Jobs fail with 'Too many open files' error in UNIX NetBackup environment configured to use service user account.
Problem
Jobs fail with 'Too many open files' (errno 24 = EMFILE) in Unix NetBackup domain when primary server or media servers are configured to use nonroot service account. It may also cause connectivity issues between various NetBackup hosts.
Error Message
Lines similar to the following will be seen in the nbpxyhelper log which is located at the following location on Primary server or Media server.
/usr/openv/logs/nbpxyhelper/*.log
02/03/2022 14:40:32.333 [Application] NB 51216 nbpxyhelper 486 PID:3383 TID:140617991476992 File ID:486 [{880139A2-8531-11EC-AF42-292BC378EEA5}:INBOUND] [Error] V-486-90 ERR - Unable to read json file [/usr/openv/var/vxss/certmapinfo.json]: error details: text:[unable to open /usr/openv/var/vxss/certmapinfo.json: Too many open files] line:[-1] source:[/usr/openv/var/vxss/certmapinfo.json] column:[-1] position:[0]
02/03/2022 14:40:32.333 [Application] NB 51216 nbpxyhelper 486 PID:3383 TID:140617991476992 File ID:486 [{880139A2-8531-11EC-AF42-292BC378EEA5}:INBOUND] [Error] V-486-90 ERR - Error while reading mapping file: json invalid
02/03/2022 14:40:32.333 [Debug] NB 51216 nbpxyhelper 486 PID:3383 TID:140617991476992 File ID:486 [{880139A2-8531-11EC-AF42-292BC378EEA5}:INBOUND] 1 [JsonRequest::populateCertInfo] Error: Failed to read certificate information from certificate mapping file. (../machines/LibNbPxyProtocol.cpp:477)
Sometimes below error is seen in system dmesg logs,
[Tue May 17 09:16:16 2022] VFS: file-max limit 65536 reached
Cause
There could be two problems related to open file limits.
- The vnetd -proxy inbound and outbound processes are not able to increase their open file limit to 8192 as expected.
On server reboot NetBackup services gets started in systemd session, thus NetBackup services gets the default open file limit configuration from systemd session.Observe which user account owns the vnetd -proxy processes. It may be root, or the NetBackup SERVICE_USER beginning with version 9.1.
$ bpps vnetd
root 1882144 1 0 18:37 ? 00:00:00 /usr/openv/netbackup/bin/vnetd -standalone
nbsvcusr 1882148 1 1 18:37 ? 00:00:00 /usr/openv/netbackup/bin/vnetd -proxy inbound_proxy -number 0
nbsvcusr 1882151 1 1 18:37 ? 00:00:00 /usr/openv/netbackup/bin/vnetd -proxy outbound_proxy -number 0
The normal ulimit (nofile) for the root root and SERVICE_USER accounts can be observed using these commands,
root@server$ ulimit -n
8192
root@server$ su nbsvcusr --shell /bin/bash --command "ulimit -n"
8192
But notice that the vnetd proxy processes (both inbound and outbound) have open file limit that are set to less than the ulimit (nofile) for root and service user account. In this example it is 4096.
root@server$ prlimit --pid=`pgrep -f "vnetd -proxy inbound_proxy -number 0"` | grep open
NOFILE max number of open files 4096 4096 files
- System wide max-file limit was set to low value
On RHEL platform, it is observed that max-file limit is strictly obeyed for processes running as non-root users and there is no such limitation for processes running with root user. So changing the NetBackup SERVICE_USER to non-root may cause "Too many open files" error.
For example, max-file is set to 65536 on below RHEL server. This limit might get hit if server is heavily loaded
root@server$ sysctl fs.file-nr
fs.file-nr = 21552 0 65536
Solution
Problem 1:
Note: Steps 1 & 2 are only appropriate for NetBackup versions less than 10.1.1. NetBackup 10.1.1 vnetd proxy processes will detect the current (lower than expected) ulimit setting at startup and decrease the fd-in-use-threshold to match so that a 2nd/3rd/4th copy of the process can be started instead of reaching EMFILE. Step 3 is applicable to and should be implemented for all NetBackup versions 8.1 and above to avoid encountering EMFILE for other/all NetBackup processes.
- (Linux) Temporarily increase the open file limit for already running vnetd proxy processes to 8192. This change will not persist through a process restart such as a host reboot.
root@server$ prlimit --pid=`pgrep -f "vnetd -proxy inbound_proxy -number 0"` --nofile=8192:8192
root@server$ prlimit --pid=`pgrep -f "vnetd -proxy outbound_proxy -number 0"` --nofile=8192:8192
- (Linux) Verify the open file limit is increased for the processes.
root@server$ prlimit --pid=`pgrep -f "vnetd -proxy inbound_proxy -number 0"` | grep open
NOFILE max number of open files 8192 8192 files
root@server$ prlimit --pid=`pgrep -f "vnetd -proxy outbound_proxy -number 0"` | grep open
NOFILE max number of open files 8192 8192 files
(Linux/UNIX) Permanently solve the problem by appropriately configuring the operating system and any clusterware used to start NetBackup. The open file (nofile) ulimit should be 8192 or higher for any O/S or cluster utilities that start NetBackup processes, including command line shells, systemctl, etc.
For details, see the Related Article: Minimum O/S ulimit settings on primary and media server Linux/UNIX platforms.
Problem 2:
- Based upon the application load on the server, determine the concurrent open file count needed..
root@server$ sysctl fs.file-nr
- Increase the max-file limit value by an arbitrary, but hopefully appropriate, amount.
For example, the value is being quadrupled to 262144 (4 * 65536) below.
- Edit /etc/sysctl.conf and change fs.file-max = 262144
- Run 'sysctl -p' to apply the change.
- Verify the change via 'sysctl fs.file-nr'.