Storage Lifecycle Policy ( SLP ) duplication operations fail with STATUS 50 due to Inactivity Detected in NBJM
Problem
The SLP duplication jobs will appear in Activity Monitor, but within a few minutes the message "client process aborted (50)" also appears because the bpduplicate command has not yet been executed. However, after a delay (possibly 3+ minutes up to a couple of hours later), the bpduplicate job is initiated in the admin debug log and may continue to run the duplication. Because the job has aborted (STATUS 50) the duplication will never be regarded as successful even if the data transfer between the storage units completes.
This does not affect manual duplications.
Error Message
The nbjm debug log extract showing the bpduplicate command being cancelled due to inactivity.
nbjm 117 PID:3880 TID:4 File ID:117 [jobid=486664 parentid=0] 4 [CommandJob::start] (1015ec9e0) started command job, cmd=/usr/openv/netbackup/bin/admincmd/bpduplicate /usr/openv/netbackup/bin/admincmd/bpduplicate -jm -jobid <JOBID> ...
nbjm 117 PID:3880 TID:4 File ID:117 [jobid=486664 parentid=0] 6 [JobInstance::setDefaultInactivityTimeout] +++ ENTERING +++ : obj = 1015ecca8 (../JobInstance.cpp:802)
nbjm 117 PID:3880 TID:4 File ID:117 [jobid=486664 parentid=0] 4 [JobInstance::ping] (1015ecca8) ping for job, next interval is -1(JobInstance.cpp:678)
nbjm 117 PID:3880 TID:4 File ID:117 [jobid=486664 parentid=0] 4 [JobInstance::ping] (1015ecca8) ping timer for job is 180(JobInstance.cpp:720)
nbjm 117 PID:3880 TID:4 File ID:117 [jobid=486664 parentid=0] 6 [JobInstance::setDefaultInactivityTimeout] --- EXITING --- : obj = 1015ecca8 (../JobInstance.cpp:802)
nbjm 117 PID:3880 TID:10 File ID:117 [jobid=486664 parentid=0] 4 [JobInstance::handle_timeout] (1015ecca8) inactivity detected for job, last activity at 1414798817(JobInstance.cpp:785)
nbjm 117 PID:3880 TID:10 File ID:117 [jobid=486664 parentid=0] 4 [JobInstance::preSetStatus] (1015ecca8) copy=0, new status=50(JobInstance.cpp:435)
The bpcd debug log extract at that same time showing when the bpcd child process (PID 25933) becomes the bpduplicate processes.
08:08:09.275 [25933] <2> process_requests: child_args[0] = /usr/openv/netbackup/bin/admincmd/bpduplicate
08:08:09.275 [25933] <2> process_requests: child_args[1] = /usr/openv/netbackup/bin/admincmd/bpduplicate
...snip...
08:08:09.277 [25933] <2> process_requests: child_args[27] = /usr/openv/var/global/nbstserv/bid_file_rs_1415808488_0_0
08:08:09.277 [25933] <2> process_requests: Before execvp of command
08:08:09.277 [25933] <2> process_requests: INF - PID = 25933
08:08:10.271 [25929] <2> process_requests:
08:08:10.272 [25929] <2> bpcd exit_bpcd: exit status 0 ----------->exiting
An truss or strace attached to bpcd to follow the child shows that after becoming bpduplicate, it is delayed for many minutes while closing billions of file descriptors before continuing with the duplication operation. Notice the decreasing file descriptors here, and the increasing elapse time.
25933/1: 0.5164 fstat(2234421244, 0xFFFFFFFF7FFFDE18) Err#9 EBADF
25933/1: 0.5166 fstat(2234421243, 0xFFFFFFFF7FFFDE18) Err#9 EBADF
25933/1: 0.5167 fstat(2234421242, 0xFFFFFFFF7FFFDE18) Err#9 EBADF
25933/1: 0.5168 fstat(2234421241, 0xFFFFFFFF7FFFDE18) Err#9 EBADF
25933/1: 0.5169 fstat(2234421240, 0xFFFFFFFF7FFFDE18) Err#9 EBADF
25933/1: 0.5171 fstat(2234421239, 0xFFFFFFFF7FFFDE18) Err#9 EBADF
Cause
The duplication job is aborted by nbjm canceling the job due to perceived lack of forward progress. After nbjm requests that bpcd initiate the bpduplicate command, it waits 180 for a response. When a timely response is not received, nbjm marks the duplication job as terminated.
The delay on the host where bpcd/bpduplicate is executing is caused by having the number of open file descriptors per process (nofiles) set to unlimited, whereas this will normally be set between 8192 and 65536.
Solution
Investigate the ulimit values, specifically the number of open file descriptors (nofile/nofiles), that the operating system is granting to the user or users that own the NetBackup processes.
These are normally observed when executing 'ulimit -n ', but can be set in multiple places within the operating system configuration. Please seek assistance from the UNIX/Linux system administrator for the operating system to reduce the file descriptors down to a more reasonable number.
Veritas recommends a minimum of 8192 for primary and media servers.
Note: For additional details, see the Related Article: Minimum O/S ulimit settings on primary and media server Linux/UNIX platforms.
Applies To
NetBackup 7.x - 10.x UNIX/Linux primary and media servers