Backup fails with status 24 after the network stops delivering data to the media server.

Article: 100022367
Last Published: 2013-06-20
Ratings: 0 0
Product(s): NetBackup & Alta Data Protection

Problem

Backup fails with status 24 after the network stops delivering data to the media server.

Error Message

ERR - Cannot write to STDOUT. Errno = 32: Broken pipe

Solution

Overview:

When a network socket hangs and stops delivering data, the NetBackup client process will eventually timeout and fail the backup.  

In this case, a network trace showed that the TCP stack on the media server host closed the window while the storage unit was getting ready to receive data.  The TCP stack on the client host sent repeated probes to see if the window could be reopened, but even after the available data was accepted and written to storage, the TCP stack on the media server did not respond to the probes.  The resulting deadlock caused a suspension of data transfer and the eventual failure.

Troubleshooting:

The bpbkar debug log shows a backup in progress and then a delay, after which a broken network socket is reported.

05:14:36.502 [23506] <2> bpbkar SelectFile: INF - Resolved_path = /export/home/OracleExport
05:14:36.506 [23506] <4> bpbkar PrintFile: /
05:14:36.506 [23506] <4> bpbkar PrintFile: /export/
05:14:36.507 [23506] <4> bpbkar PrintFile: /export/home/
05:14:36.507 [23506] <4> bpbkar PrintFile: /export/home/OracleExport/
05:23:26.211 [23506] <16> flush_archive(): ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
05:23:26.212 [23506] <16> bpbkar Exit: ERR - bpbkar FATAL exit status = 24: socket write failed
05:23:26.212 [23506] <4> bpbkar Exit: INF - EXIT STATUS 24: socket write failed

The bpbrm debug log shows the file name passed to the client, followed some time later by the client reporting a broken socket.  The failure message arrives over the NAME socket indicating that the broken socket must be the DATA socket between bpbkar and bpdm.  Since the client has failed, bpbrm then terminates bpdm.

01:14:36.303 [23681] <2> write_file_names: buffering file name '/export/home/OracleExport' for output
...snip...
01:14:37.956 [23681] <2> bpbrm readline: retrying partial read from fgets ::
01:23:26.145 [23681] <16> bpbrm main: from client myclient: ERR - Cannot write to STDOUT. Errno = 32: Broken pipe
...snip...
01:23:26.254 [23681] <2> bpbrm main: client myclient EXIT STATUS = 24: socket write failed
...snip...
01:23:26.254 [23681] <2> bpbrm kill_child_process: start
01:23:28.426 [23681] <2> bpbrm wait_for_child: child exit_status = 82 signal_status = 0
...snip...
01:23:29.396 [23681] <2> bpbrm Exit: client backup EXIT STATUS 24: socket write failed

The bpdm debug log confirms that the backup was awaiting inbound data for some time until terminated by bpbrm.

01:14:37.096 [23682] <2> io_set_recvbuf: setting receive network buffer to 262144 bytes
01:14:37.097 [23682] <2> io_set_recvbuf: receive network buffer is 64860 bytes
...snip...
01:14:37.955 [23682] <4> write_backup: begin writing backup id myclient_1255065274, copy 1, fragment 1, destination path /myDSU
01:14:37.955 [23682] <2> signal_parent: sending SIGUSR1 to bpbrm (pid = 23681)
01:23:26.254 [23682] <2> catch_signal: Interrupt processing
01:23:28.264 [23682] <2> db_error_add_to_file: dberrorq.c:midnite = 1255060800
01:23:28.282 [23682] <16> catch_signal: media manager terminated by parent process
...snip...
01:23:28.413 [23682] <2> catch_signal: EXITING with status 82


The 'netstat -na' output from the client host shows the send window closed immediately, but did not reopen once the storage unit began accepting data.  The socket closed when bpbkar exited.

TCP: IPv4
  Local Address        Remote Address    Swind Send-Q Rwind Recv-Q  State
-------------------- -------------------- ----- ------ ----- ------ -------
Friday October  9 05:14:08 GMT 2009
  (The socket does not exist until the backup starts.)
Friday October  9 05:14:38 GMT 2009
10.43.2.62.13724     198.235.125.52.42317     1      0 49344      0 ESTABLISHED
10.43.2.62.13724     198.235.125.52.42318 49680      0 49680      0 ESTABLISHED
Friday October  9 05:15:08 GMT 2009
10.43.2.62.13724     198.235.125.52.42317     1      0 49344      0 ESTABLISHED
10.43.2.62.13724     198.235.125.52.42318 49680      0 49680      0 ESTABLISHED
  (...snipped waiting for the client to timeout ...)
Friday October  9 05:23:08 GMT 2009
10.43.2.62.13724     198.235.125.52.42317     1      0 49344      0 ESTABLISHED
10.43.2.62.13724     198.235.125.52.42318 49680      0 49680      0 ESTABLISHED
Friday October  9 05:23:38 GMT 2009
  (The socket has closed.)

The 'netstat -na' output from the media server host confirms the sequence, except that the receive window does not close.  This is an early indication that the TCP stacks are not in sync with each other.

Fri Oct  9 01:14:10 EDT 2009
  (The socket does not exist until the backup starts.)
Fri Oct  9 01:14:40 EDT 2009
198.235.125.52.42317 10.43.2.62.13724     49344      0 64860      0 ESTABLISHED
198.235.125.52.42318 10.43.2.62.13724     49680      0 49680      0 ESTABLISHED
Fri Oct  9 01:15:10 EDT 2009
198.235.125.52.42317 10.43.2.62.13724     49344      0 64860      0 ESTABLISHED
198.235.125.52.42318 10.43.2.62.13724     49680      0 49680      0 ESTABLISHED

A truss of the bpdm process confirms that it was willing to accept 256KB from the socket but only 95KB was available due to the widow closing.  It then was willing to accept the potential outstanding 161KB, but the TCP stack did not provide any additional data.

23682/1:          2.0320 read(0, 0xFC000000, 262144)                     = 97980
23682/1:            /\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
23682/1:           \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
...snip...
23682/1:           \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
23682/1:           \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
23682/1:         read(0, 0xFC017EBC, 164164)     ( sleeping...)

A network trace confirmed that the TCP stack on the media server host did not respond to the Zero Window Probes from the TCP stack on the client host.  The window remained closed which prevented the client from sending additional data.

Resolution:

The root cause of the problem is external to NetBackup and within the TCP stack on the media server.  The vendor for the TCP stack should be contacted for a formal solution.

Work-arounds:

Observation of the bpdm log shows that the NET_BUFFER_SZ file is in place, but the requested size is not being honored by the operating system even though success (0) is returned to the application.

23682/1:          1.5405 open("/usr/openv/netbackup/NET_BUFFER_SZ", O_RDONLY) = 8
23682/1:          1.5413 read(8, " 2 6 2 1 4 4\n", 8192)                 = 7
23682/1:          1.5416 close(8)                                        = 0
23682/1:          1.5418 setsockopt(0, SOL_SOCKET, SO_RCVBUF, 0xFFBFE394, 4, SOV_DEFAULT) = 0
23682/1:          1.5425 write(4, 0xFFBFCC14, 87)                        = 87
23682/1:            0 1 : 1 4 : 3 7 . 0 9 6   [ 2 3 6 8 2 ]   < 2 >   i o _ s e t _ r
23682/1:            e c v b u f :   s e t t i n g   r e c e i v e   n e t w o r k
23682/1:            b u f f e r   t o   2 6 2 1 4 4   b y t e s\n
23682/1:          1.5426 getsockopt(0, SOL_SOCKET, SO_RCVBUF, 0xFFBFE394, 0xFFBFE390, SOV_DEFAULT) = 0
23682/1:          1.5433 write(4, 0xFFBFCC14, 78)                        = 78
23682/1:            0 1 : 1 4 : 3 7 . 0 9 7   [ 2 3 6 8 2 ]   < 2 >   i o _ s e t _ r
23682/1:            e c v b u f :   r e c e i v e   n e t w o r k   b u f f e r   i
23682/1:            s   6 4 8 6 0   b y t e s\n

By removing the NET_BUFFER_SZ file, the setsockopt system call to request increased buffer space was prevented and the TCP stack on the media server then responded correctly to the Zero Window Probes from the client host and backups were successful.

A review of the current and maximum network buffer space confirmed that the limits could be raised.

$ ndd -get /dev/tcp tcp_recv_hiwat
49152
$ ndd -get /dev/tcp tcp_xmit_hiwat
49152
$ ndd -get /dev/tcp tcp_max_buf
1048576

Increasing the TCP receive buffer space to a value greater than or equal to the desired NET_BUFFER_SZ also provided a work-around.

$ ndd -set /dev/tcp tcp_recv_hiwat 262144
$ ndd -set /dev/tcp tcp_xmit_hiwat 262144

See Related Articles for NET_BUFFER_SZ best practices. 

Was this content helpful?