Search <book_title>...

NetBackup™ Backup Planning and Performance Tuning Guide

Last Published: 2024-04-16

Product(s): NetBackup & Alta Data Protection (10.4, 10.3.0.1, 10.3, 10.2.0.1, 10.2, 10.1.1, 10.1, 10.0.0.1, 10.0, 9.1.0.1, 9.1, 9.0.0.1, 9.0, 8.3.0.2, 8.3.0.1, 8.3)

About tape I/O error handling

Note:

This topic has nothing to do with the number of times NetBackup retries a backup or restore that fails. That situation is controlled by the global configuration parameter Backup Tries for backups and the bp.conf entry RESTORE_RETRIES for restores.

The algorithm that is described here determines whether I/O errors on tape should cause media to be frozen or drives to be downed.

When a read/write/position error occurs on tape, the error that is returned by the operating system does not identify whether the tape or drive caused the error. To prevent the failure of all backups in a given time frame, bptm tries to identify a bad tape volume or drive based on past history.

To do so, bptm uses the following logic:

Each time an I/O error occurs on a read/write/position, bptm logs the error in the following file.
Linux/UNIX
```
/usr/openv/netbackup/db/media/errors
```
Windows
```
install_path\NetBackup\db\media\errors
```
The error message includes the time of the error, media ID, drive index, and type of error. The following examples illustrate the entries in this file:
```
07/21/96 04:15:17 A00167 4 WRITE_ERROR 
07/26/96 12:37:47 A00168 4 READ_ERROR
```
Each time an entry is made, the past entries are scanned. The scan determines whether the same media ID or drive has had this type of error in the past "n" hours. "n" is known as the time_window. The default time window is 12 hours.
During the history search for the time_window entries, EMM notes the past errors that match the media ID, the drive, or both. The purpose is to determine the cause of the error. For example: If a media ID gets write errors on more than one drive, the tape volume may be bad and NetBackup freezes the volume. If more than one media ID gets a particular error on the same drive, the drive goes to a "down" state. If only past errors are found on the same drive with the same media ID, EMM assumes that the volume is bad and freezes it.
The freeze or down operation is not performed on the first error.
Note two other parameters: media_error_threshold and drive_error_threshold. For both of these parameters, the default is 2. For a freeze or down to happen, more than the threshold number of errors must occur. By default, at least three errors must occur in the time window for the same drive or media ID.
If either media_error_threshold or drive_error_threshold is 0, a freeze or down occurs the first time an I/O error occurs. media_error_threshold is looked at first, so if both values are 0, a freeze overrides a down. Veritas does not recommend that these values be set to 0.
A change to the default values is not recommended without good reason. One obvious change would be to put very large numbers in the threshold files. Large numbers in that file would disable the mechanism, such that to "freeze" a tape or "down" a drive should never occur.
Freezing and downing are primarily intended to benefit backups. If read errors occur on a restore, a freeze of media has little effect. NetBackup still accesses the tape to perform the restore. In the restore case, downing a bad drive may help.

For further tuning information on tape backup, see the following topics:

See About the threshold for media errors.