NetBackup™ Backup Planning and Performance Tuning Guide
- NetBackup capacity planning
- Primary server configuration guidelines
- Media server configuration guidelines
- NetBackup hardware design and tuning considerations
- About NetBackup Media Server Deduplication (MSDP)
- MSDP tuning considerations
- MSDP sizing considerations
- Accelerator performance considerations
- Media configuration guidelines
- How to identify performance bottlenecks
- Best practices
- Best practices: NetBackup AdvancedDisk
- Best practices: NetBackup tape drive cleaning
- Best practices: Universal shares
- NetBackup for VMware sizing and best practices
- Best practices: Storage lifecycle policies (SLPs)
- Measuring Performance
- Table of NetBackup All Log Entries report
- Evaluating system components
- Tuning the NetBackup data transfer path
- NetBackup network performance in the data transfer path
- NetBackup server performance in the data transfer path
- About shared memory (number and size of data buffers)
- About the communication between NetBackup client and media server
- Effect of fragment size on NetBackup restores
- Other NetBackup restore performance issues
- About shared memory (number and size of data buffers)
- Tuning other NetBackup components
- How to improve NetBackup resource allocation
- How to improve FlashBackup performance
- Tuning disk I/O performance
About tape I/O error handling
Note:
This topic has nothing to do with the number of times NetBackup retries a backup or restore that fails. That situation is controlled by the global configuration parameter Backup Tries for backups and the bp.conf
entry RESTORE_RETRIES for restores.
The algorithm that is described here determines whether I/O errors on tape should cause media to be frozen or drives to be downed.
When a read/write/position error occurs on tape, the error that is returned by the operating system does not identify whether the tape or drive caused the error. To prevent the failure of all backups in a given time frame, bptm tries to identify a bad tape volume or drive based on past history.
To do so, bptm uses the following logic:
Each time an I/O error occurs on a read/write/position, bptm logs the error in the following file.
/usr/openv/netbackup/db/media/errors
Windows
install_path\NetBackup\db\media\errors
The error message includes the time of the error, media ID, drive index, and type of error. The following examples illustrate the entries in this file:
07/21/96 04:15:17 A00167 4 WRITE_ERROR 07/26/96 12:37:47 A00168 4 READ_ERROR
Each time an entry is made, the past entries are scanned. The scan determines whether the same media ID or drive has had this type of error in the past "n" hours. "n" is known as the time_window. The default time window is 12 hours.
During the history search for the time_window entries, EMM notes the past errors that match the media ID, the drive, or both. The purpose is to determine the cause of the error. For example: If a media ID gets write errors on more than one drive, the tape volume may be bad and NetBackup freezes the volume. If more than one media ID gets a particular error on the same drive, the drive goes to a "down" state. If only past errors are found on the same drive with the same media ID, EMM assumes that the volume is bad and freezes it.
The freeze or down operation is not performed on the first error.
Note two other parameters: media_error_threshold and drive_error_threshold. For both of these parameters, the default is 2. For a freeze or down to happen, more than the threshold number of errors must occur. By default, at least three errors must occur in the time window for the same drive or media ID.
If either media_error_threshold or drive_error_threshold is 0, a freeze or down occurs the first time an I/O error occurs. media_error_threshold is looked at first, so if both values are 0, a freeze overrides a down. Veritas does not recommend that these values be set to 0.
A change to the default values is not recommended without good reason. One obvious change would be to put very large numbers in the threshold files. Large numbers in that file would disable the mechanism, such that to "freeze" a tape or "down" a drive should never occur.
Freezing and downing are primarily intended to benefit backups. If read errors occur on a restore, a freeze of media has little effect. NetBackup still accesses the tape to perform the restore. In the restore case, downing a bad drive may help.
For further tuning information on tape backup, see the following topics: