Information Center

The Complete Data Deduplication Guide and Why It is Important

Today, organizations save a lot of critical information in their databases daily. They back up the data on auto-pilot, naturally resulting in continuous data re-coping and re-saving. As a result, the data storage becomes unnecessarily burdened over time with redundant data copies, which costs money as data requirements grow and processing times become slower.

Hence, the need for data deduplication.

Data deduplication eliminates redundant data to reduce storage needs. The technology has graduated from cutting-edge technology to a mainstream staple. Data deduplication first appeared in 2003 when organizations wanted to move away from tape storage towards disk-based backups for their performance.

A decade later, data deduplication is standard in backup products like Veritas’ NetBackup Appliances, making it a valuable tool in an enterprise data protection strategy.

However, there are numerous elements to consider when picking the right backup deduplication technology for your business. You need to consider issues such as the types of deduplication available, how the technology works, the factors affecting deduplication, how it differs from compression, and deduplication with virtualization environments.

This complete guide on data deduplication explains all these and how Veritas builds advanced data deduplication technology into its NetBackup Appliance media servers. As a result, Veritas has engineered NetBackup Media Server Deduplication Pool (MSDP) with over 80 patents specifically on deduplication.

MSDP ensures cloud deduplication with compression and encryption, shorter backup times, and faster recovery at scale.

What is Data Deduplication?

The data deduplication process eliminates redundant data copies and reduces a software system’s processing time. As a result, every software system data backup results in copying and storing large data sets. Over time, it requires a significant amount of data storage. Therefore, data deduplication optimizes data storage to ensure the organization only copies and stores one unique data instance.

At the basic level, deduplication eliminates non-unique data segments within data sets. From this definition, deduplication is not that different from compression. However, its real distinction is data reduction against historical data, which allows storage savings and prevents copying similar data from multiple sources.

Compression was previously the primary storage savings activity before deduplication. During compression, backup solutions compressed data streams as the data was written to a backup tape or disk. However, compression savings only occurred at that point in time. Thus, it was possible to compress a similar data backup at another point in time, which could take up an equivalent amount of space.

Data Deduplication is different because it segments data and checks it against a matrix representing previously written data. As a result, unique segments are sent to storage, while non-unique ones create a reference to the unique instances of similar segmented data.

For example, a company’s email system might contain 50 instances of the same one megabyte (MB) file attachment. Backing up the platform without deduplication will see all 50 instances saved, requiring 50MB of storage space. However, deduplication only allows the storage of one instance of the email attachment, with each subsequent instance referenced to the saved copy. Therefore, the 50MB storage demand drops to one MB.

How Data Deduplication Works

In the simplest terms, the data dedupe process starts by chopping the data set aside for deduplication into chunks. A chunk consists of one or more contiguous data blocks. How and where the process divides the chunks is subject to individual patents. However, once the process creates a series of chunks, data deduplication compares them against all previous chunks already made and seen by the dedupe system.

The system compares chunks by running a deterministic cryptographic hashing algorithm that creates a hash. If the hashes of two different chunks match, the system considers them identical since even the slightest change changes chunk hashes. For example, if the cryptographic hashing algorithm creates a 160 bits hash called SHA-1 for an eight MB chunk, the system saves almost eight MBs every time it backs up that chunk. Hence, data dedupe is a significant space saver.

The data deduplication process eliminates duplicate data blocks and stores unique data blocks only. It relies on fingerprints – unique digital signatures for data blocks. Therefore, the inline deduplication engine will examine incoming data blocks, develop a fingerprint for each and store it in a hash store when the system writes the data (in-memory data structure).

After calculating the fingerprint, the process conducts a lookup in the hash store. It then examines data blocks matching cache memory’s duplicate fingerprint (donor block). If it finds a match in the hash store, one of two things happens:

If a match is present, it compares the new data block (receiver) and the donor block, which acts as verification. The system verifies the data between the two blocks without writing the receiver block to the disk. It then updates the metadata to track the sharing details.
If the donor block is unavailable in the cache memory, the system pre-fetches it from the disk to compare it byte-to-byte with the recipient block in the cache. If it’s an exact match, the system flags the recipient block as a duplicate without writing it into the disk but updates the metadata to track the sharing details.

The background duplication engine functions similarly. It searches all data bocks in bulk. It compares block fingerprints and performs byte-to-byte comparisons to eliminate false positives and remove duplicates. The process does not lose any data.

Types of Data Deduplication

While it’s not hard to make a deduplication engine, it isn’t easy to make a performance-optimized and fault-tolerant solution that’s scalable. How and where deduplication occurs makes a significant difference in service quality. Below are the major types of deduplication:

1. Post-Process Data Deduplication

As the least efficient form of data deduplication, post-process deduplication requires a large disk cache to temporarily store a complete data set plus another disk cache for the deduplicated data. Therefore, it does not apply the deduplication process until after the successful writing of data to the target disk, where processing occurs using a post-processing deduplication method. It then stores the data in a deduplication repository.

While it helps get data from the source without worrying about the processing time, it results in inefficient use of space, leading to data integrity issues. Due to these setbacks, Veritas deduplication does not offer post-process deduplication.

2. Inline Data Deduplication

Inline data deduplication applies the deduplication process to the data stream before writing it to storage. It only writes unique data segments to storage.

Target-level inline deduplication means all data stored is streamed to the target device and deduplicated as it’s written to storage.
Source-level inline deduplication means data deduplication against written data occurs before it’s sent to the target device.

Source side deduplication is efficient from a data transport perspective because it dramatically reduces the data amount the organization needs to send across the network. Fortunately, Veritas deduplication performs both target and source inline deduplication and compression.

Below are other common methods of data deduplication:

File deduplication: Refers to deduplication at the file level and examines the file as a whole instead of its contents. The deduplicate functions remove duplicate files and reference the original. However, they fail to address identical content within the file.
Chunking deduplication: It breaks down data into chunks and runs them through a hashing algorithm to create a unique hash of that data set. As with file deduplication, the system removes duplicate hashes and keeps the original.
Sub-file deduplication: It assesses individual file contents to search for duplicate content before removing it. It divides the content into file blocks and compares them against each other to remove duplicate content to save storage.
Client backup deduplication: It’s also called source deduplication and occurs on the internal client backup program using the chunking method to remove duplicate data.
Windows server deduplication: It allows users to store data only once while creating intelligent pointers to its location. Microsoft keeps improving windows deduplication. For example, Windows Server 2019 can now deduplicate NTFS and ReFS volumes.

Why is Data Deduplication Important?

Even though disk capacities continually increase, data storage vendors still seek methods to help customers store their ever-increasing data into storage and backup devices. Besides, exploring opportunities to help maximize data storage and the potential disk capacity makes sense.

Hence, storage and backup vendors rely on data reduction strategies such as deduplication and compression. They allow customers to effectively store more data than the capacity of their storage media suggests. So if the customer gets a five to one (5:1) benefit from various data reduction mechanisms, they can theoretically store up to 50 TB of data on a 10 TB storage array.

Consider the scenario below:

An organization runs a virtual desktop environment supporting 200 identical workstations that store their data on an expensive storage array explicitly purchased for that purpose. Suppose the organization runs copies of Windows 10, Office 2013 and 2016, ERP software, and numerous other software tools that users require, and each workstation image consumes about 25 GB of disk space. The 200 workstations will consume five terabytes of capacity.

Data deduplication allows the organization to store one copy of the individual virtual machines while the storage array places pointers to the rest. Therefore, each time the deduplication engine finds an identical data asset already stored in the environment, it saves a small pointer in place of the data copy instead of copying the data again. This way, deduplication frees up storage blocks.

Factors Affecting Data Deduplication

Careful deduplication deployment planning is necessary to ensure the protected data deduplicates well. Different data types can get different deduplication levels based on the makeup. For example, image files, virtual images, compressed data, encrypted data, and NDMP streams don’t deduplicate well.

Additionally, databases with a high change rate may require more effort to ensure data presentation in a manner that results in optimal deduplication results. The Veritas data deduplication process can implement separate policies within NetBackup for different data types based on how well they deduplicate.

Veritas has designed two different methods to improve data deduplication:

Adaptive variable-length segmentation
Fixed-length segmentation with stream handlers

MSDP uses intelligent stream handlers which employ Veritas technology to optimize the stream for deduplication based on the data type. In addition, stream handlers are adaptive and data-aware, so they help improve storage efficiency and backup performance based on the data type ingested.

As a result, the data stream turns into something that achieves consistently good deduplication rates at high speeds with fixed length segmentation. In addition, it engages stream handlers in standard filesystem backups and VMware, NetApp, EMC NDMP, Hyper-V, and other snapshot-based solutions such as FlashBackup.

Veritas introduced Adaptive Variable Length (VLD) segmentation in NetBackup for optimal deduplication results whenever the client cannot employ a stream handler. VLD uses defined segment size ranges to find optimal segmentation for the deduplicated data, allowing for the best results for opaque data while utilizing CPU power better than fixed-length segmentation.

NetBackup, NetBackup Virtual Appliances, and NetBackup Appliances can create a deduplication pool that extends beyond shelf boundaries and does not restrict disk shelves from other storage use. In addition, MSDP allows organizations to select between fixed-length, variable-length, and no deduplication on one media server.

Many of today’s applications use encryption at rest, which industry security trends drive rapidly. NetBackup does not require dedicated storage shelves for data storage, meaning these workloads are directed to a non-deduplicated storage pool, saving up to 200% in storage costs. It’s something to consider when comparing vendor rates.

Benefits of Data Deduplication

Data deduplication is essential because it significantly reduces storage space requirements, saves money, and reduces the amount of bandwidth wasted transferring data to and fro remote storage locations. It also improves scaling and efficiency when storing and pulling data from one source. Having lots of similar data stored in different spaces slows down the entire system.

Below are some other benefits:

Creates backup capacity by reducing redundancy, especially in full backups
Allows continuous data validation as opposed to simply storing backup data because the latter only discovers problems during recovery
Enables higher data recovery because its accurate, faster, and reliable
Supports optimal backup data disaster recovery because deduplication has an excellent capacity optimization capability
Deduplication has a smaller data footprint
It uses less bandwidth while copying data for replication, remote backups, and disaster recovery
It has longer retention periods
It achieves reduced tape backups with faster recovery time targets

Differences between Data Deduplication and Compression

Data dedupe looks for duplicate data chunks and places pointers instead of copying them again, while compression minimizes the number of storage bits required to represent data. However, both are part of data reduction strategies that maximize storage capacity.

Data Deduplication Uses Cases

Below are areas where deduplication is applicable:

General-Purpose File Servers

These file servers have numerous purposes and may hold the following shares:

Home folders for users
Folders for work
Shared by groups
Shared in software development

Multiple users have numerous data copies and revisions of the same file, making general-purpose file servers suitable for data deduplication. In addition, it benefits software development shares because many binaries remain primarily unchanged from build to build.

Virtual Desktop Infrastructure (VDI) Deployments

VDI servers like remote desktop services allow organizations to supply employees with PCs efficiently. Below are some reasons for using this technology:

Allows application deployment throughout the organization, helpful when dealing with regularly updated, rarely utilized, and hard-to-administer apps.
Allows application consolidation by eliminating the need for software updates on client computers because it installs and runs them from centrally controlled virtual machines
Enables remote access to enterprise programs from personal devices, some with different operating systems
Enables branch office access and improves the application performance of branch office workers who require access to centralized data repositories

VDI deployments are excellent data deduplication candidates because virtual hard disks driving the remote desktops are virtually identical.

Backup Targets

Virtualized backup apps are backup targets due to the effective data deduplication between backup snapshots. Therefore, backup programs are perfect candidates for deduplication.

Data Deduplication for Backup and Disaster Recovery

Data deduplication technology has achieved significant savings when used in the backup infrastructure. However, logically, backup images will eventually result in duplicated data.

For example, it’s easy to have a situation where multiple parties work on the same data sets or documents. It could result in partial or wholly duplicated data across numerous systems, which is inefficient and costly. In addition, scenarios resulting in multi-year data retention requirements could lead to staggering amounts of data storage.

Tape storage was initially the best cost-effective solution for data retention. However, the cost of storing all that data became a significant problem. While tape keeps costs lower than sets of arrays, it is not an ideal solution because the media tends to take up too much physical space.

Tape storage also results in a large data center footprint of specialized management hardware. Long-term data shipping and storage and other logistical challenges occur when getting the tapes where they are needed. It adds significant downtime during emergency restore situations and significantly impacts the operational ability and total ownership costs.

Veritas considered all these issues to develop a well-rounded data protection solution in the form of a powerful and integrated data deduplication storage engine. We integrated MSDP and NetBackup to create a complete solution in a single application. As a result, our deduplication data format is now highly portable with new possibilities. In addition, it facilitates data replication across multiple locations and diverse targets.

Finally, NetBackup clients support client-side deduplication, while MSDP does not limit the number of incoming streams or refuse connections, unlike other data deduplication solutions.

Data Deduplication with Virtualization Environments

Virtualization solutions have come with a new set of opportunities and complexities. For example, many virtual entities usually share a common core infrastructure, leading to VM sprawl where thousands of hosts share data sets or a standard template while having unique elements. Protecting these points while maintaining the independence of guest systems could result in storing massive amounts of historical data.

Data deduplication helps protect all the data. NetBackup MSDP protects virtual machine (VM) data and provides instant operational and disaster recoverability. In addition, customers can leverage NetBackup Appliances and NetBackup Universal Share with MSDP to secure instant access to individual files from VMs or secondary copies of the VMs for replication, testing, or other uses.

NetBackup also allows backup administrators to exclude data contained within the swap and paging files of guest operating systems, leading to less data to backup and compress.

As a result, data deduplication in virtualization environments helps reclaim space and makes writing easier than removing data segments no longer required. MSDP has a patented process called rebase to simplify data cleaning and deduplicate data in cloud environments.

MSDP Storage Server

An MSDP storage server is an entity that writes data to storage and reads from it. One host is the storage server and must be a NetBackup media server with only one existing for each NetBackup deduplication node. Additionally, while the storage server component runs on a media server, it is a separate logical entity. Below are the functions of the MSDP storage server:

It receives the backups from clients and deduplicates the data
It receives deduplicated data from clients and other media servers
It allows configuration settings from NetBackup clients and other media servers to deduplicate data, meaning the storage server only receives data after deduplication
It manages data deduplication in storage
It writes and reads deduplicated data from the disk storage
It manages the data
deduplication process

The number of storage servers and nodes you configure depends on the storage requirements and whether or not you use optimized replication or duplication.

Data Deduplication with NetBackup Appliances

NetBackup and Virtual Appliances allow organizations to deploy MSDP services in a secure, flexible, scalable, and easy-to-manage way. A single NetBackup Appliance supports up to 960TB of deduplicated data, while a Virtual Appliance supports 250TB. Additionally, each NetBackup Media Server Appliance hosts deduplicated and non-deduplicated data.

NetBackup Appliance runs a single and secure operating system (OS) instead of multiple virtual machines (VMs) with different OSs. However, the latter solution is less secure because it increases the potential attack surface.

NetBackup Appliances provide security protection and intrusion detection capability through Role-Based Access Controls and Systemic Data Center Security (SDCS). They also include FIPS 140-2 validation at no additional cost.

Additionally, NetBackup Appliances provide rapid recovery speeds organizations need to restore at scale. The technology supports several concurrent recoveries without limitations or additional requirements like SSD.

Veritas also has a fully staffed team of engineers and performance experts who test and validate the performance of NetBackup Appliance versions.

The Bottom Line

As organizations expand their operations, managing large data volumes is crucial to ensure cost savings and efficiency. Data deduplication allows them to handle large data in the best possible way.

Veritas NetBackup Appliances are industry-leading technology solutions for data protection and deduplication. They also provide data encryption and compression capability in a high-performing secure, and scalable environment.

NetBackup Appliances with MSDP technology provide significant savings through minimized backup footprint and optimized data transfer rates. In addition, NetBackup virtual appliances extend MSDP services to the cloud and other virtual environments.

Veritas customers include 95% of the Fortune 100, and NetBackup™ is the #1 choice for enterprises looking to back up large amounts of data.

Learn how Veritas keeps your data fully protected across virtual, physical, cloud and legacy workloads with Data Protection Services for Enterprise Businesses.