The volume of business data under protection is growing rapidly, driven by the explosion of mobile computing, the use of powerful business applications that generate more data, and stringent regulations that require companies to retain data longer and maintain it in a format that is readily available upon request. The problem of massive data growth is particularly acute in traditional, large data-intensive enterprises that have become increasingly reliant on database-driven business automation systems, such as Oracle, SQL, and SAP. These organizations are also increasingly adopting a new wave of data-intensive applications to analyze and manage their "big data" - further compounding the problem.
Data deduplication has emerged as a key technology to help large, data-intensive enterprises control costs and improve efficiency in their data protection environments by reducing the volume of data that they need to protect, store, and replicate for disaster recovery (DR).While there are a wide range of technologies available to help small-to-mid-sized businesses (SMBs) and small-to-medium sized enterprises (SMEs) to protect data and control data growth, there are few options for very large enterprises. In these organizations, the massive volume of data they need to protect and the complexity of their data protection environments are overwhelming for most data deduplication and data protection technologies.
Large enterprises face a variety of challenges that their SME counterparts do not, including: deduplication challenges that are specific to large databases, shorter and more critical timeframes for completing backups and replications (windows), stringent regulatory requirements related to data availability and restore time, and increasingly higher expectations for business continuity/disaster recovery protection. As both data volumes and reliance on these systems continue to grow, IT managers need to make important decisions about the technologies and strategies they choose to deduplicate and replicate to optimize their big data backup environments..
Big Data Backup Environments
There are several deduplication products and options to choose from - some are limited to smaller data sets typical in departments and small to medium-sized enterprises (SMEs); others are optimized for large enterprise use and big data backup environments. Data center managers need to consider the strengths and drawbacks of each of these technologies to choose the solution that best meets their requirements. All deduplication technologies compare the data in each backup to previously seen data to identify duplicates. New data is stored and duplicate data is replaced with a pointer to the baseline data set. Although all technologies perform the same basic task at a high level, large differences exist between implementations. The following highlights some of these differences:
Source vs. Target Deduplication
Target deduplication is performed on the backup system (target). Target-based deduplication can be performed inline, post process, or concurrently and it is typically used to process larger data volumes.
Source deduplication is typically performed on non-target system, typically the backup media server or on individual client systems (desktops and servers) to reduce the amount of data sent to the target backup system (called the backup target). The desktop and server systems may use the file system to locate just those files that have changed since the previous backup or differencing or hash techniques. Once the unique data is determined, only that data is sent to the target. This results in less data being pushed to the target but can be expensive for the source due to frequent client-target negotiations therefore source deduplication is typically used in smaller backup environments.
Inline, Post Process, and Concurrent Deduplication
A key distinction among deduplication technologies is whether deduplication is performed before, after, or while the backup data is being written to the target. Inline deduplication is performed in memory before it is sent to the target. Disk system deduplication may also be performed as a post process after the entire backup reaches the target disks, or it may be performed concurrently while data is being ingested to the target.
Source deduplication is typically performed inline on the non-target system while target deduplication may be performed inline, post process, or concurrently. In post-processing deduplication, the goal is to backup and protect the data at the fastest possible speed, to minimize the time to safety. While this method improves the backup time, it slows completion of deduplication, replication, and restore processes. In contrast, concurrent deduplication performs backup, deduplication, replication, and restore operations concurrently, load balancing these operations across many nodes. This process results in an optimal balance of time to safety, time to replication, and capacity reduction.
Hash vs. Content-Aware Deduplication
Deduplication schemes include hash based and differential.
Hash-based technologies compute and assign a unique identifier called a hash to each incoming segment of data. They store the hashes in an index in memory. As each backup is performed, they compare the hashes of incoming data to those already stored in the index. If the hash already exists, then they replace the incoming data with a pointer to the hash. Deduplication technologies face the challenge of delivering more granular comparisons of data without slowing the backup process. In large enterprises with massive data volumes, the more closely that data is examined, the more CPU resources are required to deduplicate it. Hash based schemes are limited in that with large numbers of small changes found in large data volumes the index grows very large slowing down the dedupe process.
Inline deduplication technologies typically cannot afford to examine data in sections smaller than 16KB because doing so would severely slow backup performance and dramatically increase the size of the hash table index. Also, so as not to outgrow the hash table size, hash based systems have limited capacity limits matched to the index size. Because inline systems cannot be load balanced across multiple processing nodes, the entire backup process has to wait for a single node to perform the hash calculation, assignment, and comparison before data is written to disk.
For example, this lack of granularity in comparison is particularly important in deduplicating databases, such as Oracle, SAP, and Exchange that store data in segments of 8KB or smaller. For these critical applications, a large volume of duplicate data is completely unidentified by inline solutions.
In contrast, content-aware deduplication technologies are able to load balance backup, deduplication, replication and restore operations across multiple process nodes. It also reads metadata from the incoming data stream to efficiently identify that is highly likely to be duplicate. This process enables the system to focus processing resources on the duplicate data - enabling a more granular level examination at the byte level. As a result content-aware deduplication is significantly more efficient at reducing capacity of the large database data used in big data backup environments.
Deduplication Rates
Although many deduplication vendors claim very high capacity reduction ratios, these ratios assume a one percent data change rate and that datasets are the typical mix of data type found in SMB companies. Note that a 5:1 ratio reduces storage needs by 80%, and yields tremendous cost savings. However, a 10:1 ratio is not double the capacity reduction of a 5:1 ratio. It is only the 10% difference between the two a minimal improvement. The bulk of actual capacity savings are within a 2:1 and 5:1 deduplication ratio.
Some data types and application environments typically contain more duplicate copies than others, depending on a number of factors, such as:
- Data type: Unstructured, semi-structured, and structured data.
- Unstructured data files. File data such as Microsoft Word, PowerPoint, or Excel files created by office workers often contain redundant data and are frequently distributed or copied.
- Semi-structured and structured data types. Data created by specific business applications that ensure that operations can be run on a day-to-day basis. For example, Microsoft Exchange for email and Oracle for a transactional database clearly fall into the "must" protect category and require frequent backup.
Frequency of data change: The less frequently that data is modified the greater the chance that copies of that data will contain duplicate data. The data deduplication ratio will be higher when the change rate is lower. This also implies that a higher data deduplication ratio should be expected as the percentage of reference data to total active data increases because reference data is not changed.
Retention period: Longer retention period increases the likelihood that duplicate data will be found
Matching the Deduplication with the Requirement
While SMB and SME organizations have a wide range of deduplication options to choose from, large enterprises with massive data volumes need to choose their deduplication technology wisely. (See Table 1) Inline deduplication (at the source or target) cannot provide sufficient time-to-safety, capacity reduction efficiency or replication bandwidth optimization to be practical or cost-efficient. Large enterprises need ContentAware deduplication technology that can move massive data volumes to safety quickly, process deduplication without slowing backup, handle databases with changes in <8KB segments and enable bandwidth optimized replication.
About the Author:
Jeff Tofano is chief technology officer, SEPATON, Inc.