Operational databases continue to expand, with database sizes growing in most organizations. There are several reasons for this growth. The primary reason, of course, is that more and more data is being generated every year. Analysts at IDC estimate the size of the global datasphere will grow to reach 175 zettabytes by 2025.[1]
But data growth is exacerbated by increasing governmental and industry regulations that dictate, among other things, that we must preserve data for longer periods of time. So, we are stuck managing more data for greater periods of time—and that can complicate matters as we attempt to wring performance and usability out of the applications that access our databases.
Indeed, as data volumes expand, it impacts operational databases in two ways:
- Additional data stresses transaction processing by slowing things down.
- Database administration tasks are negatively impacted.
In terms of performance, the more data in the operational database, the less efficient transactions running against that database tend to be. This is so for many reasons. Table scans must reference more pages of data to return a result. Indexes grow in size to support larger data volumes, causing access by the index to degrade because there are more levels to traverse to return an answer. Such performance impacts cause many companies to seek solutions that offload older data to archive data stores—more on that in a moment.
The other impact, database administration complexity, causes longer processing time and outages to perform traditional DBA tasks. Functions such as backups, unloads, loads, reorganizations, recoveries, and disaster recoveries all take longer to perform as the database increases in size. And these tasks can cause outages … outages that expand as the database grows. In many cases the lengthened outages have become unacceptable. When you can no longer satisfy agreed-upon service levels (e.g., time to recovery) you need to seek ways to lighten up the bulk of your operational databases.
But as important as these operational performance and administration issues are, frequently they are ancillary to the regulatory issue of preserving authentic data over time. Although both the desire for better performance and the need to comply with laws are driving the need to move data from the operational database, it is the legal requirements that are likely to have the biggest impact in terms of data volume expansion. Nonetheless, the operational issues are real and need to be managed.
One approach to handling the data growth problem is database archiving. Database archiving is the process of removing selected data records from operational databases that are not expected to be referenced again and storing them in an archive data store where they can be retrieved if needed.
By archiving data when it is no longer needed for operational purposes, administrators can relieve the pressure on their production databases while at the same time preserving the data that is no longer accessed by business transactions and reports. Because the data is preserved in the archive it can be easily retrieved if and when it is needed, for example in the e-discovery phase of a lawsuit or during an audit for compliance with governmental regulations (such as GDPR or HIPAA, to name but two of the many regulations that demand data retention).
When database archive policies are in place, older data that is no longer accessed is periodically archived out of the operational database. This has the effect of minimizing the impact of data growth, at least in terms of database and transaction performance. Yes, the archived data is still stored, so we are not likely to experience savings in the amount of disk storage needed, but the data is stored separately from the operational database and does not require the DBMS any longer. Archived data is separate and independent from the production systems from which it was moved and, therefore, business transactions and DBA tasks can be performed more efficiently and with minimized outages. Furthermore, archived data can be stored on cheaper, slower storage because it rarely needs to be accessed.
[1] www.i-scoop.eu/big-data-action-value-context/data-age-2025-datasphere