When database administrators swap war stories, they are likely to relate similar tales about the woes of managing time-series data that may include clogs, jams, and general inefficiency. Why the ubiquitous complaints? Because a standard, relational database is not equipped to handle the rigorous demands this kind of data dishes out to its handlers.
A “relational” database that maintains a set of separate, related files that are combined when required is an extremely flexible solution that allows users to store and retrieve virtually any type of information. In order to handle the sheer speed and volume often involved in time-series management, however, the data manager must shed the “one-size-fits-all” mentality and look to a solution that optimizes its storage, manipulation, and retrieval algorithms for this type of data.
Time series data - any value with a time and date attached to it - is commonly utilized to provide vital information for analysis in a variety of industries. For example, power plants, companies that use heavy machinery and fleet operators may employ devices to collect time-series information to monitor equipment stress levels in order to schedule preventative maintenance. For clinicians and researchers within the healthcare profession, the same type of monitoring of time-series data may yield critical intelligence for wemore timely and accurate diagnoses as well as more effective monitoring of the progression and treatment of disease. Additionally, financial organizations commonly collect and analyze this type of data to uncover stock market patterns and trends.
As useful as it may be, however, time-series data can be problematic for several reasons. First, it is often collected at an extremely rapid pace. In such tasks as the monitoring of critical equipment, for example, information is relayed at fraction-of-a-second intervals, which may lead to clogs and jams during the input process.
Second, time-series data is usually collected in massive quantities. Such activities as monitoring stress levels in a power plant or observing stock market activity often involve collecting terabytes of data from thousands of different sources simultaneously to gain a complete picture for analysis. When faced with such high-volume demands, most relational databases eventually run out of room, leaving managers scrambling for alternate solutions. Furthermore, due to technological developments that allow sensors attached to equipment to relay massive amounts of data, we are currently seeing a lag between the ability to generate and the ability to manage time-series information. In essence, there is more data available than standard database solutions can effectively handle.
Other common issues that arise when using relational models for time-series surround the fact that relational database solutions provide time series entries in tables and user specified indexes. Without the index on timestamp, however, the data is useless for practical purposes. This may prove problematic as index maintenance can degrade the performance of the solution.
To address these problems, many vendors are making minor modifications to relational databases and passing them off to their clients as time-series solutions. Data managers dealing with large amounts of time-series information usually realize that this practice is akin to putting snow tires on a Honda Accord and trying to pass it off as a Hummer. Eventually - usually sooner rather than later - these “suped-up” solutions lose their steam and begin to clog and jam, and sometimes just plain give out.
To avoid the pitfalls and generally keep pace, data managers should be aware of the key characteristics of an effective data management solution for time-series, as well as the unique features of this type of information that give way to opportunities for optimization.
Time-Series Shortcuts
Because of the regularity inherent in time-series data, it can be exploited for efficiency’s sake in unique ways. With time-series data, for example, it is not necessary to actually store every date associated with values in the database. Following this procedure allows the database to dynamically infer dates into the positions where dates are not stored, thus reducing the I/O requirements on the server as well as the overall disk space required for loaded data. Also, with time-series data the use of descriptive information can be minimized, ensuring that any time series can be identified and extracted without excessive overhead in describing the data point.
Additionally, lossless compression can be utilized to facilitate a high level of data compaction for an unimpeded flow of data. This is achieved through a combination of block compression and careful management of the internal structure information.
The Retrieval Process
Collected time-series data must be quickly retrievable in order for it to be of analytical use. The same problems encountered while inputting time-series data into a relational database reappear during the retrieval process. In answer to this problem, memory mapping can be utilized to access the information, limiting the amount of work the system has to do when retrieving data, and improving speed and flexibility. As time-series is temporal in nature, temporal logic can be used to access data by value. Thus the time series database can act similar to content addressable storage using the properties of “value” and “time.”
Use of a query engine with time logic support can result in answers to problems that may be difficult and error prone to express in systems. Without this ability, a time series solution will often require the transport of data in and out of the solution.
Additional Challenges
When implementing a time-series solution, here are a few factors that should be kept in mind:
Even with a database that is optimized for time-series data, the sheer volume of data that may be input into time-series requires capacity in the range of hundreds of terabytes. For some applications, petabyte capacity may even be required. Data managers dealing with high volumes should therefore look to solutions utilizing 64-bit executable binaries.
It is also worth mentioning that large files systems are not a complete solution to the management problems of time-series data. Look for a solution that complements relational model databases. A solution that captures data in increments must be able to provide the data back to the application with many magnitudes of optimization. When this is not in place, the data cannot be read “fast enough” to be relevant to the problem.
In summary, time-series should be viewed as a unique category within the family of data replete with its own set of challenges. Data managers should be aware that while its volume and speed requirements may seem overwhelming, due to its numerical nature time-series also presents unique opportunities for optimization. Thus, an informed manager with the right set of tools can transform what would otherwise be an unwieldy set of data into actionable intelligence for his/her organization.