Image courtesy of Shutterstock.
Many enterprises are dealing with large data volumes, but have traditional databases and/or warehouses; therefore, timely data access can be a major issue, especially when customers demand quick response times. Transitioning an enterprise’s data system to improve response time can be a challenge and requires a fairly good understanding of current technology solutions. Let’s examine the challenge of processing data in real time reliably and meeting customers’ expectations for quick responses.
Nature of Data
The nature of data for an enterprise needs to be understood. Each facet of data has its technology counterpart handler in an enterprise computing environment.
Value of Data
Data exists on a time continuum, no longer stationary, since it moves from function to function within an organization. Data is processed and analyzed differently depending on its age since creation time. In the figure 1 chart below, we have represented time as the horizontal axis and data value as the vertical Axis.
Just after data is created, there is high value attached to the data. As times goes by, the data is kept in storage for retrieval at a future point in time. As data begins to age, its value does not diminish, but the nature of that value begins to change. Enterprises have found countless ways to gain valuable insights – trends, patterns, anomalies – from data held over long timelines. Business intelligence often base their reports on data held over time. The individual data item, though useful in its own right, is now valuable when “aggregated” with others of like kind. Additionally, we are seeing an increasing use of data science – applications that explore data for deeper insights – not just observing trends, but discovering them. This is Exploratory Analytics.
Data Storage
Data is physically stored in three forms based on the technologies used in an enterprise. First and foremost storage is in-memory on DRAM chips attached to a local computing CPU node. The response times are in the range of nanoseconds which is the fastest with current technology. The second form is stored on hard disk media storage (persistent despite failure of power) whose access times are typically in the range of milliseconds or more. The last storage form is distributed in nature, wherein the data could reside on a computing node far removed from the local node. The data could be in-memory or on hard disk on the distributed node, however the data needs to traverse over a network in order to be available locally. Furthermore the distributed data can be re-located anywhere on the network at any time.
Data Size and Time
Given the three forms of data storage (in-memory, persistent, distributed) there are implications for speed and elapsed time for manipulation of data. Small data sizes often can fit into in-memory DRAM. Since in-memory sizes are relatively smaller when compared to disk based storage, typically only a subset of the data can reside in-memory. When data is in-memory, the expectation for speed and time access is very quick and in-fact it is the fastest.
A larger data size implies storage on disk and therefore is relatively slow in terms of access and manipulation (at best millisecond versus nanosecond). A combination of in-memory and disk are used when large data is involved. A portion of the data from disk is brought into in-memory for data manipulation and then persisted back to hard disk. Expectations of times involved are slower than that for in-memory since the fetching and management of data is time-consuming. Normal disk access times are in the milliseconds.