In the beginning, the “data warehouse” was a concept that was not accepted by the database fraternity. From that humble beginning, the data warehouse has become conventional wisdom and is a standard part of the infrastructure in most organizations. Data warehouse has become the foundation of corporate data. When an organization wants to look at data from a corporate perspective, not an application perspective, the data warehouse is the tool of choice.
Data Warehousing and Business Intelligence
A data warehouse is the enabling foundation of business intelligence. Data warehouse and business intelligence are linked as closely as fish and water.
The spending on data warehousing and business intelligence has long ago passed that of spending on transaction-based operational systems. Once, operational systems dominated the budget of IT. Now, data warehousing and business intelligence dominate.
Through the years, data warehouses have grown in size and sophistication. Once, data warehouse capacity was measured in gigabytes. Today, many data warehouses are measured in terabytes. Once, single processors were sufficient to manage data warehouses. Today, parallel processors are the norm.
Today, also, most corporations understand the strategic significance of a data warehouse. Most corporations appreciate that being able to look at data uniformly across the corporation is an essential aspect of doing business.
But in many ways, the data warehouse is like a river. It is constantly moving, never standing still. The architecture of data warehouses has evolved with time. First, there was just the warehouse. Then, there was the corporate information factory (CIF). Then, there was DW 2.0. Now there is big data.
Enter Big Data
Continuing the architectural evolution is the newest technology—big data. Big data technology arrived on the scene as an answer to the need to service very large amounts of data. There are several definitions of big data. The definition discussed here is the one typically discussed in Silicon Valley.
Big data technology:
• Is capable of handling lots and lots of data
• Is capable of operating on inexpensive storage
Big data:
• Is managed by the “Roman census” method
• Resides in an unstructured format
Organizations are finding that big data extends their capabilities beyond the scope of their current horizon. With big data technology, organizations can search and analyze data well beyond what would have ever fit in their current environment. Big data extends well beyond anything that would ever fit in the standard DBMS environment. As such, big data technology extends the reach of data warehousing as well.
Fundamental Challenges with Big Data
But with big data there come some fundamental challenges. The biggest challenge is that big data is not able to be analyzed using standard analytical software. Standard analytical software makes the assumption that data is organized into standard fields, columns, rows, keys, indexes, etc. This classical DBMS structuring of data provides context to the data. And analytical software greatly depends on this form of contex. Stated differently, if standard analytical software does not have the context of data that it assumes is there, then the analytical software simply does not work.
Therefore, without context, unstructured data cannot be analyzed by standard analytical software. If big data is to fulfill its destiny, there must be a means by which to analyze big data once the data is captured.
Determining Context for Unstructured Data
There have been several earlier attempts to analyze unstructured data. Each of the attempts has its own major weakness. The previous attempts to analyze unstructured data include:
1. NLP—natural language processing. NLP is intuitive. But the flaw with NLP is that NLP assumes context can be determined from the examination of text. The problem with this assumption is that most context is nonverbal and never finds its way into any form of text.
2. Data scientists. The problem with throwing a data scientist at the problem of needing to analyze unstructured data is that the world only has a finite supply of those scientists. Even if the universities of the world started to turn out droves of data scientist, the demand for data scientists everywhere there is big data would far outstrip the supply.
3. MapReduce. The leading technology of big data—Hadoop—has technology called MapReduce. In MapReduce, you can create and manage unstructured data to the nth degree. But the problem with MapReduce is that it requires very technical coding in order to be implemented. In many ways MapReduce is like coding in Assembler. Thousands and thousands of lines of custom code are required. Furthermore, as business functionality changes, those thousands of lines of code need to be maintained. And no organization likes to be stuck with ongoing maintenance of thousands of lines of detailed, technical custom code.
4. MapReduce on steroids. Organizations have recognized that creating thousands of lines of custom code is no real solution. Instead, technology has been developed that accomplishes the same thing as MapReduce except that the code is written much more efficiently. But even here there are some basic problems. The MapReduce on steroids approach is still written for the technician, not the business person. And the raw data found in big data is essentially missing context.