Will another emerging scenario involves using Hadoop as a repository in which to persist the raw data that feeds a data warehouse. In traditional data integration, this data is often staged in a middle tier, which can consist of an ETL repository or an operational data store (ODS). On a per-gigabyte or per-terabyte basis, both the ETL and ODS stores are more expensive than Hadoop. In this scheme, some or all of this data could be shifted into Hadoop, where it could be used to (inexpensively) augment analytic discovery (which prefers denormalized or raw data) or to assist with data warehouse maintenance—e.g., in case dimensions are added or have to be rekeyed.
Still another use case involves offloading workloads from Hadoop to SQL analytic platforms. Some of these platforms are able to execute analytic algorithms inside their database engines. Some SQL DBMS vendors claim that an advanced analysis will run faster on their own MPP platforms than on Hadoop using MapReduce. They note that MapReduce is a brute-force data processing ?tool, and while it’s ideal for certain kinds of ?workloads, it’s far from ideal as a general-purpose compute engine. This is why so much Hadoop development work has focused on YARN—Yet Another Resource Negotiator—?which will permit Hadoop to schedule, ?execute, and manage non-MapReduce jobs. The benefits of doing so are manifold, especially from a data integration perspective. First, even though some ETL tools run in Hadoop and replace MapReduce with their own engines, Hadoop itself provides no native facility to schedule or manage non-?MapReduce jobs. (Hadoop’s existing JobTracker and TaskTracker paradigm is tightly coupled to the MapReduce compute engine.) Second, YARN should permit users to run optimized analytic libraries—much like the SQL analytic database vendors do—in the Hadoop environment. This promises to be faster and more efficient than the status quo, which involves coding analytic workloads as MapReduce jobs. Third, YARN could help ?stem the flow of analytic workloads out of Hadoop and encourage analytic workloads to be shifted from the SQL world into Hadoop. Even though it might be faster to run an analytic workload in an MPP database platform, it probably isn’t cheaper—relative, that is, to ?running the same workload in Hadoop.
Alternatives to Hadoop
But while big data is often discussed through the prism of Hadoop, owing to the popularity and prominence of that platform, alternatives abound. Among NoSQL platforms, for example, there’s Apache Cassandra, which is able to host and run Hadoop MapReduce workloads, and which—unlike Hadoop—is fault-tolerant. There’s also Spanner, Google’s successor to BigTable. Google runs its F1 DBMS—a SQL- and ACID-compliant database platform—on top of Spanner, which has already garnered the sobriquet “NewSQL.” (And F1, unlike Hadoop, can be used as a streaming database. Here and elsewhere, Hadoop’s file-based architecture is a significant constraint.) Remember, a primary contributor to Hadoop’s success is its cost—as an MPP storage and compute platform, Hadoop is significantly less expensive than existing alternatives. But Hadoop by itself isn’t ACID-compliant and doesn’t expose a native SQL interface. To the extent that technologies such as F1 address existing data management requirements, enable scalable parallel workload processing, and expose more intuitive programming interfaces, they could comprise compelling alternatives to Hadoop.
What’s Ahead in Big Data Integration
Big data, along with the related technologies such as Hadoop and other NoSQL platforms, is just one of several destabilizing forces on the IT horizon, however. Other technologies are changing the practice of data integration—such as the shift to the cloud and the emergence of data virtualization.