An issue that tends to get glossed over is that of Hadoop’s efficacy as a data management platform. “Managing” data isn’t simply a question of ingesting and storing it; it’s likewise, and to a much greater extent, a question of retrieving just the right data, of preparing it in just the right format, and of delivering it at more or less the right time. In other words, big data tools aren’t only less productive, than are those of traditional BI and decision support, but big data management platforms are themselves comparatively immature, too. Generally speaking, they lack support for key database features or for core transaction-processing concepts, such as ACID integrity. The simple reason for this is that many platforms either aren’t databases or eschew conventional DBMS reliability and concurrency features to address scaling-specific or application-specific requirements. The upshot, then, is that the human focus of data integration is shifting and will continue to shift to Hadoop and other big data platforms—not least because these platforms tend to require considerable human oversight and intervention.
This doesn’t mean that data, applications, and other resources are shifting or will shift to big data platforms, never to return or to be recirculated. For one thing, there’s cloud, which is having no less a profound impact on data integration and data management. Data must be vectored from big data platforms (in the cloud or on-premises) to other big data platforms (in the cloud or on-premises), to the cloud in general—i.e., to SaaS, platform-as-a-service (PaaS), and infrastructure-as-a-service (IaaS) resources—and, last but not least, to good old on-premises resources like applications and databases.
For more articles on big data technologies and trends, download the Free Big Data Sourcebook: Second Edition
There’s no shortage of data exchange formats for integrating data in this context—JSON and XML foremost among them—but the venerable SQL language will continue to be an important and even a preferred mechanism for data integration in on-premises, big data, and even cloud environments. The reasons for this are many. First, SQL is an extremely efficient and productive language: According to a tally compiled by Andrew Binstock, editor-in-chief of Dr. Dobb’s Journal, SQL trails only legacy languages such as .ASP and Visual Basic (at number 1 and 2, respectively) and Java (at number 3) productivity-wise. (Binstock based his tally on data sourced from the International Software Benchmarking Standards Group, or ISBSG, which maintains a database of more than 6,000 software projects.) Second, there’s a surfeit of available SQL query interfaces and/or adapters, along with (to a lesser extent) of SQL-savvy coders. Third, open source software (OSS) and proprietary vendors have expended a simply shocking amount of effort to develop ANSI-SQL-on-Hadoop technologies. This is a very good thing, chiefly because SQL is arguably the single most promising tool for getting the right data in the right format out of Hadoop.