Google introduced the MapReduce algorithm to perform massively parallel processing of very large data sets using clusters of commodity hardware. MapReduce is a core Google technology and key to maintaining Google's website indexes.
The general MapReduce concept is simple: The "map" step partitions the data and distributes it to worker processes, which may run on remote hosts. The outputs of the parallelized computations are "reduced" into a merged result. MapReduce works well when performing the sort of aggregate operations typical in business intelligence-daily sales totals for instance-as well as operations such as web search that involve searches through massive data sets.
Over the past couple of years, the emergence of a reasonably robust open source implementation of MapReduce-in the form of the Hadoop project-has provided ready access to MapReduce for the wider IT community, and resulted in some fairly notable MapReduce success stories outside of Google:
- Facebook now has more than 2 petabytes of data in Hadoop clusters, which form key components of its data warehousing solution.
- Yahoo has more than 25,000 nodes running Hadoop with data volumes of up to 1.5 petabytes. A recent Hadoop benchmark sorted 1 TB of data in just over 1 minute using 1,400 nodes.
- The well-known New York Times project that used the Amazon cloud to convert older newspaper images into PDF did so using Hadoop.
Hadoop can be installed on your own hardware or deployed in the Amazon cloud using Amazon's Elastic MapReduce. At least one company-CloudEra-provides commercial support and services around Hadoop.
While Hadoop can be used to perform analytic data processing, doing so requires far more programming expertise than SQL or BI tools. Consequently, there are active efforts to merge the familiar world of SQL with the new world of MapReduce.
Facebook developed, and now has, open sourced Hive, which provides a SQL-like interface to Hadoop. Hive provides many SQL language features, such as joins and group operations, though it is not strictly ANSI SQL compatible. Recently, researchers at Yale University announced HadoopDB, which combines Hadoop, Hive and Postgres to allow for more highly structured data analytics.
Vendors Greenplum and Aster both provide ways to merge SQL and MapReduce. Each allows MapReduce to process data held in their RDBMS-based data warehouses. Greenplum represents Hadoop MapReduce routines as views within the relational database. SQL statements can make use of these views, invoking MapReduce and then adding more complex SQL processing to the MapReduce output. Greenplum also allows SQL queries to be defined as the inputs in the MapReduce stream.
Aster's implementation embeds MapReduce into the database in a manner reminiscent of stored procedures. A SQL statement can invoke the MapReduce functions using SQL language extensions.
Despite the enthusiasm for MapReduce, some doubt it is truly suitable for mainstream analytics. In particular, a group of RDBMS luminaries-including Michael Stonebraker of Postgres fame-have said MapReduce is a "major step backwards" for the database community, because it relies on brute force rather than optimization and re-implementation of many features considered solved in the RDBMS world. Furthermore, they say, MapReduce is incompatible with existing BI tools and omits many essential features found in today's RDBMS.
It's hard to argue against MapReduce as a significant technology, given that we indirectly exploit it with every Google search. While they probably won't replace the analytic RDBMS, I suspect MapReduce and Hadoop are going to play a major role in big data analytics for years to come.