While Hadoop is officially 15 years old as an Apache project, it only gained mainstream IT attention 10 years ago. Hadoop started as an open source implementation of key Google technologies used for storing and processing huge amounts of data on large numbers of commodity servers. The MapReduce parallel processing algorithm allowed work to be parallelized across thousands of servers, while the Google File System (GFS) allowed the disks on these servers to be addressed as a single logical file store. Hadoop 1.0 essentially provided an open source implementation of GFS—Hadoop Distributed File System (HDFS)—and an implementation of MapReduce.
The Promise
Hadoop was battle-hardened at Yahoo and, in 2008, Cloudera became the first dedicated Hadoop company, followed by MapR in 2009 and Hortonworks in 2011. Hadoop almost instantly gained favor among Fortune 500 vendors who were attracted by the idea that “big data” could generate a competitive advantage.
Big data is a somewhat nebulous term, but “more data to more effect” sums it up nicely. Companies realized that data generated from the web and social media could be used to create an improved and personalized user experience, which could drive greater adoption and generate even more data. This big data positive feedback loop became a critical success factor for companies competing on the web.
Big data required new technology for the economical storage and processing of masses of loosely structured data. Hadoop seemed custom-made for the task. The concept of the “data lake”—a vast reservoir of priceless company data—was born.
The Reality
Unfortunately, while Hadoop provided an economical platform for big data storage, it offered very little in the way of analytic capabilities. At the biggest companies, elite data scientists crafted sophisticated machine learning models on top of Hadoop. But in the mid-market, these skills were in short supply. Consequently, many Hadoop data lakes became “data swamps” full of stale and unused data.
The data lake was typically an on-premise deployment. As companies migrated their assets to the cloud, they typically found alternatives to both the Hadoop storage layer (HDFS) and the Hadoop processing engine.
New Alternatives
Over time, the compute layer of choice for big data analytics increasingly became Spark. Spark is somewhat Hadoop-inspired, but is based on memory computation rather than disk-based brute force. It’s far more suitable for iterative machine learning and analytic analyses.
Spark is storage platform-agnostic, though it works very well with HDFS. However, cloud vendors offered object storage that was very cost-effective when compared to HDFS. Consequently, as workloads moved into the cloud, they tended to move off HDFS and onto cloud storage such as Amazon’s S3 or Azure’s Blob Storage.
Companies disillusioned by the failure of Hadoop data lake projects have also turned to next-generation data warehousing solutions. For instance, the Snowflake data warehouse provides elastic cloud processing, support for very large databases, and a familiar SQL interface together with strong support for unstructured data.
A Pivotal Technology
To describe Hadoop as dead is extreme. Large Hadoop clusters are liable to be around for decades to come. However, it’s hard to see Hadoop making a comeback in the era of cloud dominance.
Regardless of its long-term prospects, Hadoop remains a pivotal technology in the history of databases. Hadoop was one of the key technologies that broke the stranglehold of relational databases, and it forced a shift in the way in which we think about and store data. Prior to Hadoop, raw data was generally transformed as it was loaded into a data warehouse and the original data discarded. Today, companies of all sizes recognize the importance of keeping original data as a potentially valuable asset, and we live in a world in which little or no data is ever discarded—for better or worse.