Anyone who thought Hadoop was a fly-by-night technology was wrong. Hadoop has rapidly evolved—improving and gaining mainstream adoption as a technology and framework for enabling data applications previously out of reach for all but the savviest of companies. The open source Apache Hadoop developer community (and distribution vendors) continuously contributes advances to meet the demands of companies seeking more powerful—and useful—data applications, while also focusing on requirements for improved data management, security, metadata, and governance. Hadoop is not only stable but worthy of consideration for core IT strategies.
The Future of Hadoop as Your Data Operating System
Major development initiatives have brought the Hadoop ecosystem from its batch-oriented roots to new, interactive capabilities that are delivering improved performance in interactive SQL and with distributed in-memory applications. This is a transformation of the traditional (and current) IT application and data architecture; with scalability and performance being delivered by Hadoop, the next class of business and IT requirements will become more operational in nature. No other data technology has had the capability of truly performing as a horizontal “data operating system,” supporting varieties of data application engines with data file systems that enable data- and analytic-oriented applications in one place. Thus, the data operating system eliminates the need for the interchange and synchronization of data in siloed applications and technology stacks. In turn, this will complete and unify the fragmented enterprise architecture stack (and its applications, presentations, and networking layers) in order to achieve the long-standing vision of enterprise and service-oriented architecture.
A PDF of the Fall 2015 issue of Big Data Quarterly magazine is available for download now. To become a subscriber to the print magazine, go to www.dbta.com/BigDataQuarterly/Subscribe.aspx.
True transformation takes place as people think beyond current perceptions of Hadoop in its standalone or verticalized current implementations. Hadoop has always been designed to be a data management platform while riding the wave of big data needs. As Hadoop rapidly evolves with the support of the open source development community fueling innovations and addressing business needs, it can become a true data operating system, centralizing all enterprise data and enabling a choice of data management engines on top of that. It’s yet to be seen whether centralizing data effectively will further enable the unification of enterprise data such as master data management or data warehousing, but it will simplify the data management challenge for those committed to the endeavor and benefits.
Just as Hadoop’s first and second generations were driven by business demands, the third generation will offer improvements that fuel the adoption of the data lake and data operating system strategies within the enterprise data architecture.
Enterprise architects are beginning to discuss how Hadoop can become a long-term, central part of IT’s strategic vision and positioning it to become the data persistence and application layers that enable a significant portion of future enterprise business application and data needs. Combined with the ongoing advancements in computing and storage technology with better price-performance, the realization of the Hadoop data operating system will only further these strategic discussions.
Laying the Foundation for the Modern Hadoop Platform
Inspired by Google white papers and cultivated at Yahoo, the first generation of Apache Hadoop originated from the need to address the challenge of providing affordable scalability and flexible data structure for working with big data sets. Traditionally, the obstacle for companies trying to embrace big data was simple: being able to cost-justify it. But, for Google and Yahoo, working with massive amounts of data was a necessity, and their businesses required out-of-the-box thinking. Big data required a lot of raw computing power, storage, and parallelism, which in turn required a lot of money to build the infrastructure needed to support big data analytics. Economic factors originally restricted all but the largest Fortune 500 organizations from investing in the Big Iron needed for big data. Pioneering data-centric companies, where data was the business, invested heavily in big data because their lives depended on it.
The only way to solve the big data problem was to break down big data into manageable chunks to run smaller jobs massively parallel. And, the only way to afford that big data solution was to use the least expensive—and therefore the ones with highest failure rates—hardware available. To address the need to manage fault tolerance and self-healing in the software, the Hadoop Distributed File System (HDFS) was developed as one of the core components of Hadoop—the “persistence layer” of the data operating system. Big data also required programs that utilized parallel processing frameworks, which led to MapReduce, the other core component of Hadoop. MapReduce could be considered the first “data management application” of the data operating system. With MapReduce and the HDFS together, the first-generation Hadoop platform provided an economical way to conquer big data sets with parallelism and affordability and set the architecture core for a data operating system.
First-generation Hadoop met the need for affordable scalability and flexible data structure, providing the democratization needed on the big data investment to give companies big and small the power and affordability of big data. However, it was only the first step. Limitations, such as its batch-oriented job processing and consolidated resource management, drove the development of Yet Another Resource Negotiator (YARN), which was necessary to take Hadoop to the next level.
In its second major release, Hadoop took the giant leap forward from batch-oriented to interactive. It also supported the innovation of more independent execution models, including MapReduce as a YARN application separate from the data persistence of HDFS. Thus, via YARN, Hadoop became a true data operating system, and the newer, improved two-tier framework enabled parallel innovation of data engines (both as Apache projects and in vendor-proprietary products). This was a significant advancement: breaking down database management systems and separating data persistence functions from different execution models to unify data for multiple workloads.
Hadoop Version 2 allows the foundation for today’s data lake strategy, which champions the affordability and effectiveness of consolidating all data into one repository and using any of the many YARN applications available (such as MapReduce2, HOYA/HBase, Hive/Tez, Storm, and Spark) in multiple ways. Using the data lake only as a consolidated data repository for other systems and platforms to fish data out of is shortsighted and only leverages the affordable scalability of HDFS as storage. But, to use Hadoop as an interactive, multiple workload and operational data platform redefines data architecture.
Embracing the Data Operating System
With the affordable scalability and manageability of consolidating data in the Hadoop platform via HDFS, the data operating system enabled via Hadoop’s YARN applications becomes a fundamental shift in data architecture. And, it takes polyglot persistence (the ability to use the data execution engine best suited for the purpose) to another level by decoupling the specialized data purpose from the previous corresponding persistence data silos. Now, data from transaction-oriented databases, document databases, and graph databases can all be stored within Hadoop and also accessed via YARN applications without duplicating or moving data for a different workload usage.
Image courtesy of Shutterstock.