Image courtesy of Shutterstock
In 2014, we continued to watch how big data is enabling all things “big” about data and its business analytics capabilities. We also saw the emergence (and early acceptance) of Hadoop Version 2 as a data operating platform, with cornerstones of YARN (Yet Another Resource Negotiator) and HDFS (Hadoop Distributed File System). The ecosystem of Apache Foundation projects has continued to mature at a rapid pace, while vendor products continue to join, mature, and benefit from Hadoop improvements.
In last year’s Big Data Sourcebook we highlighted several items in “The State of Big Data” article worth recapping. First, we referenced the “battle over persistence” for data architectures, primarily in enterprise adoption that dealt with the promise of “everything in Hadoop” pundits and the “it’s OK to have another data platform.” In 2014, we witnessed the acceptance of these multi-tiered, specific workload capability architectures that, at Radiant Advisors, we refer to as the “modern data platform.” With gaining acceptance, Hadoop is here to stay and many analysts refer to its role as “inevitable.” This, naturally, is tempered with its maturity, the ability for enterprises to find and/or train resources, and specifying the proper first use case project and long term strategy, such as the data lake or enterprise data hub strategies.
For more articles on big data technologies and trends, download the Free Big Data Sourcebook: Second Edition
We also discussed how companies needed to understand how “data is data” when approaching big data with “big” eyes. For the most part, in 2014 we saw mainstream companies shift from a “the sky is falling if I don’t start a big data project” mindset to distinguishing big data projects as those for situations where the data wasn’t typically relationally structured, or when it had volatile schemas. “Schema on read” versus “schema on write” benefits and situations became a much better understood term in 2014, too. And, more importantly, we have seen an increasing understanding that all data can be valuable and the need to explore data for discovery and insights.
Last year, we said that 2014 would be “the race for access hill” as companies demanded better access to data in Hadoop by business analysts and power users and that this access no longer be restricted to programmers. As SQL reasserted itself as the de-facto standard for common knowledge users and existing data analysis and integration tools, the SQL access capabilities of Hadoop was under incredible pressure to improve both in performance and capability. Continued releases by Hortonworks with Hive/Tez, Cloudera Impala, and MapR Drill initiative made orders of magnitude performance improvements for SQL access. The race was on: Actian’s Vortex made a splash at the Hadoop Summit in June, and others—such as IBM and Pivotal—made significant improvements, too. The race in 2014 continues going into 2015 with more SQL analytic capabilities and performance improvements.
Hadoop 2 Ushers in the Next Generation
The significance of Hadoop 2 has recently started to resonate with companies and enterprise architects. Moving away from its batch-oriented origins, YARN has clearly positioned the data operating system as two separate fundamental architecture components.