The year 2015 started out with people recognizing that the Hadoop ecosystem is here to stay. As was stated by Gartner, Hadoop was in the “trough of disillusionment,” meaning that the hype was at an end. Some people were disillusioned because they were beyond proof-of-concept and were actively frustrated by solving real problems. Other earlier adopters had already moved beyond the “how to manage and implement” phase and were making real progress and reinventing their businesses processes. There have been a number of use cases implemented that cut across nearly every industry, and these use cases work well with this technology stack. 2015 was the year in which organizations achieved real success within the Hadoop ecosystem.
Interestingly, more projects are popping up within the Hadoop ecosystem that can run both with and without Hadoop. The great thing about this trend is that it lowers the barrier to entry for people to get started with these technologies. More importantly, all of these new technologies work best at large scale within the rest of the Hadoop ecosystem, while Hadoop MapReduce has begun its ride off into the sunset. However, it is still there, still supported, and people still have real code running in production with it.
In 2016, we’ll continue to see a swing in conversions from Hadoop MapReduce to other technologies such as Apache Spark. Tools like Apache Zeppelin and Jupyter are continuing to integrate more of these technologies and expose easy-to-use web interfaces via web notebooks to users, which again helps lower the barrier to entry.
Looking Forward
2015 was a great year for Hadoop-related big data technologies. Apache Spark has gained significant momentum in replacing Hadoop MapReduce within the Hadoop stack, and there is no sign that this momentum will slow down any time soon—especially as users find new and creative ways to exploit it.
Keep in mind that there’s more to the Hadoop ecosystem than just Hadoop and Spark. Let’s take a look at some areas that will have a big impact in the coming year.
Resource Management
Those early adopters of Hadoop are now moving into their second or even third iterations of their big data platforms. They are now looking at new ways to leverage this massive cluster of hardware to gain even more benefits. Many have already put a data lake in place, or are at least evaluating doing so. For those unfamiliar with the term, a “data lake” is nothing more than storing data from multiple domains of a business in a single cluster. While this helps make the data available for computing, it still doesn’t enable real-time access to the data. As those organizations with data lakes continue their journey forward, they will start to gravitate toward the Zeta Architecture. Whether they realize it or not, a data lake is an interim step into optimizing the data center. Data will be processed where it is created, enabling real-time access to data and thereby eliminating the heavy processes and extreme latency caused by moving data into a data lake.
For more articles on the state of big data, download the third edition of The Big Data Sourcebook, your guide to the enterprise and technology issues IT professionals are being asked to cope with in 2016 as business or organizational leadership increasingly defines strategies that leverage the "big data" phenomenon.
Technologies such as Apache Myriad and Apache Mesos will enable the move to the Zeta Architecture. These technologies will give IT teams the ability to manage all resources—not just the Hadoop cluster—in a more granular way. Myriad will bridge the gap from those distributed applications that can only be managed by YARN to the rest of the data center operations which can be resource-managed by Mesos. This move will give businesses the opportunity to operate in a manner similar to Google.
SQL and NoSQL
Years ago, people who were called business analysts would dive into the data and figure out how to optimize business processes. Today, this role has morphed into what is being called a data scientist. These same people now need a new set of skills and proficiencies to leverage this data. Since Hadoop gained popularity, a goal of these big data ecosystem technologies has been to reduce the barrier to entry and ease the burden of adopting new technologies that can work at scale.
For example, Apache Drill reached its 1.0 release in May and has seen a rather substantial adoption. Apache Drill can be plugged directly into standard BI tools that organizations have already invested in. There’s no need to retrain users on the BI tools; instead, users can take a free on-demand training class to learn how to query complex JSON data and get up-to-speed within a couple hours by leveraging the skills they already have. Drill even plugs into Microsoft Excel. Many people pretend that Excel doesn’t exist, but it is still heavily used and should not be ignored within the big data space, thanks to Apache Drill’s ease of integration with common end-user tools. Drill will continue to experience strong growth and add even more integrations in the following year. Given this momentum, Apache Drill’s reputation will be cemented as the “SQL engine for everything.”
Apache Drill will also bridge the gap to the NoSQL world. Even more NoSQL databases will be exposed to end users via Apache Drill. SQL can already be delivered to some NoSQL databases (HBase, MapR-DB, MongoDB, and Cassandra), but this trend will continue to gain traction in the coming months.
In addition, document databases supporting tight integration into the Hadoop ecosystem will open the door to developer-friendly programming interfaces, enabling a far simpler transition to big data for more companies. This is going to be a major factor in the adoption of big data technologies which happen to work at any scale for any application being developed. In other words, 2016 will be the year that software engineers will easily be able to build all new applications on top of these big data technologies, aside from those applications which are purely and strictly relational in nature. This is because once an application is developed, scaling it will no longer be a concern because the backend will scale effortlessly.