Image courtesy of Shutterstock
It is no secret that we are in the data age. Data comes at us from all directions, in all shapes and sizes. As technology leaders, we are tasked with answering complex questions like: How will we process all this information? How can it be more accessible? And how will users benefit from it?
To further complicate these endeavors, new technologies appear everyday. Incumbent vendors and startups constantly add new features, build on top of emerging open source projects, and claim to solve the next wave of challenges. Within the Hadoop ecosystem alone, there are (at least) 11 Hadoop-related open source projects. Making sense of it can be a time-consuming headache.
To bring clarity and peace of mind, here are the top 5 big data predictions for 2015 and beyond.
1. In-Memory Computing Becomes Widespread
Dropping prices, undisputed performance, and untapped business value will drive the widespread adoption of in-memory computing.
Microsoft recently announced its G-series for cloud instances that go up to 32 cores and 448 GiB of RAM at a cost of $8.69 per hour. As of right now, Amazon’s R3 comparable instance is $2.80 per hour for 244 GiB of memory with 32 CPU cores, and Google’s high memory option averages $1.184 per hour for 104 GB of memory, although only available up to 16 cores.
The key take away — you can now run applications on high-performance, memory-optimized virtual machines for a trivial cost.
In-memory computing offers performance benefits hundreds of times faster than a traditional RDBMS. With a memory-first system architecture, what used to take hours or days, can be done in seconds — all by processing data where it resides.
Ultimately, this leads to new sources of business revenue. In the financial world, organizations must respond to market volatility instantly as delays drain money from their pockets. In-memory computing enables financial institutions to respond to fluctuating market conditions as they happen.
2. Resurgence of SQL and Relational Architectures
With growth markets like the Internet of Things and mobile computing continuing unabated, enterprise data architectures require structured and semi-structured data to scale together.
In reality, businesses use both structured and semi-structured data, but problems arise as most databases typically handle one type of data or the other. The challenge lies in uniting different data models so that all data can be analyzed together.
Evidence of merging different data types can be found in open source initiatives like Apache Hive and Pig, which offer a layer for SQL on Hadoop. But these platforms still require experienced engineering teams to support production application environments.
As the lines between relational and non-relational database management systems blur, SQL continues to dominate as a preferred method for managing data for the following reasons:
- Scale and performance - Preconceived notions that SQL lacks scale and flexibility have been dispelled by advances in in-memory computing and distributed system architectures. It is now entirely possible for relational databases to easily scale while providing the familiarity and stability of SQL. SQL also expresses incredibly complex queries with just a half dozen variations of the select command.
- Everyone knows it - Decades of usage has made SQL a de facto data analysis language. With millions of SQL users and thousands of SQL tools readily available, making a change to some foreign query language is an uphill battle. The adoption of a SQL infrastructure inside a big data architecture can also lead to increased consumption of Hadoop by opening access to a greater population of users through SQL.
- Stability - Relational database management systems support the SQL compatibility, transactional consistency, and enforced schema required by data-reliant enterprises.
3. Move to Scaling Systems Online
The onslaught of incoming data shows no sign of stopping. At the same time, emerging data sources (and the volume of data that comes with them) are unpredictable. For this reason, data dependent organizations will need distributed system architectures to scale out rather than up.
Online scaling provides benefits over traditional scale up databases:
- Seamless growth - Scaling capacity or performance online requires a fast and painless system triggered with a click or simple command. In contrast, scale up systems have fixed limitations to capacity and compute power. Modifications to scale beyond that add complexity and cost.
- Schema flexibility - As applications mature, schema changes can be made without taking the system down.
- High Availability - Users no longer tolerate downtime, so systems must support that expectation.
4. The Rise of Apache Spark
As the world moves from batch to online data processing, real-time data pipelines will supersede legacy data warehouse and transaction processing systems as core business computing infrastructure.
Latency and complexity negatively impact revenue by preventing key stakeholders from accessing data when they need it. As Apache Hadoop and Spark see wider enterprise adoption, the next challenge will be integrating these batch processing tools with a real-time database for transaction processing and operational analytics.
With a memory first infrastructure, businesses can process transactions and query real-time data simultaneously, while tightly integrating with Hadoop and Spark.
5. Up-to-the-Moment Analytics Everywhere
In a world where by 2017, the average person will own five connected devices, user expectations of a cohesive experience have shifted from a plus to a must. Connected devices and applications drive huge amounts of data and processing, and a seamless experience mandates up-to-the-moment accuracy. In a recent post on DBMS2.com, Curt Monash hit the nail on the head, saying, “Internet interaction applications increasingly require data freshness to the last click or other user action.”
The omni-channel retail experience illustrates the need for up-to-the-moment analytics. Today’s retail experience spans customer touch points from an in-store PoS system to online webstore to a mobile application. Data crossing these touch points needs to be updated in real-time to the last event. For example, purchases made online at Target.com should influence the mobile offers a customer might receive on entering a Target store, and the checkouts at the store should drive “you-might-like” recommendations during the next online visit. The customer is making these decisions in real-time, and the information interaction with their chosen brands should reflect that.
Looking Ahead
While the specifics of how we’ll tackle the data challenges of 2015 are still taking shape, the need to do so remains. Paying attention to these key trends and directions will hopefully provide guidance to navigate to the right solutions.
Gary Orenstein, Chief Marketing Officer, leads marketing at MemSQL across marketing strategy, growth, communications, and customer engagement. Prior to MemSQL, Gary was the Chief Marketing Officer at Fusion-io where he lead global marketing activities. He also served as Senior Vice President of Products at Fusion-io during the company’s expansion to multiple product lines. Prior to Fusion-io, Gary worked at infrastructure companies across file systems, caching, and high-speed networking. Earlier in his career he served as the vice president of marketing at Compellent. Gary holds a bachelor’s degree from Dartmouth College and a master's in business administration from The Wharton School at the University of Pennsylvania.