Over the years, I have discovered that “good enough” is never “good enough.” Years ago when I worked for a digital advertising company, most people in the business were content with putting out a new advertisement and waiting 2 hours before knowing how it was performing. If an advertisement was misconfigured, it would be reconfigured, and 2 more hours would elapse before seeing the new statistics. Making little adjustments would take the better part of a day.
To move the company forward, we laid out plans to transition into the Hadoop ecosystem and leverage batch processing. Although the new system would be beneficial for the company, I made a prediction that the batch processing would only be good enough until the employees realized the power which real-time Hadoop could deliver. Others didn’t buy into that argument, believing that going from 2 hours to 5 minutes was great, and after all, the business had been content with a 2-hour delay for nearly 5 years.
After getting the system into production, everyone was excited that it enabled the modification of work processes to deliver an elevated level of performance and customer satisfaction. However, less than 3 months later the request for “real time” was made to enable users to get faster feedback. The employees saw what was possible and wanted more.
A PDF of the Fall 2015 issue of Big Data Quarterly magazine is available for download now. To become a subscriber to the print magazine, go to www.dbta.com/BigDataQuarterly/Subscribe.aspx.
In general, batch data processing is where people tend to be the most comfortable when beginning to work with big data technologies. Most systems have been built this way for the last 30 years with batches of data ingested into a data warehouse or other RDBMS. However, after becoming acclimated with batch in a Hadoop environment, the trepidation people experience tends to wear off as they become more comfortable putting real-time processes into their operational workflow.
Real-time platforms consist of a variety of different technologies. These include real-time databases, messaging platforms, and stream processing engines. Depending on the use case, some or all of these may be used to build a real-time platform:
- Real-time databases in the Hadoop ecosystem such as Apache HBase and MapR-DB enable real-time data processing for online transactional processing (OLTP) or online analytical processing (OLAP). They scale linearly, which makes performance and expansion costs very predictable. The main enterprise concern with HBase is that it is not well-equipped to handle transactions and analytics on the same hardware at the same time and generally requires dedicated clusters.
- Stream processing engines garner the most attention these days. They provide a framework for creating software that can handle taking an action on each event or message received. The current most popular streaming engines are Apache Spark, Apache Storm, and Apache Flink. Spark Streaming (micro-batches) has gained a significant following due to its ease of getting started. Flink (event-based) has been rapidly gaining attention. Finally, Storm (event-based), which was created and open-sourced by Twitter, had been the dominant choice until recently. Twitter announced that it migrated off of Storm to its own stream processing engine Heron. There are many differences between these engines, and each of them can be set up to handle messages from a variety of messaging platforms.
- Messaging platforms should be given the most consideration when building a real-time platform. This is the component that handles receiving all data coming into a system and can place guarantees on message delivery. If this platform cannot offer certain performance and functional guarantees, it will be the limiting factor to any downstream application. Apache Kafka delivers a publish/subscribe model and is currently the most dominant in this space, but it is not the only option. RabbitMQ, which supports distributed queues (very important), and Apache Flume generally used for log shipping, are also options. Kafka scales easily because it leverages the Hadoop platform for data persistence. That coupled with a simple API are the two biggest reasons Kafka has gained such popularity. Kafka does, however, have certain enterprise concerns that shouldn’t be ignored. The most prominent among them is Kafka’s inability to coexist on the same cluster of hardware as other Hadoop use cases. Another noteworthy concern is its lack of support for globally distributed messaging.
We live in an “as-it-happens” world, and consumers expect on-demand everything—TV, taxi service, you name it. That same culture is now (and rightly so) expected in businesses. Don’t just think about “good enough”—think and plan for real time. The technology is here now to leverage real time in your business for shorter feedback loops, improved time-to-market, and happier customers. The sooner, the better. It’s never too early for real time. “Good enough” simply isn’t, well, good enough anymore.
Image courtesy of Shutterstock.