Given all of the recent discussion around big data, NoSQL and NewSQL, this is a good opportunity to visit a topic I believe will be (or should be) forefront in our minds for the next several years - high velocity transactional systems. Let's start with a description of the problem. High velocity transactional applications have input streams that can reach millions of database operations per second under load. To complicate the problem, many of these systems simply cannot tolerate data inconsistencies. As Todd Hoff of HighScalability.com lamented from his own experience of being trapped by an "eventually consistent" e-commerce application:
"I was very unhappy when I bought a product and they said later they were out of stock. I did not want a compensated transaction. I wanted my item!"
To put that another way, for transactional systems, the costs of data inconsistency are often unreasonably high relative to the costs of getting the transaction right in the first place.
High velocity transactional systems are by no means new. Capital markets have been handling real-time feeds, algorithmic trading and other throughput intensive applications for decades; phone companies have high performance systems for processing call data; military operations have strategic systems for processing real-time battlefield intelligence. What's different now is the number and diversity of applications spooling up around three profound forces:
1. The web itself, which has disintermediated many of the traditional human transaction inputs (when was the last time you called a travel agent to book an airline reservation?).
2. Sensor technology. Within a few short years, everything worth anything will be sensor tagged, tracked, analyzed and optimized.
3. Wireless and location-based systems. Everyone with a smart mobile device is a trackable, addressable endpoint.
We've entered a new cycle of interactivity and engagement. Online gaming and commerce applications need micro-transactions; digital advertising systems need data infrastructures for low-latency trading; web analytics and vulnerability detection apps must ingest and analyze massive inputs in real-time. Along the continuum of information technology, these are relatively new applications. And there are many more on the horizon. Soon there will be many thousands of new performance-intensive transactional applications in production, most of which will severely stress their data infrastructures.
Data Tier for High Velocity Applications
What's the most appropriate data infrastructure for emerging high velocity applications? Let's define the core requirements first:
1. Raw throughput and scale. Assuming an application includes one or more "fire hose" data sources, the data tier must be able to ingest and provide stateful management for those high velocity inputs. As the application grows, its data tier must be able to scale easily and inexpensively.
2. ACID transactions. Although some parts of an application may tolerate "eventual consistency," high velocity transactional systems must guarantee data accuracy. Why? Because eventual consistency means that database replicas containing inconsistent data will only converge if the transactions are commutative (and most "real world" transactions are not). Otherwise, you've set yourself up for a nightmare - selling the same product twice, giving customers wrong information (as Todd experienced), or some other undesirable outcome requiring a complex brew of compensating transactions. High velocity systems will not allow these post-mortem gymnastics - they need the accuracy guarantees that come with ACID-level consistency.
3. Built-in high availability. High velocity systems also won't tolerate unscheduled downtime. The data tier must be incredibly resilient to failure, including built-in "Tandem-style" redundancy, and have the ability to repair itself without service windows.
4. Real-time analytics. In addition to raw transaction throughput, high velocity data infrastructures must support analytics on volatile, high frequency data. Examples include leaderboard reports, trend analyses and analytics that drive alerting systems.
5. High performance historical analytics. Once data transitions from a high velocity, temporal state to its ultimate historical form, organizations typically need to analyze it deeply for patterns, trends, anomalies and other useful insights.
Based on the above requirements, a high performance transactional data tier will, in most cases, consist of at least two purpose-built components:
1. An ultra-high throughput, scalable RDBMS. This component will generally handle requirements 1-4 above, although some of those requirements could also be serviced by other components in the infrastructure (e.g., low frequency ACID transactions like user profile changes could be managed by a traditional RDBMS). In my opinion, the only way to implement high velocity transaction processing is to use an in-memory, NewSQL product like VoltDB. Why? Because products that attempt to retrofit these new requirements into decades-old code bases are simply disasters. Database systems need to be designed from scratch to deliver ultra-high throughput, ACID transactions, shared-nothing scale out, and built-in high availability. Also, a point that doesn't seem to get much air time in scalability circles is node speed. Products like VoltDB are so fast on a per-node basis, you can achieve incredible transaction throughput with a relatively small number of nodes. This profile significantly limits the likelihood of node failures (fewer nodes = fewer possible points of failure) and it reduces the need for multi-partition operations (further improving both read and write throughput). Moore's Law will improve these performance and reliability benefits as time goes by.
2. A high-performance, scalable analytic datastore. This component will generally handle parts of requirement 4 (depending on your definition of "real-time") and all of requirement 5. Popular analytic datastores include Hadoop/HDFS, column stores such as HP/Vertica and Infobright, and adapted row stores like IBM/Netezza, EMC/Greenplum and Teradata/Aster. Some database vendors will tell you that they can handle all of the requirements of high velocity transactional systems in a single, one-size-fits-all product. Proceed with caution. High performance transaction processing and high performance historical analytics are sufficiently different that you need specialized solutions for each problem. Trying to address both needs with one technology means you'll probably do a bad job at both.
What Else Do You Need to Consider?
The two key components of a high performance data tier - ultra-high throughput, scalable RDBMS and high performance, scalable analytic datastore - need to be designed from the ground up for interoperability. Data will move across these engines at high frequencies and in large volumes, and that traffic needs to be reflected in the products' core architectures. Also, to the degree that a "front-end" RDBMS component is capable of performing common enrichment operations (de-normalization, pre-aggregation, de-duplication and other useful transformations), data can be moved into the analytic "back-end" quickly and efficiently. Finally, to the degree that the components in your data tier support a common data language (i.e., SQL), you'll build applications more quickly and maintain them more easily and cost effectively.
Summary
High velocity transactional systems will soon find a home in many IT portfolios. By their nature, high velocity systems require a data tier that delivers extremely high throughput, ACID transactions, high scalability and built-in fault tolerance. That data tier must also support a wide range of real-time and historical analytics. My advice is to break the problem into at least two parts - high velocity and deep analytics - and implement an infrastructure that excels at addressing each of these needs in separate but integrated ways.
About the author:
Dr. Michael Stonebraker is CTO of VoltDB, Inc. (www.voltdb.com). He has been a pioneer of data base research and technology for more than a quarter of a century. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for 25 years. More recently at M.I.T. he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, and the H-Store transaction processing engine. Currently, he is working on science-oriented DBMSs and search engines for accessing the deep web. He is a founder of six venture-capital backed startups, which commercialized these prototypes. Presently he is also a co-founder of VoltDB, Inc., SciDB and Goby.com.
Professor Stonebraker is the author of scores of research papers on data base technology, operating systems and the architecture of system software services. He was awarded the ACM System Software Award in 1992, for his work on INGRES. Additionally, he was awarded the first annual Innovation award by the ACM SIGMOD special interest group in 1994, and was elected to the National Academy of Engineering in 1997. He was awarded the IEEE John Von Neumann award in 2005, and is presently an adjunct professor of Computer Science at M.I.T.