Building Architectures for Real-Time Analytic Applications

May 18, 2017

By Joyce Wells

BI and analytics are at the top of corporate agendas. Competition is intense, and, more than ever, organizations are demanding faster access to insights about their customers, markets, and internal operations to make better decisions. However, they are also facing challenges in powering real-time business analytics and systems of engagement. Analytic applications and systems of engagement need to be fast and consistent, but traditional database approaches, including RDBMS and first-generation NoSQL solutions, can be challenging to maintain, and costly, according to Aerospike CTO and co-founder Brian Bulkowski.

In Bulkowski's view, companies should aim to simplify traditional systems and architectures while also reducing vendors. One way to do this is by embracing an emerging hybrid memory architecture, which removes an entire caching layer from the front-end application. In a talk at Data Summit 2017, Bulkowski discussed the use of this pattern to improve application agility and reduce operational database spend.

Moving from cloud to cloud becomes easier if you standardize on a database that you can use in multiple clouds, said Bulkowski.

The Aerospike database architecture is one of clustering and high performance, said Bulkowski. The system is optimized for different hardware choices but also highly available and resilient as well as high performance, he noted. It is clustered so you can easily bring new nodes online and do that in production, you can also change software versions in production as well. But more importantly, regarding cloud deployments, is the way that Aerospike architected its cross-data center support so that you can shift data from source to source in real time, said Bulkowski.

“The application architecture becomes key in real-time uses,” said Bulkowski. The old application architecture of slower databases and a cache and enterprise volume manager as a storage layer is not how internet companies do their tech stacks, he noted.

However, according to Bulkowski, a new trend in NoSQL - with Aerospike, Cassandra, Redis to some extent, and DynamoDB - embraces a different premise that asks: Why do you need a cache layer? We can have caching in the database. A separate cache layer is a consistency violation, it is a chance for errors, and it is a separate thing for your team to manage, said Bulkowski. It does have one benefit which is that it scales the read path independently so you can worry about the two problems separately and scale out your cache independently of whatever you did on the actual data. However, the simplification of removing a layer gives you a “massive performance benefit.”

Remove your network hops as much as you can in the cloud, advised Bulkowski. This one-hop design that says: the application and then the database, no cache, is one that succeeds in the internet world.

Supporting the team

“The architecture that I have seen succeed in internet companies is one where what is considered the decision engine, and is using a variety of frameworks and technologies, becomes your compute server.”

One of the benefits of this approach is for the team, said Bulkowski. The riskiest thing in starting a new project is getting a great team together which means that you need to give them the tools they are familiar with - and if you shackle them with tools they don’t know or are not appropriate to task, that increases the risk, said Bulkowski. If you tell them that they must use SQL and SQL is not the best tool for the job, they are going to have a harder time.

The best architecture for these systems is one of polyglot languages, CPU in a cloud-oriented scalable system with a separate data layer that is expandable, clustered, and has all of the NoSQL-style characteristics, said Bulkowski.

Split between analytics and operational

There is a growing belief that you should be able to see analytics on your front edge or transactional data, noted Bullkowski. But doing complex SQL queries on operational data is difficult if not impossible.

“However, when I look at what has happened in Aerospike, I see that the operational use of Aerospike is providing next-generation analytics. People do a lot of machine learning on their real-time, up-to-the moment data. But the way that they break down the problem is not complex queries. That is the key to the whole thing,” said Bulkowski, so if you want an HTAP system, I recommend this data and application architecture where you have scalable CPU, a separate back end system. You still need your analytics warehouse, your data lake and a sandbox to charge back system for providing that type of research back end - all of that is necessary; the data warehouse does not go away.

But, what these high performance databases like Aerospike and DynamodB enable is the ability to have a real-time system with hundreds of TBs of data doing millions of transaction per second in a cost-effective fashion and since they are key-value-oriented they don’t get in the way of each other. That is part of the key idea here. You scale out your CPUs, you have got some CPUs for these more analytics processes, some for true transactions, and then a key value-oriented database with some query capability for maintenance. “These kinds of architectures are the ones that I have found succeed.”

Many conference presentations have been made available by speakers at www.dbta.com/datasummit/2017/presentations.aspx.