There's no question that databases are the heart of nearly every application running these days. Moreover, the information stored in databases is now being routinely used as a competitive and operational business weapon by all businesses and organizations regardless of size or industry. Whether used internally in business intelligence applications or utilized externally via the exposure of data tools that let customers view and search through vast amounts of data on websites, data is being maximized in many different ways.
Increasingly, the requirement to quickly search through, read, and mine large amounts of data is becoming a necessity. This being the case, IT staffs are being asked to focus on delivering query-based and analytic systems that are very high-performant, which is often not easy to do when progressively larger data volumes are combined with many concurrent users and stringent service level agreements that contain demanding response time requirements.
These types of query and analytic systems generally fall into three main types:
Data warehouses, data marts, and analytic applications: data from one or more traditional online processing systems are fused together and loaded into a database designed specifically to support demanding business intelligence applications and ad-hoc query customers.
Reporting databases: these are typically more simplified data warehouses or marts where a transactional system is basically cloned onto a different box and database and used for decision support purposes. Sometimes the reporting database and transactional system are linked together via data archiving methods to provide one logical view of the data.
Hybrid online systems: popularized by large websites, the hybrid system separates read-intensive portions of an online application and directs all query traffic to one set of databases and then all online transaction processing work to a different set of databases. As the read portion of an application can grow quickly given certain conditions (e.g., the storage of past customer history made available for search), the read side of the application oftentimes employs multiple machines and a partitioning scheme to service all the various query needs.
How can IT personnel design and deploy databases systems that can deliver the performance needed for the online world and do so without breaking the bank? Here are six recommendations for building high-performance query and analytic databases that will both meet the requirements of today and tomorrow.
1. Only Read the Data You Need. This seems like a fairly obvious thing to strive for: avoid doing any unnecessary read operations so queries will run faster. Unfortunately, in query and analytic systems it isn't as easy as one might think. Legacy RDBMS's are row-architected, meaning that all I/O operations ultimately act upon an entire row of data even when only one or two columns of data are needed. Traditional databases try and get around this via indexing, data partitioning, materialized views, and much more, but a lot of the time these techniques fall short. What's needed instead is a new approach.
Recommendation: Use column-oriented databases for read-intensive applications. Column-oriented databases designed especially for analytics overcome the limitations that exist in traditional RDBMS systems by storing, managing, and querying data based on columns rather than rows. Because only the necessary columns in a query are accessed rather than entire rows, I/O activities as well as overall query response times can be reduced. The end result is the ability to interrogate and return query results against either moderate amounts of information (tens or hundreds of GB's) or large amounts of data (1-n terabytes) in less time that standard RDBMS systems can.
2. Exploit Modern Hardware. Whatever database you choose must be capable of addressing all available CPU's/cores on a machine and parallelizing the queries across those processors. This means you should use a database that's multi-threaded in nature.
Recommendation: Choose a column-oriented database with multi-threading and parallel processing across CPU's. In doing this, you'll be exploiting the hardware resources you have and using the best architecture for reading large amounts of data.
3. Divide and Conquer. Not only should your system designs include a column database that can scale up on one machine, but you should ensure that you can also employ a divide-and-conquer approach across multiple machines. Sometimes this can be accomplished via a database's built-in replication functionality, but oftentimes this approach necessitates application programming layers that smartly direct read traffic to the machines where it needs to go.
Recommendation: Choose a database that is massive parallel processing (MPP) capable. This is the easiest divide-and-conquer approach as MPP databases automatically distribute work across all participating nodes and offer linear performance gains when additional nodes are added.
4. Scale Both I/O and Concurrency. When IT staff think ‘scale out', oftentimes they only think of distributing and scaling I/O activity and don't consider the need to scale concurrent user activity. Even idle user connections eat up resources and overhead, so it's important to include a scaling feature in your analytic systems that allows for the growth of user concurrency as well as added data volumes and I/O processing.
Recommendation: Utilize a modular architecture that separates initial/core user query processing from the heavy I/O activities that are responsible for most of the overall database work. This may mean using multiple front ends that act as query entry points and connect into the database I/O layer.
5. Provide for Transparent Expansion and Failover. If there's one phrase that's been repeated by each and every successful business, it's this: "I never imagined that we'd grow so fast!" Naturally, this is a good problem to have, but it does mean that you need to be able to transparently add capacity when needed. Further, you also need to have good failover mechanisms in place so you experience no downtime should a tech hiccup in the system occur.
Recommendation: Use an MPP-based database that allows for online node additions and has built-in failover and redistribution of work. If that can't be done, then a reliable replication paradigm should be put in place that's coupled with either application partitioning mechanisms to redirect traffic or good load balancers.
6. Keep a Sharp Eye on the Budget. The price tag for all the above can add up fast, especially when things are done in a hurry and data growth is mounting by the hour.
Recommendation: Use open source/core software and commodity hardware. According to Gartner Group, some 66% of companies are either already using open source databases or planning to very soon. Long gone are the days when open source software wasn't ready for prime time; today open source software supports nearly every Web site and more traditional businesses and government agencies each day. Don't discount the positive impact open source software can have on your budget.
Read-intensive applications - whether they are data warehouses, reporting systems, or hybrid online applications - are growing in number every day, so chances are you're already designing a new query/analytic system or about to. By following the recommendations above, you'll be able to ensure your analytic database can meet all the demands that are thrown at it today and be future-proofed for years afterward as well.
About the author:
Robin Schumacher is vice president for products at Calpont, a provider of high performance analytic database solutions. Schumacher oversees the company's product line and brings over 20 years of practical database experience to Calpont. He has deep experience in DB2, Teradata, Oracle, Sybase, Microsoft SQL Server, and MySQL, has written extensively in print and online on the subject of database design and performance, and is the author of three database performance books. Prior to joining Calpont, Schumacher was the director of product management for MySQL and previously was the vice president of product management for Embarcadero Technologies.