Database sizes have grown exponentially, with more than half of all databases in use globally projected to exceed 10TB by 2010, according to The Data Warehouse Institute. But as the size of data warehouses has exploded and business requirements have forced companies to conduct more ad hoc queries on those warehouses, response times have slowed, requiring increasingly expensive database hardware investments.
Column-oriented relational databases - which store data in columns rather than in rows - were developed in the mid-1990s as a way to improve query speed and reduce data storage and retrieval overhead for data warehouses and other read-intensive applications. This article defines column databases, describes the next generation and associated benefits, and sheds light on how companies can use them to better capitalize on business opportunities in real time.
A Columnar Approach
Ingres, Postgres and other early databases shared a common architecture that became the blueprint for most relational database management system (RDBMS) products still in use today. Data is logically arranged in tables of records, where each record stores information about one entity (e.g., the name, salary, and age of an employee), with multiple rows representing information about different entities of the same class (e.g., different employees). Most databases use a “row-oriented” layout, moving horizontally across the table and storing the fields of each record consecutively on disk. Row-oriented architectures are optimized for write-intensive online transaction processing (OLTP), since it is easy to add a new record or read all of the fields from a single record when it is arranged in this way.
Because disks typically read data in blocks of several hundred bytes at a time, it is essentially impossible to just read the data of one column from a given record in a row store. This means that any row-store query will read data from all of the columns in a table. If a query is only interested in four columns of a 20-column-wide table, it will read four times more data than is needed.
On the other hand, column-store architectures arrange data in disk by writing values from the same column consecutively on disk, effectively placing each column in a separate file. This means that they can read only the data in pertinent columns eliminating much of the I/O of row-store architectures. Since disk I/O often dominates query performance in database systems, many queries in column-stores inherently run much faster than in row-stores, delivering results when business managers need them.
Making Compression Count
Though compression is widely used in database systems, column databases can exploit compression much more effectively than their row-oriented counterparts. That’s because compression algorithms work by eliminating redundancy in data, and successive elements in a column are fundamentally more similar, and therefore more redundant, than successive rows in a table.
Consider, for example, a year of sales transactions stored in a 100 million-row table. If the table is sorted on transaction date, many consecutive values will have the same date. In this case, we can reduce the length of the column from 100 million rows to just 365 rows by storing each date of the year along with the number of appearances of each date. Such compression can only be applied in column stores because row stores will likely have no repeated tuples for this kind of run-length encoding (RLE).
Column databases are also able to exploit compression by operating directly on compressed data, unlike many databases in use today. Consider our column of compressed dates; if the database has to evaluate a predicate like (Date = “01/01/2006”), it can apply this predicate directly to the encoded RLE value “01/01/2006,270000” (assuming there were 270,000 transactions on Jan. 1) , without decompressing. This compression reduces both I/O costs and total CPU query processing time giving column databases a significant performance advantage versus databases that have to decompress data before operating on it.
Projecting Efficiency
Querying data stored in a series of columns requires some way to establish correspondence between fields from the same records in different columns. The simplest way to do this is to ensure that the positions in the different files line up - e.g., the nth entry in column 1 belongs to the same tuple as the nth entry in column 2, and so on. When columns are compressed, finding the nth position is slightly complex; a typical technique is to store a sparse index that indicates the range of tuple positions that are stored in every few kilobytes of a compressed column. This index can be used to find approximately where a given position begins, and those positions can be scanned to find the exact records of interest
A collection of columns will usually be stored in a particular sort order - that is, where a column or collection of columns are sorted, and other columns are ordered in the same way. Sorting is particular important because many compression techniques (including RLE) work best for columns that are sorted or nearly sorted. Furthermore, sorted columns can be quickly scanned to find values of interest.
Hence, query processing time for a given query will be minimized if one or more of the columns accessed by that query are stored in sorted order. For that reason, it’s useful for a column database to store each table on disk as projections - collections of sorted columns - spread across a collection of networked, shared-nothing machines. Though it may seem that such additional copies of data are wasteful of disk space, the space reductions offered by column-oriented compression often mean that a column-based database can store many projections in the space that a traditional database would store just one copy of the data.
Multiple projections containing overlapping subsets of columns with different sort orders ensure high availability and enhance performance by enabling the DBMS to execute queries against the projection(s) with the most appropriate columns and sort orders. The trick is to have a query optimizer that knows some of the data is compressed and knows how to act on it directly. For example, certain orders, such as those sorted on one or more of the attributes (customer state, and then customer name) – allow the system to rapidly find records of interest (such as every customer named Jones in Florida).
Storing data in columns in multiple sort orders across shared nothing systems also provides companies with redundancy to safeguard against data loss and other performance issues due to failures. Column stores enable DBMS solutions to query projections not only to handle user requests, but also to rebuild the data in a recently restored projection or node. The DBMS can build the necessary redundancy into the projections it creates so that a number of node failures (pre-determined by a DBA) can occur without compromising the system.
Adding up the Column
Aside from the significant performance benefits that column store architectures present, they also improve the bottom line. The aggressive compression coupled with the multiple sort order projections of column databases allow companies to reduce hardware costs because they don’t need as much disk space to store and access their data. Instead of spending millions of dollars on proprietary iron, companies can build a cost-efficient infrastructure with standards-based blades and servers that easily and affordably scales with demand. As databases continue to grow and frontline business managers continue to conduct more detailed queries in real time, the benefits of such column-oriented architectures will become more and more evident.