The tide is changing for analytics architectures. Traditional approaches, from the data warehouse to the data lake, implicitly assume that all relevant data can be stored in a single, centralized repository. But this approach is slow and expensive, and sometimes not even feasible, because some data sources are too big to be replicated, and data is often too distributed to make a “full centralization” strategy successful.
That is why leading analysts such as Gartner and Forrester recommend architectures such as the logical data warehouse. In these architectures, data is distributed across several specialized data stores such as data warehouses, Hadoop clusters, and cloud databases, and there is a common infrastructure which allows unified querying, administration, and metadata management. Logical architectures are the only option when data collection/replication becomes unfeasible, and, in addition, they greatly reduce the need for effort-intensive ETL processes, providing much shorter times to production and significant cost reductions. Gartner has recently estimated cost savings of 40% by deploying logical architectures, and some companies report even bigger improvements.
Logical architectures for analytics are typically implemented using data virtualization (DV), which makes all the underlying data sources seem to be a single system with a unified query interface (see Figure 1). As a result, data virtualization creates real-time logical views from several sources and publishes the results to other applications in multiple formats, such as JDBC, ODBC, or REST data services. This way, consuming applications do not need to know where data resides or the query language that is used by each source system, as they are also abstracted from changes in data management technologies such as moving from Hive to Spark.
Figure 1: The logical data warehouse architecture
It is also straightforward to create different logical views over the same physical data, adapted to the needs of each type of user. Furthermore, data virtualization provides a single entry point to apply global security and governance policies across all the underlying systems.
Nevertheless, to realize the full potential of logical architectures, it is crucial that the data virtualization system includes query optimization techniques specifically designed for combining large distributed datasets. In turn, many of the data federation/virtualization systems available in the market reuse the query optimizers of conventional databases, with only slight adaptations. This is the case of the data federation extensions recently introduced by some database and BI tool vendors. However, those optimizers cannot apply the most effective optimization techniques for logical architectures, and the associated performance penalty can be very significant.
More precisely, there are two types of capabilities needed to achieve the best performance in these scenarios:
1.
Applying automatic optimizations to minimize network traffic, pushing down as much processing as possible to the data sources.
2.
Using parallel in-memory computation to perform at the DV layer the post-processing operations that cannot be pushed down to the data sources.
The following example illustrates how both capabilities work, while the figure below shows a simplified logical data warehouse scenario:
• An enterprise data warehouse (EDW) contains sales from the current year (290 million rows) and a Hadoop system contains the sales data from previous years (3 billion rows). Sales data include, among other information, the customer ID associated to each sale.
• A CRM database contains customer data (5 million rows, one row for each customer). The information for each customer includes its country of origin.
Figure 2: Simple federation (A) versus advanced data virtualizaton approach
Parts A and B of the figure show two alternative strategies to calculate a certain report that looks at the total amount of sales by customer country in the last 2 years. As depicted, the report needs sales data from both the current and previous years and the country of origin of each customer. Therefore, it needs to combine data from the three data sources.
Data federation tools, using extensions of conventional query optimizers, would compute this report using Strategy A, while DV tools with optimizers designed for logical architectures would use Strategy B. In fact, the most sophisticated DV tools would consider additional strategies to Strategy B and choose the best one using cost information.
In Strategy A, the federation tool pushes down the filtering conditions to the data sources and retrieves the data required to calculate the report. Since the report includes one filtering condition because only sales from the last 2 years are needed, only 400 million rows are retrieved from Hadoop instead of the total 3 billion rows. In addition, the report required the full 290 million rows from the EDW and the full 5 million rows from the CRM. Therefore, even though the filters have been pushed down to the data sources, alternative A still needs to retrieve 695 million rows through the network, and then post-process all the data in the federation system. Therefore, execution will be very slow.
In turn, in strategy B, the optimizer of the data virtualization system introduces additional groups by operations, as the circles above the EDW and Hadoop data sources show, to divide the computation of the report in two steps. The first step is executed at the EDW and the Hadoop systems, which compute the total sales by customer for the data in each of these systems. The second step is performed at the data virtualization system: It adds the partial results obtained for each customer in both data sources and groups the resulting data by country using the information retrieved from the CRM. Since the first step does not require information about the country of origin, it can be entirely pushed down to the data sources. This means we only need to retrieve 5 million rows, or one row for each customer, from both the EDW and the Hadoop systems. Therefore, network traffic is drastically reduced from 695 million to 15 million rows.
Notice these are the type of techniques only an optimizer specifically designed for logical architectures will consider. A conventional query optimizer designed to work in a physical architecture couldn’t add additional operations to the query plan, as alternative B did, because it usually makes no sense in physical environments.
The second capability required for best performance in logical architectures is parallel in-memory computation. Notice that in the alternative B, the data virtualization system still needs to post-process 15 million rows. While this can be done in acceptable time with conventional serial execution, the process can be further optimized using a parallel in-memory grid, as illustrated in the upper-right part of the figure.
The in-memory grid should be installed in a cluster of machines connected to the DV system through a high-speed network. When the DV system needs to post-process a significant amount of data, it can use the in-memory grid to execute such operations in parallel. For this, the DV optimizer should partition the data into the cluster nodes to maximize the degree of parallelism. In this example, the DV optimizer would partition the data by “customer_id,” thus ensuring that all the sales from the same customer end in the same node, and the join and group by post-processing operations can be parallelized almost in full. Additionally, the data does not need to be materialized in disk at either the DV system or at the in-memory grid, so data can be streamed directly as it arrives from the data sources. As a result, parallel computation starts almost immediately with the first chunk of data.
Using the parallel in-memory grid can result in much faster execution of the post-processing operations than in conventional serial architectures. However, the parallel in-memory computation capabilities are of no help to minimize network traffic. For instance, in the above example, parallel databases with simple federation capabilities would still use the execution strategy A. This means 695 million rows would need to be transferred through the network before the parallel processing even starts, resulting in query execution times unacceptable for many applications. Therefore, both the advanced techniques for minimizing network traffic and the parallel in-memory computation capabilities are needed to achieve the best performance.
Benefits of Redefining Analytics Architectures
Logical architectures for analytics provide shorter times to production and are cheaper to create and maintain. Data virtualization is a key component of these architectures, providing them with abstraction, unified query execution, and unified security capabilities. To guarantee best performance, it is crucial that the query optimizer of the data virtualization system includes specific techniques designed for minimizing network traffic in logical architectures. More importantly, the query optimizer should leverage in-memory parallel processing to perform post-processing operations that cannot be pushed down to the data sources.