Despite the increasing focus on offering more access to more users in organizations, ad hoc querying of big data remains a problem for most, according to Jair Aguirre, lead data scientist at Booz Allen Hamilton, who presented a session at Data Summit 2016 titled “De-Siloing Data Using Apache Drill.”
Many studies have shown that data scientists spend 50–90% of their time gathering and preparing data. The problem keeps repeating itself, said Aguirre, who explained that the recurring theme is that people can’t get to data in various formats and locations, making querying a major challenge.
If you are a CIO or CTO, you want your data scientists writing algorithms not getting data from here to there, said Aguirre. It’s a cliché to even talk about data volumes nowadays, yet at every conference, the speaker presents slides showing that data is being collected at rates that they can’t even keep up with. “We have that problem, everybody understands it," he said. But what falls to the wayside is that people are also finding new formats and new data stores all the time. “As we go forward, we can expect new data formats to come online. We can expect new data formats to come online in databases that we haven’t even imagined yet. And all those data formats and those data silos take a certain level of skill to set up and interact with so this is another costly issue," he added.
As the data volumes and data sources go up, the time to analytics also increase while profits go down. Apache Drill is a relatively new tool that can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata or rebuild their entire data infrastructure. In the session, Aguirre walked attendees through big data analytics use cases with the status quo using various big data technologies and also illustrated how Drill can be used to query a variety of data sources more easily.
While many big data technologies clearly offer advantages, Aguirre emphasized that for ad hoc queries where you are trying to understand your data, Drill is a technology with strong benefits. Key advantages, he noted, are in the areas exploration and interaction and ease of configuration. It offers schema-less reads, full ANSI SQL capability, and handles complex data, with no need for Scala/Java/Python. In addition, users can query directly from popular BI tools using ODBC and JDBC connectors. “It sends a query to Apache Drill and Apache Drill parses the query and then you can interact with the data that way for push-button analytics.” In addition, for security, views can be set up that allow access to data that users have permission to handle. “If you don’t have access to certain data on the data silo, you just won’t see it. You can run the query as you would normally but you won’t get the results from it,” Aguirre said.
“Where Apache Drill really shines is in being able to interact with different data formats from different data siloes at the same time in the same query,” said Aguirre. With Apache Drill, instead of “ETL heroics” for the data scientist, there is no need to move anything, said Aguirre. For CIOs and CTOs, this means more profit from the data. “It is really magical and Apache Drill does the magic for you.”
Aguirre has made the slides from his presentation available at https://www.dbta.com/DataSummit/2016/Presentations.aspx. Many additional presentations from Data Summit 2016 have also been made available for review.
Data Summit is an annual 2-day conference, preceded by a day of workshops. Data Summit offers a comprehensive educational experience designed to guide you through all of the key issues in data management and analysis today. The event brings together IT managers, data architects, application developers, data analysts, project managers, and business managers for an intense immersion into the key technologies and strategies for becoming a data-informed business.