Data Lake Boot Camp at Data Summit 2019

May 21-22, 2019 // Hyatt Regency Boston

From centralized data acquisition and offloading, to data discovery and data science projects, data lakes are on the rise at enterprises today. In a recent DBTA study, more than a third of respondents counted themselves as data lake users, and another 15% reported near-term adoption plans. The benefits of inexpensive storage, data democratization and greater flexibility and scalability are easily understood in theory. On the other hand, establishing effective processes for integrating, securing and governing that data is a far more complicated endeavor, and this is where the rubber meets the road. Register for Data Lake Boot Camp today for a deep dive into the latest supporting technologies, best practices, real-world success factors and expert insights.

Access to Data Lake Boot Camp is included when you register for an All Access or Full Two-Day Conference Pass or as a stand alone registration option. View all our registration options here.

Tuesday, May 21

Track C: Data Lake Boot Camp

Moderator:

Julie Langenkamp, Director, Editorial & Content Strategy, Radiant Advisors

C101. Building a Data Lake for the Enterprise

Tuesday, May 21: 10:45 a.m. - 11:45 a.m.

A new data platform approach is needed to extend the data warehouse and address the vast quantity and variety of data flowing into organizations, much of it unstructured.

The Data Warehouse Is Dead

10:45 a.m. - 11:45 a.m.

The data warehouse is experiencing pressure from increasing data volumes, more users, and tight budgets—a triple threat to its ongoing existence and value. In addition, new data types are coming into play. This increased pressure means the old-school data warehouse may not be delivering insights at the speed of business. There are a number of alternatives to meet modern analytics infrastructure needs. This presentation outlines in detail why a modern data platform is required to deliver on new analytics demands.

Speaker:

Lynda Partner, VP, Products and Offerings, The Pythian Group

Exploiting Enterprise Data for Transformational Projects

10:45 a.m. - 11:45 a.m.

In a data fabric, the data discovery and integration layer maps all enterprise data in its original business context so that users can find and blend data from diverse siloed sources into analytic-ready data sets on an on-demand basis. Join Cambridge Semantics CTO Sean Martin to hear how companies are using data discovery and integration solutions to exploit enterprise data fortransformational analytic and machine learning projects.

Speaker:

Sean Martin, CTO, Cambridge Semantics

C102. Data Discovery in Data Lakes

Tuesday, May 21: 12:00 p.m. - 12:45 p.m.

With the abundance of data stored in data lakes, finding the relevant information is increasingly challenging, particularly in light of the many formats in which the data apears.

Data Discovery, Selection, and Provisioning

12:00 p.m. - 12:45 p.m.

With the realization of the power of data lakes, more and more organizational data in various formats and standards are being made available there. Given this plethora of information, it is becoming increasingly daunting for users to search for the data of interest to them with the use of conventional data analytical tools. A combination of Data Discovery tools, making use of semantic search and concept search, brings in the right blend of capability, enabling "comparison shopping" between seemingly similar datasets, and allowing end users to evaluate the best fit while facilitating the discovery and reuse of all available information and data assets, both internal and external.

Speaker:

Subhayan Das, Associate Director-Digital Capability Management, R&D Data Lakes and Integrations, Bristol-Myers Squibb

C103. PANEL: Data Lakes: Challenges and Opportunities

Tuesday, May 21: 2:00 p.m. - 2:45 p.m.

As part of our deep dive into data lakes, our panel of experts contemplates success factors, failure avoidance, and new developments. Join us for an invigorating discussion.

Speakers:

Richard Sherman, Managing Partner, Athena IT Solutions

Sean Martin, CTO, Cambridge Semantics

C104. Data Lakes in Action

Tuesday, May 21: 3:15 p.m. - 4:00 p.m.

Data lakes are highly appealing as they provide the capacity to support all types of data and maintain it in its original format for future purposes. Before diving in, it’s important to be aware of the components of a successful data lake implementation.

Uber’s Hadoop Data Ingestion and Dispersal Framework

3:15 p.m. - 4:00 p.m.

Marmaray, Uber’s general-purpose Apache Hadoop data ingestion and dispersal framework and library, was open-sourced in 2018. Marmaray was envisioned, designed, and ultimately released in late 2017 to fulfill the need for a flexible, universal dispersal platform that would complete the Hadoop ecosystem by providing the means to transfer Hadoop data out to any online data store. Before Marmaray, each team was building its own ad hoc dispersal systems, which resulted in duplicated efforts and an inefficient use of engineering resources.

Speaker:

Omkar Joshi, Software Engineer, Uber

Building a Data Lake in Two Weeks Without Writing Code

3:15 p.m. - 4:00 p.m.

Rafael shares how an ad-tech company moved from a DW to a fully functional data lake that processes over 400,000 events per second without writing one line of code.

Speaker:

Ori Rafael, CEO & Co-Founder, UpSolver

C105. Frameworks for the Future

Tuesday, May 21: 4:15 p.m. - 5:00 p.m.

New frameworks and platforms are enabling organizations to improve time-consuming processes and meet enterprise requirements for high performance.

Tackling Data Ingestion Challenges at LinkedIn With Apache Gobblin

4:15 p.m. - 5:00 p.m.

Apache Gobblin is a distributed data integration framework for both streaming and batch data ecosystems. This presentation covers how Gobblin powers several data processing pipelines at LinkedIn and use cases such as ingestion of more than 300 billion events for thousands of Kafka topics on a daily basis, metadata and storage management for several petabytes of data on HDFS, and near real-time processing of thousands of enterprise customer jobs. It also looks at the key Gobblin features that help LinkedIn build and run these data pipelines at extreme scale.

Speaker:

Krishnan Raman, SR Site Reliability Engineer, LinkedIn