Building performant ETL pipelines to address analytics requirements is difficult as data volumes and variety grow at an explosive pace.
With existing technologies, data engineers are challenged to deliver data pipelines to support the real-time insight that business owners demand from their analytics.
In a recent DBTA webcast Joe Widen, solutions architect, Databricks and Singh Garewal, director of product marketing, Databricks, discussed how to avoid the common pitfalls of data engineering and how the Databricks Unified Analytics Platform can ensure performance and reliability.
The data engineering drivers consist of advanced analytics/machine learning coming of age; industry-spanning adoption; technology innovation between hardware, cloud, and storage; increased financial scrutiny, and the evolution of roles such as a CDO or data curator, Widen and Garewal explained.
However, moving forward presents a variety of challenges from failed production jobs leaving data in corrupt state requiring tedious recovery to too many small or very big files.
Data pipelines need to satisfy three goals:
- Cleaner data in less time
- Provide simple data engineering at scale
- Efficient, faster queries for analytics
Databricks Delta is the next-generation unified engine, Widen and Garewal said. It is co-designed compute & storage, compatible with Spark API’s, and built on open standards (Parquet).
The platform makes data lakes ready for analytics, makes data reliable, makes data more performant, and provides Spark, API’s and IoT streaming.
In a real-world cybersecurity analysis use case, 93.2% of the records in a 504 terabytes dataset were skipped for a typical query, reducing query times by up to 100X, Widen and Garewal said.
An archived on-demand replay of this webinar is available here.