The Importance of Data Engineering for Successful AI with Airbyte and Zilliz

Oct 17, 2024

By Sydney Blanchard

Enabling the collection and utilization of data is crucial to successfully supporting AI projects at enterprise scale. From data integration to data pipelines, AI performance, data governance, compliance, and more, adhering to data engineering best practices has never been more prudent for enabling an AI-powered future.

In DBTA’s latest webinar, Data Engineering Best Practices for AI, Brian Leonard, director of engineering at Airbyte, and Tim Spann, principal developer advocate, Milvus, Zilliz, offered their expertise regarding how data engineering can resolve common challenges associated with deploying and scaling effective AI usage.

As the open source data movement company, Airbyte makes data actionable anywhere, enabling over 20,000 data and AI professionals to manage diverse data across multi-cloud environments, according to Leonard. Regarding Airbyte’s AI use case, many enterprises are leveraging the Airbyte platform to load first-party data into AI apps by extracting records from unstructured sources—such as Google Drive or Salesforce—and moving that data into lakehouses, where users can enable retrieval-augmented generation (RAG) and fine-tuning.

Leonard then took a closer look at the AI data pipeline, examining the journey from extraction to normalization, processing, and usage. Each phase of the pipeline incorporates the following processes:

Extraction: Data encryption, PII masking, pushdown filters, file transfer, permissions
Normalization: Schema normalization, data cleaning, deduplication
Processing: Enrichment, summarization, use cases optimization, document chunking, embeddings calculation
Usage: Place embeddings into a queryable data store, such as Milvus, a vector database

Spann expanded on the advantages of Zilliz’s Milvus, a high-performance, open source vector database built for scale. Vector search, Spann noted, is the new paradigm for AI, as “now, images, text, video, documents—everything is data, and vector search makes it searchable.” In fact, IDC predicts that 90% of newly generated data in 2025 will be unstructured, reflecting a crucial need for vector search.

Vector databases are responsible for powering search across a variety of use cases—from RAG to molecular similarity search, fraud and anomaly detection, multimodal similarity search, and more. At its core, unstructured data—and the ability to extract knowledge from it—is fundamental toward enabling AI success.

Since 2017, Zilliz has been helping organizations make sense of unstructured data. Having been built by a top-tier team of algorithm and database engineers with a strong pedigree in developing high performance, scalable, and highly available distributed systems, uniquely tailored for vector search, Zilliz was built from the ground-up to address the various data engineering challenges associated with AI, noted Spann.

As a result, Milvus is an easy to set up, feature-rich vector database that offers elastic scaling, reusable code, and expansive integrations, underpinned by a robust, supportive community. Spann then walked webinar viewers through the way Milvus operates, detailing its structure, features, and more.

For the full, in-depth webinar discussing data engineering for the age of AI, you can view an archived version of the webinar here.