RAG Has a Dirty Little Secret: Here’s How to Clean It Up and Get GenAI Right

Feb 13, 2025

By Jens Doerpmund, VP of Software Architecture at Hitachi Vantara

Retrieval-augmented generation (RAG) has become a go-to architecture for companies using generative AI (GenAI). Enterprises adopt RAG to enrich large language models (LLMs) with proprietary corporate data so LLMs have the data they need, but weren’t trained on, to answer questions.

The healthcare, finance, and legal sectors are leading RAG adoption, Grand View Research reports. However, businesses across industries are now
implementing RAG. It’s hard to ignore.

RAG can increase the accuracy of GenAI tools such as ChatGPT, enabling such technology to generate more relevant, context-aware responses. That can position companies to be more efficient and competitive by allowing them to make decisions and act faster, revolutionizing their businesses.

However, you can’t simply embrace RAG and consider all your data problems solved.

Enterprises have worked for decades to solve the problem of providing users with the right information at the right time. It’s an enduring challenge for business intelligence reporting, data lakehouse and data warehousing, and dashboarding—and now it’s a challenge for RAG as well.

For the best results with RAG, you need to build a solid data foundation. Here’s how to get started:

Provide RAG with the right information.

This first piece of advice is a bit of a no-brainer, so I’ll keep it brief. If you add information to the context window that is not relevant to the question asked, it won’t help the LLM provide a good answer. The answer quality will deteriorate, or the AI will hallucinate. Be sure the information you’re adding as part of your RAG process is relevant to the question by utilizing advanced RAG techniques, such as reranking.

Ensure that your documents are up-to-date.

Imagine that a corporate user asks, “When is Thanksgiving?” Your RAG solution might add a line from a document about the company holiday in 2022, so your LLM answers with an incorrect date.

This is not a hallucination. The problem is that the information the LLM had access to was outdated. Make sure RAG has access to the most current data. Also, ensure that all documents, chunks, embeddings, etc., are fully tracked. Employ technology that assigns your documents a content hash.

That way, for example, if someone makes a change to the corporate travel policy, your solution will trigger a notification that the embedding—which RAG created to represent the meaning of a particular segment of a document—needs to be re-created.

In other cases, someone at your company may move a document used by RAG. For example, a document may move from an on-prem system to the cloud. When that happens, it normally means your RAG process can no longer be validated because it can no longer find the original document.

Solve this by implementing a solution that provides each stored document with a unique identifier, so that the RAG solution can locate data. This way, everything keeps working even if you move things around

Make data easily accessible.

Many proprietary and open source solutions require data to be curated and put in a single place before you can use it. The same applies to RAG, so it can create as many problems as it solves.

If you ask a question, the answer may exist anywhere in a billion documents, which are likely in various formats and a wide range of locations. If you need to dedicate months to identify, curate, and put the relevant data in one place so that the LLM can access it, that’s a nonstarter.

This may help explain why 37% of IT leaders have identified data quality as a major barrier to AI success, according to the recently released Hitachi Vantara State of Data Infrastructure Survey.

To simplify and speed up providing people and AI with the right information at the right time, you need to make data easy to access, no matter where the data is physically located. Do this by employing a semantic data plane, which connects various data sources, adds a data virtualization layer on top, and provides APIs that make it easy to retrieve documents.

You will need three APIs to make this work: One will access unstructured data, such as PowerPoint slides and PDF and Word documents. Another will access your structured data, which has data stored in, say, Parquet files in object stores or database tables. A third is for metadata and factual data (think knowledge graphs), which is critical for GenAI because it provides context.

Standardizing on a few APIs to access information irrespective of their physical location (on-prem or cloud object store, NAS, etc.) and their file format (e.g., Parquet files, CSV files, or database tables) significantly simplifies data access for RAG solutions.

Understand lineage—end to end.

The standard RAG process involves reading a document, converting it to text, breaking it up into small sections such as paragraphs, creating embeddings to represent the meaning of each section, and then storing embeddings in a vector database.

Later, when a person or an AI agent asks a question, RAG converts the question into an embedding and compares that to the embeddings in the vector database to find information that’s similar to the meaning of the question. In doing so, the system finds relevant documents that it adds to the context window.

Innovations now also make it possible to know exactly what’s happening in a system at any point in time. That includes providing end-to-end lineage from the original document to the text version of it, to the document section, to embedding, to what was stored in the vector database. You can then see whether personally identifiable information (PII) is in the document, and you’ll have all the information about how the document was used and for what.

When an LLM has complete insights and introspection through such a “meta RAG” solution, the LLM can validate whether the users who get answers from GenAI should have access to the data on which those answers are based and can indicate if an access control adjustment is needed.

End-to-end lineage is more important than ever. When GenAI first surfaced, there was one LLM, and when RAG emerged, you would add one or two documents.

Today, there are multi-agent systems that collaborate to solve more complex problems. Multi-agent systems, which involve multiple LLMs, could draw from millions or billions of documents in storage systems. So, using manual processes to identify documents and safeguard confidential information simply isn’t feasible.

Create embeddings inside your storage and include embeddings in your archives.

Security and scale should also inform where you choose to create and keep your embeddings.
You could take documents out of your storage systems to do embeddings. But every time you move data, you add complexity and open the data to vulnerabilities. For example, the very person who is charged with taking a document out of storage and ascertaining if it contains PII, such as Social Security numbers, may not technically be allowed to see those numbers.

However, if your storage system can create embeddings, you don’t have to load those documents into a different environment to create them. This approach is both more efficient and more secure.

A second issue to consider is that creating embeddings is expensive. If you make embeddings for many documents, you don’t want to lose them, so you’ll want to back them up. If embeddings are stored in a separate vector database, you need to have a backup and archiving strategy specifically for embeddings. This is extra work and requires additional skills. However, if you store and manage embeddings in a storage system, then they can be backed up and/or archived together with the documents from which they were created. No additional tools or skills are needed. Let’s say you are looking for some old, already archived, documents about a specific topic by utilizing semantic search. Semantic search, of course, requires embeddings.

Restoring archived documents together with their embeddings would enable you to immediately perform a semantic search to locate these documents.

Taking these steps simplifies RAG, protects against leaking sensitive data into the LLM and to GenAI end users, drives operational efficiency and environmental and business sustainability, and helps ensure that the information users ultimately derive from GenAI is very high quality.

This methodical approach to RAG—and holistic approach to data—is clean and simple. It frees you from the complexities of providing users with the right information at the right time. More importantly, it helps you to revolutionize your organization and drive real business value.

Newsletters

RAG Has a Dirty Little Secret: Here’s How to Clean It Up and Get GenAI Right

White Papers

Sponsors