Retrieval-augmented generation—or RAG—is an AI strategy that supplements text generation with information from private or proprietary data sources, according to Elastic, the search AI company. RAG plays a crucial role in improving large language models (LLMs); however, building and optimizing RAG workflows in production and at scale requires a deft hand.
Jeff Vestal, principal customer enterprise architect at Elastic, joined DBTA’s webinar, Beyond RAG basics: Strategies and best practices for implementing RAG, to explore best practices, patterns, and techniques for delivering superior RAG performance.
Keeping up with the ethical standard that AI must be safe, secure, and trustworthy, Vestal explained that LLMs—often the foundation for many popular generative AI (GenAI) tools—need to be grounded.
LLMs “aren’t infallible,” said Vestal. “There’s examples all the time of these models giving bad information—sometimes it’s a little more funny, [and] sometimes it’s companies giving away $80,000 pickup trucks for $200.”
Through RAG, context helps ground LLMs to prevent this sort of misinformation delivered by an AI chat tool. Vestal cautioned, however, that LLMs are not things that can be 100% controlled—RAG only improves the overall governability.
Ultimately, LLMs are limited by the fact that:
- Base model knowledge is limited to public training data
- Data is frozen in time after training and fine-tuning
- Production of non-deterministic results and hallucinations
- No concept of access control by itself
With RAG, LLMs are capable of human-like responses, benefitting from:
- Access to private data
- Real time updates (and deletes)
- Confidence in response
- Access control
- Cost Reduction
Vestal walked viewers through an ordinary RAG pipeline, where a customer question pulled relevant information from both business data and public internet data to deliver a conversational answer. In a production environment, such as a large, multi-component application, RAG is composed of many crucial layers. From a deployment and inference layer (where embeddings are generated) to an analysis layer, application layer, model layer, and data layer, each level is unified by a robust security and observability strategy.
Embeddings are vector representations of enterprise data that enable semantic search, leveraging approximate nearest neighbor to yield accurate results. This is a desirable asset to any LLM, and many enterprises might look to vectorize all their private data.
While that is possible, noted Vestal, organizations must implement a chunking strategy. A single vector could not accurately represent an entire novel, therefore enterprise data must be chunked to refine its semantic precision.
Chunking breaks up the input text into smaller pieces—which is achievable in many ways. You may not want to chunk all the way down to a single word, but also not as large as several paragraphs; striking a balance appropriate for your data is key.
Another way of chunking is through token overlaps, which enables chunks to overlap by a known number of characters to improve the chances of catching complete ideas, topic transitions, and context clues. Vestal offered an example of several paragraphs of text, where the first half is highlighted by blue, and the latter by yellow; in the middle, the highlights overlap, creating another dynamic chunk of information that increases the odds of semantic accuracy.
Deploying RAG at scale also comes with certain considerations that must be considered, including:
- Security and reliability: disaster recovery, SLO (service level objective), Infrastructure as Code orchestration
- Cost optimization: model efficiency, quantization, cost analysis
- Continuous analysis: observability and evaluation strategies
LLM caching tackles some of these considerations, serving to cache each response delivered by the LLM. Then, when a user queries a similar question, the cached response is delivered, reducing the need to re-call the LLM. This helps reduce LLM token costs from duplicated calls, as well as increases GenAI response validity and transparency by pre-caching users’ questions before they ask them. Elastic’s LLM caching technology can be explored here.
For the full, in-depth discussion of RAG implementation and optimization, you can view an archived version of the webinar here.