Retrieval Augmented Generation: the key to enhancing relevancy and precision

What is RAG?

The LLMs most frequently adopted in business environments, namely those created by OpenAI and Meta, are trained on vast swathes of data spanning the entire internet. However, successfully integrating AI technologies into everyday workflows requires models to have specific knowledge about your business. These models don’t have direct access to data relevant to businesses, rendering them incompetent for certain business use cases.

This is why businesses are adopting RAG. Retrieval-Augmented Generation (RAG) is a technique in natural language processing (NLP) that allows users to provide external knowledge to an LLM in order to give the model access to information that it has not previously been trained on. A RAG system consists of multiple components, combining the strengths of retrieval-based systems, which find relevant data, with generative models that create coherent and contextually appropriate text, ultimately enhancing the accuracy of responses generated by AI models.

A RAG system first involves retrieving relevant information from a dataset provided by the user, before generating a response. The system can cross-reference the user’s documents with the question posed by the user, and ultimately return an answer that is best matched to the user query. This means that the language model can reason with the provided documents in order to provide the most relevant answer for the user.

To get more technical, a RAG system uses embedding models to take documents of any form (powerpoints, PDFs, texts, etc.) and ingest them into a vector database where they are parsed for similar meanings using semantic similarity, which is a technique which involves finding words that are similar in meaning, even if the words themselves are not syntactically similar.

Once the text is in the vector database, the user can ask questions based on the corpus of documents that have been ingested. The user query is embedded into the same vector space as the documents, and the vector database returns the documents - or pieces of documents - which best match the user query in actual meaning, rather than keyword matching. Finally, the context and the user query is passed to the large language model which will synthesise an answer.

What are the key benefits of RAG for a business?

RAG helps businesses mitigate the risks associated with LLMs by ensuring greater accuracy, contextual relevance, compliance, efficient use of knowledge, and reduction in hallucinations.

Traditional LLMs, while powerful, rely on their vast training data to generate responses. This knowledge base is not able to cover the most current or specific information related to a business that is needed for particular queries. By integrating a retrieval mechanism, RAG ensures that the generated responses are not only accurate but also leverage the existing knowledge infrastructure of the business, ensuring that employees and customers receive information that is both correct and consistent with the company’s knowledge base. The generative model bases its responses on factual, retrieved data, ensuring that the information provided is accurate and verifiable, therefore reducing the risk of hallucination.

Breaking down RAG: a technical pipeline

A RAG system consists of three sub components or three subsystems: an ingestion pipeline, a retrieval pipeline, and a synthesis pipeline.

The Ingestion Pipeline

The Ingestion Pipeline is the crucial first step in a RAG system that is responsible for preparing documents so they can be effectively parsed, broken down, and embedded for usage by the system.

Firstly, each document in the knowledge base needs to be parsed in order to be properly understood by the system. Parsing involves extracting relevant information from various document formats and converting it into a structured format that can be processed further. This technique involves raw text extraction, cleaning and preprocessing to remove unnecessary characters, and metadata extraction to collect important data that might be useful for indexing or querying.

Secondly, once the document has been parsed, it needs to be broken down into smaller parts, or ‘chunks’, so that it can be processed efficiently. This step involves determining optimal chunk sizes for the specific task, and splitting the document based on natural language structures, such as sentences or paragraphs, depending on the document size and type.

Once the chunking stage is complete, the text chunks are each passed to the embedding model, where they are converted into vector representations that capture the semantic meaning of the text. Embedding typically involves tokenisation in order to convert text into tokens that a model can understand, feeding tokens into the model to generate embeddings, and post-processing.

If done correctly, the RAG system can effectively ingest documents, convert them into a structured format, and generate embeddings that facilitate effective retrieval and augmentation during the generation phase.

The Retrieval Pipeline

The Retrieval Pipeline kicks in when a user asks the model a question, and is all about finding the best answer. It is responsible for retrieving the chunks of information that are most relevant and match the user query.

The retrieval phase involves embedding the user query, transforming it into a suitable format and tokenising it for use by the model, and conducting a semantic similarity test within the database. One popular technique to carry out semantic similarity is called Cosine Similarity, which searches the database for the chunks which are very similar in semantic meaning to the user query.

Depending on the size of the dataset, it is likely that the search may find a range of documents that match the semantic meaning of the query. If this is the case, additional post-processing will be necessary to optimise the chunks and ultimately conclude which are the most relevant. This optimisation may include enhancing the document with metadata, enriching the context by expanding abbreviations, or including more context by including sentences before and after what was retrieved. Once the most relevant section of the dataset is sourced, the retrieval pipeline is complete.

The Synthesis Pipeline

The final component of a RAG system is the Synthesis Pipeline, which is responsible for integrating the retrieved information and generating a coherent response. The purpose of this stage is to leverage both the retrieval results and the generative capabilities of a language model to produce the final output, while also ensuring that the LLM does not hallucinate or provide inaccurate responses.

Firstly, contextualisation involves preparing the retrieved chunks and the user query so they can be processed. The user query and chunks are concatenated to form a single input context, after which input formatting will ensure that the concatenated text is in a suitable format. The input is converted into tokens that the model can understand, and might also be truncated to fit within a certain token limit.

The generation stage functions as a reasoning engine, responsible for sifting through many facts to find the best solution. It is a crucial step where the model leverages the context provided by the query and the retrieved chunks to generate a relevant answer, and the more parameters the model has, the better the reasoning capabilities will be. To avoid hallucination and provide utmost transparency to the user, if the LLM cannot answer the question based on the context provided, it is ordered to not attempt a response and instead respond saying that it doesn’t know the answer.

RAG vs. fine tuning - which is better for generating the most precise and relevant outputs?

When it comes to improving the accuracy of models’ responses, fine tuning and RAG are both popular approaches. RAG is often the preferred method for optimising the performance of LLMs, however they can also be used alongside each other for optimal results.

LLM fine tuning is an effective technique for improving the capabilities of existing models, and is most effective on narrow tasks, such as summarisation or asking a model to respond in a certain tone. Fine tuning involves the process of taking a pre-existing and pre-trained LLM, and training it further on a specific dataset, for example your business documents. However, while the fine tuned model has access to information relevant to the user query, it still runs the risk of hallucination, as it is also pre-trained on vast quantities of data.

Contrastingly, a RAG system almost completely diminishes the risk of hallucinations as the user can guarantee that the answer to their query is in the dataset provided to the model. And if the RAG cannot answer the user’s question, it simply will not. Unlike a fine tuned model, a RAG system can document its lineage, in other words, precisely from where it has sourced its responses within the knowledge base.

Clairo RAG

Clairo AI’s existence is defined by an unwavering commitment to data-privacy, environmental sustainability, and digital innovation. Clairo RAG allows us to uphold those values, and allow our users to leverage the power of Large Language Models without compromising the security of their data, straining their budget, or impacting the environment.

RAG and Data Privacy

Businesses should be able to harness the power of large language models (LLMs) without compromising the integrity and privacy of their data. Traditional generative AI solutions rely on processing vast amounts of user-inputted personal data to continuously learn and improve, putting the privacy of data at risk. As a result, businesses are increasingly prioritising data privacy in their AI strategies. In contrast, proprietary data is not used to train or fine-tune models in RAG systems, ensuring that companies can confidently protect their sensitive information.

RAG and Environmental Sustainability

At Clairo AI, we are on a mission to combat the environmental impact of AI solutions, and RAG is a core driver of this mission. By leveraging existing databases to retrieve relevant information, RAG systems reduce the need for extensive data processing and therefore lower energy consumption compared to traditional Generative AI solutions. As well as reduced data processing times, RAG systems use retrieval mechanisms to find and incorporate relevant data on-demand, and do not require the same level of exhaustive training, thus leading to lower energy use.

‍

RAG is an important component in the creation of AI Agents. Composed of a model, data and prompts, agents can benefit hugely from RAG, due to its ability to enhance the quality, relevance, and accuracy of generated responses. RAG enhances AI agents by combining the strengths of both retrieval-based and generative approaches, making them more capable and reliable as autonomous systems in business operations.

With Clairo AI, users can build custom AI Agents that are tailored to a range of specific business use cases with privately-held data retrieved through RAG, allowing them to navigate complex tasks and decisions.