Skip to main content

Generative AI: Retrieval Augmented Generation(RAG)


 
Another blog post starts with you beautiful people👦. I hope you have explored my last blog post about 2x faster fine-tuning of Mistral 7b model on a custom dataset👈. In this blog post, we are going to learn an essential technique in Generative AI: Retrieval Augmented Generation (RAG).

What is RAG?

Retrieval Augmented Generation (RAG) is an innovative approach that melds generative models, like transformers, with a retrieval mechanism. By tapping into existing knowledge, RAG retrieves pertinent information from expansive external datasets or knowledge bases to enhance the generation process, thereby elevating the model's content relevance and factual accuracy💪. This versatility renders RAG particularly beneficial for tasks demanding the assimilation of external knowledge, such as question answering or content creation.

Upon receiving input, RAG actively searches for relevant documents from specified sources (e.g., Wikipedia, company knowledge base, etc.). It then seamlessly amalgamates this retrieved data with the input, offering a comprehensive output complete with references. This unique structure enables RAG to effortlessly integrate new and evolving information without the need to retrain the entire model from scratch💥.

RAG vs Fine-Tuning?

RAG augments the prompt with the external data, while fine-tuning incorporates the additional knowledge into the model itself. RAG requires less labeled data and resources than fine-tuning processes, making it less costly. Much of RAG expenses often go into setting up embedding and retrieval systems. In contrast, fine-tuning requires more labeled data, significant computational resources, and state-of-the-art hardware like high-performance GPUs or TPUs. As a result, the overall cost of fine-tuning is relatively higher than RAG💸.

RAG Architecture?

A standard RAG application comprises two primary elements:

A. Indexing: a data ingestion pipeline that sources and indexes data, typically conducted offline. The indexing sequence is as follows-

1. Load: First we need to load our data. This is done with DocumentLoaders.

2. Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it into a model, since large chunks are harder to search over and won’t fit in a model’s finite context window.

3. Store: We need somewhere to store and index our splits so that they can later be searched. This is often done using a VectorStore and Embeddings model.

B. Retrieval and generation: the operational RAG chain, is responsible for receiving user queries during runtime, retrieving pertinent data from the index, and passing it to the model. The sequence is as follows-

1. Retrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.

2. Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data.

Let's build something with RAG!

Now we are ready to use RAG with a large language model-Mistral 7B that we also used in the last blog. As of date 18 January'24, the hot topic in India is the new temple of lord Ram in Ayodhya city. If we ask any question related to this to an LLM, they cannot reply with accurate answers since they are not trained with current news. To prove this point, we can load the pre-trained LLM model and ask a related question to it like below-

Load the 4-bit Mistral 7b model as we did in my last post-


And I asked a current question to the loaded pre-trained model as below-



And as I expected, the model is unable to give the answer-


I hope you have understood the problem statement.👀 Now we will use RAG to get the required information from an external source to teach our LLM about the current topic. As you have read above in the architecture section of the RAG, we will need to store that source somewhere. For this purpose, we are going to use FAISS but you can use any other vector database like CHROMA. Let's install the FAISS library using pip command in our colab notebook-

Also, install other required libraries like langchain, langchain-community, unsloth, transformers and sentence-transformers if you still need to install them.
The next step is to download the data from our source. In my case, it is 'https://srjbtkshetra.org/'. You can replace this with any other site of your requirement. To load all text from this webpage into a document format, we will use WebBaseLoader from the LangChain tool. Please note that LangChain supports various types of data loaders and you can refer this link for all the details-

The next step is to divide the loaded texts into smaller chunks that can fit into our model's context window. For example, GPT-3.5-turbo has a context window of 4,097 tokens and Mistral 7B has 8,000. So if we try to pass that context window then the behavior of LLMs will be unpredictable and suffer from severe performance degradation👽. For this purpose, we will use text splitter as below-

The next step is to encode these chunks of text into the embeddings and then perform the indexing of those embeddings to our vector database as below-

Here, for embeddings, we use the 'sentence-transformers/all-mpnet-base-v2' model from the Hugging Face hub but you can explore other sentence-transformers models as well from this link. Now we will construct a receiver to fetch the documents from the vector db as below-

Next, we will create and load our pipeline for the inference as you already read in last blog-

Next, we will create a prompt template for giving the instruction to the model, format the output as our need, and create a chain of RAG all of this like below-

Now our RAG is ready to ask any relative questions to our added source. For example, I asked about the dimensions of the newly built Lord Ram Temple and the model is now able to provide me accurate answers as below-

Another one-

How cool is this, right guys👏. In this post, we learned about the creation of PromptTemplate & Chain, usage of  RunnablePassthrough, invocation of Retriever, integration of context Integration, and the LLM invocation. The usage of RAG with any other possible use case is endless💫. So don't wait. Make a copy of this colab notebook in your colab notebook and start playing with your own data and any source you want to update with the LLM.  In the next post, we will further learn something useful use case of Gen AI, till then 👉 Go chase your dreams, have an awesome day, make every second count, and see you later in my next post.
















Comments

  1. your blog content supports a beginner and learning fast from your blog, The variable in your content is very good and different category, thanks for sharing this information.

    learn more about Data Science click Data Science

    ReplyDelete
  2. Its Very Use Full Information , For More Information Search Data Science Online Training Institute In Hyderabad , Thanking You

    ReplyDelete

Post a Comment

Popular posts from this blog

How to use opencv-python with Darknet's YOLOv4?

Another post starts with you beautiful people 😊 Thank you all for messaging me your doubts about Darknet's YOLOv4. I am very happy to see in a very short amount of time my lovely aspiring data scientists have learned a state of the art object detection and recognition technique. If you are new to my blog and to computer vision then please check my following blog posts one by one- Setup Darknet's YOLOv4 Train custom dataset with YOLOv4 Create production-ready API of YOLOv4 model Create a web app for your YOLOv4 model Since now we have learned to use YOLOv4 built on Darknet's framework. In this post, I am going to share with you how can you use your trained YOLOv4 model with another awesome computer vision and machine learning software library-  OpenCV  and of course with Python 🐍. Yes, the Python wrapper of OpenCV library has just released it's latest version with support of YOLOv4 which you can install in your system using below command- pip install opencv-pyt...

How can I install and use Darknet framework in Windows?

Another post starts with you beautiful people! I hope you have enjoyed my  last post   about using real time object detection system- Yolo with keras api. In that post I mentioned that Yolo is built on Darknet framework and this framework is written on C and cuda. That's why we used Python wrapper of Darknet  framework instead of installing original framework. Many readers asked me about how can we install and use the original framework in our window machine. In this post I will try to show you the steps about this installation. Before following the steps I strongly recommend you to activate virtual env and install all libraries I have mentioned in my  last post . For this setup I have followed this original github repository-  AlexeyAB/darknet   . this repo is as same as original Darknet repo with additional Windows support. So don't forget to give a star to this repo as a token of our respect to the author. If you are reading my blog first time, th...

How to deploy your ML model as Fast API?

Another post starts with you beautiful people! Thank you all for showing so much interests in my last posts about object detection and recognition using YOLOv4. I was very happy to see many aspiring data scientists have learnt from my past three posts about using YOLOv4. Today I am going to share you all a new skill to learn. Most of you have seen my post about  deploying and consuming ML models as Flask API   where we have learnt to deploy and consume a keras model with Flask API  . In this post you are going to learn a new framework-  FastAPI to deploy your model as Rest API. After completing this post you will have a new industry standard skill. What is FastAPI? FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It is easy to learn, fast to code and ready for production . Yes, you heard it right! Flask is not meant to be used in production but with FastAPI you can use you...