Developing LLM Based Applications: Implementing a RAG based Private Data Bot (4/n)
When we search for something on Google, we’re presented with a plethora of websites, links, and documents. It’s up to us to click on the right link to find the most relevant answer.
Let’s go one step deeper into this idea. What if, instead of generic (though highly relevant) search results, we could obtain a curated answer tailored to our specific question?
Ask Google a question: “What is split brain in Elasticsearch”. You’d see a few articles appearing rather than a single — definitive —reliable — relevant answer. Can I not get an ready-to-go answer?
Moreover, what if the source of this answer came from a private dataset?
Enterprises and organizations amass data over time — in databases, applications, Cloud Drives (Google Drives, OneDrive, Box..), Confluence and JIRA, Sharepoint and so on.
Wouldn’t it be good to get a curated answer by prepared from the company’s Confluence documents rather than browsing through the tens of potential relevant results?
This is precisely what a Large Language Model (LLM) offers. Given a set of pertinent documents (similar to the search results Google provides), we can prompt the LLM to craft an insightful and relevant response.
In this article, we look at a framework called “Retrieval Augmented Generation” (RAG) that fuses already established search with generative AI. The LLM will be working on a set of relevant documents during the query process before curating a final answer.
What is Retrieval Augmented Generation (RAG) framework?
LLMs are great with the source knowledge. That is, they can answer the questions if they’ve seen the data earlier — as they were trained on that data.
They surely weren’t trained on company’s Confluence or Google Drive or Sharepoint data. So, in essence, if you ask a question on your private data, LLM will go blank.
In fact, more than blank — it could “hallucinate” — meaning you’d get a completely convincing but mis-information.
The RAG framework is a relatively recent AI frameworks where the Large Language Model is fed with private data for the answers.
In simple worlds:
- When a customer issues a search query, the query gets executed against the database (vectorised databases as the data that was stored was all vectorised in the first place) and fetches relevant results (could me zero to many results)
- The results from this query (handful of closely related and relevant documents) will then be fed to the LLM.
- LLM will go over these results and creates a context with them. It then curates an answer from these results (rather than using its own trained data).
As usual, let’s write a sample application to understand it via code :)
Use case: Questioning the Motor Insurance Policy Docs
The idea is that a user uploads their motor insurance policy documents on our fictitious site.
Once the PDF document has been uploaded, the user fire their queries.
Our application must provide the necessary answer to these queries as our application is integrated with the Large Language Model.
The LLM will answer from the given context. Should the answer to the query is not avalable in the context given, LLM will say “I don’t know” as it is specifically instructed not to make up (“hallucinate”) the answer.
We use Aviva’s sample motor insurance document (pdf) as our private data for this use case. As part of our application, we will load the document and vectorise it to be stored in the database.
Tech Stack
The tech stack for this use case is:
- OpenAI’s gpt-3.5 model (follow my earlier article to run this app with open source Llama 2 model)
- LangChain framework
- Flask
- Python programming Language
- Postman (for testing)
Server Side Code
The code is available here in my repository.
The server side logic is split into three parts:
- Instantiate OpenAI’s gpt-turbo-3.5 model
- Load the data from the local PDF file (which is our insurance document)
- Split the document into multiple chunks of 1000 characters each
- Vectorise the chunks to a Chroma DB
- Create a LangChain’s “RetrievalQA” chain to compose the LLM, vector db and the user’s question as a prompt
Loading data
To not complicate the use case, we will load the data from the local folder.
Fortunately, LangChain provides excellent integrations to load data from various sources like: Web, Wikipedia, Google Drive and so on. Here’s a list of all document loader integrations with external sources.
I’d like to express my sincere gratitude to the LangChain LANGCHAIN community for developing such an outstanding framework in very limited time! Despite being in existence for less than a year, the array of features and integrations it offers is truly impressive. Hats off to the community!
The following function defines the mechanism to load the data:
def load_data():
print("Loading data from PDF document")
# Using the PDF Loader class
pdf_loader = PyPDFLoader("./docs/aviva-motor-insurance.pdf")
# Load the data and return it
data = pdf_loader.load()
return data
The function is pretty straight forward — it uses the PyPDFLoader class, imported from langchain.document_loaders (from langchain.document_loaders import PyPDFLoader)
to read and load the PDF document from the local directory.
Splitting the data
A file can consists of many pages — in this case, the insurance policy document has over 40 pages. We’ve loaded all these pages into in-memory of the application.
# Splitting into chunks of 1000 characters
def get_chunks(data):
splitter = RecursiveCharacterTextSplitter(
chunk_size = 1000, chunk_overlap=50)
chunks = splitter.split_documents(data)
return chunks
Here, we are splitting the full data into chunks of 1000 characters. We use a RecursiveCharacterTextSplitter
class to split the text into multiple chunks.
Vectorising and Embedding the Chunks
The next is a crucial one — we are expected to vectorise these chunks to a Chroma DB. Chroma DB is an open source vector database. The vectorisation is then followed by application of embeddings. Let’s understand these two featues at a very high level.
Vectorising (in a simple language) is a process of transforming the textual data into numbers. This is the initial step where textual data (like words or sentences) is converted into numerical form.
This can be done in various ways, such as using term frequency-inverse document frequency (TF-IDF). Read my Elasticsearch in Action book or my articles here on medium to understand the TF-IDF algorithm.
The vectorised data is then embedded to create semantic similarity between the words.
Embeddings are dense vector representations that capture semantic meanings and relationships between words or phrases.
Techniques like Word2Vec and FastText are popular methods to generate word embeddings.
The goal is to represent words in a way that similar words or words used in similar contexts have similar vector representations. For example, a “cat” and a “dog” (imagine these are the two of the many words in your data) are similar — hence they’d be represented closely on the vector space.
Coming back to our code, the following snippet does this for us:
# Vectorizing and embedding
def get_db_retriever(chunks):
embeddings = OpenAIEmbeddings()
# The chunks are embedded usign the OpenAI's algorithm
db = Chroma.from_documents(chunks,embeddings)
return db.as_retriever()
The chunks that we had created are applied with an embedding algorithm (in this case it is OpenAIEmbeddings
) and persisted. The Chroma db’s reference is then used for the next step.
Retrieval QA chain
The next step is to create a RetrievalQA chain — one of the LangChain’s class that would allow retrieval of the relevant documents to be pushed through to the LLM for a curated answer. It combines the LLM with the private data as the context and the incoming question with a prompt.
# Creating a Retieval QA
retrievalQA = RetrievalQA.from_chain_type(llm,
chain_type="stuff",
retriever=db_retriever,
chain_type_kwargs={"prompt": prompt})
..
The RetrievalQA is a chain that combines the retriever and the question-answer-chain (the stuff
is the StuffDocumentsChain
) represented by the stuff
in the above snippet.
The full snippet is given here for completeness (don’t forget to checkout the code from my repo):
def retrieval_qa(q):
llm = OpenAI()
data = load_data()
chunks = get_chunks(data)
db_retriever = get_db_retriever(chunks)
retrievalQA = RetrievalQA.from_chain_type(llm,
chain_type="stuff",
retriever=db_retriever,
chain_type_kwargs={"prompt": prompt})
return retrievalQA({"query": q})
API Call
Last but not least, let’s create an API call that would invoke the retrievalQA to fetch the answer.
# This method will use qa chain object to return the answer
def ask(question):
return retrieval_qa(question)
# The API endpoint
@app.route("/call",methods=["POST"])
def call():
req = request.json
question = req.get('question')
response = ask(question)
return response
# Run the server on 3099
if __name__== '__main__':
app.run(host="0.0.0.0",port=3099)
We are exposing an endpoint call
over a POST method. The user’s question is expected to be formatted as a JSON document. We can test this setup using Postman.
Test the application
We’ve got an webapp up and running on localhost at 3099 port. We can send two requests — one happy path and other unhappy (erroneous one).
The happy path request asks for a question that will most likely be in the context of the motor insurance. The following image demonstrates this in action:
Asking a question “Is the loss or damage of the vehicle covered?” would yiled a positive respose as you can see in the answer.
Should you ask questions that weren’t in the PDF document, we should ideally get a “Sorry I don’t know” answer (if you are curious how this is instructed, checkout the prompt template).
Let’s see if the application can give us a “hallucinated” answer if we ask about BTC (Bitcoin) which is nothing related to our document:
As you can see, the application retuned “Sorry, I don’t know the answer” message when asked about Bitcoin.
I am currently developing the UI based on streamlit for this application — stay-tuned!
Yay! We got our LLM integrated applicaiton working on the private data as expected!
Repository of this code is here.