Day 7: How to Build Chat-GPT for your Data using Langchain?

I write a newsletter called Above Average where I talk about the second order insights behind everything that is happening in big tech. If you are in tech and don’t want to be average, subscribe to it.

I think one of the most common use-case everyone wanted after chat-gpt broke out in the public imagination was to have a chat-gpt like experience on top of their own data.

In this example, I will use Langchain (which raised $10M at a $100M valuation) to access open AI’s API.

As a Product Manager at Azure Files, the most important information I would like to chat with is the publicly available Azure Files documentation. I downloaded this in the form of a PDF for the purpose of this exercise. If you are following along download what ever information you want to build your chat bot on in the form of one or more PDFs. You can also use other formats but I will be sticking to PDF format in this example.

The process is which we will build this chatbot is often referred as retrieval augmented generation (RAG).

The following image explains different steps involved to create the chat bot that will help me do my job better and faster.

So lets write the code to create our chat bot. I am using langchain along with Open AI to create the chatbot. So you would need Open AI secret key and IDE of your choice to follow along. I am using VS Code and virtual python environment.

Step 1 – Load the PDFs: The first step is to load data from the folder data using document loaders available in langchain. Langchain provides data loaders for 90+ sources, so you can load data not just from PDFs but anything that you want.

Step 2 – Splitting: Split the data into smaller chunks with chunk size 1500 & chunk overlap of 150 which means each consecutive chunk will have 150 tokens common with the previous chunk to make sure the context doesn’t get abruptly split. There are different ways to split your data and each of them have different pros & cons. Check out langchain splitters to get more ideas on which splitter to use.

Step 3 – Store: In this step we convert each split into an embedding vector. An embedding vector is a mathematical representation of text.

An embedding vector of a text block captures the context/meaning of the text. Texts that have similar meaning/context will have similar vectors. To convert the splits into their embedding vector versions we use OpenAIEmbeddings. A special type of database is required to store these embeddings, known as vector database. In our example we will chrome as it can be stored in memory. There are other options you can use like pinecone, weaviate & more. After storing embedding for my splits in the chroma vector db I also persist it to reuse in the next steps.

Step 5 & 6 – Retrieval & Generate Output: We have stored our embedding in chroma db. In our chat bot experience when the user asks a question. We send that question to the vector db and retrieve data splits that might have the answer. Retrieving the right data splits is a huge and evolving subject. There are lot of ways you can retrieve based on your application needs. I am using a common method of retrieval called Maximal Marginal Relevance (MMR). You can learn more techniques like basic semantic similarity, LLM aided retrieval & more. I will write a separate post talking about MMR and others in a separate post. For this post consider retrieval as we are getting top 3 data chunks that could have the context/answer for the question that user asked the chat bot. Once we retrieve that relevant chunks and pass it to Open AI LLM as ask it to generate an answer by using prompt. See my previous post about writing good prompts.

The result I got is very accurate. Not only did the LLM correctly identify that there is no feature called polling it also found contextually a relevant feature called change detection which is similar to what polling refers to in a lot of products.

If you need the full code for this and images are not helping reach out to me on twitter. That’s it for Day 6 of 100 Days of AI.

If you understand RAG and the concepts associated with it here like embeddings, vector databases, retrieving techniques you can generate a lot of ideas to build interesting chat bots. Feel free to reach out to me and share those ideas with me.

AI PRODUCT IDEA ALERT 1: All organization would want chat with your data applications. And would also want their employees create customer chat with your data applications based on their needs with out writing any code. Microsoft and other companies are launching products and features to enable large organizations to do this via Azure Open AI. But I think there will be startups competing for this space as well.

AI PRODUCT IDEA ALERT 2: Chat Gpt for doctors not trained on all the internet, but curated data that is from all the text books, recent research, best practices picked from a board etc., I could see a highly curated LLM tuned for doctors.

AI PRODUCT IDEA ALERT 3: Similar to idea 2 there will be LLM’s fine tuned to education use case. Where all the info has to be accurate. I think there are other verticals which needs curated data sets instead of the whole internet.

Follow me on Twitter, LinkedIn for latest updates on 100 days of AI or bookmark this page.