Reading Time: ~10 minutes
Github Repo: https://github.com/Jet-Engine/art-deco-chatbot
This blog post can be read from the following links:
Large Language Models (LLMs) have significantly advanced, improving their ability to answer a broad array of questions. However, they still encounter challenges, particularly with specific or recent information, often resulting in inaccuracies or “hallucinations.” To address these issues, the Retrieval Augmented Generation (RAG) approach integrates a document retrieval step into the response generation process. This approach uses a corpus of documents and employs vector databases for efficient retrieval, enhancing the accuracy and reliability of LLM responses through three key steps:
Vector databases facilitate quick similarity searches and efficient data management, making RAG a powerful solution for enhancing LLM capabilities.
The Art Deco era, spanning the roaring 1920s to the 1940s, left a dazzling legacy in architecture. Despite the capabilities of models like Meta’s Llama3.1, their responses can be unreliable, especially for nuanced or detailed queries specific to Art Deco. Our goal with the Art Deco ChatBot is to use RAG to improve the quality of responses about Art Deco architecture, comparing these with those generated by traditional LLMs in both quality and time efficiency.
By designing the Art Deco ChatBot, we also aim to show how a complex RAG system can be built. You can access the complete code at the Art Deco ChatBot GitHub repository. By examining the code and reading this README, you will learn:
Ollama is a program that facilitates running LLM models easily on local machines.
ollama pull llama3.1
(LLM that will be used for RAG)ollama pull nomic-embed-text
(embedding model that will be used for RAG)
In this project, we not only aim to write code to show how RAG can be done but also to compare and benchmark results of RAG with queries to different LLMs. Some of these LLMs cannot be run locally (like GPT-4o
), while others are compute-heavy and are run on cloud services (like Llama3.1:70b
on Groq).
LiteLLM provides a unified interface to query different LLMs, making our code cleaner and more readable. Checking out the LiteLLM Python library is recommended but not required for this project.
Get your API keys from OpenAI and Groq to use them in the project. Be aware that you may be billed for using these services. While the Groq API
can be used for free at the time of writing, the OpenAI API
is not free.
PulseJet is a high-performance vector database that enables efficient storage and retrieval of document embeddings. To set up PulseJet:
pip install pulsejet
docker run --name pulsejet_container -p 4000:4000 -p 47044-47045:47044-47045 jetngine/pulsejet
Note: You can skip the first step since pulsejet is already included in the requirements.txt
file.
Check PulseJet Docs for details about running Pulsejet Docker images and using the pulsejet Python library for vector database operations.
Install all necessary dependencies by running:
This project was developed using a conda
environment with Python 3.11
.
As we have not tested the project in different environments, we recommend adhering to this configuration for optimal performance and compatibility.
The Art Deco ChatBot uses two YAML files for configuration: config.template.yaml
and secrets.yaml
. Here's a detailed breakdown of each section:
Create a secrets.yaml
file with your API keys:
Here’s a detailed explanation of each section:
Models:
Vector Database:
Pulsejet Configuration:
File Paths:
Embeddings:
true
, the system will load embeddings from the specified file. When false
, it will generate new embeddings and save them to this file.
LLM Models Configuration:
RAG Parameters:
Ensure you update these configuration files with your specific settings before running the project. Adjusting the RAG parameters can significantly impact the performance and accuracy of the RAG system. Experimentation with different values may be necessary to find the optimal configuration for your specific use case and document set.
wiki-bot.py
This step is optional since the content files of all scraped articles from Wikipedia are available in the https://huggingface.co/datasets/JetEngine/Art_Deco_USA_DS.
You can download this dataset and copy all text files from it into the rag_files directory. If you plan to use pre-calculated embeddings, which will be explained in the next section, you don’t actually need to download this dataset.
There is no need to repeat the scraping process. You could skip reading rest of this section if you are not interested in data scraping process.
Our initial step involves gathering knowledge about Art-Deco architecture. We focus on U.S. structures, given their prominence in the Art-Deco movement. The wiki-bot.py script automates the collection of relevant Wikipedia articles, organizing them into a structured directory for ease of access.
Run the scraper bot using:
When you run wiki-bot.py with an empty rag_files
directory, it saves the contents of the scraped Wikipedia articles in a sub-folder named text
under rag_files. The bot also creates various sub-folders to organize different types of data such as article URLs, references, etc. Since our current focus is only on the contents of the Wikipedia articles, to reduce clutter, we only transferred the contents from the text
sub-folder to our HG dataset and removed all other sub-folders.
Thus, if you want to run the bot yourself which is optional since the scraped documents are already available in Hugging Face, you would need to either copy all files from the text sub-folder to the rag_files
directory and then delete all sub-folders within rag_files
, or simply change the rag_files_path
in config.yaml
to rag_files/text
.
indexing.py
Index the documents by running:
This script processes the documents, generates embeddings, and stores them in PulseJet. If you don’t want to lose time for generating embeddings, you can download pre-calculated embeddings from https://huggingface.co/JetEngine/rag_art_deco_embeddings and set use_precalculated_embeddings: true
in the configuration.
In our setup generation of embeddings takes around 15 minutes to complete and insertion of vectors to Pulsejet takes around 4 seconds.
The script outputs timing information for:
chat.py
Ensure your configuration is correct, then run:
This script queries different LLMs and the RAG system, outputting results in HTML, JSON, and CSV formats for comparison.
Pulsejet is used in this project for efficient vector storage and retrieval. Here’s a detailed overview of how Pulsejet is integrated into our Art Deco ChatBot project:
1. Initializing the Pulsejet Client:
This creates a Pulsejet client. In our project, we’re using a remote Pulsejet instance, so the location
is set to "remote". This connects to a Pulsejet server running in a Docker container.
2. Creating a Collection:
This creates a new collection in Pulsejet to store our document embeddings. The vector_config
parameter specifies the configuration for the vector storage, such as the vector size and index type (e.g., HNSW for efficient similarity search).
3. Inserting Vectors:
In our project, we use the following pattern for inserting vectors:
This might look confusing at first, but here’s what it means:
collection[0]
is actually our Pulsejet client instance.collection[1]
is the name of the collection we're inserting into.embed
is the vector we're inserting.meta
is additional metadata associated with the vector.This is equivalent to calling:
For bulk insertions, we use:
This inserts multiple embeddings at once, which is more efficient for large datasets.
4. Searching Vectors:
This performs a similarity search in the specified Pulsejet collection to find the most relevant documents for a given query vector. The limit
parameter specifies the maximum number of results to return.
In our project, client['db']
is used to access the database methods of the Pulsejet client. This is equivalent to using the client directly:
5. Closing the Connection:
This closes the connection to the Pulsejet database when it’s no longer needed.
The PulsejetRagClient
class is defined in pulsejet_rag_client.py
and provides a high-level interface for interacting with PulseJet in the context of our RAG system. Here's a breakdown of its key components:
1. Initialization
The client is initialized with configuration parameters, setting up the PulseJet client and storing relevant config values.
2. Creating a Collection:
This method creates a new collection in PulseJet with the specified parameters. It uses the get_vector_size
function to determine the appropriate vector size for the embeddings.
3. Inserting Vectors:
These methods handle the insertion of single and multiple vectors into the PulseJet collection, along with their associated metadata.
4. Searching Vectors:
This method performs a similarity search in the PulseJet collection to find the most relevant documents for a given query vector.
5. Closing the Connection:
This method closes the connection to the PulseJet database when it’s no longer needed.
The PulsejetRagClient
is used throughout the project to interact with PulseJet. Here's how it's typically instantiated and used:
1. Creation:
2. Indexing Documents:
In indexing.py
, we use the client to create the collection and insert vectors:
3. Searching Similar Vectors:
In ollama_rag.py
, we use the client to search for similar vectors during the RAG process:
4. Closing the Connection:
After operations are complete, we close the connection:
This implementation provides a clean, encapsulated interface for all PulseJet operations in our RAG system.
LLama3.1
take longer than simple question answering due to the increased query length.
The Art Deco Chat Bot demonstrates how LLMs could be better utilized with RAG. Our project offers a comprehensive exploration of RAG implementation, covering every step from data scraping and document chunking to embedding creation and the integration of vector databases.
As the document base for a RAG system grows larger, the performance of insertion and search operations becomes increasingly critical. By learning how to integrate the Pulsejet vector database into a full-fledged RAG system, one can significantly benefit from its capabilities, particularly when dealing with RAG applications on large document bases.
Our RAG responses could have been more accurate. To enhance our Art Deco ChatBot’s performance, we are considering several experimental approaches:
We plan to expand this project through the following initiatives:
We encourage you to experiment with the Art Deco ChatBot, modify its parameters, and adapt it to your own domains of interest.