User Guides
RAG API (Chat with Files)


The RAG (Retrieval-Augmented Generation) API is a powerful tool that integrates with LibreChat to provide context-aware responses based on user-uploaded files.

It leverages LangChain, PostgresQL + PGVector, and Python FastAPI to index and retrieve relevant documents, enhancing the conversational experience.

For further details, refer to the configuration guide provided here: RAG API Configuration


Currently, this feature is available to all Custom Endpoints, OpenAI, Azure OpenAi, Anthropic, and Google.

OpenAI Assistants have their own implementation of RAG through the “Retrieval” capability. Learn more about it here.

It will still be useful to implement usage of the RAG API with the Assistants API since OpenAI charges for both file storage, and use of “Retrieval,” and will be introduced in a future update.

Plugins support is not enabled as the whole “plugin/tool” framework will get a complete rework soon, making tools available to most endpoints (ETA Summer 2024).

Still confused about RAG? Read the section I wrote below explaining the general concept in more detail with a link to a helpful video.

What is RAG?

RAG, or Retrieval-Augmented Generation, is an AI framework designed to improve the quality and accuracy of responses generated by large language models (LLMs). It achieves this by grounding the LLM on external sources of knowledge, supplementing the model’s internal representation of information.


  • Document Indexing: The RAG API indexes user-uploaded files, creating embeddings for efficient retrieval.
  • Semantic Search: It performs semantic search over the indexed documents to find the most relevant information based on the user’s input.
  • Context-Aware Responses: By augmenting the user’s prompt with retrieved information, the API enables LibreChat to generate more accurate and contextually relevant responses.
  • Asynchronous Processing: The API supports asynchronous operations for improved performance and scalability.
  • Flexible Configuration: It allows customization of various parameters such as chunk size, overlap, and embedding models.

Key Benefits of RAG

  1. Access to up-to-date and reliable facts: RAG ensures that the LLM has access to the most current and reliable information by retrieving relevant facts from an external knowledge base.
  2. Transparency and trust: Users can access the model’s sources, allowing them to verify the accuracy of the generated responses and build trust in the system.
  3. Reduced data leakage and hallucinations: By grounding the LLM on a set of external, verifiable facts, RAG reduces the chances of the model leaking sensitive data or generating incorrect or misleading information.
  4. Lower computational and financial costs: RAG reduces the need for continuous training and updating of the model’s parameters, potentially lowering the computational and financial costs of running LLM-powered chatbots in an enterprise setting.

How RAG Works

RAG consists of two main phases: retrieval and content generation.

  1. Retrieval Phase: Algorithms search for and retrieve snippets of information relevant to the user’s prompt or question from an external knowledge base. In an open-domain, consumer setting, these facts can come from indexed documents on the internet. In a closed-domain, enterprise setting, a narrower set of sources are typically used for added security and reliability.
  2. Generative Phase: The retrieved external knowledge is appended to the user’s prompt and passed to the LLM. The LLM then draws from the augmented prompt and its internal representation of its training data to synthesize a tailored, engaging answer for the user. The answer can be passed to a chatbot with links to its sources.

Challenges and Ongoing Research

While RAG is currently one of the best-known tools for grounding LLMs on the latest, verifiable information and lowering the costs of constant retraining and updating, it’s not perfect. Some challenges include:

  1. Recognizing unanswerable questions: LLMs need to be explicitly trained to recognize questions they can’t answer based on the available information. This may require fine-tuning on thousands of examples of answerable and unanswerable questions.
  2. Improving retrieval and generation: Ongoing research focuses on innovating at both ends of the RAG process: improving the retrieval of the most relevant information possible to feed the LLM, and optimizing the structure of that information to obtain the richest responses from the LLM.

In summary, RAG is a powerful framework that enhances the capabilities of LLMs by grounding them on external, verifiable knowledge. It helps to ensure more accurate, up-to-date, and trustworthy responses while reducing the costs associated with continuous model retraining. As research in this area progresses, we can expect further improvements in the quality and efficiency of LLM-powered conversational AI systems.

For a more detailed explanation of RAG, you can watch this informative video by IBM on Youtube:


The RAG API is a powerful addition to LibreChat, enabling context-aware responses based on user-uploaded files. By leveraging Langchain and FastAPI, it provides efficient document indexing, retrieval, and generation capabilities. With its flexible configuration options and seamless integration, the RAG API enhances the conversational experience in LibreChat.

For more detailed information on the RAG API, including API endpoints, request/response formats, and advanced configuration, please refer to the official RAG API documentation.