Optimizing RAG Performance in LibreChat
Optimize Retrieval-Augmented Generation (RAG) performance for LibreChat

Note: This is a guest post. The information provided may not be accurate if the underlying RAG API changes in future LibreChat versions. Always refer to the official documentation for the most up-to-date information.
Optimizing RAG Performance in LibreChat
This guide walks you through optimizing Retrieval-Augmented Generation (RAG) performance in your LibreChat setup. Always change only one major setting at a time and test results carefully.
1. Optimize Database (vectordb - PostgreSQL/pgvector)
Improving database performance is crucial for RAG speed, especially during indexing and retrieval.
1.1. Verify/Create Metadata & Filter Indexes (CRITICAL)
Missing indexes for filtering can drastically degrade performance.
Connect to the Database:
Check for existing indexes:
You should see:
custom_id_idxidx_cmetadata_file_id_text- A vector index like
langchain_pg_embedding_embedding_idx
If missing, run:
Exit:
1.2. Verify/Tune Vector Index
The pgvector extension typically creates an index on the embedding column.
Check with \di again. Look for a hnsw or ivfflat index type.
⚙️ Advanced: You can tune index parameters like
lists,m,ef_search, andef_construction(see pgvector README).
1.3. Monitor/Adjust Server Resources
Watch for memory and/or CPU saturation. PostgreSQL benefits from abundant RAM.
Optional: Set resource limits in docker-compose.override.yml
1.4. Perform Database Maintenance
Run regularly:
1.5. Advanced PostgreSQL Tuning
Consider tuning:
shared_bufferswork_memmaintenance_work_memeffective_cache_size
These live in postgresql.conf (inside the container). Only touch them if you know what you're doing.
2. Tune Chunking Strategy (.env)
Impacts upload speed and retrieval precision.
2.1. Open the main .env file:
2.2. Modify chunk settings:
Try other combinations like:
1000/100500/50
Trade-offs:
- Larger chunks = faster processing, lower precision
- Smaller chunks = slower, more precise
2.3. Save, exit, and restart:
2.4. Delete Old Embeddings
- Easiest: Delete files via UI
- Advanced: Delete from DB
🔁 Safer method: Use a new test file for each config test
2.5. Re-upload & test performance
3. Optimize Embedding Process
Set provider/model in .env.
Examples:
OpenAI:
Azure:
Ollama (local):
Restart and re-upload:
4. Tune Retrieval Strategy
How many chunks are retrieved affects both relevance and API limits.
4.1. In .env:
- Lower
TOP_K= safer, faster - Higher
TOP_K= more context, risk of hitting token limits
5. Monitor LibreChat API Logs
Check for truncation or token overflows.
5.1. Run a large query, then:
Search for:
If present, reduce TOP_K or CHUNK_SIZE.
6. Manage Server Load & Isolation
6.1. Monitor:
6.2. Reduce Load (temporarily):
6.3. Upgrade Hardware
- More RAM/CPU
- Use SSD (preferably NVMe)
- GPU boosts embedding (if using local models)
6.4. Advanced: Separate Services
You can host vectordb and rag_api on separate machines for heavy workloads.
Summary
Start with index optimization. Then move on to:
- Chunk tuning (
CHUNK_SIZE,CHUNK_OVERLAP) - Retrieval strategy (
RAG_API_TOP_K) - Embedding configuration
- API log monitoring
Test each change independently. Always monitor API logs and resource usage. If issues persist, consider model/hardware upgrades.