Hybrid RAG pipeline using cloud and local models

Building a Hybrid RAG Engine with Local + Cloud Embeddings

Retrieval-Augmented Generation (RAG) is one of those patterns that becomes more interesting the deeper you go.
Once you begin working with real workloads, two challenges show up quickly:

  • Embedding API calls become expensive
  • Cloud latency can be unpredictable

I wanted to experiment with a setup that balances cloud power with local flexibility—a system that feels good to develop with and doesn’t punish you for iterating.
This led to a small project I’ve been improving over time:

👉 containerized-Local-LLM-ingest-retrieve
https://github.com/premsgdev/rag-structure

The idea behind it is simple:

Let cloud models and local models work together inside the same RAG engine.


🔍 Why a Hybrid RAG Engine?

Traditional RAG systems rely entirely on cloud embeddings:

text → cloud embedding → vector DB → cloud LLM

It works, but it comes with a cost—literally.

So this project splits the embedding stage into two independent pipelines:

1️⃣ Cloud Embedding Path

  • Model: models/embedding-001 (Gemini)
  • Stored in: ChromaDB collection policies
  • Strengths: high-quality semantic embeddings, predictable behavior

2️⃣ Local Embedding Path

  • Model: all-MiniLM-L6-v2 (Xenova / transformers)
  • Stored in: ChromaDB collection policies_xenova
  • Strengths: zero cost, fast, ideal for local development and fallback

Both embedding paths write to the same ChromaDB instance running locally, but they use different collections to keep vector spaces completely isolated.

The final LLM answer still comes from Gemini, but the pipeline becomes flexible and cost-aware.


Hybrid RAG architecture

This captures the flow:

- Documents → two embedding pipelines
- Both pipelines write to separate collections in the same local ChromaDB instance
- Retrieval chooses the appropriate collection
- Gemini generates the final answer

⚙️ Cloud Ingestion: Batching Embeddings

The cloud pipeline uses Gemini’s embedding model. Batching is essential to stay efficient:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
async function generateEmbeddings(allChunks: string[]): Promise<number[][]> {
  const totalChunks = allChunks.length;
  let allEmbeddings: number[][] = [];

  for (let i = 0; i < totalChunks; i += BATCH_SIZE) {
    const batch = allChunks.slice(i, i + BATCH_SIZE);

    const response = await ai.models.embedContent({
      model: EMBEDDING_MODEL, 
      contents: batch.map(text => ({ parts: [{ text }] }))
    });

    // TODO: extract vectors into allEmbeddings
  }

  return allEmbeddings;
}

Why batching matters:

- Reduces API calls
- Avoids hitting rate limits
- Improves ingestion throughput
- All cloud embeddings are saved into the policies collection inside ChromaDB.

⚡ Local Ingestion: Fast and Free

Local embeddings run entirely inside Node.js using Xenova/transformers

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
async function generateEmbeddingsXenova(allChunks: string[]): Promise<number[][]> {
  const extractor = await initializeExtractor();

  const embeddingsTensor = await extractor(allChunks, {
    pooling: "mean",
    normalize: true
  });

  const embeddingsArray: number[][] = [];

  for (let i = 0; i < allChunks.length; i++) {
    // TODO: convert tensor slice → JS array
  }

  return embeddingsArray;
}

Two important parameters:

- pooling: "mean" — produces one vector per chunk
- normalize: true — required for consistent cosine similarity searches

These embeddings go into the policies_xenova collection in the same local ChromaDB instance.

🏗️ Containerized Infrastructure

Everything is built to be reproducible and easy to spin up. The entire system runs using Docker Compose:

ChromaDB (vector store)

- Runs locally inside Docker
- Holds two collections, one for cloud embeddings, one for local embeddings
- Same instance, logically separated vector spaces

PostgreSQL

- Stores metadata like document source, timestamps, and types
- Enables hybrid search: semantic (Chroma) + structured filters

Redis

- Caches frequent LLM responses
- Reduces repeated calls to Gemini

Node.js services

- Cloud ingestion
- Local ingestion
- Retrieval
- Helpers for embedding and vector operations

🔁 Retrieval Flow

Retrieval Flow Diagram

🌱 Why I Built This

I enjoy exploring the space where developer experience, performance, and cost meet. RAG is powerful, but I’ve always felt it could be made more flexible—something you can run anywhere, tweak freely, and experiment with safely.

This hybrid setup grew out of that curiosity.

If you’re working on retrieval systems yourself, I hope this gives you a useful starting point or sparks a new idea.

🔗 Repository

👉 containerized-Local-LLM-ingest-retrieve https://github.com/premsgdev/rag-structure

If this helps you build your own RAG pipeline—or inspires improvements—I’d love to see what you create.

Happy building! 🚀

Previous: Modern WordPress with Docker + Bedrock