Setup local LLMs & Embedding models¶

Prepare local models¶

NOTE¶

In the case of using Docker image, please replace http://localhost with http://host.docker.internal to correctly communicate with service on the host machine. See more detail.

Ollama OpenAI compatible server (recommended)¶

Install ollama and start the application.

Pull your model (e.g):

1 2	`ollama pull llama3.1:8b ollama pull nomic-embed-text`

Setup LLM and Embedding model on Resources tab with type OpenAI. Set these model parameters to connect to Ollama:

1
2
3

api_key: ollama
base_url: http://localhost:11434/v1/
model: gemma2:2b (for llm) | nomic-embed-text (for embedding)

Models

oobabooga/text-generation-webui OpenAI compatible server¶

Install oobabooga/text-generation-webui.

Follow the setup guide to download your models (GGUF, HF). Also take a look at OpenAI compatible server for detail instructions.

Here is a short version

# install sentence-transformer for embeddings creation
pip install sentence_transformers
# change to text-generation-webui src dir
python server.py --api

Use the Models tab to download new model and press Load.

Setup LLM and Embedding model on Resources tab with type OpenAI. Set these model parameters to connect to text-generation-webui:

1
2
3

api_key: dummy
base_url: http://localhost:5000/v1/
model: any

llama-cpp-python server (LLM only)¶

See llama-cpp-python OpenAI server.

Download any GGUF model weight on HuggingFace or other source. Place it somewhere on your local machine.

Run

1	`LOCAL_MODEL=<path/to/GGUF> python scripts/serve_local.py`

Setup LLM model on Resources tab with type OpenAI. Set these model parameters to connect to llama-cpp-python:

1
2
3

api_key: dummy
base_url: http://localhost:8000/v1/
model: model_name

Use local models for RAG¶

Set default LLM and Embedding model to a local variant.

Models

Set embedding model for the File Collection to a local model (e.g: ollama)

Index

Go to Retrieval settings and choose LLM relevant scoring model as a local model (e.g: ollama). Or, you can choose to disable this feature if your machine cannot handle a lot of parallel LLM requests at the same time.

Settings

You are set! Start a new conversation to test your local RAG pipeline.