Ollama: Local LLM Serving & Inference
A production-focused guide to running large language models locally with Ollama — from installation and model management to GPU acceleration, quantization trade-offs, OpenAI-compatible APIs, custom Modelfiles, embeddings, vision models, LangChain/LlamaIndex integration, production deployment with Docker Compose, and performance tuning for maximum throughput.
By Jose Nobile | Updated 2026-04-23 | 28 min read
1. What is Ollama
Ollama is an open-source tool for running large language models locally on your own hardware. It wraps the llama.cpp inference engine in a user-friendly CLI and REST API, handling model downloading, quantization, GPU offloading, and serving with zero configuration. Think of it as "Docker for LLMs" — you ollama pull a model and ollama run it, just like pulling and running a container image.
Running models locally provides three critical advantages: Privacy — your data never leaves your machine, which is essential for proprietary code, medical records, and confidential business data. Cost — after the initial hardware investment, inference is free with no per-token charges, which saves thousands per month at scale. Latency — local inference eliminates network round-trips, enabling sub-100ms responses for small models that are impossible with cloud APIs.
Ollama supports all major operating systems (Linux, macOS, Windows), all major GPU vendors (NVIDIA CUDA, AMD ROCm, Apple Metal), and exposes an OpenAI-compatible API that makes it a drop-in replacement for cloud LLM endpoints. Models are stored as GGUF files — a quantized format optimized for CPU and GPU inference with llama.cpp — and range from 1B parameter models that run on a laptop to 70B+ models that require multi-GPU workstations.
100% Local Inference
All data stays on your machine. No API keys, no cloud dependency, no data logging. Critical for healthcare, legal, finance, and any domain with compliance requirements (HIPAA, GDPR, SOC 2).
Zero Per-Token Cost
After hardware investment, inference is free. A single RTX 4090 running Qwen 3 8B processes ~80 tok/s indefinitely. At cloud API rates, that volume would cost $2,000-5,000/month.
OpenAI-Compatible API
Drop-in replacement for OpenAI endpoints. Change base_url to http://localhost:11434/v1 and your existing code works. Supports chat completions, embeddings, and streaming.
Broad Model Support
Access 200+ models from the Ollama library: Llama 4, Qwen 3, Gemma 3, DeepSeek V3, Phi-4, Mistral, CodeLlama, and more. New models are available within days of release.
2. Installation
Ollama installs in under a minute on all major platforms. The official install script handles GPU driver detection, binary installation, and systemd service configuration on Linux. On macOS, the desktop app includes Metal GPU acceleration out of the box. On Windows, the installer configures CUDA automatically if NVIDIA drivers are present.
For production and CI/CD environments, the official Docker image (ollama/ollama) provides a containerized deployment with GPU passthrough via NVIDIA Container Toolkit. This is the recommended approach for servers and Kubernetes clusters where you want isolated, reproducible inference environments.
Linux (Recommended)
One-line install with automatic GPU detection. Creates a systemd service that starts on boot. Supports NVIDIA (CUDA 11.7+) and AMD (ROCm 6.0+) GPUs.
# Install Ollama (auto-detects GPU)
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Start the server (auto-starts via systemd)
ollama serve
# Pull and run a model
ollama pull llama3.2
ollama run llama3.2 "Explain GGUF quantization"
macOS (Apple Silicon)
Download the desktop app or use Homebrew. Metal GPU acceleration works automatically on M1/M2/M3/M4 chips. Unified memory means all RAM is available as VRAM.
# Install via Homebrew
brew install ollama
# Or download from https://ollama.com/download
# Start the server
ollama serve
# Apple Silicon: Metal GPU is auto-detected
# 32GB M2 Max can run 30B models comfortably
ollama run qwen3:32b
Windows
Download the installer from ollama.com. CUDA acceleration works with NVIDIA drivers 452.39+. Also runs in WSL2 with GPU passthrough for a Linux-like experience.
# Download installer from https://ollama.com/download
# Or install via winget:
winget install Ollama.Ollama
# WSL2 alternative (recommended for dev):
# Install in WSL2 Ubuntu with GPU passthrough
curl -fsSL https://ollama.com/install.sh | sh
# Verify GPU detection
ollama run llama3.2 "Hello from Windows"
Docker (Production)
Official Docker image with NVIDIA GPU passthrough. Requires NVIDIA Container Toolkit. Ideal for servers, Kubernetes, and reproducible deployments.
# CPU-only
docker run -d -v ollama:/root/.ollama \
-p 11434:11434 --name ollama \
ollama/ollama
# NVIDIA GPU (requires nvidia-container-toolkit)
docker run -d --gpus=all \
-v ollama:/root/.ollama \
-p 11434:11434 --name ollama \
ollama/ollama
# AMD GPU (ROCm)
docker run -d --device /dev/kfd --device /dev/dri \
-v ollama:/root/.ollama \
-p 11434:11434 --name ollama \
ollama/ollama:rocm
# Pull a model inside the container
docker exec ollama ollama pull llama3.2
GPU Requirements
VRAM determines the maximum model size you can run. Models need roughly 0.5-1GB VRAM per billion parameters at Q4 quantization. CPU inference works but is 10-50x slower than GPU.
# VRAM requirements (Q4_K_M quantization):
# 7B model ~ 4.5 GB VRAM
# 13B model ~ 8.0 GB VRAM
# 30B model ~ 18.0 GB VRAM
# 70B model ~ 40.0 GB VRAM
# 110B model ~ 64.0 GB VRAM
# Recommended GPUs:
# Budget: RTX 3060 12GB (7-13B models)
# Mid: RTX 4070 Ti Super 16GB (13-30B)
# Pro: RTX 4090 24GB (30B models)
# Server: A100 80GB / H100 (70B+ models)
# Mac: M2/M3/M4 Pro/Max (unified memory)
3. Model Management
Ollama manages models like a package manager: pull, list, inspect, copy, and remove models with simple CLI commands. Models are stored in ~/.ollama/models (Linux/macOS) or %USERPROFILE%\.ollama\models (Windows) as GGUF blobs with metadata manifests. Each model variant (different quantization levels, sizes) is a separate tag, similar to Docker image tags.
The ollama show command reveals a model's full metadata: architecture, parameter count, quantization level, context window, system prompt, and license. This is essential for understanding exactly what you are running. Use ollama create with a Modelfile to build custom model configurations with tailored system prompts, temperature settings, and stop tokens.
Core Commands
Essential CLI commands for managing your local model library. Pull downloads from the Ollama registry, list shows installed models, show displays metadata, and rm removes models.
# Pull a model (downloads GGUF weights)
ollama pull llama3.2
ollama pull qwen3:8b
ollama pull gemma3:12b-it-q4_K_M
# List installed models
ollama list
# NAME ID SIZE MODIFIED
# llama3.2:latest a80c4f17 2.0 GB 2 hours ago
# qwen3:8b ... 4.9 GB 1 day ago
# Show model details
ollama show llama3.2
# architecture: llama
# parameters: 3.2B
# quantization: Q4_K_M
# context: 131072
# Copy/rename a model
ollama cp llama3.2 my-assistant
# Remove a model
ollama rm llama3.2
# List running models
ollama ps
# NAME ID SIZE PROCESSOR UNTIL
# qwen3:8b ... 5.4 GB 100% GPU 4 minutes
Custom Modelfiles
Modelfiles define custom model configurations with base models, system prompts, and parameters. Like Dockerfiles for LLMs — reproducible, versioned, shareable.
# Modelfile for a code review assistant
FROM qwen3:8b
SYSTEM """You are an expert code reviewer.
Focus on: security vulnerabilities, performance
issues, error handling gaps, and maintainability.
Be concise. Use bullet points. Cite line numbers."""
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER stop "<|endoftext|>"
# Build the custom model
# ollama create code-reviewer -f Modelfile
# Run it
# ollama run code-reviewer "Review this function..."
Storage Management
Models can consume significant disk space. Monitor usage, set custom model directories, and prune unused models to keep storage under control.
# Check total model storage
du -sh ~/.ollama/models
# 28G /home/user/.ollama/models
# Change model storage location
export OLLAMA_MODELS=/data/ollama/models
# Or in systemd override:
# sudo systemctl edit ollama
# [Service]
# Environment="OLLAMA_MODELS=/data/models"
# List models sorted by size
ollama list | sort -k3 -h
# Pull specific quantization
ollama pull llama3.2:3b-instruct-q8_0
ollama pull llama3.2:3b-instruct-q4_K_M
Importing GGUF Models
Import any GGUF model from Hugging Face or local files. This unlocks access to thousands of community-quantized models beyond the official Ollama library.
# Import a GGUF from Hugging Face
# 1. Download the GGUF file
wget https://huggingface.co/bartowski/\
Qwen2.5-7B-Instruct-GGUF/resolve/main/\
Qwen2.5-7B-Instruct-Q5_K_M.gguf
# 2. Create a Modelfile pointing to the file
cat > Modelfile <<EOF
FROM ./Qwen2.5-7B-Instruct-Q5_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER stop "<|im_end|>"
PARAMETER num_ctx 4096
EOF
# 3. Create and run the model
ollama create my-qwen -f Modelfile
ollama run my-qwen
4. API Server
Ollama runs a REST API server on port 11434 by default. The API supports text generation, chat completions, embeddings, model management, and streaming responses. Since v0.1.24, Ollama exposes an OpenAI-compatible endpoint at /v1/chat/completions, which means any tool, library, or application that works with the OpenAI API can work with Ollama by changing the base URL.
The native Ollama API at /api/generate and /api/chat provides additional features like raw mode, image input for multimodal models, and fine-grained model loading control. Both APIs support streaming (the default) and non-streaming modes. The server automatically loads models into memory on first request and unloads them after an idle timeout (default: 5 minutes) to free resources.
Native API Endpoints
The Ollama-native API provides generate, chat, embeddings, and model management endpoints. Streaming is enabled by default for real-time token delivery.
# Generate (single-turn)
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Explain Docker networking",
"stream": false
}'
# Chat (multi-turn with history)
curl http://localhost:11434/api/chat -d '{
"model": "qwen3:8b",
"messages": [
{"role": "system", "content": "You are a DevOps expert."},
{"role": "user", "content": "How do I set up Traefik?"}
],
"stream": false
}'
# Streaming (default, returns NDJSON)
curl http://localhost:11434/api/chat -d '{
"model": "qwen3:8b",
"messages": [
{"role": "user", "content": "Write a haiku"}
]
}'
OpenAI-Compatible Endpoint
Drop-in replacement for OpenAI API. Change the base URL and your existing code works. Supports chat completions, embeddings, and model listing.
# OpenAI-compatible chat completions
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" -d '{
"model": "qwen3:8b",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
# With streaming (SSE format, like OpenAI)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" -d '{
"model": "qwen3:8b",
"messages": [
{"role": "user", "content": "Explain RAFT"}
],
"stream": true
}'
# List available models
curl http://localhost:11434/v1/models
# Embeddings
curl http://localhost:11434/v1/embeddings -d '{
"model": "nomic-embed-text",
"input": "Ollama is great for local inference"
}'
Python Client (OpenAI SDK)
Use the official OpenAI Python SDK with Ollama. Zero code changes except the base URL. Works with LangChain, LlamaIndex, and any OpenAI-compatible library.
from openai import OpenAI
# Point to local Ollama server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but unused
)
# Chat completion (identical to OpenAI API)
response = client.chat.completions.create(
model="qwen3:8b",
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Explain GGUF format"}
],
temperature=0.7,
max_tokens=1024
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role": "user", "content": "Hello"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Official Ollama Python SDK
The native Ollama Python library provides a more feature-rich interface with model management, structured outputs, and multimodal support built in.
import ollama
# Chat with streaming
stream = ollama.chat(
model="qwen3:8b",
messages=[
{"role": "user", "content": "Explain RAG"}
],
stream=True
)
for chunk in stream:
print(chunk["message"]["content"], end="")
# List local models
models = ollama.list()
for m in models["models"]:
print(f"{m['name']}: {m['size'] / 1e9:.1f}GB")
# Pull a model programmatically
ollama.pull("gemma3:4b")
# Generate embeddings
emb = ollama.embed(
model="nomic-embed-text",
input="Local LLM inference with Ollama"
)
print(f"Dim: {len(emb['embeddings'][0])}")
# Structured output (JSON mode)
resp = ollama.chat(
model="qwen3:8b",
messages=[{"role": "user",
"content": "List 3 Python frameworks as JSON"}],
format="json"
)
5. Model Library
The Ollama model library hosts 200+ models covering general chat, coding, reasoning, embedding, and vision tasks. Models are tagged by size variant and quantization level. The library is updated within days of major model releases, making it the fastest way to try new open-weight models locally.
Choosing the right model depends on your use case, hardware, and quality requirements. For general-purpose chat, Qwen 3 and Llama 4 lead in quality. For coding, Qwen 3 Coder and DeepSeek Coder V3 excel. For reasoning, DeepSeek R1 and Qwen 3 (with thinking mode) are state-of-the-art. For resource-constrained environments, Phi-4 Mini and Gemma 3 offer remarkable quality at small sizes.
Llama 4 (Meta)
Meta's latest open model family. Llama 4 Scout (17B active / 109B total, MoE) and Maverick (17B active / 400B total, MoE). 10M+ token context. Best open model for multilingual and long-context tasks.
ollama pull llama4:scout # 109B MoE, ~26GB
ollama pull llama4:maverick # 400B MoE, ~95GB
ollama run llama4:scout "Summarize this codebase"
Qwen 3 (Alibaba)
State-of-the-art open model series. 0.6B to 32B sizes plus 30B-A3B MoE. Built-in thinking mode for chain-of-thought reasoning. Excellent coding, math, and multilingual support.
ollama pull qwen3:8b # Best value (4.9GB)
ollama pull qwen3:32b # Near-frontier quality
ollama pull qwen3:4b # Fast, lightweight
ollama run qwen3:8b "Write a FastAPI endpoint"
Gemma 3 (Google)
Google's latest open model. 1B, 4B, 12B, and 27B sizes. Excels at instruction following, multilingual tasks, and efficiency. The 4B variant is remarkably capable for its size.
ollama pull gemma3:4b # Great for constrained envs
ollama pull gemma3:12b # Strong general-purpose
ollama pull gemma3:27b # Highest quality
ollama run gemma3:12b "Explain microservices"
DeepSeek V3 / R1
DeepSeek V3 (685B MoE, 37B active) for general tasks. R1 for advanced chain-of-thought reasoning. Distilled variants (1.5B-70B) bring reasoning to smaller hardware.
ollama pull deepseek-r1:8b # Reasoning, 4.9GB
ollama pull deepseek-r1:32b # Strong reasoning
ollama pull deepseek-v3:latest # Full 685B MoE
ollama run deepseek-r1:8b "Prove sqrt(2) is irrational"
Phi-4 (Microsoft)
Microsoft's small but mighty model family. Phi-4 Mini (3.8B) rivals much larger models. Phi-4 (14B) competes with models 3-5x its size. Excellent for edge and on-device deployment.
ollama pull phi4-mini # 3.8B, 2.2GB
ollama pull phi4:14b # 14B, 8.5GB
ollama run phi4-mini "Optimize this SQL query"
Coding Models
Specialized for code generation, completion, and review. CodeLlama, Qwen 3 Coder, DeepSeek Coder V3, and StarCoder2 cover autocomplete to complex refactoring.
ollama pull codellama:13b # Meta, code-focused
ollama pull deepseek-coder-v2:16b # Strong code + math
ollama pull starcoder2:7b # Fast code completion
ollama run codellama:13b "Write a Redis cache layer"
Mistral / Mixtral
Mistral Small 3.1 (24B) with vision and 128K context. Mixtral 8x7B and 8x22B MoE for high throughput. Strong tool calling and function calling support.
ollama pull mistral-small3.1:24b # 24B, vision
ollama pull mixtral:8x7b # MoE, fast
ollama pull mistral:7b # Classic, efficient
ollama run mistral-small3.1:24b "Analyze this diagram"
6. Quantization
Quantization reduces model precision from 16-bit floating point (FP16) to lower bit-widths (8-bit, 4-bit, or even 2-bit), dramatically reducing VRAM requirements and increasing inference speed with minimal quality loss. Ollama uses the GGUF format (GPT-Generated Unified Format) from llama.cpp, which supports mixed-precision quantization where different layers use different bit-widths to preserve quality in the most important parts of the model.
The naming convention tells you the quantization level: Q4_K_M means 4-bit with K-quants (intelligent mixed-precision) at medium quality. Q5_K_M is 5-bit medium, Q8_0 is 8-bit uniform. As a rule of thumb, Q4_K_M offers the best balance of size and quality for most use cases. Q5_K_M is near-lossless. Q8_0 is virtually indistinguishable from FP16 but twice the size of Q4. Below Q4 (Q3, Q2), quality degrades noticeably.
Choosing quantization depends on your hardware constraints and quality requirements. For a 7B model: FP16 needs ~14GB VRAM, Q8_0 needs ~7.5GB, Q5_K_M needs ~5.5GB, Q4_K_M needs ~4.5GB, and Q2_K needs ~3GB. On Apple Silicon with unified memory, you can afford higher quantization levels since system RAM serves as VRAM.
GGUF Format
GGUF (GPT-Generated Unified Format) is the standard for CPU/GPU inference with llama.cpp. Stores model weights, tokenizer, and metadata in a single file. Replaced the older GGML format.
# GGUF files contain everything:
# - Model architecture metadata
# - Quantized weight tensors
# - Tokenizer vocabulary and merges
# - Chat template
# - Recommended parameters
# Pull specific quantization
ollama pull qwen3:8b-q4_K_M # 4-bit, medium
ollama pull qwen3:8b-q5_K_M # 5-bit, higher quality
ollama pull qwen3:8b-q8_0 # 8-bit, near-lossless
# Import your own GGUF
cat > Modelfile <<EOF
FROM ./my-model.Q4_K_M.gguf
EOF
ollama create my-model -f Modelfile
Quantization Levels Compared
Each level trades quality for size/speed. K-quants (Q4_K_M, Q5_K_M) use intelligent mixed precision for better quality than uniform quantization at the same bit width.
# Quantization comparison for 8B model:
#
# Level Size Quality Speed Use Case
# ------- ----- ------- ------ --------
# FP16 16 GB 100% 1.0x Reference
# Q8_0 8 GB 99% 1.5x Quality-first
# Q6_K 6 GB 98% 1.7x High quality
# Q5_K_M 5 GB 97% 1.8x Recommended
# Q4_K_M 5 GB 95% 2.0x Best balance
# Q4_K_S 4 GB 93% 2.1x Smaller Q4
# Q3_K_M 4 GB 88% 2.2x Low VRAM
# Q2_K 3 GB 75% 2.4x Extreme savings
#
# Rule of thumb:
# - Q4_K_M: default for most users
# - Q5_K_M: when quality matters more
# - Q8_0: when VRAM is not a concern
Quantizing Your Own Models
Convert Hugging Face safetensors to GGUF using llama.cpp tools. Useful for fine-tuned models or models not yet in the Ollama library.
# Install llama.cpp quantization tools
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j$(nproc)
# Convert safetensors to GGUF (FP16)
python convert_hf_to_gguf.py \
/path/to/hf-model/ \
--outfile model-fp16.gguf \
--outtype f16
# Quantize to Q4_K_M
./llama-quantize \
model-fp16.gguf \
model-Q4_K_M.gguf \
Q4_K_M
# Import into Ollama
echo "FROM ./model-Q4_K_M.gguf" > Modelfile
ollama create my-model -f Modelfile
Importance Matrix Quantization
Importance matrix (imatrix) quantization uses calibration data to determine which weights matter most. Produces significantly better results at low bit-widths (Q3, Q2).
# Generate importance matrix
./llama-imatrix \
-m model-fp16.gguf \
-f calibration-data.txt \
-o imatrix.dat \
--chunks 100
# Quantize with importance matrix
./llama-quantize \
--imatrix imatrix.dat \
model-fp16.gguf \
model-IQ4_XS.gguf \
IQ4_XS
# IQ quantization types (importance-based):
# IQ4_XS: smaller than Q4_K_M, similar quality
# IQ3_XXS: very small, usable quality
# IQ2_XXS: extremely small, for testing
7. GPU Acceleration
GPU acceleration is the single most impactful factor for LLM inference speed. A model running on an RTX 4090 generates tokens 20-50x faster than the same model on a modern CPU. Ollama automatically detects available GPUs and offloads as many model layers as fit in VRAM. If the model is too large for GPU memory, it splits layers between GPU and CPU (partial offloading), which is slower than full GPU but still much faster than CPU-only.
Ollama supports three GPU backends: NVIDIA CUDA (most mature, widest support), AMD ROCm (Linux only, good performance), and Apple Metal (macOS, seamless with unified memory). Multi-GPU setups are supported: Ollama automatically distributes model layers across all available GPUs.
CUDA (NVIDIA)
Best supported GPU backend. Requires NVIDIA driver 450+ and CUDA 11.7+. Supports all GeForce (RTX 3000+), Quadro, Tesla, and datacenter GPUs (A100, H100, L40S).
# Verify NVIDIA GPU detection
nvidia-smi
ollama run llama3.2 "Test GPU"
# Check GPU utilization during inference
watch -n 0.5 nvidia-smi
# Control GPU usage
export CUDA_VISIBLE_DEVICES=0 # GPU 0 only
export CUDA_VISIBLE_DEVICES=0,1 # GPU 0 and 1
# Force number of GPU layers (0 = CPU only)
OLLAMA_NUM_GPU=99 ollama serve
# Docker with NVIDIA GPU
docker run --gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama
ROCm (AMD)
Linux only. Requires ROCm 6.0+. Supports RX 7900 XTX (24GB), RX 7900 XT (20GB), MI250X, MI300X datacenter GPUs. Competitive performance with CUDA.
# Install ROCm (Ubuntu 22.04+)
# Follow: https://rocm.docs.amd.com/
# Verify AMD GPU detection
rocm-smi
ollama run llama3.2 "Test AMD GPU"
# Docker with AMD GPU
docker run -d \
--device /dev/kfd \
--device /dev/dri \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama:rocm
# Override GPU target (unsupported GPUs)
HSA_OVERRIDE_GFX_VERSION=11.0.0 ollama serve
# Check VRAM usage
rocm-smi --showmeminfo vram
Metal (Apple Silicon)
Zero-config on M1/M2/M3/M4 chips. Unified memory means all system RAM is available for models. M4 Max with 128GB can run 70B models. Excellent performance per watt.
# Metal is auto-detected on Apple Silicon
ollama run qwen3:32b # Uses Metal automatically
# Apple Silicon memory guide:
# M2/M3/M4 (8GB): 7B Q4 models
# M2/M3/M4 (16GB): 13B Q4 or 7B Q8
# M2/M3/M4 Pro (36GB): 30B Q4 models
# M2/M3/M4 Max (64GB): 70B Q4 models
# M2/M3/M4 Max (128GB): 70B Q8 or 110B Q4
# Monitor memory pressure
# Activity Monitor > Memory tab
# Limit concurrent loaded models
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1
Multi-GPU & VRAM Management
Ollama auto-distributes model layers across multiple GPUs. Configure VRAM limits, model concurrency, and layer distribution for optimal throughput.
# Multi-GPU: auto-distributes layers
# 2x RTX 4090 (48GB total) = 70B Q4 models
# Control which GPUs to use
export CUDA_VISIBLE_DEVICES=0,1
# Limit concurrent models in memory
export OLLAMA_MAX_LOADED_MODELS=2
# Set model idle timeout (default: 5m)
export OLLAMA_KEEP_ALIVE=30m
# Force CPU-only mode (for testing)
export OLLAMA_NUM_GPU=0
ollama serve
# Monitor multi-GPU utilization
watch -n 1 nvidia-smi
# Per-request GPU control via API
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Hello",
"options": {"num_gpu": 99}
}'
8. Custom Models
Modelfiles are the Dockerfiles of Ollama: declarative configuration files that define a custom model's base weights, system prompt, inference parameters, chat template, and even LoRA adapters. Creating a Modelfile lets you package a specific model configuration as a reusable, shareable artifact.
The Modelfile syntax supports fine-grained control: FROM specifies the base model or GGUF file, SYSTEM sets the system prompt, PARAMETER adjusts inference settings, TEMPLATE defines the chat template, ADAPTER applies LoRA/QLoRA adapters, and LICENSE embeds license information.
Modelfile Reference
Complete Modelfile syntax with all supported instructions. Each instruction configures a different aspect of the model's behavior.
# Complete Modelfile reference
FROM qwen3:8b # Base model or GGUF
SYSTEM """You are a senior backend engineer.
You specialize in Python, Go, and Kubernetes.
Always provide production-ready code with error
handling, logging, and type hints."""
PARAMETER temperature 0.4
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 16384
PARAMETER num_predict 2048
PARAMETER stop "<|eot_id|>"
PARAMETER seed 42
TEMPLATE """{{ if .System }}<|start|>system
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|start|>user
{{ .Prompt }}<|end|>
{{ end }}<|start|>assistant
{{ .Response }}<|end|>"""
LICENSE "Apache 2.0"
LoRA Adapters
Apply LoRA fine-tuned adapters to base models. Enables domain-specific fine-tuning without modifying the full model weights. Adapter: 50-200MB vs. full model: 4-16GB.
# Modelfile with LoRA adapter
FROM llama3.2
# Apply a GGUF-format LoRA adapter
ADAPTER ./my-lora-adapter.gguf
SYSTEM """You are a medical coding assistant
specializing in ICD-10 and CPT codes."""
PARAMETER temperature 0.2
PARAMETER num_ctx 4096
# Build the model with adapter
# ollama create medical-coder -f Modelfile
# The adapter modifies base model behavior
# without full fine-tuning overhead:
# - Base model: 4.5GB
# - LoRA adapter: 50-200MB
# - Combined: domain-specific knowledge
Practical Custom Models
Real-world Modelfile examples for common use cases: SQL assistant, commit message writer, and security auditor.
# SQL Assistant
cat > Modelfile.sql <<'EOF'
FROM qwen3:8b
SYSTEM """You are a SQL expert. Given a natural
language question and a database schema, generate
the optimal SQL query. Use CTEs for readability.
Always include column aliases."""
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
EOF
ollama create sql-assistant -f Modelfile.sql
# Git Commit Message Writer
cat > Modelfile.commit <<'EOF'
FROM phi4-mini
SYSTEM """Write concise git commit messages.
Format: type(scope): description
Types: feat, fix, refactor, docs, test, chore
Keep under 72 characters. No period at end."""
PARAMETER temperature 0.3
PARAMETER num_predict 100
EOF
ollama create commit-writer -f Modelfile.commit
# Usage
git diff --staged | ollama run commit-writer
Sharing & Registry
Push custom models to the Ollama registry for team sharing, or export models as files for offline distribution.
# Push to Ollama registry
ollama push username/my-model
# Pull someone else's custom model
ollama pull username/their-model
# Copy models between machines (offline)
# 1. Find model blob location
ollama show --modelfile qwen3:8b
# 2. Copy the model directory
# Source: ~/.ollama/models/
scp -r ~/.ollama/models/ user@target:~/.ollama/
# Export as Modelfile for versioning
ollama show --modelfile my-model > Modelfile
# Commit Modelfile to Git for team sharing
9. Embeddings & Vision
Ollama supports specialized models beyond text generation: embedding models for RAG (Retrieval-Augmented Generation) and semantic search, and multimodal vision models that can analyze images. Embedding models convert text into high-dimensional vectors that capture semantic meaning, enabling similarity search, clustering, and retrieval systems entirely locally.
Vision models like LLaVA, Gemma 3, and Mistral Small 3.1 accept both text and images as input, enabling local image analysis, OCR, diagram understanding, and visual question answering. This is valuable for processing sensitive images (medical, legal, proprietary) that cannot be sent to cloud APIs.
Embedding Models
Generate vector embeddings for RAG pipelines, semantic search, and document clustering. All processing stays local — no data leaves your machine.
# Popular embedding models
ollama pull nomic-embed-text # 137M, 768-dim
ollama pull mxbai-embed-large # 335M, 1024-dim
ollama pull all-minilm # 23M, 384-dim (fast)
ollama pull snowflake-arctic-embed2 # 568M
# Generate embeddings via API
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Local LLM inference is the future"
}'
# Python: batch embeddings
import ollama
texts = [
"Kubernetes pod scheduling",
"Docker container networking",
"Helm chart templating"
]
result = ollama.embed(
model="nomic-embed-text",
input=texts
)
print(f"Dims: {len(result['embeddings'][0])}")
# Dims: 768
Vision / Multimodal Models
Analyze images locally with multimodal models. Supports diagrams, screenshots, documents, photos. No cloud upload — critical for sensitive visual data.
# Vision-capable models
ollama pull llava:13b # LLaVA 1.6
ollama pull gemma3:12b # Gemma 3 (vision)
ollama pull mistral-small3.1:24b # Mistral (vision)
ollama pull llama4:scout # Llama 4 (vision)
# API: send image as base64
import ollama, base64
with open("screenshot.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = ollama.chat(
model="gemma3:12b",
messages=[{
"role": "user",
"content": "What errors are in this log?",
"images": [img_b64]
}]
)
print(response["message"]["content"])
Local RAG Pipeline
Build a complete RAG system with Ollama: embed documents, store vectors, retrieve context, and generate answers — all locally with zero cloud dependencies.
import ollama
import numpy as np
# 1. Embed your documents
docs = [
"Ollama runs LLMs locally using GGUF",
"CUDA acceleration requires NVIDIA GPU",
"Quantization reduces model size and VRAM",
"LoRA adapters enable domain fine-tuning",
]
doc_embs = ollama.embed(
model="nomic-embed-text", input=docs
)["embeddings"]
# 2. Embed the query
query = "How do I reduce VRAM usage?"
q_emb = ollama.embed(
model="nomic-embed-text", input=query
)["embeddings"][0]
# 3. Find most similar documents
scores = [np.dot(q_emb, d) for d in doc_embs]
context = docs[np.argmax(scores)]
# 4. Generate answer with context
response = ollama.chat(model="qwen3:8b", messages=[
{"role": "system",
"content": f"Answer based on: {context}"},
{"role": "user", "content": query}
])
print(response["message"]["content"])
RAG with ChromaDB
Production RAG with ChromaDB vector store and Ollama. Persistent storage, metadata filtering, and efficient similarity search.
import chromadb, ollama
client = chromadb.PersistentClient(path="./chroma")
class OllamaEmbed:
def __call__(self, input):
return ollama.embed(
model="nomic-embed-text", input=input
)["embeddings"]
collection = client.get_or_create_collection(
"docs", embedding_function=OllamaEmbed()
)
# Add documents
collection.add(
documents=["Doc content here..."],
ids=["doc1"],
metadatas=[{"source": "manual.pdf"}]
)
# Query with automatic embedding
results = collection.query(
query_texts=["How to configure GPU?"],
n_results=3
)
# Generate answer
context = "\n".join(results["documents"][0])
resp = ollama.chat(model="qwen3:8b", messages=[
{"role": "system",
"content": f"Context:\n{context}"},
{"role": "user",
"content": "How to configure GPU?"}
])
10. Integration
Ollama integrates with the entire LLM tooling ecosystem through its OpenAI-compatible API. LangChain, LlamaIndex, and other orchestration frameworks have native Ollama support. IDE extensions like Continue.dev provide AI-powered code completion using local models. Web interfaces like Open WebUI give you a ChatGPT-like experience backed by local inference.
The key advantage of Ollama integrations is that switching from cloud to local inference requires minimal code changes. Most libraries accept a base_url parameter — change it from https://api.openai.com/v1 to http://localhost:11434/v1 and you are running locally.
LangChain
Native Ollama integration in LangChain for chains, agents, and RAG pipelines. Use ChatOllama for chat and OllamaEmbeddings for vectors.
from langchain_ollama import (
ChatOllama, OllamaEmbeddings
)
from langchain_core.messages import (
HumanMessage, SystemMessage
)
llm = ChatOllama(
model="qwen3:8b",
temperature=0.3,
num_ctx=8192,
base_url="http://localhost:11434"
)
messages = [
SystemMessage(content="You are a DevOps expert."),
HumanMessage(content="Explain K8s Ingress")
]
response = llm.invoke(messages)
print(response.content)
# Embeddings for RAG
embeddings = OllamaEmbeddings(
model="nomic-embed-text"
)
vectors = embeddings.embed_documents([
"Kubernetes pods run containers",
"Helm manages Kubernetes packages"
])
# Streaming
for chunk in llm.stream(messages):
print(chunk.content, end="")
LlamaIndex
Build RAG apps with LlamaIndex using Ollama as the LLM and embedding backend. Index documents, query with natural language, all locally.
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import (
OllamaEmbedding
)
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Settings
)
Settings.llm = Ollama(
model="qwen3:8b",
request_timeout=120,
temperature=0.3
)
Settings.embed_model = OllamaEmbedding(
model_name="nomic-embed-text"
)
# Load and index documents
documents = SimpleDirectoryReader(
"./docs"
).load_data()
index = VectorStoreIndex.from_documents(documents)
# Query the index
query_engine = index.as_query_engine()
response = query_engine.query(
"How do I configure GPU acceleration?"
)
print(response)
Open WebUI
Self-hosted ChatGPT-like interface for Ollama. Multi-user, conversation history, model switching, file upload, RAG, web search, plugins. Docker deploy in minutes.
# Deploy Open WebUI with Docker
docker run -d --name open-webui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
-e OLLAMA_BASE_URL=\
http://host.docker.internal:11434 \
ghcr.io/open-webui/open-webui:main
# Features:
# - Multi-user with role-based access
# - Conversation history and search
# - Model switching mid-conversation
# - File upload and RAG
# - Web search integration
# - Custom system prompts per model
# - Plugin/function support
# - Mobile-responsive UI
# Access at http://localhost:3000
Continue.dev (IDE)
AI code assistant for VS Code and JetBrains using local Ollama. Tab completion, inline chat, code explanation, refactoring, test generation — all private.
// ~/.continue/config.json
{
"models": [
{
"title": "Qwen 3 8B (Local)",
"provider": "ollama",
"model": "qwen3:8b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen 3 4B (Fast)",
"provider": "ollama",
"model": "qwen3:4b"
},
"embeddingsProvider": {
"provider": "ollama",
"model": "nomic-embed-text"
}
}
// Shortcuts:
// Ctrl+L: Chat with codebase context
// Tab: AI autocomplete
// Ctrl+I: Inline edit
// Highlight + Ctrl+L: Explain/refactor
More Integrations
Ollama works with the entire open-source LLM ecosystem: n8n, Dify, CrewAI, Aider, AnythingLLM, and more.
# n8n: AI workflow automation
# Add Ollama node in n8n workflows
# Base URL: http://ollama:11434
# Dify: AI application platform
# Settings > Model Providers > Ollama
# CrewAI: multi-agent orchestration
from crewai import Agent, Crew
from langchain_ollama import ChatOllama
llm = ChatOllama(model="qwen3:8b")
researcher = Agent(
role="Researcher",
goal="Find technical information",
llm=llm
)
# Aider: AI pair programming
# pip install aider-chat
# aider --model ollama/qwen3:8b
# AnythingLLM: document chat
# Configure Ollama as LLM provider
11. Production Deployment
Deploying Ollama in production requires reverse proxy configuration, TLS termination, authentication, monitoring, resource limits, and high availability. The Docker-based deployment is recommended: it provides isolation, reproducibility, and easy orchestration with Docker Compose or Kubernetes. Always place Ollama behind a reverse proxy (Nginx, Traefik, Caddy) that handles TLS, rate limiting, and authentication.
For multi-user deployments, combine Ollama with Open WebUI for a managed interface, or build a custom API gateway that handles authentication, usage tracking, and request routing. Monitor GPU utilization, memory usage, request latency, and queue depth to detect bottlenecks and plan capacity.
Docker Compose Stack
Production Docker Compose with Ollama, Open WebUI, GPU passthrough, persistent volumes, health checks, and restart policies.
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_KEEP_ALIVE=30m
- OLLAMA_MAX_LOADED_MODELS=2
- OLLAMA_NUM_PARALLEL=4
healthcheck:
test: ["CMD", "curl", "-f",
"http://localhost:11434/"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- webui_data:/app/backend/data
depends_on:
ollama:
condition: service_healthy
restart: unless-stopped
volumes:
ollama_data:
webui_data:
Reverse Proxy & TLS
Nginx reverse proxy with TLS termination, rate limiting, and streaming support. Essential for secure network exposure.
# /etc/nginx/sites-available/ollama
upstream ollama {
server 127.0.0.1:11434;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name llm.example.com;
ssl_certificate /etc/letsencrypt/live/
llm.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/
llm.example.com/privkey.pem;
limit_req_zone $binary_remote_addr
zone=ollama:10m rate=10r/m;
location / {
limit_req zone=ollama burst=5;
proxy_pass http://ollama;
proxy_set_header Host $host;
proxy_read_timeout 300s;
# SSE streaming support
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding on;
}
}
Monitoring & Metrics
Monitor Ollama with Prometheus, Grafana, and custom health checks. Track GPU utilization, model loading times, request latency, and token throughput.
#!/usr/bin/env python3
"""Ollama health monitor with Prometheus."""
import time, requests, subprocess
from prometheus_client import (
start_http_server, Gauge, Histogram
)
GPU_UTIL = Gauge("ollama_gpu_util", "GPU %")
GPU_MEM = Gauge("ollama_gpu_mem_gb", "VRAM GB")
MODELS = Gauge("ollama_loaded_models", "Count")
def collect():
out = subprocess.check_output([
"nvidia-smi",
"--query-gpu=utilization.gpu,memory.used",
"--format=csv,noheader,nounits"
]).decode()
util, mem = out.strip().split(", ")
GPU_UTIL.set(float(util))
GPU_MEM.set(float(mem) / 1024)
resp = requests.get(
"http://localhost:11434/api/ps"
).json()
MODELS.set(len(resp.get("models", [])))
start_http_server(9090)
while True:
collect()
time.sleep(15)
Authentication & Access Control
Ollama has no built-in auth. Implement API key validation at the reverse proxy level. Never expose Ollama directly to the internet.
# Nginx API key authentication
location /v1/ {
set $api_key "";
if ($http_authorization ~* "Bearer (.+)") {
set $api_key $1;
}
if ($api_key != "your-secret-key") {
return 401 '{"error":"unauthorized"}';
}
proxy_pass http://ollama;
proxy_buffering off;
}
# Bind to localhost only
OLLAMA_HOST=127.0.0.1:11434
# Firewall rules (UFW)
sudo ufw deny 11434
sudo ufw allow from 10.0.0.0/8 to any port 11434
# Docker: internal networking only
services:
ollama:
expose: # NOT ports:
- "11434"
Scaling & Load Balancing
Scale horizontally with multiple instances behind a load balancer. Each instance on a separate GPU. Use least-connections routing.
# Multiple instances on different GPUs
# Instance 1: GPU 0
CUDA_VISIBLE_DEVICES=0 \
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# Instance 2: GPU 1
CUDA_VISIBLE_DEVICES=1 \
OLLAMA_HOST=0.0.0.0:11435 ollama serve
# Nginx load balancer
upstream ollama_cluster {
least_conn;
server 127.0.0.1:11434;
server 127.0.0.1:11435;
keepalive 32;
}
server {
location / {
proxy_pass http://ollama_cluster;
proxy_buffering off;
}
}
12. Performance Tuning
Performance tuning involves optimizing context length, batch size, parallel request handling, KV cache management, and model loading strategy. Key metrics: tokens per second (tok/s), time to first token (TTFT), and concurrent request capacity.
The two biggest performance levers: (1) ensuring the model fits entirely in GPU VRAM to avoid CPU fallback, and (2) tuning num_ctx to the minimum needed. A 7B model with 4K context is 2-3x faster than with 32K context because the KV cache grows linearly with context length.
Context Length & KV Cache
Context length directly impacts VRAM usage and speed. Set num_ctx to the minimum needed. Shorter context = faster inference.
# Override context per request:
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Hello",
"options": { "num_ctx": 4096 }
}'
# KV cache VRAM usage (8B model, Q4_K_M):
# 2K ctx: ~4.5 GB total
# 4K ctx: ~5.0 GB total
# 8K ctx: ~5.5 GB total
# 16K ctx: ~6.5 GB total
# 32K ctx: ~8.5 GB total
# 128K ctx: ~20.0 GB total
# Set default in Modelfile
# PARAMETER num_ctx 4096
# Or via environment variable
OLLAMA_NUM_CTX=4096 ollama serve
Parallel Requests
Concurrent request processing with shared KV cache. Configure parallel slots for multi-user serving without loading duplicate model copies.
# Enable parallel request processing
export OLLAMA_NUM_PARALLEL=4 # 4 concurrent
# Each slot uses additional KV cache VRAM
# 4 parallel x 8K ctx = 4x KV cache VRAM
# Start server with parallel config
OLLAMA_NUM_PARALLEL=4 \
OLLAMA_MAX_LOADED_MODELS=2 \
ollama serve
# Test concurrent requests
for i in $(seq 1 4); do
curl -s http://localhost:11434/api/generate \
-d '{"model":"qwen3:8b",
"prompt":"Count to 10",
"stream":false}' &
done
wait
# Monitor parallel utilization
ollama ps
Model Loading & Keep-Alive
Model loading takes 2-10s. Configure keep-alive to avoid reloading. Pre-load models at startup for instant first responses.
# Keep models in memory longer (default: 5m)
export OLLAMA_KEEP_ALIVE=30m
# Keep model loaded indefinitely
export OLLAMA_KEEP_ALIVE=-1
# Per-request keep_alive override
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Hello",
"keep_alive": "1h"
}'
# Pre-load a model at startup
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"keep_alive": "24h"
}'
# Unload a model immediately
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"keep_alive": 0
}'
# Limit concurrent loaded models
export OLLAMA_MAX_LOADED_MODELS=2
Benchmarking & Profiling
Measure token throughput, time to first token, and total latency using API response metadata.
#!/usr/bin/env python3
"""Benchmark Ollama model performance."""
import time, requests, statistics
def benchmark(model, prompt, n=5):
results = []
for i in range(n):
start = time.time()
resp = requests.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt,
"stream": False,
"options": {"num_ctx": 4096}}
).json()
elapsed = time.time() - start
eval_count = resp.get("eval_count", 0)
eval_dur = resp.get("eval_duration", 1)
tok_s = eval_count / (eval_dur / 1e9)
ttft = resp.get(
"prompt_eval_duration", 0) / 1e9
results.append(tok_s)
print(f"Run {i+1}: {tok_s:.1f} tok/s, "
f"TTFT: {ttft:.2f}s")
print(f"\nMedian: {statistics.median(results):.1f}")
print(f"Mean: {statistics.mean(results):.1f}")
benchmark("qwen3:8b", "Explain Docker networking")
Environment Variables Reference
Complete reference of Ollama environment variables for performance, storage, networking, and resource limits.
# Server configuration
OLLAMA_HOST=0.0.0.0:11434 # Bind address
OLLAMA_ORIGINS=* # CORS origins
OLLAMA_MODELS=/data/models # Model storage
# Performance tuning
OLLAMA_NUM_PARALLEL=4 # Concurrent requests
OLLAMA_MAX_LOADED_MODELS=2 # Models in VRAM
OLLAMA_KEEP_ALIVE=30m # Model idle timeout
OLLAMA_NUM_GPU=99 # GPU layers (99=all)
OLLAMA_MAX_QUEUE=512 # Request queue size
# GPU control
CUDA_VISIBLE_DEVICES=0,1 # NVIDIA GPU select
HSA_OVERRIDE_GFX_VERSION=11.0.0 # AMD override
OLLAMA_FLASH_ATTENTION=1 # Flash attention
OLLAMA_KV_CACHE_TYPE=q8_0 # Quantized KV cache
# Debug
OLLAMA_DEBUG=1 # Verbose logging
OLLAMA_LLM_LIBRARY=cpu # Force CPU backend
# Apply via systemd override:
sudo systemctl edit ollama
# [Service]
# Environment="OLLAMA_NUM_PARALLEL=4"
# Environment="OLLAMA_KEEP_ALIVE=30m"
sudo systemctl restart ollama
Flash Attention & KV Cache Quantization
Enable flash attention for faster inference and lower VRAM. Quantize KV cache to fit larger contexts in the same VRAM budget.
# Enable flash attention (experimental)
export OLLAMA_FLASH_ATTENTION=1
# KV cache quantization
export OLLAMA_KV_CACHE_TYPE=q8_0
# Impact on 8B model, 32K context:
# FP16 KV cache: ~8.5 GB VRAM
# Q8_0 KV cache: ~6.5 GB VRAM (-24%)
# Q4_0 KV cache: ~5.5 GB VRAM (-35%)
# Combined optimizations:
OLLAMA_FLASH_ATTENTION=1 \
OLLAMA_KV_CACHE_TYPE=q8_0 \
OLLAMA_NUM_PARALLEL=4 \
OLLAMA_KEEP_ALIVE=1h \
ollama serve
# Verify settings in debug mode
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i flash