← Back to Blog

Ollama: Run Open-Source LLMs Locally with One Command

Most LLM demos assume you have an OpenAI key. That is fine for prototypes, but there are good reasons to run the model yourself: the prompt stays on your machine, there is no per-token bill, and you can fine-tune, quantize, or swap weights without asking anyone. The catch used to be that local inference meant wrestling with CUDA, llama.cpp build flags, and Python virtualenvs.

Ollama hides all of that. One binary, one command, and you have a model running on a REST API at http://localhost:11434. The CLI is a thin wrapper around llama.cpp with sensible defaults: model pulling, automatic GPU offload, quantized weights (Q4_0, Q4_K_M, Q5, Q8), and an OpenAI-compatible /v1/chat/completions endpoint that drops into tools that already speak OpenAI.

This guide goes from a fresh install to a small working agent that calls a function and returns structured JSON. You will copy-paste, run, and end with a model you control end to end.

Prerequisites

  • macOS, Linux, or Windows (WSL2)
  • 8GB RAM minimum (16GB+ recommended for 7B+ models)
  • For GPU acceleration: Apple Silicon built-in, or NVIDIA GPU with CUDA drivers on Linux
  • About 10GB free disk per 7B model, more for larger or higher-precision variants

Ollama's install page covers macOS (DMG), Linux (one-line script), and Windows (preview build). The project is open source under the MIT license.

Step 1: Install

macOS and Linux get the same one-liner:

curl -fsSL https://ollama.com/install.sh | sh

That drops the ollama binary into /usr/local/bin and registers a systemd (or launchd) service that auto-starts on boot. Verify:

ollama --version

You should see something like ollama version 0.5.x. If ollama serve is not already running, start it manually:

ollama serve

The first time you issue a model command, the daemon downloads the model from the Ollama registry and caches it under ~/.ollama/models.

Step 2: Pull and Chat

ollama pull llama3.2

That fetches the default 3B Llama 3.2 model (~2GB). For more capability without much more weight, try llama3.1:8b. For coding, qwen2.5-coder:7b is a popular choice. The full tag list is at ollama.com/library.

Once pulled, run it interactively:

ollama run llama3.2

You are now chatting with a local LLM. Type /bye to exit, or pass a prompt directly:

ollama run llama3.2 "Explain PagedAttention in 2 sentences."

Step 3: Call the REST API

Ollama exposes two HTTP surfaces:

  • Native API at POST /api/chat and POST /api/generate
  • OpenAI-compatible at POST /v1/chat/completions, POST /v1/embeddings, GET /v1/models

The OpenAI-compatible endpoint is the killer feature. Any tool that already targets OpenAI can point at Ollama by swapping the base URL:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Write a haiku about caching."}
    ]
  }'

The response schema matches OpenAI's exactly, so the openai Python SDK and the openai Node SDK work without code changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK, ignored by Ollama
)

resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Summarize the Ollama README."}],
)
print(resp.choices[0].message.content)
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",
});

const resp = await client.chat.completions.create({
  model: "llama3.2",
  messages: [{ role: "user", content: "List 3 CLI productivity tools." }],
});
console.log(resp.choices[0].message.content);

Step 4: Use the Official Python SDK

For tighter integration, Ollama ships a Python library that exposes streaming, async, and structured outputs:

pip install ollama
import ollama

# streaming
stream = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantization in plain English."}],
    stream=True,
)
for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

The async client works the same way with ollama.AsyncClient. Embeddings are one call:

emb = ollama.embeddings(model="nomic-embed-text", prompt="hello world")
print(len(emb["embedding"]))  # 768

Step 5: Build a Custom Modelfile

A Modelfile is to Ollama what a Dockerfile is to a container. You pick a base model, set parameters, and bake a system prompt in.

FROM llama3.2
PARAMETER temperature 0.2
PARAMETER num_ctx 4096
SYSTEM """
You are a senior DevOps engineer. Answer concisely. Prefer commands over prose.
When asked for a Dockerfile, return a complete, runnable example.
"""

Build and run it:

ollama create devops-bot -f ./Modelfile
ollama run devops-bot "Write a Dockerfile for a Go HTTP server on Alpine."

Useful PARAMETER keys: temperature, top_p, top_k, num_ctx (context length), num_gpu (number of layers to offload), stop (a string that halts generation), and seed for reproducibility.

You can also pull a model from HuggingFace directly, including GGUF files you uploaded yourself:

FROM hf.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF:Q4_K_M

Step 6: A Real Use Case — Local Function-Calling Agent

Ollama added native tool/function calling in late 2024. The pattern below shows a tiny agent that picks between two fake tools based on the user request. It runs fully on your laptop.

import json
import ollama

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"},
                },
                "required": ["city"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "send_email",
            "description": "Send an email to a recipient",
            "parameters": {
                "type": "object",
                "properties": {
                    "to": {"type": "string"},
                    "subject": {"type": "string"},
                    "body": {"type": "string"},
                },
                "required": ["to", "subject", "body"],
            },
        },
    },
]

def dispatch(name, args):
    if name == "get_weather":
        return f"Sunny, 24C in {args['city']}"
    if name == "send_email":
        return f"Email queued to {args['to']}"
    return "unknown tool"

messages = [{"role": "user", "content": "What's the weather in Jakarta?"}]

resp = ollama.chat(
    model="llama3.1:8b",  # 8B is more reliable for tool calls than 3B
    messages=messages,
    tools=tools,
)

if resp["message"].get("tool_calls"):
    call = resp["message"]["tool_calls"][0]
    name = call["function"]["name"]
    args = call["function"]["arguments"]
    result = dispatch(name, args)
    print(f"Tool: {name}({args}) -> {result}")
else:
    print(resp["message"]["content"])

The 8B variant picks the right tool more often than 3B. For production agents, you will want either a larger model or a smaller one with a tool-calling fine-tune (Qwen2.5, Llama 3.1, and Mistral Nemo are the usual picks).

Performance: How to Get More Out of It

A few settings actually move the needle on Apple Silicon and Linux.

Force CPU off or on. OLLAMA_NUM_GPU=0 runs everything on CPU. Useful as a baseline.

Control context length. Long contexts cost a lot of memory. num_ctx 2048 is plenty for chat; raise to 8192+ only when you need it.

Concurrent requests. Ollama will queue and parallelize requests against a single model. The number of parallel slots defaults to the number of detected CPU cores divided by 2. You can override with OLLAMA_NUM_PARALLEL.

Watch ollama ps. Live memory and processor usage per running model:

ollama ps

Quantization choice. Smaller quant = faster, less memory, slightly worse quality. For most chat and code work, Q4_K_M is the sweet spot. The Modelfile FROM line controls which quant gets pulled.

When to Use Ollama (and When Not To)

Ollama fits when you want a single machine, a small number of models, and a one-binary install. It does not fit when you need to scale to many concurrent users or a fleet of GPUs.

Pick Ollama when:

  • You are developing on a laptop and want offline access
  • You need a private model for a single team or a small tool
  • You want an OpenAI-compatible API without running a vLLM cluster
  • You are evaluating open-source models and want to swap them in and out fast

Pick something else when:

  • vLLM if you have a multi-GPU box and need maximum throughput. vLLM's PagedAttention still beats Ollama's llama.cpp backend by a wide margin on large batches. See the vLLM setup guide for the trade-offs.
  • llama.cpp directly if you need exotic quant formats or want to read the source. Ollama wraps it, but you give up flexibility.
  • LM Studio if you want a GUI for chatting with local models. It also uses llama.cpp under the hood and is friendlier for non-CLI users.
  • OpenAI or Anthropic API if you cannot guarantee a GPU, or if you need a frontier model that no open release matches yet.

Common Pitfalls

Model pulls silently fail behind a corporate proxy. Set HTTPS_PROXY before running ollama pull. The daemon does not pick up proxy env vars from your shell.

Ollama binds to 127.0.0.1 by default. If you want another machine on your LAN to hit the API, set OLLAMA_HOST=0.0.0.0:11434 before starting the service, or edit the systemd unit. Do not expose this port to the public internet without auth.

Out of memory on 16GB machines. A 7B model in Q4 takes around 5GB. A 13B model takes around 9GB. Add a few GB for the KV cache and OS, and you are at the limit. Drop to a smaller model or close other apps.

/v1/embeddings returns a different dimension than OpenAI. OpenAI's text-embedding-3-small is 1536 dims. Ollama's default embedding model (nomic-embed-text) is 768 dims. If you swap them in an existing vector DB, re-embed everything.

Tool calls are flaky on small models. Anything under 7B will hallucinate function names or miss required fields. Test with a known prompt and a known model before wiring it into an agent loop.

Where to Go Next

Need Help Implementing This?

I help teams design and build scalable cloud infrastructure, DevOps pipelines, and production-grade systems.

Book a Free Consultation