← Back to Blog

Run a Local LLM Code Reviewer with Ollama and Continue.dev

Most code review tools send your diff to someone else's server. For a side project that is fine. For a client codebase, an internal monorepo, or anything under NDA, it is a non-starter. The fix is running the model yourself.

Ollama wraps llama.cpp and gives you a single binary that pulls, runs, and serves GGUF models over a local HTTP API on port 11434. Continue is a VS Code and JetBrains extension that talks to any OpenAI-compatible endpoint, including Ollama. Together they give you a code reviewer that runs on the laptop, with no telemetry and no subscription.

This guide walks through a setup that actually works: install Ollama, pick a model that fits a developer laptop, wire it to Continue, and run a real review on a real diff.

Prerequisites

  • macOS, Linux, or Windows (WSL2 recommended on Windows)
  • About 8 GB of free RAM for a 7B model, 16 GB for a 14B
  • About 6 GB of free disk for model weights
  • VS Code 1.85+
  • Optional but recommended: an Apple Silicon Mac or any machine with a discrete GPU

Step 1: Install Ollama

# macOS
brew install ollama

# Linux (one-liner from ollama.com)
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download the installer from ollama.com/download

Start the server. On macOS the app handles this; on Linux run it as a systemd service or in a separate terminal:

ollama serve

Verify it is up:

curl http://localhost:11434/api/version
# {"version":"0.5.x"}

Step 2: Pick a model

Not every model is good at code review. You want something trained on code, with a context window large enough to fit a diff plus some surrounding file. Three solid options as of mid-2026:

Model Size RAM needed Why pick it
qwen2.5-coder:7b 4.7 GB ~8 GB Best bang per token for code tasks
qwen2.5-coder:14b 9.0 GB ~16 GB Noticeably better at catching logic bugs
deepseek-coder-v2:16b 8.9 GB ~16 GB Strong at multi-file reasoning

Pull the one that fits your hardware:

ollama pull qwen2.5-coder:7b

The first pull downloads several gigabytes. Subsequent pulls of different tags are fast because the base layers are reused.

Sanity check the model before touching the editor:

ollama run qwen2.5-coder:7b "Review this Python: def add(a,b): return a-b"

You should get a response that flags the bug, explains the off-by-sign issue, and suggests a fix. If you get nonsense, the model file is corrupted. Re-pull it.

Step 3: Configure Continue in VS Code

Install the Continue extension from the VS Code marketplace. After install, open the config file with the command palette:

Continue: Open config.json

This creates ~/.continue/config.json. Replace its contents with:

{
  "models": [
    {
      "title": "Qwen Coder 7B (local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "systemMessage": "You are a senior engineer doing code review. Be specific. Cite line numbers. Do not restate the code. Skip nitpicks. Flag real issues: bugs, security holes, missing edge cases, and confusing naming."
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder 7B (local)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Restart VS Code. Open the Continue panel from the left sidebar (it looks like a chat bubble). You should see Qwen Coder 7B (local) in the model dropdown.

Step 4: Review a real diff

The keyboard shortcut Cmd+L (macOS) or Ctrl+L (Linux/Windows) opens the chat with the current selection as context. Select a function, hit the shortcut, and ask:

Find bugs and security issues in this function.

For full-file or full-diff reviews, paste the diff into the chat directly. The model has a 32K context window, which is enough for most PRs.

For a repeatable review across a branch, run this from the repo root:

git diff main...HEAD > /tmp/changes.diff

Then in the Continue chat:

Review the diff at /tmp/changes.diff. List the top 5 issues with file paths and line numbers.

You can also add it as a project-level slash command. Create .continue/rules/review.md:

# Code review

When asked to review code:
1. Read the full diff before commenting.
2. Group findings by severity: bug, security, performance, style.
3. Cite the file and line for every finding.
4. If the code is fine, say so. Do not invent issues.

Step 5: Make it part of pre-commit (optional)

If you want a review on every commit, point Continue at the Ollama server from a shell script and stream the response. A minimal version:

#!/usr/bin/env bash
set -euo pipefail

DIFF=$(git diff --cached)
[ -z "$DIFF" ] && exit 0

PAYLOAD=$(jq -n --arg d "$DIFF" '{
  model: "qwen2.5-coder:7b",
  prompt: "Review this staged diff. Flag bugs and security issues. Be terse.

($d)",
  stream: false
}')

RESPONSE=$(curl -s http://localhost:11434/api/generate -d "$PAYLOAD")
echo "$RESPONSE" | jq -r .response

Save it as scripts/pre-commit-review.sh, make it executable, and add it to a pre-commit hook. The review prints to your terminal before each commit. Exit non-zero on serious findings and you have a CI gate that runs on your laptop.

Performance tuning

Ollama defaults are fine for a single user. Two knobs matter:

Context length. The default is 2048 tokens, which is too short for most diffs. Set it per-model with a Modelfile:

FROM qwen2.5-coder:7b
PARAMETER num_ctx 16384
PARAMETER temperature 0.2

Build it:

ollama create qwen-coder-review -f Modelfile

Then point Continue at qwen-coder-review instead of qwen2.5-coder:7b. Lower temperature makes reviews more deterministic and less likely to invent issues.

Keep-alive. Ollama unloads a model from memory 5 minutes after the last request. The next prompt pays the reload cost, which is 5-15 seconds. Pin it for the duration of a review session:

{
  "models": [
    {
      "title": "Qwen Coder 7B (local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "ollama": {
        "keepAlive": "-1"
      }
    }
  ]
}

-1 means the model stays loaded until Ollama restarts. On a 16 GB machine with a 7B model, the cost is about 5 GB of RAM, which is usually fine.

When this is not the right tool

A local 7B model is not a frontier model. It misses things. Use it for a first pass and still get a human to look at anything that touches auth, payments, or data migration. For a 14B model on a beefy machine, the gap narrows. For deeper review of large monorepos, run an agent loop with hermes-agent or claude-code pointed at the same repo.

If you need a model that is stronger than what fits on your laptop, the same config file accepts a hosted provider. Swap provider: ollama for provider: anthropic and add an API key. The integration code does not change.

What to do next

  • Try qwen2.5-coder:14b if you have 16 GB of RAM. The jump in review quality is noticeable.
  • Wire the same Ollama server to a hermes-agent or claude-code agent loop for automated refactors.
  • Add a .continue/rules/ file for each project with project-specific review heuristics.
  • If your team uses JetBrains, the same config works; Continue supports both editors.

Need Help Implementing This?

I help teams design and build scalable cloud infrastructure, DevOps pipelines, and production-grade systems.

Book a Free Consultation