Ollama Service — Full Configuration & Performance Manual

Learn how to install and configure Ollama for optimal performance. This guide covers setup, model storage, environment tuning, and key tips to keep your AI workflows efficient and stable.

Ollama Service — Full Configuration & Performance Manual

1. Goals & scope

This manual gives you everything needed to:

  • Move Ollama model storage to another disk or path.
  • Apply persistent environment variables on Windows, macOS, and Linux.
  • Configure Ollama to best use dual GPUs and many CPU cores (NUMA-aware).
  • Run multiple Ollama servers (one per GPU) for best throughput.
  • Safely edit systemd service overrides and avoid common errors.
  • Monitor, validate and troubleshoot after changes.

Where I cite a command or behavior, the source is the official docs or a well-documented community thread. (docs.ollama.com)

2. Overview: What configuration controls Ollama behavior

Key knobs you will use:

  • OLLAMA_MODELS — tells Ollama where to download / store models. Change this to move storage off a small root disk.
  • OLLAMA_VISIBLE_DEVICES — restricts which GPU indices Ollama sees (or set to all / omit to use all).
  • OLLAMA_SCHED_SPREAD=1 — instructs Ollama to try to spread model layers across GPUs (if possible). Useful when you want explicit spreading even if a model fits on a single GPU; not always faster due to PCIe traffic.
  • OLLAMA_NUM_THREAD — optional direct control for CPU threads. Usually auto-detected, but can be set for tuning.
  • CUDA_VISIBLE_DEVICES — lower-level CUDA control if you run multiple instances (common pattern when you run one service instance per GPU).
  • GGML_CUDA_ENABLE_UNIFIED_MEMORY — a lower-level setting/hack that allows spilling GPU allocations to host memory (can help avoid OOM but has performance tradeoffs).

You’ll apply these as environment variables for the Ollama process (via system environment variables on Windows, launchctl / plist on macOS, or systemd override file on Linux) and restart Ollama. Official docs recommend OLLAMA_MODELS for relocating models and prefer systemd override for Linux service customization.

3. Change model directory (OLLAMA_MODELS) — step-by-step

Why: move models to a big disk (eg. NVMe or mounted volume), avoid filling /.

3.1 Windows (GUI method)

  1. Quit Ollama: right-click tray icon → Quit.
  2. Open System Environment Variables: Start → type environment variablesEdit the system environment variables.
  3. Click Environment Variables… under System Properties.
  4. In System variables, click New… (or select existing OLLAMA_MODELS and Edit).
    • Name: OLLAMA_MODELS
    • Value: e.g. D:\OllamaModels
  5. OK → close dialogs. Restart machine or just log out/in so services/in-process apps pick env changes.
  6. Start Ollama from Start Menu. Validate that models download to the new path.

3.2 macOS (temporary / session)

  • In Terminal run:
launchctl setenv OLLAMA_MODELS /path/to/new/location
  • Quit Ollama app fully and relaunch for the app process to see the new environment. If you want persistence across reboots, add the env var to a launchd plist that loads at boot (or create a small wrapper script to export and launch Ollama). Official docs show launchctl for session vars. (docs.ollama.com)
  • Preferred: add env var in systemd override so the service process sees it.
sudo systemctl edit ollama.service

This opens an editor. Add:

[Service]
Environment="OLLAMA_MODELS=/mnt/BigData/ollama-models"

Save and exit. Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama
  • Ensure the ollama system user (if used) owns the directory:
sudo chown -R ollama:ollama /mnt/BigData/ollama-models
sudo chmod -R u+rwX,g-rwx,o-rwx /mnt/BigData/ollama-models   # tighten perms

Notes: Some users reported OLLAMA_MODELS not respected when service not restarted or when ownership/permissions blocked writes — check logs if models still download to default location.

4. systemd overrides: persistent, safe pattern

Use sudo systemctl edit ollama.service to create a drop-in override file (/etc/systemd/system/ollama.service.d/override.conf). This is update-safe and preferred to editing packaged unit files.

Example override adding multiple variables:

[Service]
Environment="OLLAMA_MODELS=/mnt/BigData/ollama-models"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_NUM_THREAD=32"
# If using numactl (see below), you'll need to clear ExecStart first

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Always check systemctl status ollama and journalctl -u ollama -b for startup errors.

5. GPU configuration & multi-GPU strategies

Basic behavior

  • Ollama auto-detects and uses GPUs by default. If you set no device-restricting env var, Ollama will try to use all GPUs it can. To explicitly restrict, use OLLAMA_VISIBLE_DEVICES=0,1 or all. Setting -1 is not valid and will hide GPUs (cause CPU fallback). (docs.ollama.com)

Two main strategies

  1. Model parallelism (single Ollama process spreads layers across GPUs)
    1. Useful for huge models that don't fit on one GPU. Controlled by default behavior and OLLAMA_SCHED_SPREAD=1 to force spreading even when a model might fit on a single GPU. Be aware: spreading can create extra PCIe traffic and sometimes slow down inference depending on model, interconnect (PCIe vs NVLink) and balance of VRAM/compute. (GitHub)
  2. Data parallelism (one full model per GPU — multiple Ollama instances)
    1. Run one Ollama server per GPU, each bound to a specific GPU using CUDA_VISIBLE_DEVICES or OLLAMA_VISIBLE_DEVICES, and use a load balancer to distribute requests across instances. This often gives better throughput for concurrent requests (replicated models avoid layer-split overhead). Example shell snippet below in Section 11.

Env vars to know

  • OLLAMA_VISIBLE_DEVICES — list of GPU indices or all.
  • OLLAMA_SCHED_SPREAD=1 — force spread. (Use with care).
  • CUDA_VISIBLE_DEVICES — common alternative when running multiple OS processes.
  • GGML_CUDA_ENABLE_UNIFIED_MEMORY — allow GPU memory spill to host RAM (useful to avoid OOM but with perf cost).

Practical tips

  • If you have NVLink between GPUs, model parallelism is less penalized — you’ll see better cross-GPU performance. If only PCIe, try running separate instances per GPU for better perf.
  • Always check nvidia-smi to confirm both GPUs are used and where the memory is allocated during model load/inference.

6. CPU / NUMA tuning for multi-CPU servers

Large multi-socket servers need NUMA-aware tuning.

Typical tools / approaches

  • numactl --interleave=all <command> — interleave memory allocations across NUMA nodes; often improves throughput for multi-threaded processes that access memory across sockets. This is a well-known approach for NUMA systems. Use it when you observe uneven CPU or memory access across sockets.
  • You can inject numactl into systemd ExecStart (see Section 8 for how to safely replace ExecStart).

Example: Numactl with Ollama

If you want the Ollama service to run with interleaved memory:

  1. Edit override: sudo systemctl edit ollama.service
  2. Add:
[Service]
ExecStart=
ExecStart=/usr/bin/numactl --interleave=all /usr/local/bin/ollama serve
  1. Reload & restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama

If you only want to pin to specific NUMA nodes for experiments, use --cpunodebind and --membind options of numactl.

Thread counts

  • OLLAMA_NUM_THREAD can be used if you must limit or set explicit CPU threads. Default behavior is usually good; only change after measurement.

Warnings: incorrect CPU pinning can reduce throughput; benchmark and measure (htop, perf, or numastat) after changes. See Ollama and community reports about NUMA issues — sometimes fixes originate in llama.cpp and Ollama follows upstream.

7. Running multiple Ollama instances (best throughput for many concurrent clients)

When you need maximum throughput for many simultaneous requests, prefer one server instance per GPU (data parallelism).

Pattern

  • Start N Ollama servers, each with environment restricting it to a single GPU and a distinct port.
  • Put a load balancer (Nginx, HAProxy, or app-level) in front.

Example script (Linux)

#!/bin/bash
# Start Ollama instance per GPU (example 2 GPUs)
# Adjust paths and absolute ollama binary path if needed.

# Instance 1: GPU 0, port 11434
CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=127.0.0.1:11434 ollama serve --port 11434 &

# Instance 2: GPU 1, port 11435
CUDA_VISIBLE_DEVICES=1 OLLAMA_HOST=127.0.0.1:11435 ollama serve --port 11435 &

Then configure your proxy to round-robin requests to 127.0.0.1:11434 and 127.0.0.1:11435.

Note: If you manage Ollama as systemd services for each instance, create separate unit files with distinct OLLAMA_HOST and Environment="CUDA_VISIBLE_DEVICES=..." entries.

8. ExecStart errors & how to fix them

Error observed: Service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing

Cause: When you create a systemd override and add a new ExecStart= without clearing the original, systemd appends the new ExecStart to the existing one. Multiple ExecStart lines are valid only for Type=oneshot. To replace the service start command you must clear the existing ExecStart first.

Correct override example:

[Service]
# Clear original ExecStart
ExecStart=
# New ExecStart with numactl
ExecStart=/usr/bin/numactl --interleave=all /usr/bin/ollama serve
# Env vars
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_VISIBLE_DEVICES=0,1"

Then run:

sudo systemctl daemon-reload
sudo systemctl restart ollama

If you still get errors:

  • Make sure you edited the system service (sudo systemctl edit ollama.service) and not a user service.
  • Confirm no other drop-in overrides exist that re-add ExecStart (check /etc/systemd/system/ollama.service.d/ and systemctl cat ollama.service).
  • Revert to packaged file if needed and retest (sudo systemctl revert ollama.service), then reapply a clean override.

9. Permissions, SELinux/AppArmor & filesystem notes

  • The ollama system user must have read/write to OLLAMA_MODELS on Linux installers. If you installed via a system package, Ollama often runs as a dedicated user — chown the directory accordingly:
    sudo chown -R ollama:ollama /path/to/OLLAMA_MODELS.
  • If using external mount (NFS, CIFS), watch out for permission mapping and performance implications. Prefer local NVMe/SSD for fastest model load.
  • SELinux/AppArmor: if enabled, ensure policy allows Ollama to access the configured directory and to use GPUs (NVIDIA’s drivers often require specific permissions). On Ubuntu Server, AppArmor profiles may block unusual paths — test with logs and temporarily set permissive mode when debugging.
  • Symbolic links are an alternative: you can symlink the default .ollama/models directory to your big disk. This is a quick workaround but less explicit than OLLAMA_MODELS.

10. Monitoring, verification & troubleshooting checklist

After any change, perform these checks:

  1. Service status: systemctl status ollama — confirm active/running and note any errors.
  2. Journal logs: journalctl -u ollama -b --no-pager | tail -n 200 — scan for permission, exec, or GPU errors.
  3. GPU usage: nvidia-smi -l 2 — watch memory and utilization while loading and querying a model. Confirm both GPUs show activity if expected.
  4. CPU usage: htop or top — confirm threads and load distribution; observe NUMA node imbalance with numastat.
  5. Model files location: ls -lh /path/to/OLLAMA_MODELS — ensure model files are present and being written.
  6. Permissions: stat /path/to/OLLAMA_MODELS and ps aux | grep ollama to check that process user matches directory ownership.
  7. Test queries: run simple ollama run <model> --prompt "hello" and measure latency. Observe whether running instances saturate GPU or CPU as expected.
  8. If GPUs missing: check driver status, nvidia modules, dmesg for GPU errors. Some users observed GPU timeouts requiring driver or reinstall troubleshooting.

11. Example configs & copy-paste snippets

Minimal systemd override (models + spread + devices)

Run sudo systemctl edit ollama.service and paste:

[Service]
Environment="OLLAMA_MODELS=/mnt/BigData/ollama-models"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_NUM_THREAD=32"
ExecStart=
ExecStart=/usr/bin/ollama serve

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

With numactl (multi-CPU)

[Service]
Environment="OLLAMA_MODELS=/mnt/BigData/ollama-models"
Environment="OLLAMA_SCHED_SPREAD=1"
ExecStart=
ExecStart=/usr/bin/numactl --interleave=all /usr/bin/ollama serve

Run two instances one-per-GPU (shell)

(Useful for data-parallel throughput)

# GPU 0 port 11434
CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=127.0.0.1:11434 ollama serve --port 11434 &

# GPU 1 port 11435
CUDA_VISIBLE_DEVICES=1 OLLAMA_HOST=127.0.0.1:11435 ollama serve --port 11435 &
Variable Purpose Suggested value / note
OLLAMA_MODELS Path to models directory /mnt/BigData/ollama-models (set and chown to ollama) (docs.ollama.com)
OLLAMA_VISIBLE_DEVICES GPU indices Ollama should see 0,1 or omit to use all
OLLAMA_SCHED_SPREAD Force model layer spreading 1 to force; use only if you want layer split behavior (can reduce perf) (GitHub)
OLLAMA_NUM_THREAD CPU threads for CPU inference 32 (example) — usually leave unset and measure
CUDA_VISIBLE_DEVICES Per-process GPU restriction Useful when running multiple instances
GGML_CUDA_ENABLE
_UNIFIED_MEMORY
Allow GPU->host spill 1 to enable (use caution; may be slower)

Final checklist before you go live

  • Ensure OLLAMA_MODELS points to a local fast disk (not root if low space).
  • chown -R ollama:ollama /path and set tight perms.
  • Choose multi-GPU strategy (model parallel vs multiple instances) and test both.
  • If using numactl, edit systemd ExecStart by clearing original ExecStart first to avoid the more than one ExecStart= error.
  • Monitor with nvidia-smi, htop, and journalctl and measure latency/throughput.
  • If odd behavior occurs, check community issues for similar reports (Ollama GH and Reddit are active).

Sources & further reading

  • Official Ollama FAQ & Linux docs (how to set OLLAMA_MODELS, customizing via systemd). (docs.ollama.com)
  • Ollama GitHub issues discussing OLLAMA_SCHED_SPREAD, GPU spreading and troubles. (GitHub)
  • Systemd guidance for clearing ExecStart= when overriding service ExecStart lines. (Ask Ubuntu)
  • numactl --interleave=all explanation and NUMA tips. (Stack Overflow)

Read next