1. Goals & scope
This manual gives you everything needed to:
- Move Ollama model storage to another disk or path.
- Apply persistent environment variables on Windows, macOS, and Linux.
- Configure Ollama to best use dual GPUs and many CPU cores (NUMA-aware).
- Run multiple Ollama servers (one per GPU) for best throughput.
- Safely edit systemd service overrides and avoid common errors.
- Monitor, validate and troubleshoot after changes.
Where I cite a command or behavior, the source is the official docs or a well-documented community thread. (docs.ollama.com)
2. Overview: What configuration controls Ollama behavior
Key knobs you will use:
OLLAMA_MODELS— tells Ollama where to download / store models. Change this to move storage off a small root disk.OLLAMA_VISIBLE_DEVICES— restricts which GPU indices Ollama sees (or set toall/ omit to use all).OLLAMA_SCHED_SPREAD=1— instructs Ollama to try to spread model layers across GPUs (if possible). Useful when you want explicit spreading even if a model fits on a single GPU; not always faster due to PCIe traffic.OLLAMA_NUM_THREAD— optional direct control for CPU threads. Usually auto-detected, but can be set for tuning.CUDA_VISIBLE_DEVICES— lower-level CUDA control if you run multiple instances (common pattern when you run one service instance per GPU).GGML_CUDA_ENABLE_UNIFIED_MEMORY— a lower-level setting/hack that allows spilling GPU allocations to host memory (can help avoid OOM but has performance tradeoffs).
You’ll apply these as environment variables for the Ollama process (via system environment variables on Windows, launchctl / plist on macOS, or systemd override file on Linux) and restart Ollama. Official docs recommend OLLAMA_MODELS for relocating models and prefer systemd override for Linux service customization.
3. Change model directory (OLLAMA_MODELS) — step-by-step
Why: move models to a big disk (eg. NVMe or mounted volume), avoid filling /.
3.1 Windows (GUI method)
- Quit Ollama: right-click tray icon → Quit.
- Open System Environment Variables: Start → type environment variables → Edit the system environment variables.
- Click Environment Variables… under System Properties.
- In System variables, click New… (or select existing
OLLAMA_MODELSand Edit).- Name:
OLLAMA_MODELS - Value: e.g.
D:\OllamaModels
- Name:
- OK → close dialogs. Restart machine or just log out/in so services/in-process apps pick env changes.
- Start Ollama from Start Menu. Validate that models download to the new path.
3.2 macOS (temporary / session)
- In Terminal run:
launchctl setenv OLLAMA_MODELS /path/to/new/location
- Quit Ollama app fully and relaunch for the app process to see the new environment. If you want persistence across reboots, add the env var to a
launchdplist that loads at boot (or create a small wrapper script to export and launch Ollama). Official docs showlaunchctlfor session vars. (docs.ollama.com)
3.3 Linux (systemd / recommended)
- Preferred: add env var in systemd override so the service process sees it.
sudo systemctl edit ollama.service
This opens an editor. Add:
[Service]
Environment="OLLAMA_MODELS=/mnt/BigData/ollama-models"
Save and exit. Then:
sudo systemctl daemon-reload
sudo systemctl restart ollama
- Ensure the
ollamasystem user (if used) owns the directory:
sudo chown -R ollama:ollama /mnt/BigData/ollama-models
sudo chmod -R u+rwX,g-rwx,o-rwx /mnt/BigData/ollama-models # tighten perms
Notes: Some users reported OLLAMA_MODELS not respected when service not restarted or when ownership/permissions blocked writes — check logs if models still download to default location.
4. systemd overrides: persistent, safe pattern
Use sudo systemctl edit ollama.service to create a drop-in override file (/etc/systemd/system/ollama.service.d/override.conf). This is update-safe and preferred to editing packaged unit files.
Example override adding multiple variables:
[Service]
Environment="OLLAMA_MODELS=/mnt/BigData/ollama-models"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_NUM_THREAD=32"
# If using numactl (see below), you'll need to clear ExecStart first
Then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Always check systemctl status ollama and journalctl -u ollama -b for startup errors.
5. GPU configuration & multi-GPU strategies
Basic behavior
- Ollama auto-detects and uses GPUs by default. If you set no device-restricting env var, Ollama will try to use all GPUs it can. To explicitly restrict, use
OLLAMA_VISIBLE_DEVICES=0,1orall. Setting-1is not valid and will hide GPUs (cause CPU fallback). (docs.ollama.com)
Two main strategies
- Model parallelism (single Ollama process spreads layers across GPUs)
- Useful for huge models that don't fit on one GPU. Controlled by default behavior and
OLLAMA_SCHED_SPREAD=1to force spreading even when a model might fit on a single GPU. Be aware: spreading can create extra PCIe traffic and sometimes slow down inference depending on model, interconnect (PCIe vs NVLink) and balance of VRAM/compute. (GitHub)
- Useful for huge models that don't fit on one GPU. Controlled by default behavior and
- Data parallelism (one full model per GPU — multiple Ollama instances)
- Run one Ollama server per GPU, each bound to a specific GPU using
CUDA_VISIBLE_DEVICESorOLLAMA_VISIBLE_DEVICES, and use a load balancer to distribute requests across instances. This often gives better throughput for concurrent requests (replicated models avoid layer-split overhead). Example shell snippet below in Section 11.
- Run one Ollama server per GPU, each bound to a specific GPU using
Env vars to know
OLLAMA_VISIBLE_DEVICES— list of GPU indices orall.OLLAMA_SCHED_SPREAD=1— force spread. (Use with care).CUDA_VISIBLE_DEVICES— common alternative when running multiple OS processes.GGML_CUDA_ENABLE_UNIFIED_MEMORY— allow GPU memory spill to host RAM (useful to avoid OOM but with perf cost).
Practical tips
- If you have NVLink between GPUs, model parallelism is less penalized — you’ll see better cross-GPU performance. If only PCIe, try running separate instances per GPU for better perf.
- Always check
nvidia-smito confirm both GPUs are used and where the memory is allocated during model load/inference.
6. CPU / NUMA tuning for multi-CPU servers
Large multi-socket servers need NUMA-aware tuning.
Typical tools / approaches
numactl --interleave=all <command>— interleave memory allocations across NUMA nodes; often improves throughput for multi-threaded processes that access memory across sockets. This is a well-known approach for NUMA systems. Use it when you observe uneven CPU or memory access across sockets.- You can inject
numactlinto systemdExecStart(see Section 8 for how to safely replace ExecStart).
Example: Numactl with Ollama
If you want the Ollama service to run with interleaved memory:
- Edit override:
sudo systemctl edit ollama.service - Add:
[Service]
ExecStart=
ExecStart=/usr/bin/numactl --interleave=all /usr/local/bin/ollama serve
- Reload & restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
If you only want to pin to specific NUMA nodes for experiments, use --cpunodebind and --membind options of numactl.
Thread counts
OLLAMA_NUM_THREADcan be used if you must limit or set explicit CPU threads. Default behavior is usually good; only change after measurement.
Warnings: incorrect CPU pinning can reduce throughput; benchmark and measure (htop, perf, or numastat) after changes. See Ollama and community reports about NUMA issues — sometimes fixes originate in llama.cpp and Ollama follows upstream.
7. Running multiple Ollama instances (best throughput for many concurrent clients)
When you need maximum throughput for many simultaneous requests, prefer one server instance per GPU (data parallelism).
Pattern
- Start N Ollama servers, each with environment restricting it to a single GPU and a distinct port.
- Put a load balancer (Nginx, HAProxy, or app-level) in front.
Example script (Linux)
#!/bin/bash
# Start Ollama instance per GPU (example 2 GPUs)
# Adjust paths and absolute ollama binary path if needed.
# Instance 1: GPU 0, port 11434
CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=127.0.0.1:11434 ollama serve --port 11434 &
# Instance 2: GPU 1, port 11435
CUDA_VISIBLE_DEVICES=1 OLLAMA_HOST=127.0.0.1:11435 ollama serve --port 11435 &
Then configure your proxy to round-robin requests to 127.0.0.1:11434 and 127.0.0.1:11435.
Note: If you manage Ollama as systemd services for each instance, create separate unit files with distinct OLLAMA_HOST and Environment="CUDA_VISIBLE_DEVICES=..." entries.
8. ExecStart errors & how to fix them
Error observed: Service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing
Cause: When you create a systemd override and add a new ExecStart= without clearing the original, systemd appends the new ExecStart to the existing one. Multiple ExecStart lines are valid only for Type=oneshot. To replace the service start command you must clear the existing ExecStart first.
Correct override example:
[Service]
# Clear original ExecStart
ExecStart=
# New ExecStart with numactl
ExecStart=/usr/bin/numactl --interleave=all /usr/bin/ollama serve
# Env vars
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_VISIBLE_DEVICES=0,1"
Then run:
sudo systemctl daemon-reload
sudo systemctl restart ollama
If you still get errors:
- Make sure you edited the system service (
sudo systemctl edit ollama.service) and not a user service. - Confirm no other drop-in overrides exist that re-add ExecStart (check
/etc/systemd/system/ollama.service.d/andsystemctl cat ollama.service). - Revert to packaged file if needed and retest (
sudo systemctl revert ollama.service), then reapply a clean override.
9. Permissions, SELinux/AppArmor & filesystem notes
- The
ollamasystem user must have read/write toOLLAMA_MODELSon Linux installers. If you installed via a system package, Ollama often runs as a dedicated user — chown the directory accordingly:sudo chown -R ollama:ollama /path/to/OLLAMA_MODELS. - If using external mount (NFS, CIFS), watch out for permission mapping and performance implications. Prefer local NVMe/SSD for fastest model load.
- SELinux/AppArmor: if enabled, ensure policy allows Ollama to access the configured directory and to use GPUs (NVIDIA’s drivers often require specific permissions). On Ubuntu Server, AppArmor profiles may block unusual paths — test with logs and temporarily set permissive mode when debugging.
- Symbolic links are an alternative: you can symlink the default
.ollama/modelsdirectory to your big disk. This is a quick workaround but less explicit thanOLLAMA_MODELS.
10. Monitoring, verification & troubleshooting checklist
After any change, perform these checks:
- Service status:
systemctl status ollama— confirm active/running and note any errors. - Journal logs:
journalctl -u ollama -b --no-pager | tail -n 200— scan for permission, exec, or GPU errors. - GPU usage:
nvidia-smi -l 2— watch memory and utilization while loading and querying a model. Confirm both GPUs show activity if expected. - CPU usage:
htoportop— confirm threads and load distribution; observe NUMA node imbalance withnumastat. - Model files location:
ls -lh /path/to/OLLAMA_MODELS— ensure model files are present and being written. - Permissions:
stat /path/to/OLLAMA_MODELSandps aux | grep ollamato check that process user matches directory ownership. - Test queries: run simple
ollama run <model> --prompt "hello"and measure latency. Observe whether running instances saturate GPU or CPU as expected. - If GPUs missing: check driver status, nvidia modules, dmesg for GPU errors. Some users observed GPU timeouts requiring driver or reinstall troubleshooting.
11. Example configs & copy-paste snippets
Minimal systemd override (models + spread + devices)
Run sudo systemctl edit ollama.service and paste:
[Service]
Environment="OLLAMA_MODELS=/mnt/BigData/ollama-models"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_NUM_THREAD=32"
ExecStart=
ExecStart=/usr/bin/ollama serve
Then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
With numactl (multi-CPU)
[Service]
Environment="OLLAMA_MODELS=/mnt/BigData/ollama-models"
Environment="OLLAMA_SCHED_SPREAD=1"
ExecStart=
ExecStart=/usr/bin/numactl --interleave=all /usr/bin/ollama serve
Run two instances one-per-GPU (shell)
(Useful for data-parallel throughput)
# GPU 0 port 11434
CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=127.0.0.1:11434 ollama serve --port 11434 &
# GPU 1 port 11435
CUDA_VISIBLE_DEVICES=1 OLLAMA_HOST=127.0.0.1:11435 ollama serve --port 11435 &
12. Appendix: env var summary & recommended defaults
| Variable | Purpose | Suggested value / note |
|---|---|---|
OLLAMA_MODELS |
Path to models directory | /mnt/BigData/ollama-models (set and chown to ollama) (docs.ollama.com) |
OLLAMA_VISIBLE_DEVICES |
GPU indices Ollama should see | 0,1 or omit to use all |
OLLAMA_SCHED_SPREAD |
Force model layer spreading | 1 to force; use only if you want layer split behavior (can reduce perf) (GitHub) |
OLLAMA_NUM_THREAD |
CPU threads for CPU inference | 32 (example) — usually leave unset and measure |
CUDA_VISIBLE_DEVICES |
Per-process GPU restriction | Useful when running multiple instances |
GGML_CUDA_ENABLE |
Allow GPU->host spill | 1 to enable (use caution; may be slower) |
Final checklist before you go live
- Ensure
OLLAMA_MODELSpoints to a local fast disk (not root if low space). chown -R ollama:ollama /pathand set tight perms.- Choose multi-GPU strategy (model parallel vs multiple instances) and test both.
- If using numactl, edit systemd ExecStart by clearing original ExecStart first to avoid the
more than one ExecStart=error. - Monitor with
nvidia-smi,htop, andjournalctland measure latency/throughput. - If odd behavior occurs, check community issues for similar reports (Ollama GH and Reddit are active).
Sources & further reading
- Official Ollama FAQ & Linux docs (how to set
OLLAMA_MODELS, customizing via systemd). (docs.ollama.com) - Ollama GitHub issues discussing
OLLAMA_SCHED_SPREAD, GPU spreading and troubles. (GitHub) - Systemd guidance for clearing
ExecStart=when overriding service ExecStart lines. (Ask Ubuntu) numactl --interleave=allexplanation and NUMA tips. (Stack Overflow)