
From silence they stir,
words birthing thought,
numbers shaping form—
a spark in the dark
that dreams into being.
I was eager to run some large language models like Qwen3 and DeepSeek‑R1 on my AMD GPU (Radeon RX 7900 XTX) using Endeavour OS. I had just bought a new system for just this purpose, and had already been playing around on my mac and my nucs with smaller models for some time.
Update March 2025
Flash Attention is now built into Ollama and just needs OLLAMA_FLASH_ATTENTION=1 in your service file. No separate addon or build step required. This alone can slash inference time by up to 75 % and free ~5 GB of VRAM. It used to be a pain on AMD (while Nvidia had it out of the box), but as of recent Ollama versions it works natively on ROCm too.
Update August 2025 — gpt-oss:20b Open‑source 20B model released on August 5, 2025, and it’s pretty optimized for Ollama. On my RX 7900 XTX I get around 107 tokens/s on moderate prompts. It follows instructions well, but like other smaller models it can hallucinate often (but its very good for automations).
So What Is Qwen3?
Qwen3 is the at the time of writing latest in the Qwen series of LLMs, released in May 2025. It comes in sizes from 8 billion up to 235 billion parameters, with both dense and Mixture-of-Experts (MoE) variants. Its key strengths are:
- Massive, Diverse Pre-training – Trained on 36 trillion high-quality tokens across 119 languages, including code, STEM, reasoning, and synthetic data.
- Mixture-of-Experts Architecture – Each MoE model uses many specialized “expert” sub-networks; inputs are routed dynamically, so only a subset of experts activate for any token, boosting efficiency without sacrificing capacity.
- Hybrid “Thinking” vs. “Direct” Modes – In thinking mode, Qwen3 can perform chain-of-thought reasoning with settings like temperature 0.6 and top-p 0.95 for clear step-by-step answers. In direct mode, it uses temperature 0.7 and top-p 0.8 for swift, straightforward responses.
How Qwen3 Works
Qwen3’s MoE models contain hundreds of experts, but at inference time, a gating network routes each token through only a small fraction (e.g., two experts). This design gives you:
- Scalable Capacity: Large parameter counts with lower FLOPs per token.
- Specialized Reasoning: Experts can specialize in tasks like math, code, or language.
- Cost Efficiency: Run bigger models on less hardware, and that is what Im using it for.
So how do we use it?
We use it by installing Ollama on Arch. Running Ollama on an AMD GPU requires a bit of setup since most of the world defaults to NVIDIA.
1. Install Ollama and ROCm: The easiest way to install Ollama is with the official install script. It detects your AMD GPU automatically and downloads the ROCm runner libraries alongside the base binary:
curl -fsSL https://ollama.com/install.sh | sh
This creates the ollama user, sets up a basic systemd service, and pulls the ROCm libraries into /usr/lib/ollama/rocm. If you prefer more control (pinning a specific version, cleaning stale GPU backends, health-checking the API after upgrade), you can write a custom upgrade script that pulls release archives directly from GitHub and extracts them yourself.
Also make sure you have the AMD drivers and ROCm libraries installed (ROCm is AMD’s equivalent of CUDA). Tools like rocminfo and amd-smi should detect your GPU (the new tool; replaces rocm-smi).
sudo pacman -S amdsmi
2. Create/Modify the Ollama systemd Service: The official install script creates a bare-bones service file. We’ll replace it with a tuned version that explicitly selects the ROCm backend, enables Flash Attention, quantizes the KV cache, and optionally pins CPU cores. Edit /etc/systemd/system/ollama.service:
[Unit]
Description=Ollama AI Service
After=network.target network-online.target
Wants=network-online.target
Before=syslog.target
[Service]
Type=exec
# --- GPU selection / ROCm ---
Environment=OLLAMA_LLM_LIBRARY=rocm
Environment=OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm
# use *your* rocminfo UUID — find via:
# rocminfo | grep -A2 "gfx11" | grep -E "Uuid|Marketing"
Environment=ROCR_VISIBLE_DEVICES=GPU-xxxxxxxxxxxx
Environment="LD_LIBRARY_PATH=/usr/lib/ollama/rocm:/usr/lib/ollama:/opt/rocm/lib:/opt/rocm/lib64"
# --- Ollama engine knobs ---
Environment=OLLAMA_FLASH_ATTENTION=1
Environment=OLLAMA_KV_CACHE_TYPE=q8_0
Environment=OLLAMA_MAX_LOADED_MODELS=1
Environment=OLLAMA_NUM_PARALLEL=1
Environment=OLLAMA_MAX_QUEUE=128
Environment=OLLAMA_KEEP_ALIVE=10m
Environment=OLLAMA_MODELS=/var/lib/ollama
Environment=TMPDIR=/var/tmp/ollama
# Reserve ~512 MB VRAM headroom per GPU for ROCm overhead
Environment=OLLAMA_GPU_OVERHEAD=536870912
# --- Net / misc ---
Environment=OLLAMA_HOST=0.0.0.0:11434
Environment=HOME=/var/lib/ollama
Environment=OLLAMA_DEBUG=0
# --- CPU affinity (For my 7950X3D: V-Cache CCD — adjust to your CPU) ---
Environment=OPENBLAS_NUM_THREADS=16
Environment=OMP_NUM_THREADS=16
User=ollama
Group=ollama
# GPU workloads benefit from unlimited locked memory
LimitMEMLOCK=infinity
# Optional: lock a peak perf profile at start, revert on stop
# ExecStartPre=/usr/bin/amd-smi set -g 0 -l STABLE_PEAK
# ExecStopPost=/usr/bin/amd-smi set -g 0 -l AUTO
# Optional: pin to CCD1 cores and NUMA node 0 (adjust to your CPU)
# ExecStart=/usr/bin/numactl --physcpubind=8-15,24-31 --membind=0 /usr/bin/ollama serve
ExecStart=/usr/bin/ollama serve
Restart=always
RestartSec=3
StandardOutput=journal
StandardError=journal
SyslogIdentifier=ollama
[Install]
WantedBy=multi-user.target
A quick rundown of what a few of these do:
OLLAMA_LLM_LIBRARY=rocm– explicitly tells Ollama to use the ROCm backend. Without this, Ollama auto-detects, which can sometimes pick the wrong library on mixed setups.OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm– points Ollama to the ROCm runner libraries. If you installed via the official install script, they live under/usr/lib/ollama/rocm.LD_LIBRARY_PATH– tells the linker where to find ROCm shared libraries. Include both the Ollama-bundled path and the system ROCm path.OLLAMA_FLASH_ATTENTION=1– enables Flash Attention for faster, more memory-efficient inference. Now built into Ollama, no separate addon needed.ROCR_VISIBLE_DEVICES=GPU-…– selects the AMD GPU by its ROCm UUID. Find it withrocminfo | grep -A2 -E "Uuid|Marketing|gfx".OLLAMA_KV_CACHE_TYPE=q8_0– quantizes the KV cache to shrink VRAM usage with minimal quality loss; a big win on 24 GB GPUs.OLLAMA_GPU_OVERHEAD=536870912– reserves ~512 MB of VRAM headroom per GPU for ROCm runtime overhead. Prevents out-of-memory crashes when loading models close to the VRAM limit.OLLAMA_NUM_PARALLEL=1– server concurrency. Keep at 1 for lowest latency on single-user setups; raise only if you have CPU/GPU headroom.OLLAMA_MAX_LOADED_MODELS=1– limits how many models stay resident; reduces VRAM pressure and CPU RAM swapping.OLLAMA_KEEP_ALIVE=10m– how long to keep the model in memory after the last request. Lower to free VRAM sooner.TMPDIR=/var/tmp/ollama– where Ollama writes temporary files. The oldOLLAMA_TMPDIRenv var has been removed; use the standardTMPDIRinstead.OLLAMA_HOST=0.0.0.0:11434– binds the HTTP API; change if you only want local access (e.g.,127.0.0.1:11434).LimitMEMLOCK=infinity– allows Ollama to lock unlimited memory. GPU workloads pin large buffers and this prevents failures from the default locked-memory cap.ExecStartwithnumactl(optional, commented out) – pins Ollama’s threads to specific CPU cores and a NUMA node. On Ryzen X3D parts, pinning to the CCD with the larger L3 (the V-Cache CCD) improves throughput and latency. Uncomment and adjust the core list for your CPU.How to find good CPU core bindings (CCD/L3 aware):
- Inspect NUMA layout and CPU sets:
numactl -H | cat lscpu -e=CPU,NODE,CORE,ONLINE | column -t | cat - Identify which cores share the larger L3 cache (96 MB on V‑Cache CCD). Either install
hwlocand read cpusets per L3:…or check sysfs for L3 size and the CPUs that share it:sudo pacman -S --noconfirm hwloc lstopo-no-graphics | grep -A1 "L3"grep . /sys/devices/system/cpu/cpu*/cache/index3/size | sort -u grep . /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u - Build a compact CPU list from that CCD (e.g.,
8-15,24-31) and set--physcpubindto it. Use--membindto the same NUMA node (often0) so pages are allocated close to those cores.
- Inspect NUMA layout and CPU sets:
After adding these, reload and restart the service:
sudo systemctl daemon-reload
sudo systemctl restart ollama
3. GPU Tuning & Memory Config: With the above, Ollama will attempt to use your RX 7900 XTX for model inference. The 7900 XTX has 24 GB of VRAM. By default, many models (especially larger ones) only offload a portion to the GPU. Just be aware that if a model doesn’t fit entirely, the rest runs on CPU. If you see your GPU usage not maxing out and CPU pegged at 100%, it might be loading only partially to VRAM. In those cases, increasing the num_gpu setting (layers on GPU) or using a smaller model/quantization can help.
A whisper becomes a world,
each symbol a seed,
growing forests of sense
from a single breath.
4. And then there is life: By this time run and see if it works ollama run qwen3:8b
In another terminal, run amd-smi or watch -n1 amd-smi. You should see the VRAM usage jump and GPU% usage when the model responds. If it stays at 0% and CPU is doing all the work, double-check the service config or logs.
With Ollama successfully using the AMD GPU, we can move on to understanding how to effectively use these large models.
Understanding Context Windows (num_ctx) and Token Limits
Large Language Models have a context window – essentially how many tokens (pieces of text) they can “remember” at once. In Ollama, this is set by the parameter num_ctx and was before 2048 but was recently updated to 4096.
- Context Window (
num_ctx): This is the maximum number of tokens the model can take into account at a time (including both the prompt and the model’s generated response). If a model hasnum_ctx = 8192, that means it can handle up to 8192 tokens of combined input+output before it runs out of memory (and starts forgetting or truncating content). - Input vs Output Tokens: We often differentiate between the prompt length and the model’s output length. Some tools use
max_tokensto mean the maximum output tokens the model will generate. In Ollama’s terms, the maximum output can be controlled bynum_predict(more on that later). But importantly, prompt tokens + output tokens ≤ num_ctx always. If you ask for more tokens than the window can hold, older context will be dropped (or the generation will stop early). - Truncation Behavior: What happens if you feed a model more tokens than it can handle? Typically, the oldest tokens get truncated – imagine the context window as a sliding window that only keeps the last
num_ctxtokens. If you have a long document input, Ollama/llama.cpp will truncate the beginning part once the limit is exceeded (since recent tokens are usually more relevant). This means if you stuff a 10,000-token text into an 8,192 context model, it will quietly ignore about 1,808 tokens from the start! The model only “sees” the last 8192 tokens in that case.
When dealing with inputs longer than the context window, you have a few options:
- Chunk the input into smaller pieces (we’ll do this in our transcript example later).
- Use a model with a larger context if available (some models support 16k, 32k, or more tokens).
- Reduce the output length if your prompt is near the limit. For instance, if your prompt is already 8000 tokens of an 8192 context, you can only safely generate ~192 tokens of output before hitting the cap.
4. Run Qwen3 and modify its parameters
Interactively launch the model:
ollama run qwen3:30b
Inside the REPL, you can set generation parameters:
> /set parameter temperature 0.6
> /set parameter top_p 0.95
> /set parameter top_k 20
> /set parameter repeat_penalty 1.1
> /set parameter num_gpu 2 # offload more layers to GPU
> /set parameter num_ctx 8192 # 8k in context
Save your configuration as a custom model:
>>> /save qwen3:30b-8k
Then simply ollama run qwen3:30b-8k next time.
That is that about that…
large language models once said
They drink from inkless wells, summon echoes of meaning, crafting shapes from shadows, where nothing once dwelled.
Buy Me a Coffee