Benchmarking my local models for n8n

n8n

Benchmarking - The Breaking Point and lessons for n8n

It was a Tuesday morning when my email triage workflow decided it had enough. A 47-message thread about a project deadline landed in my inbox, and my automation sat there… thinking… for three full minutes before finally spitting out a summary that missed the actual deadline buried in message 23. That’s when I knew something had to change.

I’d been running n8n automations on a 24 GB AMD Radeon RX 7900 XTX, handling everything from email summaries to document classification, Slack digests to calendar extractions. The setup worked beautifully for months, until it didn’t. Models were getting slower, contexts were exploding, and my carefully crafted system prompts kept vanishing into the void.

This is the story of how I rebuilt my local AI stack from the ground up, tested 15 different models, discovered a game-changing Ollama upgrade, and finally created a routing system that actually works. Spoiler: the answer wasn’t “bigger model” or “more context”, it was understanding what each model does best and when to use it.

Caveat Note These local models dont replace frontier models in day to day but they can help automate something that is tedious and time consuming such as sorting emails, summarization, reminders so in general automation.

Hardware

Component	Spec / Notes
GPU	AMD Radeon RX 7900 XTX (24 GB VRAM)
CPU	AMD Ryzen 9 7950X3D (16 cores / 32 threads)
RAM	64 GB DDR5
OS	EndeavourOS (Arch-based Linux)
Runtime	Ollama + ROCm
Context limit	Up to 130K (I have never gone larger as that usually would blow past my GPU VRAM)

The Quest Begins: Model Selection

I’d been loyal to Qwen3 because of its /think and /no_think toggle I could turn deliberate reasoning on or off right in the prompt. Perfect for automation where sometimes you need deep thinking and sometimes you just need fast classification. Then the 2507 update dropped and changed everything.

Qwen split into separate Thinking and Instruct models. My first reaction? Frustration my workflows were built around that prompt-side flexibility. After testing, though, the intelligence gains were undeniable. The older omnipurpose models were doing double duty with one set of weights; the 2507 models are specialists, and it shows. Practically, I now run the Thinking or Instruct variant explicitly instead of flipping a toggle in the prompt.

Then came Qwen3.5. The context window officially jumps to 256K, and Alibaba pushed further on quality across the board. But on my 24 GB card the practical ceiling is still around 130K before latency and RAM pressure stop feeling sane. The bigger story for my workflows: Qwen3.5 handles thinking differently again, controlled through model settings or the API call rather than prompt-side toggles. For automation I usually want thinking off entirely, and Qwen3.5 makes that cleaner to enforce. That said, on my current AMD/ROCm stack, Qwen3.5 feels noticeably slower than similarly-sized Qwen3-2507 models. It’s usable and worth watching, but the stack doesn’t feel fully settled yet for AMD.

The Ollama Discovery

But there was a problem. Right after Qwen3-2507 landed, my machine choked. Simple questions were taking forever if the prompt was anything beyond “hello world”. I filed a GitHub issue (#12432) and kept testing.

Then came Ollama 0.12.4-rc6, suggested by the maintainers.

The release notes mention FlashAttention being enabled by default for certain families. I’d already been forcing FlashAttention via flag, but with 0.12.4-rc6 (ROCm build) my numbers leapt: prompt eval rate went from ~230 tok/s to >4000 tok/s on the same prompts. Not a typo. An order-of-magnitude improvement. From my testing and the maintainer guidance, this felt like more than just flipping a flag the RC includes deeper optimizations/refactors that benefit AMD/ROCm systems.

Suddenly, long contexts weren’t painful anymore. No more micro-stalls. Token pacing became smooth and predictable. This was the breakthrough I needed.

The Great Benchmark Session

With Ollama 0.12.4 humming along, I spent a weekend testing every model I could get my hands on. I needed to know: which models were actually fast enough for real-time automation? Which ones could handle tool calling without choking? And most importantly, which ones earned their spot in my workflows?

Here’s what I found, ordered by eval rate (the metric that matters most for automation). I also include prompt-eval numbers for 0.12.1 vs 0.12.4 to show the big jump:

Model	Eval rate (tok/s)	Prompt eval rate tok/s v0.12.1	Prompt eval rate (tok/s) v0.12.4
llama3.2:3b	161.17	409.89	2879.96
qwen3:4b-instruct-2507-q4_K_M	120.57	248.00	2159.90
qwen3:4b-thinking-2507-q4_K_M	121.62	237.02	2169.92
qwen3:4b	119.81	154.36	2219.18
gpt-oss:20b	113.91	229.57	1951.89
qwen3:4b-instruct-2507-q8_0	110.87	235.93	3070.18
qwen3:4b-thinking-2507-q8_0	105.11	259.10	2251.66
qwen3.5:4b	89.30	n/a	n/a
qwen3:30b	83.45	61.59	384.66
qwen3:8b	82.25	134.83	1552.20
deepseek-r1:8b	76.49	66.43	1800.68
qwen3-coder:30b	83.68	47.17	406.34
qwen3.5:9b	63.80	n/a	n/a
qwen3:14b	53.52	103.42	940.61
gemma3:12b	52.25	117.97	1130.46
deepseek-r1:32b	25.93	55.55	419.25

Eval rate is the main apples-to-apples throughput signal for automation.
Prompt eval rate is critical on large contexts if prompt eval is slow, everything is slow. Qwen3.5 prompt eval numbers are still settling on my AMD/ROCm stack so I’m listing them as n/a for now.
On my machine, 20B GPT-OSS runs shockingly fast but Qwen3-8B and Qwen3-4B are my daily drivers they load faster and leave more VRAM for context.
Qwen3.5:4b lands at ~89 tok/s and Qwen3.5:9b at ~64 tok/s. Both are slower than their Qwen3-2507 counterparts at similar parameter counts. On AMD/ROCm this feels like stack maturity rather than a model problem.
Heavy deliberate models (DeepSeek R1 32B) are too slow for live automations, so I use them sparingly but they can think interestingly.

Real Workflows That Actually Work

Benchmarks are nice, but what really matters is: do these models solve actual problems? Let me walk you through some workflows that transformed how I work.

The Paperless Pipeline

I hate paper clutter. Every document that enters my life receipts, contracts, manuals gets scanned into Paperless. But finding things was still tedious until I added AI to the mix.

Now, when a document hits Paperless, a webhook pings n8n. The workflow classifies the document type, extracts key points, and most importantly, finds dates deadlines, renewal dates, warranty expirations. If there’s an actionable date, it creates a calendar entry automatically.

I tested this with several models. Qwen3-8B was good but occasionally missed dates in dense legal text. gpt-oss:20b hit the sweet spot with the updated 0.12.5 accurate enough to catch every deadline, fast enough to process documents in seconds, and the 20B size means I can handle even long contracts without VRAM issues.

Weekly Slack Summaries: Qwen3-8B’s Home Turf

Every Monday morning, I used to spend 30 minutes scrolling through work Slack channels trying to remember what happened last week. Now my n8n workflow does it for me.

It pulls the last week of messages from specific channels and generates a “what mattered” brief: decisions made, outages handled, wins celebrated, follow-ups needed. Sometimes I ask for quarter-to-date accomplishments for review meetings. Slack now has their Recap feature that does something similar, but I built this months ago and it’s tailored exactly to my needs.

Qwen3-8B owns this workflow. It handles the multilingual mix (Swedish and English messages), maintains context across long channel histories, and produces summaries that actually capture the nuance of conversations. At ~82 tok/s, the entire week’s summary is ready before my coffee finishes brewing.

The Context Management Disaster (and How I Fixed It)

The three failures that most folks who even glance at local AI will run into:

I had a beautiful system prompt for email classification seven carefully crafted rules, examples, edge cases. Worked perfectly on short emails. Then a 50-message thread came through, and the model completely ignored my instructions. Classified urgent items as spam. Why? My system prompt got pushed out of context. The model literally forgot what I told it to do, and that’s all because of num_ctx. This is the first thing I optimized: I estimate it inside the n8n workflows and set it when the model is called.

Next up, the big-chungus context. Let’s say someone sent me a 200-page technical specification to summarize and I threw it at my 30B model with a huge context window. First, it wouldn’t fit 30B is a large model in and of itself on a local box but let’s pretend it did. The GPU fans spun up like a jet engine. Ten minutes later, it was still processing. When it finally finished, the summary was… mediocre. Turns out, bigger context doesn’t mean better results. It just means slower unimaginably slower. And a lot of that pain is prompt-eval speed, not “thinking.”

Lastly, my document extraction workflow was supposed to pull deadlines and create calendar entries. It worked great until it hit a contract with multiple dates. The model found all the dates but the JSON output got truncated mid-object. Retries failed the same way. The problem? I hadn’t budgeted space for the tool response. Each tool you add takes a chunk of that sweet context MCP and “normal” tooling are no exception.

These examples taught me to stop thinking about context as a number and start treating it like an expensive budget.

I start by estimating the true footprint of whatever I’m about to feed the model body text, thread tail, attachment text. Before I even think about inference, I “budget” three things that used to get squeezed out:

A protected slice for the system prompt. I always reserve space for it up front so it can’t be evicted by a long user message.
The user message. I calculate the estimated tokens from the size of the user message.
Summaries / JSON / tool I/O need breathing room too. If you don’t reserve it, you get truncation or rambling retries.

With that budget in place, I bucket the job instead of guessing: small (blazing fast), medium (typical emails), large (long threads/docs), or “oversize” (rare). Each bucket maps to a sane context length, so I’m never shoving everything into a single huge window just because I can. Small and medium stay snappy; large gets more space but still within reason; oversize gets special handling. Remember: the bigger the local context, the slower it gets.

Finally, I’m intentional about reasoning. Most automation steps run in a no-think mode that just gets to the point: classify, plan tools, act. I only switch to think when ambiguity truly benefits from a scratchpad. That alone saves a ton of tokens and keeps the important instructions safe at the front of the window.

Here’s my current lineup of models.

Llama3.2-3B: The Speed Demon

When I use it: First-line email triage, spam detection, priority scoring, intent routing
Speed: 161 tok/s the fastest in my arsenal
The story: I needed something to pre-filter my inbox before heavier models kicked in. This tiny model catches ~95% of spam, routes urgent emails to the right workflow, and does it so fast that I can process 100 emails in under 10 seconds. It’s not “smart,” but it doesn’t need to be. It’s fast and focused.

Qwen3-4B-2507: The Efficient Generalist

When I use it: Short emails, webhook responses, light summaries with occasional tool calls
Speed: 121 tok/s with low VRAM footprint
The story: This became my go-to after I noticed my morning email routine was wasting time loading bigger models for simple tasks. Qwen3-4B loads quickly, handles most routine work, and when it needs to call a tool (like creating a task or calendar entry) it can manage that when you have worked through the hit/miss ratios. Nearly as fast as GPT-OSS-20B but leaves more VRAM for context.

Qwen3-8B: The Workhorse

When I use it: Default for anything requiring reasoning, multilingual content, or complex summaries
Speed: 82 tok/s
The story: This is my daily driver. It’s the model that runs when I’m not sure which model to use. It handles Swedish and English seamlessly, produces thoughtful summaries, and manages tool calling with same caveats as 4B, hit/miss ratio so my approach usually is to have it output json with the tools it needs and have a n8n workflow programatically manage the actual tooling. If I could only keep one model, this would be it. The Slack summary workflow? The document analysis? Both run on this model because it consistently delivers quality at reasonable speed.

GPT-OSS-20B: The Power Play

When I use it: Complex tool plans, high-stakes summaries, when quality trumps speed
Speed: 114 tok/s impressive for 20B parameters
The story: This model surprised me. It runs shockingly fast for its size and produces output quality that rivals hosted smaller models on APIs. I use it when a client-facing summary needs to be more perfect or when a complex workflow requires sophisticated tool orchestration (still simple in comparison to the big boys). The catch? I only call it when I have the context budget.

Qwen3.5-4B / 9B: The Next Generation (Early Days)

When I use it: Testing and evaluating for eventual daily driver replacement
Speed: 89 tok/s (4b) / 64 tok/s (9b)
The story: Qwen3.5 is the successor to my Qwen3 lineup with a 256K official context window and improved quality benchmarks across the board. On paper, qwen3.5:4b should replace qwen3:4b-2507 as my lightweight generalist, and qwen3.5:9b should slot in where I use qwen3:8b today. In practice, on my AMD/ROCm stack, both feel slower than their Qwen3-2507 counterparts. The 4b at 89 tok/s is workable but noticeably behind the 120+ tok/s I get from qwen3:4b-2507. The 9b at 64 tok/s with ugly TTFB tails (p95 over a minute) is harder to justify for real-time automation right now. I’m keeping them in rotation for non-latency-critical tasks and waiting for the ROCm story to mature. This is a “not yet, but soon” situation.

DeepSeek R1-8B / 32B: The Deep Thinker

When I use it: Rare moments when I need genuine chain-of-thought reasoning
Speed: 76 tok/s (8B) / 26 tok/s (32B)
The story: Too slow for automation, but fascinating for exploration. When I’m building a new workflow and need to understand edge cases, or when I’m analyzing something genuinely ambiguous, R1-8B helps me think through the problem as its a chatty thinker. The 32B version lives in Open WebUI for offline research sessions.

Now to the Benchmarks

Stories are great, but data is better. Here’s how the Qwen3 family actually performs on automation-relevant benchmarks. I’m particularly interested in instruction following (IFEval), tool use (BFCL/TAU), and multilingual capabilities the things that make or break real workflows.

Numbers here are a mix of public cards and my local runs as noted elsewhere.

Task / Metric	Qwen3-30B-Thinking	Qwen3-4B Thinking	Qwen3-4B-Thinking-2507
IFEval (instr. follow)	86.5	81.9	87.4
Arena-Hard v2 (hard tasks, human prefs)	36.3	13.7	34.9
LiveBench 20241125 (general reasoning)	74.3	63.6	71.8
WritingBench (exposition/clarity)	77.0	73.5	83.3
Creative Writing v3 (coherence/voice)	79.1	61.1	75.6
BFCL-v3 (agentic behavior)	69.1	65.9	71.2
TAU1-Retail (tool-aug. tasks v1)	61.7	33.9	66.1
TAU1-Airline	32.0	32.0	48.0
TAU2-Retail (tool-aug. tasks v2)	34.2	38.6	53.5
TAU2-Airline	36.0	28.0	58.0
TAU2-Telecom	22.8	17.5	27.2
MultiIF (multilingual instr.)	72.2	66.3	77.3
AIME25 (math ref)	70.9	65.6	81.3

What do these benchmarks actually measure?

IFEval (Instruction Following): can it obey formats, constraints, and steps?
Arena-Hard v2 (Human-preference on hard prompts): overall problem-solving quality on tough, ambiguous tasks.
LiveBench (General Reasoning): breadth of reasoning across varied domains.
WritingBench / Creative Writing: clarity, coherence, structure, and controllable style.
BFCL-v3 (Agentic): staying on-task across multi-step objectives.
TAU1 / TAU2 (Tool-Augmented): succeeds when the solution requires tools/APIs, not just text.
MultiIF (Multilingual Instruction Following): follows instructions in multiple languages.
AIME25 (Math): hard math as a proxy for precise reasoning.

Head-to-head (for automation)

Capability	Qwen3-4B-Thinking-2507	Qwen3.5-4B	Qwen3-30B-Thinking-2507	gpt-4.1 mini	gpt-4.1 nano	Claude 3.5 Haiku	gpt-oss-20B
Reasoning depth (multi-step, ambiguity)	Good for 4B; beats older 4B/8B	Good–Very good; quality uplift	Very good; close to small hosted	Excellent	Fair	Good	Good–Very good
Instruction following (IFEval-like)	Very good	Very good	Very good	Excellent	Fair	Very good	Good
Tool use / function calling (BFCL/TAU-like)	Very good (few schema errors)	Very good	Very good	Excellent (low error rates)	Fair	Very good	Good (sometimes verbose)
Summarization / writing quality	Good	Good–Very good	Very good	Excellent	Fair	Excellent	Good–Very good
Multilingual (sv/en mail, Slack)	Good	Good–Very good	Very good	Very good	Fair	Very good	Good
Local Latency	Low (local)	Medium (slower on AMD/ROCm)	Medium (local)	Medium (API RTT)	Low–Medium	Medium	Low–Medium (local)
Cost	$0 variable power	$0 variable power	$0 variable power	$ per token	$ lower	$ per token	$0 variable power
Determinism / control	High (quant, params fixed)	High	High	Medium (provider changes)	Medium	Medium	High
Context (practical)	8–32k sweet spot	8–32k (256K official, ~130K on 24 GB)	16–64k sweet spot	Large (cheap-ish)	Small	Large	8–32k sweet spot
Footprint	4.3 GB (q8_0)	~4 GB (q8_0)	8–16 GB (typical quant)	none local	none local	none local	~10–12 GB (q8_0/q6)

Jump to API when you need:

When I decide between local vs hosted, I ask: Is this data-privacy-sensitive? If yes, I stay local. If not, I’ll use hosted.

Maximum obedience + low schema errors across varied tools, or highest polish in writing (customer-facing) → GPT-4.1 mini.
Cheaper hosted small tasks where reasoning is light → GPT-4.1 nano.
Concise, pleasant summaries and steady formatting → Claude 3.5 Haiku.

All these examples are to be as budget friendly as possible, always use bigger paid models when the need is there!

What I Learned (and What It’s Worth)

Running local AI teaches something counterintuitive: the answer isn’t always “bigger” or “more powerful.” It’s about matching the right model to the right task.

The Real Impact

After a few months of running this setup:

Email triage time: Down from thousands of unread emails to ~5 minutes/day (mostly just reviewing what the AI flagged and sorted)
Document processing: 200+ documents processed without a single missed deadline in the calendar
Slack summaries: Generated many weekly reports, saving time from manual review

The Three Costs of Getting It Wrong

If you don’t match the model to the task, you pay three ways:

Accuracy Smaller models miss nuance they’re not built for; larger models can hallucinate on tasks beneath them
Electricity A 24 GB card processing simple emails is like taking a semi-truck to buy groceries
Time Slow contexts, retries, and “why did it forget my instructions?” debugging sessions that steal your afternoons

The System That Works

My optimization isn’t a fancy setting. It’s a rhythm:

Budget the context before you load the model
Route to the smallest model that can handle the task
Escalate only when needed let failures teach you when to use bigger guns
Protect your instructions reserve context space for what matters

Do this, and local AI goes from an interesting science project to a reliable co-worker that saves you hours every week. Skip it, and the meter runs on power, patience, and days you won’t get back.

Running local models for automation? The hardware matters, but the routing strategy matters more. Start small, measure everything, and let the data tell you when to scale up.

Buy Me a Coffee