There is a particular moment that hooks every developer on local AI. You type a question into a terminal, hit enter, and watch a coherent answer stream back — with your Wi-Fi off, no API key, no usage meter ticking, nothing leaving your laptop. The model is just there, running on silicon you already own.
Getting to that moment used to require a research-lab pedigree. It no longer does. In 2026, a mid-range laptop can run models that would have been considered frontier-class a couple of years ago, and the tooling has matured from finicky Python scripts into one-line installers. The catch is that the landscape is now wide: a dozen serious tools, hundreds of models, and a thicket of jargon — GGUF, quantization, KV cache, MoE, offloading — standing between you and that first streamed token.
This guide is the map. I'll assume you're a competent developer but new to running models locally, and I'll take you from vocabulary to a working setup, with enough depth that intermediate and senior engineers get the why behind each decision, not just the how. By the end you'll be able to do three things with confidence: pick the right open source model for a given job, configure it for your specific hardware, and run it successfully — whether you're on a MacBook Air, a gaming rig with an NVIDIA card, or a CPU-only workstation.
One promise up front: I won't pretend local always beats the cloud (it doesn't, at the very high end), and I won't bury the tradeoffs. Local AI is the right call for privacy, cost, offline capability, and control. Let's make those wins real.
The one-paragraph version: If you read nothing else: install Ollama (or LM Studio if you want a GUI), pull a 7–8B model in Q4_K_M quantization, and you’re running local AI in ten minutes. The single number that decides what you can run is memory — VRAM on a GPU, or unified memory on a Mac. Everything else in this guide is detail on top of those two facts.
The vocabulary you need
This field has a dialect, and most tutorials assume you already speak it. Let's fix that first. Skim this section now, then refer back when a term trips you up later — it's designed as a glossary you can return to, not a wall to memorize in one pass.
The foundational terms
LLM (Large Language Model). A neural network trained to predict the next token of text. That simple objective, at scale, produces the chat, code generation, and reasoning we find so useful. Everything you’ll run is an LLM or a close cousin.
Open source vs. open weight. This distinction matters more than most people realize. Open weight means the trained parameters are downloadable and you can run them yourself. Open source, in the strict sense, additionally requires open training data and code, and a license with no restrictions on who can use it or for what. Most "open" models — Llama, Qwen, Gemma, DeepSeek — are open weight. Only some, typically those under Apache 2.0 or MIT licenses, approach genuine open source.
Parameters (weights). The learned numbers inside the network. "7B" means seven billion parameters. More parameters generally means more capability — and more memory required to hold the model.
Tokens and tokenization. Models don’t read words; they read tokens. A token is roughly four characters or about three-quarters of a word. When you see "tokens per second," that’s the unit of generation speed.
Context window (context length). How many tokens the model can hold in its attention at once — your prompt plus its output combined. Older models maxed out around 4,000 tokens; modern ones reach 128,000, 256,000, and in a few cases over a million.
Inference. Running a trained model to produce output. This is distinct from training (creating the model) and fine-tuning (adapting it). Everything in this guide is about inference.
The terms that actually decide your setup
Quantization. The most important concept after parameter count. Models are trained in 16-bit precision, but you can compress the weights to 8-bit or 4-bit to shrink memory use and speed up inference, trading a little quality. The common levels:
FP16 / BF16 — full (half) precision, the uncompressed baseline. Two bytes per parameter.
Q8_0 — 8-bit, essentially indistinguishable from the original. One byte per parameter.
Q5_K_M — 5-bit, a high-quality middle ground.
Q4_K_M — 4-bit, the universally recommended sweet spot: about 75% smaller than FP16 with only a 1–3% quality drop.
GGUF — not a quantization level but the file format that packages a quantized model into a single file. It’s what Ollama, LM Studio, and llama.cpp all consume.
GPTQ / AWQ / EXL2 — alternative quantization schemes optimized for GPU-based serving.
VRAM vs. RAM. VRAM is the dedicated memory on a discrete graphics card; RAM is your system memory. On a machine with an NVIDIA or AMD GPU, the model must fit in VRAM to run at full speed.
Unified memory. On Apple Silicon (and a few new AMD chips), the CPU and GPU share one fast pool of memory. The GPU can use almost all of your system memory — which is why a 64GB MacBook can punch far above a gaming GPU on large models.
GPU offloading (layer offloading). When a model is too big for your VRAM, you can keep some of its layers on the GPU and push the rest to system RAM. The model still runs — but the offloaded portion is dramatically slower.
Metal / CUDA / ROCm / Vulkan / SYCL. The hardware-acceleration backends: Apple, NVIDIA, AMD, a cross-vendor fallback, and the Intel path respectively.
MoE (Mixture of Experts). An architecture where only a fraction of the total parameters — the "active" parameters — fire for any given token. You get the quality of a big model with the compute cost of a small one. The catch: you still have to hold all the parameters in memory. Plan memory by total parameters; plan speed by active parameters.
Mental model: A Mixture-of-Experts model is like a hospital with fifty specialists on staff but only four seeing any given patient. You pay the rent on the whole building (memory), but each visit is fast because only a few doctors are involved (compute). A 30B-A3B model has 30 billion parameters total but only ~3 billion active per token.
The terms you’ll see in benchmarks and settings
KV cache. As a model generates, it stores the attention keys and values for every previous token so it doesn’t recompute them. This cache grows with context length, and at long contexts it can consume as much memory as the model weights themselves.
Temperature, top-p, top-k. Sampling controls that govern randomness. Lower temperature produces more deterministic output; higher is more creative. Top-p and top-k limit the pool of candidate tokens.
System prompt. A hidden instruction that sets the model’s role and behavior before the conversation begins.
Throughput vs. latency. Throughput is total tokens per second across all requests; latency is how fast a single response comes back. Tokens per second is the headline speed number, and time to first token measures how snappy the model feels.
Fine-tuning vs. RAG. Two ways to make a model "know" your data. Fine-tuning retrains the model on your examples; RAG (retrieval-augmented generation) leaves the model untouched and feeds it relevant documents at query time. For most use cases, RAG is the cheaper, faster, more maintainable choice.
Embeddings. Numerical vector representations of text that capture meaning, used for semantic search and as the backbone of RAG systems.
Distillation. Training a smaller model to imitate a larger one. DeepSeek’s R1 "distill" models bring large-model reasoning to consumer hardware.
Multimodal / vision-language models. Models that accept images (and sometimes audio or video) alongside text.
Reasoning models. Models trained to "think out loud" — producing an explicit chain of reasoning before their final answer. DeepSeek-R1 and OpenAI’s gpt-oss are leading examples.
The memory math that governs everything
If you internalize one section of this guide, make it this one. Almost every question you’ll have — "Can I run this model?" "Why is it so slow?" "Which quantization should I pick?" — reduces to a single question: does the model fit in fast memory, and if not, how much are you willing to spill into slow memory?
Estimating how much memory a model needs
The weights of a model take up a predictable amount of space based on parameter count and quantization. The rule of thumb:
Memory (GB) ≈ parameters (billions) × bytes-per-parameter × 1.2, where bytes-per-parameter ≈ 2.0 (FP16), 1.0 (Q8_0), ~0.7 (Q5_K_M), ~0.55 (Q4_K_M). The 1.2 accounts for overhead.
So a 7B model needs about 14GB at full precision, ~7.7GB at Q8, and ~4.5GB at Q4_K_M. The handy shortcut: at Q4_K_M, every billion parameters costs roughly 0.55–0.7GB. Here’s the reference table:
Model size | Q4_K_M (weights) | Q8_0 (weights) | Typical GPU it fits |
|---|---|---|---|
3B | ~2 GB | ~3.5 GB | Almost anything, even 4GB |
7–8B | ~4.5–5 GB | ~8 GB | 8GB cards comfortably |
13–14B | ~8 GB | ~14 GB | 12GB cards |
27–32B | ~18–20 GB | ~34 GB | 24GB cards (3090/4090) |
70B | ~40 GB | ~75 GB | 48GB+, or a high-RAM Mac |
Don’t forget the KV cache
The weights are only part of the story. The KV cache grows linearly with context length, and at long contexts it can rival or exceed the weights. A Llama-3-8B at 32K context burns roughly 4GB on KV cache alone. Push to 128K and the cache can dwarf the model.
Two things rescue you. First, nearly every model released in 2025 and 2026 uses Grouped-Query Attention, which cuts cache size by 50–75% for free. Second, you can quantize the KV cache itself — setting it to 8-bit or 4-bit — to roughly halve its footprint.
Common trap: When a model card says "runs in 8GB," that almost always means weights only, at a short context. Budget an extra 1–2GB for the KV cache and overhead at modest context lengths — and far more at long ones.
The MoE wrinkle
Mixture-of-Experts models break the simple mental model. Take Qwen3-30B-A3B: 30 billion total parameters but only ~3 billion active per token. It generates as fast as a 3B model, but you still need enough memory to hold all 30 billion. So: size your memory by total parameters, size your speed expectations by active parameters.
What happens when it doesn’t fit: offloading
When a model exceeds your VRAM, tools like llama.cpp and Ollama automatically offload the excess layers to system RAM. This prevents a crash, but it’s slow — system RAM bandwidth (roughly 50–70 GB/s on a dual-channel DDR5 desktop) is an order of magnitude below GPU VRAM (around 1,000 GB/s on an RTX 4090). Offloading 10–20% is often tolerable; offloading half will make you wish you hadn’t.
This leads to the most important hardware insight in the guide: token generation speed is governed by memory bandwidth, not raw compute. A model generates roughly as fast as your memory bandwidth divided by the model’s size in memory.
The tooling landscape, tool by tool
The ecosystem looks chaotic until you see its structure. There are really three layers: engines that do the actual math (llama.cpp, MLX); experiences that wrap an engine in convenience (Ollama, LM Studio, Jan, GPT4All); and servers for high-throughput, multi-user production (vLLM, TGI, SGLang).
Because most consumer tools wrap llama.cpp, their raw single-user speed differs by only a few percent. So choose based on workflow, not on a myth that one is dramatically faster than another.
llama.cpp — the engine underneath almost everything
The foundation of the entire consumer local-AI world. Created by Georgi Gerganov, its first commit landed on March 10, 2023 — just two weeks after Meta released the original LLaMA weights. It’s a dependency-free C/C++ inference library that reads GGUF and runs on essentially everything: CPU, CUDA, Metal, ROCm, Vulkan, SYCL.
Its superpower is being first: new architectures usually land here before anywhere else. It exposes every tuning knob. The tradeoff is that you compile it yourself and manage flags. Pick it if you want maximum control, the newest models the day they drop, or the last few percent of performance.
Ollama — the "Docker for LLMs"
If one tool is the default recommendation for developers, it’s this one. Ollama is CLI-first, runs as a background daemon, and exposes both a REST API and an OpenAI-compatible endpoint on port 11434. The workflow: ollama pull, ollama run, done. It stores models by content hash and automatically manages VRAM.
Pick it if you’re a developer who wants local AI to behave like a service you forget is running. This is the one most people should start with.
LM Studio — the polished GUI
The friendliest on-ramp, and free for personal use. LM Studio gives you a built-in model browser that shows memory estimates before you download, a chat playground, RAG over your local documents, and an OpenAI-compatible server — all in a clean desktop app. It runs both the llama.cpp and MLX backends.
Pick it if you want the smoothest discover-download-experiment loop, or you’re on a Mac and want MLX speed without touching the command line.
The rest of the field
Each of these earns its place for a specific job:
Tool | What makes it special | Pick it if… |
|---|---|---|
Jan | Open-source, offline-first ChatGPT-style desktop app; can bridge to cloud APIs | You want a clean assistant UI and value fully open-source software |
GPT4All | Point-and-click RAG over a folder of documents, fully offline, near-zero config | You want private document Q&A with no setup |
KoboldCpp | Single-executable llama.cpp fork built for creative writing and roleplay | Fiction or roleplay with rich world/character memory |
Llamafile | Packs an entire model plus runtime into one cross-platform executable | You want maximum portability or to ship a model as a single file |
MLX / MLX-LM | Apple’s native framework; exploits unified memory and supports on-device fine-tuning | You’re on a Mac and want peak performance or local LoRA training |
text-generation-webui | "Swiss Army knife" — multiple loaders behind one UI, plus fine-tuning and RAG | You want to experiment broadly across model formats |
LocalAI | A router, not a runner: one OpenAI-compatible endpoint in front of many backends | You’re orchestrating several model types behind a single API |
The production tier: vLLM and friends
Everything above is built for one user at a desk. The moment you need to serve a model to many concurrent users, you cross into a different category — and the leader is vLLM. Its PagedAttention manages the KV cache in non-contiguous blocks like an OS manages virtual memory, cutting memory waste from 60–80% down to under 4%, and continuous batching slots new requests into the running batch the instant a slot frees. Its launch benchmarks reported up to 24× the throughput of naive Hugging Face Transformers serving.
The tradeoff: vLLM needs a Python environment and a capable GPU, doesn’t run GGUF (it uses safetensors with AWQ or GPTQ), and is heavier to set up. Its cousins TGI and SGLang compete in the same space.
Rule of thumb: Ollama for your laptop, vLLM for your server. If exactly one person or process talks to the model at a time, use a llama.cpp-based tool. If many do at once, move to vLLM or TGI.
The Python baseline: Hugging Face Transformers
The reference implementation everything else is measured against. Transformers (with Accelerate) gives you maximum model coverage and flexibility — the standard for research and fine-tuning — but carries the most setup and isn’t optimized for consumer single-user inference. Pick it if you’re doing research or need to run a brand-new model before anyone has produced a GGUF for it.
So which one should you actually use?
Your situation | Best choice |
|---|---|
I’m a developer and want one default | Ollama Invisible infrastructure with a clean API |
I’m a beginner or non-developer | LM Studio or GPT4All |
I’m on a Mac | LM Studio or Ollama with the MLX backend |
I have a powerful NVIDIA card | llama.cpp for control; vLLM to serve |
I need to serve many users | vLLM (or TGI / SGLang) |
I want document chat | GPT4All or LM Studio (built-in RAG) |
I’m on low-end hardware | Ollama with small models; Llamafile for portability |
Creative writing / roleplay | KoboldCpp |
Configuring for your specific machine
Now we get practical. Find your hardware below and follow the path. The through-line is always the memory math from the previous section — here we apply it to real silicon.
Apple Silicon Macs (M1 through M5)
The surprise winner for individual developers. Because of unified memory, a 64GB MacBook can load a 70B model at Q4 with no copying between CPU and GPU. Use an MLX-backed runtime; on the newest chips, MLX is meaningfully faster than the older Metal path.
The rule for Macs: your usable model memory is roughly total unified RAM minus about 8GB for the OS. A 16GB Mac handles 7–8B comfortably, 32GB reaches ~30B, 64GB runs 70B, and 128GB+ opens the big MoE models. The one place a Mac loses to a discrete GPU is raw speed on models that already fit comfortably in that GPU’s VRAM.
NVIDIA GPUs (Windows and Linux)
The best-supported platform, full stop. Every tool works on NVIDIA first. Plan by your VRAM tier:
VRAM | Example cards | What you can run |
|---|---|---|
8 GB | RTX 4060, 3060 Ti | 7–8B at Q4_K_M, 40+ tok/s. The popular real-world floor. |
12 GB | RTX 3060 12GB, 4070 | 12–14B at Q4_K_M with room for context |
16 GB | RTX 4060 Ti 16GB, 5060 Ti 16GB | 14B at Q5, or gpt-oss-20b — the "16GB sweet spot" |
24 GB | RTX 3090, 4090 | 27–32B at Q4/Q5 fully resident, or 70B with offloading |
32 GB+ | RTX 5090, workstation cards | Larger 32B at high quant; dual 24GB reach 70B fully in VRAM |
The RTX 3090 deserves special mention: thanks to its wide 384-bit bus and ~936 GB/s bandwidth, it often out-generates the newer RTX 4080 despite being a generation older — a perfect illustration of the bandwidth-over-compute principle. A used 3090 remains one of the best value buys in 2026.
AMD GPUs (ROCm and Vulkan)
The story has genuinely improved. In 2026, ROCm has matured enough that AMD is a real choice for inference. On Linux, ROCm/HIP runs llama.cpp and Ollama at roughly 70–80% of CUDA speed at equivalent bandwidth. On Windows, Vulkan (through LM Studio or Ollama) is the least-friction path.
The RX 7900 XTX (24GB) is a credible, cheaper alternative to a 4090 for inference, and AMD’s Strix Halo chips bring Apple-style unified memory to the PC world, with up to 128GB shared. Where NVIDIA still wins decisively is fine-tuning and FP8 production serving.
Intel Arc GPUs
Workable, but the roughest software story of the bunch. The A770 16GB is a genuine budget VRAM bargain. Your paths are llama.cpp’s SYCL or Vulkan backend, IPEX-LLM’s portable Ollama build, or Intel’s vLLM-based stack. Buy Intel only if you enjoy the setup adventure.
CPU-only and low-RAM laptops
Realistic, but manage expectations. A modern CPU does 3–13 tok/s on a quantized 7B — fine for batch jobs, frustrating for interactive chat (anything under ~15 tok/s feels laggy). Stick to small models: Phi-4-mini, Llama 3.2 3B, or Gemma 3 4B at Q4, on 8GB systems.
High-RAM workstation with a weak GPU
You have an underrated option: run a big MoE model (say, gpt-oss-120b) keeping the attention layers on the GPU and the experts in system RAM. Because only a few experts activate per token, you can hit 10–30 tok/s with surprisingly little VRAM. In llama.cpp the --cpu-moe flag does exactly this.
Realistic speed expectations
Approximate tokens/second for a Q4_K_M model, single user, via llama.cpp or Ollama:
Hardware | 7B model | Notes |
|---|---|---|
RTX 4090 | ~135 tok/s | Faster than you can read |
RTX 3090 | ~95 tok/s | The value champion |
RTX 4070 Super | ~75 tok/s | Excellent mid-range |
RTX 4060 8GB | ~25–37 tok/s | Perfectly usable |
M3 Max (64–128GB) | fast on 7B; ~7–14 on 70B | Holds the whole 70B — so it beats a 4090 there |
Modern CPU | ~12–13 tok/s | Batch jobs, not chat |
And the headline number that breaks the pattern: a 30B MoE like Qwen3-30B-A3B can sustain 100+ tok/s on a 4090 — nothing like the slowdown you’d expect from 30 billion total parameters — because only ~3B are active per token.
Choosing the right model by use case
You’ve got a tool and you know your memory budget. Now: which of the hundreds of available models should you actually download? Start with the size tiers, then match a family to your task.
Read this first: This space moves monthly. Specific version numbers and "best in class" claims go stale fast. Treat the families below as durable and the exact version numbers as a snapshot — always check the current model card on Hugging Face or the Ollama library before committing.
The parameter-size tiers
1–3B (small / on-device): Phi-4-mini, Llama 3.2 1B/3B, Gemma 3 1B/4B. For edge devices, autocomplete, classification, and simple chat.
7–8B (the workhorse): Llama 3.1/3.3 8B, Qwen3 8B, Mistral 7B. The best cost-to-capability ratio for most laptops.
13–14B: Phi-4 14B, Qwen3 14B. Meaningfully smarter; needs ~12GB.
27–34B: Gemma 3 27B, Qwen3 32B, Mistral Small 3 (24B). The single-24GB-GPU sweet spot.
70B+: Llama 3.3 70B, Qwen2.5 72B. Needs 48GB+ or a high-RAM Mac.
MoE giants: Llama 4 Scout/Maverick, DeepSeek-V3/R1, Qwen3-235B, gpt-oss-120b. Big-model quality at small-model compute — but huge memory footprints.
The major model families
Llama (Meta). The largest ecosystem and the most community fine-tunes. The 3.x series are dependable dense workhorses; Llama 4 (April 2025) brought Mixture-of-Experts to the line, with Scout offering a headline 10M-token context. The caveat: the Llama Community License is not open source — it carries a 700M-monthly-active-user cap and EU restrictions.
Qwen (Alibaba). For many developers, the default answer to "what should I run locally?" in 2025–2026. Apache 2.0 licensed, dense and MoE from ~1.7B to 235B, with strong coding, math, and multilingual ability.
Gemma (Google). Gemma 3 spans 1B to 27B, is multimodal from 4B up, and runs beautifully on consumer hardware. The 4B is a superb laptop default; the 27B is competitive on a 24GB GPU.
DeepSeek. DeepSeek-V3 for general use and the reasoning-focused DeepSeek-R1, released January 2025 under a clean MIT license. The full model needs a server, but the distilled variants (1.5B to 70B) bring R1-style reasoning to consumer hardware.
Mistral / Mixtral. Mistral 7B and Mixtral 8x7B remain heavily deployed; Mistral Small 3 (24B) rivals models three times its size, and the Mistral 3 family moved fully to Apache 2.0.
Microsoft Phi. Phi-4 (14B) and Phi-4-mini (3.8B), MIT-licensed, punch well above their weight per parameter — ideal for budget and edge deployments.
OpenAI gpt-oss. Released August 2025 under Apache 2.0 — OpenAI’s first open-weight models since GPT-2. Both are MoE with a 128K context and three configurable reasoning-effort levels. The gpt-oss-20b needs only ~16GB (a standout for a 16GB GPU or Mac), and gpt-oss-120b runs within 80GB on a single high-end GPU.
Also worth watching: Cohere’s Command R/R+ (RAG-focused), Falcon 3, GLM, Kimi, and coding specialists Qwen-Coder and Mistral’s Devstral.
Mapping use cases to models
Your task | Reach for |
|---|---|
General chat / assistant | Qwen3, Gemma 3, Llama 3.3 |
Code generation | Qwen-Coder, Devstral, DeepSeek-Coder, Code Llama |
Reasoning / math | DeepSeek-R1 (or its distills), gpt-oss at high reasoning effort |
Long-context tasks | Llama 4 Scout, Qwen3, Gemma 3 |
RAG / summarization | Mid-size instruct models plus an embedding model |
Multilingual | Qwen3 (strongest, especially CJK), Gemma 3, Mistral |
Vision-language | Gemma 3, Qwen-VL, Llama 4, Mistral Small 3.1 |
Small / fast on-device | Phi-4-mini, Llama 3.2 3B, Gemma 3 4B |
Reading model cards, and the licensing trap
You’ll find models in two main places: the Ollama library (curated, one-command pulls) and Hugging Face (everything, including community quantizations — "bartowski" and "Unsloth" are prolific and reliable). On any model card, check five things: parameter count, context length, license, intended use, and which quantizations are available.
The licensing nuance matters the moment you build something real. Apache 2.0 (Qwen, Gemma’s permissive releases, Mistral 3, gpt-oss) and MIT (DeepSeek, Phi) are the cleanest for commercial use. Llama’s license permits commercial use but with that 700M-MAU cap and EU restrictions. Always read the actual card before shipping a product, and involve legal for anything at scale.
Optimization: squeezing out performance
Here’s the playbook, roughly in the order you should apply it.
1. Pick the right quantization for your VRAM
Q4_K_M is the default for a reason — ~75% smaller than FP16 with only a 1–3% quality drop. Step up to Q5_K_M or Q8_0 when you have headroom. The most useful heuristic of all: a larger model at lower precision usually beats a smaller model at higher precision. A 14B at Q4 typically outperforms a 7B at Q8 — if both fit.
2. Maximize GPU layer offloading
In llama.cpp, set -ngl 999 to push every layer onto the GPU. If you run out of memory, lower the number until the model fits. Partial offload is far faster than running everything on the CPU.
3. Manage context length and the KV cache
Don’t set your context window higher than you actually need; the KV cache scales linearly with it. When context is tight, quantize the cache — setting the KV cache type to q8_0 roughly halves its footprint. In Ollama, this is the OLLAMA_KV_CACHE_TYPE environment variable.
4. Enable flash attention
Flash attention reduces the memory cost of attention and speeds up long-context inference. It’s standard on modern setups — turn it on with --flash-attn on in llama.cpp.
5. Use speculative decoding for a free speedup
Pair a large "main" model with a small "draft" model from the same family — say, Llama 3.2 1B drafting for Llama 3.1 8B. When acceptance is high, you get a 1.5–3× speedup with no quality loss, because the large model still has the final say. Keep the draft model much smaller than the main one.
6. Choose MoE models when speed matters
A 30B-A3B MoE generates as fast as a 3B dense model while approaching the quality of something far larger — provided you have the memory to hold all 30B.
7. Buy hardware for memory bandwidth
Memory bandwidth (GB/s) predicts token-generation speed better than raw compute (TFLOPS). This is why the RTX 3090 often beats the newer 4080, and why the M3 Max beats the M4 Pro on generation. Match your purchase to two things: enough memory capacity to fit the model, and enough memory bandwidth to run it fast.
Optimization order of operations: (1) Pick the largest model that fits at Q4_K_M. (2) Max out GPU layers. (3) Enable flash attention. (4) Quantize the KV cache if context is tight. (5) Add speculative decoding if you need more speed. (6) Only then consider buying hardware.
Step-by-step setup walkthroughs
Enough theory. Here are the exact commands to get running on each of the three paths most people take. All three expose an OpenAI-compatible API, so any code you’ve written against the OpenAI SDK will work against your local model with a one-line change to the base URL.
Path A: Ollama (the recommended default)
Works on macOS, Windows, and Linux. On macOS and Windows, download the installer; on Linux, one command does it:
# Linux install (macOS/Windows: download the app)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model — downloads on first run
ollama run qwen3:8b
# type your prompt in the chat; /bye to exit
# Handy commands
ollama list # installed models
ollama ps # models currently in memory
ollama pull gemma3:4b # download without running
ollama rm <model> # delete a model
ollama run qwen3:8b --verbose "Write a haiku" # shows tok/sThe server runs on port 11434. Here’s how you hit it from the command line and from Python:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3:8b","messages":[{"role":"user","content":"Hello!"}]}'from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the SDK but unused
)
resp = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)To expose Ollama to other machines, set OLLAMA_HOST=0.0.0.0:11434 before it starts. To customize a model’s system prompt or parameters, write a Modelfile and run ollama create.
Path B: LM Studio (the GUI route)
No commands required for the basics:
Download from the LM Studio site (macOS
.dmg, Windows.exe, Linux AppImage) and install. It auto-detects your GPU and the right backend.Open the Discover tab, search for a model — start with Gemma 3 4B or Llama 3.2 3B — pick the
Q4_K_Mquantization, and download. It shows the memory estimate before you commit.Load the model and chat in the playground. Attach a PDF or text file to use the built-in RAG.
For development, go to the Developer tab, enable Developer Mode, load a model, and click Start Server. It exposes an OpenAI-compatible API at
http://localhost:1234/v1.In settings, max out the GPU layers for your VRAM, set your context length, and on Apple Silicon select the MLX engine for a noticeable speed bump.
Path C: llama.cpp (maximum control)
For when you want every knob. Install via a package manager or build from source with your GPU backend:
# Easiest: package manager (macOS/Linux)
brew install llama.cpp
# Or build from source with GPU support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON # NVIDIA
# -DGGML_METAL=ON (Mac) · -DGGML_HIP=ON (AMD) · -DGGML_VULKAN=ON (cross-vendor)
cmake --build build --config Release# Run a model straight from Hugging Face (auto-downloads the GGUF)
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
# Launch the OpenAI-compatible server with tuning flags
llama-server -hf bartowski/Qwen3-8B-GGUF:Q4_K_M \
-ngl 999 \ # all layers on GPU
-c 8192 \ # context length
--flash-attn on \ # enable flash attention
--cache-type-k q8_0 --cache-type-v q8_0 \ # quantize the KV cache
--host 0.0.0.0 --port 8080The server gives you a web UI and an API at http://localhost:8080. Two utilities you’ll use often: llama-bench measures tokens/second on your exact hardware, and llama-quantize converts a model to a smaller quant.
When something goes wrong
Symptom | Fix |
|---|---|
Out of memory | Use a smaller model, a lower quant (Q4 instead of Q8), or reduce context length |
Painfully slow (running on CPU) | Confirm GPU detection and raise the GPU layer count |
Port already in use | Ollama defaults to 11434, LM Studio to 1234, llama.cpp to 8080 — they coexist, but don’t double-bind one port |
Garbled or repetitive output | Check the model’s prompt template; lower temperature; try a higher quant |
Recommendations & what to do next
Let’s compress everything into an action plan.
Start here, today
Install Ollama and run ollama run qwen3:8b (or gemma3:4b on a lighter machine). Confirm you get interactive speed, then point your code at localhost:11434/v1. Prefer clicking to typing? Install LM Studio and download Gemma 3 4B at Q4_K_M instead.
Then match the model to your memory and task
Your hardware | Run this |
|---|---|
8GB GPU / 16GB Mac | 7–8B (Qwen3 8B, Llama 3.1 8B) at Q4_K_M — or gpt-oss-20b on 16GB |
12–16GB | 14B (Phi-4, Qwen3 14B) at Q4/Q5 |
24GB / 32–64GB Mac | 27–32B (Gemma 3 27B, Qwen3 32B), or the Qwen3-30B-A3B MoE for speed |
48GB+ / 64–128GB Mac | 70B at Q4, or the big MoE models |
For reasoning, reach for a DeepSeek-R1 distill sized to your tier. For coding, Qwen-Coder or Devstral. For private document chat, the built-in RAG in GPT4All or LM Studio.
Scale up only when concurrency forces you to
The moment more than one user or process needs the model at once, move to vLLM (or TGI) on a proper GPU server. Until then, a llama.cpp-based tool on your own machine is simpler, cheaper, and entirely sufficient. Ollama for the laptop, vLLM for the server.
The honest caveats
A few truths to keep you grounded. The model leaderboard changes monthly — treat every specific version number and benchmark here as a snapshot, and verify the current state on the model card before you build. Benchmarks themselves disagree and are often vendor-reported, so validate on your workload. "Runs in X GB" almost always means weights only — budget extra for the KV cache. And local models have a real quality ceiling: a local 8B will not match a frontier cloud model on the hardest reasoning. Choose local for privacy, cost, offline capability, and control — not because it always wins.
Finally, privacy is not automatic just because inference is local. Some applications include telemetry; open-source tools let you verify what’s actually happening. Running models on your own hardware supports a strong privacy and compliance posture, but full compliance needs additional access, audit, and physical controls on top.
That’s the whole map. The fundamentals here — the memory math, the quantization tradeoffs, the three tooling layers, the bandwidth-over-compute principle, and the setup commands — will outlast any individual model release. The specific models will keep getting better, faster, and smaller. Which means the best time to get comfortable running them locally is right now, and the second-best time is the next time you open your terminal.
Now go pull a model. That first streamed token is waiting.