How Much VRAM Do You Need to Run an LLM Locally?
By Rui Barreira · Last updated: 13 June 2026
You can calculate exact VRAM requirements for any LLM using brevio LLM Memory Calculator — select a preset (Llama 3.3 70B, Mistral 7B, Gemma 2 9B, and more), choose your quantization precision, set your context window, and instantly see model weights, KV cache, and activation memory broken out separately, plus a GPU compatibility table. All calculations run in your browser with no data sent anywhere.
The short answer: a 7B parameter model in fp16 needs at least 14GB of VRAM just for model weights. Add KV cache and activations and you need 18–20GB for a 2048-token context. An RTX 3090 (24GB) handles it in fp16; an RTX 3060 (12GB) requires int4 quantization. For 70B models, you need multi-GPU setups or cloud inference unless you use aggressive int4 quantization (35–40GB compressed).
The Three Memory Components
GPU memory consumption breaks down into three distinct components. Understanding each one prevents the most common mistake: buying hardware that can fit the model weights but runs out of memory at inference time.
1. Model Weights
The dominant cost. Every parameter in the model occupies memory according to its numerical precision. The formula is straightforward:
Model memory (GB) = parameter count × bytes per parameter
For a 7B parameter model: 7,000,000,000 × 2 bytes (fp16) = 14,000,000,000 bytes = 14 GB. This number is fixed once you choose the model and precision — it does not change with context length or batch size.
2. KV Cache
The second-largest cost, and the one that surprises most practitioners. The KV (key-value) cache stores the attention states computed for all previous tokens so the model does not re-compute them on each generation step. Its size scales with:
- Context length: longer contexts store more KV states
- Batch size: each concurrent request needs its own KV cache
- Number of layers: each transformer layer contributes both K and V tensors
- Hidden dimension size: larger models have wider attention heads
The formula: KV cache (bytes) = 2 × layers × batch_size × context_tokens × hidden_dim × 2 bytes (fp16). For a 7B model (32 layers, 4096 hidden dim) with 2048 context and batch size 1: 2 × 32 × 1 × 2048 × 4096 × 2 = ~1.07 GB. At 32K context, that becomes ~17 GB — larger than the model itself for some architectures.
3. Activations
Intermediate computation outputs during the forward pass. For inference (not training), activations are relatively small — typically around 20% of model weight memory. At training time, activations are much larger because gradients must be stored for backpropagation. For local inference use cases, this component is the smallest of the three.
Quantization Explained
Quantization reduces the numerical precision of model weights, trading a small amount of accuracy for large memory savings. Modern quantized models (GGUF format used by llama.cpp, GPTQ, and AWQ) are the standard way to run large models on consumer hardware.
| Precision | Bytes per parameter | 7B model size | 70B model size | Quality impact |
|---|---|---|---|---|
| fp32 | 4 bytes | 28 GB | 280 GB | Reference (no loss) |
| fp16 / bf16 | 2 bytes | 14 GB | 140 GB | Negligible |
| int8 | 1 byte | 7 GB | 70 GB | Minimal (~1% benchmark drop) |
| int4 | 0.5 bytes | 3.5 GB | 35 GB | Moderate (2–5% benchmark drop) |
The rule of thumb: model weight GB ≈ parameter count in billions × bytes per parameter. A 7B model at fp16 = 7 × 2 = 14 GB. A 13B model at int4 = 13 × 0.5 = 6.5 GB. This heuristic is accurate to within 5% for transformer architectures.
Modern int4 quantization (particularly GGUF Q4_K_M and AWQ) achieves remarkably close results to fp16 on most benchmarks. The quality loss is most noticeable on mathematical reasoning and coding tasks. For general chat and instruction following, int4 is usually indistinguishable from fp16 in practice.
Context Window and KV Cache Scaling
Context length has a dramatic effect on KV cache memory. Running Llama 3.1 8B with a 128K context window (its rated maximum) at fp16 with batch size 1 requires approximately 32 GB of KV cache alone — more than twice the model weights. This is why many local inference tools default to 2048–4096 token contexts even when the model supports more.
Practical guideline: if you need long context, budget 2–4x your model weight memory for KV cache. For batch inference (serving multiple requests simultaneously), multiply KV cache by batch size — 8 concurrent requests at 2048 tokens requires 8x the single-request KV cache.
GPU Comparison
| GPU | VRAM | Max model (fp16) | Max model (int4) | Typical use case |
|---|---|---|---|---|
| RTX 3060 | 12 GB | ~5B params | ~20B params | 7B int8, 13B int4 |
| RTX 3090 / 4090 | 24 GB | ~11B params | ~44B params | 7B fp16, 13B int8, 30B int4 |
| A100 40GB | 40 GB | ~18B params | ~72B params | 30B fp16, 70B int4 |
| A100 80GB | 80 GB | ~36B params | ~144B params | 70B fp16, 130B int4 |
| H100 80GB | 80 GB | ~36B params | ~144B params | 70B fp16 (3x faster than A100) |
| H200 141GB | 141 GB | ~65B params | ~256B params | 70B fp16 with large context |
Note: “max model” figures assume 20% headroom for KV cache and activations. Real limits depend on context length and batch size.
Tools for Running LLMs Locally
Ollama is the easiest starting point. It handles model download, quantization selection, and a local HTTP API with a Docker-like CLI. Run ollama run llama3.1:8b and it automatically selects the best quantization for your available VRAM. Free and open source.
llama.cpp is the most VRAM-efficient runtime. It supports CPU offloading — layers that do not fit in VRAM spill to system RAM with a performance penalty, but the model still runs. This makes it possible to run 70B models on a 24GB GPU (with significant CPU RAM usage and slower inference). The GGUF model format was developed for llama.cpp.
LM Studio provides a GUI over llama.cpp with a model browser, chat interface, and local OpenAI-compatible API. Best for non-technical users. Available for macOS, Windows, and Linux.
vLLM is the production-grade choice for serving inference at scale. It uses PagedAttention to minimize KV cache memory waste, supports continuous batching for high throughput, and exposes an OpenAI-compatible API. Requires Linux and a recent NVIDIA GPU with 16+ GB VRAM.
When to Use Cloud vs Local
Local inference makes economic sense when you have sustained, predictable workloads and the hardware investment amortizes over 12–18 months. An RTX 3090 at $500 used runs roughly 10,000 7B model requests per day at zero marginal cost. At $0.10 per 1,000 tokens (typical hosted API), 10,000 requests at 500 tokens average = $500/month — the GPU pays for itself in a single month of heavy use.
Cloud inference is better for: variable workloads (spiky, not sustained), models larger than your GPU pool can handle, or when you need production SLAs. GPU rental on Lambda Labs, Vast.ai, or RunPod costs $0.50–$2.00/hr for A100s, which is economical for batch workloads that run a few hours per day.
Hybrid approach: local for development and testing (free), cloud for production inference or large batch jobs (pay only for what you use).
Frequently Asked Questions
How much VRAM do I need to run a 7B parameter LLM?
A 7B model in fp16 precision requires approximately 14GB of VRAM for model weights alone (7B × 2 bytes). With KV cache and activations for a 2048-token context, budget 18–20GB. An RTX 3090 (24GB) handles it comfortably in fp16; an RTX 3060 (12GB) requires int4 quantization (3.5 GB model weights + KV cache fits well within 12 GB at short contexts).
What is quantization and how does it reduce VRAM?
Quantization reduces the numerical precision of model weights: fp32 uses 4 bytes per parameter, fp16 uses 2, int8 uses 1, and int4 uses 0.5. Switching from fp16 to int4 cuts VRAM by 4x with modest quality loss. Most modern quantized models (GGUF, GPTQ, AWQ) use int4 or int8. The “K_M” suffix in GGUF names (e.g. Q4_K_M) indicates a mixed quantization strategy that preserves higher precision in attention layers for better accuracy at the same average bit width.
Does context window length affect VRAM?
Yes, significantly. The KV cache grows linearly with context length and batch size. Running a 7B model with a 32K context window requires approximately 17 GB more VRAM than the same model with a 2K context — more than the model weights themselves. For long-context use cases, the KV cache is often the binding constraint on VRAM, not the model size.
What is the cheapest GPU for running LLMs locally?
The RTX 3060 (12GB) is the most cost-effective entry point at around $250–300 used. It runs 7B models in int4/int8 quantization with Ollama or llama.cpp at 10–20 tokens/second. For 13B+ models, the RTX 3090 (24GB) at $450–550 used is the sweet spot — it handles 7B models in fp16 and 13B models in int8 without compromise.
Frequently Asked Questions
- How much VRAM do I need to run a 7B parameter LLM?
- A 7B model in fp16 precision requires approximately 14GB of VRAM for model weights alone (7B × 2 bytes). With KV cache and activations for a 2048-token context, budget 18–20GB. An RTX 3090 (24GB) handles it comfortably in fp16; an RTX 3060 (12GB) requires int4 quantization.
- What is quantization and how does it reduce VRAM?
- Quantization reduces the numerical precision of model weights: fp32 uses 4 bytes per parameter, fp16 uses 2, int8 uses 1, and int4 uses 0.5. Switching from fp16 to int4 cuts VRAM by 4x with modest quality loss. Most modern quantized models (GGUF, GPTQ, AWQ) use int4 or int8.
- Does context window length affect VRAM?
- Yes, significantly. The KV cache grows linearly with context length and batch size. Running a 7B model with a 32K context window requires 5–8GB more VRAM than the same model with a 2K context.
- What is the cheapest GPU for running LLMs locally?
- The RTX 3060 (12GB) is the most cost-effective entry point at around $300 used. It can run 7B models in int4/int8 quantization with Ollama or llama.cpp. For 13B+ models, the RTX 3090 (24GB) at $500–600 used is the sweet spot.