guide

How Much VRAM Do You Need to Run an LLM Locally?

By Rui Barreira · Last updated: 13 June 2026

You can calculate exact VRAM requirements for any LLM using brevio LLM Memory Calculator — select a preset (Llama 3.3 70B, Mistral 7B, Gemma 2 9B, and more), choose your quantization precision, set your context window, and instantly see model weights, KV cache, and activation memory broken out separately, plus a GPU compatibility table. All calculations run in your browser with no data sent anywhere.

The short answer: a 7B parameter model in fp16 needs at least 14GB of VRAM just for model weights. Add KV cache and activations and you need 18–20GB for a 2048-token context. An RTX 3090 (24GB) handles it in fp16; an RTX 3060 (12GB) requires int4 quantization. For 70B models, you need multi-GPU setups or cloud inference unless you use aggressive int4 quantization (35–40GB compressed).

The Three Memory Components

GPU memory consumption breaks down into three distinct components. Understanding each one prevents the most common mistake: buying hardware that can fit the model weights but runs out of memory at inference time.

1. Model Weights

The dominant cost. Every parameter in the model occupies memory according to its numerical precision. The formula is straightforward:

Model memory (GB) = parameter count × bytes per parameter

For a 7B parameter model: 7,000,000,000 × 2 bytes (fp16) = 14,000,000,000 bytes = 14 GB. This number is fixed once you choose the model and precision — it does not change with context length or batch size.

2. KV Cache

The second-largest cost, and the one that surprises most practitioners. The KV (key-value) cache stores the attention states computed for all previous tokens so the model does not re-compute them on each generation step. Its size scales with:

  • Context length: longer contexts store more KV states
  • Batch size: each concurrent request needs its own KV cache
  • Number of layers: each transformer layer contributes both K and V tensors
  • Hidden dimension size: larger models have wider attention heads

The formula: KV cache (bytes) = 2 × layers × batch_size × context_tokens × hidden_dim × 2 bytes (fp16). For a 7B model (32 layers, 4096 hidden dim) with 2048 context and batch size 1: 2 × 32 × 1 × 2048 × 4096 × 2 = ~1.07 GB. At 32K context, that becomes ~17 GB — larger than the model itself for some architectures.

3. Activations

Intermediate computation outputs during the forward pass. For inference (not training), activations are relatively small — typically around 20% of model weight memory. At training time, activations are much larger because gradients must be stored for backpropagation. For local inference use cases, this component is the smallest of the three.

Quantization Explained

Quantization reduces the numerical precision of model weights, trading a small amount of accuracy for large memory savings. Modern quantized models (GGUF format used by llama.cpp, GPTQ, and AWQ) are the standard way to run large models on consumer hardware.

PrecisionBytes per parameter7B model size70B model sizeQuality impact
fp324 bytes28 GB280 GBReference (no loss)
fp16 / bf162 bytes14 GB140 GBNegligible
int81 byte7 GB70 GBMinimal (~1% benchmark drop)
int40.5 bytes3.5 GB35 GBModerate (2–5% benchmark drop)

The rule of thumb: model weight GB ≈ parameter count in billions × bytes per parameter. A 7B model at fp16 = 7 × 2 = 14 GB. A 13B model at int4 = 13 × 0.5 = 6.5 GB. This heuristic is accurate to within 5% for transformer architectures.

Modern int4 quantization (particularly GGUF Q4_K_M and AWQ) achieves remarkably close results to fp16 on most benchmarks. The quality loss is most noticeable on mathematical reasoning and coding tasks. For general chat and instruction following, int4 is usually indistinguishable from fp16 in practice.

Context Window and KV Cache Scaling

Context length has a dramatic effect on KV cache memory. Running Llama 3.1 8B with a 128K context window (its rated maximum) at fp16 with batch size 1 requires approximately 32 GB of KV cache alone — more than twice the model weights. This is why many local inference tools default to 2048–4096 token contexts even when the model supports more.

Practical guideline: if you need long context, budget 2–4x your model weight memory for KV cache. For batch inference (serving multiple requests simultaneously), multiply KV cache by batch size — 8 concurrent requests at 2048 tokens requires 8x the single-request KV cache.

GPU Comparison

GPUVRAMMax model (fp16)Max model (int4)Typical use case
RTX 306012 GB~5B params~20B params7B int8, 13B int4
RTX 3090 / 409024 GB~11B params~44B params7B fp16, 13B int8, 30B int4
A100 40GB40 GB~18B params~72B params30B fp16, 70B int4
A100 80GB80 GB~36B params~144B params70B fp16, 130B int4
H100 80GB80 GB~36B params~144B params70B fp16 (3x faster than A100)
H200 141GB141 GB~65B params~256B params70B fp16 with large context

Note: “max model” figures assume 20% headroom for KV cache and activations. Real limits depend on context length and batch size.

Tools for Running LLMs Locally

Ollama is the easiest starting point. It handles model download, quantization selection, and a local HTTP API with a Docker-like CLI. Run ollama run llama3.1:8b and it automatically selects the best quantization for your available VRAM. Free and open source.

llama.cpp is the most VRAM-efficient runtime. It supports CPU offloading — layers that do not fit in VRAM spill to system RAM with a performance penalty, but the model still runs. This makes it possible to run 70B models on a 24GB GPU (with significant CPU RAM usage and slower inference). The GGUF model format was developed for llama.cpp.

LM Studio provides a GUI over llama.cpp with a model browser, chat interface, and local OpenAI-compatible API. Best for non-technical users. Available for macOS, Windows, and Linux.

vLLM is the production-grade choice for serving inference at scale. It uses PagedAttention to minimize KV cache memory waste, supports continuous batching for high throughput, and exposes an OpenAI-compatible API. Requires Linux and a recent NVIDIA GPU with 16+ GB VRAM.

When to Use Cloud vs Local

Local inference makes economic sense when you have sustained, predictable workloads and the hardware investment amortizes over 12–18 months. An RTX 3090 at $500 used runs roughly 10,000 7B model requests per day at zero marginal cost. At $0.10 per 1,000 tokens (typical hosted API), 10,000 requests at 500 tokens average = $500/month — the GPU pays for itself in a single month of heavy use.

Cloud inference is better for: variable workloads (spiky, not sustained), models larger than your GPU pool can handle, or when you need production SLAs. GPU rental on Lambda Labs, Vast.ai, or RunPod costs $0.50–$2.00/hr for A100s, which is economical for batch workloads that run a few hours per day.

Hybrid approach: local for development and testing (free), cloud for production inference or large batch jobs (pay only for what you use).

Frequently Asked Questions

How much VRAM do I need to run a 7B parameter LLM?

A 7B model in fp16 precision requires approximately 14GB of VRAM for model weights alone (7B × 2 bytes). With KV cache and activations for a 2048-token context, budget 18–20GB. An RTX 3090 (24GB) handles it comfortably in fp16; an RTX 3060 (12GB) requires int4 quantization (3.5 GB model weights + KV cache fits well within 12 GB at short contexts).

What is quantization and how does it reduce VRAM?

Quantization reduces the numerical precision of model weights: fp32 uses 4 bytes per parameter, fp16 uses 2, int8 uses 1, and int4 uses 0.5. Switching from fp16 to int4 cuts VRAM by 4x with modest quality loss. Most modern quantized models (GGUF, GPTQ, AWQ) use int4 or int8. The “K_M” suffix in GGUF names (e.g. Q4_K_M) indicates a mixed quantization strategy that preserves higher precision in attention layers for better accuracy at the same average bit width.

Does context window length affect VRAM?

Yes, significantly. The KV cache grows linearly with context length and batch size. Running a 7B model with a 32K context window requires approximately 17 GB more VRAM than the same model with a 2K context — more than the model weights themselves. For long-context use cases, the KV cache is often the binding constraint on VRAM, not the model size.

What is the cheapest GPU for running LLMs locally?

The RTX 3060 (12GB) is the most cost-effective entry point at around $250–300 used. It runs 7B models in int4/int8 quantization with Ollama or llama.cpp at 10–20 tokens/second. For 13B+ models, the RTX 3090 (24GB) at $450–550 used is the sweet spot — it handles 7B models in fp16 and 13B models in int8 without compromise.

Frequently Asked Questions

How much VRAM do I need to run a 7B parameter LLM?
A 7B model in fp16 precision requires approximately 14GB of VRAM for model weights alone (7B × 2 bytes). With KV cache and activations for a 2048-token context, budget 18–20GB. An RTX 3090 (24GB) handles it comfortably in fp16; an RTX 3060 (12GB) requires int4 quantization.
What is quantization and how does it reduce VRAM?
Quantization reduces the numerical precision of model weights: fp32 uses 4 bytes per parameter, fp16 uses 2, int8 uses 1, and int4 uses 0.5. Switching from fp16 to int4 cuts VRAM by 4x with modest quality loss. Most modern quantized models (GGUF, GPTQ, AWQ) use int4 or int8.
Does context window length affect VRAM?
Yes, significantly. The KV cache grows linearly with context length and batch size. Running a 7B model with a 32K context window requires 5–8GB more VRAM than the same model with a 2K context.
What is the cheapest GPU for running LLMs locally?
The RTX 3060 (12GB) is the most cost-effective entry point at around $300 used. It can run 7B models in int4/int8 quantization with Ollama or llama.cpp. For 13B+ models, the RTX 3090 (24GB) at $500–600 used is the sweet spot.
More free toolsSee all 162
Merge PDFsCompress ImageJSON FormatterPassword GeneratorVAT CalculatorQR Code Generator