What is a context window in LLMs?

A context window is the maximum total number of tokens (input + output combined) that an LLM can process in a single API call. Claude supports 200,000 tokens, GPT-4o supports 128,000, and Gemini 2.5 Pro supports 1,000,000 tokens.

What happens if my prompt exceeds the context window?

The API returns a context_length_exceeded error (OpenAI) or a similar error. The model does not automatically truncate — it rejects the request. You must reduce your input before retrying.

Does the output count toward the context window?

Yes. The total context = input tokens + output tokens. If a model has a 128K context window and your prompt uses 120K tokens, you can only generate up to 8K tokens of output.

What is the most cost-effective way to handle long documents?

Use RAG (Retrieval-Augmented Generation): split the document into 500–1,000 token chunks, embed them, and retrieve only the relevant chunks for each query. This avoids sending the entire document on every call.

guide

LLM Context Windows Explained: How to Check if Your Prompt Fits

By Rui Barreira · Last updated: 13 June 2026

Every LLM API call has a hard ceiling: the context window. Exceed it and the API refuses your request entirely — no automatic truncation, no fallback. Understanding how context windows work, how to measure your usage, and how to manage long documents is one of the most practical skills for anyone building on top of language models.

Use the brevio Context Window Visualizer to paste your prompt and instantly see what percentage of each model's context window it fills, across all nine major models compared side by side.

What Is a Context Window?

A context window is the maximum number of tokens an LLM can process in a single API call. The critical detail most developers miss: input tokens and output tokens both count against the same limit. If you send a 120,000-token prompt to GPT-4o (128K context), you have only 8,000 tokens left for the model's response.

Tokens are the fundamental unit of LLM processing — not words, not characters. English text averages roughly one token per four characters, or 0.75 tokens per word. Code and technical content with URLs, numbers, and special characters typically uses more tokens per character than natural language prose.

Context Window Sizes: 9-Model Comparison

Model	Context Window	Effective Input (leaving 4K for output)	Approximate Word Capacity
Claude Haiku 4.5	200,000 tokens	196,000 tokens	~147,000 words
Claude Sonnet 4.6	200,000 tokens	196,000 tokens	~147,000 words
Claude Opus 4.8	200,000 tokens	196,000 tokens	~147,000 words
GPT-4o	128,000 tokens	124,000 tokens	~93,000 words
GPT-4o mini	128,000 tokens	124,000 tokens	~93,000 words
Gemini 2.5 Pro	1,000,000 tokens	996,000 tokens	~747,000 words
Gemini 2.5 Flash	1,000,000 tokens	996,000 tokens	~747,000 words
Llama 3.3 70B	128,000 tokens	124,000 tokens	~93,000 words
Mistral Large 2	128,000 tokens	124,000 tokens	~93,000 words

Gemini 2.5's 1M token context window is large enough to hold the entire text of most novels. In practice, very long contexts still perform best when the most relevant content appears early or late — the "lost in the middle" effect means information buried in the middle of a 500K-token prompt is recalled less reliably than information at the edges.

Why Context Limits Matter Beyond the Error Message

The obvious consequence of exceeding a context window is an API error. But context limits have subtler effects on cost, reliability, and application design that matter well before you hit the ceiling.

Cost scales with tokens. Every token in your context costs money, whether the model "uses" it or not. A 100K-token prompt costs roughly 50x more per call than a 2K-token prompt. For Claude Sonnet at $3 per million input tokens, a single 100K-token call costs $0.30 — that adds up quickly at scale.

Latency increases with context length. LLMs process tokens sequentially during the prefill phase. A 100K-token context will have measurably higher time-to-first-token than a 2K-token context, even if your requested output is short.

KV cache complexity. The key-value (KV) cache stores intermediate attention computations for the context. Providers like Anthropic offer prompt caching that can reduce costs by up to 90% for repeated prefixes, but cache hits only occur when the prefix is identical. If your system prompt changes frequently, you lose caching benefits and pay full price for every call.

Strategies for Long Documents

When your content exceeds the context window — or when you want to manage costs on large inputs — these four strategies are the standard toolkit.

1. Chunking

Split the document into fixed-size pieces (typically 500–2,000 tokens each, with 10–20% overlap to avoid cutting across sentences). Process each chunk independently and aggregate the results. Chunking works well for extraction tasks (pull all dates from each section) but poorly for tasks requiring cross-section understanding (summarise the entire argument).

2. Summarisation Pipeline

Summarise the document hierarchically: first summarise each section independently, then summarise the summaries. This works well for broad comprehension tasks. The cost is information loss — fine-grained details from the original document do not survive multiple summarisation rounds.

3. Retrieval-Augmented Generation (RAG)

RAG is the most cost-effective strategy for large document collections. Split documents into chunks of 500–1,000 tokens, embed each chunk using a vector model (text-embedding-3-small, voyage-3-lite), and store embeddings in a vector database (Pinecone, pgvector, Qdrant). At query time, embed the user's question, retrieve the top-k most similar chunks, and include only those chunks in the LLM prompt. You typically need only 5–20 chunks per query — a prompt of 10,000 tokens instead of 100,000.

The tradeoff: RAG requires infrastructure (vector DB, embedding pipeline) and introduces retrieval latency of 50–200ms per query. For single documents under 200K tokens that you already have in memory, sending the whole document may be simpler and cheaper than building a RAG system.

4. Map-Reduce

A variant of chunking specifically for generation tasks. Map phase: process each chunk in parallel, generating a partial result (partial summary, partial answer). Reduce phase: combine all partial results into a final unified output. Map-reduce is highly parallelisable — all chunk calls can run concurrently — which minimises latency for large document processing pipelines.

KV Cache and Cost Optimisation

The KV cache stores the attention states for your prompt so they can be reused across multiple calls. Anthropic's prompt caching reduces input token costs by 90% (from $3 to $0.30 per million tokens for Claude Sonnet) for prompts that share a common prefix. To benefit from caching:

Put your system prompt and any static context at the beginning of the request
Keep dynamic content (the user's current question) at the end
Mark the cacheable prefix with cache_control: {type: "ephemeral"} in the Anthropic API
Ensure the cached prefix is at least 1,024 tokens (Anthropic's minimum cacheable block size)

OpenAI applies automatic prompt caching for prompts over 1,024 tokens with no API-level configuration required — the cache hit is reflected in the usage object as cached_tokens. For high-volume applications, prompt caching can reduce costs by 50–80% on repeated calls with the same system context.

Monitoring Context Usage in Production

All major LLM APIs return token usage in the response object. Log this for every production call:

OpenAI: response.usage.prompt_tokens, response.usage.completion_tokens, response.usage.total_tokens
Anthropic: response.usage.input_tokens, response.usage.output_tokens
Gemini: response.usageMetadata.promptTokenCount, response.usageMetadata.candidatesTokenCount

Set up alerts when input_tokens / context_limit > 0.85 — approaching the limit signals that a prompt template has grown unexpectedly or that conversation history accumulation is not being managed. A gradual increase in average token counts is an early warning sign of prompt bloat in multi-turn applications.

For multi-turn conversation applications, implement a sliding window that drops the oldest messages when the conversation history approaches 70% of the context limit. Always keep the system prompt and the user's current message; drop older assistant and user turns from the middle of the history.

DevTools Verification

The Context Window Visualizer runs entirely in your browser. Open DevTools and check the Network tab while pasting text — no POST requests are sent. Your prompt content never leaves your device. This is the same privacy guarantee as the token counter and tokenizer visualizer.

Quick Reference: Token Estimation Rules

English prose: ~1 token per 4 characters, ~0.75 tokens per word
Source code: ~1 token per 3 characters (more tokens per character than prose)
Numbers and URLs: ~1 token per 2–3 characters
A 1,000-word document ≈ 1,330 tokens
A 10-page PDF ≈ 3,500–5,000 tokens depending on content type

Frequently Asked Questions

What is a context window in LLMs?: A context window is the maximum total number of tokens (input + output combined) that an LLM can process in a single API call. Claude supports 200,000 tokens, GPT-4o supports 128,000, and Gemini 2.5 Pro supports 1,000,000 tokens.
What happens if my prompt exceeds the context window?: The API returns a context_length_exceeded error (OpenAI) or a similar error. The model does not automatically truncate — it rejects the request. You must reduce your input before retrying.
Does the output count toward the context window?: Yes. The total context = input tokens + output tokens. If a model has a 128K context window and your prompt uses 120K tokens, you can only generate up to 8K tokens of output.
What is the most cost-effective way to handle long documents?: Use RAG (Retrieval-Augmented Generation): split the document into 500–1,000 token chunks, embed them, and retrieve only the relevant chunks for each query. This avoids sending the entire document on every call.