guide

LLM Tokenizer Explained: How to See How Your Text Gets Tokenized

By Rui Barreira · Last updated: 13 June 2026

Before an LLM processes your text, it splits it into tokens — the fundamental units the model operates on. Understanding tokenization helps you predict costs, write more efficient prompts, and debug unexpected model behaviour. Use brevio Tokenizer Visualizer to see approximate token boundaries highlighted directly in your text.

What Is BPE Tokenization?

Byte Pair Encoding (BPE) builds a vocabulary of frequently occurring byte sequences. Starting from individual characters, it repeatedly merges the most common adjacent pairs until a target vocabulary size is reached (typically 50,000–100,000 tokens). The result is a dictionary where common words and subwords get their own single token, while rare sequences are split into multiple tokens.

For example: "tokenization" might become three tokens: "token", "ization" — depending on how frequently each appeared in the training corpus. "cat" is almost certainly a single token. "ChatGPT" might be two tokens: "Chat" and "GPT".

Why Common Words Are Single Tokens

High-frequency words like "the", "and", "is", "for" are single tokens because they appear so often in training text that merging them is always beneficial. Programming keywords like "function", "return", "class" are typically single tokens in code-aware models. Conversely, technical jargon, proper nouns, and domain-specific terms that rarely appear in the training data get split at arbitrary character boundaries.

Surprising Token Splits

  • Numbers. "2024" often becomes ["20", "24"] or even ["2", "0", "2", "4"]. This is why LLMs struggle with arithmetic — the model never sees a number as a single unit the way a human does.
  • Punctuation. Apostrophes create splits: "don't" becomes ["don", "'t"]. URLs are typically split at every slash and period.
  • Code. Operators like <= and != are often single tokens in code-aware models. Variable names with underscores may split at the underscore boundary.
  • CJK characters. Chinese, Japanese, and Korean characters are often 1–2 tokens each but may represent fewer semantic units per token than equivalent English words.
  • Leading spaces. Many tokenizers include the space before a word as part of the token. "hello" and " hello" (with a leading space) are different tokens in GPT-4's tokenizer.

How Token Count Affects Cost

LLM API pricing is token-based — you pay per input token and per output token. A prompt with 1,000 tokens costs 10× more than one with 100 tokens. This makes tokenization practically important:

  • Removing unnecessary whitespace, comments, or boilerplate from prompts directly reduces cost.
  • Technical text with many numbers, URLs, and rare terms uses more tokens per character than natural language prose.
  • Code is typically 3–4× more tokens-per-character than English text.

Tokenizer Differences Between Providers

ProviderTokenizerVocabulary SizeExact Count
OpenAI GPT-4cl100k_base (tiktoken)~100,000tiktoken library or API
ClaudeAnthropic proprietaryNot publishedcount_tokens API endpoint
GeminiSentencePiece~256,000countTokens API
Llama 3BPE (tiktoken-compatible)~128,000HuggingFace tokenizer

The brevio Tokenizer Visualizer provides a word-boundary approximation useful for estimation. For exact counts before sending an API request, use each provider's official counting endpoint or library.

Token Count in Prompt Engineering

Context windows — the maximum tokens a model can process in a single request — range from 8,192 tokens (older GPT-3.5) to 200,000 tokens (Claude). If your prompt exceeds the context window, the model cannot process it at all. Even below the limit, very long contexts can degrade model attention — the model may "lose" information from the beginning of a long prompt.

For retrieval-augmented generation (RAG), each retrieved chunk counts toward your prompt tokens. Knowing approximate token counts per chunk helps you stay well within context limits while maximizing the number of useful context chunks you can include.

DevTools Verification

The Tokenizer Visualizer runs entirely in JavaScript. Open DevTools and check the Network tab while using the tool — no POST requests are made. Your prompt content stays in your browser.

Quick Estimation Rules

  • English prose: ~1 token per 4 characters, or ~0.75 tokens per word
  • Code: ~1 token per 3 characters
  • Numbers and URLs: ~1 token per 2–3 characters
  • A rough formula: tokens ≈ characters / 4 is accurate to within 20% for most English text

Frequently Asked Questions

Why does "ChatGPT" tokenize as two tokens but "chat" as one?
BPE (Byte Pair Encoding) builds its vocabulary from the most frequent byte sequences in the training corpus. Common words like "chat" appear frequently enough to get their own token. Less common combinations like "ChatGPT" or any compound proper noun are split at boundaries that maximise coverage of the training vocabulary.
Does token count affect response quality?
Indirectly. Long prompts that approach the context limit may cause the model to truncate or lose coherence in earlier context. Token count directly affects cost and latency. Response quality is more affected by instruction clarity than by length.
Why do numbers tokenize so inefficiently?
Numbers like "2024" often tokenize as multiple tokens because each digit or digit group gets its own token in many vocabularies. GPT-4's tokenizer splits numbers at inconsistent boundaries. This is why arithmetic can be error-prone in LLMs — the model never "sees" a number as a whole unit.
Is the tokenizer the same for all models?
No. OpenAI's GPT-4 uses cl100k_base (tiktoken). Claude uses Anthropic's tokenizer. Gemini uses SentencePiece. Llama models use their own BPE vocabularies. For exact token counts per model, use the provider's official tokenizer or API token counting endpoints.
More free toolsSee all 162
Merge PDFsCompress ImageJSON FormatterPassword GeneratorVAT CalculatorQR Code Generator