guide

Few-Shot Prompting: How to Write Better LLM Prompts with Examples

By Rui Barreira · Last updated: 13 June 2026

Few-shot prompting is one of the most reliable techniques in prompt engineering. By showing a model two to ten worked examples before your actual query, you give it a concrete pattern to follow — no training required, no fine-tuning, no weight updates. Use brevio Few-Shot Prompt Builder to assemble your examples and export a ready-to-paste JSON payload for OpenAI, Claude, or plain text.

What Is Few-Shot Prompting?

Few-shot prompting is a form of in-context learning: instead of telling a model what to do in abstract terms, you show it what you mean through input/output examples embedded in the prompt itself. The model reads those examples, infers the pattern, and applies it to your actual query — all within a single API call.

The name comes from machine learning terminology. In supervised learning, "few-shot" means training a model with very few labeled examples. In prompting, the analogy holds: the examples act as on-the-fly training data that shapes the model's behaviour for that call only.

Zero-Shot vs One-Shot vs Few-Shot

Understanding the spectrum helps you choose the right approach for each task.

Technique	Examples	Best for	Token cost
Zero-shot	0	Simple, well-understood tasks	Minimal
One-shot	1	Format hints, disambiguation	Low
Few-shot	2–10	Complex format, tone, style	Medium
Many-shot	10+	Highly consistent classification	High

Zero-shot works well for tasks the model has seen extensively during training — summarisation, translation, simple Q&A. Add examples when the output format is unusual, the task requires a specific tone, or consistency matters more than creativity.

Why Examples Work

Examples do three things that natural-language instructions often cannot. First, they calibrate tone: seeing five responses in a breezy, casual register is more reliable than writing "be casual and friendly." Second, they specify format: showing JSON output is unambiguous in a way that "respond in JSON" sometimes is not. Third, they set implicit constraints: if all your examples show two-sentence responses, the model learns the expected length without being told.

This is why few-shot prompts outperform lengthy instruction paragraphs on tasks like structured output extraction, style imitation, and classification with nuanced categories.

How Many Examples to Use

The research consensus is that 3–5 examples is the sweet spot for most tasks. A 2023 meta-analysis of GPT-4 benchmarks found that moving from zero-shot to three-shot improved accuracy by an average of 12% across classification tasks, while moving from three-shot to ten-shot added only a further 3–4%.

Quality matters far more than quantity. One well-chosen example that precisely represents the target pattern is worth more than five mediocre ones. If your examples contain errors, inconsistencies, or edge cases that don't reflect the common case, the model will generalise from those too.

The practical upper limit is set by context window size and cost. At GPT-4o pricing, 10 examples in every prompt across 10,000 daily calls can add meaningfully to your monthly bill. If you need more than 10 examples for reliable performance, fine-tuning is almost always cheaper at scale.

Negative Examples: Showing What NOT to Do

Most practitioners focus on positive examples (desired outputs). Negative examples — showing the wrong output alongside the correct one — can help for tasks where the failure mode is predictable. For instance, if a classification prompt keeps returning "Neutral" for mildly negative reviews, adding one example that shows: "This took forever" → Negative (not Neutral) can fix the bias.

Use negative examples sparingly: they add tokens and complexity, and they can introduce confusion if the model pattern-matches on the negative output rather than the correction.

API Format Differences

Where your examples live in the API payload depends on which provider you are using. The brevio Few-Shot Builder handles all three formats — you write your examples once and switch between tabs to get the right JSON.

OpenAI (messages array)

In OpenAI's Chat Completions API, examples are injected as alternating user/assistant message objects inside the messages array, after the system message and before the final user query:

{
  "model": "gpt-4o",
  "messages": [
    { "role": "system", "content": "Classify sentiment as Positive, Negative, or Neutral." },
    { "role": "user", "content": "The checkout was seamless." },
    { "role": "assistant", "content": "Positive" },
    { "role": "user", "content": "Waited 20 minutes for a reply." },
    { "role": "assistant", "content": "Negative" },
    { "role": "user", "content": "[YOUR INPUT HERE]" }
  ]
}

Claude (Anthropic Messages API)

Claude separates the system prompt from the conversation. Examples go into the messages array as user/assistant pairs, while the task description stays in the top-level system field:

{
  "system": "Classify sentiment as Positive, Negative, or Neutral.",
  "messages": [
    { "role": "user", "content": "The checkout was seamless." },
    { "role": "assistant", "content": "Positive" },
    { "role": "user", "content": "Waited 20 minutes for a reply." },
    { "role": "assistant", "content": "Negative" },
    { "role": "user", "content": "[YOUR INPUT HERE]" }
  ]
}

Plain text

For models without a structured API — local models via llama.cpp, Ollama, or raw completions endpoints — the plain-text format embeds everything in a single string with Task/User/Assistant labels separated by blank lines.

Example Selection Strategy

Random sampling from your dataset rarely produces the best few-shot set. These principles improve selection:

Cover the distribution. If your task has three categories, include at least one example per category. Imbalanced examples produce imbalanced outputs.

Choose unambiguous examples. Pick inputs where the correct output is obvious. Borderline cases confuse rather than calibrate.

Match the expected difficulty. If most real queries are short and simple, your examples should be too. Long examples train the model to produce long outputs.

Order matters (slightly). Research shows models give slightly more weight to examples near the end of the context window. Put your best, most representative example last.

When Few-Shot Is Not Enough

Few-shot prompting has real limits. If you are making thousands of calls per day, the token cost of repeated examples accumulates fast. If your task requires highly specialised knowledge not present in the base model, examples alone cannot inject it. If consistency must be near-perfect across millions of calls, prompt variance will eventually produce errors that examples cannot prevent.

The rule of thumb: consider fine-tuning when you have 50+ high-quality annotated examples, are making more than 50,000 calls per month to the same task, and few-shot performance has plateaued despite varying your examples. At that scale, fine-tuning reduces per-call cost (shorter prompts) and typically improves consistency.

Verification: Testing Your Few-Shot Prompt

Before deploying a few-shot prompt in production, run at least 20 test cases through it and measure precision and recall against your expected outputs. Change one example at a time and observe the effect — this isolates which examples are load-bearing. If a single example swap changes accuracy by more than 5%, your prompt is brittle and needs more examples to average out that sensitivity.

Use the Few-Shot Prompt Builder to export the formatted payload, drop it into your testing environment, and iterate.

FAQ

What is few-shot prompting?

Few-shot prompting means providing 2–10 worked examples (input/output pairs) in the prompt before your actual query. The model uses these examples to infer the pattern you want, without any weight updates or training.

How many examples should I include in a few-shot prompt?

3–5 examples is the sweet spot for most tasks. More examples improve consistency for complex tasks but add tokens and cost. For simple classification tasks, 1–2 examples often suffice. Diminishing returns typically set in after 10 examples.

Do the examples need to be in the same format as the expected output?

Yes. Examples work by pattern-matching — the model infers the expected output format from what it sees. If you want JSON output, your examples must show JSON output. If you want bullet points, show bullet points.

When should I fine-tune instead of using few-shot prompting?

Consider fine-tuning when: (1) you have 50+ high-quality examples, (2) the task requires highly consistent behaviour that few-shot cannot achieve, (3) you are making thousands of API calls and want to reduce prompt length costs.

Frequently Asked Questions

What is few-shot prompting?: Few-shot prompting means providing 2–10 worked examples (input/output pairs) in the prompt before your actual query. The model uses these examples to infer the pattern you want, without any weight updates or training.
How many examples should I include in a few-shot prompt?: 3–5 examples is the sweet spot for most tasks. More examples improve consistency for complex tasks but add tokens and cost. For simple classification tasks, 1–2 examples often suffice. Diminishing returns typically set in after 10 examples.
Do the examples need to be in the same format as the expected output?: Yes. Examples work by pattern-matching — the model infers the expected output format from what it sees. If you want JSON output, your examples must show JSON output. If you want bullet points, show bullet points.
When should I fine-tune instead of using few-shot prompting?: Consider fine-tuning when: (1) you have 50+ high-quality examples, (2) the task requires highly consistent behavior that few-shot cannot achieve, (3) you are making thousands of API calls and want to reduce prompt length costs.