guide

How to Write a Good System Prompt (7 Dimensions That Matter)

By Rui Barreira · Last updated: 13 June 2026

You can score any system prompt in seconds using brevio System Prompt Scorer — paste your prompt and see an A–F grade alongside per-dimension feedback. Scoring is instant, runs entirely in your browser, and no data is sent to any server.

A good system prompt is not about length — it is about coverage. A 50-word prompt that hits all 7 dimensions outperforms a 500-word prompt that repeats the same instruction in different ways. This guide explains what each dimension means, why it is weighted the way it is, and what good versus bad looks like for each.

What Is a System Prompt?

A system prompt is a set of instructions sent to an LLM at the start of every conversation, before any user message. It persists across the entire session and the model treats it as higher-authority than user messages. If the system prompt says “respond only in French” and the user writes in English, a well-aligned model will respond in French.

System prompts are used to configure an LLM for a specific use case: a customer support bot, a code review assistant, a document summariser, a creative writing partner. Without a system prompt, the model uses only its training defaults — which are optimised for general helpfulness, not your specific application.

The 7 Dimensions (and Why They Are Weighted)

1. Role Definition (weight: 3×)

Role definition is the single most impactful element of a system prompt. It tells the model who it is, which anchors all subsequent instructions and shapes the model's default behaviours. A clear role reduces ambiguity about scope, authority, and personality.

Good: “You are a senior software engineer at a fintech company helping internal developers review pull requests.”

Bad: “Be helpful with code.”

The pattern to match is: “You are a [role] that [primary function] for [audience].” All three components strengthen the definition. The scorer detects phrases like “you are”, “act as”, and “your role is”.

2. Task Clarity (weight: 3×)

Task clarity specifies what the model should do. Without it, the model infers a task from context, which introduces variance. Explicit task verbs — help, answer, generate, explain, create, provide — anchor the model's primary function.

Good: “Help users diagnose billing errors and generate corrected invoice drafts.”

Bad: “You are a billing assistant.”

Role and task together (both weighted 3×) account for 43% of the total score. If your prompt only does one thing right, make it these two.

3. Constraints (weight: 2×)

Constraints define what the model must not do. Without constraints, the model may helpfully wander outside your intended scope — discussing topics it should ignore, making commitments it should not, or producing formats that break your downstream pipeline.

Good: “Do not provide legal advice. Never promise specific refund amounts. Avoid discussing competitor products.”

Bad: No restrictions mentioned at all.

The scorer detects negation patterns: “do not”, “never”, “avoid”, “only”, “must not”. Even a single constraint signals intentional scoping rather than leaving the model unconstrained.

4. Output Format (weight: 2×)

Output format tells the model how to structure its response. Without this, the model uses its training defaults — which may be verbose prose when you need bullet points, or JSON when you need markdown, or long answers when you need one sentence.

Good: “Respond in markdown. Use bullet points for lists. Keep explanations to 2–3 sentences unless the user asks for more detail.”

Bad: No format instructions.

Format instructions are especially important for downstream automation. If a system processes the model's output programmatically, consistent formatting prevents parsing failures.

5. Tone/Persona (weight: 1×)

Tone defines the register and personality of responses. This has lower weight (1×) because it matters less for functionality than for user experience. But inconsistent tone is jarring — a support bot that alternates between formal and casual within the same conversation damages trust.

Good: “Be professional, direct, and empathetic. Avoid jargon unless the user uses it first.”

Bad: No tone guidance.

6. Context/Background (weight: 1×)

Context provides the model with relevant background it cannot infer from the task alone. The scorer uses a simple heuristic: prompts over 200 characters are long enough to contain meaningful context. This dimension has lower weight (1×) because it is often optional for simple use cases — a password generator does not need contextual background.

Context becomes critical when the use case is domain-specific: a medical triage assistant needs to know it is operating in an emergency setting, a legal contract reviewer needs to know which jurisdiction's law applies.

7. Examples (weight: 2×)

Few-shot examples demonstrate the exact input/output format you expect. They are especially effective for non-obvious output structures — custom JSON schemas, specific table formats, or multi-part responses. Examples constrain the model's output distribution more reliably than instructions alone.

Good:

Example: User: What is the status of order #1234? Assistant: Order #1234 is currently in transit. Expected delivery: June 15. Tracking number: UPS-987654321.

Bad: No examples provided, relying entirely on instructions to shape output format.

The scorer detects examples when a phrase like “example” or “for instance” is followed within 200 characters by a role label like “User:”, “Input:”, or “Assistant:”.

Before and After: A Real Transformation

Before (grade: D, score ~28):

You are a support bot. Be helpful.

After (grade: A, score ~93):

You are a customer support agent for Acme Store, helping shoppers resolve order and product issues. Help users track orders, process returns, and answer product questions. Do not make pricing exceptions or promise refunds greater than the order amount. Respond in clear, plain English — use numbered steps when explaining a process, one sentence per step. Be friendly and professional throughout. Example: User: My order hasn't arrived yet. Assistant: I can help with that. Please share your order number and I'll check the current status and estimated delivery date for you.

The transformation adds role definition, task verbs, a specific constraint, format guidance, a tone instruction, sufficient context (the company name and scope), and a few-shot example. The prompt grows from 6 words to about 110 — a reasonable length for a production system prompt.

Common Mistakes

Repeating the same instruction: “Be helpful. Always try to help. Your job is to be as helpful as possible.” This adds tokens without adding information. The model does not become more helpful from repetition — it becomes more helpful from specificity.

No constraints: An unconstrained model will attempt to answer anything, including questions outside your intended scope. This is the most common gap — and the one most likely to cause production incidents.

Vague format guidance: “Be concise” means different things to different models. “Keep each response to 3 sentences or fewer unless the user asks for a detailed explanation” is concrete and enforceable.

No role definition: The model defaults to a generic helpful assistant. For specialized applications, this produces mediocre results — the model lacks the domain anchor to apply appropriate expertise.

How to Verify No Data Is Transmitted

Open DevTools with F12 (Windows/Linux) or ⌘⌥I (Mac).
Go to the Network tab and filter to Fetch/XHR.
Paste a system prompt into brevio System Prompt Scorer.
Observe: no network requests fire. Scoring uses JavaScript regex checks that run entirely in your browser. Your prompt never leaves your device — important if it contains internal product details, proprietary instructions, or confidential business logic.

Frequently Asked Questions

What should every system prompt include?

At minimum: a clear role definition (“You are a...”), the primary task (“Help users with X”), at least one constraint (“Do not discuss Y”), and an output format instruction (“Respond in markdown”). These four elements cover the highest-weighted scoring dimensions and account for most of the score gap between a weak and a strong prompt.

How long should a system prompt be?

Long enough to specify role, task, constraints, format, and tone — typically 100–500 characters for simple use cases, up to 2,000 characters for complex assistants with examples. Keep it tight: every word in a system prompt is billed on every API call. A 1,000-token system prompt on Claude Sonnet 4.6 at $3.00/1M tokens costs $0.003 per request — trivial for 1,000 requests, but $300/day at 100,000 daily requests.

Should I include few-shot examples in the system prompt?

Yes, for non-obvious output formats. One or two examples showing the exact structure you want (input/output pairs) dramatically improve consistency. For straightforward Q&A tasks, examples are less necessary — invest the tokens in better constraints and format guidance instead.

What is the difference between a system prompt and a user prompt?

The system prompt sets persistent instructions that apply to the entire conversation — the model treats it as configuration. The user prompt is the per-turn input from the user. The model treats system prompt instructions as higher authority than user instructions. If the system prompt says “respond only in JSON” and the user asks for a prose answer, a compliant model returns JSON. This authority difference is why system prompts are the right place for constraints, format requirements, and scope boundaries.

Frequently Asked Questions

What should every system prompt include?: At minimum: a clear role definition ("You are a..."), the primary task ("Help users with X"), at least one constraint ("Do not discuss Y"), and an output format instruction ("Respond in markdown"). These four elements cover the highest-weighted scoring dimensions.
How long should a system prompt be?: Long enough to specify role, task, constraints, format, and tone — typically 100–500 characters for simple use cases, up to 2,000 characters for complex assistants with examples. Keep it tight: every word in a system prompt is billed on every API call.
Should I include few-shot examples in the system prompt?: Yes, for non-obvious output formats. 1–2 examples that show the exact structure you want (input/output pairs) dramatically improve consistency. For straightforward Q&A tasks, examples are less necessary.
What is the difference between a system prompt and a user prompt?: The system prompt sets persistent instructions that apply to the entire conversation. The user prompt is the per-turn input from the user. The model treats system prompt instructions as higher authority than user instructions.