What does p < 0.05 actually mean?

If the null hypothesis were true (the variants perform identically), you would expect to see a result this extreme or more extreme less than 5% of the time. It does not mean there is a 95% probability that your variant is better.

Can I stop my test early if it reaches significance?

This is called "peeking" and it inflates the false positive rate. If you plan to check results multiple times, use a sequential testing method or apply a Bonferroni correction. The safest approach is to decide your sample size in advance and only read results once.

What is statistical power?

Power is the probability that your test will detect a real difference if one exists. At 80% power, you will miss a real improvement 20% of the time (false negative). Higher power requires larger sample sizes.

guide

How to Calculate A/B Test Statistical Significance — Free (2026)

By Rui Barreira · Last updated: 18 June 2026

Determine whether your A/B test result is statistically significant with the brevio A/B Test Calculator — free, no signup, runs entirely in your browser. Enter visitors and conversions for each variant to get the Z-score, p-value, and confidence level instantly.

Running an A/B test and seeing that variant B converted at 3.6% versus control A at 3.0% does not automatically mean B is better. With a small sample, this difference could easily be due to random chance. Statistical significance testing quantifies the probability that the observed difference is real.

How to Use the Tool

Enter visitors and conversions for Control (A). These are the baseline numbers before any change.
Enter visitors and conversions for Variant (B). These are the numbers for the version being tested.
Click Calculate Significance. The tool returns conversion rates, relative lift, Z-score, p-value, and whether the result is significant at 95% confidence.

How Statistical Significance Is Calculated

The tool uses the two-proportion Z-test. The formula is: Z = (p₂ − p₁) / SE, where the standard error is SE = √(p_pool × (1 − p_pool) × (1/n₁ + 1/n₂)) and the pooled proportion is p_pool = (c₁ + c₂) / (n₁ + n₂).

The p-value is the probability of observing a difference at least as large as this one if the null hypothesis (no real difference) were true. A p-value below 0.05 means there is less than a 5% chance the result is due to random variation — this is the 95% confidence threshold used by most product teams.

The relative lift is (p₂ − p₁) / p₁ × 100%. A 3.0% control rate and a 3.6% variant rate produce a 20% relative lift — meaningful even if the absolute difference is only 0.6 percentage points.

Minimum Sample Size

The tool also shows the minimum recommended sample size per variant, based on a 10% relative minimum detectable effect (MDE) at 80% power and 95% confidence. If your current sample is below this number, your test may not have enough power to detect real differences reliably — you risk false negatives (missing a real improvement).

Frequently Asked Questions

What does p < 0.05 actually mean?: If the null hypothesis were true (the variants perform identically), you would expect to see a result this extreme or more extreme less than 5% of the time. It does not mean there is a 95% probability that your variant is better — it means the data is unlikely under the assumption of no difference.
Can I stop my test early if it reaches significance?: This is called "peeking" and it inflates the false positive rate significantly. If you plan to check results multiple times, use a sequential testing method or apply a Bonferroni correction. The safest approach is to decide your sample size in advance and only read results once.
What is statistical power?: Power is the probability that your test will detect a real difference if one exists. At 80% power, you will miss a real improvement 20% of the time (false negative). Higher power requires larger sample sizes but reduces the risk of inconclusive tests.

Frequently Asked Questions

What does p < 0.05 actually mean?: If the null hypothesis were true (the variants perform identically), you would expect to see a result this extreme or more extreme less than 5% of the time. It does not mean there is a 95% probability that your variant is better.
Can I stop my test early if it reaches significance?: This is called "peeking" and it inflates the false positive rate. If you plan to check results multiple times, use a sequential testing method or apply a Bonferroni correction. The safest approach is to decide your sample size in advance and only read results once.
What is statistical power?: Power is the probability that your test will detect a real difference if one exists. At 80% power, you will miss a real improvement 20% of the time (false negative). Higher power requires larger sample sizes.