Calculate sample size needed for statistically significant results
Sample size depends on baseline conversion rate, minimum detectable effect, statistical power (usually 80%), and significance level (usually 95%). Larger effects are easier to detect with smaller samples.
Formula
n = 2 × (Z_α + Z_β)² × p(1-p) / δ²
For 5% baseline, 20% relative lift (to 6%), 80% power, 95% confidence: need ~3,600 visitors per variation (7,200 total).
Determine required traffic before starting. Running underpowered tests wastes time and risks wrong conclusions.
Each variation needs full sample size. A/B/C/D test requires 4x the traffic of A/B.
Small changes (button color) rarely have detectable effects. Test substantial differences.
A statistically significant 0.1% lift might not be worth implementing. Set business-relevant MDE.
Calculate traffic needed for landing page or ad tests.
Determine sample size for feature tests.
Set realistic timelines based on traffic and MDE.
Evaluate statistical significance of completed tests.
95% significance means there's only a 5% chance the observed difference occurred by random chance. It's the confidence threshold for making decisions.
Power (typically 80%) is the probability of detecting a real effect when it exists. Lower power means you might miss real improvements.
The smallest improvement you want to reliably detect. Smaller MDE requires larger sample sizes. A 5% lift vs 20% lift needs ~16x more traffic.
Until you reach the required sample size. Never stop early based on initial results—that introduces bias. Plan for at least 1-2 business cycles.
Generally no—"peeking" inflates false positive rates. Use sequential testing methods if you need to monitor results continuously.
Inconclusive means no statistically significant difference was detected. You can: extend the test, accept the null hypothesis, or test a larger change.