A/B Testing Email Campaigns: A Statistical Guide
A/B Testing Fundamentals
A/B testing in email compares two variants of an email to determine which performs better against a specific metric. Unlike web A/B testing, email tests have unique constraints: you get one send per subscriber, sample sizes are fixed by your list size, and external factors (time of day, day of week) can skew results.
A proper email A/B test requires:
- A clear hypothesis ("Subject line with a number will increase open rates by 10%")
- A single variable changed between variants (never test multiple things at once)
- A predetermined success metric (open rate, click rate, conversion rate)
- A sufficient sample size for statistical significance
- A predetermined test duration
What to Test (and What Not To)
High-impact tests (test these first):
- Subject lines — highest impact on open rates, easiest to test
- Send time — can shift open rates by 10-30%
- CTA text and placement — directly affects click-through rates
- From name — "Jane at Company" vs "Company" vs "The Company Team"
Medium-impact tests:
- Email length (concise vs detailed)
- Personalization (name in subject, dynamic content blocks)
- Preheader text
- Number of links/CTAs
Low-impact tests (usually not worth the effort):
- Font choices or colors within acceptable ranges
- Minor copy tweaks that do not change the core message
- Image placement when the core layout is the same
Getting Your Sample Size Right
The most common A/B testing mistake is declaring a winner too early. To detect a meaningful difference, you need sufficient sample size.
For email open rate tests, here are rough minimums per variant:
- Detect 5% relative difference: ~3,000 subscribers per variant
- Detect 10% relative difference: ~800 subscribers per variant
- Detect 20% relative difference: ~200 subscribers per variant
These assume a baseline open rate of ~25% and 95% confidence level. Use an online sample size calculator for your specific baseline metrics.
If your list is too small for statistical significance, consider:
- Testing larger changes that produce bigger effects
- Accumulating results across multiple sends (meta-analysis)
- Using Bayesian methods that work better with small samples
Interpreting Results
Key metrics to evaluate:
- Statistical significance — Is p less than 0.05? If not, the result may be due to chance.
- Effect size — A statistically significant 0.1% improvement is real but irrelevant. Focus on meaningful differences.
- Confidence interval — understand the range of plausible true values, not just the point estimate.
Be cautious about:
- Novelty effects — a new approach may spike initially but normalize over time
- Segment differences — a winning variant for one segment may lose for another
- Downstream metrics — higher open rates do not always translate to more conversions
Common Pitfalls
- Peeking at results early — checking daily and stopping when you see a winner inflates false positive rates dramatically
- Testing too many things at once — multivariate testing requires exponentially larger sample sizes
- Ignoring seasonality — results from Black Friday week do not generalize to normal periods
- Not documenting learnings — keep a testing log so you build institutional knowledge
- Over-optimizing for opens — clickbait subject lines win open rate tests but hurt long-term engagement