A/B Testing Email Campaigns: A Statistical Guide

A/B Testing Fundamentals

A/B testing in email compares two variants of an email to determine which performs better against a specific metric. Unlike web A/B testing, email tests have unique constraints: you get one send per subscriber, sample sizes are fixed by your list size, and external factors (time of day, day of week) can skew results.

A proper email A/B test requires:

A clear hypothesis ("Subject line with a number will increase open rates by 10%")
A single variable changed between variants (never test multiple things at once)
A predetermined success metric (open rate, click rate, conversion rate)
A sufficient sample size for statistical significance
A predetermined test duration

What to Test (and What Not To)

High-impact tests (test these first):

Subject lines — highest impact on open rates, easiest to test
Send time — can shift open rates by 10-30%
CTA text and placement — directly affects click-through rates
From name — "Jane at Company" vs "Company" vs "The Company Team"

Medium-impact tests:

Email length (concise vs detailed)
Personalization (name in subject, dynamic content blocks)
Preheader text
Number of links/CTAs

Low-impact tests (usually not worth the effort):

Font choices or colors within acceptable ranges
Minor copy tweaks that do not change the core message
Image placement when the core layout is the same

Getting Your Sample Size Right

The most common A/B testing mistake is declaring a winner too early. To detect a meaningful difference, you need sufficient sample size.

For email open rate tests, here are rough minimums per variant:

Detect 5% relative difference: ~3,000 subscribers per variant
Detect 10% relative difference: ~800 subscribers per variant
Detect 20% relative difference: ~200 subscribers per variant

These assume a baseline open rate of ~25% and 95% confidence level. Use an online sample size calculator for your specific baseline metrics.

If your list is too small for statistical significance, consider:

Testing larger changes that produce bigger effects
Accumulating results across multiple sends (meta-analysis)
Using Bayesian methods that work better with small samples

Interpreting Results

Key metrics to evaluate:

Statistical significance — Is p less than 0.05? If not, the result may be due to chance.
Effect size — A statistically significant 0.1% improvement is real but irrelevant. Focus on meaningful differences.
Confidence interval — understand the range of plausible true values, not just the point estimate.

Be cautious about:

Novelty effects — a new approach may spike initially but normalize over time
Segment differences — a winning variant for one segment may lose for another
Downstream metrics — higher open rates do not always translate to more conversions

Common Pitfalls

Peeking at results early — checking daily and stopping when you see a winner inflates false positive rates dramatically
Testing too many things at once — multivariate testing requires exponentially larger sample sizes
Ignoring seasonality — results from Black Friday week do not generalize to normal periods
Not documenting learnings — keep a testing log so you build institutional knowledge
Over-optimizing for opens — clickbait subject lines win open rate tests but hurt long-term engagement