
What is “Frequentist” in A/B Testing?
Frequentist inference interprets probability as the long-run frequency of events. In the context of A/B tests, it asks: If I repeatedly ran this experiment under the null hypothesis, how often would I observe a result at least this extreme just by chance?
Key objects in the frequentist toolkit are null/alternative hypotheses, test statistics, p-values, confidence intervals, Type I/II errors, and power.
Core Concepts (Fast Definitions)
- Null hypothesis (H₀): No difference between variants (e.g., pA=pB).
- Alternative hypothesis (H₁): There is a difference (two-sided) or a specified direction (one-sided).
- Test statistic: A standardized measure (e.g., a z-score) used to compare observed effects to what chance would produce.
- p-value: Probability, assuming H₀ is true, of observing data at least as extreme as what you saw.
- Significance level (α): Threshold for rejecting H₀ (often 0.05).
- Confidence interval (CI): A range of plausible values for the effect size that would capture the true effect in X% of repeated samples.
- Power (1−β): Probability your test detects a true effect of a specified size (i.e., avoids a Type II error).
How Frequentist A/B Testing Works (Step-by-Step)
1) Define the effect and hypotheses
For a proportion metric like conversion rate (CR):
- pA = baseline CR (variant A/control)
- pB = treatment CR (variant B/experiment)
Null hypothesis:
Two-sided alternative:
2) Choose α, power, and (optionally) the Minimum Detectable Effect (MDE)
- Common choices: α = 0.05, power = 0.8 or 0.9.
- MDE is the smallest lift you care to detect (planning parameter for sample size).
3) Collect data according to a pre-registered plan
Let nA,nB be samples; xA,xB conversions; pA=xA/nA, pB=xB/nB.
4) Compute the test statistic (two-proportion z-test)
Pooled proportion under H₀:
Standard error (SE) under H₀:
z-statistic:
5) Convert z to a p-value
For a two-sided test:
where Φ is the standard normal CDF.
6) Decision rule
- If p-value ≤ α ⇒ Reject H₀ (evidence of a difference).
- If p-value > α ⇒ Fail to reject H₀ (data are consistent with no detectable difference).
7) Report the effect size with a confidence interval
Approximate 95% CI for the difference (pB−pA):
Tip: Also report relative lift (pB/pA−1) and absolute difference (pB−pA).
A Concrete Example (Conversions)
Suppose:
- nA=10,000, xA=900⇒pA=0.09
- nB=10,000, xB=960⇒pB=0.096
Compute pooled p, SE, z, p-value, CI using the formulas above. If the two-sided p-value ≤ 0.05 and the CI excludes 0, you can conclude a statistically significant lift of ~0.6 percentage points (≈6.7% relative).
Why Frequentist Testing Is Important
- Clear, widely-understood decisions
Frequentist tests provide a familiar yes/no decision rule (reject/fail to reject H₀) that is easy to operationalize in product pipelines. - Error control at scale
By fixing α, you control the long-run rate of false positives (Type I errors), crucial when many teams run many tests.
- Confidence intervals communicate uncertainty
CIs provide a range of plausible effects, helping stakeholders gauge practical significance (not just p-values). - Power planning avoids underpowered tests
You can plan sample sizes to hit desired power for your MDE, reducing wasted time and inconclusive results.
Approximate two-sample proportion power-based sample size per variant:
where p is baseline CR and Δ is your MDE in absolute terms.
Practical Guidance & Best Practices
- Pre-register your hypothesis, metrics, α, stopping rule, and analysis plan.
- Avoid peeking (optional stopping inflates false positives). If you need flexibility, use group-sequential or alpha-spending methods.
- Adjust for multiple comparisons when testing many variants/metrics (e.g., Bonferroni, Holm, or control FDR).
- Check metric distributional assumptions. For very small counts, prefer exact or mid-p tests; for large samples, z-tests are fine.
- Report both statistical and practical significance. A tiny but “significant” lift may not be worth the engineering cost.
- Monitor variance early. High variance metrics (e.g., revenue/user) may require non-parametric tests or transformations.
Frequentist vs. Bayesian
- Frequentist p-values tell you how unusual your data are if H₀ were true.
- Bayesian methods provide a posterior distribution for the effect (e.g., probability the lift > 0).
Both are valid; frequentist tests remain popular for their simplicity, well-established error control, and broad tooling support.
Common Pitfalls & How to Avoid Them
- Misinterpreting p-values: A p-value is not the probability H₀ is true.
- Multiple peeks without correction: Inflates Type I errors—use planned looks or sequential methods.
- Underpowered tests: Leads to inconclusive results—plan with MDE and power.
- Metric shift & novelty effects: Run long enough to capture stabilized user behavior.
- Winner’s curse: Significant early winners may regress—replicate or run holdout validation.
Reporting Template
- Hypothesis: H0:pA=pB, H1: two-sided
- Design: α=0.05, power=0.8, MDE=…
- Data: nA,xA,pA; nB,xB,pB
- Analysis: two-proportion z-test (pooled), 95% CI
- Result: p-value = …, z = …, 95% CI = […, …], effect = absolute … / relative …
- Decision: reject/fail to reject H₀
- Notes: peeking policy, multiple-test adjustments, assumptions check
Final Takeaway
Frequentist A/B testing gives you a disciplined framework to decide whether a product change truly moves your metric or if the observed lift could be random noise. With clear error control, simple decision rules, and mature tooling, it remains a workhorse for experimentation at scale.
Recent Comments