Sample Ratio Mismatch (SRM) in A/B Testing

What is Sample Ratio Mismatch?

Sample Ratio Mismatch (SRM) is when the observed allocation of users to variants differs significantly from the planned allocation.
Example: You configured a 50/50 split, but after 10,000 users you see 5,300 in A and 4,700 in B. That’s likely SRM.

SRM means the randomization or eligibility pipeline is biased (or data capture is broken), so any effect estimates (lift, p-values, etc.) can’t be trusted.

How SRM Works (Conceptually)

When you specify a target split like 50/50 or 33/33/34, each incoming unit (user, device, session, etc.) should be randomly bucketed so that the expected distribution matches your target in expectation.

Formally, for a test with k variants and total N assigned units, the expected count for variant i is:

E_i = p_i N

where

p_i

is the target proportion for variant 𝑖 i and

N

is the total sample size.

If the observed counts,

O_i

, differ from the expected more than chance alone would allow, you have an SRM.

How to Identify SRM (Step-by-Step)

1) Use a Chi-Square Goodness-of-Fit Test (recommended)

For k variants, compute:

χ^{2} = \sum (\frac{(O_i - E_i)}{2})

with degrees of freedom df=k−1. Compute the p-value from the chi-square distribution. If the p-value is very small (common thresholds: 10⁻³ to 10⁻⁶), you’ve likely got an SRM.

Example (two-arm 50/50):
N=10,000, O_A=5,300, O_B=4,700, E_A=E_B=5,000

χ^{2} = \frac{(5300-5000)^2}{5000} + \frac{(4700-5000)^2}{5000} = 36

With df=1, p≈1.97×10⁻⁹. This triggers SRM.

2) Visual/Operational Checks

Live split dashboard: Show observed vs. expected % by variant.
Stratified checks: Repeat the chi-square by country, device, browser, app version, traffic source, time-of-day to find where the skew originates.
Time series: Plot cumulative allocation over time—SRM that “drifts” may indicate a rollout, caching, or traffic-mix issue.

3) Early-Warning Rule of Thumb

If your observed proportion deviates from the target by more than a few standard errors early in the test, investigate. For two arms with target p=0.5, the sampling variance under perfect randomization is:

σ_{p} = \sqrt{\frac{p}{(}}

Large persistent deviations → likely SRM.

Common Causes of SRM

Eligibility asymmetry: Filters (geo, device, login state, new vs. returning) applied after assignment or applied differently per variant.
Randomization at the wrong unit: Assigning by session but analyzing by user (or vice versa); cross-device users collide.
Inconsistent hashing/salts: Different hash salt/seed per service or per page; some code paths skip/override the assignment.
Sticky sessions / caching / CDNs: Edge caching or load balancer stickiness pinning certain users to one variant.
Traffic shaping / rollouts: Feature flags, canary releases, or time-based rollouts inadvertently biasing traffic into one arm.
Bot or test traffic: Non-human or QA traffic not evenly distributed (or filtered in one arm only).
Telemetry loss / logging gaps: Events dropped more in one arm (ad-blockers, blocked endpoints, CORS, mobile SDK bugs).
User-ID vs. device-ID mismatch: Some users bucketed by cookie, others by account ID; cookie churn changes ratios.
Late triggers: Assignment happens at “conversion event” time in one arm but at page load in another.
Geo or platform routing differences: App vs. web, iOS vs. Android, or specific regions routed to different infrastructure.

How to Prevent SRM (Design & Implementation)

Choose the right unit of randomization (usually user). Keep it consistent from assignment through analysis.
Server-side assignment with deterministic hashing on a stable ID (e.g., user_id). Example mapping:

b = {\begin{matrix} A & if & (H (user\_id || salt) mod M) < p M \\ B & otherwise \end{matrix}}

where H is a stable hash, M a large modulus (e.g., 10⁶), and p the target proportion for A.

Single source of truth for assignment (SDKs/services call the same bucketing service).
Pre-exposure assignment: Decide the variant before any UI/network differences occur.
Symmetric eligibility: Apply identical inclusion/exclusion filters before assignment.
Consistent rollout & flags: If you use gradual rollouts, do it outside the experiment or symmetrically across arms.
Bot/QA filtering: Detect and exclude bots and internal IPs equally for all arms.
Observability: Log (unit_id, assigned_arm, timestamp, eligibility_flags, platform, geo) to a central stream. Monitor split, by segment, in real time.
Fail-fast alerts: Trigger alerts when SRM p-value falls below a strict threshold (e.g., p<10⁻⁴).

How to Fix SRM (Triage & Remediation)

Pause the experiment immediately. Do not interpret effect estimates from an SRM-affected test.
Localize the bias. Recompute chi-square by segment (geo, device, source). The segment with the strongest SRM often points to the root cause.
Audit the assignment path.
- Verify the unit ID is consistent (user_id vs. cookie).
- Check hash function + salt are identical everywhere.
- Ensure assignment occurs pre-render and isn’t skipped due to timeouts.
Check eligibility filters. Confirm identical filters are applied before assignment and in both arms.
Review infra & delivery. Look for sticky sessions, CDN cache keys, or feature flag rollouts that differ by arm.
Inspect telemetry. Compare event loss rates by arm/platform. Fix SDK/network issues (e.g., batch size, retry logic, CORS).
Sanitize traffic. Exclude bots/internal traffic uniformly; re-run SRM checks.
Rerun a smoke test. After fixes, run a small, short dry-run experiment to confirm the split is healthy (no SRM) before relaunching the real test.

Analyst’s Toolkit (Ready-to-Use)

SRM Chi-Square (two-arm 50/50):

χ^{2} = \frac{(O_A - N / 2)}{2} + \frac{(O_B - N / 2)}{2}

General kkk-arm expected counts:

E_i = p_i N

Standard error for a two-arm proportion (target ppp):

σ_{p} = \sqrt{\frac{p}{(}}

Practical Checklist

Confirm unit of randomization and use stable IDs.
Perform server-side deterministic hashing with shared salt.
Apply eligibility before assignment, symmetrically.
Exclude bots/QA consistently.
Instrument SRM alerts (e.g., chi-square p<10⁻⁴).
Segment SRM monitoring by platform/geo/source/time.
Pause & investigate immediately if SRM triggers.

Summary

SRM isn’t a minor annoyance—it’s a stop sign. It tells you that the randomization or measurement is broken, which can fabricate uplifts or hide regressions. Detect it early with a chi-square test, design your experiments to prevent it (stable IDs, deterministic hashing, symmetric eligibility), and never ship decisions from an SRM-affected test.

Software Engineer's Notes