
What is Sample Ratio Mismatch?
Sample Ratio Mismatch (SRM) is when the observed allocation of users to variants differs significantly from the planned allocation.
Example: You configured a 50/50 split, but after 10,000 users you see 5,300 in A and 4,700 in B. That’s likely SRM.
SRM means the randomization or eligibility pipeline is biased (or data capture is broken), so any effect estimates (lift, p-values, etc.) can’t be trusted.
How SRM Works (Conceptually)
When you specify a target split like 50/50 or 33/33/34, each incoming unit (user, device, session, etc.) should be randomly bucketed so that the expected distribution matches your target in expectation.
Formally, for a test with k variants and total N assigned units, the expected count for variant i is:
where
is the target proportion for variant 𝑖 i andis the total sample size.
If the observed counts,
, differ from the expected more than chance alone would allow, you have an SRM.
How to Identify SRM (Step-by-Step)
1) Use a Chi-Square Goodness-of-Fit Test (recommended)
For k variants, compute:
with degrees of freedom df=k−1. Compute the p-value from the chi-square distribution. If the p-value is very small (common thresholds: 10−3 to 10−6), you’ve likely got an SRM.
Example (two-arm 50/50):
N=10,000, OA=5,300, OB=4,700, EA=EB=5,000
With df=1, p≈1.97×10−9. This triggers SRM.
2) Visual/Operational Checks
- Live split dashboard: Show observed vs. expected % by variant.
- Stratified checks: Repeat the chi-square by country, device, browser, app version, traffic source, time-of-day to find where the skew originates.
- Time series: Plot cumulative allocation over time—SRM that “drifts” may indicate a rollout, caching, or traffic-mix issue.
3) Early-Warning Rule of Thumb
If your observed proportion deviates from the target by more than a few standard errors early in the test, investigate. For two arms with target p=0.5, the sampling variance under perfect randomization is:
Large persistent deviations → likely SRM.
Common Causes of SRM
- Eligibility asymmetry: Filters (geo, device, login state, new vs. returning) applied after assignment or applied differently per variant.
- Randomization at the wrong unit: Assigning by session but analyzing by user (or vice versa); cross-device users collide.
- Inconsistent hashing/salts: Different hash salt/seed per service or per page; some code paths skip/override the assignment.
- Sticky sessions / caching / CDNs: Edge caching or load balancer stickiness pinning certain users to one variant.
- Traffic shaping / rollouts: Feature flags, canary releases, or time-based rollouts inadvertently biasing traffic into one arm.
- Bot or test traffic: Non-human or QA traffic not evenly distributed (or filtered in one arm only).
- Telemetry loss / logging gaps: Events dropped more in one arm (ad-blockers, blocked endpoints, CORS, mobile SDK bugs).
- User-ID vs. device-ID mismatch: Some users bucketed by cookie, others by account ID; cookie churn changes ratios.
- Late triggers: Assignment happens at “conversion event” time in one arm but at page load in another.
- Geo or platform routing differences: App vs. web, iOS vs. Android, or specific regions routed to different infrastructure.
How to Prevent SRM (Design & Implementation)
- Choose the right unit of randomization (usually user). Keep it consistent from assignment through analysis.
- Server-side assignment with deterministic hashing on a stable ID (e.g., user_id). Example mapping:
where H is a stable hash, M a large modulus (e.g., 106), and p the target proportion for A.
- Single source of truth for assignment (SDKs/services call the same bucketing service).
- Pre-exposure assignment: Decide the variant before any UI/network differences occur.
- Symmetric eligibility: Apply identical inclusion/exclusion filters before assignment.
- Consistent rollout & flags: If you use gradual rollouts, do it outside the experiment or symmetrically across arms.
- Bot/QA filtering: Detect and exclude bots and internal IPs equally for all arms.
- Observability: Log
(unit_id, assigned_arm, timestamp, eligibility_flags, platform, geo)to a central stream. Monitor split, by segment, in real time. - Fail-fast alerts: Trigger alerts when SRM p-value falls below a strict threshold (e.g., p<10−4).
How to Fix SRM (Triage & Remediation)
- Pause the experiment immediately. Do not interpret effect estimates from an SRM-affected test.
- Localize the bias. Recompute chi-square by segment (geo, device, source). The segment with the strongest SRM often points to the root cause.
- Audit the assignment path.
- Verify the unit ID is consistent (user_id vs. cookie).
- Check hash function + salt are identical everywhere.
- Ensure assignment occurs pre-render and isn’t skipped due to timeouts.
- Check eligibility filters. Confirm identical filters are applied before assignment and in both arms.
- Review infra & delivery. Look for sticky sessions, CDN cache keys, or feature flag rollouts that differ by arm.
- Inspect telemetry. Compare event loss rates by arm/platform. Fix SDK/network issues (e.g., batch size, retry logic, CORS).
- Sanitize traffic. Exclude bots/internal traffic uniformly; re-run SRM checks.
- Rerun a smoke test. After fixes, run a small, short dry-run experiment to confirm the split is healthy (no SRM) before relaunching the real test.
Analyst’s Toolkit (Ready-to-Use)
- SRM Chi-Square (two-arm 50/50):
- General kkk-arm expected counts:
- Standard error for a two-arm proportion (target ppp):
Practical Checklist
- Confirm unit of randomization and use stable IDs.
- Perform server-side deterministic hashing with shared salt.
- Apply eligibility before assignment, symmetrically.
- Exclude bots/QA consistently.
- Instrument SRM alerts (e.g., chi-square p<10−4).
- Segment SRM monitoring by platform/geo/source/time.
- Pause & investigate immediately if SRM triggers.
Summary
SRM isn’t a minor annoyance—it’s a stop sign. It tells you that the randomization or measurement is broken, which can fabricate uplifts or hide regressions. Detect it early with a chi-square test, design your experiments to prevent it (stable IDs, deterministic hashing, symmetric eligibility), and never ship decisions from an SRM-affected test.
Recent Comments