Unit of Randomization in A/B Testing: A Practical Guide

What is a “Unit of Randomization”?

The unit of randomization is the entity you randomly assign to variants (A or B). It’s the “thing” that receives the treatment: a user, a session, a device, a household, a store, a geographic region, etc.

Choosing this unit determines:

Who gets which experience
How independence assumptions hold (or break)
How you compute statistics and sample size
How actionable and unbiased your results are

How It Works (at a high level)

Define exposure: decide what entity must see a consistent experience (e.g., “Logged-in user must always see the same variant across visits.”).
Create an ID: select an identifier for that unit (e.g., user_id, device_id, household_id, store_id).
Hash & assign: use a stable hashing function to map each ID into variant A or B with desired split (e.g., 50/50).
Persist: ensure the unit sticks to its assigned variant on every exposure (stable bucketing).
Analyze accordingly: aggregate metrics at or above the unit level; use the right variance model (especially for clusters).

Common Units of Randomization (with pros/cons and when to use)

1) User-Level (Account ID or Login ID)

What it is: Each unique user/account is assigned to a variant.
Use when: Logged-in products; experiences should persist across devices and sessions.
Pros: Clean independence between users; avoids cross-device contamination for logged-in flows.
Cons: Requires reliable, unique IDs; guest traffic may be excluded or need fallback logic.

2) Device-Level (Device ID / Mobile Advertiser ID)

What it is: Each physical device is assigned.
Use when: Native apps; no login, but device ID is stable.
Pros: Better than cookies for persistence; good for app experiments.
Cons: Same human on multiple devices may see different variants; may bias human-level metrics.

3) Cookie-Level (Browser Cookie)

What it is: Each browser cookie gets a variant.
Use when: Anonymous web traffic without login.
Pros: Simple to implement.
Cons: Cookies expire/clear; users have multiple browsers/devices → contamination and assignment churn.

4) Session-Level

What it is: Each session is randomized; the same user may see different variants across sessions.
Use when: You intentionally want short-lived treatment (e.g., page layout in a one-off landing funnel).
Pros: Fast ramp, lots of independent observations.
Cons: Violates persistence; learning/carryover effects make interpretation tricky for longer journeys.

5) Pageview/Request-Level

What it is: Every pageview or API request is randomized.
Use when: Low-stakes UI tweaks with negligible carryover; ads/creative rotation tests.
Pros: Maximum volume quickly.
Cons: Massive contamination; not suitable when the experience should be consistent within a visit.

6) Household-Level

What it is: All members/devices of a household share the same assignment (derived from address or shared account).
Use when: TV/streaming, grocery delivery, multi-user homes.
Pros: Limits within-home interference; aligns with purchase behavior.
Cons: Hard to define reliably; potential privacy constraints.

7) Network/Team/Organization-Level

What it is: Randomize at a group/organization level (e.g., company admin sets a feature; all employees see it).
Use when: B2B products; settings that affect the whole group.
Pros: Avoids spillovers inside an org.
Cons: Fewer units → lower statistical power; requires cluster-aware analysis.

8) Geographic/Store/Region-Level (Cluster Randomization)

What it is: Entire locations are assigned (cities, stores, countries, data centers).
Use when: Pricing, inventory, logistics, or features tied to physical/geo constraints.
Pros: Realistic operational measurement, cleaner separation across regions.
Cons: Correlated outcomes within a cluster; requires cluster-robust analysis and typically larger sample sizes.

Why the Unit of Randomization Matters

1) Validity (Independence & Interference)

Statistical tests assume independent observations. If people in the control are affected by those in treatment (interference), estimates are biased. Picking a unit that contains spillovers (e.g., randomize at org or store level) preserves validity.

2) Power & Sample Size (Design Effect)

Clustered units (households, stores, orgs) share similarities—captured by intra-class correlation (ICC), often denoted ρ\rhoρ. This inflates variance via the design effect:

DE = 1 + (m - 1) ρ

Where m is the average cluster size. Your effective sample size becomes:

n_{eff} = \frac{n}{DE}

Larger clusters or higher ρ → bigger DE → less power for the same raw n.

3) Consistency of Experience

Units like user-level + stable bucketing ensure a user’s experience doesn’t flip between variants, avoiding dilution and confusion.

4) Interpretability & Actionability

If you sell at the store level, store-level randomization makes metrics easier to translate into operational decisions. If you optimize user engagement, user-level makes more sense.

How to Choose the Right Unit (Decision Checklist)

Where do spillovers happen?
Pick the smallest unit that contains meaningful interference (user ↔ household ↔ org ↔ region).
What is the primary decision maker?
If rollouts happen per account/org/region, align the unit with that boundary.
Can you persist assignment?
Use stable identifiers and hashing (e.g., SHA-256 on user_id + experiment_name) to keep assignments sticky.
How will you analyze it?
- User/cookie/device: standard two-sample tests aggregated per unit.
- Cluster (org/store/geo): use cluster-robust standard errors or mixed-effects models; adjust for design effect in planning.
Is the ID reliable & unique?
Prefer user_id over cookie when possible. If only cookies exist, add fallbacks and measure churn.

Practical Implementation Tips

Stable Bucketing: Hash the chosen unit ID to a uniform number in [0,1); map ranges to variants (e.g., <0.5 → A, ≥0.5 → B). Store assignment server-side for reliability.
Cross-Device Consistency: If the same human might use multiple devices, prefer user-level (requires login) or implement a linking strategy (e.g., email capture) before randomization.
Exposure Control: Ensure treatment is only applied after assignment; log exposures to avoid partial-treatment bias.
Metric Aggregation: Aggregate outcomes per randomized unit first (e.g., user-level conversion), then compare arms. Avoid pageview-level analysis when randomizing at user level.
Bot & Duplicate Filtering: Scrub bots and detect duplicate IDs (e.g., shared cookies) to reduce contamination.
Pre-Experiment Checks: Verify balance on key covariates (traffic source, device, geography) across variants for the chosen unit.

Examples

Pricing test in retail chain → randomize at store level; compute sales per store; analyze with cluster-robust errors; account for region seasonality.
New signup flow on a web app → randomize at user level (or cookie if anonymous); ensure users see the same variant across sessions.
Homepage hero image rotation for paid ads landing page → potentially session or pageview level; keep awareness of contamination if users return.

Common Pitfalls (and how to avoid them)

Using too granular a unit (pageview) for features with memory/carryover → inconsistent experiences and biased results.
Fix: move to session or user level.
Ignoring clustering when randomizing stores/teams → inflated false positives.
Fix: use cluster-aware analysis and plan for design effect.
Cookie churn breaks persistence → variant switching mid-experiment.
Fix: server-side assignment with long-lived identifiers; encourage login.
Interference across units (social/network effects) → contamination.
Fix: enlarge the unit (household/org/region) or use geo-experiments with guard zones.

Software Engineer's Notes