Stable Bucketing in A/B Testing

What Is Stable Bucketing?

Stable bucketing is a repeatable, deterministic way to assign units (users, sessions, accounts, devices, etc.) to experiment variants so that the same unit always lands in the same bucket whenever the assignment is recomputed. It’s typically implemented with a hash function over a unit identifier and an experiment “seed” (or namespace), then mapped to a bucket index.

Key idea: assignment never changes for a given (unit_id, experiment_seed) unless you deliberately change the seed or unit of bucketing. This consistency is crucial for clean experiment analysis and operational simplicity.

Why We Need It (At a Glance)

Consistency: Users don’t flip between A and B when they return later.
Reproducibility: You can recompute assignments offline for debugging and analysis.
Scalability: Works statelessly across services and languages.
Safety: Lets you ramp traffic up or down without re-randomizing previously assigned users.
Analytics integrity: Reduces bias and cross-contamination when users see multiple experiments.

How Stable Bucketing Works (Step-by-Step)

1) Choose Your Unit of Bucketing

Pick the identity that best matches the causal surface of your treatment:

User ID (most common): stable across sessions/devices (if you have login).
Device ID: when login is rare; beware of cross-device spillover.
Session ID / Request ID: only for per-request or per-session treatments.

Rule of thumb: bucket at the level where the treatment is applied and outcomes are measured.

2) Build a Deterministic Hash

Compute a hash over a canonical string like:

canonical_key = experiment_namespace + ":" + unit_id
hash = H(canonical_key)  // e.g., 64-bit MurmurHash3, xxHash, SipHash

Desiderata: fast, language-portable implementations, low bias, and uniform output over a large integer space (e.g., 2^64).

3) Normalize to [0, 1)

Convert the integer hash to a unit interval. With a 64-bit unsigned hash h∈{0,…,2⁶⁴−1}:

u = h / 2^64   // floating-point in [0,1)

4) Map to Buckets

If you have K total buckets (e.g., 1000) and want to allocate N of them to the experiment (others remain “control” or “not in experiment”), you can map:

bucket = ⌊ u \times K ⌋

Then assign variant ranges. For a 50/50 split with two variants A and B over the same experiment allocation, for example:

A gets buckets [0,K/2−1]
B gets buckets [K/2,K−1]

You can also reserve a global “control” by giving it a fixed bucket range that is outside any experiment’s allocation.

5) Control Allocation (Traffic Percentage)

If the intended inclusion probability is p (e.g., 10%), assign the first p⋅K buckets to the experiment:

N = p \times K

Include a unit if bucket < N. Split inside N across variants according to desired proportions.

Minimal Pseudocode (Language-Agnostic)

function assign_variant(unit_id, namespace, variants):
    // variants = [{name: "A", weight: 0.5}, {name: "B", weight: 0.5}]
    key = namespace + ":" + canonicalize(unit_id)
    h = Hash64(key)                       // e.g., MurmurHash3 64-bit
    u = h / 2^64                          // float in [0,1)
    // cumulative weights to pick variant
    cum = 0.0
    for v in variants:
        cum += v.weight
        if u < cum:
            return v.name
    return variants[-1].name              // fallback for rounding

Deterministic: same (unit_id, namespace) → same u → same variant every time.

Statistical Properties (Why It Works)

Assuming the hash behaves like a uniform random function over [0,1), the inclusion indicator l_i for each unit i with target probability p is:

ℓ (A) = E [n_A] = p n

Var [n_A] = n p (1 - p)

With stable bucketing, units included at ramp-up remain included as you increase p (monotone ramps), which avoids re-randomization noise.

Benefits & Why It’s Important (In Detail)

1) User Experience Consistency

A returning user continues to see the same treatment, preventing confusion and contamination.
Supports long-running or incremental rollouts (10% → 25% → 50% → 100%) without users flipping between variants.

2) Clean Causal Inference

Avoids cross-over effects that can bias estimates when users switch variants mid-experiment.
Ensures SUTVA-like stability at the chosen unit (no unit’s potential outcomes change due to assignment instability).

3) Operational Simplicity & Scale

Stateless assignment (derive on the fly from (unit_id, namespace)).
Works across microservices and languages as long as the hash function and namespace are shared.

4) Reproducibility & Debugging

Offline recomputation lets you verify assignments, investigate suspected sample ratio mismatches (SRM), and audit exposure logs.

5) Safe Traffic Management

Ramps: increasing p simply widens the bucket interval—no reshuffling of already exposed users.
Kill-switches: setting p=0 instantly halts new exposures while keeping analysis intact.

6) Multi-Experiment Harmony

Use namespaces or layered bucketing to keep unrelated experiments independent while permitting intended interactions when needed.

Practical Design Choices & Pitfalls

Hash Function

Prefer fast, well-tested non-cryptographic hashes (MurmurHash3, xxHash).
If adversarial manipulation is a risk (e.g., public IDs), consider SipHash or SHA-based hashing.

Namespace (Seed) Discipline

The experiment_namespace must be unique per experiment/phase. Changing it intentionally re-randomizes.
For follow-up experiments requiring independence, use a new namespace. For continued exposure, reuse the old one.

Bucket Count & Mapping

Use a large K (e.g., 10,000) to get fine-grained control over traffic percentages and reduce allocation rounding issues.

Unit of Bucketing Mismatch

If treatment acts at the user level but you bucket by device, a single user on two devices can see different variants (spillover). Align unit with treatment.

Identity Resolution

Cross-device/user-merges can change effective unit IDs. Decide whether to lock assignment post-merge or recompute at login—document the policy and its analytical implications.

SRM Monitoring

Even with stable bucketing, instrumentation bugs, filters, and eligibility rules can create SRM. Continuously monitor observed splits versus expected ppp.

Privacy & Compliance

Hash only pseudonymous identifiers and avoid embedding raw PII in logs. Salt/namespace prevents reuse of the same hash across experiments.

Example: Two-Variant 50/50 with 20% Traffic

Setup

K=10,000 buckets
Experiment gets p=0.2 ⇒ N=2,000 buckets
Within experiment, A and B each get 50% of the N buckets (1,000 each)

Mapping

Include user if 0 ≤ bucket < 2000
If included:
- A: 0 ≤ bucket < 1000
- B: 1000 ≤ bucket < 2000
Else: not in experiment (falls through to global control)

Ramp from 20% → 40%

Extend inclusion to 0 ≤ bucket < 4000
Previously included users stay included; new users are added without reshuffling earlier assignments.

Math Summary (Allocation & Variant Pick)

Inclusion Decision

include = [⌊ u \times K ⌋ < N]

Variant Selection by Cumulative Weights

Let variants have weights w₁,…w_m with ∑w_j=1 . Pick the smallest j such that:

u < \sum_{k = 1} w_{k}

Implementation Tips (Prod-Ready)

Canonicalization: Lowercase IDs, trim whitespace, and normalize encodings before hashing.
Language parity tests: Create cross-language golden tests (input → expected bucket) for your SDKs.
Versioning: Version your bucketing algorithm; log algo_version, namespace, and unit_id_type.
Exposure logs: Record (unit_id, namespace, variant, timestamp) for auditability.
Dry-run: Add an endpoint or feature flag to validate expected split on synthetic data before rollout.

Takeaways

Stable bucketing is the backbone of reliable A/B testing infrastructure. By hashing a stable unit ID within a disciplined namespace, you get deterministic, scalable, and analyzable assignments. This prevents cross-over effects, simplifies rollouts, and preserves statistical validity—exactly what you need for trustworthy product decisions.

Software Engineer's Notes