All insights
Data Science· Jul 1, 2026· 12 min read

Why your A/B test is lying to you: peeking, alpha inflation, and always-valid inference

The classic t-test guarantees a 5% false-positive rate only if you look exactly once, at a sample size you fixed in advance. Watch a live dashboard and stop when it goes green, and your real false-positive rate can quietly climb past 30%. Here's the math, and the fix.

AA
Abhishek Aggarwal
Co-founder, SERP Axis

The contract you're breaking

A standard fixed-horizon significance test comes with a contract, and almost everyone violates a clause of it. The contract: choose your sample size N in advance (from a power calculation), collect exactly N samples, compute the test once, and reject the null if p < α. Honor that, and the test does exactly what it promises — a 5% chance of a false positive when there is truly no effect.

What teams actually do: wire the experiment to a dashboard, watch the p-value wiggle in real time, and ship the moment it dips below 0.05. That is a different statistical procedure entirely — 'stop the first time it looks significant' — and it does not have a 5% false-positive rate. It has a much larger one, and nothing on the dashboard tells you so.

Why peeking inflates false positives

Under the null hypothesis (no real difference), the test statistic doesn't sit still at 'not significant' — it random-walks as data accumulates. A single look gives it a 5% chance of being across the p<0.05 line at that instant. But every additional look is another independent chance for the wandering statistic to poke across the threshold.

You are no longer asking 'is it significant at N?' You are asking 'does it become significant at any point along the way?' — and the probability of that is strictly higher, growing with the number of looks. Continuous monitoring is the limit of infinitely many looks. A deep result (the law of the iterated logarithm) says that under continuous monitoring of a fixed-horizon test, the probability of eventually crossing p<0.05 purely by chance tends to 1. Peek long enough and you are guaranteed a 'winner' that isn't real.

The subtle part

The p-value itself is computed correctly at each look. What's invalid is the stopping rule wrapped around it. Optional stopping turns an honest 5% test into a fishing expedition, and no single p-value on the screen reflects that you've looked a hundred times.

How bad it actually gets

This isn't a rounding error. Simulate an A/A test — two identical variants, so every 'win' is by definition false — and apply the 'stop when p<0.05' rule while checking periodically:

  • The figures above are the well-known result from continuous-monitoring simulations (e.g. Johari et al.); exact numbers depend on cadence and horizon, but the direction and rough magnitude are robust.
  • Practical consequence: a team that peeks daily on dozens of experiments will 'discover' a stream of winners that fail to replicate and don't move the business — and will conclude that 'A/B testing doesn't work here,' when in fact their stopping rule manufactured the noise.
Number of looks during the testApprox. real false-positive rate
1 (the honest fixed-horizon test)~5%
2~8%
5~14%
10~19%
Continuous / every visitortrends toward 100% given enough time

"Just go Bayesian" does not automatically save you

A popular claim is that Bayesian A/B testing is immune to peeking. That's half-true in a way that matters. A Bayesian posterior is coherent under optional stopping — the posterior given the data you've seen is a valid posterior no matter when you chose to stop. The likelihood principle holds.

But teams don't ship on 'the posterior is valid.' They ship on a decision rule — 'stop when P(B beats A) > 95%.' Evaluate that rule's frequentist error rate under optional stopping and it, too, can be inflated well above 5%, depending on the prior. Optional stopping is a property of your decision procedure, not of the statistical framework's label. Switching from frequentist to Bayesian without changing the 'stop when the threshold is crossed' behavior mostly relabels the same error.

The honest framing: any method is safe under optional stopping only if it was designed to be. That's exactly what sequential and always-valid methods provide — and it's why serious experimentation platforms adopted them rather than just switching priors.

The fix: sequential and always-valid inference

If you want to look continuously and stop early — which is operationally very attractive — use a method whose guarantee holds at every look, not just at one predetermined N. Three families, in increasing order of what modern platforms actually use:

  • Group-sequential designs (Pocock, O'Brien–Fleming): pre-plan K interim analyses and spend your α budget across them (alpha spending). Rigorous, but you must commit to the look schedule in advance.
  • Sequential probability ratio tests (Wald's SPRT) and their mixture variants (mSPRT): test statistics designed so the error guarantee holds while data streams in.
  • Always-valid inference — confidence sequences and always-valid p-values: confidence intervals that are simultaneously valid at all sample sizes, so you can stop whenever you like and the coverage guarantee still holds. This is the machinery behind 'peeking-safe' stats engines (Optimizely's Stats Engine, and the sequential tests in Statsig/Eppo).
The one-line rule

Either fix N in advance and don't peek, or adopt an always-valid method and peek all you want. What you may never do is run a fixed-horizon test and monitor it continuously — that's the combination that lies.

Fix the sample size properly — or reduce the variance (CUPED)

If you take the fixed-horizon route, size it honestly. Required sample size scales roughly with the variance of your metric divided by the square of the minimum detectable effect (MDE): halve the effect you want to detect and you need about four times the data. Teams that stop early are usually reacting to being under-powered — the test is simply too small to see the effect they hoped for, so they grab the first favorable wiggle.

A better lever than peeking is reducing the metric's variance, which increases power without more traffic. CUPED (Controlled-experiment Using Pre-Experiment Data) uses a pre-experiment covariate correlated with the outcome — typically the same user's pre-period behavior — to subtract predictable variance from the metric. In practice it can cut variance substantially and meaningfully shorten tests, and it's a standard part of mature experimentation stacks. The point: earn shorter tests with better statistics, not with a trigger-happy stopping rule.

The silent invalidators: sample-ratio mismatch and multiplicity

Two more ways a test lies even when your stopping rule is clean.

Sample Ratio Mismatch (SRM): you configured a 50/50 split but observe, say, 51.8/48.2. Run a chi-square goodness-of-fit test on the assignment counts; if it fails (a tiny p-value), your experiment is broken — a bucketing bug, a redirect that drops one arm, bot filtering hitting variants unequally — and every downstream result is untrustworthy. A failed SRM check invalidates the experiment regardless of how pretty the lift looks. Always run it first.

Multiplicity: test one hypothesis at 5% and your false-positive rate is 5%. Test twenty metrics, or five variants, each at 5%, and the probability that at least one lights up by chance is far higher (about 1 − 0.95^k). If you slice by segments after the fact until something is significant, you've p-hacked. Pre-register your primary metric, and correct for multiple comparisons (Bonferroni for strict control, Benjamini–Hochberg to control the false-discovery rate) when you legitimately test many.

Order of operations for trusting a result

1) SRM check passes. 2) The test used a valid stopping rule (fixed-N-no-peeking, or always-valid). 3) The primary metric was chosen before launch. 4) Multiplicity is corrected. Only then is the lift real. Skipping any one of these is how good teams ship confident nonsense.

What this means for conversion work

Most 'CRO that didn't work' is really 'experimentation that was never valid.' Powered tests, an SRM guard, a primary metric fixed up front, either a disciplined fixed horizon or an always-valid stats engine, and variance reduction so tests finish before patience runs out — that discipline is the entire difference between optimization that compounds and a dashboard of ghosts. The statistics are not the boring part of CRO; they are the part that decides whether any of it was true.

Tags
ExperimentationA/B testingStatisticsCROCausal inference
Related services

Want this handled by senior operators instead of read about? Our Digital Marketing practice turns the ideas above into shipped work — or explore everything we do below.

Free 48-hour audit · no lock-in

The cost of waiting
is your competitor.

Every 90 days you delay is 90 days of authority compounding for someone else. Get the audit. See the math. Then decide.

No lock-in
Weekly invoicing
Reply within
3 hours
Audit value
$2,400 yours, free