A/B Testing Methodology: Statistical

What You Will Learn

What an A/B test is actually measuring — the null hypothesis and what it means to reject it
What statistical significance is, what 95% confidence means, and what it does not mean
How to calculate the required sample size before launching a test
How long a test must run to produce reliable results
The peeking problem — why checking results during a test invalidates them
P-hacking and multiple testing — how common practices create false positives
When to use multivariate, split URL, or bandit testing instead of standard A/B
What to do with inconclusive results — is "no winner" a failure?
Sequential testing methods that allow early stopping with statistical validity
A pre-launch checklist for every test

What an A/B Test Measures

An A/B test splits traffic randomly between two or more versions of a page or element — the control (original) and one or more variations — and measures whether the conversion rate differs between them. The fundamental question is: is the observed difference in conversion rates between control and variation due to the change made, or is it due to random variation in who happened to be assigned to each variant?

This is framed as a hypothesis test. The null hypothesis is: "There is no real difference in conversion rate between control and variation — any observed difference is due to random chance." The alternative hypothesis is: "There is a real difference in conversion rate caused by the change." Statistical analysis determines the probability that the observed data would occur if the null hypothesis were true. If that probability is below a threshold (typically 5%), the null hypothesis is rejected and the result is called "statistically significant."

Statistical significance means: "this result is unlikely to have occurred by chance." It does not mean: "this result is certainly true" or "this result will replicate perfectly." It means the evidence is strong enough to act on — not that the conclusion is guaranteed.

Statistical Significance

The significance level (α) is the probability of a false positive — concluding that the variation is better when it actually is not. The convention is α = 0.05 (5%), meaning there is a 5% chance the result is a false positive. Most testing platforms default to 95% confidence (1 − α = 0.95), which is this convention applied.

Confidence (statistical power) is a separate concept: the probability that the test will detect a real effect if one exists. Higher power requires larger sample sizes. Most testing plans target 80% power — meaning 20% of real effects will not be detected (false negatives). This is a pragmatic trade-off between sample size requirements and the risk of missing real improvements.

Concept	Definition	Common Setting
Significance level (α)	Acceptable false positive rate	5% (95% confidence)
Statistical power (1−β)	Probability of detecting a real effect	80%
p-value	Probability the result occurred by chance if null hypothesis is true	Test result — must be <0.05 to reject null
Minimum Detectable Effect (MDE)	The smallest improvement you want to be able to detect	Typically 5–10% relative improvement

Sample Size Calculation

Sample size is determined before a test launches — not during or after. The required sample size depends on: the current conversion rate (baseline); the minimum detectable effect (the smallest improvement you want to reliably detect); the significance level (typically 95%); and the statistical power (typically 80%).

Free online sample size calculators (Optimizely's sample size calculator, Evan Miller's sample size tool) calculate the required number of visitors per variant given these inputs. Entering: baseline conversion rate = 2%; MDE = 10% relative improvement (detecting a change from 2% to 2.2%); 95% confidence; 80% power — typically produces a required sample size of approximately 50,000 visitors per variant. A site with 10,000 monthly visitors split 50/50 between two variants would need approximately 50,000 ÷ 5,000 = 10 months to reach significance — this test may not be practically achievable.

If the required sample size exceeds practical constraints, adjust the MDE: if you only want to detect improvements of 20% or more (from 2% to 2.4%), the required sample size decreases significantly. This means accepting that smaller improvements will not be reliably detected — a practical trade-off for low-traffic sites.

Test Duration

Test duration is driven by sample size requirements and traffic volume. A test that needs 50,000 visitors per variant on a site receiving 5,000 visitors per day will take 20 days. But even tests that reach their sample size quickly should run for a minimum of one full week — to capture the complete weekly traffic cycle and avoid day-of-week novelty effects.

Novelty and primacy effects

Novelty effect: when a new design launches, regular users notice the change and interact with it out of curiosity — inflating short-term conversion rates for the variation. Primacy effect: regular users who are familiar with the control version may initially perform worse on the variation due to habit disruption, even if the variation is ultimately better. Both effects can reverse over time. Running tests long enough for these effects to stabilise (typically 2–4 weeks) prevents acting on transient behavioural responses to newness.

The Peeking Problem

The most common A/B testing error is "peeking" — checking results during the test and stopping early when statistical significance is reached before the pre-determined sample size. This practice dramatically inflates the false positive rate beyond the intended 5%.

The reason: statistical significance is a threshold calibrated to the expected sample size. When you check significance multiple times during a test (at n=1000, n=2000, n=3000...) rather than once at the predetermined endpoint, you multiply the opportunities for a false positive. By the time a test reaches 10,000 visitors with 10 intermediate checks, the actual false positive rate may exceed 30% — even though each individual check used a 95% confidence threshold.

The solution is to commit to the predetermined sample size and test duration before launching, and not evaluate significance until the test has completed. Most major testing platforms now offer "always-valid p-values" or sequential testing methods (see below) that address this problem for teams that genuinely need early stopping capability.

P-Hacking and False Positives

P-hacking is any practice that increases the probability of finding a statistically significant result beyond the intended α level. Common forms in A/B testing:

Changing the success metric after seeing results. If the primary metric does not show significance, switching to a secondary metric that does show significance and reporting that as the test winner. The significance threshold only applies to pre-specified hypotheses — post-hoc metric selection invalidates the statistical conclusion.
Testing until significant. Continuing to run a test past the pre-determined sample size until significance is reached — equivalent to repeated peeking. Every additional data point added beyond the planned sample size increases the false positive probability.
Reporting only the significant results. Testing 10 hypotheses, finding 1 that reaches significance at 95% confidence, and reporting only that result — ignoring the other 9. With 10 independent tests at 95% confidence, approximately 0.5 tests will show a false positive by chance alone. Reporting only the "winner" without acknowledging the multiple testing context inflates the apparent evidence for the winning hypothesis.

The Bonferroni correction is the standard statistical remedy for multiple testing: divide the significance threshold by the number of tests. Testing 5 metrics simultaneously should use α = 0.01 (0.05 ÷ 5) rather than 0.05 for each individual metric.

A/B vs Multivariate vs Split URL

Test Type	Description	When to Use	Limitation
A/B test	One control, one variation — one change	Most tests; clear hypothesis about a specific element	Tests only one change at a time; slower for multi-element optimisation
A/B/n test	One control, multiple variations — each a different change	Testing multiple design directions simultaneously	Requires proportionally more traffic; each variation needs its own sample
Multivariate test (MVT)	Multiple elements changed simultaneously; tests all combinations	High-traffic pages where multiple element interactions are important	Requires very high traffic (100k+ per variant); complex to interpret
Split URL test	Tests entirely different page versions at different URLs	Large-scale redesigns where page changes are too substantial for A/B overlay	SEO implications of splitting traffic between URLs require canonical tag management

Inconclusive Results

An inconclusive A/B test — one that reaches the planned sample size without a statistically significant winner — is not a failure. It is a result. An inconclusive result means: the test did not detect an improvement of the size it was powered to detect. This has several possible interpretations:

The change had no meaningful effect on the conversion rate — the hypothesis was wrong
The change had an effect smaller than the MDE — the test was not powered to detect it
The test had implementation errors that prevented the variation from showing correctly to all users

For inconclusive results, review: the sample size (was the test adequately powered?); the implementation (did the variation render correctly throughout the test?); the segment-level data (was there a significant effect for a specific user segment that was diluted at the aggregate level?). Inconclusive results are informative — they eliminate hypotheses and narrow the hypothesis space for the next round of research.

Sequential Testing

Sequential testing methods (also called "always-valid inference" or "continuous monitoring methods") allow checking results at any point during a test without inflating the false positive rate. They are designed to address the legitimate business need for early stopping when a result is very strong, while maintaining statistical validity.

Examples: Sequential probability ratio testing (SPRT); Mixture Sequential Probability Ratio Test (mSPRT) used by platforms including Optimizely and VWO. These methods use a different statistical framework from classical null hypothesis significance testing — they are mathematically designed to remain valid regardless of when the test is stopped. If your testing platform offers "always-valid p-values" or "sequential testing mode," this is what it is referring to.

Pre-Launch Testing Checklist

☐ Hypothesis is documented — observation, change, expected outcome, reason
☐ Success metric defined and agreed before launch
☐ Sample size calculated — required visitors per variant based on baseline, MDE, α, power
☐ Test duration calculated — weeks needed to reach required sample size at current traffic levels
☐ No other tests running on the same page (reduces cross-test interference)
☐ Variation renders correctly across Chrome, Safari, Firefox, on mobile and desktop
☐ Conversion tracking verified in both control and variation (via DebugView or testing platform preview)
☐ Traffic split is 50/50 (for a two-variant test) — verified in testing platform
☐ Test start date documented; scheduled end date set for pre-determined sample size
☐ Do not check results until the end date (or use a sequential testing platform if early stopping is needed)

Authentic Sources

Source integrity

Every factual claim in this guide is drawn from official Google documentation, regulatory bodies, or platform-published technical specifications. No third-party blogs or marketing tools are used as primary sources. All content is written in our own words — we learn from official sources and explain them; we never copy.

OfficialGoogle Analytics Help — Funnel Exploration

GA4 funnel analysis as the quantitative research foundation for A/B test hypothesis generation.

OfficialGoogle Analytics Developer Guide

Technical reference for GA4 event tracking — essential for verifying conversion events in A/B tests.

OfficialNielsen Norman Group — 5 Users in Usability Tests

Research-backed guideline for qualitative user testing sample size, referenced in CRO research methodology.

OfficialMicrosoft Clarity

Free session recording and heatmap tool — official product documentation.

A/B Testing Methodology · Stats, Sample Size & Test Design

What You Will Learn

What an A/B Test Measures

Statistical Significance

Sample Size Calculation

Test Duration

Novelty and primacy effects

The Peeking Problem

P-Hacking and False Positives

A/B vs Multivariate vs Split URL

Inconclusive Results

Sequential Testing

Pre-Launch Testing Checklist

Authentic Sources

600 guides. All authentic sources.

A/B Testing Methodology · Stats, Sample Size & Test Design

What You Will Learn

What an A/B Test Measures

Statistical Significance

Sample Size Calculation

Test Duration

Novelty and primacy effects

The Peeking Problem

P-Hacking and False Positives

A/B vs Multivariate vs Split URL

Inconclusive Results

Sequential Testing

Pre-Launch Testing Checklist

Authentic Sources

Related Guides

600 guides. All authentic sources.