Why Your A/B Tests Are Lying to You

June 19, 2025·3 min read

95% of product teams are making decisions based on A/B test results that are statistically meaningless.

Your product team is statistically incompetent. I don't say this to be cruel—I say it because I've watched companies burn millions of dollars on "data-driven" decisions that were actually coin flips dressed up in scientific language.

Here's the uncomfortable truth: the A/B testing industrial complex has a vested interest in keeping you confused. Testing platform vendors sell you dashboards that spit out green checkmarks. Consultants bill hourly to "optimize" tests that never had the power to detect anything meaningful. Your data team gets to feel smart by throwing around terms like "statistical significance" without ever questioning whether your test could actually detect the 2% improvement you're looking for.

Everyone wins except you—the person who shipped a "winning" variant and watched conversion rates drift right back to baseline within a month.

This isn't about statistics being hard. It's about an entire industry that profits from your confusion.

The Statistical Significance Theater

Here's what happens at most companies:

Run an A/B test for a week
See p < 0.05
Declare victory
Ship the "winning" variant
Watch conversion rates return to baseline

Sound familiar? You're not alone. Most teams confuse statistical significance with practical significance, and it's costing them dearly.

Test Your Understanding

Before we dive deeper, let's see how your current A/B tests stack up. Try different scenarios and see what the numbers actually tell you:

Surprised by those results? Most teams are. Let's break down what's really happening.

The Four Lies Your A/B Tests Tell You

Lie #1: "p < 0.05 Means It's Real"

Statistical significance only tells you that your result is unlikely to be due to random chance. It doesn't tell you:

If the effect is large enough to matter
If the effect will persist over time
If the test had enough power to detect real differences

The reality: A test with 10,000 users can detect tiny, meaningless differences as "statistically significant." Meanwhile, a test with 200 users might miss huge improvements because it's underpowered.

Lie #2: "Bigger Sample = Better Results"

More data can actually make your tests worse if you're not careful. Large samples can:

Detect statistically significant but practically meaningless differences
Hide important segments where the effect is actually strong
Lead to false confidence in weak effects

The reality: You need the right sample size, not the biggest sample size.

Lie #3: "95% Confident Means 95% Right"

A 95% confidence interval doesn't mean you're 95% certain the true effect is in that range. It means that if you ran this exact test 100 times, about 95 of those intervals would contain the true effect.

The reality: Your specific test result could be in the 5% that's completely wrong.

Lie #4: "No Difference = No Effect"

When a test isn't statistically significant, most teams conclude there's no effect. But "no significance" often just means "we didn't collect enough data to detect the effect."

The reality: Absence of evidence isn't evidence of absence.

What Actually Matters: The Power Analysis

The most ignored metric in A/B testing is statistical power—the probability that your test will detect a real effect if one exists. Most tests have terrible power, which means:

Low power (< 50%): Your test probably won't detect real improvements
Medium power (50-80%): Your test might catch big improvements, but will miss smaller ones
High power (80%+): Your test can reliably detect meaningful changes

The brutal truth: Most A/B tests have power below 50%. You're essentially flipping coins and calling it data science.

The Minimum Detectable Effect Reality Check

Every test has a minimum detectable effect (MDE)—the smallest change it can reliably detect. If your test can only detect a 25% improvement in conversion rate, but you're looking for 2% improvements, you're wasting everyone's time.

Before running any test, ask:

What's the smallest improvement that would change our strategy?
Can our test actually detect that improvement?
If not, why are we running it?

The Confidence Interval Truth

Confidence intervals tell you the range of plausible values for your effect. A "statistically significant" result with a confidence interval of [0.1%, 15%] is very different from one with [8%, 12%].

Red flags:

Confidence intervals that include zero (even if p < 0.05)
Confidence intervals that are huge relative to your effect
Confidence intervals that include both trivial and massive effects

How to Run Tests That Actually Matter

1. Design for Power First

Calculate required sample size before starting
Aim for 80%+ power
Design tests to detect the minimum effect size you care about

2. Focus on Effect Size, Not Just Significance

Report confidence intervals, not just p-values
Consider practical significance alongside statistical significance
Ask: "Is this difference big enough to change our strategy?"

3. Plan for Segmentation

Different user segments often have different responses
Build tests that can detect segment-specific effects
Don't hide important variations behind overall averages

4. Kill Underpowered Tests Before They Start

If your test can't detect the effect size you care about, don't run it
"Inconclusive" is usually code for "we wasted everyone's time"
A test that can't answer the question is worse than no test at all

The Business Impact Framework

Instead of asking "Is it significant?", ask:

Is the effect large enough to matter to our business?
Is our test powerful enough to detect that effect?
What's the range of plausible outcomes?
What would we do differently based on these results?

The Action Plan

Audit your current tests using power analysis
Calculate effect sizes you actually care about
Design tests with adequate power for those effects
Report confidence intervals alongside p-values
Make decisions based on business impact, not just statistics

The Bottom Line

The next time your data scientist shows you a "statistically significant" result, ask three questions:

What was the statistical power of this test?
What's the minimum effect size we could have detected?
Does our confidence interval include effects too small to matter?

Watch them squirm. Most haven't done the math. Most are just checking if p < 0.05 and calling it science.

Stop playing statistical theater. Run fewer tests with adequate power, or don't run tests at all. The honest answer of "we don't have enough data to know" beats the confident-sounding lie of "statistically significant at p < 0.05" every single time.

Your product decisions are too important to leave to coin flips in lab coats.

Technical Debt Isn't Just Slowing You Down—It's Accelerating

Your team shipped 12 features last quarter. This quarter, with the same people and same effort, you shipped 8.

The Hidden Costs of Technical Debt

I've watched engineering teams slow to a crawl, not because they hired bad developers or chose wrong technologies, but because they treated technical debt li...

The Product Manager's Guide to the Perfect Breakfast

In the decidedly fast-paced world of product management, even breakfast needs a framework. After extensive user research (asking my colleagues on Slack), mul...

haas.

←back to writing

Why Your A/B Tests Are Lying to You

June 19, 2025·3 min read

95% of product teams are making decisions based on A/B test results that are statistically meaningless.

ab-testing statistics data product

Everyone wins except you—the person who shipped a "winning" variant and watched conversion rates drift right back to baseline within a month.

This isn't about statistics being hard. It's about an entire industry that profits from your confusion.

The Statistical Significance Theater

Here's what happens at most companies:

Run an A/B test for a week
See p < 0.05
Declare victory
Ship the "winning" variant
Watch conversion rates return to baseline

Sound familiar? You're not alone. Most teams confuse statistical significance with practical significance, and it's costing them dearly.

Test Your Understanding

Before we dive deeper, let's see how your current A/B tests stack up. Try different scenarios and see what the numbers actually tell you:

Surprised by those results? Most teams are. Let's break down what's really happening.

The Four Lies Your A/B Tests Tell You

Lie #1: "p < 0.05 Means It's Real"

Statistical significance only tells you that your result is unlikely to be due to random chance. It doesn't tell you:

If the effect is large enough to matter
If the effect will persist over time
If the test had enough power to detect real differences

Lie #2: "Bigger Sample = Better Results"

More data can actually make your tests worse if you're not careful. Large samples can:

Detect statistically significant but practically meaningless differences
Hide important segments where the effect is actually strong
Lead to false confidence in weak effects

The reality: You need the right sample size, not the biggest sample size.

Lie #3: "95% Confident Means 95% Right"

The reality: Your specific test result could be in the 5% that's completely wrong.

Lie #4: "No Difference = No Effect"

When a test isn't statistically significant, most teams conclude there's no effect. But "no significance" often just means "we didn't collect enough data to detect the effect."

The reality: Absence of evidence isn't evidence of absence.

What Actually Matters: The Power Analysis

The most ignored metric in A/B testing is statistical power—the probability that your test will detect a real effect if one exists. Most tests have terrible power, which means:

Low power (< 50%): Your test probably won't detect real improvements
Medium power (50-80%): Your test might catch big improvements, but will miss smaller ones
High power (80%+): Your test can reliably detect meaningful changes

The brutal truth: Most A/B tests have power below 50%. You're essentially flipping coins and calling it data science.

The Minimum Detectable Effect Reality Check

Before running any test, ask:

What's the smallest improvement that would change our strategy?
Can our test actually detect that improvement?
If not, why are we running it?

The Confidence Interval Truth

Confidence intervals tell you the range of plausible values for your effect. A "statistically significant" result with a confidence interval of [0.1%, 15%] is very different from one with [8%, 12%].

Red flags:

Confidence intervals that include zero (even if p < 0.05)
Confidence intervals that are huge relative to your effect
Confidence intervals that include both trivial and massive effects

How to Run Tests That Actually Matter

1. Design for Power First

Calculate required sample size before starting
Aim for 80%+ power
Design tests to detect the minimum effect size you care about

2. Focus on Effect Size, Not Just Significance

Report confidence intervals, not just p-values
Consider practical significance alongside statistical significance
Ask: "Is this difference big enough to change our strategy?"

3. Plan for Segmentation

Different user segments often have different responses
Build tests that can detect segment-specific effects
Don't hide important variations behind overall averages

4. Kill Underpowered Tests Before They Start

If your test can't detect the effect size you care about, don't run it
"Inconclusive" is usually code for "we wasted everyone's time"
A test that can't answer the question is worse than no test at all

The Business Impact Framework

Instead of asking "Is it significant?", ask:

Is the effect large enough to matter to our business?
Is our test powerful enough to detect that effect?
What's the range of plausible outcomes?
What would we do differently based on these results?

The Action Plan

Audit your current tests using power analysis
Calculate effect sizes you actually care about
Design tests with adequate power for those effects
Report confidence intervals alongside p-values
Make decisions based on business impact, not just statistics

The Bottom Line

The next time your data scientist shows you a "statistically significant" result, ask three questions:

What was the statistical power of this test?
What's the minimum effect size we could have detected?
Does our confidence interval include effects too small to matter?

Watch them squirm. Most haven't done the math. Most are just checking if p < 0.05 and calling it science.

Your product decisions are too important to leave to coin flips in lab coats.

Why Your A/B Tests Are Lying to You

The Statistical Significance Theater

Test Your Understanding

The Four Lies Your A/B Tests Tell You

Lie #1: "p < 0.05 Means It's Real"

Lie #2: "Bigger Sample = Better Results"

Lie #3: "95% Confident Means 95% Right"

Lie #4: "No Difference = No Effect"

What Actually Matters: The Power Analysis

The Minimum Detectable Effect Reality Check

The Confidence Interval Truth

How to Run Tests That Actually Matter

1. Design for Power First

2. Focus on Effect Size, Not Just Significance

3. Plan for Segmentation

4. Kill Underpowered Tests Before They Start

The Business Impact Framework

The Action Plan

The Bottom Line

Continue reading

Technical Debt Isn't Just Slowing You Down—It's Accelerating

The Hidden Costs of Technical Debt

The Product Manager's Guide to the Perfect Breakfast

Why Your A/B Tests Are Lying to You

The Statistical Significance Theater

Test Your Understanding

The Four Lies Your A/B Tests Tell You

Lie #1: "p < 0.05 Means It's Real"

Lie #2: "Bigger Sample = Better Results"

Lie #3: "95% Confident Means 95% Right"

Lie #4: "No Difference = No Effect"

What Actually Matters: The Power Analysis

The Minimum Detectable Effect Reality Check

The Confidence Interval Truth

How to Run Tests That Actually Matter

1. Design for Power First

2. Focus on Effect Size, Not Just Significance

3. Plan for Segmentation

4. Kill Underpowered Tests Before They Start

The Business Impact Framework

The Action Plan

The Bottom Line

Continue reading

Technical Debt Isn't Just Slowing You Down—It's Accelerating

The Hidden Costs of Technical Debt

The Product Manager's Guide to the Perfect Breakfast