(2024-11-17) Cohen P-hacking Your A/B Tests

Jason Cohen on p-Hacking your A/B tests. Half of your “successful” A/B tests are false-positives.

All experiments sometimes give false-positives. An experiment that gives the correct result 95% of the time, still gives the wrong answer 5% of the time

This is also what happens in the social sciences, resulting in The Replication Crisis. An experiment is run once, often with a small number of college students. Occasionally something “interesting” happened, and a journal publishes it. Journals don’t wait for other teams to reproduce the result.

Before you shake your finger at them, shake that finger at yourself.
Because you’re doing this too.

Faking your A/B tests (unintentionally)

consider this simple example: You’re testing whether a coin is fair

Your experiment is to flip it 270 times

According to the binomial distribution, 90% of the time the result will be between 45% and 55% heads, if the coin is fair.
So, you run the experiment, and you get 57% heads. You conclude the coin is biased, and you say “I’m 90% sure of that.” Is this the right conclusion?

Now imagine you have 10 coins, and you want to test all of them for fairness. So you run the above experiment, once per coin. 9 of the tests result in “fair coin”, but one test shows “biased coin”.
Should we conclude that the one coin is biased? Almost surely not.

So: 9 / 1 is the most likely result both if all 10 coins are fair and if only 9 coins are fair.
So… what exactly can you conclude from the 9 / 1 result? Nothing, yet, not with confidence. What you conclude is that this procedure is insufficient, and that we need to augment the procedure to correct the issue.

The insight is: You are making exactly this mistake with your A/B tests.

You are running a bunch of A/B tests. You’re looking for (something like) “90% confidence”. Mostly the tests have a negative result. Occasionally one works, maybe one out of ten. And you conclude that was a successful test. But this is exactly what we just did with coin-flipping.

*Or you don’t even pick a confidence level. You don’t decide how much N you need to make a conclusion. Instead you “run the test until we get a result that looks conclusive.” This is a new type of mistake.

This particular error of “stopping whenever it looks conclusive” is called “p-hacking” by statisticians, and it’s been a well-documented fallacy since the 1950s.*

Marketers: The accidental p-hackers (accidental?)

Marketers have been making these p-hacking errors in A/B testing for many years. You are too.
We have data. A study4 of more than 2100 real-world A/B tests across Optimizely’s customer base found a 40% false positive rate.

This explains another phenomonon that you’re probably familar with if you’ve done a lot of A/B testing:

  • You run tests. Sometimes one is significant. You keep that result and continue testing new variants.
  • You keep repeating this process, keeping the designs that are “better.”
  • Over time… one is 10% better. Another is 20% better. Another is 10% better.
  • So, that should be 45% better overall.
  • You look back between now and months ago when you first started all this… and you don’t see a 45% improvement! Often, there’s no improvement at all.
  • Why didn’t all those improvements add up? Because they were false-positives.

How to stop fooling yourself

The easiest thing is to run the test again.

And don’t stop tests early.

And seek large effects, like double-digit changes in conversion rates. Large effects are unlikely to be caused by randomness; small flucutations are exponentially more likely to be false-positives. And anyway, large effects actually have an impact on the business, whereas small effects don’t. This might mean testing drastic changes instead of incremental ones.

Form a theory, test the theory, extend the theory

Too often A/B tests are just “throwing shit at the wall.” We excuse this behavior by saying “No one knows which headline will work; it’s impossible to predict, so we just try things.”
Not only is this thoughtless and lazy, it also means you haven’t learned anything, regardless of the result of the test.

To do this, form a theory, then design experiments to test the theory. Example theories:

Perhaps some theories already popped into your mind. Good! Write those down. Then make designs that would perform better if that theory were true.
It’s not “shit on the wall” because this time you have a specific Theory of Customer that your wall-shit is designed to test.

Let’s say you pick a theory, run a test, and it fails. Is your theory disproved?
Not quite yet.

But if you’re still not getting positive results after a few iterations, you have accumulated evidence that the theory is incorrect. That is called “learning.”

Suppose you had a positive result. Hooray!

Is the theory proven? No, because you read the first half of the article, so you know that positive A/B tests are often false. So, what do you do?

You lean even further into the theory. Run another test that’s even more extreme, or a different form of the same concept.

If the theory is truly correct, that will work again, perhaps even better! If it reverts to nothing, you know it wasn’t a real result.

Most theories won’t be right (or at least not impactful enough to matter). Most tests will come up negative. That’s frustrating but also the truth.


Edited:    |       |    Search Twitter for discussion