(2023-08-09) The Inequality Of A/B Testing

The inequality of A/B testing. Microsoft recently published a paper on the distribution of results from their A/B tests.

The main departure from the literature is that the model allows for fat tails. The key insight is that the optimal experimentation strategy depends on whether most gains accrue from typical innovations or from rare and unpredictable large successes that can be detected using tests with small samples

when this distribution is very fat tailed, a “lean” experimentation strategy consisting of trying more ideas, each with possibly smaller sample sizes, is preferred

We measure the relevant tail parameter using experiments from Microsoft Bing’s EXP platform and find extremely fat tails

How fat were the tails? “…the top 2% of ideas are responsible for 74.8% of the historical gains”. As the paper says, “This is an extreme version of the usual 80-20 Pareto principle”.

The conclusions:

Ideas with small t-statistics should be shrunk aggressively, because they are likely to be lucky draws

“…there are large gains from moving towards a lean experimentation strategy”

If the results are “barely significant” or “marginally significant”, assume they are not significant at all

Overall conversion rate improvements will come from a few huge wins, not from dozens of incremental improvements

You should keep running your tests, but when the results are “inconclusive”, just assume the its a null result, and will stay a null result, and rather than extending the test to see what you can find, just end the test and try something new. This is the VC model of A/B testing

There is one more reason to test. Not to figure out if a new idea is better, but to verify a new idea is not worse

use a test to make sure the new design is not significantly worse

If it’s about the same, then you can use your business judgement to decide you want to replace the old one with the new one.

the CRO team needs to be searching for lots and lots of new ideas and testing them quickly. Tests do NOT need to be run to ensure statistical significance. If the tests do not show dramatic improvement fairly early, they should just end the test and try something new

Basically: End tests early, but add 20% more tests.


Edited:    |       |    Search Twitter for discussion