Looking for Bias in Split Tests

Revealing a challenge

A technique you may have heard of is to run A/A tests. These show us that our test analysis isn't unfairly biasing the analysis for or against the changes under test. You probably know what an A/B test is, in an A/A test we run a split test where we have two groups of people: a control group who are shown the current version of our product/website and a variant which is also just the control. In this instance, we are really just running a test comparing control to itself.

So we run this for a few days and look at our results. Here is where things get messy. Imagine these are the results:

Question: Is there bias in the experimentation or analysis systems?

We can answer this question by considering another question.

Digging deeper

Question: In an A/A test, where both the control and variant are showing your users the same thing, how likely are you to see a statistically significant positive or negative even though none should exist? If you answered that it's dependent on your significance level... great! Let's say we are running a test at 95% confidence.

Here you might say that there is a 5% chance of detecting a true/false positive. Also great! But.

How many independent segments are you going to be looking at? Let's say you divide your results by new and return visitors and device type (PC, Mobile, Tablet, Other), different store brands, and even even a few different metrics. Now let's say you actually have 36 independent tests. The probability of at least one false positive/negative is 1-0.95^36 = 84.2%. So you will almost always see some positive and/or negative signals as statistically significant even though you haven't made any changes!

As a business stakeholder, this is extremely disconcerting. If you're having a hard time visualizing the issue or understanding why this can be disconcerting for a business stakeholder, refer back to the example above. It represents a set of results at 95% significance.

An improvement

We can adjust for all of these independent metrics by using something known as the Bonferroni Correction. Using this, the results would be entirely free of any errors (in this example). It's possible to still have errors, but instead of having a 5% chance per metric/segment under test, you have a 5% chance per experiment! Here's another simulated example with the correction applied:

Much less concerning isn't it? Technically both are "correct" but they are different ways of thinking about the problem. As Data Scientists or Statisticians we may get focused on the details and miss how an analysis is seen holistically by consumers of the data and how what we think of as "correct" is actually missing the mark on what people are expecting.

Applying the correction is as simple as dividing your significance level by the number of independent metrics you are looking at (looking at all of the combinations). In our case, we have a 5% error level and 36 independent metrics would be a significance level of approximately 0.00139 instead of 0.05. 

In the end, we didn't find bias, but we did find unmet expectations and were able to correct for them using a very simple technique.

Technical resources

On top of correcting when we signal that a metric is statistically significant, we can use this correction to also adjust the confidence/predictive intervals and success probabilities. The confidence interval is adjusted in the same way as the significance level. There are caveats and even alternate methods, but if you need something quick that will get you 80% of the way there, this may be exactly what you need.

You can read more about the Bonferroni Correction here:

The code to reproduce these simulated results is here: