In the previous post, we gathered data from an uncontrolled test that a store owner ran. We were able to do that because we assumed that everyday was perfectly comparable. If you've ever looked at how business metrics move though you know that each day is not at all directly comparable. Each day of the week brings in different numbers of visitors consistently, different times of the year may be busier than others, etc. In this post, we're going to examine what it would take to test a change to a business without having a true control.
We keep doing what we've been doing, make a random choice about what to show your customers each day.
I know. That doesn't take into account day of the week effects right? What if Mondays are stellar for your business and Wednesday is terrible? What about if your business is consistently growing 20% year over year?
The data you're thinking about may look something like this over a couple months:
Pretty messy, varies greatly.
It doesn't matter and I can show you this works via a simulation.
Why is it so simple?
Simply put because each day we randomly select a change we'll put in front of our customers each day. By choosing what we show randomly we guarantee that all of the temporal correlations in our data and the analysis of our variant are broken. Said another way: As long as my showing you the variant or control are not at all related to the day of the week, then the day of the week merely contributes some noise but not bias (preferring control or variant due to the noise).
The noise will affect the error margin we will have, but as we will see, the estimate of the impact from the experiment is usually the most likely effect that our analysis identifies.
How well does it work?
Here is the result of simulating the testing of a variant that gives a 5% lift on the above data over two weeks:
This is a probability distribution of simulating running our experiment 20,000 more times. Beyond just looking technically fancy, we actually can read a fair amount of analysis from this though. A quick interpretation of it is that there is almost a 93% chance that the experiment results in a positive lift. We know that's true because my simulation pretended that the variant had a 5% lift. The width of the distribution is telling us about how certain we are. Just eye-balling this, the answer could be anywhere between -2% and +14% lift which is a pretty wide interval. On top of that, our analysis shows our most likely answer to be right around 5% lift.
Without a baseline and without a true control we've been able to recover a pretty darn good estimate of the change we made to the business.
Deep in the Weeds
If you would like to dig deeper into how the statistical analysis and simulation were done this is a link to my Python notebook:
That's it. No more excuses. No control, no baseline, no problem.You have the tools to begin to explore being able to test changes to your business with very little overhead.
What are some next steps one could take to further optimize this? One change would be to combine this with the response surface analysis technique I showed here:
This would allow you to try more variants at once and find an optimum solution much more quickly. You could also pair these methods with multi-armed bandit techniques to optimize the money you make as you test and change your business as tests do worse or better. With some prior information on how your business behaves you could also remove some noise and make the analyses more accurate by weighting days of the week differently. We've only touched on the tip of the ice berg.
Thanks for following along! If you've had success/failure with these methods or can point me to the "proper" names or papers where these ideas originated from (I'm sure I didn't invent them) please drop a comment to let me know.