## Why Sample Sizes?

When talking about an experiment design, the most contentious topic without a doubt is that of sample size. Most commonly, "Why is the sample size so big?" The good news is that the sample size doesn't have to be so big IF you're willing to compromise in other ways. In this post, I'm going to delve into how to evaluate a sample size for a log-normal distributed dataset. Then, I'll focus on how changing a couple parameters can dramatically reduce the sample sizes we need. Lastly, I'll show how using a different statistical distribution (to a binomial distribution) can alter the analysis.

## How Are We Going To Do This?

The easiest way to do this analysis (if you know how to program) is to:

1. Model the variable we'll be testing
2. Simulate changing it
3. Compare the changed distribution to the original many many thousands of times.

Important: I'm going about this backwards in a way. I'm going to say what effect I want to be able to measure and how confident I want to be and then I back into the sample size by trying hundreds of thousands of tests. I then look for the minimum sample size that will meet the constraints I set.

Here are more specific steps:

1. Look at the real data. Get real data and see what kind of distribution it forms.
2. Create a mathematical model of the data that you can use to simulate shifts.
3. Run thousands of simulated statistical tests between a simulated control and a simulated variant where we have a known improvement. Count the number of times we're able to detect the given shift.
4. The number of times we detect the shift divided by the number of simulated split tests is the likelihood to detect an effect of the given magnitude with the given sample size. This is also known as the statistical power.
5. Run other simulations varying sample size and the change between the simulated control and the simulated variant.

## Pretending To Be A Normal Log

On our way to analyzing our sample sizes for how much customers spend, we're going to need to figure out how to first model their purchases. This is a pretty great segueway.

In [59]:
from IPython.display import YouTubeVideo

Out[59]:

My first step is always to look at the data I want to model. Since most of my examples of lognormal data come from work, we'll have to use a generated dataset and pretend like it's natural.

In [2]:
%load_ext rmagic

In [60]:
%%R
generated_sales_data <- round(rlnorm(10000, mean=3.21, sd=.6), 2)
write.table(generated_sales_data, "raw_data/sales_data.csv", col.names=FALSE, row.names=FALSE, sep=',', quote=FALSE)


Now that we have our "real" sales data let's forget we made it at all and work backwards to see how we can reverse engineer the parameters I used to make it up. First, let's view our distribution and see what kind of distribution we have.

In [63]:
%%R
library('ggplot2')

p <- ggplot(sales, aes(x=purchase_amount)) +
geom_histogram(binwidth=.5) +
xlim(0, 150) +
xlab("Customer amount spent (in $)") + ylab("# of customers") + ggtitle("Amounts spent by individual consumers") + theme(axis.title.y = element_text(angle = 0)) print(p)  This looks a lot like a log-normal distribution. We can model a random distribution that looks like this by computing a couple of values from the above data. We need to find the mean of the log of each of the prices and also the standard deviation of the log of each price. Here's how that comes together: In [70]: %%R library('ggplot2') df <- read.csv('~/Documents/Notebooks/raw_data/sales_data.csv', header=T, col.names=c('amount_spent')) purchase.count <- length(df$amount_spent)
purchase.log_mean <- mean(log(df$amount_spent)) purchase.log_stdev <- sd(log(df$amount_spent))

print(paste(c("Standard mean of amount spent:", round(mean(df$amount_spent),2)), sep='')) print(paste(c("Standard deviation of amount spent:", round(sd(df$amount_spent),2)), sep=''))
print(paste(c("Log mean of amount spent:", purchase.log_mean), sep=''))
print(paste(c("Log standard deviation of amount spent:", purchase.log_stdev), sep=''))

[1] "Standard mean of amount spent:" "29.66"
[1] "Standard deviation of amount spent:" "19.48"
[1] "Log mean of amount spent:" "3.20924196153511"
[1] "Log standard deviation of amount spent:"
[2] "0.601137563076673"



Notice how different the log-mean and log-standard deviation are from their typical counter parts. When I first learned to do this I always hoped I could just use the standard mean and standard deviation but they don't even come close. So much for being lazy! ;)

Now that we have these two parameters we should be able to create a pretty good model of our data.

In [71]:
%%R
# Create modeled data
simulated_purchases <- data.frame(amount_spent=rlnorm(purchase.count, mean=purchase.log_mean, sd=purchase.log_stdev))

# Graph it
p <- ggplot(simulated_purchases, aes(x=amount_spent)) +
geom_histogram(binwidth=0.5) +
xlim(0, 150) +
xlab("price") +
ylab("# of orders") +
ggtitle("Simulated price frequency from one day") +
theme(axis.title.y = element_text(angle = 0))
print(p)


Looking at both the real and the simulated distribution of how much customers spent we can see they're pretty similar. Another little difference you may notice in the wild is your simulated histogram won't have some of the sporadic sharp spikes your real data has. Not a big deal, but just pointing it out in case you're ever left thinking wondering about it. So now what?

## Deciding On The Effect Size To Test For

Something that can dramatically affect our sample size is the size of effect we decide to test for. This is something that is very open to be changed in order to mitigate risk. As we'll see in the examples that follow, larger shifts in our data require dramatically fewer samples. Since most of the features we test have a low likelihood of dramatically improving results though, we tend to stick with looking for ~2% shift in our numbers.

This is a primary place where we can tune our tests. The hard part is how do we know what effect to measure for? We'd like to test for as small of an effect as possible right? Sure. At some point though it ends up that we're just running the test on all of our customers or that the test takes forever to complete. You need to choose a number that will allow your team to complete most tests in a week (if you can) and that balances the risk of negatively affecting customers with the risk of mistakenly thinking a feature under test is ineffective.

Let me repeat this because it is vital: We have to balance the possibility of our test not being sensitive enough with the possibility of negatively affecting our customers. We have the ability to work with this but it requires an open dialog with concrete constraints between stakeholders.

## Modeling A Split Test Looking For A 2% Effect

Let's assume we've decided we need to test for an effect of at least 2%. This means we want to detect that the measured mean has increased by 2%.

If you recall, above we showed that we can model the variable we want to test (purchase amounts) as a log-normal distribution. Let's try modeling a slight improvement to our variation and see what happens when we compare the two. Rather than rederive the constants we used before I'll just use the values of mean and standard deviation we've already calculated. First, let's try comparing two different random distributions that should be equivalent (in other words, no statistically significant difference in the means).

In [72]:
%%R
number_of_samples <- 1000
control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
equivalent_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)

results <- t.test(equivalent_variation, control, alternative='greater')
print(results)


Welch Two Sample t-test

data:  equivalent_variation and control
t = -0.2264, df = 1997.376, p-value = 0.5895
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-1.560934       Inf
sample estimates:
mean of x mean of y
28.92833  29.11709



So... great! We have statistical test results but what do they mean? Search through the output for the p-value. Let's accept the standard definition that a p-value of less than .05 is statistically significant. In this case our p-value > .05 so we conclude that there's no evidence to support that our variation mean is greater than our control.

Great. That matches our expectations.

Let's go a bit further now. Let's run this test a couple thousand times and see how often we get a result that matches our expectations. We should see a statistically significant difference only around 5% of the time. Running this simulation takes less than a second.

In [73]:
%%R
number_of_samples <- 1000

simulation_results <- mapply(function(x){
control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
equivalent_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
results <- t.test(equivalent_variation, control, alternative='greater')
.05 >= results[3]$p.value }, seq(0,2000, by=1)) percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results) print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))  [1] "Percentage of time effect detected: 5.55%"  Pretty close to 5% of the time the results show a statistically significant difference! Perfect. This validates that the t-test is working pretty well for our needs so far. This keeps with statistical testing theory and how the t.test works. We have two concepts that can seem VERY similar so allow me to be a bit more detailed here: • Statistical significance- 95% significant means that 95% of the time the effect we measure will NOT be due to random chance. • Statistical power- Percentage of time that an effect of a certain size will be detected. Refer here for more information: http://en.wikipedia.org/wiki/Statistical_power • Effect size- The magnitude of the change to be measured. Also referred to as sensitivity. When you hear statisticians discussing 95% significance what they're really saying is that if they ran the experiment 100 times they expect that one time out of twenty they will mistakenly detect an effect that's due simply to random chance. The importance of this is that we never get to a point where we are 100% confident. We can approach it, but then that basically just equates to measuring an effect on our entire population of customers. Might as well just roll the feature out at that point. There's a give and take here so it's another place where we can make compromises. If we say we want to measure for an effect size of 2% with 95% statistical power, that means we want to be able to detect an effect of at least 2% at least 95% of the time. Maybe we're ok with throwing out a positive effect twice as often (90% confident). That's a perfectly valid decision to make as long as there's an understanding of how that impacts our testing. Next we need to see what percent of the time we detect an improvement if we shift the variation to have a small difference. Let's do the same test as above but let's shift our variation by 2%. Let's call our shot and predict what should happen. I don't know exactly what to expect but I do know that since there is now a change we should detect a change more often than 5% of the time. In [74]: %%R number_of_samples <- 1000 effect_size <- 1.02 simulation_results <- mapply(function(x){ control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev) improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size results <- t.test(improved_variation, control, alternative='greater') .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 17.79%"



Big change! So now we're seeing that these two distributions are showing a statistically significant difference ~16% of the time! This means if we were okay with only detecting an improvement in our test features ~16% of the time we would only need 1,000 samples. For negative customer experience potential, this is pretty risk adverse. From an experimentation perspective, however, it's pretty terrible. Our tests would hardly even be repeatable. When we first started testing we actually started here. It was so confusing to get different results every test run.

In a real split test, when there's a change, we ideally want to be able to measure a change at least 95% of the time. Again, that's something we can explore shifting but let's take it as a given right now. Let's try increasing our sample size and see where it gets us.

In [75]:
%%R
number_of_samples <- 10000
effect_size <- 1.02

simulation_results <- mapply(function(x){
control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
results <- t.test(improved_variation, control, alternative='greater')
.05 >= results[3]$p.value }, seq(0,2000, by=1)) percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results) print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))  [1] "Percentage of time effect detected: 68.72%"  I took a stab in the dark and tried 10x the samples. That got us MUCH closer to detecting the change 95% of the time but we're not quite there yet. Notice how very non-mathematical this approach is? This is a chief draw of using simulation to perform complex analysis. Rather than having to find a formula or tool online and just trust it, we can brute-force the solution in a verifiable way. We're at ~67% with 10,000 samples. Let's triple the sample size and see if that helps us detect the 2% effect at least 95% of the time. In [76]: %%R number_of_samples <- 30000 effect_size <- 1.02 simulation_results <- mapply(function(x){ control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev) improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size results <- t.test(improved_variation, control, alternative='greater') .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 97.55%"



Excellent! 98%. We overshot our goal a little so now I fine tune. Again it's an imperfect process.

I explain later on how exactly a statistical test of conversion rate differs from purchase amounts.

Let's continue our process of guess and check. 98% is more certainty than we want for this test. Let's see if we can get our sample size (and risk!) down a little more and still be at ~95%.

In [78]:
%%R
number_of_samples <- 24000
effect_size <- 1.02

simulation_results <- mapply(function(x){
control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
results <- t.test(improved_variation, control, alternative='greater')
.05 >= results[3]$p.value }, seq(0,2000, by=1)) percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results) print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))  [1] "Percentage of time effect detected: 95.55%"  Changing our sample size to 24,000 bounces us around 95% pretty well. Voila! We've discovered the answer to our question: In order to measure at least a 2% effect on customer purchase amount 95% of the time, we need a sample size around 24,000 purchases per variation. Now obviously if I had to go through this process for every test we did I would have a LOT of work. Instead I've just made HUGE tables where I've precomputed some common values and then I filter the spreadsheets in Excel based upon the constraints I have for the test at hand. ## Changing Constraints For A Smaller Sample Size Ah here is where it gets really interesting! What if we decide we don't want to affect any more than 7,000 customers (or any arbitrary number)? Can we still use this technique? Absolutely. Let's walk through this using the same purchase amounts we've already modeled and see how things change. Let's state our question more formally: In order to limit our test to only 7,000 customers per variation what is the maximum effect size and confidence level we can test purchase amounts at? Let's start with a wild guess and see what confidence we get when measuring a 2% shift in customer: In [79]: %%R number_of_samples <- 7000 effect_size <- 1.02 simulation_results <- mapply(function(x){ control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev) improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size results <- t.test(improved_variation, control, alternative='greater') .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 54.22%"



We're able to measure a true 2% shift ~54-57% of the time! Kinda terrible. So let's try changing things. What if we said we were ok with detecting changes of 3% or more?

In [80]:
%%R
number_of_samples <- 7000
effect_size <- 1.03

simulation_results <- mapply(function(x){
control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
results <- t.test(improved_variation, control, alternative='greater')
.05 >= results[3]$p.value }, seq(0,2000, by=1)) percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results) print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))  [1] "Percentage of time effect detected: 84.31%"  OK. Now we've found that we can correctly detect a 3% shift in the mean of purchase amounts ~84-87% of the time while only affecting 7,000 customers per variation. Let's get the statistical power up to at least 90%. That means we need to increase the size of the shift we can measure. Let's try 3.4%. In [81]: %%R number_of_samples <- 7000 effect_size <- 1.034 simulation_results <- mapply(function(x){ control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev) improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size results <- t.test(improved_variation, control, alternative='greater') .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 90.05%"



Yeah I cheated above. I had to try 3.5%, 3.1%, 3.2%, and 3.3% first. Finally I tried 3.4% and that got me closest. This is why generating a table has been so helpful. It's allowed me to just look at the possibilities for a given constraint. Back to the problem at hand.

Here's the formal answer to our question: In order to only affect 7,000 customers per variation, we should test for a 3.4% shift in the mean of our purchase amounts with a ~90% confidence in the result.

## Testing Metrics Like Conversion Rate

We've now covered how to test our hypothetical customer purchase amounts which tended to follow what's called a log-normal distribution. Another distribution we see fairly often is a binomial distribution. Specifically we see this in conversion rates. Let's walk through a similar example and find the sample size we need to test a 3% lift in conversion rate. NOTE: This means our conversion rate going from say 8% to 8.25%. We'll also assume that we want to be able to measure this shift 95% of the time.

Conversion rates follow a binomial distribution rather than log-normal. This is because each data point is either a success or failure. Compare this to our log normal data which were some number greater than zero. Simulating this is a bit different but just as straight forward.

For the sake of argument let's say the conversion rate we want to test is typically around 8%. We can generate the distribution like so:

In [83]:
%%R
number_of_samples <- 1000
control_conversion_rate <- 0.08

successes <- factor(rbinom(number_of_samples, 1, control_conversion_rate) == T)
control_distribution <- data.frame(success = successes)
p <- ggplot(control_distribution, aes(x=success)) +
geom_histogram(binwidth=1) +
xlab("Purchased?") +
ylab("# of visitors") +
ggtitle("Proportion Of visitors purchasing vs not") +
theme(axis.title.y = element_text(angle = 0))
print(p)


This looks a lot different from our purchase amount distribution. Here's a refresher so you can compare:

In [84]:
%%R

p <- ggplot(sales, aes(x=purchase_amount)) +
geom_histogram(binwidth=.5) +
xlim(0, 150) +
xlab("Customer amount spent (in $)") + ylab("# of customers") + ggtitle("Amounts spent by individual consumers") + theme(axis.title.y = element_text(angle = 0)) print(p)  These distributions are radically different. For modeling the customer purchases I needed the mean and standard deviation. I can get the mean here, but I can't get a standard deviation. What would that even mean in the case where we have either a success or failure? Due to this we need to use a different statistical test to measure our results. This test is known as a binomial test. Like I did for purchases, I'm going to start by simulating a test between two equivalent distributions. No changes to either one yet. In this case, both distributions are "binomial" distributions. In [85]: %%R number_of_samples <- 7000 control_conversion_rate <- .08 simulation_results <- mapply(function(x){ control <- factor(rbinom(number_of_samples, 1, control_conversion_rate) == T) results <- binom.test(length(control[control == T]), number_of_samples, control_conversion_rate, alternative='greater') .05 >= results[3]$p.value
}, seq(0, 2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 5.65%"



Again, we find a significant change only ~5% of the time which matches with our 95% confident statistical test. Now let's apply the effect size we want to test for and start looking for the point of where we get to 95% confidence in our split test.

In [86]:
%%R
number_of_samples <- 11000
effect_to_measure <- 1.03

simulation_results <- mapply(function(x){
control <- factor(rbinom(number_of_samples, 1, control_conversion_rate*effect_to_measure) == T)
results <- binom.test(length(control[control == T]), number_of_samples, control_conversion_rate, alternative='greater')
.05 >= results[3]$p.value }, seq(0, 2000, by=1)) percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results) print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))  [1] "Percentage of time effect detected: 23.74%"  Wow. So not even close. Let's guess and check until we get to 95%. In [91]: %%R number_of_samples <- 145000 effect_to_measure <- 1.03 simulation_results <- mapply(function(x){ control <- factor(rbinom(number_of_samples, 1, control_conversion_rate*effect_to_measure) == T) results <- binom.test(length(control[control == T]), number_of_samples, control_conversion_rate, alternative='greater') .05 >= results[3]$p.value
}, seq(0, 2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 95.35%"



After a lot of guess and check I finally found a sample size that gets us close to being able to reliably measure a 3% shift. That's a lot right? An important thing to remember though is that since this is conversion rate we only need this many visitors, whether they order or not. Still, compared to our purchase amounts this is a very different scenario. Or is it?

See, in order to get that 24,000 purchases so we can run our statistical test we would need to capture enough visitors to our web site as well. Since 92% of them don't order (since our conversion rate is 8%) that means we have to show our experiment to a lot more visitors than we might expect.

Given that 8% of our visitors convert to customers and if we need 24,000 customers to measure a 2% significance in purchase amount lift, then we need will need to have almost 300,000 visitors to our web site in our experiment per variation to ensure we get enough customers. We only need 145,000 visitors in our experiment to measure a 3% significance in our conversion rate. At least we would kill two birds with one stone.

## Normal, Log-Normal, Binomial... What's the point?

The point is the shape of the distribution is just as important to sample size as the effect size we're looking for, and the sensitivity to it. With a pure mathematical solution we would need a different formula for each distribution and any others we encountered. Here, we use simulation as a one size fits all solution.

## The Case For Brute Force

Without needing to delve too deeply into mathematical theory, we've used simulations to help us conduct a statistical power analysis and determine a pretty accurate approximation of the number of samples we need in each variation to identify a 2% shift in the mean of our purchase amounts! Then we were able to put a hard stop on the sample size of our test and instead played with changing the statistical power of the test as well as the effect size we're measuring.

It's an imperfect method, but we can get more accuracy. All we need to do is tune our algorithm to run more simulated split tests. That takes up some extra computer time, but that's MUCH cheaper than human time.

This paper puts it best:

"Increases in computing power have extended power analyses to many new areas, and R’s capability to run many repeated stochastic simulations is a great help. Paradoxically, the mathematical difficulty of deriving power formulas is a great equalizer: since even research statisticians typically use simulations to estimate power, it’s now possible (by learning simulation, which is easier than learning advanced mathematical statistics) to work on an equal footing with even cutting-edge researchers." http://ms.mcmaster.ca/~bolker/emdbook/chap5A.pdf

And here's an article from the company which produces SAS (a huge statistical software) on also running stochastic simulations in order to conduct similar power analyses: http://blogs.sas.com/content/iml/2013/05/30/simulation-power/

## Speeding things up

For more information on speeding up these simulations through the use of parallelization refer here: http://www.databozo.com/2013/06/24/Practical_parallelizing_in_R.html