You've probably come across the Dirichlet Distribution if you've done some work in Bayesian Non-Parametrics, clustering, or perhaps even statistical testing. If you have and you're like I was you may have wondered what this magical thing is and why it gets so much attention. Maybe you saw that a Dirichlet process is actually infinite and then wondered well how is that going to be useful?
I think I've found a very intuitive approach... it's at least quite different from any other I've read. This post requires that you already be familiar with Beta distributions, PDF functions, and Python to get along. If you meet that requirement then grab a bag of nuts and let's jump right in.
Why log space? Read through enough tutorials on Bayesian statistics and you're sure to encounter what seem to be unnecessary or at the very least confusing use of log and exp.
Let's go over some examples to understand why log-space and also how.
You've seen the articles that say "MCMC is easy! Read this!" and by the end of the article you're still left scratching your head. Maybe after reading that article you get what MCMC is doing... but you're still left scratching your head. "Why?"
"Why do I need to do MCMC to do Bayes?"
"Why are there so many different types of MCMC?"
"Why does MCMC take so long?"
Why, why, why, why, etc. why.
Here's the point: You don't need MCMC to do Bayes. We're going to do Bayesian modeling in a very naive way and it will lead us naturally to want to do MCMC. We'll understand the motivation and then! We'll also better understand how to work with the different MCMC frameworks (PyMC3, Emcee, etc) much better because we'll understand where they're coming from.
We'll assume that you have a solid enough background in probability distributions and mathematical modeling that you can Google and teach yourself what you don't understand in this post.
Here's the thing
I'm going to make this quick. You do a carefully thought through analysis. You present it to all the movers and shakers at your company. Everyone loves it. Six months later someone asks you a question you didn't cover so you need to reproduce your analysis...
But you can't remember where the hell you saved the damn thing on your computer.
If you're a data scientist (especially the decision sciences/analysis focused kind) this has happened to you. A bunch. You might laugh it off, shrug, w/e but it's a real problem because now you have to spend hours if not days recreating work you've already done. It's a waste of time and money.
I used to be this person too, so I get it. I decided to experiment with a new method that sounds so simplistic and stupid you'll think it won't work.
Just. Try. It. It will change your life.
In the previous post, we gathered data from an uncontrolled test that a store owner ran. We were able to do that because we assumed that everyday was perfectly comparable. If you've ever looked at how business metrics move though you know that each day is not at all directly comparable. Each day of the week brings in different numbers of visitors consistently, different times of the year may be busier than others, etc. In this post, we're going to examine what it would take to test a change to a business without having a true control.
Controlled testing in modern systems is fairly straight-forward. There are many tools to handle statistical analysis, random population sampling, data collection, etc. With the number of web visitors routinely numbering in the millions, the statistical techniques are also greatly simplified because of the preponderance of data. What do we do though when data is very costly?
In this post, we'll take an overly simplified model and solve it using response surface modeling to find an approximate optimum with very little data. At the end of the post we'll discuss some possible caveats and some ideas for getting around those caveats.
Say you run a video store and you want to understand how customer rental frequencies are changing over time. You can just plot the numbers but you want some help identifying when a customer's usage really went up and when it came down. Just looking at data you see that happening every day. What are the real changes though?
If you charted the number of rentals per month for a given customer over many months you might see this:
You probably can see an uptick somewhere around a year into their history... but where exactly? And is it safe to say it's constant even though we see a fall afterwards? Definitely we could give a pretty good guess, but we can't automate gut feel. How do we answer these questions quantitatively without needing a human?
Revealing a challenge
A technique you may have heard of is to run A/A tests. These show us that our test analysis isn't unfairly biasing the analysis for or against the changes under test. You probably know what an A/B test is, in an A/A test we run a split test where we have two groups of people: a control group who are shown the current version of our product/website and a variant which is also just the control. In this instance, we are really just running a test comparing control to itself.
So we run this for a few days and look at our results. Here is where things get messy. Imagine these are the results:
Question: Is there bias in the experimentation or analysis systems?
We can answer this question by considering another question.
- With the latest odds and given the way ticket purchases grow with the expected jackpot, the expected value of Powerball is negative. Even more so with a billion dollar payout.
- There is a 95% chance that there will be 3 or more winning tickets after the next draw. Almost zero chance no one will win.
- Don't laugh at the premise! Powerball did have a positive expected value when the expected jackpot fell within a range of $400m-$650m. The game was changed to reduce the odds in October, 2015 which "fixed" that problem.
First of all big shout out to Walter Hickey at Business Insider for the pointer to the Powerball data (here:http://www.lottoreport.com/powerballsales.htm) in this post a few years ago (here: http://www.businessinsider.com/heres-when-math-says-you-should-start-to-care-about-powerball-2013-9).
This chart plots the relationship between expected value of purchasing a ticket to estimated size of a jackpot. The model used takes into account the dramatic increase in tickets purchased as the jackpot size increases.
My analysis shows that because of the exponential increase in the number of tickets being played and the likewise dramatic increase in the likelihood of sharing the winnings, there is never a point where one will break even on buying a Powerball ticket... Now that they've made it harder to win.
Previous to 10/07/2015 this is what the expected value looked like:
Notice anything funny? Yeah it used to have a range of values where the expected value was positive! That means if the expected jackpot was somewhere between $400m-$700m it was actually a real investment for you to play the lotto. When the Powerball odds were reduced though, this stopped being a problem for the lottery. (You can read more about it here: http://news.lotteryhub.com/powerball-odds-set-change-october-2015/). Depending on how someone plays, the expected value may not mean much. Specifically for people who play only a few times in their lives since they won't play enough for the Law of Averages to even out the times when a player lost. A lot. Even so, I use it here because it's a pretty simple and intuitive framework to use to understand the value of an investment.
Let's dig deeper into the game with the current (and harder to win) odds to understand why, even in the face of a billion dollar payout, Powerball is a net negative play with these new odds.
The Problem to Solve
I've been developing a process for optimizing meals in my household. I want to be able to have an inventory of ingredients and other food stuffs and be able to figure out a meal to make. These are the different problems I want to be able to solve:
- Given what's in the house, what meals can we make which meet our nutritional needs and maximize the number of meals we can make?
- What is the meal that costs the least and meets our minimum nutritional needs?
- Plan a grocery shopping list by figuring out meals for breakfast, lunch, and dinner for a 15 day period that meet prep time, meal diversity, and nutrition constraints.
- Also, phase in caloric constraints over time to prevent weight gain.
If you're familiar with Linear Programming you may recognize this as a variant of "The Diet Problem". You can read more about the full diet problem here: http://www.neos-guide.org/content/diet-problem. The most important take away is that this kind of problem can be solved using optimization methods such as Linear Programming.
As a proof of concept, I've started laying out a small toy version of this problem to see if it holds water. If it does, I plan to continue to develop a manual solution and then, if that works, I might automate a general solution.
Formulating the Model
Let's first solve the problem of maximizing the number of meals we can make given what we have on hand. In order to solve this problem with Linear Programming we need to formulate an objective function which we want to maximize. I've laid out a few common meals from my household. For this problem, the meals will be what we call the decision variables. These are the inputs into the model that we change in order to maximize the objective function. The reason for choosing that the decision variables be the actual meals and not the ingredients (in case you are wondering) is that the meals are what need to be planned. The ingredients give us information about the nutrition for use by our constraints but they don't determine the meals on their own.
Let's start by figuring out an objective function for this problem. First, I defined six meals to act as decision variables. Those are:
For this first simple proof of concept, our objective function is pretty simple though the constraints of the model will get a bit lengthy.
Since we want to just maximize the number of meals, our objective function is simply the sum of the number of meals:
Great! Next we need to define our list of ingredients. Notice that the amount of each ingredient is a function of the meals each is included in. Here are the definitions of the ingredients:
And then this is how each of the ingredients relates to the individual meals along with our constraints about how much we have on hand.
Solving the model
We still haven't entered in the nutrition information! That's ok for now. This is just the proof of concept. Let's stop here and try to solve the model. You may want to use Google Docs or Excel; I'm going to choose Excel.
At another point I will get into how exactly to solve this model in Excel. For now, I want to show the power of this approach so I will keep things brief by just giving you the answer and providing the Excel spreadsheet.
According to Excel, the maximum number of meals I can create given the above constraints on my inventory is 10.75 meals. Specifically:
- 0.75 servings of spaghetti
- 0 veggie pepperoni pizza
- 1 serving of veggie chili dogs (two hot dogs)
- 3 servings of chili
- 2 servings of eggs and toast
- 4 servings of peanut butter and jelly sandwiches
Not only do we know exactly what meals we can make, but we also are left knowing how much of each ingredient we will have left over:
- Spaghetti noodles: 137.5 grams
- Boca crumbles: 0.625 lbs
- Marinara sauce: 0 cups
- Broccoli: 37.5 grams
- Pizza dough: (still) 0
- Mozzarella cheese: 3.5 cups
- Veggie pepperoni: 11 pepperonis
- Veggie chili: 1.5 cans
- Hot dog buns: 1 bun
- Veggie hot dogs: 0 hot dogs
- Grated cheddar cheese: 0 cups
- Eggs: 0 eggs
- Butter: 0.08 cups
- Bread: 0 slices
- Peanut butter: 1 cup
- Jelly: 1 cup
For a second I thought having 1.5 cans of chili left over was a bug. I realized though that according to my recipe I can't make veggie chili without cheddar cheese. Since I have none of that left, I can't make chili!
Now if I also had listed the amount of money each ingredient costs I could also list how much money I'm paying per meal. As I said earlier, if I had nutrition info, I could put extra constraints around nutrition. If I wanted to ensure half my meals aren't PB&J I could add a constraint for that as well. The list goes on.
Once I entered the data, the model takes less than a second to run (so far). We'll see how the performance fares as I increase the complexity of it.
In the next blog post, I will continue this analysis but will start to include more realistic estimates for ingredients including calories and nutrition so that we can start to solve some more interesting problems.