You've seen the articles that say "MCMC is easy! Read this!" and by the end of the article you're still left scratching your head. Maybe after reading that article you get what MCMC is doing... but you're still left scratching your head. "Why?"
"Why do I need to do MCMC to do Bayes?"
"Why are there so many different types of MCMC?"
"Why does MCMC take so long?"
Why, why, why, why, etc. why.
Here's the point: You don't need MCMC to do Bayes. We're going to do Bayesian modeling in a very naive way and it will lead us naturally to want to do MCMC. We'll understand the motivation and then! We'll also better understand how to work with the different MCMC frameworks (PyMC3, Emcee, etc) much better because we'll understand where they're coming from.
We'll assume that you have a solid enough background in probability distributions and mathematical modeling that you can Google and teach yourself what you don't understand in this post.
Our Motivating Example
A common business problem is to find the min/max of some function so that you can optimize revenue or costs. For this reason, my favorite toy example is modeling a parabola to some data.
Imagine we are tasked with optimizing our revenue for creating Super Widgets. We want to know what the optimum number of hours are for our workers to work in a day in order to maximize our revenue. We have a fixed staff size and are subject to overtime laws which is where the optimization comes into play.
Now! We've ran an experiment and we varied how many hours we have the average worker work for. Using this data, let's see if we can recommend an optimum.
With the naked eye we might say that 8 hour days are perfect... The workers are sure excited by that! But it may just be noise. Let's see what happens when we model this.
I'm a Model, Are You a Model?
How do we model this? We start somewhere simple. The data looks parabolic and we want to find the maximum point. The nice thing about modeling this with a parabola is that we can write the equation such that the exact information we want is just a parameter. Check out the standard form of a parabola:
In our example, the h in this equation represents the number of hours we have workers work. The a parameter controls the width of the parabola and k controls the height, but we really don't care about these values. We will call a and k our nuisance parameters as a result. We will end up including them in our model but only so we can get to the value of h.
So our list of variables is:
- h- hours worked
- a- width of parabola
- k- amount of money made at our optimum
Now, the Bayesian part. In order to run a bayesian analysis we need to have hypotheses to test. Then we can find the probability of these different hypotheses and find the most likely one. Let's run through each of our variables and create a set of hypotheses for each one:
- h: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
- a: -4,-3,-2,-1
- k: 18000, 20000, 23000, 25000, 26000, 28000, 30000
There's one more nuisance variable and we'll see that in a moment.
So we have our mathematical model and we have hypotheses... how are we going to get to a probability for this? What we need is a way to take a combination of the above hypotheses (say h=3, a=-4, k=20000) and and evaluate all our data points with it and say what is the probability of this specific version of this model being true?
Really each possible combination of these hypotheses is just one hypothesis technically. The code to find all the combinations looks like this:
We're about to get to the "sigma" variable you see there.
The evaluation part is key. That means for each combination of hypotheses like above we're also going to have a predicted net revenue value. This is really just simple regression. So for every hypothesis combination we will evaluate all of our data points and get a corresponding predicted net revenue that we can compare to what the actual observed net revenue was. The missing link is that we want the error of our observed net revenue minus our predicted net revenue to be zero on average and hopefully just have the errors normally distributed around zero.
This is where we need a variable (which I'm calling sigma) to model our error. Because we really don't care about sigma we'll count it as yet another nuisance variable. This is also what probabilistically connects our hypotheses to our data. We expect that we have a normal distribution where the mean is 0 and we observe actual net revenue minus predicted net revenue. The standard deviation around zero is represented by sigma.
This is pretty abstract so let's get to the code.
QWOP-ing Some Code
We've got all of the pieces and now let's step through the Python to make this happen.
All of the probability multiplication we'll be doing, we need to do in log-space because it's very possible our hypotheses will have very small probability densities and multiplying them makes them get so small so quick that the computer will just round to zero. In probability land, zero is a VERY strong statement (implying impossibility) and we like to avoid that whenever possible.
This is why on line 8 we create a prior where every hypothesis is weighted equally, but then immediately log transform it.
After that, since we've already combined all of our hypotheses into all the possible combinations, we loop through each one and evaluate all of the data points for each hypothesis.
On line 18, we evaluate the probability density for the given error occurring given our hypothesis for sigma.
On line 22, for the given hypothesis we are just multiplying this probability density by the new one we compute to eventually get to an aggregated probability density for the given hypothesis over all of our data.
Once we evaluate every hypothesis for every data point we have to normalize the probability densities so that we can read them as probabilities. We also need to do this in log-space. The function is posted here for completeness without explanation. Working in log-space is a separate blog post.
Now that we have every hypothesis with an associated probability we can look for the most likely hypothesis. Below is an abbreviated list of the hypotheses.
To find the most likely hypothesis out of all of these we can run this quick Python code
We find that the most likely hypothesis is the following set of values for our variables:
a = -3
h = 9
k = 23000
sigma = 10
The actual values I used to generate the data were
true_a = -3
true_h = 9.25
true_k = 25000
true_sigma = 10
Pretty much right on.
But let's take a step back and see why this works. All of this is really just the extreme happy path.
Where MCMC Comes In
So while we did find an optimum of 9 hours... we didn't get 9.25. Of course that was due to the fact that we only evaluated whole integer hypotheses. Maybe instead of a hypothesis at every hour we could've done it in quarters? We're already evaluating 420 hypotheses... if we do this we'd be looking at 4*420 = 1680 hypotheses! If I hadn't generated the data though, I wouldn't know that the data fell neatly on 9.25. This is a problem especially on the nuisance variables. We might have some good ideas for what to set k to since it's related to our net revenue, but what about a? That's a parameter that is completely abstract. It has no connection to us and just describes how steep of a fall off we get when we over or under shoot our optimum number of hours.
What if this parameter was basically 0% likely for -3 and higher and -4 and lower and the true value was -3.5? None of our hypotheses would have a value and this would result in our model not converging at all. The worst part is, that this can happen at very small ranges and finding these issues can be very time consuming.
We could just keep adding discrete hypotheses to our set here but it will cause the number of hypotheses to evaluate to explode in number exponentially! At some point it becomes obvious that doing this manually does not scale. Because MCMC algorithms work over continuous distributions, we can work across enormous ranges for our hypotheses and get an effect similar to us including every possible number in the ranges. Given what it's doing, the speed of MCMC is pretty awesome.
We can take the above example and rewrite it in PyMC3 to see how these concepts translate.
And the resulting diagnostics look like this:
What I expect to see when I look at the distributions in the left column is that the highest point in the distribution is very close to the actual values used. That seems to be the case here!
To get our estimate out of this trace we use the following code
And the resulting most likely hypothesis is:
a = -3.65
h = 8.85
k = 25,003
sigma = 11.15
If this were a real scenario and I didn't know what the "true" values were, I would go with the 8.85 hours per worker to maximize our net revenue. This would get us pretty darn close to the true optimum. If I wanted to get a more accurate estimate I would just collect more data and re-run the analysis.
What did we gain from MCMC? While we kept the hypothesis ranges the same we could expand the ranges if we wanted. We also were able to look at all of the values in each range and not just a discrete list of values. If you tried to tackle this with the naive solution, your code would never finish. Relative to that benchmark, PyMC3 runs extremely quickly.
Also, notice how well PyMC3's syntax mirrors what we were doing before and how it simplifies the code. Now that you see how we would do this manually, the brevity of this library and the expression of it hopefully can be more fully appreciated.
You don't need MCMC in the truest sense of the word. You almost certainly want it though. If this article addressed some confusion you've felt in the past (I wrote it to past me who would've loved to read this) then I recommend implementing your own naive solution to Bayesian modeling and then choose an MCMC library and start learning it. I think understanding the basics of what is happening probabilistically is a requirement for really leveraging MCMC.
Thanks for reading! Let me know if this post was helpful. I have more to write and I'm wondering how much value people would get from it. The code in notebook form is posted here: https://github.com/jcbozonier/research/blob/master/notebooks/Naive%20Inference.ipynb
EDIT: Some people had questions on the log-normalize function. I wrote a second post here to cover that in some detail. You can read that here: http://www.databozo.com/why-bayesians-prefer-log-space