Deep in the Weeds: Complex Hierarchical Models in PyMC3

If you've read Thomas Wiecki's awesome introduction to hierarchical models (here: you probably started working to figure out how to apply the technique to your own work. A challenge you may run into pretty quickly is how do you go about modeling N different levels? Even more challenging is how do you model N-levels and also keep the model vectorized? This post will be fairly terse and will illustrate how to actually set this up in PyMC3.

I've generated a small dataset to act as an example here. Let's pretend we have data on how much  money the average person in different states made over time and counties made and we have information on whether or not they got a degree. Using hierarchical modeling, we can relate these different factors to understand their impact on income. As I said, this data is all generated solely to illustrate this technical concept, for one blog post try to not think critically about the data (or do and expect to have some good laughs).

If you want to see how to generate the example data, I have included that at the end of this post.

WARNING: My main goal is to put an example online. This post assumes an understanding of PyMC3, Hierarchical modeling, pandas, etc. 

Modeling the problem

Ultimately I want to understand how degree affects income. For this reason, I have decided to place degree classification at the top of my hierarchy followed by state and then county. For the purpose of this post I am keeping the model a simple linear model but that's an easy detail to change. It's left as an exercise to the reader though. 

Let's start modeling this in PyMC3 and solve problems as we run into them. First, because we are making a hierarchical model, we know that we'll need a global prior for the slope of the lines and the intercept. At the same time we can also easily include the degree level variables. We can get there by writing 

This gives us a vector of variables for slopes at the degree level with a shared global prior and the same for the y-intercepts. The number of random variables the vector degree_m refers to is specified by degree_count. Which makes sense right? We want to model a random variable for each type of degree. 

If we left it here, we would have a single level hierarchical model where each degree classification could have its own slope and intercept that is informed by the slopes and intercepts of the other degree classifications. If this doesn't make sense quite yet, check out the link to Thomas's blog post above.

Now the next step is to get to a level below degrees and introduce states. This is where things get tricky. For each type of degree, we want to create a separate random variable for each state which uses degree_m as the prior for the slope m of the given state and that can also act as the prior for the slope of each county in that state. Essentially this means we have a random variable for each combination of degree and state. Then we'll have yet another set of random variables for every degree / state / county combination. On top of that, because PyMC3 works best with vectorized inputs, we'll need to gather to all of these variables into vectors and refer to them in our data with vector indexes.

Massaging your data

As I said in the beginning, we're going to do this while keeping our model vectorized. When we say "vectorized" what we are really saying is that we want to be able to point to the index of the random variable used for each of our different categories. For example, degree is one such category and "degree" and "no degree" will each be modeled by a separate Normal variable (though they will share a global prior). This means that we need our data to be able to refer to each of these variables in a way that's easy for PyMC3 to understand and in this case that means with an index. So instead of each row referring to "no degree" by name, we need to assign an index that can be used to refer to it.

This can be a bit intimidating so we're going to break this down into several steps. The general outline looks like this though:
  1. Create an index number for each unique degree classification.
  2. Create an index number for each unique degree classification & state combination.
  3. Create an index number for each unique degree classification & state & county combination.
  4. Combine all of these indexed data frames back onto the original data frame so that these combinations can be referred to by an index.
We're going to need this data organized in another way as well but it's difficult to explain how or why until we run into the concrete problem

In order to do this, we need to be able to refer to all of the different classes

This is an example of how I create these indexes using pandas:

Running the Model

Vectorizing the data is the hardest part. With that out of the way we can set up our full model:

Assuming everything runs well we should get a trace plot that looks something like what I have below. The plots with out "sd" in them are different levels of fit for our slope and intercept across Global -> Degree -> State -> County!

Generating example data

Very quickly, this is how I generated my fake data for this example:

The data isn't totally realistic... but real enough to act as a motivating example.

Wrap up

There's a lot to dig through here. The best way to approach is probably to take the Jupyter Notebook and poke at it until everything makes sense. You can take a look at that and download it from here: