Say you run a video store and you want to understand how customer rental frequencies are changing over time. You can just plot the numbers but you want some help identifying when a customer's usage really went up and when it came down. Just looking at data you see that happening every day. What are the real changes though?
If you charted the number of rentals per month for a given customer over many months you might see this:
You probably can see an uptick somewhere around a year into their history... but where exactly? And is it safe to say it's constant even though we see a fall afterwards? Definitely we could give a pretty good guess, but we can't automate gut feel. How do we answer these questions quantitatively without needing a human?
We can answer these questions using a Bayesian statistical approach. From a high level, we will start from the first month's data and see if it is statistically dissimilar from the rest of the data. If it isn't, we move to the next month's data and see if both of these months together seem dissimilar from the rest of the data and so forth. At some point we may find that one set is quite different from the other so we split the month data we've collected to be its own partition. Then we repeat this process for the remaining data as though it were all of the data remaining.
Eventually we're left with a series of one or more sets of data that represent regions of time that should be statistically similar within themselves but dissimilar for partitions of data that are adjacent to them.
The remaining sections of this article explain this process in more detail.
Envisioning a solution
Imagine if we had an algorithm that could summarize that chart into a series of distinct periods of time with constant frequencies. That algorithm might produce a chart like this:
We've left the original line on to make comparison easier. Here we can see that in month 10 usage increased from 0.67 rentals a month to 3.2 rentals a month. It then fell in month 22 to 0.22 rentals a month. That crisp summary is exactly what we're after.
Some possible ideas you might think of to tackle this problem and reasons why they break down:
- Use a moving average- We'll end up with a picture that is qualitatively clearer but it won't provide us with a crisp definition of when the changes occurred.
- Divide the data above into three arbitrary windows and adjust them until the data fits pretty well- That might work in this case but we shouldn't assume we know how many frequency shifts will occur in a set of data. We need a way to discover that.
- Choose an arbitrarily sized window of X days and compare before and after for each day (ex. look at previous 7 days and compare to next 7 days)- Another reasonable option but it won't summarize the data as fully as what we're looking for. Also depending on the data X days may not be enough to see the changes that are occurring.
So what other options are left?
A Bayesian approach
What if instead we had a set of hypotheses from 0 through some upper bound of what we might expect for what the underlying event rate/frequency is... for each time unit? We could start at the beginning of the event observations with a window of one time unit (in this case one month) and say is this one month window statistically dissimilar from the months that follow? If it isn't we include the next month and say "Is this set of TWO months statistically dissimilar from the months that follow?" and so on until the answer is yes. Once the answer is yes then we say OK. Then let's call the set of months we've collected so far one partition and the next months are a different partition because they were probably generated with a different underlying event rate.
If it's never yes, we should just end up with our original dataset.
Let's lay out code for these steps. I tend to build my code inside out. I start with a very specific case and then generalize as I go. So the first interesting function I worked on was creating probabilities for all of the possible underlying event frequencies for a given set of event counts.
The key is in the ss.poisson.pmf(event_count, hypotheses) method call. This loops through the event counts and uses scipy's poisson function to evaluate the probability of each hypothesis for the true rate of the data against each count observation we have. First note that we set the hypotheses to np.linspace(start=0.1, stop=100, num=1000) which basically means every number from 0.1 to 100 in 0.1 intervals (which is 1,000 different hypotheses to evaluate). Next, let's assume we have a list of [0,1,2] as event_counts. This code will build a list for each event count with a probability of how likely the hypothesis for the event rate is given the data.
Now comes the math. We essentially want to ask the question what is the probability of each of these rates given the event count of 0 was observed at time 0 AND that the event count of 1 was observed at time 1 and that the event count of 2 was observed at time 2? To compute this, we can essentially just multiply the probabilities of the hypotheses for each event count together. Once we do this, we re-normalize the data to get nice and tidy probabilities.
To make this a bit more concrete here's a smaller version of the problem. Assume we have three hypotheses for the rate of events coming in: 1, 2, and 3. Now for each time index we evaluate the hypotheses by computing the pmf function with our count:
- At time 1 where the count is 0 we get [0.36787944, 0.13533528, 0.04978707]
- At time 2 where the count is 1 we get [0.36787944, 0.27067057, 0.14936121]
- At time 3 where the count is 2 we get [0.18393972, 0.27067057, 0.22404181]
These are all equivalent to running ss.poisson.pmf(event_count, hypotheses). Next we normalize them to get probabilities (divide each list by its sum):
- At time 1 where the count is 0 we get [0.66524096, 0.24472847, 0.09003057]
- At time 2 where the count is 1 we get [0.46690469, 0.34352927, 0.18956604]
- At time 3 where the count is 2 we get [0.27103684, 0.39883553, 0.33012763]
Now we know the probability of a rate of 1 given we saw an event count of 0 is 66.5% for example. This probability is completely independent of all of the other information we know though about time index 2 and 3. Keeping in mind each index in these lists represents one of our hypothesis, we want to ask, "What is the probability of event 1 and event 2 and event 3 given each hypothesis?" To do that we multiply down like so:
[0.13719709, 0.01366128, 0.00102023]
As an example, we compute the 0.137 number by running 0.66524096 * 0.46690469 * 0.27103684 = 0.13719709 and do the same for the other columns of data. Lastly, we normalize to get our probabilities of each hypothesized underlying rate.
[0.90333389, 0.08994869, 0.00671742]
This is just [0.13719709, 0.01366128, 0.00102023] divided by the sum of [0.13719709, 0.01366128, 0.00102023]. According to this there is a 90.3% chance the true underlying rate was 1.
So now we can take a window of event counts for sequential units of time and find the probabilities of the underlying rates driving the event frequencies. Next our algorithm needs to compare two windows of different sizes and ask "Is it reasonable to assume these both are generated by the same underlying event rate?" That's what this function does.
I won't spend much time on this but to describe it in English. It essentially is just finding a credibility interval (very similar to a confidence interval) for both windows of event count data we're comparing and if they don't overlap, then it says the windows probably were generated by a different rate.
Where do these two windows come from though? The first window is made of data from earlier in our time series. It's the window where we start with the very first time index of event counts. The second window is all of the remaining data. This function gives a pretty good overview of the process of trying many different splittings of the same set of data looking for the point where the first window has a significantly different rate generating it.
Basically, loop through each event and once we find a good splitting point return the data split up.
Partitioning everything is a very simple function that mainly just uses this find_breaking_point function.
One more thing
There is an unanticipated result of how the algorithm reads left to right. There are circumstances where it will leave the data more partitioned than it should. In our initial example, this is how what we've gone through so far will partition the event data:
Partitions will be:
- [1, 0, 1, 1, 1, 0, 2, 0, 0]
- [1, 1, 2, 1, 1, 10, 3, 1, 5, 2, 4]
- [0, 1, 0, 0, 0, 1, 0, 0, 0]
Why is the 7 all alone? Well because when the algorithm was evaluating the window [1, 1, 2, 1, 1, 10, 3, 1, 5, 2, 4] and [7,0,1,0,0,0,1,0,0,0] even though the 7 was in there, it was still different enough to be declared a breaking point. Then the next step is to take the window on the right and try to break it up. Well the first step of that would be to compare  and [0,1,0,0,0,1,0,0,0] and those are pretty different too! So it breaks them up as well.
What should happen? You're probably already ahead of me. When we partition the data we need to check if the next time index would have fit better in the previous partition. If it does, it needs to be merged into that.
To do this, we can add a second phase to our partitioning. The second phase runs through the partitioned data and tries to evaluate if each window is similar enough to its previous window of data that they should be merged. That's all this function does:
On line 12 you can see where we look for if the windows are similar and how that is used to decide whether or not the two windows are merged. By running our partitioned data through this last function, we get the answer we would expect:
- [1, 0, 1, 1, 1, 0, 2, 0, 0]
- [1, 1, 2, 1, 1, 10, 3, 1, 5, 2, 4, 7]
- [0, 1, 0, 0, 0, 1, 0, 0, 0]
One more test
What if we ran this algorithm over a set of data that all came from the same event rate? Ideally, it wouldn't be partitioned because the whole thing should look like it's been generated from a single event rate. There's a test for that in the repo but this is the result:
Before: [2, 0, 2, 6, 5, 3, 3, 1, 0, 2, 2, 3, 5, 1, 4, 2, 4, 5, 3, 2, 4, 2, 1, 4, 2, 4, 2, 4, 0, 2, 3, 4, 4, 6, 1, 1, 3, 3, 1, 3, 1, 2, 1, 1, 6, 3, 3, 1, 2, 5]
After: [[2, 0, 2, 6, 5, 3, 3, 1, 0, 2, 2, 3, 5, 1, 4, 2, 4, 5, 3, 2, 4, 2, 1, 4, 2, 4, 2, 4, 0, 2, 3, 4, 4, 6, 1, 1, 3, 3, 1, 3, 1, 2, 1, 1, 6, 3, 3, 1, 2, 5]]
So it wasn't partitioned which is exactly what we would expect.
Pulling it all together
That's this technique in a very lengthy nutshell! This is probably something that already exists and has a paper written about it... or it doesn't work and I don't know how it doesn't work yet. Until I figure out what this technique is called, I'm laying claim to it as Bozo's Dynamic Frequency Partitioning.
You can see the code in full and some examples on my GitHub repo here:
Future improvements and research
We could possibly improve the inference by helping it to make more simplifications. In our first example, it's entirely possible that the first and second partitions are actually the same underlying rate but just so happened to measure differently. We could have the algorithm prefer to use previous rates when they're statistically similar so that we get an even simpler narrative.
So instead of this report
We'd see this
It's a slight difference in this case, but would make automated summarization read better as well. Instead of saying "Month 10 usage increased from 0.67 rentals a month to 3.2 rentals a month. It then fell in month 22 to 0.22 rentals a month.", we would say "Month 10 usage increased from 0.67 rentals a month to 3.2 rentals a month and returned fell back to normal in month 22."
The repo is called "Beyond Churn" because I plan to add more techniques to help deal with frequency analysis in modern businesses beyond the simplistic black & white concept of churn. Comments are always welcome (critical or not). Especially leave a comment if you know what this technique is officially called. I'd love to read more about it.