Beta Prior for a Binomial

December 10, 2017

“We’re all bayesians in the foxhole”

How can we use prior data to help with low counts?

It happens all the time. We’re trying to estimate the rate of something (e.g. clickthrough rate) but we’ve got low counts for some of them. When somebody sorts by clickthrough rate it looks like the top results are trash or don’t make sense. Often the reason is the low counts make for high variance. Maybe the true rate is .1 but you got lucky and got a click through on the first impression so the measured rate is 1.0. The low counts create a high variance and make it hard to find good performers in a principled way as you’re just getting the outliers of high variance distribution.

Intuitively we’d like to shrink these counts back to our prior somehow but what is a principled way to do that? One example is to add in some pseudo counts but how many to add? Fortunately a lot of smart people have thought about this so let’s see what they suggest.

Math wise it works out conveniently if we can describe out prior in the form of the $Beta(\alpha,\beta)$ distribution. Let our prior be $P(\theta|\alpha,\beta) = Beta(\alpha,\beta)$ and our observed data of successes and failures be $D=(S,F)$ then

$$ \begin{align*} P(\theta|\alpha,\beta,S,T) &\propto P(S,F|\theta)P(\theta|\alpha,\beta) \\ &\propto \theta^S(1-\theta)^F\theta^{\alpha-1}(1-\theta)^{\beta-1} \\ &= \theta^{S+\alpha-1}(1-\theta)^{F+\beta-1} \\ \end{align*} $$

Adding back normalization factor for beta

$$ \begin{align*} P(\theta|\alpha,\beta,S,T) &= \frac{\Gamma(S+\alpha+F+\beta)}{\Gamma(S+\alpha)\Gamma(F+\beta)}\theta^{S+\alpha-1}(1-\theta)^{F+\beta-1} \\ &= Beta(\alpha+S,\beta+F) \\ \end{align*} $$

So to predict:

$$ \begin{align*} P(x=S|D) &= \int_{0}^{1}P(x=S|\theta)P(\theta|D)d\theta \\ &= \int_0^1{\theta}P(\theta|D)d\theta \\ &= E[\theta|D] \\ &= E[Beta(S+\alpha,F+\beta)] \\ &= \frac{S + \alpha}{S+\alpha+F+\beta} \\ \end{align*} $$

So now $\alpha,\beta$ can be thought of as pseudocounts that we are adding to the denominator and numerator to combine the data we’ve seen to the prior from previous data or experience.

Not very good but very fast way to estimate $\alpha,\beta$ in R:

estBetaParams <- function(mu, var) {
  alpha <- ((1 - mu) / var - 1 / mu) * mu ^ 2
  beta <- alpha * (1 / mu - 1)
  return(params = list(alpha = alpha, beta = beta))
}