Facebook Practical Lessons

Xiran He et al. - “Practical Lessons from Predicting Clicks on Ads at Facebook” paper

Interesting to note that of the 11 authors 5 had already left Facebook at the time of writing.

use Normalized Cross Entropy (NE) and calibration as major evaluation metrics

Normalized entropy is the predictive log loss normalized by entropy of background CTR. E.g. what are you gaining over just empirical CTR of training data set.

$$ NE = \frac{-\frac{1}{N}\sum_{i=1}^n(\frac{1+y_i}{2}log(p_i)+\frac{1-y_i}{2}log(1-p_i))}{-(p*log(p) + (1-p)*log(1-p))} $$

Where $p_i$ is the estimated probability of click for a particular ad, $p$ is the average empirical CTR and $y_i \in {-1,+1}$. NE has the benefits of reflecting both the goodness of predictions and implicitly the calibration. The top of the model means that when $p_i = 1, y_i = 1$ we get loss of $2/2log(1) + 0/2log(0) = 0$, similarly when $p_i = 0, y_i = -1$ we get loss of $0log(0) + 2/2log(1) = 0$

The big innovation here is to use a gradient boosted decision tree to learn the relevant data crosses. Their model claims a 4% decrease in NE (quite large) by using the binary encoding of which leaf a variable falls into. So if tree has 3 terminal leaves on right side and 2 on left side then you can use [0,0,0,1,0] if the value falls into the left leaf of the right tree. This effectively automatically does a lot of the feature engineering for you as the decision trees learn which data crosses to do.

Turning Decision Trees into Features

They also show that they get large improvements by training online and the importance of data freshness in their models for Ads. Wonder if the same is true for more static data sets than Ads like movie recommendations. This echos retargeting as being important, figuring out the user’s current context?

They show that the learning rate update has a large impact. The one that works the best for them is what they call the “per-coordinate learning rate” credited to Google’s 2013 KDD McMahan et al paper.

$$ \eta_{t,i} = \frac{\alpha}{\beta + \sqrt{\sum_{j=1}^t\nabla_{j,i}^2}} $$

The constant per weight learning algorithm performed the worst overall. The gain from the worst to the best was 5% NE so comparable to changing the model or using the decision tree features suggested above.

They compare their logistic regression approach to the Bayesian online learning scheme for probit regression (BOPR) described in Bing’s ICML 2010 Graepel et al paper. BOPR has a similar update for the learning but updates both a mean and variance. This makes for a more expensive update but allows for a full posterior distribution over the click which can be used for explore/exploit learning schemes.

They observe that after 500 trees the model does not improve very much and can even get worse due to over training. They observe that the top 10 features are responsible for about 12 total feature importance as measured by summing over the squared loss reduction over all trees. They see that sampling at a .025 rate gives maximal reduction in NE. Interesting that a 0.025 does the best and that too many negatives actually hurt the training.