Hi! My name is Chuck. Most of the content here was written for my own education but any comments and corrections are appreciated.
Web3
As Mugatu would say “Web3 and crypto, so hot right now”. Nothing like new financial instruments to stir up passions and web3 might be one of the greatest branding campaigns ever. Crypto currencies like Bitcoin and Ethereum have captured the imagination with their run to over $1 trillion dollars in value. These cryptocurrencies also have received a lot of criticism as their proof of work validation schemes consume more power than Sweden using over 110 Terawatt Hours per year. .
vtreat - prepping your data
“80% of data analysis is spent on the process of cleaning and preparing the data”
– Dasu and Johnson 2003
In the glamorous world of big data and deep learning the get your hands into the dirt of real world data is often left as an exercise to the analyst with hardly a footnote in the resulting paper or software.
With the advent of open source software there are more and more people
having to deal with the pain of real data. Packages like
vtreat
(paper) and
tidyr
(paper). I recently got
my hands on some data sets where there was high cardinality
categorical values and was wondering what is the best way to handle
them when I came upon vtreat
and the ideas it covers.
Personalized PageRank
PageRank is a metric developed by Google founders Larry Page and Sergey Brin. PageRank is defined by the on a directed graph \(G=(V,E)\) as the probability that a random walk will be on any particular node $v_i$ when walking over the edges randomly. As there may be vertices that only point to each other (spider traps) they introduced the idea of a random restart, so with probability $\beta$ the random walk with jump randomly to a new vertices which is also known as the dampening factor.v_i
Multi-classification
So you want to predict more than binary classes? Well life is hard and it isn’t as easy. Some approaches
Softmax
For binary classification we would normally take our $z=w^tx $ and calculate the logit $ \frac{1}{1+e^{-z}} $ .
When there are multiple classes there are a few different approaches. First is the softmax
$$ \sigma(z)_j = \frac{e^{z_j}}{\sigma_{k=1}^K e^{z_k}} $$If you have a large number of classes it is computationally expensive to calculate $\sigma(z)_j$ for each class so you can use a hierarchical softmax. For hierarchical softmax there is a binary tree where the input starts at the root and then depending on the response will pick a different path in the tree to find the final label at the roots.
Kaggle SuperStars - Gradient Boosted Decision Trees (GBDT)
Deep learning is cool and for applications like image processing and speech understanding has truly raised the bar but if you want to win a Kaggle competition you might want to check out Gradient Boosted Decision Trees (GBDTs) like Xgboost [code] from Tianqi Chen or new contender from Microsoft LightGBM [code].
What are the things that make GBDT attactive?
- They can learn the important crosses between input variables by incorporating them into trees.
- They require minimal data processing before inputting data. No normalization is required since they use ranks of data rather than values themselves.
- Implementations like XGBoost and LightGBM can be trained in a distributed fashion on large datasets quickly.
- Lots of parameters to regularize and tune performance.
Some downsides to GBDTs are:
Factorization Machines
Steven Rendle’s Factorization Machine’s paper is a cool tour de force of showing off a new machine learning approach. He proposes a new model of linear model that can handle very sparse data and automatically model interactions between different variables. Not only is it interesting theoretically but variants have been winners in Kaggle ad CTR prediction contests for Criteo and Avazu and further tried out in real advertising systems at scale and performed well
Facebook Practical Lessons
Xiran He et al. - “Practical Lessons from Predicting Clicks on Ads at Facebook” paper
Interesting to note that of the 11 authors 5 had already left Facebook at the time of writing.
use Normalized Cross Entropy (NE) and calibration as major evaluation metrics
Normalized entropy is the predictive log loss normalized by entropy of background CTR. E.g. what are you gaining over just empirical CTR of training data set.
$$ NE = \frac{-\frac{1}{N}\sum_{i=1}^n(\frac{1+y_i}{2}log(p_i)+\frac{1-y_i}{2}log(1-p_i))}{-(p*log(p) + (1-p)*log(1-p))} $$Where $p_i$ is the estimated probability of click for a particular ad, $p$ is the average empirical CTR and $y_i \in \{-1,+1\}$. NE has the benefits of reflecting both the goodness of predictions and implicitly the calibration. The top of the model means that when $p\_i = 1, y\_i = 1$ we get loss of $2/2log(1) + 0/2log(0) = 0$, similarly when $p_i = 0, y_i = -1$ we get loss of $0log(0) + 2/2log(1) = 0$
Wide and Deep Models
Google play recommendations use a wide and deep modeling approach according to Heng-Tze Cheng, Levent Koc et al.. The basic idea is that memorization of co-occurence for app installs is very powerful when there is a clear signal but that it is hard for new applications and doesn’t generalize well. Use a deep learning model of more primitive features (e.g. words, category, etc.) to get better generalization and combine with memorization to get something like the best results from each.