Output from the chuckcode generative model

February 6, 2022

Web3

As Mugatu would say “Web3 and crypto, so hot right now”. Nothing like new financial instruments to stir up passions and web3 might be one of the greatest branding campaigns ever. Crypto currencies like Bitcoin and Ethereum have captured the imagination with their run to over $1 trillion dollars in value. These cryptocurrencies also have received a lot of criticism as their proof of work validation schemes consume more power than Sweden using over 110 Terawatt Hours per year. .

August 6, 2018

vtreat - prepping your data

“80% of data analysis is spent on the process of cleaning and preparing the data”

– Dasu and Johnson 2003

In the glamorous world of big data and deep learning the get your hands into the dirt of real world data is often left as an exercise to the analyst with hardly a footnote in the resulting paper or software.

With the advent of open source software there are more and more people having to deal with the pain of real data. Packages like vtreat (paper) and tidyr (paper). I recently got my hands on some data sets where there was high cardinality categorical values and was wondering what is the best way to handle them when I came upon vtreat and the ideas it covers.

March 10, 2018

Personalized PageRank

PageRank is a metric developed by Google founders Larry Page and Sergey Brin. PageRank is defined by the on a directed graph $G=(V,E)$ as the probability that a random walk will be on any particular node $v_i$ when walking over the edges randomly. As there may be vertices that only point to each other (spider traps) they introduced the idea of a random restart, so with probability $\beta$ the random walk with jump randomly to a new vertices which is also known as the dampening factor.v_i

March 3, 2018

Multi-classification

So you want to predict more than binary classes? Well life is hard and it isn’t as easy. Some approaches

Softmax

For binary classification we would normally take our $z=w^tx $ and calculate the logit $ \frac{1}{1+e^{-z}} $ .

When there are multiple classes there are a few different approaches. First is the softmax

$$ \sigma(z)_j = \frac{e^{z_j}}{\sigma_{k=1}^K e^{z_k}} $$

If you have a large number of classes it is computationally expensive to calculate $\sigma(z)_j$ for each class so you can use a hierarchical softmax. For hierarchical softmax there is a binary tree where the input starts at the root and then depending on the response will pick a different path in the tree to find the final label at the roots.

March 3, 2018

Kaggle SuperStars - Gradient Boosted Decision Trees (GBDT)

Deep learning is cool and for applications like image processing and speech understanding has truly raised the bar but if you want to win a Kaggle competition you might want to check out Gradient Boosted Decision Trees (GBDTs) like Xgboost [code] from Tianqi Chen or new contender from Microsoft LightGBM [code].

What are the things that make GBDT attactive?

They can learn the important crosses between input variables by incorporating them into trees.
They require minimal data processing before inputting data. No normalization is required since they use ranks of data rather than values themselves.
Implementations like XGBoost and LightGBM can be trained in a distributed fashion on large datasets quickly.
Lots of parameters to regularize and tune performance.

Some downsides to GBDTs are:

February 15, 2018

Factorization Machines

Steven Rendle’s Factorization Machine’s paper is a cool tour de force of showing off a new machine learning approach. He proposes a new model of linear model that can handle very sparse data and automatically model interactions between different variables. Not only is it interesting theoretically but variants have been winners in Kaggle ad CTR prediction contests for Criteo and Avazu and further tried out in real advertising systems at scale and performed well

February 3, 2018

Facebook Practical Lessons

Xiran He et al. - “Practical Lessons from Predicting Clicks on Ads at Facebook” paper

Interesting to note that of the 11 authors 5 had already left Facebook at the time of writing.

use Normalized Cross Entropy (NE) and calibration as major evaluation metrics

Normalized entropy is the predictive log loss normalized by entropy of background CTR. E.g. what are you gaining over just empirical CTR of training data set.

$$ NE = \frac{-\frac{1}{N}\sum_{i=1}^n(\frac{1+y_i}{2}log(p_i)+\frac{1-y_i}{2}log(1-p_i))}{-(p*log(p) + (1-p)*log(1-p))} $$

Where $p_i$ is the estimated probability of click for a particular ad, $p$ is the average empirical CTR and $y_i \in \{-1,+1\}$. NE has the benefits of reflecting both the goodness of predictions and implicitly the calibration. The top of the model means that when $p\_i = 1, y\_i = 1$ we get loss of $2/2log(1) + 0/2log(0) = 0$, similarly when $p_i = 0, y_i = -1$ we get loss of $0log(0) + 2/2log(1) = 0$

January 29, 2018

Wide and Deep Models

Google play recommendations use a wide and deep modeling approach according to Heng-Tze Cheng, Levent Koc et al.. The basic idea is that memorization of co-occurence for app installs is very powerful when there is a clear signal but that it is hard for new applications and doesn’t generalize well. Use a deep learning model of more primitive features (e.g. words, category, etc.) to get better generalization and combine with memorization to get something like the best results from each.

December 10, 2017

Beta Prior for a Binomial

“We’re all bayesians in the foxhole”

How can we use prior data to help with low counts?

It happens all the time. We’re trying to estimate the rate of something (e.g. clickthrough rate) but we’ve got low counts for some of them. When somebody sorts by clickthrough rate it looks like the top results are trash or don’t make sense. Often the reason is the low counts make for high variance. Maybe the true rate is .1 but you got lucky and got a click through on the first impression so the measured rate is 1.0. The low counts create a high variance and make it hard to find good performers in a principled way as you’re just getting the outliers of high variance distribution.

April 9, 2017

Back to basics - logistic regression

“All models are wrong, but some are useful”

– George Box

Logistic regression is used to predict binary variables and is characterized by the logistic function

$$ \begin{align*} logistic(z) &= \frac{e^z}{e^z + 1} \\ \sigma (z) &= \frac{e^z}{e^z + 1} \\ &= \frac{1}{1 + e^{-z}} \\ \end{align*} $$

Cost Function:

With $\mathbf{X}$ being the observed variables, $\boldsymbol{\beta}$, $\beta_0$ being the fitted parameters, $ z = \beta_0 + \boldsymbol{\beta}^t\mathbf{X} $ and $logistic(z)$ being the predicted value also known as $\hat{y}$ and $y$ being the true label. Letting our labels for $\mathbf{y} \in \\{0,1\\} $ allows some simplification in equations, but other conventions use $\mathbf{y} \in \\{-1,1\\} $ and slightly more verbose notation.

Posts

accumulated cruft