vtreat - prepping your data
“80% of data analysis is spent on the process of cleaning and preparing the data”
– Dasu and Johnson 2003
In the glamorous world of big data and deep learning the get your hands into the dirt of real world data is often left as an exercise to the analyst with hardly a footnote in the resulting paper or software.
With the advent of open source software there are more and more people
having to deal with the pain of real data. Packages like
vtreat
(paper) and
tidyr
(paper). I recently got
my hands on some data sets where there was high cardinality
categorical values and was wondering what is the best way to handle
them when I came upon vtreat
and the ideas it covers.
vtreat
does three things to help you with the problems you see in typical real life data:
- How to deal with NA’s in your data due to both random and systematic issues in a way rather than just filter them out.
- How to handle data with lots of categorical values with approaches like impact coding.
- Variable pruning and significance to remove variables that look to not be relevant in a principled way.
Mostly vtreat
deals with NA’s in your data by either adding the mean for numerical values or creating a special ‘NA’ categorical value for categorical variables. This isn’t super exciting but nice to not have to think about it. vtreat
will also do the usual filtering if you want that too.
Impact coding, also known as “likelihood encoding” or effect coding, has a history in the literature but was new to me and seems to make a lot of sense. Daniele Micci-Barreca has a 2001 paper where she lays out the idea as being based on Empirical Bayes and goes over a few different low count approaches and weightings as well. Impact encoding is a transformation from a input variable $X_i$ to the conditional probability of the target variable $Y$ given $X_i$. Impact encoding plus pruning based on the significance of variables is an interesting approach for dealing with a large number of variables after one hot encoding and I’d like to see some comparisons to the more standard hashing trick where variables are randomly assigned to buckets.
For categorical values vtreat
uses a version:
One thing to be especially careful of here though is that you’re implicitly creating a nested model which can lead to pretty severe over training on the data set if you’re using the same data for impact coding and model training. You’ll want to use a separate training set to estimate impact coding or use a resampling and averaging approach to mitigate this. vtreat
calls this issue out explicitly in their documentation and there are there is a good stackexchange answer about this too where they talk about importance of not over fitting to the impact scores. They have links to a few discussions on kaggle where they discuss importance of resampling (splitting and averaging) to get results that generalize well along with some example kernels and code
Variable pruning is done by estimating the significance of F statistic for numerical problems or $\chi^2$ for logistic regression. This goes hand in hand with conversion to one hot encoding as now categorical values have exploded into a lot of sparse columns. This variable pruning is a heuristic and could miss variables that are useful when combined with other variables (e.g. reverse of Simpson’s Paradox). vtreat
recommends a rule of significance where the threshold $\rho=1/n_{var}$ where $n_{var}$ is the number of candidate variables going in. As a reminder for a case where there are $r$ rows and $c$ columns ($c=2$) for our case with a single indicator variable). Note that for a categorical variable with $k$ levels the degrees of freedom are $k-1$ for calculating significance of $\chi^2$ statistic.
One pain point with vtreat
is that the interface is pretty confusing for the non expert. Example names are designTreatmentsZ
for when you don’t have an outcome variable, designTreatmentsC
for when you have a binary classification value and designTreatmentsN
for when you have a numeric regression outcome. Why there is 17 characters for the prefix and a single character for the difference in functionality is confusing. Their mkCrossFrameXExperiment
functions are similar. I find their docs and naming conventions to be a little confusing as well. They have some different suffixes for variable names to indicate their source and meaning. A real world example can be seen in their demo or kaggle examples
For categorical targets (e.g. binary classification)
- clean numeric variable with NA/NaN/non-finite values replaced
- is_Bad indicator variable for ‘clean’ above showing where a variable was replaced
- lev one hot encoding for a particular categorical variable level.
- cat_B impact encoding $x\text{_}catB = impact(x) = logit(P[y|x) - logit(P[y]))$
- cat_P prevalence fact, how prevalent original variable was.
For numerical targets (e.g. regression)
- clean numeric variable with NA/NaN/non-finite values replaced
- is_Bad indicator variable for ‘clean’ above showing where a variable was replaced
- lev one hot encoding for a particular categorical variable level.
- cat_N impact encoding $x\text{_}catN = impact(x) = E[y|x] - E[y]$
- cat_P prevalence fact, how prevalent original variable was.
- cat_D deviation fact about categorical level tell if ‘y’ is concentrated or diffuse when conditioned on the observed level or categorical value.