suppressPackageStartupMessages( require(tidyverse) )
suppressPackageStartupMessages( require(recipes) )
suppressPackageStartupMessages( require(mlbench) )
recipes is a package that is meant to simplify data preparation and transformation steps, as well as predefine the roles of each variable in the dataset (predictor and response).
There is a very good tutorial
in this proof of concept we are going to perform several preparation steps using recipes.
Usually no dataset will require all of these preparation steps therefore we will use two different datasets.
The dataset consists of a number of chemical elements and their conecntrations in different types of glass. The type of glass is the predictor.
data(Glass)
data = Glass %>%
as_tibble()
summary(data)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
Skewed data can be a problem for all modelling algorithms that require normally distributed data.
data %>%
summarise_if( is.numeric, e1071::skewness ) %>%
knitr::kable()
| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe |
|---|---|---|---|---|---|---|---|---|
| 1.602715 | 0.4478343 | -1.136452 | 0.8946104 | -0.7202392 | 6.460089 | 2.018446 | 3.36868 | 1.729811 |
Highly correlating variables are problematic for some modelling algorithms such as regressions. Even for tree-based ensembl methods such as randomforest which can handly highly correlating variables, they will have decreased importance when investigating the contribution of variables to the final model. So we might want to remove them.
The corrplot package offers different methods of ordering the variables on the correlation matrix plot. Here we chose hclust
require(corrplot)
correlations = cor( select_if( data, is.numeric ) )
corrplot(correlations, method = 'number', order = 'hclust')
We first build the recipe by adding the data. Adding a formula as well assigns the role of predictor and response to the variables
rec = recipes::recipe(data, Type~.)
summary(rec)
| variable | type | role | source |
|---|---|---|---|
| RI | numeric | predictor | original |
| Na | numeric | predictor | original |
| Mg | numeric | predictor | original |
| Al | numeric | predictor | original |
| Si | numeric | predictor | original |
| K | numeric | predictor | original |
| Ca | numeric | predictor | original |
| Ba | numeric | predictor | original |
| Fe | numeric | predictor | original |
| Type | nominal | outcome | original |
We can add a number of transformation steps to the recipe
The Yeo Johnson transformation is similar to the Box_Cox Transformation which one can think of a more fine-tuned version of a log transformation. However unlike BoxCox and log transformation Yeo-Johnson can handle negative data.
Adding steps is as easy as that. We simply must not forget to specify which variables should be affected by the step. In order to provide an example for the (synthax)[https://topepo.github.io/recipes/reference/selections.html] we specify all numeric variables via all_numeric() and exclude the response via all_outcomes(). Since the response is a factor variable it would actually not be included in the all_nzumerics() select.
rec = rec %>%
step_center( all_numeric(), - all_outcomes() ) %>%
step_scale( all_numeric(), - all_outcomes() ) %>%
step_YeoJohnson( all_numeric(), - all_outcomes() )
rec = rec %>%
step_corr( all_numeric(), - all_outcomes(), threshold = 0.5 )
rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 9
##
## Steps:
##
## Centering for all_numeric(), -all_outcomes()
## Scaling for all_numeric(), -all_outcomes()
## Yeo-Johnson transformation on all_numeric(), -all_outcomes()
## Correlation filter on all_numeric(), -all_outcomes()
Here we can prepare the recipe, using a training dataset which in our case will be the full data set. We can use the same prepared recipe and apply it to future datasets. This comes in handy if we train a model on transformed data and then want to apply the same transformations on future data. This is especially relevant for the transformation steps.
prep_rec = prep( rec, training = data )
prep_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 9
##
## Training data contained 214 data points and no missing data.
##
## Steps:
##
## Centering for RI, Na, Mg, Al, Si, K, Ca, Ba, Fe [trained]
## Scaling for RI, Na, Mg, Al, Si, K, Ca, Ba, Fe [trained]
## Yeo-Johnson transformation on RI, Na, Mg, Al, Si, K, Ca, Ba, Fe [trained]
## Correlation filter removed Ca, Ba [trained]
data_trans = bake( prep_rec, data )
sum_old = data %>%
summarise_if( is.numeric, e1071::skewness ) %>%
mutate( data = 'untransformed') %>%
select( data, everything() )
sum_trans = data_trans %>%
summarise_if( is.numeric, e1071::skewness ) %>%
mutate( data = 'transformed') %>%
select( data, everything() )
sum_old %>%
bind_rows( sum_trans) %>%
knitr::kable()
| data | RI | Na | Mg | Al | Si | K | Ca | Ba | Fe |
|---|---|---|---|---|---|---|---|---|---|
| untransformed | 1.6027151 | 0.4478343 | -1.1364523 | 0.8946104 | -0.7202392 | 6.4600889 | 2.018446 | 3.36868 | 1.7298107 |
| transformed | -0.2853362 | -0.1201328 | -0.4567434 | -0.0734347 | 0.2844468 | -0.0017153 | NA | NA | 0.8271965 |
Here we can see a great reduction in skewness (all values are closer to zero after the transformation)
We can already see from the recipe summary that Ca and Ba have been removed, however lets look at the correlation matrix
correlations = cor( select_if( data_trans, is.numeric ) )
corrplot( correlations, method = 'number', order = 'hclust' )
Here we can see now that none of the correlations now are greater than 0.5 . For modelling we might want to reduce the threshold even further. 0.5 Is still a pretty high correlation factor.
Here we have different attributes of different soybean plants. The response variable here is Class and most of the predictors are factor variables.
data(Soybean)
data = Soybean %>%
as_tibble()
summary(data)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 0 :441 0: 77 0 :221 0 :357 0 : 51 0 :487
## 1 :226 1:606 1 : 36 1 : 21 1 :327 1 : 96
## NA's: 16 2 :342 2 :221 2 :221 NA's:100
## NA's: 84 NA's: 84 NA's: 84
##
##
##
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 0 :554 0 :535 0 :296 0 :520 0 :379 0 :320
## 1 : 45 1 : 20 1 :371 1 : 42 1 : 39 1 : 83
## NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36 2 :177
## NA's:108 3 :191 3 : 65
## NA's: 38 NA's: 38
##
##
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 0 :473 0 :497 0 :639 0 :581 0 :625 0 :407
## 1 :104 1 :135 1 : 6 1 : 44 1 : 20 1 :130
## NA's:106 2 : 13 NA's: 38 2 : 20 NA's: 38 2 : 14
## NA's: 38 NA's: 38 3 : 48
## NA's: 84
##
##
## fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 0 :345 0 :476 0 :524 0 :513 0 :532 0 :539
## 1 : 75 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
## 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92 NA's:106
## 4 :100
## NA's:106
##
##
## roots
## 0 :551
## 1 : 86
## 2 : 15
## NA's: 31
##
##
##
Amelia::missmap( data )
Variables with near zero variance are pretty useless for modelling since when splitting the data into test and validation tests we always have a large chance that we end up with no occurrences of a specific class insideo one of the splits.
We define a variable as degenerate if either is true
summary( data[, caret::nearZeroVar(data) ] )
## leaf.mild mycelium sclerotia
## 0 :535 0 :639 0 :625
## 1 : 20 1 : 6 1 : 20
## 2 : 20 NA's: 38 NA's: 38
## NA's:108
rec = recipe(data, Class~.)
There are several techniques on how to deal with missing data. We can either drop the observations that have incomplete values or we drop the variables with incomplete mesaurements. Looking back at the missingness map that means for this dataset we have to drop almost all of the variables or one third of all observations. So we have can either assume that all missing values carry mean or median values or more elegantly we try to impute the missing values based on similar observations that cluster in the same way.
rec = rec %>%
step_knnimpute( all_predictors() )
rec = rec %>%
step_nzv( all_predictors(), options = list(freq_cut = 95/5, unique_cut = 10 ) )
prep_rec = prep( rec, training = data )
prep_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 35
##
## Training data contained 683 data points and 121 incomplete rows.
##
## Steps:
##
## 5-nearest neighbor imputation for plant.stand, precip, temp, ... [trained]
## Sparse, unbalanced variable filter removed mycelium [trained]
Note that one degenerate variable was not removed. Probably the imputation balanced it out.
data_trans = bake( prep_rec, data )
summary( data[, caret::nearZeroVar(data, freqCut = 96/5) ] )
## leaf.mild mycelium sclerotia
## 0 :535 0 :639 0 :625
## 1 : 20 1 : 6 1 : 20
## 2 : 20 NA's: 38 NA's: 38
## NA's:108
summary( data_trans[, caret::nearZeroVar(data_trans, freqCut = 95/5) ] )
## sclerotia
## 0:650
## 1: 33
Indeed it is as we suspected the imputation of the missing values has balanced out the other two variables.
Amelia::missmap(data_trans)