suppressPackageStartupMessages( require(tidyverse) )
suppressPackageStartupMessages( require(recipes) )
suppressPackageStartupMessages( require(mlbench) )

recipes is a package that is meant to simplify data preparation and transformation steps, as well as predefine the roles of each variable in the dataset (predictor and response).

There is a very good tutorial

in this proof of concept we are going to perform several preparation steps using recipes.

  • filter near zero variance variables
  • filter correlating variables
  • transform skew data
  • impute missing values
  • center and scale

Usually no dataset will require all of these preparation steps therefore we will use two different datasets.

1 Glass

The dataset consists of a number of chemical elements and their conecntrations in different types of glass. The type of glass is the predictor.

data(Glass)

data = Glass %>%
  as_tibble()

summary(data)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

1.1 Investigate Skewness

Skewed data can be a problem for all modelling algorithms that require normally distributed data.

  • Values around zero are unskewed
  • Negative values are left skewed
  • Positive values are right skewes
data %>%
  summarise_if( is.numeric, e1071::skewness ) %>%
  knitr::kable()
RI Na Mg Al Si K Ca Ba Fe
1.602715 0.4478343 -1.136452 0.8946104 -0.7202392 6.460089 2.018446 3.36868 1.729811

1.2 Inverstigate Correlation

Highly correlating variables are problematic for some modelling algorithms such as regressions. Even for tree-based ensembl methods such as randomforest which can handly highly correlating variables, they will have decreased importance when investigating the contribution of variables to the final model. So we might want to remove them.

The corrplot package offers different methods of ordering the variables on the correlation matrix plot. Here we chose hclust

require(corrplot)

correlations = cor( select_if( data, is.numeric )  )

corrplot(correlations, method = 'number', order = 'hclust')

1.3 Recipes

We first build the recipe by adding the data. Adding a formula as well assigns the role of predictor and response to the variables

rec = recipes::recipe(data, Type~.)

summary(rec)
variable type role source
RI numeric predictor original
Na numeric predictor original
Mg numeric predictor original
Al numeric predictor original
Si numeric predictor original
K numeric predictor original
Ca numeric predictor original
Ba numeric predictor original
Fe numeric predictor original
Type nominal outcome original

1.3.1 Adding recipes steps

We can add a number of transformation steps to the recipe

1.3.1.1 Remove Skewness by Yeo-Johnson transformation

The Yeo Johnson transformation is similar to the Box_Cox Transformation which one can think of a more fine-tuned version of a log transformation. However unlike BoxCox and log transformation Yeo-Johnson can handle negative data.

Adding steps is as easy as that. We simply must not forget to specify which variables should be affected by the step. In order to provide an example for the (synthax)[https://topepo.github.io/recipes/reference/selections.html] we specify all numeric variables via all_numeric() and exclude the response via all_outcomes(). Since the response is a factor variable it would actually not be included in the all_nzumerics() select.

rec = rec %>%
  step_center( all_numeric(), - all_outcomes() ) %>%
  step_scale( all_numeric(), - all_outcomes() ) %>%
  step_YeoJohnson( all_numeric(), - all_outcomes() )

1.3.1.2 Remove strongly correlating variables

  • Calculate correlation matrix
  • Determine a threshold ( 0.5 )
  • Determine the two predictors with the largest absolute pairwise correlation that surpases the threshold
  • Calculate the average correlation of both variables with the other variables in the dataset
  • Remove the variable that has the higher correlation with the other variables in the dataset.
  • Repeat the above steps until no correlation pair surpasses the threshold.
rec = rec %>%
  step_corr( all_numeric(), - all_outcomes(), threshold = 0.5 )

rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Steps:
## 
## Centering for all_numeric(), -all_outcomes()
## Scaling for all_numeric(), -all_outcomes()
## Yeo-Johnson transformation on all_numeric(), -all_outcomes()
## Correlation filter on all_numeric(), -all_outcomes()

1.3.2 Prepare the recipe

Here we can prepare the recipe, using a training dataset which in our case will be the full data set. We can use the same prepared recipe and apply it to future datasets. This comes in handy if we train a model on transformed data and then want to apply the same transformations on future data. This is especially relevant for the transformation steps.

prep_rec = prep( rec, training = data )


prep_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9
## 
## Training data contained 214 data points and no missing data.
## 
## Steps:
## 
## Centering for RI, Na, Mg, Al, Si, K, Ca, Ba, Fe [trained]
## Scaling for RI, Na, Mg, Al, Si, K, Ca, Ba, Fe [trained]
## Yeo-Johnson transformation on RI, Na, Mg, Al, Si, K, Ca, Ba, Fe [trained]
## Correlation filter removed Ca, Ba [trained]

1.3.3 Apply the recipe

data_trans = bake( prep_rec, data )

1.4 Validate the effect of the transformations

1.4.1 Skewness

sum_old = data %>%
  summarise_if( is.numeric, e1071::skewness ) %>%
  mutate( data = 'untransformed') %>%
  select( data, everything() )

sum_trans = data_trans %>%
  summarise_if( is.numeric, e1071::skewness ) %>%
  mutate( data = 'transformed') %>%
  select( data, everything() )

sum_old %>%
  bind_rows( sum_trans) %>%
  knitr::kable()
data RI Na Mg Al Si K Ca Ba Fe
untransformed 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.018446 3.36868 1.7298107
transformed -0.2853362 -0.1201328 -0.4567434 -0.0734347 0.2844468 -0.0017153 NA NA 0.8271965

Here we can see a great reduction in skewness (all values are closer to zero after the transformation)

1.4.2 Correllations

We can already see from the recipe summary that Ca and Ba have been removed, however lets look at the correlation matrix

correlations = cor( select_if( data_trans, is.numeric ) )

corrplot( correlations, method = 'number', order = 'hclust' )

Here we can see now that none of the correlations now are greater than 0.5 . For modelling we might want to reduce the threshold even further. 0.5 Is still a pretty high correlation factor.

2 Soybeans

Here we have different attributes of different soybean plants. The response variable here is Class and most of the predictors are factor variables.

data(Soybean)

data = Soybean %>%
  as_tibble()

summary(data)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
## 

2.1 Investigate Missing Data

Amelia::missmap( data )

2.2 Investigate Near zero variance variables (degenerate variables)

Variables with near zero variance are pretty useless for modelling since when splitting the data into test and validation tests we always have a large chance that we end up with no occurrences of a specific class insideo one of the splits.

We define a variable as degenerate if either is true

  • The ratio of the most common value to the second most common value is less than 95:5
  • The percentage of distinct values out of the number of total samples is less than 10
summary( data[, caret::nearZeroVar(data) ] )
##  leaf.mild  mycelium   sclerotia 
##  0   :535   0   :639   0   :625  
##  1   : 20   1   :  6   1   : 20  
##  2   : 20   NA's: 38   NA's: 38  
##  NA's:108

2.3 Recipes

rec = recipe(data, Class~.)

2.3.1 Impute the missing data

There are several techniques on how to deal with missing data. We can either drop the observations that have incomplete values or we drop the variables with incomplete mesaurements. Looking back at the missingness map that means for this dataset we have to drop almost all of the variables or one third of all observations. So we have can either assume that all missing values carry mean or median values or more elegantly we try to impute the missing values based on similar observations that cluster in the same way.

rec = rec %>%
  step_knnimpute( all_predictors() )

2.3.2 Remove the degenerated variables

rec = rec %>%
  step_nzv( all_predictors(), options = list(freq_cut = 95/5, unique_cut = 10 ) )

2.3.3 Prepare the recipe

prep_rec = prep( rec, training = data )


prep_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         35
## 
## Training data contained 683 data points and 121 incomplete rows. 
## 
## Steps:
## 
## 5-nearest neighbor imputation for plant.stand, precip, temp, ... [trained]
## Sparse, unbalanced variable filter removed mycelium [trained]

Note that one degenerate variable was not removed. Probably the imputation balanced it out.

2.3.4 Apply the recipe

data_trans = bake( prep_rec, data )

2.4 Validate the effect of the transformations

2.4.1 Degenerate Variables

summary( data[, caret::nearZeroVar(data, freqCut = 96/5) ] )
##  leaf.mild  mycelium   sclerotia 
##  0   :535   0   :639   0   :625  
##  1   : 20   1   :  6   1   : 20  
##  2   : 20   NA's: 38   NA's: 38  
##  NA's:108
summary( data_trans[, caret::nearZeroVar(data_trans, freqCut = 95/5) ] )
##  sclerotia
##  0:650    
##  1: 33

Indeed it is as we suspected the imputation of the missing values has balanced out the other two variables.

2.4.2 Missing Values

Amelia::missmap(data_trans)