Data Preprocessing

Introduction

Model performance can be highly dependent on how input predictors are encoded.
Sensitivities to different encodings varies across model types.
Tree-based methods are notably insensitive.
Linear regression is not.
Predictor encoding, called feature engineering, is done as a data preprocessing step prior to model fitting.
Feature engineering generally includes additions, deletions, and transformations of training set data.
Typically, there are many different encodings from which to choose.
Optimal feature engineering depends on the (1) model type and (2) true relationship between the predictors and outcome.
The most effective approaches are often informed by scientific understandings of problems rather than purely algorithmic processing.

Recipes I

The fit() and resample() functions in the MachineShop R package support model specification with recipe objects supplied by the recipes package (Kuhn and Wickham, 2020).
Recipes define predictor and outcome variables to be modeled, training data, and preprocessing steps to be applied to them; and have the general syntax shown below.

Recipe Syntax

recipe(formula, data) %>%
step_1(...) %>%
step_2(...) %>%
...
step_N(...)

Recipe Elements

recipe(): defines the ingredients with which to create a recipe for preprocessing data.
formula: model formula of outcome and predictor variables. Dots are allowed; in-line functions are not.
data: data frame containing the variables.
%>%: forward pipe operator for adding preprocessing steps to the recipe.
step(): preprocessing functions to be applied to the data in their order of appearance.

Recipes II

...: variables to which to apply the preprocessing. They may be specified using one or more of their data frame names or with the selector functions below.
A negative sign may be used in front of specifications to exclude variables from the step.
- Name: starts_with(), ends_with(), contains(), matches(), num_range(), all_of(), any_of(), everything()
- Role: all_predictors(), all_outcomes(), has_role()
- Type: all_numeric(), all_nominal(), has_type()
Included in the table on the next slide are some of the available functions for defining and checking variables types.

General Recipe Functions

Syntax	Types*	Description
`recipe(function, data)`	-	Create a recipe for preprocessing data
`check_missing()`	A	Check for missing values
`step_factor2string()`	F	Convert factors to strings
`step_intercept()`	-	Add intercept column
`step_naomit()`	A	Remove cases with missing values
`step_novel()`	F	Simple value assignments for novel factor levels
`step_num2factor()`	N	Convert numbers to factors
`step_ordinalscore()`	F	Convert ordinal factors to numeric scores
`step_string2factor()`	C	Convert strings to factors
`step_unorder()`	F	Convert ordered to unordered factors
`summary()`	A	Summarize a recipe

A = any; C = character; F = factor; N = numeric

IFS Study Example I

In the following example code, recipe Caries_recipe is created with the Caries data from the IFS Study.
A step is added to check for missing values in the outcome variable AdjDFS_I and, if present, return an error message when the data are processed.
Missing values in the outcomes will cause problems with the model fitting.
An alternative strategy to excluding such observations would be to impute their values if desired to keep them in the analysis.
Recall that some of the variables are stored as numeric in the original dataset.
The role_case() step function in the recipe is from the MachineShop package and included to designate AdjDFS_I as a stratification variable in calls to resample() for resample estimation of prediction performance.
Some variables whose values represent categorical levels are converted in the recipe to nominal and ordinal factors with the step_num2factor() function.
The conversion requires that the values be consecutive integers starting at 1 or be transformed to such with an appropriate function supplied to the transform argument.
Finally, a step to remove cases with missing predictor values is added to create recipe rec_naomit.

IFS Study Example I

library(MachineShop)
library(recipes)
## Load data and libraries
setwd("/Users/damia/OneDrive - Harvard University/HARVARD HSDM/CREATING NEW COURSE")
#setwd("/Users/cho379/OneDrive-Harvard University/HARVARD HSDM/CREATING NEW COURSE")
Caries <- read.csv("ML_Week 1/Caries_data.csv")

## Dataset without preprocessing
train_indices <- sample(nrow(Caries), nrow(Caries) * 2 / 3)
trainset <- Caries[train_indices, ]
testset <- Caries[-train_indices, ]

## Global resample control for comparability of results
control <- CVControl(folds = 10, seed = 123)

Caries_recipe <- recipe(AdjDFS_I ~ MotherEduc + Income + Female1 + Total_mgF + Homeppm + Brushingfreq 
             + Waterbase + Milk + Juice100 + SSB,data = trainset) %>%
    role_case(stratum = AdjDFS_I) %>%
    check_missing(AdjDFS_I) %>%
    #step_naomit(MotherEduc, Income) %>%
    step_num2factor(MotherEduc, transform = function(x) x + 1,levels = c("1", "2", "3", "4","5", "6"), ordered = TRUE) %>%       
    step_num2factor(Income, levels = c("Lowest", "Low", "Medium", "High", "Highest"), ordered = TRUE) %>%
    step_num2factor(Female1, transform = function(x) x + 1,levels = c("M", "F")) 
    
prep()
summary(prep(Caries_recipe))
juice(prep(Caries_recipe ))

2 Adding Predictors

Dummy Variables

Many models require categorical variables to have quantitative codings for model fitting.
One common approach is to create a set of “dummy variables” in which each category is represented by a 0/1 indicator.

Income	N	Normal	Fixed Defect	Reversible Defect
Lowest	4	1	0	0
Low	18	0	1	0
Medium	18	0	0	1
High	16	0	1	0
Highest	52	1	0	0

Some models that include an intercept term, such as linear regression, would have numerical issue if all indicator variables were included.
Other models can accommodate all, and doing so may improve interpretation.

Dummy Variable Recipe Functions

Syntax	Types	Description
`step_bin2factor()`	N	Create factors from dummy variables
`step_dummy()`	F	Dummy variables creation
`step_regex()`	F	Create dummy variables with regular expressions
`dummy_names()`	-	Naming tools

IFS Study Example I

A recipe step is added to create dummy variables for all nominal (and ordinal) factors in the training data.
At this point, a complete case analysis is being performed by extending the rec_naomit recipe.
The step_dummy() function excludes the first dummy variable in each set by default.
Dummy variables are 0/1 indicators for unordered factors and polynomial contrasts for ordered factors.
Specification of all_nominal() and -all_outcomes() ensures that only factor variables among the predictors are processed.
Dummy variables are not needed for the outcome variable.

## Exclude cases with missing observations
Caries_recipe_naomit <- Caries_recipe %>% step_naomit(all_predictors())
Caries_recipe_naomit

### Dummy variables creation
(Car_rec <- Caries_recipe_naomit %>%
    step_dummy(all_nominal(), -all_outcomes()))
summary(prep(Car_rec))

Processing Recipe under `fit()` and `resample()` functions

An unprocessed recipe can be passed to the fit() and resample() functions.
The fit() function processes the recipe on the full dataset and resample() on each resampled dataset of the resampling algorithm.
Observed and predicted outcomes on new (unprocessed) data can be obtained with the predict() and response() functions and compared with performance().

### Model fitting and resampling
modelFit <- fit(Car_rec , model = GLMModel)
res <- resample(Car_rec , model = GLMModel, control = control)
summary(res)

### Prediction and performance on new data
obs <- response(modelFit, newdata = testset)
pred <- predict(modelFit, newdata = testset)
performance(obs, pred)

3 Removing Predictors

Removing Predictors

There may be advantages to removing some predictors prior to modeling.
Fewer predictors require less computational time and lead to more interpretable models.
Removal of weakly or non-informative variables can improve model stability and performance.
Two common motivations for removal are:
1. Degenerate distributions
2. Multicollinearity
Preprocessing steps to remove variables are referred to as filtering.

Degenerate Distributions I

There must be variability in the values of a predictor in order for it to have an association with the outcome.
Predictors that have a single value, and hence zero variance, are said to have degenerate distributions.
Degenerate predictors should be removed since they have no predictive ability and can cause computational errors in some models.
Similarly, the removal of predictors with near-zero variance may also be advantageous.
This is not always the case though; e.g., in genetics studies of rare diseases, rare genetic variants may be of primary interest as predictors.
One filtering approach for near-zero variance predictors is as follows.

Near-Zero Variance Filter Algorithm

Remove predictors for which
1. the percent of unique values is small relative to the sample size, and
2. the ratio of the frequency of the most common value to the frequency of the second most common value is large.
The algorithm can be applied to numeric or categorical predictors.
Zero variance is an extreme case of near-zero variance.

Multicollinearity I

Multicollinearity is a condition in which there are high degrees of correlation among multiple predictors.
High correlation suggests overlap in information provided by the predictors which can lead to unnecessarily complex models.
Moreover, for some models, multicollinearity can lead to unstable parameter estimates, numerical errors, and degraded prediction performance.
One remedy is to remove the minimum number of predictors necessary to ensure that all pairwise correlations are below a specified threshold.

High Correlation Filter Algorithm 1. Calculate the pairwise correlation matrix for the predictors. 2. Identify the two predictors, say x and x′, with the largest absolute correlation. 3. Compute the average correlation between each of x and x′ and the other predictors. 4. Remove the one with the largest average correlation. 5. Repeat the previous steps until no absolute correlations are above the threshold.

As typically implemented, this algorithm requires numeric predictors for the calculation of pairwise correlations.
Different types of correlation could be computed, including traditional Pearson correlation or Spearman correlations and Kendall.
Removal multicollinearity may significantly improve model performance.
However, overlapping information due to non-linear associations between predictors will not necessarily be addressed.
Clustering predictors and extracting the medoids with the MachineShop step_kmedoids() function is another example of filtering.

IFS Study Example I

Filter Recipe Functions

Category	Syntax	Types	Description
Degenerate Distribution	`step_nzv()`	FN	Near-zero variance filter
	`step_zv()`	FN	Zero variance filter
Multicollinearity	`step_corr()`	N	High correlation filter
	`step_lincomb()`	N	Linear combination filter
Attributes	`step_rm()`	A	General variable filter

A near-zero variance filter on all predictors is added in the first recipe.
step_nzv() function has default thresholds of 10% for percent of unique values and 95/5 for the ratio of most to second most common values.
The second recipe adds a pairwise correlation filter on all numeric predictors and sets the absolute correlation threshold at 0.90.
Dummy variables are created after filtering in each case.

### Near-zero variance filter
Car_rec1 <- Caries_recipe_naomit %>%
  step_nzv(all_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes())
### High correlation filter
Car_rec2 <- Caries_recipe_naomit %>%
  step_corr(all_numeric(), -all_outcomes(), threshold = 0.9) %>%
  step_dummy(all_nominal(), -all_outcomes())

4 Dealing with Missing Values

Dealing with Missing Values I

In dental public health research, some values of the predictors may be missing.
Missing values may be structural or indeterminable at the time of model building.
Distinctions are made between types of missing data.
- Missing completely at random (MCAR): the missingness of data is unrelated to any study variable; data are rarely MCAR.
- Missing at random (MAR): missingness can be fully accounted for by non-missing variables.
- Missing not at random (MNAR): missing data that are neither MAR or MCAR; also known as nonignorable nonresponse.
Do not confuse missing data with censored data.
Approaches for dealing with missing data include
- Removal of subjects or predictors,
- Use of models, like tree-based techniques, that account for missing data within their model fitting algorithms, or
- Imputation.
Imputation has been studied extensively in the context of statistical inference.
However, Imputation for predictive modeling is a separate issue in which prediction accuracy rather than valid statistical inference is the goal.

Mean Imputation

Mean imputation simply replaces missing values in a numeric variable with the mean of its observed values.
Nominal predictors can be handled by substituting their modes.
Addition of a categorical level for missing values is another approach, but not advisable due to the potential for bias (Jones, 1997).

Advantages

Preserves the sample mean/mode value
Easy to understand and apply
Computationally fast

Disadvantages

Assumes MCAR
Attenuates associations
Mean and mode imputation are not generally recommended for multivariate analyses.

Bagged Tree Imputation

This imputation method replaces missing values in a numeric or nominal variable with predictions from a bagged tree fit with the remaining variables as predictors.
Prediction from bagged trees is an aggregation over decision trees individually fit to resampled training datasets.
Bagged trees reduce prediction variance and accommodate mixtures of numeric and nominal predictors.

Advantages

Good multivariate performance

Disadvantages

Assumes MAR
Computationally slow
More tuning parameters
An effective machine learning tool, although has significant computational and tuning costs.

K-Nearest Neighbors Imputation

This method replaces each missing values in a numeric or nominal variable with the mean or mode, respectively, of the K closest neighbors.

Closeness is relative to the remaining variables and can be computed on mixtures of numeric and nominal variables with Gower’s distance (1971).

Advantages

Good multivariate performance
Fairly robust to choice of K
Computationally fast

Disadvantages

Assumes MAR
Requires storage and application of the entire training data

An attractive choice for machine learning

Imputation Recipe Functions

Syntax	Types	Description
`step_bagimpute()`	FN	Imputation via bagged trees
`step_knnimpute()`	FN	Imputation via K-nearest neighbors
`step_lowerimpute()`	N	Impute numeric data below the threshold of measurement
`step_meanimpute()`	N	Impute numeric data using the mean
`step_modeimpute()`	F	Impute nominal data using the most common value

Although not available as a recipe function, multiple imputation by chained equations (MICE) is a powerful method that possesses many of the previously mentioned advantages and additionally accounts for uncertainty in imputed values (van Buuren and Groothuis-Oudshoorn, 2011).

Imputation Example with IFS Study

Note that the outcome variable is not excluded from the imputations.
The recipe is designed so that the outcome is required to have no missing values.
By default, all predictors are utilized in the bagged tree and K-nearest neighbors imputations.
K-nearest neighbors is being conducted with K = 5.
Dummy variables are created at the ends of the recipes to allow imputation on their original categorical levels.

### Imputation using mean or mode
Car_rec3 <- Caries_recipe %>%
  step_meanimpute(all_numeric()) %>%
  step_modeimpute(all_nominal()) %>%
  step_dummy(all_nominal(), -all_outcomes())

res_meanimpute <- resample(Car_rec3, model = GLMModel, control = control)

### Imputation via bagged trees
Car_rec4 <- Caries_recipe %>%
  step_bagimpute(all_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes())

res_bagimpute <- resample(Car_rec4, model = GLMModel, control = control)

### Imputation via K-nearest neighbors
Car_rec5 <- Caries_recipe %>%
  step_knnimpute(all_predictors(), K = 5) %>%
  step_dummy(all_nominal(), -all_outcomes())

res_knnimpute <- resample(Car_rec5, model = GLMModel, control = control)

Imputation with predictor-specific methods

### Imputation with predictor-specific methods
Car_rec6 <- Caries_recipe %>%
  step_modeimpute(all_nominal()) %>%
  step_meanimpute(Total_mgF) %>%
  step_knnimpute(MotherEduc,Income,Female1,Homeppm,Brushingfreq, Waterbase, Milk, Juice100, SSB) %>%
  step_dummy(all_nominal(), -all_outcomes())

res_customimpute <- resample(Car_rec6, model = GLMModel, control = control)

### Compare method performances
res <- c("mean" = res_meanimpute, "bag" = res_bagimpute, "knn" = res_knnimpute, "custom" = res_customimpute)
summary(res)

5 Data Transformations

Data Transformations for Individual Predictors I

Transformations of individual predictors may be desired or needed for planned modeling approaches.

A few types of transformations are summarized below.

Centering and Scaling: improves numerical stability of some calculations and performance of some models.
- step_center() is used to subtract the mean from numeric predictors.
  - This transformation shifts variables so they are centered around zero while preserving their scale and distribution shape.
- step_scale() is used to standardize the spread (variance) of numeric predictors by dividing each variable by its standard deviation.
  - Unlike centering, scaling changes the magnitude of variables so they are on comparable scales.
- step_normalize() is that performs z-score standardization.
  - It combines centering and scaling by subtracting the mean and dividing by the standard deviation for selected numeric variables.
  - Normalization ensures each predictor contributes proportionally during model fitting.
  - Without normalization, predictors with larger numeric ranges can disproportionately influence model coefficients or distance calculations.
Skewness: may need to be reduced to deemphasize the effects of observations in the tails of their distributions.
Basis Expansions: may allow for more flexible modeling of predictor effects.
Univariate Transformations: can be applied based on subject-matter knowledge or modeling strategies.

Univariate Transformation Recipe Functions

Category	Syntax	Types	Description
Centering and Scaling	`step_center()`	N	Centering numeric data
	`step_scale()`	N	Scaling numeric data
	`step_normalize()`	N	Centering and scaling numeric data
	`step_range()`	N	Scaling numeric data to a specific range
Skewness	`step_BoxCox()`	N	Box-Cox transformation (non-negative data)
	`step_YeoJohnson()`	N	Yeo-Johnson transformation
Basis Expansions	`step_bs/ns()`	N	Spline basis functions
	`step_poly()`	N	Orthogonal polynomial basis functions
Univariate Transformations	`step_discretize()`	N	Discretize numeric variables
	`step_hyperbolic()`	N	Hyperbolic transformation
	`step_invlogit()`	N	Inverse logit transformation
	`step_log()`	N	Log transformation
	`step_logit()`	N	Logit transformation
	`step_other()`	F	Collapse some categorical levels
	`step_relu()`	N	Apply rectified linear transformation
	`step_sqrt()`	N	Square root transformation
	`step_window()`	N	Moving window functions

IFS Study Example (Individual Predictors) I

Recipes are given to illustrate transformations of individual predictors.
Yeo-Johnson transformations are applied to make the numeric variable distributions more symmetric.
Finally, centering and scaling are applied to numeric variables before and after dummy variable creation, respectively.

# Add one of the imputation methods to the base recipe
Car_rec_impute <- Caries_recipe %>% step_knnimpute(all_predictors())

### Yeo-Johnson transformation
Car_rec7 <- Car_rec_impute %>%
  step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes())

### Centering and scaling (before and after ordinal scoring)
Car_rec8 <- Car_rec_impute %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_ordinalscore(MotherEduc) %>%
  step_dummy(all_nominal(), -all_outcomes())

Car_rec9 <- Car_rec_impute %>%
  step_ordinalscore(MotherEduc) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes())

Data Transformations for Multiple Predictors I

Transformations on groups of predictors can help to resolve outliers, reduce the dimensionality of data, or account for interactions.
- Outliers are observations whose predictor or outcome values are outside of the mainstream of the data.
Data entry errors are often the cause of outliers and should be checked for as a first step.
Removal of outlying subjects is not recommended unless they are confirmed to come from a population for which prediction is not of interest.
Some models are less affected by outliers.
Otherwise, transformations such as spatial sign are available to draw outliers closer to the rest of the observations.

Data Transformations for Multiple Predictors II

To reduce dimensionality, methods such as principle components analysis (PCA) can be employed to generate a smaller set of predictors that capture much of the original information.
There may be interest in modeling interaction terms to explicitly account for multiplicative effects of predictors.

Below is the Recipe functions for multiple predictor transformations:

Multivariate Transformation Recipe Functions

Category	Syntax	Types	Description
Outliers	`step_spatialsign()`	N	Spatial sign preprocessing
Dimension Reduction	`step_ica()`	N	ICA signal extraction
	`step_isomap()`	N	Isomap embedding
	`step_kpca()`	N	Kernel PCA signal extraction
	`step_pca()`	N	PCA signal extraction
Interaction	`step_interact()`	N	Create interaction variables
	`step_ratio()`	N	Ratio variable creation

IFS Study Example (Multiple Predictors) I

In the syntax below, PCA is applied to reduce the number of predictors to a smaller set.
Dummy variables are created beforehand so that categorical variables are included in the PCA.
Centering and scaling are not performed by the step_pca() function by default and thus are explicitly included as recipe steps.
We set the threshold at O.75 which means that the recipe will retain enough components so as to capture 75% of the variance in the variables.
Alternatively, there is a num argument in the PCA step function that can be used to specify the exact number of components to retain (default: 5).
Clustering and averaging of predictors within clusters with the MachineShop step_kmeans() function is another example of dimension reduction.

### Data reduction with principal components
Car_rec10 <- Car_rec_impute %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_pca(all_numeric(), -all_outcomes(), threshold = 0.75)

res2 <- resample(Car_rec10 , model = GLMModel, control = control)
summary(res2)

6 Binning Predictors

Binning or categorization is the process of turning a continuous variable into a categorical one.
While binning may aid in interpretability, there are considerable downsides in the context of predictive modeling.
1. Information loss due to the continuous values of subjects being replaced with a single value within each bin.
2. Degradation of model performance.
3. Loss of precision in the predictions.
4. High rates of false positives.
Avoid manual binning of variables!

7 Summary

Data preprocessing can have a large impact on the final performance of a predictive model.
Preprocessing steps should be included in resampling algorithms to obtain proper estimates of prediction performance.
The optimal types and combination (recipe) of preprocessing steps depends on the scientific problem to be addressed and models to be fit.
Different recipes can be implemented and compared with respect to their impacts on prediction performance.
This can be accomplished with recipe step functions provided by the recipes and MachineShop R packages.
Some custom step functions may also be implemented to extend the packages and develop novel preprocessing approaches.

8 References

Gower C (1971) A general coefficient of similarity and some of its properties, Biometrics, 857-871.

Jones MP (1996) Indicator and stratification methods for missing explanatory variables in multiple linear regression, Journal of the American Statistical Association 91: 222230.

Kuhn M and Johnson K (2013) Applied Predictive Modeling, Chapter 3, New York: Springer.

Max Kuhn and Hadley Wickham (2020). recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes

Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/

Week 2: Data Preprocessing and Model Tuning

Outline I

Outline II

Data Preprocessing

Recipes I

Recipes II

General Recipe Functions

IFS Study Example I

IFS Study Example I

2 Adding Predictors

IFS Study Example I

Processing Recipe under `fit()` and `resample()` functions

3 Removing Predictors

Removing Predictors

Degenerate Distributions I

Multicollinearity I

IFS Study Example I

4 Dealing with Missing Values

Dealing with Missing Values I

Mean Imputation

Bagged Tree Imputation

K-Nearest Neighbors Imputation

Imputation Recipe Functions

Imputation Example with IFS Study

Imputation with predictor-specific methods

5 Data Transformations

Univariate Transformation Recipe Functions

IFS Study Example (Individual Predictors) I

Data Transformations for Multiple Predictors I

Data Transformations for Multiple Predictors II

IFS Study Example (Multiple Predictors) I

6 Binning Predictors

7 Summary

8 References

Week 2: Data Preprocessing and Model Tuning

Outline I

Outline II

Data Preprocessing

Recipes I

Recipes II

General Recipe Functions

IFS Study Example I

IFS Study Example I

2 Adding Predictors

IFS Study Example I

Processing Recipe under fit() and resample() functions

3 Removing Predictors

Removing Predictors

Degenerate Distributions I

Multicollinearity I

IFS Study Example I

4 Dealing with Missing Values

Dealing with Missing Values I

Mean Imputation

Bagged Tree Imputation

K-Nearest Neighbors Imputation

Imputation Recipe Functions

Imputation Example with IFS Study

Imputation with predictor-specific methods

5 Data Transformations

Univariate Transformation Recipe Functions

IFS Study Example (Individual Predictors) I

Data Transformations for Multiple Predictors I

Data Transformations for Multiple Predictors II

IFS Study Example (Multiple Predictors) I

6 Binning Predictors

7 Summary

8 References

Processing Recipe under `fit()` and `resample()` functions