Data Pre-processing

Applied Predictive Modeling
By Max Kuhn and Kjell Johnson
ISBN: 9781461468486
http://appliedpredictivemodeling.com/

Chapter 3 Data Pre-Processing

Data pre-processing techniques generally refer to the addition, deletion, or transformation of training set data. Data preparation can make or break a model's predictive ability.

  • Transformations
  • Missing Values
  • Removing Predictors
  • Adding Predictors
  • Binning Predictors

Dataset

New York Air Quality Measurements

Daily air quality measurements in New York, May to September 1973. Base R dataset with missing data.

summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

Principals

Principals

  • The need for data pre-processing is determined by the type of model being used.
    • Tree-based models for example, are insensitive to characteristics predictors.
    • Linear regression models for example, are sensitive to characteristics predictors.
  • Transformations can lead to significant improvements in performance. For Example:
    • Centering: \(x_{i}-\bar{x}\)
    • Scaling: \({ \left( x_{ i }-\bar { x } \right) }/{ \sigma }\)
  • Modifying predictors based on their lack of information content can be effective. For Example:
    • Deletion: data <- data[complete.cases(data)]
    • Imputation: data[is.na(data)] <- mean(data, na.rm = T)
  • Combinations of predictors can be more effective than using the individual values. For Example:
    • Ratios: \(\mathbf{x}_i / \mathbf{x}_j\)
    • Sums: \(\mathbf{x}_i / \mathbf{x}_j\)

Transformations

Centering and Scaling - Theory

These techniques improve numerical stability of some calculations, but may impact interpretability.

Centering data such that \(x_{ i }-\bar { x }\) coerces the predictor to have a zero mean.

\[\textrm{Sample Mean}=\bar { x } =\frac { 1 }{ n } \sum _{ i=1 }^{ n }{ x_{ i } } \]

Scaling data through standardization such that \({ \left( x_{ i }-\bar { x } \right) }/{ s }\) coerce the values to have a common standard deviation of one. There are also other types of scaling such as the rescaling method, \({ \left[ x_{ i }-\min { \left( x \right) } \right] }/{ { \left[ \max { \left( x \right) } -\min { \left( x \right) } \right] } }\), and mean normalization, \({ \left[ x_{ i }-\bar { x } \right] }/{ { \left[ \max { \left( x \right) } -\min { \left( x \right) } \right] } }\).

\[\textrm{Sample Standard Deviation}=s=\sqrt { \frac { 1 }{ n-1 } \sum _{ i=1 }^{ n }{ { \left( x_{ i }-\bar { x } \right) }^{ 2 } } }\]

Centering and Scaling - Code

centered <- scaled <- airquality
for (i in 1:ncol(airquality)) {
  centered[, i] <- airquality[, i] - mean(airquality[, i], na.rm = T)
  scaled[, i] <- centered[, i] / sd(airquality[, i], na.rm = T)
}

Sample Means equal zero after centering (rounded to avoid exponential notation).

round(apply(scaled, 2, mean, na.rm = T), 10)
##   Ozone Solar.R    Wind    Temp   Month     Day 
##       0       0       0       0       0       0

Sample Standard Deviations equal one after scaling.

apply(scaled, 2, sd, na.rm = T)
##   Ozone Solar.R    Wind    Temp   Month     Day 
##       1       1       1       1       1       1

Resolving Skewness - Theory

Significant skew (asymmetry in data) is present in data when the absolute value of the skew is more than the standard error of skewness times the corresponding critical value for the desired level of \(\alpha\).

\[\textrm{Sample Skewness}=\frac { \frac { 1 }{ n-1 } \sum _{ i=1 }^{ n }{ { \left( x_{ i }-\bar { x } \right) }^{ 3 } } }{ { \left[ \frac { 1 }{ n-1 } \sum _{ i=1 }^{ n }{ { \left( x_{ i }-\bar { x } \right) }^{ 2 } } \right] }^{ { 3 }/{ 2 } } }\]\[\textrm{Standard Error of Skewness}=\sqrt { \frac { 6n\left( n-1 \right) }{ \left( n-2 \right) \left( n+1 \right) \left( n+3 \right) } }\]

Box and Cox proposed a family of transformations that are indexed by the parameter \(\lambda\). Using the training data, the transformation parameter \(\lambda\) is determined using maximum likelihood estimation. The procedure is applied independently to each predictor. Box-Cox transformation results in data that are better behaved than the data in its natural units. Some typical Box-Cox transformations are: \[-2\Rightarrow \frac { 1 }{ Y^{ 2 } }, \quad -1\Rightarrow \frac { 1 }{ Y }, \quad -0.5\Rightarrow \frac { 1 }{ \sqrt { Y } }, \quad 0\Rightarrow \log { \left( Y \right) }, \quad 0.5\Rightarrow \sqrt { Y }, \quad 1\Rightarrow Y, \quad 2\Rightarrow { Y }^{ 2 }\]

Resolving Skewness - Code

library(moments)
skew <- skewness(airquality, na.rm = T)
n <- colSums(apply(airquality, 2, complete.cases))
ses <- sqrt((6*n*(n-1)) / ((n-2)*(n+1)*(n+3)))
critical <- qt(0.05, df=(n-1), lower.tail=F)
data.frame(skew, n, ses, sig=abs(skew)>critical*ses)
##                 skew   n       ses   sig
## Ozone    1.225680663 116 0.2245612  TRUE
## Solar.R -0.423634197 146 0.2006795  TRUE
## Wind     0.344398467 153 0.1961246  TRUE
## Temp    -0.374169579 153 0.1961246  TRUE
## Month   -0.002367988 153 0.1961246 FALSE
## Day      0.002625783 153 0.1961246 FALSE
par(mfrow = c(2, 3))
for (i in 1:ncol(airquality)) {
  boxplot(airquality[ ,i], ylab = names(airquality[i]), horizontal=T,
          main = paste(names(airquality[i]), "Skew:", round(skew[i], 3)),
          col = ifelse(abs(skew[i])>critical[i]*ses[i], "red", "steelblue"))
}

library(car)
boxcox <- powerTransform(airquality, family="yjPower", na.action(rm))
coef(boxcox)
##     Ozone   Solar.R      Wind      Temp     Month       Day 
## 0.2236305 0.9604316 0.3077159 1.8375454 0.9133494 0.7605748
transformed <- bcPower(airquality, boxcox$lambda)
skewT <- skewness(transformed, na.rm = T)
data.frame(skewT, n, ses, sig=abs(skewT)>critical*ses)
##                    skewT   n       ses   sig
## Ozone^0.22    0.02682566 116 0.2245612 FALSE
## Solar.R^0.96 -0.45615611 146 0.2006795  TRUE
## Wind^0.31    -0.51315302 153 0.1961246  TRUE
## Temp^1.84    -0.15172099 153 0.1961246 FALSE
## Month^0.91   -0.02117506 153 0.1961246 FALSE
## Day^0.76     -0.21008206 153 0.1961246 FALSE
par(mfrow = c(2, 3))
for (i in 1:ncol(airquality)) {
  boxplot(transformed[ ,i], ylab = names(airquality[i]), horizontal=T,
          main = paste(names(airquality[i]), "Skew:", round(skewT[i], 3)),
          col = ifelse(abs(skewT[i])>critical[i]*ses[i], "red", "steelblue"))
}

Resolving Skewness - Remarks

What happened? The skewness in the variables Ozone and Temp was resolved, but the variables Solar.R and Wind are still skewed after Box-Cox transformation. There are two distinct and important things happening here:

Solar.R - When Box-Cox calculates \(\lambda\approx1\), there is almost no difference between the original data and the transformed data because \(x^1=x\cdot1=x\) (the multiplicative identity).

boxcox$lambda[2]
##   Solar.R 
## 0.9604316

Wind - A Box-Cox transformation can overcorrect skewness in data when the \(\lambda\) calculated is very small. This results in data becoming skewed in the opposite direction.

data.frame(boxcox$lambda[3], skew[3], skewT[3])
##      boxcox.lambda.3.   skew.3.  skewT.3.
## Wind        0.3077159 0.3443985 -0.513153

This does not mean that Box-Cox transformation on those variables is pointless however. As long as \(\lambda\neq1\), the data are likely to be better behaved in models after the transformation than when they are in their natural units.

Resolving Outliers

An outlier can be defined statistically as a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile. Outliers can indicate Scientifically Invalid recording errors, but great care should be taken not to hastily remove or change values because the data may be Scientifically Valid:

  • Small Samples: Sign of skewed distribution just starting to be captured.
  • Subpopulation: Indication of an area just starting to be sampled.

Boxplots display data in quartiles. Values outside the box's whiskers are considered outliers.

par(mfrow = c(2, 3))
for (i in 1:ncol(airquality)) {
  boxplot(airquality[ ,i], ylab = names(airquality[i]), horizontal=T,
          main = names(airquality[i]), col = "grey")
}

There are several predictive models that are resistant to outliers. If a model is considered to be sensitive to outliers, one data transformation that can minimize the problem is the spatial sign. This procedure projects the predictor values onto a multidimensional sphere. This has the effect of making all the samples the same distance from the center of the sphere. Unlike centering or scaling, this manipulation of the predictors transforms them as a group. Removing predictor variables after applying the spatial sign transformation may be problematic.

Dimensionality Reduction - Theory

Dimensionality (Data) Reduction techniques are another class of predictor transformations. These methods reduce the data by generating a smaller set of predictors (Feature Extraction) that seek to capture a majority of the information in the original variables. For most data reduction techniques, the new predictors are functions of the original predictors; therefore, all the original predictors are still needed to create the surrogate variables.

Some predictive models prefer predictors to be uncorrelated (or at least low correlation) in order to find solutions and to improve the model's numerical stability.

Principal Component Analysis (PCA) is a commonly used data reduction technique that extracts features which are the linear combinations of the predictors that maximize variability. PCA is "an orthogonal transformation" that converts "correlated variables into a set of values of linearly uncorrelated variables called principal components." Skew, distributional differences, and scale of predictors can impact on PCA. It is therefore best to first transform then center and scale predictors before performing PCA.

Dimensionality Reduction - Code

Principal Component Analysis (PCA)

PCA <- function(X) {
  Xpca <- prcomp(na.omit(X), center = T, scale. = T) 
  M <- as.matrix(na.omit(X)); R <- as.matrix(Xpca$rotation); score <- M %*% R
  print(list("Importance of Components" = summary(Xpca)$importance[ ,1:5], 
             "Rotation (Variable Loadings)" = Xpca$rotation[ ,1:5],
             "Correlation between X and PC" = cor(na.omit(X), score)[ ,1:5]))
  par(mfrow=c(1,2))
  labels <- as.character(1:length(Xpca$sdev^2))
  barplot(Xpca$sdev^2, ylab = "Component Variance", names.arg=labels)
  labels <- substr(names(airquality), 1, 1)
  barplot(abs(cor(na.omit(X), score)[ ,1]), ylab = "Absolute Correlation", names.arg=labels)
}
PCA(transformed)
## $`Importance of Components`
##                             PC1      PC2       PC3       PC4       PC5
## Standard deviation     1.606746 1.054987 0.9930061 0.8574583 0.6359942
## Proportion of Variance 0.430270 0.185500 0.1643400 0.1225400 0.0674100
## Cumulative Proportion  0.430270 0.615770 0.7801200 0.9026500 0.9700700
## 
## $`Rotation (Variable Loadings)`
##                     PC1         PC2          PC3         PC4          PC5
## Ozone^0.22   -0.5644851  0.14487458 -0.118700936 -0.11957736 -0.267710106
## Solar.R^0.96 -0.2915797  0.65497916 -0.142607860  0.54831822  0.378999660
## Wind^0.31     0.4686873  0.09203426 -0.010851246  0.60690785 -0.596765926
## Temp^1.84    -0.5509748 -0.11508694 -0.002671725  0.08604586 -0.576451598
## Month^0.91   -0.2508352 -0.71506016  0.023402882  0.55302454  0.310183525
## Day^0.76      0.1008888 -0.13033538 -0.982293492 -0.05891699 -0.007122034
## 
## $`Correlation between X and PC`
##                     PC1         PC2         PC3        PC4        PC5
## Ozone^0.22   -0.7817090 -0.18431288 -0.45651790  0.7102593 -0.7243605
## Solar.R^0.96 -0.3883330  0.68016542 -0.91942347  0.8597371 -0.1625977
## Wind^0.31     0.5217296  0.27704480  0.14151609 -0.3692439  0.5203658
## Temp^1.84    -0.9948390 -0.50143000 -0.31430751  0.7399805 -0.9910374
## Month^0.91   -0.3740879 -0.37921870  0.03399349  0.1719521 -0.4155701
## Day^0.76      0.1229613  0.02948716 -0.33090939 -0.1072570  0.1150340

Dimensionality Reduction - Remarks

The output shows that Principal Component 1 captures nearly half of the variance in the data at 43%.

Loadings close to zero indicate that a predictor variable did not contribute much to that component.

The correlation matrix between the original variables and the principal components indicates that the majority of the variance in the data is coming from (in order of significance) the transformed, centered, and scaled Temp and Ozone variables. To a much lesser extent, Wind is also impacting the variance.

Missing Values

Missing Values - Theory

It is important to understand why the values are missing and not to confuse missing data with censored data. With censored data, the exact value is missing but something is known about its value. Such as when there is a limit to measurement and a precise measure cannot be taken, but it is known that the value is below or above the limit. Censoring is usually taken into account in a formal manner by making assumptions about the censoring mechanism and treating the data as simple missing data or using the censored value as the observed value.

  • Informative Missingness: The pattern of missing data is related to the outcome.
  • Structurally Missing: The number of children a man has given birth to.
  • Censored Data: The exact value is missing but something is known about its value.

There are three general approaches to handling missing data: removal, using robust models, and imputation. For large data sets, removal of samples based on missing values is not a problem, assuming that the missingness is not informative. In smaller data sets, there is a steep price in removing samples. A few predictive models, especially tree-based techniques, can specifically account for missing data. Imputing uses information in the training set predictors to estimate the values of other predictors. This amounts to a predictive model within a predictive model. This extra layer of models adds uncertainty.

Missing Values - Code

library(VIM)
library(mice)
aggr(airquality, prop = c(T, T), bars=T, numbers=T, sortVars=T)

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##     Ozone 0.24183007
##   Solar.R 0.04575163
##      Wind 0.00000000
##      Temp 0.00000000
##     Month 0.00000000
##       Day 0.00000000
MICE <- mice(airquality, method="pmm", printFlag=F, seed=624)
aggr(complete(MICE), prop = c(T, T), bars=T, numbers=T, sortVars=T)

## 
##  Variables sorted by number of missings: 
##  Variable Count
##     Ozone     0
##   Solar.R     0
##      Wind     0
##      Temp     0
##     Month     0
##       Day     0

Missing Values - Remarks

The visualizations produced by the aggr function in the VIM package show a bar chart with the proportion of missing data per variable as well as a grid with the proportion of missing data for variable combinations. The bar chart shows two predictor variables have missing values. The grid shows the combination of all predictors with 72.5% of data not missing. The remainder of the grid shows missing data for variable combinations with each row highlighting the missing values for the group of variables detailed in the x-axis.

The Multivariate Imputation by Chained Equations (MICE) imputation method assumes values are missing at random and is implement by imputing missing data for all variables with a simple method, removing the imputations for one variable, imputing the removed data using regression, repeating the remove-regress imputation for every other imputed variable, then continuing the remove-regress imputation in a loop over the whole dataset a default of \(m=5\) times.

MICE uses linear regression for imputation on these data because the dataset consists of real data. Poisson regression is used for count data and logistic regression is used for categorical data. The MICE imputation is imitated with a Predictive Mean Matching (PMM) imputation method. PMM "imputes missing values by means of the nearest-neighbor donor with distance based on the expected values of the missing variables conditional on the observed covariates."

Removing Predictors

Removing Predictors

Removing predictors with problems can decrease computational time and model complexity. Some problems that arise with predictors are:

  • Degenerate Distributions: Zero or near-zero variance predictors can cripple models: \(\sigma_{ x_{ i } } \approx 0\). There can be a significant improvement in model performance and/or stability without those problematic variables.
  • Collinearity: Highly correlated predictors measure the same underlying information: \(\left| \rho _{ x_{ i },x_{ j } } \right| \to 1\). Redundant predictors frequently add more complexity to the model than information they provide to the model.
library(caret)
nearZeroVar(airquality, names = TRUE, saveMetrics=T)
##         freqRatio percentUnique zeroVar   nzv
## Ozone    1.500000     43.790850   FALSE FALSE
## Solar.R  1.000000     76.470588   FALSE FALSE
## Wind     1.363636     20.261438   FALSE FALSE
## Temp     1.222222     26.143791   FALSE FALSE
## Month    1.000000      3.267974   FALSE FALSE
## Day      1.000000     20.261438   FALSE FALSE
cor(na.omit(airquality))
##                Ozone     Solar.R        Wind       Temp        Month
## Ozone    1.000000000  0.34834169 -0.61249658  0.6985414  0.142885168
## Solar.R  0.348341693  1.00000000 -0.12718345  0.2940876 -0.074066683
## Wind    -0.612496576 -0.12718345  1.00000000 -0.4971897 -0.194495804
## Temp     0.698541410  0.29408764 -0.49718972  1.0000000  0.403971709
## Month    0.142885168 -0.07406668 -0.19449580  0.4039717  1.000000000
## Day     -0.005189769 -0.05775380  0.04987102 -0.0965458 -0.009001079
##                  Day
## Ozone   -0.005189769
## Solar.R -0.057753801
## Wind     0.049871017
## Temp    -0.096545800
## Month   -0.009001079
## Day      1.000000000

Adding Predictors

Adding Predictors

When a predictor is categorical it is common to decompose the predictor into a set of more specific variables. Usually, each category gets its own dummy variable. Using a dummy variable for every category is optional however since an excluded category can be inferred when using \(n-1\) categories.

color <- factor(c("red", "green", "red", "blue"))
data.frame(model.matrix(~color-1))
##   colorblue colorgreen colorred
## 1         0          0        1
## 2         0          1        0
## 3         0          0        1
## 4         1          0        0

Binning Predictors

Binning Predictors

This is a data-prepossessing method that should be avoided.

Avoid taking a numeric predictor and manually pre-categorizing or "binning" it into two or more groups prior to data analysis. Manual binning of continuous data can lead to a significant loss of performance in the model, a loss of precision in the predictions when the predictors are categorized, and a high rate of false positives.

There are several models, such as classification/regression trees and multivariate adaptive regression splines, that estimate cut points in the process of model building. The difference is that these models evaluate many variables simultaneously and are usually based on statistically sound methodologies.

Questions?

Email: jose.zuniga@sps.cuny.edu