Chapter 4: Preprocessing your data

In this chapter, you will practice using train() to preprocess data before fitting models, improving your ability to making accurate predictions.

4.1: Apply median imputation

In this chapter, you’ll be using a version of the Wisconsin Breast Cancer dataset. This dataset presents a classic binary classification problem: 50% of the samples are benign, 50% are malignant, and the challenge is to identify which are which.

This dataset is interesting because many of the predictors contain missing values and most rows of the dataset have at least one missing value. This presents a modeling challenge, because most machine learning algorithms cannot handle missing values out of the box. For example, your first instinct might be to fit a logistic regression model to this data, but prior to doing this you need a strategy for handling the NAs.

Fortunately, the train() function in caret contains an argument called preProcess, which allows you to specify that median imputation should be used to fill in the missing values. In previous chapters, you created models with the train() function using formulas such as y ~ .. An alternative way is to specify the x and y arguments to train(), where x is an object with samples in rows and features in columns and y is a numeric or factor vector containing the outcomes. Said differently, x is a matrix or data frame that contains the whole dataset you’d use for the data argument to the lm() call, for example, but excludes the response variable column; y is a vector that contains just the response variable column.

For this exercise, the argument x to train() is loaded in your workspace as breast_cancer_x and y as breast_cancer_y.

Instructions

100 XP

  • Use the train() function to fit a glm model called model to the breast cancer dataset. Use preProcess = “medianImpute” to handle the missing values.

  • Print model to the console.

install.packages("miceadds")
breast_cancer_y<-miceadds::load.Rdata2('BreastCancer.RData', path=getwd())
breast_cancer_x<-read.csv("BreastCancer_x.csv")
str(breast_cancer_y)
 Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
str(breast_cancer_x)
'data.frame':   699 obs. of  9 variables:
 $ Cl.thickness   : int  5 NA NA 6 4 8 1 2 NA NA ...
 $ Cell.size      : int  NA 4 NA 8 1 10 1 1 1 2 ...
 $ Cell.shape     : int  1 4 1 8 1 10 NA 2 1 1 ...
 $ Marg.adhesion  : int  1 NA 1 NA 3 8 NA 1 NA 1 ...
 $ Epith.c.size   : int  NA 7 2 NA 2 7 2 2 2 2 ...
 $ Bare.nuclei    : int  1 10 NA 4 1 10 10 1 1 1 ...
 $ Bl.cromatin    : int  3 3 3 3 3 NA 3 3 NA 2 ...
 $ Normal.nucleoli: int  1 NA 1 7 NA 7 1 1 1 NA ...
 $ Mitoses        : int  1 1 1 1 1 1 1 1 5 1 ...
# Apply median imputation: model
model <- train(
  x = breast_cancer_x, y = breast_cancer_y,
  method = "glm",
  trControl = myControl,
  preProcess = "medianImpute"
)
The metric "Accuracy" was not in the result set. ROC will be used instead.
+ Fold01: parameter=none 
- Fold01: parameter=none 
+ Fold02: parameter=none 
- Fold02: parameter=none 
+ Fold03: parameter=none 
- Fold03: parameter=none 
+ Fold04: parameter=none 
- Fold04: parameter=none 
+ Fold05: parameter=none 
- Fold05: parameter=none 
+ Fold06: parameter=none 
- Fold06: parameter=none 
+ Fold07: parameter=none 
- Fold07: parameter=none 
+ Fold08: parameter=none 
- Fold08: parameter=none 
+ Fold09: parameter=none 
- Fold09: parameter=none 
+ Fold10: parameter=none 
- Fold10: parameter=none 
Aggregating results
Fitting final model on full training set
# Print model to console
model
Generalized Linear Model 

699 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

Pre-processing: median imputation (9) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 629, 630, 629, 629, 629, 629, ... 
Resampling results:

  ROC        Sens       Spec     
  0.9909593  0.9694203  0.9416667

4.2: Use KNN imputation

In the previous exercise, you used median imputation to fill in missing values in the breast cancer dataset, but that is not the only possible method for dealing with missing data.

An alternative to median imputation is k-nearest neighbors, or KNN, imputation. This is a more advanced form of imputation where missing values are replaced with values from other rows that are similar to the current row. While this is a lot more complicated to implement in practice than simple median imputation, it is very easy to explore in caret using the preProcess argument to train(). You can simply use preProcess = “knnImpute” to change the method of imputation used prior to model fitting.

Instructions

100 XP

  • breast_cancer_x and breast_cancer_y are loaded in your workspace.

  • Use the train() function to fit a glm model called model2 to the breast cancer dataset.

  • Use KNN imputation to handle missing values.

install.packages("RANN")
Warning in install.packages :
  unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
trying URL 'https://mran.microsoft.com/snapshot/2018-08-01/bin/windows/contrib/3.5/RANN_2.6.zip'
Content type 'application/zip' length 518764 bytes (506 KB)
downloaded 506 KB
package ‘RANN’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\cweiqiang\AppData\Local\Temp\Rtmp2twUuP\downloaded_packages
library(RANN)
# Apply KNN imputation: model2
model2 <- train(
  x = breast_cancer_x, y = breast_cancer_y,
  method = "glm",
  trControl = myControl,
  preProcess = "knnImpute"
)
The metric "Accuracy" was not in the result set. ROC will be used instead.
+ Fold01: parameter=none 
- Fold01: parameter=none 
+ Fold02: parameter=none 
- Fold02: parameter=none 
+ Fold03: parameter=none 
- Fold03: parameter=none 
+ Fold04: parameter=none 
- Fold04: parameter=none 
+ Fold05: parameter=none 
- Fold05: parameter=none 
+ Fold06: parameter=none 
- Fold06: parameter=none 
+ Fold07: parameter=none 
- Fold07: parameter=none 
+ Fold08: parameter=none 
- Fold08: parameter=none 
+ Fold09: parameter=none 
- Fold09: parameter=none 
+ Fold10: parameter=none 
- Fold10: parameter=none 
Aggregating results
Fitting final model on full training set
# Print model to console
model2
Generalized Linear Model 

699 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

Pre-processing: nearest neighbor imputation (9), centered (9), scaled (9) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 629, 629, 629, 629, 630, 629, ... 
Resampling results:

  ROC        Sens       Spec     
  0.9927758  0.9715459  0.9296667

4.3: Compare KNN and median imputation

All of the preprocessing steps in the train() function happen in the training set of each cross-validation fold, so the error metrics reported include the effects of the preprocessing.

This includes the imputation method used (e.g. knnImpute or medianImpute). This is useful because it allows you to compare different methods of imputation and choose the one that performs the best out-of-sample.

median_model and knn_model are available in your workspace, as is resamples, which contains the resampled results of both models. Look at the results of the models by calling

dotplot(resamples, metric = “ROC”) and choose the one that performs the best out-of-sample. Which method of imputation yields the highest out-of-sample ROC score for your glm model?

Instructions

50 XP

Possible Answers

  1. KNN imputation is much better than median imputation.

  2. KNN imputation is slightly better than median imputation. [ans]

  3. Median imputation is much better than KNN imputation.

  4. Median imputation is slightly better than KNN imputation.

# > resamples
# 
# Call:
# resamples.default(x = list(median_model = median_model, knn_model = knn_model))
# 
# Models: median_model, knn_model 
# Number of resamples: 10 
# Performance metrics: ROC, Sens, Spec 
# Time estimates for: everything, final model fit
# 
# 
# median_model and knn_model are available in your workspace, as is resamples, which contains the resampled results of both models. Look at the results of the models by calling
# 
# dotplot(resamples, metric = "ROC")

4.4: Combining preprocessing methods

The preProcess argument to train() doesn’t just limit you to imputing missing values. It also includes a wide variety of other preProcess techniques to make your life as a data scientist much easier. You can read a full list of them by typing ?preProcess and reading the help page for this function.

One set of preprocessing functions that is particularly useful for fitting regression models is standardization: centering and scaling. You first center by subtracting the mean of each column from each value in that column, then you scale by dividing by the standard deviation.

Standardization transforms your data such that for each column, the mean is 0 and the standard deviation is 1. This makes it easier for regression models to find a good solution.

Instructions

100 XP

  • breast_cancer_x and breast_cancer_y are loaded in your workspace. Fit two models called model1 and model2 to the breast cancer data, then print each to the console:

  • A logistic regression model using only median imputation: model1

  • A logistic regression model using median imputation, centering, and scaling (in that order): model2

# Fit glm with median imputation: model1
model1 <- train(
  x = breast_cancer_x, y = breast_cancer_y,
  method = "glm",
  trControl = myControl,
  preProcess =  c("medianImpute")
)
The metric "Accuracy" was not in the result set. ROC will be used instead.
+ Fold01: parameter=none 
- Fold01: parameter=none 
+ Fold02: parameter=none 
- Fold02: parameter=none 
+ Fold03: parameter=none 
- Fold03: parameter=none 
+ Fold04: parameter=none 
- Fold04: parameter=none 
+ Fold05: parameter=none 
- Fold05: parameter=none 
+ Fold06: parameter=none 
- Fold06: parameter=none 
+ Fold07: parameter=none 
- Fold07: parameter=none 
+ Fold08: parameter=none 
- Fold08: parameter=none 
+ Fold09: parameter=none 
- Fold09: parameter=none 
+ Fold10: parameter=none 
- Fold10: parameter=none 
Aggregating results
Fitting final model on full training set
# Print model1
model1
Generalized Linear Model 

699 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

Pre-processing: median imputation (9) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 629, 629, 630, 629, 629, 629, ... 
Resampling results:

  ROC        Sens       Spec     
  0.9915866  0.9715942  0.9461667
# Fit glm with median imputation and standardization: model2
model2 <- train(
  x = breast_cancer_x, y = breast_cancer_y,
  method = "glm",
  trControl = myControl,
  preProcess = c("medianImpute", "center", "scale")
)
The metric "Accuracy" was not in the result set. ROC will be used instead.
+ Fold01: parameter=none 
- Fold01: parameter=none 
+ Fold02: parameter=none 
- Fold02: parameter=none 
+ Fold03: parameter=none 
- Fold03: parameter=none 
+ Fold04: parameter=none 
- Fold04: parameter=none 
+ Fold05: parameter=none 
- Fold05: parameter=none 
+ Fold06: parameter=none 
- Fold06: parameter=none 
+ Fold07: parameter=none 
- Fold07: parameter=none 
+ Fold08: parameter=none 
- Fold08: parameter=none 
+ Fold09: parameter=none 
- Fold09: parameter=none 
+ Fold10: parameter=none 
- Fold10: parameter=none 
Aggregating results
Fitting final model on full training set
# Print model2
model2
Generalized Linear Model 

699 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

Pre-processing: median imputation (9), centered (9), scaled (9) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 629, 630, 628, 629, 629, 629, ... 
Resampling results:

  ROC        Sens       Spec 
  0.9926808  0.9695652  0.938

4.5: Remove near zero variance predictors

As you saw in the video, for the next set of exercises, you’ll be using the blood-brain dataset. This is a biochemical dataset in which the task is to predict the following value for a set of biochemical compounds:

log((concentration of compound in brain) / (concentration of compound in blood)) This gives a quantitative metric of the compound’s ability to cross the blood-brain barrier, and is useful for understanding the biological properties of that barrier.

One interesting aspect of this dataset is that it contains many variables and many of these variables have extemely low variances. This means that there is very little information in these variables because they mostly consist of a single value (e.g. zero).

Fortunately, caret contains a utility function called nearZeroVar() for removing such variables to save time during modeling.

nearZeroVar() takes in data x, then looks at the ratio of the most common value to the second most common value, freqCut, and the percentage of distinct values out of the number of total samples, uniqueCut. By default, caret uses freqCut = 19 and uniqueCut = 10, which is fairly conservative. I like to be a little more aggressive and use freqCut = 2 and uniqueCut = 20 when calling nearZeroVar().

Instructions

100 XP

  • bloodbrain_x and bloodbrain_y are loaded in your workspace.

  • Identify the near zero variance predictors by running nearZeroVar() on the blood-brain dataset. Store the result as an object called remove_cols. Use freqCut = 2 and uniqueCut = 20 in the call to nearZeroVar().

  • Use names() to create a vector containing all column names of bloodbrain_x. Call this all_cols.

  • Make a new data frame called bloodbrain_x_small with the near-zero variance variables removed. Use setdiff() to isolate the column names that you wish to keep (i.e. that you don’t want to remove.)

bloodbrain_y<-miceadds::load.Rdata2('BloodBrain.RData', path=getwd())
str(bloodbrain_y)
 num [1:208] 1.08 -0.4 0.22 0.14 0.69 0.44 -0.43 1.38 0.75 0.88 ...
bloodbrain_x<-read.csv("bloodbrain_x.csv")
str(bloodbrain_x)
'data.frame':   208 obs. of  132 variables:
 $ tpsa                : num  12 49.3 50.5 37.4 37.4 ...
 $ nbasic              : int  1 0 1 0 1 1 1 1 1 1 ...
 $ vsa_hyd             : num  167.1 92.6 295.2 319.1 299.7 ...
 $ a_aro               : int  0 6 15 15 12 11 6 12 12 6 ...
 $ weight              : num  156 151 366 383 326 ...
 $ peoe_vsa.0          : num  76.9 38.2 58.1 62.2 74.8 ...
 $ peoe_vsa.1          : num  43.4 25.5 124.7 124.7 118 ...
 $ peoe_vsa.2          : num  0 0 21.7 13.2 33 ...
 $ peoe_vsa.3          : num  0 8.62 8.62 21.79 0 ...
 $ peoe_vsa.4          : num  0 23.3 17.4 0 0 ...
 $ peoe_vsa.5          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ peoe_vsa.6          : num  17.24 0 8.62 8.62 8.62 ...
 $ peoe_vsa.0.1        : num  18.7 49 83.8 83.8 83.8 ...
 $ peoe_vsa.1.1        : num  43.5 0 49 68.8 36.8 ...
 $ peoe_vsa.2.1        : num  0 0 0 0 0 ...
 $ peoe_vsa.3.1        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ peoe_vsa.4.1        : num  0 0 5.68 5.68 5.68 ...
 $ peoe_vsa.5.1        : num  0 13.567 2.504 0 0.137 ...
 $ peoe_vsa.6.1        : num  0 7.9 2.64 2.64 2.5 ...
 $ a_acc               : int  0 2 2 2 2 2 2 2 0 2 ...
 $ a_acid              : int  0 0 0 0 0 0 0 0 0 0 ...
 $ a_base              : int  1 0 1 1 1 1 1 1 1 1 ...
 $ vsa_acc             : num  0 13.57 8.19 8.19 8.19 ...
 $ vsa_acid            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ vsa_base            : num  5.68 0 0 0 0 ...
 $ vsa_don             : num  5.68 5.68 5.68 5.68 5.68 ...
 $ vsa_other           : num  0 28.1 43.6 28.3 19.6 ...
 $ vsa_pol             : num  0 13.6 0 0 0 ...
 $ slogp_vsa0          : num  18 25.4 14.1 14.1 14.1 ...
 $ slogp_vsa1          : num  0 23.3 34.8 34.8 34.8 ...
 $ slogp_vsa2          : num  3.98 23.86 0 0 0 ...
 $ slogp_vsa3          : num  0 0 76.2 76.2 76.2 ...
 $ slogp_vsa4          : num  4.41 0 3.19 3.19 3.19 ...
 $ slogp_vsa5          : num  32.9 0 9.51 0 0 ...
 $ slogp_vsa6          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ slogp_vsa7          : num  0 70.6 148.1 144 140.7 ...
 $ slogp_vsa8          : num  113.2 0 75.5 75.5 75.5 ...
 $ slogp_vsa9          : num  33.3 41.3 28.3 55.5 26 ...
 $ smr_vsa0            : num  0 23.86 12.63 3.12 3.12 ...
 $ smr_vsa1            : num  18 25.4 27.8 27.8 27.8 ...
 $ smr_vsa2            : num  4.41 0 0 0 0 ...
 $ smr_vsa3            : num  3.98 5.24 8.43 8.43 8.43 ...
 $ smr_vsa4            : num  0 20.8 29.6 21.4 20.3 ...
 $ smr_vsa5            : num  113.2 70.6 235.1 235.1 234.6 ...
 $ smr_vsa6            : num  0 5.26 76.25 76.25 76.25 ...
 $ smr_vsa7            : num  66.2 33.3 0 31.3 0 ...
 $ tpsa.1              : num  16.6 49.3 51.7 38.6 38.6 ...
 $ logp.o.w.           : num  2.948 0.889 4.439 5.254 3.8 ...
 $ frac.anion7.        : num  0 0.001 0 0 0 0 0.001 0 0 0 ...
 $ frac.cation7.       : num  0.999 0 0.986 0.986 0.986 0.986 0.996 0.946 0.999 0.976 ...
 $ andrewbind          : num  3.4 -3.3 12.8 12.8 10.3 10 10.4 15.9 12.9 9.5 ...
 $ rotatablebonds      : int  3 2 8 8 8 8 8 7 4 5 ...
 $ mlogp               : num  2.5 1.06 4.66 3.82 3.27 ...
 $ clogp               : num  2.97 0.494 5.137 5.878 4.367 ...
 $ mw                  : num  155 151 365 382 325 ...
 $ nocount             : int  1 3 5 4 4 4 4 3 2 4 ...
 $ hbdnr               : int  1 2 1 1 1 1 2 1 1 0 ...
 $ rule.of.5violations : int  0 0 1 1 0 0 0 0 1 0 ...
 $ prx                 : int  0 1 6 2 2 2 1 0 0 4 ...
 $ ub                  : num  0 3 5.3 5.3 4.2 3.6 3 4.7 4.2 3 ...
 $ pol                 : int  0 2 3 3 2 2 2 3 4 1 ...
 $ inthb               : int  0 0 0 0 0 0 1 0 0 0 ...
 $ adistm              : num  0 395 1365 703 746 ...
 $ adistd              : num  0 10.9 25.7 10 10.6 ...
 $ polar_area          : num  21.1 117.4 82.1 65.1 66.2 ...
 $ nonpolar_area       : num  379 248 638 668 602 ...
 $ psa_npsa            : num  0.0557 0.4743 0.1287 0.0974 0.11 ...
 $ tcsa                : num  0.0097 0.0134 0.0111 0.0108 0.0118 0.0111 0.0123 0.0099 0.0106 0.0115 ...
 $ tcpa                : num  0.1842 0.0417 0.0972 0.1218 0.1186 ...
 $ tcnp                : num  0.0103 0.0198 0.0125 0.0119 0.013 0.0125 0.0162 0.011 0.0109 0.0122 ...
 $ ovality             : num  1.1 1.12 1.3 1.3 1.27 ...
 $ surface_area        : num  400 365 720 733 668 ...
 $ volume              : num  656 555 1224 1257 1133 ...
 $ most_negative_charge: num  -0.617 -0.84 -0.801 -0.761 -0.857 ...
 $ most_positive_charge: num  0.307 0.497 0.541 0.48 0.455 ...
 $ sum_absolute_charge : num  3.89 4.89 7.98 7.93 7.85 ...
 $ dipole_moment       : num  1.19 4.21 3.52 3.15 3.27 ...
 $ homo                : num  -9.67 -8.96 -8.63 -8.56 -8.67 ...
 $ lumo                : num  3.4038 0.1942 0.0589 -0.2651 0.3149 ...
 $ hardness            : num  6.54 4.58 4.34 4.15 4.49 ...
 $ ppsa1               : num  349 223 518 508 509 ...
 $ ppsa2               : num  679 546 2066 2013 1999 ...
 $ ppsa3               : num  31 42.3 64 61.7 61.6 ...
 $ pnsa1               : num  51.1 141.8 202 225.4 158.8 ...
 $ pnsa2               : num  -99.3 -346.9 -805.9 -894 -623.3 ...
 $ pnsa3               : num  -10.5 -44 -43.8 -42 -39.8 ...
 $ fpsa1               : num  0.872 0.611 0.719 0.692 0.762 ...
 $ fpsa2               : num  1.7 1.5 2.87 2.75 2.99 ...
 $ fpsa3               : num  0.0774 0.1159 0.0888 0.0842 0.0922 ...
 $ fnsa1               : num  0.128 0.389 0.281 0.307 0.238 ...
 $ fnsa2               : num  -0.248 -0.951 -1.12 -1.22 -0.933 ...
 $ fnsa3               : num  -0.0262 -0.1207 -0.0608 -0.0573 -0.0596 ...
 $ wpsa1               : num  139.7 81.4 372.7 372.1 340.1 ...
 $ wpsa2               : num  272 199 1487 1476 1335 ...
 $ wpsa3               : num  12.4 15.4 46 45.2 41.1 ...
 $ wnsa1               : num  20.4 51.8 145.4 165.3 106 ...
 $ wnsa2               : num  -39.8 -126.6 -580.1 -655.3 -416.3 ...
 $ wnsa3               : num  -4.2 -16.1 -31.5 -30.8 -26.6 ...
 $ dpsa1               : num  298.1 81.3 315.8 282.2 350.4 ...
  [list output truncated]
# Identify near zero variance predictors: remove_cols
remove_cols <- nearZeroVar(bloodbrain_x, names = TRUE, 
                           freqCut = 2, uniqueCut = 20)
# Get all column names from bloodbrain_x: all_cols
all_cols<-names(bloodbrain_x)
# Remove from data: bloodbrain_x_small
bloodbrain_x_small <- bloodbrain_x[ , setdiff(all_cols, remove_cols)]

4.6: Fit model on reduced blood-brain data

Now that you’ve reduced your dataset, you can fit a glm model to it using the train() function. This model will run faster than using the full dataset and will yield very similar predictive accuracy.

Furthermore, zero variance variables can cause problems with cross-validation (e.g. if one fold ends up with only a single unique value for that variable), so removing them prior to modeling means you are less likely to get errors during the fitting process.

Instructions

100 XP

  • bloodbrain_x, bloodbrain_y, remove, and bloodbrain_x_small are loaded in your workspace.

  • Fit a glm model using the train() function and the reduced blood-brain dataset you created in the previous exercise.

  • Print the result to the console.

# Fit model on reduced data: model
model <- train(x = bloodbrain_x_small, y = bloodbrain_y, method = "glm")
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
# Print model to console
model
Generalized Linear Model 

208 samples
112 predictors

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 208, 208, 208, 208, 208, 208, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  1.760752  0.1329531  1.133845

4.7: Using PCA as an alternative to nearZeroVar()

An alternative to removing low-variance predictors is to run PCA on your dataset. This is sometimes preferable because it does not throw out all of your data: many different low variance predictors may end up combined into one high variance PCA variable, which might have a positive impact on your model’s accuracy.

This is an especially good trick for linear models: the pca option in the preProcess argument will center and scale your data, combine low variance variables, and ensure that all of your predictors are orthogonal. This creates an ideal dataset for linear regression modeling, and can often improve the accuracy of your models.

Instructions

100 XP

  • bloodbrain_x and bloodbrain_y are loaded in your workspace.

  • Fit a glm model to the full blood-brain dataset using the “pca” option to preProcess.

  • Print the model to the console and inspect the result.

# Fit glm model using PCA: model
model <- train(
  x = bloodbrain_x, y = bloodbrain_y,
  method = "glm", preProcess = "pca"
)
# Print model to console
model
Generalized Linear Model 

208 samples
132 predictors

Pre-processing: principal component signal extraction (132), centered (132), scaled (132) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 208, 208, 208, 208, 208, 208, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.5885472  0.4441561  0.4420655

Remark: Note that the PCA model’s accuracy is slightly higher than the nearZeroVar() model from the previous exercise. PCA is generally a better method for handling low-information predictors than throwing them out entirely.

---
title: "Datacamp R - Machine Learning Toolbox : Chapter 4"
author: "Chen Weiqiang"
date: "November 29, 2018"
output: html_notebook
---

# Chapter 4: Preprocessing your data

In this chapter, you will practice using train() to preprocess data before fitting models, improving your ability to making accurate predictions.


```{r, echo=TRUE,results='hide',fig.keep='all'}
library(dplyr)
library(ggplot2)
#install.packages("mlbench")
#library(mlbench) # sonar data
install.packages("caret")
library(caret)
library(caTools)
```

## 4.1: Apply median imputation

In this chapter, you'll be using a version of the Wisconsin Breast Cancer dataset. This dataset presents a classic binary classification problem: 50% of the samples are benign, 50% are malignant, and the challenge is to identify which are which.

This dataset is interesting because many of the predictors contain missing values and most rows of the dataset have at least one missing value. This presents a modeling challenge, because most machine learning algorithms cannot handle missing values out of the box. For example, your first instinct might be to fit a logistic regression model to this data, but prior to doing this you need a strategy for handling the NAs.

Fortunately, the train() function in caret contains an argument called preProcess, which allows you to specify that median imputation should be used to fill in the missing values. In previous chapters, you created models with the train() function using formulas such as y ~ .. An alternative way is to specify the x and y arguments to train(), where x is an object with samples in rows and features in columns and y is a numeric or factor vector containing the outcomes. Said differently, x is a matrix or data frame that contains the whole dataset you'd use for the data argument to the lm() call, for example, but excludes the response variable column; y is a vector that contains just the response variable column.

For this exercise, the argument x to train() is loaded in your workspace as breast_cancer_x and y as breast_cancer_y.

Instructions

100 XP

- Use the train() function to fit a glm model called model to the breast cancer dataset. Use preProcess = "medianImpute" to handle the missing values.

- Print model to the console.
```{r}
install.packages("miceadds")
```

```{r}
breast_cancer_y<-miceadds::load.Rdata2('BreastCancer.RData', path=getwd())
breast_cancer_x<-read.csv("BreastCancer_x.csv")
str(breast_cancer_y)
str(breast_cancer_x)
```

```{r}
# Apply median imputation: model
model <- train(
  x = breast_cancer_x, y = breast_cancer_y,
  method = "glm",
  trControl = myControl,
  preProcess = "medianImpute"
)

# Print model to console
model
```

## 4.2: Use KNN imputation

In the previous exercise, you used median imputation to fill in missing values in the breast cancer dataset, but that is not the only possible method for dealing with missing data.

An alternative to median imputation is k-nearest neighbors, or KNN, imputation. This is a more advanced form of imputation where missing values are replaced with values from other rows that are similar to the current row. While this is a lot more complicated to implement in practice than simple median imputation, it is very easy to explore in caret using the preProcess argument to train(). You can simply use preProcess = "knnImpute" to change the method of imputation used prior to model fitting.

Instructions

100 XP


- breast_cancer_x and breast_cancer_y are loaded in your workspace.

- Use the train() function to fit a glm model called model2 to the breast cancer dataset.

- Use KNN imputation to handle missing values.
```{r}
install.packages("RANN")
library(RANN)
```

```{r}
# Apply KNN imputation: model2
model2 <- train(
  x = breast_cancer_x, y = breast_cancer_y,
  method = "glm",
  trControl = myControl,
  preProcess = "knnImpute"
)

# Print model to console
model2
```

## 4.3: Compare KNN and median imputation

All of the preprocessing steps in the train() function happen in the training set of each cross-validation fold, so the error metrics reported include the effects of the preprocessing.

This includes the imputation method used (e.g. knnImpute or medianImpute). This is useful because it allows you to compare different methods of imputation and choose the one that performs the best out-of-sample.

median_model and knn_model are available in your workspace, as is resamples, which contains the resampled results of both models. Look at the results of the models by calling

dotplot(resamples, metric = "ROC")
and choose the one that performs the best out-of-sample. Which method of imputation yields the highest out-of-sample ROC score for your glm model?

Instructions

50 XP

Possible Answers

1. KNN imputation is much better than median imputation.

2. KNN imputation is slightly better than median imputation. [ans]

3. Median imputation is much better than KNN imputation.

4. Median imputation is slightly better than KNN imputation.

```{r}
# > resamples
# 
# Call:
# resamples.default(x = list(median_model = median_model, knn_model = knn_model))
# 
# Models: median_model, knn_model 
# Number of resamples: 10 
# Performance metrics: ROC, Sens, Spec 
# Time estimates for: everything, final model fit
# 
# 
# median_model and knn_model are available in your workspace, as is resamples, which contains the resampled results of both models. Look at the results of the models by calling
# 
# dotplot(resamples, metric = "ROC")
```

## 4.4: Combining preprocessing methods

The preProcess argument to train() doesn't just limit you to imputing missing values. It also includes a wide variety of other preProcess techniques to make your life as a data scientist much easier. You can read a full list of them by typing ?preProcess and reading the help page for this function.

One set of preprocessing functions that is particularly useful for fitting regression models is standardization: centering and scaling. You first center by subtracting the mean of each column from each value in that column, then you scale by dividing by the standard deviation.

Standardization transforms your data such that for each column, the mean is 0 and the standard deviation is 1. This makes it easier for regression models to find a good solution.

Instructions

100 XP

- breast_cancer_x and breast_cancer_y are loaded in your workspace. Fit two models called model1 and model2 to the breast cancer data, then print each to the console:

- A logistic regression model using only median imputation: model1

- A logistic regression model using median imputation, centering, and scaling (in that order): model2

```{r}
# Fit glm with median imputation: model1
model1 <- train(
  x = breast_cancer_x, y = breast_cancer_y,
  method = "glm",
  trControl = myControl,
  preProcess =  c("medianImpute")
)

# Print model1
model1

# Fit glm with median imputation and standardization: model2
model2 <- train(
  x = breast_cancer_x, y = breast_cancer_y,
  method = "glm",
  trControl = myControl,
  preProcess = c("medianImpute", "center", "scale")
)

# Print model2
model2
```

## 4.5: Remove near zero variance predictors

As you saw in the video, for the next set of exercises, you'll be using the blood-brain dataset. This is a biochemical dataset in which the task is to predict the following value for a set of biochemical compounds:

log((concentration of compound in brain) /
      (concentration of compound in blood))
This gives a quantitative metric of the compound's ability to cross the blood-brain barrier, and is useful for understanding the biological properties of that barrier.

One interesting aspect of this dataset is that it contains many variables and many of these variables have extemely low variances. This means that there is very little information in these variables because they mostly consist of a single value (e.g. zero).

Fortunately, caret contains a utility function called nearZeroVar() for removing such variables to save time during modeling.

nearZeroVar() takes in data x, then looks at the ratio of the most common value to the second most common value, freqCut, and the percentage of distinct values out of the number of total samples, uniqueCut. By default, caret uses freqCut = 19 and uniqueCut = 10, which is fairly conservative. I like to be a little more aggressive and use freqCut = 2 and uniqueCut = 20 when calling nearZeroVar().

Instructions

100 XP

- bloodbrain_x and bloodbrain_y are loaded in your workspace.

- Identify the near zero variance predictors by running nearZeroVar() on the blood-brain dataset. Store the result as an object called remove_cols. Use freqCut = 2 and uniqueCut = 20 in the call to nearZeroVar().

- Use names() to create a vector containing all column names of bloodbrain_x. Call this all_cols.

- Make a new data frame called bloodbrain_x_small with the near-zero variance variables removed. Use setdiff() to isolate the column names that you wish to keep (i.e. that you don't want to remove.)

```{r}
bloodbrain_y<-miceadds::load.Rdata2('BloodBrain.RData', path=getwd())
str(bloodbrain_y)
bloodbrain_x<-read.csv("bloodbrain_x.csv")
str(bloodbrain_x)
```

```{r}
# Identify near zero variance predictors: remove_cols
remove_cols <- nearZeroVar(bloodbrain_x, names = TRUE, 
                           freqCut = 2, uniqueCut = 20)

# Get all column names from bloodbrain_x: all_cols
all_cols<-names(bloodbrain_x)

# Remove from data: bloodbrain_x_small
bloodbrain_x_small <- bloodbrain_x[ , setdiff(all_cols, remove_cols)]
```

## 4.6: Fit model on reduced blood-brain data

Now that you've reduced your dataset, you can fit a glm model to it using the train() function. This model will run faster than using the full dataset and will yield very similar predictive accuracy.

Furthermore, zero variance variables can cause problems with cross-validation (e.g. if one fold ends up with only a single unique value for that variable), so removing them prior to modeling means you are less likely to get errors during the fitting process.

Instructions

100 XP

- bloodbrain_x, bloodbrain_y, remove, and bloodbrain_x_small are loaded in your workspace.

- Fit a glm model using the train() function and the reduced blood-brain dataset you created in the previous exercise.

- Print the result to the console.
```{r}
# Fit model on reduced data: model
model <- train(x = bloodbrain_x_small, y = bloodbrain_y, method = "glm")

# Print model to console
model
```

## 4.7: Using PCA as an alternative to nearZeroVar()

An alternative to removing low-variance predictors is to run PCA on your dataset. This is sometimes preferable because it does not throw out all of your data: many different low variance predictors may end up combined into one high variance PCA variable, which might have a positive impact on your model's accuracy.

This is an especially good trick for linear models: the pca option in the preProcess argument will center and scale your data, combine low variance variables, and ensure that all of your predictors are orthogonal. This creates an ideal dataset for linear regression modeling, and can often improve the accuracy of your models.

Instructions

100 XP

- bloodbrain_x and bloodbrain_y are loaded in your workspace.

- Fit a glm model to the full blood-brain dataset using the "pca" option to preProcess.

- Print the model to the console and inspect the result.
```{r}
# Fit glm model using PCA: model
model <- train(
  x = bloodbrain_x, y = bloodbrain_y,
  method = "glm", preProcess = "pca"
)

# Print model to console
model
```

Remark: Note that the PCA model's accuracy is slightly higher than the nearZeroVar() model from the previous exercise. PCA is generally a better method for handling low-information predictors than throwing them out entirely.