In this chapter, you will practice using train() to preprocess data before fitting models, improving your ability to making accurate predictions.
In this chapter, you’ll be using a version of the Wisconsin Breast Cancer dataset. This dataset presents a classic binary classification problem: 50% of the samples are benign, 50% are malignant, and the challenge is to identify which are which.
This dataset is interesting because many of the predictors contain missing values and most rows of the dataset have at least one missing value. This presents a modeling challenge, because most machine learning algorithms cannot handle missing values out of the box. For example, your first instinct might be to fit a logistic regression model to this data, but prior to doing this you need a strategy for handling the NAs.
Fortunately, the train() function in caret contains an argument called preProcess, which allows you to specify that median imputation should be used to fill in the missing values. In previous chapters, you created models with the train() function using formulas such as y ~ .. An alternative way is to specify the x and y arguments to train(), where x is an object with samples in rows and features in columns and y is a numeric or factor vector containing the outcomes. Said differently, x is a matrix or data frame that contains the whole dataset you’d use for the data argument to the lm() call, for example, but excludes the response variable column; y is a vector that contains just the response variable column.
For this exercise, the argument x to train() is loaded in your workspace as breast_cancer_x and y as breast_cancer_y.
Instructions
100 XP
Use the train() function to fit a glm model called model to the breast cancer dataset. Use preProcess = “medianImpute” to handle the missing values.
Print model to the console.
install.packages("miceadds")
breast_cancer_y<-miceadds::load.Rdata2('BreastCancer.RData', path=getwd())
breast_cancer_x<-read.csv("BreastCancer_x.csv")
str(breast_cancer_y)
Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
str(breast_cancer_x)
'data.frame': 699 obs. of 9 variables:
$ Cl.thickness : int 5 NA NA 6 4 8 1 2 NA NA ...
$ Cell.size : int NA 4 NA 8 1 10 1 1 1 2 ...
$ Cell.shape : int 1 4 1 8 1 10 NA 2 1 1 ...
$ Marg.adhesion : int 1 NA 1 NA 3 8 NA 1 NA 1 ...
$ Epith.c.size : int NA 7 2 NA 2 7 2 2 2 2 ...
$ Bare.nuclei : int 1 10 NA 4 1 10 10 1 1 1 ...
$ Bl.cromatin : int 3 3 3 3 3 NA 3 3 NA 2 ...
$ Normal.nucleoli: int 1 NA 1 7 NA 7 1 1 1 NA ...
$ Mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
# Apply median imputation: model
model <- train(
x = breast_cancer_x, y = breast_cancer_y,
method = "glm",
trControl = myControl,
preProcess = "medianImpute"
)
The metric "Accuracy" was not in the result set. ROC will be used instead.
+ Fold01: parameter=none
- Fold01: parameter=none
+ Fold02: parameter=none
- Fold02: parameter=none
+ Fold03: parameter=none
- Fold03: parameter=none
+ Fold04: parameter=none
- Fold04: parameter=none
+ Fold05: parameter=none
- Fold05: parameter=none
+ Fold06: parameter=none
- Fold06: parameter=none
+ Fold07: parameter=none
- Fold07: parameter=none
+ Fold08: parameter=none
- Fold08: parameter=none
+ Fold09: parameter=none
- Fold09: parameter=none
+ Fold10: parameter=none
- Fold10: parameter=none
Aggregating results
Fitting final model on full training set
# Print model to console
model
Generalized Linear Model
699 samples
9 predictor
2 classes: 'benign', 'malignant'
Pre-processing: median imputation (9)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 629, 630, 629, 629, 629, 629, ...
Resampling results:
ROC Sens Spec
0.9909593 0.9694203 0.9416667
In the previous exercise, you used median imputation to fill in missing values in the breast cancer dataset, but that is not the only possible method for dealing with missing data.
An alternative to median imputation is k-nearest neighbors, or KNN, imputation. This is a more advanced form of imputation where missing values are replaced with values from other rows that are similar to the current row. While this is a lot more complicated to implement in practice than simple median imputation, it is very easy to explore in caret using the preProcess argument to train(). You can simply use preProcess = “knnImpute” to change the method of imputation used prior to model fitting.
Instructions
100 XP
breast_cancer_x and breast_cancer_y are loaded in your workspace.
Use the train() function to fit a glm model called model2 to the breast cancer dataset.
Use KNN imputation to handle missing values.
install.packages("RANN")
Warning in install.packages :
unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'
trying URL 'https://mran.microsoft.com/snapshot/2018-08-01/bin/windows/contrib/3.5/RANN_2.6.zip'
Content type 'application/zip' length 518764 bytes (506 KB)
downloaded 506 KB
package ‘RANN’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\cweiqiang\AppData\Local\Temp\Rtmp2twUuP\downloaded_packages
library(RANN)
# Apply KNN imputation: model2
model2 <- train(
x = breast_cancer_x, y = breast_cancer_y,
method = "glm",
trControl = myControl,
preProcess = "knnImpute"
)
The metric "Accuracy" was not in the result set. ROC will be used instead.
+ Fold01: parameter=none
- Fold01: parameter=none
+ Fold02: parameter=none
- Fold02: parameter=none
+ Fold03: parameter=none
- Fold03: parameter=none
+ Fold04: parameter=none
- Fold04: parameter=none
+ Fold05: parameter=none
- Fold05: parameter=none
+ Fold06: parameter=none
- Fold06: parameter=none
+ Fold07: parameter=none
- Fold07: parameter=none
+ Fold08: parameter=none
- Fold08: parameter=none
+ Fold09: parameter=none
- Fold09: parameter=none
+ Fold10: parameter=none
- Fold10: parameter=none
Aggregating results
Fitting final model on full training set
# Print model to console
model2
Generalized Linear Model
699 samples
9 predictor
2 classes: 'benign', 'malignant'
Pre-processing: nearest neighbor imputation (9), centered (9), scaled (9)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 629, 629, 629, 629, 630, 629, ...
Resampling results:
ROC Sens Spec
0.9927758 0.9715459 0.9296667
All of the preprocessing steps in the train() function happen in the training set of each cross-validation fold, so the error metrics reported include the effects of the preprocessing.
This includes the imputation method used (e.g. knnImpute or medianImpute). This is useful because it allows you to compare different methods of imputation and choose the one that performs the best out-of-sample.
median_model and knn_model are available in your workspace, as is resamples, which contains the resampled results of both models. Look at the results of the models by calling
dotplot(resamples, metric = “ROC”) and choose the one that performs the best out-of-sample. Which method of imputation yields the highest out-of-sample ROC score for your glm model?
Instructions
50 XP
Possible Answers
KNN imputation is much better than median imputation.
KNN imputation is slightly better than median imputation. [ans]
Median imputation is much better than KNN imputation.
Median imputation is slightly better than KNN imputation.
# > resamples
#
# Call:
# resamples.default(x = list(median_model = median_model, knn_model = knn_model))
#
# Models: median_model, knn_model
# Number of resamples: 10
# Performance metrics: ROC, Sens, Spec
# Time estimates for: everything, final model fit
#
#
# median_model and knn_model are available in your workspace, as is resamples, which contains the resampled results of both models. Look at the results of the models by calling
#
# dotplot(resamples, metric = "ROC")
The preProcess argument to train() doesn’t just limit you to imputing missing values. It also includes a wide variety of other preProcess techniques to make your life as a data scientist much easier. You can read a full list of them by typing ?preProcess and reading the help page for this function.
One set of preprocessing functions that is particularly useful for fitting regression models is standardization: centering and scaling. You first center by subtracting the mean of each column from each value in that column, then you scale by dividing by the standard deviation.
Standardization transforms your data such that for each column, the mean is 0 and the standard deviation is 1. This makes it easier for regression models to find a good solution.
Instructions
100 XP
breast_cancer_x and breast_cancer_y are loaded in your workspace. Fit two models called model1 and model2 to the breast cancer data, then print each to the console:
A logistic regression model using only median imputation: model1
A logistic regression model using median imputation, centering, and scaling (in that order): model2
# Fit glm with median imputation: model1
model1 <- train(
x = breast_cancer_x, y = breast_cancer_y,
method = "glm",
trControl = myControl,
preProcess = c("medianImpute")
)
The metric "Accuracy" was not in the result set. ROC will be used instead.
+ Fold01: parameter=none
- Fold01: parameter=none
+ Fold02: parameter=none
- Fold02: parameter=none
+ Fold03: parameter=none
- Fold03: parameter=none
+ Fold04: parameter=none
- Fold04: parameter=none
+ Fold05: parameter=none
- Fold05: parameter=none
+ Fold06: parameter=none
- Fold06: parameter=none
+ Fold07: parameter=none
- Fold07: parameter=none
+ Fold08: parameter=none
- Fold08: parameter=none
+ Fold09: parameter=none
- Fold09: parameter=none
+ Fold10: parameter=none
- Fold10: parameter=none
Aggregating results
Fitting final model on full training set
# Print model1
model1
Generalized Linear Model
699 samples
9 predictor
2 classes: 'benign', 'malignant'
Pre-processing: median imputation (9)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 629, 629, 630, 629, 629, 629, ...
Resampling results:
ROC Sens Spec
0.9915866 0.9715942 0.9461667
# Fit glm with median imputation and standardization: model2
model2 <- train(
x = breast_cancer_x, y = breast_cancer_y,
method = "glm",
trControl = myControl,
preProcess = c("medianImpute", "center", "scale")
)
The metric "Accuracy" was not in the result set. ROC will be used instead.
+ Fold01: parameter=none
- Fold01: parameter=none
+ Fold02: parameter=none
- Fold02: parameter=none
+ Fold03: parameter=none
- Fold03: parameter=none
+ Fold04: parameter=none
- Fold04: parameter=none
+ Fold05: parameter=none
- Fold05: parameter=none
+ Fold06: parameter=none
- Fold06: parameter=none
+ Fold07: parameter=none
- Fold07: parameter=none
+ Fold08: parameter=none
- Fold08: parameter=none
+ Fold09: parameter=none
- Fold09: parameter=none
+ Fold10: parameter=none
- Fold10: parameter=none
Aggregating results
Fitting final model on full training set
# Print model2
model2
Generalized Linear Model
699 samples
9 predictor
2 classes: 'benign', 'malignant'
Pre-processing: median imputation (9), centered (9), scaled (9)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 629, 630, 628, 629, 629, 629, ...
Resampling results:
ROC Sens Spec
0.9926808 0.9695652 0.938
As you saw in the video, for the next set of exercises, you’ll be using the blood-brain dataset. This is a biochemical dataset in which the task is to predict the following value for a set of biochemical compounds:
log((concentration of compound in brain) / (concentration of compound in blood)) This gives a quantitative metric of the compound’s ability to cross the blood-brain barrier, and is useful for understanding the biological properties of that barrier.
One interesting aspect of this dataset is that it contains many variables and many of these variables have extemely low variances. This means that there is very little information in these variables because they mostly consist of a single value (e.g. zero).
Fortunately, caret contains a utility function called nearZeroVar() for removing such variables to save time during modeling.
nearZeroVar() takes in data x, then looks at the ratio of the most common value to the second most common value, freqCut, and the percentage of distinct values out of the number of total samples, uniqueCut. By default, caret uses freqCut = 19 and uniqueCut = 10, which is fairly conservative. I like to be a little more aggressive and use freqCut = 2 and uniqueCut = 20 when calling nearZeroVar().
Instructions
100 XP
bloodbrain_x and bloodbrain_y are loaded in your workspace.
Identify the near zero variance predictors by running nearZeroVar() on the blood-brain dataset. Store the result as an object called remove_cols. Use freqCut = 2 and uniqueCut = 20 in the call to nearZeroVar().
Use names() to create a vector containing all column names of bloodbrain_x. Call this all_cols.
Make a new data frame called bloodbrain_x_small with the near-zero variance variables removed. Use setdiff() to isolate the column names that you wish to keep (i.e. that you don’t want to remove.)
bloodbrain_y<-miceadds::load.Rdata2('BloodBrain.RData', path=getwd())
str(bloodbrain_y)
num [1:208] 1.08 -0.4 0.22 0.14 0.69 0.44 -0.43 1.38 0.75 0.88 ...
bloodbrain_x<-read.csv("bloodbrain_x.csv")
str(bloodbrain_x)
'data.frame': 208 obs. of 132 variables:
$ tpsa : num 12 49.3 50.5 37.4 37.4 ...
$ nbasic : int 1 0 1 0 1 1 1 1 1 1 ...
$ vsa_hyd : num 167.1 92.6 295.2 319.1 299.7 ...
$ a_aro : int 0 6 15 15 12 11 6 12 12 6 ...
$ weight : num 156 151 366 383 326 ...
$ peoe_vsa.0 : num 76.9 38.2 58.1 62.2 74.8 ...
$ peoe_vsa.1 : num 43.4 25.5 124.7 124.7 118 ...
$ peoe_vsa.2 : num 0 0 21.7 13.2 33 ...
$ peoe_vsa.3 : num 0 8.62 8.62 21.79 0 ...
$ peoe_vsa.4 : num 0 23.3 17.4 0 0 ...
$ peoe_vsa.5 : num 0 0 0 0 0 0 0 0 0 0 ...
$ peoe_vsa.6 : num 17.24 0 8.62 8.62 8.62 ...
$ peoe_vsa.0.1 : num 18.7 49 83.8 83.8 83.8 ...
$ peoe_vsa.1.1 : num 43.5 0 49 68.8 36.8 ...
$ peoe_vsa.2.1 : num 0 0 0 0 0 ...
$ peoe_vsa.3.1 : num 0 0 0 0 0 0 0 0 0 0 ...
$ peoe_vsa.4.1 : num 0 0 5.68 5.68 5.68 ...
$ peoe_vsa.5.1 : num 0 13.567 2.504 0 0.137 ...
$ peoe_vsa.6.1 : num 0 7.9 2.64 2.64 2.5 ...
$ a_acc : int 0 2 2 2 2 2 2 2 0 2 ...
$ a_acid : int 0 0 0 0 0 0 0 0 0 0 ...
$ a_base : int 1 0 1 1 1 1 1 1 1 1 ...
$ vsa_acc : num 0 13.57 8.19 8.19 8.19 ...
$ vsa_acid : num 0 0 0 0 0 0 0 0 0 0 ...
$ vsa_base : num 5.68 0 0 0 0 ...
$ vsa_don : num 5.68 5.68 5.68 5.68 5.68 ...
$ vsa_other : num 0 28.1 43.6 28.3 19.6 ...
$ vsa_pol : num 0 13.6 0 0 0 ...
$ slogp_vsa0 : num 18 25.4 14.1 14.1 14.1 ...
$ slogp_vsa1 : num 0 23.3 34.8 34.8 34.8 ...
$ slogp_vsa2 : num 3.98 23.86 0 0 0 ...
$ slogp_vsa3 : num 0 0 76.2 76.2 76.2 ...
$ slogp_vsa4 : num 4.41 0 3.19 3.19 3.19 ...
$ slogp_vsa5 : num 32.9 0 9.51 0 0 ...
$ slogp_vsa6 : num 0 0 0 0 0 0 0 0 0 0 ...
$ slogp_vsa7 : num 0 70.6 148.1 144 140.7 ...
$ slogp_vsa8 : num 113.2 0 75.5 75.5 75.5 ...
$ slogp_vsa9 : num 33.3 41.3 28.3 55.5 26 ...
$ smr_vsa0 : num 0 23.86 12.63 3.12 3.12 ...
$ smr_vsa1 : num 18 25.4 27.8 27.8 27.8 ...
$ smr_vsa2 : num 4.41 0 0 0 0 ...
$ smr_vsa3 : num 3.98 5.24 8.43 8.43 8.43 ...
$ smr_vsa4 : num 0 20.8 29.6 21.4 20.3 ...
$ smr_vsa5 : num 113.2 70.6 235.1 235.1 234.6 ...
$ smr_vsa6 : num 0 5.26 76.25 76.25 76.25 ...
$ smr_vsa7 : num 66.2 33.3 0 31.3 0 ...
$ tpsa.1 : num 16.6 49.3 51.7 38.6 38.6 ...
$ logp.o.w. : num 2.948 0.889 4.439 5.254 3.8 ...
$ frac.anion7. : num 0 0.001 0 0 0 0 0.001 0 0 0 ...
$ frac.cation7. : num 0.999 0 0.986 0.986 0.986 0.986 0.996 0.946 0.999 0.976 ...
$ andrewbind : num 3.4 -3.3 12.8 12.8 10.3 10 10.4 15.9 12.9 9.5 ...
$ rotatablebonds : int 3 2 8 8 8 8 8 7 4 5 ...
$ mlogp : num 2.5 1.06 4.66 3.82 3.27 ...
$ clogp : num 2.97 0.494 5.137 5.878 4.367 ...
$ mw : num 155 151 365 382 325 ...
$ nocount : int 1 3 5 4 4 4 4 3 2 4 ...
$ hbdnr : int 1 2 1 1 1 1 2 1 1 0 ...
$ rule.of.5violations : int 0 0 1 1 0 0 0 0 1 0 ...
$ prx : int 0 1 6 2 2 2 1 0 0 4 ...
$ ub : num 0 3 5.3 5.3 4.2 3.6 3 4.7 4.2 3 ...
$ pol : int 0 2 3 3 2 2 2 3 4 1 ...
$ inthb : int 0 0 0 0 0 0 1 0 0 0 ...
$ adistm : num 0 395 1365 703 746 ...
$ adistd : num 0 10.9 25.7 10 10.6 ...
$ polar_area : num 21.1 117.4 82.1 65.1 66.2 ...
$ nonpolar_area : num 379 248 638 668 602 ...
$ psa_npsa : num 0.0557 0.4743 0.1287 0.0974 0.11 ...
$ tcsa : num 0.0097 0.0134 0.0111 0.0108 0.0118 0.0111 0.0123 0.0099 0.0106 0.0115 ...
$ tcpa : num 0.1842 0.0417 0.0972 0.1218 0.1186 ...
$ tcnp : num 0.0103 0.0198 0.0125 0.0119 0.013 0.0125 0.0162 0.011 0.0109 0.0122 ...
$ ovality : num 1.1 1.12 1.3 1.3 1.27 ...
$ surface_area : num 400 365 720 733 668 ...
$ volume : num 656 555 1224 1257 1133 ...
$ most_negative_charge: num -0.617 -0.84 -0.801 -0.761 -0.857 ...
$ most_positive_charge: num 0.307 0.497 0.541 0.48 0.455 ...
$ sum_absolute_charge : num 3.89 4.89 7.98 7.93 7.85 ...
$ dipole_moment : num 1.19 4.21 3.52 3.15 3.27 ...
$ homo : num -9.67 -8.96 -8.63 -8.56 -8.67 ...
$ lumo : num 3.4038 0.1942 0.0589 -0.2651 0.3149 ...
$ hardness : num 6.54 4.58 4.34 4.15 4.49 ...
$ ppsa1 : num 349 223 518 508 509 ...
$ ppsa2 : num 679 546 2066 2013 1999 ...
$ ppsa3 : num 31 42.3 64 61.7 61.6 ...
$ pnsa1 : num 51.1 141.8 202 225.4 158.8 ...
$ pnsa2 : num -99.3 -346.9 -805.9 -894 -623.3 ...
$ pnsa3 : num -10.5 -44 -43.8 -42 -39.8 ...
$ fpsa1 : num 0.872 0.611 0.719 0.692 0.762 ...
$ fpsa2 : num 1.7 1.5 2.87 2.75 2.99 ...
$ fpsa3 : num 0.0774 0.1159 0.0888 0.0842 0.0922 ...
$ fnsa1 : num 0.128 0.389 0.281 0.307 0.238 ...
$ fnsa2 : num -0.248 -0.951 -1.12 -1.22 -0.933 ...
$ fnsa3 : num -0.0262 -0.1207 -0.0608 -0.0573 -0.0596 ...
$ wpsa1 : num 139.7 81.4 372.7 372.1 340.1 ...
$ wpsa2 : num 272 199 1487 1476 1335 ...
$ wpsa3 : num 12.4 15.4 46 45.2 41.1 ...
$ wnsa1 : num 20.4 51.8 145.4 165.3 106 ...
$ wnsa2 : num -39.8 -126.6 -580.1 -655.3 -416.3 ...
$ wnsa3 : num -4.2 -16.1 -31.5 -30.8 -26.6 ...
$ dpsa1 : num 298.1 81.3 315.8 282.2 350.4 ...
[list output truncated]
# Identify near zero variance predictors: remove_cols
remove_cols <- nearZeroVar(bloodbrain_x, names = TRUE,
freqCut = 2, uniqueCut = 20)
# Get all column names from bloodbrain_x: all_cols
all_cols<-names(bloodbrain_x)
# Remove from data: bloodbrain_x_small
bloodbrain_x_small <- bloodbrain_x[ , setdiff(all_cols, remove_cols)]
Now that you’ve reduced your dataset, you can fit a glm model to it using the train() function. This model will run faster than using the full dataset and will yield very similar predictive accuracy.
Furthermore, zero variance variables can cause problems with cross-validation (e.g. if one fold ends up with only a single unique value for that variable), so removing them prior to modeling means you are less likely to get errors during the fitting process.
Instructions
100 XP
bloodbrain_x, bloodbrain_y, remove, and bloodbrain_x_small are loaded in your workspace.
Fit a glm model using the train() function and the reduced blood-brain dataset you created in the previous exercise.
Print the result to the console.
# Fit model on reduced data: model
model <- train(x = bloodbrain_x_small, y = bloodbrain_y, method = "glm")
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
# Print model to console
model
Generalized Linear Model
208 samples
112 predictors
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 208, 208, 208, 208, 208, 208, ...
Resampling results:
RMSE Rsquared MAE
1.760752 0.1329531 1.133845
An alternative to removing low-variance predictors is to run PCA on your dataset. This is sometimes preferable because it does not throw out all of your data: many different low variance predictors may end up combined into one high variance PCA variable, which might have a positive impact on your model’s accuracy.
This is an especially good trick for linear models: the pca option in the preProcess argument will center and scale your data, combine low variance variables, and ensure that all of your predictors are orthogonal. This creates an ideal dataset for linear regression modeling, and can often improve the accuracy of your models.
Instructions
100 XP
bloodbrain_x and bloodbrain_y are loaded in your workspace.
Fit a glm model to the full blood-brain dataset using the “pca” option to preProcess.
Print the model to the console and inspect the result.
# Fit glm model using PCA: model
model <- train(
x = bloodbrain_x, y = bloodbrain_y,
method = "glm", preProcess = "pca"
)
# Print model to console
model
Generalized Linear Model
208 samples
132 predictors
Pre-processing: principal component signal extraction (132), centered (132), scaled (132)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 208, 208, 208, 208, 208, 208, ...
Resampling results:
RMSE Rsquared MAE
0.5885472 0.4441561 0.4420655
Remark: Note that the PCA model’s accuracy is slightly higher than the nearZeroVar() model from the previous exercise. PCA is generally a better method for handling low-information predictors than throwing them out entirely.