Robbie Beane
In this lession, we will introduction the caret package. This package provides a common interface for creating supervised learning models using many different algorithms, as well as tools for performing tasks related to data preprocessing. It also provides tools for performing cross validation.
We will begin by loading the package.
As our first example of using caret, we will create a basic multiple linear regression model. For this example, we will use the NYC Restaurant dataset.
## Price Food Decor Service
## Min. :19.0 Min. :16.0 Min. : 6.00 Min. :14.0
## 1st Qu.:36.0 1st Qu.:19.0 1st Qu.:16.00 1st Qu.:18.0
## Median :43.0 Median :20.5 Median :18.00 Median :20.0
## Mean :42.7 Mean :20.6 Mean :17.69 Mean :19.4
## 3rd Qu.:50.0 3rd Qu.:22.0 3rd Qu.:19.00 3rd Qu.:21.0
## Max. :65.0 Max. :25.0 Max. :25.00 Max. :24.0
## Wait East
## Min. : 0.00 Min. :0.000
## 1st Qu.:16.75 1st Qu.:0.000
## Median :23.00 Median :1.000
## Mean :22.92 Mean :0.631
## 3rd Qu.:29.00 3rd Qu.:1.000
## Max. :46.00 Max. :1.000
Supervised learning models are created in caret by using the train function. You can specify the reponse variable and the predictors using the familiar formula notation (using ~), or by providing separate dataframes for the features and reponse. The train function also requires you to specify a method argument, which determines the type of model being fit.
model_1 <- train(Price ~ ., nyc, method="lm")
#model_1a <- train(nyc[,-1], nyc$Price, method = "lm" )
summary(model_1)##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3315 -3.9098 -0.2242 3.3561 17.7499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.16816 4.78350 -5.261 4.47e-07 ***
## Food 1.55401 0.36844 4.218 4.09e-05 ***
## Decor 1.91615 0.21663 8.845 1.49e-15 ***
## Service -0.04571 0.39688 -0.115 0.9085
## Wait 0.06796 0.05311 1.280 0.2025
## East 2.04599 0.94505 2.165 0.0319 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.727 on 162 degrees of freedom
## Multiple R-squared: 0.6316, Adjusted R-squared: 0.6202
## F-statistic: 55.55 on 5 and 162 DF, p-value: < 2.2e-16
We can use the predict function to generate predictions based on a model created with train. We will use predict to generate predictions for our training set, which we will then use to recalculate our training r-Squared value.
predictions <- predict(model_1, nyc)
SSE <- sum((nyc$Price - predictions)^2)
SST <- sum((nyc$Price - mean(nyc$Price))^2)
r2 <- 1 - SSE / SST
r2## [1] 0.6316051
We can also instruct train to produce estimates of out-of-sample performance metrics by performing cross-validation. This is done by specifying a trainControl argument.
set.seed(1)
tc <- trainControl(method="cv", number=10)
model_2 <- train(Price ~ ., nyc, method="lm", trControl=tc)
summary(model_2)##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3315 -3.9098 -0.2242 3.3561 17.7499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.16816 4.78350 -5.261 4.47e-07 ***
## Food 1.55401 0.36844 4.218 4.09e-05 ***
## Decor 1.91615 0.21663 8.845 1.49e-15 ***
## Service -0.04571 0.39688 -0.115 0.9085
## Wait 0.06796 0.05311 1.280 0.2025
## East 2.04599 0.94505 2.165 0.0319 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.727 on 162 degrees of freedom
## Multiple R-squared: 0.6316, Adjusted R-squared: 0.6202
## F-statistic: 55.55 on 5 and 162 DF, p-value: < 2.2e-16
The summary does not show the estimates generated by cross-validation. However, we can see these by using the model object itself.
## Linear Regression
##
## 168 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 152, 151, 151, 150, 152, 151, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 5.707805 0.6267498 4.549324
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
We can view the metrics for each of the 10 folds through the model’s resample attribute.
## RMSE Rsquared MAE Resample
## 1 3.147815 0.9018018 2.535886 Fold01
## 2 6.304390 0.5827676 5.198243 Fold02
## 3 5.563655 0.7309504 4.354696 Fold03
## 4 8.222586 0.4589760 6.337726 Fold04
## 5 6.161045 0.6410242 4.906047 Fold05
## 6 5.300876 0.5286649 4.456928 Fold06
## 7 6.245975 0.5085592 5.423489 Fold07
## 8 4.670149 0.6859168 4.107081 Fold08
## 9 5.364719 0.7546770 3.838051 Fold09
## 10 6.096839 0.4741603 4.335095 Fold10
We can also use cross-validation for performing hyperparameter tuning and model selection. Before providing an example of this, we will first illustrate the use of the expand.grid() function, which is useful for creating dataframes consisting of combinations of hyperparameter values.
## c1 c2
## 1 1 A
## 2 2 A
## 3 3 A
## 4 1 B
## 5 2 B
## 6 3 B
We will now perform KNN regression on the nyc dataset, using cross-validation to select an appropriate value for K. The hyperparameter values are supplied to the train() function in the form of the tuneGrid parameter. The train function will automatically select the best set of hyperparameters according to the cross-validation scores. By default, train performs this selection based on RMSE for regression problems, but we can instruct it to instead use r-squared with the metric parameter.
set.seed(1)
tc <- trainControl(method="cv", number=10,
selectionFunction="Rsquared")
param_grid <- expand.grid(k = c(1:40))
model_3 <- train(Price ~ ., nyc, method="knn",
preProcess=c("center", "scale"),
tuneGrid=param_grid, trainControl=tc,
metric="Rsquared")
model_3## k-Nearest Neighbors
##
## 168 samples
## 5 predictor
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 168, 168, 168, 168, 168, 168, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 8.133346 0.3651926 6.559035
## 2 7.630758 0.4065960 6.110262
## 3 7.188013 0.4411575 5.747229
## 4 6.868136 0.4695338 5.508320
## 5 6.620856 0.4967033 5.293771
## 6 6.520592 0.5078362 5.206962
## 7 6.449212 0.5163969 5.119928
## 8 6.382981 0.5259114 5.021135
## 9 6.379889 0.5266769 5.014644
## 10 6.420765 0.5221343 5.035897
## 11 6.427816 0.5226087 5.032147
## 12 6.435852 0.5222320 5.031862
## 13 6.459572 0.5193200 5.048249
## 14 6.444286 0.5232384 5.017959
## 15 6.443646 0.5244012 5.013767
## 16 6.434772 0.5272125 5.002048
## 17 6.433166 0.5293690 5.010073
## 18 6.426012 0.5323696 5.017149
## 19 6.405804 0.5378944 5.002702
## 20 6.362845 0.5470030 4.958886
## 21 6.347289 0.5516032 4.948972
## 22 6.356725 0.5517104 4.966295
## 23 6.368628 0.5523172 4.973502
## 24 6.344537 0.5591353 4.958955
## 25 6.334406 0.5626367 4.949254
## 26 6.320043 0.5670252 4.941686
## 27 6.328378 0.5686898 4.944207
## 28 6.344985 0.5686375 4.951599
## 29 6.347670 0.5693341 4.954045
## 30 6.363316 0.5685342 4.952608
## 31 6.365253 0.5698286 4.956516
## 32 6.360889 0.5711098 4.959845
## 33 6.369985 0.5708191 4.971895
## 34 6.384071 0.5705291 4.981946
## 35 6.400461 0.5693288 4.995635
## 36 6.425944 0.5673021 5.017832
## 37 6.434182 0.5677234 5.031035
## 38 6.446933 0.5673075 5.044687
## 39 6.466928 0.5658344 5.061033
## 40 6.476308 0.5664772 5.073236
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was k = 32.
The hyperparameters resulting in the best model are stored in the bestTune attribute of the model object.
## k
## 32 32
The results attribute of the model object is a data frame containing cross-validation results for each combination of hyperparameters being considering.
## k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 8.133346 0.3651926 6.559035 0.6171991 0.08249531 0.5276548
## 2 2 7.630758 0.4065960 6.110262 0.7215375 0.07635848 0.5605432
## 3 3 7.188013 0.4411575 5.747229 0.6484953 0.06398062 0.5815737
## 4 4 6.868136 0.4695338 5.508320 0.6298373 0.06435648 0.5405153
## 5 5 6.620856 0.4967033 5.293771 0.5990372 0.06266739 0.4819627
## 6 6 6.520592 0.5078362 5.206962 0.5848288 0.06306099 0.4390070
## 7 7 6.449212 0.5163969 5.119928 0.5885483 0.06261215 0.4689539
## 8 8 6.382981 0.5259114 5.021135 0.6157454 0.06466605 0.5002379
## 9 9 6.379889 0.5266769 5.014644 0.6068891 0.06359247 0.4794728
## 10 10 6.420765 0.5221343 5.035897 0.6007563 0.06328543 0.4694754
## 11 11 6.427816 0.5226087 5.032147 0.5820196 0.05983282 0.4954389
## 12 12 6.435852 0.5222320 5.031862 0.6029491 0.06026492 0.5047011
## 13 13 6.459572 0.5193200 5.048249 0.6079515 0.05926500 0.5419614
## 14 14 6.444286 0.5232384 5.017959 0.5997702 0.05907761 0.5471148
## 15 15 6.443646 0.5244012 5.013767 0.5861841 0.05836258 0.5272943
## 16 16 6.434772 0.5272125 5.002048 0.5995168 0.06061531 0.5395927
## 17 17 6.433166 0.5293690 5.010073 0.5846699 0.06146139 0.5172125
## 18 18 6.426012 0.5323696 5.017149 0.5723893 0.06026370 0.4940068
## 19 19 6.405804 0.5378944 5.002702 0.5617545 0.06040239 0.4841181
## 20 20 6.362845 0.5470030 4.958886 0.5521411 0.06051247 0.4733970
## 21 21 6.347289 0.5516032 4.948972 0.5667081 0.06361238 0.4924101
## 22 22 6.356725 0.5517104 4.966295 0.5758459 0.06379349 0.4937581
## 23 23 6.368628 0.5523172 4.973502 0.5996523 0.06297839 0.5099448
## 24 24 6.344537 0.5591353 4.958955 0.6003275 0.06133679 0.5207975
## 25 25 6.334406 0.5626367 4.949254 0.5795607 0.05816576 0.5187503
## 26 26 6.320043 0.5670252 4.941686 0.5804074 0.06119485 0.5200360
## 27 27 6.328378 0.5686898 4.944207 0.5680397 0.06042774 0.5152812
## 28 28 6.344985 0.5686375 4.951599 0.5689039 0.06180431 0.5094421
## 29 29 6.347670 0.5693341 4.954045 0.5637367 0.06020353 0.5040553
## 30 30 6.363316 0.5685342 4.952608 0.5848690 0.06229935 0.5192927
## 31 31 6.365253 0.5698286 4.956516 0.5898456 0.06050694 0.5329000
## 32 32 6.360889 0.5711098 4.959845 0.5907581 0.05785153 0.5313975
## 33 33 6.369985 0.5708191 4.971895 0.5879682 0.05745242 0.5351011
## 34 34 6.384071 0.5705291 4.981946 0.5877356 0.06054540 0.5303284
## 35 35 6.400461 0.5693288 4.995635 0.5981007 0.06146628 0.5444196
## 36 36 6.425944 0.5673021 5.017832 0.6053286 0.06185602 0.5504168
## 37 37 6.434182 0.5677234 5.031035 0.6019362 0.06104422 0.5411615
## 38 38 6.446933 0.5673075 5.044687 0.6001116 0.06337007 0.5336664
## 39 39 6.466928 0.5658344 5.061033 0.5895779 0.06360421 0.5271185
## 40 40 6.476308 0.5664772 5.073236 0.5827963 0.06438045 0.5204593
We can use which.max to select out information relating to the model with the highest cross-validation r-squared value.
## k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 32 32 6.360889 0.5711098 4.959845 0.5907581 0.05785153 0.5313975
We can extract individual columns of the results dataframe to generate plots of the cross-validation r-squared estimates as a function of K.
plot(1:40, model_3$results$Rsquared,
main='Using Cross-Validation for Hyperparameter Tuning',
xlab='K', ylab='Cross-Validation r-Squared')
lines(1:40, model_3$results$Rsquared)As a convenience, we can generate the same plot by simply passing the model object to the plot() function.
The results of the final model selected through cross-validation are stored in the finalModel attribute of the model object. However, the final model itself doesn’t take into account any preprocessing that was applied to the training set. Fortunately, the object returned by train also contains a preProcess attribute that contains our feature scaler.
## [1] 43.25000 39.78125 36.37500 39.68750 46.34375
To simplify matters, we can provide the object returned by train to the predict function directly. This will perform the necessary preprocessing steps prior to applying the final model.
## [1] 43.25000 39.78125 36.37500 39.68750 46.34375
After using cross-validation to select your final model, it is useful to perform an independent round of cross-validation to estimate out-of-sample performance.
set.seed(2)
tc <- trainControl(method="cv", number=10)
best_params_3 <- model_3$bestTune
model_3_best <- train(Price ~ ., nyc, method="knn",
preProcess=c("center", "scale"),
tuneGrid=best_params_3, trainControl=tc,
metric="Rsquared")
model_3_best## k-Nearest Neighbors
##
## 168 samples
## 5 predictor
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 168, 168, 168, 168, 168, 168, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6.641145 0.5460618 5.214257
##
## Tuning parameter 'k' was held constant at a value of 32
We will now now see an example of using caret to perform hyperparameter tuning on an elasticnet model.
set.seed(1)
tc <- trainControl(method="cv", number=10)
param_grid <- expand.grid(alpha=seq(0,1, by=0.2),
lambda=seq(0,1,length=100))
model_4 <- train(Price ~ ., nyc, method="glmnet",
preProcess=c("center", "scale"),
tuneGrid=param_grid, trControl=tc,
metric="Rsquared")
best_ix_4 = which.max(model_4$results$Rsquared)
model_4$results[best_ix_4, ]## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD
## 508 1 0.07070707 5.69067 0.6281646 4.541926 1.294004 0.1412212
## MAESD
## 508 1.006432
We will use plot() to see how the cross-validation estimates for r-squared vary with respect to the hyperparameters.
As a strange quirk, when using train with method="glmnet", the finalModel attribute contains a matrix of coefficient estimates for multiple values of lambda, but for the optimal value of alpha. We can select out the coefficients for the optimal model as follows:
## 6 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 42.6964286
## Food 2.9917039
## Decor 5.0963233
## Service .
## Wait 0.5020165
## East 0.9210049
We will perform an independent round of cross-validation to assess out-of-sample performance.
set.seed(2)
best_params_4 <- model_4$bestTune
tc <- trainControl(method="cv", number=10)
model_4_best <- train(Price ~ ., nyc, method="glmnet",
preProcess=c("center", "scale"),
tuneGrid=best_params_4, trControl=tc,
metric="Rsquared")
model_4_best## glmnet
##
## 168 samples
## 5 predictor
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 152, 152, 150, 150, 151, 151, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 5.653275 0.6451659 4.51807
##
## Tuning parameter 'alpha' was held constant at a value of 1
##
## Tuning parameter 'lambda' was held constant at a value of 0.07070707
The train function from caret will automatically take care of one-hot encoding qualitative features for us. We will illustrate this using the diamonds dataset.
diamonds <- read.table("data/diamonds.txt", sep="\t", header = TRUE)
diamonds <- diamonds[,c("carat", "cut", "color", "clarity", "price")]
summary(diamonds)## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Ideal :21551 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Very Good:12082 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## price
## Min. : 326
## 1st Qu.: 950
## Median : 2401
## Mean : 3933
## 3rd Qu.: 5324
## Max. :18823
##
We will now us cross-validation to select elasticnet hyperparameters for this dataset.
set.seed(1)
tc <- trainControl(method="cv", number=10)
param_grid <- expand.grid(alpha=seq(0.2, 1, by=0.2),
lambda=exp(seq(-3, 2,length=100)))
model_5 <- train(price ~ ., diamonds, method="glmnet",
preProcess = c("range"),
tuneGrid=param_grid, trControl=tc,
metric="Rsquared")
best_ix <- which.max(model_5$results$Rsquared)
model_5$results[best_ix, ]## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD
## 301 0.8 0.04978707 1157.585 0.9158561 801.4206 21.31721 0.001778996
## MAESD
## 301 10.24947
We will now plot the cross-validation estimates for out-of-sample r-squared.
The coefficients in our final model are displayed below.
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -5322.0557
## carat 42668.8359
## cutGood 622.5196
## cutIdeal 969.4578
## cutPremium 839.3194
## cutVery Good 819.9871
## colorE -194.2664
## colorF -285.6780
## colorG -487.0998
## colorH -959.7280
## colorI -1418.2183
## colorJ -2300.7144
## clarityIF 5163.9155
## claritySI1 3329.1471
## claritySI2 2384.4157
## clarityVS1 4287.2468
## clarityVS2 3972.9075
## clarityVVS1 4820.6038
## clarityVVS2 4718.7002
In the code chunk below, we will retrain a new model using the best combination of hyperparameters found in order to estimate out-of-sample performance.
set.seed(2)
best_params_5 <- model_5$bestTune
tc <- trainControl(method="cv", number=10)
param_grid <- expand.grid(alpha=seq(0.2, 1, by=0.2),
lambda=exp(seq(-3, 2,length=100)))
model_5_best <- train(price ~ ., diamonds, method="glmnet",
preProcess = c("range"),
tuneGrid=best_params_5, trControl=tc,
metric="Rsquared")
model_5_best## glmnet
##
## 53940 samples
## 4 predictor
##
## Pre-processing: re-scaling to [0, 1] (18)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 48547, 48546, 48546, 48547, 48545, 48546, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1157.426 0.9158503 801.4574
##
## Tuning parameter 'alpha' was held constant at a value of 0.8
##
## Tuning parameter 'lambda' was held constant at a value of 1.140325
We will now see how to use train to perform cross-validation to select elasticnet hyperparameters for logistic regression.
## diagnosis radius_mean texture_mean perimeter_mean
## B:357 Min. : 6.981 Min. : 9.71 Min. : 43.79
## M:212 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median :0.006380 Median :0.020450 Median :0.02589
## Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst
## Min. :0.1565 Min. :0.05504
## 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
We will now perform cross validation, and will display information relating to the best model.
set.seed(1)
tc <- trainControl(method="cv", number=10)
param_grid <- expand.grid(alpha=seq(0, 1, by=0.5),
lambda=exp(seq(-6, -2,length=20)))
model_6 <- train(diagnosis ~ ., wbc, method="glmnet", family="binomial",
preProcess=c("center", "scale"),
tuneGrid=param_grid, trControl=tc,
metric="Accuracy")
#model_6$bestTune
best_ix_6 <- which.max(model_6$results$Accuracy)
model_6$results[best_ix_6, ]## alpha lambda Accuracy Kappa AccuracySD KappaSD
## 23 0.5 0.003776539 0.9755272 0.9472677 0.02046019 0.04403434
We will now plot the cross-validation estimates for out-of-sample accuracy.
The coefficients for the best model found are shown below.
## 31 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.315639716
## radius_mean 0.285506745
## texture_mean 0.371914497
## perimeter_mean 0.235269424
## area_mean 0.310513202
## smoothness_mean .
## compactness_mean .
## concavity_mean 0.462308660
## concave.points_mean 0.734380315
## symmetry_mean .
## fractal_dimension_mean -0.219396320
## radius_se 1.131353911
## texture_se -0.074450743
## perimeter_se 0.391405864
## area_se 0.653019321
## smoothness_se 0.104321239
## compactness_se -0.535356211
## concavity_se .
## concave.points_se 0.001014363
## symmetry_se -0.142199830
## fractal_dimension_se -0.255251353
## radius_worst 0.988305173
## texture_worst 0.992847852
## perimeter_worst 0.763287310
## area_worst 0.858604086
## smoothness_worst 0.671143708
## compactness_worst .
## concavity_worst 0.655779729
## concave.points_worst 0.941234045
## symmetry_worst 0.603554472
## fractal_dimension_worst .
We will now retrain a new model with a slightly higher degree of regularization, but with a similar cross-validation score.
set.seed(2)
tc <- trainControl(method="cv", number=10)
param_grid <- expand.grid(alpha=0.5,
lambda=exp(-4))
model_6_alt <- train(diagnosis ~ ., wbc, method="glmnet", family="binomial",
preProcess=c("center", "scale"),
tuneGrid=param_grid, trControl=tc,
metric="Accuracy")
model_6_alt## glmnet
##
## 569 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## Pre-processing: centered (30), scaled (30)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 512, 513, 511, 512, 511, 512, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9736475 0.9426502
##
## Tuning parameter 'alpha' was held constant at a value of 0.5
##
## Tuning parameter 'lambda' was held constant at a value of 0.01831564
The coefficients of this alternate model are shown below.
## 31 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.5741890
## radius_mean 0.3141484
## texture_mean 0.2490084
## perimeter_mean 0.2857658
## area_mean 0.2378809
## smoothness_mean .
## compactness_mean .
## concavity_mean 0.1064340
## concave.points_mean 0.4673911
## symmetry_mean .
## fractal_dimension_mean .
## radius_se 0.4429637
## texture_se .
## perimeter_se 0.1416516
## area_se 0.1355298
## smoothness_se .
## compactness_se .
## concavity_se .
## concave.points_se .
## symmetry_se .
## fractal_dimension_se -0.1075690
## radius_worst 0.6555300
## texture_worst 0.5875112
## perimeter_worst 0.5590332
## area_worst 0.4669748
## smoothness_worst 0.4270019
## compactness_worst .
## concavity_worst 0.2660941
## concave.points_worst 0.6538142
## symmetry_worst 0.3069766
## fractal_dimension_worst .
In this final example, we will use train to perform hyperparameter tuning for KNN classification.
set.seed(1)
tc <- trainControl(method="cv", number=10)
param_grid <- expand.grid(k=1:60)
model_7 <- train(diagnosis ~ ., wbc, method="knn",
preProcess=c("center", "scale"),
tuneGrid=param_grid, trControl=tc,
metric="Accuracy")
#model_6$bestTune
best_ix_7 <- which.max(model_7$results$Accuracy)
model_7$results[best_ix_7, ]## k Accuracy Kappa AccuracySD KappaSD
## 10 10 0.9701095 0.9346352 0.0235881 0.05246313
We will plot the cross-validation estimates as a function of K.
We will now use cross-validation to estimate the best model’s out-of-sample performance.
set.seed(2)
tc <- trainControl(method="cv", number=10)
best_params_7 <- model_7$bestTune
model_7_best <- train(diagnosis ~ ., wbc, method="knn",
preProcess=c("center", "scale"),
tuneGrid=best_params_7, trControl=tc,
metric="Accuracy")
model_7_best## k-Nearest Neighbors
##
## 569 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## Pre-processing: centered (30), scaled (30)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 512, 513, 511, 512, 511, 512, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9719558 0.9392185
##
## Tuning parameter 'k' was held constant at a value of 10