Example 1: Ordinary Least Squares Regression (NYC Restaurant Data)

As our first example of using caret, we will create a basic multiple linear regression model. For this example, we will use the NYC Restaurant dataset.

nyc <- read.table("data/nyc.txt", sep="\t", header = TRUE)
summary(nyc)

##      Price           Food          Decor          Service    
##  Min.   :19.0   Min.   :16.0   Min.   : 6.00   Min.   :14.0  
##  1st Qu.:36.0   1st Qu.:19.0   1st Qu.:16.00   1st Qu.:18.0  
##  Median :43.0   Median :20.5   Median :18.00   Median :20.0  
##  Mean   :42.7   Mean   :20.6   Mean   :17.69   Mean   :19.4  
##  3rd Qu.:50.0   3rd Qu.:22.0   3rd Qu.:19.00   3rd Qu.:21.0  
##  Max.   :65.0   Max.   :25.0   Max.   :25.00   Max.   :24.0  
##       Wait            East      
##  Min.   : 0.00   Min.   :0.000  
##  1st Qu.:16.75   1st Qu.:0.000  
##  Median :23.00   Median :1.000  
##  Mean   :22.92   Mean   :0.631  
##  3rd Qu.:29.00   3rd Qu.:1.000  
##  Max.   :46.00   Max.   :1.000

Example 1: Ordinary Least Squares Regression (NYC Restaurant Data)

Supervised learning models are created in caret by using the train function. You can specify the reponse variable and the predictors using the familiar formula notation (using ~), or by providing separate dataframes for the features and reponse. The train function also requires you to specify a method argument, which determines the type of model being fit.

Example 1: Ordinary Least Squares Regression (NYC Restaurant Data)

model_1 <- train(Price ~ ., nyc, method="lm")
#model_1a <- train(nyc[,-1], nyc$Price, method = "lm" )
summary(model_1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3315  -3.9098  -0.2242   3.3561  17.7499 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -25.16816    4.78350  -5.261 4.47e-07 ***
## Food          1.55401    0.36844   4.218 4.09e-05 ***
## Decor         1.91615    0.21663   8.845 1.49e-15 ***
## Service      -0.04571    0.39688  -0.115   0.9085    
## Wait          0.06796    0.05311   1.280   0.2025    
## East          2.04599    0.94505   2.165   0.0319 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.727 on 162 degrees of freedom
## Multiple R-squared:  0.6316, Adjusted R-squared:  0.6202 
## F-statistic: 55.55 on 5 and 162 DF,  p-value: < 2.2e-16

Example 1: Ordinary Least Squares Regression (NYC Restaurant Data)

We can use the predict function to generate predictions based on a model created with train. We will use predict to generate predictions for our training set, which we will then use to recalculate our training r-Squared value.

predictions <- predict(model_1, nyc)
SSE <- sum((nyc$Price - predictions)^2)
SST <- sum((nyc$Price - mean(nyc$Price))^2)
r2 <- 1 - SSE / SST
r2

## [1] 0.6316051

Example 2: Using CV to Estimate Out-of-Sample r-Squared

We can also instruct train to produce estimates of out-of-sample performance metrics by performing cross-validation. This is done by specifying a trainControl argument.

set.seed(1)
tc <- trainControl(method="cv", number=10)
model_2 <- train(Price ~ ., nyc, method="lm", trControl=tc)
summary(model_2)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3315  -3.9098  -0.2242   3.3561  17.7499 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -25.16816    4.78350  -5.261 4.47e-07 ***
## Food          1.55401    0.36844   4.218 4.09e-05 ***
## Decor         1.91615    0.21663   8.845 1.49e-15 ***
## Service      -0.04571    0.39688  -0.115   0.9085    
## Wait          0.06796    0.05311   1.280   0.2025    
## East          2.04599    0.94505   2.165   0.0319 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.727 on 162 degrees of freedom
## Multiple R-squared:  0.6316, Adjusted R-squared:  0.6202 
## F-statistic: 55.55 on 5 and 162 DF,  p-value: < 2.2e-16

Example 2: Using CV to Estimate Out-of-Sample r-Squared

The summary does not show the estimates generated by cross-validation. However, we can see these by using the model object itself.

model_2

## Linear Regression 
## 
## 168 samples
##   5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 152, 151, 151, 150, 152, 151, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   5.707805  0.6267498  4.549324
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Example 2: Using CV to Estimate Out-of-Sample r-Squared

We can view the metrics for each of the 10 folds through the model’s resample attribute.

model_2$resample

##        RMSE  Rsquared      MAE Resample
## 1  3.147815 0.9018018 2.535886   Fold01
## 2  6.304390 0.5827676 5.198243   Fold02
## 3  5.563655 0.7309504 4.354696   Fold03
## 4  8.222586 0.4589760 6.337726   Fold04
## 5  6.161045 0.6410242 4.906047   Fold05
## 6  5.300876 0.5286649 4.456928   Fold06
## 7  6.245975 0.5085592 5.423489   Fold07
## 8  4.670149 0.6859168 4.107081   Fold08
## 9  5.364719 0.7546770 3.838051   Fold09
## 10 6.096839 0.4741603 4.335095   Fold10

Example 3: Using Cross-Validation for Hyperparameter Tuning

We can also use cross-validation for performing hyperparameter tuning and model selection. Before providing an example of this, we will first illustrate the use of the expand.grid() function, which is useful for creating dataframes consisting of combinations of hyperparameter values.

expand.grid(c1 = c(1, 2, 3), c2 = c('A', 'B'))

##   c1 c2
## 1  1  A
## 2  2  A
## 3  3  A
## 4  1  B
## 5  2  B
## 6  3  B

Example 3: Using Cross-Validation for Hyperparameter Tuning

We will now perform KNN regression on the nyc dataset, using cross-validation to select an appropriate value for K. The hyperparameter values are supplied to the train() function in the form of the tuneGrid parameter. The train function will automatically select the best set of hyperparameters according to the cross-validation scores. By default, train performs this selection based on RMSE for regression problems, but we can instruct it to instead use r-squared with the metric parameter.

set.seed(1)
tc <- trainControl(method="cv", number=10, 
                   selectionFunction="Rsquared")

param_grid <- expand.grid(k = c(1:40))

model_3 <- train(Price ~ ., nyc, method="knn", 
                 preProcess=c("center", "scale"), 
                 tuneGrid=param_grid, trainControl=tc, 
                 metric="Rsquared")

model_3

## k-Nearest Neighbors 
## 
## 168 samples
##   5 predictor
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 168, 168, 168, 168, 168, 168, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  8.133346  0.3651926  6.559035
##    2  7.630758  0.4065960  6.110262
##    3  7.188013  0.4411575  5.747229
##    4  6.868136  0.4695338  5.508320
##    5  6.620856  0.4967033  5.293771
##    6  6.520592  0.5078362  5.206962
##    7  6.449212  0.5163969  5.119928
##    8  6.382981  0.5259114  5.021135
##    9  6.379889  0.5266769  5.014644
##   10  6.420765  0.5221343  5.035897
##   11  6.427816  0.5226087  5.032147
##   12  6.435852  0.5222320  5.031862
##   13  6.459572  0.5193200  5.048249
##   14  6.444286  0.5232384  5.017959
##   15  6.443646  0.5244012  5.013767
##   16  6.434772  0.5272125  5.002048
##   17  6.433166  0.5293690  5.010073
##   18  6.426012  0.5323696  5.017149
##   19  6.405804  0.5378944  5.002702
##   20  6.362845  0.5470030  4.958886
##   21  6.347289  0.5516032  4.948972
##   22  6.356725  0.5517104  4.966295
##   23  6.368628  0.5523172  4.973502
##   24  6.344537  0.5591353  4.958955
##   25  6.334406  0.5626367  4.949254
##   26  6.320043  0.5670252  4.941686
##   27  6.328378  0.5686898  4.944207
##   28  6.344985  0.5686375  4.951599
##   29  6.347670  0.5693341  4.954045
##   30  6.363316  0.5685342  4.952608
##   31  6.365253  0.5698286  4.956516
##   32  6.360889  0.5711098  4.959845
##   33  6.369985  0.5708191  4.971895
##   34  6.384071  0.5705291  4.981946
##   35  6.400461  0.5693288  4.995635
##   36  6.425944  0.5673021  5.017832
##   37  6.434182  0.5677234  5.031035
##   38  6.446933  0.5673075  5.044687
##   39  6.466928  0.5658344  5.061033
##   40  6.476308  0.5664772  5.073236
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was k = 32.

Example 3: Using Cross-Validation for Hyperparameter Tuning

The hyperparameters resulting in the best model are stored in the bestTune attribute of the model object.

model_3$bestTune

##     k
## 32 32

Example 3: Using Cross-Validation for Hyperparameter Tuning

The results attribute of the model object is a data frame containing cross-validation results for each combination of hyperparameters being considering.

model_3$results

##     k     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1   1 8.133346 0.3651926 6.559035 0.6171991 0.08249531 0.5276548
## 2   2 7.630758 0.4065960 6.110262 0.7215375 0.07635848 0.5605432
## 3   3 7.188013 0.4411575 5.747229 0.6484953 0.06398062 0.5815737
## 4   4 6.868136 0.4695338 5.508320 0.6298373 0.06435648 0.5405153
## 5   5 6.620856 0.4967033 5.293771 0.5990372 0.06266739 0.4819627
## 6   6 6.520592 0.5078362 5.206962 0.5848288 0.06306099 0.4390070
## 7   7 6.449212 0.5163969 5.119928 0.5885483 0.06261215 0.4689539
## 8   8 6.382981 0.5259114 5.021135 0.6157454 0.06466605 0.5002379
## 9   9 6.379889 0.5266769 5.014644 0.6068891 0.06359247 0.4794728
## 10 10 6.420765 0.5221343 5.035897 0.6007563 0.06328543 0.4694754
## 11 11 6.427816 0.5226087 5.032147 0.5820196 0.05983282 0.4954389
## 12 12 6.435852 0.5222320 5.031862 0.6029491 0.06026492 0.5047011
## 13 13 6.459572 0.5193200 5.048249 0.6079515 0.05926500 0.5419614
## 14 14 6.444286 0.5232384 5.017959 0.5997702 0.05907761 0.5471148
## 15 15 6.443646 0.5244012 5.013767 0.5861841 0.05836258 0.5272943
## 16 16 6.434772 0.5272125 5.002048 0.5995168 0.06061531 0.5395927
## 17 17 6.433166 0.5293690 5.010073 0.5846699 0.06146139 0.5172125
## 18 18 6.426012 0.5323696 5.017149 0.5723893 0.06026370 0.4940068
## 19 19 6.405804 0.5378944 5.002702 0.5617545 0.06040239 0.4841181
## 20 20 6.362845 0.5470030 4.958886 0.5521411 0.06051247 0.4733970
## 21 21 6.347289 0.5516032 4.948972 0.5667081 0.06361238 0.4924101
## 22 22 6.356725 0.5517104 4.966295 0.5758459 0.06379349 0.4937581
## 23 23 6.368628 0.5523172 4.973502 0.5996523 0.06297839 0.5099448
## 24 24 6.344537 0.5591353 4.958955 0.6003275 0.06133679 0.5207975
## 25 25 6.334406 0.5626367 4.949254 0.5795607 0.05816576 0.5187503
## 26 26 6.320043 0.5670252 4.941686 0.5804074 0.06119485 0.5200360
## 27 27 6.328378 0.5686898 4.944207 0.5680397 0.06042774 0.5152812
## 28 28 6.344985 0.5686375 4.951599 0.5689039 0.06180431 0.5094421
## 29 29 6.347670 0.5693341 4.954045 0.5637367 0.06020353 0.5040553
## 30 30 6.363316 0.5685342 4.952608 0.5848690 0.06229935 0.5192927
## 31 31 6.365253 0.5698286 4.956516 0.5898456 0.06050694 0.5329000
## 32 32 6.360889 0.5711098 4.959845 0.5907581 0.05785153 0.5313975
## 33 33 6.369985 0.5708191 4.971895 0.5879682 0.05745242 0.5351011
## 34 34 6.384071 0.5705291 4.981946 0.5877356 0.06054540 0.5303284
## 35 35 6.400461 0.5693288 4.995635 0.5981007 0.06146628 0.5444196
## 36 36 6.425944 0.5673021 5.017832 0.6053286 0.06185602 0.5504168
## 37 37 6.434182 0.5677234 5.031035 0.6019362 0.06104422 0.5411615
## 38 38 6.446933 0.5673075 5.044687 0.6001116 0.06337007 0.5336664
## 39 39 6.466928 0.5658344 5.061033 0.5895779 0.06360421 0.5271185
## 40 40 6.476308 0.5664772 5.073236 0.5827963 0.06438045 0.5204593

We can use which.max to select out information relating to the model with the highest cross-validation r-squared value.

best_ix_3 = which.max(model_3$results$Rsquared)
model_3$results[best_ix_3, ]

##     k     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 32 32 6.360889 0.5711098 4.959845 0.5907581 0.05785153 0.5313975

Example 3: Using Cross-Validation for Hyperparameter Tuning

We can extract individual columns of the results dataframe to generate plots of the cross-validation r-squared estimates as a function of K.

plot(1:40, model_3$results$Rsquared, 
     main='Using Cross-Validation for Hyperparameter Tuning', 
     xlab='K', ylab='Cross-Validation r-Squared')
lines(1:40, model_3$results$Rsquared)

Example 3: Using Cross-Validation for Hyperparameter Tuning

As a convenience, we can generate the same plot by simply passing the model object to the plot() function.

plot(model_3)

Example 3: Using Cross-Validation for Hyperparameter Tuning

The results of the final model selected through cross-validation are stored in the finalModel attribute of the model object. However, the final model itself doesn’t take into account any preprocessing that was applied to the training set. Fortunately, the object returned by train also contains a preProcess attribute that contains our feature scaler.

scaled_data = predict(model_3$preProcess, nyc[1:5, 2:6])

predict(model_3$finalModel, scaled_data)

## [1] 43.25000 39.78125 36.37500 39.68750 46.34375

To simplify matters, we can provide the object returned by train to the predict function directly. This will perform the necessary preprocessing steps prior to applying the final model.

predict(model_3, nyc[1:5, 2:6])

## [1] 43.25000 39.78125 36.37500 39.68750 46.34375

Example 3: Using Cross-Validation for Hyperparameter Tuning

After using cross-validation to select your final model, it is useful to perform an independent round of cross-validation to estimate out-of-sample performance.

set.seed(2)

tc <- trainControl(method="cv", number=10)

best_params_3 <- model_3$bestTune

model_3_best <- train(Price ~ ., nyc, method="knn", 
                 preProcess=c("center", "scale"), 
                 tuneGrid=best_params_3, trainControl=tc, 
                 metric="Rsquared")

model_3_best

## k-Nearest Neighbors 
## 
## 168 samples
##   5 predictor
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 168, 168, 168, 168, 168, 168, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6.641145  0.5460618  5.214257
## 
## Tuning parameter 'k' was held constant at a value of 32

Example 4: Using Cross-Validation with Elasticnet

We will now now see an example of using caret to perform hyperparameter tuning on an elasticnet model.

set.seed(1)
tc <- trainControl(method="cv", number=10)

param_grid <- expand.grid(alpha=seq(0,1, by=0.2), 
                          lambda=seq(0,1,length=100))

model_4 <- train(Price ~ ., nyc, method="glmnet", 
                 preProcess=c("center", "scale"), 
                 tuneGrid=param_grid, trControl=tc, 
                 metric="Rsquared")

best_ix_4 = which.max(model_4$results$Rsquared)
model_4$results[best_ix_4, ]

##     alpha     lambda    RMSE  Rsquared      MAE   RMSESD RsquaredSD
## 508     1 0.07070707 5.69067 0.6281646 4.541926 1.294004  0.1412212
##        MAESD
## 508 1.006432

Example 4: Using Cross-Validation with Elasticnet

We will use plot() to see how the cross-validation estimates for r-squared vary with respect to the hyperparameters.

plot(model_4, pch="")

Example 4: Using Cross-Validation with Elasticnet

As a strange quirk, when using train with method="glmnet", the finalModel attribute contains a matrix of coefficient estimates for multiple values of lambda, but for the optimal value of alpha. We can select out the coefficients for the optimal model as follows:

coef(model_4$finalModel, 
     model_4$finalModel$lambdaOpt)

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 42.6964286
## Food         2.9917039
## Decor        5.0963233
## Service      .        
## Wait         0.5020165
## East         0.9210049

Example 4: Using Cross-Validation with Elasticnet

We will perform an independent round of cross-validation to assess out-of-sample performance.

set.seed(2)

best_params_4 <- model_4$bestTune

tc <- trainControl(method="cv", number=10)

model_4_best <- train(Price ~ ., nyc, method="glmnet", 
                      preProcess=c("center", "scale"), 
                      tuneGrid=best_params_4, trControl=tc, 
                      metric="Rsquared")

model_4_best

## glmnet 
## 
## 168 samples
##   5 predictor
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 152, 152, 150, 150, 151, 151, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   5.653275  0.6451659  4.51807
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## 
## Tuning parameter 'lambda' was held constant at a value of 0.07070707

Example 5: Qualitative Predictors

The train function from caret will automatically take care of one-hot encoding qualitative features for us. We will illustrate this using the diamonds dataset.

diamonds <- read.table("data/diamonds.txt", sep="\t", header = TRUE)
diamonds <- diamonds[,c("carat", "cut", "color", "clarity", "price")]
summary(diamonds)

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Ideal    :21551   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Very Good:12082   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      price      
##  Min.   :  326  
##  1st Qu.:  950  
##  Median : 2401  
##  Mean   : 3933  
##  3rd Qu.: 5324  
##  Max.   :18823  
##

Example 5: Qualitative Predictors

We will now us cross-validation to select elasticnet hyperparameters for this dataset.

set.seed(1)
tc <- trainControl(method="cv", number=10)

param_grid <- expand.grid(alpha=seq(0.2, 1, by=0.2), 
                          lambda=exp(seq(-3, 2,length=100)))

model_5 <- train(price ~ ., diamonds, method="glmnet", 
                 preProcess = c("range"),
                 tuneGrid=param_grid, trControl=tc, 
                 metric="Rsquared")

best_ix <- which.max(model_5$results$Rsquared)
model_5$results[best_ix, ]

##     alpha     lambda     RMSE  Rsquared      MAE   RMSESD  RsquaredSD
## 301   0.8 0.04978707 1157.585 0.9158561 801.4206 21.31721 0.001778996
##        MAESD
## 301 10.24947

Example 5: Qualitative Predictors

We will now plot the cross-validation estimates for out-of-sample r-squared.

plot(model_5, pch="", xTrans=log)

Example 5: Qualitative Predictors

The coefficients in our final model are displayed below.

coef(model_5$finalModel, 
     model_5$finalModel$lambdaOpt)

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  -5322.0557
## carat        42668.8359
## cutGood        622.5196
## cutIdeal       969.4578
## cutPremium     839.3194
## cutVery Good   819.9871
## colorE        -194.2664
## colorF        -285.6780
## colorG        -487.0998
## colorH        -959.7280
## colorI       -1418.2183
## colorJ       -2300.7144
## clarityIF     5163.9155
## claritySI1    3329.1471
## claritySI2    2384.4157
## clarityVS1    4287.2468
## clarityVS2    3972.9075
## clarityVVS1   4820.6038
## clarityVVS2   4718.7002

Example 5: Qualitative Predictors

In the code chunk below, we will retrain a new model using the best combination of hyperparameters found in order to estimate out-of-sample performance.

set.seed(2)

best_params_5 <- model_5$bestTune

tc <- trainControl(method="cv", number=10)

param_grid <- expand.grid(alpha=seq(0.2, 1, by=0.2), 
                          lambda=exp(seq(-3, 2,length=100)))

model_5_best <- train(price ~ ., diamonds, method="glmnet", 
                 preProcess = c("range"),
                 tuneGrid=best_params_5, trControl=tc, 
                 metric="Rsquared")

model_5_best

## glmnet 
## 
## 53940 samples
##     4 predictor
## 
## Pre-processing: re-scaling to [0, 1] (18) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 48547, 48546, 48546, 48547, 48545, 48546, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1157.426  0.9158503  801.4574
## 
## Tuning parameter 'alpha' was held constant at a value of 0.8
## 
## Tuning parameter 'lambda' was held constant at a value of 1.140325

Example 6: Cross-Validation and Elasticnet for Classification

We will now see how to use train to perform cross-validation to select elasticnet hyperparameters for logistic regression.

wbc <- read.table('data/breast_cancer.csv', header=TRUE, sep=',')
wbc$id <- NULL
summary(wbc)

##  diagnosis  radius_mean      texture_mean   perimeter_mean  
##  B:357     Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
##  M:212     1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
##            Median :13.370   Median :18.84   Median : 86.24  
##            Mean   :14.127   Mean   :19.29   Mean   : 91.97  
##            3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
##            Max.   :28.110   Max.   :39.28   Max.   :188.50  
##    area_mean      smoothness_mean   compactness_mean  concavity_mean   
##  Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.00000  
##  1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956  
##  Median : 551.1   Median :0.09587   Median :0.09263   Median :0.06154  
##  Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.08880  
##  3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070  
##  Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.42680  
##  concave.points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se    
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst
##  Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2822   Median :0.08004        
##  Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.6638   Max.   :0.20750

Example 6: Cross-Validation and Elasticnet for Classification

We will now perform cross validation, and will display information relating to the best model.

set.seed(1)
tc <- trainControl(method="cv", number=10)

param_grid <- expand.grid(alpha=seq(0, 1, by=0.5), 
                          lambda=exp(seq(-6, -2,length=20)))

model_6 <- train(diagnosis ~ ., wbc, method="glmnet", family="binomial",
                 preProcess=c("center", "scale"), 
                 tuneGrid=param_grid, trControl=tc, 
                 metric="Accuracy")


#model_6$bestTune
best_ix_6 <- which.max(model_6$results$Accuracy)
model_6$results[best_ix_6, ]

##    alpha      lambda  Accuracy     Kappa AccuracySD    KappaSD
## 23   0.5 0.003776539 0.9755272 0.9472677 0.02046019 0.04403434

Example 6: Cross-Validation and Elasticnet for Classification

We will now plot the cross-validation estimates for out-of-sample accuracy.

plot(model_6, pch="", xTrans=log)

Example 6: Cross-Validation and Elasticnet for Classification

The coefficients for the best model found are shown below.

coef(model_6$finalModel, 
     model_6$finalModel$lambdaOpt)

## 31 x 1 sparse Matrix of class "dgCMatrix"
##                                    1
## (Intercept)             -0.315639716
## radius_mean              0.285506745
## texture_mean             0.371914497
## perimeter_mean           0.235269424
## area_mean                0.310513202
## smoothness_mean          .          
## compactness_mean         .          
## concavity_mean           0.462308660
## concave.points_mean      0.734380315
## symmetry_mean            .          
## fractal_dimension_mean  -0.219396320
## radius_se                1.131353911
## texture_se              -0.074450743
## perimeter_se             0.391405864
## area_se                  0.653019321
## smoothness_se            0.104321239
## compactness_se          -0.535356211
## concavity_se             .          
## concave.points_se        0.001014363
## symmetry_se             -0.142199830
## fractal_dimension_se    -0.255251353
## radius_worst             0.988305173
## texture_worst            0.992847852
## perimeter_worst          0.763287310
## area_worst               0.858604086
## smoothness_worst         0.671143708
## compactness_worst        .          
## concavity_worst          0.655779729
## concave.points_worst     0.941234045
## symmetry_worst           0.603554472
## fractal_dimension_worst  .

Example 6: Cross-Validation and Elasticnet for Classification

We will now retrain a new model with a slightly higher degree of regularization, but with a similar cross-validation score.

set.seed(2)
tc <- trainControl(method="cv", number=10)

param_grid <- expand.grid(alpha=0.5, 
                          lambda=exp(-4))

model_6_alt <- train(diagnosis ~ ., wbc, method="glmnet", family="binomial",
                 preProcess=c("center", "scale"), 
                 tuneGrid=param_grid, trControl=tc, 
                 metric="Accuracy")


model_6_alt

## glmnet 
## 
## 569 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 512, 513, 511, 512, 511, 512, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9736475  0.9426502
## 
## Tuning parameter 'alpha' was held constant at a value of 0.5
## 
## Tuning parameter 'lambda' was held constant at a value of 0.01831564

Example 6: Cross-Validation and Elasticnet for Classification

The coefficients of this alternate model are shown below.

coef(model_6_alt$finalModel, 
     model_6_alt$finalModel$lambdaOpt)

## 31 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)             -0.5741890
## radius_mean              0.3141484
## texture_mean             0.2490084
## perimeter_mean           0.2857658
## area_mean                0.2378809
## smoothness_mean          .        
## compactness_mean         .        
## concavity_mean           0.1064340
## concave.points_mean      0.4673911
## symmetry_mean            .        
## fractal_dimension_mean   .        
## radius_se                0.4429637
## texture_se               .        
## perimeter_se             0.1416516
## area_se                  0.1355298
## smoothness_se            .        
## compactness_se           .        
## concavity_se             .        
## concave.points_se        .        
## symmetry_se              .        
## fractal_dimension_se    -0.1075690
## radius_worst             0.6555300
## texture_worst            0.5875112
## perimeter_worst          0.5590332
## area_worst               0.4669748
## smoothness_worst         0.4270019
## compactness_worst        .        
## concavity_worst          0.2660941
## concave.points_worst     0.6538142
## symmetry_worst           0.3069766
## fractal_dimension_worst  .

Example 7: Cross-Validation and KNN for Classification

In this final example, we will use train to perform hyperparameter tuning for KNN classification.

set.seed(1)
tc <- trainControl(method="cv", number=10)

param_grid <- expand.grid(k=1:60)

model_7 <- train(diagnosis ~ ., wbc, method="knn",
                 preProcess=c("center", "scale"), 
                 tuneGrid=param_grid, trControl=tc, 
                 metric="Accuracy")

#model_6$bestTune
best_ix_7 <- which.max(model_7$results$Accuracy)
model_7$results[best_ix_7, ]

##     k  Accuracy     Kappa AccuracySD    KappaSD
## 10 10 0.9701095 0.9346352  0.0235881 0.05246313

Example 7: Cross-Validation and KNN for Classification

We will plot the cross-validation estimates as a function of K.

plot(model_7, pch="")

Example 7: Cross-Validation and KNN for Classification

We will now use cross-validation to estimate the best model’s out-of-sample performance.

set.seed(2)
tc <- trainControl(method="cv", number=10)

best_params_7 <- model_7$bestTune

model_7_best <- train(diagnosis ~ ., wbc, method="knn",
                      preProcess=c("center", "scale"), 
                      tuneGrid=best_params_7, trControl=tc, 
                      metric="Accuracy")

model_7_best

## k-Nearest Neighbors 
## 
## 569 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 512, 513, 511, 512, 511, 512, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9719558  0.9392185
## 
## Tuning parameter 'k' was held constant at a value of 10

Lesson 5.2 - Introduction to Caret

Introduction

Example 1: Ordinary Least Squares Regression (NYC Restaurant Data)

Example 1: Ordinary Least Squares Regression (NYC Restaurant Data)

Example 1: Ordinary Least Squares Regression (NYC Restaurant Data)

Example 1: Ordinary Least Squares Regression (NYC Restaurant Data)

Example 2: Using CV to Estimate Out-of-Sample r-Squared

Example 2: Using CV to Estimate Out-of-Sample r-Squared

Example 2: Using CV to Estimate Out-of-Sample r-Squared

Example 3: Using Cross-Validation for Hyperparameter Tuning

Example 3: Using Cross-Validation for Hyperparameter Tuning

Example 3: Using Cross-Validation for Hyperparameter Tuning

Example 3: Using Cross-Validation for Hyperparameter Tuning

Example 3: Using Cross-Validation for Hyperparameter Tuning

Example 3: Using Cross-Validation for Hyperparameter Tuning

Example 3: Using Cross-Validation for Hyperparameter Tuning

Example 3: Using Cross-Validation for Hyperparameter Tuning

Example 4: Using Cross-Validation with Elasticnet

Example 4: Using Cross-Validation with Elasticnet

Example 4: Using Cross-Validation with Elasticnet

Example 4: Using Cross-Validation with Elasticnet

Example 5: Qualitative Predictors

Example 5: Qualitative Predictors

Example 5: Qualitative Predictors

Example 5: Qualitative Predictors

Example 5: Qualitative Predictors

Example 6: Cross-Validation and Elasticnet for Classification

Example 6: Cross-Validation and Elasticnet for Classification

Example 6: Cross-Validation and Elasticnet for Classification

Example 6: Cross-Validation and Elasticnet for Classification

Example 6: Cross-Validation and Elasticnet for Classification

Example 6: Cross-Validation and Elasticnet for Classification

Example 7: Cross-Validation and KNN for Classification

Example 7: Cross-Validation and KNN for Classification

Example 7: Cross-Validation and KNN for Classification