Caret

Harold Nelson

04/20/2022

Caret

The task is to predict the gender of a person based on other characteristics?

This document works through several models using the caret package. It uses the cleaned version of the cdc data.

The Data and Packages

load("cdc2.Rdata")
library(class)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(caTools)
library(ggplot2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(e1071)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(mboost)
## Warning: package 'mboost' was built under R version 4.1.2
## Loading required package: parallel
## Loading required package: stabs
## 
## Attaching package: 'mboost'
## The following object is masked from 'package:ggplot2':
## 
##     %+%
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.1.2
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(ranger)
library(gbm)
## Loaded gbm 2.1.8

Task 1

Use the function createDataPartition from the caret package to split the cdc data frame into traindf and testdf using an 80/20 split. Use table to examine the distribution of gender in cdc, train, and test. The distributions should be very similar.

Answer

set.seed(123)
inTrain = createDataPartition(cdc2$gender, p = .8,list=F)
traindf = cdc2[inTrain,]
testdf = cdc2[-inTrain,]
table(cdc2$gender)/nrow(cdc2)
## 
##         m         f 
## 0.4783718 0.5216282
table(traindf$gender)/nrow(traindf)
## 
##         m         f 
## 0.4783723 0.5216277
table(testdf$gender)/nrow(testdf)
## 
##         m         f 
## 0.4783696 0.5216304

Task 2

Create myControl specifying 5-fold cross-validation.

Use the train function in caret to create a model, glm, based on the data in traindf. In this model use height, weight, exerany and smoke100 to predict gender. Use the method “glm”.

Display the model.

Solution

myControl  = trainControl(method = "cv", number = 5, verboseIter = FALSE)

glm = train(gender~height + weight + exerany + smoke100,
             data = traindf,
             method="glm",
             trControl = myControl)

#Display the model.
glm
## Generalized Linear Model 
## 
## 15998 samples
##     4 predictor
##     2 classes: 'm', 'f' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12798, 12798, 12799, 12798, 12799 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8549833  0.7091209

Task 2

Create a vector of predictions, pred_glm for the data in testdf. Look at the head of the predictions

Solution

#Create predictions for the test data
pred_glm = predict(glm,testdf)

Task 3

Show the confusion matrix for the test data.

Solution

confusionMatrix(pred_glm,testdf$gender)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    m    f
##          m 1597  287
##          f  316 1799
##                                           
##                Accuracy : 0.8492          
##                  95% CI : (0.8377, 0.8602)
##     No Information Rate : 0.5216          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6977          
##                                           
##  Mcnemar's Test P-Value : 0.2542          
##                                           
##             Sensitivity : 0.8348          
##             Specificity : 0.8624          
##          Pos Pred Value : 0.8477          
##          Neg Pred Value : 0.8506          
##              Prevalence : 0.4784          
##          Detection Rate : 0.3993          
##    Detection Prevalence : 0.4711          
##       Balanced Accuracy : 0.8486          
##                                           
##        'Positive' Class : m               
## 

Task 4

Look at the caret model list and search for models related to glm. You will find one you can use under the method name “glmboost”. You do need to have the plyr and mboost packages loaded. Estimate the model and display it.

glmboost = train(gender~height + weight + exerany + smoke100,
             data = traindf,
             method="glmboost",
             trControl = myControl)
glmboost
## Boosted Generalized Linear Model 
## 
## 15998 samples
##     4 predictor
##     2 classes: 'm', 'f' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12799, 12799, 12798, 12798, 12798 
## Resampling results across tuning parameters:
## 
##   mstop  Accuracy   Kappa    
##    50    0.8508558  0.7008878
##   100    0.8532935  0.7057952
##   150    0.8546063  0.7084065
## 
## Tuning parameter 'prune' was held constant at a value of no
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mstop = 150 and prune = no.

Task 5

Produce predictions for the test data in pred_glmboost.

Solution

pred_glmboost = predict(glmboost, testdf)

Task 6

Create the confusion matrix for the test data and predictions.

Solution

confusionMatrix(pred_glmboost, testdf$gender)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    m    f
##          m 1602  288
##          f  311 1798
##                                           
##                Accuracy : 0.8502          
##                  95% CI : (0.8388, 0.8611)
##     No Information Rate : 0.5216          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6997          
##                                           
##  Mcnemar's Test P-Value : 0.3687          
##                                           
##             Sensitivity : 0.8374          
##             Specificity : 0.8619          
##          Pos Pred Value : 0.8476          
##          Neg Pred Value : 0.8525          
##              Prevalence : 0.4784          
##          Detection Rate : 0.4006          
##    Detection Prevalence : 0.4726          
##       Balanced Accuracy : 0.8497          
##                                           
##        'Positive' Class : m               
## 

Task 7

Try the gradient boosting model using method “gbm”. In the call to train, set verbose = FALSE.

Do the usual.

  1. Estimate the model
  2. Display it
  3. Make predictions for the test data
  4. Make a confusion matrix for the test data

Solution

gbm<- train(gender~height + weight + exerany + smoke100, data = traindf, 
                 method = "gbm",
                 trControl = myControl,
                 verbose = FALSE)
gbm
## Stochastic Gradient Boosting 
## 
## 15998 samples
##     4 predictor
##     2 classes: 'm', 'f' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12799, 12798, 12799, 12798, 12798 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8577949  0.7143309
##   1                  100      0.8624204  0.7238367
##   1                  150      0.8621080  0.7232547
##   2                   50      0.8620454  0.7230612
##   2                  100      0.8644832  0.7280230
##   2                  150      0.8631707  0.7254516
##   3                   50      0.8630456  0.7251339
##   3                  100      0.8644831  0.7280383
##   3                  150      0.8639206  0.7268835
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  2, shrinkage = 0.1 and n.minobsinnode = 10.
pred_gbm = predict(gbm,testdf)
confusionMatrix(pred_gbm, testdf$gender)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    m    f
##          m 1598  265
##          f  315 1821
##                                           
##                Accuracy : 0.855           
##                  95% CI : (0.8437, 0.8657)
##     No Information Rate : 0.5216          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7091          
##                                           
##  Mcnemar's Test P-Value : 0.04189         
##                                           
##             Sensitivity : 0.8353          
##             Specificity : 0.8730          
##          Pos Pred Value : 0.8578          
##          Neg Pred Value : 0.8525          
##              Prevalence : 0.4784          
##          Detection Rate : 0.3996          
##    Detection Prevalence : 0.4659          
##       Balanced Accuracy : 0.8541          
##                                           
##        'Positive' Class : m               
## 

Task 8

Try a random forest using “ranger”.

Solution

ranger <- train(gender~height + weight + exerany + smoke100, data = traindf, 
                 method = "ranger", 
                 trControl = myControl,
                 verbose = FALSE)
ranger
## Random Forest 
## 
## 15998 samples
##     4 predictor
##     2 classes: 'm', 'f' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12799, 12798, 12799, 12798, 12798 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##   2     gini        0.8626080  0.7241665
##   2     extratrees  0.8613576  0.7215392
##   3     gini        0.8556070  0.7102545
##   3     extratrees  0.8617954  0.7225081
##   4     gini        0.8470437  0.6932417
##   4     extratrees  0.8487313  0.6965568
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
##  and min.node.size = 1.
pred_ranger = predict(ranger,testdf)


confusionMatrix(pred_ranger, testdf$gender)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    m    f
##          m 1587  254
##          f  326 1832
##                                           
##                Accuracy : 0.855           
##                  95% CI : (0.8437, 0.8657)
##     No Information Rate : 0.5216          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7089          
##                                           
##  Mcnemar's Test P-Value : 0.003197        
##                                           
##             Sensitivity : 0.8296          
##             Specificity : 0.8782          
##          Pos Pred Value : 0.8620          
##          Neg Pred Value : 0.8489          
##              Prevalence : 0.4784          
##          Detection Rate : 0.3968          
##    Detection Prevalence : 0.4604          
##       Balanced Accuracy : 0.8539          
##                                           
##        'Positive' Class : m               
## 

The Big Question

Which of these models performed best in the sense of accuracy ON THE TEST DATA?

Solution

Do this and discuss next time.