Harold Nelson
04/20/2022
The task is to predict the gender of a person based on other characteristics?
This document works through several models using the caret package. It uses the cleaned version of the cdc data.
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Warning: package 'mboost' was built under R version 4.1.2
## Loading required package: parallel
## Loading required package: stabs
##
## Attaching package: 'mboost'
## The following object is masked from 'package:ggplot2':
##
## %+%
## Warning: package 'xgboost' was built under R version 4.1.2
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
## Loaded gbm 2.1.8
Use the function createDataPartition from the caret package to split the cdc data frame into traindf and testdf using an 80/20 split. Use table to examine the distribution of gender in cdc, train, and test. The distributions should be very similar.
set.seed(123)
inTrain = createDataPartition(cdc2$gender, p = .8,list=F)
traindf = cdc2[inTrain,]
testdf = cdc2[-inTrain,]
table(cdc2$gender)/nrow(cdc2)
##
## m f
## 0.4783718 0.5216282
##
## m f
## 0.4783723 0.5216277
##
## m f
## 0.4783696 0.5216304
Create myControl specifying 5-fold cross-validation.
Use the train function in caret to create a model, glm, based on the data in traindf. In this model use height, weight, exerany and smoke100 to predict gender. Use the method “glm”.
Display the model.
myControl = trainControl(method = "cv", number = 5, verboseIter = FALSE)
glm = train(gender~height + weight + exerany + smoke100,
data = traindf,
method="glm",
trControl = myControl)
#Display the model.
glm
## Generalized Linear Model
##
## 15998 samples
## 4 predictor
## 2 classes: 'm', 'f'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12798, 12798, 12799, 12798, 12799
## Resampling results:
##
## Accuracy Kappa
## 0.8549833 0.7091209
Create a vector of predictions, pred_glm for the data in testdf. Look at the head of the predictions
Show the confusion matrix for the test data.
## Confusion Matrix and Statistics
##
## Reference
## Prediction m f
## m 1597 287
## f 316 1799
##
## Accuracy : 0.8492
## 95% CI : (0.8377, 0.8602)
## No Information Rate : 0.5216
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6977
##
## Mcnemar's Test P-Value : 0.2542
##
## Sensitivity : 0.8348
## Specificity : 0.8624
## Pos Pred Value : 0.8477
## Neg Pred Value : 0.8506
## Prevalence : 0.4784
## Detection Rate : 0.3993
## Detection Prevalence : 0.4711
## Balanced Accuracy : 0.8486
##
## 'Positive' Class : m
##
Look at the caret model list and search for models related to glm. You will find one you can use under the method name “glmboost”. You do need to have the plyr and mboost packages loaded. Estimate the model and display it.
glmboost = train(gender~height + weight + exerany + smoke100,
data = traindf,
method="glmboost",
trControl = myControl)
glmboost
## Boosted Generalized Linear Model
##
## 15998 samples
## 4 predictor
## 2 classes: 'm', 'f'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12799, 12799, 12798, 12798, 12798
## Resampling results across tuning parameters:
##
## mstop Accuracy Kappa
## 50 0.8508558 0.7008878
## 100 0.8532935 0.7057952
## 150 0.8546063 0.7084065
##
## Tuning parameter 'prune' was held constant at a value of no
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mstop = 150 and prune = no.
Produce predictions for the test data in pred_glmboost.
Create the confusion matrix for the test data and predictions.
## Confusion Matrix and Statistics
##
## Reference
## Prediction m f
## m 1602 288
## f 311 1798
##
## Accuracy : 0.8502
## 95% CI : (0.8388, 0.8611)
## No Information Rate : 0.5216
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6997
##
## Mcnemar's Test P-Value : 0.3687
##
## Sensitivity : 0.8374
## Specificity : 0.8619
## Pos Pred Value : 0.8476
## Neg Pred Value : 0.8525
## Prevalence : 0.4784
## Detection Rate : 0.4006
## Detection Prevalence : 0.4726
## Balanced Accuracy : 0.8497
##
## 'Positive' Class : m
##
Try the gradient boosting model using method “gbm”. In the call to train, set verbose = FALSE.
Do the usual.
gbm<- train(gender~height + weight + exerany + smoke100, data = traindf,
method = "gbm",
trControl = myControl,
verbose = FALSE)
gbm
## Stochastic Gradient Boosting
##
## 15998 samples
## 4 predictor
## 2 classes: 'm', 'f'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12799, 12798, 12799, 12798, 12798
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8577949 0.7143309
## 1 100 0.8624204 0.7238367
## 1 150 0.8621080 0.7232547
## 2 50 0.8620454 0.7230612
## 2 100 0.8644832 0.7280230
## 2 150 0.8631707 0.7254516
## 3 50 0.8630456 0.7251339
## 3 100 0.8644831 0.7280383
## 3 150 0.8639206 0.7268835
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 2, shrinkage = 0.1 and n.minobsinnode = 10.
## Confusion Matrix and Statistics
##
## Reference
## Prediction m f
## m 1598 265
## f 315 1821
##
## Accuracy : 0.855
## 95% CI : (0.8437, 0.8657)
## No Information Rate : 0.5216
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7091
##
## Mcnemar's Test P-Value : 0.04189
##
## Sensitivity : 0.8353
## Specificity : 0.8730
## Pos Pred Value : 0.8578
## Neg Pred Value : 0.8525
## Prevalence : 0.4784
## Detection Rate : 0.3996
## Detection Prevalence : 0.4659
## Balanced Accuracy : 0.8541
##
## 'Positive' Class : m
##
Try a random forest using “ranger”.
ranger <- train(gender~height + weight + exerany + smoke100, data = traindf,
method = "ranger",
trControl = myControl,
verbose = FALSE)
ranger
## Random Forest
##
## 15998 samples
## 4 predictor
## 2 classes: 'm', 'f'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12799, 12798, 12799, 12798, 12798
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.8626080 0.7241665
## 2 extratrees 0.8613576 0.7215392
## 3 gini 0.8556070 0.7102545
## 3 extratrees 0.8617954 0.7225081
## 4 gini 0.8470437 0.6932417
## 4 extratrees 0.8487313 0.6965568
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
## and min.node.size = 1.
## Confusion Matrix and Statistics
##
## Reference
## Prediction m f
## m 1587 254
## f 326 1832
##
## Accuracy : 0.855
## 95% CI : (0.8437, 0.8657)
## No Information Rate : 0.5216
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7089
##
## Mcnemar's Test P-Value : 0.003197
##
## Sensitivity : 0.8296
## Specificity : 0.8782
## Pos Pred Value : 0.8620
## Neg Pred Value : 0.8489
## Prevalence : 0.4784
## Detection Rate : 0.3968
## Detection Prevalence : 0.4604
## Balanced Accuracy : 0.8539
##
## 'Positive' Class : m
##
Which of these models performed best in the sense of accuracy ON THE TEST DATA?
Do this and discuss next time.