This is an example of how to apply the CARET machine learning package in R to classify individuals or objects based upon covariates. I used an artificial data set that contains student’s ABC’s (attendance, behavior, and course performance) as covariates and are used to predict whether or not a student will dropout (1 = dropout; 0 = non-dropout). The ABC’s are common indicators of whether or not students will dropout (Bruce et al., 2011). Therefore, I have three variables for each student: attendance rate, number of suspensions, and GPA. The goal is to develop an algorithm that uses current information on dropout and non-dropout students to predict whether or not future students will drop out.
This example is based upon the example provided by the creators of the CARET package who demonstrated a very similar process with a different data set. The example can be found here: http://topepo.github.io/caret/model-training-and-tuning.html
## attendance supsensions gpa dropout
## 1 77 4 2.5 1
## 2 79 6 3.3 1
## 3 78 8 1.3 1
## 4 79 6 2.3 1
## 5 75 6 2.7 1
## 6 71 8 2.9 1
Next we need to partition the training sets from the testing sets. The createDataPartition in CARET does this by taking a stratified random sample of .75 of the data for training.
We then create both the training and testing data sets which will be used to develop and evaluate the model.
inTrain = createDataPartition(y = dat$dropout, p = .75, list = FALSE)
training = dat[inTrain,]
testing = dat[-inTrain,]
Here we are creating the cross validation method that will be used by CARET to create the training sets. Cross validation means to randomly split the data into k (in our case ten) data testing data sets and the repeated part just means to repeat this process k times (in our case ten as well).
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 10)
Now we are ready to the develop model. We use the train function in CARET to regress the dependent variable dropout onto all of the other covariates. Instead of explicitly naming all of the covariates, in the CARET package the “.” is used, which means include all of the other variables in the data set.
Next the method or type of regression is selected. Here we are using the gbm or Stochastic Gradient Boosting that is used for regression and classification. More information about the gbm package can be found here: https://cran.r-project.org/web/packages/gbm/gbm.pdf
The trControl is used to assign the validation method created above. It says run a gbm model with a ten cross validation method and repeat that process ten times. Finally, the verbose command just hides the calculations CARET computes for the user.
set.seed(12345)
gbmFit1 <- train(dropout ~ ., data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE)
Let’s now inspect the results. The most important piece of information is the accuracy, because that is what CARET uses to choose the final model. It is the overall agreement rate between the cross validation methods. The Kappa is another statistical method used for assessing models with categorical variables such as ours.
CARET chose the first model with an interaction depth of 1, number of trees at 50, an accuracy of 95% and a Kappa of 90%.
gbmFit1
## Stochastic Gradient Boosting
##
## 150 samples
## 3 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 135, 136, 135, 136, 135, 136, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9476548 0.8953069
## 1 100 0.9522024 0.9042521
## 1 150 0.9497976 0.8994304
## 2 50 0.9476667 0.8951896
## 2 100 0.9490417 0.8979223
## 2 150 0.9476667 0.8951165
## 3 50 0.9477024 0.8953157
## 3 100 0.9445833 0.8891125
## 3 150 0.9417321 0.8833247
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
Finally, we can use the training model to predict both classifications and probabilities for the test data set.
The first line of codes uses the built in predict function with the training model (gbmFit1) to predict values using the testing data set, which is the 25% of the data set that we set aside at the beginning of this example. We include the code “head” for your convenience so that R does not display the entire data set. If “head” were removed R would display all of the predictions.
The first piece of code includes the argument type = “prob”, which tells R to display the probabilities that a student is classified as non-dropout (0) or dropout (1). As we can see, there is a 98% probability that the first student in the data set is going to dropout.
predict(gbmFit1, newdata = head(testing), type = "prob")
## 0 1
## 1 0.016978526 0.9830215
## 2 0.002998704 0.9970013
## 3 0.321006595 0.6789934
## 4 0.004878295 0.9951217
## 5 0.098959954 0.9010400
## 6 0.001283639 0.9987164
Bruce, M., Bridgeland, J. M., Fox, J. H., & Balfanz, R. (2011). On Track for Success: The Use of Early Warning Indicator and Intervention Systems to Build a Grad Nation. Civic Enterprises.