First things first, let’s load in the caret library so that it will be available for us to use.
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1364)
Now we’ll load in the Alzheimers data set that we’ve been working with in previous sessions.
Alz <- read.csv(file.choose())
dim(Alz)
## [1] 333 132
summary(Alz$gender)
## female Female M male Male
## 2 202 3 2 124
Before we go on let’s clean up one of our variables that has some inconsistencies
Alz$gender <- ifelse(Alz$gender == 'female', 'Female', ifelse(Alz$gender == 'M', 'Male', ifelse(Alz$gender == 'male', 'Male', ifelse(Alz$gender == 'Female', 'Female', 'Male'))))
Alz$gender <- as.factor(Alz$gender)
Partitioning Data Into Training and Test Sets Typically one of the first things we want to do before building any prediction model is to take our data and split it up into a ‘training’ set that will be used to teach the model and a ‘test’ set that will be used as part of validating the model
The caret package has some really simple built in functions for partitioning data. Let’s use createDataPartition() to split the Alzheimers data set into a train and test set
samples_for_training <- createDataPartition(Alz$response, p=.75, list=F)
training_samples <- Alz[samples_for_training,]
test_samples <- Alz[-samples_for_training,]
Ready to Build a Model We’re now ready to build a model using any of the machine learning algorithms that are built into caret
There are >200 machine learning algorithms available in the caret package
Whichever algorithm we choose we can generate a model using train() function
my_model_gbm <- train(response ~ ., data = training_samples, method = "gbm")
That’s it.
We’ve got a model now that we can use to make predictions.
Taking a look at the model we built
my_model_gbm
## Stochastic Gradient Boosting
##
## 251 samples
## 131 predictors
## 2 classes: 'Impaired', 'NotImpaired'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 251, 251, 251, 251, 251, 251, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8016530 0.4285405
## 1 100 0.8240044 0.5103785
## 1 150 0.8240777 0.5195473
## 2 50 0.8192888 0.4959376
## 2 100 0.8268481 0.5236718
## 2 150 0.8247815 0.5228646
## 3 50 0.8286565 0.5213867
## 3 100 0.8230417 0.5122095
## 3 150 0.8240807 0.5197905
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
## = 3, shrinkage = 0.1 and n.minobsinnode = 10.
Generating Predictions with our model Let’s make some predictions using the test data set and see how the model performs
To generate predictions we use the predict() function
my_gbm_predictions <- predict(my_model_gbm, newdata = test_samples)
Model Performance Visualizing model performance with a confusion matrix!
cMatrix_gbm <- confusionMatrix(my_gbm_predictions, test_samples$response)
cMatrix_gbm
## $positive
## [1] "Impaired"
##
## $table
## Reference
## Prediction Impaired NotImpaired
## Impaired 13 1
## NotImpaired 9 59
##
## $overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.878048780 0.648972603 0.787141173 0.939941987 0.731707317
## AccuracyPValue McnemarPValue
## 0.001090334 0.026856696
##
## $byClass
## Sensitivity Specificity Pos Pred Value
## 0.5909091 0.9833333 0.9285714
## Neg Pred Value Precision Recall
## 0.8676471 0.9285714 0.5909091
## F1 Prevalence Detection Rate
## 0.7222222 0.2682927 0.1585366
## Detection Prevalence Balanced Accuracy
## 0.1707317 0.7871212
##
## $mode
## [1] "sens_spec"
##
## $dots
## list()
##
## attr(,"class")
## [1] "confusionMatrix"
Switching Algorithms Caret makes it exceptionally easy to try a different algorithm
my_model_glm <- train(response ~ ., data = training_samples, method = "glm")
my_glm_predictions <- predict(my_model_glm, newdata = test_samples)
cMatrix_glm <- confusionMatrix(my_glm_predictions, test_samples$response)
cMatrix_glm
## $positive
## [1] "Impaired"
##
## $table
## Reference
## Prediction Impaired NotImpaired
## Impaired 17 17
## NotImpaired 5 43
##
## $overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.73170732 0.41731266 0.62244024 0.82361014 0.73170732
## AccuracyPValue McnemarPValue
## 0.55706563 0.01901647
##
## $byClass
## Sensitivity Specificity Pos Pred Value
## 0.7727273 0.7166667 0.5000000
## Neg Pred Value Precision Recall
## 0.8958333 0.5000000 0.7727273
## F1 Prevalence Detection Rate
## 0.6071429 0.2682927 0.2073171
## Detection Prevalence Balanced Accuracy
## 0.4146341 0.7446970
##
## $mode
## [1] "sens_spec"
##
## $dots
## list()
##
## attr(,"class")
## [1] "confusionMatrix"
compare the two models
cMatrix_gbm$overall[1]
## Accuracy
## 0.8780488
cMatrix_glm$overall[1]
## Accuracy
## 0.7317073
Imporoving our models For some algorithms to run or be optimized we might need to preprocess data
hist(Alz$tau, breaks = 30)
hist(Alz$age, breaks = 30)
Scaling Data Using the scale() function we can get all of our variables to have a mean of 0 and standard deviation of 1.
tau_scaled <- scale(Alz$tau, center = TRUE)
age_scaled <- scale(Alz$age, center = TRUE)
hist(tau_scaled, breaks = 30)
hist(age_scaled, breaks = 30)
Dummy variables Some machine learning algorithms can’t handle factor variables. They must be turned into numeric type data. This process is referred to as creating dummy variables
So which of our variables are factor variables?
Alz_var_classes <- sapply(Alz, class)
which(Alz_var_classes == 'factor')
## Genotype response gender
## 130 131 132
Genotype, response and gender are all factor variables.
Let’s see how we can turn one of them into a dummy variable using dummyVars() in caret
dmy <- dummyVars(~ Genotype, data = Alz)
dmy_df <- data.frame(predict(dmy, newdata = Alz))
head(dmy_df)
## Genotype.E2E2 Genotype.E2E3 Genotype.E2E4 Genotype.E3E3 Genotype.E3E4
## 1 0 0 0 1 0
## 2 0 0 0 0 1
## 3 0 0 0 0 1
## 4 0 0 0 0 1
## 5 0 0 0 1 0
## 6 0 0 0 0 0
## Genotype.E4E4
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
head(Alz$Genotype)
## [1] E3E3 E3E4 E3E4 E3E4 E3E3 E4E4
## Levels: E2E2 E2E3 E2E4 E3E3 E3E4 E4E4