Data Analysis in R Part 3: Data Modeling and Machine Learning

First things first, let’s load in the caret library so that it will be available for us to use.

library(caret)

## Warning: package 'caret' was built under R version 3.3.2

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(1364)

Now we’ll load in the Alzheimers data set that we’ve been working with in previous sessions.

Alz <- read.csv(file.choose())
dim(Alz)

## [1] 333 132

summary(Alz$gender)

## female Female      M   male   Male 
##      2    202      3      2    124

Before we go on let’s clean up one of our variables that has some inconsistencies

Alz$gender <- ifelse(Alz$gender == 'female', 'Female', ifelse(Alz$gender == 'M', 'Male', ifelse(Alz$gender == 'male', 'Male', ifelse(Alz$gender == 'Female', 'Female', 'Male'))))
Alz$gender <- as.factor(Alz$gender)

Partitioning Data Into Training and Test Sets Typically one of the first things we want to do before building any prediction model is to take our data and split it up into a ‘training’ set that will be used to teach the model and a ‘test’ set that will be used as part of validating the model

The caret package has some really simple built in functions for partitioning data. Let’s use createDataPartition() to split the Alzheimers data set into a train and test set

samples_for_training <- createDataPartition(Alz$response, p=.75, list=F)
training_samples <- Alz[samples_for_training,]
test_samples <- Alz[-samples_for_training,]

Ready to Build a Model We’re now ready to build a model using any of the machine learning algorithms that are built into caret

There are >200 machine learning algorithms available in the caret package

Whichever algorithm we choose we can generate a model using train() function

my_model_gbm <- train(response ~ ., data = training_samples, method = "gbm")

That’s it.
We’ve got a model now that we can use to make predictions.

Taking a look at the model we built

my_model_gbm

## Stochastic Gradient Boosting 
## 
## 251 samples
## 131 predictors
##   2 classes: 'Impaired', 'NotImpaired' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 251, 251, 251, 251, 251, 251, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8016530  0.4285405
##   1                  100      0.8240044  0.5103785
##   1                  150      0.8240777  0.5195473
##   2                   50      0.8192888  0.4959376
##   2                  100      0.8268481  0.5236718
##   2                  150      0.8247815  0.5228646
##   3                   50      0.8286565  0.5213867
##   3                  100      0.8230417  0.5122095
##   3                  150      0.8240807  0.5197905
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Generating Predictions with our model Let’s make some predictions using the test data set and see how the model performs

To generate predictions we use the predict() function

my_gbm_predictions <- predict(my_model_gbm, newdata = test_samples)

Model Performance Visualizing model performance with a confusion matrix!

cMatrix_gbm <- confusionMatrix(my_gbm_predictions, test_samples$response)
cMatrix_gbm

## $positive
## [1] "Impaired"
## 
## $table
##              Reference
## Prediction    Impaired NotImpaired
##   Impaired          13           1
##   NotImpaired        9          59
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##    0.878048780    0.648972603    0.787141173    0.939941987    0.731707317 
## AccuracyPValue  McnemarPValue 
##    0.001090334    0.026856696 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.5909091            0.9833333            0.9285714 
##       Neg Pred Value            Precision               Recall 
##            0.8676471            0.9285714            0.5909091 
##                   F1           Prevalence       Detection Rate 
##            0.7222222            0.2682927            0.1585366 
## Detection Prevalence    Balanced Accuracy 
##            0.1707317            0.7871212 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

Switching Algorithms Caret makes it exceptionally easy to try a different algorithm

my_model_glm <- train(response ~ ., data = training_samples, method = "glm")
my_glm_predictions <- predict(my_model_glm, newdata = test_samples)
cMatrix_glm <- confusionMatrix(my_glm_predictions, test_samples$response)
cMatrix_glm

## $positive
## [1] "Impaired"
## 
## $table
##              Reference
## Prediction    Impaired NotImpaired
##   Impaired          17          17
##   NotImpaired        5          43
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##     0.73170732     0.41731266     0.62244024     0.82361014     0.73170732 
## AccuracyPValue  McnemarPValue 
##     0.55706563     0.01901647 
## 
## $byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7727273            0.7166667            0.5000000 
##       Neg Pred Value            Precision               Recall 
##            0.8958333            0.5000000            0.7727273 
##                   F1           Prevalence       Detection Rate 
##            0.6071429            0.2682927            0.2073171 
## Detection Prevalence    Balanced Accuracy 
##            0.4146341            0.7446970 
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

compare the two models

cMatrix_gbm$overall[1]

##  Accuracy 
## 0.8780488

cMatrix_glm$overall[1]

##  Accuracy 
## 0.7317073

Imporoving our models For some algorithms to run or be optimized we might need to preprocess data

hist(Alz$tau, breaks = 30)

hist(Alz$age, breaks = 30)

Scaling Data Using the scale() function we can get all of our variables to have a mean of 0 and standard deviation of 1.

tau_scaled <- scale(Alz$tau, center = TRUE)
age_scaled <- scale(Alz$age, center = TRUE)
hist(tau_scaled, breaks = 30)

hist(age_scaled, breaks = 30)

Dummy variables Some machine learning algorithms can’t handle factor variables. They must be turned into numeric type data. This process is referred to as creating dummy variables

So which of our variables are factor variables?

Alz_var_classes <- sapply(Alz, class)
which(Alz_var_classes == 'factor')

## Genotype response   gender 
##      130      131      132

Genotype, response and gender are all factor variables.

Let’s see how we can turn one of them into a dummy variable using dummyVars() in caret

dmy <- dummyVars(~ Genotype, data = Alz)
dmy_df <- data.frame(predict(dmy, newdata = Alz))
head(dmy_df)

##   Genotype.E2E2 Genotype.E2E3 Genotype.E2E4 Genotype.E3E3 Genotype.E3E4
## 1             0             0             0             1             0
## 2             0             0             0             0             1
## 3             0             0             0             0             1
## 4             0             0             0             0             1
## 5             0             0             0             1             0
## 6             0             0             0             0             0
##   Genotype.E4E4
## 1             0
## 2             0
## 3             0
## 4             0
## 5             0
## 6             1

head(Alz$Genotype)

## [1] E3E3 E3E4 E3E4 E3E4 E3E3 E4E4
## Levels: E2E2 E2E3 E2E4 E3E3 E3E4 E4E4

“Data Analysis in R Part 3: Data Modeling and Machine Learning”

Stephen Guest

November 8, 2016