Coursera Pratical Machine Learning Assignment

Caroline Zimmerman

Assignment description

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

The goal of this assignment is to predict how well a participant performs a bicep curl based on data collected from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.They were each asked to perform the exercise in five different ways. Classe A corresponds to the correct way of doing the exercise, while the other 4 Classes (B, C, D, ands E) correspond to common mistakes. The goal is to create a model that accurately predicts Classe. The final output of the assignment is a set of predictions that we will use to complete the Week 3 Quiz.

More information on the data is available here: http://groupware.les.inf.puc-rio.br/har (see “Weight Lifting Exercises Dataset”).

Load and view the data

# Disable scientific notation
options(scipen=999)
# Disable warnings and messages
knitr::opts_chunk$set(message=FALSE, warning=FALSE, dev="png")
# Set options for Rpubs
options(rpubs.upload.method = "internal")
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
devtools::install_github("rstudio/rmarkdown")

## Skipping install for github remote, the SHA1 (cb74ee52) has not changed since last install.
##   Use `force = TRUE` to force installation

# Load the data directly from the website
Train <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"), na.strings=c("", "NA", "NULL"))
Test <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"), na.strings=c("", "NA", "NULL"))
# Take a look 
str(Train, list.len=10)

## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##   [list output truncated]

Data preparation

Variable selection

Let’s reduce the 5 time variables into 2 scaled time sequence variables that will allow us to understand how the measure variables evolve by groups (users, classes) over time.

# Join the datasets
Train$problem_id <- NA
Test$classe <- NA
Test$type <- "TEST"
Train$type <- "TRAIN"
Data <- rbind(Train, Test)
# Order by username and timestamps 1 and 2 
Data <- Data[order(Data$user_name, Data$raw_timestamp_part_1, Data$raw_timestamp_part_2),]
# Then, create a new timestamp variable that fuses timestamps 1 and 2 
Data$timestamp <- as.numeric(paste0(Data$raw_timestamp_part_1, Data$raw_timestamp_part_2))
# Create a scaled timestamp variable (timestamp index by user)
library(plyr)
Data_new <- ddply(Data, .(user_name), transform, timestamp_by_user = seq_along(timestamp))
# Create a new variable that splits timestamps into 12 bins per username
Data_new <- ddply(Data_new, .(user_name), transform, timestamp_bin=as.numeric(cut(timestamp_by_user, 12)))
# Remove the redundant time variables
Data_new <- Data_new[c(-3,-4,-5,-6,-7,-163)]
# Separate the datasets again
Train <- Data_new[Data_new$type=="TRAIN",]
Train <- Train[c(-156, -157)]
Test <- Data_new[Data_new$type=="TEST",]
Test <- Test[c(-155, -157)]

# Get rid of variables that are summary statistics
# Determine whether the first row of each variable is NA 
zy <- as.data.frame(cbind(names(Train),is.na(as.numeric(Train[1,]))))
# Create an index for variables that have a real (non-NA) value in the first row
index <- as.numeric(row.names(zy[zy$V2==F,]))
# Subset original data based on this index 
Train <- Train[,index]

# Remove row number variable
Train <- Train[,-1]

# Split Train into testing and training sets, for cross-validation purposes.(This allows us to build and test a model before we actually use it to answer the quiz questions.)
library(caret)
set.seed(1234)
inTrain <- createDataPartition(y=Train$classe, p=0.75, list=FALSE)
train <- Train[inTrain,]
test <- Train[-inTrain,]

Can we reduce the number of variables by removing near-zero variance variables?

head(nearZeroVar(train, saveMetrics=T))

##                  freqRatio percentUnique zeroVar   nzv
## user_name         1.101290    0.04076641   FALSE FALSE
## roll_belt         1.121168    7.96983286   FALSE FALSE
## pitch_belt        1.012739   11.68637043   FALSE FALSE
## yaw_belt          1.054124   12.39298818   FALSE FALSE
## total_accel_belt  1.067216    0.19024324   FALSE FALSE
## gyros_belt_x      1.068560    0.89686099   FALSE FALSE

There are none!

Plot the numeric variables against each other to determine correlations and remove highly correlated variables.

# Generate correlation matrix
numeric_vars <- train[,c(2:53)]
M <- cor(numeric_vars)
# Remove redundant variables, defined here as variables with absolute corr > 0.8
highlyCor <- findCorrelation(M, 0.8)
numeric_vars <- numeric_vars[,-highlyCor]
# Create new data set with only the relevant variables 
clean_train <- cbind(train[,c(1,55,56,54)], numeric_vars)

Outcome distribution

Is the outcome variable (classe) evenly distributed?

classe_distribution <- ddply(clean_train, .(classe), plyr::summarize, freq=length(classe))
classe_distribution

##   classe freq
## 1      A 4185
## 2      B 2848
## 3      C 2567
## 4      D 2412
## 5      E 2706

We notice that the outcome variable is evenly distributed among B,C,D,E, while A (the correct way of doing the exercise) has almost double the concentration compared to the other classes.

Exploratory plots

Histograms

Let’s start by seeing if each of the measure variables are normally distributed. Since there are a lot of variables to consider, let’s group them by accelerometer placement (belt, arm, dumbbell, forearm). This also allows us to see if there are any patterns for a given body area. (For clarity, we’ll only print the results for the belt measurements.)

library(Hmisc)
library(psych)
par(mfrow=c(3, 4))
multi.hist(clean_train[,grep("_belt", names(clean_train))], main="belt")
multi.hist(clean_train[,grep("_arm", names(clean_train))], main="arm")
multi.hist(clean_train[,grep("_dumbbell", names(clean_train))], main="dumbbell")
multi.hist(clean_train[,grep("_forearm", names(clean_train))], main="forearm")

There are very few normal distributions, nor are there visible patterns for any of the body areas.

What if we make the same plots, but for each class? (For clarity, we won’t print any of these plots.)

for (i in 1:5) {
  x <- unique(clean_train$classe)[i]
  multi.hist(clean_train[ which(clean_train$classe==x),grep("_belt", names(clean_train))], main=paste(x, " belt"))
  multi.hist(clean_train[ which(clean_train$classe==x),grep("_arm", names(clean_train))], main=paste(x, " arm"))
  multi.hist(clean_train[ which(clean_train$classe==x),grep("_dumbbell", names(clean_train))], main=paste(x, " dumbbell"))
  multi.hist(clean_train[ which(clean_train$classe==x),grep("_forearm", names(clean_train))], main=paste(x, " forearm"))
}

There are more normal distributions, but not enough to make these variables good candidates for a regression model (at least without transformations). We will favor classification trees over regression in our modeling phase.

Time plots

Let’s try to visualize the relationship between classe and time. (For clarity, we will only print one of these plots.)

# First, create a list where each element is a data frame containing the values for username, timestamp, classe 
# and values for each of the measure variables
result <- function(variable) {
# Write a function to select each numeric variable and combine with username, timestamp and classe
  x <- data.frame(clean_train$user_name, clean_train$timestamp_by_user, clean_train$classe, variable)
  xx <- rename(x, c("clean_train.user_name"="user_name", "clean_train.timestamp_by_user"="timestamp", "clean_train.classe"="classe"))
}                   
# Run the function on the measure variables in clean_train
clean_train_measures <- clean_train[5:43]
result2 <- lapply(clean_train_measures, result)
# Rename col 3 to be the list element name
for (i in seq_along(result2)){
  colnames(result2[[i]]) <- c("user_name", "timestamp", "classe", names(result2[i]))
}
# Generate scatterplots for each variable to see how classe changes with time
plots <- function (df) {
  (par(mfrow=c(1, 1)))
  plot(df[,2], df[,4], ylab=names(df)[4], xlim=c(min(df[,2]), max(df[,2])), ylim=c(min(df[,4]), max(df[,4])), xlab="Timestamp", main=colnames(df)[4], type="p", col=df[,3])
  legend(3200,max(df[,4]),unique(df[,3]),col=1:length(df[,3]),pch=1)
  
}
lapply(result2, plots)

Class evolves from A to E in alphabetical order, over time.

What if we make separate plots for each username? (For clarity, we will only print one set of plots.)

# Generate scatterplots for each variable to be able to see individual performance over time by classe    
plots2 <- function (df) {
  (par(mfrow=c(2, 3)))
  for (i in 1:6) {
    y <- unique(df$user_name)[i]
    yy <- df[ which(df$user_name==y),]
    plot(yy[,2], yy[,4], ylab=names(yy)[4], xlim=c(min(df[,2]), max(df[,2])), ylim=c(min(df[,4]), max(df[,4])), xlab="Timestamp", main=paste(yy[1,1], colnames(yy)[4]), type="p", col=yy[,3])
    legend(3200,max(df[,4]),unique(yy[,3]),col=1:length(yy[,3]),pch=1)
  }
}
lapply(result2, plots2)

For any given user, there is perfect delineation of the classes over time, progressing from A to E in alphabetical order. This suggests that time will be the most important predictor of classe, by far. But what does this really mean? Given what we know about the problem, it’s clear that the 6 participants were each told to do a certain number of reps in five different ways, with no overlap. So while time will help us accurately predict classe for the purposes of the quiz, it won’t actually help us predict classe outside of these experimental conditions. For a working model that can predict classe in real life, we would need to exclude the username and time variables entirely.

Boxplots

To look at how the measure variables vary by classe, let’s look at boxplots. (For clarity, we will only print one plot.)

boxplots_simple <- function (df){ 
      (par(mfrow=c(1, 1)))
      boxplot(df[,4]~df[,3],data=df, main=colnames(df)[4])
  }
lapply(result2, boxplots_simple)

The means across classes are often pretty close, so there isn’t huge variability between classes, which may make classe more difficult to predict without including the time variables.

T.tests

Let’s do some t.tests, to make sure the differences in the classe means are statistically significant across variables.

t.tests <- function (df){ 
    pairwise.t.test(x=df[,4],g=df[,3])
}
t.test.result <- lapply(result2, t.tests)

What percentage of t.tests have a p>0.05?

p.values <- sapply (t.test.result, function (x) {x$p.value})
p.values <- as.data.frame(apply(p.values, 2, function (x) {round(x,5)})) 
p.values <- na.omit(p.values)
p.values <- unlist(p.values)
percentage <- length(p.values[which(p.values>0.05)])/length(p.values)
percentage

## [1] 0.3153846

About 1/3 of t.tests have a p>0.05, so 2/3 of the t.tests show that the classe mean differences are significant. This is good news for building a model that excludes the time variables.

Outliers

An issue that became clear on the time plots by username is outliers. There are a handful of points that are so off the charts that they are probably errors, so let’s go ahead and remove them.

clean_train <- clean_train[ -which(clean_train$total_accel_dumbbell==max(clean_train$total_accel_dumbbell)),]
clean_train <- clean_train[ -which(clean_train$gyros_dumbbell_y==max(clean_train$gyros_dumbbell_y)),]
clean_train <- clean_train[ -which(clean_train$magnet_dumbbell_x==min(clean_train$magnet_dumbbell_x)),]
clean_train <- clean_train[ -which(clean_train$gyros_forearm_x==min(clean_train$gyros_forearm_x)),]

Imputing missing values

The other issue we see from the time plots by username is that there are some zero values for Jeremy-arm and Adelmo-forearm measures that don’t fit into the patterns for those variables. These zero’s are likely due to malfunctioning equipment, rather than true scores of zero. So let’s impute values for these zero values, which could introduce bias into the model.

# First, identify the zero-value variables in question and convert them to NA
make_NA <- function(var, user_name) {
  var <- ifelse(clean_train$user_name==user_name & var==0, var==NA, var)
  var
}
clean_train$yaw_forearm <- make_NA(clean_train$yaw_forearm, "adelmo")
clean_train$pitch_forearm <- make_NA(clean_train$pitch_forearm, "adelmo")
clean_train$roll_forearm <- make_NA(clean_train$roll_forearm, "adelmo")
clean_train$yaw_arm <- make_NA(clean_train$yaw_arm, "jeremy")
clean_train$pitch_arm <- make_NA(clean_train$pitch_arm, "jeremy")
clean_train$roll_arm <- make_NA(clean_train$roll_arm, "jeremy")
# Impute missing values, using the measure variables to estimate the correct values
library(mice)
clean_train_measures <- clean_train[5:43]
imputed <- mice(clean_train_measures)
clean_train_measures <- complete(imputed,1)
# Then, complete the dataset
clean_train <- cbind(clean_train[c(1:4)], clean_train_measures)

Modeling

Decision trees

1. All variables

Let’s run a simple decision tree model first, including all of the variables.

# Run the same transformations on the test set as on the train set
numeric_vars_test <- test[,c(2:53)]
numeric_vars_test <- numeric_vars_test[,-highlyCor]
clean_test <- cbind(test[,c(1,55,56,54)], numeric_vars_test)
# Run the decision tree model on the training set
set.seed(145689)
modFit <- train(classe ~ ., data=clean_train, method="rpart")
# Plot the decision tree
library(rattle)
fancyRpartPlot(modFit$finalModel)

As expected, the tree splits on time variables.

# Predict classe on the test set
predicted <- predict(modFit, clean_test)
# Evaluate model 
confusionMatrix(clean_test$classe, predicted)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1231  164    0    0    0
##          B    1  948    0    0    0
##          C    0   96  713   46    0
##          D    0    0  116  688    0
##          E    0    0    0   91  810
## 
## Overall Statistics
##                                                
##                Accuracy : 0.8952               
##                  95% CI : (0.8863, 0.9036)     
##     No Information Rate : 0.2512               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8679               
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9992   0.7848   0.8601   0.8339   1.0000
## Specificity            0.9553   0.9997   0.9652   0.9716   0.9778
## Pos Pred Value         0.8824   0.9989   0.8339   0.8557   0.8990
## Neg Pred Value         0.9997   0.9343   0.9714   0.9666   1.0000
## Prevalence             0.2512   0.2463   0.1690   0.1682   0.1652
## Detection Rate         0.2510   0.1933   0.1454   0.1403   0.1652
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9773   0.8922   0.9126   0.9028   0.9889

Our simple decision tree model has an accuracy rate of 89.5%. Pretty good for a first attempt. Of course, this is due to the conditions of the experiment, rather than a true relationship between time and classe. It would be a worthless model for predicting classe outside of these specific experimental conditions.

2. Measure variables

What if we exclude the time variables and username and use only the measure variables, so that the model could predict classe outside of the experimental conditions?

# Let's create new datasets for train and test, excluding time and username
clean_train_no_time <- clean_train[c(4:43)]
clean_test_no_time <-  clean_test[c(4:43)]
# Run the model, predict classe, and evaluate model performance
set.seed(785369)
modFit2 <- train(classe ~ ., data=clean_train_no_time, method="rpart")
predicted2 <- predict(modFit2, clean_test_no_time)
confusionMatrix(clean_test_no_time$classe, predicted2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 870  28 495   0   2
##          B 149 306 486   0   8
##          C  21  16 817   0   1
##          D  51 167 519   0  67
##          E  22 155 416   0 308
## 
## Overall Statistics
##                                              
##                Accuracy : 0.4692             
##                  95% CI : (0.4552, 0.4833)   
##     No Information Rate : 0.5573             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.3343             
##  Mcnemar's Test P-Value : <0.0000000000000002
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7817   0.4554   0.2989       NA  0.79793
## Specificity            0.8615   0.8481   0.9825   0.8361  0.86875
## Pos Pred Value         0.6237   0.3224   0.9556       NA  0.34184
## Neg Pred Value         0.9307   0.9075   0.5268       NA  0.98051
## Prevalence             0.2270   0.1370   0.5573   0.0000  0.07871
## Detection Rate         0.1774   0.0624   0.1666   0.0000  0.06281
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639  0.18373
## Balanced Accuracy      0.8216   0.6517   0.6407       NA  0.83334

Without time variables, our simple decision tree model now has an accuracy rate of just 46.9%. Pretty poor.

3. Measure variables using PCA

What if we use Principal Component Analysis for a decision tree model on the measure variables (excl. time and username), rather than manually reducing the number of variables? Would that improve the model? For that, we will have to re-instate the measure variables that we excluded in the data preparation portion of the process before running the model.

# Redo some of the data preparation on the original train and test sets
train <- train[ -which(train$total_accel_dumbbell==max(train$total_accel_dumbbell)),]
train <- train[ -which(train$gyros_dumbbell_y==max(train$gyros_dumbbell_y)),]
train <- train[ -which(train$magnet_dumbbell_x==min(train$magnet_dumbbell_x)),]
train <- train[ -which(train$gyros_forearm_x==min(train$gyros_forearm_x)),]
make_NA <- function(var, user_name) {
  var <- ifelse(train$user_name==user_name & var==0, var==NA, var)
  var
}
train$yaw_forearm <- make_NA(train$yaw_forearm, "adelmo")
train$pitch_forearm <- make_NA(train$pitch_forearm, "adelmo")
train$roll_forearm <- make_NA(train$roll_forearm, "adelmo")
train$yaw_arm <- make_NA(train$yaw_arm, "jeremy")
train$pitch_arm <- make_NA(train$pitch_arm, "jeremy")
train$roll_arm <- make_NA(train$roll_arm, "jeremy")
library(mice)
train_measures <- train[2:53]
imputed <- mice(train_measures)
train_measures <- complete(imputed,1)
train <- cbind(train[c(54)], train_measures)
test <- test[c(54,2:53)]

# Run the PCA model, predict classe, and evaluate model performance
set.seed(145689)
modFit3 <- train(classe ~ ., data=train, preProcess="pca", method="rpart")
predicted3 <- predict(modFit3, test)
confusionMatrix(test$classe, predicted3)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1019  240    0  117   19
##          B  323  353    0  212   61
##          C  544  238    0   68    5
##          D  264  194    0  296   50
##          E  258  264    0  109  270
## 
## Overall Statistics
##                                              
##                Accuracy : 0.3952             
##                  95% CI : (0.3815, 0.409)    
##     No Information Rate : 0.491              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.2119             
##  Mcnemar's Test P-Value : <0.0000000000000002
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4232  0.27386       NA  0.36908  0.66667
## Specificity            0.8494  0.83513   0.8257  0.87616  0.85975
## Pos Pred Value         0.7305  0.37197       NA  0.36816  0.29967
## Neg Pred Value         0.6042  0.76334       NA  0.87659  0.96628
## Prevalence             0.4910  0.26285   0.0000  0.16354  0.08259
## Detection Rate         0.2078  0.07198   0.0000  0.06036  0.05506
## Detection Prevalence   0.2845  0.19352   0.1743  0.16395  0.18373
## Balanced Accuracy      0.6363  0.55449       NA  0.62262  0.76321

Our PCA decision tree model using the measure variables has an accuracy rate of just 37.7%. Even worse!

Random forests

1. All variables

Random forest model including all variables.

library(randomForest)
# Run the model, predict classe, and evaluate model performance
set.seed(145873)
modFit4 <- randomForest(classe ~ ., data=clean_train)
predicted4 <- predict(modFit4, clean_test)
confusionMatrix(clean_test$classe, predicted4)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    2  947    0    0    0
##          C    0    2  853    0    0
##          D    0    0    3  801    0
##          E    0    0    0    2  899
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9982               
##                  95% CI : (0.9965, 0.9992)     
##     No Information Rate : 0.2849               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9977               
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9986   0.9979   0.9965   0.9975   1.0000
## Specificity            1.0000   0.9995   0.9995   0.9993   0.9995
## Pos Pred Value         1.0000   0.9979   0.9977   0.9963   0.9978
## Neg Pred Value         0.9994   0.9995   0.9993   0.9995   1.0000
## Prevalence             0.2849   0.1935   0.1746   0.1637   0.1833
## Detection Rate         0.2845   0.1931   0.1739   0.1633   0.1833
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9993   0.9987   0.9980   0.9984   0.9998

Accuracy goes up to 99.8%!

varImpPlot(modFit4)

Again, as expected, the most important variables are (by far) those related to time.

2. Measure variables

Random forest model excluding time and username variables.

# Run the model, predict classe, and evaluate model performance
set.seed(21036)
modFit5 <- randomForest(classe ~ ., data=clean_train_no_time)
predicted5 <- predict(modFit5, clean_test_no_time)
confusionMatrix(clean_test_no_time$classe, predicted5)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1392    3    0    0    0
##          B    5  939    5    0    0
##          C    0    8  846    1    0
##          D    0    0   18  786    0
##          E    0    0    1    2  898
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9912               
##                  95% CI : (0.9882, 0.9936)     
##     No Information Rate : 0.2849               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9889               
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9884   0.9724   0.9962   1.0000
## Specificity            0.9991   0.9975   0.9978   0.9956   0.9993
## Pos Pred Value         0.9978   0.9895   0.9895   0.9776   0.9967
## Neg Pred Value         0.9986   0.9972   0.9941   0.9993   1.0000
## Prevalence             0.2849   0.1937   0.1774   0.1609   0.1831
## Detection Rate         0.2838   0.1915   0.1725   0.1603   0.1831
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9978   0.9929   0.9851   0.9959   0.9996

Accuracy goes up to 99.1%, which is almost as good as the model that included the time variables! This is a great illustration of the power of random forests at capturing patterns in the data that don’t appear on the first try.

varImpPlot(modFit5)

The variable importance plot shows that we could probably further reduce the number of variables we used in the model and still get an accurate prediction.

3. Measure variables using PCA

Random forest model using PCA (excl. time and username variables).

# Run the model, predict classe, and evaluate model performance
set.seed(89)
trControl <- trainControl(method = "none", number = 1, repeats = 1)
modFit6 <- train(classe ~ ., data=train, preProcess="pca", method="rf", trControl = trainControl(method="none"), tuneGrid=data.frame(mtry=3))
predicted6 <- predict(modFit6, test)
confusionMatrix(test$classe, predicted6)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1371    7    6   10    1
##          B   31  891   24    2    1
##          C   73   24  729   27    2
##          D   24    0   39  737    4
##          E    5    7   17   10  862
## 
## Overall Statistics
##                                                
##                Accuracy : 0.936                
##                  95% CI : (0.9288, 0.9427)     
##     No Information Rate : 0.3067               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9188               
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9116   0.9591   0.8945   0.9377   0.9908
## Specificity            0.9929   0.9854   0.9692   0.9837   0.9903
## Pos Pred Value         0.9828   0.9389   0.8526   0.9167   0.9567
## Neg Pred Value         0.9621   0.9904   0.9788   0.9880   0.9980
## Prevalence             0.3067   0.1894   0.1662   0.1603   0.1774
## Detection Rate         0.2796   0.1817   0.1487   0.1503   0.1758
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9523   0.9723   0.9318   0.9607   0.9906

A random forest model using PCA has accuracy of 93.7%. This is very good, but it doesn’t beat the model where we trimmed the correlated variables manually.

Boosting

1. Measure variables using PCA

Could we improve the PCA model by trying another method?

# Run the model, predict classe, and evaluate model performance
modFit7 <-train(classe ~ ., data=clean_train, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 1), verbose = FALSE)
predicted7 <- predict(modFit7, clean_test)
confusionMatrix(clean_test$classe, predicted7)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1393    2    0    0    0
##          B    1  948    0    0    0
##          C    0    3  848    4    0
##          D    0    0    7  797    0
##          E    0    0    0    6  895
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9953               
##                  95% CI : (0.993, 0.997)       
##     No Information Rate : 0.2843               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9941               
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9948   0.9918   0.9876   1.0000
## Specificity            0.9994   0.9997   0.9983   0.9983   0.9985
## Pos Pred Value         0.9986   0.9989   0.9918   0.9913   0.9933
## Neg Pred Value         0.9997   0.9987   0.9983   0.9976   1.0000
## Prevalence             0.2843   0.1943   0.1743   0.1646   0.1825
## Detection Rate         0.2841   0.1933   0.1729   0.1625   0.1825
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9994   0.9973   0.9950   0.9929   0.9993

A general boosted model using only the measure variables has accuracy of 99.6%, so we did better than the random forest PCA model! It also performs a bit better than the random forest model using manually selected measure variables. But it’s also quite a bit slower, and the interpretability of the principal components isn’t ideal, either.

Model selection

Since the random forest including all variables gave us the highest accuracy rate, we’ll use that one for the Coursera submission. Our accuracy rate for our chosen model gives us an estimate for the out of sample error rate, which is 100% - 99.8% = 0.20%, with a 95% confidence interval of 0.10% and 0.37%. However, if we were selecting the model for real-world predictions outside of the experimental conditions in this study, we would select the random forest model that excludes time and username variables, which is quick to run, easily interpretable, and offers over 99% accuracy.

Coursera quiz

# Perform same transformations on Test set as on Train set
zy <- as.data.frame(cbind(names(Test),is.na(as.numeric(Test[1,]))))
index <- as.numeric(row.names(zy[zy$V2==F,]))
Test <- Test[,index]
Test <- Test[,-1]
numeric_vars_Test <- Test[,c(2:53)]
numeric_vars_Test <- numeric_vars_Test[,-highlyCor]
clean_Test <- cbind(Test[,c(1,55,56,54)], numeric_vars_Test)
# Order the set by Problem ID
clean_Test <- clean_Test[order(clean_Test$problem_id),] 
# Remove the Problem ID as a variable
clean_Test <- clean_Test[-4]
# Predict classe for the Test set using the random forest for all variables model
predicted_final <- predict(modFit4, clean_Test, method="class")
# Organize the predictions for easy entry into the quiz page
question_number <- seq_along(1:20)
data.frame(question_number, predicted_final)

##       question_number predicted_final
## 17819               1               B
## 14684               2               A
## 14872               3               B
## 402                 4               A
## 11280               5               A
## 16642               6               E
## 16137               7               D
## 14887               8               B
## 4105                9               A
## 7298               10               A
## 5015               11               B
## 15494              12               C
## 11914              13               B
## 14078              14               A
## 16818              15               E
## 13209              16               E
## 17105              17               A
## 5208               18               B
## 17752              19               B
## 11830              20               B

100% score on the quiz. Hooray!