Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
The goal of this assignment is to predict how well a participant performs a bicep curl based on data collected from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.They were each asked to perform the exercise in five different ways. Classe A corresponds to the correct way of doing the exercise, while the other 4 Classes (B, C, D, ands E) correspond to common mistakes. The goal is to create a model that accurately predicts Classe. The final output of the assignment is a set of predictions that we will use to complete the Week 3 Quiz.
More information on the data is available here: http://groupware.les.inf.puc-rio.br/har (see “Weight Lifting Exercises Dataset”).
# Disable scientific notation
options(scipen=999)
# Disable warnings and messages
knitr::opts_chunk$set(message=FALSE, warning=FALSE, dev="png")
# Set options for Rpubs
options(rpubs.upload.method = "internal")
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
devtools::install_github("rstudio/rmarkdown")
## Skipping install for github remote, the SHA1 (cb74ee52) has not changed since last install.
## Use `force = TRUE` to force installation
# Load the data directly from the website
Train <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"), na.strings=c("", "NA", "NULL"))
Test <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"), na.strings=c("", "NA", "NULL"))
# Take a look
str(Train, list.len=10)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## [list output truncated]
Let’s reduce the 5 time variables into 2 scaled time sequence variables that will allow us to understand how the measure variables evolve by groups (users, classes) over time.
# Join the datasets
Train$problem_id <- NA
Test$classe <- NA
Test$type <- "TEST"
Train$type <- "TRAIN"
Data <- rbind(Train, Test)
# Order by username and timestamps 1 and 2
Data <- Data[order(Data$user_name, Data$raw_timestamp_part_1, Data$raw_timestamp_part_2),]
# Then, create a new timestamp variable that fuses timestamps 1 and 2
Data$timestamp <- as.numeric(paste0(Data$raw_timestamp_part_1, Data$raw_timestamp_part_2))
# Create a scaled timestamp variable (timestamp index by user)
library(plyr)
Data_new <- ddply(Data, .(user_name), transform, timestamp_by_user = seq_along(timestamp))
# Create a new variable that splits timestamps into 12 bins per username
Data_new <- ddply(Data_new, .(user_name), transform, timestamp_bin=as.numeric(cut(timestamp_by_user, 12)))
# Remove the redundant time variables
Data_new <- Data_new[c(-3,-4,-5,-6,-7,-163)]
# Separate the datasets again
Train <- Data_new[Data_new$type=="TRAIN",]
Train <- Train[c(-156, -157)]
Test <- Data_new[Data_new$type=="TEST",]
Test <- Test[c(-155, -157)]
# Get rid of variables that are summary statistics
# Determine whether the first row of each variable is NA
zy <- as.data.frame(cbind(names(Train),is.na(as.numeric(Train[1,]))))
# Create an index for variables that have a real (non-NA) value in the first row
index <- as.numeric(row.names(zy[zy$V2==F,]))
# Subset original data based on this index
Train <- Train[,index]
# Remove row number variable
Train <- Train[,-1]
# Split Train into testing and training sets, for cross-validation purposes.(This allows us to build and test a model before we actually use it to answer the quiz questions.)
library(caret)
set.seed(1234)
inTrain <- createDataPartition(y=Train$classe, p=0.75, list=FALSE)
train <- Train[inTrain,]
test <- Train[-inTrain,]
Can we reduce the number of variables by removing near-zero variance variables?
head(nearZeroVar(train, saveMetrics=T))
## freqRatio percentUnique zeroVar nzv
## user_name 1.101290 0.04076641 FALSE FALSE
## roll_belt 1.121168 7.96983286 FALSE FALSE
## pitch_belt 1.012739 11.68637043 FALSE FALSE
## yaw_belt 1.054124 12.39298818 FALSE FALSE
## total_accel_belt 1.067216 0.19024324 FALSE FALSE
## gyros_belt_x 1.068560 0.89686099 FALSE FALSE
There are none!
Plot the numeric variables against each other to determine correlations and remove highly correlated variables.
# Generate correlation matrix
numeric_vars <- train[,c(2:53)]
M <- cor(numeric_vars)
# Remove redundant variables, defined here as variables with absolute corr > 0.8
highlyCor <- findCorrelation(M, 0.8)
numeric_vars <- numeric_vars[,-highlyCor]
# Create new data set with only the relevant variables
clean_train <- cbind(train[,c(1,55,56,54)], numeric_vars)
Is the outcome variable (classe) evenly distributed?
classe_distribution <- ddply(clean_train, .(classe), plyr::summarize, freq=length(classe))
classe_distribution
## classe freq
## 1 A 4185
## 2 B 2848
## 3 C 2567
## 4 D 2412
## 5 E 2706
We notice that the outcome variable is evenly distributed among B,C,D,E, while A (the correct way of doing the exercise) has almost double the concentration compared to the other classes.
Let’s start by seeing if each of the measure variables are normally distributed. Since there are a lot of variables to consider, let’s group them by accelerometer placement (belt, arm, dumbbell, forearm). This also allows us to see if there are any patterns for a given body area. (For clarity, we’ll only print the results for the belt measurements.)
library(Hmisc)
library(psych)
par(mfrow=c(3, 4))
multi.hist(clean_train[,grep("_belt", names(clean_train))], main="belt")
multi.hist(clean_train[,grep("_arm", names(clean_train))], main="arm")
multi.hist(clean_train[,grep("_dumbbell", names(clean_train))], main="dumbbell")
multi.hist(clean_train[,grep("_forearm", names(clean_train))], main="forearm")
There are very few normal distributions, nor are there visible patterns for any of the body areas.
What if we make the same plots, but for each class? (For clarity, we won’t print any of these plots.)
for (i in 1:5) {
x <- unique(clean_train$classe)[i]
multi.hist(clean_train[ which(clean_train$classe==x),grep("_belt", names(clean_train))], main=paste(x, " belt"))
multi.hist(clean_train[ which(clean_train$classe==x),grep("_arm", names(clean_train))], main=paste(x, " arm"))
multi.hist(clean_train[ which(clean_train$classe==x),grep("_dumbbell", names(clean_train))], main=paste(x, " dumbbell"))
multi.hist(clean_train[ which(clean_train$classe==x),grep("_forearm", names(clean_train))], main=paste(x, " forearm"))
}
There are more normal distributions, but not enough to make these variables good candidates for a regression model (at least without transformations). We will favor classification trees over regression in our modeling phase.
Let’s try to visualize the relationship between classe and time. (For clarity, we will only print one of these plots.)
# First, create a list where each element is a data frame containing the values for username, timestamp, classe
# and values for each of the measure variables
result <- function(variable) {
# Write a function to select each numeric variable and combine with username, timestamp and classe
x <- data.frame(clean_train$user_name, clean_train$timestamp_by_user, clean_train$classe, variable)
xx <- rename(x, c("clean_train.user_name"="user_name", "clean_train.timestamp_by_user"="timestamp", "clean_train.classe"="classe"))
}
# Run the function on the measure variables in clean_train
clean_train_measures <- clean_train[5:43]
result2 <- lapply(clean_train_measures, result)
# Rename col 3 to be the list element name
for (i in seq_along(result2)){
colnames(result2[[i]]) <- c("user_name", "timestamp", "classe", names(result2[i]))
}
# Generate scatterplots for each variable to see how classe changes with time
plots <- function (df) {
(par(mfrow=c(1, 1)))
plot(df[,2], df[,4], ylab=names(df)[4], xlim=c(min(df[,2]), max(df[,2])), ylim=c(min(df[,4]), max(df[,4])), xlab="Timestamp", main=colnames(df)[4], type="p", col=df[,3])
legend(3200,max(df[,4]),unique(df[,3]),col=1:length(df[,3]),pch=1)
}
lapply(result2, plots)
Class evolves from A to E in alphabetical order, over time.
What if we make separate plots for each username? (For clarity, we will only print one set of plots.)
# Generate scatterplots for each variable to be able to see individual performance over time by classe
plots2 <- function (df) {
(par(mfrow=c(2, 3)))
for (i in 1:6) {
y <- unique(df$user_name)[i]
yy <- df[ which(df$user_name==y),]
plot(yy[,2], yy[,4], ylab=names(yy)[4], xlim=c(min(df[,2]), max(df[,2])), ylim=c(min(df[,4]), max(df[,4])), xlab="Timestamp", main=paste(yy[1,1], colnames(yy)[4]), type="p", col=yy[,3])
legend(3200,max(df[,4]),unique(yy[,3]),col=1:length(yy[,3]),pch=1)
}
}
lapply(result2, plots2)
For any given user, there is perfect delineation of the classes over time, progressing from A to E in alphabetical order. This suggests that time will be the most important predictor of classe, by far. But what does this really mean? Given what we know about the problem, it’s clear that the 6 participants were each told to do a certain number of reps in five different ways, with no overlap. So while time will help us accurately predict classe for the purposes of the quiz, it won’t actually help us predict classe outside of these experimental conditions. For a working model that can predict classe in real life, we would need to exclude the username and time variables entirely.
To look at how the measure variables vary by classe, let’s look at boxplots. (For clarity, we will only print one plot.)
boxplots_simple <- function (df){
(par(mfrow=c(1, 1)))
boxplot(df[,4]~df[,3],data=df, main=colnames(df)[4])
}
lapply(result2, boxplots_simple)
The means across classes are often pretty close, so there isn’t huge variability between classes, which may make classe more difficult to predict without including the time variables.
Let’s do some t.tests, to make sure the differences in the classe means are statistically significant across variables.
t.tests <- function (df){
pairwise.t.test(x=df[,4],g=df[,3])
}
t.test.result <- lapply(result2, t.tests)
What percentage of t.tests have a p>0.05?
p.values <- sapply (t.test.result, function (x) {x$p.value})
p.values <- as.data.frame(apply(p.values, 2, function (x) {round(x,5)}))
p.values <- na.omit(p.values)
p.values <- unlist(p.values)
percentage <- length(p.values[which(p.values>0.05)])/length(p.values)
percentage
## [1] 0.3153846
About 1/3 of t.tests have a p>0.05, so 2/3 of the t.tests show that the classe mean differences are significant. This is good news for building a model that excludes the time variables.
An issue that became clear on the time plots by username is outliers. There are a handful of points that are so off the charts that they are probably errors, so let’s go ahead and remove them.
clean_train <- clean_train[ -which(clean_train$total_accel_dumbbell==max(clean_train$total_accel_dumbbell)),]
clean_train <- clean_train[ -which(clean_train$gyros_dumbbell_y==max(clean_train$gyros_dumbbell_y)),]
clean_train <- clean_train[ -which(clean_train$magnet_dumbbell_x==min(clean_train$magnet_dumbbell_x)),]
clean_train <- clean_train[ -which(clean_train$gyros_forearm_x==min(clean_train$gyros_forearm_x)),]
The other issue we see from the time plots by username is that there are some zero values for Jeremy-arm and Adelmo-forearm measures that don’t fit into the patterns for those variables. These zero’s are likely due to malfunctioning equipment, rather than true scores of zero. So let’s impute values for these zero values, which could introduce bias into the model.
# First, identify the zero-value variables in question and convert them to NA
make_NA <- function(var, user_name) {
var <- ifelse(clean_train$user_name==user_name & var==0, var==NA, var)
var
}
clean_train$yaw_forearm <- make_NA(clean_train$yaw_forearm, "adelmo")
clean_train$pitch_forearm <- make_NA(clean_train$pitch_forearm, "adelmo")
clean_train$roll_forearm <- make_NA(clean_train$roll_forearm, "adelmo")
clean_train$yaw_arm <- make_NA(clean_train$yaw_arm, "jeremy")
clean_train$pitch_arm <- make_NA(clean_train$pitch_arm, "jeremy")
clean_train$roll_arm <- make_NA(clean_train$roll_arm, "jeremy")
# Impute missing values, using the measure variables to estimate the correct values
library(mice)
clean_train_measures <- clean_train[5:43]
imputed <- mice(clean_train_measures)
clean_train_measures <- complete(imputed,1)
# Then, complete the dataset
clean_train <- cbind(clean_train[c(1:4)], clean_train_measures)
Let’s run a simple decision tree model first, including all of the variables.
# Run the same transformations on the test set as on the train set
numeric_vars_test <- test[,c(2:53)]
numeric_vars_test <- numeric_vars_test[,-highlyCor]
clean_test <- cbind(test[,c(1,55,56,54)], numeric_vars_test)
# Run the decision tree model on the training set
set.seed(145689)
modFit <- train(classe ~ ., data=clean_train, method="rpart")
# Plot the decision tree
library(rattle)
fancyRpartPlot(modFit$finalModel)
As expected, the tree splits on time variables.
# Predict classe on the test set
predicted <- predict(modFit, clean_test)
# Evaluate model
confusionMatrix(clean_test$classe, predicted)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1231 164 0 0 0
## B 1 948 0 0 0
## C 0 96 713 46 0
## D 0 0 116 688 0
## E 0 0 0 91 810
##
## Overall Statistics
##
## Accuracy : 0.8952
## 95% CI : (0.8863, 0.9036)
## No Information Rate : 0.2512
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8679
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9992 0.7848 0.8601 0.8339 1.0000
## Specificity 0.9553 0.9997 0.9652 0.9716 0.9778
## Pos Pred Value 0.8824 0.9989 0.8339 0.8557 0.8990
## Neg Pred Value 0.9997 0.9343 0.9714 0.9666 1.0000
## Prevalence 0.2512 0.2463 0.1690 0.1682 0.1652
## Detection Rate 0.2510 0.1933 0.1454 0.1403 0.1652
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9773 0.8922 0.9126 0.9028 0.9889
Our simple decision tree model has an accuracy rate of 89.5%. Pretty good for a first attempt. Of course, this is due to the conditions of the experiment, rather than a true relationship between time and classe. It would be a worthless model for predicting classe outside of these specific experimental conditions.
What if we exclude the time variables and username and use only the measure variables, so that the model could predict classe outside of the experimental conditions?
# Let's create new datasets for train and test, excluding time and username
clean_train_no_time <- clean_train[c(4:43)]
clean_test_no_time <- clean_test[c(4:43)]
# Run the model, predict classe, and evaluate model performance
set.seed(785369)
modFit2 <- train(classe ~ ., data=clean_train_no_time, method="rpart")
predicted2 <- predict(modFit2, clean_test_no_time)
confusionMatrix(clean_test_no_time$classe, predicted2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 870 28 495 0 2
## B 149 306 486 0 8
## C 21 16 817 0 1
## D 51 167 519 0 67
## E 22 155 416 0 308
##
## Overall Statistics
##
## Accuracy : 0.4692
## 95% CI : (0.4552, 0.4833)
## No Information Rate : 0.5573
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3343
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7817 0.4554 0.2989 NA 0.79793
## Specificity 0.8615 0.8481 0.9825 0.8361 0.86875
## Pos Pred Value 0.6237 0.3224 0.9556 NA 0.34184
## Neg Pred Value 0.9307 0.9075 0.5268 NA 0.98051
## Prevalence 0.2270 0.1370 0.5573 0.0000 0.07871
## Detection Rate 0.1774 0.0624 0.1666 0.0000 0.06281
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.18373
## Balanced Accuracy 0.8216 0.6517 0.6407 NA 0.83334
Without time variables, our simple decision tree model now has an accuracy rate of just 46.9%. Pretty poor.
What if we use Principal Component Analysis for a decision tree model on the measure variables (excl. time and username), rather than manually reducing the number of variables? Would that improve the model? For that, we will have to re-instate the measure variables that we excluded in the data preparation portion of the process before running the model.
# Redo some of the data preparation on the original train and test sets
train <- train[ -which(train$total_accel_dumbbell==max(train$total_accel_dumbbell)),]
train <- train[ -which(train$gyros_dumbbell_y==max(train$gyros_dumbbell_y)),]
train <- train[ -which(train$magnet_dumbbell_x==min(train$magnet_dumbbell_x)),]
train <- train[ -which(train$gyros_forearm_x==min(train$gyros_forearm_x)),]
make_NA <- function(var, user_name) {
var <- ifelse(train$user_name==user_name & var==0, var==NA, var)
var
}
train$yaw_forearm <- make_NA(train$yaw_forearm, "adelmo")
train$pitch_forearm <- make_NA(train$pitch_forearm, "adelmo")
train$roll_forearm <- make_NA(train$roll_forearm, "adelmo")
train$yaw_arm <- make_NA(train$yaw_arm, "jeremy")
train$pitch_arm <- make_NA(train$pitch_arm, "jeremy")
train$roll_arm <- make_NA(train$roll_arm, "jeremy")
library(mice)
train_measures <- train[2:53]
imputed <- mice(train_measures)
train_measures <- complete(imputed,1)
train <- cbind(train[c(54)], train_measures)
test <- test[c(54,2:53)]
# Run the PCA model, predict classe, and evaluate model performance
set.seed(145689)
modFit3 <- train(classe ~ ., data=train, preProcess="pca", method="rpart")
predicted3 <- predict(modFit3, test)
confusionMatrix(test$classe, predicted3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1019 240 0 117 19
## B 323 353 0 212 61
## C 544 238 0 68 5
## D 264 194 0 296 50
## E 258 264 0 109 270
##
## Overall Statistics
##
## Accuracy : 0.3952
## 95% CI : (0.3815, 0.409)
## No Information Rate : 0.491
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2119
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4232 0.27386 NA 0.36908 0.66667
## Specificity 0.8494 0.83513 0.8257 0.87616 0.85975
## Pos Pred Value 0.7305 0.37197 NA 0.36816 0.29967
## Neg Pred Value 0.6042 0.76334 NA 0.87659 0.96628
## Prevalence 0.4910 0.26285 0.0000 0.16354 0.08259
## Detection Rate 0.2078 0.07198 0.0000 0.06036 0.05506
## Detection Prevalence 0.2845 0.19352 0.1743 0.16395 0.18373
## Balanced Accuracy 0.6363 0.55449 NA 0.62262 0.76321
Our PCA decision tree model using the measure variables has an accuracy rate of just 37.7%. Even worse!
Random forest model including all variables.
library(randomForest)
# Run the model, predict classe, and evaluate model performance
set.seed(145873)
modFit4 <- randomForest(classe ~ ., data=clean_train)
predicted4 <- predict(modFit4, clean_test)
confusionMatrix(clean_test$classe, predicted4)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 2 947 0 0 0
## C 0 2 853 0 0
## D 0 0 3 801 0
## E 0 0 0 2 899
##
## Overall Statistics
##
## Accuracy : 0.9982
## 95% CI : (0.9965, 0.9992)
## No Information Rate : 0.2849
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9977
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9986 0.9979 0.9965 0.9975 1.0000
## Specificity 1.0000 0.9995 0.9995 0.9993 0.9995
## Pos Pred Value 1.0000 0.9979 0.9977 0.9963 0.9978
## Neg Pred Value 0.9994 0.9995 0.9993 0.9995 1.0000
## Prevalence 0.2849 0.1935 0.1746 0.1637 0.1833
## Detection Rate 0.2845 0.1931 0.1739 0.1633 0.1833
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9993 0.9987 0.9980 0.9984 0.9998
Accuracy goes up to 99.8%!
varImpPlot(modFit4)
Again, as expected, the most important variables are (by far) those related to time.
Random forest model excluding time and username variables.
# Run the model, predict classe, and evaluate model performance
set.seed(21036)
modFit5 <- randomForest(classe ~ ., data=clean_train_no_time)
predicted5 <- predict(modFit5, clean_test_no_time)
confusionMatrix(clean_test_no_time$classe, predicted5)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1392 3 0 0 0
## B 5 939 5 0 0
## C 0 8 846 1 0
## D 0 0 18 786 0
## E 0 0 1 2 898
##
## Overall Statistics
##
## Accuracy : 0.9912
## 95% CI : (0.9882, 0.9936)
## No Information Rate : 0.2849
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9889
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9884 0.9724 0.9962 1.0000
## Specificity 0.9991 0.9975 0.9978 0.9956 0.9993
## Pos Pred Value 0.9978 0.9895 0.9895 0.9776 0.9967
## Neg Pred Value 0.9986 0.9972 0.9941 0.9993 1.0000
## Prevalence 0.2849 0.1937 0.1774 0.1609 0.1831
## Detection Rate 0.2838 0.1915 0.1725 0.1603 0.1831
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9978 0.9929 0.9851 0.9959 0.9996
Accuracy goes up to 99.1%, which is almost as good as the model that included the time variables! This is a great illustration of the power of random forests at capturing patterns in the data that don’t appear on the first try.
varImpPlot(modFit5)
The variable importance plot shows that we could probably further reduce the number of variables we used in the model and still get an accurate prediction.
Random forest model using PCA (excl. time and username variables).
# Run the model, predict classe, and evaluate model performance
set.seed(89)
trControl <- trainControl(method = "none", number = 1, repeats = 1)
modFit6 <- train(classe ~ ., data=train, preProcess="pca", method="rf", trControl = trainControl(method="none"), tuneGrid=data.frame(mtry=3))
predicted6 <- predict(modFit6, test)
confusionMatrix(test$classe, predicted6)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1371 7 6 10 1
## B 31 891 24 2 1
## C 73 24 729 27 2
## D 24 0 39 737 4
## E 5 7 17 10 862
##
## Overall Statistics
##
## Accuracy : 0.936
## 95% CI : (0.9288, 0.9427)
## No Information Rate : 0.3067
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9188
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9116 0.9591 0.8945 0.9377 0.9908
## Specificity 0.9929 0.9854 0.9692 0.9837 0.9903
## Pos Pred Value 0.9828 0.9389 0.8526 0.9167 0.9567
## Neg Pred Value 0.9621 0.9904 0.9788 0.9880 0.9980
## Prevalence 0.3067 0.1894 0.1662 0.1603 0.1774
## Detection Rate 0.2796 0.1817 0.1487 0.1503 0.1758
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9523 0.9723 0.9318 0.9607 0.9906
A random forest model using PCA has accuracy of 93.7%. This is very good, but it doesn’t beat the model where we trimmed the correlated variables manually.
Could we improve the PCA model by trying another method?
# Run the model, predict classe, and evaluate model performance
modFit7 <-train(classe ~ ., data=clean_train, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 1), verbose = FALSE)
predicted7 <- predict(modFit7, clean_test)
confusionMatrix(clean_test$classe, predicted7)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1393 2 0 0 0
## B 1 948 0 0 0
## C 0 3 848 4 0
## D 0 0 7 797 0
## E 0 0 0 6 895
##
## Overall Statistics
##
## Accuracy : 0.9953
## 95% CI : (0.993, 0.997)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9941
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9948 0.9918 0.9876 1.0000
## Specificity 0.9994 0.9997 0.9983 0.9983 0.9985
## Pos Pred Value 0.9986 0.9989 0.9918 0.9913 0.9933
## Neg Pred Value 0.9997 0.9987 0.9983 0.9976 1.0000
## Prevalence 0.2843 0.1943 0.1743 0.1646 0.1825
## Detection Rate 0.2841 0.1933 0.1729 0.1625 0.1825
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9994 0.9973 0.9950 0.9929 0.9993
A general boosted model using only the measure variables has accuracy of 99.6%, so we did better than the random forest PCA model! It also performs a bit better than the random forest model using manually selected measure variables. But it’s also quite a bit slower, and the interpretability of the principal components isn’t ideal, either.
Since the random forest including all variables gave us the highest accuracy rate, we’ll use that one for the Coursera submission. Our accuracy rate for our chosen model gives us an estimate for the out of sample error rate, which is 100% - 99.8% = 0.20%, with a 95% confidence interval of 0.10% and 0.37%. However, if we were selecting the model for real-world predictions outside of the experimental conditions in this study, we would select the random forest model that excludes time and username variables, which is quick to run, easily interpretable, and offers over 99% accuracy.
# Perform same transformations on Test set as on Train set
zy <- as.data.frame(cbind(names(Test),is.na(as.numeric(Test[1,]))))
index <- as.numeric(row.names(zy[zy$V2==F,]))
Test <- Test[,index]
Test <- Test[,-1]
numeric_vars_Test <- Test[,c(2:53)]
numeric_vars_Test <- numeric_vars_Test[,-highlyCor]
clean_Test <- cbind(Test[,c(1,55,56,54)], numeric_vars_Test)
# Order the set by Problem ID
clean_Test <- clean_Test[order(clean_Test$problem_id),]
# Remove the Problem ID as a variable
clean_Test <- clean_Test[-4]
# Predict classe for the Test set using the random forest for all variables model
predicted_final <- predict(modFit4, clean_Test, method="class")
# Organize the predictions for easy entry into the quiz page
question_number <- seq_along(1:20)
data.frame(question_number, predicted_final)
## question_number predicted_final
## 17819 1 B
## 14684 2 A
## 14872 3 B
## 402 4 A
## 11280 5 A
## 16642 6 E
## 16137 7 D
## 14887 8 B
## 4105 9 A
## 7298 10 A
## 5015 11 B
## 15494 12 C
## 11914 13 B
## 14078 14 A
## 16818 15 E
## 13209 16 E
## 17105 17 A
## 5208 18 B
## 17752 19 B
## 11830 20 B
100% score on the quiz. Hooray!