This project will develop a model to fit data taken from a fitness device to predict performance of a bicep curl. There are 5 classes of performance: (Class A) exactly according to the specification, (Class B) throwing the elbows to the front, (Class C) lifting the dumbbell only halfway, (Class D) lowering the dumbbell only halfway, and (Class E) throwing the hips to the front. The data was graciously provided by this source: http://groupware.les.inf.puc-rio.br/har
The data is downloaded and loaded.
url.train <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url.test <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url.train, "./pmlTraining.csv"); download.file(url.test, "./pmlTesting.csv")
training <- read.csv("./pmlTraining.csv", na.strings=c("NA","#DIV/0!",""));
testing <- read.csv("./pmlTesting.csv", na.strings=c("NA","#DIV/0!",""))
The data is partitioned to create a testing set for modelling. The test.build set will be used to cross validate the training data.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(613603)
inTrain <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
train.build <- training[inTrain,]
test.build <- training[-inTrain,]
Variables with near zero variance are tested for and removed. Modelling this data is problematic without checking for near zero variance. Next, irrelevant variables are removed since they will not improve the model predictions. Lastly, columns that 90+% NAs are removed.
## Remove columns with near zero variance
nsv <- nearZeroVar(train.build)
train.build <- train.build[, -nsv]
test.build <- test.build[, -nsv]
testing <- testing[, -nsv]
## Remove first 6 columns thaat don't make sense to this model
train.build <- train.build[, -(1:6)]
test.build <- test.build[, -(1:6)]
testing <- testing[, -(1:6)]
## Remove columns with mostly NAs
isna <- is.na(train.build)
Cmeans <- colMeans(isna)
train.build <- train.build[Cmeans <= .9]
test.build <- test.build[Cmeans <= .9]
testing <- testing[Cmeans <= .9]
Many types of models were attempted, but only the successful ones are represented here. We begin by creating a decision tree.
library(rpart)
set.seed(31834)
rpartFit <- rpart(classe ~ ., method = "class", data = train.build)
predict.Rpart <- predict(rpartFit, test.build, type = "class")
confusionMatrix(predict.Rpart, test.build$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1549 223 21 105 42
## B 33 639 43 19 69
## C 43 106 826 148 142
## D 17 86 63 610 50
## E 32 85 73 82 779
##
## Overall Statistics
##
## Accuracy : 0.7482
## 95% CI : (0.7369, 0.7592)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6798
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9253 0.5610 0.8051 0.6328 0.7200
## Specificity 0.9071 0.9654 0.9097 0.9561 0.9434
## Pos Pred Value 0.7985 0.7958 0.6530 0.7385 0.7412
## Neg Pred Value 0.9683 0.9016 0.9567 0.9300 0.9373
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2632 0.1086 0.1404 0.1037 0.1324
## Detection Prevalence 0.3297 0.1364 0.2150 0.1404 0.1786
## Balanced Accuracy 0.9162 0.7632 0.8574 0.7944 0.8317
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
suppressWarnings(fancyRpartPlot(rpartFit))
Next we will try a random forest model.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(67484)
rfFit <- randomForest(classe ~., data = train.build)
predict.RF <- predict(rfFit, test.build)
confusionMatrix(predict.RF, test.build$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 6 0 0 0
## B 0 1133 6 0 0
## C 1 0 1018 10 0
## D 0 0 2 954 2
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9954
## 95% CI : (0.9933, 0.997)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9942
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9947 0.9922 0.9896 0.9982
## Specificity 0.9986 0.9987 0.9977 0.9992 1.0000
## Pos Pred Value 0.9964 0.9947 0.9893 0.9958 1.0000
## Neg Pred Value 0.9998 0.9987 0.9984 0.9980 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1925 0.1730 0.1621 0.1835
## Detection Prevalence 0.2853 0.1935 0.1749 0.1628 0.1835
## Balanced Accuracy 0.9990 0.9967 0.9950 0.9944 0.9991
plot(rfFit)
The plot shows that he Random Forest error falls significantly after approximately 50 trees.
NOTE: We attempted Linear Discriminant Analysis, K-Nearest Neighbor and Gradient Boosted Machine modelling, however, these methods all were resource heavy models that provided little or no improvement on the random forest method.
Finally, we will combine both the models with a random forest model attempt to increase accuracy.
set.seed(983346)
combDF <- data.frame(predict.RF, predict.Rpart ,classe = test.build$classe)
combFit <- train(classe ~ ., method = "rf", data = combDF)
predict.comb <- predict(combFit, test.build)
confusionMatrix(predict.comb, test.build$classe)$overall[1]
## Accuracy
## 0.9954121
Accuracy is not improved, however.
In the above models, we used test.build to cross validate the models and obtain our out of sample error. The out of sample error rate is: 25.18% for the decision tree, and 0.4% for both the random forest and combined models.
So the random forest appears to have the best prediction even when combined with other models given that it has the lowest out of sample error rate and uses the least amount of resources to generate. Therefore we compute our predictions.
predict.Final <- predict(rfFit, testing)
predict.Final
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E