The goal of this project was to predict the manner in which six individuals performed various weight lifting exercises based on accelerometer readings taken during their exercises. The authors of the study developed a five-tier classification system which ranked the quality of the participants’ exercises.
The Random Forest algorithm was used to predict the ranking of 20 different test cases based on accelerometer readings taken from the belt, forearm, arm, and dumbell of the six study participants. Testing conducted on the validation set used in this study returned a 99.9% to 100% estimated, out-of-sample accuracy for our model at the 95% confidence level.
## Download training and test sets
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "../data/pml-training.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "../data/pml-testing.csv")
## Load Training Set Data
training <- read.csv("../data/pml-training.csv")
## Load Testing Set Data
testing <- read.csv("../data/pml-testing.csv")
library(knitr)
library(ggplot2)
## Explore set dimensions
dim(training)
## [1] 19622 160
dim(testing )
## [1] 20 160
# Identify set column name differences
trainCol <- names(training)
testCol <- names(testing)
setdiff(trainCol, testCol)
## [1] "classe"
setdiff(testCol, trainCol)
## [1] "problem_id"
# Examine grouping of test data
testing$problem_id
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## Get 6 user names
unique(sort(training$user_name))
## [1] adelmo carlitos charles eurico jeremy pedro
## Levels: adelmo carlitos charles eurico jeremy pedro
# Assess numbers in each class
table(training$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
# View histogram of class outcome
# counts in the training data
qplot(training$classe, main = "Training Class Counts")
The five-tier classification system developed by the authors of the study awarded an “A” classification for weight lifting performed according to the specification; a “B” for situations where the participant threw their elbows to the front; a “C” for lifting the dumbbell only halfway; a “D” for lowering the dumbbell only halfway; and an “E” for throwing the hips to the front. As can be seen in the above table, there are about 20% more exercise measurements classified as Class “A” quality than any of the other four classes, but the range of classes in the other four categories of training data is fairly evenly dispersed.
We have six test participants whose first names are listed above. The training and testing sets both have 160 variables. Only one variable differs between the two sets: training has a variable named “classe”, and testing has a variable named “problem_id.”
# Create function to assess number of NAs in data sets
na_count <-function (x) sapply(x, function(y) sum(is.na(y)))
# Identify columns that do no have primarily NA values
boolNotNAs <- na_count(training) < 19000
# Subset test and training sets to filter out
# primarily NA columns
train2 <- training[,boolNotNAs]
test2 <- testing[,boolNotNAs]
# Create function to assess number of blank
# values in factor columns
empty_count <-function (x) sapply(x, function(y) sum(y == ""))
# Identify columns that do not have primarily
# blank factor contents
boolNotEmpty <- empty_count(train2) < 19000
# Subset test and training sets to filter out
# columns with primarily blank factor contents
train3 <- train2[,boolNotEmpty]
test3 <- test2[,boolNotEmpty]
# Filter out index, timeseries, and windowing columns
train4 <- train3[,-c(1,3:7)]
test4 <- test3[,-c(1,3:7)]
# Load caret for data partitioning
library(caret)
# Create validation test set
inTrain <- createDataPartition(train4$classe, p=0.7, list = F)
train4 <- train4[inTrain,]
valid4 <- train4[-inTrain,]
We are able to reduce the number of variables of concern from 160 to 93 by filtering out continuous training set variables with over 19,000 (97%) NA values. We are able to further reduce the number of variables of concern to 60 by filtering out factor variables with over 19,000 (97%) empty string values.
We filter out 6 more variables by eliminating the index, timeseries, and windowing columns. We do this in order to focus on the accelerometer measurements rather than the manner in which these measurements were collected.
# Load caret, randomForest
library(caret)
# library(randomForest)
ctrl <- trainControl(method = "oob")
set.seed(1235)
# Train model using Random Forest
modFit <- train(classe ~ ., data = train4, method = "rf", prox = T, importance = T, trControl = ctrl)
# Assess importance of top 20 variables
varImp(modFit)
## rf variable importance
##
## variables are sorted by maximum importance across the classes
## only 20 most important variables shown (out of 57)
##
## A B C D E
## roll_belt 81.51 89.01 81.71 84.88 100.00
## pitch_belt 31.02 93.09 63.77 48.64 40.98
## pitch_forearm 62.75 73.72 90.54 60.31 63.01
## magnet_dumbbell_y 68.87 66.11 83.54 64.90 57.85
## magnet_dumbbell_z 76.49 57.26 73.77 52.36 49.98
## yaw_belt 64.57 61.35 69.19 74.46 48.85
## accel_forearm_x 25.49 38.79 35.50 50.84 39.82
## roll_forearm 50.07 40.47 47.78 37.24 38.59
## accel_dumbbell_y 37.68 38.22 44.51 31.95 33.85
## yaw_arm 41.26 32.37 32.95 34.60 23.32
## gyros_dumbbell_y 36.05 33.54 40.63 32.32 26.61
## gyros_belt_z 27.73 35.16 32.96 25.08 40.46
## gyros_arm_y 27.91 39.47 24.35 33.24 23.07
## accel_dumbbell_z 28.63 33.99 25.85 30.77 35.70
## magnet_belt_x 18.76 35.45 32.24 20.88 28.37
## magnet_belt_y 19.62 35.31 29.78 23.46 30.65
## total_accel_dumbbell 18.99 35.07 22.68 29.87 32.11
## magnet_belt_z 25.36 35.01 28.28 34.66 29.09
## gyros_belt_x 34.65 17.88 21.91 12.99 16.91
## roll_dumbbell 25.13 34.65 24.98 28.06 34.22
# Predict outcomes on the validation set
predValid <- predict(modFit, newdata = valid4)
# Assess prediction accuracy
confusionMatrix(predValid, valid4$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1191 0 0 0 0
## B 0 788 0 0 0
## C 0 0 712 0 0
## D 0 0 0 702 0
## E 0 0 0 0 722
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9991, 1)
## No Information Rate : 0.2894
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.000 1.0000 1.0000
## Prevalence 0.2894 0.1915 0.173 0.1706 0.1755
## Detection Rate 0.2894 0.1915 0.173 0.1706 0.1755
## Detection Prevalence 0.2894 0.1915 0.173 0.1706 0.1755
## Balanced Accuracy 1.0000 1.0000 1.000 1.0000 1.0000
# Calculate the error rate of preditions
sum(predValid != valid4$classe)/length(predValid)
## [1] 0
The random forest algorithm builds multiple classification trees it uses to select the correct prediction for a given set of inputs. The classification trees “vote” on the final classification for each case, and the classification receiving the most votes wins. According to Leo Breiman and Adele Cutler, the authors of the algorithm, Random Forests are unexcelled in accuracy among the current algorithms. The algorithm was selected for this test because the it works efficiently on large data sets, and can handle thousands of input variables.
According to Breiman and Cutler, there is no need for cross-validation or a separate test set to get an unbiased estimate of test set error. An out-of-bag (oob) error estimate is constructed at run time using about one-third of the test cases which are omitted from the bootstrap sample. The authors have shown these estimates to be unbiased in many tests.
The confusion matrix we ran on our validation set resulted in a 99.9% to 100% accuracy rate in predictions on the five class categories used to assess exercise performance at the 95% confidence level. As reflected by our misclassification calculation, this results in a negligible out-of-sample error rate.
We were also able to assess the importance of the top 20 variables used to assess outcomes across all classes using the “varImp” function from the caret package.
library(caret)
# Predict outcomes on the validation set
predTest <- predict(modFit, newdata = test4)
# Predicted outcomes for our 20 test cases were as follows:
predTest
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# View histogram of predicted class outcome
# counts in the test data
qplot(predTest, main = "Predicted Test Class Counts")
The random forest model resulted in the above listed 20 predictions for the test set provided for this exercise. We estimate a negligible out-of-sample error rate using the Random Forest algorithm.
The predicted class counts are much more skewed in the set of 20 test cases. The counts of cases assessed as Class A or Class B categories were far greater than the cases assessed at the other three categories.
Random Forests, Leo Breiman and Adele Cutler, Random Forests (tm) is a trademark of Leo Breiman and Adele Cutler.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.