This RMarkdown file is the final report of Practical Machine Learning. The goal of this report is to predict the manner in which 6 participants performed one of five possible exercises. In this report, I first split my sample into a training and test set, clean the samples in the same method, and then test three different modeling approaches to see which predicts the test data the best. From that, I decided on using the Randomized Forest Approach, and submit my approach’s predictions to the automated exam tester.
Citation of this data:
Velloso, E., Bulling, A., Gellersen, H., Ugulino, W., & Fuks, H. (2013). “Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13)”. Stuttgart, Germany: ACM SIGCHI.
The five possible outcomes are:
A: Exactly according to the specification.
B: Throwing the elbows to the front.
C: Lifting the dumbbell only halfway.
D: Lowering the dumbbell only halfway.
E: Throwing the hips to the front.
library(caret)
library(rattle)
library(rpart)
library(rpart.plot)
library(randomForest)
library(repmis)
set.seed(13192)
training_data <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
inTrain <- createDataPartition(y=training_data$classe, p=0.6, list=FALSE)
training <- training_data[inTrain, ]
testing_subset <- training_data[-inTrain, ]
testing_data <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))
After we created our partition above, we need to clean. First, we remove all variables with low variability. Then, we remove all variables with any NA cases.
myDataNZV <- nearZeroVar(training, saveMetrics=TRUE)
rownames <- as.list(rownames(myDataNZV[myDataNZV$nzv==TRUE,]))
training2 <- names(training)[!(names(training) %in% rownames)]
training_subset <- training[, training2]
training_subset <- training_subset[c(-1)]
training_final <- training_subset
for(k in 1:length(training_subset)) {
if( sum( is.na( training_subset[, k] ) ) /nrow(training_subset) >= .01 ) {
for(c in 1:length(training_final)) {
if( length( grep(names(training_subset[k]), names(training_final)[c]) ) ==1) {
training_final <- training_final[ , -c]
}
}
}
}
rm(training_subset, rownames, myDataNZV, inTrain, training, training2, training_data)
clean1 <- colnames(training_final)
testing_data <- testing_data[clean1[-58]]
testing_subset <-testing_subset[clean1]
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 2163 208 9 2 0
B 52 1120 99 69 0
C 17 182 1234 137 51
D 0 8 26 1015 190
E 0 0 0 63 1201
Overall Statistics
Accuracy : 0.8581
95% CI : (0.8502, 0.8658)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8202
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9691 0.7378 0.9020 0.7893 0.8329
Specificity 0.9610 0.9652 0.9403 0.9659 0.9902
Pos Pred Value 0.9081 0.8358 0.7613 0.8192 0.9502
Neg Pred Value 0.9874 0.9388 0.9785 0.9590 0.9634
Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2757 0.1427 0.1573 0.1294 0.1531
Detection Prevalence 0.3036 0.1708 0.2066 0.1579 0.1611
Balanced Accuracy 0.9650 0.8515 0.9212 0.8776 0.9115
Our output shows an overall accuracy of 85.81%. It’s not bad, but I’m sure other models can do better.
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1)
modelfit_boostedr <- train(classe ~ ., data=training_final, method = "gbm",
trControl = fitControl,
verbose = FALSE)
modelfit_boostedr$finalModel
A gradient boosted model with multinomial loss function.
150 iterations were performed.
There were 79 predictors of which 42 had non-zero influence.
predictions_boostedr <- predict(modelfit_boostedr, newdata=testing_subset)
gbmAccuracyTest <- confusionMatrix(predictions_boostedr, testing_subset$classe)
gbmAccuracyTest
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 2230 1 0 0 0
B 2 1508 0 0 0
C 0 1 1358 0 0
D 0 8 10 1283 0
E 0 0 0 3 1442
Overall Statistics
Accuracy : 0.9968
95% CI : (0.9953, 0.9979)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.996
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9991 0.9934 0.9927 0.9977 1.0000
Specificity 0.9998 0.9997 0.9998 0.9973 0.9995
Pos Pred Value 0.9996 0.9987 0.9993 0.9862 0.9979
Neg Pred Value 0.9996 0.9984 0.9985 0.9995 1.0000
Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2842 0.1922 0.1731 0.1635 0.1838
Detection Prevalence 0.2843 0.1925 0.1732 0.1658 0.1842
Balanced Accuracy 0.9995 0.9965 0.9963 0.9975 0.9998
plot(modelfit_boostedr, ylim=c(0.8,1))
Our second model, the boosted regression, shows a 99.69% accuracy! That seems more than enough, but why not just one more?
modelfit_rforest <- randomForest(classe ~. , data=training_final)
predictions_rforest <- predict(modelfit_rforest, testing_subset, type = "class")
confusionMatrix(predictions_rforest, testing_subset$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 2230 0 0 0 0
B 2 1518 4 0 0
C 0 0 1363 0 0
D 0 0 1 1285 0
E 0 0 0 1 1442
Overall Statistics
Accuracy : 0.999
95% CI : (0.998, 0.9996)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9987
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9991 1.0000 0.9963 0.9992 1.0000
Specificity 1.0000 0.9991 1.0000 0.9998 0.9998
Pos Pred Value 1.0000 0.9961 1.0000 0.9992 0.9993
Neg Pred Value 0.9996 1.0000 0.9992 0.9998 1.0000
Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2842 0.1935 0.1737 0.1638 0.1838
Detection Prevalence 0.2842 0.1942 0.1737 0.1639 0.1839
Balanced Accuracy 0.9996 0.9995 0.9982 0.9995 0.9999
confusionmatrix2 <- confusionMatrix(predictions_rforest, testing_subset$classe)
plot(confusionmatrix2 $table, col = confusionmatrix2 $byClass, main = paste("Random Forest Confusion Matrix: Accuracy =", round(confusionmatrix2$overall['Accuracy'], 4)))
Our final model, a random forest model, showed a 99.9% accuracy.
However, I like the methodology behind a boosted regression more than random forest, and I find the output more visually appealing. I’m fine with accepting a .21% lower accuracy rate for interpretability, so I have chosen to use the Boosted Regression as my final model. This means I am estimating an out of sample error rate of .31%.
predict_final <- predict(modelfit_boostedr, newdata=testing_data)
prediction_write = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
prediction_write(predict_final)