Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
I intend to evaluate each of the variables within the dataset to determine which variables are insignificant for predicting the classe.
Once in the proper format, after transformations, I’ll apply three different models to the data to determine which has the best fit.
I will then test the model against the validation set to look for over-fitting.
I will then use that model to predict on the testing set.
I have relied heavily on the work of Max Kuhn to streamline the model testing process through caret.
http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf
Applied Predictive Modeling; Kuhn & Johnson
setwd("C:/Users/Kier/Documents/Analytics Course/07_PracticalMachineLearning")
suppressMessages(library(caret))
suppressMessages(library(ggplot2))
suppressMessages(library(plyr))
suppressMessages(library(tidyverse))
suppressMessages(library(rattle))
suppressMessages(library(partykit))
suppressMessages(library(randomForest))
Some of these models take a longggg time to build so instead of re-running the training processes I will load the models from RDS files. I will show the code used to build the models if you care to run it on your own.
tr_fit_rpart <- readRDS("tr_fit_rpart.RDS")
tr_fit_rf <- readRDS("tr_fit_rf.RDS")
tr_fit_svm_caret <- readRDS("tr_fit_svm_caret.RDS")
tr_fit_gbm <- readRDS("tr_fit_gbm.RDS")
training_raw <- read.csv("pml-training.csv", na.strings = c("NA", "#DIV/0!", ""))
testing_raw <- read.csv("pml-testing.csv", na.strings = c("NA", "#DIV/0!", ""))
# Thankfully the na.strings attribute exists to deal with those DIV/0 values.
set.seed(317)
in_train <- createDataPartition(training_raw$classe, p = 0.7, list = FALSE)
validation_raw <- training_raw[-in_train, ]
tr1 <- training_raw[in_train, ]
set_for_removal <- tr1[,colSums(is.na(tr1))/nrow(tr1) >= 0.50]
names_for_removal <- names(set_for_removal)
cols_for_removal <- which(names(tr1) %in% names_for_removal)
tr2 <- tr1[, -cols_for_removal]
# Down to 60 variables
nsv <- nearZeroVar(tr2)
tr3 <- tr2[, -nsv]
# Down to 59 variables
tr4 <- tr3[, -c(1:6)]
# down to 53 variables
Ready for pre-processing
trainX <- tr4[, names(tr4) != "classe"]
trainPreProcValues <- preProcess(trainX, method = c("center", "scale"))
trainScaled <- predict(trainPreProcValues, trainX)
# This is really nice. It does all the work for you.
training <- cbind(classe = tr4$classe, trainFiltered)
validation_classe <- validation_raw$classe
v1 <- validation_raw[, -cols_for_removal]
v2 <- v1[, -nsv]
v3 <- v2[, -c(1:6)]
valX <- v3[, names(v3) != "classe"]
valPreProcValues <- preProcess(valX, method = c("center", "scale"))
valScaled <- predict(valPreProcValues, valX)
valFiltered <- valScaled[, -high_corr]
validation <- cbind(classe = validation_classe, valFiltered)
problem_id <- testing_raw$problem_id
te1 <- testing_raw[, -cols_for_removal]
te2 <- te1[, -nsv]
te3 <- te2[, -c(1:6, 59)]
testX <- te3
testPreProcValues <- preProcess(testX, method = c("center", "scale"))
testScaled <- predict(testPreProcValues, testX)
testFiltered <- testScaled[, -high_corr]
testing <- testFiltered %>%
mutate(classe = NA_character_) # Add the classe variable to test since it doesn't currently exist.
rm(te1, te2, te3, tr1, tr2, tr3, tr4, testScaled, trainScaled,
set_for_removal, testX, trainX, v1, v2, v3, valScaled, valX,
cols_for_removal, names_for_removal, nsv, testPreProcValues,
trainPreProcValues, valPreProcValues, testFiltered, valFiltered,
trainFiltered, correlations)
Kuhn recommends to start with the black-box models like svm and gbm and then see if there are any simpler models that produce similar results. Black-box models tend to produce better results at the expense of interpretability. Simpler models are more interpretable and sometimes produce very similar results.
In this case R will do repeated 10-fold cross-validations on the training set, three times. This takes longer but produces better results.
cvCtrl <- trainControl(method = "repeatedcv", repeats = 3, savePred=TRUE)
This one has the best name by far. It also produces some good results.
tr_fit_svm_caret <- train(classe ~ ., data = training,
method = "svmRadial",
tuneGrid = data.frame(.C = c(.25, .5, 1),
.sigma = .05),
trControl = cvCtrl)
I like to apply the model to both the training and validation set to see if there is a large gap in results. If there is a large gap in accuracy between the training set and the validation set it may mean that our model is over-fitting to the training set it was modelled on.
tr_pred_svm_caret <- suppressMessages(predict(tr_fit_svm_caret, training))
confusionMatrix(tr_pred_svm_caret, training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3905 98 0 0 0
## B 1 2548 24 0 0
## C 0 12 2351 161 31
## D 0 0 19 2087 41
## E 0 0 2 4 2453
##
## Overall Statistics
##
## Accuracy : 0.9714
## 95% CI : (0.9685, 0.9741)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9638
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9997 0.9586 0.9812 0.9267 0.9715
## Specificity 0.9900 0.9977 0.9820 0.9948 0.9995
## Pos Pred Value 0.9755 0.9903 0.9202 0.9721 0.9976
## Neg Pred Value 0.9999 0.9901 0.9960 0.9858 0.9936
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1855 0.1711 0.1519 0.1786
## Detection Prevalence 0.2914 0.1873 0.1860 0.1563 0.1790
## Balanced Accuracy 0.9949 0.9782 0.9816 0.9608 0.9855
# Accuracy 97.14%; Kappa 0.9638
val_pred_svm_caret <- predict(tr_fit_svm_caret, validation)
confusionMatrix(val_pred_svm_caret, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1668 76 3 0 3
## B 1 909 19 0 0
## C 1 90 989 86 31
## D 1 26 15 874 29
## E 3 38 0 4 1019
##
## Overall Statistics
##
## Accuracy : 0.9276
## 95% CI : (0.9207, 0.9341)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9084
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.7981 0.9639 0.9066 0.9418
## Specificity 0.9805 0.9958 0.9572 0.9856 0.9906
## Pos Pred Value 0.9531 0.9785 0.8262 0.9249 0.9577
## Neg Pred Value 0.9985 0.9536 0.9921 0.9818 0.9869
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2834 0.1545 0.1681 0.1485 0.1732
## Detection Prevalence 0.2974 0.1579 0.2034 0.1606 0.1808
## Balanced Accuracy 0.9885 0.8969 0.9606 0.9461 0.9662
# Accuracy 92.76; kappa 0.9084
This is not a bad way to start. Predicting over 90% correct on the validation set is very promising.
This is another black-box modeling package. Let’s see how it does…
tr_fit_gbm <- train(classe ~ ., data = training,
method = "gbm",
trControl = cvCtrl,
verbose = FALSE)
tr_pred_gbm <- suppressMessages(predict(tr_fit_gbm, training))
confusionMatrix(tr_pred_gbm, training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3852 79 1 5 11
## B 32 2490 101 20 26
## C 12 76 2263 83 31
## D 9 8 30 2129 21
## E 1 5 1 15 2436
##
## Overall Statistics
##
## Accuracy : 0.9587
## 95% CI : (0.9553, 0.962)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9478
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9862 0.9368 0.9445 0.9454 0.9648
## Specificity 0.9902 0.9838 0.9822 0.9941 0.9980
## Pos Pred Value 0.9757 0.9329 0.9181 0.9690 0.9910
## Neg Pred Value 0.9945 0.9848 0.9882 0.9893 0.9921
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2804 0.1813 0.1647 0.1550 0.1773
## Detection Prevalence 0.2874 0.1943 0.1794 0.1599 0.1789
## Balanced Accuracy 0.9882 0.9603 0.9633 0.9697 0.9814
# Accuracy is 95.87% and Kappa is 0.9478
val_pred_gbm <- predict(tr_fit_gbm, validation)
confusionMatrix(val_pred_gbm, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1637 97 23 18 4
## B 9 872 54 28 23
## C 10 108 861 34 33
## D 17 46 60 858 22
## E 1 16 28 26 1000
##
## Overall Statistics
##
## Accuracy : 0.8884
## 95% CI : (0.88, 0.8963)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8585
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9779 0.7656 0.8392 0.8900 0.9242
## Specificity 0.9663 0.9760 0.9619 0.9705 0.9852
## Pos Pred Value 0.9202 0.8844 0.8231 0.8554 0.9337
## Neg Pred Value 0.9910 0.9455 0.9659 0.9783 0.9830
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2782 0.1482 0.1463 0.1458 0.1699
## Detection Prevalence 0.3023 0.1675 0.1777 0.1704 0.1820
## Balanced Accuracy 0.9721 0.8708 0.9006 0.9303 0.9547
# Accuracy is 88.84% and Kappa is 0.8585.
Accuracies of ~96% and ~89%, respectively.
tr_fit_rpart <- train(classe ~ ., data = training, method = "rpart",
tuneLength = 50,
trControl = cvCtrl)
# Predict on training set
train_pred_rpart <- suppressMessages(predict.train(tr_fit_rpart, newdata = training))
confusionMatrix(train_pred_rpart, training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3803 50 14 27 19
## B 51 2495 67 27 43
## C 20 52 2261 50 42
## D 17 34 34 2132 38
## E 15 27 20 16 2383
##
## Overall Statistics
##
## Accuracy : 0.9517
## 95% CI : (0.948, 0.9553)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.939
## Mcnemar's Test P-Value : 0.000863
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9736 0.9387 0.9437 0.9467 0.9438
## Specificity 0.9888 0.9830 0.9855 0.9893 0.9930
## Pos Pred Value 0.9719 0.9299 0.9324 0.9455 0.9683
## Neg Pred Value 0.9895 0.9853 0.9881 0.9895 0.9874
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2768 0.1816 0.1646 0.1552 0.1735
## Detection Prevalence 0.2849 0.1953 0.1765 0.1642 0.1792
## Balanced Accuracy 0.9812 0.9609 0.9646 0.9680 0.9684
# Predict on validation set
val_pred_rpart <- predict.train(tr_fit_rpart, newdata = validation)
confusionMatrix(val_pred_rpart, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1525 121 36 39 6
## B 85 738 84 34 41
## C 25 109 751 30 51
## D 22 62 128 834 27
## E 17 109 27 27 957
##
## Overall Statistics
##
## Accuracy : 0.8165
## 95% CI : (0.8064, 0.8263)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7678
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9110 0.6479 0.7320 0.8651 0.8845
## Specificity 0.9520 0.9486 0.9558 0.9514 0.9625
## Pos Pred Value 0.8830 0.7515 0.7774 0.7773 0.8417
## Neg Pred Value 0.9642 0.9182 0.9441 0.9730 0.9737
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2591 0.1254 0.1276 0.1417 0.1626
## Detection Prevalence 0.2935 0.1669 0.1641 0.1823 0.1932
## Balanced Accuracy 0.9315 0.7983 0.8439 0.9083 0.9235
Accuracies of ~87% and ~79%, respectively.
I had a hard time getting this to run in caret so I used the functions within the randomForest package.
tr_fit_rf = randomForest(classe ~ ., data=training)
# Predict on training set
train_pred_rf <- predict(tr_fit_rf, training)
confusionMatrix(train_pred_rf, training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3906 0 0 0 0
## B 0 2658 0 0 0
## C 0 0 2396 0 0
## D 0 0 0 2252 0
## E 0 0 0 0 2525
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
# Predict on validation set
val_pred_rf <- predict(tr_fit_rf, validation)
confusionMatrix(val_pred_rf, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1664 49 8 5 0
## B 4 1042 18 1 0
## C 2 30 974 25 8
## D 0 1 26 933 2
## E 4 17 0 0 1072
##
## Overall Statistics
##
## Accuracy : 0.966
## 95% CI : (0.9611, 0.9705)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.957
## Mcnemar's Test P-Value : 3.456e-13
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9940 0.9148 0.9493 0.9678 0.9908
## Specificity 0.9853 0.9952 0.9866 0.9941 0.9956
## Pos Pred Value 0.9641 0.9784 0.9374 0.9699 0.9808
## Neg Pred Value 0.9976 0.9799 0.9893 0.9937 0.9979
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2828 0.1771 0.1655 0.1585 0.1822
## Detection Prevalence 0.2933 0.1810 0.1766 0.1635 0.1857
## Balanced Accuracy 0.9897 0.9550 0.9680 0.9810 0.9932
Interesting. Accuracies of 100% and ~97%, respectively.
SVM and Random Forest produced very high accuracies. Let’s apply each of them to the testing set to see what theire results are.
(test_pred_svm_caret <- predict(tr_fit_svm_caret, testing))
## [1] B A A A A E D B A A A C B A E E A B B B
## Levels: A B C D E
(test_pred_rf <- predict(tr_fit_rf, testing))
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## E A A A D B D B A A B C D A E E E B E B
## Levels: A B C D E
confusionMatrix(test_pred_rf, test_pred_svm_caret)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 6 0 0 0 0
## B 1 3 0 0 1
## C 0 0 1 0 0
## D 1 1 0 1 0
## E 1 2 0 0 2
##
## Overall Statistics
##
## Accuracy : 0.65
## 95% CI : (0.4078, 0.8461)
## No Information Rate : 0.45
## P-Value [Acc > NIR] : 0.05803
##
## Kappa : 0.5286
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.6667 0.5000 1.00 1.0000 0.6667
## Specificity 1.0000 0.8571 1.00 0.8947 0.8235
## Pos Pred Value 1.0000 0.6000 1.00 0.3333 0.4000
## Neg Pred Value 0.7857 0.8000 1.00 1.0000 0.9333
## Prevalence 0.4500 0.3000 0.05 0.0500 0.1500
## Detection Rate 0.3000 0.1500 0.05 0.0500 0.1000
## Detection Prevalence 0.3000 0.2500 0.05 0.1500 0.2500
## Balanced Accuracy 0.8333 0.6786 1.00 0.9474 0.7451
Only 13 of the 20 test cases produced the same results between SVM and RandomForest.
In this case I would choose Random Forest as the model due to it’s accuracy scores and that it’s results are more interpretable. I could further refine the rf model by pruning it a bit so that it would take out some of the complexity but produce similar results.
test_pred_rf
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## E A A A D B D B A A B C D A E E E B E B
## Levels: A B C D E