In this project, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, that were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The description of the experiment can be found at ‘http://groupware.les.inf.puc-rio.br/har’ under the section ‘Weight Lifting Exercise Dataset’. The data set was provided by Velloso, E and co (1).
The training dataset was split into a training, test and validation subset. We trained four different models with the training subset and tested them separately with the test subset. Following the results of the confusionMatrix method we chose the random forest model as our final model showing an accuracy of 0.9879 and a 95% confidence interval of (0.984, 0.991). This final model was validated with our validation subset showing a drop of accuracy of only 0.0017 or 0.17%. Thus its accuracy remained very high with 0.9862 and a 95% confidence interval of (0.9829, 0.9891).
6 participants were asked to perform barbell lifts under 5 different conditions, which are classified as groups from A to E, all under the surveillance of an experienced observer.
A : exactly according to the specification (correctly)
B : throwing the elbows to the front (incorrectly)
C : lifting the dumbbell only halfway (incorrectly)
D : lowering the dumbbell only halfway (incorrectly)
E : throwing the hips to the front (incorrectly)
Two data sets are downloaded from a specific source consisting in a training set and quiz set for the Practical Machine Learning Quiz at coursera.
urlTrainingData <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTestingData <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
date_downloaded <- date()
download.file(urlTrainingData, destfile = "training.csv")
download.file(urlTestingData, destfile = "testing.csv")
training <- read.csv("training.csv",strip.white = TRUE, na.strings = c("#DIV/0!", "NA"))
dim(training) #[1] 19622 160
quizSet <- read.csv("testing.csv", strip.white=TRUE, na.strings = c("#DIV/0!", "NA")) #Prediction Quiz
The number of variables in the training set will be reduced, such that only variables from accelerometers on the belt, forearm, arm, and dumbell are retained and used to build our prediction models. Morevover further variables are retained, that could be significant in predicting the correct outcome, named as ‘classe’ in the data set representing one of the five groups (A - E). Finally, from 160 variables, 29 predictors and 1 outcome are used to build our models.
summary(training)
#looked at the dataset and wrote down all the variables that are necessary for this assignment.
asked <- c("classe", "roll_belt","pitch_belt", "yaw_belt" ,"accel_belt_x","accel_belt_y", "accel_belt_z", "magnet_belt_x", "magnet_belt_y", "magnet_belt_z", "accel_arm_x", "accel_arm_y", "accel_arm_z", "roll_forearm", "pitch_forearm", "yaw_forearm" ,"accel_forearm_x", "accel_forearm_y", "accel_forearm_z", "magnet_forearm_x", "magnet_forearm_y","magnet_forearm_z", "roll_dumbbell", "pitch_dumbbell" ,"accel_dumbbell_x", "accel_dumbbell_y", "accel_dumbbell_z", "magnet_dumbbell_x", "magnet_dumbbell_y", "magnet_dumbbell_z")
training <- training[,asked]
sum(!complete.cases(training)) #0, thus only complete cases
dim(training) #[1] 19622 30
With the cleaned data set we will build four different models.
The construction of these models will be done with the caret package and the appropriate methods. After building the first four models, the accuracy and results will be measured with the method confusionMatrix() and its results will decide our final model. The final model will be tested using a validation subset.
The number of observations in the original training data set allows us to construct a training subset, a test subset and a validation subset. The training subset will be used to train the first four models, which will be tested with the test subset. Only the final model will be validated. As the in sample error will always be lower than the out sample error, we expected the models to perform worse on new sample sets. The expected drop in accuracy (1 - out sample error) will be mentioned for each model by personal guesses. An accuracy of 20% shows that the model would be as good as guessing by chance.
set.seed(12121)#in order to be reproducible for others
library(caret)
library(randomForest)
inBuild <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
validation <- training[-inBuild,]
buildData <- training[inBuild,]
inTrain <- createDataPartition(y = buildData$classe, p = 0.7, list = FALSE)
trainSet <- buildData[inTrain,]
testSet <- buildData[-inTrain,]
This model provides a very good accuracy for the test subset, but it takes an incredible amount of time. Unfortunately due to overfitting, the out of sample error could be quite high, such that we could expect a drop in accuracy at around 90% or lower. The method varImp shows us the 20 most important variables for its prediction.
setting <- trainControl(allowParallel=T, method="cv", number=4) #to work faster, but still very slow :-(
modRF <- train(classe ~ ., data = trainSet, method = "rf", trainControl=setting)
predictionsRF <- predict(modRF, newdata = testSet)
confusionMatrix(predictionsRF, testSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1164 16 0 0 0
## B 5 770 1 0 0
## C 1 10 714 4 1
## D 0 1 3 670 6
## E 1 0 0 1 750
##
## Overall Statistics
##
## Accuracy : 0.9879
## 95% CI : (0.984, 0.991)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9846
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9940 0.9661 0.9944 0.9926 0.9908
## Specificity 0.9946 0.9982 0.9953 0.9971 0.9994
## Pos Pred Value 0.9864 0.9923 0.9781 0.9853 0.9973
## Neg Pred Value 0.9976 0.9919 0.9988 0.9985 0.9979
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2827 0.1870 0.1734 0.1627 0.1821
## Detection Prevalence 0.2865 0.1884 0.1773 0.1651 0.1826
## Balanced Accuracy 0.9943 0.9822 0.9949 0.9948 0.9951
varImp(modRF)
## rf variable importance
##
## only 20 most important variables shown (out of 29)
##
## Overall
## roll_belt 100.000
## pitch_forearm 59.798
## yaw_belt 55.207
## pitch_belt 49.762
## roll_forearm 44.370
## magnet_dumbbell_y 43.342
## magnet_dumbbell_z 42.815
## accel_dumbbell_y 26.249
## magnet_dumbbell_x 18.320
## roll_dumbbell 17.455
## accel_dumbbell_z 16.929
## accel_forearm_x 16.811
## magnet_belt_z 16.294
## accel_belt_z 15.300
## magnet_forearm_z 14.392
## magnet_belt_y 12.717
## magnet_belt_x 11.324
## accel_arm_x 10.933
## accel_forearm_z 7.406
## magnet_forearm_x 5.074
This model provides a much lower accuracy compared to the random forest model and could only be helpful in combination with our first model. The expected out of sample error with a new sample could be around 40% and if unlucky above 50%.
library(MASS)
modLDA <- train(classe ~ ., data = trainSet, method = "lda")
predictionsLDA <- predict(modLDA, newdata = testSet)
confusionMatrix(predictionsLDA, testSet$classe) #Accuracy 0.6545 [0.6422, 0.6667]
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 873 158 116 47 32
## B 79 427 46 78 124
## C 118 123 468 59 63
## D 89 44 79 447 95
## E 12 45 9 44 443
##
## Overall Statistics
##
## Accuracy : 0.6455
## 95% CI : (0.6306, 0.6601)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5512
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7455 0.5358 0.6518 0.6622 0.5852
## Specificity 0.8802 0.9015 0.8932 0.9108 0.9673
## Pos Pred Value 0.7121 0.5663 0.5632 0.5928 0.8011
## Neg Pred Value 0.8970 0.8900 0.9239 0.9322 0.9119
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2120 0.1037 0.1136 0.1085 0.1076
## Detection Prevalence 0.2977 0.1831 0.2018 0.1831 0.1343
## Balanced Accuracy 0.8129 0.7186 0.7725 0.7865 0.7762
This models provides very good results and a very high accuracy. By combining it with our first model, we could improve the accuracy for new samples. The expected accuracy on new samples will be lower, thus giving a higher out of sample error, but we can’t estimate by how much.
library(plyr)
library(survival)
library(splines)
library(parallel)
library(ggplot2)
library(gbm)
#takes a lot of time
modB <- train(classe ~ ., data = trainSet, method = "gbm", verbose = FALSE)
predictionsB <- predict(modB, newdata = testSet)
confusionMatrix(predictionsB, testSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1143 32 0 2 1
## B 17 734 16 6 6
## C 5 27 693 19 5
## D 1 3 6 646 12
## E 5 1 3 2 733
##
## Overall Statistics
##
## Accuracy : 0.959
## 95% CI : (0.9524, 0.9648)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9481
## Mcnemar's Test P-Value : 0.0001592
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9761 0.9210 0.9652 0.9570 0.9683
## Specificity 0.9881 0.9864 0.9835 0.9936 0.9967
## Pos Pred Value 0.9703 0.9422 0.9252 0.9671 0.9852
## Neg Pred Value 0.9905 0.9811 0.9926 0.9916 0.9929
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2776 0.1782 0.1683 0.1569 0.1780
## Detection Prevalence 0.2861 0.1892 0.1819 0.1622 0.1807
## Balanced Accuracy 0.9821 0.9537 0.9744 0.9753 0.9825
This model has the worst performance so far with an accuracy of around 50%. The out of sample error will be higher, but we can’t estimate by how much.
library(rattle)
modT <- train(classe ~ ., data = trainSet, method = "rpart")
predictionsT <- predict(modT, testSet)
confusionMatrix(predictionsT, testSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1054 319 338 291 111
## B 18 272 30 106 86
## C 95 206 350 278 208
## D 0 0 0 0 0
## E 4 0 0 0 352
##
## Overall Statistics
##
## Accuracy : 0.4925
## 95% CI : (0.4771, 0.5079)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3374
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9001 0.34128 0.48747 0.0000 0.46499
## Specificity 0.6407 0.92773 0.76853 1.0000 0.99881
## Pos Pred Value 0.4988 0.53125 0.30783 NaN 0.98876
## Neg Pred Value 0.9416 0.85441 0.87655 0.8361 0.89234
## Prevalence 0.2844 0.19354 0.17436 0.1639 0.18383
## Detection Rate 0.2559 0.06605 0.08499 0.0000 0.08548
## Detection Prevalence 0.5131 0.12433 0.27610 0.0000 0.08645
## Balanced Accuracy 0.7704 0.63451 0.62800 0.5000 0.73190
fancyRpartPlot(modT$finalModel)
Our first model showed such a high accuracy that it will be used as a standalone model for our validation set showing an accuracy of 0.9862 with a 95% confidence interval of (0.9829, 0.9891). The expected drop of accuracy due to the out of sample error was much smaller than expected, which increased only by 0.0017 or 0.17%.
predictions <- predict(modRF, validation)
confusionMatrix(predictions, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1665 14 0 0 0
## B 6 1111 9 0 4
## C 2 12 1012 16 1
## D 0 1 5 946 7
## E 1 1 0 2 1070
##
## Overall Statistics
##
## Accuracy : 0.9862
## 95% CI : (0.9829, 0.9891)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9826
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9946 0.9754 0.9864 0.9813 0.9889
## Specificity 0.9967 0.9960 0.9936 0.9974 0.9992
## Pos Pred Value 0.9917 0.9832 0.9703 0.9864 0.9963
## Neg Pred Value 0.9979 0.9941 0.9971 0.9963 0.9975
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2829 0.1888 0.1720 0.1607 0.1818
## Detection Prevalence 0.2853 0.1920 0.1772 0.1630 0.1825
## Balanced Accuracy 0.9956 0.9857 0.9900 0.9893 0.9940
predictionsQuiz <- predict(modRF, quizSet)
quizDF <- data.frame(predictions = predictionsQuiz, quizSet)
#quizDF[,1:2] passed 20/20