Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
Participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions. The fashions are listed in the data as variable “Classe”. Each Classe is a factor listed A-E. Below are the descriptions for each factor:
A: Correctly
B: Throwing elbows
C: Lifting halfway
D: Lowering halfway
E: Throwing hips
As shown, factor A represents a correct lift, while the remaining factors are various mistakes to the dumbbell lift. All other factors after A are used for prediction. Decision Tree and Random Forest algorithms will be used for the models. Whichever has the highest accuracy will be used.
Cross Validation is performed by creating subsets (80%/20% split) of the given training data. Each model will be trained on the 80% training data subset and then initially tested on the 80% training data split. Once it is determined which model is more accurate, it will be used on the test data set.
This section will detail the code used to prep the data for the model. The libraries used in this analysis are: caret, rpart, and randomForest for machine learning along with RColorBrewer, rpart.plot, rattle, lattice, and ggplot2 for graphing purposes. For reproducable purposes, a seed value of 9067 is used throughout the analysis
# Load data and scrub
traindat <- read.csv("pml-training.csv",na.strings = c("$DIV/0",""," ","NA"))
testdat <- read.csv("pml-testing.csv",na.strings = c("$DIV/0",""," ","NA"))
After loading the data, exploratory analysis was done using dim, head, summary, and str functions. To save space, these were not included in the code. It was determined the initial 7 columns were not needed for the initial analysis and thus were removed from the data set. Columns containing all NA values were removed as well.
traindat <- traindat[,-c(1:7)]
testdat <- testdat[,-c(1:7)]
traindat <- traindat[,colSums(is.na(traindat))==0]
testdat <- testdat[,colSums(is.na(traindat))==0]
Before doing any testing, the training data needs to be split as detailed in the Cross Validation Section.
set.seed(9067)
trainsub <- createDataPartition(y=traindat$classe, p=.8, list=FALSE)
maintraindat <- traindat[trainsub,]
testtraindat <- traindat[-trainsub,]
Now that the data is split, a distribution of classe can be shown:
table(maintraindat$classe)
A B C D E
4464 3038 2738 2573 2886
As shown, A (correct lift) is the most represented while D (lowering halfway) is the least represented.
The first model used on the training data set is a decistion tree. It is fitted to the main training data and plotted. Since decision trees are interpretable, we can easily plot them.
set.seed(9067)
treemodel <- rpart(classe ~., data=maintraindat, method="class")
fancyRpartPlot(treemodel)
Next, the model is fitted to a prediction and ran through a confusion matrix to determine it’s prediction quality
set.seed(9067)
treepredict <- predict(treemodel, testtraindat, type = "class")
confusionMatrix(treepredict,testtraindat$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1005 137 14 81 30
B 27 389 61 28 53
C 22 102 536 55 53
D 54 54 51 421 53
E 8 77 22 58 532
Overall Statistics
Accuracy : 0.7349
95% CI : (0.7208, 0.7487)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6633
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9005 0.51252 0.7836 0.6547 0.7379
Specificity 0.9067 0.94659 0.9284 0.9354 0.9485
Pos Pred Value 0.7932 0.69713 0.6979 0.6651 0.7633
Neg Pred Value 0.9582 0.89004 0.9531 0.9325 0.9414
Prevalence 0.2845 0.19347 0.1744 0.1639 0.1838
Detection Rate 0.2562 0.09916 0.1366 0.1073 0.1356
Detection Prevalence 0.3230 0.14224 0.1958 0.1614 0.1777
Balanced Accuracy 0.9036 0.72955 0.8560 0.7951 0.8432
The second model used on the training data set is a random forest.
set.seed(9067)
forestmodel <- randomForest(classe ~., data=maintraindat, method="class")
The random forest model is fitted to a prediction and ran through a confusion matrix. Random forests are much less interpretable than decision trees and thus aren’t easy to plot.
set.seed(9067)
forestpredict <- predict(forestmodel, testtraindat, type = "class")
confusionMatrix(forestpredict,testtraindat$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1114 1 0 0 0
B 2 757 6 0 0
C 0 1 678 0 1
D 0 0 0 643 2
E 0 0 0 0 718
Overall Statistics
Accuracy : 0.9967
95% CI : (0.9943, 0.9982)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9958
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9982 0.9974 0.9912 1.0000 0.9958
Specificity 0.9996 0.9975 0.9994 0.9994 1.0000
Pos Pred Value 0.9991 0.9895 0.9971 0.9969 1.0000
Neg Pred Value 0.9993 0.9994 0.9981 1.0000 0.9991
Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2840 0.1930 0.1728 0.1639 0.1830
Detection Prevalence 0.2842 0.1950 0.1733 0.1644 0.1830
Balanced Accuracy 0.9989 0.9974 0.9953 0.9997 0.9979
Looking at the Balanced Accuracy on both confusion matrix results, it is determined the random forest algorithm best predicts the data. Since the end goal is prediction rather than interpretability, we will use random forest on the test data.
Lastly, the random forest model is used on the test data to predict which type of lift was done for each observation. The results are detailed below. Please refer to the Model section for description of the classes.
set.seed(9067)
testpredict <- predict(forestmodel, testdat, type = "class")
testpredict
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E