Introduction
This project is concerned with identifying the execution type of an exercise, the Unilateral Dumbbell Biceps Curl. The dataset includes readings from motion sensors on participants bodies’. These readings will be used to classify the performed exercise into five categories: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Please see the website http://groupware.les.inf.puc-rio.br/har for more information.
The Data
Processing:
setwd("~/Desktop/Courses/Coursera/Data Science Specialization/Machine Learning/Project")
library(caret)
library(randomForest)
library(ggthemes)
library(gridExtra)
library(ggplot2)
library(grid)
train = read.csv("pml-training.csv",header=TRUE)
train_used = train[,c(8:11,37:49,60:68,84:86,102,113:124,140,151:160)]
The raw dataset contained \(19622\) rows of data, with \(160\) variables. Many variables contained largely missing data (usually with only one row of data), so these were removed from the dataset. In addition, variables not concerning the movement sensors were also removed. This resulted in a dataset of \(53\) variables.
To understand the structure of the data a bit better, density plots were made of a selection of the data. These are displayed below.
Partitioning the Data
The dataset was partitioned into training and testing datasets, with 60% of the original data going to the training set and 40% to the testing set. The model was built with the training dataset, then tested on the testing dataset. The following code performs this procedure:
# partition training dataset into 60/40 train/test
train_part = createDataPartition(train_used$classe, p = 0.6, list = FALSE)
training = train_used[train_part, ]
testing = train_used[-train_part, ]
##
The Model
Many methods of classification were attempted, including niave Bayes, multinomial logistic regression, and decision trees. It was determined that the Random Forest method produced the best results. In addition, principal component analysis was attempted however this greatly reduced the prediction accuracy.
Cross validation was not used, as, according to the creators of the Random Forest algorithm: “In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error.” - Leo Breiman and Adele Cutler
The R code is shown below, as is the confusion matrix. The OOB error rate in the training and the confusion matrix is shown below. For informational purposes a plot of the error rate versus number of trees is also shown.
set.seed(1777)
random_forest=randomForest(classe~.,data=training,ntree=500,importance=TRUE)
random_forest
##
## Call:
## randomForest(formula = classe ~ ., data = training, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.59%
## Confusion matrix:
## A B C D E class.error
## A 3345 2 1 0 0 0.0008960573
## B 12 2260 7 0 0 0.0083369899
## C 0 9 2043 2 0 0.0053554041
## D 0 0 27 1902 1 0.0145077720
## E 0 0 2 6 2157 0.0036951501
plot(random_forest,main="Random Forest: Error Rate vs Number of Trees")
Variable Importance
It may be of interest to know which variables were most ‘important’ in the building of the model. This can be seen by plotting the mean decrease in accuracy and the mean decrease in the gini coefficient per variable. In short, The more the accuracy of the random forest decreases due to the exclusion (or permutation) of a single variable, the more important that variable is deemed to be. The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest. (from https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html)
imp=importance(random_forest)
impL=imp[,c(6,7)]
imp.ma=as.matrix(impL)
imp.df=data.frame(imp.ma)
write.csv(imp.df, "imp.df.csv", row.names=TRUE)
imp.df.csv=read.csv("imp.df.csv",header=TRUE)
colnames(imp.df.csv)=c("Variable","MeanDecreaseAccuracy","MeanDecreaseGini")
imp.sort = imp.df.csv[order(-imp.df.csv$MeanDecreaseAccuracy),]
imp.sort = transform(imp.df.csv,
Variable = reorder(Variable, MeanDecreaseAccuracy))
VIP=ggplot(data=imp.sort, aes(x=Variable, y=MeanDecreaseAccuracy)) +
ylab("Mean Decrease Accuracy")+xlab("")+
geom_bar(stat="identity",fill="skyblue",alpha=.8,width=.75)+
coord_flip()+theme_few()
imp.sort.Gini <- transform(imp.df.csv,
Variable = reorder(Variable, MeanDecreaseGini))
VIP.Gini=ggplot(data=imp.sort.Gini, aes(x=Variable, y=MeanDecreaseGini)) +
ylab("Mean Decrease Gini")+xlab("")+
geom_bar(stat="identity",fill="skyblue",alpha=.8,width=.75)+
coord_flip()+theme_few()
VarImpPlot=arrangeGrob(VIP, VIP.Gini,ncol=2)
grid.draw(VarImpPlot)
Model Applied to Testing Dataset
test_predictions = predict(random_forest, newdata=testing)
confusionMatrix(test_predictions,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 6 0 0 0
## B 0 1508 8 0 0
## C 0 4 1358 19 2
## D 0 0 2 1267 8
## E 0 0 0 0 1432
##
## Overall Statistics
##
## Accuracy : 0.9938
## 95% CI : (0.9918, 0.9954)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9921
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9934 0.9927 0.9852 0.9931
## Specificity 0.9989 0.9987 0.9961 0.9985 1.0000
## Pos Pred Value 0.9973 0.9947 0.9819 0.9922 1.0000
## Neg Pred Value 1.0000 0.9984 0.9985 0.9971 0.9984
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1922 0.1731 0.1615 0.1825
## Detection Prevalence 0.2852 0.1932 0.1763 0.1628 0.1825
## Balanced Accuracy 0.9995 0.9961 0.9944 0.9919 0.9965
The model was applied to the testing dataset and generated predictions for the class of weightlifting type. Above is the code that was used and the confusion matrix for the testing dataset. The accuracy is very high, at over 99%. The model accurately predicted all of the 20 test subjects.
Cross-validation
Just in case the grader feels it is necessary to do cross validation, I have added the code and error rates from the CV from the caret package. The cross-validation error is shown below.
## Random Forest
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 9420, 9422, 9421, 9420, 9421
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9894703 0.9866777 0.003051468 0.003863105
## 27 0.9887059 0.9857124 0.003241171 0.004101986
## 52 0.9781759 0.9723913 0.004598599 0.005817516
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2227 7 0 0 0
## B 5 1508 11 0 0
## C 0 3 1353 29 3
## D 0 0 4 1257 7
## E 0 0 0 0 1432
##
## Overall Statistics
##
## Accuracy : 0.9912
## 95% CI : (0.9889, 0.9932)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9889
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9978 0.9934 0.9890 0.9774 0.9931
## Specificity 0.9988 0.9975 0.9946 0.9983 1.0000
## Pos Pred Value 0.9969 0.9895 0.9748 0.9913 1.0000
## Neg Pred Value 0.9991 0.9984 0.9977 0.9956 0.9984
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2838 0.1922 0.1724 0.1602 0.1825
## Detection Prevalence 0.2847 0.1942 0.1769 0.1616 0.1825
## Balanced Accuracy 0.9983 0.9954 0.9918 0.9879 0.9965