Practical Machine Learning Project: Weight Lifting Exercise Classification

Introduction

This project is concerned with identifying the execution type of an exercise, the Unilateral Dumbbell Biceps Curl. The dataset includes readings from motion sensors on participants bodies’. These readings will be used to classify the performed exercise into five categories: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Please see the website http://groupware.les.inf.puc-rio.br/har for more information.

The Data

Processing:

## Loading required package: lattice
## Loading required package: ggplot2
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

The raw dataset contained \(19622\) rows of data, with \(160\) variables. Many variables contained largely missing data (usually with only one row of data), so these were removed from the dataset (these variables were also not used in the final 20 testing dataset). In addition, variables not concerning the movement sensors were also removed. This resulted in a dataset of \(53\) variables.

To understand the structure of the data a bit better, density plots were made of a selection of the data. These are displayed below.

Partitioning the Data

The dataset was partitioned into training and testing datasets, with 60% of the original data going to the training set and 40% to the testing set. The model was built with the training dataset, then tested on the testing dataset. The following code performs this procedure:

# partition training dataset into 60/40 train/test
train_part = createDataPartition(train_used$classe, p = 0.6, list = FALSE)
training = train_used[train_part, ]
testing = train_used[-train_part, ]
##

The Model

Many methods of classification were attempted, including niave Bayes, multinomial logistic regression, and Support Vector Machines. It was determined that the Random Forest method produced the best results. In addition, pre-processing using principal component analysis was attempted however this greatly reduced the prediction accuracy.

Cross validation was not used, as, according to the creators of the Random Forest algorithm Leo Breiman and Adele Cutler, there is no need for cross-validation. See http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr for more information.

The R code is shown below, as is the confusion matrix. The out of the bag error rate in the training and the confusion matrix is shown below. For informational purposes a plot of the error rate versus number of trees is also shown.

set.seed(1777)
random_forest=randomForest(classe~.,data=training,ntree=500,importance=TRUE)
random_forest

## 
## Call:
##  randomForest(formula = classe ~ ., data = training, ntree = 500,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.61%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3343    5    0    0    0 0.001493429
## B   12 2261    6    0    0 0.007898201
## C    0   14 2034    6    0 0.009737098
## D    0    0   21 1906    3 0.012435233
## E    0    0    1    4 2160 0.002309469

plot(random_forest,main="Random Forest: Error Rate vs Number of Trees")

Variable Importance

It may be of interest to know which variables were most ‘important’ in the building of the model. This can be seen by plotting the mean decrease in accuracy and the mean decrease in the Gini coefficient per variable. In short, the more the accuracy of the random forest decreases due to the exclusion (or permutation) of a single variable, the more important that variable is deemed to be. The mean decrease in the Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest. (from https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html)

imp=importance(random_forest)
impL=imp[,c(6,7)]
imp.ma=as.matrix(impL)
imp.df=data.frame(imp.ma)

write.csv(imp.df, "imp.df.csv", row.names=TRUE)
imp.df.csv=read.csv("imp.df.csv",header=TRUE)

colnames(imp.df.csv)=c("Variable","MeanDecreaseAccuracy","MeanDecreaseGini")
imp.sort =  imp.df.csv[order(-imp.df.csv$MeanDecreaseAccuracy),] 

imp.sort = transform(imp.df.csv, 
  Variable = reorder(Variable, MeanDecreaseAccuracy))

VIP=ggplot(data=imp.sort, aes(x=Variable, y=MeanDecreaseAccuracy)) + 
  ylab("Mean Decrease Accuracy")+xlab("")+
    geom_bar(stat="identity",fill="skyblue",alpha=.8,width=.75)+ 
    coord_flip()+theme_few() 

imp.sort.Gini <- transform(imp.df.csv, 
                      Variable = reorder(Variable, MeanDecreaseGini))

VIP.Gini=ggplot(data=imp.sort.Gini, aes(x=Variable, y=MeanDecreaseGini)) + 
  ylab("Mean Decrease Gini")+xlab("")+
  geom_bar(stat="identity",fill="skyblue",alpha=.8,width=.75)+ 
  coord_flip()+theme_few() 

VarImpPlot=arrangeGrob(VIP, VIP.Gini,ncol=2)
grid.draw(VarImpPlot)

Model Applied to Testing Dataset

test_predictions = predict(random_forest, newdata=testing)
CM = confusionMatrix(test_predictions,testing$classe)
CM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2230   15    0    0    0
##          B    2 1498   15    0    0
##          C    0    5 1349   22    1
##          D    0    0    3 1262    4
##          E    0    0    1    2 1437
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9911         
##                  95% CI : (0.9887, 0.993)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9887         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9868   0.9861   0.9813   0.9965
## Specificity            0.9973   0.9973   0.9957   0.9989   0.9995
## Pos Pred Value         0.9933   0.9888   0.9797   0.9945   0.9979
## Neg Pred Value         0.9996   0.9968   0.9971   0.9964   0.9992
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2842   0.1909   0.1719   0.1608   0.1832
## Detection Prevalence   0.2861   0.1931   0.1755   0.1617   0.1835
## Balanced Accuracy      0.9982   0.9921   0.9909   0.9901   0.9980

The model was applied to the testing dataset and generated predictions for the class of weightlifting type. Above is the code that was used and the confusion matrix for the testing dataset. The accuracy is very high, \(0.9911\). The model accurately predicted all of the 20 test subjects.

Cross Validation

Just in case the grader feels it is necessary to do cross validation, I have added the code and error rates from the CV from the caret package. The cross-validation error is shown below.

CV = trainControl(method = "cv", number = 5, allowParallel = T, verboseIter = F)
CVmodel = train(classe ~ ., data = training, method = "rf", prox = F, trControl = CV)
CVmodel

## Random Forest 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9420, 9420, 9421, 9421, 9422 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD   Kappa SD    
##    2    0.9878568  0.9846370  0.0007688827  0.0009718183
##   27    0.9885364  0.9854978  0.0027655446  0.0034983728
##   52    0.9808940  0.9758270  0.0055730287  0.0070517795
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

predsCVmodel = predict(CVmodel, newdata = testing)

confMatrix = confusionMatrix(predsCVmodel,testing$classe)
confMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2229   19    0    0    0
##          B    3 1492   11    0    0
##          C    0    7 1349   24    1
##          D    0    0    8 1260    3
##          E    0    0    0    2 1438
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9901          
##                  95% CI : (0.9876, 0.9921)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9874          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9987   0.9829   0.9861   0.9798   0.9972
## Specificity            0.9966   0.9978   0.9951   0.9983   0.9997
## Pos Pred Value         0.9915   0.9907   0.9768   0.9913   0.9986
## Neg Pred Value         0.9995   0.9959   0.9971   0.9960   0.9994
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2841   0.1902   0.1719   0.1606   0.1833
## Detection Prevalence   0.2865   0.1919   0.1760   0.1620   0.1835
## Balanced Accuracy      0.9976   0.9903   0.9906   0.9891   0.9985

As can be seen, model with CV leads to slightly poorer performance than the previous one. As such, the previous model will remain.

Practical Machine Learning Project: Weight Lifting Exercise Classification

John Slough II

12 July 2015