Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Quantified self enthusiasts regularly quantify how much of a particular activity they do, yet they rarely qualify how well they do it.

Six males aged 20-28 were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Now we seek to predict whether the lifts were performed correctly using the data from accelerometers on the belt, forearm, arm, and dumbell of study participants. Data for this project is courtesy of http://groupware.les.inf.puc-rio.br/har.

Data Source

Source data for this project:

http://groupware.les.inf.puc-rio.br/har

Training data for this project:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

Test data for this project:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The full reference is as follows:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. “Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13)”. Stuttgart, Germany: ACM SIGCHI, 2013.

Data Processing

Load required packages, set seed for reproducibility.

library(caret)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(randomForest)
library(rattle)
library(rpart)
library(rpart.plot)
set.seed(777)

Load training and test datasets.

trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
quizUrl  <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainData <- read.csv(url(trainUrl), strip.white = TRUE, na.strings = c("NA",""))
quizData  <- read.csv(url(quizUrl),  strip.white = TRUE, na.strings = c("NA",""))
dim(trainData)
## [1] 19622   160
dim(quizData)
## [1]  20 160

Partition the original training data (70% and 30%)

inTrain  <- createDataPartition(trainData$classe, p=0.70, list=FALSE)
training <- trainData[ inTrain, ]
testing  <- trainData[-inTrain, ]
dim(training)
## [1] 13737   160
dim(testing)
## [1] 5885  160

Inspection of the data reveals that columns 1 through 5 are non-principal components, we’ll remove them.

training <- training[ , -(1:5)]
testing <- testing[ , -(1:5)]
dim(training)
## [1] 13737   155
dim(testing)
## [1] 5885  155

The training and test sets each have a large number of NA values as well as near-zero-variance (NZV) variables. For dimension reduction, let’s remove these, along with their ID variables.

nzvVar <- nearZeroVar(training)
training <- training[ , -nzvVar]
testing  <- testing[ , -nzvVar]
dim(training)
## [1] 13737   119
dim(testing)
## [1] 5885  119

Remove variables that contain > 95% NA values.

naVar <- sapply(training, function(x) mean(is.na(x))) > 0.95
training <- training[ , naVar == FALSE]
testing  <- testing [ , naVar == FALSE]
dim(training)
## [1] 13737    54
dim(testing)
## [1] 5885   54

Now that we’ve reduced the dimensions of our data from the original 160 predictors down to a more manageable 54, let’s see which are covariate.

Exploratory Data Analysis

The final column, “classe”, contains our outcomes of interest. Let’s see which other variables closely correlate with classe.

classeIndex <- which(names(training)=="classe")
partitionClasse <- createDataPartition(y=training$classe, p=0.75, list = FALSE)
subsetTrain <- training[partitionClasse, ]
subsetTest <- training[-partitionClasse, ]

Which predictors are highly correlated with classe?

corrClasse <- cor(subsetTrain[-classeIndex], as.numeric(subsetTrain$classe))
bestCorr <- subset(as.data.frame(as.table(corrClasse)), abs(Freq)>0.25)
bestCorr
##             Var1 Var2       Freq
## 13 magnet_belt_y    A -0.2848973
## 25  magnet_arm_x    A  0.2931318
## 26  magnet_arm_y    A -0.2549615
## 42 pitch_forearm    A  0.3323660

None appear to be highly correlated, let’s check visually.

p1 <- ggplot(subsetTrain, aes(classe, pitch_forearm)) +
        geom_boxplot(aes(fill = classe))
p2 <- ggplot(subsetTrain, aes(classe, magnet_arm_x)) +
        geom_boxplot(aes(fill = classe))

grid.arrange(p1, p2, nrow = 1)

No clear correlation emerges for simple linear regression. Let’s train some models to look more closely for correlation among predictors and the classe outcomes.

Prediction Models

Classification Tree

set.seed(777)
modFit <- train(classe ~ ., method = "rpart", data = training)
fancyRpartPlot(modFit$finalModel)

Predictions of the classification tree model on testing:

predModFit <- predict(modFit, newdata = testing)
confModFit <- confusionMatrix(predModFit, testing$classe)
confModFit
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1555  302  200  196   51
##          B   28  389   33  180   90
##          C   87  448  793  552  292
##          D    0    0    0    0    0
##          E    4    0    0   36  649
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5754         
##                  95% CI : (0.5626, 0.588)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.455          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9289   0.3415   0.7729   0.0000   0.5998
## Specificity            0.8221   0.9303   0.7162   1.0000   0.9917
## Pos Pred Value         0.6749   0.5403   0.3651      NaN   0.9419
## Neg Pred Value         0.9668   0.8548   0.9372   0.8362   0.9167
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2642   0.0661   0.1347   0.0000   0.1103
## Detection Prevalence   0.3915   0.1223   0.3691   0.0000   0.1171
## Balanced Accuracy      0.8755   0.6359   0.7446   0.5000   0.7957

The predictive accuracy of the classification tree model is relatively low at 57.5%.

Generalized Boosted Model (Boosting)

Because we have many generally weak predictors, let’s try boosting. Perhaps weighting them, adding them up, then averaging will allow us to get a stronger predictor.

set.seed(777)
ctrlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
fitGBM  <- train(classe ~ ., data = training, method = "gbm",
                  trControl = ctrlGBM, verbose = FALSE)
fitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.

Predictions of the GBM on testing.

predictGBM <- predict(fitGBM, newdata = testing)
conf_matrix_GBM <- confusionMatrix(predictGBM, testing$classe)
conf_matrix_GBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674   11    0    0    1
##          B    0 1117    9    9    6
##          C    0   10 1015    9    2
##          D    0    1    2  942    8
##          E    0    0    0    4 1065
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9878          
##                  95% CI : (0.9846, 0.9904)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9845          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9807   0.9893   0.9772   0.9843
## Specificity            0.9972   0.9949   0.9957   0.9978   0.9992
## Pos Pred Value         0.9929   0.9790   0.9797   0.9885   0.9963
## Neg Pred Value         1.0000   0.9954   0.9977   0.9955   0.9965
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1898   0.1725   0.1601   0.1810
## Detection Prevalence   0.2865   0.1939   0.1760   0.1619   0.1816
## Balanced Accuracy      0.9986   0.9878   0.9925   0.9875   0.9917

The predictive accuracy of our GBM model is reasonably good at 98.7%.

Random Forest Model

set.seed(777)
ctrlRF <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
fitRF  <- train(classe ~ ., data = training, method = "rf",
                  trControl = ctrlRF, verbose = FALSE)
fitRF$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.16%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    0    0    0    1 0.0002560164
## B    6 2649    3    0    0 0.0033860045
## C    0    3 2393    0    0 0.0012520868
## D    0    0    6 2246    0 0.0026642984
## E    0    1    0    2 2522 0.0011881188

Predictions of the Random Forest Model on testing.

predictRF <- predict(fitRF, newdata = testing)
conf_matrix_RF <- confusionMatrix(predictRF, testing$classe)
conf_matrix_RF
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    2    0    0    0
##          B    0 1136    2    0    0
##          C    0    1 1024    3    0
##          D    0    0    0  961    2
##          E    0    0    0    0 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9983          
##                  95% CI : (0.9969, 0.9992)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9979          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9974   0.9981   0.9969   0.9982
## Specificity            0.9995   0.9996   0.9992   0.9996   1.0000
## Pos Pred Value         0.9988   0.9982   0.9961   0.9979   1.0000
## Neg Pred Value         1.0000   0.9994   0.9996   0.9994   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1930   0.1740   0.1633   0.1835
## Detection Prevalence   0.2848   0.1934   0.1747   0.1636   0.1835
## Balanced Accuracy      0.9998   0.9985   0.9986   0.9982   0.9991

The predictive accuracy of the Random Forest model is excellent at 99.8%.

Apply Best Predictive Model to the Test Data

In summary, predictive accuracy of the three models evaluated are as follows:

The Random Forest model is chosen to make predictions on the 20 data points from the original testing dataset quizData.

predQuiz <- predict(fitRF, newdata = quizData)
predQuiz
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E