Question

How well did the six participants perform their weight lifting exercises based on pre-specified, qualitative activity recognition classification?

Training and Testing Data

pml_training<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",na.strings=c("NA",""))
dim(pml_training) # 19622 rows, 160 cols

## [1] 19622   160

pml_testing<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",na.strings=c("NA",""))
dim(pml_testing)  # 20 rows, 160 cols

## [1]  20 160

Pre-Processing Steps

Step 1. Delete 100 columns with unavailable (NA) or missing (“”) data.

non_missing <- apply(!is.na(pml_training),2,sum)>19621
pml_training1 <- pml_training[,non_missing]
dim(pml_training1)  # 19,622 rows, 60 cols

## [1] 19622    60

non_missing <- apply(!is.na(pml_testing),2,sum)>19
pml_testing2 <- pml_testing[,non_missing]
dim(pml_testing2)  # 20 rows, 60 cols

## [1] 20 60

Step 2. Delete 406 rows in which the new_window variable equals “yes.”

pml_training2 <- subset(pml_training1,new_window=="no")
dim(pml_training2)  # 19,216 rows, 60 cols

## [1] 19216    60

Step 3. Delete index, timestamps and new_window variable (Columns 1, 3-6). The remaining columns include 54 predictors and the classe variable.

pml_training2_user_name <- pml_training2[,2]
pml_training2_predictors <- pml_training2[7:60]
pml_training3 <- cbind(pml_training2_user_name,pml_training2_predictors)
dim(pml_training3) # 19,216 rows, 55 cols

## [1] 19216    55

pml_testing2_user_name <- pml_testing2[,2]
pml_testing2_predictors <- pml_testing2[,7:60]
pml_testing3 <- cbind(pml_testing2_user_name,pml_testing2_predictors)
dim(pml_testing3) # 20 rows, 55 cols

## [1] 20 55

Training Subset

library(caret)

## Warning: package 'caret' was built under R version 3.1.2

## Loading required package: lattice
## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.1.2

InTrain = createDataPartition(pml_training3$classe,p=0.4)[[1]]
training <- pml_training3[InTrain,] # 7689 rows, 55 cols
dim(training)

## [1] 7689   55

Algorithm

Boosted Trees Model

A generalized boosted (gbm) regression model was fitted to training data using the train function.

set.seed(125)
gbmModel <- train(classe~.,method="gbm",data=training,verbose=FALSE)

## Loading required package: gbm

## Warning: package 'gbm' was built under R version 3.1.2

## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr

## Warning: package 'plyr' was built under R version 3.1.2

print(gbmModel$finalModel)

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 58 predictors of which 45 had non-zero influence.

gbmPredictions <- predict(gbmModel,training)
confusionMatrix(training$classe,gbmPredictions)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2189    0    0    0    0
##          B    9 1470    8    1    0
##          C    0    6 1333    1    1
##          D    0    2    8 1249    0
##          E    0    2    1    8 1401
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9939          
##                  95% CI : (0.9919, 0.9955)
##     No Information Rate : 0.2859          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9923          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9959   0.9932   0.9874   0.9921   0.9993
## Specificity            1.0000   0.9971   0.9987   0.9984   0.9983
## Pos Pred Value         1.0000   0.9879   0.9940   0.9921   0.9922
## Neg Pred Value         0.9984   0.9984   0.9973   0.9984   0.9998
## Prevalence             0.2859   0.1925   0.1756   0.1637   0.1823
## Detection Rate         0.2847   0.1912   0.1734   0.1624   0.1822
## Detection Prevalence   0.2847   0.1935   0.1744   0.1637   0.1836
## Balanced Accuracy      0.9980   0.9952   0.9931   0.9953   0.9988

Random Forest Model

A random forest (rf) model was fitted to training data using the train function with five-fold cross validation (cv).

set.seed(125)
rfModel <-train(classe~.,data=training,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE)

## Loading required package: randomForest

## Warning: package 'randomForest' was built under R version 3.1.2

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

print(rfModel$finalModel)

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE,      allowParallel = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 30
## 
##         OOB estimate of  error rate: 0.69%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 2188    1    0    0    0 0.0004568296
## B   12 1470    5    1    0 0.0120967742
## C    0   12 1326    3    0 0.0111856823
## D    0    0   12 1247    0 0.0095313741
## E    0    0    0    7 1405 0.0049575071

rfPredictions <- predict(rfModel,training)
confusionMatrix(training$classe,rfPredictions)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2189    0    0    0    0
##          B    0 1488    0    0    0
##          C    0    0 1341    0    0
##          D    0    0    0 1259    0
##          E    0    0    0    0 1412
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9995, 1)
##     No Information Rate : 0.2847     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2847   0.1935   0.1744   0.1637   0.1836
## Detection Rate         0.2847   0.1935   0.1744   0.1637   0.1836
## Detection Prevalence   0.2847   0.1935   0.1744   0.1637   0.1836
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Out-of-Sample Error

The random forest model should be highly accurate due to the large sample size of the training data and five-fold cross validation, # i.e., OOB error < 0.5%.

Predictions

Predicted values should match actual values because user name and window number are common factors in both training and test datasets, # i.e., 20 out of 20.