Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

The goal of this project is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants to predict the manner in which they did the exercise. The classe variable in the training set is the target. The report describes how the model was built, how cross validation was used, the expected out of sample error, and why the data processing and modeling choices were made. The prediction model will be used to predict 20 different test cases.

Background

The training and test data analyzed in this report are available here:

Training data: used to train the model and test for accuracy.
Test data: used to make the final predictions for the quiz.

Data Processing

This section will go through the required steps to get the raw data loaded, processed, and ready to be modeled.

Loading Packages and Data

First, the required packages/libraries are loaded.

# Load libraries ---------------------------------------------------------------
library(caret)
library(dplyr)

Second, the raw training and testing data sets are loaded.

# Load data --------------------------------------------------------------------
training <- read.csv("./data/pml-training.csv")
testing <- read.csv("./data/pml-testing.csv")

Exploring & Cleaning Data

The training data set has a large number of columns, many of which are either (1) not predictors or (2) have NAs or blank values. We need to clean the training and testing data sets to remove non-predictors / targets and any columns with empty values (e.g. NA or blank values).

# Initial data frame sizes and structure ---------------------------------------
dim(training)

## [1] 19622   160

str(training[ ,1:20]) # Structure of first 20 columns shown

## 'data.frame':    19622 obs. of  20 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name           : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1: int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2: int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp      : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window          : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt  : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_belt : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_belt   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt  : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt.1: Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_belt   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt        : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...

First, the non-predictor / target columns (columns 1 through 7) are removed.

# Remove non-predictors and non-target columns ---------------------------------
training <- training[,-(1:7)]
testing <- testing[,-(1:7)]

Second, the columns with NAs and blanks are removed.

# Remove columns with NA or blank values ---------------------------------------
training <- training[ , !apply(training, 2, function(x) any(is.na(x)))]
training <- training[ , !apply(training, 2, function(x) any(x==""))]
testing <- testing[ , !apply(testing, 2, function(x) any(is.na(x)))]
testing <- testing[ , !apply(testing, 2, function(x) any(x==""))]

The final training set dimensions are as follows. As we can see, it cut down the columns substantially, leaving a well formed data frame for modeling.

# Final dimensions and structure of training set -------------------------------
dim(training)

## [1] 19622    53

str(training[,1:20]) # Structure of first 20 columns shown

## 'data.frame':    19622 obs. of  20 variables:
##  $ roll_belt       : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt      : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt        : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x    : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y    : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z    : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x    : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y    : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z    : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x   : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y   : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z   : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm        : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm       : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm         : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x     : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y     : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z     : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...

Pre-Processing Data

Once the data is cleaned, the first step is to partition the training set into a training and a test set. The set.seed() function is used for reproducibility. A random sample is taken using split of 60/40 for training/testing data sets using the createDataPartition() function. The train.set data will be used for training the models. The test.set data will be used to determine accuracy of the test set.

# Partition data into training, testing, and CV sets ---------------------------
set.seed(100)
inTrain <- createDataPartition(y=training$classe, p=0.6, list=FALSE)
train.set <- training[inTrain, ] # 60% of training set
test.set <- training[-inTrain, ] # 40% of training set

No additional pre-processing is performed. PCA was not used because this would reduce the interpretability of the model. Feature reduction using variable importance was not used since this increases the bias of the model.

Modeling

Three models were created using the train.set: random forest, linear discriminant analysis, and gradient boosting machine (boosting). Cross validation was performed on all models using k = 5 folds. For computational considerations with random forest, rather than limiting the number of factors, limiting the number of trees provided a computationally efficient method to develop the model. The number of trees was limited to 300.

# Modeling ---------------------------------------------------------------------
folds <- 5
ntree <- 300
set.seed(222)
fitControl <- trainControl(method="cv", number = folds)

# Random Forest
model.rf <- train(classe ~ ., 
                  data=train.set, 
                  method="rf", 
                  trControl=fitControl, 
                  ntree=ntree)

# LDA
model.lda <- train(classe ~ ., 
                   data=train.set, 
                   method="lda", 
                   trControl=fitControl)

# GBM (Boosting)
model.gbm <- train(classe ~ ., 
                  data=train.set, 
                  method="gbm", 
                  trControl=fitControl,
                  verbose=FALSE)

Assessing Model Performance

In this section, we assess the performance of each model on the test.set. The model with the lowest out-of-sample error rate (1 - Testing Accuracy Rate) is considered to have the best performance.

Random Forest

The following code predicts the classe classification on the test.set for the Random Forest model.

# Prediction on test set -------------------------------------------------------
pred.rf <-  predict(model.rf, newdata=test.set)

The confusionMatrix() function shows the accuracy of the model on the test data set.

# Confusion Matrix to show accuracy --------------------------------------------
confusionMatrix(pred.rf, test.set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2231   10    0    0    0
##          B    1 1502    9    0    0
##          C    0    6 1350   29    1
##          D    0    0    9 1257    6
##          E    0    0    0    0 1435
## 
## Overall Statistics
##                                           
##                Accuracy : 0.991           
##                  95% CI : (0.9886, 0.9929)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9886          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9996   0.9895   0.9868   0.9774   0.9951
## Specificity            0.9982   0.9984   0.9944   0.9977   1.0000
## Pos Pred Value         0.9955   0.9934   0.9740   0.9882   1.0000
## Neg Pred Value         0.9998   0.9975   0.9972   0.9956   0.9989
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1914   0.1721   0.1602   0.1829
## Detection Prevalence   0.2856   0.1927   0.1767   0.1621   0.1829
## Balanced Accuracy      0.9989   0.9939   0.9906   0.9876   0.9976

Out of sample error for RF is 0.009.

# Out of sample error ----------------------------------------------------------
round(1-confusionMatrix(pred.rf, test.set$classe)$overall[[1]], 4)

## [1] 0.009

Linear Discriminant Analysis

The following code predicts the classe classification on the test.set for the LDA model.

# Prediction on test set -------------------------------------------------------
pred.lda <-  predict(model.lda, newdata=test.set)

The confusionMatrix() function shows the accuracy of the model on the test data set.

# Confusion Matrix to show accuracy --------------------------------------------
confusionMatrix(pred.lda, test.set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1792  231  115   71   43
##          B   48  960  131   58  259
##          C  211  181  925  140  114
##          D  177   67  166  971  156
##          E    4   79   31   46  870
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7033         
##                  95% CI : (0.693, 0.7134)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.625          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8029   0.6324   0.6762   0.7551   0.6033
## Specificity            0.9181   0.9216   0.9003   0.9137   0.9750
## Pos Pred Value         0.7957   0.6593   0.5888   0.6318   0.8447
## Neg Pred Value         0.9213   0.9127   0.9294   0.9501   0.9161
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2284   0.1224   0.1179   0.1238   0.1109
## Detection Prevalence   0.2870   0.1856   0.2002   0.1959   0.1313
## Balanced Accuracy      0.8605   0.7770   0.7882   0.8344   0.7892

Out of sample error for LDA is 0.2967.

# Out of sample error ----------------------------------------------------------
round(1-confusionMatrix(pred.lda, test.set$classe)$overall[[1]], 4)

## [1] 0.2967

Gradient Boosting Machine (Boosting)

The following code predicts the classe classification on the test.set for the GBM model.

# Prediction on test set -------------------------------------------------------
pred.gbm <-  predict(model.gbm, newdata=test.set)

The confusionMatrix() function shows the accuracy of the model on the test data set.

# Confusion Matrix to show accuracy --------------------------------------------
confusionMatrix(pred.gbm, test.set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2204   54    0    3    1
##          B   20 1410   41    2   19
##          C    4   50 1304   39   12
##          D    3    4   20 1228   15
##          E    1    0    3   14 1395
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9611          
##                  95% CI : (0.9566, 0.9653)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9508          
##  Mcnemar's Test P-Value : 1.279e-07       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9875   0.9289   0.9532   0.9549   0.9674
## Specificity            0.9897   0.9870   0.9838   0.9936   0.9972
## Pos Pred Value         0.9744   0.9450   0.9255   0.9669   0.9873
## Neg Pred Value         0.9950   0.9830   0.9901   0.9912   0.9927
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2809   0.1797   0.1662   0.1565   0.1778
## Detection Prevalence   0.2883   0.1902   0.1796   0.1619   0.1801
## Balanced Accuracy      0.9886   0.9579   0.9685   0.9742   0.9823

Out of sample error for GBM is 0.0389.

# Out of sample error ----------------------------------------------------------
round(1-confusionMatrix(pred.gbm, test.set$classe)$overall[[1]], 4)

## [1] 0.0389

Conclusions

The Random Forest model produced the most accurate model, followed closely by the GBM model and more distantly by the LDA model. The out of sample error rates were as follows:

Random Forest: 0.009
LDA: 0.2967
GBM: 0.0389

The Random Forest model will be used to predict the testing set.

Application to Testing Set

The testing set is predicted below.

pred.testing <- predict(model.rf, newdata=testing)
data.frame(Observation = 1:20 , Prediction = pred.testing)

##    Observation Prediction
## 1            1          B
## 2            2          A
## 3            3          B
## 4            4          A
## 5            5          A
## 6            6          E
## 7            7          D
## 8            8          B
## 9            9          A
## 10          10          A
## 11          11          B
## 12          12          C
## 13          13          B
## 14          14          A
## 15          15          E
## 16          16          E
## 17          17          A
## 18          18          B
## 19          19          B
## 20          20          B

Practical Machine Learning: Prediction Assignment

Matt Dancho

August 9, 2016

Executive Summary

Background

Data Processing

Loading Packages and Data

Exploring & Cleaning Data

Pre-Processing Data

Modeling

Assessing Model Performance

Random Forest

Linear Discriminant Analysis

Gradient Boosting Machine (Boosting)

Conclusions

Application to Testing Set