Machine Learning Project (Coursera)

Executive Summary

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

Summary of Analysis

This report provides an analysis and exploration on how well the accelerometers on the belt, forearm, arm and dumbell of 6 participants did in classifying whether the lifts were performed correctly or incorrectly. All data and information about the data sets is available from the website here: http://groupware.les.inf.puc-rio.br/har. Methods of analysis included non-linear regression modeling, cross validation and a validation test set. All supporting plots for the analysis are found within the analysis.

Classifying Definitions

A - Exactly according to the specification
B - Throwing the elbows to the front
C - Lifting the dumbbell only halfway
D - Lowering the dumbbell only halfway
E - Throwing the hips to the front

This report analyzes key multi predictors variables used for quantifying how well they performed the routine. Any variable which did not have a direct coorelation or was missing a significant of data was eliminated from the feature set. The non-linear regression models used included the Recursive Partition, Random Forest and Gradient Boosting.

Results of the data analysis showed that Random Forest had the best accuracy rating of 99.2% with a p-value of 2.2e-16 while Gradient Boosting had an estimated accuracy rating of 96.1% with a p-value of 2.2e-16, and Recursive Partition having an estimated accuracy rating of 49% with a p-value of 1.0 for classifying the quantification on how well the participants do the routines. When testing the validation test data against the models, both the Random Forest and Gradient Boosting were 100% accurate. Further analysis will need to continue to determine if there are any linear relationships between the variables or if variables can be combined for increasing the accuracy.

Project Contraints

R version 3.4.4

Libraries

caret 6.0.80
dplyr 0.7.4
elasticnet 1.1
gbm 2.1.3
ggplot2 2.2.1
rattle 5.1.0
rpart 4.1.13

library(dplyr)
library(AppliedPredictiveModeling)
library(caret)
library(RGtk2)
library(rattle)
library(rpart)
library(rpart.plot)

Data Extraction and Transformation

Data Load

Data was downloaded from Cousera’s and stored with the project for reproducing the current analysis. The original data sets were downloaded from the following locations:

Training Data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

Validation Test Data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

raw_trainingData <- read.csv("./pml-training.csv", sep = ",", header = TRUE, fill = TRUE)
raw_testingData <- read.csv("./pml-testing.csv", sep = ",", header = TRUE, fill = TRUE)

dim(raw_trainingData); dim(raw_testingData)

## [1] 19622   160

## [1]  20 160

Clean and Transform

usefulFeatures <- !(names(raw_trainingData) %in% c("X", "user_name","raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_windows", "num_window"))
trainingData <- raw_trainingData[,usefulFeatures]
testingData <- raw_testingData[,usefulFeatures]

nonZeroPredictors <- nearZeroVar(trainingData, saveMetrics = TRUE)
trainingData = trainingData[,nonZeroPredictors$nzv==FALSE]
testingData = testingData[,nonZeroPredictors$nzv==FALSE]

usefulFeatures <- !(names(trainingData) %in% colnames(trainingData)[colSums(is.na(trainingData)) >= 0.65])
trainingData <- trainingData[,usefulFeatures]
testingData <- testingData[,usefulFeatures]

dim(trainingData); dim(testingData)

## [1] 19622    53

## [1] 20 53

# Set features (aka predictors, responses)
features <- colnames(trainingData)
features

##  [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [4] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
##  [7] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [10] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [13] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [16] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [19] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [22] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [25] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [28] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [31] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [34] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [37] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [40] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [43] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [46] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [49] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [52] "magnet_forearm_z"     "classe"

# Frequency count of data points
summary(trainingData$classe)

##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

# Set the seed for reproduciblility
set.seed(50342)

Training the Model

Split Data into Training and Testing using Random Splitting

inTrain <- createDataPartition(y=trainingData$classe, p=0.70, list=FALSE)
training <- trainingData[inTrain,]
testing <- trainingData[-inTrain,]

dim(training); dim(testing)

## [1] 13737    53

## [1] 5885   53

# Frequency counts of data points
summary(training$classe)

##    A    B    C    D    E 
## 3906 2658 2396 2252 2525

summary(testing$classe)

##    A    B    C    D    E 
## 1674 1139 1026  964 1082

Define the training controls for use with cross validation. A k-fold of 5 was selected based on an accumlation of speed and accuracy. Additional k-folds could be selected but the speed to train the model does not quantify the small increase in accuracy.

trainControl <- trainControl(method="cv", number=5)

Train the model using Recursive Partitioning (rpart) for building classification

modelRpart <- train(classe ~ .,data=training, method="rpart", trControl=trainControl)

fancyRpartPlot(modelRpart$finalModel)

predictedByRpart <- predict(modelRpart, newdata=testing)

varImp(modelRpart)

## rpart variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                   Overall
## pitch_forearm      100.00
## roll_forearm        72.10
## roll_belt           70.29
## magnet_dumbbell_y   50.09
## accel_belt_z        43.18
## magnet_belt_y       39.80
## yaw_belt            39.49
## total_accel_belt    35.94
## magnet_arm_x        26.90
## accel_arm_x         25.76
## roll_dumbbell       19.55
## magnet_dumbbell_z   18.04
## magnet_dumbbell_x   17.99
## accel_dumbbell_y    15.39
## roll_arm            14.34
## gyros_belt_z         0.00
## accel_forearm_x      0.00
## gyros_belt_x         0.00
## magnet_arm_z         0.00
## magnet_belt_x        0.00

confMatRpart <- confusionMatrix(testing$classe, predictedByRpart)
confMatRpart

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1526   23  119    0    6
##          B  453  375  311    0    0
##          C  486   38  502    0    0
##          D  421  171  372    0    0
##          E  163  146  291    0  482
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4902          
##                  95% CI : (0.4774, 0.5031)
##     No Information Rate : 0.5181          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.334           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.5005  0.49801   0.3147       NA  0.98770
## Specificity            0.9478  0.85113   0.8779   0.8362  0.88883
## Pos Pred Value         0.9116  0.32924   0.4893       NA  0.44547
## Neg Pred Value         0.6383  0.92035   0.7751       NA  0.99875
## Prevalence             0.5181  0.12795   0.2710   0.0000  0.08292
## Detection Rate         0.2593  0.06372   0.0853   0.0000  0.08190
## Detection Prevalence   0.2845  0.19354   0.1743   0.1638  0.18386
## Balanced Accuracy      0.7242  0.67457   0.5963       NA  0.93827

The overall accuracy for Recursive Partitioning is 0.4902294 with a p-value of 0.9999911 and a classification error 0.5097706

Train the model using Random Forest

modelRf <- train(classe ~ .,data=training, method="rf", trControl=trainControl, verbose=FALSE)

predictedByRf <- predict(modelRf, newdata=testing)

varImp(modelRf)

## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt             100.00
## pitch_forearm          58.67
## yaw_belt               52.17
## pitch_belt             43.81
## magnet_dumbbell_y      41.85
## roll_forearm           41.85
## magnet_dumbbell_z      41.69
## accel_dumbbell_y       21.77
## roll_dumbbell          18.71
## magnet_dumbbell_x      16.91
## accel_forearm_x        15.91
## magnet_belt_z          15.55
## total_accel_dumbbell   14.23
## accel_belt_z           14.16
## magnet_forearm_z       13.84
## accel_dumbbell_z       13.82
## magnet_belt_y          12.67
## gyros_belt_z           10.21
## yaw_arm                10.15
## magnet_belt_x           9.84

confMatRf <- confusionMatrix(testing$classe, predictedByRf)
confMatRf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    1    0    0    0
##          B   13 1125    1    0    0
##          C    0    9 1009    8    0
##          D    0    0   11  952    1
##          E    0    0    1    1 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9922          
##                  95% CI : (0.9896, 0.9943)
##     No Information Rate : 0.2865          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9901          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9923   0.9912   0.9873   0.9906   0.9991
## Specificity            0.9998   0.9971   0.9965   0.9976   0.9996
## Pos Pred Value         0.9994   0.9877   0.9834   0.9876   0.9982
## Neg Pred Value         0.9969   0.9979   0.9973   0.9982   0.9998
## Prevalence             0.2865   0.1929   0.1737   0.1633   0.1837
## Detection Rate         0.2843   0.1912   0.1715   0.1618   0.1835
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9960   0.9941   0.9919   0.9941   0.9993

plot(modelRf$finalModel, main="Model Error for Random Forest")

The overall accuracy for Random Forest is 0.9921835 and a classification error 0.0078165

Train the model using Gradient Boosting

modelBoosting <- train(classe ~ .,data=training, method="gbm", trControl=trainControl, verbose=FALSE)

predictedByBoosting <- predict(modelBoosting, newdata=testing)

#varImp(modelBoosting)
confMatBoosting <- confusionMatrix(testing$classe, predictedByBoosting)
confMatBoosting

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1649   12    4    5    4
##          B   43 1058   36    2    0
##          C    0   27  988   11    0
##          D    0    1   41  918    4
##          E    3   16    7   11 1045
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9614          
##                  95% CI : (0.9562, 0.9662)
##     No Information Rate : 0.288           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9512          
##  Mcnemar's Test P-Value : 1.981e-11       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9729   0.9497   0.9182   0.9694   0.9924
## Specificity            0.9940   0.9830   0.9921   0.9907   0.9923
## Pos Pred Value         0.9851   0.9289   0.9630   0.9523   0.9658
## Neg Pred Value         0.9891   0.9882   0.9819   0.9941   0.9983
## Prevalence             0.2880   0.1893   0.1828   0.1609   0.1789
## Detection Rate         0.2802   0.1798   0.1679   0.1560   0.1776
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9834   0.9664   0.9552   0.9800   0.9924

The overall accuracy for Gradient Boosting is 0.9614274 and a classification error 0.0385726

Validate the Model

The validation test data containing 20 rows is provided for validating the accuracy of the models

validationTest <- raw_testingData[,(names(raw_testingData) %in% features)]

validationPredictionRf <- predict(modelRf, newdata=validationTest, type="raw")
validationPredictionRfProbability <- predict(modelRf, newdata=validationTest, type="prob")

validationPredictionBoosting <- predict(modelBoosting, newdata=validationTest, type="raw")
validationPredictionBoostingProbablity <- predict(modelBoosting, newdata=validationTest, type="prob")


validationPredictionRpart <- predict(modelRpart, newdata=validationTest, type="raw")
validationPredictionRpartProbility <- predict(modelRpart, newdata=validationTest, type="prob")

Predictions by each of the models in order of accuracy

Random Forest

validationPredictionRf

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Gradient Boosting

validationPredictionBoosting

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Recursive Partitioning

validationPredictionRpart

##  [1] C A C A A C C A A A C C C A C A A A A C
## Levels: A B C D E

Random Forest Probility for classifying

validationPredictionRfProbability

##        A     B     C     D     E
## 1  0.028 0.824 0.116 0.026 0.006
## 2  0.990 0.002 0.000 0.006 0.002
## 3  0.172 0.708 0.060 0.012 0.048
## 4  0.956 0.000 0.020 0.024 0.000
## 5  0.974 0.014 0.010 0.000 0.002
## 6  0.006 0.030 0.036 0.004 0.924
## 7  0.010 0.012 0.048 0.908 0.022
## 8  0.034 0.750 0.050 0.140 0.026
## 9  1.000 0.000 0.000 0.000 0.000
## 10 0.994 0.000 0.000 0.004 0.002
## 11 0.038 0.800 0.108 0.032 0.022
## 12 0.004 0.060 0.846 0.038 0.052
## 13 0.006 0.988 0.000 0.000 0.006
## 14 1.000 0.000 0.000 0.000 0.000
## 15 0.004 0.004 0.014 0.014 0.964
## 16 0.006 0.008 0.000 0.002 0.984
## 17 0.966 0.000 0.000 0.000 0.034
## 18 0.028 0.870 0.002 0.084 0.016
## 19 0.100 0.878 0.000 0.016 0.006
## 20 0.000 1.000 0.000 0.000 0.000