If you are accessing this page via GitHub, please go to https://rpubs.com/samkanta/data-sci-pml-wk4 for ease of viewing.

Introduction

The quantified self movement has made it possible to collect a large amount of data about personal activity relatively inexpensively, using devices such as Jawbone Up, Nike FuelBand, and Fitbit. Enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or simply because they are tech geeks are part of this group. People often quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, the goal was to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants - each asked to perform barbell lifts correctly and incorrectly in 5 different ways. Applying a machine learning algorithm, with techniques improving quality of model fit, we will predict the manner in which the 6 participants did the exercise. The following sections summarize the approach for this project.

Preparing the Data

Before any model develop occured, the corresponding R libraries were enabled and the source files for the data downloaded.

library(e1071)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(corrplot)
trainUrl <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "./data/pml-training.csv"
testFile  <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
  dir.create("./data")
}
if (!file.exists(trainFile)) {
  download.file(trainUrl, destfile=trainFile, method="curl")
}
if (!file.exists(testFile)) {
  download.file(testUrl, destfile=testFile, method="curl")
}

Reading the Data

After downloading the data, the csv files were transformed into two data frames.

trainRaw <- read.csv("./data/pml-training.csv")
testRaw <- read.csv("./data/pml-testing.csv")
dim(trainRaw)
## [1] 19622   160
dim(testRaw)
## [1]  20 160

The training data set contains 19,622 observations and 160 variables, while the testing data set contains 20 observations and 160 variables. The “classe” variable in the training set is the outcome to predict.

Cleaning the Data

Before creating the predictor model, the data needs to be cleaned to remove potential outliers that would otherwise reduce the accuracy of the algorithm. This was done in three parts:

  1. Identify complete cases
sum(complete.cases(trainRaw))
## [1] 406
  1. Remove columns that contain NA missing values.
trainRaw <- trainRaw[, colSums(is.na(trainRaw)) == 0] 
testRaw <- testRaw[, colSums(is.na(testRaw)) == 0] 
  1. Remove columns that do not contribute to performance measurements.
classe <- trainRaw$classe
trainRemove <- grepl("^X|timestamp|window", names(trainRaw))
trainRaw <- trainRaw[, !trainRemove]
trainCleaned <- trainRaw[, sapply(trainRaw, is.numeric)]
trainCleaned$classe <- classe
testRemove <- grepl("^X|timestamp|window", names(testRaw))
testRaw <- testRaw[, !testRemove]
testCleaned <- testRaw[, sapply(testRaw, is.numeric)]

The resulting training data set contains 19,622 observations and 53 variables, while the testing data set contains 20 observations and 53 variables. Note that the classe variable remains in the cleaned training set.

Slicing the Data

The cleaned training set is split into a pure training data set (70%) and a validation data set (30%). The validation data set assists in conducting cross validation.

set.seed(22519) # For reproducibile purpose
inTrain <- createDataPartition(trainCleaned$classe, p=0.70, list=F)
trainData <- trainCleaned[inTrain, ]
testData <- trainCleaned[-inTrain, ]

Modeling a Predictive Algoritm for the Data

We fit a predictive model for activity recognition using a Random Forest algorithm because it automatically selects key variables and is robust to correlated covariates & outliers. The 5-fold cross validation is used when applying the algorithm.

controlRf <- trainControl(method="cv", 5)
modelRf <- train(classe ~ ., data=trainData, method="rf", trControl=controlRf, ntree=250)
modelRf
## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10988, 10989, 10989, 10991, 10991 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9903191  0.9877528
##   27    0.9919204  0.9897794
##   52    0.9840581  0.9798338
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.

The performance of the model is estimated from the validation data set.

#Predicting
predictRf <- predict(modelRf, newdata = testData)
#Testing accuracy
confusionMatrix(table(predictRf, testData$classe))
## Confusion Matrix and Statistics
## 
##          
## predictRf    A    B    C    D    E
##         A 1669    5    0    0    0
##         B    2 1130    4    0    0
##         C    3    3 1019   10    4
##         D    0    1    3  954    2
##         E    0    0    0    0 1076
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9937          
##                  95% CI : (0.9913, 0.9956)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.992           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9970   0.9921   0.9932   0.9896   0.9945
## Specificity            0.9988   0.9987   0.9959   0.9988   1.0000
## Pos Pred Value         0.9970   0.9947   0.9808   0.9937   1.0000
## Neg Pred Value         0.9988   0.9981   0.9986   0.9980   0.9988
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2836   0.1920   0.1732   0.1621   0.1828
## Detection Prevalence   0.2845   0.1930   0.1766   0.1631   0.1828
## Balanced Accuracy      0.9979   0.9954   0.9945   0.9942   0.9972

The quality of model fit for the prediction can be determined by calculating the values of accuracy and the out-of-sample Root Mean Square Error (RSME). In the interests of cross-validation, the RSME was normalized to aid in interpreting how well the prediction model fitted the test data.

The estimated accuracy of the model is 99.42% and the Normalized out-of-sample error (RMSE) has a relatively low value between 1 and 0 - 0.006287171. Such a value indicates a high degree of fit of the prediction model to the dataset.

accuracy <- postResample(table(predictRf), table(testData$classe))
accuracy
##      RMSE  Rsquared       MAE 
## 6.7823300 0.9992941 5.2000000
oose <- 1 - as.numeric(confusionMatrix(table(testData$classe, predictRf))$overall[1])
oose
## [1] 0.006287171

Final Data Model with the Top Twenty Predictor Variables

modelRf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 250, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 250
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.69%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3900    5    1    0    0 0.001536098
## B   23 2628    6    1    0 0.011286682
## C    0    8 2378   10    0 0.007512521
## D    0    1   20 2227    4 0.011101243
## E    0    3    3   10 2509 0.006336634
varImp(modelRf)
## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt            100.000
## pitch_forearm         60.497
## yaw_belt              52.479
## pitch_belt            42.770
## magnet_dumbbell_z     42.175
## roll_forearm          41.435
## magnet_dumbbell_y     40.922
## accel_dumbbell_y      19.075
## roll_dumbbell         18.299
## magnet_dumbbell_x     17.540
## accel_forearm_x       17.333
## accel_belt_z          15.212
## magnet_belt_z         14.879
## accel_dumbbell_z      13.713
## total_accel_dumbbell  13.536
## magnet_forearm_z      13.476
## magnet_belt_y         11.927
## gyros_belt_z          10.721
## yaw_arm               10.588
## magnet_belt_x          9.371

Conclusion

A machine learning (ML) model predicted the manner of participant exercise, which was Classe ‘A’. Accelerometer data located on participants’ belt, forearm, arm, and dumbell, from an Exercise dataset, was cleaned, and split into training and test datasets. The predictive model successfully identified the Classe. RSME calculations contributed to the anticipated accuracy of the model.

Appendix: Figures - Data Visualization

1. Correlation Matrix Visualization

corrPlot <- cor(trainData[, -length(names(trainData))])
corrplot(corrPlot, method="color")

2. Decision Tree Visualization

treeModel <- rpart(classe ~ ., data=trainData, method="class")
prp(treeModel) # fast plot

Predicting for Test Data Set (for Course Project Prediction Quiz Portion)

The prediction model was applied to the original testing data set, downloaded from the data source.

result <- predict(modelRf, testCleaned[, -length(names(testCleaned))])
result
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E