If you are accessing this page via GitHub, please go to https://rpubs.com/samkanta/data-sci-pml-wk4 for ease of viewing.
The quantified self movement has made it possible to collect a large amount of data about personal activity relatively inexpensively, using devices such as Jawbone Up, Nike FuelBand, and Fitbit. Enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or simply because they are tech geeks are part of this group. People often quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, the goal was to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants - each asked to perform barbell lifts correctly and incorrectly in 5 different ways. Applying a machine learning algorithm, with techniques improving quality of model fit, we will predict the manner in which the 6 participants did the exercise. The following sections summarize the approach for this project.
Before any model develop occured, the corresponding R libraries were enabled and the source files for the data downloaded.
library(e1071)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(corrplot)
trainUrl <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "./data/pml-training.csv"
testFile <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
dir.create("./data")
}
if (!file.exists(trainFile)) {
download.file(trainUrl, destfile=trainFile, method="curl")
}
if (!file.exists(testFile)) {
download.file(testUrl, destfile=testFile, method="curl")
}
After downloading the data, the csv files were transformed into two data frames.
trainRaw <- read.csv("./data/pml-training.csv")
testRaw <- read.csv("./data/pml-testing.csv")
dim(trainRaw)
## [1] 19622 160
dim(testRaw)
## [1] 20 160
The training data set contains 19,622 observations and 160 variables, while the testing data set contains 20 observations and 160 variables. The “classe” variable in the training set is the outcome to predict.
Before creating the predictor model, the data needs to be cleaned to remove potential outliers that would otherwise reduce the accuracy of the algorithm. This was done in three parts:
sum(complete.cases(trainRaw))
## [1] 406
trainRaw <- trainRaw[, colSums(is.na(trainRaw)) == 0]
testRaw <- testRaw[, colSums(is.na(testRaw)) == 0]
classe <- trainRaw$classe
trainRemove <- grepl("^X|timestamp|window", names(trainRaw))
trainRaw <- trainRaw[, !trainRemove]
trainCleaned <- trainRaw[, sapply(trainRaw, is.numeric)]
trainCleaned$classe <- classe
testRemove <- grepl("^X|timestamp|window", names(testRaw))
testRaw <- testRaw[, !testRemove]
testCleaned <- testRaw[, sapply(testRaw, is.numeric)]
The resulting training data set contains 19,622 observations and 53 variables, while the testing data set contains 20 observations and 53 variables. Note that the classe variable remains in the cleaned training set.
The cleaned training set is split into a pure training data set (70%) and a validation data set (30%). The validation data set assists in conducting cross validation.
set.seed(22519) # For reproducibile purpose
inTrain <- createDataPartition(trainCleaned$classe, p=0.70, list=F)
trainData <- trainCleaned[inTrain, ]
testData <- trainCleaned[-inTrain, ]
We fit a predictive model for activity recognition using a Random Forest algorithm because it automatically selects key variables and is robust to correlated covariates & outliers. The 5-fold cross validation is used when applying the algorithm.
controlRf <- trainControl(method="cv", 5)
modelRf <- train(classe ~ ., data=trainData, method="rf", trControl=controlRf, ntree=250)
modelRf
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10988, 10989, 10989, 10991, 10991
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9903191 0.9877528
## 27 0.9919204 0.9897794
## 52 0.9840581 0.9798338
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
The performance of the model is estimated from the validation data set.
#Predicting
predictRf <- predict(modelRf, newdata = testData)
#Testing accuracy
confusionMatrix(table(predictRf, testData$classe))
## Confusion Matrix and Statistics
##
##
## predictRf A B C D E
## A 1669 5 0 0 0
## B 2 1130 4 0 0
## C 3 3 1019 10 4
## D 0 1 3 954 2
## E 0 0 0 0 1076
##
## Overall Statistics
##
## Accuracy : 0.9937
## 95% CI : (0.9913, 0.9956)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.992
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9970 0.9921 0.9932 0.9896 0.9945
## Specificity 0.9988 0.9987 0.9959 0.9988 1.0000
## Pos Pred Value 0.9970 0.9947 0.9808 0.9937 1.0000
## Neg Pred Value 0.9988 0.9981 0.9986 0.9980 0.9988
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2836 0.1920 0.1732 0.1621 0.1828
## Detection Prevalence 0.2845 0.1930 0.1766 0.1631 0.1828
## Balanced Accuracy 0.9979 0.9954 0.9945 0.9942 0.9972
The quality of model fit for the prediction can be determined by calculating the values of accuracy and the out-of-sample Root Mean Square Error (RSME). In the interests of cross-validation, the RSME was normalized to aid in interpreting how well the prediction model fitted the test data.
The estimated accuracy of the model is 99.42% and the Normalized out-of-sample error (RMSE) has a relatively low value between 1 and 0 - 0.006287171. Such a value indicates a high degree of fit of the prediction model to the dataset.
accuracy <- postResample(table(predictRf), table(testData$classe))
accuracy
## RMSE Rsquared MAE
## 6.7823300 0.9992941 5.2000000
oose <- 1 - as.numeric(confusionMatrix(table(testData$classe, predictRf))$overall[1])
oose
## [1] 0.006287171
modelRf$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 250, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 250
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.69%
## Confusion matrix:
## A B C D E class.error
## A 3900 5 1 0 0 0.001536098
## B 23 2628 6 1 0 0.011286682
## C 0 8 2378 10 0 0.007512521
## D 0 1 20 2227 4 0.011101243
## E 0 3 3 10 2509 0.006336634
varImp(modelRf)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.000
## pitch_forearm 60.497
## yaw_belt 52.479
## pitch_belt 42.770
## magnet_dumbbell_z 42.175
## roll_forearm 41.435
## magnet_dumbbell_y 40.922
## accel_dumbbell_y 19.075
## roll_dumbbell 18.299
## magnet_dumbbell_x 17.540
## accel_forearm_x 17.333
## accel_belt_z 15.212
## magnet_belt_z 14.879
## accel_dumbbell_z 13.713
## total_accel_dumbbell 13.536
## magnet_forearm_z 13.476
## magnet_belt_y 11.927
## gyros_belt_z 10.721
## yaw_arm 10.588
## magnet_belt_x 9.371
A machine learning (ML) model predicted the manner of participant exercise, which was Classe ‘A’. Accelerometer data located on participants’ belt, forearm, arm, and dumbell, from an Exercise dataset, was cleaned, and split into training and test datasets. The predictive model successfully identified the Classe. RSME calculations contributed to the anticipated accuracy of the model.
corrPlot <- cor(trainData[, -length(names(trainData))])
corrplot(corrPlot, method="color")
treeModel <- rpart(classe ~ ., data=trainData, method="class")
prp(treeModel) # fast plot
The prediction model was applied to the original testing data set, downloaded from the data source.
result <- predict(modelRf, testCleaned[, -length(names(testCleaned))])
result
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E