Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(corrplot)
trainUrl <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "./data/pml-training.csv"
testFile <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
dir.create("./data")
}
if (!file.exists(trainFile)) {
download.file(trainUrl, destfile=trainFile)
}
if (!file.exists(testFile)) {
download.file(testUrl, destfile=testFile)
}
# Replace blank values with NA
trainRaw <- read.csv("./data/pml-training.csv", na.strings = c("", NA))
testRaw <- read.csv("./data/pml-testing.csv", na.strings = c("", NA))
dim(trainRaw)
## [1] 19622 160
dim(testRaw)
## [1] 20 160
sum(complete.cases(trainRaw))
## [1] 406
table(colSums(is.na(trainRaw)))
##
## 0 19216
## 60 100
There are 100 columns out of 160 where out of 19622 observations, 19216 have NA values, thus only 406 observations have values other than NA for these columns, 406/19622 is approx. 2% of the data, so we can afford to drop these 100 columns.
trainRaw <- trainRaw[, colSums(is.na(trainRaw)) == 0]
testRaw <- testRaw[, colSums(is.na(testRaw)) == 0]
Now, let’s clean the data off the variables that are of no use for predictions.
classe <- trainRaw$classe
trainRemove <- grepl("^X|timestamp|window|user_name", names(trainRaw))
trainCleaned <- trainRaw[, !trainRemove]
testRemove <- grepl("^X|timestamp|window|user_name|problem_id", names(testRaw))
testCleaned <- testRaw[, !testRemove]
We now split the cleaned training set into a pure training data set (70%) and a validation data set (30%). We will use the validation data set to conduct cross validation.
set.seed(1977) # For reproducibile purposes
inTrain <- createDataPartition(trainCleaned$classe, p=0.70, list=F)
trainData <- trainCleaned[inTrain, ]
testData <- trainCleaned[-inTrain, ]
We fit a predictive model for activity recognition using Random Forest algorithm because it automatically selects important variables and is robust to correlated covariates & outliers in general. We will use 5-fold cross validation when applying the algorithm. (Wanted to try 10-fold CV with 3 repetitions but it was taking too much of time.)
controlRf <- trainControl(method="cv", number=5)
modelRf <- train(classe ~ ., data=trainData, method="rf", trControl=controlRf, ntree=250)
modelRf
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10991, 10989, 10990, 10989, 10989
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9903907 0.9878442
## 27 0.9906817 0.9882126
## 52 0.9872605 0.9838859
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
Next, We estimate the performance of the model on the validation data set.
predictRf <- predict(modelRf, testData)
confusionMatrix(testData$classe, predictRf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 2 0 0 0
## B 6 1131 2 0 0
## C 0 2 1022 2 0
## D 0 0 15 948 1
## E 0 0 3 2 1077
##
## Overall Statistics
##
## Accuracy : 0.9941
## 95% CI : (0.9917, 0.9959)
## No Information Rate : 0.2851
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9925
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9965 0.9808 0.9958 0.9991
## Specificity 0.9995 0.9983 0.9992 0.9968 0.9990
## Pos Pred Value 0.9988 0.9930 0.9961 0.9834 0.9954
## Neg Pred Value 0.9986 0.9992 0.9959 0.9992 0.9998
## Prevalence 0.2851 0.1929 0.1771 0.1618 0.1832
## Detection Rate 0.2841 0.1922 0.1737 0.1611 0.1830
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9980 0.9974 0.9900 0.9963 0.9990
accuracy <- postResample(predictRf, testData$classe)
accuracy
## Accuracy Kappa
## 0.9940527 0.9924767
1 - as.numeric(confusionMatrix(testData$classe, predictRf)$overall[1])
## [1] 0.005947324
The accuracy of the model is 99.45% and and out of sample error is 0.54%.
Predicting the data downloaded earlier with 20 observations.
result <- predict(modelRf, testCleaned)
result
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Corelation Matrix
corrPlot <- cor(trainData[, -length(names(trainData))])
corrplot(corrPlot, method="color")
Decision Tree Visualization
treeModel <- rpart(classe ~ ., data=trainData, method="class")
prp(treeModel) # fast plot