There are a number of wearable devices that can be used to identify activities being performed (i.e. walking, running, sitting). These devices, which include Jawbone Up, Nike FuelBand, and Fitbit, are used to identify behavioral patterns and improve health by helping users adjust those patterns. Unfortunately, merely identifying an activity gives an incomplete picture of the quality of exercise. In order to adjust behavioral patterns to improve health, it is necessary to ensure that exercises are being performed efficiently. The objective of this data analysis is to determine whether the efficiency of exercise can be identified using machine learning algorithms.
Here, I have made use of data available from “http://groupware.les.inf.puc-rio.br/har”. This data includes measurements made by the devices mentioned above as well as classifications of the efficiency of the exercise. Below, I have tested two machine learning algorithms, random forests and classification trees.
The objective of this section is to download and clean the data. The data is cleaned in several ways: 1) by removing all columns containing NAs and 2) by removing the first 7 columns of data (which contain irrelevant information). The remaining data is then partitioned into a trainint set and a cross validation set. The column names are also saved in order to later obtain the out of sample accuracy.
trainingUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testingUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainingUrl, "train.csv")
download.file(testingUrl, "test.csv")
trainingData <- read.csv("train.csv", na.strings = c("NA",""))
testingData <- read.csv("test.csv", na.strings = c("NA",""))
library(caret)
library(rattle)
library(rpart)
library(rpart.plot)
library(randomForest)
library(repmis)
set.seed(12345)
# Remove the first 7 columns
trainingData <- trainingData[, -c(1:7)]
testingData <- testingData[, -c(1:7)]
# Remove columns with NAs
trainingData <- trainingData[, colSums(is.na(trainingData)) == 0]
testingData <- testingData[, colSums(is.na(testingData)) == 0]
# Partition data
trainDataIndices <- createDataPartition(trainingData$classe, p = 0.7, list = FALSE)
trainData <- trainingData[trainDataIndices,]
validationData <- trainingData[-trainDataIndices,]
# Get valid column names (excluding "classe")
validCols <- names(trainData)
validCols <- validCols[-length(validCols)]
The following chunk of code defines a cross validation method, trains a random forest model, predicts the classifications of observations in the cross validation data set, and compares those predictions to the actual classifications, thereby giving the out-of-sample accuracy.
# Set cross validation method
crossV <- trainControl(method = "cv", number = 5)
# Train random forest model
modelRf <- train(classe ~ ., data = trainData, method. = "rf",
trcontrol = crossV)
print(modelRf)
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9889702 0.9860414
## 27 0.9882148 0.9850863
## 52 0.9802267 0.9749753
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
predictedCrossVs <- predict(modelRf, validationData)
resultRf <- confusionMatrix(validationData$classe, predictedCrossVs)
print(resultRf)
## $positive
## NULL
##
## $table
## Reference
## Prediction A B C D E
## A 1673 1 0 0 0
## B 10 1124 5 0 0
## C 0 17 1006 3 0
## D 0 0 23 941 0
## E 0 0 0 4 1078
##
## $overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9892948 0.9864559 0.9863239 0.9917643 0.2859813
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
##
## $byClass
## Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: A 0.9940582 0.9997620 0.9994026 0.9976253 0.9994026
## Class: B 0.9842382 0.9968374 0.9868306 0.9962073 0.9868306
## Class: C 0.9729207 0.9958771 0.9805068 0.9942375 0.9805068
## Class: D 0.9926160 0.9953413 0.9761411 0.9985775 0.9761411
## Class: E 1.0000000 0.9991679 0.9963031 1.0000000 0.9963031
## Recall F1 Prevalence Detection Rate
## Class: A 0.9940582 0.9967233 0.2859813 0.2842821
## Class: B 0.9842382 0.9855327 0.1940527 0.1909941
## Class: C 0.9729207 0.9766990 0.1757009 0.1709431
## Class: D 0.9926160 0.9843096 0.1610875 0.1598980
## Class: E 1.0000000 0.9981481 0.1831776 0.1831776
## Detection Prevalence Balanced Accuracy
## Class: A 0.2844520 0.9969101
## Class: B 0.1935429 0.9905378
## Class: C 0.1743415 0.9843989
## Class: D 0.1638063 0.9939787
## Class: E 0.1838573 0.9995839
##
## $mode
## [1] "sens_spec"
##
## $dots
## list()
##
## attr(,"class")
## [1] "confusionMatrix"
# Accuracy of random forest model on the validation data
print(resultRf$overall[1])
## Accuracy
## 0.9892948
Using the same training and cross validation data sets, defined above, the following chunk of code trains a classification tree model, predicts the classifications of observations in the cross validation data set, and compares those predictions to the actual classifications, thereby giving the out-of-sample accuracy.
modelCt <- train(classe ~ ., data = trainData, method. = "rpart",
trcontrol = crossV)
print(modelCt)
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9892757 0.9864323
## 27 0.9881606 0.9850229
## 52 0.9807874 0.9756947
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
predictedCrossVs <- predict(modelCt, validationData)
resultCt <- confusionMatrix(validationData$classe, predictedCrossVs)
print(resultCt$overall[1])
## Accuracy
## 0.9891249
The accuracy of the random forest model was greater than the accuracy of the classification tree model.
Since the random forest model was more accurate, I used the random forest model to predict the classe variable for the test set. The predictions are given below.
predict(modelRf, testingData)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E