Overview

There are a number of wearable devices that can be used to identify activities being performed (i.e. walking, running, sitting). These devices, which include Jawbone Up, Nike FuelBand, and Fitbit, are used to identify behavioral patterns and improve health by helping users adjust those patterns. Unfortunately, merely identifying an activity gives an incomplete picture of the quality of exercise. In order to adjust behavioral patterns to improve health, it is necessary to ensure that exercises are being performed efficiently. The objective of this data analysis is to determine whether the efficiency of exercise can be identified using machine learning algorithms.

Here, I have made use of data available from “http://groupware.les.inf.puc-rio.br/har”. This data includes measurements made by the devices mentioned above as well as classifications of the efficiency of the exercise. Below, I have tested two machine learning algorithms, random forests and classification trees.

Download and organize data

The objective of this section is to download and clean the data. The data is cleaned in several ways: 1) by removing all columns containing NAs and 2) by removing the first 7 columns of data (which contain irrelevant information). The remaining data is then partitioned into a trainint set and a cross validation set. The column names are also saved in order to later obtain the out of sample accuracy.

trainingUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testingUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainingUrl, "train.csv")
download.file(testingUrl, "test.csv")
trainingData <- read.csv("train.csv", na.strings = c("NA",""))
testingData <- read.csv("test.csv", na.strings = c("NA",""))
library(caret)
library(rattle)
library(rpart)
library(rpart.plot)
library(randomForest)
library(repmis)
set.seed(12345)
# Remove the first 7 columns
trainingData <- trainingData[, -c(1:7)]
testingData <- testingData[, -c(1:7)]

# Remove columns with NAs
trainingData <- trainingData[, colSums(is.na(trainingData)) == 0]
testingData <- testingData[, colSums(is.na(testingData)) == 0]

# Partition data
trainDataIndices <- createDataPartition(trainingData$classe, p = 0.7, list = FALSE)
trainData <- trainingData[trainDataIndices,]
validationData <- trainingData[-trainDataIndices,]

# Get valid column names (excluding "classe")
validCols <- names(trainData)
validCols <- validCols[-length(validCols)]

Model training

Train random forest model

The following chunk of code defines a cross validation method, trains a random forest model, predicts the classifications of observations in the cross validation data set, and compares those predictions to the actual classifications, thereby giving the out-of-sample accuracy.

# Set cross validation method
crossV <- trainControl(method = "cv", number = 5)

# Train random forest model
modelRf <- train(classe ~ ., data = trainData, method. = "rf",
                 trcontrol = crossV)
print(modelRf)

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9889702  0.9860414
##   27    0.9882148  0.9850863
##   52    0.9802267  0.9749753
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

predictedCrossVs <- predict(modelRf, validationData)
resultRf <- confusionMatrix(validationData$classe, predictedCrossVs)
print(resultRf)

## $positive
## NULL
## 
## $table
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    1    0    0    0
##          B   10 1124    5    0    0
##          C    0   17 1006    3    0
##          D    0    0   23  941    0
##          E    0    0    0    4 1078
## 
## $overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9892948      0.9864559      0.9863239      0.9917643      0.2859813 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN 
## 
## $byClass
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: A   0.9940582   0.9997620      0.9994026      0.9976253 0.9994026
## Class: B   0.9842382   0.9968374      0.9868306      0.9962073 0.9868306
## Class: C   0.9729207   0.9958771      0.9805068      0.9942375 0.9805068
## Class: D   0.9926160   0.9953413      0.9761411      0.9985775 0.9761411
## Class: E   1.0000000   0.9991679      0.9963031      1.0000000 0.9963031
##             Recall        F1 Prevalence Detection Rate
## Class: A 0.9940582 0.9967233  0.2859813      0.2842821
## Class: B 0.9842382 0.9855327  0.1940527      0.1909941
## Class: C 0.9729207 0.9766990  0.1757009      0.1709431
## Class: D 0.9926160 0.9843096  0.1610875      0.1598980
## Class: E 1.0000000 0.9981481  0.1831776      0.1831776
##          Detection Prevalence Balanced Accuracy
## Class: A            0.2844520         0.9969101
## Class: B            0.1935429         0.9905378
## Class: C            0.1743415         0.9843989
## Class: D            0.1638063         0.9939787
## Class: E            0.1838573         0.9995839
## 
## $mode
## [1] "sens_spec"
## 
## $dots
## list()
## 
## attr(,"class")
## [1] "confusionMatrix"

# Accuracy of random forest model on the validation data
print(resultRf$overall[1])

##  Accuracy 
## 0.9892948

Train classification tree model

Using the same training and cross validation data sets, defined above, the following chunk of code trains a classification tree model, predicts the classifications of observations in the cross validation data set, and compares those predictions to the actual classifications, thereby giving the out-of-sample accuracy.

modelCt <- train(classe ~ ., data = trainData, method. = "rpart", 
                  trcontrol = crossV)
print(modelCt)

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9892757  0.9864323
##   27    0.9881606  0.9850229
##   52    0.9807874  0.9756947
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

predictedCrossVs <- predict(modelCt, validationData)
resultCt <- confusionMatrix(validationData$classe, predictedCrossVs)
print(resultCt$overall[1])

##  Accuracy 
## 0.9891249

Model assessment

The accuracy of the random forest model was greater than the accuracy of the classification tree model.

Test set predictions

Since the random forest model was more accurate, I used the random forest model to predict the classe variable for the test set. The predictions are given below.

predict(modelRf, testingData)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Machine Learning Project

Zachary Colburn