Practical Machine Learning Project

Executive Summary

This project is the course project for the Practical Machine Learning course by Coursera. The goal of the project is to build a machine learning prediction model that predicts whether a weight lifting exercise is perform correctly or not. The participants were instructed to perform the exercise either properly or incorrectly recorded in the “classe” variable of the data. The value ‘A’ represents the correct way of exercising while ‘B’, ‘C’, ‘D’, ‘E’ represent the incorrect ways. More information can be found here

For this project, I compare three different machine learning algorithms with 5-fold cross validation and choose one that has the best out of sample accuracy for the prediction. The models use for the projects are as follows:

Decision Trees with CART (rpart)
Gradient Boosting Trees (gbm)
Random Forest Decision Trees (rf)

Random Forest Model yields the best out of sample accuracy of about 99.3% so we use this model for the prediction.

Data Preparation

Here we download and read the csv files for the given training and test data of the project.

The train data is used for the analysis which is then evaluated with the test data. We set the empty values, NA, and #DIV/0! as NA values.

if (!file.exists("Courseradata")) {
    dir.create("Courseradata")
}
trainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

download.file(trainUrl, destfile = "/Users/adrianromano/Downloads/Courseradata/trainFile.csv", method = "curl")

## Warning in download.file(trainUrl, destfile = "/Users/adrianromano/
## Downloads/Courseradata/trainFile.csv", : download had nonzero exit status

download.file(testUrl, destfile = "/Users/adrianromano/Downloads/Courseradata/testFile.csv", method = "curl")

## Warning in download.file(testUrl, destfile = "/Users/adrianromano/
## Downloads/Courseradata/testFile.csv", : download had nonzero exit status

trainData <- read.csv("/Users/adrianromano/Downloads/Courseradata/trainFile.csv", na.strings = c("", "NA", "#DIV/0!"))
testData <- read.csv("/Users/adrianromano/Downloads/Courseradata/testFile.csv", na.strings = c("", "NA", "#DIV/0!"))

Here we check the dimensions and see the total number of NA values for both data.

dim(trainData)

## [1] 19622   160

dim(testData)

## [1]  20 160

sum(is.na(trainData))

## [1] 1925102

sum(is.na(testData))

## [1] 2000

The training data is made up of 19622 observations on 160 columns with 1925102 NA values.
The test data is made up of 5885 observations on 160 columns with 2000 NA values.

There are a lot of NA values in both data so we remove the variables containing NA values and also some that are not important to include for the prediction such as username and timestamps.

trainData <- trainData[, colSums(is.na(trainData)) == 0]
testData <- testData[, colSums(is.na(testData)) == 0]
trainData <- trainData[, -c(1:7)]
testData <- testData[, -c(1:7)]
dim(trainData)

## [1] 19622    53

dim(testData)

## [1] 20 53

We now just left with 53 variables in each data.

Here we see the Near Zero Variance to check if there are further features to remove.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

nzv <- nearZeroVar(trainData, saveMetrics = TRUE)
nzv

##                      freqRatio percentUnique zeroVar   nzv
## roll_belt             1.101904     6.7781062   FALSE FALSE
## pitch_belt            1.036082     9.3772296   FALSE FALSE
## yaw_belt              1.058480     9.9734991   FALSE FALSE
## total_accel_belt      1.063160     0.1477933   FALSE FALSE
## gyros_belt_x          1.058651     0.7134849   FALSE FALSE
## gyros_belt_y          1.144000     0.3516461   FALSE FALSE
## gyros_belt_z          1.066214     0.8612782   FALSE FALSE
## accel_belt_x          1.055412     0.8357966   FALSE FALSE
## accel_belt_y          1.113725     0.7287738   FALSE FALSE
## accel_belt_z          1.078767     1.5237998   FALSE FALSE
## magnet_belt_x         1.090141     1.6664968   FALSE FALSE
## magnet_belt_y         1.099688     1.5187035   FALSE FALSE
## magnet_belt_z         1.006369     2.3290184   FALSE FALSE
## roll_arm             52.338462    13.5256345   FALSE FALSE
## pitch_arm            87.256410    15.7323412   FALSE FALSE
## yaw_arm              33.029126    14.6570176   FALSE FALSE
## total_accel_arm       1.024526     0.3363572   FALSE FALSE
## gyros_arm_x           1.015504     3.2769341   FALSE FALSE
## gyros_arm_y           1.454369     1.9162165   FALSE FALSE
## gyros_arm_z           1.110687     1.2638875   FALSE FALSE
## accel_arm_x           1.017341     3.9598410   FALSE FALSE
## accel_arm_y           1.140187     2.7367241   FALSE FALSE
## accel_arm_z           1.128000     4.0362858   FALSE FALSE
## magnet_arm_x          1.000000     6.8239731   FALSE FALSE
## magnet_arm_y          1.056818     4.4439914   FALSE FALSE
## magnet_arm_z          1.036364     6.4468454   FALSE FALSE
## roll_dumbbell         1.022388    84.2065029   FALSE FALSE
## pitch_dumbbell        2.277372    81.7449801   FALSE FALSE
## yaw_dumbbell          1.132231    83.4828254   FALSE FALSE
## total_accel_dumbbell  1.072634     0.2191418   FALSE FALSE
## gyros_dumbbell_x      1.003268     1.2282132   FALSE FALSE
## gyros_dumbbell_y      1.264957     1.4167771   FALSE FALSE
## gyros_dumbbell_z      1.060100     1.0498420   FALSE FALSE
## accel_dumbbell_x      1.018018     2.1659362   FALSE FALSE
## accel_dumbbell_y      1.053061     2.3748853   FALSE FALSE
## accel_dumbbell_z      1.133333     2.0894914   FALSE FALSE
## magnet_dumbbell_x     1.098266     5.7486495   FALSE FALSE
## magnet_dumbbell_y     1.197740     4.3012945   FALSE FALSE
## magnet_dumbbell_z     1.020833     3.4451126   FALSE FALSE
## roll_forearm         11.589286    11.0895933   FALSE FALSE
## pitch_forearm        65.983051    14.8557741   FALSE FALSE
## yaw_forearm          15.322835    10.1467740   FALSE FALSE
## total_accel_forearm   1.128928     0.3567424   FALSE FALSE
## gyros_forearm_x       1.059273     1.5187035   FALSE FALSE
## gyros_forearm_y       1.036554     3.7763735   FALSE FALSE
## gyros_forearm_z       1.122917     1.5645704   FALSE FALSE
## accel_forearm_x       1.126437     4.0464784   FALSE FALSE
## accel_forearm_y       1.059406     5.1116094   FALSE FALSE
## accel_forearm_z       1.006250     2.9558659   FALSE FALSE
## magnet_forearm_x      1.012346     7.7667924   FALSE FALSE
## magnet_forearm_y      1.246914     9.5403119   FALSE FALSE
## magnet_forearm_z      1.000000     8.5771073   FALSE FALSE
## classe                1.469581     0.0254816   FALSE FALSE

Looks like there are no more variables to remove so we can move on to building the model.

Model Building

We further split the training set into another train set and test set. Three models are build to compare their out of sample accuracy and the best one is used for the prediction. We also do cross validation with 5-fold for the models.

We split the data into 70% training set and 30% test set

set.seed(1995)
inTrain <- createDataPartition(trainData$classe, p = 0.7, list = FALSE)
training <- trainData[inTrain, ]
testing <- trainData[-inTrain, ]

Decision Tree Model

library(rpart)
set.seed(1995)
cartFit <- train(classe ~ ., method = "rpart", trControl = trainControl(method = "cv", number = 5), data = training)
cartPred <- predict(cartFit, newdata = testing)
cartCM <- confusionMatrix(cartPred, testing$classe)
cartCM$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1069  212   29   70   15
##          B    3  199   19    7    6
##          C  319  185  664  279  198
##          D  277  543  314  608  368
##          E    6    0    0    0  495

Gradient Boosting Tree Model

library(gbm)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.3

set.seed(1995)
gbmFit<- train(classe ~ ., method = "gbm", data = training, trControl = trainControl(method = "cv", number = 5), verbose = FALSE)

## Loading required package: plyr

gbmPred<- predict(gbmFit, newdata = testing)
gbmCM <- confusionMatrix(gbmPred, testing$classe)
gbmCM$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1658   29    0    1    3
##          B    8 1087   36    1   15
##          C    3   21  974   33   13
##          D    3    2   14  924   10
##          E    2    0    2    5 1041

Random Forest Model

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(1995)
rfFit <- train(classe ~ ., method = "rf", data = training, trControl = trainControl(method = "cv", number = 5), importance = TRUE)
rfPred <- predict(rfFit, newdata = testing)
rfCM <- confusionMatrix(rfPred, testing$classe)
rfCM$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1673    6    0    0    0
##          B    0 1132   10    0    0
##          C    0    1 1014   14    1
##          D    0    0    2  949    0
##          E    1    0    0    1 1081

Model comparison:

accuracy <- data.frame(Model = c("CART", "GBM", "RF"),
                       Accuracy = rbind(cartCM$overall[1], gbmCM$overall[1], rfCM$overall[1]))
accuracy

##   Model  Accuracy
## 1  CART 0.5157179
## 2   GBM 0.9658454
## 3    RF 0.9938828

Random Forest Model has the best out of sample accuracy of about 99.3% followed by the Gradient Boosting Tree Model of about 96.3%. The CART decision tree model has a low 51.5% out of sample accuracy. We use the highest out of sample accuracy, the Random Forest Model, for the prediction.

Prediction

Evaluate the prediction to the original test data

prediction <- predict(rfFit, newdata = testData)
prediction

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E