This project is the course project for the Practical Machine Learning course by Coursera. The goal of the project is to build a machine learning prediction model that predicts whether a weight lifting exercise is perform correctly or not. The participants were instructed to perform the exercise either properly or incorrectly recorded in the “classe” variable of the data. The value ‘A’ represents the correct way of exercising while ‘B’, ‘C’, ‘D’, ‘E’ represent the incorrect ways. More information can be found here
For this project, I compare three different machine learning algorithms with 5-fold cross validation and choose one that has the best out of sample accuracy for the prediction. The models use for the projects are as follows:
Random Forest Model yields the best out of sample accuracy of about 99.3% so we use this model for the prediction.
Here we download and read the csv files for the given training and test data of the project.
The train data is used for the analysis which is then evaluated with the test data. We set the empty values, NA, and #DIV/0! as NA values.
if (!file.exists("Courseradata")) {
dir.create("Courseradata")
}
trainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainUrl, destfile = "/Users/adrianromano/Downloads/Courseradata/trainFile.csv", method = "curl")
## Warning in download.file(trainUrl, destfile = "/Users/adrianromano/
## Downloads/Courseradata/trainFile.csv", : download had nonzero exit status
download.file(testUrl, destfile = "/Users/adrianromano/Downloads/Courseradata/testFile.csv", method = "curl")
## Warning in download.file(testUrl, destfile = "/Users/adrianromano/
## Downloads/Courseradata/testFile.csv", : download had nonzero exit status
trainData <- read.csv("/Users/adrianromano/Downloads/Courseradata/trainFile.csv", na.strings = c("", "NA", "#DIV/0!"))
testData <- read.csv("/Users/adrianromano/Downloads/Courseradata/testFile.csv", na.strings = c("", "NA", "#DIV/0!"))
Here we check the dimensions and see the total number of NA values for both data.
dim(trainData)
## [1] 19622 160
dim(testData)
## [1] 20 160
sum(is.na(trainData))
## [1] 1925102
sum(is.na(testData))
## [1] 2000
There are a lot of NA values in both data so we remove the variables containing NA values and also some that are not important to include for the prediction such as username and timestamps.
trainData <- trainData[, colSums(is.na(trainData)) == 0]
testData <- testData[, colSums(is.na(testData)) == 0]
trainData <- trainData[, -c(1:7)]
testData <- testData[, -c(1:7)]
dim(trainData)
## [1] 19622 53
dim(testData)
## [1] 20 53
Here we see the Near Zero Variance to check if there are further features to remove.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
nzv <- nearZeroVar(trainData, saveMetrics = TRUE)
nzv
## freqRatio percentUnique zeroVar nzv
## roll_belt 1.101904 6.7781062 FALSE FALSE
## pitch_belt 1.036082 9.3772296 FALSE FALSE
## yaw_belt 1.058480 9.9734991 FALSE FALSE
## total_accel_belt 1.063160 0.1477933 FALSE FALSE
## gyros_belt_x 1.058651 0.7134849 FALSE FALSE
## gyros_belt_y 1.144000 0.3516461 FALSE FALSE
## gyros_belt_z 1.066214 0.8612782 FALSE FALSE
## accel_belt_x 1.055412 0.8357966 FALSE FALSE
## accel_belt_y 1.113725 0.7287738 FALSE FALSE
## accel_belt_z 1.078767 1.5237998 FALSE FALSE
## magnet_belt_x 1.090141 1.6664968 FALSE FALSE
## magnet_belt_y 1.099688 1.5187035 FALSE FALSE
## magnet_belt_z 1.006369 2.3290184 FALSE FALSE
## roll_arm 52.338462 13.5256345 FALSE FALSE
## pitch_arm 87.256410 15.7323412 FALSE FALSE
## yaw_arm 33.029126 14.6570176 FALSE FALSE
## total_accel_arm 1.024526 0.3363572 FALSE FALSE
## gyros_arm_x 1.015504 3.2769341 FALSE FALSE
## gyros_arm_y 1.454369 1.9162165 FALSE FALSE
## gyros_arm_z 1.110687 1.2638875 FALSE FALSE
## accel_arm_x 1.017341 3.9598410 FALSE FALSE
## accel_arm_y 1.140187 2.7367241 FALSE FALSE
## accel_arm_z 1.128000 4.0362858 FALSE FALSE
## magnet_arm_x 1.000000 6.8239731 FALSE FALSE
## magnet_arm_y 1.056818 4.4439914 FALSE FALSE
## magnet_arm_z 1.036364 6.4468454 FALSE FALSE
## roll_dumbbell 1.022388 84.2065029 FALSE FALSE
## pitch_dumbbell 2.277372 81.7449801 FALSE FALSE
## yaw_dumbbell 1.132231 83.4828254 FALSE FALSE
## total_accel_dumbbell 1.072634 0.2191418 FALSE FALSE
## gyros_dumbbell_x 1.003268 1.2282132 FALSE FALSE
## gyros_dumbbell_y 1.264957 1.4167771 FALSE FALSE
## gyros_dumbbell_z 1.060100 1.0498420 FALSE FALSE
## accel_dumbbell_x 1.018018 2.1659362 FALSE FALSE
## accel_dumbbell_y 1.053061 2.3748853 FALSE FALSE
## accel_dumbbell_z 1.133333 2.0894914 FALSE FALSE
## magnet_dumbbell_x 1.098266 5.7486495 FALSE FALSE
## magnet_dumbbell_y 1.197740 4.3012945 FALSE FALSE
## magnet_dumbbell_z 1.020833 3.4451126 FALSE FALSE
## roll_forearm 11.589286 11.0895933 FALSE FALSE
## pitch_forearm 65.983051 14.8557741 FALSE FALSE
## yaw_forearm 15.322835 10.1467740 FALSE FALSE
## total_accel_forearm 1.128928 0.3567424 FALSE FALSE
## gyros_forearm_x 1.059273 1.5187035 FALSE FALSE
## gyros_forearm_y 1.036554 3.7763735 FALSE FALSE
## gyros_forearm_z 1.122917 1.5645704 FALSE FALSE
## accel_forearm_x 1.126437 4.0464784 FALSE FALSE
## accel_forearm_y 1.059406 5.1116094 FALSE FALSE
## accel_forearm_z 1.006250 2.9558659 FALSE FALSE
## magnet_forearm_x 1.012346 7.7667924 FALSE FALSE
## magnet_forearm_y 1.246914 9.5403119 FALSE FALSE
## magnet_forearm_z 1.000000 8.5771073 FALSE FALSE
## classe 1.469581 0.0254816 FALSE FALSE
We further split the training set into another train set and test set. Three models are build to compare their out of sample accuracy and the best one is used for the prediction. We also do cross validation with 5-fold for the models.
We split the data into 70% training set and 30% test set
set.seed(1995)
inTrain <- createDataPartition(trainData$classe, p = 0.7, list = FALSE)
training <- trainData[inTrain, ]
testing <- trainData[-inTrain, ]
Decision Tree Model
library(rpart)
set.seed(1995)
cartFit <- train(classe ~ ., method = "rpart", trControl = trainControl(method = "cv", number = 5), data = training)
cartPred <- predict(cartFit, newdata = testing)
cartCM <- confusionMatrix(cartPred, testing$classe)
cartCM$table
## Reference
## Prediction A B C D E
## A 1069 212 29 70 15
## B 3 199 19 7 6
## C 319 185 664 279 198
## D 277 543 314 608 368
## E 6 0 0 0 495
Gradient Boosting Tree Model
library(gbm)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
set.seed(1995)
gbmFit<- train(classe ~ ., method = "gbm", data = training, trControl = trainControl(method = "cv", number = 5), verbose = FALSE)
## Loading required package: plyr
gbmPred<- predict(gbmFit, newdata = testing)
gbmCM <- confusionMatrix(gbmPred, testing$classe)
gbmCM$table
## Reference
## Prediction A B C D E
## A 1658 29 0 1 3
## B 8 1087 36 1 15
## C 3 21 974 33 13
## D 3 2 14 924 10
## E 2 0 2 5 1041
Random Forest Model
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(1995)
rfFit <- train(classe ~ ., method = "rf", data = training, trControl = trainControl(method = "cv", number = 5), importance = TRUE)
rfPred <- predict(rfFit, newdata = testing)
rfCM <- confusionMatrix(rfPred, testing$classe)
rfCM$table
## Reference
## Prediction A B C D E
## A 1673 6 0 0 0
## B 0 1132 10 0 0
## C 0 1 1014 14 1
## D 0 0 2 949 0
## E 1 0 0 1 1081
accuracy <- data.frame(Model = c("CART", "GBM", "RF"),
Accuracy = rbind(cartCM$overall[1], gbmCM$overall[1], rfCM$overall[1]))
accuracy
## Model Accuracy
## 1 CART 0.5157179
## 2 GBM 0.9658454
## 3 RF 0.9938828
Random Forest Model has the best out of sample accuracy of about 99.3% followed by the Gradient Boosting Tree Model of about 96.3%. The CART decision tree model has a low 51.5% out of sample accuracy. We use the highest out of sample accuracy, the Random Forest Model, for the prediction.
Evaluate the prediction to the original test data
prediction <- predict(rfFit, newdata = testData)
prediction
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E