knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(caret)
library(corrplot)
This report is a writeup for the final project of the practical machine learning course in the Data Science Specialization provided by Johns Hopkins University’s on Coursera. The purpose of this assignment is to create a model that will assess an individiual’s performance of an exercise. This model will be trained from the weight lifting data set:
sample_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
quiz_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
sample_file_name <- "./data/pml-training.csv"
quiz_file_name <- "./data/pml-testing.csv"
# Download files
if(!file.exists(sample_file_name) | !file.exists(quiz_file_name)){
download.file(sample_url, destfile=training_file_name, method="curl")
download.file(quiz_url, destfile=testing_file_name, method="curl")
}
# Read files
sample_data <- read.csv(sample_file_name, na.string=c("NA", "#DIV/0!"))
quiz_data <- read.csv(quiz_file_name, na.string=c("NA", "#DIV/0!"))
The dimensions of our sample data set are 160 columns by 19622 rows. Let’s see if we can reduce the size a bit to make things easier on ourselves. First we should remove columns from the data set that are irrelevant to training a model (names, timestamps, IDs) and then remove columns that are mostly NAs. The training dataset has a total of 1925102 missing values.
# Remove columns that are irrelevant to prediction (names, timestamps, row IDs)
sample_data <- sample_data[,-c(1:7)]
quiz_data <- quiz_data[,-c(1:7)]
# Remove columns that have greater than 5% missing values
mean_not_nas <- function(x){mean(!is.na(x)) > 0.95}
good_mean_nas <- sapply(sample_data, mean_not_nas)
sample_data <- sample_data[, good_mean_nas]
quiz_data <- quiz_data[, good_mean_nas]
sum_nas <- sum(is.na(sample_data))
The dimensions are now 53 columns by 19622 rows. There are 0 missing values. Now we can break up the sample set into a training and testing set.
In order to test our model’s accuracy before we take the quiz, we have to make a training and test set.
set.seed(4321)
# choose which indeces will be put in training set
inTrain <- createDataPartition(y=sample_data$classe, p=0.7, list=FALSE)
# Separate sample set into training and testing data frames
training <- sample_data[inTrain,]
testing <- sample_data[-inTrain,]
I chose to build a random forest and a general boosting model because they are some of the most popular and accurate prediction models for a classification problem.
# Train model
rf_cont <- trainControl(method="cv", number=3, verboseIter=FALSE)
rf_fit <- train(classe~., data=training, method="rf",
trControl=rf_cont, verbose=FALSE)
# Display final model
rf_fit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.67%
## Confusion matrix:
## A B C D E class.error
## A 3903 1 1 0 1 0.0007680492
## B 18 2631 8 1 0 0.0101580135
## C 0 13 2369 14 0 0.0112687813
## D 0 1 21 2225 5 0.0119893428
## E 0 2 3 3 2517 0.0031683168
Let’s check the accuracy of the model using the data that it was trained with.
# Run predictions on training data set
rf_train_pred <- predict(rf_fit, training)
# Assess accuracy of predictions
rf_train_accuracy <- confusionMatrix(training$classe, rf_train_pred)$overall[1]
rf_train_accuracy
## Accuracy
## 1
Our accuracy with the training data is 100%. The random forest model has no in sample error.
Now we can check the accuracy of the model using new data.
# Run predictions on testing data set
rf_test_pred <- predict(rf_fit, testing)
# Assess accuracy of predictions
rf_test_accuracy <- confusionMatrix(testing$classe, rf_test_pred)$overall[1]
rf_test_accuracy
## Accuracy
## 0.9940527
With the testing data, our model accuracy is 99.41%, so our model has 0.59% out of sample error.
# Train model
gb_cont <- trainControl(method="repeatedcv", number=3, verboseIter=FALSE)
gb_fit <- train(classe~., data=training, method="gbm",
trControl=gb_cont, verbose=FALSE)
# Display final model
gb_fit$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.
Let’s check the accuracy of the model using the data that it was trained with.
# Run predictions on training dataest
gb_train_pred <- predict(gb_fit, training)
# Assess accuracy of predictions
gb_train_accuracy <- confusionMatrix(training$classe, gb_train_pred)$overall[1]
gb_train_accuracy
## Accuracy
## 0.9736478
Our accuracy using the training data is 97.36%, so this model has 2.64%
Now we can check the accuracy of the model using new data.
# Run predictions on testing data set
gb_test_pred <- predict(gb_fit, testing)
# Assess accuracy of predictions
gb_test_accuracy <- confusionMatrix(testing$classe, gb_test_pred)$overall[1]
gb_test_accuracy
## Accuracy
## 0.9621071
With the testing data, this model accuracy is 96.21%, so our model has 3.79% out of sample error.
| Model | In Sample Error | Out of Sample Error |
|---|---|---|
| Random Forest | 0 | 0.0059473 |
| Boosting | 0.0263522 | 0.0378929 |
I chose to use the random forest model to predict the quiz answers because it has a smaller out of sample error.
# Find quiz answers
quiz_answers <- predict(rf_fit, quiz_data)
quiz_answers
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.