knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(caret)
library(corrplot)

Synopsis

This report is a writeup for the final project of the practical machine learning course in the Data Science Specialization provided by Johns Hopkins University’s on Coursera. The purpose of this assignment is to create a model that will assess an individiual’s performance of an exercise. This model will be trained from the weight lifting data set:

http://groupware.les.inf.puc-rio.br/har

Getting the Data

sample_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
quiz_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

sample_file_name <- "./data/pml-training.csv"
quiz_file_name <- "./data/pml-testing.csv"

# Download files
if(!file.exists(sample_file_name) | !file.exists(quiz_file_name)){
  download.file(sample_url, destfile=training_file_name, method="curl")
  download.file(quiz_url, destfile=testing_file_name, method="curl")
}

# Read files
sample_data <- read.csv(sample_file_name, na.string=c("NA", "#DIV/0!"))
quiz_data <- read.csv(quiz_file_name, na.string=c("NA", "#DIV/0!"))

Cleaning the Data

The dimensions of our sample data set are 160 columns by 19622 rows. Let’s see if we can reduce the size a bit to make things easier on ourselves. First we should remove columns from the data set that are irrelevant to training a model (names, timestamps, IDs) and then remove columns that are mostly NAs. The training dataset has a total of 1925102 missing values.

# Remove columns that are irrelevant to prediction (names, timestamps, row IDs)
sample_data <- sample_data[,-c(1:7)]
quiz_data <- quiz_data[,-c(1:7)]

# Remove columns that have greater than 5% missing values 
mean_not_nas <- function(x){mean(!is.na(x)) > 0.95}
good_mean_nas <- sapply(sample_data, mean_not_nas)
sample_data <- sample_data[, good_mean_nas]
quiz_data <- quiz_data[, good_mean_nas]

sum_nas <- sum(is.na(sample_data))

The dimensions are now 53 columns by 19622 rows. There are 0 missing values. Now we can break up the sample set into a training and testing set.

Cross Validation

In order to test our model’s accuracy before we take the quiz, we have to make a training and test set.

set.seed(4321)

# choose which indeces will be put in training set
inTrain <- createDataPartition(y=sample_data$classe, p=0.7, list=FALSE)

# Separate sample set into training and testing data frames
training <- sample_data[inTrain,]
testing <- sample_data[-inTrain,]

Building Models

I chose to build a random forest and a general boosting model because they are some of the most popular and accurate prediction models for a classification problem.

Random Forest Model

# Train model
rf_cont <- trainControl(method="cv", number=3, verboseIter=FALSE)
rf_fit <- train(classe~., data=training, method="rf",
                trControl=rf_cont, verbose=FALSE)

# Display final model
rf_fit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.67%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3903    1    1    0    1 0.0007680492
## B   18 2631    8    1    0 0.0101580135
## C    0   13 2369   14    0 0.0112687813
## D    0    1   21 2225    5 0.0119893428
## E    0    2    3    3 2517 0.0031683168

In Sample Error

Let’s check the accuracy of the model using the data that it was trained with.

# Run predictions on training data set
rf_train_pred <- predict(rf_fit, training)

# Assess accuracy of predictions
rf_train_accuracy <- confusionMatrix(training$classe, rf_train_pred)$overall[1]
rf_train_accuracy

## Accuracy 
##        1

Our accuracy with the training data is 100%. The random forest model has no in sample error.

Out of Sample Error

Now we can check the accuracy of the model using new data.

# Run predictions on testing data set
rf_test_pred <- predict(rf_fit, testing)

# Assess accuracy of predictions
rf_test_accuracy <- confusionMatrix(testing$classe, rf_test_pred)$overall[1]
rf_test_accuracy

##  Accuracy 
## 0.9940527

With the testing data, our model accuracy is 99.41%, so our model has 0.59% out of sample error.

Generalized Boosted Model

# Train model
gb_cont <- trainControl(method="repeatedcv", number=3, verboseIter=FALSE)
gb_fit <- train(classe~., data=training, method="gbm",
                trControl=gb_cont, verbose=FALSE)

# Display final model
gb_fit$finalModel

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.

In Sample Error

Let’s check the accuracy of the model using the data that it was trained with.

# Run predictions on training dataest
gb_train_pred <- predict(gb_fit, training)

# Assess accuracy of predictions
gb_train_accuracy <- confusionMatrix(training$classe, gb_train_pred)$overall[1]
gb_train_accuracy

##  Accuracy 
## 0.9736478

Our accuracy using the training data is 97.36%, so this model has 2.64%

Out of Sample Error

Now we can check the accuracy of the model using new data.

# Run predictions on testing data set
gb_test_pred <- predict(gb_fit, testing)

# Assess accuracy of predictions
gb_test_accuracy <- confusionMatrix(testing$classe, gb_test_pred)$overall[1]
gb_test_accuracy

##  Accuracy 
## 0.9621071

With the testing data, this model accuracy is 96.21%, so our model has 3.79% out of sample error.

Comparison of Model Accuracy

Model	In Sample Error	Out of Sample Error
Random Forest	0	0.0059473
Boosting	0.0263522	0.0378929

I chose to use the random forest model to predict the quiz answers because it has a smaller out of sample error.

# Find quiz answers
quiz_answers <- predict(rf_fit, quiz_data)
quiz_answers

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

https://www.coursera.org/learn/practical-machine-learning/peer/R43St/prediction-assignment-writeup/review/5pyI3RqcEeqaogrh_tkcKw

Practical Machine Learning Course Project

Daniel DeWaters

12/2/2019

Synopsis

Getting the Data

Cleaning the Data

Cross Validation

Building Models

Random Forest Model

In Sample Error

Out of Sample Error

Generalized Boosted Model

In Sample Error

Out of Sample Error

Comparison of Model Accuracy

References