Background
Quantified Self devices are becoming more and more common, and are able to collect a large amount of data about people and their personal health activities. The focus of this project is to utilize some sample data on the quality of certain exercises to predict the manner in which they did the exercise.
Analysis
This analysis will build a machine learning model from the sample data that is attempting to most accurately predict the manner in which the exercise was performed. This is a classification problem into discrete categories, which in the training data are located in the ‘classe’ varaible.
Load and Preprocess
The analysis starts by downloading the data into local files. There are 2 data sets, the training data set and the testing data set we are attempting to perform the predictions from the final model on. When the data is loaded into dataframes, it is necessary to locate strings containing ‘#DIV/0!’ in otherwise numeric data, a common sentinal error code for division by zero errors. These error codes are loaded into the data frame as NA fields. Though the code has been retained, for the purpose of this analysis, the download code is not executed when compiling the analytic document.
library(caret)
url_train <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
url_test <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
# download.file(url = url_train, destfile = 'data_train.csv')
# download.file(url = url_test, destfile = 'data_test.csv')
pml_train <- read.csv(file = 'data_train.csv',
na.strings = c('NA','#DIV/0!',''))
pml_quiz <- read.csv(file = 'data_test.csv',
na.strings = c('NA','#DIV/0!',''))
Exploratory Data Analysis reveals that the first 7 fields of the data are dimensional, and may not be pertinent to the prediction model. The balance of the fields are numeric according to the data documentation available here. The balance of the columns are looped through and cast into numeric data with the exception of the last column, which is the categorical class the prediction model will classify for.
for(i in c(8:ncol(pml_train)-1)) {
pml_train[,i] = as.numeric(as.character(pml_train[,i]))
pml_quiz[,i] = as.numeric(as.character(pml_quiz[,i]))
}
Analysis additionally reveals that of the many variables, several are extraordinarily sparse and thus may not be as useful for building a classification model. The following code initiates a slicer index of column names, removes the columns with null values, and also removes the inital seven columns of dimensional data. Rather than modify the actual data, this vector of column names will be used as a slicer index into the training data, cross-validation data, and the testing data when interacting with a model.
feature_index <- colnames(pml_train)
feature_index <- colnames(pml_train[colSums(is.na(pml_train)) == 0])
feature_index <- feature_index[-c(1:7)]
Splitting Data into Testing and Cross-Validation
To find an optimal model, with the best performance both in Accuracy as well as minimizing Out of Sample Error, the full testing data is split randomly with a set seed with 80% of the data into the training sample and 20% of the data used as cross-validation. When the samples are created, they are sliced by column against the feature set so only the variables of interest are fed into the final model.
set.seed(1300)
index_train <- createDataPartition(y=pml_train$classe, p=0.80, list=FALSE)
data_train <- pml_train[index_train,feature_index]
data_xval <- pml_train[-index_train,feature_index]
dim(data_train); dim(data_xval)
## [1] 15699 53
## [1] 3923 53
Pre-Model Fitting Class Examination
Before a model is fit, it is useful to have an idea of the ratio that should be expected of the classification variable outcome. This wil govern how we seek to optimize models for Specificity, Sensitivity, and Positive/Negative Predictive Value.
ggplot(aes(x=classe), data=data_train) +
geom_histogram(fill='dark orange') +
ggtitle('Histogram of Classe Frequency in Training Set') +
xlab('Classe of Excercise') +
ylab('Frequency in Training Data')
The chart above shows that each of the classifications are within an order of magnitude of one another, with each class roughly as likely as any other. This indicates that optimizing a model for accuracy and minimizing overall out of sample error should indicate an optimal model for making classificions.
Train Candidate Models
Five Candidate models are trained with the training data, and validated for accuracy against the cross-validation data set. The models compared are Random Forest, Decision Tree, Gradient Boosting, Lienar Discriminant Analysis, and Naive Bayes.
mod_rf <- train(classe ~ .,
data=data_train,
method='rf',
trControl=trainControl(method="cv",
number=4,
allowParallel=TRUE,
verboseIter=TRUE)) # Random Forrest
mod_tree <- train(classe ~ .,
data=data_train,
method='rpart',
trControl=trainControl(method="cv",
number=4,
allowParallel=TRUE,
verboseIter=TRUE)) # Trees
mod_gbm <- train(classe ~ .,
data=data_train,
method='gbm',
verbose=FALSE,
trControl=trainControl(method="cv",
number=4,
allowParallel=TRUE,
verboseIter=TRUE)) # Gradient Boosting
mod_lda <- train(classe ~ .,
data=data_train,
method='lda',
trControl=trainControl(method="cv",
number=4,
allowParallel=TRUE,
verboseIter=TRUE)) # Linear Discriminant Analysis
mod_nb <- train(classe ~ .,
data=data_train,
method='nb',
trControl=trainControl(method="cv",
number=4,
allowParallel=TRUE,
verboseIter=TRUE)) # Naive Bayes
Predictions Against Cross Validation Data
For each candidate model, predictions are made agaist the cross-validation data set. Then, a confusion matrix is calculated and stored for each model for later reference.
pred_rf <- predict(mod_rf,data_xval)
cm_rf <- confusionMatrix(pred_rf,data_xval$classe)
pred_tree <- predict(mod_tree,data_xval)
cm_tree <- confusionMatrix(pred_tree,data_xval$classe)
pred_gbm <- predict(mod_gbm,data_xval)
cm_gbm <- confusionMatrix(pred_gbm,data_xval$classe)
pred_lda <- predict(mod_lda,data_xval)
cm_lda <- confusionMatrix(pred_lda,data_xval$classe)
pred_nb <- predict(mod_nb,data_xval)
cm_nb <- confusionMatrix(pred_nb,data_xval$classe)
The accuracy results are accessible within the confusion matrix object derived from each model. To easily analyze them, the out of sample accuracy is aggregated for each model and plotted on a common index.
model_compare <- data.frame(Model = c('Random Forest',
'Trees',
'Gradient Boosting',
'Linear Discriminant',
'Naive Bayes'),
Accuracy = c(cm_rf$overall[1],
cm_tree$overall[1],
cm_gbm$overall[1],
cm_lda$overall[1],
cm_nb$overall[1]))
ggplot(aes(x=Model, y=Accuracy), data=model_compare) +
geom_bar(stat='identity', fill = 'blue') +
ggtitle('Comparative Accuracy of Models on Cross-Validation Data') +
xlab('Models') +
ylab('Overall Accuracy')
The Random Forest model appears to be the most accurate. Further exploration of the model will help confirm.
ggplot(data=data.frame(cm_rf$table)) +
geom_tile(aes(x=Reference, y=Prediction, fill=Freq)) +
ggtitle('Prediction Accuracy for Classes in Cross-Validation (Random Forest Model)') +
xlab('Actual Classe') +
ylab('Predicted Classe from Model')
The plot above demonstrates the accuracy of the model by comparing the predictions for each class against the actual value in the cross validation set, and coloring the tile intersection for each combination of outcome class by the frequency of prediction. The intersection of the same value for Predicted Classe and Actual Class is very high for all 5 classes in the Random Forest model, indicating that the model is highly accurate, both overall as well as within each class.
ggplot(data=data.frame(cm_tree$table)) +
geom_tile(aes(x=Reference, y=Prediction, fill=Freq)) +
ggtitle('Prediction Accuracy for Classes in Cross-Validation (Decision Tree Model)') +
xlab('Actual Classe') +
ylab('Predicted Classe from Model')
To see the amount of error noise in competing models, compare the accuracy of the Random Forest model to the much less accurate Decision Tree model–and the difference becomes apparant.
Beyond just evaluating a graphical representation, it is prudent to examine the entire bank of calculated statistics available within the confusion matrix object from the cross-validation data for the Random Forest model.
cm_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 8 0 0 0
## B 0 751 3 0 0
## C 0 0 677 5 0
## D 0 0 4 637 1
## E 0 0 0 1 720
##
## Overall Statistics
##
## Accuracy : 0.994
## 95% CI : (0.992, 0.996)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.993
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 0.989 0.990 0.991 0.999
## Specificity 0.997 0.999 0.998 0.998 1.000
## Pos Pred Value 0.993 0.996 0.993 0.992 0.999
## Neg Pred Value 1.000 0.997 0.998 0.998 1.000
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.284 0.191 0.173 0.162 0.184
## Detection Prevalence 0.287 0.192 0.174 0.164 0.184
## Balanced Accuracy 0.999 0.994 0.994 0.995 0.999
The accuracy of the model is 0.9944. The out of sample error is 0.0056. The out of sample error is calculated as 1 - accuracy for predictions made against the cross-validation set. Considering that the test set is a sample size of 20, an accuracy rate well above 99% is sufficient to expect that few or none of the test samples will be mis-classified.
Applying Selected Model to Test Set
For the test results, there are 20 samples asked to be classified. The column names are not consistent between the test and training data. It is necessary to rename the last column in the testing set for compatability. However, since that column will not be used in the model feature set being fed into the predictor, the column name change unaffects the predictions. Once the predictions are made from the chosen Random Forest model, the prediction vector is shown.
final_col <- length(colnames(pml_quiz[]))
colnames(pml_quiz)[final_col] <- 'classe'
quiz_rf <- predict(mod_rf,pml_quiz[,feature_index])
quiz_rf
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E