Machine Learning Class Project

Synopsis

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. Additional information is available at http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Objective

In this project, I use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. My analysis seeks to accurately predict the manner in which they did the exercise (this is predicting the “classe” variable in the training set).

Libraries

The R libraries utilized for this analysis includes:

library(randomForest)
library(caret)
library(dplyr)
library(corrplot)

Data Processing

Data Loading

The data for this project originated from the following source: http://groupware.les.inf.puc-rio.br/har. A training set and test set were already pre-established and are located at:

Training dataframe: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Testing dataframe: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Initial loading of the data is as follows:

# load training data
trainURL <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
train_DF <- read.csv(file=trainURL, na.strings = c("NA", ""), stringsAsFactors=FALSE)

# load test data
testURL <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
test_DF <- read.csv(file=testURL, na.strings = c("NA", ""), stringsAsFactors=FALSE)

Data Pre-processing

Next, I perform some data pre-processing for data reduction purposes and to properly assign variable classes. Many of the variables have significant amounts of data missing so I eliminate these variables from the dataframes. In addition, after early data exploration, the time stamp and window variables revealed no relation to the classe variable so I remove these variables along with the “X” variable and user_name variable which are for tracking purposes only. I also re-class the dependent (“classe”) variable to a factor variable.

# identify and remove variables with signficant missing values
train_DF <- train_DF[colSums(is.na(train_DF)) == 0]
test_DF <- test_DF[colSums(is.na(test_DF)) == 0]

# remove time stamp and window variables
train_DF <- select(train_DF, -contains("timestamp"), -ends_with("window"), -starts_with("user"), -X)
test_DF <- select(test_DF, -contains("timestamp"), -ends_with("window"), -starts_with("user"), -X)

# set dependent ("classe") variable to a factor variable
train_DF$classe <- as.factor(train_DF$classe)

In addition, to check for zero and near-zero variance variables I applied the nearZeroVar() function and as the below output shows, there were no variables identified as having zero or near-zero variance.

# perform near-zero variance analysis on numeric columns
zero_var_variables <- nearZeroVar(train_DF[sapply(train_DF, is.numeric)], saveMetrics = TRUE)

# rather than display all the results, this will display only variables that have zero or near-zero variance
filter(zero_var_variables, zeroVar == TRUE | nzv == TRUE)

## [1] freqRatio     percentUnique zeroVar       nzv          
## <0 rows> (or 0-length row.names)

Data Splitting

Prior to any data reduction or modeling, I want to split my current training dataframe into training, validation, and test dataframes so that I can assess prediction performance and error rate of my model prior to applying it to the test set. The training set allows me to build and fit multiple models. The validation set allows me to compare their accuracy, perform any model/parameter tuning and choose one model (aka “the best”) based on their performance. The testing set is used to apply the final selected model and assess its expected performance with out-of-sample data. I set the random seed so results can be duplicated and I apply a 50/25/25 split of the data.

# set random seed
set.seed(3030)

# split dataframe into training, validation & testing dataframes
# 50% of total data goes towards training and 50% of the remaining (25% of total)
# goes towards validation and the other 50% of remaining (25% of total) goes towards testing
inTrain <- createDataPartition(y = train_DF$classe, p=0.50, list=FALSE)
training_DF <- train_DF[inTrain, ]

remaining_DF <- train_DF[-inTrain, ]
inVal <- createDataPartition(y = remaining_DF$classe, p=0.50, list=FALSE)
validating_DF <- remaining_DF[inVal, ]
testing_DF <- remaining_DF[-inVal, ]

Predictor Relationships & Data Compression

At this point I have 52 predictor variables. To identify additional data reduction options I’ll analyze the correlations among the numeric predictor variables. The correlation plot below uses a hierarchical cluster analysis approach to group the predictor variables. As a result, you will notice clusters of higher positively correlated (darker blue) variables along the diagonal; this will also show the groupings of negatively correlated (darker red) variables throughout the dataset. This plot reveals that several groupings of variables with high correlations exist and that principal component analysis (PCA) may provide a suitable data reduction technique.

# calculate correlation on numeric predictor variables
predictor_corr <- round(cor(training_DF[sapply(training_DF, is.numeric)]), 2)

# plot correlation matrix and order the variables using hierarchical cluster analysis
par(ps=5)
corrplot.mixed(predictor_corr, order = "hclust", tl.col="black", diag = "n", tl.pos = "lt", 
               lower = "circle", upper = "number", tl.cex = 1.5, mar=c(1,0,1,0))

plot of chunk unnamed-chunk-8

Next, I choose to compress the data dimensions using PCA, which allows me to work with less predictor variables while maintaining structure of the data in a way that best explains the variance in the data. The PCA resulted in 26 components needed to capture 95% of the variance. If all PCA variables are used then the number of predictor variables will be reduced by 50%.

# train PCA model
compress <- preProcess(training_DF[,-53], method = "pca")

# apply PCA model to training, validation and test sets
training_PCA <- predict(compress, training_DF[,-53])
validating_PCA <- predict(compress, validating_DF[,-53])
testing_PCA <- predict(compress, testing_DF[,-53])

Modeling & Validation

Applying Random Forest to PCA Data

The modeling technique I apply is Random Forest. This technique improves predictive performance over a single tree by reducing variance and is well suited to handle many variables. First, I train the Random Forest model and then I visualize the relative importance of each principal component.

# train random forest predictive model
model_1 <- train(training_DF$classe ~., method = "rf", data = training_PCA, 
                 trControl = trainControl(method = "cv", number = 4), 
                 ntree = 100, importance = TRUE)

# visualize the relative importance of each PCA component in the predictive model
par(ps=7)
varImpPlot(model_1$finalModel, sort = TRUE, type = 1, pch = 19, col = 1, cex = 1,
           main = "Relative Importance of Principal Components \nin Random Forest Predictive Model")

plot of chunk unnamed-chunk-11

Cross Validation of Random Forest Model with PCA Data

Next I cross validate my initial PCA model to assess its accuracy. As the confusion matrix shows, there appears to be a fair amount of misclassifications across all levels of the classe variable.

# apply Random Forest model to validation set
model_1_validate <- predict(model_1, validating_PCA)

# generate confusion matrix for validation model
CM <- confusionMatrix(validating_DF$classe, model_1_validate)
CM$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1373   13    6    2    1
##          B   28  902   14    1    4
##          C    5   25  812    9    5
##          D    2    2   53  745    2
##          E    2   10   15   11  864

I can assess the accuracy of the model as follows:

model_1_accuracy <- round(postResample(validating_DF$classe, model_1_validate)[[1]], 3)
model_1_accuracy

## [1] 0.957

The accuracy rate of 95.7% is not bad and gives us an estimated out-of-sample error of 4.3% as the following shows:

1 - model_1_accuracy

## [1] 0.043

Applying Random Forest to Full Data Set

To assess how much predictive power was lost by compressing the data with PCA I will now apply the same modeling technique, Random Forest, to the full data set. The variable importance plot shown below illustrates how 7 variables have significant importance and the level of decreased accuracy begins to taper starting with the accel_forearm_x variable.

# train random forest predictive model
model_2 <- train(classe ~., method = "rf", data = training_DF, 
                 trControl = trainControl(method = "cv", number = 4), 
                 ntree = 100, importance = TRUE)

# visualize the relative importance of each predictor variable in the model
par(ps=7)
varImpPlot(model_2$finalModel, sort = TRUE, type = 1, pch = 19, col = 1, cex = 1,
           main = "Relative Importance of Top 30 Predictor Variables \nin Random Forest Predictive Model")

plot of chunk unnamed-chunk-15

Cross Validation of Random Forest Model with Full Data Set

Next I cross validate my second model (the full data set model) to assess its accuracy. As the confusion matrix shows, the amount of misclassifications decreased significantly across all levels of the classe variable.

# apply Random Forest model to validation set
model_2_validate <- predict(model_2, validating_DF)

# generate confusion matrix for validation model
CM <- confusionMatrix(validating_DF$classe, model_2_validate)
CM$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1392    2    0    0    1
##          B   16  926    6    1    0
##          C    0   15  835    6    0
##          D    2    0   26  776    0
##          E    0    0    5    2  895

This leads to an increased accuracy rate in the predictive performance. The accuracy rate of model 2 is:

model_2_accuracy <- round(postResample(validating_DF$classe, model_2_validate)[[1]], 3)
model_2_accuracy

## [1] 0.983

The increased accuracy rate of 98.3% gives us an estimated out-of-sample error of 1.7% as the following shows:

1 - model_2_accuracy

## [1] 0.013

This suggests that as a result of the data compression via PCA in model 1, we lost a little over 2% of predictive accuracy. Or in other words, it caused an increase in estimated out-of-sample error of over 2%.

Final Model Testing & Predicted Answers

As a result of the increased accuracy rate in the validation phase, I will apply model 2 to the testing data set in a similar fashion as I did with the validation set.

# apply Random Forest model to testing set
model_2_test <- predict(model_2, testing_DF)

# generate confusion matrix for validation model
CM <- confusionMatrix(testing_DF$classe, model_2_test)
CM$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1389    5    1    0    0
##          B    8  930   11    0    0
##          C    0    5  847    3    0
##          D    1    1   21  777    4
##          E    0    0    0    5  896

I can assess the accuracy of the model as follows:

model_2_accuracy <- round(postResample(testing_DF$classe, model_2_test)[[1]], 3)
model_2_accuracy

## [1] 0.987

The accuracy rate for the final model on the testing dataset is 98.7% which gives us an estimated out-of-sample error of 1.3% as the following shows:

1 - model_2_accuracy

## [1] 0.013

As expected, the final model applied to the testing set had similar accuracy and estimated out-of-sample error rates as when applied to the validation set since no further model/parameter tuning took place after the validation. The final step is to predict the results of the 20 observations withheld from the original data. These are the results I plan to submit for grading.

# apply Random Forest model to withheld set
answers <- predict(model_2, test_DF[, -53])
answers

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Machine Learning Class Project

Brad Boehmke

August 20, 2014

Synopsis

Background

Objective

Libraries

Data Processing

Data Loading

Data Pre-processing

Data Splitting

Predictor Relationships & Data Compression

Modeling & Validation

Applying Random Forest to PCA Data

Cross Validation of Random Forest Model with PCA Data

Applying Random Forest to Full Data Set

Cross Validation of Random Forest Model with Full Data Set

Final Model Testing & Predicted Answers