Overview

In this project, I build a machine learning model to predict the class of weight lifting technique done by a volunteer given measurements from wearable activity monitors on the volunteer during the weight lifting. The data set was collected and made available by these researchers. Each volunteer wore sensors on his forearm, upper arm, and waist while doing bicep curls; there was another sensor on the dumbbell itself. Each volunteer performed the exercise correctly (class A), and then performed the exercise in four incorrect fashions: with elbows pushing forward (class B), lifting only halfway (class C), lowering only halfway (class D), and pushing hips forward (class E). The sensors on the volunteer’s body and the dumbbell recorded the motion of the volunteer and dumbbell. I will use these activity measurements to predict the class of the weight lifting technique.

Preprocessing Data

First, I load the libraries I will use in this analysis, then download and open up the data.

library(RColorBrewer)
library(ggplot2)
library(reshape2)
library(caret)

if (!file.exists("./pml-training.csv"))
        {fileURL1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
        download.file(fileURL1, destfile = "./pml-training.csv", method = "curl")}

training <- read.csv("./pml-training.csv", stringsAsFactors = FALSE)

Some of the columns need to have their data types adjusted.

training[,8:159] <- lapply(training[,8:159], as.numeric)
training[,c(2,6,160)] <- lapply(training[,c(2,6,160)], as.factor)

There are quite a number of NA values in this data set. The majority of the columns, in fact, are almost entirely populated by NA values.

fracNAs <- sapply(training, function(y) sum(length(which(is.na(y))))/nrow(training))

ggplot(data = data.frame(fracNAs), aes(x = fracNAs)) +
        geom_histogram(alpha = 0.6, fill = "mediumpurple4", binwidth = .02) +
        theme(legend.position = "none") +
        ggtitle("Fraction of NAs in Training Set Columns") +
        xlab("Fraction of NAs") + ylab("Number of columns")

This plot shows that 60 columns in the data set are complete and have no missing values, but the rest have very high levels of missing values, 97% to 99%. Let’s look at a heatmap of the values in the data set to look for patterns in these missing values.

heatmap(as.matrix(training[,8:159]), col = brewer.pal(9, "Purples"), Colv = NA, Rowv = NA, 
        labRow = NA, labCol = NA, xlab = "Columns", ylab = "Rows", 
        main = "Heatmap of Data Set Values", margins = c(2,2))

From this pattern of missing values, it would be unwise to attempt imputing any missing values for the columns with NA values present. When there are only <1% to 3% real values, there is not enough information to get reliable nearest neighbor estimates, for example. Instead, I will remove the columns with high levels of NA values. I will only be keeping 37.5% of the columns present in the original data set.

trainkeep <- training[,fracNAs < .5]

Now it is time to partition this data set into separate sets for training the model, validating the model, and testing the model. I will use 50% of the data to train the model, 20% of the model for validation, and 30% of the data for final testing of my model.

set.seed(34434)
inTrain <- createDataPartition(trainkeep$classe, p = .7, list = FALSE)
trainVal <- trainkeep[inTrain,]
testMod <- trainkeep[-inTrain,]
inTrain <- createDataPartition(trainVal$classe, p = 5/7, list = FALSE)
trainMod <- trainVal[inTrain,]
validMod <- trainVal[-inTrain,]

Let’s examine the columns that still remain. Do any of them have a problem with near zero variance within the training set?

nearZeroVar(trainMod[,8:59], saveMetrics = TRUE)
##                      freqRatio percentUnique zeroVar   nzv
## roll_belt             1.137850     9.5883432   FALSE FALSE
## pitch_belt            1.079545    15.7326269   FALSE FALSE
## yaw_belt              1.032847    16.5885470   FALSE FALSE
## total_accel_belt      1.079800     0.2853067   FALSE FALSE
## gyros_belt_x          1.042042     1.2533116   FALSE FALSE
## gyros_belt_y          1.175334     0.6317506   FALSE FALSE
## gyros_belt_z          1.021505     1.5488078   FALSE FALSE
## accel_belt_x          1.005102     1.6201345   FALSE FALSE
## accel_belt_y          1.122047     1.3450173   FALSE FALSE
## accel_belt_z          1.004435     2.8224985   FALSE FALSE
## magnet_belt_x         1.053476     2.9040147   FALSE FALSE
## magnet_belt_y         1.076677     2.7613613   FALSE FALSE
## magnet_belt_z         1.026667     4.1777053   FALSE FALSE
## roll_arm             50.764706    22.3150601   FALSE FALSE
## pitch_arm            86.350000    25.2802119   FALSE FALSE
## yaw_arm              29.254237    24.2205013   FALSE FALSE
## total_accel_arm       1.008791     0.6623191   FALSE FALSE
## gyros_arm_x           1.040984     6.2359894   FALSE FALSE
## gyros_arm_y           1.465587     3.6070919   FALSE FALSE
## gyros_arm_z           1.109848     2.2824536   FALSE FALSE
## accel_arm_x           1.011364     7.5606277   FALSE FALSE
## accel_arm_y           1.053571     5.2272264   FALSE FALSE
## accel_arm_z           1.081967     7.4485429   FALSE FALSE
## magnet_arm_x          1.043478    13.1342979   FALSE FALSE
## magnet_arm_y          1.062500     8.6101488   FALSE FALSE
## magnet_arm_z          1.000000    12.5229264   FALSE FALSE
## roll_dumbbell         1.257576    88.6183004   FALSE FALSE
## pitch_dumbbell        1.891566    86.3969839   FALSE FALSE
## yaw_dumbbell          1.338710    88.1903403   FALSE FALSE
## total_accel_dumbbell  1.004348     0.4279601   FALSE FALSE
## gyros_dumbbell_x      1.027682     2.3028327   FALSE FALSE
## gyros_dumbbell_y      1.281356     2.6492765   FALSE FALSE
## gyros_dumbbell_z      1.146758     1.8544936   FALSE FALSE
## accel_dumbbell_x      1.068323     3.9637253   FALSE FALSE
## accel_dumbbell_y      1.039683     4.5037701   FALSE FALSE
## accel_dumbbell_z      1.040000     3.9841043   FALSE FALSE
## magnet_dumbbell_x     1.120879    10.0774404   FALSE FALSE
## magnet_dumbbell_y     1.451220     8.0293458   FALSE FALSE
## magnet_dumbbell_z     1.021053     6.5722437   FALSE FALSE
## roll_forearm         11.287425    16.5070308   FALSE FALSE
## pitch_forearm        72.538462    23.2015488   FALSE FALSE
## yaw_forearm          14.960317    15.8345221   FALSE FALSE
## total_accel_forearm   1.160804     0.6419401   FALSE FALSE
## gyros_forearm_x       1.134387     2.7919299   FALSE FALSE
## gyros_forearm_y       1.037037     7.1428571   FALSE FALSE
## gyros_forearm_z       1.099567     2.8224985   FALSE FALSE
## accel_forearm_x       1.111111     7.8357449   FALSE FALSE
## accel_forearm_y       1.019608     9.6494803   FALSE FALSE
## accel_forearm_z       1.025974     5.3800693   FALSE FALSE
## magnet_forearm_x      1.048780    14.0513552   FALSE FALSE
## magnet_forearm_y      1.187500    17.8724271   FALSE FALSE
## magnet_forearm_z      1.031250    15.4982678   FALSE FALSE

These results look good. As one might expect, looking at the nearZeroVar function for the data set before removing the columns with high levels of NA values did reveal variables with near zero variance.

Now let’s examine if any of these remaining columns are correlated with each other. If they are, I would like to remove them to obtain better results from my model.

findCorrelation(cor(trainMod[,8:59]), verbose = TRUE)
## Compare row 10  and column  1 with corr  0.992 
##   Means:  0.274 vs 0.171 so flagging column 10 
## Compare row 1  and column  9 with corr  0.925 
##   Means:  0.255 vs 0.167 so flagging column 1 
## Compare row 9  and column  4 with corr  0.928 
##   Means:  0.236 vs 0.164 so flagging column 9 
## Compare row 8  and column  2 with corr  0.966 
##   Means:  0.251 vs 0.16 so flagging column 8 
## Compare row 19  and column  18 with corr  0.919 
##   Means:  0.091 vs 0.16 so flagging column 18 
## All correlations <= 0.9
## [1] 10  1  9  8 18

These columns are highly correlated with other columns in the training set so I will remove them. Removing these columns increased the accuracy of my model a modest but significant amount.

trainMod <- trainMod[,-c(9, 16, 17, 18, 26, 39, 41)]
validMod <- validMod[,-c(9, 16, 17, 18, 26, 39, 41)]
testMod <- testMod[,-c(9, 16, 17, 18, 26, 39, 41)]

This leaves me with over 40 predictors to use for fitting, still perhaps a high number. Looking at what these predictors are measuring, it seems possible that a weighted combination of these predictors may work better than fitting to each individual predictor on its own.

names(trainMod[,8:52])
##  [1] "roll_belt"            "yaw_belt"             "total_accel_belt"    
##  [4] "gyros_belt_x"         "gyros_belt_y"         "gyros_belt_z"        
##  [7] "accel_belt_x"         "magnet_belt_y"        "magnet_belt_z"       
## [10] "roll_arm"             "pitch_arm"            "yaw_arm"             
## [13] "total_accel_arm"      "gyros_arm_x"          "gyros_arm_z"         
## [16] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [19] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [22] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [25] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_z"    
## [28] "accel_dumbbell_y"     "accel_dumbbell_z"     "magnet_dumbbell_x"   
## [31] "magnet_dumbbell_y"    "magnet_dumbbell_z"    "roll_forearm"        
## [34] "pitch_forearm"        "yaw_forearm"          "total_accel_forearm" 
## [37] "gyros_forearm_x"      "gyros_forearm_y"      "gyros_forearm_z"     
## [40] "accel_forearm_x"      "accel_forearm_y"      "accel_forearm_z"     
## [43] "magnet_forearm_x"     "magnet_forearm_y"     "magnet_forearm_z"

To this end, I will do a principal component analysis and find a new set of variables that will explain 95% of the variance in the training data I am working with at this point in the analysis.

preProc <- preProcess(trainMod[,8:52], method = "pca", thresh = .95)
preProc
## Created from 9814 samples and 45 variables
## 
## Pre-processing:
##   - centered (45)
##   - ignored (0)
##   - principal component signal extraction (45)
##   - scaled (45)
## 
## PCA needed 25 components to capture 95 percent of the variance

So there are 25 components needed to contain the specified threshold of information in the training set, fewer than the over 40 columns I was working with before. The model will not be as computationally expensive with this reduced number of predictors so it may be worth using PCA predictors instead of the original activity measurements as predictors. Let’s see what the first few principal components look like. In the plot below, the colored bars show the contribution of each original predictor to that principal component. Each principal component is uncorrelated to the others and together, the principal components contain almost all of the information in the data set.

melted <- melt(preProc$rotation[,1:9])
ggplot(data = melted) +
        theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank()) + 
        xlab("Measurements from sensors") +
        ylab("Relative importance in each principle component") +
        ggtitle("Variables in Principal Component Analysis") +
        geom_bar(aes(x=Var1, y=value, fill=Var1), stat="identity") +
        facet_wrap(~Var2)

Model Selection and Fitting

To find the best model to predict the activity class from the activity measurents, I experimented with boosting models (method = "gbm") and random forest models (method = "rf") in the caret package. I experimented with using the PCA predictors or the original predictors from the activity measurements. After building each model, I looked at the accuracy I achieved on my validation set. I found that I achieved the best results on the validation data set with a random forest model using the original, non-PCA predictors.

set.seed(45656)
modelFit <- train(trainMod$classe ~ ., method = "rf", data = trainMod[,8:52],
                  trControl = trainControl(method = "cv", number = 5, 
                                           allowParallel = TRUE))

Looking at the validation set, the best accuracy I achieved with a random forest model and the PCA predictors was about 96%. The best accuracy I achieved with a boosting model with the original, non-PCA predictors was also about 96%. The accuracy of the boosting model using the PCA predictors was much lower (~80%), which is interesting to think about given how boosting works. Here I show the detailed results for accuracy on the validation set for my best model, the random forest model with original predictors.

confusionMatrix(validMod$classe, predict(modelFit, validMod[,8:52]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1114    2    0    0    0
##          B   17  740    2    0    0
##          C    0   10  673    1    0
##          D    1    2    9  631    0
##          E    0    1    1    3  716
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9875          
##                  95% CI : (0.9835, 0.9907)
##     No Information Rate : 0.2886          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9842          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9841   0.9801   0.9825   0.9937   1.0000
## Specificity            0.9993   0.9940   0.9966   0.9964   0.9984
## Pos Pred Value         0.9982   0.9750   0.9839   0.9813   0.9931
## Neg Pred Value         0.9936   0.9953   0.9963   0.9988   1.0000
## Prevalence             0.2886   0.1925   0.1746   0.1619   0.1825
## Detection Rate         0.2840   0.1886   0.1716   0.1608   0.1825
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9917   0.9871   0.9895   0.9950   0.9992

I used cross-validation to estimate the error on the validation set by specificying method = "cv" within the trainControl() function. I found that I obtained good results for the accuracy on this random forest model with 5-fold (number = 5) cross-validation. The accuracy is about 99% for this model, but it did take about 1.5 longer to train this model than the random forest with PCA predictors.

Discussion of Results and Errors

To estimate my out-of-sample error, I look at the training set, data I have not used at all in my model training or selection. These data have been processed in the same manner as the training data (same columns removed for NA values, correlated values, etc, as detailed above) but they have not been used in choosing or training the principal component analysis or random forest model. This error is estimated using 5-fold cross-validation as described above.

myMatrix <- confusionMatrix(testMod$classe, predict(modelFit, testMod[,8:52]))
myMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    1    0    0    1
##          B   28 1106    5    0    0
##          C    0    9 1012    5    0
##          D    0    1   14  948    1
##          E    0    2    0    4 1076
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9879          
##                  95% CI : (0.9848, 0.9906)
##     No Information Rate : 0.2889          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9847          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9835   0.9884   0.9816   0.9906   0.9981
## Specificity            0.9995   0.9931   0.9971   0.9968   0.9988
## Pos Pred Value         0.9988   0.9710   0.9864   0.9834   0.9945
## Neg Pred Value         0.9934   0.9973   0.9961   0.9982   0.9996
## Prevalence             0.2889   0.1901   0.1752   0.1626   0.1832
## Detection Rate         0.2841   0.1879   0.1720   0.1611   0.1828
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9915   0.9907   0.9893   0.9937   0.9984

I estimate my out-of-sample error to be about 1% with a 95% confidence interval of 0.94% to 1.52%.