In this project, I build a machine learning model to predict the class of weight lifting technique done by a volunteer given measurements from wearable activity monitors on the volunteer during the weight lifting. The data set was collected and made available by these researchers. Each volunteer wore sensors on his forearm, upper arm, and waist while doing bicep curls; there was another sensor on the dumbbell itself. Each volunteer performed the exercise correctly (class A), and then performed the exercise in four incorrect fashions: with elbows pushing forward (class B), lifting only halfway (class C), lowering only halfway (class D), and pushing hips forward (class E). The sensors on the volunteer’s body and the dumbbell recorded the motion of the volunteer and dumbbell. I will use these activity measurements to predict the class of the weight lifting technique.
First, I load the libraries I will use in this analysis, then download and open up the data.
library(RColorBrewer)
library(ggplot2)
library(reshape2)
library(caret)
if (!file.exists("./pml-training.csv"))
{fileURL1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileURL1, destfile = "./pml-training.csv", method = "curl")}
training <- read.csv("./pml-training.csv", stringsAsFactors = FALSE)
Some of the columns need to have their data types adjusted.
training[,8:159] <- lapply(training[,8:159], as.numeric)
training[,c(2,6,160)] <- lapply(training[,c(2,6,160)], as.factor)
There are quite a number of NA values in this data set. The majority of the columns, in fact, are almost entirely populated by NA values.
fracNAs <- sapply(training, function(y) sum(length(which(is.na(y))))/nrow(training))
ggplot(data = data.frame(fracNAs), aes(x = fracNAs)) +
geom_histogram(alpha = 0.6, fill = "mediumpurple4", binwidth = .02) +
theme(legend.position = "none") +
ggtitle("Fraction of NAs in Training Set Columns") +
xlab("Fraction of NAs") + ylab("Number of columns")
This plot shows that 60 columns in the data set are complete and have no missing values, but the rest have very high levels of missing values, 97% to 99%. Let’s look at a heatmap of the values in the data set to look for patterns in these missing values.
heatmap(as.matrix(training[,8:159]), col = brewer.pal(9, "Purples"), Colv = NA, Rowv = NA,
labRow = NA, labCol = NA, xlab = "Columns", ylab = "Rows",
main = "Heatmap of Data Set Values", margins = c(2,2))
From this pattern of missing values, it would be unwise to attempt imputing any missing values for the columns with NA values present. When there are only <1% to 3% real values, there is not enough information to get reliable nearest neighbor estimates, for example. Instead, I will remove the columns with high levels of NA values. I will only be keeping 37.5% of the columns present in the original data set.
trainkeep <- training[,fracNAs < .5]
Now it is time to partition this data set into separate sets for training the model, validating the model, and testing the model. I will use 50% of the data to train the model, 20% of the model for validation, and 30% of the data for final testing of my model.
set.seed(34434)
inTrain <- createDataPartition(trainkeep$classe, p = .7, list = FALSE)
trainVal <- trainkeep[inTrain,]
testMod <- trainkeep[-inTrain,]
inTrain <- createDataPartition(trainVal$classe, p = 5/7, list = FALSE)
trainMod <- trainVal[inTrain,]
validMod <- trainVal[-inTrain,]
Let’s examine the columns that still remain. Do any of them have a problem with near zero variance within the training set?
nearZeroVar(trainMod[,8:59], saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## roll_belt 1.137850 9.5883432 FALSE FALSE
## pitch_belt 1.079545 15.7326269 FALSE FALSE
## yaw_belt 1.032847 16.5885470 FALSE FALSE
## total_accel_belt 1.079800 0.2853067 FALSE FALSE
## gyros_belt_x 1.042042 1.2533116 FALSE FALSE
## gyros_belt_y 1.175334 0.6317506 FALSE FALSE
## gyros_belt_z 1.021505 1.5488078 FALSE FALSE
## accel_belt_x 1.005102 1.6201345 FALSE FALSE
## accel_belt_y 1.122047 1.3450173 FALSE FALSE
## accel_belt_z 1.004435 2.8224985 FALSE FALSE
## magnet_belt_x 1.053476 2.9040147 FALSE FALSE
## magnet_belt_y 1.076677 2.7613613 FALSE FALSE
## magnet_belt_z 1.026667 4.1777053 FALSE FALSE
## roll_arm 50.764706 22.3150601 FALSE FALSE
## pitch_arm 86.350000 25.2802119 FALSE FALSE
## yaw_arm 29.254237 24.2205013 FALSE FALSE
## total_accel_arm 1.008791 0.6623191 FALSE FALSE
## gyros_arm_x 1.040984 6.2359894 FALSE FALSE
## gyros_arm_y 1.465587 3.6070919 FALSE FALSE
## gyros_arm_z 1.109848 2.2824536 FALSE FALSE
## accel_arm_x 1.011364 7.5606277 FALSE FALSE
## accel_arm_y 1.053571 5.2272264 FALSE FALSE
## accel_arm_z 1.081967 7.4485429 FALSE FALSE
## magnet_arm_x 1.043478 13.1342979 FALSE FALSE
## magnet_arm_y 1.062500 8.6101488 FALSE FALSE
## magnet_arm_z 1.000000 12.5229264 FALSE FALSE
## roll_dumbbell 1.257576 88.6183004 FALSE FALSE
## pitch_dumbbell 1.891566 86.3969839 FALSE FALSE
## yaw_dumbbell 1.338710 88.1903403 FALSE FALSE
## total_accel_dumbbell 1.004348 0.4279601 FALSE FALSE
## gyros_dumbbell_x 1.027682 2.3028327 FALSE FALSE
## gyros_dumbbell_y 1.281356 2.6492765 FALSE FALSE
## gyros_dumbbell_z 1.146758 1.8544936 FALSE FALSE
## accel_dumbbell_x 1.068323 3.9637253 FALSE FALSE
## accel_dumbbell_y 1.039683 4.5037701 FALSE FALSE
## accel_dumbbell_z 1.040000 3.9841043 FALSE FALSE
## magnet_dumbbell_x 1.120879 10.0774404 FALSE FALSE
## magnet_dumbbell_y 1.451220 8.0293458 FALSE FALSE
## magnet_dumbbell_z 1.021053 6.5722437 FALSE FALSE
## roll_forearm 11.287425 16.5070308 FALSE FALSE
## pitch_forearm 72.538462 23.2015488 FALSE FALSE
## yaw_forearm 14.960317 15.8345221 FALSE FALSE
## total_accel_forearm 1.160804 0.6419401 FALSE FALSE
## gyros_forearm_x 1.134387 2.7919299 FALSE FALSE
## gyros_forearm_y 1.037037 7.1428571 FALSE FALSE
## gyros_forearm_z 1.099567 2.8224985 FALSE FALSE
## accel_forearm_x 1.111111 7.8357449 FALSE FALSE
## accel_forearm_y 1.019608 9.6494803 FALSE FALSE
## accel_forearm_z 1.025974 5.3800693 FALSE FALSE
## magnet_forearm_x 1.048780 14.0513552 FALSE FALSE
## magnet_forearm_y 1.187500 17.8724271 FALSE FALSE
## magnet_forearm_z 1.031250 15.4982678 FALSE FALSE
These results look good. As one might expect, looking at the nearZeroVar function for the data set before removing the columns with high levels of NA values did reveal variables with near zero variance.
Now let’s examine if any of these remaining columns are correlated with each other. If they are, I would like to remove them to obtain better results from my model.
findCorrelation(cor(trainMod[,8:59]), verbose = TRUE)
## Compare row 10 and column 1 with corr 0.992
## Means: 0.274 vs 0.171 so flagging column 10
## Compare row 1 and column 9 with corr 0.925
## Means: 0.255 vs 0.167 so flagging column 1
## Compare row 9 and column 4 with corr 0.928
## Means: 0.236 vs 0.164 so flagging column 9
## Compare row 8 and column 2 with corr 0.966
## Means: 0.251 vs 0.16 so flagging column 8
## Compare row 19 and column 18 with corr 0.919
## Means: 0.091 vs 0.16 so flagging column 18
## All correlations <= 0.9
## [1] 10 1 9 8 18
These columns are highly correlated with other columns in the training set so I will remove them. Removing these columns increased the accuracy of my model a modest but significant amount.
trainMod <- trainMod[,-c(9, 16, 17, 18, 26, 39, 41)]
validMod <- validMod[,-c(9, 16, 17, 18, 26, 39, 41)]
testMod <- testMod[,-c(9, 16, 17, 18, 26, 39, 41)]
This leaves me with over 40 predictors to use for fitting, still perhaps a high number. Looking at what these predictors are measuring, it seems possible that a weighted combination of these predictors may work better than fitting to each individual predictor on its own.
names(trainMod[,8:52])
## [1] "roll_belt" "yaw_belt" "total_accel_belt"
## [4] "gyros_belt_x" "gyros_belt_y" "gyros_belt_z"
## [7] "accel_belt_x" "magnet_belt_y" "magnet_belt_z"
## [10] "roll_arm" "pitch_arm" "yaw_arm"
## [13] "total_accel_arm" "gyros_arm_x" "gyros_arm_z"
## [16] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [19] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [22] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [25] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_z"
## [28] "accel_dumbbell_y" "accel_dumbbell_z" "magnet_dumbbell_x"
## [31] "magnet_dumbbell_y" "magnet_dumbbell_z" "roll_forearm"
## [34] "pitch_forearm" "yaw_forearm" "total_accel_forearm"
## [37] "gyros_forearm_x" "gyros_forearm_y" "gyros_forearm_z"
## [40] "accel_forearm_x" "accel_forearm_y" "accel_forearm_z"
## [43] "magnet_forearm_x" "magnet_forearm_y" "magnet_forearm_z"
To this end, I will do a principal component analysis and find a new set of variables that will explain 95% of the variance in the training data I am working with at this point in the analysis.
preProc <- preProcess(trainMod[,8:52], method = "pca", thresh = .95)
preProc
## Created from 9814 samples and 45 variables
##
## Pre-processing:
## - centered (45)
## - ignored (0)
## - principal component signal extraction (45)
## - scaled (45)
##
## PCA needed 25 components to capture 95 percent of the variance
So there are 25 components needed to contain the specified threshold of information in the training set, fewer than the over 40 columns I was working with before. The model will not be as computationally expensive with this reduced number of predictors so it may be worth using PCA predictors instead of the original activity measurements as predictors. Let’s see what the first few principal components look like. In the plot below, the colored bars show the contribution of each original predictor to that principal component. Each principal component is uncorrelated to the others and together, the principal components contain almost all of the information in the data set.
melted <- melt(preProc$rotation[,1:9])
ggplot(data = melted) +
theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
xlab("Measurements from sensors") +
ylab("Relative importance in each principle component") +
ggtitle("Variables in Principal Component Analysis") +
geom_bar(aes(x=Var1, y=value, fill=Var1), stat="identity") +
facet_wrap(~Var2)
To find the best model to predict the activity class from the activity measurents, I experimented with boosting models (method = "gbm") and random forest models (method = "rf") in the caret package. I experimented with using the PCA predictors or the original predictors from the activity measurements. After building each model, I looked at the accuracy I achieved on my validation set. I found that I achieved the best results on the validation data set with a random forest model using the original, non-PCA predictors.
set.seed(45656)
modelFit <- train(trainMod$classe ~ ., method = "rf", data = trainMod[,8:52],
trControl = trainControl(method = "cv", number = 5,
allowParallel = TRUE))
Looking at the validation set, the best accuracy I achieved with a random forest model and the PCA predictors was about 96%. The best accuracy I achieved with a boosting model with the original, non-PCA predictors was also about 96%. The accuracy of the boosting model using the PCA predictors was much lower (~80%), which is interesting to think about given how boosting works. Here I show the detailed results for accuracy on the validation set for my best model, the random forest model with original predictors.
confusionMatrix(validMod$classe, predict(modelFit, validMod[,8:52]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1114 2 0 0 0
## B 17 740 2 0 0
## C 0 10 673 1 0
## D 1 2 9 631 0
## E 0 1 1 3 716
##
## Overall Statistics
##
## Accuracy : 0.9875
## 95% CI : (0.9835, 0.9907)
## No Information Rate : 0.2886
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9842
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9841 0.9801 0.9825 0.9937 1.0000
## Specificity 0.9993 0.9940 0.9966 0.9964 0.9984
## Pos Pred Value 0.9982 0.9750 0.9839 0.9813 0.9931
## Neg Pred Value 0.9936 0.9953 0.9963 0.9988 1.0000
## Prevalence 0.2886 0.1925 0.1746 0.1619 0.1825
## Detection Rate 0.2840 0.1886 0.1716 0.1608 0.1825
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9917 0.9871 0.9895 0.9950 0.9992
I used cross-validation to estimate the error on the validation set by specificying method = "cv" within the trainControl() function. I found that I obtained good results for the accuracy on this random forest model with 5-fold (number = 5) cross-validation. The accuracy is about 99% for this model, but it did take about 1.5 longer to train this model than the random forest with PCA predictors.
To estimate my out-of-sample error, I look at the training set, data I have not used at all in my model training or selection. These data have been processed in the same manner as the training data (same columns removed for NA values, correlated values, etc, as detailed above) but they have not been used in choosing or training the principal component analysis or random forest model. This error is estimated using 5-fold cross-validation as described above.
myMatrix <- confusionMatrix(testMod$classe, predict(modelFit, testMod[,8:52]))
myMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 1 0 0 1
## B 28 1106 5 0 0
## C 0 9 1012 5 0
## D 0 1 14 948 1
## E 0 2 0 4 1076
##
## Overall Statistics
##
## Accuracy : 0.9879
## 95% CI : (0.9848, 0.9906)
## No Information Rate : 0.2889
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9847
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9835 0.9884 0.9816 0.9906 0.9981
## Specificity 0.9995 0.9931 0.9971 0.9968 0.9988
## Pos Pred Value 0.9988 0.9710 0.9864 0.9834 0.9945
## Neg Pred Value 0.9934 0.9973 0.9961 0.9982 0.9996
## Prevalence 0.2889 0.1901 0.1752 0.1626 0.1832
## Detection Rate 0.2841 0.1879 0.1720 0.1611 0.1828
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9915 0.9907 0.9893 0.9937 0.9984
I estimate my out-of-sample error to be about 1% with a 95% confidence interval of 0.94% to 1.52%.