Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, my goal was to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants and predict the manner in which each participant completed the exercise. Each participant was asked to perform barbell lifts correctly (Class A) and incorrectly in 4 different ways (Classes B-E).
More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The data for this project come from here. The training data for this project are available here and the test data are available here.
My goal was to create a report describing how I built the prediction models, how I used cross-validation, my thoughts on the expected out of sample errors, while justifying each choice I made along the way. Analyses were completed in R and the report was generated using R Markdown and Knitr.
Using the validation method, I split the training data into two sub-samples, a training (70%) and a test (30%) set. Then, I used the testing data as the validation data for my final model. I also removed variables that contained missing values, variables that are related to date-stamps and timestamps, and variables that were related to metadata and wouldn’t be used for predicting class assignment. After that, I removed any variables that approximate a zero-variance.
pml = read.csv("./data/pml-training.csv", header=T)
#Split the data into training, testing, and validation sets. We will use these sub-samples to evaluate the out-of-sample errors later.
intrain = createDataPartition(y=pml$classe, p=0.7, list=FALSE) #Split 70% of data in training and 30% in testing data
training = pml[intrain,]
testing = pml[-intrain,] #Include all samples that don't appear in the "intrain" dataset
validation = read.csv("./data/pml-testing.csv", header=T)
#Remove the variables that contain missing values:
training = training[, colSums(is.na(training))==0]
testing = testing[, colSums(is.na(testing))==0]
validation = validation[, colSums(is.na(validation))==0]
#The first few columns are related to datestamps and timestamps. We don't need these to predict the model.
#We also don't need the new_window column, or the num_window column.
training = training %>% select(-(1:7))
testing = testing %>% select(-(1:7))
validation = validation %>% select(-(1:7))
#ALSO! remove the variables that approximate a zero-variance (i.e., these variables are almost fixed and do not vary)
#Only do this with the training and test data, NOT the validation data!
nzv = nearZeroVar(training) #save metrics = shows details
training = training[, -nzv]
testing = testing[, -nzv]
Next, I investigated the correlations between the remaining variables, excluding the classe variable. I also plotted the correlation matrix with colors to visually present variables that were highly correlated (r>0.75). I subsequently removed these highly correlated variables from the dataset. Thank you to the following site: http://rismyhammer.com/ml/Pre-Processing.html for the incredibly helpful documentation on correlation matrices, visuals, and removal of variables with high correlations.
#Investigating correlated predictors
corrtable = cor(training[, -53])
summary(corrtable[upper.tri(corrtable)]) #Summary table of the min, 25%, median/mean, 75%, and max correlations
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.992077 -0.112831 0.002472 0.002050 0.091479 0.981343
#Min = -0.99 and max = 0.98!
corrplot(corrtable, order="FPC", method="color", type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))
#Given a correlation matrix, findcorrelation flags predictors for removal:
highcorrvars = findCorrelation(corrtable, cutoff = 0.75)
names(training)[highcorrvars]
## [1] "accel_belt_z" "roll_belt" "accel_belt_y"
## [4] "total_accel_belt" "accel_dumbbell_z" "accel_belt_x"
## [7] "pitch_belt" "magnet_dumbbell_x" "accel_dumbbell_y"
## [10] "magnet_dumbbell_y" "accel_dumbbell_x" "accel_arm_x"
## [13] "accel_arm_z" "magnet_arm_y" "magnet_belt_z"
## [16] "accel_forearm_y" "gyros_forearm_y" "gyros_dumbbell_x"
## [19] "gyros_dumbbell_z" "gyros_arm_x"
#Remove the highly correlated variables from the training set only:
training = training[, -highcorrvars]
I ran three separate models: 1.) decision tree, 2.) random forest, and 3.) generalized boosted regression model in the training subset. Then, I validated each model using the testing subset. Each model centers and scales (i.e., standardizes) the predictors before processing. The confusion matrices of each model are presented below, along with a corresponding plot that displays the accuracy between the training subset and the testing subset.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1027 206 28 51 27
## B 330 683 210 336 489
## C 243 215 741 232 181
## D 74 34 46 254 37
## E 0 1 1 91 348
##
## Overall Statistics
##
## Accuracy : 0.5188
## 95% CI : (0.5059, 0.5316)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3939
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.6135 0.5996 0.7222 0.26349 0.32163
## Specificity 0.9259 0.7124 0.8207 0.96119 0.98064
## Pos Pred Value 0.7670 0.3335 0.4597 0.57079 0.78912
## Neg Pred Value 0.8577 0.8812 0.9333 0.86949 0.86517
## Prevalence 0.2845 0.1935 0.1743 0.16381 0.18386
## Detection Rate 0.1745 0.1161 0.1259 0.04316 0.05913
## Detection Prevalence 0.2275 0.3480 0.2739 0.07562 0.07494
## Balanced Accuracy 0.7697 0.6560 0.7715 0.61234 0.65113
For this model, I used parallel processing to increase processing time (see the code in “trainControl()”). Thanks to the following site for the guidance on parallel processing code: https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md. To increase precision, I used cross-validation with 5 resampling iterations.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 7 0 0 0
## B 2 1128 6 0 0
## C 1 4 1015 10 0
## D 0 0 5 953 1
## E 0 0 0 1 1081
##
## Overall Statistics
##
## Accuracy : 0.9937
## 95% CI : (0.9913, 0.9956)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.992
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9903 0.9893 0.9886 0.9991
## Specificity 0.9983 0.9983 0.9969 0.9988 0.9998
## Pos Pred Value 0.9958 0.9930 0.9854 0.9937 0.9991
## Neg Pred Value 0.9993 0.9977 0.9977 0.9978 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2839 0.1917 0.1725 0.1619 0.1837
## Detection Prevalence 0.2851 0.1930 0.1750 0.1630 0.1839
## Balanced Accuracy 0.9983 0.9943 0.9931 0.9937 0.9994
For this model, I used parallel processing to increase processing time (see the code in “trainControl()”). Thanks to the following site for the guidance on parallel processing code: https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md. To increase precision, I used repeated cross-validation with five resampling iterations and repeated the process 1 time.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1956
## 2 1.4854 nan 0.1000 0.1452
## 3 1.3955 nan 0.1000 0.1019
## 4 1.3296 nan 0.1000 0.0920
## 5 1.2734 nan 0.1000 0.0775
## 6 1.2252 nan 0.1000 0.0664
## 7 1.1841 nan 0.1000 0.0521
## 8 1.1509 nan 0.1000 0.0584
## 9 1.1154 nan 0.1000 0.0506
## 10 1.0832 nan 0.1000 0.0445
## 20 0.8381 nan 0.1000 0.0226
## 40 0.6020 nan 0.1000 0.0134
## 60 0.4732 nan 0.1000 0.0067
## 80 0.3881 nan 0.1000 0.0036
## 100 0.3266 nan 0.1000 0.0035
## 120 0.2798 nan 0.1000 0.0024
## 140 0.2424 nan 0.1000 0.0024
## 150 0.2255 nan 0.1000 0.0015
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1635 44 0 1 2
## B 22 1049 60 4 17
## C 9 40 946 38 14
## D 5 1 19 914 13
## E 3 5 1 7 1036
##
## Overall Statistics
##
## Accuracy : 0.9482
## 95% CI : (0.9422, 0.9537)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9344
##
## Mcnemar's Test P-Value : 0.0000001787
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9767 0.9210 0.9220 0.9481 0.9575
## Specificity 0.9888 0.9783 0.9792 0.9923 0.9967
## Pos Pred Value 0.9721 0.9106 0.9035 0.9601 0.9848
## Neg Pred Value 0.9907 0.9810 0.9835 0.9899 0.9905
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2778 0.1782 0.1607 0.1553 0.1760
## Detection Prevalence 0.2858 0.1958 0.1779 0.1618 0.1788
## Balanced Accuracy 0.9828 0.9496 0.9506 0.9702 0.9771
After creating three different models, I combined classifiers by “stacking” the predictions together. To do this, I built a new dataframe that consists of predictions from each model and pulled the “classe” variable from the testing dataset. Then, I fit a new model that relates the classe variable to the new prediction variables using the random forest method.
#1.
#Build a new dataset that consists of predictions from each model
#and create a classe variable that is pulled from the classe variable in the testing dataset
preddf = data.frame(ctpredict, rfpredict, boostpredict, classe=testing$classe)
#2.
#Fit a new model that relates the classe variable to the two prediction variables
#Instead of fitting a model based on the predictors, we are fitting a model based on the predictions!
combmodfit = train(classe ~ ., method = "rf",
preProcess = c("center", "scale"),
data=preddf)
combpred = predict(combmodfit, preddf)
#Confusion Matrix:
confusionMatrix(combpred, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 7 0 0 0
## B 2 1128 6 0 0
## C 1 4 1015 10 0
## D 0 0 5 953 1
## E 0 0 0 1 1081
##
## Overall Statistics
##
## Accuracy : 0.9937
## 95% CI : (0.9913, 0.9956)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.992
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9903 0.9893 0.9886 0.9991
## Specificity 0.9983 0.9983 0.9969 0.9988 0.9998
## Pos Pred Value 0.9958 0.9930 0.9854 0.9937 0.9991
## Neg Pred Value 0.9993 0.9977 0.9977 0.9978 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2839 0.1917 0.1725 0.1619 0.1837
## Detection Prevalence 0.2851 0.1930 0.1750 0.1630 0.1839
## Balanced Accuracy 0.9983 0.9943 0.9931 0.9937 0.9994
comb_accuracy = confusionMatrix(combpred, testing$classe)$overall["Accuracy"]
Finally, I evaluated the accuracy of each model using the test sub-sample and selected the model with the highest accuracy.
#What is the resulting accuracy on the test set?
#Is it better or worse than each of the individual predictions?
ct_accuracy = round(ct_accuracy, 4)
rf_accuracy = round(rf_accuracy, 4)
boost_accuracy = round(boost_accuracy, 4)
comb_accuracy = round(comb_accuracy, 4)
The random forest model (0.9937) and stacked model (0.9937) show similar accuracy. Therefore, I selected the random forest model to apply to the validation data. Given the accuracy of the random forest model, I expect the out of sample error rate to be around 1%.
#Step 6. Apply the best model to the validation data
results = predict(rfmodel, newdata = validation)
#Results are suppressed and will be submitted to the course project quit portion of the assignment.