The goal of the project is to predict the quality of exercise activities. This project uses the data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
This report describes the model, cross validation, and the expected out of sample error. Finally, the prediction model is used to predict 20 different test cases.
Load the training and testing data:
training = read.csv("pml-training.csv")
testing = read.csv("pml-testing.csv")
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
Check how many columns having more than 80% missing values (NAs) in the testing data set:
sum(sapply(testing, function(e) {sum(is.na(e)) / nrow(testing) > 0.8}))
## [1] 100
Remove the columns with more than 80% missing values from both the training and testing data sets:
missingCols = sapply(testing, function(e) {sum(is.na(e)) / nrow(testing) > 0.8})
trainingNNa = subset(training, select = !missingCols)
testingNNa = subset(testing, select = !missingCols)
dim(trainingNNa)
## [1] 19622 60
dim(testingNNa)
## [1] 20 60
What is the objective variable and its type?
str(trainingNNa$class)
## Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(trainingNNa$class)
## A B C D E
## 5580 3797 3422 3216 3607
There are 5 different discrete values. By the nature of the objective variable, we will train a classification model for the prediction problem.
str(trainingNNa)
## 'data.frame': 19622 obs. of 60 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1: int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2: int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
Remove the first 7 variables that don’t provide predicting values.
trainingNNa = trainingNNa[, -c(1:7)]
testingNNa = testingNNa[, -c(1:7)]
dim(trainingNNa)
## [1] 19622 53
dim(testingNNa)
## [1] 20 53
Randomly pick up 5 variables and create a scatter-plot matrix:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
featurePlot(x = trainingNNa[, c(1, 10, 20, 35, 45)], y = trainingNNa$class, plot = "pairs", auto.key = list(columns = 5))
Separate the predictors and class labels.
trainingVars = trainingNNa[, c(1:52)]
objLabels = trainingNNa[, 53]
testingVars = testingNNa[, c(1:52)]
testingProdIds = testingNNa[, 53]
Center and scale both training and testing data
library(caret)
preproc = preProcess(trainingVars)
trainingVarsProc = predict(preproc, trainingVars)
testingVarsProc = predict(preproc, testingVars)
Are there any near-zero-variance variables?
nearZeroVar(trainingVarsProc)
## integer(0)
Are there highly correlated variables?
descrCor <- cor(trainingVarsProc)
highCorr <- sum(abs(descrCor[upper.tri(descrCor)]) > 0.75)
highCorr
## [1] 31
Remove highly correlated variables
highlyCorDescr <- findCorrelation(descrCor, cutoff = 0.75)
trainingVarsProcNCor = trainingVarsProc[, -highlyCorDescr]
testingVarsProcNCor = testingVarsProc[, -highlyCorDescr]
dim(trainingVarsProcNCor)
## [1] 19622 32
dim(testingVarsProcNCor)
## [1] 20 32
The pre-processed training data is split into two subsets: 75% for training prediction models and 25% as validation set for evaluating error rates and selecting models.
inTrain = createDataPartition(objLabels, p = 0.75, list = FALSE)
forTrainingVars = trainingVarsProcNCor[inTrain, ]
forTestingVars = trainingVarsProcNCor[-inTrain, ]
forTrainingLabels = objLabels[inTrain]
forTestingLabels = objLabels[-inTrain]
For this classification problem, I first train a decision tree and evaluate its error rate on the validation set. I will use the error rate of the decision tree as a baseline error rate. In the later part of the document, I will train other models and use cross validation to select the best one.
Train a decision tree using the caret package.
set.seed(2345)
modeldt = train(y = forTrainingLabels, x = forTrainingVars, method = "rpart")
Predict the labels for the validation set and estimate the error rate.
preddt = predict(modeldt, forTestingVars)
## Loading required package: rpart
confusionMatrix(forTestingLabels, preddt)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 854 249 231 61 0
## B 141 585 186 31 6
## C 25 173 610 46 1
## D 33 243 191 230 107
## E 17 363 133 35 353
##
## Overall Statistics
##
## Accuracy : 0.5367
## 95% CI : (0.5226, 0.5507)
## No Information Rate : 0.3289
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4174
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7981 0.3627 0.4515 0.57072 0.75589
## Specificity 0.8589 0.8894 0.9310 0.87247 0.87649
## Pos Pred Value 0.6122 0.6164 0.7135 0.28607 0.39179
## Neg Pred Value 0.9384 0.7401 0.8170 0.95780 0.97152
## Prevalence 0.2182 0.3289 0.2755 0.08218 0.09523
## Detection Rate 0.1741 0.1193 0.1244 0.04690 0.07198
## Detection Prevalence 0.2845 0.1935 0.1743 0.16395 0.18373
## Balanced Accuracy 0.8285 0.6260 0.6913 0.72160 0.81619
The accuracy of the decision tree is 0.537. To improve the accuracy, I will train several different models in the following.
Train a naive Bayes model and estimate its accuracy. To speed up the training process, I only use 10% of the training data.
set.seed(2345)
library(klaR)
## Loading required package: MASS
library(MASS)
inTrain = createDataPartition(forTrainingLabels, p = .1, list = FALSE)
smallTrainingVars = forTrainingVars[inTrain, ]
smallTrainingLabels = forTrainingLabels[inTrain]
modelnb = train(y = smallTrainingLabels, x = smallTrainingVars, method = "nb", )
prednb = predict(modelnb, forTestingVars)
confusionMatrix(forTestingLabels, prednb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 901 88 257 147 2
## B 80 608 156 87 18
## C 65 64 648 72 6
## D 39 19 199 505 42
## E 41 109 96 75 580
##
## Overall Statistics
##
## Accuracy : 0.6611
## 95% CI : (0.6477, 0.6743)
## No Information Rate : 0.2765
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5751
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8002 0.6847 0.4779 0.5700 0.8951
## Specificity 0.8692 0.9151 0.9417 0.9256 0.9246
## Pos Pred Value 0.6459 0.6407 0.7579 0.6281 0.6437
## Neg Pred Value 0.9359 0.9292 0.8251 0.9071 0.9830
## Prevalence 0.2296 0.1811 0.2765 0.1807 0.1321
## Detection Rate 0.1837 0.1240 0.1321 0.1030 0.1183
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.8347 0.7999 0.7098 0.7478 0.9098
The accuracy of the naive Bayes model is 0.66 which is better than that (0.537) of the decision tree although only 10% of the training data is used.
Train a random forest model using the 10% of the training data as used in training the naive Bayes model.
modelrf = train(y = smallTrainingLabels, x = smallTrainingVars, method = "rf", )
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
predrf = predict(modelrf, forTestingVars)
confusionMatrix(forTestingLabels, predrf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1365 11 4 14 1
## B 47 831 45 23 3
## C 0 58 767 29 1
## D 1 2 45 755 1
## E 2 34 8 16 841
##
## Overall Statistics
##
## Accuracy : 0.9296
## 95% CI : (0.9221, 0.9367)
## No Information Rate : 0.2885
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.911
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9647 0.8878 0.8826 0.9020 0.9929
## Specificity 0.9914 0.9703 0.9782 0.9880 0.9852
## Pos Pred Value 0.9785 0.8757 0.8971 0.9391 0.9334
## Neg Pred Value 0.9858 0.9735 0.9748 0.9800 0.9985
## Prevalence 0.2885 0.1909 0.1772 0.1707 0.1727
## Detection Rate 0.2783 0.1695 0.1564 0.1540 0.1715
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9780 0.9290 0.9304 0.9450 0.9891
Using only the 10% of the training data, the accuracy of the random forest model is 0.929 which is much better than that (0.66) of the naive Bayes model. Let us train a random forest model using all the available training data.
modelrfall = train(y = forTrainingLabels, x = forTrainingVars, method = "rf", )
predrfall = predict(modelrfall, forTestingVars)
confusionMatrix(forTestingLabels, predrfall)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 2 944 3 0 0
## C 0 7 844 4 0
## D 0 0 6 798 0
## E 0 0 0 1 900
##
## Overall Statistics
##
## Accuracy : 0.9953
## 95% CI : (0.993, 0.997)
## No Information Rate : 0.2849
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9941
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9986 0.9926 0.9894 0.9938 1.0000
## Specificity 1.0000 0.9987 0.9973 0.9985 0.9998
## Pos Pred Value 1.0000 0.9947 0.9871 0.9925 0.9989
## Neg Pred Value 0.9994 0.9982 0.9978 0.9988 1.0000
## Prevalence 0.2849 0.1939 0.1739 0.1637 0.1835
## Detection Rate 0.2845 0.1925 0.1721 0.1627 0.1835
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9993 0.9957 0.9934 0.9962 0.9999
Great! The accuracy of the random forest model trained by using all the available training data is 0.99.
I am curious about whether we can select a smaller set of features/predictors to further improve the training process, e.g., better accuracy or faster training time.
First, create an integer vector for the specific subset sizes of the predictors that should be tested.
subsets <- c(1:5, 10, 15, 20, 25)
Use recursive feature elimination via caret to find the important features. We train a series random forest models and select features through repeated cross validations on different sizes of feature sets.
set.seed(2345)
library(Hmisc)
## Loading required package: grid
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following object is masked from 'package:randomForest':
##
## combine
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
rfProfile <- rfe(smallTrainingVars, smallTrainingLabels,
sizes = subsets,
rfeControl = ctrl)
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
rfProfile
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.5007 0.3634 0.08638 0.10988
## 2 0.6920 0.6091 0.05889 0.07496
## 3 0.7794 0.7203 0.04397 0.05578
## 4 0.8158 0.7667 0.03241 0.04093
## 5 0.8384 0.7954 0.03091 0.03921
## 10 0.8914 0.8625 0.02705 0.03435
## 15 0.9138 0.8909 0.02042 0.02589
## 20 0.9166 0.8943 0.02326 0.02949 *
## 25 0.9159 0.8935 0.02173 0.02752
## 32 0.9145 0.8917 0.01966 0.02490
##
## The top 5 variables (out of 20):
## yaw_belt, magnet_dumbbell_z, magnet_belt_z, pitch_forearm, roll_dumbbell
Extract the data.
featureEliTrainingVars = forTrainingVars[, c("magnet_belt_z", "magnet_dumbbell_z", "yaw_belt", "roll_dumbbell", "pitch_forearm")]
featureEliTestingVars = forTestingVars[, c("magnet_belt_z", "magnet_dumbbell_z", "yaw_belt", "roll_dumbbell", "pitch_forearm")]
featureEliFinalTestingVars = testingVarsProcNCor[, c("magnet_belt_z", "magnet_dumbbell_z", "yaw_belt", "roll_dumbbell", "pitch_forearm")]
dim(featureEliTrainingVars)
## [1] 14718 5
dim(featureEliTestingVars)
## [1] 4904 5
dim(featureEliFinalTestingVars)
## [1] 20 5
Train a random forest model using only 5 variables.
modelrf5vars = train(y = forTrainingLabels, x = featureEliTrainingVars, method = "rf", )
predrf5vars = predict(modelrf5vars, featureEliTestingVars)
confusionMatrix(forTestingLabels, predrf5vars)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1344 23 12 15 1
## B 28 865 35 20 1
## C 8 24 789 34 0
## D 10 8 25 759 2
## E 0 11 6 5 879
##
## Overall Statistics
##
## Accuracy : 0.9454
## 95% CI : (0.9386, 0.9515)
## No Information Rate : 0.2834
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9309
## Mcnemar's Test P-Value : 0.002189
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9669 0.9291 0.9100 0.9112 0.9955
## Specificity 0.9855 0.9789 0.9837 0.9889 0.9945
## Pos Pred Value 0.9634 0.9115 0.9228 0.9440 0.9756
## Neg Pred Value 0.9869 0.9833 0.9807 0.9820 0.9990
## Prevalence 0.2834 0.1898 0.1768 0.1699 0.1801
## Detection Rate 0.2741 0.1764 0.1609 0.1548 0.1792
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9762 0.9540 0.9468 0.9501 0.9950
The accuracy of the model trained by only 5 variables is 0.95.
Apply the random forest models trained by the full set of variables and the reduced set of variables to the test set with 20 observations. Check their prediction agreement.
resultsrf5vars = predict(modelrf5vars, featureEliFinalTestingVars)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
resultsrfall = predict(modelrfall, testingVarsProcNCor)
confusionMatrix(resultsrf5vars, resultsrfall)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 6 0 0 0 0
## B 0 8 0 0 0
## C 1 0 1 0 1
## D 0 0 0 1 0
## E 0 0 0 0 2
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.683, 0.9877)
## No Information Rate : 0.4
## P-Value [Acc > NIR] : 5.041e-06
##
## Kappa : 0.8592
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8571 1.0 1.0000 1.00 0.6667
## Specificity 1.0000 1.0 0.8947 1.00 1.0000
## Pos Pred Value 1.0000 1.0 0.3333 1.00 1.0000
## Neg Pred Value 0.9286 1.0 1.0000 1.00 0.9444
## Prevalence 0.3500 0.4 0.0500 0.05 0.1500
## Detection Rate 0.3000 0.4 0.0500 0.05 0.1000
## Detection Prevalence 0.3000 0.4 0.1500 0.05 0.1000
## Balanced Accuracy 0.9286 1.0 0.9474 1.00 0.8333
The two models agree about 90% of the test cases.