In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.
The original data was preprocessed to remove NA and empty columns using dplyr package and imputed some of the missing values in the remaining ones. Then 6 models (lda, lda2, rf, gbm, knn and kknn) were fitted with cross-validation using caret library. Comparison of the models on the validation subset showed that Random forest model has the highest accuracy. The selected model then was applied to the test data set.
The following code loads csv files with training and testing data to the current working directory and then reads them in.
trainUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainUrl,destfile=paste0(getwd(),"/pml-training.csv"), method = "curl")
download.file(testUrl,destfile=paste0(getwd(),"/pml-testing.csv"), method = "curl")
pmlTrain<-read.csv("pml-training.csv", stringsAsFactors = FALSE, na.strings=c("NA", "", "#DIV/0!"))
pmlTest<-read.csv("pml-testing.csv", stringsAsFactors = FALSE, na.strings=c("NA", "", "#DIV/0!"))
The training data set contains of 19622 observations on 160 variables. There are a lot of columns containing mostly NAs or “” in the training data set, so I’ve removed them both from training and testing data sets. I’ve also excluded variables 1:7 that are not relevant for predicting the way the excercise was done.
library(dplyr)
# Removing columns with NA values
csum<-colSums(is.na(pmlTrain))
nanames<-names(csum[csum>19000])
pmlTrain2<-select(pmlTrain, -c(nanames))
pmlTesting<-select(pmlTest, -c(nanames))
# Removing irrelevant columns
pmlTrain2<-pmlTrain2[,-(1:7)]
pmlTesting<-pmlTesting[,-(1:7)]
Next step is to check the remaining columns for zero/near-zero values and remove those if there are any.
library(caret)
nzval<-nearZeroVar(pmlTrain2, saveMetrics = TRUE)
nzval
As we can see - all are false, so there is nothing to remove.
freqRatio percentUnique zeroVar nzv
roll_belt 1.101904 6.7781062 FALSE FALSE
pitch_belt 1.036082 9.3772296 FALSE FALSE
yaw_belt 1.058480 9.9734991 FALSE FALSE
total_accel_belt 1.063160 0.1477933 FALSE FALSE
gyros_belt_x 1.058651 0.7134849 FALSE FALSE
gyros_belt_y 1.144000 0.3516461 FALSE FALSE
gyros_belt_z 1.066214 0.8612782 FALSE FALSE
accel_belt_x 1.055412 0.8357966 FALSE FALSE
accel_belt_y 1.113725 0.7287738 FALSE FALSE
accel_belt_z 1.078767 1.5237998 FALSE FALSE
magnet_belt_x 1.090141 1.6664968 FALSE FALSE
magnet_belt_y 1.099688 1.5187035 FALSE FALSE
magnet_belt_z 1.006369 2.3290184 FALSE FALSE
roll_arm 52.338462 13.5256345 FALSE FALSE
pitch_arm 87.256410 15.7323412 FALSE FALSE
yaw_arm 33.029126 14.6570176 FALSE FALSE
total_accel_arm 1.024526 0.3363572 FALSE FALSE
gyros_arm_x 1.015504 3.2769341 FALSE FALSE
gyros_arm_y 1.454369 1.9162165 FALSE FALSE
gyros_arm_z 1.110687 1.2638875 FALSE FALSE
accel_arm_x 1.017341 3.9598410 FALSE FALSE
accel_arm_y 1.140187 2.7367241 FALSE FALSE
accel_arm_z 1.128000 4.0362858 FALSE FALSE
magnet_arm_x 1.000000 6.8239731 FALSE FALSE
magnet_arm_y 1.056818 4.4439914 FALSE FALSE
magnet_arm_z 1.036364 6.4468454 FALSE FALSE
roll_dumbbell 1.022388 84.2065029 FALSE FALSE
pitch_dumbbell 2.277372 81.7449801 FALSE FALSE
yaw_dumbbell 1.132231 83.4828254 FALSE FALSE
total_accel_dumbbell 1.072634 0.2191418 FALSE FALSE
gyros_dumbbell_x 1.003268 1.2282132 FALSE FALSE
gyros_dumbbell_y 1.264957 1.4167771 FALSE FALSE
gyros_dumbbell_z 1.060100 1.0498420 FALSE FALSE
accel_dumbbell_x 1.018018 2.1659362 FALSE FALSE
accel_dumbbell_y 1.053061 2.3748853 FALSE FALSE
accel_dumbbell_z 1.133333 2.0894914 FALSE FALSE
magnet_dumbbell_x 1.098266 5.7486495 FALSE FALSE
magnet_dumbbell_y 1.197740 4.3012945 FALSE FALSE
magnet_dumbbell_z 1.020833 3.4451126 FALSE FALSE
roll_forearm 11.589286 11.0895933 FALSE FALSE
pitch_forearm 65.983051 14.8557741 FALSE FALSE
yaw_forearm 15.322835 10.1467740 FALSE FALSE
total_accel_forearm 1.128928 0.3567424 FALSE FALSE
gyros_forearm_x 1.059273 1.5187035 FALSE FALSE
gyros_forearm_y 1.036554 3.7763735 FALSE FALSE
gyros_forearm_z 1.122917 1.5645704 FALSE FALSE
accel_forearm_x 1.126437 4.0464784 FALSE FALSE
accel_forearm_y 1.059406 5.1116094 FALSE FALSE
accel_forearm_z 1.006250 2.9558659 FALSE FALSE
magnet_forearm_x 1.012346 7.7667924 FALSE FALSE
magnet_forearm_y 1.246914 9.5403119 FALSE FALSE
magnet_forearm_z 1.000000 8.5771073 FALSE FALSE
classe 1.469581 0.0254816 FALSE FALSE
After the procedure the number of variables decreased to 53. Since there are a lot of observations in training set we can split it in two subsets: training (75%) and validation (25%) in order to find out what model is better before applying it to the test set.
pmlTrain2$classe<-as.factor(pmlTrain2$classe)
library(caret)
set.seed(123456)
inTrain<-createDataPartition(y=pmlTrain2$classe, p=0.75, list = FALSE)
pmlValidation<-pmlTrain2[-inTrain,]
pmlTraining<-pmlTrain2[inTrain,]
Since many models utilize random numbers during the phase where parameters are estimated and to ensure that the same resamples are used between calls to train we’ll use set.seed prior to every call to train function. We will fit 6 models:
For every model the 3-fold Cross-Validation is used by applying trControl=trainControl(method=“cv”, number=3). Then we’ll predict classe variable for Validation data set and build confusion matrices.
set.seed(123456)
fitlda2<-train(classe~., data=pmlTraining, method="lda2", preProcess="knnImpute",
trControl=trainControl(method="cv", number=3))
fitlda2
Linear Discriminant Analysis
14718 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
Pre-processing: nearest neighbor imputation (52), centered (52),
scaled (52)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 9812, 9811, 9813
Resampling results across tuning parameters:
dimen Accuracy Kappa
1 0.4697651 0.3166359
2 0.5950536 0.4862335
3 0.6449928 0.5497851
4 0.7008431 0.6213326
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was dimen = 4.
vPredict<-predict(fitlda2, pmlValidation)
cmlda2<-confusionMatrix(pmlValidation$classe,vPredict); cmlda2
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1153 34 108 97 3
B 152 604 113 35 45
C 76 74 560 114 31
D 46 37 100 602 19
E 43 157 80 108 513
Overall Statistics
Accuracy : 0.6998
95% CI : (0.6868, 0.7126)
No Information Rate : 0.2998
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.62
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.7844 0.6667 0.5827 0.6297 0.8396
Specificity 0.9295 0.9137 0.9252 0.9488 0.9096
Pos Pred Value 0.8265 0.6365 0.6550 0.7488 0.5694
Neg Pred Value 0.9097 0.9236 0.9010 0.9137 0.9755
Prevalence 0.2998 0.1847 0.1960 0.1949 0.1246
Detection Rate 0.2351 0.1232 0.1142 0.1228 0.1046
Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
Balanced Accuracy 0.8569 0.7902 0.7540 0.7893 0.8746
set.seed(123456)
fitrf<-train(classe~., data=pmlTraining, method="rf", preProcess="knnImpute",
trControl=trainControl(method="cv", number=3))
fitrf
Random Forest
14718 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
Pre-processing: nearest neighbor imputation (52), centered (52),
scaled (52)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 9812, 9811, 9813
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9883135 0.9852154
27 0.9896042 0.9868490
52 0.9803634 0.9751556
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 27.
vPredict<-predict(fitrf, pmlValidation)
cmrf<-confusionMatrix(pmlValidation$classe,vPredict); cmrf
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1395 0 0 0 0
B 4 943 2 0 0
C 0 3 849 3 0
D 0 0 7 797 0
E 0 1 4 3 893
Overall Statistics
Accuracy : 0.9945
95% CI : (0.992, 0.9964)
No Information Rate : 0.2853
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.993
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9971 0.9958 0.9849 0.9925 1.0000
Specificity 1.0000 0.9985 0.9985 0.9983 0.9980
Pos Pred Value 1.0000 0.9937 0.9930 0.9913 0.9911
Neg Pred Value 0.9989 0.9990 0.9968 0.9985 1.0000
Prevalence 0.2853 0.1931 0.1758 0.1637 0.1821
Detection Rate 0.2845 0.1923 0.1731 0.1625 0.1821
Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
Balanced Accuracy 0.9986 0.9971 0.9917 0.9954 0.9990
set.seed(123456)
fitgbm<-train(classe~., data=pmlTraining, method="gbm", preProcess="knnImpute",
trControl=trainControl(method="cv", number=3), verbose=FALSE)
fitgbm
Stochastic Gradient Boosting
14718 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
Pre-processing: nearest neighbor imputation (52), centered (52),
scaled (52)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 9812, 9811, 9813
Resampling results across tuning parameters:
interaction.depth n.trees Accuracy Kappa
1 50 0.7506455 0.6839001
1 100 0.8204911 0.7727524
1 150 0.8526968 0.8135909
2 50 0.8576571 0.8196083
2 100 0.9087509 0.8844991
2 150 0.9334823 0.9158221
3 50 0.8938040 0.8655349
3 100 0.9427908 0.9275977
3 150 0.9590973 0.9482429
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150,
interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
vPredict<-predict(fitgbm, pmlValidation)
cmgbm<-confusionMatrix(pmlValidation$classe,vPredict); cmgbm
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1366 15 9 4 1
B 26 897 25 0 1
C 0 26 818 8 3
D 1 2 30 767 4
E 1 4 19 15 862
Overall Statistics
Accuracy : 0.9604
95% CI : (0.9546, 0.9657)
No Information Rate : 0.2843
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.95
Mcnemar's Test P-Value : 5.442e-07
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9799 0.9502 0.9079 0.9660 0.9897
Specificity 0.9917 0.9869 0.9908 0.9910 0.9903
Pos Pred Value 0.9792 0.9452 0.9567 0.9540 0.9567
Neg Pred Value 0.9920 0.9881 0.9795 0.9934 0.9978
Prevalence 0.2843 0.1925 0.1837 0.1619 0.1776
Detection Rate 0.2785 0.1829 0.1668 0.1564 0.1758
Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
Balanced Accuracy 0.9858 0.9685 0.9493 0.9785 0.9900
SVM is one of the most widely-used and robust classifiers. Not only can it efficiently classify linear decision boundaries, but it can also classify non-linear boundaries and solve linearly inseparable problems. As we can see it’s little less accurate than rf and gbm, but much more precise than lda.
library(e1071)
set.seed(123456)
fitsvm<-svm(classe~., data=pmlTraining)
fitsvm
Call:
svm(formula = classe ~ ., data = pmlTraining)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.01923077
Number of Support Vectors: 6760
vPredict<-predict(fitsvm, pmlValidation)
cmsvm<-confusionMatrix(pmlValidation$classe,vPredict); cmsvm
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1382 4 9 0 0
B 59 862 27 0 1
C 2 31 812 9 1
D 2 0 82 719 1
E 1 7 28 21 844
Overall Statistics
Accuracy : 0.9419
95% CI : (0.935, 0.9483)
No Information Rate : 0.2949
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9264
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9557 0.9535 0.8476 0.9599 0.9965
Specificity 0.9962 0.9782 0.9891 0.9795 0.9860
Pos Pred Value 0.9907 0.9083 0.9497 0.8943 0.9367
Neg Pred Value 0.9818 0.9894 0.9639 0.9927 0.9993
Prevalence 0.2949 0.1843 0.1954 0.1527 0.1727
Detection Rate 0.2818 0.1758 0.1656 0.1466 0.1721
Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
Balanced Accuracy 0.9760 0.9659 0.9184 0.9697 0.9912
set.seed(123456)
fitknn<-train(classe~., data=pmlTraining, method="knn", preProcess="knnImpute",
trControl=trainControl(method="cv", number=3))
fitknn
k-Nearest Neighbors
14718 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
Pre-processing: nearest neighbor imputation (52), centered (52),
scaled (52)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 9812, 9811, 9813
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.9474790 0.9335486
7 0.9313758 0.9131477
9 0.9173792 0.8954277
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.
vPredict<-predict(fitknn, pmlValidation)
cmknn<-confusionMatrix(pmlValidation$classe,vPredict); cmknn
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1382 5 7 0 1
B 22 912 15 0 0
C 4 16 825 10 0
D 3 0 32 767 2
E 0 12 16 6 867
Overall Statistics
Accuracy : 0.9692
95% CI : (0.964, 0.9739)
No Information Rate : 0.2877
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.961
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9794 0.9651 0.9218 0.9796 0.9966
Specificity 0.9963 0.9907 0.9925 0.9910 0.9916
Pos Pred Value 0.9907 0.9610 0.9649 0.9540 0.9623
Neg Pred Value 0.9917 0.9917 0.9827 0.9961 0.9993
Prevalence 0.2877 0.1927 0.1825 0.1597 0.1774
Detection Rate 0.2818 0.1860 0.1682 0.1564 0.1768
Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
Balanced Accuracy 0.9879 0.9779 0.9572 0.9853 0.9941
Performs k-nearest neighbor classification: for each row of the test set, the k nearest training set vectors (according to Minkowski distance) are found, and the classification is done via the maximum of summed kernel densities.
library(kknn)
set.seed(123456)
fitkknn<-train(classe~., data=pmlTraining, method="kknn", preProcess="knnImpute",
trControl=trainControl(method="cv", number=3))
fitkknn
k-Nearest Neighbors
14718 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
Pre-processing: nearest neighbor imputation (52), centered (52),
scaled (52)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 9812, 9811, 9813
Resampling results across tuning parameters:
kmax Accuracy Kappa
5 0.9824025 0.9777427
7 0.9824025 0.9777427
9 0.9824025 0.9777427
Tuning parameter 'distance' was held constant at a value of 2
Tuning parameter 'kernel' was held constant at a value of optimal
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were kmax = 9, distance = 2 and
kernel = optimal.
vPredict<-predict(fitkknn, pmlValidation)
cmkknn<-confusionMatrix(pmlValidation$classe,vPredict); cmkknn
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1391 1 1 1 1
B 5 939 4 1 0
C 3 8 840 4 0
D 1 1 12 788 2
E 0 3 1 3 894
Overall Statistics
Accuracy : 0.9894
95% CI : (0.9861, 0.9921)
No Information Rate : 0.2855
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9866
Mcnemar's Test P-Value : 0.1641
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9936 0.9863 0.9790 0.9887 0.9967
Specificity 0.9989 0.9975 0.9963 0.9961 0.9983
Pos Pred Value 0.9971 0.9895 0.9825 0.9801 0.9922
Neg Pred Value 0.9974 0.9967 0.9956 0.9978 0.9993
Prevalence 0.2855 0.1941 0.1750 0.1625 0.1829
Detection Rate 0.2836 0.1915 0.1713 0.1607 0.1823
Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
Balanced Accuracy 0.9962 0.9919 0.9877 0.9924 0.9975
Meaningful plots can be built for 2 models only. For random forest we can see that error decreases with increaing of the number of trees built. Plot for kknn model gives us information about the quality of the classification based on the number of neighbors.
par(mfrow=c(1,2),mar=c(5,4,2,2))
plot(fitrf$finalModel, main="RF")
plot(fitkknn$finalModel, main="KKNN")
The following plots show how the accuracy changes while the parameters of the models are tuned. The model parameters are selected based on the accuracy value.
p1<-ggplot(fitlda2) + labs(title="lda2") + theme_bw()
p2<-ggplot(fitrf) + labs(title="rf") + theme_bw()
p3<-ggplot(fitgbm) + labs(title="gbm") + theme_bw()
p4<-ggplot(fitknn) + labs(title="knn") + theme_bw()
p5<-ggplot(fitkknn) + labs(title="kknn") + theme_bw()
multiplot(p1, p4, p2, p5, p3, cols=3)
Random forest and KKNN model have the best accuracy, but RF is even more precise, so I’ll choose it for predicting for the test data set.
accuracyDF<-data.frame(Model=c("lda2", "rf", "gbm", "svm","knn", "kknn"),
Accuracy=c(cmlda2$overall[1], cmrf$overall[1], cmgbm$overall[1],
cmsvm$overall[1], cmknn$overall[1], cmkknn$overall[1]))
accuracyDF
Model Accuracy
1 lda2 0.6998369
2 rf 0.9944943
3 gbm 0.9604405
4 svm 0.9418842
5 knn 0.9692088
6 kknn 0.9893964
Let’s predict the style of doing exercises for the test data set using trained Random Forest model with Cross Validation.
testPredict<-predict(fitrf, pmlTesting)
testPredict
[1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E