Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here & the test data are available here and read by the read.csv() function. The data for this project come from here.
training_raw <- read.csv("./data/pml-training.csv", header = T); dim(training_raw)
## [1] 19622 160
testing_raw <- read.csv("./data/pml-testing.csv", header = T); dim(testing_raw)
## [1] 20 160
head(colnames(training_raw), 10)
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt" "pitch_belt"
## [10] "yaw_belt"
The first 7 columns are ID and time variables. So, we do not need them for analysis.
training_raw <- training_raw[, -c(1:7)]; dim(training_raw)
## [1] 19622 153
testing_raw <- testing_raw[, -c(1:7)]; dim(testing_raw)
## [1] 20 153
Up next, the libraries needed for analysis are loaded.
library(caret)
library(kernlab)
library(rpart)
library(ggplot2)
library(randomForest)
library(rattle)
library(Metrics)
Next, near zero variance variables are removed from the training data set and also from the testing data set.
NZV_train <- nearZeroVar(training_raw)
training_raw <- training_raw[, -NZV_train]; dim(training_raw)
## [1] 19622 94
NZV_test <- nearZeroVar(testing_raw)
testing_raw <- testing_raw[, -NZV_test]; dim(testing_raw)
## [1] 20 53
At last, NA values are removed.
training_raw <- training_raw[, colSums(is.na(training_raw)) == 0]; dim(training_raw)
## [1] 19622 53
testing_raw <- testing_raw[, colSums(is.na(testing_raw)) == 0]; dim(testing_raw)
## [1] 20 53
For cross-validation, a sub-training data set and a validation data set are created by splitting the training data into a 70:30 ratio.
set.seed(123321)
inTrain <- createDataPartition(training_raw$classe, p = 0.7, list = F)
training_Data <- training_raw[inTrain, ]; dim(training_Data)
## [1] 13737 53
validation_Data <- training_raw[-inTrain, ]; dim(validation_Data)
## [1] 5885 53
Decision Tree, Random Forrest, Gradient Boosted Machine (GBM), Support Vector Machine (SVM), and Linear Discriminant Analysis (LDA) models were created and tested. Now, let’s set up a control for the sub-training data set to use cross-validation.
control <- trainControl(method = "repeatedcv", number = 3, repeats = 5, verboseIter = F)
dec_tree_model <- train(classe ~., data = training_Data, method = "rpart", trControl = control, tuneLength = 5)
fancyRpartPlot(dec_tree_model$finalModel, sub = "Decision Tree Model")
pred_dec_tree <- predict(dec_tree_model, validation_Data)
confusionMatrix(pred_dec_tree, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1517 456 501 451 139
## B 30 354 35 7 119
## C 92 246 388 119 263
## D 29 83 102 387 72
## E 6 0 0 0 489
##
## Overall Statistics
##
## Accuracy : 0.5327
## 95% CI : (0.5199, 0.5455)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3907
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9062 0.31080 0.37817 0.40145 0.45194
## Specificity 0.6326 0.95976 0.85182 0.94188 0.99875
## Pos Pred Value 0.4951 0.64954 0.35018 0.57504 0.98788
## Neg Pred Value 0.9443 0.85300 0.86644 0.88929 0.88998
## Prevalence 0.2845 0.19354 0.17434 0.16381 0.18386
## Detection Rate 0.2578 0.06015 0.06593 0.06576 0.08309
## Detection Prevalence 0.5206 0.09261 0.18828 0.11436 0.08411
## Balanced Accuracy 0.7694 0.63528 0.61499 0.67167 0.72535
So, the out of sample error for this model will be:
1-as.numeric(confusionMatrix(pred_dec_tree, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.4672897
rf_model <- train(classe ~., data = training_Data, method = "rf", trControl = control, tuneLength = 5)
pred_rf <- predict(rf_model, validation_Data)
confusionMatrix(pred_rf, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 5 0 0 0
## B 3 1127 3 0 0
## C 0 7 1020 8 3
## D 0 0 3 955 5
## E 1 0 0 1 1074
##
## Overall Statistics
##
## Accuracy : 0.9934
## 95% CI : (0.991, 0.9953)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9916
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9895 0.9942 0.9907 0.9926
## Specificity 0.9988 0.9987 0.9963 0.9984 0.9996
## Pos Pred Value 0.9970 0.9947 0.9827 0.9917 0.9981
## Neg Pred Value 0.9990 0.9975 0.9988 0.9982 0.9983
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2838 0.1915 0.1733 0.1623 0.1825
## Detection Prevalence 0.2846 0.1925 0.1764 0.1636 0.1828
## Balanced Accuracy 0.9982 0.9941 0.9952 0.9945 0.9961
So, the out of sample error for this model will be:
1-as.numeric(confusionMatrix(pred_rf, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.006627018
gbm_model <- train(classe ~., data = training_Data, method = "gbm", trControl = control, tuneLength = 5)
pred_gbm <- predict(gbm_model, validation_Data)
confusionMatrix(pred_gbm, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1662 10 0 0 0
## B 8 1119 4 0 5
## C 4 10 1018 7 3
## D 0 0 4 957 8
## E 0 0 0 0 1066
##
## Overall Statistics
##
## Accuracy : 0.9893
## 95% CI : (0.9863, 0.9918)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9865
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9928 0.9824 0.9922 0.9927 0.9852
## Specificity 0.9976 0.9964 0.9951 0.9976 1.0000
## Pos Pred Value 0.9940 0.9850 0.9770 0.9876 1.0000
## Neg Pred Value 0.9972 0.9958 0.9983 0.9986 0.9967
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2824 0.1901 0.1730 0.1626 0.1811
## Detection Prevalence 0.2841 0.1930 0.1771 0.1647 0.1811
## Balanced Accuracy 0.9952 0.9894 0.9936 0.9952 0.9926
So, the out of sample error for this model will be:
1-as.numeric(confusionMatrix(pred_gbm, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.01070518
svm_model <- train(classe ~., data = training_Data, method = "svmLinear", trControl = control, tuneLength = 5)
pred_svm <- predict(svm_model, validation_Data)
confusionMatrix(pred_svm, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1519 153 84 58 59
## B 37 824 89 41 134
## C 53 54 818 119 76
## D 53 24 23 710 67
## E 12 84 12 36 746
##
## Overall Statistics
##
## Accuracy : 0.7845
## 95% CI : (0.7738, 0.795)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7262
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9074 0.7234 0.7973 0.7365 0.6895
## Specificity 0.9159 0.9366 0.9378 0.9661 0.9700
## Pos Pred Value 0.8110 0.7324 0.7304 0.8096 0.8382
## Neg Pred Value 0.9614 0.9338 0.9563 0.9493 0.9327
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2581 0.1400 0.1390 0.1206 0.1268
## Detection Prevalence 0.3183 0.1912 0.1903 0.1490 0.1512
## Balanced Accuracy 0.9117 0.8300 0.8676 0.8513 0.8297
So, the out of sample error for this model will be:
1-as.numeric(confusionMatrix(pred_svm, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.215463
lda_model <- train(classe ~., data = training_Data, method = "lda", trControl = control, tuneLength = 5)
pred_lda <- predict(lda_model, validation_Data)
confusionMatrix(pred_lda, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1341 168 103 61 47
## B 46 734 107 38 177
## C 132 129 690 116 108
## D 146 49 114 711 94
## E 9 59 12 38 656
##
## Overall Statistics
##
## Accuracy : 0.7021
## 95% CI : (0.6903, 0.7138)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6232
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8011 0.6444 0.6725 0.7376 0.6063
## Specificity 0.9100 0.9225 0.9002 0.9181 0.9754
## Pos Pred Value 0.7797 0.6661 0.5872 0.6382 0.8475
## Neg Pred Value 0.9200 0.9153 0.9287 0.9470 0.9167
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2279 0.1247 0.1172 0.1208 0.1115
## Detection Prevalence 0.2923 0.1873 0.1997 0.1893 0.1315
## Balanced Accuracy 0.8555 0.7834 0.7863 0.8278 0.7909
So, the out of sample error for this model will be:
1-as.numeric(confusionMatrix(pred_lda, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.297876
The Random Forrest Model showed better accuracy & lower out of sample error than the other models. So, the Random Forrest Model will be used to predict the testing data set.
pred_test <- predict(rf_model, testing_raw)
pred_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E