Goal of this project is to predict how well people perform a certain activity. We will do so using data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. ‘how well’ is defined as a “classe”. The goal is to build a model to predict the “classe” of 20 test cases, based on training data.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
training_raw<-read.csv("pml-training.csv")
testing_raw<-read.csv("pml-testing.csv")
training_raw$classe<-as.factor(training_raw$classe)
dim(training_raw)
## [1] 19622 160
#remove parameters that are irrelevant for predicting the classe
trainRemove <- grepl("^user|^X|timestamp|window", names(training_raw))
training1<-training_raw[,!trainRemove]
#restrict to the complete cases
# training2<-training1[complete.cases(training_raw),]
# remove rows with missing values
training2<- training1[, colSums(is.na(training1)) == 0]
# Removing near-zero covariates
near_zero<-nearZeroVar(training2, saveMetrics=TRUE)
training3<-training2[,!near_zero[4]]
training_cleaned<-training3
dim(training_cleaned)
## [1] 19622 53
We reduced the number of variables from 160 to 53.
The cleaned dataset is split in a training and a validation set. Note: In theory, one should use ca 70% of the data for training the prediction model and ca 30% for validation. In practice, working with such large training sets to calculate the prediction model really takes a LOT of computing time on my machine, so for this assessment I trained the prediction models on only 20% of the total training dataset. While this obviously will have an impact on the accuracy of the prediction model, it does not take away any of the learning opportunity while doing this assessment.
#careful: the split should be set to 70/30 for training/validation, but doing so takes a LOT of computing time to calculate the prediction models.
#therefore, for now using only 10% for the training dataset to allow troubleshooting the syntax.
inTrain <- createDataPartition(training_cleaned$classe, p=0.1, list=F)
trainData<-training_cleaned[inTrain,]
valData<-training_cleaned[-inTrain,]
We start with random forest model:
#train model
rf_model<-train(classe~.,data=trainData,method="rf",allowParallel = TRUE)
Predictions of the random forest model on the validation set:
#assess prediction accuracy on validation data
rf_predict<-predict(rf_model,newdata=valData)
rf_cm <- confusionMatrix(factor(rf_predict), factor(valData$classe))
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4935 170 12 22 8
## B 33 3022 160 12 70
## C 33 206 2833 126 49
## D 17 10 67 2719 53
## E 4 9 7 15 3066
##
## Overall Statistics
##
## Accuracy : 0.9387
## 95% CI : (0.935, 0.9422)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9224
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9827 0.8844 0.9201 0.9395 0.9445
## Specificity 0.9832 0.9807 0.9716 0.9900 0.9976
## Pos Pred Value 0.9588 0.9166 0.8725 0.9487 0.9887
## Neg Pred Value 0.9930 0.9725 0.9829 0.9882 0.9876
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2795 0.1711 0.1604 0.1540 0.1736
## Detection Prevalence 0.2915 0.1867 0.1839 0.1623 0.1756
## Balanced Accuracy 0.9829 0.9325 0.9459 0.9648 0.9711
#train model
set.seed(12345)
gbm_control <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
gbm_model <- train(classe ~ ., data=trainData, method = "gbm", trControl = gbm_control, verbose = FALSE)
Predictions of the GBM model on the validation set:
#assess prediction accuracy on validation data
gbm_predict <- predict(gbm_model, newdata=valData)
gbm_cm <- confusionMatrix(factor(gbm_predict), factor(valData$classe))
gbm_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4890 161 2 15 21
## B 56 3014 181 9 111
## C 27 217 2788 150 43
## D 37 8 93 2670 61
## E 12 17 15 50 3010
##
## Overall Statistics
##
## Accuracy : 0.9272
## 95% CI : (0.9232, 0.931)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9078
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9737 0.8821 0.9055 0.9226 0.9273
## Specificity 0.9843 0.9749 0.9700 0.9865 0.9935
## Pos Pred Value 0.9609 0.8941 0.8645 0.9306 0.9697
## Neg Pred Value 0.9895 0.9718 0.9798 0.9849 0.9838
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2769 0.1707 0.1579 0.1512 0.1705
## Detection Prevalence 0.2882 0.1909 0.1826 0.1625 0.1758
## Balanced Accuracy 0.9790 0.9285 0.9378 0.9546 0.9604
The random forest prediction model has the highest accuracy, so the actual prediction on the test data is done using this model. The out of sample error - calculated as 1 minus the accuracy - is around 5%. That means on average 1 out of the 20 predictions will be incorrect.
rf_predict_test<-predict(rf_model,testing_raw)
as.data.frame(rf_predict_test)
## rf_predict_test
## 1 C
## 2 A
## 3 B
## 4 A
## 5 A
## 6 E
## 7 D
## 8 D
## 9 A
## 10 A
## 11 B
## 12 C
## 13 B
## 14 A
## 15 E
## 16 E
## 17 A
## 18 B
## 19 B
## 20 B