This is the final report of the Peer Assessment project from Coursera’s course Practical Machine Learning Course, as part of the Specialization in Data Science. The main goal of the project is to predict the manner (classe variable) in which 6 participants performed some exercise as described below. The machine learning algorithm selected and built through the process described in this report is applied to the 20 cases available in the test data and the predictions are submitted in appropriate format to the Course Project Prediction Quiz for automated grading.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz3xsbS5bVX
library(caret)
library(scales)
library(kableExtra)
library(dplyr)
# set the URL for the download, download and load the data for the training of the model
UrlTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(UrlTrain, destfile = "pml-training.csv")
training <- read.csv("pml-training.csv")
# set the URL for the download, download and load the data for the prediction
# (ATTENTION: not for the cross-validation testing of the model building)
UrlTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(UrlTest, destfile = "pml-testing.csv")
testing <- read.csv("pml-testing.csv")
We look at the dimensions of the two datasets. The one which will build (train and test) our model, and the one whose outcome will try to predict:
dim(training); dim(testing)
## [1] 19622 160
## [1] 20 160
We try to find the name of the columns that are not the same between the two datasets
(dif <- which(names(training)!=names(testing))) # the only difference is located in the 160th variable/column
## [1] 160
names(c(training,testing))[c(dif,2*dif)] # the names corresponding to the 160th col of each dataset
## [1] "classe" "problem_id"
As we were expecting the only difference regarding the variables is located on the 160th one which will be used as the outcome variable of our model and prediction.
So both datasets have 160 variables, essentially the same. But those variables include:
We have to deal with this issues.
# simply subsetting the training dataset
training <- training [,-(1:7)]
dim(training)
## [1] 19622 153
Thus, we end up with 153 variables
# using caret's nearZeroVar function and subsetting the training dataset
nzv <- nearZeroVar(training,saveMetrics=TRUE)
training <- training[which(nzv$nzv==0)]
dim(training)
## [1] 19622 94
Thus, we end up with 94 variables
While the the treatment of the previous two issues was simpler the case of missing/NA values is more complicated and the various options have important differences and are related with the percentage and cause of missing values. In our case, we limit ourselves to dealing with the issue in two ways: either deleting all this kind of variables or imputing them with some callibrating method (we’ll use the k-nearest neighbors). Subsequently, during our exploratory model building we will decide which one to keep.
no_na <- which(colSums(is.na(training)) > 0) # find the variables with NA values
summary(colSums(is.na(training[,no_na]))) # calculate the number of NA values and their variance
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19216 19216 19216 19216 19216 19216
We note that all the variables with missing data have the same number of NAs.
mean(colSums(is.na(training[,no_na])))/nrow(training) # calculate the percentage of NA values for these variables
## [1] 0.9793089
We also note that in these cases the percentage of missing values is close to 98%, an important percentage that justifies the “delete” option.
training_no_na <- training[,-no_na] # subset ommiting these variables
dim(training_no_na)
## [1] 19622 53
Subsetting again, we end up in this case with 53 variables
ATT: The imputing refers to the training dataset and not the new training_no_na one.
set.seed(5959)
preObj <- preProcess(training,method="knnImpute") # training a preprocessing caret's model
training_knn <- predict(preObj,training) # predicting and creating the new imputed training dataset
dim(training_knn)
## [1] 19622 94
Thus, in the imputing case, we end up with 94 variables, logically thus, with the same number we ended up after the processing of nzv variables
For a cross validation of our data and evaluation (accuracy and out-of-sample error) of our models:
We split our data into a training data set (75% of the total cases) and a testing data set (25% of the total cases; the latter should not be confused with the data in the pml-testing.csv file). This will allow us to estimate the out of sample error of our predictor.
set.seed(5959)
inTrain <- createDataPartition(training$classe, p=0.75, list=FALSE)
sub_training_no_na <- training_no_na[inTrain,]
sub_training_knn <- training_knn[inTrain,]
sub_testing_no_na <- training_no_na[-inTrain,]
sub_testing_knn <- training_knn[-inTrain,]
We must note that in our case, instead of two, we have four new datasets, that is two pairs of sub_training-sub_testing for each one of the cases of dealing with missing values that we will try below.
dim(sub_training_no_na); dim(sub_testing_no_na)
## [1] 14718 53
## [1] 4904 53
dim(sub_training_knn); dim(sub_testing_knn)
## [1] 14718 94
## [1] 4904 94
With each sub_training containing 14,718 and each sub_setting 4,904 observations.
K-fold Cross Validation (CV) divides the training data into folds, ensuring that each fold is used as a testing set at some point and thus giving as a more accurate estimation of each model’s predicition capacity (accuracy, out-of-sample error) on an unknown dataset. In order to use it, we set a trainControl object that we’ll use subsequently in the training of our models. For each one of them Cross validation is done with K = 25.
(Since caret’s default K is 25, it’s practicaly only for the shake of demonstration of how to set this paramater at our will that we’re dealing in this ase with the Control parameters (trainControl) argument)
fitControl <- trainControl(method='cv', number = 25)
Acc_Err <- data.frame("Model" = NA, "Specs"=NA, "HO_Accuracy" = NA, "HO_Out_of_Sample_Error" =NA, "CV_Accuracy" = NA, "CV_Out_of_Sample_Error"=NA, "User_Time"=NA)
We first try to apply this algorithm on the datasets cleared of the variables with NA values
no_na:
set.seed(5959)
model_rpart_no_na <- train(classe ~ ., data = sub_training_no_na, method = "rpart", trControl=fitControl)
| Model | Specs | Accuracy | Out-of-Sample Error | Accuracy | Out-of-Sample Error | User Time (sec) |
|---|---|---|---|---|---|---|
| Decision Tree (RPART) | no NAs | 50.08% | 48.51% - 51.33% | 50.29% | 49.63% - 49.80% | 30.8 |
We then apply the same algorithm on the datasets where the NA values have been imputed using using knn method.
knn:
set.seed(5959)
model_rpart_knn <- train(classe ~ ., data = sub_training_knn, method = "rpart", trControl=fitControl)
| Model | Specs | Accuracy | Out-of-Sample Error | Accuracy | Out-of-Sample Error | User Time (sec) |
|---|---|---|---|---|---|---|
| Decision Tree (RPART) | knn imputed NAs | 50.37% | 48.22% - 51.04% | 52.85% | 46.95% - 47.36% | 31.3 |
We already note that the user time for the calculation of the algorithm has a significant difference between two datasets. We note at the same time that the metrics of the two models are not so different.
The Linear Disciminant Analysis is much affected by the possible strong collinearity between some variables. As such, we try two different preprocessing approaches of the datasets:
We start by the erasing higly correlated variables method:
cor_matrix_no_na <- abs(cor(sub_training_no_na[ , -53]))
no_na_ncor <- findCorrelation(cor_matrix_no_na, cutoff=2/3)
length(no_na_ncor)
## [1] 25
no_na_ncor <- sort(no_na_ncor)
sub_training_no_na_ncor <- sub_training_no_na[,-no_na_ncor]
sub_testing_no_na_ncor <- sub_testing_no_na[,-no_na_ncor]
set.seed(5959)
model_lda_no_na_ncor <- train(classe ~ ., data = sub_training_no_na_ncor, method = "lda", trControl=fitControl)
And we also try a principal component analysis preprocessing, components=27 as before:
no_na_pca <- preProcess(sub_training_no_na, method = "pca", pcaComp = 27)
sub_training_no_na_pca <- predict(no_na_pca, sub_training_no_na)
sub_testing_no_na_pca <- predict(no_na_pca, sub_testing_no_na)
set.seed(5959)
model_lda_no_na_pca <- train(classe ~ ., data = sub_training_no_na_pca, method = "lda", trControl=fitControl)
We start with the case of 5 trees, for both our datasets (no_na and knn)
set.seed(5959)
model_gbm_5_no_na <- train(classe ~ ., data = sub_training_no_na, method = "gbm",
tuneGrid=expand.grid(n.trees=5,
interaction.depth=1,
shrinkage=.1,
n.minobsinnode=10),
trControl=fitControl)
We’ll do the same for the cases of 10, 20 and 50 trees.
We start with the case of 5 trees, for both our datasets (no_na and knn)
set.seed(5959)
model_rf_5_no_na <- train(classe ~ ., data = sub_training_no_na, method = "rf", ntree=5, trControl=fitControl)
We’ll do the same for the cases of 10, 20, 50 and 100 trees.
We will try a Linear Suppocrt Vector Machine
set.seed(5959)
model_svml_no_na <- train(classe ~ ., data = sub_training_no_na, method = "svmLinear", trControl=fitControl)
We will also try some stacked models
set.seed(5959)
stacked_model_rf_5 <- train(classe ~ ., data = sub_training_stacked_no_na, method = "rf", ntree=5, trControl=fitControl)
set.seed(5959)
stacked_model_rf_10 <- train(classe ~ ., data = sub_training_stacked_no_na, method = "rf", ntree=10, trControl=fitControl)
set.seed(5959)
stacked_model_gbm_50 <- train(classe ~ ., data = sub_training_stacked_no_na, method = "gbm",
tuneGrid=expand.grid(n.trees=50,
interaction.depth=1,
shrinkage=.1,
n.minobsinnode=10),
trControl=fitControl)
set.seed(5959)
stacked_model_rpart <- train(classe ~ ., data = sub_training_stacked_no_na, method = "rpart", trControl=fitControl)
set.seed(5959)
stacked_model_svml <- train(classe ~ ., data = sub_training_stacked_no_na, method = "svmLinear", trControl=fitControl)
sub_training_stacked_X_no_na <- data.frame("gbm"=predict(model_gbm_50_no_na, sub_training_no_na),
"rf"=predict(model_rf_50_no_na, sub_training_no_na),
"svm"=predict(model_svml_no_na, sub_training_no_na),
"classe"=sub_training_no_na$classe)
sub_testing_stacked_X_no_na <- data.frame("gbm"=predict(model_gbm_50_no_na, sub_testing_no_na),
"rf"=predict(model_rf_50_no_na, sub_testing_no_na),
"svm"=predict(model_svml_no_na, sub_testing_no_na),
"classe"=sub_testing_no_na$classe)
set.seed(5959)
stacked_model_X_rf_50 <- train(classe ~ ., data = sub_training_stacked_X_no_na, method = "rf", ntree=5, trControl=fitControl)
Acc_Err %>%
kable(align = "c", col.names = c("Model","Specs","Accuracy","Out-of-Sample Error","Accuracy","Out-of-Sample Error", "User Time (sec)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left") %>%
add_header_above(c(" " = 2, "Hold-out Validation" = 2, "25-fold Cross-Validation" = 2," " = 1))
| Model | Specs | Accuracy | Out-of-Sample Error | Accuracy | Out-of-Sample Error | User Time (sec) |
|---|---|---|---|---|---|---|
| Decision Tree (RPART) | 50.08% | 48.51% - 51.33% | 50.29% | 49.63% - 49.80% | 30.8 | |
| Linear Discriminant Analysis (LDA) | cor<|2/3| | 57.97% | 40.64% - 43.42% | 55.71% | 44.17% - 44.42% | 4.3 |
| Linear Discriminant Analysis (LDA) | pca | 56.28% | 42.32% - 45.12% | 53.82% | 46.14% - 46.23% | 4.6 |
| Gradient Boosting Machine (GBM) | 5 trees | 55.59% | 43.02% - 45.82% | 53.40% | 46.52% - 46.68% | 16.9 |
| Gradient Boosting Machine (GBM) | 10 trees | 58.85% | 39.77% - 42.54% | 59.11% | 40.83% - 40.95% | 30.8 |
| Gradient Boosting Machine (GBM) | 20 trees | 67.86% | 30.83% - 33.46% | 66.88% | 33.05% - 33.18% | 90.8 |
| Gradient Boosting Machine (GBM) | 50 trees | 76.28% | 22.53% - 24.93% | 75.43% | 24.52% - 24.61% | 130.7 |
| Random Forest (RF) | 5 trees | 97.92% | 1.66% - 2.39% | 98.13% | 1.87% - 1.87% | 66.0 |
| Random Forest (RF) | 10 trees | 98.67% | 0.97% - 1.64% | 98.81% | 1.19% - 1.19% | 119.3 |
| Random Forest (RF) | 20 trees | 99.08% | 0.62% - 1.23% | 99.09% | 0.91% - 0.91% | 211.4 |
| Random Forest (RF) | 50 trees | 99.43% | 0.38% - 0.82% | 99.25% | 0.75% - 0.75% | 516.9 |
| Support Vector Machine (SVM) | Linear | 79.32% | 19.55% - 21.84% | 78.18% | 21.79% - 21.85% | 696.1 |
| RPART + GBM_50 + RF_5 + SVM :: RF | 5 trees | 98.02% | 1.61% - 2.41% | 99.90% | 0.09% - 0.10% | 11.1 |
| RPART + GBM_50 + RF_5 + SVM :: RF | 10 trees | 98.02% | 1.61% - 2.41% | 99.90% | 0.09% - 0.10% | 16.1 |
| RPART + GBM_5 + RF_5 + SVM :: GBM | 50 trees | 98.02% | 1.61% - 2.41% | 99.90% | 0.09% - 0.10% | 35.1 |
| RPART + GBM_50 + RF_5 + SVM :: RPART | 64.93% | 33.74% - 36.43% | 73.75% | 24.73% - 27.76% | 2.7 | |
| RPART + GBM_50 + RF_5 + SVM :: SVM | Linear | 98.02% | 1.61% - 2.41% | 99.90% | 0.09% - 0.10% | 5.3 |
| GBM_50 + RF_50 + SVM :: RF | 50 trees | 99.45% | 0.36% - 0.80% | 100.00% | 0.00% - 0.00% | 8.8 |
The hold-out accuracies and out-of-sample errors differ, which were supposed to have 95% likelihood to be between the corresponding cross-validation CI values, are each time found to be out of this interval —narrowly the most often, but still, significantly especially for the lowest error values). This cannot be accidental. It rather means something interesting for our cross-validation approach, meaning something imperfect regarding the random formating of our 25 folds. And indeed, the answer is very easy if we just look at the classe values for all the observations that have trained our models.
table(sub_training_no_na$classe)
##
## A B C D E
## 4185 2848 2567 2412 2706
A occurs 50 to 75% more frequently than B, C, D and E.
The imbalance in the response variables is large. In such cases, a slight variation in the K Fold cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class as the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds. This variation is known as Stratified K-Fold Cross Validation or as Stratified K Fold, but for the purpose of our report, and given the already important double checked validation of our building modeling results (hold-out, c-v), we will not proceed with its implementation.
But mainly, we can see a very important difference in accuracy and in out-of-sample error among the different machine learning algorithms, and their differently tuned versions. Regarding this aspect, in the case of our datasets and our question, the most succesful seems to be the random forest models as well as the stacked ones. That means that these models, once trained, they have the least errors trying in predicting the outcomes on uknown datasets.
It is a parametre that differentiates extensively, not the performance but, the evaluation of the different models. The (best) trade-off between the metrics, e.g. accuracy, is after all a huge issue in the design and selection of suitable machine learning algorithms and models, which determines their scalability especially in big data problems.
Given for instance the precise terms of the problem, that is we have to reassure 80% succesful prediction, we could choose the algorithm whose acuuracy would stay over 80% with a 0.997 confidence interval ( p < 0.003). But the fact that we cannot trust the simple 25-fold cross-validation Out-of-Sample Error estimation is combined, in our case, with the feasibility of applying the algorithm (at least for the size of our dataset — variables & observations).
Given the above, we think we reasonably end up choosing random forest (RF) as the algorithm that is appropriate for our problem.
Having already selected the random forest (RF) algorithm, it’s time to try its performance with the knn_imputed dataset (insted of the dataset where we had simply subtracted NA values’ variables/predictors).
set.seed(5959)
model_rf_50_knn <- train(classe ~ ., data = sub_training_knn, method = "rf", ntree=50, trControl=fitControl)
| Model | Specs | Accuracy | Out-of-Sample Error | Accuracy | Out-of-Sample Error | User Time (sec) |
|---|---|---|---|---|---|---|
| Random Forest (50 trees) | no NA variables | 99.43% | 0.38% - 0.82% | 99.25% | 0.75% - 0.75% | 516.9 |
| Random Forest (50 trees) | knn imputed NAs | 99.10% | 0.65% - 1.20% | 99.01% | 0.98% - 0.99% | 957.7 |
We remark that, still, the no_na and the knn training datasets don’t end up with some major difference. Since the user time remains double for the case of knn, we decide to finally use the no_na datasets.
We will just try to apply the RF algorithm for 100 trees.
set.seed(5959)
model_rf_100_no_na <- train(classe ~ ., data = sub_training_no_na, method = "rf", ntree=100, trControl=fitControl)
knn_no_na_table %>%
kable(align = "c", col.names = c("Model","Specs","Accuracy","Out-of-Sample Error","Accuracy","Out-of-Sample Error", "User Time (sec)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left") %>%
add_header_above(c(" " = 2, "Hold-out Validation" = 2, "25-fold Cross-Validation" = 2," " = 1))
| Model | Specs | Accuracy | Out-of-Sample Error | Accuracy | Out-of-Sample Error | User Time (sec) |
|---|---|---|---|---|---|---|
| Random Forest (RF) | 50 trees | 99.43% | 0.38% - 0.82% | 99.25% | 0.75% - 0.75% | 516.9 |
| Random Forest (RF) | 100 trees | 99.41% | 0.40% - 0.85% | 99.30% | 0.70% - 0.70% | 955.7 |
model_rf_100_no_na
## Random Forest
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (25 fold)
## Summary of sample sizes: 14129, 14128, 14129, 14128, 14128, 14130, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9926600 0.9907142
## 27 0.9930030 0.9911476
## 52 0.9889254 0.9859897
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
varImp(model_rf_100_no_na)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## pitch_forearm 61.81
## yaw_belt 56.50
## pitch_belt 46.71
## magnet_dumbbell_z 46.25
## magnet_dumbbell_y 45.73
## roll_forearm 43.10
## accel_dumbbell_y 21.29
## accel_forearm_x 17.91
## roll_dumbbell 17.51
## magnet_dumbbell_x 16.62
## accel_dumbbell_z 15.33
## magnet_forearm_z 15.16
## total_accel_dumbbell 15.15
## magnet_belt_z 14.62
## accel_belt_z 13.04
## magnet_belt_y 11.75
## yaw_arm 11.39
## magnet_belt_x 11.37
## gyros_belt_z 11.01
We could look more in depth at the RF tuning parameters. We could create grids trying more in detail to optimize the prediction accuracy depending (except for number of trees (ntree)) on mtry (how many variables we should select at a node split) and nodesize (how many observations we want in the terminal nodes). We could also go with reverse logic and stop when we find an algorithm that guarantees us with a long period of confidence that the prediction success will not fall below 80%.
One way or another, we have already come up with a very reliable, in terms of accuracy and out-of-sample error, algorithm (random forest, 100 trees, 27 mtry, no_na dataset) with which we answer the original question, that is to predict the classe variable of 20 cases of our testing dataset (after preprocessing it approprietly)
testing <- testing [,-(1:7)]
testing <- testing[which(nzv$nzv==0)]
testing_no_na <- testing[,-no_na]
predict(model_rf_100_no_na, testing_no_na)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Acc_Err[1,1] <- "Decision Tree (RPART)"
Acc_Err[1,2] <- "no NAs"
Acc_Err[1,3] <- percent(confusionMatrix(predict(model_rpart_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[1,4] <- paste(percent(1-confusionMatrix(predict(model_rpart_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rpart_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[1,5] <- percent(model_rpart_no_na$results[1,2], accuracy = 0.01)
Acc_Err[1,6] <- paste(percent(1-model_rpart_no_na$results[1,2]-1.96*model_rpart_no_na$results[1,4]^2, accuracy = 0.01),
"-",
percent(1-model_rpart_no_na$results[1,2]+1.96*model_rpart_no_na$results[1,4]^2, accuracy = 0.01))
Acc_Err[1,7] <- format(round(as.numeric(model_rpart_no_na$times$everything["user.self"]),1), nsmall = 1)
Acc_Err[3,1] <- "Linear Discriminant Analysis (LDA)"
Acc_Err[3,2] <- "pca"
Acc_Err[3,3] <- percent(confusionMatrix(predict(model_lda_no_na_pca, sub_testing_no_na_pca),
sub_testing_no_na_pca$classe)$overall['Accuracy'],
accuracy = 0.01)
Acc_Err[3,4] <- paste(percent(1-confusionMatrix(predict(model_lda_no_na_pca, sub_testing_no_na_pca),
sub_testing_no_na_pca$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_lda_no_na_pca, sub_testing_no_na_pca),
sub_testing_no_na_pca$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[3,5] <- percent(model_lda_no_na_pca$results[1,2], accuracy = 0.01)
Acc_Err[3,6] <- paste(percent(1-model_lda_no_na_pca$results[1,2]-1.96*model_lda_no_na_pca$results[1,4]^2, accuracy = 0.01),
"-",
percent(1-model_lda_no_na_pca$results[1,2]+1.96*model_lda_no_na_pca$results[1,4]^2, accuracy =0.01))
Acc_Err[3,7] <- format(round(as.numeric(model_lda_no_na_pca$times$everything["user.self"]),1), nsmall = 1)
Acc_Err[4,1] <- "Gradient Boosting Machine (GBM)"
Acc_Err[4,2] <- "5 trees"
Acc_Err[4,3] <- percent(confusionMatrix(predict(model_gbm_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[4,4] <- paste(percent(1-confusionMatrix(predict(model_gbm_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_gbm_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[4,5] <- percent(model_gbm_5_no_na$results[1,5], accuracy = 0.01)
Acc_Err[4,6] <- paste(percent(1-model_gbm_5_no_na$results[1,5]-1.96*model_gbm_5_no_na$results[1,7]^2, accuracy = 0.01),
"-",
percent(1-model_gbm_5_no_na$results[1,5]+1.96*model_gbm_5_no_na$results[1,7]^2, accuracy =0.01))
Acc_Err[4,7] <- format(round(as.numeric(model_gbm_5_no_na$times$everything["user.self"]),1), nsmall = 1)
Acc_Err[8,1] <- "Random Forest (RF)"
Acc_Err[8,2] <- "5 trees"
Acc_Err[8,3] <- percent(confusionMatrix(predict(model_rf_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[8,4] <- paste(percent(1-confusionMatrix(predict(model_rf_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rf_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(model_rf_5_no_na$results, mtry==as.numeric(model_rf_5_no_na$bestTune))$Accuracy
AccuracySD <- filter(model_rf_5_no_na$results, mtry==as.numeric(model_rf_5_no_na$bestTune))$AccuracySD
Acc_Err[8,5] <- percent(Accuracy,.01)
Acc_Err[8,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[8,7] <- format(round(as.numeric(model_rf_5_no_na$times$everything["user.self"]),1), nsmall = 1)
Acc_Err[12,1] <- "Support Vector Machine (SVM)"
Acc_Err[12,2] <- "Linear"
Acc_Err[12,3] <- percent(confusionMatrix(predict(model_svml_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[12,4] <- paste(percent(1-confusionMatrix(predict(model_svml_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_svml_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- model_svml_no_na$results$Accuracy
AccuracySD <- model_svml_no_na$results$AccuracySD
Acc_Err[12,5] <- percent(Accuracy,.01)
Acc_Err[12,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[12,7] <- format(round(as.numeric(model_svml_no_na$times$everything["user.self"]),1), nsmall = 1)
```{r getlabels, echo=FALSE}
labs <- knitr::all_labels()
labs <- labs[!labs %in% c("setup", "working_directory","getlabels","allcode")]
appendix <- c("getlabels", "allcode")
```
```{r allcode, ref.label=labs, eval=FALSE, results=FALSE, echo=TRUE}
```
library(caret)
library(scales)
library(kableExtra)
library(dplyr)
# set the URL for the download, download and load the data for the training of the model
UrlTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(UrlTrain, destfile = "pml-training.csv")
training <- read.csv("pml-training.csv")
# set the URL for the download, download and load the data for the prediction
# (ATTENTION: not for the cross-validation testing of the model building)
UrlTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(UrlTest, destfile = "pml-testing.csv")
testing <- read.csv("pml-testing.csv")
dim(training); dim(testing)
(dif <- which(names(training)!=names(testing))) # the only difference is located in the 160th variable/column
names(c(training,testing))[c(dif,2*dif)] # the names corresponding to the 160th col of each dataset
# simply subsetting the training dataset
training <- training [,-(1:7)]
dim(training)
# using caret's nearZeroVar function and subsetting the training dataset
nzv <- nearZeroVar(training,saveMetrics=TRUE)
training <- training[which(nzv$nzv==0)]
dim(training)
no_na <- which(colSums(is.na(training)) > 0) # find the variables with NA values
summary(colSums(is.na(training[,no_na]))) # calculate the number of NA values and their variance
mean(colSums(is.na(training[,no_na])))/nrow(training) # calculate the percentage of NA values for these variables
training_no_na <- training[,-no_na] # subset ommiting these variables
dim(training_no_na)
set.seed(5959)
preObj <- preProcess(training,method="knnImpute") # training a preprocessing caret's model
training_knn <- predict(preObj,training) # predicting and creating the new imputed training dataset
dim(training_knn)
set.seed(5959)
inTrain <- createDataPartition(training$classe, p=0.75, list=FALSE)
sub_training_no_na <- training_no_na[inTrain,]
sub_training_knn <- training_knn[inTrain,]
sub_testing_no_na <- training_no_na[-inTrain,]
sub_testing_knn <- training_knn[-inTrain,]
dim(sub_training_no_na); dim(sub_testing_no_na)
dim(sub_training_knn); dim(sub_testing_knn)
fitControl <- trainControl(method='cv', number = 25)
Acc_Err <- data.frame("Model" = NA, "Specs"=NA, "HO_Accuracy" = NA, "HO_Out_of_Sample_Error" =NA, "CV_Accuracy" = NA, "CV_Out_of_Sample_Error"=NA, "User_Time"=NA)
set.seed(5959)
model_rpart_no_na <- train(classe ~ ., data = sub_training_no_na, method = "rpart", trControl=fitControl)
Acc_Err[1,1] <- "Decision Tree (RPART)"
Acc_Err[1,2] <- "no NAs"
Acc_Err[1,3] <- percent(confusionMatrix(predict(model_rpart_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[1,4] <- paste(percent(1-confusionMatrix(predict(model_rpart_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rpart_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[1,5] <- percent(model_rpart_no_na$results[1,2], accuracy = 0.01)
Acc_Err[1,6] <- paste(percent(1-model_rpart_no_na$results[1,2]-1.96*model_rpart_no_na$results[1,4]^2, accuracy = 0.01),
"-",
percent(1-model_rpart_no_na$results[1,2]+1.96*model_rpart_no_na$results[1,4]^2, accuracy = 0.01))
Acc_Err[1,7] <- format(round(as.numeric(model_rpart_no_na$times$everything["user.self"]),1), nsmall = 1)
Acc_Err[1,] %>%
kable(align = "c", col.names = c("Model","Specs","Accuracy","Out-of-Sample Error","Accuracy","Out-of-Sample Error", "User Time (sec)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left") %>%
add_header_above(c(" " = 2, "Hold-out Validation" = 2, "25-fold Cross-Validation" = 2," " = 1))
set.seed(5959)
model_rpart_knn <- train(classe ~ ., data = sub_training_knn, method = "rpart", trControl=fitControl)
Acc_Err[2,1] <- "Decision Tree (RPART)"
Acc_Err[2,2] <- "knn imputed NAs"
Acc_Err[2,3] <- percent(confusionMatrix(predict(model_rpart_knn, sub_testing_knn),
sub_testing_knn$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[2,4] <- paste(percent(1-confusionMatrix(predict(model_rpart_knn, sub_testing_knn),
sub_testing_knn$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rpart_knn, sub_testing_knn),
sub_testing_knn$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[2,5] <- percent(model_rpart_knn$results[1,2], accuracy = 0.01)
Acc_Err[2,6] <- paste(percent(1-model_rpart_knn$results[1,2]-1.96*model_rpart_knn$results[1,4]^2, accuracy = 0.01),
"-",
percent(1-model_rpart_knn$results[1,2]+1.96*model_rpart_knn$results[1,4]^2, accuracy = 0.01))
Acc_Err[2,7] <- format(round(as.numeric(model_rpart_knn$times$everything["user.self"]),1), nsmall = 1)
Acc_Err %>%
slice(2) %>%
kable(align = "c", col.names = c("Model","Specs","Accuracy","Out-of-Sample Error","Accuracy","Out-of-Sample Error", "User Time (sec)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left") %>%
add_header_above(c(" " = 2, "Hold-out Validation" = 2, "25-fold Cross-Validation" = 2," " = 1))
cor_matrix_no_na <- abs(cor(sub_training_no_na[ , -53]))
no_na_ncor <- findCorrelation(cor_matrix_no_na, cutoff=2/3)
length(no_na_ncor)
no_na_ncor <- sort(no_na_ncor)
sub_training_no_na_ncor <- sub_training_no_na[,-no_na_ncor]
sub_testing_no_na_ncor <- sub_testing_no_na[,-no_na_ncor]
set.seed(5959)
model_lda_no_na_ncor <- train(classe ~ ., data = sub_training_no_na_ncor, method = "lda", trControl=fitControl)
Acc_Err[1,1] <- "Decision Tree (RPART)"
Acc_Err[1,2] <- " "
Acc_Err[2,1] <- "Linear Discriminant Analysis (LDA)"
Acc_Err[2,2] <- "cor<|2/3|"
Acc_Err[2,3] <- percent(confusionMatrix(predict(model_lda_no_na_ncor, sub_testing_no_na_ncor),
sub_testing_no_na_ncor$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[2,4] <- paste(percent(1-confusionMatrix(predict(model_lda_no_na_ncor, sub_testing_no_na_ncor),
sub_testing_no_na_ncor$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_lda_no_na_ncor, sub_testing_no_na_ncor),
sub_testing_no_na_ncor$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[2,5] <- percent(model_lda_no_na_ncor$results[1,2], accuracy = 0.01)
Acc_Err[2,6] <- paste(percent(1-model_lda_no_na_ncor$results[1,2]-1.96*model_lda_no_na_ncor$results[1,4]^2, accuracy = 0.01),
"-",
percent(1-model_lda_no_na_ncor$results[1,2]+1.96*model_lda_no_na_ncor$results[1,4]^2, accuracy =0.01))
Acc_Err[2,7] <- format(round(as.numeric(model_lda_no_na_ncor$times$everything["user.self"]),1), nsmall = 1)
no_na_pca <- preProcess(sub_training_no_na, method = "pca", pcaComp = 27)
sub_training_no_na_pca <- predict(no_na_pca, sub_training_no_na)
sub_testing_no_na_pca <- predict(no_na_pca, sub_testing_no_na)
set.seed(5959)
model_lda_no_na_pca <- train(classe ~ ., data = sub_training_no_na_pca, method = "lda", trControl=fitControl)
Acc_Err[3,1] <- "Linear Discriminant Analysis (LDA)"
Acc_Err[3,2] <- "pca"
Acc_Err[3,3] <- percent(confusionMatrix(predict(model_lda_no_na_pca, sub_testing_no_na_pca),
sub_testing_no_na_pca$classe)$overall['Accuracy'],
accuracy = 0.01)
Acc_Err[3,4] <- paste(percent(1-confusionMatrix(predict(model_lda_no_na_pca, sub_testing_no_na_pca),
sub_testing_no_na_pca$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_lda_no_na_pca, sub_testing_no_na_pca),
sub_testing_no_na_pca$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[3,5] <- percent(model_lda_no_na_pca$results[1,2], accuracy = 0.01)
Acc_Err[3,6] <- paste(percent(1-model_lda_no_na_pca$results[1,2]-1.96*model_lda_no_na_pca$results[1,4]^2, accuracy = 0.01),
"-",
percent(1-model_lda_no_na_pca$results[1,2]+1.96*model_lda_no_na_pca$results[1,4]^2, accuracy =0.01))
Acc_Err[3,7] <- format(round(as.numeric(model_lda_no_na_pca$times$everything["user.self"]),1), nsmall = 1)
set.seed(5959)
model_gbm_5_no_na <- train(classe ~ ., data = sub_training_no_na, method = "gbm",
tuneGrid=expand.grid(n.trees=5,
interaction.depth=1,
shrinkage=.1,
n.minobsinnode=10),
trControl=fitControl)
Acc_Err[4,1] <- "Gradient Boosting Machine (GBM)"
Acc_Err[4,2] <- "5 trees"
Acc_Err[4,3] <- percent(confusionMatrix(predict(model_gbm_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[4,4] <- paste(percent(1-confusionMatrix(predict(model_gbm_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_gbm_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[4,5] <- percent(model_gbm_5_no_na$results[1,5], accuracy = 0.01)
Acc_Err[4,6] <- paste(percent(1-model_gbm_5_no_na$results[1,5]-1.96*model_gbm_5_no_na$results[1,7]^2, accuracy = 0.01),
"-",
percent(1-model_gbm_5_no_na$results[1,5]+1.96*model_gbm_5_no_na$results[1,7]^2, accuracy =0.01))
Acc_Err[4,7] <- format(round(as.numeric(model_gbm_5_no_na$times$everything["user.self"]),1), nsmall = 1)
set.seed(5959)
model_gbm_10_no_na <- train(classe ~ ., data = sub_training_no_na, method = "gbm",
tuneGrid=expand.grid(n.trees=10,
interaction.depth=1,
shrinkage=.1,
n.minobsinnode=10),
trControl=fitControl)
Acc_Err[5,1] <- "Gradient Boosting Machine (GBM)"
Acc_Err[5,2] <- "10 trees"
Acc_Err[5,3] <- percent(confusionMatrix(predict(model_gbm_10_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[5,4] <- paste(percent(1-confusionMatrix(predict(model_gbm_10_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_gbm_10_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[5,5] <- percent(model_gbm_10_no_na$results[1,5], accuracy = 0.01)
Acc_Err[5,6] <- paste(percent(1-model_gbm_10_no_na$results[1,5]-1.96*model_gbm_10_no_na$results[1,7]^2, accuracy = 0.01),
"-",
percent(1-model_gbm_10_no_na$results[1,5]+1.96*model_gbm_10_no_na$results[1,7]^2, accuracy =0.01))
Acc_Err[5,7] <- format(round(as.numeric(model_gbm_10_no_na$times$everything["user.self"]),1), nsmall = 1)
#-------------------------------------------------------------------------------------#
set.seed(5959)
model_gbm_20_no_na <- train(classe ~ ., data = sub_training_no_na, method = "gbm",
tuneGrid=expand.grid(n.trees=20,
interaction.depth=1,
shrinkage=.1,
n.minobsinnode=10),
trControl=fitControl)
Acc_Err[6,1] <- "Gradient Boosting Machine (GBM)"
Acc_Err[6,2] <- "20 trees"
Acc_Err[6,3] <- percent(confusionMatrix(predict(model_gbm_20_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[6,4] <- paste(percent(1-confusionMatrix(predict(model_gbm_20_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_gbm_20_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[6,5] <- percent(model_gbm_20_no_na$results[1,5], accuracy = 0.01)
Acc_Err[6,6] <- paste(percent(1-model_gbm_20_no_na$results[1,5]-1.96*model_gbm_20_no_na$results[1,7]^2, accuracy = 0.01),
"-",
percent(1-model_gbm_20_no_na$results[1,5]+1.96*model_gbm_20_no_na$results[1,7]^2, accuracy =0.01))
Acc_Err[6,7] <- format(round(as.numeric(model_gbm_20_no_na$times$everything["user.self"]),1), nsmall = 1)
#-------------------------------------------------------------------------------------#
set.seed(5959)
model_gbm_50_no_na <- train(classe ~ ., data = sub_training_no_na, method = "gbm",
tuneGrid=expand.grid(n.trees=50,
interaction.depth=1,
shrinkage=.1,
n.minobsinnode=10),
trControl=fitControl)
Acc_Err[7,1] <- "Gradient Boosting Machine (GBM)"
Acc_Err[7,2] <- "50 trees"
Acc_Err[7,3] <- percent(confusionMatrix(predict(model_gbm_50_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[7,4] <- paste(percent(1-confusionMatrix(predict(model_gbm_50_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_gbm_50_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[7,5] <- percent(model_gbm_50_no_na$results[1,5], accuracy = 0.01)
Acc_Err[7,6] <- paste(percent(1-model_gbm_50_no_na$results[1,5]-1.96*model_gbm_50_no_na$results[1,7]^2, accuracy = 0.01),
"-",
percent(1-model_gbm_50_no_na$results[1,5]+1.96*model_gbm_50_no_na$results[1,7]^2, accuracy =0.01))
Acc_Err[7,7] <- format(round(as.numeric(model_gbm_50_no_na$times$everything["user.self"]),1), nsmall = 1)
set.seed(5959)
model_rf_5_no_na <- train(classe ~ ., data = sub_training_no_na, method = "rf", ntree=5, trControl=fitControl)
Acc_Err[8,1] <- "Random Forest (RF)"
Acc_Err[8,2] <- "5 trees"
Acc_Err[8,3] <- percent(confusionMatrix(predict(model_rf_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[8,4] <- paste(percent(1-confusionMatrix(predict(model_rf_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rf_5_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(model_rf_5_no_na$results, mtry==as.numeric(model_rf_5_no_na$bestTune))$Accuracy
AccuracySD <- filter(model_rf_5_no_na$results, mtry==as.numeric(model_rf_5_no_na$bestTune))$AccuracySD
Acc_Err[8,5] <- percent(Accuracy,.01)
Acc_Err[8,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[8,7] <- format(round(as.numeric(model_rf_5_no_na$times$everything["user.self"]),1), nsmall = 1)
set.seed(5959)
model_rf_10_no_na <- train(classe ~ ., data = sub_training_no_na, method = "rf", ntree=10, trControl=fitControl)
Acc_Err[9,1] <- "Random Forest (RF)"
Acc_Err[9,2] <- "10 trees"
Acc_Err[9,3] <- percent(confusionMatrix(predict(model_rf_10_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[9,4] <- paste(percent(1-confusionMatrix(predict(model_rf_10_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rf_10_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(model_rf_10_no_na$results, mtry==as.numeric(model_rf_10_no_na$bestTune))$Accuracy
AccuracySD <- filter(model_rf_10_no_na$results, mtry==as.numeric(model_rf_10_no_na$bestTune))$AccuracySD
Acc_Err[9,5] <- percent(Accuracy,.01)
Acc_Err[9,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[9,7] <- format(round(as.numeric(model_rf_10_no_na$times$everything["user.self"]),1), nsmall = 1)
#-------------------------------------------------------------------------------------#
set.seed(5959)
model_rf_20_no_na <- train(classe ~ ., data = sub_training_no_na, method = "rf", ntree=20, trControl=fitControl)
Acc_Err[10,1] <- "Random Forest (RF)"
Acc_Err[10,2] <- "20 trees"
Acc_Err[10,3] <- percent(confusionMatrix(predict(model_rf_20_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[10,4] <- paste(percent(1-confusionMatrix(predict(model_rf_20_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rf_20_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(model_rf_20_no_na$results, mtry==as.numeric(model_rf_20_no_na$bestTune))$Accuracy
AccuracySD <- filter(model_rf_20_no_na$results, mtry==as.numeric(model_rf_20_no_na$bestTune))$AccuracySD
Acc_Err[10,5] <- percent(Accuracy,.01)
Acc_Err[10,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[10,7] <- format(round(as.numeric(model_rf_20_no_na$times$everything["user.self"]),1), nsmall = 1)
#-------------------------------------------------------------------------------------#
set.seed(5959)
model_rf_50_no_na <- train(classe ~ ., data = sub_training_no_na, method = "rf", ntree=50, trControl=fitControl)
Acc_Err[11,1] <- "Random Forest (RF)"
Acc_Err[11,2] <- "50 trees"
Acc_Err[11,3] <- percent(confusionMatrix(predict(model_rf_50_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[11,4] <- paste(percent(1-confusionMatrix(predict(model_rf_50_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rf_50_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(model_rf_50_no_na$results, mtry==as.numeric(model_rf_50_no_na$bestTune))$Accuracy
AccuracySD <- filter(model_rf_50_no_na$results, mtry==as.numeric(model_rf_50_no_na$bestTune))$AccuracySD
Acc_Err[11,5] <- percent(Accuracy,.01)
Acc_Err[11,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[11,7] <- format(round(as.numeric(model_rf_50_no_na$times$everything["user.self"]),1), nsmall = 1)
set.seed(5959)
model_svml_no_na <- train(classe ~ ., data = sub_training_no_na, method = "svmLinear", trControl=fitControl)
Acc_Err[12,1] <- "Support Vector Machine (SVM)"
Acc_Err[12,2] <- "Linear"
Acc_Err[12,3] <- percent(confusionMatrix(predict(model_svml_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[12,4] <- paste(percent(1-confusionMatrix(predict(model_svml_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_svml_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- model_svml_no_na$results$Accuracy
AccuracySD <- model_svml_no_na$results$AccuracySD
Acc_Err[12,5] <- percent(Accuracy,.01)
Acc_Err[12,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[12,7] <- format(round(as.numeric(model_svml_no_na$times$everything["user.self"]),1), nsmall = 1)
sub_training_stacked_no_na <- data.frame("rpart"=predict(model_rpart_no_na, sub_training_no_na),
"gbm"=predict(model_gbm_50_no_na, sub_training_no_na),
"rf"=predict(model_rf_5_no_na, sub_training_no_na),
"svm"=predict(model_svml_no_na, sub_training_no_na),
"classe"=sub_training_no_na$classe)
sub_testing_stacked_no_na <- data.frame("rpart"=predict(model_rpart_no_na, sub_testing_no_na),
"gbm"=predict(model_gbm_50_no_na, sub_testing_no_na),
"rf"=predict(model_rf_5_no_na, sub_testing_no_na),
"svm"=predict(model_svml_no_na, sub_testing_no_na),
"classe"=sub_testing_no_na$classe)
set.seed(5959)
stacked_model_rf_5 <- train(classe ~ ., data = sub_training_stacked_no_na, method = "rf", ntree=5, trControl=fitControl)
Acc_Err[13,1] <- "RPART + GBM_50 + RF_5 + SVM :: RF"
Acc_Err[13,2] <- "5 trees"
Acc_Err[13,3] <- percent(confusionMatrix(predict(stacked_model_rf_5, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[13,4] <- paste(percent(1-confusionMatrix(predict(stacked_model_rf_5, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(stacked_model_rf_5, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(stacked_model_rf_5$results, mtry==as.numeric(stacked_model_rf_5$bestTune))$Accuracy
AccuracySD <- filter(stacked_model_rf_5$results, mtry==as.numeric(stacked_model_rf_5$bestTune))$AccuracySD
Acc_Err[13,5] <- percent(Accuracy,.01)
Acc_Err[13,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[13,7] <- format(round(as.numeric(stacked_model_rf_5$times$everything["user.self"]),1), nsmall = 1)
set.seed(5959)
stacked_model_rf_10 <- train(classe ~ ., data = sub_training_stacked_no_na, method = "rf", ntree=10, trControl=fitControl)
Acc_Err[14,1] <- "RPART + GBM_50 + RF_5 + SVM :: RF"
Acc_Err[14,2] <- "10 trees"
Acc_Err[14,3] <- percent(confusionMatrix(predict(stacked_model_rf_10, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[14,4] <- paste(percent(1-confusionMatrix(predict(stacked_model_rf_10, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(stacked_model_rf_10, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(stacked_model_rf_10$results, mtry==as.numeric(stacked_model_rf_10$bestTune))$Accuracy
AccuracySD <- filter(stacked_model_rf_10$results, mtry==as.numeric(stacked_model_rf_10$bestTune))$AccuracySD
Acc_Err[14,5] <- percent(Accuracy,.01)
Acc_Err[14,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[14,7] <- format(round(as.numeric(stacked_model_rf_10$times$everything["user.self"]),1), nsmall = 1)
set.seed(5959)
stacked_model_gbm_50 <- train(classe ~ ., data = sub_training_stacked_no_na, method = "gbm",
tuneGrid=expand.grid(n.trees=50,
interaction.depth=1,
shrinkage=.1,
n.minobsinnode=10),
trControl=fitControl)
Acc_Err[15,1] <- "RPART + GBM_5 + RF_5 + SVM :: GBM"
Acc_Err[15,2] <- "50 trees"
Acc_Err[15,3] <- percent(confusionMatrix(predict(stacked_model_gbm_50, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[15,4] <- paste(percent(1-confusionMatrix(predict(stacked_model_gbm_50, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(stacked_model_gbm_50, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[15,7] <- format(round(as.numeric(stacked_model_gbm_50$times$everything["user.self"]),1), nsmall = 1)
Acc_Err[15,5] <- percent(stacked_model_gbm_50$results[1,5], accuracy = 0.01)
Acc_Err[15,6] <- paste(percent(1-stacked_model_gbm_50$results[1,5]-1.96*stacked_model_gbm_50$results[1,7]^2, accuracy = 0.01),
"-",
percent(1-stacked_model_gbm_50$results[1,5]+1.96*stacked_model_gbm_50$results[1,7]^2, accuracy =0.01))
Acc_Err[15,7] <- format(round(as.numeric(stacked_model_gbm_50$times$everything["user.self"]),1), nsmall = 1)
set.seed(5959)
stacked_model_rpart <- train(classe ~ ., data = sub_training_stacked_no_na, method = "rpart", trControl=fitControl)
Acc_Err[16,1] <- "RPART + GBM_50 + RF_5 + SVM :: RPART"
Acc_Err[16,2] <- " "
Acc_Err[16,3] <- percent(confusionMatrix(predict(stacked_model_rpart, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[16,4] <- paste(percent(1-confusionMatrix(predict(stacked_model_rpart, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(stacked_model_rpart, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Acc_Err[16,5] <- percent(stacked_model_rpart$results[1,2], accuracy = 0.01)
Acc_Err[16,6] <- paste(percent(1-stacked_model_rpart$results[1,2]-1.96*stacked_model_rpart$results[1,4]^2, accuracy = 0.01),
"-",
percent(1-stacked_model_rpart$results[1,2]+1.96*stacked_model_rpart$results[1,4]^2, accuracy = 0.01))
Acc_Err[16,7] <- format(round(as.numeric(stacked_model_rpart$times$everything["user.self"]),1), nsmall = 1)
set.seed(5959)
stacked_model_svml <- train(classe ~ ., data = sub_training_stacked_no_na, method = "svmLinear", trControl=fitControl)
Acc_Err[17,1] <- "RPART + GBM_50 + RF_5 + SVM :: SVM"
Acc_Err[17,2] <- "Linear"
Acc_Err[17,3] <- percent(confusionMatrix(predict(stacked_model_svml, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[17,4] <- paste(percent(1-confusionMatrix(predict(stacked_model_svml, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(stacked_model_svml, sub_testing_stacked_no_na),
sub_testing_stacked_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- stacked_model_svml$results$Accuracy
AccuracySD <- stacked_model_svml$results$AccuracySD
Acc_Err[17,5] <- percent(Accuracy,.01)
Acc_Err[17,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[17,7] <- format(round(as.numeric(stacked_model_svml$times$everything["user.self"]),1), nsmall = 1)
sub_training_stacked_X_no_na <- data.frame("gbm"=predict(model_gbm_50_no_na, sub_training_no_na),
"rf"=predict(model_rf_50_no_na, sub_training_no_na),
"svm"=predict(model_svml_no_na, sub_training_no_na),
"classe"=sub_training_no_na$classe)
sub_testing_stacked_X_no_na <- data.frame("gbm"=predict(model_gbm_50_no_na, sub_testing_no_na),
"rf"=predict(model_rf_50_no_na, sub_testing_no_na),
"svm"=predict(model_svml_no_na, sub_testing_no_na),
"classe"=sub_testing_no_na$classe)
set.seed(5959)
stacked_model_X_rf_50 <- train(classe ~ ., data = sub_training_stacked_X_no_na, method = "rf", ntree=5, trControl=fitControl)
Acc_Err[18,1] <- "GBM_50 + RF_50 + SVM :: RF"
Acc_Err[18,2] <- "50 trees"
Acc_Err[18,3] <- percent(confusionMatrix(predict(stacked_model_X_rf_50, sub_testing_stacked_X_no_na),
sub_testing_stacked_X_no_na$classe)$overall['Accuracy'],
accuracy = .01)
Acc_Err[18,4] <- paste(percent(1-confusionMatrix(predict(stacked_model_X_rf_50, sub_testing_stacked_X_no_na),
sub_testing_stacked_X_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(stacked_model_X_rf_50, sub_testing_stacked_X_no_na),
sub_testing_stacked_X_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(stacked_model_X_rf_50$results, mtry==as.numeric(stacked_model_X_rf_50$bestTune))$Accuracy
AccuracySD <- filter(stacked_model_X_rf_50$results, mtry==as.numeric(stacked_model_X_rf_50$bestTune))$AccuracySD
Acc_Err[18,5] <- percent(Accuracy,.01)
Acc_Err[18,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
Acc_Err[18,7] <- format(round(as.numeric(stacked_model_X_rf_50$times$everything["user.self"]),1), nsmall = 1)
Acc_Err %>%
kable(align = "c", col.names = c("Model","Specs","Accuracy","Out-of-Sample Error","Accuracy","Out-of-Sample Error", "User Time (sec)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left") %>%
add_header_above(c(" " = 2, "Hold-out Validation" = 2, "25-fold Cross-Validation" = 2," " = 1))
table(sub_training_no_na$classe)
knn_no_na_table <- data.frame("Model" = NA, "Specs"=NA, "HO_Accuracy" = NA, "HO_Out_of_Sample_Error" =NA, "CV_Accuracy" = NA, "CV_Out_of_Sample_Error"=NA, "User_Time"=NA)
knn_no_na_table[1,] <- Acc_Err[11,]
knn_no_na_table[1,1] <- "Random Forest (50 trees)"
knn_no_na_table[1,2] <- "no NA variables"
set.seed(5959)
model_rf_50_knn <- train(classe ~ ., data = sub_training_knn, method = "rf", ntree=50, trControl=fitControl)
knn_no_na_table[2,1] <- "Random Forest (50 trees)"
knn_no_na_table[2,2] <- "knn imputed NAs"
knn_no_na_table[2,3] <- percent(confusionMatrix(predict(model_rf_50_knn, sub_testing_knn),
sub_testing_knn$classe)$overall['Accuracy'],
accuracy = .01)
knn_no_na_table[2,4] <- paste(percent(1-confusionMatrix(predict(model_rf_50_knn, sub_testing_knn),
sub_testing_knn$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rf_50_knn, sub_testing_knn),
sub_testing_knn$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(model_rf_50_knn$results, mtry==as.numeric(model_rf_50_knn$bestTune))$Accuracy
AccuracySD <- filter(model_rf_50_knn$results, mtry==as.numeric(model_rf_50_knn$bestTune))$AccuracySD
knn_no_na_table[2,5] <- percent(Accuracy,.01)
knn_no_na_table[2,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
knn_no_na_table[2,7] <- format(round(as.numeric(model_rf_50_knn$times$everything["user.self"]),1), nsmall = 1)
knn_no_na_table %>%
kable(align = "c", col.names = c("Model","Specs","Accuracy","Out-of-Sample Error","Accuracy","Out-of-Sample Error", "User Time (sec)")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left") %>%
add_header_above(c(" " = 2, "Hold-out Validation" = 2, "25-fold Cross-Validation" = 2," " = 1))
set.seed(5959)
model_rf_100_no_na <- train(classe ~ ., data = sub_training_no_na, method = "rf", ntree=100, trControl=fitControl)
knn_no_na_table[1,1] <- "Random Forest (RF)"
knn_no_na_table[1,2] <- "50 trees"
knn_no_na_table[2,1] <- "Random Forest (RF)"
knn_no_na_table[2,2] <- "100 trees"
knn_no_na_table[2,3] <- percent(confusionMatrix(predict(model_rf_100_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['Accuracy'],
accuracy = .01)
knn_no_na_table[2,4] <- paste(percent(1-confusionMatrix(predict(model_rf_100_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyUpper'],
accuracy = .01),
"-",
percent(1-confusionMatrix(predict(model_rf_100_no_na, sub_testing_no_na),
sub_testing_no_na$classe)$overall['AccuracyLower'],
accuracy = .01))
Accuracy <- filter(model_rf_100_no_na$results, mtry==as.numeric(model_rf_100_no_na$bestTune))$Accuracy
AccuracySD <- filter(model_rf_100_no_na$results, mtry==as.numeric(model_rf_100_no_na$bestTune))$AccuracySD
knn_no_na_table[2,5] <- percent(Accuracy,.01)
knn_no_na_table[2,6] <- paste(percent(1-Accuracy-AccuracySD^2,.01),
"-",
percent(1-Accuracy+AccuracySD^2,.01))
knn_no_na_table[2,7] <- format(round(as.numeric(model_rf_100_no_na$times$everything["user.self"]),1), nsmall = 1)
model_rf_100_no_na
varImp(model_rf_100_no_na)
testing <- testing [,-(1:7)]
testing <- testing[which(nzv$nzv==0)]
testing_no_na <- testing[,-no_na]
predict(model_rf_100_no_na, testing_no_na)