I. Synopsis

In this paper, we will try to predict how well a person is doing a weight lifting exercise (resulting in a 4-categorical grade: A, B, C, or D), using a dataset shared on the following website: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.
In order to do so, we will run several machine learning algorithm and compare their effectiveness, using cross validation and by looking at the expected out of sample error.
Once the model is set, we will try and predict 20 different test cases.

II. Data Reading and Processing

Let us first read the file and look at a quick summary of the data set:

# Load modules
library(ggplot2)
library(caret)
library(rattle)
library(knitr)


# Read files
setwd("C:/Users/tngch/Documents/R/Learning/8. Practical Machine Learning")
trainFile = "pml-training.csv"
testFile = "pml-testing.csv"

original = read.csv(trainFile, stringsAsFactors=F)
quizz = read.csv(testFile, stringsAsFactors=F)

Let us take a look at the data.

str(original)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : chr  "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
##  $ new_window              : chr  "no" "no" "no" "no" ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : chr  "" "" "" "" ...
##  $ kurtosis_picth_belt     : chr  "" "" "" "" ...
##  $ kurtosis_yaw_belt       : chr  "" "" "" "" ...
##  $ skewness_roll_belt      : chr  "" "" "" "" ...
##  $ skewness_roll_belt.1    : chr  "" "" "" "" ...
##  $ skewness_yaw_belt       : chr  "" "" "" "" ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : chr  "" "" "" "" ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : chr  "" "" "" "" ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : chr  "" "" "" "" ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : chr  "" "" "" "" ...
##  $ kurtosis_picth_arm      : chr  "" "" "" "" ...
##  $ kurtosis_yaw_arm        : chr  "" "" "" "" ...
##  $ skewness_roll_arm       : chr  "" "" "" "" ...
##  $ skewness_pitch_arm      : chr  "" "" "" "" ...
##  $ skewness_yaw_arm        : chr  "" "" "" "" ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : chr  "" "" "" "" ...
##  $ kurtosis_picth_dumbbell : chr  "" "" "" "" ...
##  $ kurtosis_yaw_dumbbell   : chr  "" "" "" "" ...
##  $ skewness_roll_dumbbell  : chr  "" "" "" "" ...
##  $ skewness_pitch_dumbbell : chr  "" "" "" "" ...
##  $ skewness_yaw_dumbbell   : chr  "" "" "" "" ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : chr  "" "" "" "" ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : chr  "" "" "" "" ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

Let us also look at NAs and Unprocessable values (“#DIV/0!”)

### Checking NAs
nNA1 = sum(is.na(original))
nNA2 = sum(is.na(quizz))

### Checking all Unprocessable values ("#DIV/0!") for both data
nDiv1 = 0 ; nDiv2 = 0
for (i in 1:ncol(original)) {
    if (!is.na(sum(original[,i] == "#DIV/0!"))) {
        nDiv1 = nDiv1 + sum(original[,i] == "#DIV/0!")
    }
}
for (i in 1:ncol(quizz)) {
    if (!is.na(sum(quizz[,i] == "#DIV/0!"))) {
        nDiv2 = nDiv2 + sum(quizz[,i] == "#DIV/0!")
    }
}

mat = as.data.frame(cbind(c(nNA1, nNA2),c(nDiv1, nDiv2)))
colnames(mat) = c("Number of NAs", "Number of #DIV/0!")
rownames(mat) = c("Train Dataset", "Prediction Dataset")
kable(mat, caption = "NA check", align=rep("c", 2), format.args = list(big.mark = ","))
NA check
Number of NAs Number of #DIV/0!
Train Dataset 1,287,472 3,502
Prediction Dataset 2,000 0

We notice that there are few columns which contain missing values (NA) and unprocessable value (“#DIV/0!”). We will first transform all unprocessable values into NAs, and finally replace all missing values by the mean of the column.
Note we used “stringAsFactors=F” when loading the data, so the “classe” variable has turned into Character and need to be changed to Factor.
We also notice that the first 5 variables (Index, User Name, Timestamp x 3) are not useful for our prediction, hence we will drop these columns.
Finally, we will drop all columns which has very low variance (< 1) as they won’t add anything to the model neither.

Although we will treat Quizz dataset missing values the same way we did for the original training dataset, we cannot drop columns in case more columns are lost in which case we won’t be able to run the prediction algorithm (as we need to have at least the same column characteristics as the original data). In case a whole column has NAs and cannot be replaced by the mean, we will transform NAs by 0. Note that there is no “#DIV/0!” in the Quizz Data set.

# Clean and Process
### Drop the first 5 column which doesn't seem to bring any interesting features
original = original[, c(-1, -2, -3, -4, -5)]
quizz = quizz[, c(-1, -2, -3, -4, -5)]

### Make the classe column a factor
original$classe = factor(original$classe)

### Changing "#DIV0/!" values to NAs (actually into "" strictly speaking, as their columns
### are character, which will be later turned into numeric, and then changed to NA)
v = names(which(sapply(original, function(x) {sum(x == "#DIV/0!")}) != 0))
for(i in 1:length(v)){
    original[original[,v[[i]]]=="#DIV/0!", v[[i]]] = ""
}

### Replace all NA by mean, while transforming character columns into numeric
for(i in 1:ncol(original)){
    if (is.character(original[, i])) {
        original[,i] = as.numeric(original[,i])
    }
    original[is.na(original[,i]), i] = mean(original[,i], na.rm = TRUE)
}
for(i in 1:ncol(quizz)){
    if (is.character(quizz[, i])) {
        quizz[,i] = as.numeric(quizz[,i])
    }
    quizz[is.na(quizz[,i]), i] = mean(quizz[,i], na.rm = TRUE)
}

### Drop variable with low variance (as it won't add anything to our model) or
### if the entire column has no value
### Quizz Dataset will keep its column although all values are NAs or it has small variance
### in order to fit the prediction algorithm
measure = sapply(original[,c(-ncol(original))], var)
dropName = c( names(which(is.na(measure))), names(which(measure < 1)) )
original = original[ , -which(names(original) %in% dropName)]
for(i in 1:ncol(quizz)){
    if ( sum(is.na(quizz[, i])) == nrow(quizz)) {
        quizz[is.na(quizz[,i]), i] = 0
    }
}

Let’s verify that both dataset are clean.

### Checking NAs
nNA1 = sum(is.na(original))
nNA2 = sum(is.na(quizz))

### Checking all Unprocessable values ("#DIV/0!") for both data
nDiv1 = 0 ; nDiv2 = 0
for (i in 1:ncol(original)) {
    if (!is.na(sum(original[,i] == "#DIV/0!"))) {
        nDiv1 = nDiv1 + sum(original[,i] == "#DIV/0!")
    }
}
for (i in 1:ncol(quizz)) {
    if (!is.na(sum(quizz[,i] == "#DIV/0!"))) {
        nDiv2 = nDiv2 + sum(quizz[,i] == "#DIV/0!")
    }
}

mat = as.data.frame(cbind(c(nNA1, nNA2),c(nDiv1, nDiv2)))
colnames(mat) = c("Number of NAs", "Number of #DIV/0!")
rownames(mat) = c("Train Dataset", "Prediction Dataset")
kable(mat, caption = "NA check", align=rep("c", 2))
NA check
Number of NAs Number of #DIV/0!
Train Dataset 0 0
Prediction Dataset 0 0

Now that our datasets are processed, we need to split them into training and testing for cross validation sake. We will use 70% of the original data into training and the remaining will be used for testing purposes.

# Split Train / Test
set.seed(3546)
indexTrain = createDataPartition(original$classe, p=.7, list=FALSE)
training = original[indexTrain, ]
testing = original[-indexTrain, ]

We are now ready to run machine learning algorithm.

III. Analysis

Given that the goal is make a classification, we can eliminate all algorithm which aimed to make a regression, such as Linear Regression.
The 4 algorithm we will use for this purpose are: CART, Random Forest, Gradient Boosting and Support Vector Machine.

III. A. CART

Let us start with the Classification and Regression Tree model (CART).

# Model building
### CART
cartModel = train(data=training, classe ~ ., method="rpart")

cartPrediction = predict(cartModel, training)
cartAccuracyTraining = sum(cartPrediction == training$classe) / length(cartPrediction)

cartPrediction = predict(cartModel, testing)
cartAccuracyTesting = sum(cartPrediction == testing$classe) / length(cartPrediction)

### Cross-Validation
cartModel$finalModel
## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 12574 8679 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -33.95 1101    6 A (0.99 0.0054 0 0 0) *
##      5) pitch_forearm>=-33.95 11473 8673 A (0.24 0.23 0.21 0.2 0.12)  
##       10) magnet_dumbbell_y< 438.5 9687 6952 A (0.28 0.18 0.24 0.19 0.11)  
##         20) roll_forearm< 123.5 6012 3557 A (0.41 0.18 0.18 0.17 0.061) *
##         21) roll_forearm>=123.5 3675 2452 C (0.076 0.18 0.33 0.23 0.19) *
##       11) magnet_dumbbell_y>=438.5 1786  875 B (0.036 0.51 0.046 0.22 0.18) *
##    3) roll_belt>=130.5 1163   11 E (0.0095 0 0 0 0.99) *
paste0("On Training, accuracy of the model: ", round(cartAccuracyTraining, 5))
## [1] "On Training, accuracy of the model: 0.49763"
paste0("On Testing, accuracy of the model: ", round(cartAccuracyTesting, 5))
## [1] "On Testing, accuracy of the model: 0.49125"
fancyRpartPlot(cartModel$finalModel)

We can see that the accuracy of the model is very low. With that low level of accuracy, we won’t go further in cross-validation analysis for this model as we will probably not use this algorithm for prediction.

III. B. Random Forest

We now try the Random Forest algorithm.

# Model building
### Random Forest
control = trainControl(method='repeatedcv', number=10, repeats=3)
tunegrid <- expand.grid(.mtry=5)
randomForestModel = train(data=training, classe ~ ., method="rf", tuneGrid=tunegrid, trControl=control)

rfPrediction = predict(randomForestModel, training)
rfAccuracyTraining = sum(rfPrediction == training$classe) / length(rfPrediction)

rfPrediction = predict(randomForestModel, testing)
rfAccuracyTesting = sum(rfPrediction == testing$classe) / length(rfPrediction)

### Cross-validation
paste0("On Training, accuracy of the model: ", round(rfAccuracyTraining, 5))
## [1] "On Training, accuracy of the model: 1"
paste0("On Testing, accuracy of the model: ", round(rfAccuracyTesting, 5))
## [1] "On Testing, accuracy of the model: 0.99371"
randomForestModel$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 0.63%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3904    2    0    0    0 0.0005120328
## B   15 2637    6    0    0 0.0079006772
## C    0   16 2377    3    0 0.0079298831
## D    0    0   34 2218    0 0.0150976909
## E    0    0    2    8 2515 0.0039603960

Although it seems like the model looks to be overfitting the data with its accuracy of 1 on the training dataset, the cross-validation shows that the accuracy stays really high even on the testing dataset.
In addition, the estimated out of sample error rate is also low, at 0.61%.
Finally, the confusion matrix shows that the True Positive Rate is 99.4% (Sum of True positive / Number of observation).
All of these indicators show that the Random Forest model gives us a very solid predicting ability.

III. C. Gradient Boosting

Our third model will be Gradient Boosting.

# Model building
### Gradient Boosting
gradBoostModel = train(data=training, classe ~ ., method="gbm")
gbPrediction = predict(gradBoostModel, training)
gbAccuracyTraining = sum(gbPrediction == training$classe) / length(gbPrediction)

gbPrediction = predict(gradBoostModel, testing)
gbAccuracyTesting = sum(gbPrediction == testing$classe) / length(gbPrediction)

### Cross-validation
gradBoostModel$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 109 predictors of which 48 had non-zero influence.
paste0("On Training, accuracy of the model: ", round(gbAccuracyTraining, 5))
## [1] "On Training, accuracy of the model: 0.99177"
paste0("On Testing, accuracy of the model: ", round(gbAccuracyTesting, 5))
## [1] "On Testing, accuracy of the model: 0.98505"
confusionMatrix(gbPrediction, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1668   13    0    1    2
##          B    4 1105    8    4    1
##          C    0   20 1013   12    4
##          D    0    1    3  947   11
##          E    2    0    2    0 1064
## 
## Overall Statistics
##                                          
##                Accuracy : 0.985          
##                  95% CI : (0.9816, 0.988)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9811         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9701   0.9873   0.9824   0.9834
## Specificity            0.9962   0.9964   0.9926   0.9970   0.9992
## Pos Pred Value         0.9905   0.9848   0.9657   0.9844   0.9963
## Neg Pred Value         0.9986   0.9929   0.9973   0.9965   0.9963
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2834   0.1878   0.1721   0.1609   0.1808
## Detection Prevalence   0.2862   0.1907   0.1782   0.1635   0.1815
## Balanced Accuracy      0.9963   0.9833   0.9900   0.9897   0.9913

Just as Random Forest, the Gradient Boosting algorithm shows a very high level of accuracy, even after our cross-validation. The accuracy on both Training and Testing Data is very high, so is the Kappa and the overall Sensitivity/Specificity as shown in the confusion Matrix.

III. D. Support Vector Machine

Our final model will be the Support Vector Machine (SVM) algorithm.

# Model building
### Support Vector Machine
svmModel = train(data=training, classe ~ ., method="svmLinear", trControl=control)

svmPrediction = predict(svmModel, training)
svmAccuracyTraining = sum(svmPrediction == training$classe) / length(svmPrediction)

svmPrediction = predict(svmModel, testing)
svmAccuracyTesting = sum(svmPrediction == testing$classe) / length(svmPrediction)

### Cross-validation
svmModel$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 7023 
## 
## Objective Function Value : -1341.416 -1178.792 -1084.395 -683.5185 -1334.646 -901.717 -1511.35 -1192.546 -1029.485 -1151.74 
## Training error : 0.197277
paste0("On Training, accuracy of the model: ", round(svmAccuracyTraining, 5))
## [1] "On Training, accuracy of the model: 0.80272"
paste0("On Testing, accuracy of the model: ", round(svmAccuracyTesting, 5))
## [1] "On Testing, accuracy of the model: 0.78029"
confusionMatrix(svmPrediction, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1537  158   85   49   55
##          B   40  786   92   32  128
##          C   40   93  798  109   82
##          D   52   16   24  728   74
##          E    5   86   27   46  743
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7803          
##                  95% CI : (0.7695, 0.7908)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7208          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9182   0.6901   0.7778   0.7552   0.6867
## Specificity            0.9176   0.9385   0.9333   0.9663   0.9659
## Pos Pred Value         0.8158   0.7291   0.7112   0.8143   0.8192
## Neg Pred Value         0.9658   0.9266   0.9521   0.9527   0.9319
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2612   0.1336   0.1356   0.1237   0.1263
## Detection Prevalence   0.3201   0.1832   0.1907   0.1519   0.1541
## Balanced Accuracy      0.9179   0.8143   0.8555   0.8607   0.8263

Contrary to the last two models we have seen, the SVM algorithm doesn’t return a very good result. Accuracy is less than 80%, as calculated, and the confusion Matrix shows a less convincing True Positive / True Negative result.

III. D. Accuracy summary

Let us now compare the result of our 4 models together.

mat = as.data.frame(cbind(c(cartAccuracyTraining, rfAccuracyTraining,
                            gbAccuracyTraining, svmAccuracyTraining),
                          c(cartAccuracyTesting, rfAccuracyTesting,
                            gbAccuracyTesting, svmAccuracyTesting)))
colnames(mat) = c("Training Accuracy", "Testing Accuracy")
rownames(mat) = c("CART", "Random Forest", "Gradient Boosting", "SVM")
kable(round(mat, 5), caption = "Accuracy Summary", align=rep("c", 2))
Accuracy Summary
Training Accuracy Testing Accuracy
CART 0.49763 0.49125
Random Forest 1.00000 0.99371
Gradient Boosting 0.99177 0.98505
SVM 0.80272 0.78029

The Random Forest looks to be the best model, followed by Gradient Boosting, SVM, and then CART. Let us conclude.

IV. Conclusion

After running 4 algorithms, the Random Forest model stood out to be the best. We will therefore use it to predict the value of the Quizz Dataset.

answer = as.vector(predict(randomForestModel, quizz))
answer
##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"