Machine Learning Final Project

Executive Summary

This is the report for the final project of the Practical Machine Learning course at John Hopkins University. This document was created using RStudio, and all the codes are based on R version 4.0.3. In this project, we show the process of training a machine learning algorithm that predict people’s exercise habit using personal activity data collected by smart wearable products such as Apple Watch. This report contains the following sections:

Data Source
Data Preparation
Exploratory Data Analysis
Model Building
Model Evaluation
Conclusion
References

The above sections are intentionally presented in a way that reflects the data science pipeline suggested by the Cross-industry Standard Process for Data Mining (CRISP_DM) (Shearer 2000).

Data Source

Human Activity Recognition (HAR) has emerged as a key research area in the past years and is gaining increasing attention. Devices like Jawbone Up, Nike FuelBand, and Fitbit are now possible to collect a large amount of data about personal activity relatively inexpensively.

In this project, we use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants who were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information related to the data for this course project is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset) (Ugulino et al. 2012).

Data Preperation

These are the packages that will be used in this project

library(tidyverse) # Misc
library(lattice) # Graphing
library(caret) # Machine Learning
library(rpart) # Recursive Partitioning
library(rpart.plot) # Graphing
library(corrplot) # Bivariate Analysis
library(rattle) # Misc
library(randomForest) # Modeling
library(nnet) # Regression
library(ranger) # Random Forest C++ Implementation
library(MLmetrics) # Model Evaluation
library(RColorBrewer) # Graphing
library(xgboost) # Gradient Boosted Algorithm

set.seed(188)

The data is accessed from the following source:

url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_test  <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

training <- read.csv(url(url_train), strip.white = TRUE, na.strings = c("NA",""))
testing <- read.csv(url(url_test),  strip.white = TRUE, na.strings = c("NA",""))

# Check the Dimensions
dim(training)

## [1] 19622   160

dim(testing)

## [1]  20 160

Now, we further split the training set into training set and calibration set. The testing data will be saved for answering the quiz questions accompanied the final project assessment of this course.

cal_index <- createDataPartition(training$classe, p=.25, list=FALSE)
training1 <- training[-cal_index, ]
calibration <- training[cal_index, ]

# Check dimensions again
dim(training1)

## [1] 14715   160

dim(calibration)

## [1] 4907  160

Now, we check the missing value condition. Our analysis shows (code shown below) that for our training set, features are either have zero missing values or have too many missing values (over 90% of values are missing).

# Column-wise missing value check
sapply(training1, function(x) sum(is.na(x)))

##                        X                user_name     raw_timestamp_part_1 
##                        0                        0                        0 
##     raw_timestamp_part_2           cvtd_timestamp               new_window 
##                        0                        0                        0 
##               num_window                roll_belt               pitch_belt 
##                        0                        0                        0 
##                 yaw_belt         total_accel_belt       kurtosis_roll_belt 
##                        0                        0                    14411 
##      kurtosis_picth_belt        kurtosis_yaw_belt       skewness_roll_belt 
##                    14411                    14411                    14411 
##     skewness_roll_belt.1        skewness_yaw_belt            max_roll_belt 
##                    14411                    14411                    14411 
##           max_picth_belt             max_yaw_belt            min_roll_belt 
##                    14411                    14411                    14411 
##           min_pitch_belt             min_yaw_belt      amplitude_roll_belt 
##                    14411                    14411                    14411 
##     amplitude_pitch_belt       amplitude_yaw_belt     var_total_accel_belt 
##                    14411                    14411                    14411 
##            avg_roll_belt         stddev_roll_belt            var_roll_belt 
##                    14411                    14411                    14411 
##           avg_pitch_belt        stddev_pitch_belt           var_pitch_belt 
##                    14411                    14411                    14411 
##             avg_yaw_belt          stddev_yaw_belt             var_yaw_belt 
##                    14411                    14411                    14411 
##             gyros_belt_x             gyros_belt_y             gyros_belt_z 
##                        0                        0                        0 
##             accel_belt_x             accel_belt_y             accel_belt_z 
##                        0                        0                        0 
##            magnet_belt_x            magnet_belt_y            magnet_belt_z 
##                        0                        0                        0 
##                 roll_arm                pitch_arm                  yaw_arm 
##                        0                        0                        0 
##          total_accel_arm            var_accel_arm             avg_roll_arm 
##                        0                    14411                    14411 
##          stddev_roll_arm             var_roll_arm            avg_pitch_arm 
##                    14411                    14411                    14411 
##         stddev_pitch_arm            var_pitch_arm              avg_yaw_arm 
##                    14411                    14411                    14411 
##           stddev_yaw_arm              var_yaw_arm              gyros_arm_x 
##                    14411                    14411                        0 
##              gyros_arm_y              gyros_arm_z              accel_arm_x 
##                        0                        0                        0 
##              accel_arm_y              accel_arm_z             magnet_arm_x 
##                        0                        0                        0 
##             magnet_arm_y             magnet_arm_z        kurtosis_roll_arm 
##                        0                        0                    14411 
##       kurtosis_picth_arm         kurtosis_yaw_arm        skewness_roll_arm 
##                    14411                    14411                    14411 
##       skewness_pitch_arm         skewness_yaw_arm             max_roll_arm 
##                    14411                    14411                    14411 
##            max_picth_arm              max_yaw_arm             min_roll_arm 
##                    14411                    14411                    14411 
##            min_pitch_arm              min_yaw_arm       amplitude_roll_arm 
##                    14411                    14411                    14411 
##      amplitude_pitch_arm        amplitude_yaw_arm            roll_dumbbell 
##                    14411                    14411                        0 
##           pitch_dumbbell             yaw_dumbbell   kurtosis_roll_dumbbell 
##                        0                        0                    14411 
##  kurtosis_picth_dumbbell    kurtosis_yaw_dumbbell   skewness_roll_dumbbell 
##                    14411                    14411                    14411 
##  skewness_pitch_dumbbell    skewness_yaw_dumbbell        max_roll_dumbbell 
##                    14411                    14411                    14411 
##       max_picth_dumbbell         max_yaw_dumbbell        min_roll_dumbbell 
##                    14411                    14411                    14411 
##       min_pitch_dumbbell         min_yaw_dumbbell  amplitude_roll_dumbbell 
##                    14411                    14411                    14411 
## amplitude_pitch_dumbbell   amplitude_yaw_dumbbell     total_accel_dumbbell 
##                    14411                    14411                        0 
##       var_accel_dumbbell        avg_roll_dumbbell     stddev_roll_dumbbell 
##                    14411                    14411                    14411 
##        var_roll_dumbbell       avg_pitch_dumbbell    stddev_pitch_dumbbell 
##                    14411                    14411                    14411 
##       var_pitch_dumbbell         avg_yaw_dumbbell      stddev_yaw_dumbbell 
##                    14411                    14411                    14411 
##         var_yaw_dumbbell         gyros_dumbbell_x         gyros_dumbbell_y 
##                    14411                        0                        0 
##         gyros_dumbbell_z         accel_dumbbell_x         accel_dumbbell_y 
##                        0                        0                        0 
##         accel_dumbbell_z        magnet_dumbbell_x        magnet_dumbbell_y 
##                        0                        0                        0 
##        magnet_dumbbell_z             roll_forearm            pitch_forearm 
##                        0                        0                        0 
##              yaw_forearm    kurtosis_roll_forearm   kurtosis_picth_forearm 
##                        0                    14411                    14411 
##     kurtosis_yaw_forearm    skewness_roll_forearm   skewness_pitch_forearm 
##                    14411                    14411                    14411 
##     skewness_yaw_forearm         max_roll_forearm        max_picth_forearm 
##                    14411                    14411                    14411 
##          max_yaw_forearm         min_roll_forearm        min_pitch_forearm 
##                    14411                    14411                    14411 
##          min_yaw_forearm   amplitude_roll_forearm  amplitude_pitch_forearm 
##                    14411                    14411                    14411 
##    amplitude_yaw_forearm      total_accel_forearm        var_accel_forearm 
##                    14411                        0                    14411 
##         avg_roll_forearm      stddev_roll_forearm         var_roll_forearm 
##                    14411                    14411                    14411 
##        avg_pitch_forearm     stddev_pitch_forearm        var_pitch_forearm 
##                    14411                    14411                    14411 
##          avg_yaw_forearm       stddev_yaw_forearm          var_yaw_forearm 
##                    14411                    14411                    14411 
##          gyros_forearm_x          gyros_forearm_y          gyros_forearm_z 
##                        0                        0                        0 
##          accel_forearm_x          accel_forearm_y          accel_forearm_z 
##                        0                        0                        0 
##         magnet_forearm_x         magnet_forearm_y         magnet_forearm_z 
##                        0                        0                        0 
##                   classe 
##                        0

Therefore, it is not meaningful to conduct data imputation here. We will just discard those features that have missing values and keep only features that have zero missing value, as it is shown below:

na_var <- sapply(training1, function(x) mean(is.na(x))) > 0.90
training2 <- training1[ ,na_var == FALSE]
calibration1 <- calibration[ ,na_var == FALSE]

dim(training2)

## [1] 14715    60

dim(calibration1)

## [1] 4907   60

We can see now there is no missing value in the training set and calibration set now, as it is shown below:

sapply(training2, function(x) sum(is.na(x)))

##                    X            user_name raw_timestamp_part_1 
##                    0                    0                    0 
## raw_timestamp_part_2       cvtd_timestamp           new_window 
##                    0                    0                    0 
##           num_window            roll_belt           pitch_belt 
##                    0                    0                    0 
##             yaw_belt     total_accel_belt         gyros_belt_x 
##                    0                    0                    0 
##         gyros_belt_y         gyros_belt_z         accel_belt_x 
##                    0                    0                    0 
##         accel_belt_y         accel_belt_z        magnet_belt_x 
##                    0                    0                    0 
##        magnet_belt_y        magnet_belt_z             roll_arm 
##                    0                    0                    0 
##            pitch_arm              yaw_arm      total_accel_arm 
##                    0                    0                    0 
##          gyros_arm_x          gyros_arm_y          gyros_arm_z 
##                    0                    0                    0 
##          accel_arm_x          accel_arm_y          accel_arm_z 
##                    0                    0                    0 
##         magnet_arm_x         magnet_arm_y         magnet_arm_z 
##                    0                    0                    0 
##        roll_dumbbell       pitch_dumbbell         yaw_dumbbell 
##                    0                    0                    0 
## total_accel_dumbbell     gyros_dumbbell_x     gyros_dumbbell_y 
##                    0                    0                    0 
##     gyros_dumbbell_z     accel_dumbbell_x     accel_dumbbell_y 
##                    0                    0                    0 
##     accel_dumbbell_z    magnet_dumbbell_x    magnet_dumbbell_y 
##                    0                    0                    0 
##    magnet_dumbbell_z         roll_forearm        pitch_forearm 
##                    0                    0                    0 
##          yaw_forearm  total_accel_forearm      gyros_forearm_x 
##                    0                    0                    0 
##      gyros_forearm_y      gyros_forearm_z      accel_forearm_x 
##                    0                    0                    0 
##      accel_forearm_y      accel_forearm_z     magnet_forearm_x 
##                    0                    0                    0 
##     magnet_forearm_y     magnet_forearm_z               classe 
##                    0                    0                    0

sapply(calibration1, function(x) sum(is.na(x)))

##                    X            user_name raw_timestamp_part_1 
##                    0                    0                    0 
## raw_timestamp_part_2       cvtd_timestamp           new_window 
##                    0                    0                    0 
##           num_window            roll_belt           pitch_belt 
##                    0                    0                    0 
##             yaw_belt     total_accel_belt         gyros_belt_x 
##                    0                    0                    0 
##         gyros_belt_y         gyros_belt_z         accel_belt_x 
##                    0                    0                    0 
##         accel_belt_y         accel_belt_z        magnet_belt_x 
##                    0                    0                    0 
##        magnet_belt_y        magnet_belt_z             roll_arm 
##                    0                    0                    0 
##            pitch_arm              yaw_arm      total_accel_arm 
##                    0                    0                    0 
##          gyros_arm_x          gyros_arm_y          gyros_arm_z 
##                    0                    0                    0 
##          accel_arm_x          accel_arm_y          accel_arm_z 
##                    0                    0                    0 
##         magnet_arm_x         magnet_arm_y         magnet_arm_z 
##                    0                    0                    0 
##        roll_dumbbell       pitch_dumbbell         yaw_dumbbell 
##                    0                    0                    0 
## total_accel_dumbbell     gyros_dumbbell_x     gyros_dumbbell_y 
##                    0                    0                    0 
##     gyros_dumbbell_z     accel_dumbbell_x     accel_dumbbell_y 
##                    0                    0                    0 
##     accel_dumbbell_z    magnet_dumbbell_x    magnet_dumbbell_y 
##                    0                    0                    0 
##    magnet_dumbbell_z         roll_forearm        pitch_forearm 
##                    0                    0                    0 
##          yaw_forearm  total_accel_forearm      gyros_forearm_x 
##                    0                    0                    0 
##      gyros_forearm_y      gyros_forearm_z      accel_forearm_x 
##                    0                    0                    0 
##      accel_forearm_y      accel_forearm_z     magnet_forearm_x 
##                    0                    0                    0 
##     magnet_forearm_y     magnet_forearm_z               classe 
##                    0                    0                    0

Take a closer look at our training set, we can see there are also several near-zero-variance predictors that may affect our model building process later on, as it is presented below. Notice that when we decided on which feature is near-zero-variance feature, we used a more aggressive cutoff value than the default cutoff for both the ratio between the most common value and the second-most-common value (freqCut=2), as well as the percentage of distinct values out of the total sample size (uniqueCut=20).

nzv <- nearZeroVar(training2, saveMetrics = TRUE, freqCut = 2, uniqueCut = 20)
nzv[nzv$nzv==TRUE | nzv$zeroVar==TRUE, ]

##               freqRatio percentUnique zeroVar  nzv
## new_window     47.40461    0.01359157   FALSE TRUE
## roll_arm       51.14000   16.75840979   FALSE TRUE
## pitch_arm      85.26667   19.39517499   FALSE TRUE
## yaw_arm        31.18293   18.21270812   FALSE TRUE
## roll_forearm   11.65217   13.13625552   FALSE TRUE
## pitch_forearm  62.74468   18.29425756   FALSE TRUE
## yaw_forearm    16.19780   12.34114849   FALSE TRUE

We now remove these near-zero-variance features, for both the training set and the calibration set.

training3 <- training2[, nzv$nzv==FALSE]
calibration2  <- calibration1[ , nzv$nzv==FALSE]

We can see now there is no near-zero-variance feature in both the training set and the calibration set, as it is shown below:

nzv_check_train <- nearZeroVar(training3, saveMetrics = TRUE, freqCut = 2, uniqueCut = 20)
TRUE %in% nzv_check_train$nzv

## [1] FALSE

nzv_check_cal <- nearZeroVar(calibration2, saveMetrics = TRUE, freqCut = 2, uniqueCut = 20)
TRUE %in% nzv_check_cal$nzv

## [1] FALSE

Because we are going to construct a correlation matrix in the next section, we also extract the first five columns. These columns are just identification information. In a real-world data science project, these identification information will later be added back to the database.

tr <- training3[, -(1:5)]
cal <- calibration2[, -(1:5)]

dim(tr)

## [1] 14715    48

dim(cal)

## [1] 4907   48

As it is shown above, after data preparation, we eventually keep 48 features for model building. Now we will perform some exploratory analysis

Exploratory Data Analysis

Before any actual model building exercise, it is always good to check the bivariate correlations between features. From the correlation matrix plot shown below we can see that most features are not significantly correlated to each other. Therefore, the multicollinearity issue is not serious. In addition, we now have 14715 observations in our training set, but only 48 features. The discussion above suggests that it should be proper to go ahead without performing dimensionality reduction, such as Singular Vector Decomposition (SVD) or Principal Component Analysis (PCA).

corr_matrix <- cor(tr[ , -48])
corrplot(corr_matrix, order = "FPC", method = "circle", type = "lower",
         tl.cex = 0.6, tl.col = rgb(0, 0, 0))

Now, we can begin training our models.

Model Buidling

Generalized Linear Model (GLM)

A good start for any model training process is the linear model. We now fit a Multinomial Logistic Model to our training set, and use it as a baseline.

set.seed(666)
tr$classe <- as.factor(tr$classe)
tr$classe <- relevel(tr$classe, ref = "A")
fit_mln <- multinom(classe ~ ., data = tr)

## # weights:  245 (192 variable)
## initial  value 23682.878881 
## iter  10 value 18970.458140
## iter  20 value 16736.978040
## iter  30 value 15468.457278
## iter  40 value 14548.008176
## iter  50 value 13946.246911
## iter  60 value 13582.794412
## iter  70 value 13290.060002
## iter  80 value 13161.572771
## iter  90 value 13059.445385
## iter 100 value 12958.337220
## final  value 12958.337220 
## stopped after 100 iterations

pred_mln <- predict(fit_mln, tr)
conf_m <- confusionMatrix(pred_mln, tr$classe)
conf_m$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.795107e-01   5.929171e-01   6.719018e-01   6.870486e-01   2.844037e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00  1.476108e-133

Now, we can see that using a simple Multinomial Logistic Model, we achieve an overall accuracy of about 67.95% on our training set, as it is shown above. This is not bad, but can certainly be improved.

Linear Discriminant Analysis (LDA)

Let’s try a more complicated linear model- the Linear Discriminant Analysis (LDA).

set.seed(666)
fit_lda <- train(classe ~ ., data = tr, method = "lda")
pred_lda <- predict(fit_lda, tr)
confusionMatrix(pred_lda, tr$classe)$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   7.032960e-01   6.237705e-01   6.958416e-01   7.106699e-01   2.844037e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00  9.250318e-145

By using LDA, we can achieve an overall accuracy of about 70%, as it is shown above. Not much improvement from the logistic model.

Quadratic Discriminant Analysis (QDA)

Let’s try another linear model: Quadratic Discriminant Analysis (QDA).

set.seed(666)
fit_qda <- train(classe ~ ., data = tr, method = "qda")
pred_qda <- predict(fit_qda, tr)
confusionMatrix(pred_qda, tr$classe)$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   8.658512e-01   8.311768e-01   8.602382e-01   8.713190e-01   2.844037e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00  9.149099e-223

By using QDA, we can boost our prediction performance up to about 87%. However, if we want to do better, we must delve into more advanced machine learning models that could capture the non-linearity of the data.

Desicion Tree Model

Let’s try tree models. We start from the basic decision tree model.

set.seed(666)
fit_dt <- rpart(classe ~ ., data = tr, method="class")
fancyRpartPlot(fit_dt)

Since our model is not simple (not very complex either), the decision tree plot gets complicated, and is hard to interpret. We will measure the goodness of fit by statistics.

pred_dt <- predict(fit_dt, tr, type="class")
confusionMatrix(pred_dt, tr$classe)$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.662589e-01   5.769591e-01   6.585737e-01   6.738784e-01   2.844037e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   1.891569e-97

F1_Score(pred_dt, tr$classe)

## [1] 0.6967846

From above, we can see that both the overall accuracy and the F1_score is unsatisfactory for the basic decision tree model. It is even lower than the linear models!

Random Forest Model

Let’s try the Random Forest Model.

set.seed(688)

control_rf <- trainControl(method="cv", number=3, verboseIter=FALSE)
# It takes some times to run the following code
fit_rf <- train(x=tr[,1:47], y=tr$classe, method = "ranger",
                trControl = control_rf)

## Growing trees.. Progress: 92%. Estimated remaining time: 2 seconds.
## Growing trees.. Progress: 97%. Estimated remaining time: 1 seconds.
## Growing trees.. Progress: 69%. Estimated remaining time: 13 seconds.
## Growing trees.. Progress: 87%. Estimated remaining time: 4 seconds.
## Growing trees.. Progress: 85%. Estimated remaining time: 5 seconds.

pred_rf <- predict(fit_rf, tr)
fit_rf

## Random Forest 
## 
## 14715 samples
##    47 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 9811, 9809, 9810 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##    2    gini        0.9925246  0.9905429
##    2    extratrees  0.9884471  0.9853837
##   24    gini        0.9947671  0.9933800
##   24    extratrees  0.9932722  0.9914886
##   47    gini        0.9923886  0.9903712
##   47    extratrees  0.9935440  0.9918330
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 24, splitrule = gini
##  and min.node.size = 1.

confusionMatrix(pred_rf, tr$classe)$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      1.0000000      1.0000000      0.9997493      1.0000000      0.2844037 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

We can see from above that with the Random Forest Model, given the number of variables available for splitting at each tree node (mtry) equals to 24, we can reach a surprisingly high accuracy, which is over 99%. Note that in this project, we used the ranger package, which is a fast implementation of the Random Forest algorithm. We have tried the classical randomForest package, but it takes too much computing resources, which is beyond our equipment’s computing ability.

Gradient-Boosted Tree Model

Let’s try one last algorithm: the Gradient-Boosted Tree Model (not covered in the course content). Per our experience, the Gradient-Boosted Tree Model usually also has a satisfactory prediction performance.

set.seed(666)

fit_xgboost <- caret::train(classe ~ .,
                      data = tr,
                      method = "gbm",
                      trControl = trainControl(method = "repeatedcv", 
                                             number = 5, 
                                             repeats = 3, 
                                             verboseIter = FALSE),
                      verbose = 0)

pred_xgboost <- predict(fit_xgboost, tr)
confusionMatrix(pred_xgboost, tr$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4182   10    0    0    1
##          B    2 2814    8    7    8
##          C    0   21 2551   32    4
##          D    0    1    4 2369   24
##          E    1    1    3    4 2668
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9911          
##                  95% CI : (0.9894, 0.9926)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9887          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9884   0.9942   0.9822   0.9863
## Specificity            0.9990   0.9979   0.9953   0.9976   0.9993
## Pos Pred Value         0.9974   0.9912   0.9781   0.9879   0.9966
## Neg Pred Value         0.9997   0.9972   0.9988   0.9965   0.9969
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2842   0.1912   0.1734   0.1610   0.1813
## Detection Prevalence   0.2849   0.1929   0.1772   0.1630   0.1819
## Balanced Accuracy      0.9991   0.9932   0.9947   0.9899   0.9928

The Gradient-Boosted Tree Model has three tuning parameters: the number of trees (n.trees), the number of splits (interaction.depth), and learning rate (shrinkage). The train function in caret automatically uses cross-validation to select the parameters for us that provide the best performance. From the above model fitting result, we can see that, if we train the model with 150 trees, 3 splits, and 0.1 learning rate, we can get an overall accuracy of around 99%. Another evaluation metric Cohen’s Kappa Score (Kappa) also reaches 0.98. This is exciting!

Model Evaluation

However, we must be cautious, because we know that our top performers the Gradient_Boosted Tree Model and the Random Forest Model are prone to overfit in nature. Therefore, before we move any further, we now test our models on the calibration set we have left untouched up to this point.

Model Performance on the Calibration Set

First, let’s see how the Gradient_Boosted Tree Model performs on the calibration set:

pred_c_xgboost <- predict(fit_xgboost, cal)
confusionMatrix(pred_c_xgboost, factor(cal$classe))$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9863460      0.9827280      0.9826920      0.9894031      0.2842878 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

F1_Score(pred_c_xgboost, factor(cal$classe))

## [1] 0.99678

The Gradient_Boosted Tree Model gives out amazing prediction performance for the calibration set, as it’s shown above. The overall accuracy reaches 98.63%. The Cohen’s Kappa is about 0.986. Additionally, the F1 Score is 0.997. Now let’s see how the Random Forest Model performs on the calibration set:

pred_c_rf <- predict(fit_rf, cal)
confusionMatrix(pred_c_rf, factor(cal$classe))$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9981659      0.9976802      0.9965211      0.9991610      0.2842878 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

From the above result we can see that the Random Forest Model also performs almost perfectly on the calibration set, with a total accuracy of around 99.8%

Now Let’s see how other models perform

# for logistic model
pred_c_mln <- predict(fit_mln, cal)
confusionMatrix(pred_c_mln, factor(cal$classe))$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.820868e-01   5.962526e-01   6.688504e-01   6.951063e-01   2.842878e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   7.868420e-41

# for LDA
pred_c_lda <- predict(fit_lda, cal)
confusionMatrix(pred_c_lda, factor(cal$classe))$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.998166e-01   6.194547e-01   6.867745e-01   7.126206e-01   2.842878e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   7.642155e-44

# for QDA
pred_c_qda <- predict(fit_qda, cal)
confusionMatrix(pred_c_qda, factor(cal$classe))$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   8.604035e-01   8.243307e-01   8.503904e-01   8.699857e-01   2.842878e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   6.443071e-67

# for Decision Tree
pred_c_dt <- predict(fit_dt, cal, type='class')
confusionMatrix(pred_c_dt, factor(cal$classe))$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.574282e-01   5.657778e-01   6.439574e-01   6.707114e-01   2.842878e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   1.398610e-26

From the above results, it is reasonable to believe that all the models we have built on the training data does not show significant performance degradation while testing on the calibration data.

Ensemble

Finally, we construct an ensemble based on the majority vote of all the models we have built, which includes the Multinomial Logistic Model, the LDA Model, the QDA Model, the Decision Tree Model, the Random Forest Model, and the Gradient-Boosted Tree Model. This simple ensemble consults what each model predicts and then makes its decision based on what the majority of all the models we have built so far vote for. For those observations where our six models cannot reach a over 50% majority vote, we accept the predictions of the top performing model- the Random Forest Model.

model <- c("pred_c_mln", "pred_c_lda", "pred_c_qda",
           "pred_c_dt", "pred_c_rf", "pred_c_xgboost")
pred <- sapply(1:6, function(x){
  as.factor(get(model[x]))})

dim(pred)

## [1] 4907    6

pred <- as.data.frame(pred)
names(pred) <-c("pred_c_mln", "pred_c_lda", "pred_c_qda",
           "pred_c_dt", "pred_c_rf", "pred_c_xgboost")
acc <- colMeans(as.matrix(pred)==cal$classe)

# acc
# mean(acc)

pred_ensemble <- data.frame(pred=NULL)

pred$pred_c_ensemble <- 0
n <- 1:nrow(pred)
x <- c('A', 'B', 'C', 'D', 'E')

for (a in x){
  for (i in n){
    if (mean(pred[i,]==a)>0.5){
        pred[i,7] <- a} 
  }
}

pred[pred$pred_c_ensemble==0,7] <- pred[pred$pred_c_ensemble==0,5]

confusionMatrix(factor(pred$pred_c_ensemble), factor(cal$classe))$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9702466      0.9623971      0.9651021      0.9748209      0.2842878 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

The above result indicates that the majority vote ensemble model has an overall accuracy of about 97% and a Cohen’s Kappa of about 0.96. This is a little inferior to the prediction performance of the Random Forest Model and the Gradient-Boosted Tree Model. Therefore, for this final project, we would recommend the Random Forest Model, which slightly beat the Gradient_Boosted Tree Model and the ensemble model.

However, we believe the ensemble model is less likely to overfit in the real world, compared to the Random Forest Model and the Gradient-Boosted Tree Model. This is because the ensemble also takes into account what the simpler models predict when gives out prediction. Although simpler models like the linear models do not predict as well as the more complex machine learning models like the Gradient-Boosted Tree Model, they are less prone to overfit because they do not attempt to catch every tiny variation of the training data.

Model Performance on the Test Set

One last task is to evaluate the Random Forest Model using the test set we have reserved, so the quiz questions can be answered. This also provides an extra opportunity to test the reliability of our model of choice’s performance. Below, we use the Random Forest Model to predict exercise habit using the test set.

pred_test <- as.data.frame(predict(fit_rf, testing))
pred_test

##    predict(fit_rf, testing)
## 1                         B
## 2                         A
## 3                         B
## 4                         A
## 5                         A
## 6                         E
## 7                         D
## 8                         B
## 9                         A
## 10                        A
## 11                        B
## 12                        C
## 13                        B
## 14                        A
## 15                        E
## 16                        E
## 17                        A
## 18                        B
## 19                        B
## 20                        B

Our algorithm got all the 20 quiz questions right. This again shows the Random Forest Model we built can predict well.

Conclusion

The above sections show that the Random Forest Model provides satisfactory prediction on individuals’ exercise manner given their movement recorded by the accelerometers. The overall accuracy given by the Random Forest Model on the calibration data reaches 99.8%. However, given that the algorithm of Random Forest Model tends to overfit in nature, we should expect the actual performance of our model to be lower than 99.8%. The ensemble model based on the majority vote of all six models built in this project should be an ideal alternative to the Random Forest Model.

References

Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing, 2000. 5:13—22.

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.