This is the report for the final project of the Practical Machine Learning course at John Hopkins University. This document was created using RStudio, and all the codes are based on R version 4.0.3. In this project, we show the process of training a machine learning algorithm that predict people’s exercise habit using personal activity data collected by smart wearable products such as Apple Watch. This report contains the following sections:
The above sections are intentionally presented in a way that reflects the data science pipeline suggested by the Cross-industry Standard Process for Data Mining (CRISP_DM) (Shearer 2000).
Human Activity Recognition (HAR) has emerged as a key research area in the past years and is gaining increasing attention. Devices like Jawbone Up, Nike FuelBand, and Fitbit are now possible to collect a large amount of data about personal activity relatively inexpensively.
In this project, we use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants who were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information related to the data for this course project is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset) (Ugulino et al. 2012).
These are the packages that will be used in this project
library(tidyverse) # Misc
library(lattice) # Graphing
library(caret) # Machine Learning
library(rpart) # Recursive Partitioning
library(rpart.plot) # Graphing
library(corrplot) # Bivariate Analysis
library(rattle) # Misc
library(randomForest) # Modeling
library(nnet) # Regression
library(ranger) # Random Forest C++ Implementation
library(MLmetrics) # Model Evaluation
library(RColorBrewer) # Graphing
library(xgboost) # Gradient Boosted Algorithm
set.seed(188)
The data is accessed from the following source:
url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training <- read.csv(url(url_train), strip.white = TRUE, na.strings = c("NA",""))
testing <- read.csv(url(url_test), strip.white = TRUE, na.strings = c("NA",""))
# Check the Dimensions
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
Now, we further split the training set into training set and calibration set. The testing data will be saved for answering the quiz questions accompanied the final project assessment of this course.
cal_index <- createDataPartition(training$classe, p=.25, list=FALSE)
training1 <- training[-cal_index, ]
calibration <- training[cal_index, ]
# Check dimensions again
dim(training1)
## [1] 14715 160
dim(calibration)
## [1] 4907 160
Now, we check the missing value condition. Our analysis shows (code shown below) that for our training set, features are either have zero missing values or have too many missing values (over 90% of values are missing).
# Column-wise missing value check
sapply(training1, function(x) sum(is.na(x)))
## X user_name raw_timestamp_part_1
## 0 0 0
## raw_timestamp_part_2 cvtd_timestamp new_window
## 0 0 0
## num_window roll_belt pitch_belt
## 0 0 0
## yaw_belt total_accel_belt kurtosis_roll_belt
## 0 0 14411
## kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## 14411 14411 14411
## skewness_roll_belt.1 skewness_yaw_belt max_roll_belt
## 14411 14411 14411
## max_picth_belt max_yaw_belt min_roll_belt
## 14411 14411 14411
## min_pitch_belt min_yaw_belt amplitude_roll_belt
## 14411 14411 14411
## amplitude_pitch_belt amplitude_yaw_belt var_total_accel_belt
## 14411 14411 14411
## avg_roll_belt stddev_roll_belt var_roll_belt
## 14411 14411 14411
## avg_pitch_belt stddev_pitch_belt var_pitch_belt
## 14411 14411 14411
## avg_yaw_belt stddev_yaw_belt var_yaw_belt
## 14411 14411 14411
## gyros_belt_x gyros_belt_y gyros_belt_z
## 0 0 0
## accel_belt_x accel_belt_y accel_belt_z
## 0 0 0
## magnet_belt_x magnet_belt_y magnet_belt_z
## 0 0 0
## roll_arm pitch_arm yaw_arm
## 0 0 0
## total_accel_arm var_accel_arm avg_roll_arm
## 0 14411 14411
## stddev_roll_arm var_roll_arm avg_pitch_arm
## 14411 14411 14411
## stddev_pitch_arm var_pitch_arm avg_yaw_arm
## 14411 14411 14411
## stddev_yaw_arm var_yaw_arm gyros_arm_x
## 14411 14411 0
## gyros_arm_y gyros_arm_z accel_arm_x
## 0 0 0
## accel_arm_y accel_arm_z magnet_arm_x
## 0 0 0
## magnet_arm_y magnet_arm_z kurtosis_roll_arm
## 0 0 14411
## kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm
## 14411 14411 14411
## skewness_pitch_arm skewness_yaw_arm max_roll_arm
## 14411 14411 14411
## max_picth_arm max_yaw_arm min_roll_arm
## 14411 14411 14411
## min_pitch_arm min_yaw_arm amplitude_roll_arm
## 14411 14411 14411
## amplitude_pitch_arm amplitude_yaw_arm roll_dumbbell
## 14411 14411 0
## pitch_dumbbell yaw_dumbbell kurtosis_roll_dumbbell
## 0 0 14411
## kurtosis_picth_dumbbell kurtosis_yaw_dumbbell skewness_roll_dumbbell
## 14411 14411 14411
## skewness_pitch_dumbbell skewness_yaw_dumbbell max_roll_dumbbell
## 14411 14411 14411
## max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## 14411 14411 14411
## min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## 14411 14411 14411
## amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## 14411 14411 0
## var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell
## 14411 14411 14411
## var_roll_dumbbell avg_pitch_dumbbell stddev_pitch_dumbbell
## 14411 14411 14411
## var_pitch_dumbbell avg_yaw_dumbbell stddev_yaw_dumbbell
## 14411 14411 14411
## var_yaw_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 14411 0 0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 0 0 0
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## 0 0 0
## magnet_dumbbell_z roll_forearm pitch_forearm
## 0 0 0
## yaw_forearm kurtosis_roll_forearm kurtosis_picth_forearm
## 0 14411 14411
## kurtosis_yaw_forearm skewness_roll_forearm skewness_pitch_forearm
## 14411 14411 14411
## skewness_yaw_forearm max_roll_forearm max_picth_forearm
## 14411 14411 14411
## max_yaw_forearm min_roll_forearm min_pitch_forearm
## 14411 14411 14411
## min_yaw_forearm amplitude_roll_forearm amplitude_pitch_forearm
## 14411 14411 14411
## amplitude_yaw_forearm total_accel_forearm var_accel_forearm
## 14411 0 14411
## avg_roll_forearm stddev_roll_forearm var_roll_forearm
## 14411 14411 14411
## avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm
## 14411 14411 14411
## avg_yaw_forearm stddev_yaw_forearm var_yaw_forearm
## 14411 14411 14411
## gyros_forearm_x gyros_forearm_y gyros_forearm_z
## 0 0 0
## accel_forearm_x accel_forearm_y accel_forearm_z
## 0 0 0
## magnet_forearm_x magnet_forearm_y magnet_forearm_z
## 0 0 0
## classe
## 0
Therefore, it is not meaningful to conduct data imputation here. We will just discard those features that have missing values and keep only features that have zero missing value, as it is shown below:
na_var <- sapply(training1, function(x) mean(is.na(x))) > 0.90
training2 <- training1[ ,na_var == FALSE]
calibration1 <- calibration[ ,na_var == FALSE]
dim(training2)
## [1] 14715 60
dim(calibration1)
## [1] 4907 60
We can see now there is no missing value in the training set and calibration set now, as it is shown below:
sapply(training2, function(x) sum(is.na(x)))
## X user_name raw_timestamp_part_1
## 0 0 0
## raw_timestamp_part_2 cvtd_timestamp new_window
## 0 0 0
## num_window roll_belt pitch_belt
## 0 0 0
## yaw_belt total_accel_belt gyros_belt_x
## 0 0 0
## gyros_belt_y gyros_belt_z accel_belt_x
## 0 0 0
## accel_belt_y accel_belt_z magnet_belt_x
## 0 0 0
## magnet_belt_y magnet_belt_z roll_arm
## 0 0 0
## pitch_arm yaw_arm total_accel_arm
## 0 0 0
## gyros_arm_x gyros_arm_y gyros_arm_z
## 0 0 0
## accel_arm_x accel_arm_y accel_arm_z
## 0 0 0
## magnet_arm_x magnet_arm_y magnet_arm_z
## 0 0 0
## roll_dumbbell pitch_dumbbell yaw_dumbbell
## 0 0 0
## total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 0 0 0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 0 0 0
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## 0 0 0
## magnet_dumbbell_z roll_forearm pitch_forearm
## 0 0 0
## yaw_forearm total_accel_forearm gyros_forearm_x
## 0 0 0
## gyros_forearm_y gyros_forearm_z accel_forearm_x
## 0 0 0
## accel_forearm_y accel_forearm_z magnet_forearm_x
## 0 0 0
## magnet_forearm_y magnet_forearm_z classe
## 0 0 0
sapply(calibration1, function(x) sum(is.na(x)))
## X user_name raw_timestamp_part_1
## 0 0 0
## raw_timestamp_part_2 cvtd_timestamp new_window
## 0 0 0
## num_window roll_belt pitch_belt
## 0 0 0
## yaw_belt total_accel_belt gyros_belt_x
## 0 0 0
## gyros_belt_y gyros_belt_z accel_belt_x
## 0 0 0
## accel_belt_y accel_belt_z magnet_belt_x
## 0 0 0
## magnet_belt_y magnet_belt_z roll_arm
## 0 0 0
## pitch_arm yaw_arm total_accel_arm
## 0 0 0
## gyros_arm_x gyros_arm_y gyros_arm_z
## 0 0 0
## accel_arm_x accel_arm_y accel_arm_z
## 0 0 0
## magnet_arm_x magnet_arm_y magnet_arm_z
## 0 0 0
## roll_dumbbell pitch_dumbbell yaw_dumbbell
## 0 0 0
## total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 0 0 0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 0 0 0
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## 0 0 0
## magnet_dumbbell_z roll_forearm pitch_forearm
## 0 0 0
## yaw_forearm total_accel_forearm gyros_forearm_x
## 0 0 0
## gyros_forearm_y gyros_forearm_z accel_forearm_x
## 0 0 0
## accel_forearm_y accel_forearm_z magnet_forearm_x
## 0 0 0
## magnet_forearm_y magnet_forearm_z classe
## 0 0 0
Take a closer look at our training set, we can see there are also several near-zero-variance predictors that may affect our model building process later on, as it is presented below. Notice that when we decided on which feature is near-zero-variance feature, we used a more aggressive cutoff value than the default cutoff for both the ratio between the most common value and the second-most-common value (freqCut=2), as well as the percentage of distinct values out of the total sample size (uniqueCut=20).
nzv <- nearZeroVar(training2, saveMetrics = TRUE, freqCut = 2, uniqueCut = 20)
nzv[nzv$nzv==TRUE | nzv$zeroVar==TRUE, ]
## freqRatio percentUnique zeroVar nzv
## new_window 47.40461 0.01359157 FALSE TRUE
## roll_arm 51.14000 16.75840979 FALSE TRUE
## pitch_arm 85.26667 19.39517499 FALSE TRUE
## yaw_arm 31.18293 18.21270812 FALSE TRUE
## roll_forearm 11.65217 13.13625552 FALSE TRUE
## pitch_forearm 62.74468 18.29425756 FALSE TRUE
## yaw_forearm 16.19780 12.34114849 FALSE TRUE
We now remove these near-zero-variance features, for both the training set and the calibration set.
training3 <- training2[, nzv$nzv==FALSE]
calibration2 <- calibration1[ , nzv$nzv==FALSE]
We can see now there is no near-zero-variance feature in both the training set and the calibration set, as it is shown below:
nzv_check_train <- nearZeroVar(training3, saveMetrics = TRUE, freqCut = 2, uniqueCut = 20)
TRUE %in% nzv_check_train$nzv
## [1] FALSE
nzv_check_cal <- nearZeroVar(calibration2, saveMetrics = TRUE, freqCut = 2, uniqueCut = 20)
TRUE %in% nzv_check_cal$nzv
## [1] FALSE
Because we are going to construct a correlation matrix in the next section, we also extract the first five columns. These columns are just identification information. In a real-world data science project, these identification information will later be added back to the database.
tr <- training3[, -(1:5)]
cal <- calibration2[, -(1:5)]
dim(tr)
## [1] 14715 48
dim(cal)
## [1] 4907 48
As it is shown above, after data preparation, we eventually keep 48 features for model building. Now we will perform some exploratory analysis
Before any actual model building exercise, it is always good to check the bivariate correlations between features. From the correlation matrix plot shown below we can see that most features are not significantly correlated to each other. Therefore, the multicollinearity issue is not serious. In addition, we now have 14715 observations in our training set, but only 48 features. The discussion above suggests that it should be proper to go ahead without performing dimensionality reduction, such as Singular Vector Decomposition (SVD) or Principal Component Analysis (PCA).
corr_matrix <- cor(tr[ , -48])
corrplot(corr_matrix, order = "FPC", method = "circle", type = "lower",
tl.cex = 0.6, tl.col = rgb(0, 0, 0))
Now, we can begin training our models.
A good start for any model training process is the linear model. We now fit a Multinomial Logistic Model to our training set, and use it as a baseline.
set.seed(666)
tr$classe <- as.factor(tr$classe)
tr$classe <- relevel(tr$classe, ref = "A")
fit_mln <- multinom(classe ~ ., data = tr)
## # weights: 245 (192 variable)
## initial value 23682.878881
## iter 10 value 18970.458140
## iter 20 value 16736.978040
## iter 30 value 15468.457278
## iter 40 value 14548.008176
## iter 50 value 13946.246911
## iter 60 value 13582.794412
## iter 70 value 13290.060002
## iter 80 value 13161.572771
## iter 90 value 13059.445385
## iter 100 value 12958.337220
## final value 12958.337220
## stopped after 100 iterations
pred_mln <- predict(fit_mln, tr)
conf_m <- confusionMatrix(pred_mln, tr$classe)
conf_m$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 6.795107e-01 5.929171e-01 6.719018e-01 6.870486e-01 2.844037e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 1.476108e-133
Now, we can see that using a simple Multinomial Logistic Model, we achieve an overall accuracy of about 67.95% on our training set, as it is shown above. This is not bad, but can certainly be improved.
Let’s try a more complicated linear model- the Linear Discriminant Analysis (LDA).
set.seed(666)
fit_lda <- train(classe ~ ., data = tr, method = "lda")
pred_lda <- predict(fit_lda, tr)
confusionMatrix(pred_lda, tr$classe)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.032960e-01 6.237705e-01 6.958416e-01 7.106699e-01 2.844037e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 9.250318e-145
By using LDA, we can achieve an overall accuracy of about 70%, as it is shown above. Not much improvement from the logistic model.
Let’s try another linear model: Quadratic Discriminant Analysis (QDA).
set.seed(666)
fit_qda <- train(classe ~ ., data = tr, method = "qda")
pred_qda <- predict(fit_qda, tr)
confusionMatrix(pred_qda, tr$classe)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 8.658512e-01 8.311768e-01 8.602382e-01 8.713190e-01 2.844037e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 9.149099e-223
By using QDA, we can boost our prediction performance up to about 87%. However, if we want to do better, we must delve into more advanced machine learning models that could capture the non-linearity of the data.
Let’s try tree models. We start from the basic decision tree model.
set.seed(666)
fit_dt <- rpart(classe ~ ., data = tr, method="class")
fancyRpartPlot(fit_dt)
Since our model is not simple (not very complex either), the decision tree plot gets complicated, and is hard to interpret. We will measure the goodness of fit by statistics.
pred_dt <- predict(fit_dt, tr, type="class")
confusionMatrix(pred_dt, tr$classe)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 6.662589e-01 5.769591e-01 6.585737e-01 6.738784e-01 2.844037e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 1.891569e-97
F1_Score(pred_dt, tr$classe)
## [1] 0.6967846
From above, we can see that both the overall accuracy and the F1_score is unsatisfactory for the basic decision tree model. It is even lower than the linear models!
Let’s try the Random Forest Model.
set.seed(688)
control_rf <- trainControl(method="cv", number=3, verboseIter=FALSE)
# It takes some times to run the following code
fit_rf <- train(x=tr[,1:47], y=tr$classe, method = "ranger",
trControl = control_rf)
## Growing trees.. Progress: 92%. Estimated remaining time: 2 seconds.
## Growing trees.. Progress: 97%. Estimated remaining time: 1 seconds.
## Growing trees.. Progress: 69%. Estimated remaining time: 13 seconds.
## Growing trees.. Progress: 87%. Estimated remaining time: 4 seconds.
## Growing trees.. Progress: 85%. Estimated remaining time: 5 seconds.
pred_rf <- predict(fit_rf, tr)
fit_rf
## Random Forest
##
## 14715 samples
## 47 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 9811, 9809, 9810
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.9925246 0.9905429
## 2 extratrees 0.9884471 0.9853837
## 24 gini 0.9947671 0.9933800
## 24 extratrees 0.9932722 0.9914886
## 47 gini 0.9923886 0.9903712
## 47 extratrees 0.9935440 0.9918330
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 24, splitrule = gini
## and min.node.size = 1.
confusionMatrix(pred_rf, tr$classe)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 1.0000000 1.0000000 0.9997493 1.0000000 0.2844037
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
We can see from above that with the Random Forest Model, given the number of variables available for splitting at each tree node (mtry) equals to 24, we can reach a surprisingly high accuracy, which is over 99%. Note that in this project, we used the ranger
package, which is a fast implementation of the Random Forest algorithm. We have tried the classical randomForest
package, but it takes too much computing resources, which is beyond our equipment’s computing ability.
Let’s try one last algorithm: the Gradient-Boosted Tree Model (not covered in the course content). Per our experience, the Gradient-Boosted Tree Model usually also has a satisfactory prediction performance.
set.seed(666)
fit_xgboost <- caret::train(classe ~ .,
data = tr,
method = "gbm",
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
verboseIter = FALSE),
verbose = 0)
pred_xgboost <- predict(fit_xgboost, tr)
confusionMatrix(pred_xgboost, tr$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4182 10 0 0 1
## B 2 2814 8 7 8
## C 0 21 2551 32 4
## D 0 1 4 2369 24
## E 1 1 3 4 2668
##
## Overall Statistics
##
## Accuracy : 0.9911
## 95% CI : (0.9894, 0.9926)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9887
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9884 0.9942 0.9822 0.9863
## Specificity 0.9990 0.9979 0.9953 0.9976 0.9993
## Pos Pred Value 0.9974 0.9912 0.9781 0.9879 0.9966
## Neg Pred Value 0.9997 0.9972 0.9988 0.9965 0.9969
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2842 0.1912 0.1734 0.1610 0.1813
## Detection Prevalence 0.2849 0.1929 0.1772 0.1630 0.1819
## Balanced Accuracy 0.9991 0.9932 0.9947 0.9899 0.9928
The Gradient-Boosted Tree Model has three tuning parameters: the number of trees (n.trees), the number of splits (interaction.depth), and learning rate (shrinkage). The train
function in caret
automatically uses cross-validation to select the parameters for us that provide the best performance. From the above model fitting result, we can see that, if we train the model with 150 trees, 3 splits, and 0.1 learning rate, we can get an overall accuracy of around 99%. Another evaluation metric Cohen’s Kappa Score (Kappa) also reaches 0.98. This is exciting!
However, we must be cautious, because we know that our top performers the Gradient_Boosted Tree Model and the Random Forest Model are prone to overfit in nature. Therefore, before we move any further, we now test our models on the calibration set we have left untouched up to this point.
First, let’s see how the Gradient_Boosted Tree Model performs on the calibration set:
pred_c_xgboost <- predict(fit_xgboost, cal)
confusionMatrix(pred_c_xgboost, factor(cal$classe))$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9863460 0.9827280 0.9826920 0.9894031 0.2842878
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
F1_Score(pred_c_xgboost, factor(cal$classe))
## [1] 0.99678
The Gradient_Boosted Tree Model gives out amazing prediction performance for the calibration set, as it’s shown above. The overall accuracy reaches 98.63%. The Cohen’s Kappa is about 0.986. Additionally, the F1 Score is 0.997. Now let’s see how the Random Forest Model performs on the calibration set:
pred_c_rf <- predict(fit_rf, cal)
confusionMatrix(pred_c_rf, factor(cal$classe))$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9981659 0.9976802 0.9965211 0.9991610 0.2842878
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
From the above result we can see that the Random Forest Model also performs almost perfectly on the calibration set, with a total accuracy of around 99.8%
Now Let’s see how other models perform
# for logistic model
pred_c_mln <- predict(fit_mln, cal)
confusionMatrix(pred_c_mln, factor(cal$classe))$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 6.820868e-01 5.962526e-01 6.688504e-01 6.951063e-01 2.842878e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 7.868420e-41
# for LDA
pred_c_lda <- predict(fit_lda, cal)
confusionMatrix(pred_c_lda, factor(cal$classe))$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 6.998166e-01 6.194547e-01 6.867745e-01 7.126206e-01 2.842878e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 7.642155e-44
# for QDA
pred_c_qda <- predict(fit_qda, cal)
confusionMatrix(pred_c_qda, factor(cal$classe))$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 8.604035e-01 8.243307e-01 8.503904e-01 8.699857e-01 2.842878e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 6.443071e-67
# for Decision Tree
pred_c_dt <- predict(fit_dt, cal, type='class')
confusionMatrix(pred_c_dt, factor(cal$classe))$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 6.574282e-01 5.657778e-01 6.439574e-01 6.707114e-01 2.842878e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 1.398610e-26
From the above results, it is reasonable to believe that all the models we have built on the training data does not show significant performance degradation while testing on the calibration data.
Finally, we construct an ensemble based on the majority vote of all the models we have built, which includes the Multinomial Logistic Model, the LDA Model, the QDA Model, the Decision Tree Model, the Random Forest Model, and the Gradient-Boosted Tree Model. This simple ensemble consults what each model predicts and then makes its decision based on what the majority of all the models we have built so far vote for. For those observations where our six models cannot reach a over 50% majority vote, we accept the predictions of the top performing model- the Random Forest Model.
model <- c("pred_c_mln", "pred_c_lda", "pred_c_qda",
"pred_c_dt", "pred_c_rf", "pred_c_xgboost")
pred <- sapply(1:6, function(x){
as.factor(get(model[x]))})
dim(pred)
## [1] 4907 6
pred <- as.data.frame(pred)
names(pred) <-c("pred_c_mln", "pred_c_lda", "pred_c_qda",
"pred_c_dt", "pred_c_rf", "pred_c_xgboost")
acc <- colMeans(as.matrix(pred)==cal$classe)
# acc
# mean(acc)
pred_ensemble <- data.frame(pred=NULL)
pred$pred_c_ensemble <- 0
n <- 1:nrow(pred)
x <- c('A', 'B', 'C', 'D', 'E')
for (a in x){
for (i in n){
if (mean(pred[i,]==a)>0.5){
pred[i,7] <- a}
}
}
pred[pred$pred_c_ensemble==0,7] <- pred[pred$pred_c_ensemble==0,5]
confusionMatrix(factor(pred$pred_c_ensemble), factor(cal$classe))$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9702466 0.9623971 0.9651021 0.9748209 0.2842878
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
The above result indicates that the majority vote ensemble model has an overall accuracy of about 97% and a Cohen’s Kappa of about 0.96. This is a little inferior to the prediction performance of the Random Forest Model and the Gradient-Boosted Tree Model. Therefore, for this final project, we would recommend the Random Forest Model, which slightly beat the Gradient_Boosted Tree Model and the ensemble model.
However, we believe the ensemble model is less likely to overfit in the real world, compared to the Random Forest Model and the Gradient-Boosted Tree Model. This is because the ensemble also takes into account what the simpler models predict when gives out prediction. Although simpler models like the linear models do not predict as well as the more complex machine learning models like the Gradient-Boosted Tree Model, they are less prone to overfit because they do not attempt to catch every tiny variation of the training data.
One last task is to evaluate the Random Forest Model using the test set we have reserved, so the quiz questions can be answered. This also provides an extra opportunity to test the reliability of our model of choice’s performance. Below, we use the Random Forest Model to predict exercise habit using the test set.
pred_test <- as.data.frame(predict(fit_rf, testing))
pred_test
## predict(fit_rf, testing)
## 1 B
## 2 A
## 3 B
## 4 A
## 5 A
## 6 E
## 7 D
## 8 B
## 9 A
## 10 A
## 11 B
## 12 C
## 13 B
## 14 A
## 15 E
## 16 E
## 17 A
## 18 B
## 19 B
## 20 B
Our algorithm got all the 20 quiz questions right. This again shows the Random Forest Model we built can predict well.
The above sections show that the Random Forest Model provides satisfactory prediction on individuals’ exercise manner given their movement recorded by the accelerometers. The overall accuracy given by the Random Forest Model on the calibration data reaches 99.8%. However, given that the algorithm of Random Forest Model tends to overfit in nature, we should expect the actual performance of our model to be lower than 99.8%. The ensemble model based on the majority vote of all six models built in this project should be an ideal alternative to the Random Forest Model.
Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing, 2000. 5:13—22.
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.