Library
library(dplyr) # data wrangling
library(ggplot2) # visualization
library(Rtsne) # EDA
library(caret) # machine learning functions
library(MLmetrics) # machine learning metrics
library(e1071) # naive bayes
library(rpart) # decision tree
library(rattle) # tree visualization
library(class) # k-NN
library(randomForest) # random forestProblem Statement
This project is to build a model that predicts human activities such as Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, or Laying accurately from smartphone measurement. The dataset used to train the model is collected from 30 people performing different activities with a smartphone to their waists, and recorded with the help of sensors (accelerometer and gyroscope) in the smartphone. This experiment was video recorded to label the data manually. To see more details, please refer to this link.
Dataset
Let’s read the dataset.
#> [1] 10299 563
As can be seen above, this dataset has 563 features with 10299 observations. That’s a lot! Let’s see the meaning of these features step by step.
Smartphone sensors captured triaxial linear acceleration (tAcc-XYZ) from accelerometer and triaxial angular velocity (tGyro-XYZ) from gyroscope with several variations. Prefix t in those variables denotes time. Suffix XYZ represents 3-axial signals in X, Y, and Z directions.
These sensor signals are preprocessed by applying noise filters and then sampled in fixed-width windows (sliding windows) of 2.56 seconds each with 50% overlap, that is, each window has 128 readings.
From each window, a feature vector was obtained by calculating variables from the time and frequency domain. In the dataset, each datapoint represents a window with different readings.
The acceleration signals were separated into Body and Gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ) using some low pass filter with corner frequency of 0.3 Hz.
After that, the body linear acceleration and angular velocity were derived in time to obtain jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ).
The magnitude of these 3-dimensional signals were calculated using Euclidean norm. These magnitudes are represented as features with names like tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, and tBodyGyroJerkMag.
Finally, frequency domain signals were calculated from some of the available signals by applying an FFT (Fast Fourier Transform). These obtained signals were labeled with prefix ‘f’ just like original signals with prefix ‘t’. These signals are labeled as fBodyAcc-XYZ, fBodyGyroMag, etc.
These are the signals that we got so far.
- tBodyAcc-XYZ
- tGravityAcc-XYZ
- tBodyAccJerk-XYZ
- tBodyGyro-XYZ
- tBodyGyroJerk-XYZ
- tBodyAccMag
- tGravityAccMag
- tBodyAccJerkMag
- tBodyGyroMag
- tBodyGyroJerkMag
- fBodyAcc-XYZ
- fBodyAccJerk-XYZ
- fBodyGyro-XYZ
- fBodyAccMag
- fBodyAccJerkMag
- fBodyGyroMag
- fBodyGyroJerkMag
Some set of variables are then estimated from the above signals, that is, the following properties are estimated on each and every signal that was recorded so far.
- mean(): mean value
- std(): standard deviation
- mad(): median absolute deviation
- max(): largest value in array
- min(): smallest value in array
- sma(): signal magnitude area
- energy(): energy measure (sum of the squares divided by the number of values)
- iqr(): interquartile range
- entropy(): signal entropy
- arCoeff(): autoregresion coefficient with Burg order equal to 4
- correlation(): correlation coefficient between two signals
- maxInds(): index of the frequency component with the largest magnitude
- meanFreq(): weighted average of the frequency components to obtain a mean frequency
- skewness(): skewness of the frequency domain signal
- kurtosis(): kurtosis of the frequency domain signal
- bandsEnergy(): energy of a frequency interval within 64 bins of the FFT of each window
- angle(): angle between two vectors
Some other vectors are obtained by taking the average of signals in a single window sample. These are used on the angle() variable.
- gravityMean
- tBodyAccMean
- tBodyAccJerkMean
- tBodyGyroMean
- tBodyGyroJerkMean
subjectfeature denotes the person id. There are 30 unique ids, each for one of 30 people.Activityfeature represents the activity a subject was doing, consists of:- WALKING
- WALKING_UPSTAIRS
- WALKING_DOWNSTAIRS
- SITTING
- STANDING
- LAYING
In short, for each record in the dataset the following is provided:
- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration.
- Triaxial angular velocity from the gyroscope.
- A 561-feature vector with time and frequency domain variables.
- Its activity label.
- An identifier of the subject who carried out the experiment.
Based on the problem statement, Activity will be our target feature.
Data Cleaning
First, convert the subject and Activity features into factor, others into numeric.
uci_har <- uci_har %>%
mutate_at(c('subject', 'Activity'), as.factor) %>%
mutate_at(vars(-subject, -Activity), as.numeric)
lvl <- levels(uci_har$Activity)
lvl#> [1] "LAYING" "SITTING" "STANDING"
#> [4] "WALKING" "WALKING_DOWNSTAIRS" "WALKING_UPSTAIRS"
Let’s check if there are any duplicated observations or missing values.
#> Number of duplicated rows: 0
#> Number of missing values: 0
Great! None of them exists. Now let’s check data imbalance.
ggplot(uci_har %>%
group_by(subject, Activity) %>%
count(name = 'activity_count'),
aes(x = subject, y = activity_count, fill = Activity)) +
geom_bar(stat = 'identity')There is no significant difference in terms of activity count for each subject. Hence, all target classes are balanced.
Exploratory Data Analysis
We can categorize Activity into two: stationary activities (such as laying, sitting, and standing) and moving activities (such as walking, walking downstairs, and walking upstairs). Let’s see the distribution of the mean of triaxial magnitude of body acceleration signals (phew, that’s mouthful! what I really mean is tBodyAccMagmean).
ggplot(uci_har,
aes(x = tBodyAccMagmean, group = Activity, fill = Activity)) +
geom_density(alpha = .5) +
annotate('text', x = -.8, y = 25, label = "Stationary activities") +
annotate('text', x = -.0, y = 5, label = "Moving activities")ggplot(uci_har,
aes(y = Activity, x = tBodyAccMagmean, group = Activity, fill = Activity)) +
geom_boxplot(show.legend = FALSE)We can see a clear distinction between the two groups:
- stationary activities have very small body movements compared to those of moving activities.
- if
tBodyAccMagmean> -0.5, then the activity will probably be either walking, walking upstairs, or walking downstairs. - if
tBodyAccMagmean< -0.5, then the activity is most probably either laying, standing, or sitting.
Now, there should be also a distinction between laying and other activities in terms of phone orientation. While laying, unlike other activities, people tend to have the phone fairly horizontal on their waist. So, let’s see whether this hypothesis is true by comparing the angle between X, Y, and Z axis to the mean of gravity acceleration signals on each axis (angleXgravityMean, angleYgravityMean, and angleZgravityMean).
for (coor in c('angleXgravityMean', 'angleYgravityMean', 'angleZgravityMean')) {
print(
ggplot(uci_har,
aes_string(y = 'Activity', x = coor, group = 'Activity', fill = 'Activity')) +
geom_boxplot(show.legend = FALSE)
)
}It’s apparent that:
- phone orientation while laying is different from phone orientation while doing other activities.
- if
angleXgravityMean> 0, then the activity is most probably laying. Other activities otherwise. - if
angleYgravityMean< -0.25 orangleZgravityMean< -0.25, then the activity will probably be laying. Other activities otherwise. - we can predict laying activity with minimum error only by using
angleXgravityMean, or probably by other gravity-related features to the X axis.
Lastly, we can perform t-SNE on the dataset to reduce its dimension for the purpose of visualization, in the hope that each Activity will be grouped at different region. t-SNE us an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. Basically, what t-SNE does is to give us a feel or intuition of how the data is arranged in a high-dimensional space. We won’t dive deeper into t-SNE in this notebook. We will perform t-SNE by doing sensitivity in the perplexity of 5, 10, and 20 to ensure the obtained dimension-reduced values are really grouped into different Activity.
for (perp in c(5, 10, 20)) {
tsne <- Rtsne(uci_har %>% select(-c(subject, Activity)), perplexity = perp)
tsne_plot <- data.frame(x = tsne$Y[,1], y = tsne$Y[,2], Activity = uci_har$Activity)
print(
ggplot(tsne_plot) +
geom_point(aes(x = x, y = y, color = Activity))
)
}As can be seen, all activities can easily be separated except for Standing and Sitting. This makes sense, since Standing and Sitting don’t have much difference in terms of phone orientation.
Cross validation
We cannot apply the normal k-fold cross validation for this problem. Recall that our objective is to predict human activity based on their phone sensors. This means that when a new unseen subject comes in, we don’t know their behavior with the phone. If we use k-fold cross validation by randomly choosing observations, there is a chance that the same subject appears in both train and test dataset, suggesting a data leak. To avoid this, during cross validation we split the dataset such that the subject in train and test dataset don’t intersect.
set.seed(2072) # for reproducibility
subject_id <- unique(uci_har$subject)
folds <- sample(1:5, 30, replace = TRUE)
d <- data.frame(col1 = c(subject_id), col2 = c(folds))
uci_har$folds <- d$col2[match(uci_har$subject, d$col1)]
uci_har <- uci_har %>%
mutate(folds = as.factor(folds)) %>%
select(-subject)Please note that after creating folds for each observation, we discarded subject feature as it’s no longer needed in the analysis. Lastly, we can also see below that the data is distributed evenly among folds, hence no imbalanced data occurred.
ggplot(uci_har %>%
group_by(folds, Activity) %>%
count(name = 'activity_count'),
aes(x = folds, y = activity_count, fill = Activity)) +
geom_bar(stat = 'identity')Metrics
We use accuracy to quantify the performance of our models after considering the following reasons:
- as per the problem statement, we are only interested in predicting each class equally and accurately without preferring one above the others, hence discrediting the purpose of recall and precision metrics.
- this problem is a multiclass classification, where accuracy is more commonly used and more interpretable than ROC-AUC metrics.
Modeling
First, as a sanity check, let’s see the dimension of the dataset, alos maximum and minimum value in each feature.
#> [1] 10299 563
#> [1] 1
#> [1] 0.795525
#> [1] -1
#> [1] -0.9960928
The minimum values are close to -1 for all features, whereas the maximum values range between 0.8 to 1. We won’t do any normalization because the beforementioned range is considered small, and more importantly we don’t want to lose much information on correlation between features.
The dataset has 563 features, two from which will be discarded while modeling. They are Activity (since this is the target variable) and folds (since this doesn’t add any information to the dataset).
We will make four models to achieve the best solution to the problem statement: Naive Bayes, Decision Tree, k-Nearest Neighbors, and Random Forest. To simplify things, below is a function to cross validate all models. In this function, we iterate each fold of cross validation that has been built before and do the following in each iteration:
- create
X_train,y_train,X_test,y_test, which are predictor variable for training, target variable for training, predictor variable for testing, and target variable for testing, respectively. - build model using and predict the output as
y_pred. - calculate the accuracy of the model by comparing
y_predtoy_test. - the accuracy results are then averaged across all folds, producing a single number to compare all models.
Note that for Random Forest model, we also use cross validation accuracy instead of OOB error, so that the comparison with other models is apple to apple. However, later we treat Random Forest model separately to emphasize the importance of hyperparameter tuning.
crossvalidate <- function(data, k, model_name,
tuning = FALSE, mtry = NULL, nodesize = NULL) {
# 'data' is the training set with the 'folds' column
# 'k' is the number of folds we have
# 'model_name' is a string describing the model being used
# 'tuning' is a mode in which this function will operate, tuning = TRUE means we are doing hyperparameter tuning
# 'mtry' and 'nodesize' are used only in Random Forest hyperparameter tuning
# initialize empty lists for recording performances
acc_train <- c()
acc_test <- c()
y_preds <- c()
y_tests <- c()
models <- c()
# one iteration per fold
for (fold in 1:k) {
# create training set for this iteration
# subset all the datapoints where folds does not match the current fold
training_set <- data %>% filter(folds != fold)
X_train <- training_set %>% select(-c(Activity, folds))
y_train <- training_set$Activity
# create test set for this iteration
# subset all the datapoints where folds matches the current fold
testing_set <- data %>% filter(folds == fold)
X_test <- testing_set %>% select(-c(Activity, folds))
y_test <- testing_set$Activity
# train & predict
switch(model_name,
nb = {
model <- naiveBayes(x = X_train, y = y_train, laplace = 1)
y_pred <- predict(model, X_test, type = 'class')
y_pred_train <- predict(model, X_train, type = 'class')
},
dt = {
model <- rpart(formula = Activity ~ ., data = training_set %>% select(-folds), method = 'class')
y_pred <- predict(model, X_test, type = 'class')
y_pred_train <- predict(model, X_train, type = 'class')
},
knn = {
k <- round(sqrt(nrow(training_set)))
y_pred <- knn(train = X_train, test = X_test, cl = y_train, k = k)
y_pred_train <- knn(train = X_train, test = X_train, cl = y_train, k = k)
},
rf = {
if (tuning == FALSE) {
model <- randomForest(x = X_train, y = y_train, xtest = X_test, ytest = y_test)
} else {
model <- randomForest(x = X_train, y = y_train, xtest = X_test, ytest = y_test,
mtry = mtry, nodesize = nodesize)
}
y_pred <- model$test$predicted
y_pred_train <- model$predicted
},
{
print("Model is not recognized. Try to input 'nb', 'dt', 'knn', or 'rf'.")
return()
}
)
# populate corresponding lists
acc_train[fold] <- Accuracy(y_pred_train, y_train)
acc_test[fold] <- Accuracy(y_pred, y_test)
y_preds <- append(y_preds, y_pred)
y_tests <- append(y_tests, y_test)
models <- c(models, list(model))
}
# convert back to factor
y_preds <- factor(y_preds, labels = lvl)
y_tests <- factor(y_tests, labels = lvl)
# get the accuracy between the predicted and the observed
cm <- confusionMatrix(y_preds, y_tests)
cm_table <- cm$table
acc <- cm$overall['Accuracy']
# return the results
if (model_name == 'knn') {
return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test))
} else {
return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test, 'models' = models))
}
}Now we are ready.
Naive Bayes
#> Naive Bayes Accuracy: 0.7258957
Naive Bayes model gives poor result with 73% accuracy. This is mainly due to the underlying assumption of the model that each predictor is independent with one another, which is not the case in our dataset. For instance, we have the following high correlations between some predictors which are not captured by the model.
set.seed(3)
col <- c(sample(names(uci_har), 6))
GGally::ggcorr(uci_har[, col], hjust = 1, layout.exp = 3, label = T)Let’s see the confusion matrix below.
#> Reference
#> Prediction LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
#> LAYING 1623 14 4 0 0
#> SITTING 286 1550 1188 0 0
#> STANDING 0 187 677 0 0
#> WALKING 2 0 1 1181 93
#> WALKING_DOWNSTAIRS 4 0 0 234 1065
#> WALKING_UPSTAIRS 29 26 36 307 248
#> Reference
#> Prediction WALKING_UPSTAIRS
#> LAYING 0
#> SITTING 0
#> STANDING 0
#> WALKING 40
#> WALKING_DOWNSTAIRS 124
#> WALKING_UPSTAIRS 1380
Naive Bayes model is still confused between several trivially different activities such as laying and sitting (around 300 wrong predictions). This model is also struggling to differentiate between walking upstairs and stationary activities (around 90 wrong predictions).
Lastly, as can be seen below, based on the accuracy resulted from predicting train and test dataset, we could see that the model is already decent, not underfit or overfit to train dataset except for the fourth fold. Hence, we couldn’t rely much on trading off bias and variance to improve the model performance.
#> [1] 0.7349193 0.7647131 0.7280039 0.7635290 0.7612536
#> [1] 0.7475728 0.7491702 0.7276636 0.6863137 0.7169533
Decision Tree
#> Decision Tree Accuracy: 0.8599864
Decision Tree model gives better result with 86% accuracy. We can see why by noticing that from the t-SNE result, our dataset is highly separable (except for sitting and standing activity). This characteristic can be exploited by a tree-based model. To give a sense on how the tree works, please observe 5 tree diagrams below, each for one cross validation fold.
One can easily see that tGravityAccminX or tGravityAccmeanX becomes a crucial feature on the first split after the root. If this feature is less than a certain threshold, the models can predict perfectly that the corresponding activity is laying, which takes in 19% of all train dataset observations. This is consistent with our EDA result that laying activity can be distinguished by only observing one gravity-related feature to the X axis.
On the second split, based on a body acceleration signal, the models can separate perfectly sitting and standing with moving activities (sitting and standing takes in 36% whereas all moving activities takes in 45% of train dataset observations). This finding confirms our previous analysis that stationary and moving activities can be separated fairly easy.
Then, on the third split down to the leaf, the models are separating sitting and standing, also all moving activities, with some errors. This signifies that sitting and standing, also all moving activities, are quite hard for the models to distinguish. To see this clearly, here’s a confusion matrix.
#> Reference
#> Prediction LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
#> LAYING 1942 15 0 0 0
#> SITTING 2 1564 334 0 0
#> STANDING 0 196 1571 8 0
#> WALKING 0 0 1 1549 151
#> WALKING_DOWNSTAIRS 0 1 0 39 1149
#> WALKING_UPSTAIRS 0 1 0 126 106
#> Reference
#> Prediction WALKING_UPSTAIRS
#> LAYING 0
#> SITTING 0
#> STANDING 0
#> WALKING 296
#> WALKING_DOWNSTAIRS 166
#> WALKING_UPSTAIRS 1082
A small note from this table: unlike Naive Bayes model, Decision Tree model doesn’t predict stationary activities falsely as walking upstairs that much. In fact, only one prediction is wrong in this case compared to 91 in Naive Bayes model.
Next, let’s see which variables are the most important. This table below once again confirms the importance of gravity-related features to the X axis (tGravityAccminX and angleXgravityMean) to our models.
for (model in dt$models) {
var_imp <- varImp(model)
var_imp <- var_imp %>% slice_max(Overall, n = 10)
print(var_imp)
}#> Overall
#> tGravityAccminX 1647.338
#> tGravityAccmeanX 1522.782
#> tGravityAccenergyX 1521.581
#> angleXgravityMean 1521.581
#> tGravityAccmaxX 1505.993
#> fBodyAccJerkbandsEnergy116 1374.982
#> fBodyAccJerkbandsEnergy124 1374.982
#> fBodyAccmadX 1374.982
#> fBodyAccmeanX 1374.982
#> tBodyAccJerkmadX 1374.178
#> Overall
#> angleXgravityMean 1494.691
#> tGravityAccenergyX 1494.691
#> tGravityAccmeanX 1494.691
#> tGravityAccminX 1494.691
#> tGravityAccmaxX 1483.883
#> fBodyAccJerkentropyX 1376.017
#> fBodyAccmadX 1376.017
#> fBodyAccmeanX 1376.017
#> tBodyAccJerkmadX 1376.017
#> tBodyAccJerkMagenergy 1376.017
#> Overall
#> angleXgravityMean 1504.419
#> tGravityAccenergyX 1504.419
#> tGravityAccmeanX 1504.419
#> tGravityAccminX 1504.419
#> tGravityAccmaxX 1488.823
#> fBodyAccJerkenergyX 1370.566
#> fBodyAccmadX 1370.566
#> fBodyAccmeanX 1370.566
#> tBodyAccJerkenergyX 1370.566
#> tBodyAccJerkstdX 1370.566
#> Overall
#> tGravityAccminX 1528.936
#> tGravityAccmeanX 1527.734
#> angleXgravityMean 1526.532
#> tGravityAccenergyX 1526.532
#> tGravityAccmaxX 1508.549
#> tBodyAccJerkenergyX 1387.769
#> tBodyAccJerkmadX 1387.769
#> tBodyAccJerkstdX 1387.769
#> tBodyAccmaxX 1387.769
#> tBodyAccstdX 1387.769
#> Overall
#> tGravityAccminX 1531.881
#> tGravityAccmeanX 1530.679
#> angleXgravityMean 1529.478
#> tGravityAccenergyX 1529.478
#> tGravityAccmaxX 1512.692
#> fBodyAccJerkbandsEnergy116 1379.583
#> fBodyAccJerkbandsEnergy124 1379.583
#> fBodyAccmadX 1379.583
#> fBodyAccmeanX 1379.583
#> tBodyAccJerkmadX 1378.776
Normally a Decision Tree model tends to overfit to train dataset because they can easily fit to noise in the data. Luckily, that’s not the case for ours since as can be seen below, the accuracy on train and test dataset are close except for the fourth fold. Hence, tree pruning is not necessary.
#> [1] 0.8897925 0.8959707 0.8811845 0.8897192 0.8691917
#> [1] 0.8660194 0.8914177 0.8643096 0.7902098 0.8855037
k-Nearest Neighbors
Since k-NN is a distance-based model, the dataset has to be normalized prior to modeling so that the model can treat each feature equally. In other words, if a feature has relatively big values compared to others, then it will dominantly influence the model in selecting a datapoint’s neighbors. But in our case, we didn’t do any normalization on the dataset and the reason has already been explained above at the beginning of this Modeling section.
#> k-Nearest Neighbors Accuracy: 0.8925138
k-NN model gives even better result than Decision Tree model, with 89% accuracy. Again, this is due to the fact that our dataset is highly separable so that k-NN algorithm can easily group each activity.
#> Reference
#> Prediction LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
#> LAYING 1926 13 0 0 0
#> SITTING 5 1369 245 0 0
#> STANDING 5 390 1659 0 0
#> WALKING 1 0 1 1658 98
#> WALKING_DOWNSTAIRS 0 0 0 35 1178
#> WALKING_UPSTAIRS 7 5 1 29 130
#> Reference
#> Prediction WALKING_UPSTAIRS
#> LAYING 0
#> SITTING 0
#> STANDING 0
#> WALKING 90
#> WALKING_DOWNSTAIRS 52
#> WALKING_UPSTAIRS 1402
However, k-NN model is worse than Decision Tree model in distinguishing sitting and standing activity. Also, some stationary activities are predicted as walking upstairs. On the other hand, more moving activities are predicted correctly than those in Decision Tree model.
By comparing accuracy on train and test dataset, it can be seen that the model is just right with low bias and variance.
#> [1] 0.9299672 0.9269841 0.9332196 0.9308184 0.9335673
#> [1] 0.9169903 0.9184448 0.8781653 0.8516484 0.8958231
Random Forest
#> Random Forest Accuracy: 0.9336829
Random forest model gives the best result so far compared to others, with a whopping 93% accuracy. In particular, Random Forest is almost always better than Decision Tree due to the following reasons:
- Random Forest is an ensemble of many Decision Tree models. It was based on majority voting of many Decision Trees and hence tends to reduce error of a single Decision Tree prediction.
- Random Forest performs boosting which generates a family of models that converts weak learner into strong learners and hence overcomes the overfitting problem of one Decision Tree.
However, these advantages don’t come without shortcomings: Random Forest model is slower to train and harder to interpret than a Decision Tree.
Now let’s see the confusion matrix.
#> Reference
#> Prediction LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
#> LAYING 1942 15 0 0 0
#> SITTING 0 1626 189 0 0
#> STANDING 0 135 1717 0 0
#> WALKING 0 0 0 1585 18
#> WALKING_DOWNSTAIRS 0 0 0 27 1293
#> WALKING_UPSTAIRS 2 1 0 110 95
#> Reference
#> Prediction WALKING_UPSTAIRS
#> LAYING 0
#> SITTING 0
#> STANDING 0
#> WALKING 37
#> WALKING_DOWNSTAIRS 54
#> WALKING_UPSTAIRS 1453
Random Forest model is still having a hard time figuring out which one is sitting or standing, and which is which from moving activities. Nevertheless, besides those errors, this model only misclassifies 18 other observations, which is a small portion of all dataset.
Based on variable importance table below, we see that Random Forest model prefers gravity-related features more than body-related ones.
for (model in rf$models) {
var_imp <- varImp(model)
var_imp <- var_imp %>% slice_max(Overall, n = 10)
print(var_imp)
}#> Overall
#> tGravityAccmeanX 233.5234
#> tGravityAccminX 212.9809
#> angleXgravityMean 190.1192
#> tGravityAccmaxX 185.7354
#> angleYgravityMean 168.6782
#> tGravityAccenergyX 158.2211
#> tGravityAccminY 152.9756
#> tGravityAccmaxY 149.8530
#> tGravityAccmeanY 128.6168
#> tGravityAccenergyY 115.0822
#> Overall
#> tGravityAccmeanX 215.90652
#> tGravityAccminX 199.06699
#> tGravityAccenergyX 187.14571
#> tGravityAccmaxX 174.64894
#> angleXgravityMean 170.14726
#> tGravityAccmaxY 148.36554
#> angleYgravityMean 147.43523
#> tGravityAccmeanY 136.34275
#> tGravityAccminY 132.14115
#> tGravityAccenergyY 83.03708
#> Overall
#> angleXgravityMean 211.0124
#> tGravityAccminX 193.4731
#> tGravityAccmaxX 183.6834
#> tGravityAccenergyX 178.2531
#> tGravityAccmaxY 175.3123
#> tGravityAccmeanX 170.4459
#> tGravityAccmeanY 166.4416
#> tGravityAccminY 164.2081
#> angleYgravityMean 159.2264
#> tGravityAccenergyY 113.4814
#> Overall
#> tGravityAccmaxX 214.2470
#> tGravityAccminX 201.6110
#> tGravityAccenergyX 198.3143
#> angleXgravityMean 191.6710
#> tGravityAccmeanY 185.8804
#> tGravityAccmeanX 182.7646
#> tGravityAccmaxY 179.4252
#> angleYgravityMean 172.8559
#> tGravityAccminY 171.5347
#> tGravityAccenergyY 102.5362
#> Overall
#> tGravityAccmeanX 208.67569
#> angleXgravityMean 202.43801
#> tGravityAccminX 192.91251
#> tGravityAccenergyX 185.74270
#> tGravityAccmaxX 158.31243
#> tGravityAccmaxY 148.26482
#> angleYgravityMean 145.74691
#> tGravityAccmeanY 142.97585
#> tGravityAccminY 126.27075
#> tGravityAccenergyY 95.61133
Lastly, from the train and test accuracy below, it’s apparent that the model performs really well on train and test dataset, even though in the third and fourth folds the test accuracies are still slightly lower than 91%. We already knew that the tendency to overfitting should decrease if we switch from Decision Tree model to Random Forest (thanks to bagging and random feature selection). However, the generalization error will not go to zero. The variance of generalization error will approach to zero with more trees added but the bias will not!
#> [1] 0.9815512 0.9805861 0.9837923 0.9865011 0.9842691
#> [1] 0.9631068 0.9720247 0.9096990 0.8971029 0.9248157
We can further improve the model by hyperparameter tuning. It’s expected that by pruning the trees, we can trade some variance with bias, lowering down both values. This is done by increasing nodesize parameter inside Random Forest model. Quoting from R documentation, nodesize is the minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). The default value is 1 for classification problem, which tends to make the model overfit to noise in data.
Besides nodesize, we will also tune mtry (number of variables randomly sampled as candidates at each split, defaults to sqrt(561) = 23 in our case). We will do a grid search by varying nodesize to c(3, 5, 7) and mtry to c(11, 16).
# establish a list of possible values for nodesize and mtry
nodesize <- c(3, 5, 7)
mtry <- c(11, 16)
# create a data frame containing all combinations
hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize)
# initialize empty vectors to store the results
rf_acc <- c()
rf_acc_train <- c()
rf_acc_test <- c()
# loop over the rows of hyper_grid
for (i in 1:nrow(hyper_grid)) {
# cross validate
rf_tuning <- crossvalidate(uci_har, 5, 'rf',
tuning = TRUE, mtry = hyper_grid$mtry[i], nodesize = hyper_grid$nodesize[i])
# store the results
rf_acc[i] <- rf_tuning$acc
rf_acc_train <- c(rf_acc_train, list(rf_tuning$acc_train))
rf_acc_test <- c(rf_acc_test, list(rf_tuning$acc_test))
}
# identify optimal set of hyperparameters based on accuracy
opt_i <- which.max(rf_acc)
print(hyper_grid[opt_i,])#> mtry nodesize
#> 5 11 7
The best hyperparameters found are 7 for nodesize and 11 for mtry. With these, the accuracy improves a little towards 94% as we can see below. Also, all test accuracies are more uniform across folds and have values above 91%.
#> [1] 0.9370813
#> [[1]]
#> [1] 0.9815512 0.9788767 0.9820863 0.9849343 0.9831801
#> [[1]]
#> [1] 0.9684466 0.9687055 0.9182991 0.9130869 0.9154791
Conclusion
rbind("Naive Bayes" = nb$acc, "Decision Tree" = dt$acc, "k-Nearest Neighbors" = knn$acc, "Random Forest" = max(rf_acc))#> Accuracy
#> Naive Bayes 0.7258957
#> Decision Tree 0.8599864
#> k-Nearest Neighbors 0.8925138
#> Random Forest 0.9370813
Based on the accuracy table above, Random Forest clearly wins as the best model. Random Forest is able to recognize human activities based on their phone behavior with an outstanding 94% accuracy. On the other hand, Random Forest is slow to run as it is an ensemble of 500 Decision Trees by default. Of course, we can also try simpler model such as One-vs-Rest Logistic Regression or a pretty standard boosting model in industry these days like XGBoost or LightGBM, then compare the result.