Smartphone for Activity Recognition

Albers Uzila

February 7, 2021

Library

library(dplyr)          # data wrangling
library(ggplot2)        # visualization
library(Rtsne)          # EDA
library(caret)          # machine learning functions
library(MLmetrics)      # machine learning metrics
library(e1071)          # naive bayes
library(rpart)          # decision tree
library(rattle)         # tree visualization
library(class)          # k-NN
library(randomForest)   # random forest

Problem Statement

This project is to build a model that predicts human activities such as Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, or Laying accurately from smartphone measurement. The dataset used to train the model is collected from 30 people performing different activities with a smartphone to their waists, and recorded with the help of sensors (accelerometer and gyroscope) in the smartphone. This experiment was video recorded to label the data manually. To see more details, please refer to this link.

Dataset

Let’s read the dataset.

uci_har <- read.csv("UCI HAR.csv")
dim(uci_har)

#> [1] 10299   563

As can be seen above, this dataset has 563 features with 10299 observations. That’s a lot! Let’s see the meaning of these features step by step.

Smartphone sensors captured triaxial linear acceleration (tAcc-XYZ) from accelerometer and triaxial angular velocity (tGyro-XYZ) from gyroscope with several variations. Prefix t in those variables denotes time. Suffix XYZ represents 3-axial signals in X, Y, and Z directions.
These sensor signals are preprocessed by applying noise filters and then sampled in fixed-width windows (sliding windows) of 2.56 seconds each with 50% overlap, that is, each window has 128 readings.
From each window, a feature vector was obtained by calculating variables from the time and frequency domain. In the dataset, each datapoint represents a window with different readings.
The acceleration signals were separated into Body and Gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ) using some low pass filter with corner frequency of 0.3 Hz.
After that, the body linear acceleration and angular velocity were derived in time to obtain jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ).
The magnitude of these 3-dimensional signals were calculated using Euclidean norm. These magnitudes are represented as features with names like tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, and tBodyGyroJerkMag.
Finally, frequency domain signals were calculated from some of the available signals by applying an FFT (Fast Fourier Transform). These obtained signals were labeled with prefix ‘f’ just like original signals with prefix ‘t’. These signals are labeled as fBodyAcc-XYZ, fBodyGyroMag, etc.
These are the signals that we got so far.
- tBodyAcc-XYZ
- tGravityAcc-XYZ
- tBodyAccJerk-XYZ
- tBodyGyro-XYZ
- tBodyGyroJerk-XYZ
- tBodyAccMag
- tGravityAccMag
- tBodyAccJerkMag
- tBodyGyroMag
- tBodyGyroJerkMag
- fBodyAcc-XYZ
- fBodyAccJerk-XYZ
- fBodyGyro-XYZ
- fBodyAccMag
- fBodyAccJerkMag
- fBodyGyroMag
- fBodyGyroJerkMag
Some set of variables are then estimated from the above signals, that is, the following properties are estimated on each and every signal that was recorded so far.
- mean(): mean value
- std(): standard deviation
- mad(): median absolute deviation
- max(): largest value in array
- min(): smallest value in array
- sma(): signal magnitude area
- energy(): energy measure (sum of the squares divided by the number of values)
- iqr(): interquartile range
- entropy(): signal entropy
- arCoeff(): autoregresion coefficient with Burg order equal to 4
- correlation(): correlation coefficient between two signals
- maxInds(): index of the frequency component with the largest magnitude
- meanFreq(): weighted average of the frequency components to obtain a mean frequency
- skewness(): skewness of the frequency domain signal
- kurtosis(): kurtosis of the frequency domain signal
- bandsEnergy(): energy of a frequency interval within 64 bins of the FFT of each window
- angle(): angle between two vectors
Some other vectors are obtained by taking the average of signals in a single window sample. These are used on the angle() variable.
- gravityMean
- tBodyAccMean
- tBodyAccJerkMean
- tBodyGyroMean
- tBodyGyroJerkMean
subject feature denotes the person id. There are 30 unique ids, each for one of 30 people.
Activity feature represents the activity a subject was doing, consists of:
- WALKING
- WALKING_UPSTAIRS
- WALKING_DOWNSTAIRS
- SITTING
- STANDING
- LAYING

In short, for each record in the dataset the following is provided:

Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration.
Triaxial angular velocity from the gyroscope.
A 561-feature vector with time and frequency domain variables.
Its activity label.
An identifier of the subject who carried out the experiment.

Based on the problem statement, Activity will be our target feature.

Data Cleaning

First, convert the subject and Activity features into factor, others into numeric.

uci_har <- uci_har %>% 
  mutate_at(c('subject', 'Activity'), as.factor) %>% 
  mutate_at(vars(-subject, -Activity), as.numeric)

lvl <- levels(uci_har$Activity)
lvl

#> [1] "LAYING"             "SITTING"            "STANDING"          
#> [4] "WALKING"            "WALKING_DOWNSTAIRS" "WALKING_UPSTAIRS"

Let’s check if there are any duplicated observations or missing values.

cat("Number of duplicated rows:", sum(duplicated(uci_har)))

#> Number of duplicated rows: 0

cat("Number of missing values:", sum(is.na(uci_har)))

#> Number of missing values: 0

Great! None of them exists. Now let’s check data imbalance.

ggplot(uci_har %>% 
         group_by(subject, Activity) %>% 
         count(name = 'activity_count'), 
       aes(x = subject, y = activity_count, fill = Activity)) + 
  geom_bar(stat = 'identity')

There is no significant difference in terms of activity count for each subject. Hence, all target classes are balanced.

Exploratory Data Analysis

We can categorize Activity into two: stationary activities (such as laying, sitting, and standing) and moving activities (such as walking, walking downstairs, and walking upstairs). Let’s see the distribution of the mean of triaxial magnitude of body acceleration signals (phew, that’s mouthful! what I really mean is tBodyAccMagmean).

ggplot(uci_har,
       aes(x = tBodyAccMagmean, group = Activity, fill = Activity)) +
  geom_density(alpha = .5) + 
  annotate('text', x = -.8, y = 25, label = "Stationary activities") + 
  annotate('text', x = -.0, y = 5, label = "Moving activities")

ggplot(uci_har,
       aes(y = Activity, x = tBodyAccMagmean, group = Activity, fill = Activity)) +
  geom_boxplot(show.legend = FALSE)

We can see a clear distinction between the two groups:

stationary activities have very small body movements compared to those of moving activities.
if tBodyAccMagmean > -0.5, then the activity will probably be either walking, walking upstairs, or walking downstairs.
if tBodyAccMagmean < -0.5, then the activity is most probably either laying, standing, or sitting.

Now, there should be also a distinction between laying and other activities in terms of phone orientation. While laying, unlike other activities, people tend to have the phone fairly horizontal on their waist. So, let’s see whether this hypothesis is true by comparing the angle between X, Y, and Z axis to the mean of gravity acceleration signals on each axis (angleXgravityMean, angleYgravityMean, and angleZgravityMean).

for (coor in c('angleXgravityMean', 'angleYgravityMean', 'angleZgravityMean')) {
  print(
    ggplot(uci_har,
           aes_string(y = 'Activity', x = coor, group = 'Activity', fill = 'Activity')) + 
      geom_boxplot(show.legend = FALSE)
  )
}

It’s apparent that:

phone orientation while laying is different from phone orientation while doing other activities.
if angleXgravityMean > 0, then the activity is most probably laying. Other activities otherwise.
if angleYgravityMean < -0.25 or angleZgravityMean < -0.25, then the activity will probably be laying. Other activities otherwise.
we can predict laying activity with minimum error only by using angleXgravityMean, or probably by other gravity-related features to the X axis.

Lastly, we can perform t-SNE on the dataset to reduce its dimension for the purpose of visualization, in the hope that each Activity will be grouped at different region. t-SNE us an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. Basically, what t-SNE does is to give us a feel or intuition of how the data is arranged in a high-dimensional space. We won’t dive deeper into t-SNE in this notebook. We will perform t-SNE by doing sensitivity in the perplexity of 5, 10, and 20 to ensure the obtained dimension-reduced values are really grouped into different Activity.

for (perp in c(5, 10, 20)) {
  tsne <- Rtsne(uci_har %>% select(-c(subject, Activity)), perplexity = perp)
  tsne_plot <- data.frame(x = tsne$Y[,1], y = tsne$Y[,2], Activity = uci_har$Activity)
  print(
    ggplot(tsne_plot) + 
      geom_point(aes(x = x, y = y, color = Activity))
  )
}

As can be seen, all activities can easily be separated except for Standing and Sitting. This makes sense, since Standing and Sitting don’t have much difference in terms of phone orientation.

Cross validation

We cannot apply the normal k-fold cross validation for this problem. Recall that our objective is to predict human activity based on their phone sensors. This means that when a new unseen subject comes in, we don’t know their behavior with the phone. If we use k-fold cross validation by randomly choosing observations, there is a chance that the same subject appears in both train and test dataset, suggesting a data leak. To avoid this, during cross validation we split the dataset such that the subject in train and test dataset don’t intersect.

set.seed(2072) # for reproducibility
subject_id <- unique(uci_har$subject)
folds <- sample(1:5, 30, replace = TRUE)
d <- data.frame(col1 = c(subject_id), col2 = c(folds))
uci_har$folds <- d$col2[match(uci_har$subject, d$col1)]
uci_har <- uci_har %>% 
  mutate(folds = as.factor(folds)) %>% 
  select(-subject)

Please note that after creating folds for each observation, we discarded subject feature as it’s no longer needed in the analysis. Lastly, we can also see below that the data is distributed evenly among folds, hence no imbalanced data occurred.

ggplot(uci_har %>%
         group_by(folds, Activity) %>%
         count(name = 'activity_count'),
       aes(x = folds, y = activity_count, fill = Activity)) +
  geom_bar(stat = 'identity')

Metrics

We use accuracy to quantify the performance of our models after considering the following reasons:

as per the problem statement, we are only interested in predicting each class equally and accurately without preferring one above the others, hence discrediting the purpose of recall and precision metrics.
this problem is a multiclass classification, where accuracy is more commonly used and more interpretable than ROC-AUC metrics.

Modeling

First, as a sanity check, let’s see the dimension of the dataset, alos maximum and minimum value in each feature.

dim(uci_har)

#> [1] 10299   563

max(apply(uci_har %>% select(-c(Activity, folds)), 1, max))

#> [1] 1

min(apply(uci_har %>% select(-c(Activity, folds)), 1, max))

#> [1] 0.795525

min(apply(uci_har %>% select(-c(Activity, folds)), 1, min))

#> [1] -1

max(apply(uci_har %>% select(-c(Activity, folds)), 1, min))

#> [1] -0.9960928

The minimum values are close to -1 for all features, whereas the maximum values range between 0.8 to 1. We won’t do any normalization because the beforementioned range is considered small, and more importantly we don’t want to lose much information on correlation between features.

The dataset has 563 features, two from which will be discarded while modeling. They are Activity (since this is the target variable) and folds (since this doesn’t add any information to the dataset).

We will make four models to achieve the best solution to the problem statement: Naive Bayes, Decision Tree, k-Nearest Neighbors, and Random Forest. To simplify things, below is a function to cross validate all models. In this function, we iterate each fold of cross validation that has been built before and do the following in each iteration:

create X_train, y_train, X_test, y_test, which are predictor variable for training, target variable for training, predictor variable for testing, and target variable for testing, respectively.
build model using and predict the output as y_pred.
calculate the accuracy of the model by comparing y_pred to y_test.
the accuracy results are then averaged across all folds, producing a single number to compare all models.

Note that for Random Forest model, we also use cross validation accuracy instead of OOB error, so that the comparison with other models is apple to apple. However, later we treat Random Forest model separately to emphasize the importance of hyperparameter tuning.

crossvalidate <- function(data, k, model_name,
                          tuning = FALSE, mtry = NULL, nodesize = NULL) {
  # 'data' is the training set with the 'folds' column
  # 'k' is the number of folds we have
  # 'model_name' is a string describing the model being used
  # 'tuning' is a mode in which this function will operate, tuning = TRUE means we are doing hyperparameter tuning
  # 'mtry' and 'nodesize' are used only in Random Forest hyperparameter tuning
  
  # initialize empty lists for recording performances
  acc_train <- c()
  acc_test <- c()
  y_preds <- c()
  y_tests <- c()
  models <- c()
  
  # one iteration per fold
  for (fold in 1:k) {
    
    # create training set for this iteration
    # subset all the datapoints where folds does not match the current fold
    training_set <- data %>% filter(folds != fold)
    X_train <- training_set %>% select(-c(Activity, folds))
    y_train <- training_set$Activity
    
    # create test set for this iteration
    # subset all the datapoints where folds matches the current fold
    testing_set <- data %>% filter(folds == fold)
    X_test <- testing_set %>% select(-c(Activity, folds))
    y_test <- testing_set$Activity
    
    # train & predict
    switch(model_name,
      nb = {
        model <- naiveBayes(x = X_train, y = y_train, laplace = 1)
        y_pred <- predict(model, X_test, type = 'class')
        y_pred_train <- predict(model, X_train, type = 'class')
      },
      dt = {
        model <- rpart(formula = Activity ~ ., data = training_set %>% select(-folds), method = 'class')
        y_pred <- predict(model, X_test, type = 'class')
        y_pred_train <- predict(model, X_train, type = 'class')
      },
      knn = {
        k <- round(sqrt(nrow(training_set)))
        y_pred <- knn(train = X_train, test = X_test, cl = y_train, k = k)
        y_pred_train <- knn(train = X_train, test = X_train, cl = y_train, k = k)
      },
      rf = {
        if (tuning == FALSE) {
          model <- randomForest(x = X_train, y = y_train, xtest = X_test, ytest = y_test)
        } else {
          model <- randomForest(x = X_train, y = y_train, xtest = X_test, ytest = y_test,
                                mtry = mtry, nodesize = nodesize)
        }
        y_pred <- model$test$predicted
        y_pred_train <- model$predicted
      },
      {
        print("Model is not recognized. Try to input 'nb', 'dt', 'knn', or 'rf'.")
        return()
      }
    )
    
    # populate corresponding lists
    acc_train[fold] <- Accuracy(y_pred_train, y_train)
    acc_test[fold] <- Accuracy(y_pred, y_test)
    y_preds <- append(y_preds, y_pred)
    y_tests <- append(y_tests, y_test)
    models <- c(models, list(model))
  }
  
  # convert back to factor
  y_preds <- factor(y_preds, labels = lvl)
  y_tests <- factor(y_tests, labels = lvl)
  
  # get the accuracy between the predicted and the observed
  cm <- confusionMatrix(y_preds, y_tests)
  cm_table <- cm$table
  acc <- cm$overall['Accuracy']
  
  # return the results
  if (model_name == 'knn') {
    return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test))
  } else {
    return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test, 'models' = models))
  }
}

Now we are ready.

Naive Bayes

nb <- crossvalidate(uci_har, 5, 'nb')
cat("Naive Bayes Accuracy:", nb$acc)

#> Naive Bayes Accuracy: 0.7258957

Naive Bayes model gives poor result with 73% accuracy. This is mainly due to the underlying assumption of the model that each predictor is independent with one another, which is not the case in our dataset. For instance, we have the following high correlations between some predictors which are not captured by the model.

set.seed(3)
col <- c(sample(names(uci_har), 6))
GGally::ggcorr(uci_har[, col], hjust = 1, layout.exp = 3, label = T)

Let’s see the confusion matrix below.

nb$cm

#>                     Reference
#> Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
#>   LAYING               1623      14        4       0                  0
#>   SITTING               286    1550     1188       0                  0
#>   STANDING                0     187      677       0                  0
#>   WALKING                 2       0        1    1181                 93
#>   WALKING_DOWNSTAIRS      4       0        0     234               1065
#>   WALKING_UPSTAIRS       29      26       36     307                248
#>                     Reference
#> Prediction           WALKING_UPSTAIRS
#>   LAYING                            0
#>   SITTING                           0
#>   STANDING                          0
#>   WALKING                          40
#>   WALKING_DOWNSTAIRS              124
#>   WALKING_UPSTAIRS               1380

Naive Bayes model is still confused between several trivially different activities such as laying and sitting (around 300 wrong predictions). This model is also struggling to differentiate between walking upstairs and stationary activities (around 90 wrong predictions).

Lastly, as can be seen below, based on the accuracy resulted from predicting train and test dataset, we could see that the model is already decent, not underfit or overfit to train dataset except for the fourth fold. Hence, we couldn’t rely much on trading off bias and variance to improve the model performance.

print(nb$acc_train)

#> [1] 0.7349193 0.7647131 0.7280039 0.7635290 0.7612536

print(nb$acc_test)

#> [1] 0.7475728 0.7491702 0.7276636 0.6863137 0.7169533

Decision Tree

dt <- crossvalidate(uci_har, 5, 'dt')
cat("Decision Tree Accuracy:", dt$acc)

#> Decision Tree Accuracy: 0.8599864

Decision Tree model gives better result with 86% accuracy. We can see why by noticing that from the t-SNE result, our dataset is highly separable (except for sitting and standing activity). This characteristic can be exploited by a tree-based model. To give a sense on how the tree works, please observe 5 tree diagrams below, each for one cross validation fold.

for (model in dt$models) {
  fancyRpartPlot(model, sub = NULL)
}

One can easily see that tGravityAccminX or tGravityAccmeanX becomes a crucial feature on the first split after the root. If this feature is less than a certain threshold, the models can predict perfectly that the corresponding activity is laying, which takes in 19% of all train dataset observations. This is consistent with our EDA result that laying activity can be distinguished by only observing one gravity-related feature to the X axis.

On the second split, based on a body acceleration signal, the models can separate perfectly sitting and standing with moving activities (sitting and standing takes in 36% whereas all moving activities takes in 45% of train dataset observations). This finding confirms our previous analysis that stationary and moving activities can be separated fairly easy.

Then, on the third split down to the leaf, the models are separating sitting and standing, also all moving activities, with some errors. This signifies that sitting and standing, also all moving activities, are quite hard for the models to distinguish. To see this clearly, here’s a confusion matrix.

dt$cm

#>                     Reference
#> Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
#>   LAYING               1942      15        0       0                  0
#>   SITTING                 2    1564      334       0                  0
#>   STANDING                0     196     1571       8                  0
#>   WALKING                 0       0        1    1549                151
#>   WALKING_DOWNSTAIRS      0       1        0      39               1149
#>   WALKING_UPSTAIRS        0       1        0     126                106
#>                     Reference
#> Prediction           WALKING_UPSTAIRS
#>   LAYING                            0
#>   SITTING                           0
#>   STANDING                          0
#>   WALKING                         296
#>   WALKING_DOWNSTAIRS              166
#>   WALKING_UPSTAIRS               1082

A small note from this table: unlike Naive Bayes model, Decision Tree model doesn’t predict stationary activities falsely as walking upstairs that much. In fact, only one prediction is wrong in this case compared to 91 in Naive Bayes model.

Next, let’s see which variables are the most important. This table below once again confirms the importance of gravity-related features to the X axis (tGravityAccminX and angleXgravityMean) to our models.

for (model in dt$models) {
  var_imp <- varImp(model)
  var_imp <- var_imp %>% slice_max(Overall, n = 10)
  print(var_imp)
}

#>                             Overall
#> tGravityAccminX            1647.338
#> tGravityAccmeanX           1522.782
#> tGravityAccenergyX         1521.581
#> angleXgravityMean          1521.581
#> tGravityAccmaxX            1505.993
#> fBodyAccJerkbandsEnergy116 1374.982
#> fBodyAccJerkbandsEnergy124 1374.982
#> fBodyAccmadX               1374.982
#> fBodyAccmeanX              1374.982
#> tBodyAccJerkmadX           1374.178
#>                        Overall
#> angleXgravityMean     1494.691
#> tGravityAccenergyX    1494.691
#> tGravityAccmeanX      1494.691
#> tGravityAccminX       1494.691
#> tGravityAccmaxX       1483.883
#> fBodyAccJerkentropyX  1376.017
#> fBodyAccmadX          1376.017
#> fBodyAccmeanX         1376.017
#> tBodyAccJerkmadX      1376.017
#> tBodyAccJerkMagenergy 1376.017
#>                      Overall
#> angleXgravityMean   1504.419
#> tGravityAccenergyX  1504.419
#> tGravityAccmeanX    1504.419
#> tGravityAccminX     1504.419
#> tGravityAccmaxX     1488.823
#> fBodyAccJerkenergyX 1370.566
#> fBodyAccmadX        1370.566
#> fBodyAccmeanX       1370.566
#> tBodyAccJerkenergyX 1370.566
#> tBodyAccJerkstdX    1370.566
#>                      Overall
#> tGravityAccminX     1528.936
#> tGravityAccmeanX    1527.734
#> angleXgravityMean   1526.532
#> tGravityAccenergyX  1526.532
#> tGravityAccmaxX     1508.549
#> tBodyAccJerkenergyX 1387.769
#> tBodyAccJerkmadX    1387.769
#> tBodyAccJerkstdX    1387.769
#> tBodyAccmaxX        1387.769
#> tBodyAccstdX        1387.769
#>                             Overall
#> tGravityAccminX            1531.881
#> tGravityAccmeanX           1530.679
#> angleXgravityMean          1529.478
#> tGravityAccenergyX         1529.478
#> tGravityAccmaxX            1512.692
#> fBodyAccJerkbandsEnergy116 1379.583
#> fBodyAccJerkbandsEnergy124 1379.583
#> fBodyAccmadX               1379.583
#> fBodyAccmeanX              1379.583
#> tBodyAccJerkmadX           1378.776

Normally a Decision Tree model tends to overfit to train dataset because they can easily fit to noise in the data. Luckily, that’s not the case for ours since as can be seen below, the accuracy on train and test dataset are close except for the fourth fold. Hence, tree pruning is not necessary.

print(dt$acc_train)

#> [1] 0.8897925 0.8959707 0.8811845 0.8897192 0.8691917

print(dt$acc_test)

#> [1] 0.8660194 0.8914177 0.8643096 0.7902098 0.8855037

k-Nearest Neighbors

Since k-NN is a distance-based model, the dataset has to be normalized prior to modeling so that the model can treat each feature equally. In other words, if a feature has relatively big values compared to others, then it will dominantly influence the model in selecting a datapoint’s neighbors. But in our case, we didn’t do any normalization on the dataset and the reason has already been explained above at the beginning of this Modeling section.

knn <- crossvalidate(uci_har, 5, 'knn')
cat("k-Nearest Neighbors Accuracy:", knn$acc)

#> k-Nearest Neighbors Accuracy: 0.8925138

k-NN model gives even better result than Decision Tree model, with 89% accuracy. Again, this is due to the fact that our dataset is highly separable so that k-NN algorithm can easily group each activity.

knn$cm

#>                     Reference
#> Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
#>   LAYING               1926      13        0       0                  0
#>   SITTING                 5    1369      245       0                  0
#>   STANDING                5     390     1659       0                  0
#>   WALKING                 1       0        1    1658                 98
#>   WALKING_DOWNSTAIRS      0       0        0      35               1178
#>   WALKING_UPSTAIRS        7       5        1      29                130
#>                     Reference
#> Prediction           WALKING_UPSTAIRS
#>   LAYING                            0
#>   SITTING                           0
#>   STANDING                          0
#>   WALKING                          90
#>   WALKING_DOWNSTAIRS               52
#>   WALKING_UPSTAIRS               1402

However, k-NN model is worse than Decision Tree model in distinguishing sitting and standing activity. Also, some stationary activities are predicted as walking upstairs. On the other hand, more moving activities are predicted correctly than those in Decision Tree model.

By comparing accuracy on train and test dataset, it can be seen that the model is just right with low bias and variance.

print(knn$acc_train)

#> [1] 0.9299672 0.9269841 0.9332196 0.9308184 0.9335673

print(knn$acc_test)

#> [1] 0.9169903 0.9184448 0.8781653 0.8516484 0.8958231

Random Forest

rf <- crossvalidate(uci_har, 5, 'rf')
cat("Random Forest Accuracy:", rf$acc)

#> Random Forest Accuracy: 0.9336829

Random forest model gives the best result so far compared to others, with a whopping 93% accuracy. In particular, Random Forest is almost always better than Decision Tree due to the following reasons:

Random Forest is an ensemble of many Decision Tree models. It was based on majority voting of many Decision Trees and hence tends to reduce error of a single Decision Tree prediction.
Random Forest performs boosting which generates a family of models that converts weak learner into strong learners and hence overcomes the overfitting problem of one Decision Tree.

However, these advantages don’t come without shortcomings: Random Forest model is slower to train and harder to interpret than a Decision Tree.

Now let’s see the confusion matrix.

rf$cm

#>                     Reference
#> Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
#>   LAYING               1942      15        0       0                  0
#>   SITTING                 0    1626      189       0                  0
#>   STANDING                0     135     1717       0                  0
#>   WALKING                 0       0        0    1585                 18
#>   WALKING_DOWNSTAIRS      0       0        0      27               1293
#>   WALKING_UPSTAIRS        2       1        0     110                 95
#>                     Reference
#> Prediction           WALKING_UPSTAIRS
#>   LAYING                            0
#>   SITTING                           0
#>   STANDING                          0
#>   WALKING                          37
#>   WALKING_DOWNSTAIRS               54
#>   WALKING_UPSTAIRS               1453

Random Forest model is still having a hard time figuring out which one is sitting or standing, and which is which from moving activities. Nevertheless, besides those errors, this model only misclassifies 18 other observations, which is a small portion of all dataset.

Based on variable importance table below, we see that Random Forest model prefers gravity-related features more than body-related ones.

for (model in rf$models) {
  var_imp <- varImp(model)
  var_imp <- var_imp %>% slice_max(Overall, n = 10)
  print(var_imp)
}

#>                     Overall
#> tGravityAccmeanX   233.5234
#> tGravityAccminX    212.9809
#> angleXgravityMean  190.1192
#> tGravityAccmaxX    185.7354
#> angleYgravityMean  168.6782
#> tGravityAccenergyX 158.2211
#> tGravityAccminY    152.9756
#> tGravityAccmaxY    149.8530
#> tGravityAccmeanY   128.6168
#> tGravityAccenergyY 115.0822
#>                      Overall
#> tGravityAccmeanX   215.90652
#> tGravityAccminX    199.06699
#> tGravityAccenergyX 187.14571
#> tGravityAccmaxX    174.64894
#> angleXgravityMean  170.14726
#> tGravityAccmaxY    148.36554
#> angleYgravityMean  147.43523
#> tGravityAccmeanY   136.34275
#> tGravityAccminY    132.14115
#> tGravityAccenergyY  83.03708
#>                     Overall
#> angleXgravityMean  211.0124
#> tGravityAccminX    193.4731
#> tGravityAccmaxX    183.6834
#> tGravityAccenergyX 178.2531
#> tGravityAccmaxY    175.3123
#> tGravityAccmeanX   170.4459
#> tGravityAccmeanY   166.4416
#> tGravityAccminY    164.2081
#> angleYgravityMean  159.2264
#> tGravityAccenergyY 113.4814
#>                     Overall
#> tGravityAccmaxX    214.2470
#> tGravityAccminX    201.6110
#> tGravityAccenergyX 198.3143
#> angleXgravityMean  191.6710
#> tGravityAccmeanY   185.8804
#> tGravityAccmeanX   182.7646
#> tGravityAccmaxY    179.4252
#> angleYgravityMean  172.8559
#> tGravityAccminY    171.5347
#> tGravityAccenergyY 102.5362
#>                      Overall
#> tGravityAccmeanX   208.67569
#> angleXgravityMean  202.43801
#> tGravityAccminX    192.91251
#> tGravityAccenergyX 185.74270
#> tGravityAccmaxX    158.31243
#> tGravityAccmaxY    148.26482
#> angleYgravityMean  145.74691
#> tGravityAccmeanY   142.97585
#> tGravityAccminY    126.27075
#> tGravityAccenergyY  95.61133

Lastly, from the train and test accuracy below, it’s apparent that the model performs really well on train and test dataset, even though in the third and fourth folds the test accuracies are still slightly lower than 91%. We already knew that the tendency to overfitting should decrease if we switch from Decision Tree model to Random Forest (thanks to bagging and random feature selection). However, the generalization error will not go to zero. The variance of generalization error will approach to zero with more trees added but the bias will not!

print(rf$acc_train)

#> [1] 0.9815512 0.9805861 0.9837923 0.9865011 0.9842691

print(rf$acc_test)

#> [1] 0.9631068 0.9720247 0.9096990 0.8971029 0.9248157

We can further improve the model by hyperparameter tuning. It’s expected that by pruning the trees, we can trade some variance with bias, lowering down both values. This is done by increasing nodesize parameter inside Random Forest model. Quoting from R documentation, nodesize is the minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). The default value is 1 for classification problem, which tends to make the model overfit to noise in data.

Besides nodesize, we will also tune mtry (number of variables randomly sampled as candidates at each split, defaults to sqrt(561) = 23 in our case). We will do a grid search by varying nodesize to c(3, 5, 7) and mtry to c(11, 16).

# establish a list of possible values for nodesize and mtry
nodesize <- c(3, 5, 7)
mtry <- c(11, 16)

# create a data frame containing all combinations 
hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize)

# initialize empty vectors to store the results
rf_acc <- c()
rf_acc_train <- c()
rf_acc_test <- c()

# loop over the rows of hyper_grid
for (i in 1:nrow(hyper_grid)) {
  # cross validate
  rf_tuning <- crossvalidate(uci_har, 5, 'rf', 
                             tuning = TRUE, mtry = hyper_grid$mtry[i], nodesize = hyper_grid$nodesize[i])
  # store the results
  rf_acc[i] <- rf_tuning$acc
  rf_acc_train <- c(rf_acc_train, list(rf_tuning$acc_train))
  rf_acc_test <- c(rf_acc_test, list(rf_tuning$acc_test))
}

# identify optimal set of hyperparameters based on accuracy
opt_i <- which.max(rf_acc)
print(hyper_grid[opt_i,])

#>   mtry nodesize
#> 5   11        7

The best hyperparameters found are 7 for nodesize and 11 for mtry. With these, the accuracy improves a little towards 94% as we can see below. Also, all test accuracies are more uniform across folds and have values above 91%.

print(rf_acc[opt_i])

#> [1] 0.9370813

print(rf_acc_train[opt_i])

#> [[1]]
#> [1] 0.9815512 0.9788767 0.9820863 0.9849343 0.9831801

print(rf_acc_test[opt_i])

#> [[1]]
#> [1] 0.9684466 0.9687055 0.9182991 0.9130869 0.9154791

Conclusion

rbind("Naive Bayes" = nb$acc, "Decision Tree" = dt$acc, "k-Nearest Neighbors" = knn$acc, "Random Forest" = max(rf_acc))

#>                      Accuracy
#> Naive Bayes         0.7258957
#> Decision Tree       0.8599864
#> k-Nearest Neighbors 0.8925138
#> Random Forest       0.9370813

Based on the accuracy table above, Random Forest clearly wins as the best model. Random Forest is able to recognize human activities based on their phone behavior with an outstanding 94% accuracy. On the other hand, Random Forest is slow to run as it is an ensemble of 500 Decision Trees by default. Of course, we can also try simpler model such as One-vs-Rest Logistic Regression or a pretty standard boosting model in industry these days like XGBoost or LightGBM, then compare the result.