Library

library(tidymodels)     # collection of best packages
library(caret)          # machine learning functions
library(MLmetrics)      # machine learning metrics
library(e1071)          # naive bayes
library(rpart)          # decision tree
library(randomForest)   # random forest
library(class)          # kNN
library(FactoMineR)     # PCA algorithm
library(factoextra)     # PCA and clustering visualization

Problem Statement 1

In the previous project, we were doing classification on Human Activity Recognition dataset. We know that this dataset has many features (561 to be exact) and some of them strongly correlates with each other. Random Forest model can classify human activity as good as 94% accuracy using this dataset. However, it takes very long to do so. The second candidate is kNN model with 89% accuracy followed by Decision Tree model with 86% accuracy and Naive Bayes model with 73% accuracy. This makes us think, can we trade some of the accuracy with computation time? Or, can we actually improve some model’s performance further more?

The idea is to reduce the dimension of the dataset just enough so that most of the information is still maintained. Of course, this may also reduce the performance of the model, but the model can run much faster. We’ve actually performed dimensionality reduction in the previous project using t-SNE to make some visualization in an attempt to see the relative position of each datapoint in higher-dimension space. However, t-SNE is a non-deterministic algorithm and so we were varying the perplexity in three different values to make sure that the result was not due to some random chances. Also, t-SNE tries to only preserve local structure of data and cannot preserve its variance.

In this project, we will introduce a new algorithm called Principal Component Analysis (PCA) as a substitute for t-SNE in doing dimensionality reduction. PCA is a deterministic algorithm, preserve global structure of the data, and with PCA we can decide how much variance to preserve using eigen values.

Dataset

Let’s read the dataset. Detailed explanation about the dataset can be found in the previous project.

uci_har <- read.csv("UCI HAR.csv")
dim(uci_har)
#> [1] 10299   563

Data Cleaning

Convert each feature to its appropriate type.

uci_har <- uci_har %>% 
  mutate_at(c('subject', 'Activity'), as.factor) %>% 
  mutate_at(vars(-subject, -Activity), as.numeric)

lvl <- levels(uci_har$Activity)
lvl
#> [1] "LAYING"             "SITTING"            "STANDING"          
#> [4] "WALKING"            "WALKING_DOWNSTAIRS" "WALKING_UPSTAIRS"

We won’t do any further data cleaning since it’s known in the previous project that the dataset is already clean.

Principal Component Analysis

PCA algorithm summarizes the information/variance of the initial features using new dimensions called the Principal Component (PC). Our dataset is suitable for PCA because it has many features, and among them many correlated ones. So let’s just perform the PCA right away. We have the following barplot showing that the variance of the dataset largely can be explained by several principal components. For the purpose of analysis, we will only take 10 principal components.

principal_component <- prcomp(x = uci_har %>% select(-c(subject, Activity)), scale. = TRUE, rank. = 10)
plot(principal_component)

Below is the resulted principal components in numbers and their summary.

head(as.data.frame(principal_component$x))
#>        PC1       PC2        PC3        PC4      PC5        PC6        PC7
#> 1 16.38018 -1.994986 -3.4155244  0.6498264 7.824682 -2.7718293  2.2981737
#> 2 15.58142 -1.182536  0.3211912 -2.7479498 4.729307 -1.5887459 -0.3340349
#> 3 15.42324 -2.243058  1.2377235 -4.0026866 4.402521 -1.0350383 -0.1297623
#> 4 15.64705 -3.762700  1.2752210 -2.8065268 3.238953 -0.7434877  0.3260538
#> 5 15.84155 -4.438682  1.8081439 -3.1603532 3.331010 -0.9115065 -0.8618895
#> 6 15.64401 -4.619950  2.1287441 -2.7722016 2.462343 -0.8805438 -1.1847999
#>           PC8          PC9        PC10
#> 1 -5.22746872 -1.335465100  3.75994762
#> 2 -1.62109884 -0.006348803 -0.07202641
#> 3 -1.27912309  0.190738822  0.78085331
#> 4 -1.74289880  0.912173701  1.59466696
#> 5 -0.09012166  0.521603541 -1.01580251
#> 6  1.40402104  0.692842819 -1.48599448
summary(principal_component)
#> Importance of first k=10 (out of 561) components:
#>                            PC1     PC2     PC3     PC4     PC5     PC6     PC7
#> Standard deviation     16.8713 5.91623 3.88655 3.70953 3.25529 3.02525 2.81701
#> Proportion of Variance  0.5074 0.06239 0.02693 0.02453 0.01889 0.01631 0.01415
#> Cumulative Proportion   0.5074 0.56977 0.59670 0.62123 0.64012 0.65643 0.67058
#>                            PC8     PC9    PC10
#> Standard deviation     2.61208 2.35101 2.30763
#> Proportion of Variance 0.01216 0.00985 0.00949
#> Cumulative Proportion  0.68274 0.69259 0.70208

We can see that these 10 principal components capture 70% variance of the original dataset. Note that every principal component is perpendicular with each other and hence they don’t correlate one another as can be shown below.

GGally::ggcorr(data = principal_component$x, label = TRUE)

Lastly, keep in mind that these principal components are a culmination of information from every features in the original dataset. As a result, they are not interpretable. However, we can approximate back the original features using eigen values information even though the result is not exactly the same as the original dataset.

We can take the first two principal components and plot them to gain some insight. Let’s perform the PCA once again.

uci_har_pca <- PCA(X = uci_har %>% select(-subject),
                   scale.unit = TRUE,
                   quali.sup = 562,
                   graph = FALSE,
                   ncp = 10)

head(uci_har_pca$ind$coord)
#>       Dim.1    Dim.2      Dim.3     Dim.4    Dim.5     Dim.6      Dim.7
#> 1 -16.38098 1.995083  3.4156902  0.649858 7.825062 2.7719639  2.2982853
#> 2 -15.58217 1.182594 -0.3212068 -2.748083 4.729536 1.5888231 -0.3340511
#> 3 -15.42399 2.243166 -1.2377836 -4.002881 4.402735 1.0350886 -0.1297686
#> 4 -15.64781 3.762882 -1.2752829 -2.806663 3.239111 0.7435238  0.3260696
#> 5 -15.84232 4.438897 -1.8082316 -3.160507 3.331172 0.9115508 -0.8619313
#> 6 -15.64477 4.620174 -2.1288474 -2.772336 2.462463 0.8805866 -1.1848574
#>         Dim.8        Dim.9     Dim.10
#> 1  5.22772253  1.335529939 -3.7601302
#> 2  1.62117755  0.006349111  0.0720299
#> 3  1.27918520 -0.190748083 -0.7808912
#> 4  1.74298342 -0.912217989 -1.5947444
#> 5  0.09012604 -0.521628866  1.0158518
#> 6 -1.40408921 -0.692876458  1.4860666

Then, pick the first two principal components and plot.

plot.PCA(x = uci_har_pca,
         choix = "ind",
         invisible = "quali",
         select = "contrib 10",
         habillage = "Activity")

There are several things to unpack here:

  1. Dim 1 explains 51% variance of the original dataset, whereas Dim 2 explains 6%.
  2. Dim 1 clearly separates stationary acitivites (such as laying, sitting, and standing) from moving activities (such as walking, walking downstairs, and walking upstairs). Dim 1 less than 0 indicates stationary activities and Dim 1 more than 0 indicates moving activities.
  3. There are several outliers for walking downstairs category. Perhaps some people indeed have different style of walking downstairs?
  4. Sitting and standing activities are stacked on top of each other, making them hard to classify.

Next, we can see that there are many body-related features contributes to Dim 1 and some additional gravity-related features to Dim 2.

dim <- dimdesc(uci_har_pca)
head(as.data.frame(dim$Dim.1$quanti))
#>                     correlation p.value
#> fBodyAccsma           0.9889282       0
#> fBodyAccJerksma       0.9882851       0
#> fBodyGyrosma          0.9878669       0
#> tBodyAccJerksma       0.9877968       0
#> tBodyAccJerkMagsma    0.9868902       0
#> tBodyAccJerkMagmean   0.9868902       0
head(as.data.frame(dim$Dim.2$quanti))
#>                        correlation p.value
#> fBodyAccmeanFreqZ        0.7359410       0
#> tBodyGyroMagarCoeff1     0.7133977       0
#> fBodyAccMagmeanFreq      0.7080972       0
#> tGravityAccarCoeffZ1     0.7075568       0
#> tGravityAccMagarCoeff1   0.7068318       0
#> tBodyAccMagarCoeff1      0.7068318       0

Cross Validation

As before, we do 5-fold cross validation by splitting the dataset such that the subject in train and test dataset don’t intersect.

set.seed(2072) # for reproducibility
subject_id <- unique(uci_har$subject)
folds <- sample(1:5, 30, replace = TRUE)
d <- data.frame(col1 = c(subject_id), col2 = c(folds))
uci_har$folds <- d$col2[match(uci_har$subject, d$col1)]
uci_har <- uci_har %>% 
  mutate(folds = as.factor(folds)) %>% 
  select(-subject)

Please note that after creating folds for each observation, we discarded subject feature as it’s no longer needed in the analysis. Lastly, we can also see below that the data is distributed evenly among folds, hence no imbalanced data occurred.

ggplot(uci_har %>%
         group_by(folds, Activity) %>%
         count(name = 'activity_count'),
       aes(x = folds, y = activity_count, fill = Activity)) +
  geom_bar(stat = 'identity')

Modeling

Now, modeling using 5 fold cross validation. The function below is edited from the crossvalidate function in the previous project. The only difference is, this time we:

  1. remove any feature that has near-zero variance using step_nzv function,
  2. center and scale the data (since this is necessary for PCA) using step_center and step_scale functions, and
  3. incorporate PCA using step_pca function.

All four functions mentioned above come from tidymodels library. These functions are fitted to train dataset, then test dataset is transformed using parameters found in the process of fitting. We will take several Principal Components that explain at minimum 90% cumulative variance of the dataset.

crossvalidate <- function(data, k, model_name) {
  # 'data' is the training set with the 'folds' column
  # 'k' is the number of folds we have
  # 'model_name' is a string describing the model being used
  
  # initialize empty lists for recording performances
  acc_train <- c()
  acc_test <- c()
  y_preds <- c()
  y_tests <- c()
  models <- c()
  
  # one iteration per fold
  for (fold in 1:k) {
    
    # create training set for this iteration
    # subset all the datapoints where folds does not match the current fold
    train <- data %>% filter(folds != fold)
    y_train <- train$Activity
    
    # create test set for this iteration
    # subset all the datapoints where folds matches the current fold
    test <- data %>% filter(folds == fold)
    y_test <- test$Activity
    
    # create PCA pipeline
    rec <- recipe(Activity~., data = train) %>%
       step_rm(folds) %>% 
       step_nzv(all_predictors()) %>%
       step_center(all_numeric()) %>%
       step_scale(all_numeric()) %>%
       step_pca(all_numeric(), threshold = 0.90) %>%
       prep()
    
    # perform PCA pipeline
    X_train_dt <- juice(rec)
    X_test_dt <- bake(rec, test)
    X_train <- X_train_dt %>% select(-Activity)
    X_test <- X_test_dt %>% select(-Activity)
    
    print(glue::glue("Fold {fold} dataset reduced to {ncol(X_train)} dimensions."))
    
    # train & predict
    switch(model_name,
      nb = {
        model <- naiveBayes(x = X_train, y = y_train, laplace = 1)
        y_pred <- predict(model, X_test, type = 'class')
        y_pred_train <- predict(model, X_train, type = 'class')
        models <- c(models, list(model))
      },
      dt = {
        model <- rpart(formula = Activity ~ ., data = X_train_dt, method = 'class')
        y_pred <- predict(model, X_test_dt, type = 'class')
        y_pred_train <- predict(model, X_train_dt, type = 'class')
        models <- c(models, list(model))
      },
      knn = {
        k <- round(sqrt(nrow(X_train)))
        y_pred <- knn(train = X_train, test = X_test, cl = y_train, k = k)
        y_pred_train <- knn(train = X_train, test = X_train, cl = y_train, k = k)
      },
      rf = {
        model <- randomForest(x = X_train, y = y_train, xtest = X_test, ytest = y_test)
        y_pred <- model$test$predicted
        y_pred_train <- model$predicted
        models <- c(models, list(model))
      },
      {
        print("Model is not recognized. Try to input 'nb', 'dt', 'knn', or 'rf'.")
        return()
      }
    )
    
    # populate corresponding lists
    acc_train[fold] <- Accuracy(y_pred_train, y_train)
    acc_test[fold] <- Accuracy(y_pred, y_test)
    y_preds <- append(y_preds, y_pred)
    y_tests <- append(y_tests, y_test)
  }
  
  # convert back to factor
  y_preds <- factor(y_preds, labels = lvl)
  y_tests <- factor(y_tests, labels = lvl)
  
  # get the accuracy between the predicted and the observed
  cm <- confusionMatrix(y_preds, y_tests)
  cm_table <- cm$table
  acc <- cm$overall['Accuracy']
  
  # return the results
  if (model_name == 'knn') {
    return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test))
  } else {
    return(list('cm' = cm_table, 'acc' = acc, 'acc_train' = acc_train, 'acc_test' = acc_test, 'models' = models))
  }
}

Run the function on several models.

nb <- crossvalidate(uci_har, 5, 'nb')
#> Fold 1 dataset reduced to 65 dimensions.
#> Fold 2 dataset reduced to 65 dimensions.
#> Fold 3 dataset reduced to 64 dimensions.
#> Fold 4 dataset reduced to 64 dimensions.
#> Fold 5 dataset reduced to 65 dimensions.
cat("Naive Bayes Accuracy:", nb$acc)
#> Naive Bayes Accuracy: 0.8010486
dt <- crossvalidate(uci_har, 5, 'dt')
#> Fold 1 dataset reduced to 65 dimensions.
#> Fold 2 dataset reduced to 65 dimensions.
#> Fold 3 dataset reduced to 64 dimensions.
#> Fold 4 dataset reduced to 64 dimensions.
#> Fold 5 dataset reduced to 65 dimensions.
cat("Decision Tree Accuracy:", dt$acc)
#> Decision Tree Accuracy: 0.7174483
knn <- crossvalidate(uci_har, 5, 'knn')
#> Fold 1 dataset reduced to 65 dimensions.
#> Fold 2 dataset reduced to 65 dimensions.
#> Fold 3 dataset reduced to 64 dimensions.
#> Fold 4 dataset reduced to 64 dimensions.
#> Fold 5 dataset reduced to 65 dimensions.
cat("k-Nearest Neighbors Accuracy:", knn$acc)
#> k-Nearest Neighbors Accuracy: 0.8750364
rf <- crossvalidate(uci_har, 5, 'rf')
#> Fold 1 dataset reduced to 65 dimensions.
#> Fold 2 dataset reduced to 65 dimensions.
#> Fold 3 dataset reduced to 64 dimensions.
#> Fold 4 dataset reduced to 64 dimensions.
#> Fold 5 dataset reduced to 65 dimensions.
cat("Random Forest Accuracy:", rf$acc)
#> Random Forest Accuracy: 0.8714438

We are not going into details of analyzing these models since this is already done in the previous project. One thing to note here is that PCA algorithm can capture 90% of the original dataset variance with only around 65 features. That’s 860% smaller than the original 561 features! This comes with consequences, some are good, others are bad:

  1. Naive Bayes accuracy improves from 73% to 80%. This is because features resulted from PCA are perpendicular in vector space and hence don’t correlate one another, which fits Naive Bayes assumption perfectly.

  2. Tree-based model such as Decision Tree and Random Forest is getting worse. Decision Tree accuracy decreased from 86% to 72% and Random Forest accuracy decreased from 94% to 87%. This may be caused by blurred area of Principal Components which makes tree-based models struggle to put decision boundaries between class distributions. This blurred area comes from the fact that each Principal Component is somehow a combination of information from all original features.

  3. k-NN model isn’t affected that much. There’s only a minor decrease in accuracy from 89% to 88%, which is normal since we only capture 90% of original dataset variance.

Problem Statement 2

Assuming we don’t know what the target variable looks like, let’s drop Activity entirely.

uci_har <- uci_har %>% select(-c(Activity, folds))

Now we are tasked to cluster each observation in the dataset into several categories. How do we do this? How many clusters do we make?

Clustering

We will use k-means clustering to categorize observations. Basically, k-means is a centroid-based clustering algorithm that works as follows.

  1. Random initialization: place \(k\) centroids randomly
  2. Cluster assignment: assign each observation to the closest cluster, based on the distance to centroids.
  3. Centroid update: move centroids to the means of observations of the same cluster.
  4. Repeat step 2 and 3 until convergence is reached.

There are 2 hyperparameters that must be determined by the user in implementing k-means clustering: then number of centroids \(k\), and the distance metric used (usually Euclidean distance or Manhattan distance). In this project, we will use Euclidean distance which comes by the formula \(d(x,y) = \sqrt{\sum_{i=1}^n (x_{i} - y_{i})^2}\) where \(x = (x_1, x_2, \dots, x_n)\) and \(y = (y_1, y_2, \dots, y_n)\) are observations with dimension \(n\). Hence, we only need to determine \(k\).

A good clustering is the one with the lowest Within Sum of Squares (WSS), that is, the lowest sum of quadratic distance from each observation to the centroid of the cluster it belongs to. A good clustering also has the ratio of Between Sum of Squares (BSS) and WSS close to 1. Here, BSS is just the sum of quadratic distance from each centroid to the global mean, weighted by the number of observations in each cluster. It’s easy to see that higher \(k\) corresponds to lower WSS and BSS/WSS close to 1. However, \(k\) has to be determined by business point of view, or more objectively using what’s called elbow method.

uci_har_scaled <- scale(uci_har)
fviz_nbclust(x = uci_har_scaled, FUNcluster = kmeans, method = "wss")

Elbow method says that we should pick \(k\) where increasing it will result in no more significant decrease of WSS. Hence from the above plot, we choose \(k = 2\). Since k-means clustering is a non-deterministic algorithm (due to random initialization of centroids), we will run k-means 5 times then take the mode of the assigned clusters for each observation.

set.seed(42) # for reproducibility
result <- c()
for (i in 1:5) {
  uci_har_km <- kmeans(x = uci_har_scaled, centers = 2)
  print(glue::glue("Trial {i} BSS/WSS: {uci_har_km$betweenss / uci_har_km$tot.withinss}"))
  result <- c(result, list(uci_har_km$cluster))
}
#> Trial 1 BSS/WSS: 0.765350481077985
#> Trial 2 BSS/WSS: 0.765350481077985
#> Trial 3 BSS/WSS: 0.765350481077985
#> Trial 4 BSS/WSS: 0.765350481077985
#> Trial 5 BSS/WSS: 0.765350481077985
cluster <- apply(as.data.frame(result), 1, function(idx) which.max(tabulate(idx)))

Since BSS/WSS gives the same number for each trial, the clustering process most probably results the same for all trials. We can visualize this clustering as follows.

fviz_cluster(object = uci_har_km, data = uci_har_scaled, labelsize = 0)

We can easily see that the two clusters obtained are most probably stationary activities for cluster 2 and moving activities for cluster 1, which are separated by Dim 1 at around zero. We can summarize this profiling as follows.

uci_har$cluster <- cluster
temp <- uci_har %>% group_by(cluster) %>% summarise_all("mean")
DT::datatable(temp, options = list(scrollX=T))

Summary

PCA is a dimension reduction technique that is useful for working with multidimensional data for two reasons: visualization and simplification. In this project, we see that by doing PCA to Human Activity Recognition dataset before fitting it to the models gives different results: some models are getting better, some are getting worse, and some are not much affected. We also see that clustering this dataset without knowing the target variable gives us a good separation between stationary and moving activities.