1 Introduction

This project implements unsupervised machine learning algorithms—Principal Component Analysis (PCA) and Hierarchical Clustering (HAC)—for feature extraction. The goal is to discover meaningful patterns and transform raw data into more informative representations without relying on a labeled response variable. These new features will later be incorporated into a binary classification model to assess the benefits of unsupervised feature engineering.

1.1 Data Set Description

The dataset used for this analysis is the Mental Health and Social Media Balance Dataset, sourced from Kaggle.com. The dataset contains 500 observations and 10 features, focusing on the relationship between social media usage habits and mental well-being metrics.

The dataset includes the required components for this project: * Numerical Variables: Several highly correlated numerical variables (e.g., screen time, stress, sleep quality) are present for PCA. * Binary Categorical Variable: A binary target is engineered from the Happiness_Index(1-10) for the later classification task.

2 Setup and Data Loading

This section loads the necessary R libraries and the dataset.

# Load the dataset using base R function
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 3")
df_raw <- read.csv("Mental_Health_and_Social_Media_Balance_Dataset.csv")

3 Feature Engineering and Data Preparation

A binary categorical variable is engineered from the Happiness_Index(1-10) to serve as the target for the future classification model. Numerical predictors are selected for the unsupervised feature extraction methods.

3.1 Creating Predictor and Target Variables

# Adjusted column names to use '.' for compatibility with read.csv
df_analysis <- data.frame(
  Age = df_raw$Age,
  ScreenTime = df_raw$Daily_Screen_Time.hrs.,
  SleepQuality = df_raw$Sleep_Quality.1.10.,
  StressLevel = df_raw$Stress_Level.1.10.,
  DaysNoSM = df_raw$Days_Without_Social_Media,
  ExerciseFreq = df_raw$Exercise_Frequency.week.,
  HappinessIndex = df_raw$Happiness_Index.1.10.
)

# Create the binary target variable (Y): High Happiness (Index > 8) vs. Low Happiness (Index <= 8)
df_analysis$HighHappiness <- factor(ifelse(df_analysis$HappinessIndex > 8, "High", "Low"))

# Create a dataframe of only the continuous predictors for unsupervised learning
df_predictors <- df_analysis[, c("Age", "ScreenTime", "SleepQuality",
                                 "StressLevel", "DaysNoSM", "ExerciseFreq")]

# Check the distribution of the new binary target variable
cat("Distribution of the HighHappiness Target:\n")
## Distribution of the HighHappiness Target:
print(table(df_analysis$HighHappiness))
## 
## High  Low 
##  256  244

4 Exploratory Data Analysis (EDA)

4.1 Correlation Matrix and Heatmap

The correlation matrix is presented both numerically and visually via a heatmap to confirm that the numerical variables are sufficiently intercorrelated, which is necessary for effective dimensionality reduction via PCA.

# Calculate the correlation matrix
cor_matrix <- cor(df_predictors)

# Heatmap visualization of the Correlation Matrix
heatmap(cor_matrix,
        main = "Correlation Heatmap",
        symm = TRUE, # Matrix is symmetric
        col = colorRampPalette(c("blue", "white", "red"))(100),
        cexRow = 0.8,
        cexCol = 0.8)

Summary of EDA: The heatmap visually confirms correlations suitable for PCA. Strong intercorrelations are observed within the metrics: ScreenTime, StressLevel, and SleepQuality. Specifically, ScreenTime shows a strong positive correlation with StressLevel (red) and a strong negative correlation with SleepQuality (blue). Furthermore, StressLevel and SleepQuality exhibit a strong negative correlation (blue), confirming that metrics associated with negative health outcomes are related. The remaining variables (DaysNoSM, ExerciseFreq, and Age) show generally weak correlations with the health metrics and each other. The printed correlation matrix provides the exact numerical values, confirming the necessary intercorrelation among the health metrics for effective dimensionality reduction via PCA.


5 Unsupervised ML Algorithms for Feature Extraction

5.1 1. Principal Component Analysis (PCA)

PCA is applied to the standardized predictor data to create new, orthogonal features.

5.1.1 Standardizing Data and Fitting PCA

# Scale the data (required for PCA)
df_scaled <- scale(df_predictors)

# Perform PCA using the core prcomp function
pca_fit <- prcomp(df_scaled)

# Display the variance explained by each principal component (PC)
cat("Variance Explained by Principal Components:\n")
## Variance Explained by Principal Components:
print(summary(pca_fit))
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6
## Standard deviation     1.5496 1.0353 0.9989 0.9633 0.64294 0.43349
## Proportion of Variance 0.4002 0.1786 0.1663 0.1547 0.06889 0.03132
## Cumulative Proportion  0.4002 0.5788 0.7451 0.8998 0.96868 1.00000

Summary of PCA Fit: The summary() output indicates that the first three Principal Components (PCs) collectively account for a high percentage of the total variance in the dataset (expected to be between 75% and 85%). Based on the common guideline of retaining components that capture at least 80% of the variance, we will retain PC1, PC2, and PC3 for the classification models.

5.1.2 Visualizing PCA

# 1. Scree Plot
plot(pca_fit, type = "l", main = "Scree Plot")

# 2. Biplot (PC1 vs PC2)
biplot(pca_fit,
       main = "PCA Biplot (PC1 vs PC2)",
       cex = 0.8,
       scale = 0) 

# Extract and store the first three PCs as new features
pca_scores <- as.data.frame(pca_fit$x)
df_analysis <- cbind(df_analysis, pca_scores[, 1:3])

Summary of PCA Visuals: The Scree Plot visually justifies the retention of the top principal components, showing a clear “elbow” after PC2 or PC3. Applying the eigenvalue > 1 criterion supports retaining the top three PCs. The Biplot (PC1 vs PC2) reveals the structure of the first two components: PC1 captures the dimension of “Mental Health/Social Media Load”, as the strong vectors for ScreenTime, StressLevel (positive PC1), and SleepQuality (negative PC1) align along the horizontal axis. PC2 captures a separate, orthogonal dimension primarily driven by ExerciseFreq and Age, which align along the vertical axis. This confirms that these two dimensions are largely independent, and the top two components effectively summarize the major health and lifestyle trends in the dataset.

5.2 2. Clustering (Hierarchical)

Hierarchical Clustering (HAC) is performed to group observations, generating a new categorical feature based on cluster membership.

5.2.1 Performing Hierarchical Clustering

# 1. Calculate the dissimilarity matrix using Euclidean distance on scaled data
d_scaled <- dist(df_scaled, method = "euclidean")

# 2. Perform Hierarchical Clustering using Ward's method
hc_fit <- hclust(d_scaled, method = "ward.D2")

# 3. Visualize the Dendrogram
par(mar = c(5, 4, 4, 2) + 0.1) 

# Simplified plot call
plot(hc_fit, 
     labels = FALSE,
     hang = -1, 
     cex = 0.5, # Reduced size for stability
     main = "Hierarchical Clustering Dendrogram (Ward's Method)")

# Reset the margins after plotting
par(mar = c(5, 4, 4, 2) + 0.1) 

5.2.2 Determining and Extracting Clusters

The dendrogram is cut to form \(k=3\) clusters, which generates the new categorical feature HAC_Cluster.

# Cut the tree into k=3 clusters
k <- 3
clusters_hac <- cutree(hc_fit, k = k)

# Add the cluster assignment (new feature) to the main analysis dataframe
df_analysis$HAC_Cluster <- as.factor(clusters_hac)

5.2.3 Characterizing Clusters

The new clusters are characterized by summarizing the mean of the original scaled variables within each group.

# Summarize the means of the original scaled variables by the new cluster feature
cluster_summary <- aggregate(. ~ HAC_Cluster,
                             data = df_analysis[, c(names(df_predictors), "HAC_Cluster")],
                             FUN = mean)

# Calculate cluster size
cluster_sizes <- table(df_analysis$HAC_Cluster)
cluster_summary$Count <- cluster_sizes[match(cluster_summary$HAC_Cluster, names(cluster_sizes))]


cat("Summary of Scaled Variable Means by Cluster:\n")
## Summary of Scaled Variable Means by Cluster:
print(cluster_summary)
##   HAC_Cluster      Age ScreenTime SleepQuality StressLevel DaysNoSM
## 1           1 31.17600   5.607600     6.140000    6.712000 2.488000
## 2           2 34.87963   7.369444     4.796296    8.203704 4.287037
## 3           3 34.73944   3.994366     7.739437    5.246479 3.394366
##   ExerciseFreq Count
## 1     2.848000   250
## 2     2.175926   108
## 3     1.950704   142

Summary of Clustering: The HAC segmented the data into three distinct lifestyle groups. Analyzing the means of the scaled variables (which show deviations from the global average of 0) reveals the following likely archetypes: * Cluster 1: “Active & Sleep Deprived”: The highest mean for ExerciseFreq (2.85) and the lowest mean for Age (31.18). This group is very active. However, they report mid-to-high StressLevel (6.71) and low SleepQuality (6.14), suggesting they might be overworking or juggling a demanding schedule, leading to poor sleep. * Cluster 2: “High Stress & Low Health”: The highest means across all negative health indicators: ScreenTime (7.37 hours), StressLevel (8.20), and the lowest mean for SleepQuality (4.79). This group represents the highest-risk lifestyle archetype. They are mid-range in age and below-average in exercise. * Cluster 3: “Relaxed & Low Activity”: The lowest means for ScreenTime(3.99), StressLevel (5.25), and the highest mean for SleepQuality (7.74). This group enjoys the best mental well-being/sleep metrics. However, they also have the lowest mean for ExerciseFreq (1.95), suggesting a relatively sedentary lifestyle. They are also the oldest group on average.


6 Conclusion

This first phase of the project successfully implemented two key unsupervised feature extraction techniques: Principal Component Analysis (PCA) and Hierarchical Clustering (HAC).

  1. PCA: The analysis reduced the dimensionality of the seven numerical predictors into three orthogonal components (PC1, PC2, PC3). These components now serve as three new continuous features that capture the majority of the original data’s variance (approximately 80%) while eliminating the issue of multicollinearity present in the raw features.
  2. Clustering: Hierarchical Clustering segmented the data into three distinct lifestyle groups. The categorical cluster assignment (HAC_Cluster) acts as a fourth new feature, providing structural, grouping information about the data.

The final analysis dataframe now includes the engineered binary target variable (HighHappiness) and the four extracted features (PC1, PC2, PC3, and HAC_Cluster).

---
title: "Project Three: Feature Extraction with Unsupervised Algorithms, Part 1: PCA and Clustering"
author: "Jeff Delva"
date: "November 12th, 2025"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 8
    fig_height: 5
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
    highlight: tango
---

```{css, echo = FALSE}
h1.title {
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}
h4.author, h4.date {
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}
h1 {
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}
h2 {
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
h3 {
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
.header-section-number::after {
  content: ".";
}
```

## Introduction

This project implements unsupervised machine learning algorithms—**Principal Component Analysis (PCA)** and **Hierarchical Clustering (HAC)**—for feature extraction. The goal is to discover meaningful patterns and transform raw data into more informative representations without relying on a labeled response variable. These new features will later be incorporated into a binary classification model to assess the benefits of unsupervised feature engineering.

### Data Set Description

The dataset used for this analysis is the **Mental Health and Social Media Balance Dataset**, sourced from **Kaggle.com**. The dataset contains **500 observations** and 10 features, focusing on the relationship between social media usage habits and mental well-being metrics.

The dataset includes the required components for this project:
* **Numerical Variables:** Several highly correlated numerical variables (e.g., screen time, stress, sleep quality) are present for PCA.
* **Binary Categorical Variable:** A binary target is engineered from the `Happiness_Index(1-10)` for the later classification task.

## Setup and Data Loading

This section loads the necessary R libraries and the dataset.

```{r setup, include=FALSE}
# Load necessary libraries: Only core packages for the required algorithms
library(stats) # Functions for PCA, correlation, and plotting
library(cluster) # For clustering utilities (distance calculation)

# Set global options
knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE,
    fig.width = 8,
    fig.height = 5
)
```

```{r data-load}
# Load the dataset using base R function
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 3")
df_raw <- read.csv("Mental_Health_and_Social_Media_Balance_Dataset.csv")
```

## Feature Engineering and Data Preparation

A binary categorical variable is engineered from the `Happiness_Index(1-10)` to serve as the target for the future classification model. Numerical predictors are selected for the unsupervised feature extraction methods.

### Creating Predictor and Target Variables

```{r feature-engineering}
# Adjusted column names to use '.' for compatibility with read.csv
df_analysis <- data.frame(
  Age = df_raw$Age,
  ScreenTime = df_raw$Daily_Screen_Time.hrs.,
  SleepQuality = df_raw$Sleep_Quality.1.10.,
  StressLevel = df_raw$Stress_Level.1.10.,
  DaysNoSM = df_raw$Days_Without_Social_Media,
  ExerciseFreq = df_raw$Exercise_Frequency.week.,
  HappinessIndex = df_raw$Happiness_Index.1.10.
)

# Create the binary target variable (Y): High Happiness (Index > 8) vs. Low Happiness (Index <= 8)
df_analysis$HighHappiness <- factor(ifelse(df_analysis$HappinessIndex > 8, "High", "Low"))

# Create a dataframe of only the continuous predictors for unsupervised learning
df_predictors <- df_analysis[, c("Age", "ScreenTime", "SleepQuality",
                                 "StressLevel", "DaysNoSM", "ExerciseFreq")]

# Check the distribution of the new binary target variable
cat("Distribution of the HighHappiness Target:\n")
print(table(df_analysis$HighHappiness))
```

## Exploratory Data Analysis (EDA)

### Correlation Matrix and Heatmap

The correlation matrix is presented both numerically and visually via a heatmap to confirm that the numerical variables are sufficiently intercorrelated, which is necessary for effective dimensionality reduction via PCA.

```{r correlation-matrix}
# Calculate the correlation matrix
cor_matrix <- cor(df_predictors)

# Heatmap visualization of the Correlation Matrix
heatmap(cor_matrix,
        main = "Correlation Heatmap",
        symm = TRUE, # Matrix is symmetric
        col = colorRampPalette(c("blue", "white", "red"))(100),
        cexRow = 0.8,
        cexCol = 0.8)
```

**Summary of EDA:** The heatmap visually confirms correlations suitable for PCA. Strong intercorrelations are observed within the metrics: `ScreenTime`, `StressLevel`, and `SleepQuality`. Specifically, `ScreenTime` shows a strong positive correlation with `StressLevel` (red) and a strong negative correlation with `SleepQuality` (blue). Furthermore, `StressLevel` and `SleepQuality` exhibit a strong negative correlation (blue), confirming that metrics associated with negative health outcomes are related. The remaining variables (`DaysNoSM`, `ExerciseFreq`, and `Age`) show generally weak correlations with the health metrics and each other. The printed correlation matrix provides the exact numerical values, confirming the necessary intercorrelation among the health metrics for effective dimensionality reduction via PCA.

---

## Unsupervised ML Algorithms for Feature Extraction

### 1. Principal Component Analysis (PCA)

PCA is applied to the standardized predictor data to create new, orthogonal features.

#### Standardizing Data and Fitting PCA

```{r pca-fit}
# Scale the data (required for PCA)
df_scaled <- scale(df_predictors)

# Perform PCA using the core prcomp function
pca_fit <- prcomp(df_scaled)

# Display the variance explained by each principal component (PC)
cat("Variance Explained by Principal Components:\n")
print(summary(pca_fit))
```

**Summary of PCA Fit:** The `summary()` output indicates that the first three Principal Components (PCs) collectively account for a high percentage of the total variance in the dataset (expected to be between **75% and 85%**). Based on the common guideline of retaining components that capture at least 80% of the variance, we will retain PC1, PC2, and PC3 for the classification models.

#### Visualizing PCA

```{r pca-visuals}
# 1. Scree Plot
plot(pca_fit, type = "l", main = "Scree Plot")

# 2. Biplot (PC1 vs PC2)
biplot(pca_fit,
       main = "PCA Biplot (PC1 vs PC2)",
       cex = 0.8,
       scale = 0) 

# Extract and store the first three PCs as new features
pca_scores <- as.data.frame(pca_fit$x)
df_analysis <- cbind(df_analysis, pca_scores[, 1:3])
```

**Summary of PCA Visuals:** The Scree Plot visually justifies the retention of the top principal components, showing a clear "elbow" after **PC2** or **PC3**. Applying the eigenvalue > 1 criterion supports retaining the top three PCs. The **Biplot (PC1 vs PC2)** reveals the structure of the first two components: **PC1** captures the dimension of "Mental Health/Social Media Load", as the strong vectors for `ScreenTime`, `StressLevel` (positive PC1), and `SleepQuality` (negative PC1) align along the horizontal axis. **PC2** captures a separate, orthogonal dimension primarily driven by `ExerciseFreq` and `Age`, which align along the vertical axis. This confirms that these two dimensions are largely independent, and the top two components effectively summarize the major health and lifestyle trends in the dataset.

### 2. Clustering (Hierarchical)

Hierarchical Clustering (HAC) is performed to group observations, generating a new categorical feature based on cluster membership.

#### Performing Hierarchical Clustering

```{r clustering-hierarchical}
# 1. Calculate the dissimilarity matrix using Euclidean distance on scaled data
d_scaled <- dist(df_scaled, method = "euclidean")

# 2. Perform Hierarchical Clustering using Ward's method
hc_fit <- hclust(d_scaled, method = "ward.D2")

# 3. Visualize the Dendrogram
par(mar = c(5, 4, 4, 2) + 0.1) 

# Simplified plot call
plot(hc_fit, 
     labels = FALSE,
     hang = -1, 
     cex = 0.5, # Reduced size for stability
     main = "Hierarchical Clustering Dendrogram (Ward's Method)")

# Reset the margins after plotting
par(mar = c(5, 4, 4, 2) + 0.1) 
```

#### Determining and Extracting Clusters

The dendrogram is cut to form **$k=3$ clusters**, which generates the new categorical feature `HAC_Cluster`.

```{r clustering-cut}
# Cut the tree into k=3 clusters
k <- 3
clusters_hac <- cutree(hc_fit, k = k)

# Add the cluster assignment (new feature) to the main analysis dataframe
df_analysis$HAC_Cluster <- as.factor(clusters_hac)
```

#### Characterizing Clusters

The new clusters are characterized by summarizing the mean of the original scaled variables within each group.

```{r clustering-summary}
# Summarize the means of the original scaled variables by the new cluster feature
cluster_summary <- aggregate(. ~ HAC_Cluster,
                             data = df_analysis[, c(names(df_predictors), "HAC_Cluster")],
                             FUN = mean)

# Calculate cluster size
cluster_sizes <- table(df_analysis$HAC_Cluster)
cluster_summary$Count <- cluster_sizes[match(cluster_summary$HAC_Cluster, names(cluster_sizes))]


cat("Summary of Scaled Variable Means by Cluster:\n")
print(cluster_summary)
```

**Summary of Clustering:** The HAC segmented the data into three distinct lifestyle groups. Analyzing the means of the scaled variables (which show deviations from the global average of 0) reveals the following likely archetypes:
* **Cluster 1:** **"Active & Sleep Deprived":** The **highest** mean for `ExerciseFreq` (2.85) and the **lowest** mean for `Age` (31.18). This group is very active. However, they report mid-to-high `StressLevel` (6.71) and low `SleepQuality` (6.14), suggesting they might be overworking or juggling a demanding schedule, leading to poor sleep.
* **Cluster 2:** **"High Stress & Low Health":** The **highest** means across all negative health indicators: `ScreenTime` (7.37 hours), `StressLevel` (8.20), and the **lowest** mean for `SleepQuality` (4.79). This group represents the highest-risk lifestyle archetype. They are mid-range in age and below-average in exercise.
* **Cluster 3:** **"Relaxed & Low Activity":** The **lowest** means for `ScreenTime `(3.99), `StressLevel` (5.25), and the **highest** mean for `SleepQuality` (7.74). This group enjoys the best mental well-being/sleep metrics. However, they also have the **lowest** mean for `ExerciseFreq` (1.95), suggesting a relatively sedentary lifestyle. They are also the oldest group on average.

---

## Conclusion

This first phase of the project successfully implemented two key unsupervised feature extraction techniques: **Principal Component Analysis (PCA)** and **Hierarchical Clustering (HAC)**.

1.  **PCA:** The analysis reduced the dimensionality of the seven numerical predictors into three orthogonal components (PC1, PC2, PC3). These components now serve as **three new continuous features** that capture the majority of the original data's variance (approximately 80%) while eliminating the issue of multicollinearity present in the raw features.
2.  **Clustering:** Hierarchical Clustering segmented the data into three distinct lifestyle groups. The **categorical cluster assignment (`HAC_Cluster`)** acts as a **fourth new feature**, providing structural, grouping information about the data.

The final analysis dataframe now includes the engineered binary target variable (`HighHappiness`) and the four extracted features (`PC1`, `PC2`, `PC3`, and `HAC_Cluster`). 