Introduction
This project implements unsupervised machine learning
algorithms—Principal Component Analysis (PCA) and
Hierarchical Clustering (HAC)—for feature extraction.
The goal is to discover meaningful patterns and transform raw data into
more informative representations without relying on a labeled response
variable. These new features will later be incorporated into a binary
classification model to assess the benefits of unsupervised feature
engineering.
Data Set
Description
The dataset used for this analysis is the Mental Health and
Social Media Balance Dataset, sourced from
Kaggle.com. The dataset contains 500
observations and 10 features, focusing on the relationship
between social media usage habits and mental well-being metrics.
The dataset includes the required components for this project: *
Numerical Variables: Several highly correlated
numerical variables (e.g., screen time, stress, sleep quality) are
present for PCA. * Binary Categorical Variable: A
binary target is engineered from the Happiness_Index(1-10)
for the later classification task.
Setup and Data
Loading
This section loads the necessary R libraries and the dataset.
# Load the dataset using base R function
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 3")
df_raw <- read.csv("Mental_Health_and_Social_Media_Balance_Dataset.csv")
Feature Engineering and
Data Preparation
A binary categorical variable is engineered from the
Happiness_Index(1-10) to serve as the target for the future
classification model. Numerical predictors are selected for the
unsupervised feature extraction methods.
Creating Predictor
and Target Variables
# Adjusted column names to use '.' for compatibility with read.csv
df_analysis <- data.frame(
Age = df_raw$Age,
ScreenTime = df_raw$Daily_Screen_Time.hrs.,
SleepQuality = df_raw$Sleep_Quality.1.10.,
StressLevel = df_raw$Stress_Level.1.10.,
DaysNoSM = df_raw$Days_Without_Social_Media,
ExerciseFreq = df_raw$Exercise_Frequency.week.,
HappinessIndex = df_raw$Happiness_Index.1.10.
)
# Create the binary target variable (Y): High Happiness (Index > 8) vs. Low Happiness (Index <= 8)
df_analysis$HighHappiness <- factor(ifelse(df_analysis$HappinessIndex > 8, "High", "Low"))
# Create a dataframe of only the continuous predictors for unsupervised learning
df_predictors <- df_analysis[, c("Age", "ScreenTime", "SleepQuality",
"StressLevel", "DaysNoSM", "ExerciseFreq")]
# Check the distribution of the new binary target variable
cat("Distribution of the HighHappiness Target:\n")
## Distribution of the HighHappiness Target:
print(table(df_analysis$HighHappiness))
##
## High Low
## 256 244
Exploratory Data
Analysis (EDA)
Correlation Matrix
and Heatmap
The correlation matrix is presented both numerically and visually via
a heatmap to confirm that the numerical variables are sufficiently
intercorrelated, which is necessary for effective dimensionality
reduction via PCA.
# Calculate the correlation matrix
cor_matrix <- cor(df_predictors)
# Heatmap visualization of the Correlation Matrix
heatmap(cor_matrix,
main = "Correlation Heatmap",
symm = TRUE, # Matrix is symmetric
col = colorRampPalette(c("blue", "white", "red"))(100),
cexRow = 0.8,
cexCol = 0.8)

Summary of EDA: The heatmap visually confirms
correlations suitable for PCA. Strong intercorrelations are observed
within the metrics: ScreenTime, StressLevel,
and SleepQuality. Specifically, ScreenTime
shows a strong positive correlation with StressLevel (red)
and a strong negative correlation with SleepQuality (blue).
Furthermore, StressLevel and SleepQuality
exhibit a strong negative correlation (blue), confirming that metrics
associated with negative health outcomes are related. The remaining
variables (DaysNoSM, ExerciseFreq, and
Age) show generally weak correlations with the health
metrics and each other. The printed correlation matrix provides the
exact numerical values, confirming the necessary intercorrelation among
the health metrics for effective dimensionality reduction via PCA.
Conclusion
This first phase of the project successfully implemented two key
unsupervised feature extraction techniques: Principal Component
Analysis (PCA) and Hierarchical Clustering
(HAC).
- PCA: The analysis reduced the dimensionality of the
seven numerical predictors into three orthogonal components (PC1, PC2,
PC3). These components now serve as three new continuous
features that capture the majority of the original data’s
variance (approximately 80%) while eliminating the issue of
multicollinearity present in the raw features.
- Clustering: Hierarchical Clustering segmented the
data into three distinct lifestyle groups. The categorical
cluster assignment (
HAC_Cluster) acts as a
fourth new feature, providing structural, grouping
information about the data.
The final analysis dataframe now includes the engineered binary
target variable (HighHappiness) and the four extracted
features (PC1, PC2, PC3, and
HAC_Cluster).
