1 Introduction

Following the analysis in Part 1, this section implements the Local Outlier Factor (LOF) algorithm. LOF is an unsupervised detection method that identifies anomalies by comparing the local density of a data point to that of its neighbors. A score significantly higher than 1 implies the point is in a sparse region (an outlier).

2 Setup and Data Loading

# Load the dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 3")
df_raw <- read.csv("Mental_Health_and_Social_Media_Balance_Dataset.csv")

3 Feature Engineering and Data Preparation

# Create the main dataframe
df_analysis <- data.frame(
  Age = df_raw$Age,
  ScreenTime = df_raw$Daily_Screen_Time.hrs.,
  SleepQuality = df_raw$Sleep_Quality.1.10.,
  StressLevel = df_raw$Stress_Level.1.10.,
  DaysNoSM = df_raw$Days_Without_Social_Media,
  ExerciseFreq = df_raw$Exercise_Frequency.week.,
  HappinessIndex = df_raw$Happiness_Index.1.10.
)

# Create the binary target variable (Y): High Happiness vs. Low Happiness
df_analysis$HighHappiness <- factor(ifelse(df_analysis$HappinessIndex > 8, "High", "Low"))

# Create a dataframe of only the continuous predictors for unsupervised learning
df_predictors <- df_analysis[, c("Age", "ScreenTime", "SleepQuality",
                                 "StressLevel", "DaysNoSM", "ExerciseFreq")]

4 Unsupervised ML: Local Outlier Factor (LOF)

We utilize the lof() function from the {dbscan} package. The LOF score depends on the parameter \(k\) (neighborhood size).

4.1 Calculating LOF Scores

# Scale the predictor data (LOF is distance-based and requires scaling)
df_scaled <- scale(df_predictors)

# Calculate LOF scores using k = 20 nearest neighbors
k_value <- 20
lof_scores <- lof(df_scaled, minPts = k_value)

# Add the LOF scores to the main dataframe as a new feature
df_analysis$LOF_Score <- lof_scores

# Display summary statistics to understand the range (Min, Max)
cat("Summary of Local Outlier Factor (LOF) Scores:\n")
## Summary of Local Outlier Factor (LOF) Scores:
print(summary(df_analysis$LOF_Score))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9512  0.9982  1.0328  1.0567  1.0885  1.4734

Summary of Analysis: The summary statistics for the computed Local Outlier Factor (LOF) scores reveal a distribution centered near 1.0 (Median: 1.0328, Mean: 1.0567). A score of approximately 1.0 indicates that a data point has a local density similar to its neighbors (an “inlier”). The maximum score observed is 1.4734. While this indicates the presence of data points that are less dense than their neighbors, the absence of extremely high scores (e.g., > 2.0 or higher) suggests that this dataset does not contain extreme anomalies or radical outliers. The majority of data points exhibit patterns that are relatively consistent with their peers.

4.2 Visualizing LOF Distribution

We visualize the LOF scores against the Happiness Index. Based on the summary statistics, we set a cut-off threshold of 1.25 to identify the most significant outliers in this specific dataset.

# Histogram of LOF Scores
hist(df_analysis$LOF_Score,
     breaks = 30,
     main = paste("Distribution of LOF Scores (k =", k_value, ")"),
     xlab = "LOF Score",
     col = "lightblue",
     border = "white")
abline(v = 1.25, col = "red", lwd = 2, lty = 2) # Threshold

# Scatter plot of LOF scores vs Happiness Index
# Threshold adjusted to 1.25 based on data distribution to catch the top ~5-10% outliers
threshold <- 1.25
plot(df_analysis$HappinessIndex, df_analysis$LOF_Score,
     main = "LOF Score vs. Happiness Index",
     xlab = "Happiness Index (Original)",
     ylab = "LOF Score",
     pch = 19,
     col = ifelse(df_analysis$LOF_Score > threshold, "red", "darkblue"))

legend("topright", 
       legend = c(paste("Outlier Score (>", threshold, ")", sep=""), "Normal Score"),
       col = c("red", "darkblue"), 
       pch = 19)

Summary of LOF Analysis: The histogram shows the density of LOF scores. Most data points cluster near 1.0 (Inliers). The scatter plot highlights observations with LOF scores greater than 1.25 in red. These points represent individuals whose behavioral patterns (Screen Time, Sleep, Stress, etc.) are significantly different from the local density of their peers.


5 Supervised Learning: Binary Classification

We compare a baseline logistic regression model against one enriched with the extracted LOF_Score.

5.1 Model Training

# Split data into Training (70%) and Testing (30%)
set.seed(123)
trainIndex <- createDataPartition(df_analysis$HighHappiness, p = 0.7, list = FALSE)
train_data <- df_analysis[trainIndex, ]
test_data <- df_analysis[-trainIndex, ]

# Model 1: Baseline Logistic Regression
model_baseline <- glm(HighHappiness ~ Age + ScreenTime + SleepQuality + 
                      StressLevel + DaysNoSM + ExerciseFreq,
                      data = train_data,
                      family = "binomial")

# Model 2: LOF-Enhanced Logistic Regression
model_lof <- glm(HighHappiness ~ Age + ScreenTime + SleepQuality + 
                 StressLevel + DaysNoSM + ExerciseFreq + LOF_Score,
                 data = train_data,
                 family = "binomial")

# Compare Model Summaries (AIC)
cat("Baseline Model AIC:", round(model_baseline$aic, 2), "\n")
## Baseline Model AIC: 257.24
cat("LOF-Enhanced Model AIC:", round(model_lof$aic, 2), "\n")
## LOF-Enhanced Model AIC: 258.91

Summary of Analysis: We utilized the Akaike Information Criterion (AIC) to compare the quality of the models, where a lower value indicates a better trade-off between model fit and complexity. * Baseline Model AIC: 257.24 * LOF-Enhanced Model AIC: 258.91

The results show that the Baseline model actually achieved a slightly lower AIC than the LOF-Enhanced model. This increase in AIC for the enhanced model suggests that the LOF_Score feature did not add sufficient predictive power to justify the penalty for the added model complexity. In statistical terms, the “outlierness” of a data point does not appear to improve the model’s fit to the training data compared to using the original predictors alone.

5.2 Model Evaluation (ROC and AUC)

# Predict probabilities
prob_baseline <- predict(model_baseline, newdata = test_data, type = "response")
prob_lof <- predict(model_lof, newdata = test_data, type = "response")

# Calculate ROC Curves
roc_baseline <- roc(test_data$HighHappiness, prob_baseline, levels = c("Low", "High"), direction = ">")
roc_lof <- roc(test_data$HighHappiness, prob_lof, levels = c("Low", "High"), direction = ">")

# Plot ROC Curves
plot(roc_baseline, col = "blue", main = "ROC Curve Comparison", lwd = 2)
plot(roc_lof, col = "red", add = TRUE, lwd = 2, lty = 2)
legend("bottomright", legend = c(paste("Baseline AUC =", round(auc(roc_baseline), 4)),
                                 paste("LOF AUC =", round(auc(roc_lof), 4))),
       col = c("blue", "red"), lty = c(1, 2), lwd = 2)

6 Conclusion

  1. Outlier Detection: Using a neighborhood size of \(k=20\), we identified a subset of observations with LOF scores exceeding 1.25. These “local outliers” represent unique behavioral profiles in the dataset.
  2. Feature Importance: By adding the LOF score to the classification model, we assessed its contribution to predicting happiness. The comparison of AIC and AUC values indicates whether the “unusualness” of a person’s habits serves as a predictive signal for their mental well-being.
---
title: "Project Three: Feature Extraction with Unsupervised Algorithms, Part 2: Local Outlier Factor (LOF)"
author: "Jeff Delva"
date: "November 20th, 2025"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 8
    fig_height: 5
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
    highlight: tango
---

```{css, echo = FALSE}
h1.title {
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}
h4.author, h4.date {
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}
h1 {
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}
h2 {
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
h3 {
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
.header-section-number::after {
  content: ".";
}
```

## Introduction

Following the analysis in Part 1, this section implements the **Local Outlier Factor (LOF)** algorithm. LOF is an unsupervised detection method that identifies anomalies by comparing the local density of a data point to that of its neighbors. A score significantly higher than 1 implies the point is in a sparse region (an outlier).

## Setup and Data Loading

```{r setup, include=FALSE}
# Load necessary libraries
library(stats)    
library(dbscan)   # For Local Outlier Factor (LOF) algorithm
library(pROC)     # For ROC curve and AUC calculation
library(caret)    # For data splitting

# Set global options
knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE,
    fig.width = 8,
    fig.height = 5
)
```

```{r data-load}
# Load the dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 3")
df_raw <- read.csv("Mental_Health_and_Social_Media_Balance_Dataset.csv")
```

## Feature Engineering and Data Preparation

```{r feature-engineering}
# Create the main dataframe
df_analysis <- data.frame(
  Age = df_raw$Age,
  ScreenTime = df_raw$Daily_Screen_Time.hrs.,
  SleepQuality = df_raw$Sleep_Quality.1.10.,
  StressLevel = df_raw$Stress_Level.1.10.,
  DaysNoSM = df_raw$Days_Without_Social_Media,
  ExerciseFreq = df_raw$Exercise_Frequency.week.,
  HappinessIndex = df_raw$Happiness_Index.1.10.
)

# Create the binary target variable (Y): High Happiness vs. Low Happiness
df_analysis$HighHappiness <- factor(ifelse(df_analysis$HappinessIndex > 8, "High", "Low"))

# Create a dataframe of only the continuous predictors for unsupervised learning
df_predictors <- df_analysis[, c("Age", "ScreenTime", "SleepQuality",
                                 "StressLevel", "DaysNoSM", "ExerciseFreq")]
```

-----

## Unsupervised ML: Local Outlier Factor (LOF)

We utilize the `lof()` function from the `{dbscan}` package. The LOF score depends on the parameter $k$ (neighborhood size).

### Calculating LOF Scores

```{r lof-calculation}
# Scale the predictor data (LOF is distance-based and requires scaling)
df_scaled <- scale(df_predictors)

# Calculate LOF scores using k = 20 nearest neighbors
k_value <- 20
lof_scores <- lof(df_scaled, minPts = k_value)

# Add the LOF scores to the main dataframe as a new feature
df_analysis$LOF_Score <- lof_scores

# Display summary statistics to understand the range (Min, Max)
cat("Summary of Local Outlier Factor (LOF) Scores:\n")
print(summary(df_analysis$LOF_Score))
```

**Summary of Analysis:** The summary statistics for the computed Local Outlier Factor (LOF) scores reveal a distribution centered near 1.0 (Median: 1.0328, Mean: 1.0567). A score of approximately 1.0 indicates that a data point has a local density similar to its neighbors (an "inlier"). The maximum score observed is 1.4734. While this indicates the presence of data points that are less dense than their neighbors, the absence of extremely high scores (e.g., > 2.0 or higher) suggests that this dataset does not contain extreme anomalies or radical outliers. The majority of data points exhibit patterns that are relatively consistent with their peers.

### Visualizing LOF Distribution

We visualize the LOF scores against the Happiness Index. Based on the summary statistics, we set a cut-off threshold of **1.25** to identify the most significant outliers in this specific dataset.

```{r lof-visuals}
# Histogram of LOF Scores
hist(df_analysis$LOF_Score,
     breaks = 30,
     main = paste("Distribution of LOF Scores (k =", k_value, ")"),
     xlab = "LOF Score",
     col = "lightblue",
     border = "white")
abline(v = 1.25, col = "red", lwd = 2, lty = 2) # Threshold

# Scatter plot of LOF scores vs Happiness Index
# Threshold adjusted to 1.25 based on data distribution to catch the top ~5-10% outliers
threshold <- 1.25
plot(df_analysis$HappinessIndex, df_analysis$LOF_Score,
     main = "LOF Score vs. Happiness Index",
     xlab = "Happiness Index (Original)",
     ylab = "LOF Score",
     pch = 19,
     col = ifelse(df_analysis$LOF_Score > threshold, "red", "darkblue"))

legend("topright", 
       legend = c(paste("Outlier Score (>", threshold, ")", sep=""), "Normal Score"),
       col = c("red", "darkblue"), 
       pch = 19)
```

**Summary of LOF Analysis:** The histogram shows the density of LOF scores. Most data points cluster near 1.0 (Inliers). The scatter plot highlights observations with LOF scores greater than 1.25 in red. These points represent individuals whose behavioral patterns (Screen Time, Sleep, Stress, etc.) are significantly different from the local density of their peers.

-----

## Supervised Learning: Binary Classification

We compare a baseline logistic regression model against one enriched with the extracted `LOF_Score`.

### Model Training

```{r classification-models}
# Split data into Training (70%) and Testing (30%)
set.seed(123)
trainIndex <- createDataPartition(df_analysis$HighHappiness, p = 0.7, list = FALSE)
train_data <- df_analysis[trainIndex, ]
test_data <- df_analysis[-trainIndex, ]

# Model 1: Baseline Logistic Regression
model_baseline <- glm(HighHappiness ~ Age + ScreenTime + SleepQuality + 
                      StressLevel + DaysNoSM + ExerciseFreq,
                      data = train_data,
                      family = "binomial")

# Model 2: LOF-Enhanced Logistic Regression
model_lof <- glm(HighHappiness ~ Age + ScreenTime + SleepQuality + 
                 StressLevel + DaysNoSM + ExerciseFreq + LOF_Score,
                 data = train_data,
                 family = "binomial")

# Compare Model Summaries (AIC)
cat("Baseline Model AIC:", round(model_baseline$aic, 2), "\n")
cat("LOF-Enhanced Model AIC:", round(model_lof$aic, 2), "\n")
```
**Summary of Analysis:**
We utilized the Akaike Information Criterion (AIC) to compare the quality of the models, where a lower value indicates a better trade-off between model fit and complexity.
* **Baseline Model AIC:** 257.24
* **LOF-Enhanced Model AIC:** 258.91

The results show that the Baseline model actually achieved a slightly lower AIC than the LOF-Enhanced model. This increase in AIC for the enhanced model suggests that the `LOF_Score` feature did not add sufficient predictive power to justify the penalty for the added model complexity. In statistical terms, the "outlierness" of a data point does not appear to improve the model's fit to the training data compared to using the original predictors alone.


### Model Evaluation (ROC and AUC)

```{r model-evaluation}
# Predict probabilities
prob_baseline <- predict(model_baseline, newdata = test_data, type = "response")
prob_lof <- predict(model_lof, newdata = test_data, type = "response")

# Calculate ROC Curves
roc_baseline <- roc(test_data$HighHappiness, prob_baseline, levels = c("Low", "High"), direction = ">")
roc_lof <- roc(test_data$HighHappiness, prob_lof, levels = c("Low", "High"), direction = ">")

# Plot ROC Curves
plot(roc_baseline, col = "blue", main = "ROC Curve Comparison", lwd = 2)
plot(roc_lof, col = "red", add = TRUE, lwd = 2, lty = 2)
legend("bottomright", legend = c(paste("Baseline AUC =", round(auc(roc_baseline), 4)),
                                 paste("LOF AUC =", round(auc(roc_lof), 4))),
       col = c("blue", "red"), lty = c(1, 2), lwd = 2)
```

## Conclusion

1.  **Outlier Detection:** Using a neighborhood size of $k=20$, we identified a subset of observations with LOF scores exceeding 1.25. These "local outliers" represent unique behavioral profiles in the dataset.
2.  **Feature Importance:** By adding the LOF score to the classification model, we assessed its contribution to predicting happiness. The comparison of AIC and AUC values indicates whether the "unusualness" of a person's habits serves as a predictive signal for their mental well-being.

