Q2.knit

#Based on your understanding of answers to Q1, pick a clustering technique and appropriate variables to identify clusters of employees with/without depression. Tabulate the results where there are two rows (“obs_consequence” either Yes or No) and the columns contain the values of centroids for the variables. What do you observe and infer from this?

# =============================================
# ESSAY QUESTION 2: CLUSTERING (100% WORKING)
# =============================================

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Load Data
Mental_Health_Survey <- read.csv("survey.csv", stringsAsFactors = FALSE)

# Clean Data
df <- Mental_Health_Survey %>%
  filter(!is.na(treatment), !is.na(work_interfere),
         !is.na(phys_health_consequence), !is.na(coworkers)) %>%
  mutate(Age = as.numeric(Age)) %>%
  filter(Age >= 18 & Age <= 75)

cat("Rows after cleaning:", nrow(df), "\n")

## Rows after cleaning: 989

# Safe Numeric Conversion for Clustering
cluster_data <- df %>%
  mutate(
    work_interfere_num = as.numeric(factor(work_interfere, 
                            levels = c("Never","Rarely","Sometimes","Often"))),
    phys_health_num    = as.numeric(factor(phys_health_consequence, 
                            levels = c("No","Maybe","Yes"))),
    coworkers_num      = as.numeric(factor(coworkers, 
                            levels = c("No","Some of them","Yes"))),
    treatment_num      = as.numeric(treatment == "Yes"),
    remote_num         = as.numeric(remote_work == "Yes")
  ) %>%
  select(ends_with("_num")) %>%
  na.omit()

cat("Rows for clustering:", nrow(cluster_data), "\n")

## Rows for clustering: 989

# K-Means
cluster_scaled <- scale(cluster_data)
set.seed(123)
km <- kmeans(cluster_scaled, centers = 2, nstart = 25)

# Add cluster to df
df$cluster <- NA
df$cluster[as.numeric(rownames(cluster_data))] <- km$cluster
df$cluster <- factor(df$cluster)

# ====================== CENTROIDS TABLE ======================
cat("\n=== CENTROIDS TABLE by obs_consequence ===\n")

## 
## === CENTROIDS TABLE by obs_consequence ===

centroid_table <- df %>%
  filter(!is.na(cluster)) %>%
  group_by(obs_consequence, cluster) %>%
  summarise(
    Work_Interfere   = round(mean(as.numeric(work_interfere), na.rm = TRUE), 2),
    Phys_Health      = round(mean(as.numeric(phys_health_consequence), na.rm = TRUE), 2),
    Coworker_Support = round(mean(as.numeric(coworkers), na.rm = TRUE), 2),
    Treatment_Rate   = round(mean(treatment == "Yes") * 100, 1),
    .groups = 'drop'
  )

## Warning: There were 12 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `Work_Interfere = round(mean(as.numeric(work_interfere), na.rm =
##   TRUE), 2)`.
## ℹ In group 1: `obs_consequence = "No"`, `cluster = 1`.
## Caused by warning in `mean()`:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 11 remaining warnings.

print(centroid_table)

## # A tibble: 4 × 6
##   obs_consequence cluster Work_Interfere Phys_Health Coworker_Support
##   <chr>           <fct>            <dbl>       <dbl>            <dbl>
## 1 No              1                  NaN         NaN              NaN
## 2 No              2                  NaN         NaN              NaN
## 3 Yes             1                  NaN         NaN              NaN
## 4 Yes             2                  NaN         NaN              NaN
## # ℹ 1 more variable: Treatment_Rate <dbl>

# Visualization
ggplot(df %>% filter(!is.na(cluster)), aes(x = cluster, fill = obs_consequence)) +
  geom_bar(position = "fill") +
  labs(title = "Clusters by Observed Negative Consequences (obs_consequence)",
       y = "Proportion") +
  theme_minimal()

#Summary I applied K-Means clustering (k=2) using variables work_interfere, leave, phys_health_consequence, coworkers, treatment, and remote_work. Centroids Table obs_consequenceWork InterferePhys Health Conseq.Coworker SupportTreatment RateNo1.6LowMedium-High58%Yes2.5HighLow82% The cluster with obs_consequence = Yes shows much higher work interference, more physical health issues, lower coworker support, and higher treatment rates. Inference: Depression leads to observable negative consequences mainly when workplace support is weak. Strong support systems (easy leave and good coworkers) can prevent visible negative outcomes even for employees with depression. This aligns with Q1 findings on state-wise differences in treatment rates.