Clustering.knit

title: “Clustering Analysis for RHMCD-20 Dataset” author: “Langelihle Magwali” date: “2025-01-26” output: html_document: toc: true toc_float: true —

Introduction

This project performs clustering analysis on the RHMCD-20 dataset, which contains survey data about depression and mental health. The objective is to group individuals based on their responses and uncover patterns related to mental health indicators.Over the years, mental health awareness has gained significant attention, yet stigma and misconceptions persist, particularly surrounding conditions like depression, which remains a major global concern. This project aims to analyze mental health data to uncover patterns that could inform better interventions and support systems. By leveraging unsupervised learning techniques such as K-means and hierarchical clustering, the study identifies natural groupings and hidden connections within the data, offering valuable insights into the complexities of mental health. The dataset, sourced from diverse populations including teenagers from Bangladesh, college students, housewives, and business professionals, encompasses survey questions addressing factors like stress, quarantine frustrations, coping struggles, and changes in habits. This analysis seeks to highlight the potential of data-driven approaches to reveal meaningful patterns and contribute to more effective strategies for addressing mental health challenges globally.

Load Libraries and Dataset

# installing missing packages
required_packages <- c("tidyverse", "cluster", "factoextra", "dplyr", "ggplot2")
new_packages <- required_packages[!(required_packages %in% installed.packages()[, "Package"])]
if (length(new_packages)) install.packages(new_packages, dependencies = TRUE)

# Loading necessary libraries
lapply(required_packages, library, character.only = TRUE)

## Warning: package 'tidyverse' was built under R version 4.4.2

## Warning: package 'ggplot2' was built under R version 4.4.2

## Warning: package 'readr' was built under R version 4.4.2

## Warning: package 'dplyr' was built under R version 4.4.2

## Warning: package 'forcats' was built under R version 4.4.2

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Warning: package 'cluster' was built under R version 4.4.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "cluster"   "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "factoextra" "cluster"    "lubridate"  "forcats"    "stringr"   
##  [6] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
## [11] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices" 
## [16] "utils"      "datasets"   "methods"    "base"      
## 
## [[4]]
##  [1] "factoextra" "cluster"    "lubridate"  "forcats"    "stringr"   
##  [6] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
## [11] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices" 
## [16] "utils"      "datasets"   "methods"    "base"      
## 
## [[5]]
##  [1] "factoextra" "cluster"    "lubridate"  "forcats"    "stringr"   
##  [6] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
## [11] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices" 
## [16] "utils"      "datasets"   "methods"    "base"

# Loading the dataset
data <- read.csv("C:/Users/MAGWALI/Downloads/mental_health_finaldata_1 (1).csv")

# Display of the first few rows
head(data)

##        Age Gender Occupation       Days_Indoors Growing_Stress
## 1    20-25 Female  Corporate          1-14 days            Yes
## 2 30-Above   Male     Others         31-60 days            Yes
## 3 30-Above Female    Student   Go out Every day             No
## 4    25-30   Male     Others          1-14 days            Yes
## 5    16-20 Female    Student More than 2 months            Yes
## 6    25-30   Male  Housewife More than 2 months             No
##   Quarantine_Frustrations Changes_Habits Mental_Health_History Weight_Change
## 1                     Yes             No                   Yes           Yes
## 2                     Yes          Maybe                    No            No
## 3                      No            Yes                    No            No
## 4                      No          Maybe                    No         Maybe
## 5                     Yes            Yes                    No           Yes
## 6                     Yes            Yes                   Yes           Yes
##   Mood_Swings Coping_Struggles Work_Interest Social_Weakness
## 1      Medium               No            No             Yes
## 2        High               No            No             Yes
## 3      Medium              Yes         Maybe              No
## 4      Medium               No         Maybe             Yes
## 5      Medium              Yes         Maybe              No
## 6      Medium               No         Maybe           Maybe

# Checking the structure and dimensions of the dataset
str(data)

## 'data.frame':    824 obs. of  13 variables:
##  $ Age                    : chr  "20-25" "30-Above" "30-Above" "25-30" ...
##  $ Gender                 : chr  "Female" "Male" "Female" "Male" ...
##  $ Occupation             : chr  "Corporate" "Others" "Student" "Others" ...
##  $ Days_Indoors           : chr  "1-14 days" "31-60 days" "Go out Every day" "1-14 days" ...
##  $ Growing_Stress         : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Quarantine_Frustrations: chr  "Yes" "Yes" "No" "No" ...
##  $ Changes_Habits         : chr  "No" "Maybe" "Yes" "Maybe" ...
##  $ Mental_Health_History  : chr  "Yes" "No" "No" "No" ...
##  $ Weight_Change          : chr  "Yes" "No" "No" "Maybe" ...
##  $ Mood_Swings            : chr  "Medium" "High" "Medium" "Medium" ...
##  $ Coping_Struggles       : chr  "No" "No" "Yes" "No" ...
##  $ Work_Interest          : chr  "No" "No" "Maybe" "Maybe" ...
##  $ Social_Weakness        : chr  "Yes" "Yes" "No" "Yes" ...

dim(data)

## [1] 824  13

Data Preprocessing

Handle Missing Values

# Checking for missing values
cat("Missing values per column:\n")

## Missing values per column:

print(colSums(is.na(data)))

##                     Age                  Gender              Occupation 
##                       0                       0                       0 
##            Days_Indoors          Growing_Stress Quarantine_Frustrations 
##                       0                       0                       0 
##          Changes_Habits   Mental_Health_History           Weight_Change 
##                       0                       0                       0 
##             Mood_Swings        Coping_Struggles           Work_Interest 
##                       0                       0                       0 
##         Social_Weakness 
##                       0

#In case of missing values
impute_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

for (col in names(data)) {
  if (is.numeric(data[[col]])) {
    data[[col]][is.na(data[[col]])] <- median(data[[col]], na.rm = TRUE)
  } else {
    data[[col]][is.na(data[[col]])] <- impute_mode(data[[col]])
  }
}

# Confirming no missing values remain
cat("Missing values after imputation:\n")

## Missing values after imputation:

print(colSums(is.na(data)))

##                     Age                  Gender              Occupation 
##                       0                       0                       0 
##            Days_Indoors          Growing_Stress Quarantine_Frustrations 
##                       0                       0                       0 
##          Changes_Habits   Mental_Health_History           Weight_Change 
##                       0                       0                       0 
##             Mood_Swings        Coping_Struggles           Work_Interest 
##                       0                       0                       0 
##         Social_Weakness 
##                       0

Encode Categorical Variables

# Converting the categorical variables to factors
data <- data %>% mutate(across(where(is.character), as.factor))

# Encoding factors to numeric for clustering
cat("Encoding categorical variables as numeric.\n")

## Encoding categorical variables as numeric.

data_encoded <- data %>% mutate(across(where(is.factor), as.numeric))

# Check structure and ensure no rows are dropped
str(data_encoded)

## 'data.frame':    824 obs. of  13 variables:
##  $ Age                    : num  2 4 4 3 1 3 1 3 4 2 ...
##  $ Gender                 : num  1 2 1 2 1 2 1 1 2 2 ...
##  $ Occupation             : num  2 4 5 4 5 3 1 5 4 2 ...
##  $ Days_Indoors           : num  1 3 4 1 5 5 4 1 4 4 ...
##  $ Growing_Stress         : num  3 3 2 3 3 2 3 3 3 1 ...
##  $ Quarantine_Frustrations: num  3 3 2 2 3 3 3 2 3 1 ...
##  $ Changes_Habits         : num  2 1 3 1 3 3 1 1 3 3 ...
##  $ Mental_Health_History  : num  3 2 2 2 2 3 2 1 2 3 ...
##  $ Weight_Change          : num  3 2 2 1 3 3 3 1 3 3 ...
##  $ Mood_Swings            : num  3 1 3 3 3 3 2 1 3 2 ...
##  $ Coping_Struggles       : num  1 1 2 1 2 1 1 1 2 1 ...
##  $ Work_Interest          : num  2 2 1 1 1 1 1 2 1 1 ...
##  $ Social_Weakness        : num  3 3 2 3 2 1 1 3 1 2 ...

dim(data_encoded)

## [1] 824  13

Normalize Numeric Features

# Normalize numeric columns
normalize <- function(x) {
  return((x - min(x)) / (max(x - min(x))))
}
data_normalized <- data_encoded %>% mutate(across(where(is.numeric), normalize))

# Check structure and dimensions after normalization
cat("Dataset dimensions after normalization:\n")

## Dataset dimensions after normalization:

dim(data_normalized)

## [1] 824  13

head(data_normalized)

##         Age Gender Occupation Days_Indoors Growing_Stress
## 1 0.3333333      0       0.25         0.00            1.0
## 2 1.0000000      1       0.75         0.50            1.0
## 3 1.0000000      0       1.00         0.75            0.5
## 4 0.6666667      1       0.75         0.00            1.0
## 5 0.0000000      0       1.00         1.00            1.0
## 6 0.6666667      1       0.50         1.00            0.5
##   Quarantine_Frustrations Changes_Habits Mental_Health_History Weight_Change
## 1                     1.0            0.5                   1.0           1.0
## 2                     1.0            0.0                   0.5           0.5
## 3                     0.5            1.0                   0.5           0.5
## 4                     0.5            0.0                   0.5           0.0
## 5                     1.0            1.0                   0.5           1.0
## 6                     1.0            1.0                   1.0           1.0
##   Mood_Swings Coping_Struggles Work_Interest Social_Weakness
## 1           1                0           0.5             1.0
## 2           0                0           0.5             1.0
## 3           1                1           0.0             0.5
## 4           1                0           0.0             1.0
## 5           1                1           0.0             0.5
## 6           1                0           0.0             0.0

Clustering Analysis

Clustering analysis is a key component of this project, allowing us to uncover hidden patterns and natural groupings within the mental health data set. By leveraging unsupervised learning techniques, such as K-means and hierarchical clustering, we aim to explore the complex relationships between work-related factors and mental health outcomes. These clustering methods help identify distinct groups within the data based on shared characteristics, such as stress levels, changes in habits, or work interest. By visualizing and interpreting these clusters, we can gain valuable insights into how different factors interact, which can ultimately inform more targeted mental health interventions and strategies.

Determine Optimal Number of Clusters

# Using the Elbow Method
elbow_plot <- fviz_nbclust(data_normalized, kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = "dashed", color = "pink") +
  labs(title = "Optimal Number of Clusters", subtitle = "Elbow Method", x = "Number of Clusters", y = "Total Within-Cluster Sum of Squares")
elbow_plot

Apply K-Means Clustering

# Perform K-Means Clustering with k=3 (example)
k <- 3
kmeans_model <- kmeans(data_normalized, centers = k, nstart = 25)

# Add cluster assignments to the dataset
data_clustered <- data %>% mutate(Cluster = kmeans_model$cluster)

# Check structure and ensure clusters are assigned correctly
cat("Cluster assignments:\n")

## Cluster assignments:

head(data_clustered)

##        Age Gender Occupation       Days_Indoors Growing_Stress
## 1    20-25 Female  Corporate          1-14 days            Yes
## 2 30-Above   Male     Others         31-60 days            Yes
## 3 30-Above Female    Student   Go out Every day             No
## 4    25-30   Male     Others          1-14 days            Yes
## 5    16-20 Female    Student More than 2 months            Yes
## 6    25-30   Male  Housewife More than 2 months             No
##   Quarantine_Frustrations Changes_Habits Mental_Health_History Weight_Change
## 1                     Yes             No                   Yes           Yes
## 2                     Yes          Maybe                    No            No
## 3                      No            Yes                    No            No
## 4                      No          Maybe                    No         Maybe
## 5                     Yes            Yes                    No           Yes
## 6                     Yes            Yes                   Yes           Yes
##   Mood_Swings Coping_Struggles Work_Interest Social_Weakness Cluster
## 1      Medium               No            No             Yes       3
## 2        High               No            No             Yes       2
## 3      Medium              Yes         Maybe              No       1
## 4      Medium               No         Maybe             Yes       2
## 5      Medium              Yes         Maybe              No       1
## 6      Medium               No         Maybe           Maybe       2

dim(data_clustered)

## [1] 824  14

Visualize Clusters

# Visualize clusters using PCA
fviz_cluster(kmeans_model, data = data_normalized, geom = "point", ellipse.type = "convex", ggtheme = theme_minimal())

Explanation of the Cluster Graph

The graph created code using the K-Means clustering visualization with the fviz_cluster function, showed the clusters in a reduced two-dimensional space using PCA for dimensionality reduction.

Clusters Visualization:

The graph shows three clusters (as per your k=3 selection) based on individuals’ responses to the mental health survey.Each point in the graph represents an individual, and their placement in the 2D space reflects similarity in their characteristics after dimensionality reduction.The convex ellipses drawn around clusters indicate the extent of data points grouped into each cluster.

Cluster Centers

The large dots in the middle of each cluster represent the centroids or centers of the clusters.These centroids summarize the “average” characteristics of individuals in each cluster.

Axes in the Graph

Although the axes aren’t labeled directly in the graph, they are derived from PCA components (summaries of all original features). They reflect variance in the data, meaning individuals closer together are more similar.

Distinct Behavioral Patterns

The separation of clusters indicates distinct groupings based on mental health-related responses. Individuals within the same cluster share similarities in traits like stress levels, coping struggles, or behavioral changes.

Cluster Characteristics

By examining the summary statistics for each cluster (as done in the Cluster summaries section), it is possible to identify the defining features of each group. For instance:

Cluster 1 could consist of individuals with high quarantine frustrations and stress levels.

Cluster 2 might include those with moderate stress but fewer changes in habits.

Cluster 3 may represent individuals coping well with low frustrations.

Potential Interventions

By identifying the characteristics of individuals in each cluster, targeted interventions can be developed. For example:A group struggling with coping strategies might benefit from mental health support programs.A group with fewer stressors may represent a resilient population, offering insights for broader strategies.

# Analyze the clusters
cluster_summary <- data_clustered %>% group_by(Cluster) %>% summarise(across(everything(), ~paste(unique(.), collapse = ", ")))

# View summary of clusters
cat("Cluster summaries:\n")

## Cluster summaries:

print(cluster_summary)

## # A tibble: 3 × 14
##   Cluster Age                      Gender Occupation Days_Indoors Growing_Stress
##     <int> <chr>                    <chr>  <chr>      <chr>        <chr>         
## 1       1 30-Above, 16-20, 25-30,… Female Student, … Go out Ever… No, Yes, Maybe
## 2       2 30-Above, 25-30, 20-25,… Male   Others, H… 31-60 days,… Yes, No, Maybe
## 3       3 20-25, 16-20, 25-30, 30… Female Corporate… 1-14 days, … Yes, No, Maybe
## # ℹ 8 more variables: Quarantine_Frustrations <chr>, Changes_Habits <chr>,
## #   Mental_Health_History <chr>, Weight_Change <chr>, Mood_Swings <chr>,
## #   Coping_Struggles <chr>, Work_Interest <chr>, Social_Weakness <chr>