Association Rules Analysis for RHMCD-20 Dataset

Introduction

This part project focuses on applying association rule mining to the RHMCD-20 data set, which consists of comprehensive survey data about depression and mental health. The data set captures various factors, such as stress levels, changes in habits, social interactions, and work-related dynamics, providing a rich source of information for analysis.

The primary goal of this analysis is to uncover meaningful associations and patterns among the survey responses. By identifying relationships between different variables, such as the connection between stress and changes in work interest or the link between social weakness and coping struggles, the project aims to provide deeper insights into the factors influencing mental health.

Association rule mining is particularly suited for this task, as it enables the discovery of hidden connections that might not be immediately apparent through traditional analysis. These insights have the potential to inform mental health interventions, guide future research, and contribute to a better understanding of how different aspects of life interact to affect mental well-being. Ultimately, this approach underscores the importance of data-driven methods in addressing complex mental health challenges.

Load Libraries and Dataset

# Set a CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))

# Force-install missing packages
required_packages <- c("tidyverse", "arules", "arulesViz", "dplyr")
new_packages <- required_packages[!(required_packages %in% installed.packages()[, "Package"])]
if (length(new_packages)) install.packages(new_packages, dependencies = TRUE)

# Load necessary libraries
lapply(required_packages, library, character.only = TRUE)

## Warning: package 'tidyverse' was built under R version 4.4.2

## Warning: package 'ggplot2' was built under R version 4.4.2

## Warning: package 'readr' was built under R version 4.4.2

## Warning: package 'dplyr' was built under R version 4.4.2

## Warning: package 'forcats' was built under R version 4.4.2

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Warning: package 'arules' was built under R version 4.4.2

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Warning: package 'arulesViz' was built under R version 4.4.2

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "arules"    "Matrix"    "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[3]]
##  [1] "arulesViz" "arules"    "Matrix"    "lubridate" "forcats"   "stringr"  
##  [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
## [13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [19] "methods"   "base"     
## 
## [[4]]
##  [1] "arulesViz" "arules"    "Matrix"    "lubridate" "forcats"   "stringr"  
##  [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
## [13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [19] "methods"   "base"

# Load the dataset
data <- read.csv("C:/Users/MAGWALI/Downloads/mental_health_finaldata_1 (1).csv")

# Display the first few rows
head(data)

##        Age Gender Occupation       Days_Indoors Growing_Stress
## 1    20-25 Female  Corporate          1-14 days            Yes
## 2 30-Above   Male     Others         31-60 days            Yes
## 3 30-Above Female    Student   Go out Every day             No
## 4    25-30   Male     Others          1-14 days            Yes
## 5    16-20 Female    Student More than 2 months            Yes
## 6    25-30   Male  Housewife More than 2 months             No
##   Quarantine_Frustrations Changes_Habits Mental_Health_History Weight_Change
## 1                     Yes             No                   Yes           Yes
## 2                     Yes          Maybe                    No            No
## 3                      No            Yes                    No            No
## 4                      No          Maybe                    No         Maybe
## 5                     Yes            Yes                    No           Yes
## 6                     Yes            Yes                   Yes           Yes
##   Mood_Swings Coping_Struggles Work_Interest Social_Weakness
## 1      Medium               No            No             Yes
## 2        High               No            No             Yes
## 3      Medium              Yes         Maybe              No
## 4      Medium               No         Maybe             Yes
## 5      Medium              Yes         Maybe              No
## 6      Medium               No         Maybe           Maybe

# Check the structure and dimensions of the dataset
str(data)

## 'data.frame':    824 obs. of  13 variables:
##  $ Age                    : chr  "20-25" "30-Above" "30-Above" "25-30" ...
##  $ Gender                 : chr  "Female" "Male" "Female" "Male" ...
##  $ Occupation             : chr  "Corporate" "Others" "Student" "Others" ...
##  $ Days_Indoors           : chr  "1-14 days" "31-60 days" "Go out Every day" "1-14 days" ...
##  $ Growing_Stress         : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Quarantine_Frustrations: chr  "Yes" "Yes" "No" "No" ...
##  $ Changes_Habits         : chr  "No" "Maybe" "Yes" "Maybe" ...
##  $ Mental_Health_History  : chr  "Yes" "No" "No" "No" ...
##  $ Weight_Change          : chr  "Yes" "No" "No" "Maybe" ...
##  $ Mood_Swings            : chr  "Medium" "High" "Medium" "Medium" ...
##  $ Coping_Struggles       : chr  "No" "No" "Yes" "No" ...
##  $ Work_Interest          : chr  "No" "No" "Maybe" "Maybe" ...
##  $ Social_Weakness        : chr  "Yes" "Yes" "No" "Yes" ...

dim(data)

## [1] 824  13

Data Preprocessing

Handle Missing Values

# Check for missing values
cat("Missing values per column:\n")

## Missing values per column:

print(colSums(is.na(data)))

##                     Age                  Gender              Occupation 
##                       0                       0                       0 
##            Days_Indoors          Growing_Stress Quarantine_Frustrations 
##                       0                       0                       0 
##          Changes_Habits   Mental_Health_History           Weight_Change 
##                       0                       0                       0 
##             Mood_Swings        Coping_Struggles           Work_Interest 
##                       0                       0                       0 
##         Social_Weakness 
##                       0

# Impute missing values (if any) with mode for categorical variables
impute_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

for (col in names(data)) {
  if (!is.numeric(data[[col]])) {
    data[[col]][is.na(data[[col]])] <- impute_mode(data[[col]])
  }
}

# Confirm no missing values remain
cat("Missing values after imputation:\n")

## Missing values after imputation:

print(colSums(is.na(data)))

##                     Age                  Gender              Occupation 
##                       0                       0                       0 
##            Days_Indoors          Growing_Stress Quarantine_Frustrations 
##                       0                       0                       0 
##          Changes_Habits   Mental_Health_History           Weight_Change 
##                       0                       0                       0 
##             Mood_Swings        Coping_Struggles           Work_Interest 
##                       0                       0                       0 
##         Social_Weakness 
##                       0

Encode Categorical Variables

# Convert all variables to factors for association rule mining
data <- data %>% mutate(across(everything(), as.factor))

# Check structure
str(data)

## 'data.frame':    824 obs. of  13 variables:
##  $ Age                    : Factor w/ 4 levels "16-20","20-25",..: 2 4 4 3 1 3 1 3 4 2 ...
##  $ Gender                 : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 1 2 2 ...
##  $ Occupation             : Factor w/ 5 levels "Business","Corporate",..: 2 4 5 4 5 3 1 5 4 2 ...
##  $ Days_Indoors           : Factor w/ 5 levels "1-14 days","15-30 days",..: 1 3 4 1 5 5 4 1 4 4 ...
##  $ Growing_Stress         : Factor w/ 3 levels "Maybe","No","Yes": 3 3 2 3 3 2 3 3 3 1 ...
##  $ Quarantine_Frustrations: Factor w/ 3 levels "Maybe","No","Yes": 3 3 2 2 3 3 3 2 3 1 ...
##  $ Changes_Habits         : Factor w/ 3 levels "Maybe","No","Yes": 2 1 3 1 3 3 1 1 3 3 ...
##  $ Mental_Health_History  : Factor w/ 3 levels "Maybe","No","Yes": 3 2 2 2 2 3 2 1 2 3 ...
##  $ Weight_Change          : Factor w/ 3 levels "Maybe","No","Yes": 3 2 2 1 3 3 3 1 3 3 ...
##  $ Mood_Swings            : Factor w/ 3 levels "High","Low","Medium": 3 1 3 3 3 3 2 1 3 2 ...
##  $ Coping_Struggles       : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 1 1 2 1 ...
##  $ Work_Interest          : Factor w/ 3 levels "Maybe","No","Yes": 2 2 1 1 1 1 1 2 1 1 ...
##  $ Social_Weakness        : Factor w/ 3 levels "Maybe","No","Yes": 3 3 2 3 2 1 1 3 1 2 ...

dim(data)

## [1] 824  13

Association Rules Mining

Create Transactions

# Convert the dataset to a transaction format
transactions <- as(data, "transactions")

# Summary of transactions
summary(transactions)

## transactions as itemMatrix in sparse format with
##  824 rows (elements/itemsets/transactions) and
##  42 columns (items) and a density of 0.3095238 
## 
## most frequent items:
##        Gender=Female  Coping_Struggles=No Coping_Struggles=Yes 
##                  434                  414                  410 
##          Gender=Male   Changes_Habits=Yes              (Other) 
##                  390                  305                 8759 
## 
## element (itemset/transaction) length distribution:
## sizes
##  13 
## 824 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      13      13      13      13      13      13 
## 
## includes extended item information - examples:
##      labels variables levels
## 1 Age=16-20       Age  16-20
## 2 Age=20-25       Age  20-25
## 3 Age=25-30       Age  25-30
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Generate Association Rules

# Generate rules using the Apriori algorithm
rules <- apriori(transactions, parameter = list(supp = 0.01, conf = 0.8, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 8 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[42 item(s), 824 transaction(s)] done [0.00s].
## sorting and recoding items ... [42 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.03s].
## writing ... [1300 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# Summary of the rules
summary(rules)

## set of 1300 rules
## 
## rule length distribution (lhs + rhs):sizes
##    4    5    6 
##  146 1057   97 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   5.000   5.000   4.962   5.000   6.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01092   Min.   :0.8000   Min.   :0.01092   Min.   :1.519  
##  1st Qu.:0.01092   1st Qu.:0.8182   1st Qu.:0.01335   1st Qu.:1.617  
##  Median :0.01214   Median :0.8333   Median :0.01335   Median :1.709  
##  Mean   :0.01244   Mean   :0.8517   Mean   :0.01467   Mean   :1.837  
##  3rd Qu.:0.01335   3rd Qu.:0.9000   3rd Qu.:0.01578   3rd Qu.:1.866  
##  Max.   :0.03277   Max.   :1.0000   Max.   :0.04005   Max.   :4.013  
##      count      
##  Min.   : 9.00  
##  1st Qu.: 9.00  
##  Median :10.00  
##  Mean   :10.25  
##  3rd Qu.:11.00  
##  Max.   :27.00  
## 
## mining info:
##          data ntransactions support confidence
##  transactions           824    0.01        0.8
##                                                                                 call
##  apriori(data = transactions, parameter = list(supp = 0.01, conf = 0.8, minlen = 2))

# Inspect the top 10 rules
inspect(head(rules, 10))

##      lhs                                 rhs                       support confidence   coverage     lift count
## [1]  {Age=25-30,                                                                                               
##       Occupation=Business,                                                                                     
##       Days_Indoors=Go out Every day}  => {Coping_Struggles=Yes} 0.01213592  0.9090909 0.01334951 1.827051    10
## [2]  {Age=25-30,                                                                                               
##       Occupation=Business,                                                                                     
##       Days_Indoors=Go out Every day}  => {Gender=Female}        0.01092233  0.8181818 0.01334951 1.553414     9
## [3]  {Occupation=Business,                                                                                     
##       Days_Indoors=Go out Every day,                                                                           
##       Quarantine_Frustrations=No}     => {Coping_Struggles=Yes} 0.01213592  0.9090909 0.01334951 1.827051    10
## [4]  {Occupation=Business,                                                                                     
##       Days_Indoors=Go out Every day,                                                                           
##       Mental_Health_History=Yes}      => {Coping_Struggles=Yes} 0.01092233  0.9000000 0.01213592 1.808780     9
## [5]  {Gender=Male,                                                                                             
##       Occupation=Business,                                                                                     
##       Days_Indoors=Go out Every day}  => {Coping_Struggles=Yes} 0.01213592  0.8333333 0.01456311 1.674797    10
## [6]  {Age=25-30,                                                                                               
##       Occupation=Business,                                                                                     
##       Growing_Stress=Yes}             => {Gender=Female}        0.01092233  0.8181818 0.01334951 1.553414     9
## [7]  {Age=30-Above,                                                                                            
##       Occupation=Business,                                                                                     
##       Growing_Stress=No}              => {Coping_Struggles=Yes} 0.01092233  0.8181818 0.01334951 1.644346     9
## [8]  {Age=30-Above,                                                                                            
##       Occupation=Business,                                                                                     
##       Mood_Swings=Medium}             => {Gender=Male}          0.01213592  0.8333333 0.01456311 1.760684    10
## [9]  {Occupation=Business,                                                                                     
##       Mental_Health_History=Maybe,                                                                             
##       Mood_Swings=High}               => {Gender=Female}        0.01456311  0.8571429 0.01699029 1.627386    12
## [10] {Occupation=Business,                                                                                     
##       Growing_Stress=Maybe,                                                                                    
##       Weight_Change=No}               => {Coping_Struggles=Yes} 0.01213592  0.8333333 0.01456311 1.674797    10

Filter and Sort Rules

# Filter rules by lift > 1
filtered_rules <- subset(rules, lift > 1)

# Sort rules by confidence
sorted_rules <- sort(filtered_rules, by = "confidence", decreasing = TRUE)

# Inspect the top 10 sorted rules
inspect(head(sorted_rules, 10))

##      lhs                                   rhs                             support confidence   coverage     lift count
## [1]  {Age=25-30,                                                                                                       
##       Occupation=Others,                                                                                               
##       Days_Indoors=More than 2 months}  => {Gender=Female}              0.01213592          1 0.01213592 1.898618    10
## [2]  {Age=25-30,                                                                                                       
##       Days_Indoors=Go out Every day,                                                                                   
##       Social_Weakness=Yes}              => {Quarantine_Frustrations=No} 0.01092233          1 0.01092233 3.244094     9
## [3]  {Age=16-20,                                                                                                       
##       Weight_Change=Yes,                                                                                               
##       Work_Interest=Yes}                => {Gender=Female}              0.01456311          1 0.01456311 1.898618    12
## [4]  {Days_Indoors=15-30 days,                                                                                         
##       Changes_Habits=No,                                                                                               
##       Weight_Change=No,                                                                                                
##       Work_Interest=No}                 => {Gender=Female}              0.01092233          1 0.01092233 1.898618     9
## [5]  {Age=25-30,                                                                                                       
##       Gender=Female,                                                                                                   
##       Occupation=Others,                                                                                               
##       Social_Weakness=Yes}              => {Coping_Struggles=Yes}       0.01092233          1 0.01092233 2.009756     9
## [6]  {Age=25-30,                                                                                                       
##       Occupation=Others,                                                                                               
##       Growing_Stress=Yes,                                                                                              
##       Coping_Struggles=Yes}             => {Gender=Female}              0.01092233          1 0.01092233 1.898618     9
## [7]  {Gender=Female,                                                                                                   
##       Occupation=Others,                                                                                               
##       Weight_Change=Maybe,                                                                                             
##       Social_Weakness=No}               => {Coping_Struggles=Yes}       0.01334951          1 0.01334951 2.009756    11
## [8]  {Gender=Male,                                                                                                     
##       Occupation=Corporate,                                                                                            
##       Growing_Stress=Maybe,                                                                                            
##       Changes_Habits=Yes}               => {Coping_Struggles=No}        0.01213592          1 0.01213592 1.990338    10
## [9]  {Days_Indoors=More than 2 months,                                                                                 
##       Quarantine_Frustrations=Yes,                                                                                     
##       Weight_Change=Maybe,                                                                                             
##       Social_Weakness=Maybe}            => {Growing_Stress=Yes}         0.01092233          1 0.01092233 2.737542     9
## [10] {Occupation=Student,                                                                                              
##       Days_Indoors=Go out Every day,                                                                                   
##       Quarantine_Frustrations=Maybe,                                                                                   
##       Weight_Change=Maybe}              => {Mood_Swings=High}           0.01092233          1 0.01092233 3.097744     9

Visualizing Association Rules

Plot Rules

# Plot the rules using arulesViz
plot(sorted_rules, method = "graph", engine = "htmlwidget")

## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).

Scatterplot of Rules

# Scatterplot of rules
plot(sorted_rules, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Insights from the Association Rules Analysis

The association rules analysis conducted on the RHMCD-20 dataset reveals several interesting patterns and relationships between variables related to mental health. These insights provide valuable information for understanding the factors influencing mental well-being and can guide targeted interventions:

1. Work and Mental Health

Insight: Individuals who reported increasing stress levels were often associated with losing interest in work.

Implication: Work-related stress significantly impacts mental health, highlighting the need for workplace interventions that prioritize employee well-being, such as flexible schedules, stress management programs, and mental health resources.

2. Social Weakness and Coping Struggles

Insight: Participants experiencing social weakness (difficulty interacting with others) were strongly linked to coping struggles(inability to handle daily stress).

Implication: Social support networks are crucial for mental health. Programs that encourage community-building, peer support, or counseling can help individuals develop stronger coping mechanisms.

3. Impact of Quarantine and Stress

Insight: Responses indicating quarantine frustration were frequently associated with increasing stress levels and changes in habits(eating and sleeping patterns).

Implication: Prolonged isolation during quarantine negatively affects mental health. Future public health policies should incorporate mental health support systems during periods of isolation or restricted movement.

Mood Swings and Weight Changes

Insight: Participants experiencing high mood swings were often linked to weight changes during the survey period.

Implication: Significant mood fluctuations could be a sign of deeper mental health challenges, such as depression or anxiety. This association underscores the importance of screening individuals experiencing mood swings for additional mental health indicators.

5. Generational Mental Health History

Insight: Those with a family history of mental health issues often reported challenges like mood swings,coping struggles, and work-related dissatisfaction.

Implication: A generational history of mental health conditions could predispose individuals to similar challenges. Early intervention strategies targeting at-risk groups can be effective.

6. Stress and Habit Changes

Insight: Participants with increasing stress levels were frequently associated with changes in eating and sleeping habits.

Implication: Stress management techniques, such as mindfulness training and therapy, can be implemented to help individuals maintain healthier lifestyles and mitigate the effects of stress.

7. Frustration and Social Weakness

Insight: Participants experiencing quarantine frustration often reported feeling socially weak.

Implication: Isolation has a detrimental effect on social interactions. Providing online social engagement platforms or encouraging digital group activities can help mitigate this.

Overall Themes from the Analysis

Workplace Stress

Strongly linked to mental health issues, emphasizing the role of job satisfaction and workplace dynamics in overall well-being.

Social Interactions

Social struggles are often paired with other mental health challenges, reinforcing the importance of support systems.

Lifestyle Changes

Eating and sleeping habits are key indicators of mental health, often tied to stress and mood fluctuations.

Generational Factors

A history of mental health issues within families plays a significant role in predicting mental health challenges.

These insights not only deepen our understanding of the relationships within the dataset but also provide actionable knowledge that can be applied to design targeted mental health interventions and policies.