Introduction

In today’s world, social media has become an important part of people’s lives. Many users share their thoughts, feelings, and daily experiences through posts and comments. Often, even close family members, like parents, may not know that someone is struggling with their mental health. However, social media platforms, with the large amount of data they collect, can provide important clues about these issues.

This opens up the possibility for social media to help detect early signs of mental health problems like stress, anxiety, and depression. By analyzing digital footprints—such as words, emojis, and behavior patterns—data science can identify and monitor mental health risks. This project uses Association Rule Learning (ARL) to find hidden patterns in user behavior. These insights can help improve mental health care by supporting early detection of at-risk individuals. The project also enables personalized support with targeted recommendations and offers scalable solutions that can be applied to large groups. With trend visualizations and actionable data, it helps in making better, data-driven decisions for mental health care.


Mental Health Trends
Mental Health Trends

Purpose and Impact

This project uses Association Rule Learning (ARL) to extract insights from social media data to improve mental health care through early detection, personalized support, and scalable solutions. It focuses on:

  • Identifying At-Risk Individuals: Detect early signs of mental health issues by analyzing user behavior and communication patterns.

  • Understanding Behavioral Patterns: Uncover hidden connections between factors like emotions, keywords, emojis, and activity trends.


Data Description

Data Source

  • Platform: Reddit
  • Subreddit: r/MentalHealth
  • For this project, I collected live data from Reddit, a social media platform, using a custom R script. The script retrieves data by setting up app credentials, getting authorization, and fetching the required information. After collecting the data, I converted it into a data frame and saved it as a CSV file named “reddit_posts.csv.” The data collection process is explained in more detail here. The dataset (reddit_data) gives a detailed view of discussions in the r/MentalHealth subreddit.
reddit_data <- read.csv("reddit_posts.csv", stringsAsFactors = FALSE)

Dataset Summary

summary(reddit_data)
##     title               text              author           time_posted       
##  Length:502         Length:502         Length:502         Min.   :1.721e+09  
##  Class :character   Class :character   Class :character   1st Qu.:1.737e+09  
##  Mode  :character   Mode  :character   Mode  :character   Median :1.737e+09  
##                                                           Mean   :1.737e+09  
##                                                           3rd Qu.:1.737e+09  
##                                                           Max.   :1.737e+09  
##     upvotes       comments_count      emojis         
##  Min.   :  0.00   Min.   :  0.00   Length:502        
##  1st Qu.:  1.00   1st Qu.:  0.00   Class :character  
##  Median :  1.00   Median :  1.00   Mode  :character  
##  Mean   :  3.99   Mean   :  4.55                     
##  3rd Qu.:  2.00   3rd Qu.:  4.00                     
##  Max.   :199.00   Max.   :231.00

The dataset (reddit_data) has 502 rows and 7 columns, providing insights into conversations in the r/MentalHealth subreddit. Below is the structure of the data summary:

str(reddit_data)
## 'data.frame':    502 obs. of  7 variables:
##  $ title         : chr  "Elections and Politics" "r/MentalHealth is looking for moderators" "Could weed bring out mental illness" "Sent nudes to someone because I was desperate for cash. And they scammed me. " ...
##  $ text          : chr  "Hello friends!\n\nIt's that time of the year again. We have always intended for r/mentalhealth to be a safe, po"| __truncated__ "Hey r/mentalhealth! We're looking to grow our moderation team. Moderators are a key part of what makes any redd"| __truncated__ "21M. I’ve never been a a constant weed user. But for the last 3 months I’ve been smoking around 3 times per wee"| __truncated__ "I’m 19f and I’ve been desperate for cash for me and my dog since my mum died. I have autism and sometimes strug"| __truncated__ ...
##  $ author        : chr  "Pi25" "DrivesInCircles" "Individual-Young-591" "forgottenpopcork" ...
##  $ time_posted   : int  1730031213 1720873558 1737141167 1737149651 1737116571 1737136497 1737137095 1737149008 1737152194 1737151049 ...
##  $ upvotes       : int  7 21 41 21 59 14 10 4 3 3 ...
##  $ comments_count: int  4 27 97 18 39 10 11 8 10 1 ...
##  $ emojis        : chr  NA NA NA NA ...

This dataset is helpful for studying mental health discussions on Reddit. It can reveal patterns related to user engagement, post content, and community responses to mental health topics.

Key Attributes:

The dataset contains information related to mental health posts and includes the following attributes:

  • Title: Title of the post.

  • Text: Content or body of the post.

  • Author: Username of the post’s author.

  • Time Posted: Timestamp indicating when the post was created.

  • Upvotes: Number of upvotes received.

  • Comments Count: Number of comments on the post.

  • Emojis: Emojis used in the post, if any.

  • Size: 502*7


Data Preprocessing

This section explains how I approached the analysis step by step. The process involved preparing the data, selecting the important factors, applying clustering techniques, and validating the results.


1.Data Transformation

Modify and create new features from existing data for analysis:

  • Convert Unix Timestamps to Readable Date and Time: The time_posted column was converted from a Unix timestamp to a readable date-time format.Time-based features such as hour, date, and day_of_week were extracted.
reddit_data <- reddit_data %>%
  mutate(
    time_posted = as.POSIXct(time_posted, origin = "1970-01-01", tz = "UTC"),
    date = as.Date(time_posted),               # Extract date
    time = format(time_posted, "%H:%M:%S"),    # Extract time
    hour = hour(time_posted),                  # Extract hour
    day_of_week = weekdays(time_posted)        # Extract day of the week
  )
  • Create Time Categories: Introduce a new categorical variable time_category to classify posts into time-based periods (e.g., Night, Morning, Afternoon, Evening).
reddit_data$time_category <- case_when(
  reddit_data$hour >= 0 & reddit_data$hour < 6 ~ "Night",
  reddit_data$hour >= 6 & reddit_data$hour < 12 ~ "Morning",
  reddit_data$hour >= 12 & reddit_data$hour < 18 ~ "Afternoon",
  TRUE ~ "Evening"
)
  • Add a Binary Column for High Engagement: Introduce a new categorical Create the High_Engagement column to classify posts as “High” or “Low” engagement based on upvote counts
reddit_data$High_Engagement <- ifelse(reddit_data$upvotes > 30, "High", "Low")

2.Data Cleaning

  • Drop Unnecessary Columns: Remove irrelevant columns, such as emojis and the original time_posted, to streamline the dataset.
reddit_data <- reddit_data %>%
  select(-time_posted, -emojis)

3. Text Processing and Feature Engineering

Clean and structure text data to extract meaningful features for further analysis.

  • Perform Sentiment Analysis: Introduce a new categorical Create the Analyze the text column to determine the sentiment of each post (Positive, Negative, or Neutral) based on predefined word dictionaries.
# Positive words
positive_words <- c(
  "happy", "love", "great", "joy", "peace", "amazing", "awesome", "fantastic", 
  "wonderful", "excited", "positive", "grateful", "satisfied", "blissful", "content", 
  "hopeful", "cheerful", "optimistic", "smile", "encouraging", "uplifting", 
  "brilliant", "sunshine", "thrilled", "energetic", "motivated", "successful", "winning", 
  "achievement", "relaxed", "calm", "fun", "delightful"
)

# Negative words
negative_words <- c(
  "sad", "anxiety", "depression", "stress", "lonely", "angry", "frustrated", 
  "hopeless", "tired", "upset", "fear", "worried", "pessimistic", "crying", 
  "heartbroken", "loss", "pain", "grief", "guilt", "regret", "failure", "broken", 
  "miserable", "hurt", "devastated", "unhappy", "disappointed", "overwhelmed", 
  "trapped", "nervous", "exhausted", "worthless", "empty", "helpless"
)

reddit_data$Sentiment <- ifelse(
  grepl(paste(positive_words, collapse = "|"), reddit_data$text, ignore.case = TRUE), "Positive",
  ifelse(grepl(paste(negative_words, collapse = "|"), reddit_data$text, ignore.case = TRUE), "Negative", "Neutral")
)
  • Extract and Clean Keywords from Titles: Clean the title column by removing special characters, converting text to lowercase, and eliminating common stop words to create the Keywords column.
reddit_data$Keywords <- gsub("[^a-zA-Z ]", "", reddit_data$title) 
reddit_data$Keywords <- tolower(reddit_data$Keywords)            
reddit_data$Keywords <- gsub("\\b(?:the|and|for|is|that|to|of|on|in|a)\\b", "", reddit_data$Keywords) 
  • Tokenize the Keywords: Split the Keywords column into individual words for further analysis..
tokenized_data <- reddit_data %>%
  separate_rows(Keywords, sep = "\\s+") # Split by spaces
  • Remove Stopwords from Tokenized Data: Remove commonly used stop words from the tokenized data to retain only significant words.
tokenized_data_cleaned <- tokenized_data %>%
  filter(!Keywords %in% stopwords("en"))

4.Data Preview

  • Preview of Cleaned Data::
head(tokenized_data_cleaned$Keywords)
## [1] "elections"     "politics"      "rmentalhealth" "looking"      
## [5] "moderators"    "weed"

Association Rules

Association Rule is a rule-based method used in data mining to identify relationships or patterns between variables in a dataset. It is typically used to discover interesting correlations, frequent itemsets, or associations among large sets of data transactions. Association rules are represented in the form:

\[ X \rightarrow Y \]

Where \(X\) (antecedent) and \(Y\) (consequent) are itemsets, meaning that the presence of \(X\) implies the presence of \(Y\).

In this example if :

  • Users posting negative sentiment during nighttime are 80% likely to express depressive thoughts within a week.

  • High use of specific keywords (e.g., ‘hopeless’, ‘tired’) correlates with a 70% increase in stress-related behavior.

There are 3 main measures used when it comes to mining for association rules:

Support

\[ support(X) = \frac{count(x)}{N} \]

It shows how frequent an itemset or rule occurs in the dataset.

Confidence

\[ confidence(X \rightarrow Y) = \frac{support(X, Y)}{support(X)} \]

It shows the percentage of transactions in which the presence of one item or itemset results in the presence of another item or itemset.

Lift

\[ lift(X \rightarrow Y) = \frac{confidence(X \rightarrow Y)}{support(Y)} \]

It shows the rise in probability of having item \(Y\) on the cart with the knowledge of item \(X\) being present over the probability of having item \(Y\) on the cart without any knowledge about the presence of \(X\).
If lift is greater than 1, then there is a positive association between those two items or itemsets.
If it’s close to 1, items or itemsets are independent.
Value lower than 1 means that there is a negative association.


Apriori

Apriori is a classic algorithm in association rule learning that is used to mine frequent itemsets and generate association rules. It is based on the principle that all subsets of a frequent itemset must also be frequent. The algorithm operates in two steps:

  1. Frequent Itemset Generation: Identify all itemsets that appear frequently in the dataset, satisfying a minimum support threshold.
  2. Association Rule Generation: Generate rules from the frequent itemsets that meet the minimum confidence threshold.

Example:

Using Apriori, the patterns mentioned above can be discovered:

  • The algorithm identifies frequent itemsets like “negative sentiment + nighttime” and “keywords like ‘hopeless’ or ‘tired’,” then derives association rules showing their correlation with depressive thoughts or stress-related behaviors.
  • For instance: “If negative sentiment occurs during nighttime, then there is an 80% likelihood of depressive thoughts within a week.”

Prepare data for Apriori

reddit_data<- tokenized_data_cleaned

Summary of final data

##     title               text              author             upvotes      
##  Length:2479        Length:2479        Length:2479        Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.:  1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :  1.00  
##                                                           Mean   :  4.57  
##                                                           3rd Qu.:  2.00  
##                                                           Max.   :199.00  
##  comments_count         date                time                hour      
##  Min.   :  0.000   Min.   :2024-07-13   Length:2479        Min.   : 0.00  
##  1st Qu.:  0.000   1st Qu.:2025-01-16   Class :character   1st Qu.: 4.00  
##  Median :  1.000   Median :2025-01-16   Mode  :character   Median :14.00  
##  Mean   :  5.268   Mean   :2025-01-15                      Mean   :12.44  
##  3rd Qu.:  4.000   3rd Qu.:2025-01-17                      3rd Qu.:20.00  
##  Max.   :231.000   Max.   :2025-01-17                      Max.   :23.00  
##  day_of_week        time_category      High_Engagement     Sentiment        
##  Length:2479        Length:2479        Length:2479        Length:2479       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Keywords        
##  Length:2479       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

This code prepares the dataset (reddit_data) for association rule mining by choosing important columns, turning them into categories, and organizing the data in a format needed for algorithms.

transaction_data <- reddit_data %>%
  select(Sentiment, time_category, High_Engagement, Keywords) %>%
  mutate_all(as.factor) 
reddit_transactions <- as(transaction_data, "transactions")

Most Frequent Items

Generate Rules

Association Rules for Negative Sentiment

Generating and checking rules to find patterns related to Negative Sentiment in posts. This helps discover important factors like posting time, engagement, and keywords linked to negative emotions.

rules.Negative.Sentiment <- apriori(
  data = reddit_transactions, 
  parameter = list(supp = 0.002, conf = 0.1), 
  appearance = list(default = "lhs", rhs = "Sentiment=Negative"), 
  control = list(verbose = FALSE)
)
rules.Negative.Sentiment.byconf <- sort(rules.Negative.Sentiment, by = "confidence", decreasing = TRUE)
inspect(head(rules.Negative.Sentiment.byconf ))
##     lhs                         rhs                      support confidence    coverage     lift count
## [1] {time_category=Evening,                                                                           
##      High_Engagement=High}   => {Sentiment=Negative} 0.004840662  1.0000000 0.004840662 2.879210    12
## [2] {Keywords=days}          => {Sentiment=Negative} 0.002016942  0.8333333 0.002420331 2.399342     5
## [3] {Keywords=everyone}      => {Sentiment=Negative} 0.002016942  0.8333333 0.002420331 2.399342     5
## [4] {High_Engagement=Low,                                                                             
##      Keywords=days}          => {Sentiment=Negative} 0.002016942  0.8333333 0.002420331 2.399342     5
## [5] {High_Engagement=Low,                                                                             
##      Keywords=everyone}      => {Sentiment=Negative} 0.002016942  0.8333333 0.002420331 2.399342     5
## [6] {Keywords=getting}       => {Sentiment=Negative} 0.002420331  0.7500000 0.003227108 2.159408     6

By seeing the rules, we can say that posts made in the evening with high engagement are very likely to have negative sentiment. The confidence is 100%, meaning all posts with these conditions are negative. The lift of 2.88 shows this is almost three times more likely than by chance. This means posts in the evening often express negative emotions, making them important for mental health monitoring.

Other rules show similar patterns. Posts with the keyword “anxiety” have an 70% chance of being negative, and posts with “help” also show a strong link to negative sentiment.

Rule Visualization

The graph shows how different items, like posting time, engagement level, and keywords, are connected to Negative Sentiment.

Network Graph

plot(rules.Negative.Sentiment.byconf , method = "graph", engine = "htmlwidget")

Scatter Plot

The distribution of support and confidence for multiple rules.

plot(rules.Negative.Sentiment.byconf, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Grouped Matrix Plot

This plot groups similar rules together, making it easier to compare them based on support and lift.

plot(rules.Negative.Sentiment.byconf, method = "grouped")

Parallel Coordinates Plot

plot(rules.Negative.Sentiment.byconf, method = "paracoord", control = list(reorder = TRUE))

The plot shows that posts made at night or with words like “depression” or “alone” are more likely to have negative sentiment. Keywords and posting time are important for finding posts with negative emotions.

Eclat Algorithm

frequent_itemsets.LowEngagement <- eclat(
  data = reddit_transactions, 
  parameter = list(supp = 0.002, minlen = 2),  # Minimum support and itemset length
  control = list(verbose = FALSE)
)

frequent_itemsets.LowEngagement.sorted <- sort(frequent_itemsets.LowEngagement, by = "support", decreasing = TRUE)
inspect(head(frequent_itemsets.LowEngagement.sorted))
##     items                                          support   count
## [1] {Sentiment=Positive, High_Engagement=Low}      0.4021783 997  
## [2] {time_category=Evening, High_Engagement=Low}   0.3590157 890  
## [3] {Sentiment=Negative, High_Engagement=Low}      0.3364260 834  
## [4] {time_category=Night, High_Engagement=Low}     0.2855990 708  
## [5] {Sentiment=Neutral, High_Engagement=Low}       0.2323518 576  
## [6] {time_category=Afternoon, High_Engagement=Low} 0.1903994 472

Output Personalized Recommendations

high_risk_keywords <- c("hopeless", "tired", "depressed", "suicide", "help")
reddit_data$high_risk <- grepl(paste(high_risk_keywords, collapse = "|"), reddit_data$Keywords, ignore.case = TRUE)
high_risk_posts <- reddit_data %>% filter(high_risk)

The code helps find posts that show signs of serious emotional distress or crises. It identifies posts with important keywords, making it easier for mental health experts or support systems to review and focus on urgent cases.

Application and Impact

This project shows how data science can help improve mental health care by analyzing patterns and language in social media posts. It offers several important benefits:

  1. Early Detection:
    The project identifies posts and users who may be at risk by detecting negative emotions or high-risk keywords. This helps in early intervention to reduce emotional distress.

  2. Personalized Support:
    By flagging posts that need attention, the analysis provides tailored recommendations. This allows mental health professionals and support systems to offer care that fits individual needs.

  3. Scalable Impact:
    The project can analyze large amounts of data, making it useful for monitoring trends in big populations. This scalability helps organizations track and address mental health challenges on a wider scale.

These benefits make the project valuable in improving mental health monitoring and care on both personal and societal levels.

Keyword Frequency Analysis

Conclusion

This project aims to improve mental health care by detecting people at risk early and giving personalized support. It analyzes social media data to find patterns in emotions, behavior, and high-risk keywords. By identifying people in emotional distress, it allows timely help, which can reduce the risk of serious problems like worsening depression or suicide.