In today’s world, social media has become an important part of people’s lives. Many users share their thoughts, feelings, and daily experiences through posts and comments. Often, even close family members, like parents, may not know that someone is struggling with their mental health. However, social media platforms, with the large amount of data they collect, can provide important clues about these issues.
This opens up the possibility for social media to help detect early signs of mental health problems like stress, anxiety, and depression. By analyzing digital footprints—such as words, emojis, and behavior patterns—data science can identify and monitor mental health risks. This project uses Association Rule Learning (ARL) to find hidden patterns in user behavior. These insights can help improve mental health care by supporting early detection of at-risk individuals. The project also enables personalized support with targeted recommendations and offers scalable solutions that can be applied to large groups. With trend visualizations and actionable data, it helps in making better, data-driven decisions for mental health care.
This project uses Association Rule Learning (ARL) to extract insights from social media data to improve mental health care through early detection, personalized support, and scalable solutions. It focuses on:
Identifying At-Risk Individuals: Detect early signs of mental health issues by analyzing user behavior and communication patterns.
Understanding Behavioral Patterns: Uncover hidden connections between factors like emotions, keywords, emojis, and activity trends.
reddit_data) gives a detailed view of discussions in the
r/MentalHealth subreddit.reddit_data <- read.csv("reddit_posts.csv", stringsAsFactors = FALSE)
summary(reddit_data)
## title text author time_posted
## Length:502 Length:502 Length:502 Min. :1.721e+09
## Class :character Class :character Class :character 1st Qu.:1.737e+09
## Mode :character Mode :character Mode :character Median :1.737e+09
## Mean :1.737e+09
## 3rd Qu.:1.737e+09
## Max. :1.737e+09
## upvotes comments_count emojis
## Min. : 0.00 Min. : 0.00 Length:502
## 1st Qu.: 1.00 1st Qu.: 0.00 Class :character
## Median : 1.00 Median : 1.00 Mode :character
## Mean : 3.99 Mean : 4.55
## 3rd Qu.: 2.00 3rd Qu.: 4.00
## Max. :199.00 Max. :231.00
The dataset (reddit_data) has 502 rows and 7 columns,
providing insights into conversations in the
r/MentalHealth subreddit. Below is the structure of the
data summary:
str(reddit_data)
## 'data.frame': 502 obs. of 7 variables:
## $ title : chr "Elections and Politics" "r/MentalHealth is looking for moderators" "Could weed bring out mental illness" "Sent nudes to someone because I was desperate for cash. And they scammed me. " ...
## $ text : chr "Hello friends!\n\nIt's that time of the year again. We have always intended for r/mentalhealth to be a safe, po"| __truncated__ "Hey r/mentalhealth! We're looking to grow our moderation team. Moderators are a key part of what makes any redd"| __truncated__ "21M. I’ve never been a a constant weed user. But for the last 3 months I’ve been smoking around 3 times per wee"| __truncated__ "I’m 19f and I’ve been desperate for cash for me and my dog since my mum died. I have autism and sometimes strug"| __truncated__ ...
## $ author : chr "Pi25" "DrivesInCircles" "Individual-Young-591" "forgottenpopcork" ...
## $ time_posted : int 1730031213 1720873558 1737141167 1737149651 1737116571 1737136497 1737137095 1737149008 1737152194 1737151049 ...
## $ upvotes : int 7 21 41 21 59 14 10 4 3 3 ...
## $ comments_count: int 4 27 97 18 39 10 11 8 10 1 ...
## $ emojis : chr NA NA NA NA ...
This dataset is helpful for studying mental health discussions on Reddit. It can reveal patterns related to user engagement, post content, and community responses to mental health topics.
The dataset contains information related to mental health posts and includes the following attributes:
Title: Title of the post.
Text: Content or body of the post.
Author: Username of the post’s author.
Time Posted: Timestamp indicating when the post was created.
Upvotes: Number of upvotes received.
Comments Count: Number of comments on the post.
Emojis: Emojis used in the post, if any.
Size: 502*7
This section explains how I approached the analysis step by step. The process involved preparing the data, selecting the important factors, applying clustering techniques, and validating the results.
Modify and create new features from existing data for analysis:
reddit_data <- reddit_data %>%
mutate(
time_posted = as.POSIXct(time_posted, origin = "1970-01-01", tz = "UTC"),
date = as.Date(time_posted), # Extract date
time = format(time_posted, "%H:%M:%S"), # Extract time
hour = hour(time_posted), # Extract hour
day_of_week = weekdays(time_posted) # Extract day of the week
)
reddit_data$time_category <- case_when(
reddit_data$hour >= 0 & reddit_data$hour < 6 ~ "Night",
reddit_data$hour >= 6 & reddit_data$hour < 12 ~ "Morning",
reddit_data$hour >= 12 & reddit_data$hour < 18 ~ "Afternoon",
TRUE ~ "Evening"
)
reddit_data$High_Engagement <- ifelse(reddit_data$upvotes > 30, "High", "Low")
reddit_data <- reddit_data %>%
select(-time_posted, -emojis)
Clean and structure text data to extract meaningful features for further analysis.
# Positive words
positive_words <- c(
"happy", "love", "great", "joy", "peace", "amazing", "awesome", "fantastic",
"wonderful", "excited", "positive", "grateful", "satisfied", "blissful", "content",
"hopeful", "cheerful", "optimistic", "smile", "encouraging", "uplifting",
"brilliant", "sunshine", "thrilled", "energetic", "motivated", "successful", "winning",
"achievement", "relaxed", "calm", "fun", "delightful"
)
# Negative words
negative_words <- c(
"sad", "anxiety", "depression", "stress", "lonely", "angry", "frustrated",
"hopeless", "tired", "upset", "fear", "worried", "pessimistic", "crying",
"heartbroken", "loss", "pain", "grief", "guilt", "regret", "failure", "broken",
"miserable", "hurt", "devastated", "unhappy", "disappointed", "overwhelmed",
"trapped", "nervous", "exhausted", "worthless", "empty", "helpless"
)
reddit_data$Sentiment <- ifelse(
grepl(paste(positive_words, collapse = "|"), reddit_data$text, ignore.case = TRUE), "Positive",
ifelse(grepl(paste(negative_words, collapse = "|"), reddit_data$text, ignore.case = TRUE), "Negative", "Neutral")
)
reddit_data$Keywords <- gsub("[^a-zA-Z ]", "", reddit_data$title)
reddit_data$Keywords <- tolower(reddit_data$Keywords)
reddit_data$Keywords <- gsub("\\b(?:the|and|for|is|that|to|of|on|in|a)\\b", "", reddit_data$Keywords)
tokenized_data <- reddit_data %>%
separate_rows(Keywords, sep = "\\s+") # Split by spaces
tokenized_data_cleaned <- tokenized_data %>%
filter(!Keywords %in% stopwords("en"))
head(tokenized_data_cleaned$Keywords)
## [1] "elections" "politics" "rmentalhealth" "looking"
## [5] "moderators" "weed"
Association Rule is a rule-based method used in data mining to identify relationships or patterns between variables in a dataset. It is typically used to discover interesting correlations, frequent itemsets, or associations among large sets of data transactions. Association rules are represented in the form:
\[ X \rightarrow Y \]
Where \(X\) (antecedent) and \(Y\) (consequent) are itemsets, meaning that the presence of \(X\) implies the presence of \(Y\).
Users posting negative sentiment during nighttime are 80% likely to express depressive thoughts within a week.
High use of specific keywords (e.g., ‘hopeless’, ‘tired’) correlates with a 70% increase in stress-related behavior.
There are 3 main measures used when it comes to mining for association rules:
\[ support(X) = \frac{count(x)}{N} \]
It shows how frequent an itemset or rule occurs in the dataset.
\[ confidence(X \rightarrow Y) = \frac{support(X, Y)}{support(X)} \]
It shows the percentage of transactions in which the presence of one item or itemset results in the presence of another item or itemset.
\[ lift(X \rightarrow Y) = \frac{confidence(X \rightarrow Y)}{support(Y)} \]
It shows the rise in probability of having item \(Y\) on the cart with the knowledge of item
\(X\) being present over the
probability of having item \(Y\) on the
cart without any knowledge about the presence of \(X\).
If lift is greater than 1, then there is a positive association between
those two items or itemsets.
If it’s close to 1, items or itemsets are independent.
Value lower than 1 means that there is a negative association.
Apriori is a classic algorithm in association rule learning that is used to mine frequent itemsets and generate association rules. It is based on the principle that all subsets of a frequent itemset must also be frequent. The algorithm operates in two steps:
Using Apriori, the patterns mentioned above can be discovered:
reddit_data<- tokenized_data_cleaned
Summary of final data
## title text author upvotes
## Length:2479 Length:2479 Length:2479 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Mode :character Median : 1.00
## Mean : 4.57
## 3rd Qu.: 2.00
## Max. :199.00
## comments_count date time hour
## Min. : 0.000 Min. :2024-07-13 Length:2479 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:2025-01-16 Class :character 1st Qu.: 4.00
## Median : 1.000 Median :2025-01-16 Mode :character Median :14.00
## Mean : 5.268 Mean :2025-01-15 Mean :12.44
## 3rd Qu.: 4.000 3rd Qu.:2025-01-17 3rd Qu.:20.00
## Max. :231.000 Max. :2025-01-17 Max. :23.00
## day_of_week time_category High_Engagement Sentiment
## Length:2479 Length:2479 Length:2479 Length:2479
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Keywords
## Length:2479
## Class :character
## Mode :character
##
##
##
This code prepares the dataset (reddit_data) for association rule mining by choosing important columns, turning them into categories, and organizing the data in a format needed for algorithms.
transaction_data <- reddit_data %>%
select(Sentiment, time_category, High_Engagement, Keywords) %>%
mutate_all(as.factor)
reddit_transactions <- as(transaction_data, "transactions")
Most Frequent Items
Generating and checking rules to find patterns related to Negative Sentiment in posts. This helps discover important factors like posting time, engagement, and keywords linked to negative emotions.
rules.Negative.Sentiment <- apriori(
data = reddit_transactions,
parameter = list(supp = 0.002, conf = 0.1),
appearance = list(default = "lhs", rhs = "Sentiment=Negative"),
control = list(verbose = FALSE)
)
rules.Negative.Sentiment.byconf <- sort(rules.Negative.Sentiment, by = "confidence", decreasing = TRUE)
inspect(head(rules.Negative.Sentiment.byconf ))
## lhs rhs support confidence coverage lift count
## [1] {time_category=Evening,
## High_Engagement=High} => {Sentiment=Negative} 0.004840662 1.0000000 0.004840662 2.879210 12
## [2] {Keywords=days} => {Sentiment=Negative} 0.002016942 0.8333333 0.002420331 2.399342 5
## [3] {Keywords=everyone} => {Sentiment=Negative} 0.002016942 0.8333333 0.002420331 2.399342 5
## [4] {High_Engagement=Low,
## Keywords=days} => {Sentiment=Negative} 0.002016942 0.8333333 0.002420331 2.399342 5
## [5] {High_Engagement=Low,
## Keywords=everyone} => {Sentiment=Negative} 0.002016942 0.8333333 0.002420331 2.399342 5
## [6] {Keywords=getting} => {Sentiment=Negative} 0.002420331 0.7500000 0.003227108 2.159408 6
By seeing the rules, we can say that posts made in the evening with high engagement are very likely to have negative sentiment. The confidence is 100%, meaning all posts with these conditions are negative. The lift of 2.88 shows this is almost three times more likely than by chance. This means posts in the evening often express negative emotions, making them important for mental health monitoring.
Other rules show similar patterns. Posts with the keyword “anxiety” have an 70% chance of being negative, and posts with “help” also show a strong link to negative sentiment.
The graph shows how different items, like posting time, engagement level, and keywords, are connected to Negative Sentiment.
Network Graph
plot(rules.Negative.Sentiment.byconf , method = "graph", engine = "htmlwidget")
Scatter Plot
The distribution of support and confidence for multiple rules.
plot(rules.Negative.Sentiment.byconf, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Grouped Matrix Plot
This plot groups similar rules together, making it easier to compare them based on support and lift.
plot(rules.Negative.Sentiment.byconf, method = "grouped")
Parallel Coordinates Plot
plot(rules.Negative.Sentiment.byconf, method = "paracoord", control = list(reorder = TRUE))
The plot shows that posts made at night or with words like “depression” or “alone” are more likely to have negative sentiment. Keywords and posting time are important for finding posts with negative emotions.
frequent_itemsets.LowEngagement <- eclat(
data = reddit_transactions,
parameter = list(supp = 0.002, minlen = 2), # Minimum support and itemset length
control = list(verbose = FALSE)
)
frequent_itemsets.LowEngagement.sorted <- sort(frequent_itemsets.LowEngagement, by = "support", decreasing = TRUE)
inspect(head(frequent_itemsets.LowEngagement.sorted))
## items support count
## [1] {Sentiment=Positive, High_Engagement=Low} 0.4021783 997
## [2] {time_category=Evening, High_Engagement=Low} 0.3590157 890
## [3] {Sentiment=Negative, High_Engagement=Low} 0.3364260 834
## [4] {time_category=Night, High_Engagement=Low} 0.2855990 708
## [5] {Sentiment=Neutral, High_Engagement=Low} 0.2323518 576
## [6] {time_category=Afternoon, High_Engagement=Low} 0.1903994 472
sentiment_trends <- reddit_data %>%
group_by(hour, Sentiment) %>%
summarise(count = n(), .groups = 'drop')
ggplot(sentiment_trends, aes(x = hour, y = count, color = Sentiment)) +
geom_line(size = 1) +
labs(title = "Sentiment Trends by Hour", x = "Hour of the Day", y = "Count") +
scale_x_continuous(breaks = seq(0, 23, by = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Observation
The output reveals distinct sentiment patterns over the course of a day:
Evening spike in negative sentiment:
Negative sentiment is at its highest between 9 PM and 11 PM. This may be
because people reflect on their day or feel stressed and lonely at
night.
Positive sentiment patterns:
Positive sentiment peaks earlier in the evening, between 6 PM and 9 PM,
when people share uplifting or happy experiences. After 10 PM, positive
sentiment drops as negative sentiment increases.
Neutral sentiment:
Neutral posts remain steady throughout the day but show a small increase
in the evening, possibly indicating posts that are less emotional in
nature.
Morning trends:
There is minimal activity between 12 AM and 6 AM, likely because most
users are asleep. Positive sentiment shows a small rise in the morning,
reflecting an optimistic or fresh start to the day.
Late-night drop in activity:
After midnight (12 AM–1 AM), activity across all sentiment types
decreases sharply, as fewer people are active during this time.
These patterns highlight how people’s emotional expression shifts at different times of the day.
high_risk_keywords <- c("hopeless", "tired", "depressed", "suicide", "help")
reddit_data$high_risk <- grepl(paste(high_risk_keywords, collapse = "|"), reddit_data$Keywords, ignore.case = TRUE)
high_risk_posts <- reddit_data %>% filter(high_risk)
The code helps find posts that show signs of serious emotional distress or crises. It identifies posts with important keywords, making it easier for mental health experts or support systems to review and focus on urgent cases.
This project shows how data science can help improve mental health care by analyzing patterns and language in social media posts. It offers several important benefits:
Early Detection:
The project identifies posts and users who may be at risk by detecting
negative emotions or high-risk keywords. This helps in early
intervention to reduce emotional distress.
Personalized Support:
By flagging posts that need attention, the analysis provides tailored
recommendations. This allows mental health professionals and support
systems to offer care that fits individual needs.
Scalable Impact:
The project can analyze large amounts of data, making it useful for
monitoring trends in big populations. This scalability helps
organizations track and address mental health challenges on a wider
scale.
These benefits make the project valuable in improving mental health monitoring and care on both personal and societal levels.
Keyword Frequency Analysis
This project aims to improve mental health care by detecting people at risk early and giving personalized support. It analyzes social media data to find patterns in emotions, behavior, and high-risk keywords. By identifying people in emotional distress, it allows timely help, which can reduce the risk of serious problems like worsening depression or suicide.