Analyzing Sentiment and Linguistic Patterns in Tweets From Depressed and Non-Depressed Users

Executive summary

This study investigates how individuals with depressive symptoms express themselves on social media, using Twitter data to uncover linguistic patterns that may reflect their psychological state. As mental health concerns grow globally, particularly among youth and social media users, understanding how depression manifests in everyday digital communication has become an urgent and valuable pursuit. Language is not only a tool for self-expression but also a window into emotional well-being.

1. Main Question

How do the language patterns of users with depressive symptoms differ from those of the general population, and what do these patterns reveal about their emotional states, thought processes, and coping behaviors?

2. Story Overview

The project began with the goal of identifying distinct language features in tweets posted by individuals who exhibit signs of depression.To uncover meaningful differences in language use between control and depressed groups, this study employed three key methods: NRC sentiment analysis to assess emotional tone, log odds ratio to identify statistically distinctive words used by each group, and trigram network visualization to explore the contextual and semantic patterns of word co-occurrences.

According to the NRC sentiment analysis, the Depressed group used more emotion-related words than the Control group across most sentiment categories, with particularly high usage of negative emotions like sadness and fear. Interestingly, the Depressed group also used more words associated with “joy” and “positive” sentiment, which may reflect a longing for recovery or self-reassurance rather than genuine positive emotions.

The log odds ratio analysis revealed that the Depressed group frequently used emotionally charged words such as depression, treatments, sos, and overcome, reflecting psychological distress and a desire for healing. In contrast, the Control group predominantly used neutral and everyday terms. This suggests that individuals with depression tend to use language more focused on psychological pain, self-reflection, and help-seeking.

The trigram network analysis showed that the Control group’s language structure was horizontal and socially driven, covering diverse topics like politics, music, and popular culture. On the other hand, the Depressed group formed a dense, emotionally anchored network centered around key terms like depression, treatments, and day, reflecting themes of emotional expression, therapeutic effort, and altered perceptions of time. Notably, many trigrams took on metaphorical or poetic forms, suggesting a tendency to express psychological pain indirectly.

In conclusion, the Depressed Group tend to use emotionally sensitive language, center their discourse around psychological pain and healing, and express their emotions through metaphorical and introspective means.

3. Conclusion & Public Health Implication

The ability to detect subtle linguistic cues associated with depression through social media analysis holds significant promise for public health applications. With proper ethical oversight, such models could inform early detection systems, digital mental health screenings, or targeted outreach efforts. Ultimately, this research reinforces the idea that language is not only a form of communication, but a potential diagnostic tool—offering a new avenue for identifying and supporting individuals at risk in our increasingly digital world.

Data background

The dataset used in this study was collected using the Twitter API as part of a research initiative aimed at exploring mental health through social media. It was originally compiled by researchers affiliated with the University of Maryland and other collaborators in the context of the CLPsych 2015 Shared Task (Computational Linguistics and Clinical Psychology). The project’s goal was to investigate whether psychological states, particularly depression, can be detected through linguistic patterns in users’ tweets. The dataset is in raw, uncleaned text format, and has been filtered to include only English-language tweets. Each tweet is labeled individually at the tweet level, with a binary classification label:

1: indicating that the tweet was written by a user identified as experiencing depression

0: representing a tweet from a user in the non-depressed (control) group.

The structure typically includes columns for the tweet text and its corresponding label. This dataset enables fine-grained analysis of emotional expression, word usage, and mental health indicators in social media posts. It serves as a valuable resource for sentiment analysis, natural language processing (NLP), and mental health classification studies, offering insights into how language reflects psychological states.

Data loading, cleaning and preprocessing

The dataset is first loaded from a CSV file. Since the text in the dataset follows the tweet format, a text cleaning is required for text-tokenization. A text cleaning function was defined to standardize the content by converting all text to lowercase and removing URLs, emojis, punctuation, numbers and extra whitespace. Then, only the relevant columns, ‘user_id’, ‘post_text’ and ‘label’ were selected for analysis. The cleaning function was then applied to the tweet text, and the cleaned text was tokenized into individual words using the unnest_tokens() function. Common English stop words were removed using a customized stop word list, which combined the built-in dataset with additional irrelevant words to improve word extraction accuracy. Finally, the label column is recoded from numeric form (1 for depressed, 0 for control) into descriptive categories: “Depressed” and “Control”. As a result, a tidy dataset was created, where each row represents a cleaned, meaningful word from each tweet, ready for further analysis.

library(tidytext)
library(stringr)
library(textclean)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

tweets <- read.csv("Mental-Health-Twitter.csv")
head(tweets)

##   X      post_id                   post_created
## 1 0 6.378947e+17 Sun Aug 30 07:48:37 +0000 2015
## 2 1 6.378904e+17 Sun Aug 30 07:31:33 +0000 2015
## 3 2 6.377493e+17 Sat Aug 29 22:11:07 +0000 2015
## 4 3 6.376964e+17 Sat Aug 29 18:40:49 +0000 2015
## 5 4 6.376963e+17 Sat Aug 29 18:40:26 +0000 2015
## 6 5 6.376928e+17 Sat Aug 29 18:26:24 +0000 2015
##                                                                                                                                      post_text
## 1 It's just over 2 years since I was diagnosed with #anxiety and #depression. Today I'm taking a moment to reflect on how far I've come since.
## 2                                              It's Sunday, I need a break, so I'm planning to spend as little time as possible on the #A14...
## 3                                                                             Awake but tired. I need to sleep but my brain has other ideas...
## 4 RT @SewHQ: #Retro bears make perfect gifts and are great for beginners too! Get stitching with October's Sew on sale NOW! #yay http://t.co/…
## 5        It’s hard to say whether packing lists are making life easier or just reinforcing how much still needs doing... #movinghouse #anxiety
## 6                                                                                         Making packing lists is my new hobby... #movinghouse
##      user_id followers friends favourites statuses retweets label
## 1 1013187241        84     211        251      837        0     1
## 2 1013187241        84     211        251      837        1     1
## 3 1013187241        84     211        251      837        0     1
## 4 1013187241        84     211        251      837        2     1
## 5 1013187241        84     211        251      837        1     1
## 6 1013187241        84     211        251      837        1     1

clean_tweets <- function(text) {
  text %>%
  str_to_lower() %>%                           
  str_replace_all("http\\S+\\s*", "") %>%       
  str_replace_all("[^\x01-\x7F]", "") %>%      
  str_replace_all("[[:punct:]]", " ") %>%       
  str_replace_all("[0-9]+", "") %>%            
  str_squish()                                  
}

data_selected <- tweets %>% 
  select(user_id, post_text, label)

custom_stop_words <- bind_rows(
  stop_words,
  tibble(word = c("rt", "don", "amp", "ll","ve", "ii", "hey", "yong"),  
    lexicon = "custom"             
  ))

tidy_tweets <- data_selected %>% 
  mutate(clean_text = clean_tweets(post_text)) %>%
  unnest_tokens(word, clean_text) %>%            
  anti_join(custom_stop_words, by = "word") %>% 
  mutate(label = ifelse(label == 1, "Depressed", "Control"))
head(tidy_tweets)

##      user_id
## 1 1013187241
## 2 1013187241
## 3 1013187241
## 4 1013187241
## 5 1013187241
## 6 1013187241
##                                                                                                                                      post_text
## 1 It's just over 2 years since I was diagnosed with #anxiety and #depression. Today I'm taking a moment to reflect on how far I've come since.
## 2 It's just over 2 years since I was diagnosed with #anxiety and #depression. Today I'm taking a moment to reflect on how far I've come since.
## 3 It's just over 2 years since I was diagnosed with #anxiety and #depression. Today I'm taking a moment to reflect on how far I've come since.
## 4 It's just over 2 years since I was diagnosed with #anxiety and #depression. Today I'm taking a moment to reflect on how far I've come since.
## 5 It's just over 2 years since I was diagnosed with #anxiety and #depression. Today I'm taking a moment to reflect on how far I've come since.
## 6 It's just over 2 years since I was diagnosed with #anxiety and #depression. Today I'm taking a moment to reflect on how far I've come since.
##       label       word
## 1 Depressed  diagnosed
## 2 Depressed    anxiety
## 3 Depressed depression
## 4 Depressed     taking
## 5 Depressed     moment
## 6 Depressed    reflect

Text data analysis

Figure 1 - NRC Sentiment Analysis Graph

To explore emotional differences in language use between depressed and non-depressed Twitter users, I utilized the NRC sentiment lexicon, which categorizes words into ten emotional categories (such as joy, sadness, anger, fear, etc.) along with positive and negative sentiments.

1. Why NRC sentiment analysis?

Importantly, the NRC lexicon was selected over alternatives such as Bing and AFINN because of its multi-dimensional emotional framework. While Bing offers a binary positive/negative classification and AFINN provides a numerical sentiment score, NRC captures a richer and more nuanced emotional spectrum. This aligns more closely with the study’s objective—to analyze emotional complexity and variation in language patterns between depressed and non-depressed users, rather than simply identifying polarity or intensity. The NRC approach enables a deeper understanding of specific emotional themes, such as fear or trust, that are particularly relevant to mental health discourse.

2. Visualization

First, the cleaned tweet data was joined with the NRC lexicon using inner_join(get_sentiments(“nrc”)) to tag each tokenized word with its associated sentiment. After labeling each tweet by group (Depressed or Control), I calculated the frequency of each emotion within each group using group_by() and summarise() functions.

nrc_tweets <- tidy_tweets %>% 
  inner_join(get_sentiments("nrc"))

## Joining with `by = join_by(word)`

## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 607 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

top_nrc <- nrc_tweets %>% 
  group_by(label, sentiment) %>%
  summarise(count = n()) %>%
  arrange(label, desc(count)) %>% 
  ungroup()

## `summarise()` has grouped output by 'label'. You can override using the
## `.groups` argument.

top_nrc

## # A tibble: 20 × 3
##    label     sentiment    count
##    <chr>     <chr>        <int>
##  1 Control   positive      5035
##  2 Control   negative      3839
##  3 Control   trust         3050
##  4 Control   joy           2495
##  5 Control   anticipation  2487
##  6 Control   fear          2167
##  7 Control   anger         1950
##  8 Control   sadness       1720
##  9 Control   surprise      1517
## 10 Control   disgust       1444
## 11 Depressed positive      5642
## 12 Depressed negative      5636
## 13 Depressed sadness       3470
## 14 Depressed trust         3228
## 15 Depressed anticipation  3047
## 16 Depressed joy           2918
## 17 Depressed fear          2793
## 18 Depressed anger         2362
## 19 Depressed disgust       1661
## 20 Depressed surprise      1145

To facilitate comparison, the resulting sentiment counts were visualized using a facet bar graph, where each facet represents a different NRC emotion. This visualization makes it easy to identify which emotions are more prevalent in each user group. Plus, the facet bar plot was chosen because it effectively separates and compares emotional categories across two groups side by side, making differences in emotional tone visually intuitive and statistically interpretable.

library(ggplot2)

ggplot(top_nrc, aes(x = label, y = count, fill = label)) +
  geom_col(show.legend = T, width = 0.6) +
  facet_wrap(~ sentiment, scales = "free_y") +
  labs(
    title = "Top 10 NRC Emotions by Group",
    x = "Group",
    y = "Word Frequency"
  ) +
  theme_minimal() +
  theme(strip.text = element_text(size = 11, face = "bold"),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

3. Graph analysis

Analysis of the NRC sentiment graph revealed that the Depressed group consistently used more emotional words than the Control group across 9 of the 10 sentiment categories. Particularly striking were the large differences in words associated with negative emotions such as sadness, fear, and negative sentiment overall. This pattern aligns with clinical understandings of depression, where individuals tend to express more distress-related emotions in daily language use.

However, it was a notable result that the Depressed group also used more words associated with ‘joy’ and ‘positive’ sentiment compared to the Control group. This suggests that the use of emotionally positive words by depressed users may not reflect actual positive emotional experiences, but rather indicate a psychological longing, nostalgia, or self-presentation strategy. In some cases, these expressions may serve as a form of self-reassurance or reflect a desire to regain emotional balance. It also illustrates how sentiment analysis based purely on word frequency can overlook contextual meaning, reinforcing the need for complementary methods that capture semantic nuance.

Figure 2 - Log Odds Ratio Graph

The second figure presents a log odds ratio analysis to identify the most characteristic words used by each group, Depressed and Control, in their tweets.

1. Why log odds ratio?

The log odds ratio was chosen for this analysis because it is particularly effective in highlighting words that are statistically characteristic of one group over another, regardless of overall word frequency. Compared to simple frequency or TF-IDF, which can sometimes favor high-frequency generic words, the log odds ratio adjusts for both overall token counts and group imbalance. It helps surface distinctive lexical features that differentiate one population from another, which is crucial in a study focused on group-based linguistic and psychological differences.

2. Visualization

To create this figure, word frequency data was first grouped by label and the top 10 most common words per group were extracted. Using a smoothed probability approach, the log odds ratio was calculated by taking the logarithm of the ratio between each word’s relative frequency in the Control group versus the Depressed group.

library(tidyr)

tweets_freq <- tidy_tweets %>% 
  group_by(label) %>% count(label, word)

lor_freq <- tweets_freq %>% 
  group_by(label) %>% 
  slice_max(n, n = 10) %>% 
  pivot_wider(names_from = label, values_from = n, values_fill = 0) %>% 
  mutate(ratio_Control = ((Control + 1)/(sum(Control + 1))), 
         ratio_Depressed = ((Depressed + 1)/(sum(Depressed + 1)))) %>% 
  mutate(log_odds_ratio = log(ratio_Control/ratio_Depressed))
lor_freq

## # A tibble: 17 × 6
##    word           Control Depressed ratio_Control ratio_Depressed log_odds_ratio
##    <chr>            <int>     <int>         <dbl>           <dbl>          <dbl>
##  1 user               509         0      0.189           0.000305         6.43  
##  2 trump              428         0      0.159           0.000305         6.25  
##  3 love               282       318      0.105           0.0973           0.0734
##  4 twitter            269         0      0.0999          0.000305         5.79  
##  5 people             268       308      0.0995          0.0942           0.0545
##  6 realdonaldtru…     259         0      0.0962          0.000305         5.75  
##  7 time               172       233      0.0640          0.0714          -0.109 
##  8 joe                169         0      0.0629          0.000305         5.33  
##  9 cameronhoodkin     166         0      0.0618          0.000305         5.31  
## 10 putin              164         0      0.0610          0.000305         5.30  
## 11 depression           0       886      0.000370        0.271           -6.59  
## 12 misslusyd            0       332      0.000370        0.102           -5.61  
## 13 treatments           0       268      0.000370        0.0820          -5.40  
## 14 sos                  0       255      0.000370        0.0781          -5.35  
## 15 day                  0       223      0.000370        0.0683          -5.22  
## 16 genevieveverso       0       220      0.000370        0.0674          -5.20  
## 17 overcome             0       219      0.000370        0.0671          -5.20

top_lor_tweets <- lor_freq %>% 
  group_by(label = ifelse(log_odds_ratio > 0, "Control", "Depressed")) %>%
  slice_max(abs(log_odds_ratio), n = 10, with_ties = F) 
top_lor_tweets

## # A tibble: 17 × 7
## # Groups:   label [2]
##    word     Control Depressed ratio_Control ratio_Depressed log_odds_ratio label
##    <chr>      <int>     <int>         <dbl>           <dbl>          <dbl> <chr>
##  1 user         509         0      0.189           0.000305         6.43   Cont…
##  2 trump        428         0      0.159           0.000305         6.25   Cont…
##  3 twitter      269         0      0.0999          0.000305         5.79   Cont…
##  4 realdon…     259         0      0.0962          0.000305         5.75   Cont…
##  5 joe          169         0      0.0629          0.000305         5.33   Cont…
##  6 cameron…     166         0      0.0618          0.000305         5.31   Cont…
##  7 putin        164         0      0.0610          0.000305         5.30   Cont…
##  8 love         282       318      0.105           0.0973           0.0734 Cont…
##  9 people       268       308      0.0995          0.0942           0.0545 Cont…
## 10 depress…       0       886      0.000370        0.271           -6.59   Depr…
## 11 misslus…       0       332      0.000370        0.102           -5.61   Depr…
## 12 treatme…       0       268      0.000370        0.0820          -5.40   Depr…
## 13 sos            0       255      0.000370        0.0781          -5.35   Depr…
## 14 day            0       223      0.000370        0.0683          -5.22   Depr…
## 15 genevie…       0       220      0.000370        0.0674          -5.20   Depr…
## 16 overcome       0       219      0.000370        0.0671          -5.20   Depr…
## 17 time         172       233      0.0640          0.0714          -0.109  Depr…

library(ggplot2)
ggplot(top_lor_tweets, aes(x = reorder(word, log_odds_ratio),
                  y = log_odds_ratio,
                  fill = label)) +
  geom_col(show.legend = T) +
  coord_flip() +
  labs(title = "Top 10 Log Odds Ratio Words in Tweets by Depressed and Control Groups", x = NULL)

3. Graph analysis

This visualization displays the top 10 words with the highest absolute log odds ratio values, meaning those that are most disproportionately associated with either the Depressed or Control group. The geom_col horizontal bar chart was used to clearly differentiate the most group-distinctive words, with positive values indicating association with the Control group and negative values indicating association with the Depressed group. Through this analysis, I aimed to answer: Which specific words are most indicative of the linguistic differences between depressed and non-depressed Twitter users? From this analysis, notable patterns emerged. Words most associated with the Depressed group include depression, treatments, sos, and overcome—all terms directly related to emotional struggles, mental health, and seeking help or healing. In contrast, the Control group predominantly used neutral terms. This divergence in word usage suggests that depressed users’ language is more emotionally loaded and psychologically revealing, often oriented around coping, distress, or personal reflection. Meanwhile, control users’ language reflects more general or situational content.

Figure 3 - Trigram Network

In showing the figures that you created, describe why you designed it the way you did. Why did you choose those colors, fonts, and other design elements? Does it convey truth?

In this study, I visualized the semantic patterns in the Control and Depressed groups using trigram network graphs. These visualizations were designed to uncover frequent three-word co-occurrences and examine how users in each group express thoughts, emotions, and experiences through language.

1. Why trigram networks?

The choice of trigram networks was motivated by the desire to go beyond surface-level word frequencies. While unigrams offer limited context and bigrams may oversimplify relationships, trigrams strike a balance, capturing meaningful patterns. Expressions contain syntactic and semantic depth, reflecting more nuanced emotional states and actions. N-gram network analysis thus enables the study of how individuals construct their thoughts and not just what vocabulary they use.

Initially, I intended to use phi coefficient-based visualizations to examine word associations. However, the computation of phi coefficients for key terms like “depression” resulted in repeated NaN (Not a Number) errors, likely due to sparse co-occurrence or imbalance in contingency table distributions. These technical limitations led to the decision to adopt n-gram network visualization as a more flexible and interpretable alternative. This pivot allowed the study to retain a focus on co-occurrence and semantic structure, but in a format that better accommodates data sparsity and visual storytelling.

2. The design and structure of trigram networks

The graph was built using a force-directed layout (layout = “fr”), allowing the natural clustering of related trigrams. The layout enhances interpretability by grouping closely associated word triplets into color-coded communities, which were identified using group_infomap(). Color palette was automatically generated based on group membership, making each thematic cluster visually distinct. This helps identify latent linguistic topics without manual labeling. Node size represents degree centrality, visually emphasizing the most structurally influential trigrams. Text labels were made readable using repel = TRUE and max.overlaps = Inf, ensuring that important terms remain legible even in dense graphs. Edges were lightly transparent (alpha = 0.4) to avoid clutter and focus attention on structure rather than volume alone.

This approach supports truthful representation of the data: it reveals what phrases are frequently used, which terms connect semantically, and which trigrams are central to the discourse within each group.

3-1. Visualization of control group’s tweets

library(tidytext)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

tidy_control <- tidy_tweets %>% 
  filter(label == "Control")

con_trigram <- tidy_control %>% 
  unnest_tokens(input = post_text,
                output = word,
                token = "ngrams",
                n = 3)

con_seprated <- con_trigram %>%
  separate(word, c("word1", "word2", "word3"), sep = " ") %>% 
  filter(!word1 %in% c("br", custom_stop_words$word, "http","https", "t.co"),
         !word2 %in% c("br", custom_stop_words$word, "http","https", "t.co"),
         !word3 %in% c("br", custom_stop_words$word, "http","https", "t.co"))
         
con_pairs <- con_seprated %>%
  count(word1, word2, word3, sort = T) %>%
  na.omit() 
head(con_pairs)

##        word1          word2          word3   n
## 1         gt             gt             gt 377
## 2 pillowtalk bestmusicvideo   iheartawards 339
## 4    cartoon           fake           yhvh 143
## 5       fake           yhvh           fuck 143
## 6     stupid            sun        glasses 143
## 7      video     pillowtalk bestmusicvideo 130

library(tidygraph)

## 
## Attaching package: 'tidygraph'

## The following object is masked from 'package:stats':
## 
##     filter

con_tri_graph <- con_pairs %>%
  filter(n >= 60) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),    
         group = as.factor(group_infomap()))

library(ggraph)
set.seed(1234)
ggraph(con_tri_graph, layout = "fr") +
  geom_edge_link(color = "gray50",         
                 alpha = 0.4) +               
  geom_node_point(aes(size = centrality,     
                      color = group),          
                  show.legend = F) +         
  scale_size(range = c(4, 10)) +               
  geom_node_text(aes(label = name),
               repel = TRUE,
               max.overlaps = Inf,
               size = 4) +
  labs(title = "Trigram Network of Control Tweets") + 
  theme_graph()

3-2. Visualization of depressed group’s tweets

In the Depressed group’s trigram network, I purposefully narrowed the focus to trigrams that include one of six keywords: “depression”, “treatments”, “overcome”, “sos”, “day”, “time”. These were selected based on prior log odds ratio analysis, which identified them as being significantly more frequent in the Depressed group compared to the Control group.

These terms were seen as semantically central to understanding the unique language patterns of individuals expressing depressive symptoms. By analyzing their surrounding trigrams, I sought to better understand the linguistic context in which these terms appear—whether they signal help-seeking behavior, expressions of struggle, or descriptions of lived experience.

library(tidytext)
library(tidyverse)

tidy_depressed <- tidy_tweets %>% 
  filter(label == "Depressed")

dep_trigram <- tidy_depressed %>% 
  unnest_tokens(input = post_text,
                output = word,
                token = "ngrams",
                n = 3)

dep_seprated <- dep_trigram %>%
  separate(word, c("word1", "word2", "word3"), sep = " ") %>% 
  filter(!word1 %in% c("br", custom_stop_words$word, "http","https", "t.co"),
         !word2 %in% c("br", custom_stop_words$word, "http","https", "t.co"),
         !word3 %in% c("br", custom_stop_words$word, "http","https", "t.co"))

         
target <- c("depression","treatments", "overcome","sos", "day", "time") 
dep_pairs <- dep_seprated %>%
  count(word1, word2, word3, sort = T) %>%
  na.omit() %>% 
  filter(word1 %in% target)
head(dep_pairs)

##        word1      word2      word3   n
## 1   overcome depressive  disorders 117
## 2 depression depression treatments  85
## 3 depression    article     teller  74
## 4   overcome depression     mental  23
## 5   overcome depression      sleep  20
## 6 depression    florida      times  18

library(tidygraph)
dep_tri_graph <- dep_pairs %>%
  filter(n >= 11) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),    
         group = as.factor(group_infomap()))

library(ggraph)
set.seed(1234)
ggraph(dep_tri_graph, layout = "fr") +
  geom_edge_link(color = "gray50",         
                 alpha = 0.4) +               
  geom_node_point(aes(size = centrality,     
                      color = group),          
                  show.legend = F) +         
  scale_size(range = c(4, 10)) +               
  geom_node_text(aes(label = name),
               repel = TRUE,
               max.overlaps = Inf,
               size = 3) +
  labs(title = "Trigram Network of Depressed Tweets") + 
  theme_graph()

4. Trigram Network Analysis: Control vs. Depressed Group

The trigram network of the Control group encompassed a wide range of topics, including politics, music, and popular culture. For instance, network structures such as “democratic play game” and “hashtags trump quotes” suggest that general users often express political opinions or engage with casual, entertainment-oriented themes such as music requests or social media memes. The frequent presence of usernames and user tags further indicates that language use in this group tends to be interaction-driven and socially embedded, reflecting a communicative style focused on external engagement and everyday discourse.

In contrast, the trigram network of the Depressed group was constructed with a more focused analytical intent. Specifically, I extracted and visualized only those trigrams that contained one of six key terms: “depression”, “treatments”, “overcome”, “sos”, “day”, and “time”. These keywords were identified through a prior log odds ratio analysis as statistically overrepresented in the Depressed group compared to the Control group. However, their selection was not based solely on frequency. Rather, these terms were chosen to examine the recurring linguistic contexts in which emotional and psychological experiences—especially those related to depression—are framed.

The resulting network showed that these six keywords often appeared as central nodes, connecting to a variety of surrounding words and forming semantically dense clusters. This indicates that individuals expressing depressive symptoms tend to use language structures that reflect a mixture of emotional expression, therapeutic efforts, and altered perceptions of time and daily life. Terms like “anxiety”, “depression”, and “treatment” frequently co-occurred, often in sequential order, illustrating how users may narrate their struggles in cohesive linguistic units.

Some trigrams, such as “wont casually day lightning” or “prompts day lightning”, may initially appear ungrammatical or fragmented. However, they can be interpreted as poetic or metaphorical attempts to describe emotional states—for example, “a day that passes like lightning without meaning”. Such language use suggests an indirect or symbolic mode of emotional expression. It appears that individuals in the Depressed group may prefer metaphor and figurative language over explicit descriptions, potentially as a coping mechanism or due to the stigmatized nature of mental health discourse.

In summary, the Control group’s network is characterized by a horizontal structure centered on everyday information and social interaction. Meanwhile, the Depressed group’s network forms a more centrally concentrated structure, where emotion-laden keywords such as symptoms, treatments, and psychological states serve as anchors. This contrast highlights how differing psychological states influence not only word choice but also the semantic architecture of digital communication.

Analyzing Sentiment and Linguistic Patterns in Tweets From Depressed and Non-Depressed Users

Lee Hyerin

2025-06-16

Executive summary

1. Main Question

2. Story Overview

3. Conclusion & Public Health Implication

Data background

Data loading, cleaning and preprocessing

Text data analysis

Figure 1 - NRC Sentiment Analysis Graph

1. Why NRC sentiment analysis?

2. Visualization

3. Graph analysis

Figure 2 - Log Odds Ratio Graph

1. Why log odds ratio?

2. Visualization

3. Graph analysis

Figure 3 - Trigram Network

1. Why trigram networks?

2. The design and structure of trigram networks

3-1. Visualization of control group’s tweets

3-2. Visualization of depressed group’s tweets

4. Trigram Network Analysis: Control vs. Depressed Group