Introduction

This project conducts a computational text analysis of 4,000 tweets concerning the 2017 Catalonian independence movement. Using sentiment analysis and text mining techniques, we examine the emotional polarity, thematic framing, and dissemination patterns within this polarized online discourse. The primary objectives are to quantify the sentiment distribution, identify dominant narratives through hashtag analysis, and investigate whether message sentiment influenced its virality via retweets. This analysis aims to reveal how conflict-driven political debates manifest linguistically on social media, bridging lexical patterns with observable social behavior in digital public spheres.

Required Packages

# Load required packages
library(tidyverse)
library(tidytext)
library(wordcloud)
library(ggplot2)
library(ggthemes)
library(RColorBrewer)
library(ggrepel)
library(DT)
library(dplyr)
library(textdata)

Load the tweets data

tweets <- read_csv("tweets.csv")

# Add doc_id to track each tweet
tweets <- tweets %>%
  mutate(doc_id = row_number())

# Data overview
glimpse(tweets)
summary(tweets)

Exploratory Data Analysis

# Check basic statistics
cat("Number of tweets:", nrow(tweets), "\n")
## Number of tweets: 4000
cat("Number of unique users:", n_distinct(tweets$screenName), "\n")
## Number of unique users: 2883
# Check retweet vs original content
retweet_summary <- tweets %>%
  summarize(
    total_tweets = n(),
    retweets = sum(isRetweet),
    original = total_tweets - retweets,
    retweet_rate = retweets / total_tweets
  )
retweet_summary
## # A tibble: 1 × 4
##   total_tweets retweets original retweet_rate
##          <int>    <int>    <int>        <dbl>
## 1         4000     3149      851        0.787
# Top users by tweet count
top_users <- tweets %>%
  count(screenName, sort = TRUE) %>%
  head(10)
top_users
## # A tibble: 10 × 2
##    screenName        n
##    <chr>         <int>
##  1 CatalanRobot    196
##  2 globurl          40
##  3 loscarnaless     26
##  4 jackiedeburca    19
##  5 Carme_Almarza    16
##  6 Fcotta63         13
##  7 QuantumAspect    13
##  8 nexthis          13
##  9 CatalanNation    12
## 10 SomColonia       11

Text Preprocessing and Tokenization

# Define custom stop words for Twitter data
custom_stop_words <- bind_rows(
  stop_words,
  tibble(word = c("rt", "https", "t.co", "amp", "catalonia", "spain", "catalonia's"), 
         lexicon = "custom")
)
# Tokenize tweets
tidy_tweets <- tweets %>%
  unnest_tokens(output = word, input = text)
# Remove stopwords
tidy_tweets_clean <- tidy_tweets %>%
  anti_join(custom_stop_words, by = "word")
# Compare token counts
cat("Original tokens:", nrow(tidy_tweets), "\n")
## Original tokens: 83377
cat("After stopword removal:", nrow(tidy_tweets_clean), "\n")
## After stopword removal: 42689
cat("Tokens removed:", nrow(tidy_tweets) - nrow(tidy_tweets_clean), "\n")
## Tokens removed: 40688

This cleaning step cut the initial 83,000 tweet tokens almost in half to 43,000. By removing common words and Twitter noise like “rt” and links, we’ve distilled the data down to its meaningful keywords, focusing the analysis on the underlying themes of the Catalonia discussion.

Sentiment Analysis with Multiple Lexicons

# Bing sentiment analysis
bing_tweet_results<- tidy_tweets_clean %>% inner_join(get_sentiments("bing"), by = "word")
# Top sentiment words
bing_top_words <- bing_tweet_results %>% count(word, sentiment, sort = TRUE) %>% group_by(sentiment) %>% slice_max(n, n=10) %>% ungroup()
# Visualize top sentiment words
ggplot(bing_top_words, aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_col() +
  facet_wrap(~sentiment, scales = "free") +
  coord_flip() +
  labs(title = "Top Sentiment Words in Tweets (Bing Dictionary)",
       x = "Word", y = "Frequency") +
  theme_minimal()

The sentiment analysis reveals a sharp division. Negative words like “crazy,” “corrupt,” and “fascist” dominate one side, indicating strong accusations and frustration. Meanwhile, positive terms such as “freedom,” “solidarity,” and “peace” are also prominent, reflecting hopeful ideals and support. This clear split shows the Twitter conversation was highly polarized around the Catalonia issue, with two opposing emotional narratives at play.

AFINN Sentiment Analysis

# AFINN sentiment analysis
afinn_tweet_results <- tidy_tweets_clean %>%
  inner_join(get_sentiments("afinn"), by = "word")
# Distribution of sentiment scores
afinn_tweet_results %>%
  count(value) %>%
  ggplot(aes(x = factor(value), y = n)) +
  geom_col(fill = "steelblue") +
  labs(title = "Distribution of AFINN Sentiment Scores in Tweets",
       x = "Sentiment Score", y = "Count") +
  theme_minimal()

The AFINN score distribution shows a clear negative skew, with scores of -3 and -4 being the most frequent. This indicates that the strongest individual words used in the tweets convey significant negativity, aligning with the intense conflict seen in the earlier analysis.

Sentiment Distribution per Tweet

# Calculate sentiment scores per tweet using Bing
bing_tweet_scores <- tidy_tweets_clean %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(doc_id, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment_score = positive - negative)
# Distribution of sentiment scores
ggplot(bing_tweet_scores, aes(x = sentiment_score)) +
  geom_histogram(bins = 30, fill = "lightblue", alpha = 0.7) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Distribution of Sentiment Scores Across Tweets",
       x = "Sentiment Score (Positive - Negative)",
       y = "Number of Tweets") +
  theme_minimal()

# Summary statistics
summary(bing_tweet_scores$sentiment_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.0000 -1.0000  1.0000  0.2068  1.0000  4.0000

The distribution of sentiment per tweet shows a more balanced and slightly positive picture. While individual words were highly negative, most tweets have a neutral or modestly positive overall score, clustering around zero. The median tweet score is +1. This suggests that many tweets combined negative and positive words, or that the sheer volume of strongly negative keywords was contained within a smaller number of highly charged messages.

Analyze Retweets vs Sentiment

# Join sentiment scores with original tweet data
tweets_with_sentiment <- bing_tweet_scores %>% left_join(tweets, by = "doc_id")
# compare sentiment between retweets and original tweets 
tweets_with_sentiment %>%
  group_by(isRetweet) %>%
  summarize(
    mean_sentiment = mean(sentiment_score, na.rm = TRUE),
    median_sentiment = median(sentiment_score, na.rm = TRUE),
    n_tweets = n()
  )
## # A tibble: 2 × 4
##   isRetweet mean_sentiment median_sentiment n_tweets
##   <lgl>              <dbl>            <dbl>    <int>
## 1 FALSE             0.0241                0      374
## 2 TRUE              0.237                 1     2233
# Boxplot of sentiment by retweet status
ggplot(tweets_with_sentiment, aes(x = as.factor(isRetweet), y = sentiment_score)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Sentiment Scores: Retweets vs Original Tweets",
       x = "Is Retweet", y = "Sentiment Score") +
  theme_minimal()

The boxplot reveals a subtle but telling difference: retweets have a higher median sentiment score (+1) than original tweets (0). This suggests that people were slightly more inclined to share content with a positive or neutral-leaning tone, even within a heated debate. However, the wide spread of scores in both groups confirms that strongly opinionated messages, both negative and positive, were circulating and being amplified.

Wordclouds of Sentiment Words

# Prepare data for wordclouds
bing_word_counts <- bing_tweet_results %>%
  count(word, sentiment, sort = TRUE)
# Positive words wordcloud
positive_words <- bing_word_counts %>%
  filter(sentiment == "positive")

wordcloud(words = positive_words$word, 
          freq = positive_words$n,
          max.words = 100,
          colors = brewer.pal(8, "Dark2"),
          random.order = FALSE)

# Negative words wordcloud
negative_words <- bing_word_counts %>%
  filter(sentiment == "negative")

wordcloud(words = negative_words$word, 
          freq = negative_words$n,
          max.words = 100,
          colors = brewer.pal(8, "Set1"),
          random.order = FALSE)

The wordclouds beautifully visualize the stark emotional split in the debate. The negative cloud is dense with words of conflict and accusation like “crazy,” “corrupt,” “fascist,” and “crisis,” painting a picture of anger and perceived injustice. In sharp contrast, the positive cloud shines with ideals like “freedom,” “solidarity,” “peace,” and “love,” representing the hopes and supportive spirit of the other side. Together, they show this wasn’t just a political discussion, but a deeply emotional clash between two visions: one of struggle and one of hope.

Time-based Sentiment Analysis

# Analyze sentiment over time (if created column exists)
if("created" %in% colnames(tweets_with_sentiment)) {
  tweets_with_sentiment <- tweets_with_sentiment %>%
    mutate(date = as.Date(created))
  
  daily_sentiment <- tweets_with_sentiment %>%
    group_by(date) %>%
    summarize(
      avg_sentiment = mean(sentiment_score, na.rm = TRUE),
      n_tweets = n()
    )
  
  ggplot(daily_sentiment, aes(x = date, y = avg_sentiment)) +
    geom_line(color = "steelblue", size = 1) +
    geom_point(color = "darkred") +
    labs(title = "Daily Average Sentiment Score",
         x = "Date", y = "Average Sentiment Score") +
    theme_minimal()
}

This chart tracks how the overall mood of the conversation shifted over its two-day peak. The sentiment started slightly negative on November 29th, but then trended upward, becoming modestly positive by the end of November 30th. This suggests that as the discussion evolved, the collective tone may have warmed slightly, potentially due to increased sharing of hopeful or supportive messages, or a change in the dominant narrative over that short period.

Interpretation and Relation to Topic Modeling

# Key insights summary
cat("=== SENTIMENT ANALYSIS RESULTS SUMMARY ===\n")
## === SENTIMENT ANALYSIS RESULTS SUMMARY ===
# Overall sentiment
overall_sentiment <- mean(bing_tweet_scores$sentiment_score, na.rm = TRUE)
cat("Overall average sentiment score:", round(overall_sentiment, 3), "\n")
## Overall average sentiment score: 0.207
# Sentiment distribution
sentiment_prop <- bing_tweet_scores %>%
  mutate(sentiment_category = case_when(
    sentiment_score > 0 ~ "Positive",
    sentiment_score < 0 ~ "Negative",
    TRUE ~ "Neutral"
  )) %>%
  count(sentiment_category) %>%
  mutate(proportion = n / sum(n))

print(sentiment_prop)
## # A tibble: 3 × 3
##   sentiment_category     n proportion
##   <chr>              <int>      <dbl>
## 1 Negative             968     0.371 
## 2 Neutral              142     0.0545
## 3 Positive            1497     0.574
# Most influential words
top_positive <- bing_word_counts %>%
  filter(sentiment == "positive") %>%
  head(5)

top_negative <- bing_word_counts %>%
  filter(sentiment == "negative") %>%
  head(5)

cat("\nTop positive words:", paste(top_positive$word, collapse = ", "), "\n")
## 
## Top positive words: solidarity, love, freedom, leading, congratulation
cat("Top negative words:", paste(top_negative$word, collapse = ", "), "\n")
## Top negative words: crazy, complain, ridicule, condemn, failure

Interpretation and Relation to Topic Modeling/Text Clustering:

Based on our sentiment analysis, we can make some reasonable predictions about what topic modeling or text clustering would likely reveal in this Twitter dataset about the Catalonia discussion.

1. Expectation of Clear, Opposing Clusters The sentiment patterns are extremely polarized, and the most common positive and negative words make that pretty obvious. On one side, we have “hope”-related words like solidarity and freedom, and on the other, much harsher terms like crazy, corrupt, and fascist. Because of this split, topic modeling would almost certainly produce at least two big thematic clusters: one focused on pro-independence or pro-Catalan arguments framed in terms of democracy, self-determination, or international support, and another focused on anti-secession or pro-union narratives, emphasizing illegality and crisis.

2. Sentiment as a Feature for Separation We also found that retweets have a higher median sentiment than original tweets, which suggests that tweets that spread more widely tend to use slightly more positive or neutral emotional framing. So, clustering might separate “high-engagement” topics from more negative or niche posts—not only based on word frequency but also on emotional tone and how far the tweets spread.

3. Subtopic Differentiation Within Broad Sentiment Categories Even though the sentiment categories are fairly binary (positive vs. negative), the variety of words in each group hints at smaller subthemes. For instance, the negative words include both political failure terms like failure and crisis, as well as stronger accusations such as fascist and repression. Meanwhile, the positive words range from emotional support (love, peace) to more practical notions of success (leading, top). Because of this diversity, topic modeling would probably reveal more specific themes—like legal debates, police actions, economic concerns, or even celebratory events—within each larger emotional category.

4. Hashtags as Thematic Anchors A lot of the hashtags that appear across tweets—such as #democracy, #EU, and #europe—show that users weren’t just discussing Catalonia in isolation. They were linking it to broader European or international contexts. In topic modeling, these hashtags will likely act as anchor points that link together tweets about international law, European institutions, or democratic principles. This might even form a separate “international reaction” cluster.

Conclusion for Modeling Approach

If we apply topic modeling (like LDA) or clustering methods (such as k-means on TF-IDF vectors), we should expect the model to pick up on these sentiment-driven narratives. The topics will likely split along both keyword sets and emotional polarity. Adding sentiment scores (either as features or for interpreting the results afterwards) can help confirm whether each topic leans more toward the “freedom and solidarity” side or the “corruption and crisis” side. Overall, the sentiment analysis acts as a useful guide, showing that the underlying topics in this dataset are not neutral at all, but highly emotional and ideologically divided.