Assignment 10A Sentiment Analysis with Text Mining in R

Author

Zineb Tamnat

This report reproduces and extends the sentiment analysis example from Chapter 2 of Text Mining with R.

First the original analysis is replicated using Jane Austen novels. Then the analysis is extended to a corpus of tweets, and an additional sentiment lexicon is included to compare results across different methods.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(textdata)
library(janeaustenr)
library(stringr)
#Creatng the Jane Austen tidy text data
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))
  ) %>%
  ungroup() %>%
  unnest_tokens(word, text)
# Basic sentiment analysis
bing_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") %>%
  count(book, sentiment)

bing_sentiment
# A tibble: 12 × 3
   book                sentiment     n
   <fct>               <chr>     <int>
 1 Sense & Sensibility negative   3671
 2 Sense & Sensibility positive   4933
 3 Pride & Prejudice   negative   3652
 4 Pride & Prejudice   positive   5052
 5 Mansfield Park      negative   4828
 6 Mansfield Park      positive   6749
 7 Emma                negative   4809
 8 Emma                positive   7157
 9 Northanger Abbey    negative   2518
10 Northanger Abbey    positive   3244
11 Persuasion          negative   2201
12 Persuasion          positive   3473
#Visual
bing_sentiment %>%
  ggplot(aes(x = book, y = n, fill = sentiment)) +
  geom_col(position = "dodge") +
  labs(
    title = "Positive and Negative Words in Jane Austen Novels",
    x = "Book",
    y = "Count"
  ) +
  theme_classic()

# Comparing lexicons for Pride and Prejudice
pride_prejudice <- tidy_books %>%
  filter(book == "Pride & Prejudice")

afinn <- pride_prejudice %>%
  inner_join(get_sentiments("afinn"), by = "word", relationship = "many-to-many") %>%
  group_by(index = linenumber %/% 80) %>%
  summarise(sentiment = sum(value), .groups = "drop") %>%
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(
  pride_prejudice %>%
    inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") %>%
    mutate(method = "Bing"),
  pride_prejudice %>%
    inner_join(
      get_sentiments("nrc") %>% filter(sentiment %in% c("positive", "negative")),
      by = "word",
      relationship = "many-to-many"
    ) %>%
    mutate(method = "NRC")
) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)

lexicon_comparison <- bind_rows(afinn, bing_and_nrc)

lexicon_comparison
# A tibble: 489 × 5
   index sentiment method negative positive
   <dbl>     <dbl> <chr>     <int>    <int>
 1     0        29 AFINN        NA       NA
 2     1         0 AFINN        NA       NA
 3     2        20 AFINN        NA       NA
 4     3        30 AFINN        NA       NA
 5     4        62 AFINN        NA       NA
 6     5        66 AFINN        NA       NA
 7     6        60 AFINN        NA       NA
 8     7        18 AFINN        NA       NA
 9     8        84 AFINN        NA       NA
10     9        26 AFINN        NA       NA
# ℹ 479 more rows
#Visual
lexicon_comparison %>%
  ggplot(aes(x = index, y = sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  labs(
    title = "Comparing Sentiment Lexicons for Pride and Prejudice",
    x = "Index",
    y = "Sentiment Score"
  ) +
  theme_classic()

The base sentiment analysis in this report follows the example from Text Mining with R: A Tidy Approach, Chapter 2: Sentiment Analysis with Tidy Data. The original code and methodology were adapted from the authors’ published materials.

I initially planned to collect tweets using the rtweet R package as stated in my approach. However, this package was not available for my version of R so I used a publicly available dataset from Kaggle (Sentiment140) which contains real tweet data.

tweets <- read_csv("train_data.csv")
Rows: 1523975 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sentence
dbl (1): sentiment

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tweets <- tweets %>%
  slice_sample(n = 5000)
colnames(tweets)
[1] "sentence"  "sentiment"
#Sampling the data and cleanng/ tokenizing t
tweets <- tweets %>%
  slice_sample(n = 5000)

tidy_tweets <- tweets %>%
  mutate(sentence = str_remove_all(sentence, "http\\S+")) %>%
  mutate(sentence = str_remove_all(sentence, "@\\w+")) %>%
  mutate(sentence = str_remove_all(sentence, "#")) %>%
  unnest_tokens(word, sentence)
head(tidy_tweets)
# A tibble: 6 × 2
  sentiment word    
      <dbl> <chr>   
1         1 is      
2         1 leaving 
3         1 for     
4         1 nyc     
5         1 tomorrow
6         1 loved   
#Analyzing with Bing
tweet_bing <- tidy_tweets %>%
  inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") %>%
  rename(lexicon_sentiment = sentiment.y) %>%
  count(lexicon_sentiment)

tweet_bing
# A tibble: 2 × 2
  lexicon_sentiment     n
  <chr>             <int>
1 negative           2498
2 positive           3387
#Visual
tweet_bing %>%
  ggplot(aes(x = lexicon_sentiment, y = n, fill = lexicon_sentiment)) +
  geom_col() +
  labs(
    title = "Sentiment in Tweet Dataset (Bing Lexicon)",
    x = "Sentiment",
    y = "Count"
  ) +
  theme_classic()

The tweet dataset shows a higher number of positive words than negative words indicating a slightly positive overall sentiment in the sampled data.

#Loughran Lexicon
tweet_loughran <- tidy_tweets %>%
  inner_join(get_sentiments("loughran"), by = "word", relationship = "many-to-many") %>%
  rename(lexicon_sentiment = sentiment.y) %>%
  count(lexicon_sentiment)

tweet_loughran
# A tibble: 6 × 2
  lexicon_sentiment     n
  <chr>             <int>
1 constraining         11
2 litigious            30
3 negative            922
4 positive            912
5 superfluous           1
6 uncertainty         369

The Loughran lexicon produces a different distribution of sentiment categories than Bing, showing that sentiment results can change depending on the lexicon used.

#Visual
tweet_loughran %>%
  ggplot(aes(x = lexicon_sentiment, y = n, fill = lexicon_sentiment)) +
  geom_col() +
  labs(
    title = "Sentiment in Tweet Dataset (Loughran Lexicon)",
    x = "Sentiment",
    y = "Count"
  ) +
  theme_classic()

Unlike the Bing lexicon, the Loughran lexicon identifies additional categories such as uncertainty, constraining and litigious which gives us a more detailed view of sentiment in the tweet data.

Comparison of Results

The Jane Austen texts show a more balanced sentiment while the tweet dataset shows a slightly more positive trend. Additionally, the Loughran lexicon provides more detailed sentiment categories than Bing, demonstrating that results can differ depending on both the text source and the lexicon used.