Assignment 10A Sentiment Analysis with Text Mining in R
Author
Zineb Tamnat
This report reproduces and extends the sentiment analysis example from Chapter 2 of Text Mining with R.
First the original analysis is replicated using Jane Austen novels. Then the analysis is extended to a corpus of tweets, and an additional sentiment lexicon is included to compare results across different methods.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# A tibble: 12 × 3
book sentiment n
<fct> <chr> <int>
1 Sense & Sensibility negative 3671
2 Sense & Sensibility positive 4933
3 Pride & Prejudice negative 3652
4 Pride & Prejudice positive 5052
5 Mansfield Park negative 4828
6 Mansfield Park positive 6749
7 Emma negative 4809
8 Emma positive 7157
9 Northanger Abbey negative 2518
10 Northanger Abbey positive 3244
11 Persuasion negative 2201
12 Persuasion positive 3473
#Visualbing_sentiment %>%ggplot(aes(x = book, y = n, fill = sentiment)) +geom_col(position ="dodge") +labs(title ="Positive and Negative Words in Jane Austen Novels",x ="Book",y ="Count" ) +theme_classic()
# Comparing lexicons for Pride and Prejudicepride_prejudice <- tidy_books %>%filter(book =="Pride & Prejudice")afinn <- pride_prejudice %>%inner_join(get_sentiments("afinn"), by ="word", relationship ="many-to-many") %>%group_by(index = linenumber %/%80) %>%summarise(sentiment =sum(value), .groups ="drop") %>%mutate(method ="AFINN")bing_and_nrc <-bind_rows( pride_prejudice %>%inner_join(get_sentiments("bing"), by ="word", relationship ="many-to-many") %>%mutate(method ="Bing"), pride_prejudice %>%inner_join(get_sentiments("nrc") %>%filter(sentiment %in%c("positive", "negative")),by ="word",relationship ="many-to-many" ) %>%mutate(method ="NRC")) %>%count(method, index = linenumber %/%80, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(sentiment = positive - negative)lexicon_comparison <-bind_rows(afinn, bing_and_nrc)lexicon_comparison
# A tibble: 489 × 5
index sentiment method negative positive
<dbl> <dbl> <chr> <int> <int>
1 0 29 AFINN NA NA
2 1 0 AFINN NA NA
3 2 20 AFINN NA NA
4 3 30 AFINN NA NA
5 4 62 AFINN NA NA
6 5 66 AFINN NA NA
7 6 60 AFINN NA NA
8 7 18 AFINN NA NA
9 8 84 AFINN NA NA
10 9 26 AFINN NA NA
# ℹ 479 more rows
#Visuallexicon_comparison %>%ggplot(aes(x = index, y = sentiment, fill = method)) +geom_col(show.legend =FALSE) +facet_wrap(~method, ncol =1, scales ="free_y") +labs(title ="Comparing Sentiment Lexicons for Pride and Prejudice",x ="Index",y ="Sentiment Score" ) +theme_classic()
The base sentiment analysis in this report follows the example from Text Mining with R: A Tidy Approach, Chapter 2: Sentiment Analysis with Tidy Data. The original code and methodology were adapted from the authors’ published materials.
I initially planned to collect tweets using the rtweet R package as stated in my approach. However, this package was not available for my version of R so I used a publicly available dataset from Kaggle (Sentiment140) which contains real tweet data.
tweets <-read_csv("train_data.csv")
Rows: 1523975 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sentence
dbl (1): sentiment
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The Loughran lexicon produces a different distribution of sentiment categories than Bing, showing that sentiment results can change depending on the lexicon used.
#Visualtweet_loughran %>%ggplot(aes(x = lexicon_sentiment, y = n, fill = lexicon_sentiment)) +geom_col() +labs(title ="Sentiment in Tweet Dataset (Loughran Lexicon)",x ="Sentiment",y ="Count" ) +theme_classic()
Unlike the Bing lexicon, the Loughran lexicon identifies additional categories such as uncertainty, constraining and litigious which gives us a more detailed view of sentiment in the tweet data.
Comparison of Results
The Jane Austen texts show a more balanced sentiment while the tweet dataset shows a slightly more positive trend. Additionally, the Loughran lexicon provides more detailed sentiment categories than Bing, demonstrating that results can differ depending on both the text source and the lexicon used.