This assignment extends sentiment analysis based on the research from “Text Mining with R - Chapter 2: Sentiment Analysis with Tidy Data”.
The assignment consists of two sections: 1. Reproduce the sentiment analysis on Jane Austen’s novel using the Bing, Afinn and nrc lexicons. 2. Perform a similar analysis based on the political news articles sourced from New York Times, using a numerical lexicon named syuzhet.
The goal of this assignment is to compare the categorical and numeric sentiment analysis; explore how sentiment the political news articles are; perform data extracting, tidying and visualizing skills.
The base code from Chapter 2 is cited from: > Silge, J. & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. >https://www.tidytextmining.com/sentiment.html
This section reproduces codes from Chapter 2 of “Text Mining with R: A Tidy Approach” (Silge & Robinson, 2017).
library(tidytext)
library(janeaustenr)
library(httr)
library(jsonlite)
library(tidyverse)
library(syuzhet)
# extracts novels, assigns line numbers
# divides into chapters
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
line_number = row_number(),
chapter = cumsum(str_detect(
text, regex(
"^chapter[\\divxlc]", ignore_case=TRUE
)
))
) %>%
ungroup() %>%
unnest_tokens(word, text) # splits text into words
head(tidy_books)
## # A tibble: 6 × 4
## book line_number chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
# uses Bing here as an example to label positive or negative
bing <- get_sentiments("bing")
book_sentiment <- tidy_books %>%
inner_join(bing, by="word") %>%
count(book, index=line_number %/%80, sentiment) %>%
pivot_wider(
names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment=positive-negative)
head(book_sentiment)
## # A tibble: 6 × 5
## book index negative positive sentiment
## <fct> <dbl> <int> <int> <int>
## 1 Sense & Sensibility 0 16 32 16
## 2 Sense & Sensibility 1 19 53 34
## 3 Sense & Sensibility 2 12 31 19
## 4 Sense & Sensibility 3 15 31 16
## 5 Sense & Sensibility 4 16 34 18
## 6 Sense & Sensibility 5 16 51 35
Visualize Bing sentiments
ggplot(book_sentiment, aes(index, sentiment, fill=book))+
geom_col(show.legend = FALSE)+
facet_wrap(~book, scales="free_x")+
labs(x="Contribution to sentiment",
y=NULL)
This section imports political news from NYT Top Stories API; calculate numeric sentiment scores using syuzhet; visualize the sentiments of articles.
# conceal api key
api_key <- Sys.getenv("NYT_API_KEY")
# import political articles from top stories API
url <- paste0("https://api.nytimes.com/svc/topstories/v2/politics.json?api-key=", api_key)
res <- GET(url)
if (res$status_code != 200) {
warning("Request failed with status: ", res$status_code)
} else {
nyt_data <- fromJSON(content(res, "text"), flatten = TRUE)
}
data <- fromJSON(content(res,"text"), flatten = TRUE)
df <- data$results %>%
transmute(
title = title,
abstract = abstract,
section = section,
published_date = as.Date(published_date),
url = url,
text = paste(title, abstract, sep = ".")
)
Let’s move on to the sentiment analysis using syuzhet.
Syuzhet is a text sentiment method which turns raw text into numeric values. It adds up all the word scores in an article and divides by the number of tokens. If the final score is positive then the article is overall positive and vice versa.
texts <- df$text
# create a new variable for sentiment score per article
df$sentiment_score <- get_sentiment(texts, method = "syuzhet")
head(df %>% select(title, sentiment_score))
## title
## 1 New Weapons Testing Won’t Include Nuclear Explosions, Energy Secretary Says
## 2 N.Y.C. rabbi moves off sidelines to condemn Mamdani amid Jewish divide.
## 3 Syria’s President to Visit Washington for First Time Since Taking Power
## 4 Latest Strike on Boat in Caribbean Sea Kills 3, Hegseth Says
## 5 Food Stamp Cuts Expose Trump’s Strategy to Use Shutdown to Advance Agenda
## 6 Anger Over ICE Raids Is Driving Some Latino Voters to the Polls
## sentiment_score
## 1 2.00
## 2 -0.25
## 3 0.40
## 4 -2.30
## 5 0.70
## 6 -2.85
summary(df$sentiment_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.950 -1.050 0.400 -0.018 0.850 2.250
# the middle 50% of data is between [-1.05, 0.85]
Visualization
# histogram to check the distribution
ggplot(df, aes(x = sentiment_score))+
geom_histogram(bins=10, binwidth = 0.4,fill = "steelblue", color = "white") +
labs(
title = "Sentiment Distribution of NYT Political Articles",
x = "Sentiment Score",
y = "Count"
)
Among 25 political articles, most of the articles are slightly positive around 0-1, while a few articles are strongly negative which make it left-skewed. It’s normal regarding political articles because they might contain words about conflict, criticism, political biases, which make the scores negative. But overall, the articles are mildly positive.
ggplot(df, aes(sample = sentiment_score)) +
stat_qq() +
stat_qq_line(color = "red", linetype = "dashed") +
labs(
title = "Q-Q Plot of Sentiment Scores",
x = "",
y= "")
Right tail on the line indicates the most positive articles are normal; middle slightly above the line indicates higher quantiles than expected; left tail under the line matches with the histogram that the most negative articles are more extreme than expected. The qq plot confirms that the normality is approximate, but it’s acceptable since the size is only 25.
Analysis of 25 political articles from the NYT Top Stories API, using syuzhet lexicon, indicates that the overall tone of the articles is mildly positive. The distribution of scores is slightly left-skewed, with most articles showing mild positivity and a few heavily negative outliers. A Q-Q plot confirms with the histogram that while the right tail aligns with normality, the left tail is heavier with negative outliers. Overall, the dataset suggests the NYT political articles are somewhat balanced, with occasional strongly negative articles.