The goal of this assignment is to replicate the sentiment analysis example from Chapter 2 of Text Mining with R: A Tidy Approach (Silge & Robinson) and extend it using a different text corpus and additional sentiment lexicons.
Approach
This analysis follows a two-part structure:
Reproduction of the Chapter 2 sentiment analysis example
Extension using a real-world news dataset collected via an external API
Step 1: Reproducing the Chapter 2 Example
This step reproduces the sentiment analysis workflow from Chapter 2 of Text Mining with R: A Tidy Approach (Silge & Robinson, 2017). The chapter demonstrates sentiment analysis using tidy text principles, where text is treated as individual word tokens and sentiment is computed by joining words with sentiment lexicons.
The process assumes that overall sentiment can be estimated by aggregating word-level sentiment contributions. Text is first converted into a tidy format using unnest_tokens(), stop words are removed using anti_join(), and sentiment values are assigned through inner_join() with sentiment lexicons.
The analysis uses three lexicons from the tidytext package:
These lexicons are applied to the example dataset from Jane Austen’s novels, and sentiment is summarized across words and text sections using tidy data operations such as joins, grouping, and counting.
Step 2: Extension Using NewsAPI and Additional Sentiment Lexicons
To extend the analysis, I use full news articles collected through the NewsAPI service as the external text corpus. Full articles are used instead of headlines because they provide richer context and more reliable sentiment signals compared to short headline-only text.
Initially, the New York Times API was considered; however, due to rate limits and restricted access for large-scale retrieval, I switched to NewsAPI, which provides more flexible and scalable access to news content.
The dataset is retrieved using the NewsAPI /v2/everything endpoint with keyword-based queries (e.g., politics, technology, business, sports). The full article text is constructed by combining title, description, and content.
To extend sentiment analysis beyond the original example, one additional lexicon is applied alongside Bing, AFINN and NRC:
Loughran–McDonald: specifically designed for financial and news text sentiment (Loughran & McDonald, 2011)
Additionally, domain-specific stop words (e.g., political figures such as “trump”) are removed to reduce bias in sentiment scoring.
Data Analysis Workflow
The analysis begins by reproducing the Chapter 2 sentiment workflow using tidy text principles, including tokenization, lexicon joins, and aggregation of sentiment scores.
For the extension, full news articles are collected using the NewsAPI /v2/everything endpoint. The JSON response is converted into a tidy data frame, and the full article text is created by combining title, description, and content.
The dataset is then tokenized using unnest_tokens(), stop words are removed, and sentiment analysis is performed using Bing, NRC, AFINN, and Loughran lexicons. Results are aggregated by article and category.
Finally, sentiment outputs are compared across lexicons.
Anticipated Challenges
Several challenges are expected:
News articles contain noisy and mixed sentiment language
Named entities can distort sentiment classification
Lexicons may disagree on sentiment labeling
API limitations may restrict data volume
Step 1 — Reproduce the Base Example (Jane Austen Corpus)
Load and tidy Jane Austen text
This step tokenizes the novels into individual words and creates structural metadata such as line numbers and chapters.
library(tidytext)
Warning: package 'tidytext' was built under R version 4.5.3
library(janeaustenr) # Provides Jane Austen novels
Warning: package 'janeaustenr' was built under R version 4.5.3
library(dplyr) # Data manipulation
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(stringr) # String processing functionstidy_books <-austen_books() %>%# Load all Jane Austen booksgroup_by(book) %>%# Group data by each bookmutate(linenumber =row_number(), # Create a line number within each bookchapter =cumsum( # Create chapter numbersstr_detect(text, # Detect lines that contain chapter titlesregex("^chapter [\\divxlc]", ignore_case =TRUE)) ) ) %>%ungroup() %>%# Remove groupingunnest_tokens(word, text) # Convert text into one word per row (tidy format)
NRC Joy Word Frequency (Example Lexicon Filtering)
This example extracts only “joy” words from the NRC lexicon and counts their frequency in Emma.
nrc_joy <-get_sentiments("nrc") %>%# Get NRC sentiment lexiconfilter(sentiment =="joy") # Keep only words labeled as "joy"tidy_books %>%filter(book =="Emma") %>%# Keep only the book "Emma"inner_join(nrc_joy) %>%# Keep only words that appear in the joy lexiconcount(word, sort =TRUE) # Count frequency of each word (sorted descending)
Joining with `by = join_by(word)`
# A tibble: 301 × 2
word n
<chr> <int>
1 good 359
2 friend 166
3 hope 143
4 happy 125
5 love 117
6 deal 92
7 found 92
8 present 89
9 kind 82
10 happiness 76
# ℹ 291 more rows
Sentiment Over Book Sections (Bing Lexicon)
This block calculates sentiment across sections of novels using the Bing lexicon and visualizes sentiment trends.
library(tidyr) jane_austen_sentiment <- tidy_books %>%inner_join(get_sentiments("bing")) %>%# Match each word with its sentiment (positive/negative)count(book, index = linenumber %/%80, sentiment) %>%# Count words by book, chunk (every 80 lines), and sentimentpivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%# Convert "positive" and "negative" into separate columnsmutate(sentiment = positive - negative) # Calculate net sentiment score (positive minus negative)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
library(ggplot2) ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +geom_col(show.legend =FALSE) +# Bar chart of sentimentfacet_wrap(~book, ncol =2, scales ="free_x") # Create separate panels for each book
Comparing Multiple Lexicons (AFINN, Bing, NRC)
This section compares different sentiment lexicons applied to Pride & Prejudice.
pride_prejudice <- tidy_books %>%filter(book =="Pride & Prejudice") # Extract only this novelafinn <- pride_prejudice %>%inner_join(get_sentiments("afinn")) %>%# Match words with AFINN scoresgroup_by(index = linenumber %/%80) %>%# Group into chunks of 80 linessummarise(sentiment =sum(value)) %>%# Sum sentiment scores within each chunkmutate(method ="AFINN") # Label method used
Joining with `by = join_by(word)`
bing_and_nrc <-bind_rows( pride_prejudice %>%inner_join(get_sentiments("bing")) %>%mutate(method ="Bing et al."), pride_prejudice %>%inner_join(get_sentiments("nrc") %>%filter(sentiment %in%c("positive", "negative")) ) %>%mutate(method ="NRC")) %>%count(method, index = linenumber %/%80, sentiment) %>%# Count sentiment occurrences per chunk and methodpivot_wider(names_from = sentiment,values_from = n,values_fill =0) %>%mutate(sentiment = positive - negative) # Compute net sentiment
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Step 2: Extension Using NewsAPI (Full News Articles)
This section extends the analysis to real-world news articles collected using NewsAPI. This implementation uses full articles rather than headlines only.
The analysis is extended using multiple sentiment lexicons including Bing, NRC, AFINN, and Loughran–McDonald (finance/news-oriented lexicon).
Collect News Articles from NewsAPI
This step retrieves full news articles for multiple categories using the NewsAPI /v2/everything endpoint.
# A tibble: 6 × 4
category title description content
<chr> <chr> <chr> <chr>
1 politics RFK Jr. Will Take on Joe Rogan for Podcaster Sup… "\"This is… "Rober…
2 politics OpenAI made economic proposals — here’s what DC … "Happy cea… "<ul><…
3 politics Messy and unpredictable: What I learned from ele… "BBC Radio… "It ha…
4 politics Get ready for a wave of TBPN clones after its bl… "TBPN has … "TBPN …
5 politics Kalshi says it will crack down on politicians an… "Kalshi sa… "Kalsh…
6 politics Trump fires attorney general Pam Bondi. "Taking a … "<ul><…
Create Full Article Text Field
This step combines title, description, and content into a single analysis-ready text field.
# A tibble: 6 × 6
category title description content article_id word
<chr> <chr> <chr> <chr> <int> <chr>
1 politics RFK Jr. Will Take on Joe Rogan … "\"This is… Robert… 1 rfk
2 politics RFK Jr. Will Take on Joe Rogan … "\"This is… Robert… 1 jr
3 politics RFK Jr. Will Take on Joe Rogan … "\"This is… Robert… 1 joe
4 politics RFK Jr. Will Take on Joe Rogan … "\"This is… Robert… 1 rogan
5 politics RFK Jr. Will Take on Joe Rogan … "\"This is… Robert… 1 podc…
6 politics RFK Jr. Will Take on Joe Rogan … "\"This is… Robert… 1 supr…
Bing Sentiment by Article
This step calculates sentiment per article using Bing lexicon and aggregates sentiment scores.
bing_sentiment <- tidy_news %>%inner_join(get_sentiments("bing")) %>%inner_join(news_df %>%select(category, article_id)) %>%# count sentiment words per articlegroup_by(category, article_id, sentiment) %>%summarise(n =n(), .groups ="drop") %>%# create chapter-like chunks within EACH categorymutate(index = article_id %/%1) %>%# aggregate within category + chunkgroup_by(category, index, sentiment) %>%summarise(n =sum(n), .groups ="drop") %>%# reshape sentiment columnspivot_wider(names_from = sentiment,values_from = n,values_fill =0) %>%# net sentimentmutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
Joining with `by = join_by(category, article_id)`
Sentiment Visualization (News Articles)
This plot shows sentiment variation across different news categories.
library(ggplot2) ggplot(bing_sentiment, aes(x = index, y = sentiment, fill = category)) +geom_col(show.legend =FALSE) +facet_wrap(~category, scales ="free_x", ncol =2)
Bag-of-Words Sentiment Exploration
This section identifies the most frequent sentiment-contributing words in political news articles using Bing lexicon.
bing_word_counts_news %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(x = n, y = word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(title ="Top Contributing Words to Sentiment in Political News",x ="Contribution to sentiment",y =NULL )
Stop Words and Cleaning Effect
This section improves sentiment accuracy by removing misleading or non-informative words using a custom stop word list.
This section compares multiple sentiment lexicons applied to political news articles only, enabling a direct evaluation of sentiment methodology differences.
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 6 of `x` matches multiple rows in `y`.
ℹ Row 994 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
This analysis reproduced the sentiment analysis workflow from Chapter 2 of Text Mining with R: A Tidy Approach and extended it using a corpus of full news articles. Compared to the original example using Jane Austen’s novels, the results differ noticeably. The literary text produced smoother and more consistent sentiment patterns due to its structured narrative and stable language. In contrast, the news article corpus resulted in more volatile and less consistent sentiment scores, reflecting the mixed tone, factual reporting style, and domain-specific vocabulary of real-world news.
Differences across lexicons were also more pronounced in the news data. The AFINN lexicon showed greater variation due to its numeric scoring system, while Bing and NRC produced sharper positive/negative swings. The Loughran–McDonald lexicon further diverged by emphasizing domain-specific negative terms common in formal or economic contexts. Additionally, applying custom stop word removal changed the sentiment distribution, highlighting the importance of preprocessing choices.
Overall, the extended analysis demonstrates that sentiment results depend heavily on both the type of text corpus and the choice of lexicon, with real-world data requiring more careful interpretation than structured literary text.
References
Silge, J., Robinson, D., & Robinson, D. (2017). Text mining with R: A tidy approach (p. 194). Boston (MA): O’reilly.
Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 168-177).
Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. arXiv preprint arXiv:1103.2903.
Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word–emotion association lexicon. Computational intelligence, 29(3), 436-465.
Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of finance, 66(1), 35-65.
OpenAI. (2026, April 19). ChatGPT conversation with K. M. Qaiduzzaman on Sentiment Analysis Extension in R. Retrieved April 19, 2026, from https://chat.openai.com/