Scraping and Sentiment Analysis of CT News Junkie: Exploring Emotional Content through NRC Lexicon

Objective of the Analysis

The aim of this analysis is to extract and examine the emotional content of the news articles from the CT News Junkie website. By scraping the site’s paragraph content, cleaning the text, and applying emotion detection using the NRC sentiment lexicon, we seek to identify the most dominant emotions conveyed in the news coverage. This helps uncover the emotional tone prevalent in local journalism and how it may influence or reflect public discourse.

Practical Implementation

Emotion Identification in Articles
By mapping words to predefined emotional categories (like anger, joy, trust, etc.), this method enables quick identification of the emotional tone embedded in news content. It’s useful for understanding how a piece might emotionally impact readers or shape public sentiment.

Brand Sentiment Monitoring
Businesses can analyze customer reviews to determine the emotional undertone of feedback. This helps identify how customers feel — whether they’re happy, frustrated, trusting, or angry — without manually reading every review.

Brief Overview of Code

1. Web Scraping CT News Junkie Articles

We start by using the rvest package to scrape text content from paragraph (<p>) tags on the CT News Junkie homepage. The resulting text is stored in a tibble with a label "junkie" to identify the source.

# Load required libraries

library(rvest)       # Web scraping: Used to extract article text from the CT News Junkie website.
library(tidyverse)   # Data manipulation and visualization: Includes dplyr, ggplot2, and more for data cleaning and plotting.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)    # Text analysis: Helps break down (tokenize) text into individual words for analysis.
library(stringr)     # String manipulation: Used to clean the text (e.g., remove numbers, names, or specific patterns).
library(textdata)    # Sentiment lexicons: Provides access to sentiment and emotion dictionaries like NRC.
library(tibble)      # Tidy tables: Used to create neat, readable data frames to work with textual data.
library(ggplot2)     # Data visualization: Creates bar charts to visualize word frequency and emotional content.


john = rvest::read_html("https://ctnewsjunkie.com/")
jr = john %>% html_nodes("p") %>% html_text()
jr.df = tibble(text = jr, name = "junkie")

2. Tokenization and Stopword Removal

In the next step, text is split into individual words (tokens) using unnest_tokens(). Then, common stopwords (e.g., “the”, “and”) are removed to focus only on meaningful words.

jr.df = jr.df %>% unnest_tokens(word, text) %>% anti_join(stop_words)

Joining with `by = join_by(word)`

3. Cleaning Unwanted Words and Visualizing Top Words

jr.df %>% 
  filter(!str_detect(word, "^[0-9]*$")) %>%
  filter(!str_detect(word, "ct")) %>%
  filter(!str_detect(word, "connecticut")) %>%
  filter(!str_detect(word, "70,000")) %>%
  filter(!str_detect(word, "john")) %>%
  filter(!str_detect(word, "rosen")) %>%
  filter(!str_detect(word, "2.9")) %>%
  group_by(word) %>% 
  dplyr::count(word, sort = TRUE) %>%
  ungroup() %>%
  slice(1:12)%>%
  ggplot(aes(reorder(word,n), n, fill = as.factor(n))) + 
  geom_col() + 
  coord_flip() +
  theme(legend.position = "none")

4. Load NRC Emotion Lexicon

The NRC Lexicon contains predefined associations between words and emotions (like anger, joy, fear, etc.). Here, it filters to keep only specific emotions, excluding general “positive” and “negative” sentiments.

nrc = get_sentiments("nrc")
nrc_emotion = nrc %>% dplyr::filter(sentiment != "negative" & sentiment != "positive")
table(nrc_emotion$sentiment)


       anger anticipation      disgust         fear          joy      sadness 
        1245          837         1056         1474          687         1187 
    surprise        trust 
         532         1230

g1 <- jr.df %>% 
  inner_join(nrc_emotion) %>% 
  #count(word, sentiment, sort=TRUE) %>% 
  group_by(sentiment) %>% 
  dplyr::count(word, sentiment, sort=TRUE) %>%
  ungroup() %>%
  slice_max(n, n = 20) %>%
  mutate(word=reorder(word, n)) %>% 
  ggplot(aes(word, n)) + 
  geom_col(aes(fill=sentiment)) +
  facet_wrap(~sentiment, scale="free_y") + 
  coord_flip()+
  labs(
       title = "Emotional Content", 
       subtitle = "CT News Junkie",
       caption = "Saurabh's Work")+
       theme(legend.position = "none"
      )

Joining with `by = join_by(word)`

Warning in inner_join(., nrc_emotion): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 29 of `x` matches multiple rows in `y`.
ℹ Row 4794 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

g1

Conclusion

The analysis successfully scraped textual content from CT News Junkie and applied the NRC lexicon to identify the dominant emotions in the news articles. The article reveals a strong presence of fear, sadness, and anger, suggesting that much of the content focuses on distressing or serious issues. However, the presence of anticipation and trust indicates some positive or forward-looking narratives. Overall, the emotional tone of the news articles leans toward critical and serious reporting, with occasional elements of optimism.