The focus will be on the home pages of five Connecticut newspapers’ websites. In no particular order, this will include: Hartford Courant, CT Post, The Day, Shelton Herald, and The Chronicle.

Installing Packages

To get started, install the following packages:

Loading Packages

pacman::p_load(tidyverse, tidytext, textclean, tokenizers, markovchain)
pacman::p_load(stm, rvest, tm)
pacman::p_load(gutenbergr)
library(textdata)
## Warning: package 'textdata' was built under R version 4.0.4
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.0.3
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.0.4
library(textdata)

1.) Hartford Courant

To start, I will pull the data from the homepage, and then view the text to see if any lines should be removed.

webpage_hc = read_html("https://www.courant.com/")

hc0  = webpage_hc %>% html_nodes("p") %>% html_text()

hc1 = data.frame(text = hc0)

head(hc1,5)
tail(hc1,13)
nrow(hc1)
## [1] 30

Next, I will be actually removing those unnecessary lines, and viewing the text to make sure it only contains what I need.

hc2 <- hc1 %>% 
  slice(-(26:30))

head(hc2,2)
tail(hc2,2)

Once I convert it to a character, I will then remove things like punctuation and numbers, as well as make all of the letters lowercase.

hc2$text = as.character(hc2$text)

hc3 = hc2 %>% unnest_tokens(word, text, 
                              to_lower = T,
                              strip_punct = T,
                              strip_numeric = T
)

Next, I will remove all of the stop words, which are those that do not add to the text.

hc3 %>% count(word, sort = T)
hc3 = hc3 %>% anti_join(stop_words, by = "word") 
head(hc3)
hc4 = cbind.data.frame(linenumber = row_number(hc3), hc3)
head(hc4)
hc4 %>% count(word, sort = T)

I will be using the NRC Emotion Lexicon library to look at the emotions of the text.

nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
## 
##        anger anticipation      disgust         fear          joy     negative 
##         1247          839         1058         1476          689         3324 
##     positive      sadness     surprise        trust 
##         2312         1191          534         1231
head(table(nrc_sent))
##              sentiment
## word          anger anticipation disgust fear joy negative positive sadness
##   abacus          0            0       0    0   0        0        0       0
##   abandon         0            0       0    1   0        1        0       1
##   abandoned       1            0       0    1   0        1        0       1
##   abandonment     1            0       0    1   0        1        0       1
##   abba            0            0       0    0   0        0        1       0
##   abbot           0            0       0    0   0        0        0       0
##              sentiment
## word          surprise trust
##   abacus             0     1
##   abandon            0     0
##   abandoned          0     0
##   abandonment        1     0
##   abba               0     0
##   abbot              0     1
hc4_emotion = hc4 %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
head(hc4_emotion)
hc_emotion = hc4_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right") +
  ggtitle("Hartford Courant")
hc_emotion

For the four remaining newspapers, I will be using the same code as above. However, I will be changing the website in the read_html line of code, as well as making some slight modifications as needed.

2.) CT Post

webpage_ct = read_html("https://www.ctpost.com/")

ct0  = webpage_ct %>% html_nodes("p") %>% html_text()

ct1 = data.frame(text = ct0)

head(ct1,6)
tail(ct1,6)
nrow(ct1)
## [1] 11
ct1$text = as.character(ct1$text) 

ct2 = ct1 %>% unnest_tokens(word, text, 
                            to_lower = T,
                            strip_punct = T,
                            strip_numeric = T
)


ct2 %>% count(word, sort = T)
ct2 = ct2 %>% anti_join(stop_words, by = "word") 
head(ct2)
ct3 = cbind.data.frame(linenumber = row_number(ct2), ct2)
head(ct3)
ct3 %>% count(word, sort = T)
nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
## 
##        anger anticipation      disgust         fear          joy     negative 
##         1247          839         1058         1476          689         3324 
##     positive      sadness     surprise        trust 
##         2312         1191          534         1231
head(table(nrc_sent))
##              sentiment
## word          anger anticipation disgust fear joy negative positive sadness
##   abacus          0            0       0    0   0        0        0       0
##   abandon         0            0       0    1   0        1        0       1
##   abandoned       1            0       0    1   0        1        0       1
##   abandonment     1            0       0    1   0        1        0       1
##   abba            0            0       0    0   0        0        1       0
##   abbot           0            0       0    0   0        0        0       0
##              sentiment
## word          surprise trust
##   abacus             0     1
##   abandon            0     0
##   abandoned          0     0
##   abandonment        1     0
##   abba               0     0
##   abbot              0     1
bing_sent = get_sentiments("bing")
table(bing_sent$sentiment)
## 
## negative positive 
##     4781     2005
ct3_emotion = ct3 %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
head(ct3_emotion)
ct_emotion = ct3_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right") +
  ggtitle("CT Post")
ct_emotion

3.) The Day

webpage_td = read_html("https://www.theday.com/")

td0  = webpage_td %>% html_nodes("p") %>% html_text()

td1 = data.frame(text = td0)

head(td1,10)
tail(td1,10)
nrow(td1)
## [1] 55
td2 <- td1 %>% 
  slice(-(56)) %>%
  slice(-(1))

head(td2)
tail(td2)
td2$text = as.character(td2$text)

td3 = td2 %>% unnest_tokens(word, text, 
                            to_lower = T,
                            strip_punct = T,
                            strip_numeric = T
)
td3 %>% count(word, sort = T)
td3 = td3 %>% anti_join(stop_words, by = "word") 
head(td3)
td4 = cbind.data.frame(linenumber = row_number(td3), td3)
head(td4)
td4 %>% count(word, sort = T)
nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
## 
##        anger anticipation      disgust         fear          joy     negative 
##         1247          839         1058         1476          689         3324 
##     positive      sadness     surprise        trust 
##         2312         1191          534         1231
head(table(nrc_sent))
##              sentiment
## word          anger anticipation disgust fear joy negative positive sadness
##   abacus          0            0       0    0   0        0        0       0
##   abandon         0            0       0    1   0        1        0       1
##   abandoned       1            0       0    1   0        1        0       1
##   abandonment     1            0       0    1   0        1        0       1
##   abba            0            0       0    0   0        0        1       0
##   abbot           0            0       0    0   0        0        0       0
##              sentiment
## word          surprise trust
##   abacus             0     1
##   abandon            0     0
##   abandoned          0     0
##   abandonment        1     0
##   abba               0     0
##   abbot              0     1
bing_sent = get_sentiments("bing")
table(bing_sent$sentiment)
## 
## negative positive 
##     4781     2005
td4_emotion = td4 %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
head(td4_emotion)
td_emotion = td4_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right") +
  ggtitle("The Day")
td_emotion

4.) Shelton Herald

webpage_sh = read_html("https://www.sheltonherald.com/")

sh0  = webpage_sh %>% html_nodes("p") %>% html_text()

sh1 = data.frame(text = sh0)

head(sh1,2)
tail(sh1,2)
nrow(sh1)
## [1] 4
sh1$text = as.character(sh1$text)

sh2 = sh1 %>% unnest_tokens(word, text, 
                            to_lower = T,
                            strip_punct = T,
                            strip_numeric = T
)
sh2 %>% count(word, sort = T)
sh2 = sh2 %>% anti_join(stop_words, by = "word") 
head(sh2)
sh3 = cbind.data.frame(linenumber = row_number(sh2), sh2)
head(sh3)
sh3 %>% count(word, sort = T)
nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
## 
##        anger anticipation      disgust         fear          joy     negative 
##         1247          839         1058         1476          689         3324 
##     positive      sadness     surprise        trust 
##         2312         1191          534         1231
head(table(nrc_sent))
##              sentiment
## word          anger anticipation disgust fear joy negative positive sadness
##   abacus          0            0       0    0   0        0        0       0
##   abandon         0            0       0    1   0        1        0       1
##   abandoned       1            0       0    1   0        1        0       1
##   abandonment     1            0       0    1   0        1        0       1
##   abba            0            0       0    0   0        0        1       0
##   abbot           0            0       0    0   0        0        0       0
##              sentiment
## word          surprise trust
##   abacus             0     1
##   abandon            0     0
##   abandoned          0     0
##   abandonment        1     0
##   abba               0     0
##   abbot              0     1
bing_sent = get_sentiments("bing")
table(bing_sent$sentiment)
## 
## negative positive 
##     4781     2005
sh3_emotion = sh3 %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
head(sh3_emotion)
sh_emotion = sh3_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right") +
  ggtitle("Shelton Herald")
sh_emotion

5.) The Chronicle

webpage_tc = read_html("https://www.thechronicle.com/")

tc0  = webpage_tc %>% html_nodes("p") %>% html_text()

tc1 = data.frame(text = tc0)

head(tc1,10)
tail(tc1,10)
nrow(tc1)
## [1] 101
tc2 <- tc1 %>% 
  slice(-(87:101)) %>%
  slice(-(1:15))

head(tc2)
tail(tc2)
tc2$text = as.character(tc2$text)

tc3 = tc2 %>% unnest_tokens(word, text, 
                            to_lower = T,
                            strip_punct = T,
                            strip_numeric = T
)
tc3 %>% count(word, sort = T)
tc3 = tc3 %>% anti_join(stop_words, by = "word") 
head(tc3)
tc4 = cbind.data.frame(linenumber = row_number(tc3), tc3)
head(tc4)
tc4 %>% count(word, sort = T)
nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
## 
##        anger anticipation      disgust         fear          joy     negative 
##         1247          839         1058         1476          689         3324 
##     positive      sadness     surprise        trust 
##         2312         1191          534         1231
head(table(nrc_sent))
##              sentiment
## word          anger anticipation disgust fear joy negative positive sadness
##   abacus          0            0       0    0   0        0        0       0
##   abandon         0            0       0    1   0        1        0       1
##   abandoned       1            0       0    1   0        1        0       1
##   abandonment     1            0       0    1   0        1        0       1
##   abba            0            0       0    0   0        0        1       0
##   abbot           0            0       0    0   0        0        0       0
##              sentiment
## word          surprise trust
##   abacus             0     1
##   abandon            0     0
##   abandoned          0     0
##   abandonment        1     0
##   abba               0     0
##   abbot              0     1
bing_sent = get_sentiments("bing")
table(bing_sent$sentiment)
## 
## negative positive 
##     4781     2005
tc4_emotion = tc4 %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment) 
## Joining, by = "word"
head(tc4_emotion)
tc_emotion = tc4_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right") +
  ggtitle("The Chronicle")
tc_emotion

An Overall View

plot_grid(tc_emotion, td_emotion, nrow = 2)

plot_grid(hc_emotion, ct_emotion, nrow = 2)

sh_emotion

At first glance, the most obvious difference is the amount of words that were actually pulled from the newspapers’ different homepages. For Shelton Herald and CT Post there were not many words that could be utilized. However, The Chronicle, The Day, and Hartford Courant were quite text heavy.
At first glance, the five newspapers appear to be utilizing all ten of the sentiments available. However, with a closer look, both CT Post and Shelton Herald are missing fear. CT Post also did not have any angry words to identify.
While it does seem that there was an abundance of text that was categorized as fear and negative, there was still some text that was positive and even trustworthy. So, even though many of the headlines pulled were unfavorable, it was not all bad news.

Sources

This project made use of the NRC Word-Emotion Association Lexicon, created by Saif M. Mohammad and Peter D. Turney at the National Research Council Canada. Click here for more information.

Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.

Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon, Saif Mohammad and Peter Turney, In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, June 2010, LA, California.

Acknowledgment

I would like to thank my Unsupervised Machine Learning professor, Armando E. Rodriguez, Ph.D., for teaching me how to complete an emotional text analysis using R.