The focus will be on the home pages of five Connecticut newspapers’ websites. In no particular order, this will include: Hartford Courant, CT Post, The Day, Shelton Herald, and The Chronicle.
To get started, install the following packages:
pacman::p_load(tidyverse, tidytext, textclean, tokenizers, markovchain)
pacman::p_load(stm, rvest, tm)
pacman::p_load(gutenbergr)
library(textdata)
## Warning: package 'textdata' was built under R version 4.0.4
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.0.3
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.0.4
library(textdata)
To start, I will pull the data from the homepage, and then view the text to see if any lines should be removed.
webpage_hc = read_html("https://www.courant.com/")
hc0 = webpage_hc %>% html_nodes("p") %>% html_text()
hc1 = data.frame(text = hc0)
head(hc1,5)
tail(hc1,13)
nrow(hc1)
## [1] 30
Next, I will be actually removing those unnecessary lines, and viewing the text to make sure it only contains what I need.
hc2 <- hc1 %>%
slice(-(26:30))
head(hc2,2)
tail(hc2,2)
Once I convert it to a character, I will then remove things like punctuation and numbers, as well as make all of the letters lowercase.
hc2$text = as.character(hc2$text)
hc3 = hc2 %>% unnest_tokens(word, text,
to_lower = T,
strip_punct = T,
strip_numeric = T
)
Next, I will remove all of the stop words, which are those that do not add to the text.
hc3 %>% count(word, sort = T)
hc3 = hc3 %>% anti_join(stop_words, by = "word")
head(hc3)
hc4 = cbind.data.frame(linenumber = row_number(hc3), hc3)
head(hc4)
hc4 %>% count(word, sort = T)
I will be using the NRC Emotion Lexicon library to look at the emotions of the text.
nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
##
## anger anticipation disgust fear joy negative
## 1247 839 1058 1476 689 3324
## positive sadness surprise trust
## 2312 1191 534 1231
head(table(nrc_sent))
## sentiment
## word anger anticipation disgust fear joy negative positive sadness
## abacus 0 0 0 0 0 0 0 0
## abandon 0 0 0 1 0 1 0 1
## abandoned 1 0 0 1 0 1 0 1
## abandonment 1 0 0 1 0 1 0 1
## abba 0 0 0 0 0 0 1 0
## abbot 0 0 0 0 0 0 0 0
## sentiment
## word surprise trust
## abacus 0 1
## abandon 0 0
## abandoned 0 0
## abandonment 1 0
## abba 0 0
## abbot 0 1
hc4_emotion = hc4 %>%
inner_join(nrc_sent) %>%
count(index = linenumber %/% 50, sentiment)
## Joining, by = "word"
head(hc4_emotion)
hc_emotion = hc4_emotion %>%
ggplot(aes(index, n, fill = as.factor(sentiment))) +
geom_col() +
theme(legend.position = "right") +
ggtitle("Hartford Courant")
hc_emotion
For the four remaining newspapers, I will be using the same code as above. However, I will be changing the website in the read_html line of code, as well as making some slight modifications as needed.
webpage_ct = read_html("https://www.ctpost.com/")
ct0 = webpage_ct %>% html_nodes("p") %>% html_text()
ct1 = data.frame(text = ct0)
head(ct1,6)
tail(ct1,6)
nrow(ct1)
## [1] 11
ct1$text = as.character(ct1$text)
ct2 = ct1 %>% unnest_tokens(word, text,
to_lower = T,
strip_punct = T,
strip_numeric = T
)
ct2 %>% count(word, sort = T)
ct2 = ct2 %>% anti_join(stop_words, by = "word")
head(ct2)
ct3 = cbind.data.frame(linenumber = row_number(ct2), ct2)
head(ct3)
ct3 %>% count(word, sort = T)
nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
##
## anger anticipation disgust fear joy negative
## 1247 839 1058 1476 689 3324
## positive sadness surprise trust
## 2312 1191 534 1231
head(table(nrc_sent))
## sentiment
## word anger anticipation disgust fear joy negative positive sadness
## abacus 0 0 0 0 0 0 0 0
## abandon 0 0 0 1 0 1 0 1
## abandoned 1 0 0 1 0 1 0 1
## abandonment 1 0 0 1 0 1 0 1
## abba 0 0 0 0 0 0 1 0
## abbot 0 0 0 0 0 0 0 0
## sentiment
## word surprise trust
## abacus 0 1
## abandon 0 0
## abandoned 0 0
## abandonment 1 0
## abba 0 0
## abbot 0 1
bing_sent = get_sentiments("bing")
table(bing_sent$sentiment)
##
## negative positive
## 4781 2005
ct3_emotion = ct3 %>%
inner_join(nrc_sent) %>%
count(index = linenumber %/% 50, sentiment)
## Joining, by = "word"
head(ct3_emotion)
ct_emotion = ct3_emotion %>%
ggplot(aes(index, n, fill = as.factor(sentiment))) +
geom_col() +
theme(legend.position = "right") +
ggtitle("CT Post")
ct_emotion
webpage_td = read_html("https://www.theday.com/")
td0 = webpage_td %>% html_nodes("p") %>% html_text()
td1 = data.frame(text = td0)
head(td1,10)
tail(td1,10)
nrow(td1)
## [1] 55
td2 <- td1 %>%
slice(-(56)) %>%
slice(-(1))
head(td2)
tail(td2)
td2$text = as.character(td2$text)
td3 = td2 %>% unnest_tokens(word, text,
to_lower = T,
strip_punct = T,
strip_numeric = T
)
td3 %>% count(word, sort = T)
td3 = td3 %>% anti_join(stop_words, by = "word")
head(td3)
td4 = cbind.data.frame(linenumber = row_number(td3), td3)
head(td4)
td4 %>% count(word, sort = T)
nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
##
## anger anticipation disgust fear joy negative
## 1247 839 1058 1476 689 3324
## positive sadness surprise trust
## 2312 1191 534 1231
head(table(nrc_sent))
## sentiment
## word anger anticipation disgust fear joy negative positive sadness
## abacus 0 0 0 0 0 0 0 0
## abandon 0 0 0 1 0 1 0 1
## abandoned 1 0 0 1 0 1 0 1
## abandonment 1 0 0 1 0 1 0 1
## abba 0 0 0 0 0 0 1 0
## abbot 0 0 0 0 0 0 0 0
## sentiment
## word surprise trust
## abacus 0 1
## abandon 0 0
## abandoned 0 0
## abandonment 1 0
## abba 0 0
## abbot 0 1
bing_sent = get_sentiments("bing")
table(bing_sent$sentiment)
##
## negative positive
## 4781 2005
td4_emotion = td4 %>%
inner_join(nrc_sent) %>%
count(index = linenumber %/% 50, sentiment)
## Joining, by = "word"
head(td4_emotion)
td_emotion = td4_emotion %>%
ggplot(aes(index, n, fill = as.factor(sentiment))) +
geom_col() +
theme(legend.position = "right") +
ggtitle("The Day")
td_emotion
webpage_sh = read_html("https://www.sheltonherald.com/")
sh0 = webpage_sh %>% html_nodes("p") %>% html_text()
sh1 = data.frame(text = sh0)
head(sh1,2)
tail(sh1,2)
nrow(sh1)
## [1] 4
sh1$text = as.character(sh1$text)
sh2 = sh1 %>% unnest_tokens(word, text,
to_lower = T,
strip_punct = T,
strip_numeric = T
)
sh2 %>% count(word, sort = T)
sh2 = sh2 %>% anti_join(stop_words, by = "word")
head(sh2)
sh3 = cbind.data.frame(linenumber = row_number(sh2), sh2)
head(sh3)
sh3 %>% count(word, sort = T)
nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
##
## anger anticipation disgust fear joy negative
## 1247 839 1058 1476 689 3324
## positive sadness surprise trust
## 2312 1191 534 1231
head(table(nrc_sent))
## sentiment
## word anger anticipation disgust fear joy negative positive sadness
## abacus 0 0 0 0 0 0 0 0
## abandon 0 0 0 1 0 1 0 1
## abandoned 1 0 0 1 0 1 0 1
## abandonment 1 0 0 1 0 1 0 1
## abba 0 0 0 0 0 0 1 0
## abbot 0 0 0 0 0 0 0 0
## sentiment
## word surprise trust
## abacus 0 1
## abandon 0 0
## abandoned 0 0
## abandonment 1 0
## abba 0 0
## abbot 0 1
bing_sent = get_sentiments("bing")
table(bing_sent$sentiment)
##
## negative positive
## 4781 2005
sh3_emotion = sh3 %>%
inner_join(nrc_sent) %>%
count(index = linenumber %/% 50, sentiment)
## Joining, by = "word"
head(sh3_emotion)
sh_emotion = sh3_emotion %>%
ggplot(aes(index, n, fill = as.factor(sentiment))) +
geom_col() +
theme(legend.position = "right") +
ggtitle("Shelton Herald")
sh_emotion
webpage_tc = read_html("https://www.thechronicle.com/")
tc0 = webpage_tc %>% html_nodes("p") %>% html_text()
tc1 = data.frame(text = tc0)
head(tc1,10)
tail(tc1,10)
nrow(tc1)
## [1] 101
tc2 <- tc1 %>%
slice(-(87:101)) %>%
slice(-(1:15))
head(tc2)
tail(tc2)
tc2$text = as.character(tc2$text)
tc3 = tc2 %>% unnest_tokens(word, text,
to_lower = T,
strip_punct = T,
strip_numeric = T
)
tc3 %>% count(word, sort = T)
tc3 = tc3 %>% anti_join(stop_words, by = "word")
head(tc3)
tc4 = cbind.data.frame(linenumber = row_number(tc3), tc3)
head(tc4)
tc4 %>% count(word, sort = T)
nrc_sent = get_sentiments("nrc")
table(nrc_sent$sentiment)
##
## anger anticipation disgust fear joy negative
## 1247 839 1058 1476 689 3324
## positive sadness surprise trust
## 2312 1191 534 1231
head(table(nrc_sent))
## sentiment
## word anger anticipation disgust fear joy negative positive sadness
## abacus 0 0 0 0 0 0 0 0
## abandon 0 0 0 1 0 1 0 1
## abandoned 1 0 0 1 0 1 0 1
## abandonment 1 0 0 1 0 1 0 1
## abba 0 0 0 0 0 0 1 0
## abbot 0 0 0 0 0 0 0 0
## sentiment
## word surprise trust
## abacus 0 1
## abandon 0 0
## abandoned 0 0
## abandonment 1 0
## abba 0 0
## abbot 0 1
bing_sent = get_sentiments("bing")
table(bing_sent$sentiment)
##
## negative positive
## 4781 2005
tc4_emotion = tc4 %>%
inner_join(nrc_sent) %>%
count(index = linenumber %/% 50, sentiment)
## Joining, by = "word"
head(tc4_emotion)
tc_emotion = tc4_emotion %>%
ggplot(aes(index, n, fill = as.factor(sentiment))) +
geom_col() +
theme(legend.position = "right") +
ggtitle("The Chronicle")
tc_emotion
plot_grid(tc_emotion, td_emotion, nrow = 2)
plot_grid(hc_emotion, ct_emotion, nrow = 2)
sh_emotion
At first glance, the most obvious difference is the amount of words that were actually pulled from the newspapers’ different homepages. For Shelton Herald and CT Post there were not many words that could be utilized. However, The Chronicle, The Day, and Hartford Courant were quite text heavy.
At first glance, the five newspapers appear to be utilizing all ten of the sentiments available. However, with a closer look, both CT Post and Shelton Herald are missing fear. CT Post also did not have any angry words to identify.
While it does seem that there was an abundance of text that was categorized as fear and negative, there was still some text that was positive and even trustworthy. So, even though many of the headlines pulled were unfavorable, it was not all bad news.
This project made use of the NRC Word-Emotion Association Lexicon, created by Saif M. Mohammad and Peter D. Turney at the National Research Council Canada. Click here for more information.
Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.
Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon, Saif Mohammad and Peter Turney, In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, June 2010, LA, California.
I would like to thank my Unsupervised Machine Learning professor, Armando E. Rodriguez, Ph.D., for teaching me how to complete an emotional text analysis using R.