Tracking Sentiment of 5 CT Daily Newspapers with Web Scraping

Acknowledgement

I want to express my highest gratitude to my professor, Dr. Armando Rodriguez (https://www.newhaven.edu/faculty-staff-profiles/armando-rodriguez.php). Without Dr. Rodriquez’s guidance and help, this project would not have been possible.

Introduction

In this project, I have scrapped 5 different Connecticut Daily Newspapers to track their sentiments on the same day. I used a tool selector gadget which is a Google Chrome extension that helps you determine the path. The goal of this is to track and rank the sentiments of these newspapers in an ascending order, with positive on the left and the least positive further right.

First we load the packages that have been used

1. Connecticut Post

First we will scrape the front page of CT Post’s website, clean our data and then continue.

webpage_a = read_html("https://www.ctpost.com/")

data1  = webpage_a %>% html_nodes("a") %>% html_text() 

data1.1 = data.frame(text = data1) # we have to convert it to a tibble/dataframe
head(data1.1, 10)
tail(data1.1, 10)

#Preprocessing
data1.2 <- data1.1 %>% slice(-(328:347)) %>% slice(-(1:74))
head(data1.2)

data1.2$text = as.character(data1.2$text) 

#Into the tidytext world          
data1.3 = data1.2 %>% unnest_tokens(word, text, 
                              to_lower = T, 
                              strip_punct = T, 
                              strip_numeric = T)


#take out the stopwords
data1.3 %>% count(word, sort = T) 

data1.3 = data1.3 %>% anti_join(stop_words, by = "word") 

#create a dataframe
data1.4 = cbind.data.frame(linenumber = row_number(data1.3), data1.3) 
head(data1.4)
data1.4 %>% count(word, sort = T)

ctPost <- data1.4

Now, we will calculate the sentiment of the newspaper using BING dictionary. We will also calculate the overall emotion of the newspaper using the NRC Lexicon. Both the results will be plotted on a graph.

#sentiment we will use BING dictionary for this part
ctPost_sentiment <- ctPost %>%
  inner_join(bing_sent) %>%
  count(index = linenumber %/% 50, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

head(ctPost_sentiment)

##   index negative positive sentiment
## 1     0        5        0        -5
## 2     1        3        2        -1
## 3     2        1        1         0
## 4     3        0        2         2
## 5     4        3        0        -3
## 6     5        3        2        -1

ctPost_sentiment = ggplot(ctPost_sentiment, 
                      aes(index, sentiment, fill = as.factor(sentiment ))) +
  geom_col() +
  theme(legend.position = "none")
ctPost_sentiment

ctPost_emotion = ctPost %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment)

## Joining, by = "word"

head(ctPost_emotion)

##   index sentiment n
## 1     0     anger 5
## 2     0   disgust 4
## 3     0      fear 7
## 4     0  negative 8
## 5     0  positive 3
## 6     0   sadness 4

ctPost_emotion = ctPost_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right")
ctPost_emotion

Here, we combine the overall emotion and sentiment of the newspaper and plot them together.

plot_grid(ctPost_emotion, ctPost_sentiment, nrow = 2)

## 2. New Haven Register Now we scrap the front page of New Haven Register and continue our analysis.

webpage_b = read_html("https://www.nhregister.com/")

data2  = webpage_b %>% html_nodes("a") %>% html_text() 

data2.1 = data.frame(text = data2) 
head(data2.1, 10)
tail(data2.1, 10)

#Preprocessing
data2.2 <- data2.1 %>% slice(-(323:343)) %>% slice(-(1:77))
head(data2.2)

data2.2$text = as.character(data2.2$text) 

#Into the tidytext world          
data2.3 = data2.2 %>% unnest_tokens(word, text, 
                              to_lower = T, 
                              strip_punct = T, 
                              strip_numeric = T)


#take out the stopwords
data2.3 %>% count(word, sort = T) 

data2.3 = data2.3 %>% anti_join(stop_words, by = "word") 

#create a dataframe
data2.4 = cbind.data.frame(linenumber = row_number(data2.3), data2.3) 
head(data2.4)
data2.4 %>% count(word, sort = T)

newHavenReg <- data2.4

Now, we will calculate the sentiment and the overall emotion of the front page of New Haven Register

#sentiment we will use bing for this part
newHavenReg_sentiment <- newHavenReg %>%
  inner_join(bing_sent) %>%
  count(index = linenumber %/% 50, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

head(newHavenReg_sentiment)

##   index negative positive sentiment
## 1     0        7        0        -7
## 2     1        1        1         0
## 3     2        1        2         1
## 4     3        2        1        -1
## 5     4        2        0        -2
## 6     5        1        3         2

newHavenReg_sentiment = ggplot(newHavenReg_sentiment, 
                      aes(index, sentiment, fill = as.factor(sentiment ))) +
  geom_col() +
  theme(legend.position = "none")
newHavenReg_sentiment

## All emotion
newHavenReg_emotion = newHavenReg %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment)

## Joining, by = "word"

head(newHavenReg_emotion)

##   index sentiment  n
## 1     0     anger  8
## 2     0   disgust  4
## 3     0      fear 10
## 4     0  negative 10
## 5     0  positive  4
## 6     0   sadness  2

newHavenReg_emotion = newHavenReg_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right")
newHavenReg_emotion

Combined overall emotion and sentiment plot

plot_grid(newHavenReg_emotion, newHavenReg_sentiment, nrow = 2)

3. Stamford Advocate

webpage_c = read_html("https://www.stamfordadvocate.com/")

data3  = webpage_c %>% html_nodes("a") %>% html_text() 

data3.1 = data.frame(text = data3) 
head(data3.1, 10)
tail(data3.1, 10)

#Preprocessing
data3.2 <- data3.1 %>% slice(-(323:343)) %>% slice(-(1:78))
head(data3.2)

data3.2$text = as.character(data3.2$text) 

#Into the tidytext world          
data3.3 = data3.2 %>% unnest_tokens(word, text, 
                                    to_lower = T, 
                                    strip_punct = T, 
                                    strip_numeric = T)

#take out the stopwords
data3.3 %>% count(word, sort = T) 

data3.3 = data3.3 %>% anti_join(stop_words, by = "word") 

#create a dataframe
data3.4 = cbind.data.frame(linenumber = row_number(data3.3), data3.3) 
head(data3.4)
data3.4 %>% count(word, sort = T)

stamfordAdv <- data3.4

Now, we will calculate the sentiment and the overall emotion of the front page of Stamford Advocate

#sentiment we will use bing for this part
stamfordAdv_sentiment <- stamfordAdv %>%
  inner_join(bing_sent) %>%
  count(index = linenumber %/% 50, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

head(stamfordAdv_sentiment)

##   index negative positive sentiment
## 1     0        5        0        -5
## 2     1        0        1         1
## 3     2        2        1        -1
## 4     3        2        0        -2
## 5     4        3        0        -3
## 6     5        4        2        -2

stamfordAdv_sentiment = ggplot(stamfordAdv_sentiment, 
                                 aes(index, sentiment, fill = as.factor(sentiment ))) +
  geom_col() +
  theme(legend.position = "none")
stamfordAdv_sentiment

## All emotion
stamfordAdv_emotion = stamfordAdv %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment)

## Joining, by = "word"

head(stamfordAdv_emotion)

##   index sentiment n
## 1     0     anger 8
## 2     0   disgust 2
## 3     0      fear 8
## 4     0  negative 8
## 5     0  positive 5
## 6     0   sadness 2

stamfordAdv_emotion = stamfordAdv_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right")
stamfordAdv_emotion

plot_grid(stamfordAdv_emotion, stamfordAdv_sentiment, nrow = 2)

## 4. Republican American

webpage_d = read_html("https://www.rep-am.com/category/news/")

data4  = webpage_d %>% html_nodes("a") %>% html_text() 

data4.1 = data.frame(text = data4) 
head(data4.1, 10)
tail(data4.1, 10)

#Preprocessing
data4.2 <- data4.1 %>% slice(-(395:546)) %>% slice(-(1:272))
head(data4.2)

data4.2$text = as.character(data4.2$text) 

#Into the tidytext world          
data4.3 = data4.2 %>% unnest_tokens(word, text, 
                                    to_lower = T, 
                                    strip_punct = T, 
                                    strip_numeric = T)

#take out the stopwords
data4.3 %>% count(word, sort = T) 

data4.3 = data4.3 %>% anti_join(stop_words, by = "word") 

#create a dataframe
data4.4 = cbind.data.frame(linenumber = row_number(data4.3), data4.3) 
head(data4.4)
data4.4 %>% count(word, sort = T)

repAmerican <- data4.4

Now, we will calculate the sentiment and the overall emotion of the front page of Republican American

#sentiment we will use bing for this part
repAmerican_sentiment <- repAmerican %>%
  inner_join(bing_sent) %>%
  count(index = linenumber %/% 50, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

head(repAmerican_sentiment)

##   index negative positive sentiment
## 1     0        2        3         1
## 2     1        7        2        -5
## 3     2        2        0        -2
## 4     3        2        1        -1
## 5     4        1        0        -1

repAmerican_sentiment = ggplot(repAmerican_sentiment, 
                                  aes(index, sentiment, fill = as.factor(sentiment ))) +
  geom_col() +
  theme(legend.position = "none")
repAmerican_sentiment

## All emotion
repAmerican_emotion = repAmerican %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment)

## Joining, by = "word"

head(repAmerican_emotion)

##   index    sentiment n
## 1     0        anger 3
## 2     0 anticipation 3
## 3     0      disgust 1
## 4     0         fear 2
## 5     0          joy 3
## 6     0     negative 4

repAmerican_emotion = repAmerican_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right")
repAmerican_emotion

plot_grid(repAmerican_emotion, repAmerican_sentiment, nrow = 2)

5. The Day

webpage_e = read_html("https://www.theday.com/")

data5  = webpage_e %>% html_nodes("a") %>% html_text() 

data5.1 = data.frame(text = data5) 
head(data5.1, 10)
tail(data5.1, 10)

#Preprocessing
data5.2 <- data5.1 %>% slice(-(317:381)) %>% slice(-(1:109))
head(data5.2)

data5.2$text = as.character(data5.2$text) 

#Into the tidytext world          
data5.3 = data5.2 %>% unnest_tokens(word, text, 
                                    to_lower = T, 
                                    strip_punct = T, 
                                    strip_numeric = T)

#take out the stopwords
data5.3 %>% count(word, sort = T) 

data5.3 = data5.3 %>% anti_join(stop_words, by = "word") 

#create a dataframe
data5.4 = cbind.data.frame(linenumber = row_number(data5.3), data5.3) 
head(data5.4)
data5.4 %>% count(word, sort = T)

theDay <- data5.4

Now, we will calculate the sentiment and the overall emotion of the front page of The Day

#sentiment we will use bing for this part
theDay_sentiment <- theDay %>%
  inner_join(bing_sent) %>%
  count(index = linenumber %/% 50, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

head(theDay_sentiment)

##   index negative positive sentiment
## 1     0        2        2         0
## 2     1        1        2         1
## 3     2        2        0        -2
## 4     3        1        0        -1
## 5     4        6        1        -5
## 6     5        2        0        -2

theDay_sentiment = ggplot(theDay_sentiment, 
                               aes(index, sentiment, fill = as.factor(sentiment ))) +
  geom_col() +
  theme(legend.position = "none")
theDay_sentiment

## All emotion
theDay_emotion = theDay %>%
  inner_join(nrc_sent) %>%
  count(index = linenumber %/% 50, sentiment)

## Joining, by = "word"

head(theDay_emotion)

##   index    sentiment n
## 1     0        anger 4
## 2     0 anticipation 5
## 3     0         fear 5
## 4     0          joy 6
## 5     0     negative 4
## 6     0     positive 9

theDay_emotion = theDay_emotion %>% 
  ggplot(aes(index, n, fill = as.factor(sentiment))) +
  geom_col() +
  theme(legend.position = "right")
theDay_emotion

plot_grid(theDay_emotion, theDay_sentiment, nrow = 2)

Newspapers ranked by their Sentiment

plot_grid(ctPost_sentiment + theme(legend.position = "none") + ggtitle("CT Post"),
          newHavenReg_sentiment + theme(legend.position = "none") + ggtitle("New Haven Register"),
          stamfordAdv_sentiment + theme(legend.position = "none") + ggtitle("Stamford Advocate"),
          repAmerican_sentiment + theme(legend.position = "none") + ggtitle("Republican American"),
          theDay_sentiment + theme(legend.position = "none") + ggtitle("The Day"), nrow = 2)

The plot above shows that CT Post is a leader in sentiment as they have the most positives compared to the other four newspapers. It is interesting to see how the news on each of the newspaper’s home page were very similar, yet there is a huge difference in their sentiment. Republican American and The Day are quite opposite compared to CT Post, New Haven Register, and Stamford Advocate. It is also interesting to see how there is a difference in sentiment of New Haven Register and Stamford Advocate despite having very similar bag of words when first scrapped. This analysis indicates how there can be a difference in writing style of these 5 newspapers which is the reason for such a difference in sentiment. ## Newspapers ranked by their Emotion

plot_grid(ctPost_emotion + theme(legend.position = "right") + ggtitle("CT Post"),
          newHavenReg_emotion + theme(legend.position = "right") + ggtitle("New Haven Register"),
          stamfordAdv_emotion + theme(legend.position = "right") + ggtitle("Stamford Advocate"), 
          theDay_emotion + theme(legend.position = "right") + ggtitle("The Day"),
          repAmerican_emotion + theme(legend.position = "right") + ggtitle("Republican American"), nrow = 5)