In this report, I hope to introduce the reader to a little bit of what sentiment analysis can do and explore some data involving climate change reporting in the United States news media.

Loading the Data

The dataset comes from Dr. Julia Silge’s course in DataCamp called Sentiment Analysis in R: The Tidy Way.

load("climate_text.rda")

There are 593 observations and 4 columns.

Exploring the Data

library("tidyverse")

climate <- tbl_df(climate_text)
head(climate)
## # A tibble: 6 x 4
##   station                           show           show_date
##     <chr>                          <chr>              <dttm>
## 1   MSNBC                Morning Meeting 2009-09-22 13:00:00
## 2   MSNBC                Morning Meeting 2009-10-23 13:00:00
## 3     CNN                   CNN Newsroom 2009-12-03 20:00:00
## 4     CNN               American Morning 2009-12-07 11:00:00
## 5   MSNBC                Morning Meeting 2009-12-08 14:00:00
## 6   MSNBC Countdown With Keith Olbermann 2009-12-10 06:00:00
## # ... with 1 more variables: text <chr>

We can see information about

  1. News stations
  2. TV show
  3. Air date
  4. The text itself (sentences containing “climate”)

Furthermore, the data set contains observations from the following TV networks: CNN, FOX News, MSNBC. There are 135 distinct TV shows. Finally the sample goes from 2009-09-22 13:00:00 to 2017-04-30 16:00:00.

Tidying the Text

Jula Silge and David Robinson created the tidytext to help manage text files for data analysis. Here, the unnest_tokens function will take the sentences and separate them into words.

library("tidytext")
library("tm")
library("wordcloud")

tidy_climate <- climate_text %>%
  unnest_tokens(word, text)
tidy_climate
## # A tibble: 41,076 x 4
##    station            show           show_date       word
##      <chr>           <chr>              <dttm>      <chr>
##  1   MSNBC Morning Meeting 2009-09-22 13:00:00        the
##  2   MSNBC Morning Meeting 2009-09-22 13:00:00   interior
##  3   MSNBC Morning Meeting 2009-09-22 13:00:00 positively
##  4   MSNBC Morning Meeting 2009-09-22 13:00:00      oozes
##  5   MSNBC Morning Meeting 2009-09-22 13:00:00      class
##  6   MSNBC Morning Meeting 2009-09-22 13:00:00      raves
##  7   MSNBC Morning Meeting 2009-09-22 13:00:00        car
##  8   MSNBC Morning Meeting 2009-09-22 13:00:00   magazine
##  9   MSNBC Morning Meeting 2009-09-22 13:00:00      slick
## 10   MSNBC Morning Meeting 2009-09-22 13:00:00        and
## # ... with 41,066 more rows
wordcloud(tidy_climate$word, max.words = 100)

Next, it is common practice to remove stop words (very common conjunctions, articles, prepositions, etc. that rarely affect analyses).

tidy_climate <- tidy_climate %>%
  anti_join(stop_words) %>%
  filter(!word %in% c("climate", "change")) #removing the two most common words
wordcloud(tidy_climate$word, max.words = 100)

Sentiments

In the tidytext package, we have access to a trio of sentiment lexicons. These are lists of words associated with various sentiments.

The bing lexicon (by Bing Lu) focuses on positive versus negative classification.

get_sentiments("bing")
## # A tibble: 6,788 x 2
##           word sentiment
##          <chr>     <chr>
##  1     2-faced  negative
##  2     2-faces  negative
##  3          a+  positive
##  4    abnormal  negative
##  5     abolish  negative
##  6  abominable  negative
##  7  abominably  negative
##  8   abominate  negative
##  9 abomination  negative
## 10       abort  negative
## # ... with 6,778 more rows

The nrc lexicon (by Saif Mohammad and Peter Turney) classifies words into a wider variety of 10 sentiments.

get_sentiments("nrc")
## # A tibble: 13,901 x 2
##           word sentiment
##          <chr>     <chr>
##  1      abacus     trust
##  2     abandon      fear
##  3     abandon  negative
##  4     abandon   sadness
##  5   abandoned     anger
##  6   abandoned      fear
##  7   abandoned  negative
##  8   abandoned   sadness
##  9 abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows

Finally, the afinn lexicon (by Finn Arup Nelson) assigns numerical scores (integers from -5 to 5) to imply that some words are more negative than others.

get_sentiments("afinn")
## # A tibble: 2,476 x 2
##          word score
##         <chr> <int>
##  1    abandon    -2
##  2  abandoned    -2
##  3   abandons    -2
##  4   abducted    -2
##  5  abduction    -2
##  6 abductions    -2
##  7      abhor    -3
##  8   abhorred    -3
##  9  abhorrent    -3
## 10     abhors    -3
## # ... with 2,466 more rows

For example, if we want to just focus on positive and negative aspects, the bing lexicon is appropriate. We can map words in the climate data frame to those in the bing lexicon, but this only makes sense if a word is in the sentiment lexicon already. Thus an inner_join procedure makes sense here. Silge’s course also recommended finding the total words per station (so that we can later use proportions rather than sheer number of words)

climate_sentiment <- tidy_climate %>% 
  group_by(station) %>%
  mutate(station_total = n()) %>%
  ungroup() %>%
  inner_join(get_sentiments("bing"))
climate_sentiment
## # A tibble: 1,806 x 6
##    station             show           show_date         word station_total
##      <chr>            <chr>              <dttm>        <chr>         <int>
##  1   MSNBC  Morning Meeting 2009-09-22 13:00:00   positively          6095
##  2   MSNBC  Morning Meeting 2009-09-22 13:00:00        slick          6095
##  3   MSNBC  Morning Meeting 2009-09-22 13:00:00     striking          6095
##  4   MSNBC  Morning Meeting 2009-10-23 13:00:00 disagreement          6095
##  5   MSNBC  Morning Meeting 2009-10-23 13:00:00         hoax          6095
##  6   MSNBC  Morning Meeting 2009-10-23 13:00:00    undermine          6095
##  7     CNN     CNN Newsroom 2009-12-03 20:00:00       bumped          3673
##  8     CNN American Morning 2009-12-07 11:00:00  controversy          3673
##  9     CNN American Morning 2009-12-07 11:00:00        issue          3673
## 10     CNN American Morning 2009-12-07 11:00:00      scandal          3673
## # ... with 1,796 more rows, and 1 more variables: sentiment <chr>

Application: Do the major news networks report climate change news differently?

Here I am going to try to count the number of positive and negative words per TV station, and then normalize the results by dividing by the total number of words from each news network.

library("ggplot2")

climate_sentiment %>%
  count(station, sentiment, station_total) %>%
  mutate(percent = 100*n / station_total) %>%
  ggplot(aes(x = sentiment, y = percent, fill = sentiment)) +
      geom_col() + 
      facet_wrap(~station) +
  labs(title = "Do the major news networks report climate change news differently?",
       subtitle = "Sentiment Analysis on Words in Sentences with 'climate'",
       caption = "Source: DataCamp")

The graphs imply that Fox News has the largest difference in negative and positive word choice when reporting on climate change, while MSNBC is closer to even between the two sentiments.

Application: Does news media reporting on climate change become more negative as temperatures rise?

Now I will bring in weather data from NOAA (National Oceanic and Atmospheric Administration) from a weather station in New York City (near those news networks’ headquarters) from the same time frame.

library("readr")

nyc_raw <- read_csv("nyc_weather.csv")
nyc_tidy <- nyc_raw %>%
  select(DATE, TMAX)

To go toward a regression model, I will utilize the numerical word sentiment scores found in the afinn lexicon.

climate_scores <- tidy_climate %>%
  inner_join(get_sentiments("afinn"))

Before merging the data, I need the dates to be in the correct format.

# date format conversion
nyc_tidy$DATE <- as.Date(nyc_tidy$DATE, "%m/%d/%Y")
climate_scores$DATE <- as.Date(as.character(climate_scores$show_date), "%Y-%m-%d")

Now I can merge the data by the DATE column.

climate_scores_temps <- left_join(climate_scores, nyc_tidy, by = "DATE")

Here is a graph showing the word sentiment scores versus the high temperatures in New York City.

climate_scores_temps %>%
  ggplot(aes(x = TMAX, y = score)) +
      geom_point(aes(alpha = 0.1)) +
      geom_smooth(method = "lm", col = "red") + 
      labs(title = "Does news on climate change become more negative as temperatures rise?",
           subtitle = "Linear Regression on sentiment versus NYC temperature",
           x = "high temperature (in Fahrenheit)",
           y = "AFinn Sentiment Score",
           caption = "Sources: DataCamp, NOAA") + 
           theme(legend.position="none")

We can see a slightly decreasing trend in sentiment in “climate” reporting as NYC temperatures rise, but let me use the built in functions in R to reveal the calculations.

x <- climate_scores_temps$TMAX
y <- climate_scores_temps$score
lin_fit <- lm(y ~ x)
summary(lin_fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.712 -1.701 -0.691  1.322  5.323 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.260687   0.179734  -1.450    0.147
## x           -0.000686   0.002840  -0.242    0.809
## 
## Residual standard error: 2.015 on 1556 degrees of freedom
## Multiple R-squared:  3.749e-05,  Adjusted R-squared:  -0.0006052 
## F-statistic: 0.05834 on 1 and 1556 DF,  p-value: 0.8092

Alas this analysis shows that our regression model does not explain the variation in sentiment in a statistically significant way.