Text Analysis of January 6th, 2021 Capitol Insurrection News Coverage

In current times, news media coverage is heavily determined by the political “lean” of a news station. This is where the idea of “fake news” has come from in the past couple of years. Information is no longer just information, it comes through a lens of bias and opinion. This can impact the way that people understand and interpret what is happening in the world, especially when there is some falsity to what is being broadcasted. With media and news so accessible, it is easy to believe the first thing you read on your phone. In this project, I will look at three different news segments broadcasted on January 7th, the day after the January 6th, 2021 capitol insurrection. I have completed a text analysis of the three segments to determine the differences in broadcasting across three news stations, one left-leaning (MSNBC), one non-partisan (CNN), and one right-leaning (Fox News) over the course of January 7th, 2021. I specifically looked at news segments (hosted by one person) from the station to create a fairly equal comparison. From CNN I am looking at an “Anderson Cooper 360 Degrees” segment, from MSNBC I am looking at “The Beat with Ari Melber”, and for Fox News, I am looking at “Tucker Carlson Tonight.” This is being done in order to understand the role that politics play in news coverage and the information that people consume.

I cleaned each data set to only include text from the actual news station broadcast, this included deleting video inserts, captioning for the start and end of a video, commercial breaks, etc. The news segments analyzed are linked below.

CNN: http://www.cnn.com/TRANSCRIPTS/2101/07/acd.01.html

MSNBC: https://www.msnbc.com/transcripts/transcript-beat-ari-melber-january-7-2021-n1259046

FOX: https://www.novakarchive.com/fox-news-tucker-carlson/2021/1/7/tucker-carlson-january-7-2021-transcript

First, let’s load the necessary packages:

library(tidyverse)
library(tidytext)
library(textdata)

Now, let’s load the text:

(all news segments were coded separately in r studio, but are now put together for comparison)

CNN:

cnn_transcript <- read_delim("cnn transcript.txt", 
                             delim = ";", escape_double = FALSE, col_names = FALSE, 
                             trim_ws = TRUE)

MSNBC:

library(readr)
msnbc_transcript <- read_delim("msnbc transcript.txt", 
                               delim = ";", escape_double = FALSE, col_names = FALSE, 
                               trim_ws = TRUE)

FOX:

library(readr)
FOX_text <- read_delim("FOX text.txt", delim = ";", 
                       escape_double = FALSE, col_names = FALSE, 
                       trim_ws = TRUE)

I am now unnesting tokens for the three sets of code in order to break the text down to individual words:

CNN:

cnn_transcript %>%
  unnest_tokens(word, X1) -> CNN_words1

MSNBC:

msnbc_transcript %>% 
  unnest_tokens(word, X1) -> MSNBC_words

FOX:

FOX_text %>% 
  unnest_tokens(word, X1) -> fox_words

After breaking down the individual words, I counted the number of words in each news segment in order to establish how long each news segment was, and how they varied by length. The CNN news report was the longest show.

CNN: 8,109 words

CNN_words1 %>% 
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  8109

MSNBC: 6,989 words

MSNBC_words %>% 
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  6989

FOX: 6,714 words

 fox_words %>% 
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  6714

Word Popularity:

In the next step of code, I removed the stop words using “anti join”, and created a ggplot to show the most popular 20 words for each show with the stop words removed. The three plots show the most popular words on the x axis and the frequency with which the words were said during the segment on the y axis.

CNN: The most common word was president.

CNN_words1 %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(!word == "null") %>% 
  head(20) %>% 
  ggplot(aes(reorder(word, n), n)) + geom_col() +
  coord_flip () + 
  theme_classic() +
  labs(x ='Most Popular Words',
       y = 'Frequency of Words',
       title = 'CNN Popular Words: Anderson Cooper 360 Degrees',
       subtitle = 'January 7th, 2021')

MSNBC: The most common word was people.

MSNBC_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  head(20) %>% 
  ggplot(aes(reorder(word, n), n)) + geom_col() +
  coord_flip () + 
  theme_classic() +
  labs(x ='Most Popular Words',
       y = 'Frequency of Words',
       title = 'MSNBC Popular Words: The Beat with Ari Melber',
       subtitle = 'January 7th, 2021')

FOX: The most common word was Trump.

fox_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  head(20)%>% 
  ggplot(aes(reorder(word, n), n)) + geom_col() +
  coord_flip () +
  theme_classic() +
  labs(x ='Most Popular Words',
       y = 'Frequency of Words',
       title = 'Fox News Popular Words: Tucker Carlson Tonight',
       subtitle = 'January 7th, 2021')

Something interesting from these visualizations is that all three news segments had the words “people” and “Trump” in their top three words. I did not expect all three of the news segments to have the two of the same most common word.

Sentiment Analysis:

In the next step of code, I completed a sentiment analysis of the three news segments using two lexicons ‘afinn’ and ‘bing’.

I used afinn to calculate the mean sentiment value of each segment. All three news segments had mean sentiment values below zero. This makes sense considering that all shows were covering the capital insurrection.

CNN: -0.4482759

CNN_words1 %>% 
  count(word, sort = TRUE)  %>% 
  inner_join(get_sentiments('afinn'))-> CNN_sentiment

mean(CNN_sentiment$value)

## [1] -0.4482759

MSNBC: -0.5436893

MSNBC_words %>% 
  count(word, sort = TRUE)  %>% 
  inner_join(get_sentiments('afinn')) -> MSNBC_sentiment

mean(MSNBC_sentiment$value)

## [1] -0.5436893

FOX: -0.3919598

fox_words %>% 
  count(word, sort = TRUE) %>% 
  inner_join(get_sentiments('afinn')) -> fox_sentiments

mean(fox_sentiments$value)

## [1] -0.3919598

According to the mean sentiment values, MSNBC had the most negative average sentiment score (-0.5436893), MSNBC being the most left leaning news segment analyzed. Fox News had the highest average sentiment score (-0.3919598), Fox being the most right leaning news segment analyzed. Although, the difference is not huge, this does show a difference in the way news is broadcasted based on political views.

CNN_sentiment %>% 
  mutate(Station = "CNN") -> CNN_sentiment

MSNBC_sentiment %>% 
  mutate(Station = "MSNBC") -> MSNBC_sentiment

fox_sentiments %>% 
  mutate(Station = "FOX") ->FOX_sentiment


CNN_sentiment %>% 
  full_join(MSNBC_sentiment) %>% 
  full_join(FOX_sentiment) -> Merged_sentiment

CNN_sentiment %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% c('null')) %>% 
  inner_join(get_sentiments('bing')) %>% 
  count(word, sentiment,  sort = TRUE) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative) -> merged_sentiment

This visualization uses the merged data sets from the previous code. First, the different stations were merged together into “merged_sentiment”, after this I removed the stop words using anti join and then filtered out a word that was not appearing properly, that being “null”. After this, I inner joined the code with the “bing” lexicon.

Merged_sentiment %>% 
  filter(n > 3) %>% 
ggplot(aes(reorder(word,n), value, fill=Station)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Station, ncol = 2, scales = "free_x") +
  coord_flip()+ 
  theme_classic() +
  labs(x= 'Sentiment Filled Words',
       y= 'Sentiment Score',
         title = 'Sentiment Levels of CNN, MSNBC, and Fox',
      subtitle = 'January 7th, 2021' ) +
  scale_x_discrete(guide = guide_axis(n.dodge = 3))

The visualization shows the three different news stations and then shows the sentiment levels for the segment. On the y axis are sentiment filled words that appeared in the show and the x axis are the sentiment levels attached to the word. In this visualization you can see the difference between CNN, MSNBC, and Fox. MSNBC has the most negative sentiment score, as one can see, the scale on MSNBC goes to negative three and there is a high frequency in the chart. There is the least amount of sentiment negativity in the Fox segment, the scale on the Fox segment only goes to negative two which is the highest of the three stations. CNN does dip the highest, all the way down to negative four, but the frequency of negativity is way lower.

Frequency of Words: Insurrection, Riot, Protest, Mob, Attack Across CNN, MSNBC, and Fox.

In this section of code, I examined five different words that hold different meaning but were commonly used when reporting on the capitol insurrection (insurrection, riot, protest, mob, and attack). I wanted to see the frequency of the five different words used throughout the different news stations, in order to examine if this could have an impact on peoples’ view of what happened during the insurrection based on watching different news stations.

CNN: The most common word is attack.

CNN_words1 %>% 
  filter(word %in% c("insurrection", "riot", "protest", "mob", "attack")) %>%
  count(word, sort = TRUE) %>% 
  ggplot(aes(reorder(word, n), n)) + geom_col() +
  theme_minimal() +
  labs(x ='Word',
       y = 'Frequency of the Word',
       title = 'CNN Popular Words: Anderson Cooper 360 Degrees',
       subtitle = 'January 7th, 2021')

MSNBC: The most common word was riot.

MSNBC_words %>% 
  filter(word %in% c("insurrection", "riot", "protest", "mob", "attack")) %>%
  count(word, sort = TRUE) %>% 
           ggplot(aes(reorder(word, n), n)) + geom_col() +
  theme_minimal() +
  labs(x ='Word',
       y = 'Frequency of the Word',
       title = 'MSNBC Popular Words: The Beat with Ari Melber',
       subtitle = 'January 7th, 2021')

FOX: The most common word is insurrection.

fox_words %>% 
  filter(word %in% c("insurrection", "riot", "protest", "mob", "attack")) %>%
  count(word, sort = TRUE) %>% 
  ggplot(aes(reorder(word, n), n)) + geom_col() +
  theme_minimal() +
  labs(x ='Word',
       y = 'Frequency of the Word',
       title = 'FOX Popular Words: Tucker Carlson Tonight',
       subtitle = 'January 7th, 2021')

These visualizations were interesting to me, because all three used different words most frequently. I had initially hypothesized that Fox News would avoid using the word insurrection, and rather refer to the occurrences of January 6th, 2021 as a protest. Fox actually used the word insurrection the most out of any other word. CNN used the word attack the most and MSNBC used the word riot the most. There was not an extreme usage of any one word across the three news stations.

Word Clouds:

In the next section of code I created wordclouds with the most common 100 words for each news station.

CNN Wordcloud:

library(wordcloud2)
CNN_words1 %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(!word == "null") %>% 
  arrange(desc(n)) %>% 
  head(100) %>% 
  wordcloud2()

CNN Wordcloud Table:

CNN_words1 %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(!word == "null") %>% 
  arrange(desc(n)) %>% 
  head(20) %>% 
 knitr::kable()

word	n
president	79
people	69
trump	39
capitol	32
yesterday	26
election	18
time	17
video	17
donald	15
25th	14
amendment	14
president’s	14
vice	13
house	12
pence	12
white	12
anderson	11
power	11
attack	10
country	10

MSNBC Wordcloud:

library(wordcloud2)
MSNBC_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(!word == "null") %>% 
  arrange(desc(n)) %>% 
  head(100) %>% 
  wordcloud2()

MSNBC Wordcloud Table:

MSNBC_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(!word == "null") %>% 
  arrange(desc(n)) %>% 
  head(20) %>% 
  knitr::kable()

word	n
people	55
police	34
trump	29
capitol	27
yesterday	22
officers	19
president	17
time	16
black	15
law	14
white	14
double	12
federal	12
america	11
gene	11
report	11
scene	11
americans	10
breaking	10
donald	10

FOX Wordcloud

library(wordcloud2)
fox_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(!word == "null") %>% 
  arrange(desc(n)) %>% 
  head(100) %>% 
  wordcloud2()

FOX Wordcloud Table

fox_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(!word == "null") %>% 
  arrange(desc(n)) %>% 
  head(20) %>% 
  knitr::kable()

word	n
trump	54
people	51
donald	37
yesterday	27
happened	25
capitol	22
tonight	12
party	11
police	11
republican	11
york	11
language	10
media	10
supporters	10
america	9
insurrection	9
life	9
political	9
cnn	8
completely	8

Conclusion:

As a communications major, I am interested in the accurate and ethical broadcasting of news, whether that be on televsion, in an article, or through social media. Unfortunately, in the past couple of years, the United States has reached a point of extreme division, especially when it comes to political views. A large reason for this extreme divide is that many people believe the first thing they hear, see, read, and do not feel the need to fact check that information. Besides this, I believe that even large, long standing news stations are sharing completely different messages about major and daily occurrences, for example, the January 6th, 2021 capitol insurrection. Misinformation is a very large problem, and will impact the future of communications and the way in which information is shared. While this project only represents a minuscule portion of the reporting that has been done on the capitol insurrection, it does give a direct comparison of three news broadcasts, that happened on the same day, and probably on the same evening. People all around the country chose to tune into one of those shows and from there their opinions began to take shape about what happened the day before at the capitol. After this, people usually stick to one station and only hear one side of an issue without question. We as humans, citizens, students, etc. must always be on the pursuit of accurate, ethical, and unbiased information.