In current times, news media coverage is heavily determined by the political “lean” of a news station. This is where the idea of “fake news” has come from in the past couple of years. Information is no longer just information, it comes through a lens of bias and opinion. This can impact the way that people understand and interpret what is happening in the world, especially when there is some falsity to what is being broadcasted. With media and news so accessible, it is easy to believe the first thing you read on your phone. In this project, I will look at three different news segments broadcasted on January 7th, the day after the January 6th, 2021 capitol insurrection. I have completed a text analysis of the three segments to determine the differences in broadcasting across three news stations, one left-leaning (MSNBC), one non-partisan (CNN), and one right-leaning (Fox News) over the course of January 7th, 2021. I specifically looked at news segments (hosted by one person) from the station to create a fairly equal comparison. From CNN I am looking at an “Anderson Cooper 360 Degrees” segment, from MSNBC I am looking at “The Beat with Ari Melber”, and for Fox News, I am looking at “Tucker Carlson Tonight.” This is being done in order to understand the role that politics play in news coverage and the information that people consume.
I cleaned each data set to only include text from the actual news station broadcast, this included deleting video inserts, captioning for the start and end of a video, commercial breaks, etc. The news segments analyzed are linked below.
CNN: http://www.cnn.com/TRANSCRIPTS/2101/07/acd.01.html
MSNBC: https://www.msnbc.com/transcripts/transcript-beat-ari-melber-january-7-2021-n1259046
First, let’s load the necessary packages:
library(tidyverse)
library(tidytext)
library(textdata)
(all news segments were coded separately in r studio, but are now put together for comparison)
CNN:
cnn_transcript <- read_delim("cnn transcript.txt",
delim = ";", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
MSNBC:
library(readr)
msnbc_transcript <- read_delim("msnbc transcript.txt",
delim = ";", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
FOX:
library(readr)
FOX_text <- read_delim("FOX text.txt", delim = ";",
escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
I am now unnesting tokens for the three sets of code in order to break the text down to individual words:
CNN:
cnn_transcript %>%
unnest_tokens(word, X1) -> CNN_words1
MSNBC:
msnbc_transcript %>%
unnest_tokens(word, X1) -> MSNBC_words
FOX:
FOX_text %>%
unnest_tokens(word, X1) -> fox_words
After breaking down the individual words, I counted the number of words in each news segment in order to establish how long each news segment was, and how they varied by length. The CNN news report was the longest show.
CNN: 8,109 words
CNN_words1 %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 8109
MSNBC: 6,989 words
MSNBC_words %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 6989
FOX: 6,714 words
fox_words %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 6714
In the next step of code, I removed the stop words using “anti join”, and created a ggplot to show the most popular 20 words for each show with the stop words removed. The three plots show the most popular words on the x axis and the frequency with which the words were said during the segment on the y axis.
CNN: The most common word was president.
CNN_words1 %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!word == "null") %>%
head(20) %>%
ggplot(aes(reorder(word, n), n)) + geom_col() +
coord_flip () +
theme_classic() +
labs(x ='Most Popular Words',
y = 'Frequency of Words',
title = 'CNN Popular Words: Anderson Cooper 360 Degrees',
subtitle = 'January 7th, 2021')
MSNBC: The most common word was people.
MSNBC_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
head(20) %>%
ggplot(aes(reorder(word, n), n)) + geom_col() +
coord_flip () +
theme_classic() +
labs(x ='Most Popular Words',
y = 'Frequency of Words',
title = 'MSNBC Popular Words: The Beat with Ari Melber',
subtitle = 'January 7th, 2021')
FOX: The most common word was Trump.
fox_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
head(20)%>%
ggplot(aes(reorder(word, n), n)) + geom_col() +
coord_flip () +
theme_classic() +
labs(x ='Most Popular Words',
y = 'Frequency of Words',
title = 'Fox News Popular Words: Tucker Carlson Tonight',
subtitle = 'January 7th, 2021')
Something interesting from these visualizations is that all three news segments had the words “people” and “Trump” in their top three words. I did not expect all three of the news segments to have the two of the same most common word.
In the next step of code, I completed a sentiment analysis of the three news segments using two lexicons ‘afinn’ and ‘bing’.
I used afinn to calculate the mean sentiment value of each segment. All three news segments had mean sentiment values below zero. This makes sense considering that all shows were covering the capital insurrection.
CNN: -0.4482759
CNN_words1 %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments('afinn'))-> CNN_sentiment
mean(CNN_sentiment$value)
## [1] -0.4482759
MSNBC: -0.5436893
MSNBC_words %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments('afinn')) -> MSNBC_sentiment
mean(MSNBC_sentiment$value)
## [1] -0.5436893
FOX: -0.3919598
fox_words %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments('afinn')) -> fox_sentiments
mean(fox_sentiments$value)
## [1] -0.3919598
According to the mean sentiment values, MSNBC had the most negative average sentiment score (-0.5436893), MSNBC being the most left leaning news segment analyzed. Fox News had the highest average sentiment score (-0.3919598), Fox being the most right leaning news segment analyzed. Although, the difference is not huge, this does show a difference in the way news is broadcasted based on political views.
CNN_sentiment %>%
mutate(Station = "CNN") -> CNN_sentiment
MSNBC_sentiment %>%
mutate(Station = "MSNBC") -> MSNBC_sentiment
fox_sentiments %>%
mutate(Station = "FOX") ->FOX_sentiment
CNN_sentiment %>%
full_join(MSNBC_sentiment) %>%
full_join(FOX_sentiment) -> Merged_sentiment
CNN_sentiment %>%
anti_join(stop_words) %>%
filter(!word %in% c('null')) %>%
inner_join(get_sentiments('bing')) %>%
count(word, sentiment, sort = TRUE) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) -> merged_sentiment
This visualization uses the merged data sets from the previous code. First, the different stations were merged together into “merged_sentiment”, after this I removed the stop words using anti join and then filtered out a word that was not appearing properly, that being “null”. After this, I inner joined the code with the “bing” lexicon.
Merged_sentiment %>%
filter(n > 3) %>%
ggplot(aes(reorder(word,n), value, fill=Station)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Station, ncol = 2, scales = "free_x") +
coord_flip()+
theme_classic() +
labs(x= 'Sentiment Filled Words',
y= 'Sentiment Score',
title = 'Sentiment Levels of CNN, MSNBC, and Fox',
subtitle = 'January 7th, 2021' ) +
scale_x_discrete(guide = guide_axis(n.dodge = 3))
The visualization shows the three different news stations and then shows the sentiment levels for the segment. On the y axis are sentiment filled words that appeared in the show and the x axis are the sentiment levels attached to the word. In this visualization you can see the difference between CNN, MSNBC, and Fox. MSNBC has the most negative sentiment score, as one can see, the scale on MSNBC goes to negative three and there is a high frequency in the chart. There is the least amount of sentiment negativity in the Fox segment, the scale on the Fox segment only goes to negative two which is the highest of the three stations. CNN does dip the highest, all the way down to negative four, but the frequency of negativity is way lower.
In this section of code, I examined five different words that hold different meaning but were commonly used when reporting on the capitol insurrection (insurrection, riot, protest, mob, and attack). I wanted to see the frequency of the five different words used throughout the different news stations, in order to examine if this could have an impact on peoples’ view of what happened during the insurrection based on watching different news stations.
CNN: The most common word is attack.
CNN_words1 %>%
filter(word %in% c("insurrection", "riot", "protest", "mob", "attack")) %>%
count(word, sort = TRUE) %>%
ggplot(aes(reorder(word, n), n)) + geom_col() +
theme_minimal() +
labs(x ='Word',
y = 'Frequency of the Word',
title = 'CNN Popular Words: Anderson Cooper 360 Degrees',
subtitle = 'January 7th, 2021')
MSNBC: The most common word was riot.
MSNBC_words %>%
filter(word %in% c("insurrection", "riot", "protest", "mob", "attack")) %>%
count(word, sort = TRUE) %>%
ggplot(aes(reorder(word, n), n)) + geom_col() +
theme_minimal() +
labs(x ='Word',
y = 'Frequency of the Word',
title = 'MSNBC Popular Words: The Beat with Ari Melber',
subtitle = 'January 7th, 2021')
FOX: The most common word is insurrection.
fox_words %>%
filter(word %in% c("insurrection", "riot", "protest", "mob", "attack")) %>%
count(word, sort = TRUE) %>%
ggplot(aes(reorder(word, n), n)) + geom_col() +
theme_minimal() +
labs(x ='Word',
y = 'Frequency of the Word',
title = 'FOX Popular Words: Tucker Carlson Tonight',
subtitle = 'January 7th, 2021')
These visualizations were interesting to me, because all three used different words most frequently. I had initially hypothesized that Fox News would avoid using the word insurrection, and rather refer to the occurrences of January 6th, 2021 as a protest. Fox actually used the word insurrection the most out of any other word. CNN used the word attack the most and MSNBC used the word riot the most. There was not an extreme usage of any one word across the three news stations.
In the next section of code I created wordclouds with the most common 100 words for each news station.
CNN Wordcloud:
library(wordcloud2)
CNN_words1 %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!word == "null") %>%
arrange(desc(n)) %>%
head(100) %>%
wordcloud2()
CNN Wordcloud Table:
CNN_words1 %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!word == "null") %>%
arrange(desc(n)) %>%
head(20) %>%
knitr::kable()
| word | n |
|---|---|
| president | 79 |
| people | 69 |
| trump | 39 |
| capitol | 32 |
| yesterday | 26 |
| election | 18 |
| time | 17 |
| video | 17 |
| donald | 15 |
| 25th | 14 |
| amendment | 14 |
| president’s | 14 |
| vice | 13 |
| house | 12 |
| pence | 12 |
| white | 12 |
| anderson | 11 |
| power | 11 |
| attack | 10 |
| country | 10 |
MSNBC Wordcloud:
library(wordcloud2)
MSNBC_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!word == "null") %>%
arrange(desc(n)) %>%
head(100) %>%
wordcloud2()
MSNBC Wordcloud Table:
MSNBC_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!word == "null") %>%
arrange(desc(n)) %>%
head(20) %>%
knitr::kable()
| word | n |
|---|---|
| people | 55 |
| police | 34 |
| trump | 29 |
| capitol | 27 |
| yesterday | 22 |
| officers | 19 |
| president | 17 |
| time | 16 |
| black | 15 |
| law | 14 |
| white | 14 |
| double | 12 |
| federal | 12 |
| america | 11 |
| gene | 11 |
| report | 11 |
| scene | 11 |
| americans | 10 |
| breaking | 10 |
| donald | 10 |
FOX Wordcloud
library(wordcloud2)
fox_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!word == "null") %>%
arrange(desc(n)) %>%
head(100) %>%
wordcloud2()
FOX Wordcloud Table
fox_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(!word == "null") %>%
arrange(desc(n)) %>%
head(20) %>%
knitr::kable()
| word | n |
|---|---|
| trump | 54 |
| people | 51 |
| donald | 37 |
| yesterday | 27 |
| happened | 25 |
| capitol | 22 |
| tonight | 12 |
| party | 11 |
| police | 11 |
| republican | 11 |
| york | 11 |
| language | 10 |
| media | 10 |
| supporters | 10 |
| america | 9 |
| insurrection | 9 |
| life | 9 |
| political | 9 |
| cnn | 8 |
| completely | 8 |
As a communications major, I am interested in the accurate and ethical broadcasting of news, whether that be on televsion, in an article, or through social media. Unfortunately, in the past couple of years, the United States has reached a point of extreme division, especially when it comes to political views. A large reason for this extreme divide is that many people believe the first thing they hear, see, read, and do not feel the need to fact check that information. Besides this, I believe that even large, long standing news stations are sharing completely different messages about major and daily occurrences, for example, the January 6th, 2021 capitol insurrection. Misinformation is a very large problem, and will impact the future of communications and the way in which information is shared. While this project only represents a minuscule portion of the reporting that has been done on the capitol insurrection, it does give a direct comparison of three news broadcasts, that happened on the same day, and probably on the same evening. People all around the country chose to tune into one of those shows and from there their opinions began to take shape about what happened the day before at the capitol. After this, people usually stick to one station and only hear one side of an issue without question. We as humans, citizens, students, etc. must always be on the pursuit of accurate, ethical, and unbiased information.