Introduction
An interest of mine has always been to look at ways in which news media could be biased. I want to ask the question: Is there a way to measure or find patterns in how a politically left-leaning news source differs from a right-leaning news source? What are some key differences that identify a right or left-leaning news source?
The scope of this project was challenging in that there are countless news sites through the internet. I have also had to address questions of legitimacy, journalistic integrity, and standards where some sites like InfoWars pass unvalidated conspiracy theories as news. To help in deciding what news sites to look at, I found a study from the University of Michigan that classifies a news source based on the left or right leaning tendencies of their reader base. https://guides.lib.umich.edu/c.php?g=637508&p=4462444
Before I begin describing my process, I will first address the many challenges I experienced for this project.
Challenges
- I found a news API that would allow for download of articles, but they required a payment that I could not justify for this project. Just to make sure, I checked the free content from the API. First let’s load the libraries we’ll use throughout the project.
library(jsonlite)
library(tidyverse)
library(tidytext)
library(data.table)
library(SnowballC)
library(stringr)
library(RColorBrewer)
Below I obtained an API key from https://www.newsapi.org. On looking at the data, more specifically the article content, we notice that the content is truncated.
#stores our main url in a variable
url <- 'https://newsapi.org/v2/everything?domains=npr.org&'
all_articles <- fromJSON(paste(url, api_key, sep = ""), flatten = TRUE) %>%
data.frame()
knitr::kable(head(all_articles$articles.content), format = "html")
|
x
|
|
President Trump has appointed Gen. Mark Milley to succeed Joint Chiefs Chairman Gen. Joseph Dunford. Andrew Harnik/AP President Trump made the announcement via Twitter. “I am pleased to announce my nomination of four-star Gen. Mark Milley, Chief of Staff of t… [+1195 chars]
|
|
Bradley Cooper plays Jackson Maine, a musician struggling with addiction, in A Star Is Born. Peter Lindbergh/Warner Bros. Fresh Air Weekend highlights some of the best interviews and reviews from past weeks, and new program elements specially paced for weeken… [+862 chars]
|
|
National flags are seen along the road to Eritrea in Zalambessa, northern Ethiopia, in September before a border reopening ceremony. Two land border crossings between Ethiopia and Eritrea were reopened for the first time in nearly 20 years. AFP/Getty Images A… [+5534 chars]
|
|
In Anna and the Apocalypse, Ella Hunt plays the titular teenager Anna, who fends off a zombie invasion in song. Gerardo Jaconelli/Orion Pictures Anna and the Apocalypse is a [checks notes] Scottish zombie Christmas high school musical. It drew raves in Great … [+4537 chars]
|
|
The Ferryman follows the large family of a man who was “disappeared” in the Northern Ireland conflicts of the late 20th century. Paddy Considine (standing) plays Quinn, the head of the family and the brother of the dead man. Joan Marcus/Courtesy of The Ferrym… [+5326 chars]
|
|
NA
|
This prompted me to decide to do a webscrape.
For the purpose of feasability, I only scraped from two sites and focused on their politics section.
My original conservative news site was going to be Breitbart, which would have been ideal given their large reader base. Unfortunately, the amount of content that Breitbart publishes caused a few memory issues on my computer and I was not able to continue with it.
Reading in Our Data
First we’ll read in our data from NPR. Just to see, let’s take a look at how many articles were pulled from each site.
npr <- read_csv('npr.csv')
npr <- npr %>%
mutate(source = 'npr') %>%
mutate(date = as.Date(datepub))
natrev <- read_csv('natrev.csv')
natrev <- natrev %>%
mutate(source = 'national review') %>%
mutate(date = as.Date(datepub))
count(npr)
## # A tibble: 1 x 1
## n
## <int>
## 1 488
## # A tibble: 1 x 1
## n
## <int>
## 1 615
The scrape pulled 488 articles from NPR and 615 articles from National Review. For convenience, let’s combine the two dataframes into one.
news <- full_join(npr, natrev, by=c('article','datepub','link','title','source','date'))
knitr::kable(head(news), format = "html")
Word Frequency for Titles. What is Being Reported?
Let’s start by seeing what topics are most frequently reported by each site. We’ll use the tidytext package to clean up our text data from the article titles and get a simple word frequency analysis. First let’s tidy our data.
title <- news %>%
unnest_tokens(word, title) %>%
anti_join(stop_words) %>%
#take out numbers
filter(!str_detect(word, "^[0-9]*$")) %>%
#add count column
count(source,word,sort = TRUE)
title$word <- gsub("\\s+","",title$word)
knitr::kable(head(title), format = "html")
|
source
|
word
|
n
|
|
npr
|
trump
|
85
|
|
national review
|
trump
|
64
|
|
npr
|
house
|
41
|
|
npr
|
democrats
|
38
|
|
national review
|
judicial
|
37
|
|
npr
|
election
|
37
|
Now that our data is tidy, let’s visualize what we found using ggplot2.
npr_title <- title %>%
filter(source == 'npr') %>%
mutate(rank = dense_rank(desc(n))) %>%
filter(rank <= 15)
ggplot(npr_title,
aes(x=reorder(word, n),y=n)) +
xlab("Word") +
geom_bar(stat = 'identity', aes(fill=n)) +
scale_fill_gradient() +
theme_bw() +
coord_flip() +
ggtitle('NPR')

nat_title <- title %>%
filter(source == 'national review') %>%
mutate(rank = dense_rank(desc(n))) %>%
filter(rank <= 15)
ggplot(nat_title,
aes(x=reorder(word, n),y=n)) +
xlab("Word") +
geom_bar(stat = 'identity',aes(fill=n)) +
scale_fill_gradient(low = 'red', high = 'pink') +
theme_bw() +
coord_flip() +
ggtitle('National Review')

For both sites, Trump was a huge topic, but a closer look at the graphs reveals some key differences. It appears that NPR mostly focused on the midterm elections and the outcome of close races post-election day. In contrast, National Review also seemed to focus somewhat on the election, but appeared to have an equally heavy emphasis on immigration, the U.S. southern border and liberals.
TF-IDF for Article Content
Now we’ll dive into the article contents. We’ll tidy our text the same way as we did with our titles.
#get each word into a row and get rid of stop words
news_w <- news %>%
unnest_tokens(word, article) %>%
anti_join(stop_words) %>%
#take out numbers
filter(!str_detect(word, "^[0-9]*$"))
news_w$word <- gsub("\\s+","",news_w$word)
knitr::kable(head(news_w), format = "html")
Now that our data is mostly tidy, we’ll use Term Frequency-Inverse Document Frequency (TF-IDF) to zero in on the more important key words that are unique to the National Review and NPR.
news_tfidf <- news_w %>%
count(source,word,sort = TRUE) %>%
bind_tf_idf(word,source,n) %>%
arrange(desc(tf_idf))
knitr::kable(head(news_tfidf), format = "html")
|
source
|
word
|
n
|
tf
|
idf
|
tf_idf
|
|
national review
|
it’s
|
406
|
0.0029283
|
0.6931472
|
0.0020297
|
|
npr
|
npr’s
|
420
|
0.0020847
|
0.6931472
|
0.0014450
|
|
national review
|
trump’s
|
245
|
0.0017671
|
0.6931472
|
0.0012248
|
|
npr
|
caption
|
291
|
0.0014444
|
0.6931472
|
0.0010012
|
|
npr
|
byline
|
231
|
0.0011466
|
0.6931472
|
0.0007947
|
|
npr
|
archived
|
226
|
0.0011217
|
0.6931472
|
0.0007775
|
npr_tfidf <- news_tfidf %>%
filter(source == 'npr') %>%
mutate(rank = dense_rank(desc(tf_idf))) %>%
filter(rank <= 20)
ggplot(npr_tfidf, aes(reorder(word,tf_idf),tf_idf)) +
geom_bar(stat = 'identity', aes(fill=tf_idf)) +
xlab('Word') +
ggtitle("NPR") +
scale_fill_gradient() +
theme_bw() +
coord_flip()

nat_tfidf <- news_tfidf %>%
filter(source == 'national review') %>%
mutate(rank = dense_rank(desc(tf_idf))) %>%
filter(rank <= 20)
ggplot(nat_tfidf, aes(reorder(word,tf_idf),tf_idf)) +
geom_bar(stat = 'identity', aes(fill=tf_idf)) +
xlab('Word') +
ggtitle("National Review") +
scale_fill_gradient(low = 'red', high = 'pink') +
theme_bw() +
coord_flip()

This is particularly noteable. NPR’s top words are mostly names of politicians or names and words associated with political figures, which is expected. When we look at the National Review however, we notice a clear difference. To investigate, I had to read a few articles to get a sense for why the important words for the National Review were what they were. It appears that the National Review writes articles similar to the way a person would casually speak. The articles also appear to be written in a way where the writers do not appear to be concerned with being objective, which makes sense given that the National Review advertises itself as a conservative news site. The high TF-IDF words perhaps imply that the idea of the writer is more important than an objective dissemination of the news.
Sentiment Analysis
We’ll now do a sentiment analysis of the text to see how each site might spin or not spin the news on a given day. From the code below, you may notice that I had to remove the word “trump” from the lexicon. “Trump” is recognized as a positive word by the tidytext package and leaving the word in place would give a large number of false positives in our data set.
#Get just the word stem and do sentiment analysis
news_sent <- news %>%
unnest_tokens(word, article) %>%
anti_join(stop_words) %>%
#take out numbers
filter(!str_detect(word, "^[0-9]*$"),
#had to remove trump as a word
word != 'trump') %>%
inner_join(get_sentiments("bing")) %>%
count(source,index = date, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
knitr::kable(head(news_sent), format = "html")
|
source
|
index
|
negative
|
positive
|
sentiment
|
|
national review
|
2018-10-31
|
200
|
121
|
-79
|
|
national review
|
2018-11-01
|
408
|
256
|
-152
|
|
national review
|
2018-11-02
|
489
|
313
|
-176
|
|
national review
|
2018-11-03
|
144
|
66
|
-78
|
|
national review
|
2018-11-04
|
90
|
53
|
-37
|
|
national review
|
2018-11-05
|
294
|
218
|
-76
|
Now that we have our sentiment analysis applied, we can now build our visualization.
ggplot(news_sent, aes(index,sentiment)) +
geom_bar(stat = 'identity', aes(fill=source)) +
theme_bw() +
theme(legend.position = "bottom") +
facet_grid(.~source)

You’ll notice a number of noticeable contrasting peaks in the data. The large positive peak for NPR is for election day likely announcing the day’s winners from the polls. National Review’s negative peaks are more difficult to assess. A deep dive into the articles on those days reveals that they report on a large variety of different topics related to politics, but infrequently about political news that occurred on that day.
Since both news outlets report roughly an equal amount on Trump, I decided to do a sentiment analysis of all articles about him.
trump_sent <- news %>%
filter(title %like% 'Trump') %>%
unnest_tokens(word, article) %>%
anti_join(stop_words) %>%
#take out numbers
filter(!str_detect(word, "^[0-9]*$"),
#had to remove trump as a word
word != 'trump') %>%
inner_join(get_sentiments("bing")) %>%
count(source, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(trump_sent, aes(source,sentiment)) +
geom_bar(stat = 'identity', aes(fill=source)) +
theme_bw()

We notice that the negative sentiment towards Trump from NPR is double that of the National Review.
Conclusion & Next Steps
There are clear differences between how the two sites report on the news, which varies by topic and writing style. Ideal next steps would be to gather articles from many more news sources and to find differences between each source.