Fair and Balanced

Introduction
Challenges
Data Acquisition
Reading in Our Data
Word Frequency for Titles. What is Being Reported?
TF-IDF for Article Content
Sentiment Analysis
Conclusion & Next Steps

Introduction

An interest of mine has always been to look at ways in which news media could be biased. I want to ask the question: Is there a way to measure or find patterns in how a politically left-leaning news source differs from a right-leaning news source? What are some key differences that identify a right or left-leaning news source?

The scope of this project was challenging in that there are countless news sites through the internet. I have also had to address questions of legitimacy, journalistic integrity, and standards where some sites like InfoWars pass unvalidated conspiracy theories as news. To help in deciding what news sites to look at, I found a study from the University of Michigan that classifies a news source based on the left or right leaning tendencies of their reader base. https://guides.lib.umich.edu/c.php?g=637508&p=4462444

Before I begin describing my process, I will first address the many challenges I experienced for this project.

Challenges

I found a news API that would allow for download of articles, but they required a payment that I could not justify for this project. Just to make sure, I checked the free content from the API. First let’s load the libraries we’ll use throughout the project.

library(jsonlite)
library(tidyverse)
library(tidytext)
library(data.table)
library(SnowballC)
library(stringr)
library(RColorBrewer)

Below I obtained an API key from https://www.newsapi.org. On looking at the data, more specifically the article content, we notice that the content is truncated.

#stores our main url in a variable
url <- 'https://newsapi.org/v2/everything?domains=npr.org&' 

all_articles <- fromJSON(paste(url, api_key, sep = ""), flatten = TRUE) %>%
  data.frame()

knitr::kable(head(all_articles$articles.content), format = "html")

x
President Trump has appointed Gen. Mark Milley to succeed Joint Chiefs Chairman Gen. Joseph Dunford. Andrew Harnik/AP President Trump made the announcement via Twitter. “I am pleased to announce my nomination of four-star Gen. Mark Milley, Chief of Staff of t… [+1195 chars]
Bradley Cooper plays Jackson Maine, a musician struggling with addiction, in A Star Is Born. Peter Lindbergh/Warner Bros. Fresh Air Weekend highlights some of the best interviews and reviews from past weeks, and new program elements specially paced for weeken… [+862 chars]
National flags are seen along the road to Eritrea in Zalambessa, northern Ethiopia, in September before a border reopening ceremony. Two land border crossings between Ethiopia and Eritrea were reopened for the first time in nearly 20 years. AFP/Getty Images A… [+5534 chars]
In Anna and the Apocalypse, Ella Hunt plays the titular teenager Anna, who fends off a zombie invasion in song. Gerardo Jaconelli/Orion Pictures Anna and the Apocalypse is a [checks notes] Scottish zombie Christmas high school musical. It drew raves in Great … [+4537 chars]
The Ferryman follows the large family of a man who was “disappeared” in the Northern Ireland conflicts of the late 20th century. Paddy Considine (standing) plays Quinn, the head of the family and the brother of the dead man. Joan Marcus/Courtesy of The Ferrym… [+5326 chars]
NA

This prompted me to decide to do a webscrape.

For the purpose of feasability, I only scraped from two sites and focused on their politics section.
My original conservative news site was going to be Breitbart, which would have been ideal given their large reader base. Unfortunately, the amount of content that Breitbart publishes caused a few memory issues on my computer and I was not able to continue with it.

Data Acquisition

The two sites I chose based on the political leanings of their readers are https://www.nationalreview.com and https://www.npr.org. National Review will serve as my conservative site and NPR will serve as my liberal site. To do a mass scrape of the data from each site, I utilized Python and the Selenium and BeautifulSoup4 packages. Details on the scrape as well as the code can be accessed here:

https://chesterpoon8.github.io/news_scrapev2.html

Reading in Our Data

First we’ll read in our data from NPR. Just to see, let’s take a look at how many articles were pulled from each site.

npr <- read_csv('npr.csv')
npr <- npr %>%
  mutate(source = 'npr') %>%
  mutate(date = as.Date(datepub))

natrev <- read_csv('natrev.csv')
natrev <- natrev %>%
  mutate(source = 'national review') %>%
  mutate(date = as.Date(datepub))

count(npr)

## # A tibble: 1 x 1
##       n
##   <int>
## 1   488

count(natrev)

## # A tibble: 1 x 1
##       n
##   <int>
## 1   615

The scrape pulled 488 articles from NPR and 615 articles from National Review. For convenience, let’s combine the two dataframes into one.

news <- full_join(npr, natrev, by=c('article','datepub','link','title','source','date'))

knitr::kable(head(news), format = "html")

article	datepub	link	title	source	date
David Greene talks to former Senate Majority Leader George Mitchell, a Democrat, about former President George H.W. Bush, a Republican. Mitchell recalls how they formed a close relationship.	2018-12-05 12:26:00	https://www.npr.org/2018/12/05/673657115/mitchell-says-bush-deserves-a-lot-of-credit-for-bipartisanship	Mitchell Says Bush Deserves A Lot Of Credit For Bipartisanship	npr	2018-12-05
Prosecutors for special counsel Robert Mueller say former Trump national security adviser Michael Flynn has provided substantial help to investigators and deserves leniency in sentencing.By subscribing, you agree to NPR’s terms of use and privacy policy.	2018-12-05 12:19:00	https://www.npr.org/2018/12/05/673630919/flynn-met-with-doj-lawyers-19-times-muellers-team-says	Flynn Met With DOJ Lawyers 19 Times, Mueller’s Team Says	npr	2018-12-05
Rachel Martin talks to Cassandra Levesque, a New Hampshire state legislator who starts her first term in office on Wednesday. Changing child marriage laws motivated her to run for office.By subscribing, you agree to NPR’s terms of use and privacy policy.	2018-12-05 10:09:00	https://www.npr.org/2018/12/05/673630968/19-year-old-new-hampshire-legislators-first-day	19-Year-Old Representative Is Among New Faces In New Hampshire’s House	npr	2018-12-05
Steve Inskeep talks to former United Nations Ambassador Thomas Pickering about his experience working with President George H.W. Bush, who presided over a time of global change.	2018-12-05 10:09:00	https://www.npr.org/2018/12/05/673630940/working-with-president-george-h-w-bush-from-the-former-un-ambassadors-perspectiv	Former U.N. Ambassador Thomas Pickering Remembers George H.W. Bush	npr	2018-12-05
FromAs Wisconsin’s Republican Gov. Scott Walker prepares to leave office, GOP lawmakers in the state are working overtime to tie the hands of Walker’s Democratic successor Tony Evers.	2018-12-05 10:09:00	https://www.npr.org/2018/12/05/673630933/wisconsin-vote-to-limit-power-of-incoming-governor-and-attorney-general	Wis. GOP Lawmakers Try To Limit Incoming Governor’s Power	npr	2018-12-05
FromA fraud probe into absentee ballots in North Carolina’s 9th Congressional District has delayed the certification of Republican Mark Harris. His victory over Democrat Dan McCready is in question.	2018-12-05 10:09:00	https://www.npr.org/2018/12/05/673630898/election-fraud-mystery-in-north-carolina	Election Officials Suspect Fraud In N.C. Congressional Race	npr	2018-12-05

Word Frequency for Titles. What is Being Reported?

Let’s start by seeing what topics are most frequently reported by each site. We’ll use the tidytext package to clean up our text data from the article titles and get a simple word frequency analysis. First let’s tidy our data.

title <- news %>%
  unnest_tokens(word, title) %>%
  anti_join(stop_words) %>%
  #take out numbers  
  filter(!str_detect(word, "^[0-9]*$")) %>%
  #add count column
  count(source,word,sort = TRUE) 

title$word <- gsub("\\s+","",title$word)

knitr::kable(head(title), format = "html")

source	word	n
npr	trump	85
national review	trump	64
npr	house	41
npr	democrats	38
national review	judicial	37
npr	election	37

Now that our data is tidy, let’s visualize what we found using ggplot2.

npr_title <- title %>%
  filter(source == 'npr') %>%
  mutate(rank = dense_rank(desc(n))) %>%
  filter(rank <= 15)

ggplot(npr_title, 
       aes(x=reorder(word, n),y=n)) +
  xlab("Word") +
  geom_bar(stat = 'identity', aes(fill=n)) +
  scale_fill_gradient() +
  theme_bw() +
  coord_flip() +
  ggtitle('NPR')

nat_title <- title %>%
  filter(source == 'national review') %>%
  mutate(rank = dense_rank(desc(n))) %>%
  filter(rank <= 15)

ggplot(nat_title, 
       aes(x=reorder(word, n),y=n)) +
  xlab("Word") +
  geom_bar(stat = 'identity',aes(fill=n)) +
  scale_fill_gradient(low = 'red', high = 'pink') +
  theme_bw() +
  coord_flip() +
  ggtitle('National Review')

For both sites, Trump was a huge topic, but a closer look at the graphs reveals some key differences. It appears that NPR mostly focused on the midterm elections and the outcome of close races post-election day. In contrast, National Review also seemed to focus somewhat on the election, but appeared to have an equally heavy emphasis on immigration, the U.S. southern border and liberals.

TF-IDF for Article Content

Now we’ll dive into the article contents. We’ll tidy our text the same way as we did with our titles.

#get each word into a row and get rid of stop words

news_w <- news %>%
  unnest_tokens(word, article) %>%
  anti_join(stop_words) %>%
  #take out numbers  
  filter(!str_detect(word, "^[0-9]*$"))

news_w$word <- gsub("\\s+","",news_w$word)

knitr::kable(head(news_w), format = "html")

datepub	link	title	source	date	word
2018-12-05 12:26:00	https://www.npr.org/2018/12/05/673657115/mitchell-says-bush-deserves-a-lot-of-credit-for-bipartisanship	Mitchell Says Bush Deserves A Lot Of Credit For Bipartisanship	npr	2018-12-05	david
2018-12-05 12:26:00	https://www.npr.org/2018/12/05/673657115/mitchell-says-bush-deserves-a-lot-of-credit-for-bipartisanship	Mitchell Says Bush Deserves A Lot Of Credit For Bipartisanship	npr	2018-12-05	greene
2018-12-05 12:26:00	https://www.npr.org/2018/12/05/673657115/mitchell-says-bush-deserves-a-lot-of-credit-for-bipartisanship	Mitchell Says Bush Deserves A Lot Of Credit For Bipartisanship	npr	2018-12-05	talks
2018-12-05 12:26:00	https://www.npr.org/2018/12/05/673657115/mitchell-says-bush-deserves-a-lot-of-credit-for-bipartisanship	Mitchell Says Bush Deserves A Lot Of Credit For Bipartisanship	npr	2018-12-05	senate
2018-12-05 12:26:00	https://www.npr.org/2018/12/05/673657115/mitchell-says-bush-deserves-a-lot-of-credit-for-bipartisanship	Mitchell Says Bush Deserves A Lot Of Credit For Bipartisanship	npr	2018-12-05	majority
2018-12-05 12:26:00	https://www.npr.org/2018/12/05/673657115/mitchell-says-bush-deserves-a-lot-of-credit-for-bipartisanship	Mitchell Says Bush Deserves A Lot Of Credit For Bipartisanship	npr	2018-12-05	leader

Now that our data is mostly tidy, we’ll use Term Frequency-Inverse Document Frequency (TF-IDF) to zero in on the more important key words that are unique to the National Review and NPR.

news_tfidf <- news_w %>%
  count(source,word,sort = TRUE) %>%
  bind_tf_idf(word,source,n) %>%
  arrange(desc(tf_idf))

knitr::kable(head(news_tfidf), format = "html")

source	word	n	tf	idf	tf_idf
national review	it’s	406	0.0029283	0.6931472	0.0020297
npr	npr’s	420	0.0020847	0.6931472	0.0014450
national review	trump’s	245	0.0017671	0.6931472	0.0012248
npr	caption	291	0.0014444	0.6931472	0.0010012
npr	byline	231	0.0011466	0.6931472	0.0007947
npr	archived	226	0.0011217	0.6931472	0.0007775

npr_tfidf <- news_tfidf %>%
  filter(source == 'npr') %>%
  mutate(rank = dense_rank(desc(tf_idf))) %>%
  filter(rank <= 20)
  
ggplot(npr_tfidf, aes(reorder(word,tf_idf),tf_idf)) +
  geom_bar(stat = 'identity', aes(fill=tf_idf)) +
  xlab('Word') +
  ggtitle("NPR") +
  scale_fill_gradient() +
  theme_bw() +
  coord_flip()

nat_tfidf <- news_tfidf %>%
  filter(source == 'national review') %>%
  mutate(rank = dense_rank(desc(tf_idf))) %>%
  filter(rank <= 20)
  
ggplot(nat_tfidf, aes(reorder(word,tf_idf),tf_idf)) +
  geom_bar(stat = 'identity', aes(fill=tf_idf)) +
  xlab('Word') +
  ggtitle("National Review") +
  scale_fill_gradient(low = 'red', high = 'pink') +
  theme_bw() +
  coord_flip()

This is particularly noteable. NPR’s top words are mostly names of politicians or names and words associated with political figures, which is expected. When we look at the National Review however, we notice a clear difference. To investigate, I had to read a few articles to get a sense for why the important words for the National Review were what they were. It appears that the National Review writes articles similar to the way a person would casually speak. The articles also appear to be written in a way where the writers do not appear to be concerned with being objective, which makes sense given that the National Review advertises itself as a conservative news site. The high TF-IDF words perhaps imply that the idea of the writer is more important than an objective dissemination of the news.

Sentiment Analysis

We’ll now do a sentiment analysis of the text to see how each site might spin or not spin the news on a given day. From the code below, you may notice that I had to remove the word “trump” from the lexicon. “Trump” is recognized as a positive word by the tidytext package and leaving the word in place would give a large number of false positives in our data set.

#Get just the word stem and do sentiment analysis
news_sent <- news %>%
  unnest_tokens(word, article) %>%
  anti_join(stop_words) %>%
  #take out numbers  
  filter(!str_detect(word, "^[0-9]*$"),
         #had to remove trump as a word
         word != 'trump') %>%
  inner_join(get_sentiments("bing")) %>%
  count(source,index = date, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

knitr::kable(head(news_sent), format = "html")

source	index	negative	positive	sentiment
national review	2018-10-31	200	121	-79
national review	2018-11-01	408	256	-152
national review	2018-11-02	489	313	-176
national review	2018-11-03	144	66	-78
national review	2018-11-04	90	53	-37
national review	2018-11-05	294	218	-76

Now that we have our sentiment analysis applied, we can now build our visualization.

ggplot(news_sent, aes(index,sentiment)) +
  geom_bar(stat = 'identity', aes(fill=source)) +
  theme_bw() +
  theme(legend.position = "bottom") +
  facet_grid(.~source)

You’ll notice a number of noticeable contrasting peaks in the data. The large positive peak for NPR is for election day likely announcing the day’s winners from the polls. National Review’s negative peaks are more difficult to assess. A deep dive into the articles on those days reveals that they report on a large variety of different topics related to politics, but infrequently about political news that occurred on that day.

Since both news outlets report roughly an equal amount on Trump, I decided to do a sentiment analysis of all articles about him.

trump_sent <- news %>%
  filter(title %like% 'Trump') %>%
  unnest_tokens(word, article) %>%
  anti_join(stop_words) %>%
  #take out numbers  
  filter(!str_detect(word, "^[0-9]*$"),
         #had to remove trump as a word
         word != 'trump') %>%
  inner_join(get_sentiments("bing")) %>%
  count(source, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

ggplot(trump_sent, aes(source,sentiment)) +
  geom_bar(stat = 'identity', aes(fill=source)) +
  theme_bw()

We notice that the negative sentiment towards Trump from NPR is double that of the National Review.

Conclusion & Next Steps

There are clear differences between how the two sites report on the news, which varies by topic and writing style. Ideal next steps would be to gather articles from many more news sources and to find differences between each source.