Introduction

Recently the newsapi aggregator added a package to R(newsanchor). You can see their vinaigrette here . I will use newsanchor to pull a list of URL dealing with presidential candidates. Most of the code I am using is just refactored code from the vinaigrette. I built out some additional functions to make future queries easier. I filter my queries for articles only displaying the candidates name in either the description or the title of the article.

Eventual Idea

While this article is mostly explorative it is the beginning of a plan to investigate media bias. In addition to this R script to scrape the Times, I have built a python script to scrape the Washington Post. I am planning on expanding these scripts to scrape several other news organizations. I will store the data in a sql db and automate alternating scrapes for weekly updates of all the major presidential candidates. I plan to use this data to attempt to develop my own personal sentiment analysis score for political language.

As it is primary season, I thought it might also be interesting to see what type of media coverage effects polling data. To understand any possible effect of media coverage, we need to understand the magnitude of dispersion in news networks. I will need to figure out a way to score how wide spread articles are via public facebook, twitter and other social networks likely via network analysis. Viewership rates on public news stations will need to be scored as well.

Queries

Daily NYT Articles Count and Sentiment Analysis for Cory Booker

Daily NYT Articles Count and Sentiment Analysis for Bernie sanders

Daily NYT Articles Count and Sentiment Analysis for Joe Biden

What happened on 3/27 for Cory Booker?

  • We can see two articles were published on 3/27 and 05-06
  • As we see below the sentiment analysis is giving a very large negative score to the word gun
    • This highlights the issues with simply looking at single words in context. Bi-grams and tri-grams along side a custom political sentiment dictionary would likely be needed to dig deeper
    • what’s interesting is we can look at the instances in which words like crime and guns are used in the articles
## # A tibble: 178 x 3
## # Groups:   url [2]
##    url                                                        word    score
##    <chr>                                                      <chr>   <int>
##  1 https://www.nytimes.com/2019/05/06/us/politics/cory-booke~ gun       -43
##  2 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ crime     -42
##  3 https://www.nytimes.com/2019/05/06/us/politics/cory-booke~ violen~   -36
##  4 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ proble~   -10
##  5 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ stop      -10
##  6 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ crimin~    -9
##  7 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ fire       -8
##  8 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ abuses     -6
##  9 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ arrest~    -6
## 10 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ kill       -6
## # ... with 168 more rows
x
seeks to combat gun violence through measures
the most progressive gun
plan to address gun violence is simple
been shattered by gun violence.”
is the proposed gun licensing program, which
to buy a gun would need to
complete a certified gun safety course.
being issued a gun license, which would
lists to obtain gun
University, said that gun control advocates and
effective at reducing gun homicides and suicides
63 percent of gun owners supported requiring
was among the gun policy experts consulted
Bear Arms, a gun
idea on the gun control wish list,”
renewed focus on gun violence comes just
on Monday, calling gun violence in the
been busy enacting gun
Still, America’s gun laws remain among
to deal with gun violence through executive
California has made gun control a central
records grappling with gun safety.
long supported stiffer gun laws, having introduced
record on gun control came under
bill that gave gun manufacturers legal immunity,
of Giffords, the gun violence prevention organization
topic connected to gun control: criminal justice
disproportionately affected by gun violence and incarcerated
Speaking about gun violence, he said,
seeks to end gun violence, said he
out muscularly for gun safety,” Mr. Feinblatt
their car at gun
gun his political career
surround my car, gun
and found two gun

Conclusion

This was just a quick examination of some of our query results. I look forward to expanding this data collection process.

Code Appendix

knitr::opts_chunk$set(echo = TRUE)
rm(list=ls())
library(newsanchor) 
library(robotstxt) 
library(httr)       
library(rvest)      
library(dplyr)      
library(stringr)    
library(tidytext)  
library(kableExtra)
library(knitr)
library(ggplot2)
library(plotly)
library(RMySQL)
#library(lubridate)
# sources
nyt<- terms_sources[122,1]
wapo <- terms_sources[129,1]


# create query result timeframe
# todays_date <- ymd(format(Sys.time(), "%Y-%m-%d"))
# end <- as.character(todays_date)
# start <- as.character(todays_date%m-% days(7))

## Add api key 

#set_api_key(api_key = 'insert api key', 
#            path = "~/.Renviron")

## Functions to scrape NYT and add to existing metadata


## one liner that takes inputs - query, start, end, and source
make_query <- function(query,start,end,sources){
     cleaned_query <- get_everything(query=query,
                               sources= sources,
                               from    = start,
                               to      = end)
     cleaned_query <- cleaned_query$results_df
     cleaned_query <- cleaned_query%>%
                         filter(str_detect(description,query)|   
                                str_detect(title,query))
     cleaned_query$candidate <- query
return(cleaned_query)
}

## takes url from query call and returns scraped full articles from NYT website
get_article_body <- function (url) {
  
  # download article page
  response <- GET(url)
  
  # check if request was successful
  if (response$status_code != 200) return(NA)
  
  # extract html
  html <- httr::content(x        = response, 
                  type     = "text", 
                  encoding = "UTF-8")
  
  # parse html
  parsed_html <- read_html(html)                   
  
  # define paragraph DOM selector
  selector <- "article#story div.StoryBodyCompanionColumn div p"
  
  # parse content
  parsed_html %>% 
    html_nodes(selector) %>%      # extract all paragraphs within class 'article-section'
    html_text() %>%               # extract content of the <p> tags
    str_replace_all("\n", "") %>% # replace all line breaks
    paste(collapse = " ")         # join all paragraphs into one string
}

## loops over all url in query call and executes get_article_body function to return full article
make_corpus <- function(article_list){
    article_list$body <- NA

# loop through articles and "apply" function
    for (i in 1:nrow(article_list)) {
  
# "apply" function to i url
    article_list$body[i] <- get_article_body(article_list$url[i])

     Sys.sleep(1)
    }
    
# drop hourly data
    article_list$published_at <- as.Date(article_list$published_at, "%Y-%m-%d")
    return (article_list)
}


## Function returns plotly graphs to compare article sentiment and number articles by date
melted_corpus<- function(complete_corpus){
        
sentiment_by_day <- complete_corpus %>%
  select(url, body) %>%                                  # extract required columns 
  unnest_tokens(word, body) %>%                          # split each article into single words
  anti_join(get_stopwords(), by = "word") %>%            # remove stopwords
  inner_join(get_sentiments("afinn"), by = "word") %>%   # join sentiment scores
  group_by(url) %>%                                      # group text again by their URL
  summarise(sentiment = sum(score)) %>%                  # sum up sentiment scores
  left_join(complete_corpus, by = "url") %>%                     # add sentiment column to articles
  select(published_at, sentiment) %>%                    # extract required columns 
  group_by(date = as.Date(published_at, "%Y-%m-%d")) %>% # group by date
  summarise(sentiment = mean(sentiment), n = n())        # calculate summaries
return(sentiment_by_day)
}

    
# Function plot number of articles vs. time 

plot_by_day <- function(sentiment_by_day){
  
  num_articles <- ggplot(data=sentiment_by_day, aes(x=date, y=n)) +
  geom_bar(stat="identity", fill="steelblue")+
  theme_minimal()

# plot sentiment score vs. time
  num_sentiment <- ggplot(data=sentiment_by_day, aes(x=date, y=sentiment)) +
  geom_bar(stat="identity", fill="steelblue")+
  #geom_text(aes(label=len), vjust=-0.3, size=3.5)+
  theme_minimal()

subplot(ggplotly(num_sentiment), ggplotly(num_articles), nrows = 2, margin = 0.04, heights = c(0.6, 0.4))
}




#write.csv(Cory_booker_query$url,"cory_booker_csv_links.csv")


## booker calls 
Cory_booker_query <- make_query(query='Cory Booker',start= "2019-06-08",end="2019-01-01",sources=nyt)
Cory_booker_corpus <- make_corpus(Cory_booker_query)
Cory_booker_melted <- melted_corpus(Cory_booker_corpus)
plot_by_day(Cory_booker_melted)

## bernie calls
Bernie_Sanders_query <- make_query(query='Bernie Sanders',start= "2019-06-08",end="2019-01-01",sources=nyt)
Bernie_corpus <- make_corpus(Bernie_Sanders_query)
Bernie_melted <- melted_corpus(Bernie_corpus)
plot_by_day(Bernie_melted)

## Biden calls
Joe_Biden_query <- make_query(query='Joe Biden',start= "2019-06-08",end="2019-01-01",sources=nyt)
Joe_Biden_corpus <- make_corpus(Joe_Biden_query)
Joe_Biden_melted <- melted_corpus(Joe_Biden_corpus)
plot_by_day(Joe_Biden_melted)





## ALternative facet wrap graphing options
## Build df
#full_df <- rbind(Joe_Biden_melted,Bernie_melted,Cory_booker_melted)
# sentiment_graphs <- ggplot(full_df, aes(date,sentiment, fill = candidate)) +
#   geom_col(show.legend = FALSE) +
#   facet_wrap(~candidate, ncol = 1, scales = "free_x")
#     
# 
# number_art_graphs <- ggplot(full_df, aes(date,n, fill = candidate)) +
#   geom_col(show.legend = FALSE) +
#   facet_wrap(~candidate, ncol = 1, scales = "free_x")
# 
# subplot(ggplotly(sentiment_graphs),ggplotly(number_art_graphs))

negative_articles <- Cory_booker_corpus %>% 
    filter(str_detect(published_at,c("2019-03-27","2019-05-06")))


guns_article <- negative_articles

guns_article %>% 
 select(url, body) %>%                                  # extract required columns 
  unnest_tokens(word, body) %>%                          # split each article into single words
  anti_join(get_stopwords(), by = "word") %>%            # remove stopwords
  inner_join(get_sentiments("afinn"), by = "word") %>% 
    group_by(url,word) %>% 
    summarize(score=sum(score)) %>% 
    arrange(score) %>% 
    select(word,score)

custom_sentiments <- get_sentiments("afinn") %>% 
    filter(word!="gun")

guns_article <- guns_article %>% 
 select(url, body) %>%                                  # extract required columns 
  unnest_tokens(word, body) %>%                          # split each article into single words
  anti_join(get_stopwords(), by = "word") %>%            # remove stopwords
  inner_join(custom_sentiments, by = "word") %>% 
    group_by(url,word) %>% 
    summarize(score=sum(score)) %>% 
    arrange(score) %>% 
    select(word,score)


library(tm)
library(NLP)
library(openNLP)

convert_text_to_sentences <- function(text, lang = "en") {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

gun_articles_sentences <- convert_text_to_sentences(negative_articles$body)


## grab sentences with gun up to 3 words before and after
use_of_guns <- stringr::str_extract(gun_articles_sentences, "([^\\s]+\\s){0,3}gun(\\s[^\\s]+){0,3}")
use_of_guns <- use_of_guns[!is.na(use_of_guns)]
kable(use_of_guns)   
# this R markdown chunk generates a code appendix