Recently the newsapi aggregator added a package to R(newsanchor). You can see their vinaigrette here . I will use newsanchor to pull a list of URL dealing with presidential candidates. Most of the code I am using is just refactored code from the vinaigrette. I built out some additional functions to make future queries easier. I filter my queries for articles only displaying the candidates name in either the description or the title of the article.
While this article is mostly explorative it is the beginning of a plan to investigate media bias. In addition to this R script to scrape the Times, I have built a python script to scrape the Washington Post. I am planning on expanding these scripts to scrape several other news organizations. I will store the data in a sql db and automate alternating scrapes for weekly updates of all the major presidential candidates. I plan to use this data to attempt to develop my own personal sentiment analysis score for political language.
As it is primary season, I thought it might also be interesting to see what type of media coverage effects polling data. To understand any possible effect of media coverage, we need to understand the magnitude of dispersion in news networks. I will need to figure out a way to score how wide spread articles are via public facebook, twitter and other social networks likely via network analysis. Viewership rates on public news stations will need to be scored as well.
Daily NYT Articles Count and Sentiment Analysis for Cory Booker
Daily NYT Articles Count and Sentiment Analysis for Bernie sanders
Daily NYT Articles Count and Sentiment Analysis for Joe Biden
## # A tibble: 178 x 3
## # Groups: url [2]
## url word score
## <chr> <chr> <int>
## 1 https://www.nytimes.com/2019/05/06/us/politics/cory-booke~ gun -43
## 2 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ crime -42
## 3 https://www.nytimes.com/2019/05/06/us/politics/cory-booke~ violen~ -36
## 4 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ proble~ -10
## 5 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ stop -10
## 6 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ crimin~ -9
## 7 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ fire -8
## 8 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ abuses -6
## 9 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ arrest~ -6
## 10 https://www.nytimes.com/2019/03/27/us/politics/cory-booke~ kill -6
## # ... with 168 more rows
x |
---|
seeks to combat gun violence through measures |
the most progressive gun |
plan to address gun violence is simple |
been shattered by gun violence.” |
is the proposed gun licensing program, which |
to buy a gun would need to |
complete a certified gun safety course. |
being issued a gun license, which would |
lists to obtain gun |
University, said that gun control advocates and |
effective at reducing gun homicides and suicides |
63 percent of gun owners supported requiring |
was among the gun policy experts consulted |
Bear Arms, a gun |
idea on the gun control wish list,” |
renewed focus on gun violence comes just |
on Monday, calling gun violence in the |
been busy enacting gun |
Still, America’s gun laws remain among |
to deal with gun violence through executive |
California has made gun control a central |
records grappling with gun safety. |
long supported stiffer gun laws, having introduced |
record on gun control came under |
bill that gave gun manufacturers legal immunity, |
of Giffords, the gun violence prevention organization |
topic connected to gun control: criminal justice |
disproportionately affected by gun violence and incarcerated |
Speaking about gun violence, he said, |
seeks to end gun violence, said he |
out muscularly for gun safety,” Mr. Feinblatt |
their car at gun |
gun his political career |
surround my car, gun |
and found two gun |
This was just a quick examination of some of our query results. I look forward to expanding this data collection process.
knitr::opts_chunk$set(echo = TRUE)
rm(list=ls())
library(newsanchor)
library(robotstxt)
library(httr)
library(rvest)
library(dplyr)
library(stringr)
library(tidytext)
library(kableExtra)
library(knitr)
library(ggplot2)
library(plotly)
library(RMySQL)
#library(lubridate)
# sources
nyt<- terms_sources[122,1]
wapo <- terms_sources[129,1]
# create query result timeframe
# todays_date <- ymd(format(Sys.time(), "%Y-%m-%d"))
# end <- as.character(todays_date)
# start <- as.character(todays_date%m-% days(7))
## Add api key
#set_api_key(api_key = 'insert api key',
# path = "~/.Renviron")
## Functions to scrape NYT and add to existing metadata
## one liner that takes inputs - query, start, end, and source
make_query <- function(query,start,end,sources){
cleaned_query <- get_everything(query=query,
sources= sources,
from = start,
to = end)
cleaned_query <- cleaned_query$results_df
cleaned_query <- cleaned_query%>%
filter(str_detect(description,query)|
str_detect(title,query))
cleaned_query$candidate <- query
return(cleaned_query)
}
## takes url from query call and returns scraped full articles from NYT website
get_article_body <- function (url) {
# download article page
response <- GET(url)
# check if request was successful
if (response$status_code != 200) return(NA)
# extract html
html <- httr::content(x = response,
type = "text",
encoding = "UTF-8")
# parse html
parsed_html <- read_html(html)
# define paragraph DOM selector
selector <- "article#story div.StoryBodyCompanionColumn div p"
# parse content
parsed_html %>%
html_nodes(selector) %>% # extract all paragraphs within class 'article-section'
html_text() %>% # extract content of the <p> tags
str_replace_all("\n", "") %>% # replace all line breaks
paste(collapse = " ") # join all paragraphs into one string
}
## loops over all url in query call and executes get_article_body function to return full article
make_corpus <- function(article_list){
article_list$body <- NA
# loop through articles and "apply" function
for (i in 1:nrow(article_list)) {
# "apply" function to i url
article_list$body[i] <- get_article_body(article_list$url[i])
Sys.sleep(1)
}
# drop hourly data
article_list$published_at <- as.Date(article_list$published_at, "%Y-%m-%d")
return (article_list)
}
## Function returns plotly graphs to compare article sentiment and number articles by date
melted_corpus<- function(complete_corpus){
sentiment_by_day <- complete_corpus %>%
select(url, body) %>% # extract required columns
unnest_tokens(word, body) %>% # split each article into single words
anti_join(get_stopwords(), by = "word") %>% # remove stopwords
inner_join(get_sentiments("afinn"), by = "word") %>% # join sentiment scores
group_by(url) %>% # group text again by their URL
summarise(sentiment = sum(score)) %>% # sum up sentiment scores
left_join(complete_corpus, by = "url") %>% # add sentiment column to articles
select(published_at, sentiment) %>% # extract required columns
group_by(date = as.Date(published_at, "%Y-%m-%d")) %>% # group by date
summarise(sentiment = mean(sentiment), n = n()) # calculate summaries
return(sentiment_by_day)
}
# Function plot number of articles vs. time
plot_by_day <- function(sentiment_by_day){
num_articles <- ggplot(data=sentiment_by_day, aes(x=date, y=n)) +
geom_bar(stat="identity", fill="steelblue")+
theme_minimal()
# plot sentiment score vs. time
num_sentiment <- ggplot(data=sentiment_by_day, aes(x=date, y=sentiment)) +
geom_bar(stat="identity", fill="steelblue")+
#geom_text(aes(label=len), vjust=-0.3, size=3.5)+
theme_minimal()
subplot(ggplotly(num_sentiment), ggplotly(num_articles), nrows = 2, margin = 0.04, heights = c(0.6, 0.4))
}
#write.csv(Cory_booker_query$url,"cory_booker_csv_links.csv")
## booker calls
Cory_booker_query <- make_query(query='Cory Booker',start= "2019-06-08",end="2019-01-01",sources=nyt)
Cory_booker_corpus <- make_corpus(Cory_booker_query)
Cory_booker_melted <- melted_corpus(Cory_booker_corpus)
plot_by_day(Cory_booker_melted)
## bernie calls
Bernie_Sanders_query <- make_query(query='Bernie Sanders',start= "2019-06-08",end="2019-01-01",sources=nyt)
Bernie_corpus <- make_corpus(Bernie_Sanders_query)
Bernie_melted <- melted_corpus(Bernie_corpus)
plot_by_day(Bernie_melted)
## Biden calls
Joe_Biden_query <- make_query(query='Joe Biden',start= "2019-06-08",end="2019-01-01",sources=nyt)
Joe_Biden_corpus <- make_corpus(Joe_Biden_query)
Joe_Biden_melted <- melted_corpus(Joe_Biden_corpus)
plot_by_day(Joe_Biden_melted)
## ALternative facet wrap graphing options
## Build df
#full_df <- rbind(Joe_Biden_melted,Bernie_melted,Cory_booker_melted)
# sentiment_graphs <- ggplot(full_df, aes(date,sentiment, fill = candidate)) +
# geom_col(show.legend = FALSE) +
# facet_wrap(~candidate, ncol = 1, scales = "free_x")
#
#
# number_art_graphs <- ggplot(full_df, aes(date,n, fill = candidate)) +
# geom_col(show.legend = FALSE) +
# facet_wrap(~candidate, ncol = 1, scales = "free_x")
#
# subplot(ggplotly(sentiment_graphs),ggplotly(number_art_graphs))
negative_articles <- Cory_booker_corpus %>%
filter(str_detect(published_at,c("2019-03-27","2019-05-06")))
guns_article <- negative_articles
guns_article %>%
select(url, body) %>% # extract required columns
unnest_tokens(word, body) %>% # split each article into single words
anti_join(get_stopwords(), by = "word") %>% # remove stopwords
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(url,word) %>%
summarize(score=sum(score)) %>%
arrange(score) %>%
select(word,score)
custom_sentiments <- get_sentiments("afinn") %>%
filter(word!="gun")
guns_article <- guns_article %>%
select(url, body) %>% # extract required columns
unnest_tokens(word, body) %>% # split each article into single words
anti_join(get_stopwords(), by = "word") %>% # remove stopwords
inner_join(custom_sentiments, by = "word") %>%
group_by(url,word) %>%
summarize(score=sum(score)) %>%
arrange(score) %>%
select(word,score)
library(tm)
library(NLP)
library(openNLP)
convert_text_to_sentences <- function(text, lang = "en") {
# Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'.
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)
# Convert text to class String from package NLP
text <- as.String(text)
# Sentence boundaries in text
sentence.boundaries <- annotate(text, sentence_token_annotator)
# Extract sentences
sentences <- text[sentence.boundaries]
# return sentences
return(sentences)
}
gun_articles_sentences <- convert_text_to_sentences(negative_articles$body)
## grab sentences with gun up to 3 words before and after
use_of_guns <- stringr::str_extract(gun_articles_sentences, "([^\\s]+\\s){0,3}gun(\\s[^\\s]+){0,3}")
use_of_guns <- use_of_guns[!is.na(use_of_guns)]
kable(use_of_guns)
# this R markdown chunk generates a code appendix