My motivation is to investigate the sentiment analysis of public news articles that cover a similar topic from seemingly different points of view. Secondarily, I’m interested in comparing the differences between two news orgs that have themselves been the topic of news in recent days: FoxNews.com and The NYTimes.com
I consider myself liberal but have not editorialized the data whatsoever for this observational analysis (i.e. if the article met the criteria and wasn’t mis-formatted or otherwise not easily processable I kept it in). Observations may be left-leaning.
This final project investigates sentiment and provides observations on the differences between the the news orgs by analyzing web-scraped news articles published on the subject of “Trump” from NYTimes.com and FoxNews.com during the first 100 days of his presidency. Articles (and their respective article urls) were selected via search results from their respective news orgs websites (API for nyt, website scrape via FN). Article texts were scraped in batches and combined locally and have been made available on GitHub for reproducability.
Using the 100-day daily mean AFINN sentiment score from each news source as the sample, I suspect that the average sentiment scores from these fine news organizations will be different so I structure my test as follows:
\(H_{0}: \mu_{diff} = 0\) i.e. there is no difference in means \(H_{A}: \mu_{diff} \neq 0\) i.e. there is a difference in the means
Will the sentiment averages by the AFINN lexicon against these news articles of similar size, topic, and timeframe be different? Likely. But not necessarily in the way you might expect.
Below are the libraries used for this project:
# for appearence, table formatting
library(prettydoc)
library(knitr)
library(scales)
# for data manipulation, visualization, and stats
library(dplyr)
library(stringr)
library(tidyjson)
library(tidyr)
library(tidytext)
library(tidyverse)
library(ggplot2)
library(DATA606)
# for web scraping:
library(RCurl)
library(RSelenium)
library(Rwebdriver)
library(rvest)
library(httr)
library(XML)
As in my previous NYTimes.com projects (here and here), I returned to the API to pull results that were:
The results of this query would then be used to pull articles from nytimes.com.
Below, we look at a single day example of pulling the results from the NYT API.
t_f100_example <- resolve_nyt("Trump", 20170428, 20170428) %>%
select(mainheadline,pub_date, article_by)
## Joining, by = c("document.id", "array.index")
## Joining, by = c("document.id", "array.index")
## Joining, by = c("document.id", "array.index")
## Joining, by = c("document.id", "array.index")
knitr::kable(t_f100_example, caption = "NYTimes API results example")
mainheadline | pub_date | article_by |
---|---|---|
Movers: Trump Bump, Time Inc. and Exxon | 2017-04-28T12:11:41+0000 | By THE NEW YORK TIMES |
Trump Nominates Former Disaster Relief Manager to Lead FEMA | 2017-04-28T18:14:43+0000 | By RON NIXON |
‘The President Show’ Puts Trump in the Host’s Chair | 2017-04-28T08:44:57+0000 | By GIOVANNI RUSSONELLO |
Trump Orders Easing Safety Rules Implemented After Gulf Oil Spill | 2017-04-28T01:00:04+0000 | By CORAL DAVENPORT |
Trump Orders Review of Safety Rules Created After Gulf Oil Spill | 2017-04-28T20:37:32+0000 | By CORAL DAVENPORT |
Trump on Being President: ‘I Thought It Would Be Easier’ | 2017-04-28T12:25:07+0000 | By CHRISTOPHER MELE |
Trump Warns That ‘Major, Major Conflict’ With North Korea Is Possible | 2017-04-28T02:46:27+0000 | By GERRY MULLANY |
‘Trump Bump’ Lifts Stocks, Giving President a Win for His First 100 Days | 2017-04-28T20:06:41+0000 | By MICHAEL J. de la MERCED |
Trump Tells N.R.A. Convention, ‘I Am Going to Come Through for You’ | 2017-04-28T09:00:27+0000 | By MICHAEL D. SHEAR |
Court Gives Trump Small Victory in Push Against Clean Power Plan | 2017-04-28T17:23:33+0000 | By CORAL DAVENPORT |
Court Gives Trump Small Victory in Push Against Clean Power Plan | 2017-04-28T17:23:33+0000 | By CORAL DAVENPORT |
Under the Trump Tax Plan, We Might All Want to Become Corporations | 2017-04-28T19:05:45+0000 | By NEIL IRWIN |
Trump Rattles South Korea by Saying It Should Pay for Antimissile System | 2017-04-28T10:47:49+0000 | By CHOE SANG-HUN |
Utah Attorney General Makes a Trump Shortlist, and Donations Pour In | 2017-04-28T09:00:31+0000 | By MATT APUZZO |
Trump Tax Plan Would Shift Trillions From U.S. Coffers to the Richest | 2017-04-28T01:26:18+0000 | By JULIE HIRSCHFELD DAVIS and PATRICIA COHEN |
Trump on North Korea: Tactic? ‘Madman Theory’? Or Just Mixed Messages? | 2017-04-28T15:00:08+0000 | By DAVID E. SANGER |
Next, we do some light cleaning of the data to ensure that it’s formatted properly for the next step:
nyt_clean <- function(my_nyt_data){
my_nyt_data %>%
mutate(article_by = str_replace(article_by, "By ",""),
pub_date = str_extract(pub_date, "..........")) %>% #remove by
filter(!grepl("THE ASSOCIATED PRESS", article_by)) %>%
arrange(pub_date)
}
# each of the 3 below were run at different times:
# t_f100 <- nyt_clean(t_f100) # 946 results
# t_f100_pt1 <- nyt_clean(t_f100_pt1)
# t_f100_pt2 <- nyt_clean(t_f100_pt2)
Next, a preview of the cleaned API results example from NYTimes.com:
t_cleaned_example <- nyt_clean(t_f100_example) %>%
select(mainheadline, pub_date, article_by)
knitr::kable(head(t_cleaned_example), caption = "NYTimes API results cleaned")
mainheadline | pub_date | article_by |
---|---|---|
Movers: Trump Bump, Time Inc. and Exxon | 2017-04-28 | THE NEW YORK TIMES |
Trump Nominates Former Disaster Relief Manager to Lead FEMA | 2017-04-28 | RON NIXON |
‘The President Show’ Puts Trump in the Host’s Chair | 2017-04-28 | GIOVANNI RUSSONELLO |
Trump Orders Easing Safety Rules Implemented After Gulf Oil Spill | 2017-04-28 | CORAL DAVENPORT |
Trump Orders Review of Safety Rules Created After Gulf Oil Spill | 2017-04-28 | CORAL DAVENPORT |
Trump on Being President: ‘I Thought It Would Be Easier’ | 2017-04-28 | CHRISTOPHER MELE |
After pulling all of the search results, it was time to scrape the nytimes.com web pages. Again, the code below is similar to my previous projects but I refined the functions to be more flexible and to be able to run in parts. The reason I opted to run these scrape sessions in parts was to check my results as I went along and to ensure I wasn’t making a nuisance of myself w/r/t to scraping.
Below, I’ve created a function that pulls an URL from a df
and then attempts to pull the text at the end of the html. This function will delay 5 seconds per request.
# pull the article text from NYT using the web_url info:
resolve_nyt_url = function(my_data, x){
current_url <- my_data[, "web_url"][x]
Sys.sleep(5)
out_html <- read_html(current_url)
out_nodes <- html_nodes(out_html, ".story-body-text")
sprintf("out_nodes length(): %s", length(out_nodes))
if(length(out_nodes) == 0){
sprintf("url num: %s is dead", x)
out_art_txt <- NULL
}
else{
out_text <- html_text(out_nodes, trim = TRUE)
out_text <- as.data.frame(out_text, stringsAsFactors = F)
out_headline <- my_data[ ,"mainheadline"][x]
out_author <- my_data[ ,"article_by"][x]
sprintf("Good URL: %s", current_url)
out_art_txt <- out_text %>%
mutate(out_hl = out_headline, out_a = out_author)
}
out_art_txt
}
# each of the below were run at different times to
# avoid getting blocked. This did not run fast but it
# seemed to be reliable.
# t_text <- pull_art_texts(t_f100, "Trump_art_text")
# t_text1 <- pull_art_texts(t_f100_pt1, "Trump_art_text1")
# t_text2 <- pull_art_texts(t_f100_pt2, "Trump_art_text2")
Next, I’ve provided an example of the raw text results from scraping NYTimes articles. Note that the headers are not included because this would have been an interim file that would be appended to another scrape session. Further, notice that the search results are actually about Ivanka Trump but that still falls within my criteria of looking at articles with headlines containing “Trump” during the first 100 days:
t_art_example <- read.csv2("Trump_art_text2.csv", stringsAsFactors = F)
head(t_art_example[,1])
## [1] "In announcing her plans to donate $100,000 each to the National Urban League and the Boys & Girls Clubs of America, Ms. Trump, the presidents daughter and adviser, released a statement calling the book part of a continuing effort to empower women that has been central to my mission throughout my career."
## [2] "Ms. Trump said her book aims to supply advice and tips on leadership, entrepreneurship, juggling work and family and building cultures where multidimensional women can thrive now and in the future."
## [3] "Like many other professional women, I have juggled the demands that come with growing my family and building my businesses, and I realize that I am more fortunate than most, Ms. Trump said in the statement."
## [4] "She said she has created the Ivanka M. Trump Charitable Fund for the unpaid portion of her advance and any future royalties to facilitate grants to charitable groups that support the economic empowerment for women and girls."
## [5] "She said she chose to donate to the Urban League and the Boys & Girls Clubs because both have made it a priority to promote entrepreneurship and educational opportunities for women and girls in underserved communities."
## [6] "The Urban League will launch a new womens initiative with the money, while the Boys & Girls Club money will go to the groups science, technology, engineering and math program."
Below, I clean and combine the different files and scrape sessions to create the file t_all_arts
which contains all of the articles that remained after refinement. The code is not run below but I’ve reproduced it for reference. Further along, we’ll obtain t_all_arts
from GitHub for the sentiment analysis.
# below, i remove the misaligned information and combine it with the
# initial search results to have text, date of publish, etc in the text
# t_text_clean <- t_text %>%
# mutate(lng_text = nchar(t_text),
# lng_hl = nchar(t_hl),
# lng_tauth = nchar(t_auth)) %>%
# filter(lng_tauth < 57, lng_hl < 85, lng_hl >= 29, lng_text < 2000)
# t_combine <- t_srcs %>%
# left_join(t_text_clean, by = c("mainheadline" = "t_hl")) %>%
# filter(!is.na(t_auth))
#
# t_combine1 <- t_srcs1 %>%
# left_join(t_text1, by = c("mainheadline" = "t_hl")) %>%
# filter(!is.na(t_auth))
#
# t_combine2 <- t_srcs2 %>%
# left_join(t_text2, by = c("mainheadline" = "t_hl")) %>%
# filter(!is.na(t_auth))
# below contains the combined, cleaned text that forms the NYT portion
# of this analysis
# t_all_arts <- rbind(t_combine, t_combine2) %>%
# rbind(t_combine1) %>%
# select(mainheadline, pub_date, web_url, t_text, t_auth) %>%
# mutate(my_rows = row_number()) %>%
# arrange(pub_date, mainheadline, my_rows) %>%
# select(mainheadline, pub_date, web_url, t_text, t_auth) %>%
# ungroup() %>%
# mutate(line_num = row_number(), t_src = "NYTimes.com")
# write.csv2(t_all_arts, "t_all_arts.csv")
For article search, I used the FoxNews.com search for both “Donald Trump” in quotes and without (without search URL not shown) during the time-frame specified, under the site “Fox News” and by Section “Politics.”
FoxNews.com was much more challenging than the NYTimes.com for a few reasons:
httr
did not work.To deal with these issues, I employed RSelenium
- an extremely powerful java-created tool for scraping all types of webpages through remote-controlling a browser. Before I could use RSelenium
, I needed to understand the FoxNews.com search URL which had similar qualities to an API URL. The below function fox_search_url
takes an input and creates an URL that points to the website search results:
# This url pulls form site "Fox News" under the section "Politics"
# but results seem to cover a bit more than politics.
fox_search_url <- function(x, s_date, e_date, a_pg = 0){
f_base_url <- 'http://www.foxnews.com/search-results/search?q="%s"%s%s&max_date=%s&start=%s0'
#f_term <- paste0('"',URLencode(x), '"')
f_term <- URLencode(x)
f_base_url2 <- "&ss=fn&sort=latest§ion.path=fnc/politics&min_date="
fox_news_url <- sprintf(f_base_url, f_term, f_base_url2, s_date, e_date, a_pg)
}
x <- fox_search_url('Donald Trump','2017-01-20', '2017-04-10')
x
## [1] "http://www.foxnews.com/search-results/search?q=\"Donald%20Trump\"&ss=fn&sort=latest§ion.path=fnc/politics&min_date=2017-01-20&max_date=2017-04-10&start=00"
Below, I’ve screen-capped the webpage mentioned above:
Next, I display a series of functions used by an RSelenium
object that’s a web browser instance (more on that next). These were used to extract specific elements from the FN search results page including:
Sidenote: I used Google Chrome’s Selector Gadeget extension to obtain the web-elements of interest:
# used chrome's selector gadget to pull the results http://selectorgadget.com/
pull_fox_hits <- function(a_browsr){
l_hits <- '//*[contains(concat( " ", @class, " " ), concat( " ", "ng-valid", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "ng-binding", " " ))]'
hits <- extract_art(a_browsr, l_hits) %>% str_replace(",", "") %>% as.integer
hits
}
pull_fox_text <- function(a_browsr){
l_text <- '//h3//*[contains(concat( " ", @class, " " ), concat( " ", "ng-binding", " " ))]'
l_dates <- '//*[contains(concat( " ", @class, " " ), concat( " ", "search-date", " " ))]'
linktext <- extract_art(a_browsr, l_text)
linkurl <- extract_link(a_browsr, l_text)
# convert date format via http://stackoverflow.com/a/31837668/5965312
linkdates <- extract_art(a_browsr, l_dates)
linkdates <- linkdates %>%
strsplit( ".*\n\\s+") %>%
unlist() %>%
vapply("[", "", 1) %>%
as.Date("%b %d, %Y")
fox_text <- data.frame(l_hl = linktext,
pub_date = linkdates,
web_url = linkurl)
fox_text
}
Next, a few more helper functions for extracting text from webpages (it looks simple here, but this actually required quite a bit of yak shaving to get it to work smoothly.)
# extract the article based on xpath
extract_art <- function(a_browsr,an_xpath){
data <- a_browsr$findElements(using = "xpath", an_xpath)
f_out <- unlist(lapply(data, function(x){x$getElementText()}))
f_out
}
# extract the article based on css
extract_art_css <- function(a_browsr,a_css){
data <- a_browsr$findElements(using = "css selector", a_css)
f_out <- unlist(lapply(data, function(x){x$getElementText()}))
# for dealing with cases where extraction result is NULL
if(is.null(f_out)){
f_out <- NA
}else{
f_out
}
}
# extract the link based on xpath
extract_link <- function(a_browsr,an_xpath){
data <- a_browsr$findElements(value = an_xpath)
f_out <- unlist(lapply(data, function(x){x$getElementAttribute('href')}))
f_out
}
A helpful resource that I used is this Computer World article that provides some basic functions that are most used for web-scraping.
Before one can use RSelenium
there are several items you’ll want to have ready including:
To make things easier, I saved my geckodriver.exe
and selenium-server-standalone-3.3.1.jar
in the same folder so that they run together. Here’s my cmd.exe
window running the standalone server:
cmd.exe running Selenium Standalone
See the comments in the below code chunk for the text in the cmd.exe
screen cap that you can use on your Window’s PC.
Note that the below function will not run unless your Standalone Server is up-and-running:
# MUST BE RUN IN CMD.EXE FIRST:
# cd to >>>> C:\Users\Jaan\Documents\R\win-library\3.3\geckodriver then:
# java -Dwebdriver.gecko.driver=geckodriver.exe -jar selenium-server-standalone-3.3.1.jar
# good Selenium intro via :
#http://www.computerworld.com/article/2971265/application-development/how-to-drive-a-web-browser-with-r-and-rselenium.html
# this function creates a df that holds all of the FN search results:
rtrv_fn <- function(x, s_d, e_d, f_n = "tmp_fn.csv"){
# create search url, retreive results
fox_news_url <- fox_search_url(x, s_d, e_d)
brs <- remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4444)
brs$open()
brs$navigate(fox_news_url)
# combine pages of results
hits <- pull_fox_hits(brs)
for(i in 0:(floor(hits/10))){
an_url <- fox_search_url(x, s_d, e_d, i)
brs$navigate(an_url)
Sys.sleep(sample(1:2,1))
temp_df <- pull_fox_text(brs)
if(i == 0){ # write the files with headers:
write.table(temp_df, f_n, append = F, sep=";",
row.names = F, col.names = T, quote = T)
}else if(i > 0){ # add pg results to files:
write.table(temp_df, f_n, append = T, sep=";",
row.names = F, col.names = F, quote = T)
}else(stop())
}
ttt <- read.csv2(f_n, stringsAsFactors = F, header = T)
ttt
}
#Uncomment to re-run
#fn_srch <- rtrv_fn("Donald Trump","2017-01-20", "2017-04-10", "tmp_fn0.csv")
#fn_srch1 <- rtrv_fn("Donald Trump","2017-04-11", "2017-04-29", "tmp_fn1.csv")
#fn_srch_big <- rtrv_fn("Donald Trump","2017-01-20", "2017-04-20", "tmp_fn2.csv")
#fn_srch_close <- rtrv_fn("Donald Trump","2017-04-21", "2017-04-29", "tmp_fn3.csv")
To pull the text from FN articles,I created a function that pulls a single article’s text and corresponding information. Previous versions of this function, fn_articles
, attempted to pull the date from the webpage visited but I had to comment it out because it wasn’t functioning properly:
# start seleneium stand-alone..
# MUST BE RUN IN CMD.EXE FIRST:
# cd to >>>> C:\Users\Jaan\Documents\R\win-library\3.3\geckodriver then:
# java -Dwebdriver.gecko.driver=geckodriver.exe -jar selenium-server-standalone-3.3.1.jar
fn_articles <- function(a_brsr, an_url){
a_brsr$navigate(an_url)
fn_art_body <- '.article-text > p' # actual article text
fn_info <- '.article-info div div a' # from portion (sometimes)
fn_info_by <- '.article-info span' # Author, at times
fn_sub <- '#content h2 a' # section
fn_main <- '#content h1' # title
fn_time <- 'time' # date of publish
fn_art_text <- extract_art_css(a_brsr, fn_art_body)
b <- extract_art_css(a_brsr, fn_info)[1]
b2 <- extract_art_css(a_brsr, fn_info)[2]
c <- extract_art_css(a_brsr, fn_info_by)
d <- extract_art_css(a_brsr, fn_sub)
e <- extract_art_css(a_brsr, fn_main)
#f <- extract_art_css(a_brsr, fn_time)
f_t <- "x" #as.data.frame(f, stringsAsFactors = F) %>%
#filter(grepl("Published ", f)) %>%
#str_replace("Published ", "") %>%
#as.Date("%B %d, %Y") %>% unique()
out_f <- as.data.frame(fn_art_text) %>%
mutate(fn_from = b, fn_from2 = b2,
fn_by = c, fn_sub = d,
fn_main = e, fn_date = f_t)
out_f
}
The final bit of code to pull FN articles is below. The fn_extract_articles
function pulls in a data frame of URLs (i.e. the FN search results) and creates a local csv file containing the article’s webpage text. Future versions of this function would do better to include a try/catch
to deal with webpages that hang (FoxNews.com has a lot of ad partners.). Note that this code did not run fast and the primary data-pull needed to run overnight.
# for extracting all text from a given list of URLs
fn_extract_articles <- function(df_urls, fn_file = "fn.csv"){
loc_file <- fn_file
the_browser <- remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4444)
the_browser$open()
for(i in 1:nrow(df_urls)){
target_url <- df_urls[,3][i] # assign url
Sys.sleep(sample(1:2,1)) # Delay a bit
target_text <- fn_articles(the_browser,target_url) # Pull text
if(i == 1){
write.table(target_text, loc_file, # write text locally
append = F, sep=";", row.names = F,
col.names = T, quote = T)
}else if(i > 1){
write.table(target_text, loc_file,
append = T, sep=";", row.names = F,
col.names = F, quote = T)
}else(stop())
}
the_browser$close() # close browser
}
# The below took an hour or so to run:
# fn_extract_articles(fn_srch_cln, "fn20170425test.csv")
# The below ran in about 5 minutes
# fn_extract_articles(fn_srch_cln1, "fn20170501test.csv")
# The below ran in about 18hrs
# fn_extract_articles(fn_srch_cln2, "fn20170506test.csv")
# fn_extract_articles(fn_srch_cln3, "fn20170508test.csv")
Next, I combine the local csv files and clean them up and add dates to the files that do not have them and create the files that will be uploaded to GitHub to allow for running this rmarkdown locally:
# read in the extracted data locally,
# clean, combine, and refine:
s_a <- read.csv2("tmp_fn0.csv", stringsAsFactors = F) %>%
select(l_hl, pub_date, web_url)
s_b <- read.csv2("tmp_fn1.csv", stringsAsFactors = F)
s_c <- read.csv2("tmp_fn2.csv", stringsAsFactors = F)
s_d <- read.csv2("tmp_fn3.csv", stringsAsFactors = F)
date_lookup <- rbind(s_a, s_b)
date_lookup <- rbind(date_lookup, s_c)
date_lookup <- rbind(date_lookup, s_d)
date_lookup <- date_lookup[!duplicated(date_lookup), ] %>%
filter(!grepl("video.foxnews.com", web_url)) %>%
select(l_hl, pub_date, web_url)
p_a <- read.csv2("fn20170425test.csv", stringsAsFactors = F)
p_b <- read.csv2("fn20170501test.csv", stringsAsFactors = F)
p_c <- read.csv2("fn20170506test.csv", stringsAsFactors = F)
p_d <- read.csv2("fn20170508test.csv", stringsAsFactors = F)
fn_all_art_raw <- rbind(p_a, p_b)
fn_all_art_raw <- rbind(fn_all_art_raw, p_c)
fn_all_art_raw <- rbind(fn_all_art_raw, p_d)
fn_all_arts <- fn_all_art_raw %>%
inner_join(date_lookup, by = c("fn_main"="l_hl"))
fn_all_arts <- fn_all_arts[!duplicated(fn_all_arts), ]
#write.csv2(fn_all_arts, "fn_all_artsX.csv", row.names = F, quote = F)
fn_datLOC <- read.csv2("fn_all_artsX.csv", stringsAsFactors = F, header = T)
part_a <- fn_datLOC %>%
slice(1:5000)
part_b1 <- fn_datLOC %>%
slice(5001:7300)
part_b2 <- fn_datLOC %>%
slice(7302:10000) #skip one row of problems
part_c <- fn_datLOC %>%
slice(10001:15000)
part_d <- fn_datLOC %>%
slice(15001:20000)
part_e <- fn_datLOC %>%
slice(20001:25000)
part_f <- fn_datLOC %>%
slice(25000:27115)
# write.csv2(part_a, "part_a.csv", row.names = FALSE)
# write.csv2(part_b1, "part_b1.csv", row.names = FALSE)
# write.csv2(part_b2, "part_b2.csv", row.names = FALSE)
# write.csv2(part_c, "part_c.csv", row.names = FALSE)
# write.csv2(part_d, "part_d.csv", row.names = FALSE)
# write.csv2(part_e, "part_e.csv", row.names = FALSE)
# write.csv2(part_f, "part_f.csv", row.names = FALSE)
Next, a preview of the FoxNews.com articles:
sm_table <- fn_all_arts %>%
select(fn_art_text, fn_main,pub_date) %>% slice(1:3)
knitr::kable(sm_table, caption = "FoxNews.com article results")
fn_art_text | fn_main | pub_date |
---|---|---|
In his first hours as president Friday, Donald Trump ordered federal agencies to the burden of ObamaCare while his chief of staff directed an immediate regulatory freeze. | In first executive order, Trump tells agencies to ease ObamaCare burden | 2017-01-20 |
Trump was joined in the Oval Office by Vice President Mike Pence, Chief of Staff Reince Priebus and other top advisers as he signed the executive order on former President Barack Obama’s signature health law, which Trump opposed throughout his campaign. | In first executive order, Trump tells agencies to ease ObamaCare burden | 2017-01-20 |
The order, which noted that Trump intends to seek the law’s repeal, directs agencies to the unwarranted economic and regulatory burdens [of ObamaCare] and prepare to afford the States more flexibility and control to create a more free and open healthcare market. It also tells agencies to waive, defer or delay imposing any ObamaCare provisions that impose fiscal penalties on states, health care providers, families or individuals. | In first executive order, Trump tells agencies to ease ObamaCare burden | 2017-01-20 |
First I set up the GitHub URLs that will be used to pull the data in:
# FN URLs - the file was large and needed to be saved in parts:
url_a <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_a.csv'
url_b1 <-'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_b1.csv'
url_b2 <-'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_b2.csv'
url_c <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_c.csv'
url_d <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_d.csv'
url_e <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_e.csv'
url_f <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_f.csv'
# nyt data was able to save on GitHub in one file:
nyt_url <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/t_all_arts.csv'
# some reference files I created in google docs and exported at csv
a_100day_url <- "https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/100days.csv"
a_sm_events <- "https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/small_events.csv"
# read GitHub FN files:
fn_dat_a <- read.csv2(text = getURL(url_a), stringsAsFactors = F, header = T)
fn_dat_b1 <- read.csv2(text = getURL(url_b1), stringsAsFactors = F, header = T)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string
fn_dat_b2 <- read.csv2(text = getURL(url_b2), stringsAsFactors = F, header = T)
fn_dat_c <- read.csv2(text = getURL(url_c), stringsAsFactors = F, header = T)
fn_dat_d <- read.csv2(text = getURL(url_d), stringsAsFactors = F, header = T)
fn_dat_e <- read.csv2(text = getURL(url_e), stringsAsFactors = F, header = T)
fn_dat_f <- read.csv2(text = getURL(url_f), stringsAsFactors = F, header = T)
# combine them
fn_datGIT <- fn_dat_a %>%
rbind(fn_dat_b1) %>%
rbind(fn_dat_b2) %>%
rbind(fn_dat_c) %>%
rbind(fn_dat_d) %>%
rbind(fn_dat_e) %>%
rbind(fn_dat_f)
# put external source data into dataframes:
nyt_dat <- read.csv2(text = getURL(nyt_url), stringsAsFactors = F, header = T)
cnt100 <- read.csv(text = getURL(a_100day_url), stringsAsFactors = F, header = T) %>%
rename(pub_date = X.U.FEFF.pub_date)
sm_events <- read.csv(text = getURL(a_sm_events),
stringsAsFactors = F, header = T)
After I’ve read in the data, I need to clean it up a bit so that I can work with it:
x <- nyt_dat %>%
select(mainheadline, pub_date, t_text, t_src) %>%
rename(hl = mainheadline, art_text = t_text, src = t_src) %>%
filter(pub_date <= as.Date("2017-04-29"), grepl("Trump", hl))
y <- fn_datGIT %>%
select(fn_main, pub_date, fn_art_text, fn_src, fn_sub) %>%
ungroup() %>%
rename(hl = fn_main, art_text = fn_art_text, src = fn_src) %>%
filter(src == "FoxNews.com",
!grepl("OPINION", fn_sub),
grepl("Trump", hl),
!grepl("Transcript of President Trump's press conference", hl)) %>%
select(-fn_sub)
articles_fn_nyt <- rbind(x, y)
articles_fn_nyt <- articles_fn_nyt %>%
mutate(line_num = row_number())
sm_table1 <- articles_fn_nyt %>% slice(1:3)
knitr::kable(sm_table1, caption = "Combined preview")
hl | pub_date | art_text | src | line_num |
---|---|---|---|---|
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | WASHINGTON Donald J. Trump arrived in Washington the day before his inauguration as the nations 45th president in a swirl of cinematic pageantry but facing serious questions about whether his chaotic transition has left critical parts of the government dangerously short-handed. | NYTimes.com | 1 |
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | Mr. Trump will be sworn in at noon Eastern time on Friday, but his team was still scrambling to fill key administration posts when he got here on Thursday, announcing last-minute plans to retain 50 essential State Department and national security officials currently working in the Obama administration to ensure continuity of government, according to Sean Spicer, the incoming White House press secretary. | NYTimes.com | 2 |
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | The furious final staff preparations included designating Thomas A. Shannon Jr., an Obama appointee, as the acting secretary of state, pending the expected confirmation of Rex W. Tillerson. | NYTimes.com | 3 |
Here I make use of the tidytext
package to tokenize the article data. Again, further refinement is needed to clear out any messy text:
tokens_nyfn <- articles_fn_nyt %>%
unnest_tokens(word, art_text) %>%
filter(!grepl('[0-9]',word),
!grepl('.{2,}(\\.com)',word),
word != "x", word != "na",
word != "_____", word != "_________")
sm_table2 <- tokens_nyfn %>% slice(1:8)
knitr::kable(sm_table2, caption = "Token preview")
hl | pub_date | src | line_num | word |
---|---|---|---|---|
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | NYTimes.com | 1 | washington |
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | NYTimes.com | 1 | donald |
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | NYTimes.com | 1 | j |
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | NYTimes.com | 1 | trump |
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | NYTimes.com | 1 | arrived |
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | NYTimes.com | 1 | in |
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | NYTimes.com | 1 | washington |
A Trump Administration, With Obama Staff Members Filling In the Gaps | 2017-01-20 | NYTimes.com | 1 | the |
Below, I take a look at some stats regarding the data to be analyzed.
Let’s take a look at each of the files from general perspective:
FoxNews.com | NYTimes.com |
---|---|
451,560 words | 492,493 words |
3.88 MB | 3.15 MB |
831 Articles | 381 Articles |
With article count difference of 450, I am glad that the word count difference is only 41K.
In the code below, I’m setting up some functions to be used with ggplot2
that will help me add annotations.
my_hl <- function(x, some_text = "some_text", up_down = 0, ymi, ymx){
my_xmin <- x - 0.5
my_xmax <- x + 0.5
my_ymin <- ymi
my_ymax <- ymx
label_pos <- my_ymax
if(up_down == 0){
label_pos <- my_ymin + 2
}else
label_pos <- my_ymax - 2
tt <- annotate("rect", xmin = my_xmin, xmax = my_xmax,
ymin = my_ymin, ymax = my_ymax,
alpha = .15, fill = "yellow")
rrr <- annotate("text", label = some_text,
x=my_xmax, y=label_pos, hjust = 0,
alpha = .6)
out_d <- c(rrr, tt)
out_d
}
my_sq <- function(x, some_text = "some_text", up_down = 0, ymi, ymx){
my_xmin <- x - 0.5
my_xmax <- x + 0.5
my_ymin <- ymi
my_ymax <- ymx
label_pos <- my_ymax
if(up_down == 0){
label_pos <- my_ymin - 1
}else
label_pos <- my_ymax + 1
tt <- annotate("segment", x = x, xend = x,
y = ymi, yend = ymx)
rrr <- annotate("text", label = some_text,
x=my_xmax, y=label_pos, hjust = 0,
alpha = .6)
my_ano <- c(rrr, tt)
my_ano
}
The first 100 days had a lot of news events and I’ve added a selection of events from the first 100-days to help the reader get their bearings. From a 100-day perspective the articles are distributed decently across all 100 days.
rvw_dates <- tokens_nyfn %>% ungroup() %>%
select(hl, pub_date, src)
dates_articles <- rvw_dates[!duplicated(rvw_dates), ] %>%
count(pub_date, src) %>%
mutate(s_date = substr(pub_date, 6, 10))
#View(dates_articles)
articles_perday <- left_join(cnt100, dates_articles, by ="pub_date")
#View(articles_perday)
#my_sq_new <- (x, some_text = "some_text", up_down = 0, ymi, ymx
a2 <- my_sq(8, "Jan-27:Travel ban",1, 19, 30)
a3 <- my_sq(25, "Feb-25:Flynn resigns",1, 27, 37)
a4 <- my_sq(44, "Mar-4:Trump wiretap claim",1, 2, 33)
a5 <- my_sq(60, "Mar-20:Comey confirms Rus probe",1, 8, 30)
a6 <- my_sq(84, "Apr-13:US drops MOAB",1, 6, 25)
ggplot(data = articles_perday, aes(x = s_date, y = n, fill = src)) +
geom_bar(stat="identity", alpha = .7) +
theme(axis.text.x = element_text(angle = 45,
hjust = 1,
size = 6)) +
a2[1] + a2[2] + a3[1] + a3[2] + a4[1] + a4[2] +
a5[1] + a5[2] + a6[1] + a6[2]
### 100-Day Coverage:Words
wrds_per_day <- tokens_nyfn %>%
count(pub_date, src, word, sort = T) %>%
group_by(pub_date, src) %>%
summarise(word_pd = sum(n)) %>%
inner_join(cnt100, wrds_per_day, by = "pub_date")%>%
mutate(s_date = substr(pub_date, 6, 10))
b2 <- my_sq(8, "Jan-27:Travel ban",1, 13000, 20000)
b3 <- my_sq(25, "Feb-25:Flynn resigns",1, 25000, 32000)
b4 <- my_sq(44, "Mar-4:Trump wiretap claim",1, 3000, 26000)
b5 <- my_sq(60, "Mar-20:Comey confirms Rus probe",1, 6900, 20000)
b6 <- my_sq(84, "Apr-13:US drops MOAB",1, 5000, 25000)
#View(wrds_per_day)
ggplot(data = wrds_per_day, aes(x = s_date, y = word_pd, fill = src)) +
geom_bar(stat="identity", alpha = .7)+
theme(axis.text.x = element_text(angle = 45,
hjust = 1,
size = 6)) +
b2[1] + b2[2] + b3[1] + b3[2] + b4[1] + b4[2] +
b5[1] + b5[2] + b6[1] + b6[2]
Let’s take a look at the percentage difference between frequencies among shared words: i.e the words used by both news organizations but with differences in the frequency of their use. This is calculated using the % difference in word counts between FN and NYT and dividing by the total number of instances of that word:
nyt_fn_dispro <- nyt_fn_compare %>%
mutate(my_rank = (FoxNews.com - NYTimes.com)/(NYTimes.com + FoxNews.com)) %>%
ungroup() %>%
mutate(g = str_extract(word, "[a-z']+"),
diff_chck = ifelse(word != g, "miss", "")) %>%
mutate(word = ifelse(diff_chck == "miss", word, g)) %>%
select(word:my_rank)
tot_words <- sum(nyt_fn_dispro[,2:3], na.rm = T)
#View(nyt_fn_dispro)
dispro_data <- nyt_fn_dispro %>%
mutate(all_freq = ((FoxNews.com + NYTimes.com)/tot_words)) %>%
select(word, my_rank, all_freq) %>%
mutate(word_source = ifelse(my_rank <= 0, "nyt", "fn"))
The ranking is set such that a score of -1 would mean that the word was only found in NYTimes.com and a score of 1 would indicate that the word only appeared in FoxNews.com. Further, a score of -0.8 would indicate that the word is primarily found in NYT and the closer to zero indicates that they’re both used about the same amount.
sm_table3 <- dispro_data %>% arrange(desc(all_freq)) %>% slice(1:10)
knitr::kable(sm_table3, caption = "A closer look at words")
word | my_rank | all_freq | word_source |
---|---|---|---|
the | -0.0206184 | 0.0596460 | nyt |
to | -0.0020742 | 0.0291096 | nyt |
of | -0.0937432 | 0.0241925 | nyt |
a | -0.1033517 | 0.0234500 | nyt |
and | -0.0173672 | 0.0225062 | nyt |
in | -0.1268589 | 0.0188040 | nyt |
that | -0.0597410 | 0.0148053 | nyt |
trump | -0.0129540 | 0.0133287 | nyt |
s | -0.2604137 | 0.0110110 | nyt |
on | -0.0479328 | 0.0099666 | nyt |
While the plot below doesn’t show all of the words, I’ve created it as a 30,000 ft view of words used by both orgs for oberservational interest:
ggplot(data = dispro_data, aes(x = all_freq, y = my_rank)) +
geom_text(aes(label = word, color = word_source),
check_overlap = TRUE) +
scale_x_log10(labels = percent_format()) +
coord_flip()+
theme(legend.position="none") +
labs(y = "",
x = "100-Day overall frequency")
Using the AFINN sentiments in the tidytext
package, I get the daily 10-word-chunk sentiment scores:
check_feel2 <- tokens_nyfn %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(index = line_num %/% 10, src, pub_date) %>%
summarise(sentiment = sum(score)) %>%
mutate(s_date = substr(pub_date, 6, 10))
cx1 <- tokens_nyfn %>%
left_join(get_sentiments("afinn"), by = "word") %>%
ungroup() %>%
select(pub_date, src, score) %>%
mutate(s_date = substr(pub_date, 6, 10),
score = ifelse(is.na(score), 0, score)) %>%
na.omit()
cx2 <- cx1 %>% select(pub_date, src, score) %>%
group_by(pub_date, src) %>%
summarise_each(funs(daily_mean = mean,
daily_sd = sd))
#View(cx2)
#View(combo_feel)
#View(check_feel2)
sm_table4 <- check_feel2 %>% head()
knitr::kable(sm_table4, caption = "AFINN Scoring")
index | src | pub_date | sentiment | s_date |
---|---|---|---|---|
0 | NYTimes.com | 2017-01-20 | -1 | 01-20 |
1 | NYTimes.com | 2017-01-20 | -3 | 01-20 |
2 | NYTimes.com | 2017-01-20 | -3 | 01-20 |
3 | NYTimes.com | 2017-01-20 | -3 | 01-20 |
4 | NYTimes.com | 2017-01-20 | -15 | 01-20 |
5 | NYTimes.com | 2017-01-20 | 10 | 01-20 |
A view of 100 days by box plot with some notable events of the first 100 days for reference. It may not be easy to see but it appears that the NYTimes.com is a more positive sentiment profile over this time period.
c2<-my_hl(8, "Travel ban",1, -50, 50)
c3<-my_hl(25, "Flynn resigns",1, -50, 60)
c4<-my_hl(44, "Trump wiretap claim",1, -50, 55)
c5<-my_hl(60, "Comey confirms Rus probe",1, -50, 38)
c6<-my_hl(84, "US drops MOAB",1, -50, 50)
ggplot(data = check_feel2, aes(x = s_date, y = sentiment, fill = src) ) +
geom_boxplot() +
facet_wrap(~src, ncol = 1) +
theme(axis.text.x = element_text(angle = 45,
hjust = 1,
size = 6),
legend.position="none") +
geom_hline(aes(yintercept = 0),
color = "green",
size = .5) +
geom_hline(aes(yintercept = 0),
color = "green",
size = 3,
alpha = .2)+
c2[1] + c2[2] + c3[1] + c3[2] +
c4[1] + c4[2] + c5[1] + c5[2] +
c6[1] + c6[2]
#
#class(check_feel2$pub_date)
Because AFINN scores are -5 to 5 we can compare the two orgs by the counts of these scores over the total 100 day period:
my_xl <- c('-5','-4','-3','-2','-1','0','1','2','3','4','5')
my_xb <- c(-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5)
ggplot(data = cx1) +
geom_histogram(aes(score, fill = src)) +
scale_x_continuous(aes(score), breaks = my_xb, labels = my_xl) +
scale_y_log10() +
facet_wrap(~src, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 40 rows containing missing values (geom_bar).
#unique(cx1$score)
As shown above, their distributions look similar at a glance.
The violin plot of the daily published average sentiment scores using the 100-day daily means:
ggplot(data = cx2, aes(x = src, y = daily_mean, fill = src)) +
geom_violin() +
geom_jitter(alpha = .5) +
geom_hline(yintercept = 0) +
theme(legend.position="none")
Looking at the means, we see they’re not that far from zero:
cx3 <- cx2
by(cx2$daily_mean,cx2$src, mean)
## cx2$src: FoxNews.com
## [1] -0.002979484
## --------------------------------------------------------
## cx2$src: NYTimes.com
## [1] 0.007719441
#by(cx2$daily_mean,cx2$src, sd)
I’ll employ the function inference
from a DATA607 lab to complete the test and paste the results in the comments below:
#inference(y = cx2$daily_mean, x = cx2$src, est = "mean", type = "ht", null = 0,
# alternative = "twosided", method = "theoretical", conflevel = .99)
# Response variable: numerical, Explanatory variable: categorical
# Difference between two means
# Summary statistics:
# n_FoxNews.com = 100, mean_FoxNews.com = -0.003, sd_FoxNews.com = 0.019
# n_NYTimes.com = 77, mean_NYTimes.com = 0.0077, sd_NYTimes.com = 0.0149
# Observed difference between means (FoxNews.com-NYTimes.com) = -0.0107
#
# H0: mu_FoxNews.com - mu_NYTimes.com = 0
# HA: mu_FoxNews.com - mu_NYTimes.com != 0
# Standard error = 0.003
# Test statistic: Z = -4.195
# p-value = 0
As shown above, the p-value is essentially zero so I reject the null hypothesis - 100-day average sentiments are definitely different for this dataset.
These two news organizations certainly seem different. Whether these sentiment scores are interesting depends on the audience and how much faith you put into the AFINN lexicon. There are certainly a number of outstanding questions that could give or take credence from these results. As stated initially, this was an observational study and the conclusions dervied should be the same.