Goal

Motivation

My motivation is to investigate the sentiment analysis of public news articles that cover a similar topic from seemingly different points of view. Secondarily, I’m interested in comparing the differences between two news orgs that have themselves been the topic of news in recent days: FoxNews.com and The NYTimes.com

Disclosure:

I consider myself liberal but have not editorialized the data whatsoever for this observational analysis (i.e. if the article met the criteria and wasn’t mis-formatted or otherwise not easily processable I kept it in). Observations may be left-leaning.

Data Sources:

This final project investigates sentiment and provides observations on the differences between the the news orgs by analyzing web-scraped news articles published on the subject of “Trump” from NYTimes.com and FoxNews.com during the first 100 days of his presidency. Articles (and their respective article urls) were selected via search results from their respective news orgs websites (API for nyt, website scrape via FN). Article texts were scraped in batches and combined locally and have been made available on GitHub for reproducability.

My Hypothesis Test:

Using the 100-day daily mean AFINN sentiment score from each news source as the sample, I suspect that the average sentiment scores from these fine news organizations will be different so I structure my test as follows:

\(H_{0}: \mu_{diff} = 0\) i.e. there is no difference in means \(H_{A}: \mu_{diff} \neq 0\) i.e. there is a difference in the means

Will the sentiment averages by the AFINN lexicon against these news articles of similar size, topic, and timeframe be different? Likely. But not necessarily in the way you might expect.

Settings

Below are the libraries used for this project:

# for appearence, table formatting
library(prettydoc)
library(knitr)
library(scales)

# for data manipulation, visualization, and stats
library(dplyr)
library(stringr)
library(tidyjson)
library(tidyr)
library(tidytext)
library(tidyverse)
library(ggplot2)
library(DATA606)

# for web scraping: 
library(RCurl)
library(RSelenium)
library(Rwebdriver)
library(rvest)
library(httr)
library(XML)

Getting the Data: Web scraping:

NYTimes.com

Scraping Search Results:

As in my previous NYTimes.com projects (here and here), I returned to the API to pull results that were:

Articles of type “News.”
Published between 1/20/2017 and 4/29/2017 - the first 100 days of Trump’s presidency.
The word “Trump” had to appear in the headline to be included.

The results of this query would then be used to pull articles from nytimes.com.

Below, we look at a single day example of pulling the results from the NYT API.

t_f100_example <- resolve_nyt("Trump", 20170428, 20170428) %>% 
  select(mainheadline,pub_date, article_by)

## Joining, by = c("document.id", "array.index")
## Joining, by = c("document.id", "array.index")
## Joining, by = c("document.id", "array.index")
## Joining, by = c("document.id", "array.index")

knitr::kable(t_f100_example, caption = "NYTimes API results example")

NYTimes API results example
mainheadline	pub_date	article_by
Movers: Trump Bump, Time Inc. and Exxon	2017-04-28T12:11:41+0000	By THE NEW YORK TIMES
Trump Nominates Former Disaster Relief Manager to Lead FEMA	2017-04-28T18:14:43+0000	By RON NIXON
‘The President Show’ Puts Trump in the Host’s Chair	2017-04-28T08:44:57+0000	By GIOVANNI RUSSONELLO
Trump Orders Easing Safety Rules Implemented After Gulf Oil Spill	2017-04-28T01:00:04+0000	By CORAL DAVENPORT
Trump Orders Review of Safety Rules Created After Gulf Oil Spill	2017-04-28T20:37:32+0000	By CORAL DAVENPORT
Trump on Being President: ‘I Thought It Would Be Easier’	2017-04-28T12:25:07+0000	By CHRISTOPHER MELE
Trump Warns That ‘Major, Major Conflict’ With North Korea Is Possible	2017-04-28T02:46:27+0000	By GERRY MULLANY
‘Trump Bump’ Lifts Stocks, Giving President a Win for His First 100 Days	2017-04-28T20:06:41+0000	By MICHAEL J. de la MERCED
Trump Tells N.R.A. Convention, ‘I Am Going to Come Through for You’	2017-04-28T09:00:27+0000	By MICHAEL D. SHEAR
Court Gives Trump Small Victory in Push Against Clean Power Plan	2017-04-28T17:23:33+0000	By CORAL DAVENPORT
Court Gives Trump Small Victory in Push Against Clean Power Plan	2017-04-28T17:23:33+0000	By CORAL DAVENPORT
Under the Trump Tax Plan, We Might All Want to Become Corporations	2017-04-28T19:05:45+0000	By NEIL IRWIN
Trump Rattles South Korea by Saying It Should Pay for Antimissile System	2017-04-28T10:47:49+0000	By CHOE SANG-HUN
Utah Attorney General Makes a Trump Shortlist, and Donations Pour In	2017-04-28T09:00:31+0000	By MATT APUZZO
Trump Tax Plan Would Shift Trillions From U.S. Coffers to the Richest	2017-04-28T01:26:18+0000	By JULIE HIRSCHFELD DAVIS and PATRICIA COHEN
Trump on North Korea: Tactic? ‘Madman Theory’? Or Just Mixed Messages?	2017-04-28T15:00:08+0000	By DAVID E. SANGER

Next, we do some light cleaning of the data to ensure that it’s formatted properly for the next step:

nyt_clean <- function(my_nyt_data){
  my_nyt_data %>% 
    mutate(article_by = str_replace(article_by, "By ",""),
           pub_date = str_extract(pub_date, "..........")) %>% #remove by
    filter(!grepl("THE ASSOCIATED PRESS", article_by)) %>% 
    arrange(pub_date)
}

# each of the 3 below were run at different times:  
# t_f100 <- nyt_clean(t_f100) # 946 results
# t_f100_pt1 <- nyt_clean(t_f100_pt1)
# t_f100_pt2 <- nyt_clean(t_f100_pt2)

Next, a preview of the cleaned API results example from NYTimes.com:

t_cleaned_example <- nyt_clean(t_f100_example) %>% 
  select(mainheadline, pub_date, article_by)

knitr::kable(head(t_cleaned_example), caption = "NYTimes API results cleaned")

NYTimes API results cleaned
mainheadline	pub_date	article_by
Movers: Trump Bump, Time Inc. and Exxon	2017-04-28	THE NEW YORK TIMES
Trump Nominates Former Disaster Relief Manager to Lead FEMA	2017-04-28	RON NIXON
‘The President Show’ Puts Trump in the Host’s Chair	2017-04-28	GIOVANNI RUSSONELLO
Trump Orders Easing Safety Rules Implemented After Gulf Oil Spill	2017-04-28	CORAL DAVENPORT
Trump Orders Review of Safety Rules Created After Gulf Oil Spill	2017-04-28	CORAL DAVENPORT
Trump on Being President: ‘I Thought It Would Be Easier’	2017-04-28	CHRISTOPHER MELE

Article Pages

After pulling all of the search results, it was time to scrape the nytimes.com web pages. Again, the code below is similar to my previous projects but I refined the functions to be more flexible and to be able to run in parts. The reason I opted to run these scrape sessions in parts was to check my results as I went along and to ensure I wasn’t making a nuisance of myself w/r/t to scraping.

Below, I’ve created a function that pulls an URL from a df and then attempts to pull the text at the end of the html. This function will delay 5 seconds per request.

# pull the article text from NYT using the web_url info:
resolve_nyt_url = function(my_data, x){

  current_url <- my_data[, "web_url"][x]
  Sys.sleep(5)
  out_html <- read_html(current_url)
  out_nodes <- html_nodes(out_html, ".story-body-text")
  
  sprintf("out_nodes length(): %s", length(out_nodes))
  
  if(length(out_nodes) == 0){
    sprintf("url num: %s is dead", x)
    out_art_txt <- NULL
    }
  else{
    out_text <- html_text(out_nodes, trim = TRUE)
    out_text <- as.data.frame(out_text, stringsAsFactors = F)
    
    out_headline <- my_data[ ,"mainheadline"][x]
    out_author <- my_data[ ,"article_by"][x]
    
    sprintf("Good URL: %s", current_url)
    out_art_txt <- out_text %>% 
      mutate(out_hl = out_headline, out_a = out_author)
    
  }
  out_art_txt
}  
# each of the below were run at different times to
# avoid getting blocked.  This did not run fast but it
# seemed to be reliable. 
# t_text <- pull_art_texts(t_f100, "Trump_art_text")
# t_text1 <- pull_art_texts(t_f100_pt1, "Trump_art_text1")
# t_text2 <- pull_art_texts(t_f100_pt2, "Trump_art_text2")

Next, I’ve provided an example of the raw text results from scraping NYTimes articles. Note that the headers are not included because this would have been an interim file that would be appended to another scrape session. Further, notice that the search results are actually about Ivanka Trump but that still falls within my criteria of looking at articles with headlines containing “Trump” during the first 100 days:

t_art_example <- read.csv2("Trump_art_text2.csv", stringsAsFactors = F) 
head(t_art_example[,1])

## [1] "In announcing her plans to donate $100,000 each to the National Urban League and the Boys & Girls Clubs of America, Ms. Trump, the presidents daughter and adviser, released a statement calling the book part of a continuing effort to empower women that has been central to my mission throughout my career."
## [2] "Ms. Trump said her book aims to supply advice and tips on leadership, entrepreneurship, juggling work and family and building cultures where multidimensional women can thrive  now and in the future."                                                                                                          
## [3] "Like many other professional women, I have juggled the demands that come with growing my family and building my businesses, and I realize that I am more fortunate than most, Ms. Trump said in the statement."                                                                                                   
## [4] "She said she has created the Ivanka M. Trump Charitable Fund for the unpaid portion of her advance and any future royalties to facilitate grants to charitable groups that support the economic empowerment for women and girls."                                                                                 
## [5] "She said she chose to donate to the Urban League and the Boys & Girls Clubs because both have made it a priority to promote entrepreneurship and educational opportunities for women and girls in underserved communities."                                                                                       
## [6] "The Urban League will launch a new womens initiative with the money, while the Boys & Girls Club money will go to the groups science, technology, engineering and math program."

Cleaning Combining Results:

Below, I clean and combine the different files and scrape sessions to create the file t_all_arts which contains all of the articles that remained after refinement. The code is not run below but I’ve reproduced it for reference. Further along, we’ll obtain t_all_arts from GitHub for the sentiment analysis.

# below, i remove the misaligned information and combine it with the 
# initial search results to have text, date of publish, etc in the text

# t_text_clean <- t_text %>% 
#   mutate(lng_text = nchar(t_text), 
#          lng_hl = nchar(t_hl),
#          lng_tauth = nchar(t_auth)) %>% 
#   filter(lng_tauth < 57, lng_hl < 85, lng_hl >= 29, lng_text < 2000) 

# t_combine <- t_srcs %>% 
#   left_join(t_text_clean, by = c("mainheadline" = "t_hl")) %>% 
#   filter(!is.na(t_auth))
# 
# t_combine1 <- t_srcs1 %>% 
#   left_join(t_text1, by = c("mainheadline" = "t_hl")) %>% 
#   filter(!is.na(t_auth))
# 
# t_combine2 <- t_srcs2 %>% 
#   left_join(t_text2, by = c("mainheadline" = "t_hl")) %>% 
#    filter(!is.na(t_auth))

# below contains the combined, cleaned text that forms the NYT portion
# of this analysis

# t_all_arts <- rbind(t_combine, t_combine2) %>%
#   rbind(t_combine1) %>% 
#   select(mainheadline, pub_date, web_url, t_text, t_auth) %>% 
#   mutate(my_rows = row_number()) %>% 
#   arrange(pub_date, mainheadline, my_rows) %>% 
#   select(mainheadline, pub_date, web_url, t_text, t_auth) %>% 
#   ungroup() %>% 
#   mutate(line_num = row_number(), t_src = "NYTimes.com")

# write.csv2(t_all_arts, "t_all_arts.csv")

FoxNews.com

Scraping Search Results:

For article search, I used the FoxNews.com search for both “Donald Trump” in quotes and without (without search URL not shown) during the time-frame specified, under the site “Fox News” and by Section “Politics.”

FoxNews.com was much more challenging than the NYTimes.com for a few reasons:

FoxNews.com doesn’t have an API as of the creation of the project
The search-results webpages from FN are dynamically generated so using httr did not work.
Search filtering was limited so I would have to cast a wide net.

To deal with these issues, I employed RSelenium - an extremely powerful java-created tool for scraping all types of webpages through remote-controlling a browser. Before I could use RSelenium, I needed to understand the FoxNews.com search URL which had similar qualities to an API URL. The below function fox_search_url takes an input and creates an URL that points to the website search results:

# This url pulls form site "Fox News" under the section "Politics"
# but results seem to cover a bit more than politics.
fox_search_url <- function(x, s_date, e_date, a_pg = 0){
  f_base_url <- 'http://www.foxnews.com/search-results/search?q="%s"%s%s&max_date=%s&start=%s0'
  #f_term <- paste0('"',URLencode(x), '"')
  f_term <- URLencode(x)
  f_base_url2 <- "&ss=fn&sort=latest&section.path=fnc/politics&min_date="
  fox_news_url <- sprintf(f_base_url, f_term, f_base_url2, s_date, e_date, a_pg)
}

x <- fox_search_url('Donald Trump','2017-01-20', '2017-04-10')
x

## [1] "http://www.foxnews.com/search-results/search?q=\"Donald%20Trump\"&ss=fn&sort=latest&section.path=fnc/politics&min_date=2017-01-20&max_date=2017-04-10&start=00"

Below, I’ve screen-capped the webpage mentioned above:

Next, I display a series of functions used by an RSelenium object that’s a web browser instance (more on that next). These were used to extract specific elements from the FN search results page including:

Search result hits
Dates of articles
Article headline

Sidenote: I used Google Chrome’s Selector Gadeget extension to obtain the web-elements of interest:

# used chrome's selector gadget to pull the results http://selectorgadget.com/
pull_fox_hits <- function(a_browsr){
   l_hits <- '//*[contains(concat( " ", @class, " " ), concat( " ", "ng-valid", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "ng-binding", " " ))]'
   hits <- extract_art(a_browsr, l_hits) %>% str_replace(",", "") %>%  as.integer
   hits
}

pull_fox_text <- function(a_browsr){
  l_text <- '//h3//*[contains(concat( " ", @class, " " ), concat( " ", "ng-binding", " " ))]'
  l_dates <- '//*[contains(concat( " ", @class, " " ), concat( " ", "search-date", " " ))]'

  linktext <- extract_art(a_browsr, l_text) 
  linkurl <- extract_link(a_browsr, l_text)
  
  # convert date format via http://stackoverflow.com/a/31837668/5965312
  linkdates <- extract_art(a_browsr, l_dates)
  linkdates <- linkdates %>% 
    strsplit( ".*\n\\s+") %>%
    unlist() %>% 
    vapply("[", "", 1) %>% 
    as.Date("%b %d, %Y")
  
  fox_text <- data.frame(l_hl = linktext,
                       pub_date = linkdates,
                       web_url = linkurl)
  
  fox_text
}

Next, a few more helper functions for extracting text from webpages (it looks simple here, but this actually required quite a bit of yak shaving to get it to work smoothly.)

# extract the article based on xpath
extract_art <- function(a_browsr,an_xpath){
  data <- a_browsr$findElements(using = "xpath", an_xpath)
  f_out <- unlist(lapply(data, function(x){x$getElementText()}))
  f_out
}

# extract the article based on css
extract_art_css <- function(a_browsr,a_css){
  data <- a_browsr$findElements(using = "css selector", a_css)
  f_out <- unlist(lapply(data, function(x){x$getElementText()}))
  # for dealing with cases where extraction result is NULL
  if(is.null(f_out)){
    f_out <- NA
  }else{
    f_out
  }
}
# extract the link based on xpath
extract_link <- function(a_browsr,an_xpath){
  data <- a_browsr$findElements(value = an_xpath)
  f_out <- unlist(lapply(data, function(x){x$getElementAttribute('href')}))
  f_out
}

A helpful resource that I used is this Computer World article that provides some basic functions that are most used for web-scraping.

Before one can use RSelenium there are several items you’ll want to have ready including:

the latest version of Selenium Standalone Server
the latest version of Java SE Dev Kit
the Gecko Driver that goes along with your version of Java SE Dev Kit.

To make things easier, I saved my geckodriver.exe and selenium-server-standalone-3.3.1.jar in the same folder so that they run together. Here’s my cmd.exe window running the standalone server:

cmd.exe running Selenium Standalone

See the comments in the below code chunk for the text in the cmd.exe screen cap that you can use on your Window’s PC.

Note that the below function will not run unless your Standalone Server is up-and-running:

# MUST BE RUN IN CMD.EXE FIRST:
# cd to >>>> C:\Users\Jaan\Documents\R\win-library\3.3\geckodriver then: 
# java -Dwebdriver.gecko.driver=geckodriver.exe -jar selenium-server-standalone-3.3.1.jar

# good Selenium intro via :
#http://www.computerworld.com/article/2971265/application-development/how-to-drive-a-web-browser-with-r-and-rselenium.html

# this function creates a df that holds all of the FN search results: 
rtrv_fn <- function(x, s_d, e_d, f_n = "tmp_fn.csv"){
  
  # create search url, retreive results
  fox_news_url <- fox_search_url(x, s_d, e_d)
  brs <- remDr <- remoteDriver(remoteServerAddr = "localhost" 
                      , port = 4444)
  brs$open()
  brs$navigate(fox_news_url)
  
  # combine pages of results
  hits <- pull_fox_hits(brs)
  
  for(i in 0:(floor(hits/10))){
    an_url <- fox_search_url(x, s_d, e_d, i)
    brs$navigate(an_url)
    
    Sys.sleep(sample(1:2,1))
    
    temp_df <- pull_fox_text(brs)
    
    if(i == 0){        # write the files with headers:
      write.table(temp_df, f_n, append = F, sep=";",
                  row.names = F, col.names = T, quote = T)
      }else if(i > 0){ # add pg results to files:
        write.table(temp_df, f_n, append = T, sep=";",
                    row.names = F, col.names = F, quote = T)
      }else(stop())
  } 
  ttt <- read.csv2(f_n, stringsAsFactors = F, header = T)
  ttt
}

#Uncomment to re-run 

#fn_srch <- rtrv_fn("Donald Trump","2017-01-20", "2017-04-10", "tmp_fn0.csv")
#fn_srch1 <- rtrv_fn("Donald Trump","2017-04-11", "2017-04-29", "tmp_fn1.csv")
#fn_srch_big <- rtrv_fn("Donald Trump","2017-01-20", "2017-04-20", "tmp_fn2.csv") 
#fn_srch_close <- rtrv_fn("Donald Trump","2017-04-21", "2017-04-29", "tmp_fn3.csv")

Article Pages:

To pull the text from FN articles,I created a function that pulls a single article’s text and corresponding information. Previous versions of this function, fn_articles, attempted to pull the date from the webpage visited but I had to comment it out because it wasn’t functioning properly:

# start seleneium stand-alone.. 
# MUST BE RUN IN CMD.EXE FIRST:
# cd to >>>> C:\Users\Jaan\Documents\R\win-library\3.3\geckodriver then: 
# java -Dwebdriver.gecko.driver=geckodriver.exe -jar selenium-server-standalone-3.3.1.jar

fn_articles <- function(a_brsr, an_url){
  a_brsr$navigate(an_url)
  
  fn_art_body <- '.article-text > p'   # actual article text
  fn_info <- '.article-info div div a' # from portion (sometimes)
  fn_info_by <- '.article-info span'   # Author, at times
  fn_sub <- '#content h2 a'            # section
  fn_main <- '#content h1'             # title
  fn_time <- 'time'                    # date of publish
  
  fn_art_text <- extract_art_css(a_brsr, fn_art_body)
  b <- extract_art_css(a_brsr, fn_info)[1]
  b2 <- extract_art_css(a_brsr, fn_info)[2]
  c <- extract_art_css(a_brsr, fn_info_by)
  d <- extract_art_css(a_brsr, fn_sub)
  e <- extract_art_css(a_brsr, fn_main)
  #f <- extract_art_css(a_brsr, fn_time) 
  f_t <- "x" #as.data.frame(f, stringsAsFactors = F) %>% 
    #filter(grepl("Published ", f)) %>% 
    #str_replace("Published ", "") %>% 
    #as.Date("%B %d, %Y") %>% unique()
  
  out_f <- as.data.frame(fn_art_text) %>% 
    mutate(fn_from = b, fn_from2 = b2,
           fn_by = c, fn_sub = d, 
           fn_main = e, fn_date = f_t)
  out_f
}

The final bit of code to pull FN articles is below. The fn_extract_articles function pulls in a data frame of URLs (i.e. the FN search results) and creates a local csv file containing the article’s webpage text. Future versions of this function would do better to include a try/catch to deal with webpages that hang (FoxNews.com has a lot of ad partners.). Note that this code did not run fast and the primary data-pull needed to run overnight.

# for extracting all text from a given list of URLs
fn_extract_articles <- function(df_urls, fn_file = "fn.csv"){
  
  loc_file <- fn_file 
  the_browser <- remDr <- remoteDriver(remoteServerAddr = "localhost",
                                       port = 4444)
  the_browser$open()
  
  for(i in 1:nrow(df_urls)){
    target_url <- df_urls[,3][i]                       # assign url
    Sys.sleep(sample(1:2,1))                           # Delay a bit
    
    target_text <- fn_articles(the_browser,target_url) # Pull text
    
    if(i == 1){
      write.table(target_text, loc_file,               # write text locally
                  append = F, sep=";", row.names = F, 
                  col.names = T, quote = T)
    }else if(i > 1){
      write.table(target_text, loc_file, 
                  append = T, sep=";", row.names = F, 
                  col.names = F, quote = T)
    }else(stop())
  }
  the_browser$close()                                  # close browser 
}

# The below took an hour or so to run:  
# fn_extract_articles(fn_srch_cln, "fn20170425test.csv")

# The below ran in about 5 minutes
# fn_extract_articles(fn_srch_cln1, "fn20170501test.csv")

# The below ran in about 18hrs
# fn_extract_articles(fn_srch_cln2, "fn20170506test.csv")
# fn_extract_articles(fn_srch_cln3, "fn20170508test.csv")

Cleaning and Combining Results:

Next, I combine the local csv files and clean them up and add dates to the files that do not have them and create the files that will be uploaded to GitHub to allow for running this rmarkdown locally:

# read in the extracted data locally,
# clean, combine, and refine: 

s_a <- read.csv2("tmp_fn0.csv", stringsAsFactors = F) %>% 
  select(l_hl, pub_date, web_url)
s_b <- read.csv2("tmp_fn1.csv", stringsAsFactors = F)
s_c <- read.csv2("tmp_fn2.csv", stringsAsFactors = F)
s_d <- read.csv2("tmp_fn3.csv", stringsAsFactors = F)

date_lookup <- rbind(s_a, s_b) 
date_lookup <- rbind(date_lookup, s_c)
date_lookup <- rbind(date_lookup, s_d)

date_lookup <- date_lookup[!duplicated(date_lookup), ] %>% 
  filter(!grepl("video.foxnews.com", web_url)) %>% 
  select(l_hl, pub_date, web_url)


p_a <- read.csv2("fn20170425test.csv", stringsAsFactors = F)
p_b <- read.csv2("fn20170501test.csv", stringsAsFactors = F)
p_c <- read.csv2("fn20170506test.csv", stringsAsFactors = F)
p_d <- read.csv2("fn20170508test.csv", stringsAsFactors = F)

fn_all_art_raw <- rbind(p_a, p_b)
fn_all_art_raw <- rbind(fn_all_art_raw, p_c)
fn_all_art_raw <- rbind(fn_all_art_raw, p_d)

fn_all_arts <- fn_all_art_raw %>% 
  inner_join(date_lookup, by = c("fn_main"="l_hl"))

fn_all_arts <- fn_all_arts[!duplicated(fn_all_arts), ] 

#write.csv2(fn_all_arts, "fn_all_artsX.csv", row.names = F, quote = F)
fn_datLOC <- read.csv2("fn_all_artsX.csv", stringsAsFactors = F, header = T)

part_a <- fn_datLOC %>% 
  slice(1:5000)
part_b1 <- fn_datLOC %>% 
  slice(5001:7300)
part_b2 <- fn_datLOC %>% 
  slice(7302:10000) #skip one row of problems
part_c <- fn_datLOC %>% 
  slice(10001:15000)
part_d <- fn_datLOC %>% 
  slice(15001:20000)
part_e <- fn_datLOC %>% 
  slice(20001:25000)
part_f <- fn_datLOC %>% 
  slice(25000:27115)

# write.csv2(part_a, "part_a.csv", row.names = FALSE)
# write.csv2(part_b1, "part_b1.csv", row.names = FALSE)
# write.csv2(part_b2, "part_b2.csv", row.names = FALSE)
# write.csv2(part_c, "part_c.csv", row.names = FALSE)
# write.csv2(part_d, "part_d.csv", row.names = FALSE)
# write.csv2(part_e, "part_e.csv", row.names = FALSE)
# write.csv2(part_f, "part_f.csv", row.names = FALSE)

Next, a preview of the FoxNews.com articles:

sm_table <- fn_all_arts %>% 
  select(fn_art_text, fn_main,pub_date) %>% slice(1:3)
knitr::kable(sm_table, caption = "FoxNews.com article results")

FoxNews.com article results
fn_art_text	fn_main	pub_date
In his first hours as president Friday, Donald Trump ordered federal agencies to the burden of ObamaCare while his chief of staff directed an immediate regulatory freeze.	In first executive order, Trump tells agencies to ease ObamaCare burden	2017-01-20
Trump was joined in the Oval Office by Vice President Mike Pence, Chief of Staff Reince Priebus and other top advisers as he signed the executive order on former President Barack Obama’s signature health law, which Trump opposed throughout his campaign.	In first executive order, Trump tells agencies to ease ObamaCare burden	2017-01-20
The order, which noted that Trump intends to seek the law’s repeal, directs agencies to the unwarranted economic and regulatory burdens [of ObamaCare] and prepare to afford the States more flexibility and control to create a more free and open healthcare market. It also tells agencies to waive, defer or delay imposing any ObamaCare provisions that impose fiscal penalties on states, health care providers, families or individuals.	In first executive order, Trump tells agencies to ease ObamaCare burden	2017-01-20

Review Data, Prep for Analysis:

Read in the Data from External Sources:

First I set up the GitHub URLs that will be used to pull the data in:

# FN URLs - the file was large and needed to be saved in parts:  
url_a <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_a.csv'
url_b1 <-'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_b1.csv'
url_b2 <-'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_b2.csv'
url_c <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_c.csv'
url_d <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_d.csv'
url_e <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_e.csv'
url_f <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/part_f.csv'

# nyt data was able to save on GitHub in one file:  
nyt_url <- 'https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/t_all_arts.csv'

# some reference files I created in google docs and exported at csv
a_100day_url <- "https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/100days.csv"
a_sm_events <- "https://raw.githubusercontent.com/jbrnbrg/final_proj607/master/small_events.csv"

# read GitHub FN files:  
fn_dat_a <- read.csv2(text = getURL(url_a), stringsAsFactors = F, header = T)
fn_dat_b1 <- read.csv2(text = getURL(url_b1), stringsAsFactors = F, header = T)

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string

fn_dat_b2 <- read.csv2(text = getURL(url_b2), stringsAsFactors = F, header = T)
fn_dat_c <- read.csv2(text = getURL(url_c), stringsAsFactors = F, header = T)
fn_dat_d <- read.csv2(text = getURL(url_d), stringsAsFactors = F, header = T)
fn_dat_e <- read.csv2(text = getURL(url_e), stringsAsFactors = F, header = T)
fn_dat_f <- read.csv2(text = getURL(url_f), stringsAsFactors = F, header = T)

# combine them
fn_datGIT <- fn_dat_a %>% 
  rbind(fn_dat_b1) %>% 
  rbind(fn_dat_b2) %>% 
  rbind(fn_dat_c) %>% 
  rbind(fn_dat_d) %>% 
  rbind(fn_dat_e) %>% 
  rbind(fn_dat_f)

# put external source data into dataframes:  
nyt_dat <- read.csv2(text = getURL(nyt_url), stringsAsFactors = F, header = T)
cnt100 <- read.csv(text = getURL(a_100day_url), stringsAsFactors = F, header = T) %>% 
  rename(pub_date = X.U.FEFF.pub_date)
sm_events <- read.csv(text = getURL(a_sm_events), 
                   stringsAsFactors = F, header = T)

Prepare the data:

After I’ve read in the data, I need to clean it up a bit so that I can work with it:

Double check that the date range is appropriate
Confirm that only articles with “Trump” in the headline in included
Remove any articles of type “Opinion” from the FN data and remove an article that was causing me some issues.

x <- nyt_dat %>% 
  select(mainheadline, pub_date, t_text, t_src) %>% 
  rename(hl = mainheadline, art_text = t_text, src = t_src) %>% 
  filter(pub_date <= as.Date("2017-04-29"), grepl("Trump", hl))

y <- fn_datGIT %>%
  select(fn_main, pub_date, fn_art_text, fn_src, fn_sub) %>% 
  ungroup() %>% 
  rename(hl = fn_main, art_text = fn_art_text, src = fn_src) %>% 
  filter(src == "FoxNews.com", 
         !grepl("OPINION", fn_sub),
         grepl("Trump", hl),
         !grepl("Transcript of President Trump's press conference", hl)) %>% 
  select(-fn_sub)

articles_fn_nyt <- rbind(x, y)

articles_fn_nyt <- articles_fn_nyt %>% 
  mutate(line_num = row_number())

sm_table1 <- articles_fn_nyt %>% slice(1:3)
knitr::kable(sm_table1, caption = "Combined preview")

Combined preview
hl	pub_date	art_text	src	line_num
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	WASHINGTON Donald J. Trump arrived in Washington the day before his inauguration as the nations 45th president in a swirl of cinematic pageantry but facing serious questions about whether his chaotic transition has left critical parts of the government dangerously short-handed.	NYTimes.com	1
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	Mr. Trump will be sworn in at noon Eastern time on Friday, but his team was still scrambling to fill key administration posts when he got here on Thursday, announcing last-minute plans to retain 50 essential State Department and national security officials currently working in the Obama administration to ensure continuity of government, according to Sean Spicer, the incoming White House press secretary.	NYTimes.com	2
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	The furious final staff preparations included designating Thomas A. Shannon Jr., an Obama appointee, as the acting secretary of state, pending the expected confirmation of Rex W. Tillerson.	NYTimes.com	3

Tokenize the text:

Here I make use of the tidytext package to tokenize the article data. Again, further refinement is needed to clear out any messy text:

tokens_nyfn <- articles_fn_nyt %>% 
  unnest_tokens(word, art_text) %>% 
  filter(!grepl('[0-9]',word),
         !grepl('.{2,}(\\.com)',word),
         word != "x", word != "na", 
         word != "_____", word != "_________")

sm_table2 <- tokens_nyfn %>% slice(1:8)
knitr::kable(sm_table2, caption = "Token preview")

Token preview
hl	pub_date	src	line_num	word
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	NYTimes.com	1	washington
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	NYTimes.com	1	donald
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	NYTimes.com	1	j
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	NYTimes.com	1	trump
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	NYTimes.com	1	arrived
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	NYTimes.com	1	in
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	NYTimes.com	1	washington
A Trump Administration, With Obama Staff Members Filling In the Gaps	2017-01-20	NYTimes.com	1	the

Compare the data:

Summary:

Below, I take a look at some stats regarding the data to be analyzed.

Let’s take a look at each of the files from general perspective:

FoxNews.com	NYTimes.com
451,560 words	492,493 words
3.88 MB	3.15 MB
831 Articles	381 Articles

With article count difference of 450, I am glad that the word count difference is only 41K.

In the code below, I’m setting up some functions to be used with ggplot2 that will help me add annotations.

my_hl <- function(x, some_text = "some_text", up_down = 0, ymi, ymx){
  my_xmin <- x - 0.5
  my_xmax <- x + 0.5
  my_ymin <- ymi
  my_ymax <- ymx
  label_pos <- my_ymax
  
  if(up_down == 0){
    label_pos <- my_ymin + 2
  }else
    label_pos <- my_ymax - 2
  
  tt <- annotate("rect", xmin = my_xmin, xmax = my_xmax, 
                 ymin = my_ymin, ymax = my_ymax, 
                 alpha = .15, fill = "yellow")
  rrr <- annotate("text", label = some_text, 
                  x=my_xmax, y=label_pos, hjust = 0, 
                  alpha = .6)
  out_d <- c(rrr, tt)
  out_d
}

my_sq <- function(x, some_text = "some_text", up_down = 0, ymi, ymx){
  my_xmin <- x - 0.5
  my_xmax <- x + 0.5
  my_ymin <- ymi
  my_ymax <- ymx
  label_pos <- my_ymax
  
  if(up_down == 0){
    label_pos <- my_ymin - 1
  }else
    label_pos <- my_ymax + 1
  
  tt <- annotate("segment", x = x, xend = x, 
                 y = ymi, yend = ymx)
  
  rrr <- annotate("text", label = some_text, 
                  x=my_xmax, y=label_pos, hjust = 0, 
                  alpha = .6)
  my_ano <- c(rrr, tt)
  my_ano
}

100-Day Coverage:Articles

The first 100 days had a lot of news events and I’ve added a selection of events from the first 100-days to help the reader get their bearings. From a 100-day perspective the articles are distributed decently across all 100 days.

rvw_dates <- tokens_nyfn %>% ungroup() %>% 
  select(hl, pub_date, src)
  
dates_articles <- rvw_dates[!duplicated(rvw_dates), ] %>% 
  count(pub_date, src) %>% 
  mutate(s_date = substr(pub_date, 6, 10))


#View(dates_articles)

articles_perday <- left_join(cnt100, dates_articles, by ="pub_date")
#View(articles_perday)

#my_sq_new <- (x, some_text = "some_text", up_down = 0, ymi, ymx
a2 <- my_sq(8, "Jan-27:Travel ban",1, 19, 30)
a3 <- my_sq(25, "Feb-25:Flynn resigns",1, 27, 37)
a4 <- my_sq(44, "Mar-4:Trump wiretap claim",1, 2, 33)
a5 <- my_sq(60, "Mar-20:Comey confirms Rus probe",1, 8, 30)
a6 <- my_sq(84, "Apr-13:US drops MOAB",1, 6, 25)

ggplot(data = articles_perday, aes(x = s_date, y = n, fill = src)) +
  geom_bar(stat="identity", alpha = .7) +
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1, 
                                   size = 6)) + 
  a2[1] + a2[2] + a3[1] + a3[2] + a4[1] + a4[2] +
  a5[1] + a5[2] + a6[1] + a6[2]

### 100-Day Coverage:Words

wrds_per_day <- tokens_nyfn %>%  
  count(pub_date, src, word, sort = T) %>%
  group_by(pub_date, src) %>% 
  summarise(word_pd = sum(n)) %>% 
  inner_join(cnt100, wrds_per_day, by = "pub_date")%>% 
  mutate(s_date = substr(pub_date, 6, 10))

b2 <- my_sq(8, "Jan-27:Travel ban",1, 13000, 20000)
b3 <- my_sq(25, "Feb-25:Flynn resigns",1, 25000, 32000)
b4 <- my_sq(44, "Mar-4:Trump wiretap claim",1, 3000, 26000)
b5 <- my_sq(60, "Mar-20:Comey confirms Rus probe",1, 6900, 20000)
b6 <- my_sq(84, "Apr-13:US drops MOAB",1, 5000, 25000)

#View(wrds_per_day)
ggplot(data = wrds_per_day, aes(x = s_date, y = word_pd, fill = src)) +
  geom_bar(stat="identity", alpha = .7)+
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1, 
                                   size = 6)) + 
  b2[1] + b2[2] + b3[1] + b3[2] + b4[1] + b4[2] +
  b5[1] + b5[2] + b6[1] + b6[2]

100-Day Coverage: Which words?

Let’s take a look at the percentage difference between frequencies among shared words: i.e the words used by both news organizations but with differences in the frequency of their use. This is calculated using the % difference in word counts between FN and NYT and dividing by the total number of instances of that word:

nyt_fn_dispro <- nyt_fn_compare %>% 
  mutate(my_rank = (FoxNews.com - NYTimes.com)/(NYTimes.com + FoxNews.com)) %>% 
  ungroup() %>% 
  mutate(g = str_extract(word, "[a-z']+"),
         diff_chck = ifelse(word != g, "miss", "")) %>% 
  mutate(word = ifelse(diff_chck == "miss", word, g)) %>% 
  select(word:my_rank)

tot_words <- sum(nyt_fn_dispro[,2:3], na.rm = T)
#View(nyt_fn_dispro)

dispro_data <- nyt_fn_dispro %>% 
  mutate(all_freq = ((FoxNews.com + NYTimes.com)/tot_words)) %>% 
  select(word, my_rank, all_freq) %>% 
  mutate(word_source = ifelse(my_rank <= 0, "nyt", "fn"))

The ranking is set such that a score of -1 would mean that the word was only found in NYTimes.com and a score of 1 would indicate that the word only appeared in FoxNews.com. Further, a score of -0.8 would indicate that the word is primarily found in NYT and the closer to zero indicates that they’re both used about the same amount.

sm_table3 <- dispro_data %>% arrange(desc(all_freq)) %>% slice(1:10)
knitr::kable(sm_table3, caption = "A closer look at words")

A closer look at words
word	my_rank	all_freq	word_source
the	-0.0206184	0.0596460	nyt
to	-0.0020742	0.0291096	nyt
of	-0.0937432	0.0241925	nyt
a	-0.1033517	0.0234500	nyt
and	-0.0173672	0.0225062	nyt
in	-0.1268589	0.0188040	nyt
that	-0.0597410	0.0148053	nyt
trump	-0.0129540	0.0133287	nyt
s	-0.2604137	0.0110110	nyt
on	-0.0479328	0.0099666	nyt

While the plot below doesn’t show all of the words, I’ve created it as a 30,000 ft view of words used by both orgs for oberservational interest:

ggplot(data = dispro_data, aes(x = all_freq, y = my_rank)) +
  geom_text(aes(label = word, color = word_source), 
            check_overlap = TRUE) +
  scale_x_log10(labels = percent_format()) +
  coord_flip()+
  theme(legend.position="none") +
  labs(y = "", 
       x = "100-Day overall frequency")

Comparing Sentiment

Combining AFINN Scores:

Using the AFINN sentiments in the tidytext package, I get the daily 10-word-chunk sentiment scores:

check_feel2 <- tokens_nyfn %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  group_by(index = line_num %/% 10, src, pub_date) %>% 
  summarise(sentiment = sum(score)) %>% 
  mutate(s_date = substr(pub_date, 6, 10)) 

cx1 <- tokens_nyfn %>% 
  left_join(get_sentiments("afinn"), by = "word") %>% 
  ungroup() %>% 
  select(pub_date, src, score) %>% 
  mutate(s_date = substr(pub_date, 6, 10), 
         score = ifelse(is.na(score), 0, score)) %>% 
  na.omit()

cx2 <- cx1 %>% select(pub_date, src, score) %>% 
  group_by(pub_date, src) %>% 
  summarise_each(funs(daily_mean = mean, 
                      daily_sd = sd))
  

#View(cx2)
#View(combo_feel)
#View(check_feel2)

sm_table4 <- check_feel2 %>% head()
knitr::kable(sm_table4, caption = "AFINN Scoring")

AFINN Scoring
index	src	pub_date	sentiment	s_date
0	NYTimes.com	2017-01-20	-1	01-20
1	NYTimes.com	2017-01-20	-3	01-20
2	NYTimes.com	2017-01-20	-3	01-20
3	NYTimes.com	2017-01-20	-3	01-20
4	NYTimes.com	2017-01-20	-15	01-20
5	NYTimes.com	2017-01-20	10	01-20

AFINN Scores by Day:

A view of 100 days by box plot with some notable events of the first 100 days for reference. It may not be easy to see but it appears that the NYTimes.com is a more positive sentiment profile over this time period.

c2<-my_hl(8, "Travel ban",1, -50, 50)
c3<-my_hl(25, "Flynn resigns",1, -50, 60)
c4<-my_hl(44, "Trump wiretap claim",1, -50, 55)
c5<-my_hl(60, "Comey confirms Rus probe",1, -50, 38)
c6<-my_hl(84, "US drops MOAB",1, -50, 50)

ggplot(data = check_feel2, aes(x = s_date, y = sentiment, fill = src) ) +
  geom_boxplot() +
  facet_wrap(~src, ncol = 1) + 
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1, 
                                   size = 6),
        legend.position="none") +
  geom_hline(aes(yintercept = 0), 
             color = "green", 
             size = .5) +
  geom_hline(aes(yintercept = 0), 
             color = "green", 
             size = 3, 
             alpha = .2)+ 
  c2[1] + c2[2] + c3[1] + c3[2] +
  c4[1] + c4[2] + c5[1] + c5[2] + 
  c6[1] + c6[2]

  # 
#class(check_feel2$pub_date)

AFINN Scores counts:

Because AFINN scores are -5 to 5 we can compare the two orgs by the counts of these scores over the total 100 day period:

my_xl <- c('-5','-4','-3','-2','-1','0','1','2','3','4','5')
my_xb <- c(-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5)
ggplot(data = cx1) + 
  geom_histogram(aes(score, fill = src)) +
  scale_x_continuous(aes(score), breaks = my_xb, labels = my_xl) +
  scale_y_log10() +
  facet_wrap(~src, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 40 rows containing missing values (geom_bar).

#unique(cx1$score)

As shown above, their distributions look similar at a glance.

100-Day Average:

The violin plot of the daily published average sentiment scores using the 100-day daily means:

ggplot(data = cx2, aes(x = src, y = daily_mean, fill = src)) + 
  geom_violin() +
  geom_jitter(alpha = .5) +
  geom_hline(yintercept = 0) +
  theme(legend.position="none")

Testing My Hypothesis

Looking at the means, we see they’re not that far from zero:

cx3 <- cx2 
by(cx2$daily_mean,cx2$src, mean)

## cx2$src: FoxNews.com
## [1] -0.002979484
## -------------------------------------------------------- 
## cx2$src: NYTimes.com
## [1] 0.007719441

#by(cx2$daily_mean,cx2$src, sd)

I’ll employ the function inference from a DATA607 lab to complete the test and paste the results in the comments below:

#inference(y = cx2$daily_mean, x = cx2$src, est = "mean", type = "ht", null = 0, 
#          alternative = "twosided", method = "theoretical", conflevel = .99)

# Response variable: numerical, Explanatory variable: categorical
# Difference between two means
# Summary statistics:
# n_FoxNews.com = 100, mean_FoxNews.com = -0.003, sd_FoxNews.com = 0.019
# n_NYTimes.com = 77, mean_NYTimes.com = 0.0077, sd_NYTimes.com = 0.0149
# Observed difference between means (FoxNews.com-NYTimes.com) = -0.0107
# 
# H0: mu_FoxNews.com - mu_NYTimes.com = 0 
# HA: mu_FoxNews.com - mu_NYTimes.com != 0 
# Standard error = 0.003 
# Test statistic: Z =  -4.195 
# p-value =  0

As shown above, the p-value is essentially zero so I reject the null hypothesis - 100-day average sentiments are definitely different for this dataset.

Conclusion:

These two news organizations certainly seem different. Whether these sentiment scores are interesting depends on the audience and how much faith you put into the AFINN lexicon. There are certainly a number of outstanding questions that could give or take credence from these results. As stated initially, this was an observational study and the conclusions dervied should be the same.

final_project_607

jbrnbrg

May 14, 2017

Goal

Motivation

Disclosure:

Data Sources:

My Hypothesis Test:

Settings

Getting the Data: Web scraping:

NYTimes.com

Scraping Search Results:

Article Pages

Cleaning Combining Results:

FoxNews.com

Scraping Search Results:

Article Pages:

Cleaning and Combining Results:

Review Data, Prep for Analysis:

Read in the Data from External Sources:

Prepare the data:

Tokenize the text:

Compare the data:

Summary:

100-Day Coverage:Articles

100-Day Coverage: Which words?

Comparing Sentiment

Combining AFINN Scores:

AFINN Scores by Day:

AFINN Scores counts:

100-Day Average:

Testing My Hypothesis

Conclusion: