Introduction

Events and incidents of significance occurs all over the globe at a daily rate. The increasing prevalence of web-based services and networks for communication allow for faster, more frequent, diverse and more easily accessible delivery of information to the average newsreader. It is therefore reasonable to presume, that the amount of new information, in any digital format, actually digested by the average news-reading citizen has increased¹ as social media sites and alike have evolved and increased in popularity. This massive stream of information also makes it increasingly difficult for users to navigate in - and so, most individuals choose to filter this information in some sense. For this reason, some media channels may aim to produce easily digestible content and catchy headlines, which appeals to the ‘filtering’ user. In a broad sense, this over-dramatization of content might result in increased numbness of users; due to being more exposed to ‘extreme’ stories of ’extreme events compared to e.g. the 1990’s or early 2000’s.

The above mentioned factors have several implications for the way users digest media content - and thus, how content is produced as well as the aim and composition of the content. Producers of media content may realize that the market for news has radically changed: The response of users is directly captured and in turn has an effect on the number of other users that will view the content. If this is the case, the power of users on media content is changing; or, the power of media on user interest may be the driving force in terms of the virality of a particular article. This puts a lot of power in the hands of either side in determining the content of popular media.

Spreading awareness and reporting on catastrophies of humanitarian nature, such as natural disasters, relief efforts and terror attacks is a crucial part of media business: users are interested in reading about it and/or expressing sympathy for victims. However, the initial effect quickly wears off, naturally.

# Import Plotting Data #
Charlie_Hebdo = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/Hebdo", sep = ",", header = F)
Charlie_Hebdo$Date = as.Date(Charlie_Hebdo$V1)
Charlie_Hebdo$Index = Charlie_Hebdo$V2
Paris_Batacaln = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/Paris1", sep = ",", header = F)
Paris_Batacaln$Date = as.Date(Paris_Batacaln$V1)
Paris_Batacaln$Index = Paris_Batacaln$V2
Nepal = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/Nepal", sep = ",", header = F)
Nepal$Date = as.Date(Nepal$V1)
Nepal$Index = Nepal$V2
Charlie_Sheen = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/Sheen", sep = ",", header = F)
Charlie_Sheen$Date = as.Date(Charlie_Sheen$V1)
Charlie_Sheen$Index = Charlie_Sheen$V2
Charlie_Hebdo.fb = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/charliel", sep = ";")
Paris_Batacaln.fb = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/parisl", sep=";")
Nepal.fb = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/Nepall", sep = ";")
Charlie_Sheen.fb = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/sheenl", sep =";")

paris = ggplot(data=Paris_Batacaln, aes(x = Date, y = Index)) + 
  geom_line(color = "#ff0000") + ggtitle("Paris") + 
  theme_minimal() + labs(x="Date", "Google Trend Index") +
  theme(plot.title = element_text(size= 12, face = "bold")) 

hebdo = ggplot(data=Charlie_Hebdo, aes(x = Date, y = Index)) + 
  geom_line(color = "#ff0000") + ggtitle("Charlie Hebdo") + 
  theme_minimal() + labs(x="Date", "Google Trend Index") +
  theme(plot.title = element_text(size= 12, face = "bold"))

nepal = ggplot(data=Nepal, aes(x = Date, y = Index)) + 
  geom_line(color = "#ff0000") + ggtitle("Nepal Earthquake") + 
  theme_minimal() + labs(x="Date", "Google Trend Index") +
  theme(plot.title = element_text(size= 12, face = "bold"))

sheen = ggplot(data=Charlie_Sheen, aes(x = Date, y = Index)) + 
  geom_line(color = "#ff0000") + ggtitle("Charlie Sheen") + 
  theme_minimal() + labs(x="Date", "Google Trend Index") +
  theme(plot.title = element_text(size= 12, face = "bold"))

grid.arrange(paris, hebdo, nepal, sheen, ncol=2, nrow=2, top = "Figure 1 : Google Trends index of events")

Curiously, the amount of media attention and the time it takes for a topic or event to become ‘uninteresting’, does at a first glance appear to be dependent somewhat on the location and nature of the event.

Data

This section provides a thourough discussion of the data used, the limitations of the approach, techniques employed for data gathering and -manipulation. The following packages have been used throughout the data-section:

packages.for.data <- (c("dplyr", "rvest", "XML", 
                        "stringr", "readr", "devtools", 
                        "httr", "RCurl", "curl", 
                        "data.table", "plyr"))

1 Data Sources

1.1 Google Trends

Google Trends is a free service provided by Google: Trends data provides an index of the popularity of any particular search term over a specified time-period. In constructing the index, Google chooses to exclude search terms that have been used sparsely, multiple times by the same user within a short time-period as well as searches for special characters such as ‘~’, ‘¤’ and so forth. Google constructs the Trends index by adjusting for geography and time range(data points are divided by the sum of searches of the represented geography and time-period) - and the Trends index thus constitutes a comprarative, relative index for search-term popularity.

In utilizing the Google Trends service, one has to understand that search-terms fed into the engine are displayed for the particular region in which the user is located. Also, by analyzing an index, it proves impossible to compare two exclusive events or search terms in order of magnitude of the popularity of searches. However intra-timeseries movements does provide some insight in terms of understanding trending behaviour in the relative usage of search terms by users. For example; it is not possible to compare the absolute size of the search terms “ISIS” and “Boko Haram”, but one may find that a spike in users interest for these search terms exhibit similar properties, i.e. rate of decline or the length of time at which the time-series deviates from its mean.

The above mentioned properties are the ones of interest in this analysis and the authors refrain from conducting any analysis of the actual popularity between events/ search terms.

1.2 Facebook

Facebook Page data provides trivial information about each post of a particular facebook page. Facebook serves as a social media website in a multitude of ways; activity on facebook ranges from user-to-user, user-to-page and page-to-user interaction. This analysis relies on page-to-user interactions. By investigating popular pages that serves as news providers to facebook-users, one can attain information about the popularity of any particular topic - and also, to some extent, the response of users, private agents.

Limitations in using this type of data lies in the fact, that a count of likes, comments and shares solely serves as an intermediary tool for measurement - one cannot derive the actual intention or degree of contsent of the user. As such, this analysis does not rely on interpreting user response - but rather, a utilization the history of posts by pages and the titles herof, which contain information about what the news-services deem on-topic an relevant at any point in time. Thus, for any event of interest one can do, to some extent, inference on the media output - which in turn may provide some insight as to how media expects the users of facebook to be interested in.

This approach of course has limitations in terms of the precision of measurement. However, it seems reasonable to argue: Media Channels thrive on popularity and attention from its users, the media channels in turn will make some prediction about what it’s user base might find interesting. The media channel can in some sense be seen as attention-maximizers, providing news that the users will like. Otherwise one could argue that media channels serves as a creator of opinion, and as such, ‘instructs’ it’s users on which topics are important. Either way; media channels do reflect public opinion to some extent - any discussion of the direction of the causal link does not lie within the scope of this analysis.

2 Data Gathering

2.1 Gathering data from Google Trends

Initially, a list was created of all significant events of 2015 in the following categories: Terrorism, Natural Disasters and Celebrity Scandals. In order to reduce bias, as many events as possible were included from all around the world. The primary source for this process was Wikipedia. For celebrity scandals; an extensive web-search was conducted. The following data were recorded for each event:

Date/time of event
Event sub-category; i.e. Kidnapping, Bombing, Heat Wave etc.
Number of casualties
Number of injuries
A unique search term
City
Country
Category; i.e. Terror, Natural Disaster or Celebrity Scandal

2.1.1 Preparation

Some of this information is used in gathering Google Trends data; Date/time and the unique search term. These data was coerced into seperate vectors of characters and some manupulations were needed for proper utilization:

###########################################
## Preparing input-strings for Google_Trends_Fetch function
###########################################

# Import external files - seperate lists
Google.Search.Date = read.csv("C:/USers/Adam/Desktop/Google.Search.Dates.csv", sep = ";", stringsAsFactors = F, header = F)
Google.Search.Term = read.csv("C:/Users/Adam/Desktop/Google.Search.Terms.csv", sep = ";", stringsAsFactors = F, header = F)

# Remove leading & trailing whitespace
Google_Trim = function(x){
  gsub("^\\s+|\\s+$", "", x)
}
# Replace space & tab between words with '+'
Google_Trim2 = function(x){
  gsub("[[:space:]]", "+", x, perl = TRUE)
}
Google_Trim3 = function(x){
  gsub("-", "+", x)
}

Google.Search.Term. = lapply(Google.Search.Term, Google_Trim)
Google.Search.Term. = lapply(Google.Search.Term., Google_Trim2)
Google.Search.Term. = lapply(Google.Search.Term., Google_Trim3)
Google.Search.Date = as.vector(Google.Search.Date$V1)
Google.Search.Term. = as.vector(Google.Search.Term.$V1)

2.1.2 Retrieving

In gathering data from Google Trends, three feasible approaches exist. Either use 1) RGoogleTrends R-package which has some deficiencies and kinks, 2) download spreadsheet from the Google Trends website and lastly 3) use the underying script of the GTrends package, which allows for costumization. The latter approach was employed for two reasons: Efficiency in handling of authorization curl (download speed) and the possibility to create custom request functions, which provides a better framework for automatization and output-costumization.

Cristoph Riedl posted a basic framework for fetching Google Trends data, which utilizes the ‘RCurl’ package to take care of logon-procedures(authorization, cookies & authentication). Upon authentication, the getForm() function can be calibrated to request to the Google server and by that, import an untidy Trends Data file.

############################################
##    Query GoogleTrends from R
##
## by Christoph Riedl, Northeastern University
## Additional help and bug-fixing re cookies by
## Philippe Massicotte Université du Québec à Trois-Rivières (UQTR)
############################################


# Load required libraries
library(RCurl)      # For getURL() and curl handler / cookie / google login
library(stringr)    # For str_trim() to trip whitespace from strings
library(dplyr)    # Loop

# Google account settings
username <- "sdsprojectgr15@gmail.com" #mod
password <- "socialdatascience" #mod

# URLs
loginURL        <- "https://accounts.google.com/accounts/ServiceLogin"
authenticateURL <- "https://accounts.google.com/accounts/ServiceLoginAuth"
trendsURL       <- "http://www.google.com/trends/TrendsRepport?"



############################################
## This gets the GALX cookie which we need to pass back with the login form
############################################
getGALX <- function(curl) {
  txt = basicTextGatherer()
  curlPerform( url=loginURL, curl=curl, writefunction=txt$update, header=TRUE, ssl.verifypeer=FALSE )
  
  tmp <- txt$value()
  
  val <- grep("Cookie: GALX", strsplit(tmp, "\n")[[1]], val = TRUE)
  strsplit(val, "[:=;]")[[1]][3]
  
  return( strsplit( val, "[:=;]")[[1]][3]) 
}


############################################
## Function to perform Google login and get cookies ready
############################################
gLogin <- function(username, password) {
  ch <- getCurlHandle()
  
  ans <- (curlSetOpt(curl = ch,
                     ssl.verifypeer = FALSE,
                     useragent = getOption('HTTPUserAgent', "R"),
                     timeout = 60,         
                     followlocation = TRUE,
                     cookiejar = "./cookies",
                     cookiefile = ""))
  
  galx <- getGALX(ch)
  authenticatePage <- postForm(authenticateURL, .params=list(Email=username, Passwd=password, GALX=galx, PersistentCookie="yes", continue="http://www.google.com/trends"), curl=ch)
  
  authenticatePage2 <- getURL("http://www.google.com", curl=ch)
  
  if(getCurlInfo(ch)$response.code == 200) {
    print("Google login successful!")
  } else {
    print("Google login failed!")
  }
  return(ch)
}

############################################
## Read data for a query
############################################
ch <- gLogin( username, password)
authenticatePage2 <- getURL("http://www.google.com", curl=ch)
res <- getForm(trendsURL, q=" ", date = " ", content=1, export=1, graph="all_csv", curl=ch) #mod
res
# Check if quota limit reached
if( grepl( "You have reached your quota limit", res ) ) {
  stop( "Quota limit reached; You should wait a while and try again later" )
}

The before mentioned lists now has to be fed into a function that does all the handling/handshaking procedures. This would be easy, if it was not for the fact, that the output files are difficult to handle, as well as the fact, that the function takes multiple inputs. The following approach was taken; utilize Riedl’s scripts functions (ch, authenticatePage2 & res) - from there create function that loop through simultaneous iterations of two ordered lists, outputs the res file as a dynamically named file, write this file as .csv to local directory. Run function using ‘mapply’, for multi-variable string-apply:

# Create function with multi-input: search-term & date (YYYY-MM) that outputs as .csv in locale = "out"
  # Unfortunately, locale has to be placed on C-drive for the function to work - further calibrations seem unessecary
  # Write to folder in /Documents/, "csv.symphony": Folder has to be created before proceeding

out <- "C:/Users/Adam/Documents/csv.symphony"

options(warn = 2)

# This function does multiple things
  # 1) Verify login information towards Google.com as specified above using curl
  # 2) Loop for two variables contained in vectors
    # 2.1) Request Trends-file containing the information we're  interested in
    # 2.2) Dynamic file-naming by 'i' in .csv-format
    # 2.3) Write to the 'out'-locale defined above for each element in i
  # 3) Prompt if something goes wrong - i.e. request limitations are occurrent

Google_Trends_Fetch = function(x, y){
  ch <- gLogin( username, password )
  authenticatePage2 <- getURL("http://www.google.com", curl=ch)
    for(i in x) {
    for(j in y){
      res <- getForm(trendsURL, q=x, date = y, content=1, export=1, graph="all_csv", curl=ch)
        myfile <- file.path(out, paste0("GTrends", "_", i, ".csv"))
    write.table(res, file = myfile, sep = ";", row.names = FALSE, col.names = FALSE,
                quote = FALSE, append = FALSE)
  }
    }
  if( grepl( "You have reached your quota limit", res) ) {
    stop( "Quota limit reached; You should wait a while and try again later" )
  }
}

###########################################
## Running the Google_Trends_Function
###########################################

# For (nested) multi-loop: Use "mapply" for inputs > 1:
  # When running this fct be aware of the definition of the out-locale, set to local dir
  # Writes .csv's as: "GTrends_Attack+refugee+center+market+N'Djamena", etc.
    # Read as mapply(FUN, x, y)
mapply(Google_Trends_Fetch, Google.Search.Term., Google.Search.Date)

2.1.3 Manipulation

From here, each .csv should be read accoring to the naming scheme applied with the Google_Trends_Fetch function. This can be done by using the same list as input into a read-function. Each .csv should constitute data frame within a list - this proves useful for data manipulation steps.

# Simple import - specifications has to be made outside the lapply
Google_CSV_Import = function(x){
  read.csv(x, sep = ";", header = FALSE)
}
# Read all filenames from out <- "C:/Users/Adam/Documents/csv.symphony"
filenames <- list.files(path=out, pattern=".*csv")
# Apply list of filenames to Google_CSV_Import in order to attain a list of dataframes
filelist <- lapply(filenames, Google_CSV_Import)
# If it is desired to analyze as seperate dataframes, the command below should be used
  #Note the invisible function keeps lapply from spitting out the data.frames to the console
invisible(lapply(names(filelist), function(x) assign(x,filelist[[x]],envir=.GlobalEnv)))

Data gathered using the ‘Google_Trends_Fetch’ function, after cleaning, consists of 236 seperate data frames. Each of these are time series of daily search-terms for one month. In order to obtain a workable data.frame, cbind and str_split [stringr] are used:

pkgs2 = c("tidyr", "stringr", "readr")
lapply(pkgs2, library, character.only = TRUE)
# Naming the individual data.frames appropriately
filenames2 = gsub("GTrends_", "", filenames)
filenames2 = gsub(".csv", "", filenames2)
filenames2 = gsub("\\+", "_", filenames2)
names(filelist) <- paste0(filenames2)

#
filelist1 = ldply(filelist, rbind)
  filelist1$V1 = as.character(filelist1$V1)
  filelist1 = filelist1[grep("^2015", filelist1$V1, perl = TRUE),]
  filelist1$V2 = NULL
  
filelist2 = str_split_fixed(filelist1$V1, ",", 2)
filelist3= cbind(filelist1,filelist2)
  filelist3$V1 = NULL
  filelist3$date = filelist3[,2]
  filelist3$index = as.numeric(filelist3[,3])
  filelist3[,2] = NULL
  filelist3[,2] = NULL

2.2 Gathering data from Facebook

In assessing the issue of retrieving a comrehensive list of news-pages on facebook, the authors took used a website for gathering the 100 most popular facebook news-pages. All these newspages are english media channels. One might argue that multiple sources for this information should be applied to the page search - however it seems somewhat sensible to assume, that this list is broad and representative of western media channels on facebook.

2.2.1 Crawler for newspage index

### The packages
pkgs = c("rvest", "plyr", "stringr", "devtools", "httr", "RCurl", "curl", "XML")
lapply(pkgs, library, character.only = TRUE)

The webpage has several subsites; it presents 20 facebook pages, ordered by popularity, for each subsite, url-scheme is /page’x’. In order to fetch the 100 most popular facebook newspages, one simply has to scrape the first 100. One tiny complication: the facebook links are not directly presented in the html of the front page - it presents the user with a link to the websites subsite of the subsite. Therefore, it is necessary to ‘visit’ each of the 100 subsubsites to fetch the facebook page link. The crawler thus needs to be fed with a list of 100 URLs - those are attained in the following way with ‘rvest’:

### First steps for crawling fanpagelist.com for facebook pages
  # Website structure : http://.../sort/fans/page1, http://.../sort/fans/page2, etc.: Shows 20 fb-pages
  # Create vector of 5 -> insert as to have 5 different links -> 100 fb-pages
vec5 = c(1:5)
linksub.li = "http://fanpagelist.com/category/news/view/list/sort/fans/pageLANK"

link_str_replace = function(vec5){
  link.sub = gsub("\\LANK", vec5, linksub.li)
}

link.sub = llply(vec5, link_str_replace) # Coerce into list by applying link_str_replace fct.

### Crawling strategy:
  # As the website does not have href's for FB on front page, we need to visit each fanpagelist.com's page for the facebook-page
  # From there we can retrieve the link to facebook. 
      # Fetch URL's for each fanpagelist.com/page 

css.selector_1 = "a:nth-child(3)"   #URL and/or TITLE depends on 'html_attr(name = href/title)'
css.selector_2 = "div.listing_profile > a"
scrape_links_fanpage = function(link.sub){
  link.url = read_html(link.sub, encoding = "UTF-8") %>%
    html_nodes(css = css.selector_2) %>%
    html_attr(name = "href")
  return(rbind(link.url))
}

### Apply function using simple for-loop
    # Sys.sleep(1) as to not overload servers or block/slow traffic
fanpage.data = list()
for(i in link.sub){
  print(paste("processing", i, sep = " "))
  fanpage.data[[i]] = scrape_links_fanpage(i)
  Sys.sleep(1)
  cat("done!\n")
}

This provides 5 lists, each containing 20 character elements. Some manipulations to the lists are required. Furthermore, the links provided are not full URLs, but only the last part of a URL-string. For example; /page/bbc, not http://fanpagelist.com/page/bbc which is needed for fetching the facebook links for the particular URL:

### Setting up data for next step
  # Each URL yields list with 20 elements 
  # Manipulate into dataframe, transpose, read as charactors rather than factors
fanpage.df = data.frame(fanpage.data)
fanpage.df = t(fanpage.df)
fanpage.df = as.character(fanpage.df)

### Preparing for crawling fanpagelist.com's subsites for facebook links
  # http://www.fanpagelist.com only provides href in condensed form: /page/bbc
  # We need the full URL to actaully visit the website
    
fanpageurl = "http://fanpagelist.comLANK" # The URL-part that is not provided by scraping
fanpage.fct = function(fanpage.df){
  link.sub2 = gsub("\\LANK", fanpage.df, fanpageurl)
}
fanpage.li = llply(fanpage.df, fanpage.fct)

Once all the URLs are obtained, it is necessary to visit each URL and find an ‘href’-module in each page’s HTML-structure; in this case ‘a:nth-child(3)’.

### The actual crawler:
  # Look at the complete link and fetch only the facebook href-link
css.selector_3 = "a:nth-child(3)"
scrape_links_facebook = function(link.sub2){
  facebook.link.url = read_html(link.sub2, encoding = "UTF-8") %>%
    html_nodes(css = css.selector_3) %>%
    html_attr(name = "href")
  return(cbind(facebook.link.url))
}
  # Do this for all links contained in fanpage.li
facebook.data = list()
for(i in fanpage.li){
  print(paste("processing", i, sep = " "))
  facebook.data[[i]] = scrape_links_facebook(i)
  Sys.sleep(1)
  cat("done!\n")
}

Reformat into workable list of characters. Also some cleaning of the output is needed. The RFacebook API-setup takes input in the for of page name, not page URL, thus it is necessary to remove the ‘http://www.facebook.com/’ from the elements in the list.

### Reformatting the data
  # As before the format is not really workable
  # Some manipulations are required
facebook.data = unlist(facebook.data)
facebook.frame = data.frame(facebook.data)
new_DF = facebook.data
new_DF = new_DF[grep("www.facebook.com", new_DF)]
new_DF = data.frame(new_DF)
page.names = gsub("\\https://www.facebook.com/", "", new_DF$new_DF)
page.names = gsub("/", "", page.names)
  #Remove duplicates
page.names = page.names[!duplicated(page.names)]

2.2.2 Request using Facebook API with Oauth

Requesting to the Facebook API on all posts in 2015 of 100 news-pages will take an immense ammount of time. Thus, it does not seem sensible to use the Graph API 2hr token for this purpose. Rather it proves useful to apply the oauth-procedure of Pablo Barbera:

### SETTING UP
  # Create app on developers.facebook.com
  # Facebook API Oauth procedure - setting up using Guide from Pablo Barberas Rfacebook package
install_github("pablobarbera/Rfacebook/Rfacebook")
library("Rfacebook")
  ## Create oauth in order to avoid having to fetch token every 2nd hr manually
fb_oauth <- fbOAuth(app_id = "XXX", app_secret = "XXX", extended_permissions = FALSE) # 
save(fb_oauth, file = "fb_oauth")
  ## Load oauth-key/token
load("fb_oauth")

Then, the request function is set up:

### Creating function for gathering all 2015 posts
Facebook_Page_Fetch = function(x){
  fb.feed = getPage(x, token = fb_oauth, n = 99999, since = '2015/01/01', until = '2015/12/05')
  return(cbind(fb.feed, x))
}

Initially, using plyr for requesting to the API worked great - however the process gets shut down at about 100.000 posts due to high request frequency:

Ply.Fetch = data.frame(ldply(page.names, Facebook_Page_Fetch, .inform = TRUE))

So, in order to avoid being shut down, a for loop with sleep between iterations is needed.

options(warn=1)
News_Facebook.ldf = list()
for(i in page.names.1){
  print(paste("processing", i, sep = " :: "))
  News_Facebook.ldf[[i]] = Facebook_Page_Fetch(i)
  Sys.sleep(0.01)
  cat("done!\n")
}
News_Facebook.df = ldply(News_Facebook.ldf, data.frame)

2.2.3 Manipulation

In manipulating data retrieved using the Facebook API, our aim is to determine the number of posts per day, that mentions a set of words, which relate to an actual event. The variable message contains the headline of every scraped posts and is, thereby, the only vairalbe of interest for this task. The only feasible approach is to search for keywords in the character column message. In this way it is possible to attain an approximate measure of whether a particular post relates to an event of interest - 1 or 0.

Keywords are determined for each event, seperately, using search terms also used in the Google Trends part as well as geographical location. In order to utilize these keywords, the stringr package, in particular the str_count function paired with a logical argument seperating each keyword: A search term applied with the str_count function might be of the type: “Paris | France | Bataclan | Shooting | Terror…, etc.”. For each complete search term, a variable is created corresponding to an event, with binary values. The question is how to build a precised search which filters out almost all posts related to the specific event and almost no ones covering different topics. Our approach is to require speicifc mandatory keywords and evaluate the post with a pool of further optional keywords. We thought about using regular expressions and creating two groups of keywords; mandatory and optional, and the a post in order to be picked should at least contain one mandatory and one optional one to qualify as relevant for the topic. However, if proved more time-efficient to conduct a two-step procedure rather than using regular expressions for this task. In the first round we searched the post for our mandatory words with the str_count function. In establishing a general rule for what criterion a search term should fulfil, we argue that location is the most appropriate. That means the country and city name where the event took place. Only for the celebrity scandals the family names of the involved persons are the mandatory key words.

From here, two approaches are feasible using search-terms and str_count: 1) Dynamically create list of search terms to loop through for all events and create 236 variables, 2) Manually write specific search terms for a subset of events and employ these in a stringr function. Each of these has advantages and limitations. 1) Does not allow for customizable search terms to the extent that 2) does. 2) Is tedious and will not result in a full range of variables as 1) would. Upon looking briefly at a couple of events, it was evident that many of the events did not recieve enough attention from western media to constitute an interesting time-series dataframe. Therefore we focused only on the twelve events we deemed most important (determined by casualties across geography). This, of course, implies bias - as the authors are influenced by the media channels analysed - subsequent sampling of events not included in this first subset was conducted. No events with more than 20 data points were found using custom search terms. From here, the second terms are required; a one-character match in the second step will qualify for being related to the event. For the most terror attacks these are very similar including words like bomb, terror or the name of the recently most activee terror group ISIS. There exists an obvious tradeoff between including more words, increasing the chances of a matched post, and keeping the search very precised.

From here on, a reduction of each dataframe was necessary so as to include only the relevant time-horizon, one month - this makes for straight-forward comparison with Google Trends data. In this way, a dataframe for each event is obtained, each with an added variable; the binary column for search-term-match.

This allows for plotting of some basic graphs showing the movement of the number of media posts published on facebook related to the events of interest. It turns out that some events only get very few to nearly no media attention as the indian heatwave² or massacre of 2,000 civilians in Nigeria³. This finding could partly be explained by the biased seleciton of only english newssites on facebook. A deeper analysis is, therefore, only applied to selected events with a sufficient number of posts. Furthermore; there are several ways of how to improve our search. A very tidious way would be to improve the optional keywords by actually checking the headlines of the most important newsmagazines, -papers or -sites one by one. An article could be dropped in the existing search-setup because it uses a synonym of an included keyword and the new word could be included. A search based on key words is, therefore, very difficult. The english language is very rich of synonyms and this is a general problem of keywords based search.

A more extensive search could be conducted by creating a crawler for each URL to a corresponding newsarticle in the facebook post; download each of the approximately 600.000 html texts and apply string-match analysis on these. Given the size of the dataframe, and the fact that the main subject of an article typically is contained within the title or description, this approach was deemed unnecessary.

3 Description of Data

3.1 Google Trends Data

Google Trends Data is used in two different ways. Firstly a time-series of the Trends index for each search-term/event. Second, a joined, cross-sectional dataframe of static data contained within the time-series; min, max, rate of decay from peak, etc. and external static measures such as number of casualties, city, country and type.

*Description of Google Trends time series data:*

.id: Search-term of the event [character]
Date: Date for each observation in index, one month span, (YYYY-MM-DD) [Date]
Index: The Google Trends index [numeric]

require("dplyr")
require("readr")
External.file.1 = read.table("https://raw.githubusercontent.com/adamingwersen/Project.15/master/glimpse", sep = ";")
glimpse(External.file.1)

## Observations: 7,191
## Variables: 3
## $ .id   (fctr) Al_Shabaab_attack_governement_building, Al_Shabaab_atta...
## $ date  (fctr) 2015-04-01, 2015-04-02, 2015-04-03, 2015-04-04, 2015-04...
## $ index (int) 91, 95, 96, 93, 92, 97, 97, 97, 98, 97, 93, 91, 97, 98, ...

3.2 Facebook Data

Data gathered via the Facebook API for the enitre year of 2015 for 100 news-pages has 12 variables, 6 of which are of interest:

from_name: e.g. “National Geographic” [character]
message: Typically a short introdution to the main article [character]
created_date: ISO 8061 time format [Date]
likes_count: Number of likes that the particular post recieved [numeric]
comments_coun: Number of comments that the particular post recieved [numeric]
shares_count: Number of shares that the particular post recieved [numeric]

The data is structured in a dataframe of 600.000 observations, each observation corresponding to a post by a facebook page.

Analysis

Visual Inspection

The data shows some irregularities: Not all events are captured equally well in terms of Facebook posts. It seems apparent that events within western countries are given more attention from western media - this can be seen by looking at the y-axis of Figure 2, depicted below. This is even when some of these events had a particular high list of casualties. The important part here is, that the overall movement of Facebook and Google data seem to be highly correlated; which means that to some extent, using Google data proves to be a simple, but useful way of depicting overall public interest.

Charlie_Hebdo = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/Hebdo", sep = ",", header=F)
Charlie_Hebdo$V2=as.numeric(Charlie_Hebdo$V2)
Charlie_Hebdo$V1=as.Date(Charlie_Hebdo$V1)

Paris_Batacaln = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/Paris1", sep = ",", header=F)
Paris_Batacaln$V2=as.numeric(Paris_Batacaln$V2)
Paris_Batacaln$V1=as.Date(Paris_Batacaln$V1)

Nepal = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/Nepal", sep = ",", header=F)
Nepal$V2=as.numeric(Nepal$V2)
Nepal$V1=as.Date(Nepal$V1)

Charlie_Sheen = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/Sheen", sep = ",", header=F)
Charlie_Sheen$V2=as.numeric(Charlie_Sheen$V2)
Charlie_Sheen$V1=as.Date(Charlie_Sheen$V1)

charlie.df2 = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/charlie.df2", sep=";", header = FALSE)
sheen.df2 = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/sheen.df2", sep=";", header = FALSE)
nepal.df2 = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/nepal.df2", sep=";", header = FALSE)
paris.df2 = read.csv("https://raw.githubusercontent.com/adamingwersen/Project.15/master/paris.df2", sep=";", header = FALSE)
charlie.df2$V2 = as.Date(charlie.df2$V2)
sheen.df2$V2 = as.Date(sheen.df2$V2)
nepal.df2$V2= as.Date(nepal.df2$V2)
paris.df2$V2= as.Date(paris.df2$V2)

par(mfrow=c(2,2))

## add extra space to right margin of plot within frame
par(mar=c(5, 4, 4, 6) + 0.1)

## Plot first set of data and draw its axis
plot(Paris_Batacaln$V1, Paris_Batacaln$V2, pch=16, axes=FALSE, ylim=c(0,100), xlab="", ylab="", type="l", col="black", main="Paris")
axis(2, ylim=c(0,100),col="black",las=1)  ## las=1 makes horizontal labels
mtext("Google Trends",side=2, line=2.5)
box()

## Allow a second plot on the same graph
par(new=TRUE)

## Plot the second plot and put axis scale on right
plot(paris.df2$V2, paris.df2$V3, pch=15,  xlab="", ylab="", type="l", ylim=c(0,339), 
     axes=FALSE, col="red")


## a little farther out (line=4) to make room for labels
mtext("Facebook",side=4,col="red", line=2.5) 
axis(4, ylim=c(0,339), col="red",col.axis="red",las=1)

## Draw the time axis
mtext("Date",side=1,col="black", line=2.5)  

## add extra space to right margin of plot within frame
par(mar=c(5, 4, 4, 6) + 0.1)

## Plot first set of data and draw its axis
plot(Charlie_Hebdo$V1, Charlie_Hebdo$V2, pch=16, axes=FALSE, ylim=c(0,100), xlab="", ylab="", type="l", col="black", main="Charlie Hebdo")
axis(2, ylim=c(0,100),col="black",las=1)  ## las=1 makes horizontal labels
mtext("Google Trends",side=2, line=2.5)
box()

## Allow a second plot on the same graph
par(new=TRUE)

## Plot the second plot and put axis scale on right
plot(charlie.df2$V2, charlie.df2$V3, pch=15,  xlab="", ylab="", type="l", ylim=c(0,74), 
     axes=FALSE, col="red")


## a little farther out (line=4) to make room for labels
mtext("Facebook",side=4,col="red", line=2.5) 
axis(4, ylim=c(0,74), col="red",col.axis="red",las=1)

## Draw the time axis
mtext("Date",side=1,col="black", line=2.5)  


## add extra space to right margin of plot within frame
par(mar=c(5, 4, 4, 6) + 0.1)

## Plot first set of data and draw its axis
plot(Nepal$V1, Nepal$V2, pch=16, axes=FALSE, ylim=c(70,100), xlab="", ylab="", type="l", col="black", main="Nepal Earthquake")
axis(2, ylim=c(70,100),col="black",las=1)  ## las=1 makes horizontal labels
mtext("Google Trends",side=2, line=2.5)
box()

## Allow a second plot on the same graph
par(new=TRUE)

## Plot the second plot and put axis scale on right
plot(nepal.df2$V2, nepal.df2$V3, pch=15,  xlab="", ylab="", type="l", ylim=c(0,67), 
     axes=FALSE, col="red")


## a little farther out (line=4) to make room for labels
mtext("Facebook",side=4,col="red", line=2.5) 
axis(4, ylim=c(0,67), col="red",col.axis="red",las=1)

## Draw the time axis
mtext("Date",side=1,col="black", line=2.5)  

## add extra space to right margin of plot within frame
par(mar=c(5, 4, 4, 6) + 0.1)

## Plot first set of data and draw its axis
plot(Charlie_Sheen$V1, Charlie_Sheen$V2, pch=16, axes=FALSE, ylim=c(0,100), xlab="", ylab="", type="l", col="black", main="Charlie Sheen")
axis(2, ylim=c(0,100),col="black",las=1)  ## las=1 makes horizontal labels
mtext("Google Trends",side=2, line=2.5)
box()

## Allow a second plot on the same graph
par(new=TRUE)

## Plot the second plot and put axis scale on right
plot(sheen.df2$V2, sheen.df2$V3, pch=15,  xlab="", ylab="", type="l", ylim=c(0,123), 
     axes=FALSE, col="red")


## a little farther out (line=4) to make room for labels
mtext("Facebook",side=4,col="red", line=2.5) 
axis(4, ylim=c(0,123), col="red",col.axis="red",las=1)

## Draw the time axis
mtext("Date",side=1,col="black", line=2.5)

Supervised Learning

Given that Google data seem to capture a significant proportion of public interest on a subset of the events analyzed, it may prove insightful to attempt at determining the underlying mechanincs of trending behaviour - and the events that trigger it, with the Google data. One interesting approach of utilizing the collected data would be to attempt at predicting the category of an event based on the nature of other variables. Two supervised learning techniques are employed in an attempt to classify and thereby predict outcomes of interest.

Due to time-constraints the dataset used for the supervised learning tasks is a reduced version of our dataset, containing 105 observations. For further exploration, it could be interesting to apply the same methods using a dataframe containing more observations.

Classification Trees

The approaches of classification trees are rather intuitive, since it follows a general logic. It classifies each observation according to a set of decision rules, as indicated by the nodes of the tree, and their resulting outcomes. When applying an algorithm such as Hunt’s for growing decision trees one should although keep in mind not to ensure that the tree does not get to complex, since the procedure could otherwise result in a tree that classifies each observation on its own. At first dataset is separated into a training and test set by assigning a V3 between 1 and 2 at random to each observation, where 1 is assigned randomly with 0.67 per cent probability and 2 with 0.33 per cent probability. In this way a training dataset of 63 observations is created while the reaming observations are included in the test set. Based on the training set the rpart package by Therneau, Atkinson and Ripley is applied, which results in the following classification tree:

CART=rpart(Category ~ Casualties+max+min, training_CART, method="class")
rpart.plot(CART, type=3, uniform=TRUE, extra=101, fallen.leaves = TRUE,main="Classification Tree")

From the classification tree one can see that an event causing more than 4 casualties is most likely a terrorist attack whereas less casualties presumably are celebrity scandals. The model will result in an error rate of 10 per cent. The graphical representation of a classification tree is rather intuitive since the attribute listed in top of the tree is the most predictive feature and while moving downwards the features importance for prediction decreases. In this case, the algorithm estimates that the number of casualties is the sole feature prediction whether given event is a terrorist attack or a celebrity scandal, meanwhile range, min and max of the Googletrendsdata does not seem to have any predictive power. The percentages in the leaf nodes, i.e. the boxes at the end of the classification tree, describes how many of the observations that are classified as a given category. The remaining numbers in the leaf node indicates the purity of the node by saying how many events that were correctly classified and how many that was incorrectly classified in the training set. For example one can see that 3 events that are actually natural disasters were incorrectly classified as terror attacks, whereas 50 events are correctly classified as terrorist attacks. To avoid an over fitting of the data, the tree has been pruned by the cost complexity pruning, which is included in the rpart package. This approach punishes the tree growing process for increasing the complexity of the tree. The complexity parameter of the tree should be around 2 since this value is associated with the minimal cross-validated error. This pruning does not alter the performance, which probably is due to our limited dataset.

K-nearest Neighbours

The approach of K nearest neighbours is rather intuitive since each observation is classified according to the category of the K nearest neighbours based on majority voting, i.e. the event will be classified according to the dominating category between the K neighbours.
At first the features casualties, min, max and range are normalized, so that their differing ranges won’t affect the outcome of the algorithm. Before applying the class package by Ripley and Venables, the dataset is divided into a test and training set. The value of k is chosen by evaluating the test error rate for given values of k and picking the value of k that leads to the minimal test error rate., since small values of k will result in rather irregular and high variance regions whereas high values will be in risk of misclassifying at the local level.

library("class")

########################
#######   KNN  #########
########################

#normalization of decay, casualties, max and min - to reduce the higher influence of some attributes than others
summary(newdf)
normalize = function(x){ 
  return ( (x-min(x)) / (max(x)-min(x)) )
}
newdf1 <- as.data.frame(lapply(newdf[,c(8,16:18)], normalize))
summary(newdf1)

#creating test and training sets by making a random distribution of 2/3 to the test set and 1/3 to the training set
set.seed(1234)
ind <- sample(2, nrow(newdf), replace=TRUE, prob=c(0.67, 0.33))
training_KNN <- newdf1[ind==1, 1:4]
test_KNN <- newdf1[ind==2, 1:4]

#Categories of the training and test set
trainLabels_KNN <- newdf[ind==1, 11]
testLabels_KNN <- newdf[ind==2, 11]

#Which K, graphical examination
library("class")

k=c(1:10)
p=rep(0,10)
summary=cbind(k,p)
colnames(summary)=c("k","Test error rate")
for(i in 1:10){
  result=knn(training_KNN, test_KNN, cl=trainLabels_KNN, k=i)
  summary[i,2]=(nrow(test_KNN)-sum(diag(table(result,testLabels_KNN))))/nrow(test_KNN) }
plot(summary, type="o", col="orange")
title(main="Per cent misclassified", col.main="black", font.main=4)

#KNN classification with choosen K (requires the graphical examination of which r to choose)
# Former plot indicates a K of XX but should be an odd number to ensure that there won't be any problems with ties 
KNN = knn(train=training_KNN,test=test_KNN, cl=trainLabels_KNN, k=8, prob=TRUE)

#performance of the algorithm (contigency table)
c=table(testLabels_KNN,KNN)
c

The test error rate is derived from the knn() function that finds the k nearest neighbours for each observation in the test set, based on the training set. With K=8 the K nearest neighbours model is estimated to give an error rate of 20 per cent, based on the confusion matrix.

Performance

As mentioned the classification tree performance better at the given dataset than the K nearest neighbours algorithm. Although this is not the general result for comparing the two algorithms, where KNN normally tends to perform better, they both tend to perform quite well on the supplied dataset. Meanwhile one should keep in mind that a rather small dataset is applied which implies rather sensitive results as shown by the confusion matrixes, where one incorrectly classified event suddenly has a huge impact on the error rate due to the limited observations. Furthermore the dataset is not well balanced and therefor one ought to expand the timeframe or similar attempts to get a more balanced dataset, i.e. more events categorized as celebrity scandals or natural disasters relative to terrorist attacks. Further work on this topic should therefore include a larger and more balanced dataset for making any real predictions.

Conclusion

….

Real Life Events on The Web:
Decay Rate of Google Searches and Facebook Posts

Group 15: Adam Ingwersen, Jesper Lycke, Laurits Marschall, Mathilde Mammen

December 15, 2015

Introduction

Data

1 Data Sources

1.1 Google Trends

1.2 Facebook

2 Data Gathering

2.1 Gathering data from Google Trends

2.1.1 Preparation

2.1.2 Retrieving

2.1.3 Manipulation

2.2 Gathering data from Facebook

2.2.1 Crawler for newspage index

2.2.2 Request using Facebook API with Oauth

2.2.3 Manipulation

3 Description of Data

3.1 Google Trends Data

3.2 Facebook Data

Analysis

Visual Inspection

Supervised Learning

Classification Trees

K-nearest Neighbours

Performance

Conclusion

Real Life Events on The Web: Decay Rate of Google Searches and Facebook Posts

Group 15: Adam Ingwersen, Jesper Lycke, Laurits Marschall, Mathilde Mammen

December 15, 2015

Introduction

Data

1 Data Sources

1.1 Google Trends

1.2 Facebook

2 Data Gathering

2.1 Gathering data from Google Trends

2.1.1 Preparation

2.1.2 Retrieving

2.1.3 Manipulation

2.2 Gathering data from Facebook

2.2.1 Crawler for newspage index

2.2.2 Request using Facebook API with Oauth

2.2.3 Manipulation

3 Description of Data

3.1 Google Trends Data

3.2 Facebook Data

Analysis

Visual Inspection

Supervised Learning

Classification Trees

K-nearest Neighbours

Performance

Conclusion

Real Life Events on The Web:
Decay Rate of Google Searches and Facebook Posts