Web Scraping in R

This is a quick applied introduction to web scraping in R preparred for consultations at NYU Data Services. We will be accomplishing two primary tasks. First, we are going to scrape abstracts from the Journal of Peace Research. Second we will be extracting some elements of the Times of India archive. In another post we will take a look at analysing and wrangling this text data.

If you are just interested in the script, you can download it by clicking here. There is also a useful web scraping cheat sheet available here.

To get started, let’s install and load some packages that we will be using.

packages <- c("rvest","stringr","tidyverse","pbapply","parallel","lubridate")
for(i in packages){
  if(!require(i,character.only = T, quietly = T)){
    install.packages(i)
    library(i, character.only = T, quietly = T)
  }
}

Scraping a Journal

The first thing that we are going to do is identify a page that we want to scrape information from. It is usually a good idea in practice to check the terms of service for the website you are using to make sure that they are alright with automated extraction of data. For example, Sage journals are totally cool with it.

For the first example we are going to be taking a look at the Journal of Peace Research. For simplicity we are going to be looking at a recent Special Issue on Electoral Violence.

issue <- "https://journals.sagepub.com/toc/jpra/57/1"
browseURL(issue)

You should take a moment to look at this page in detail; if you are using Chrome right click on the page at select “view page source” to take a look at the HTML we will be working with.

What we are going to do is use this issue page extract the Title, Author, and Abstract for each article and turn it into a dataset.

The first step in this process is to read the html into R and take a look at it.

html <- read_html(issue)
html

## {html_document}
## <html lang="en" class="pb-page" data-request-id="20388a1e-e16d-427f-b009-6f53dada4ee2">
## [1] <head data-pb-dropzone="head">\n<meta http-equiv="Content-Type" content=" ...
## [2] <body class="pb-ui">\n<div class="totoplink">\n<a id="skiptocontent" clas ...

Success! But now how do we work with this mess? There are two basic approaches. The first is to install the [SelectorGadget Chrome Extension] for CSS selection. This will allow you to click on fields of the page and get their tags. The alternative is to work within the page source itself and extract information by identifying the appropriate XML tags. In my experience the first is simpler and easier when you are first starting, but the second can be helpful in grabbing exactly what you want. Being able to look at the page source is going to be an important task, as we will see in the second example.

Alright, the first thing that we are going to do is harvest all the links on the page and then keep only those links that link to a doi, indicating that they go to an article. To effectively scrape a website you need to know how it works and identify any patterns that come with it. Here is what the code looks like:

links <- html %>% html_nodes("a") %>% html_attr("href") 
abs <- unique(links %>% grep("doi/abs",.) %>% links[.] %>% paste0("https://journals.sagepub.com",.))
head(abs)

## [1] "https://journals.sagepub.com/doi/abs/10.1177/0022343319889657"
## [2] "https://journals.sagepub.com/doi/abs/10.1177/0022343319885166"
## [3] "https://journals.sagepub.com/doi/abs/10.1177/0022343319890383"
## [4] "https://journals.sagepub.com/doi/abs/10.1177/0022343319884998"
## [5] "https://journals.sagepub.com/doi/abs/10.1177/0022343319886000"
## [6] "https://journals.sagepub.com/doi/abs/10.1177/0022343319892677"

What we did was created a new object called links which takes the html code of the target page, selected the “a” nodes, and then extracted the “href” attributes. This produces a lot of junk that we don’t need since it grabbed every link on the page while all we want are those links that lead to articles. In the second step we subset down the the DOIs of interest and paste them to form proper URLs targeting the pages we want.

Excellent! Now what? When you are working on a task that you will eventually loop over or run in parallel, it is often useful to write working code for one instance and then build it into a function. Take a look at the following chunk of code:

article <- abs[1]
a_html <- read_html(article)

title <- a_html %>% html_node("h1") %>% html_text() %>% str_replace_all(.,"\n","")
abstract <- a_html %>% html_node(".abstractInFull") %>% html_text()

tmp <- a_html %>% html_nodes(".entryAuthor") %>% html_text()
tmp <- unique(tmp[-c(grep("ORCID",tmp),grep("See all",tmp),grep("Search",tmp))])

authors <- paste(tmp,collapse = ";")

out <- c(abs[1],title,authors,abstract)
tbl_df(out)

## # A tibble: 4 x 1
##   value                                                                         
##   <chr>                                                                         
## 1 https://journals.sagepub.com/doi/abs/10.1177/0022343319889657                 
## 2 Electoral violence: An introduction                                           
## 3 " Sarah Birch; Ursula Daxecker; Kristine Höglund"                             
## 4 Elections are held in nearly all countries in the contemporary world. Yet des~

The first thing we do is take the first link identified above and make it an object to work with. What we do is take that link and read in the html as before. We then extract the title, abstract, and author names from the article page. Take a moment to either use SelectorGadget to replicate these results or take a look at the page source to see how we found these elements. After we have them, we combine them into an object and print it to check the results – we have everything that we want.

An alternative approach is to use XML tags directly from the page source. These would look like the following:

a_html %>% html_nodes(xpath = '//meta[@property="article:author"]') %>% html_attr('content')

## [1] "Sarah Birch, Ursula Daxecker, Kristine Höglund"

a_html %>% html_nodes(xpath = '//a[@class="doiWidgetLink"]') %>% html_attr('href')

## [1] "https://doi.org/10.1177%2F0022343319889657"

a_html %>% html_nodes(xpath = '//div[@class="abstractSection abstractInFull"]') %>% html_text()

## [1] "Elections are held in nearly all countries in the contemporary world. Yet despite their aim of allowing for peaceful transfers of power, elections held outside of consolidated democracies are often accompanied by substantial violence. This special issue introduction article establishes electoral violence as a subtype of political violence with distinct analytical and empirical dynamics. We highlight how electoral violence is distinct from other types of organized violence, but also how it is qualitatively different from nonviolent electoral manipulation. The article then surveys what we have learned about the causes and consequences of electoral violence, identifies important research gaps in the literature, and proceeds to discuss the articles included in the special issue. The contributions advance research in four domains: the micro-level targeting and consequences of electoral violence, the institutional foundations of electoral violence, the conditions leading to high-stakes elections, and electoral violence in the context of other forms of organized violence. The individual articles are methodologically and geographically diverse, encompassing ethnography, survey vignette and list experiments and survey data, quantitative analyses of subnational and crossnational event data, and spanning Africa, Latin America, and Asia."

a_html %>% html_nodes(xpath = '//meta[@name="dc.Title"]') %>% html_attr('content')

## [1] "Electoral violence: An introduction: "

Take a look at the page source and see where we got each of these tags.

Now that we know how to identify all the elements of interest, we can write a function automate the process. Here is one approach:

grab_info <- function(article_link){
  a_html <- read_html(article_link)
  
  title <- a_html %>% html_node("h1") %>% html_text() %>% str_replace_all(.,"\n","")
  abstract <- a_html %>% html_node(".abstractInFull") %>% html_text()
  
  tmp <- a_html %>% html_nodes(".entryAuthor") %>% html_text()
  tmp <- unique(tmp[-c(grep("ORCID",tmp),grep("See all",tmp),grep("Search",tmp))])
  
  authors <- paste(tmp,collapse = ";")
  
  out <- c(article_link,title,authors,abstract)
  out
}

tbl_df(grab_info(abs[2]))

## # A tibble: 4 x 1
##   value                                                                         
##   <chr>                                                                         
## 1 https://journals.sagepub.com/doi/abs/10.1177/0022343319885166                 
## 2 Dangerously informed: Voter information and pre-electoral violence in Africa  
## 3 " Inken von Borzyskowski; Patrick M Kuhn;Roxana Gutiérrez-Romero;Sarah Birch;~
## 4 A considerable literature examines the effect of voter information on candida~

Now let’s run it in parallel and make ourselves a dataset:

cl <- makeCluster(detectCores())
clusterExport(cl,varlist = c("grab_info"))
clusterEvalQ(cl,library(rvest))
clusterEvalQ(cl,library(stringr))

info <- as.data.frame(t(pbsapply(abs,grab_info,cl=cl)))

rownames(info) <- NULL
colnames(info) <- c("Link","Title","Authors","Abstract")

tbl_df(info)

## # A tibble: 14 x 4
##    Link             Title               Authors             Abstract            
##    <fct>            <fct>               <fct>               <fct>               
##  1 https://journal~ Electoral violence~ " Sarah Birch; Urs~ Elections are held ~
##  2 https://journal~ Dangerously inform~ " Inken von Borzys~ A considerable lite~
##  3 https://journal~ Raising the stakes~ " Kathleen Klaus;J~ How does large-scal~
##  4 https://journal~ Carrots and sticks~ " Ezequiel Gonzale~ How do parties targ~
##  5 https://journal~ Who dissents? Self~ " Lauren E Young;R~ Reactions to acts o~
##  6 https://journal~ Does electoral vio~ " Roxana Gutiérrez~ Across many new dem~
##  7 https://journal~ Pre-election viole~ " Michael Wahman; ~ Cross-national rese~
##  8 https://journal~ Electoral violence~ " Johan Brosché; H~ Why do the first mu~
##  9 https://journal~ The effect of alte~ " Rubén Ruiz-Rufin~ There is as yet lit~
## 10 https://journal~ Political party st~ " Hanne Fjelde;Urs~ Existing research o~
## 11 https://journal~ Unequal votes, une~ " Ursula Daxecker;~ Elections held outs~
## 12 https://journal~ Patterned pogroms:~ " Ward Berenschot;~ The regular occurre~
## 13 https://journal~ Restrained or cons~ " Jana Krause;Sara~ Anecdotal evidence ~
## 14 https://journal~ Mitigating electio~ " Hannah Smidt;Jan~ False information, ~

Super. One could easily extend the above code to scrape every issue of the journal, but I’ll leave that as an exercise.

Times of India

The next example that we are going to look at is scraping the Times of India news archive from 2008 through 2019. Take a moment to click around the website to get a feel for how it works.

Based on the example we might think that we can simply start from the landing page, harvest all the links, navigate to the articles, and collect what we want. Unfortunately, the page uses JavaScript to generate links on the fly which makes the approach a bit more challenging – try replicating the above approach in this setting and see if you can find the issue in the page source.

What we are going to do instead is work around the problem by noting how regular the links to daily articles are.

browseURL("https://timesofindia.indiatimes.com/2020/1/1/archivelist/year-2020,month-1,starttime-43831.cms")

The above link will take you to all articles published on January 1st 2020. What you should note is that the links follow the following format:

https://timesofindia.indiatimes.com/Y/m/d/archivelist/year-Y,month-m,starttime-countofdays.cms

Take a look at the following code:

date_vec <- format(seq(as.Date("2008/01/01"),as.Date("2019/12/31"),by="days"),"%Y/%m/%d")
day_count <- 39448:43830

main <- "https://timesofindia.indiatimes.com/"
ext <- paste0(date_vec,"/archivelist/year-",year(date_vec),",month-",month(date_vec),",starttime-",day_count,".cms")

browseURL(paste0(main,ext[1]))

What we are going to do is simply paste together working URLs which lead us to pages where the above strategy of harvesting links which lead to content to scrape applies. Suppose that we want to grab the link to the story, headline, and description for each story. Here is the basic approach that we will apply:

url <- paste0(main,ext[1])
html <- read_html(url)

headlines <- html %>% html_nodes(".cnt div td span a") %>% html_text()
tbl_df(headlines)

## # A tibble: 193 x 1
##    value                                  
##    <chr>                                  
##  1 2008: For tranquility in troubled times
##  2 Christmas spirit                       
##  3 Strings attached!                      
##  4 Blink and you miss him                 
##  5 Blink and you miss him                 
##  6 Money matters                          
##  7 Dimple makes a comeback                
##  8 Vishal not a victim of over exposure   
##  9 Spice Girls left speechless            
## 10 Rakshanda can’t be on a reality show!  
## # ... with 183 more rows

story_links <- html %>% html_nodes(".cnt div td span a") %>% html_attr("href")
tbl_df(story_links)

## # A tibble: 193 x 1
##    value                                                                        
##    <chr>                                                                        
##  1 http://timesofindia.indiatimes.com//india/2008-For-tranquility-in-troubled-t~
##  2 http://timesofindia.indiatimes.com//entertainment/events/bangalore/Christmas~
##  3 http://timesofindia.indiatimes.com//entertainment/events/mumbai/Strings-atta~
##  4 http://timesofindia.indiatimes.com//entertainment/events/hyderabad/Blink-and~
##  5 http://timesofindia.indiatimes.com//entertainment/events/Blink-and-you-miss-~
##  6 http://timesofindia.indiatimes.com//entertainment/hindi/bollywood/news/Money~
##  7 http://timesofindia.indiatimes.com//entertainment/hindi/bollywood/news/Dimpl~
##  8 http://timesofindia.indiatimes.com//tv/news/hindi/Vishal-not-a-victim-of-ove~
##  9 http://timesofindia.indiatimes.com//entertainment/english/hollywood/news/Spi~
## 10 http://timesofindia.indiatimes.com//tv/news/hindi/Rakshanda-cant-be-on-a-rea~
## # ... with 183 more rows

First we will paste together a URL leading to a particular day of articles. From there we can collect all headlines as well as their corresponding links. Once we have these links, we can pass them through a function to grab the story descriptions as follows:

story_desc <- function(link){
  read_html(link) %>% 
    html_nodes(xpath = '//meta[@name="description"]') %>% 
    html_attr('content')
}

story_desc(story_links[1])

## [1] "India News: The optimism, as they say, is in being here, in being alive."

Great. One way to proceed is to now build a function to collect everything that we want for a particular day.

one_day <- function(link_to_day,date){
  html <- read_html(link_to_day)
  headlines <- html %>% html_nodes(".cnt div td span a") %>% html_text()
  story_links <- html %>% html_nodes(".cnt div td span a") %>% html_attr("href")
  out <- data.frame(Date = rep(date,length(headlines)), 
                    Headline = headlines,
                    Link = story_links,
                    stringsAsFactors = FALSE)
}

In this particular case, grabbing the descriptions is time intensive so we will add those on later. Occasionally you will run into an error caused by being redirected to an invalid page; this is just a feature of the website. We can get around that by making a simple error checker.

pass_errors <- function(fun){
  res <- withCallingHandlers(
    tryCatch(fun, error=function(e) {
      err <- paste0("ERROR: ",conditionMessage(e))
    }))
  ifelse(!is.null(res),res,err)
}

Now we can put everything together and make our dataset. For simplicity we will only do the first five days and first 50 links of those five days, but you can easily modify the code to grab all of the information.

urls <- paste0(main,ext)

clust <- makeCluster(detectCores())
clusterExport(clust,varlist = c("pass_errors","story_desc"))
clusterEvalQ(clust,library(rvest))

five_days <- pblapply(1:5,function(x)one_day(urls[x],date_vec[x]))
five_days <- do.call("rbind",five_days)

tmp <- unlist(pblapply(five_days$Link[1:50],function(x)pass_errors(story_desc(x)),cl=clust))
five_days[1:50,"Description"] <- tmp

tbl_df(five_days)

## # A tibble: 1,187 x 4
##    Date    Headline         Link                    Description                 
##    <chr>   <chr>            <chr>                   <chr>                       
##  1 2008/0~ 2008: For tranq~ http://timesofindia.in~ India News: The optimism, a~
##  2 2008/0~ Christmas spirit http://timesofindia.in~ The Christmas spirit kicked~
##  3 2008/0~ Strings attache~ http://timesofindia.in~ "The long wait is over and ~
##  4 2008/0~ Blink and you m~ http://timesofindia.in~ Narain Karthikeyan’s visit ~
##  5 2008/0~ Blink and you m~ http://timesofindia.in~ Narain Karthikeyan’s visit ~
##  6 2008/0~ Money matters    http://timesofindia.in~ "After actress Ileana baggi~
##  7 2008/0~ Dimple makes a ~ http://timesofindia.in~ Actress Dimple Kapadia is m~
##  8 2008/0~ Vishal not a vi~ http://timesofindia.in~ Vishal Singh says he’s been~
##  9 2008/0~ Spice Girls lef~ http://timesofindia.in~ Poor mobile signals at Lond~
## 10 2008/0~ Rakshanda can’t~ http://timesofindia.in~ She’d rather host the show,~
## # ... with 1,177 more rows

Great, so now we know how to harvest text from websites and create a dataset out of it. From here you might want to start analyzing the data. A great place to start is the free Tidy Text Mining Book, but I’ll be writing up a walkthrough for some common tasks in the near future.

Web Scraping in R

Christopher Schwarz

3/9/2020

Scraping a Journal

Times of India