Introduction to Scraping in R

In this recitation, we will: (1) identify nodes in the source code of a website, (2) create a function to scrape the content of that website, (3) create a list of URLs of web pages to scrape and (4) export this content to a .csv file.

Today, we will be scraping temperature data from Berkeley Earth, which is an independent non-profit focused on temperature data analysis for climate science. It contains data about regional warming across the globe since 1960. This is a good website to introduce students to scraping because (1) it is a relatively straightforward search engine and (2) it has some significant flaws that one is likely to encounter when conducting web scraping.

Here is what this website (http://berkeleyearth.lbl.gov/city-list/) looks like:

Zener Cards

Screenshot from the Berkeley Earth website

1. Identifying nodes and creating a function to scrape the website

You’ll notice that this webpage allows for a search by city, which are all classified according to their first letter. In total, there are 3523 cities we can access within this dataset.

Let’s say we are interested in scraping the name of each city, the country of that city, the average temperature increase in Celcius for that city and the hyperlink of the city-specific notice. Our future .csv file should look like this:

City	Country	Temperature	Link
…	…	…	…
…	…	…	…
…	…	…	…
…	…	…	…
…	…	…	…

Please use the Selector Gadget extension on Google Chrome to identify the source code for the nodes. Then, copy-paste these source codes in the appropriate section of the code below. I’ll show you exactly how to do this—but ultimately, here is what your R script should look like.

# Helper function for collapsing multi-line output
collapse_to_text <- function(x){
  p <- html_text(x, trim = TRUE)
  p <- p[p != ""] # Drop empty lines
  paste(p, collapse = "\n")
}

# Identifying the nodes
scraping <- function(x){
  tibble(city = html_nodes(x, "td:nth-child(1) a") %>% collapse_to_text, # Scraping the city name
         country = html_nodes(x, "td+ td a") %>% collapse_to_text, # Scraping the country name
         temperature = html_nodes(x, "td~ td+ td") %>%  collapse_to_text, # Scraping the temperature
         link = str_trim(html_attr(html_nodes(x, "td:nth-child(1) a"), "href"))) # Scraping the URL behind the city name node
}

We now have the function that will allow us to scrape this website. But before using it, we will need to know which URLs to scrape in the first place.

2. Creating the URLs

The first step always consists in looking at how our website of interest structures its hyperlinks. Here, the initial URL of the webpage looks like this:

http://berkeleyearth.lbl.gov/city-list/

However, this only shows us the largest cities within this dataset, and we would like to scrape all of them. If we click on the first letter of the alphabetical listing, here is the URL we get:

http://berkeleyearth.lbl.gov/city-list/A

We can see that the letter following the initial URL represents the link to access all cities starting with this letter. (Please note that not all websites are as straightforward!)

url_base <- "http://berkeleyearth.lbl.gov/city-list/" # This is the basic URL for this webpage
letter <- LETTERS[seq(from = 1, to = 26)] # Creating a sequence of letters for the entire alphabet
urls <- paste(url_base, letter,sep="") # Pasting both to copy the URL format of that website for each alphabetical section

urls[1:5] # Viewing our first five URLs

## [1] "http://berkeleyearth.lbl.gov/city-list/A"
## [2] "http://berkeleyearth.lbl.gov/city-list/B"
## [3] "http://berkeleyearth.lbl.gov/city-list/C"
## [4] "http://berkeleyearth.lbl.gov/city-list/D"
## [5] "http://berkeleyearth.lbl.gov/city-list/E"

3. Scraping the website

We now have all the information we need to scrape the results of all alphabetical lists of that website using the scraping() function we created earlier.

library(rvest) # Loading the rvest package
library(tidyverse) # Loading the tidyverse package
library(stringr) # Loading the stringr package

myResults <- urls %>% # Saving the scraping results in a tibble named "myResults" 
  map(read_html) %>% # Applying the "read_html" function to each element and returning a vector of the same length
  map_df(scraping) # Applying the "scraping" function to each element and returning a data frame

# Please note that the %>% operator forwards a value (or the result of an expression) into the next function

4. Cleaning the data

Let’s look at our results!

head(myResults,5) # Displaying the first 5 rows of our dataset

## # A tibble: 5 x 4
##   city               country             temperature                 link 
##   <chr>              <chr>               <chr>                       <chr>
## 1 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…
## 2 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…
## 3 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…
## 4 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…
## 5 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…

myResults[1:5,4] # Viewing the content of the "link" vector

## # A tibble: 5 x 1
##   link                                                
##   <chr>                                               
## 1 http://berkeleyearth.lbl.gov/locations/42.59N-8.73W 
## 2 http://berkeleyearth.lbl.gov/locations/50.63N-6.34E 
## 3 http://berkeleyearth.lbl.gov/locations/57.05N-10.33E
## 4 http://berkeleyearth.lbl.gov/locations/5.63N-8.07E  
## 5 http://berkeleyearth.lbl.gov/locations/29.74N-48.00E

Well, that doesn’t look too good! Clearly, there are a few problems here: while the hyperlink node worked perfectly, the city name, country name and temperature nodes repeated each row’s output multiple times. This is problematic, but we saw how to perform text manipulation last week—so we are perfectly equipped to solve this problem.

Links <- data.frame(myResults$link) # Extracting the sole correct vector

dim(myResults)[1] # This is the correct number of cities, but all of them encompass ALL city names

## [1] 3523

# for that letter (the same goes for the country name and the temperature value)

myResults$city[1] # Here's the proof

## [1] "A Coruña\nAachen\nAalborg\nAba\nAbadan\nAbakaliki\nAbakan\nAbbotsford\nAbengourou\nAbeokuta\nAberdeen\nAbha\nAbidjan\nAbiko\nAbilene\nAbohar\nAbomey-Calavi\nAbu Dhabi\nAbuja\nAcapulco\nAcarigua\nAccra\nAchalpur\nAcheng\nAchinsk\nAcuña\nAdana\nAddis Abeba\nAdelaide\nAden\nAdilabad\nAdiwerna\nAdoni\nAfyonkarahisar\nAgadir\nAgartala\nAgboville\nAgeo\nAgra\nAguascalientes\nAhmadabad\nAhmadnagar\nAhmadpur East\nAhvaz\nAix-en-Provence\nAizawl\nAjdabiya\nAjmer\nAkashi\nAkishima\nAkita\nAkola\nAkron\nAksaray\nAksu\nAktau\nAkure\nAkyab\nAlagoinhas\nAlandur\nAlanya\nAlappuzha\nAlbacete\nAlberton\nAlbuquerque\nAlbury\nAlcalá de Henares\nAlcobendas\nAlcorcón\nAleppo\nAlexandria\nAlexandria\nAlgeciras\nAlgiers\nAligarh\nAllahabad\nAllentown\nAlmaty\nAlmere\nAlmería\nAlmetyevsk\nAlor Setar\nAltay\nAlwar\nAmadora\nAmagasaki\nAmaigbo\nAmarillo\nAmbala\nAmbarnath\nAmbato\nAmbattur\nAmbon\nAmbur\nAmericana\nAmersfoort\nAmiens\nAmol\nAmravati\nAmritsar\nAmroha\nAmsterdam\nAnaco\nAnaheim\nAnand\nAnanindeua\nAnantapur\nAnápolis\nAnbu\nAnchorage\nAncona\nAnda\nAndijon\nAngarsk\nAngeles\nAngers\nAngra dos Reis\nAngren\nAnjo\nAnkang\nAnkara\nAnn Arbor\nAnqing\nAnqiu\nAnshan\nAnshun\nAntakya\nAntalya\nAntananarivo\nAntioch\nAntipolo\nAntofagasta\nAntsirabe\nAntwerp\nAnyama\nAnyang\nApeldoorn\nApodaca\nApopa\nApucarana\nAqtöbe\nAra\nAracaju\nAraçatuba\nArad\nAraguaína\nArak\nArapiraca\nAraraquara\nAraras\nAraruama\nAraucária\nArdabil\nArequipa\nÅrhus\nArica\nArjawinangun\nArkhangelsk\nArlington\nArlington\nArmavir\nArmenia\nArnhem\nArusha\nArvada\nAryanah\nArzamas\nAsahikawa\nAsaka\nAsansol\nAsfi\nAsgabat\nAshdod\nAshdod\nAshikaga\nAshqelon\nAsmara\nAstana\nAstanajapura\nAstrakhan\nAsunción\nAswan\nAsyut\nAthens\nAtibaia\nAtlanta\nAtsugi\nAtyrau\nAuckland\nAugsburg\nAurangabad\nAurora\nAurora\nAustin\nAvadi\nAwassa\nAwka\nAyacucho\nAyer Itam\nAzamgarh\nAzare"

# What we'll need to do is erasing all these duplicates for each letter, 
# and rebuilding our dataset from scratch with only 26 rows encompassing all city names, country names and temperature values

myResults$link <- NULL # Erasing the "link"" vector (this will be necessary to remove all duplicates---each of the URL in the "link" vector being different, there would have been no duplicated row if we had left it in the data frame)

Data_NoDuplicate <- unique(myResults) # Creating a new dataset without all these duplicate rows. We now have only 26 observations, for 26 letters of the alphabet. That is to say: each row consists of ALL city, country names and temperatures for cities starting with that specific letter.

head(Data_NoDuplicate,3) # Looking at results for letters A,B and C

## # A tibble: 3 x 3
##   city                 country               temperature                  
##   <chr>                <chr>                 <chr>                        
## 1 "A Coruña\nAachen\n… "Spain\nGermany\nDen… "2.37 ± 0.31\n2.83 ± 0.22\n2…
## 2 "Babakan\nBabol\nBa… "Indonesia\nIran\nVi… "1.50 ± 0.30\n3.17 ± 0.47\n1…
## 3 "CÃ  Mau\nCabanatua… "Vietnam\nPhilippine… "1.14 ± 0.27\n0.91 ± 0.29\n1…

# The next step consists of separating each of these observations into multiple rows
City <- data.frame(city=unlist(strsplit(as.character(Data_NoDuplicate$city),"\n"))) # City
Country <- data.frame(country=unlist(strsplit(as.character(Data_NoDuplicate$country),"\n"))) # Country
Temperature <- data.frame(temperature=unlist(strsplit(as.character(Data_NoDuplicate$temperature),"\n"))) # Temperature

Output <- cbind(City, Country, Temperature, Links) # Merging these data frames to create a final dataset called "Output"

head(Output, 10) # Displaying the first 10 rows of our dataset

##          city       country temperature
## 1    A Coruña         Spain 2.37 ± 0.31
## 2      Aachen       Germany 2.83 ± 0.22
## 3     Aalborg       Denmark 2.84 ± 0.23
## 4         Aba       Nigeria 1.55 ± 0.44
## 5      Abadan          Iran 2.89 ± 0.64
## 6   Abakaliki       Nigeria 1.55 ± 0.44
## 7      Abakan        Russia 2.74 ± 0.19
## 8  Abbotsford        Canada 1.62 ± 0.23
## 9  Abengourou Côte d'Ivoire 1.76 ± 0.31
## 10   Abeokuta       Nigeria 1.74 ± 0.46
##                                           myResults.link
## 1    http://berkeleyearth.lbl.gov/locations/42.59N-8.73W
## 2    http://berkeleyearth.lbl.gov/locations/50.63N-6.34E
## 3   http://berkeleyearth.lbl.gov/locations/57.05N-10.33E
## 4     http://berkeleyearth.lbl.gov/locations/5.63N-8.07E
## 5   http://berkeleyearth.lbl.gov/locations/29.74N-48.00E
## 6     http://berkeleyearth.lbl.gov/locations/5.63N-8.07E
## 7   http://berkeleyearth.lbl.gov/locations/53.84N-91.36E
## 8  http://berkeleyearth.lbl.gov/locations/49.03N-122.45W
## 9     http://berkeleyearth.lbl.gov/locations/7.23N-4.05W
## 10    http://berkeleyearth.lbl.gov/locations/7.23N-4.05E

This looks much better now. The last step consists in saving this data frame into a .csv file.

write.csv(Output,file="/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2018/myFile.csv") # Saving that dataframe into a .csv file named "myFile"

5. Performing analyses

Now that we have all this information, let’s find the city with the biggest temperature change since 1960. Again, we can do this easily using text manipulation functions.

Output$t_change <- NA # Creating an empty vector

Output$t_change <- str_sub(Output$temperature, 1, 3) # Creating a vector for the temperature change (selecting all characters between the 1st and the 3rd position within each object)

Output[order(Output$t_change, decreasing = T)[1:10],c(1,2,5)] # What are the names of the cities (and associated countries) with the 10 biggest changes?

##            city country t_change
## 2182    Norilsk  Russia      4.3
## 390     Birjand    Iran      4.0
## 1538     Kerman    Iran      4.0
## 2547  Rafsanjan    Iran      3.9
## 2891     Sirjan    Iran      3.9
## 3431       Yazd    Iran      3.9
## 1422     Jiroft    Iran      3.8
## 1915 Marv Dasht    Iran      3.8
## 1359     Jahrom    Iran      3.7
## 1918    Mashhad    Iran      3.7

Exercise

Scrape the website of your choosing—but please choose one that doesn’t explicitly ban scraping and avoid small, low-traffic websites. Select simple webpage designs (older websites are perfect to practice these skills) and try to avoid any interactive webpage (these are much more difficult to scrape!).

City	Country	Temperature	Link
…	…	…	…
…	…	…	…
…	…	…	…
…	…	…	…
…	…	…	…

City	Country	Temperature	Link
…	…	…	…
…	…	…	…
…	…	…	…
…	…	…	…
…	…	…	…