In this recitation, we will: (1) identify nodes in the source code of a website, (2) create a function to scrape the content of that website, (3) create a list of URLs of web pages to scrape and (4) export this content to a .csv file.
Today, we will be scraping temperature data from Berkeley Earth, which is an independent non-profit focused on temperature data analysis for climate science. It contains data about regional warming across the globe since 1960. This is a good website to introduce students to scraping because (1) it is a relatively straightforward search engine and (2) it has some significant flaws that one is likely to encounter when conducting web scraping.
Here is what this website (http://berkeleyearth.lbl.gov/city-list/) looks like:
You’ll notice that this webpage allows for a search by city, which are all classified according to their first letter. In total, there are 3523 cities we can access within this dataset.
Let’s say we are interested in scraping the name of each city, the country of that city, the average temperature increase in Celcius for that city and the hyperlink of the city-specific notice. Our future .csv file should look like this:
| City | Country | Temperature | Link |
|---|---|---|---|
| … | … | … | … |
| … | … | … | … |
| … | … | … | … |
| … | … | … | … |
| … | … | … | … |
Please use the Selector Gadget extension on Google Chrome to identify the source code for the nodes. Then, copy-paste these source codes in the appropriate section of the code below. I’ll show you exactly how to do this—but ultimately, here is what your R script should look like.
# Helper function for collapsing multi-line output
collapse_to_text <- function(x){
p <- html_text(x, trim = TRUE)
p <- p[p != ""] # Drop empty lines
paste(p, collapse = "\n")
}
# Identifying the nodes
scraping <- function(x){
tibble(city = html_nodes(x, "td:nth-child(1) a") %>% collapse_to_text, # Scraping the city name
country = html_nodes(x, "td+ td a") %>% collapse_to_text, # Scraping the country name
temperature = html_nodes(x, "td~ td+ td") %>% collapse_to_text, # Scraping the temperature
link = str_trim(html_attr(html_nodes(x, "td:nth-child(1) a"), "href"))) # Scraping the URL behind the city name node
}
We now have the function that will allow us to scrape this website. But before using it, we will need to know which URLs to scrape in the first place.
The first step always consists in looking at how our website of interest structures its hyperlinks. Here, the initial URL of the webpage looks like this:
http://berkeleyearth.lbl.gov/city-list/
However, this only shows us the largest cities within this dataset, and we would like to scrape all of them. If we click on the first letter of the alphabetical listing, here is the URL we get:
http://berkeleyearth.lbl.gov/city-list/A
We can see that the letter following the initial URL represents the link to access all cities starting with this letter. (Please note that not all websites are as straightforward!)
url_base <- "http://berkeleyearth.lbl.gov/city-list/" # This is the basic URL for this webpage
letter <- LETTERS[seq(from = 1, to = 26)] # Creating a sequence of letters for the entire alphabet
urls <- paste(url_base, letter,sep="") # Pasting both to copy the URL format of that website for each alphabetical section
urls[1:5] # Viewing our first five URLs
## [1] "http://berkeleyearth.lbl.gov/city-list/A"
## [2] "http://berkeleyearth.lbl.gov/city-list/B"
## [3] "http://berkeleyearth.lbl.gov/city-list/C"
## [4] "http://berkeleyearth.lbl.gov/city-list/D"
## [5] "http://berkeleyearth.lbl.gov/city-list/E"
We now have all the information we need to scrape the results of all alphabetical lists of that website using the scraping() function we created earlier.
library(rvest) # Loading the rvest package
library(tidyverse) # Loading the tidyverse package
library(stringr) # Loading the stringr package
myResults <- urls %>% # Saving the scraping results in a tibble named "myResults"
map(read_html) %>% # Applying the "read_html" function to each element and returning a vector of the same length
map_df(scraping) # Applying the "scraping" function to each element and returning a data frame
# Please note that the %>% operator forwards a value (or the result of an expression) into the next function
Let’s look at our results!
head(myResults,5) # Displaying the first 5 rows of our dataset
## # A tibble: 5 x 4
## city country temperature link
## <chr> <chr> <chr> <chr>
## 1 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…
## 2 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…
## 3 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…
## 4 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…
## 5 "A Coruña\nAachen… "Spain\nGermany\nD… "2.37 ± 0.31\n2.83 ± 0.22\… http…
myResults[1:5,4] # Viewing the content of the "link" vector
## # A tibble: 5 x 1
## link
## <chr>
## 1 http://berkeleyearth.lbl.gov/locations/42.59N-8.73W
## 2 http://berkeleyearth.lbl.gov/locations/50.63N-6.34E
## 3 http://berkeleyearth.lbl.gov/locations/57.05N-10.33E
## 4 http://berkeleyearth.lbl.gov/locations/5.63N-8.07E
## 5 http://berkeleyearth.lbl.gov/locations/29.74N-48.00E
Well, that doesn’t look too good! Clearly, there are a few problems here: while the hyperlink node worked perfectly, the city name, country name and temperature nodes repeated each row’s output multiple times. This is problematic, but we saw how to perform text manipulation last week—so we are perfectly equipped to solve this problem.
Links <- data.frame(myResults$link) # Extracting the sole correct vector
dim(myResults)[1] # This is the correct number of cities, but all of them encompass ALL city names
## [1] 3523
# for that letter (the same goes for the country name and the temperature value)
myResults$city[1] # Here's the proof
## [1] "A Coruña\nAachen\nAalborg\nAba\nAbadan\nAbakaliki\nAbakan\nAbbotsford\nAbengourou\nAbeokuta\nAberdeen\nAbha\nAbidjan\nAbiko\nAbilene\nAbohar\nAbomey-Calavi\nAbu Dhabi\nAbuja\nAcapulco\nAcarigua\nAccra\nAchalpur\nAcheng\nAchinsk\nAcuña\nAdana\nAddis Abeba\nAdelaide\nAden\nAdilabad\nAdiwerna\nAdoni\nAfyonkarahisar\nAgadir\nAgartala\nAgboville\nAgeo\nAgra\nAguascalientes\nAhmadabad\nAhmadnagar\nAhmadpur East\nAhvaz\nAix-en-Provence\nAizawl\nAjdabiya\nAjmer\nAkashi\nAkishima\nAkita\nAkola\nAkron\nAksaray\nAksu\nAktau\nAkure\nAkyab\nAlagoinhas\nAlandur\nAlanya\nAlappuzha\nAlbacete\nAlberton\nAlbuquerque\nAlbury\nAlcalá de Henares\nAlcobendas\nAlcorcón\nAleppo\nAlexandria\nAlexandria\nAlgeciras\nAlgiers\nAligarh\nAllahabad\nAllentown\nAlmaty\nAlmere\nAlmería\nAlmetyevsk\nAlor Setar\nAltay\nAlwar\nAmadora\nAmagasaki\nAmaigbo\nAmarillo\nAmbala\nAmbarnath\nAmbato\nAmbattur\nAmbon\nAmbur\nAmericana\nAmersfoort\nAmiens\nAmol\nAmravati\nAmritsar\nAmroha\nAmsterdam\nAnaco\nAnaheim\nAnand\nAnanindeua\nAnantapur\nAnápolis\nAnbu\nAnchorage\nAncona\nAnda\nAndijon\nAngarsk\nAngeles\nAngers\nAngra dos Reis\nAngren\nAnjo\nAnkang\nAnkara\nAnn Arbor\nAnqing\nAnqiu\nAnshan\nAnshun\nAntakya\nAntalya\nAntananarivo\nAntioch\nAntipolo\nAntofagasta\nAntsirabe\nAntwerp\nAnyama\nAnyang\nApeldoorn\nApodaca\nApopa\nApucarana\nAqtöbe\nAra\nAracaju\nAraçatuba\nArad\nAraguaína\nArak\nArapiraca\nAraraquara\nAraras\nAraruama\nAraucária\nArdabil\nArequipa\nÅrhus\nArica\nArjawinangun\nArkhangelsk\nArlington\nArlington\nArmavir\nArmenia\nArnhem\nArusha\nArvada\nAryanah\nArzamas\nAsahikawa\nAsaka\nAsansol\nAsfi\nAsgabat\nAshdod\nAshdod\nAshikaga\nAshqelon\nAsmara\nAstana\nAstanajapura\nAstrakhan\nAsunción\nAswan\nAsyut\nAthens\nAtibaia\nAtlanta\nAtsugi\nAtyrau\nAuckland\nAugsburg\nAurangabad\nAurora\nAurora\nAustin\nAvadi\nAwassa\nAwka\nAyacucho\nAyer Itam\nAzamgarh\nAzare"
# What we'll need to do is erasing all these duplicates for each letter,
# and rebuilding our dataset from scratch with only 26 rows encompassing all city names, country names and temperature values
myResults$link <- NULL # Erasing the "link"" vector (this will be necessary to remove all duplicates---each of the URL in the "link" vector being different, there would have been no duplicated row if we had left it in the data frame)
Data_NoDuplicate <- unique(myResults) # Creating a new dataset without all these duplicate rows. We now have only 26 observations, for 26 letters of the alphabet. That is to say: each row consists of ALL city, country names and temperatures for cities starting with that specific letter.
head(Data_NoDuplicate,3) # Looking at results for letters A,B and C
## # A tibble: 3 x 3
## city country temperature
## <chr> <chr> <chr>
## 1 "A Coruña\nAachen\n… "Spain\nGermany\nDen… "2.37 ± 0.31\n2.83 ± 0.22\n2…
## 2 "Babakan\nBabol\nBa… "Indonesia\nIran\nVi… "1.50 ± 0.30\n3.17 ± 0.47\n1…
## 3 "CÃ Mau\nCabanatua… "Vietnam\nPhilippine… "1.14 ± 0.27\n0.91 ± 0.29\n1…
# The next step consists of separating each of these observations into multiple rows
City <- data.frame(city=unlist(strsplit(as.character(Data_NoDuplicate$city),"\n"))) # City
Country <- data.frame(country=unlist(strsplit(as.character(Data_NoDuplicate$country),"\n"))) # Country
Temperature <- data.frame(temperature=unlist(strsplit(as.character(Data_NoDuplicate$temperature),"\n"))) # Temperature
Output <- cbind(City, Country, Temperature, Links) # Merging these data frames to create a final dataset called "Output"
head(Output, 10) # Displaying the first 10 rows of our dataset
## city country temperature
## 1 A Coruña Spain 2.37 ± 0.31
## 2 Aachen Germany 2.83 ± 0.22
## 3 Aalborg Denmark 2.84 ± 0.23
## 4 Aba Nigeria 1.55 ± 0.44
## 5 Abadan Iran 2.89 ± 0.64
## 6 Abakaliki Nigeria 1.55 ± 0.44
## 7 Abakan Russia 2.74 ± 0.19
## 8 Abbotsford Canada 1.62 ± 0.23
## 9 Abengourou Côte d'Ivoire 1.76 ± 0.31
## 10 Abeokuta Nigeria 1.74 ± 0.46
## myResults.link
## 1 http://berkeleyearth.lbl.gov/locations/42.59N-8.73W
## 2 http://berkeleyearth.lbl.gov/locations/50.63N-6.34E
## 3 http://berkeleyearth.lbl.gov/locations/57.05N-10.33E
## 4 http://berkeleyearth.lbl.gov/locations/5.63N-8.07E
## 5 http://berkeleyearth.lbl.gov/locations/29.74N-48.00E
## 6 http://berkeleyearth.lbl.gov/locations/5.63N-8.07E
## 7 http://berkeleyearth.lbl.gov/locations/53.84N-91.36E
## 8 http://berkeleyearth.lbl.gov/locations/49.03N-122.45W
## 9 http://berkeleyearth.lbl.gov/locations/7.23N-4.05W
## 10 http://berkeleyearth.lbl.gov/locations/7.23N-4.05E
This looks much better now. The last step consists in saving this data frame into a .csv file.
write.csv(Output,file="/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2018/myFile.csv") # Saving that dataframe into a .csv file named "myFile"
Now that we have all this information, let’s find the city with the biggest temperature change since 1960. Again, we can do this easily using text manipulation functions.
Output$t_change <- NA # Creating an empty vector
Output$t_change <- str_sub(Output$temperature, 1, 3) # Creating a vector for the temperature change (selecting all characters between the 1st and the 3rd position within each object)
Output[order(Output$t_change, decreasing = T)[1:10],c(1,2,5)] # What are the names of the cities (and associated countries) with the 10 biggest changes?
## city country t_change
## 2182 Norilsk Russia 4.3
## 390 Birjand Iran 4.0
## 1538 Kerman Iran 4.0
## 2547 Rafsanjan Iran 3.9
## 2891 Sirjan Iran 3.9
## 3431 Yazd Iran 3.9
## 1422 Jiroft Iran 3.8
## 1915 Marv Dasht Iran 3.8
## 1359 Jahrom Iran 3.7
## 1918 Mashhad Iran 3.7
Scrape the website of your choosing—but please choose one that doesn’t explicitly ban scraping and avoid small, low-traffic websites. Select simple webpage designs (older websites are perfect to practice these skills) and try to avoid any interactive webpage (these are much more difficult to scrape!).