Author: YJ Choi
Date: 2019-05-01

This R Markdown shows how to scrape (or harvest!) web data in three steps in R, using rvest and also, for those who are not familiar with html (like myself), selectorgadget. Scraping incolves basically three steps:
* Step 1. Convert a website into an XML object. (i.e., translate website into something R can understand)
* Step 2. Tag the relevant nodes (i.e., parts/contents) from the XML object. In other word, specify what elements of the XML object is needed. (i.e., find what specific content needs to be “harvested”)
* Step 3. Extract the tagged data - either in text or data table. (i.e., get it)

There are three sections in this document.
* Example 1. Scraping Text
* Example 2. Scraping Data Table
* More on Finding the Correct Data Table

A simple webpage is used for the first two examples: PMA2020 Datasets List: (https://www.pma2020.org/pma2020-datasets-list). A more complex webpage is used in the last section: (https://unstats.un.org/unsd/methodology/m49/)

1. Example 1. Scraping Text

First example scrapes text, circled with green in Figure 1.

Figure 1. Alt text

Step 1: Convert a website into an XML object
Use the read_html() function, with a target URL. The function calls the webserver, collects the data, and parses it.

suppressPackageStartupMessages(library(rvest))
web<-read_html("https://www.pma2020.org/pma2020-datasets-list")
class(web)

[1] "xml_document" "xml_node"

Step 2: Tag the relevant nodes from the XML object
Use html_nodes(), whose argument is the class descriptor (CSS), prepended by a “.” to signify that it is a class. The output will be a list of all the nodes found in that way.

Or, without understanding html, a very easy way to find the right node is using selectorgadget. Thank you, selectorgadget! See the red circle in Figure 1. In this example, CSS is “.last > h3 span”.

nodes<-html_nodes(web, ".last > h3 span")
class(nodes)

[1] "xml_nodeset"

Or, another way to find the specific “selector” from the webpage is from the developer tools. Use the “inspect function” to toggle over and find the element. Good instruction for this is available here. In this example, the selector is “#node-706 > div.content > div > div > div > h3 > span” (obtained via “copy selector” in Figure 2), and it can replace “.last > h3 span” in the above code chunk.

Figure 2. Alt text

Step 3: Extract the tagged data
Apply html_text() to the nodes you want

dta<-html_text(nodes)
class(dta)

[1] "character"

dta

[1] "Below is a list of full datasets that are currently available. To request datasets, click here."

Using piping (%>%), however, above three steps can be done in the following one code:

suppressPackageStartupMessages(library(rvest))
dta<-read_html("https://www.pma2020.org/pma2020-datasets-list") %>% 
    html_nodes(".last > h3 span") %>%
    html_text() 
dta

[1] "Below is a list of full datasets that are currently available. To request datasets, click here."

2. Example 2. Scraping Data Table

The next example scrapes data table from the same webpage, the list of Household/Female survyes and Service Delivery Point surveys by country, highlighted with green circle in Figure 3.

Figure 3.

Compared to the previous text scraping example, there are two differences:
- Instead of html_text, html_table is used to extract data at the last step.
- Also, there may be many tables - information presented in the “html’s table format”. Thus, the correct table number (i.e., element number) should be identified (using [[ ]]) and then coerced to a data frame with html_table().

So, first see all “tables” and identify the right table number.

tables<-read_html("https://www.pma2020.org/pma2020-datasets-list") %>% 
    html_nodes("table") %>%
    html_table(header = TRUE)     
numtables<-length(tables)
str(tables)

List of 1
 $ :'data.frame':   64 obs. of  5 variables:
  ..$ Country               : chr [1:64] "Burkina Faso" "" "" "" ...
  ..$ Household/Female      : chr [1:64] "Round 1 (2014)" "Round 2 (2015)" "Round 3 (2016)" "Round 4 (2016)" ...
  ..$ Service Delivery Point: chr [1:64] "Round 1 (2014)" "Round 2 (2015)" "Round 3 (2016)" "Round 4 (2016)" ...
  ..$ GPS                   : chr [1:64] "Burkina Faso" "" "" "" ...
  ..$ Other                 : chr [1:64] "Nutrition Round 1 HHQFQ (2017)" "Nutrition Round 1 SDP (2017)" "Nutrition Round 2 HHQFCQ (2018)" "Nutrition Round 2 SDP (2018)" ...

Luckily, the example website has only one table, and that table does have the correct information. Then, execute the three steps (i.e., convert, tag, and extract).

dta<-read_html("https://www.pma2020.org/pma2020-datasets-list") %>% 
    html_nodes("table") %>%
    .[[1]] %>%
    html_table(header = TRUE)

Finally, as always, tidy up the harvested data.

# check the current data 
head(dta, 10)

         Country        Household/Female Service Delivery Point
1   Burkina Faso          Round 1 (2014)         Round 1 (2014)
2                         Round 2 (2015)         Round 2 (2015)
3                         Round 3 (2016)         Round 3 (2016)
4                         Round 4 (2016)         Round 4 (2016)
5                         Round 5 (2017)         Round 5 (2017)
6                                                              
7  Cote d'Ivoire          Round 1 (2017)         Round 1 (2017)
8                         Round 2 (2018)         Round 2 (2018)
9                                                              
10           DRC Round 1 (2013) Kinshasa          No SDP survey
                       GPS                           Other
1             Burkina Faso  Nutrition Round 1 HHQFQ (2017)
2                             Nutrition Round 1 SDP (2017)
3                          Nutrition Round 2 HHQFCQ (2018)
4                             Nutrition Round 2 SDP (2018)
5                                                         
6                                                         
7            Cote d'Ivoire                                
8                                                         
9                                                         
10 Kinshasa, Kongo Central

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(Hmisc))
dta<-dta %>% 
    # drop unnecessary variables/columns 
    select(-GPS, -Other) %>%
    # rename variable names 
    rename (SDP =   "Service Delivery Point") %>% 
    rename (HHF =   "Household/Female") %>% 
    # drop empty rows that were used as space 
    filter(HHF!="" | SDP!="")
for (i in 1:nrow(dta)){
    # fill in empty country names 
    if (dta[i,1]==""){
    dta[i,1]=dta[i-1,1]
    }}

# check the cleaned data 
head(dta, 10)

         Country                     HHF                     SDP
1   Burkina Faso          Round 1 (2014)          Round 1 (2014)
2   Burkina Faso          Round 2 (2015)          Round 2 (2015)
3   Burkina Faso          Round 3 (2016)          Round 3 (2016)
4   Burkina Faso          Round 4 (2016)          Round 4 (2016)
5   Burkina Faso          Round 5 (2017)          Round 5 (2017)
6  Cote d'Ivoire          Round 1 (2017)          Round 1 (2017)
7  Cote d'Ivoire          Round 2 (2018)          Round 2 (2018)
8            DRC Round 1 (2013) Kinshasa           No SDP survey
9            DRC Round 2 (2014) Kinshasa Round 2 (2014) Kinshasa
10           DRC Round 3 (2015) Kinshasa Round 3 (2015) Kinshasa

3. More on Finding the Correct Table

In most webpages, however, there are multiple tables as in this example from UN Statistics Division: https://unstats.un.org/unsd/methodology/m49/.

tables<-read_html("https://unstats.un.org/unsd/methodology/m49/") %>% 
    html_nodes("table") %>%
    html_table(header = TRUE)     
numtables<-length(tables)

There are 48 tables on the webpage. Yikes! Now, explore them.

str(tables)
# Results hidden.

We need the list of countries by geographic region, which looks like this in Figure 4. There are rows with no country code (e.g., the first row with “World”). Thus, the number of observations should be greater than the total number of countries/areas recognized in the UN system (i.e., 249).

Figure 4.
Alt text

# check number of observations in each table (i.e., element in "tables"")
for (i in 1:length(tables)){
    print(nrow(tables[[i]]) )
}

[1] 249
[1] 249
[1] 249
[1] 249
[1] 249
[1] 249
[1] 280
[1] 280
[1] 280
[1] 280
[1] 280
[1] 280
[1] 46
[1] 24
[1] 19
[1] 9
[1] 6
[1] 6
[1] 47
[1] 47
[1] 47
[1] 47
[1] 47
[1] 47
[1] 32
[1] 32
[1] 32
[1] 32
[1] 32
[1] 32
[1] 53
[1] 53
[1] 53
[1] 53
[1] 53
[1] 53
[1] 66
[1] 66
[1] 66
[1] 66
[1] 66
[1] 66
[1] 182
[1] 182
[1] 182
[1] 182
[1] 182
[1] 182

So, now, it has been narrowed down to 7th ~ 12th tables. Explore them further.

for (i in 7:12){
    dta<-tables%>% .[[i]] 
    str(dta)
    head(dta, 10)
}

'data.frame':   280 obs. of  4 variables:
 $ Country or Area: chr  "World" "Africa" "Northern Africa" "Algeria" ...
 $ M49 code       : int  1 2 15 12 818 434 504 729 788 732 ...
 $ ISO-alpha3 code: chr  "" "" "" "DZA" ...
 $ Other groupings: chr  "" "" "" "" ...
'data.frame':   280 obs. of  4 variables:
 $ Country or Area: chr  "<U+4E16><U+754C>" "<U+975E><U+6D32>" "<U+5317><U+975E>" "<U+963F><U+5C14><U+53CA><U+5229><U+4E9A>" ...
 $ M49 code       : int  1 2 15 12 818 434 504 729 788 732 ...
 $ ISO-alpha3 code: chr  "" "" "" "DZA" ...
 $ Other groupings: chr  "" "" "" "" ...
'data.frame':   280 obs. of  4 variables:
 $ Country or Area: chr  "<U+0412><U+0435><U+0441><U+044C> <U+043C><U+0438><U+0440>" "<U+0410><U+0437><U+0438><U+044F>" "<U+0412><U+043E><U+0441><U+0442><U+043E><U+0447><U+043D><U+0430><U+044F> <U+0410><U+0437><U+0438><U+044F>" "<U+041A><U+0438><U+0442><U+0430><U+0439>" ...
 $ M49 code       : int  1 142 30 156 344 446 408 496 410 392 ...
 $ ISO-alpha3 code: chr  "" "" "" "CHN" ...
 $ Other groupings: chr  "" "" "" "" ...
'data.frame':   280 obs. of  4 variables:
 $ Country or Area: chr  "Monde" "Afrique" "Afrique septentrionale" "Algérie" ...
 $ M49 code       : int  1 2 15 12 818 434 504 732 729 788 ...
 $ ISO-alpha3 code: chr  "" "" "" "DZA" ...
 $ Other groupings: chr  "" "" "" "" ...
'data.frame':   280 obs. of  4 variables:
 $ Country or Area: chr  "Mundo" "África" "África septentrional" "Argelia" ...
 $ M49 code       : int  1 2 15 12 818 434 504 732 729 788 ...
 $ ISO-alpha3 code: chr  "" "" "" "DZA" ...
 $ Other groupings: chr  "" "" "" "" ...
'data.frame':   280 obs. of  4 variables:
 $ Country or Area: chr  "<U+0627><U+0644><U+0639><U+0627><U+0644><U+0645>" "<U+0622><U+0633><U+064A><U+0627>" "<U+0622><U+0633><U+064A><U+0627> <U+0627><U+0644><U+0648><U+0633><U+0637><U+0649>" "<U+0623><U+0648><U+0632><U+0628><U+0643><U+0633><U+062A><U+0627><U+0646>" ...
 $ M49            : int  1 142 143 860 795 762 417 398 34 4 ...
 $ ISO-alpha3     : chr  "" "" "" "UZB" ...
 $ Other groupings: chr  "" "" "" "LLDC" ...

The first data frame (which is 7th element in “Tables”) has all information. Table 7 has been found! Compare it to the Figure 4.

dta<-tables%>% .[[7]] 
head(dta, 10)

   Country or Area M49 code ISO-alpha3 code Other groupings
1            World        1                                
2           Africa        2                                
3  Northern Africa       15                                
4          Algeria       12             DZA                
5            Egypt      818             EGY                
6            Libya      434             LBY                
7          Morocco      504             MAR                
8            Sudan      729             SDN             LDC
9          Tunisia      788             TUN                
10  Western Sahara      732             ESH

# or from the beginning 
dta<-read_html("https://unstats.un.org/unsd/methodology/m49/") %>% 
    html_nodes("table") %>%
    .[[7]] %>% 
    html_table(header = TRUE)     
head(dta, 10)

   Country or Area M49 code ISO-alpha3 code Other groupings
1            World        1                                
2           Africa        2                                
3  Northern Africa       15                                
4          Algeria       12             DZA                
5            Egypt      818             EGY                
6            Libya      434             LBY                
7          Morocco      504             MAR                
8            Sudan      729             SDN             LDC
9          Tunisia      788             TUN                
10  Western Sahara      732             ESH

Acknowledgement: The following resources have been very helpful for learning web scraping.
* http://bradleyboehmke.github.io/2015/12/scraping-html-text.html
* https://www.datacamp.com/community/tutorials/r-web-scraping-rvest