This R Markdown shows how to scrape (or harvest!) web data in three steps in R, using rvest and also, for those who are not familiar with html (like myself), selectorgadget. Scraping incolves basically three steps:
* Step 1. Convert a website into an XML object. (i.e., translate website into something R can understand)
* Step 2. Tag the relevant nodes (i.e., parts/contents) from the XML object. In other word, specify what elements of the XML object is needed. (i.e., find what specific content needs to be “harvested”)
* Step 3. Extract the tagged data - either in text or data table. (i.e., get it)
There are three sections in this document.
* Example 1. Scraping Text
* Example 2. Scraping Data Table
* More on Finding the Correct Data Table
A simple webpage is used for the first two examples: PMA2020 Datasets List: (https://www.pma2020.org/pma2020-datasets-list). A more complex webpage is used in the last section: (https://unstats.un.org/unsd/methodology/m49/)
First example scrapes text, circled with green in Figure 1.
Figure 1.
Step 1: Convert a website into an XML object
Use the read_html() function, with a target URL. The function calls the webserver, collects the data, and parses it.
suppressPackageStartupMessages(library(rvest))
web<-read_html("https://www.pma2020.org/pma2020-datasets-list")
class(web)
[1] "xml_document" "xml_node"
Step 2: Tag the relevant nodes from the XML object
Use html_nodes(), whose argument is the class descriptor (CSS), prepended by a “.” to signify that it is a class. The output will be a list of all the nodes found in that way.
Or, without understanding html, a very easy way to find the right node is using selectorgadget. Thank you, selectorgadget! See the red circle in Figure 1. In this example, CSS is “.last > h3 span”.
nodes<-html_nodes(web, ".last > h3 span")
class(nodes)
[1] "xml_nodeset"
Or, another way to find the specific “selector” from the webpage is from the developer tools. Use the “inspect function” to toggle over and find the element. Good instruction for this is available here. In this example, the selector is “#node-706 > div.content > div > div > div > h3 > span” (obtained via “copy selector” in Figure 2), and it can replace “.last > h3 span” in the above code chunk.
Figure 2.
Step 3: Extract the tagged data
Apply html_text() to the nodes you want
dta<-html_text(nodes)
class(dta)
[1] "character"
dta
[1] "Below is a list of full datasets that are currently available. To request datasets, click here."
Using piping (%>%), however, above three steps can be done in the following one code:
suppressPackageStartupMessages(library(rvest))
dta<-read_html("https://www.pma2020.org/pma2020-datasets-list") %>%
html_nodes(".last > h3 span") %>%
html_text()
dta
[1] "Below is a list of full datasets that are currently available. To request datasets, click here."
The next example scrapes data table from the same webpage, the list of Household/Female survyes and Service Delivery Point surveys by country, highlighted with green circle in Figure 3.
Figure 3.
Compared to the previous text scraping example, there are two differences:
- Instead of html_text, html_table is used to extract data at the last step.
- Also, there may be many tables - information presented in the “html’s table format”. Thus, the correct table number (i.e., element number) should be identified (using [[ ]]) and then coerced to a data frame with html_table().
So, first see all “tables” and identify the right table number.
tables<-read_html("https://www.pma2020.org/pma2020-datasets-list") %>%
html_nodes("table") %>%
html_table(header = TRUE)
numtables<-length(tables)
str(tables)
List of 1
$ :'data.frame': 64 obs. of 5 variables:
..$ Country : chr [1:64] "Burkina Faso" "" "" "" ...
..$ Household/Female : chr [1:64] "Round 1 (2014)" "Round 2 (2015)" "Round 3 (2016)" "Round 4 (2016)" ...
..$ Service Delivery Point: chr [1:64] "Round 1 (2014)" "Round 2 (2015)" "Round 3 (2016)" "Round 4 (2016)" ...
..$ GPS : chr [1:64] "Burkina Faso" "" "" "" ...
..$ Other : chr [1:64] "Nutrition Round 1 HHQFQ (2017)" "Nutrition Round 1 SDP (2017)" "Nutrition Round 2 HHQFCQ (2018)" "Nutrition Round 2 SDP (2018)" ...
Luckily, the example website has only one table, and that table does have the correct information. Then, execute the three steps (i.e., convert, tag, and extract).
dta<-read_html("https://www.pma2020.org/pma2020-datasets-list") %>%
html_nodes("table") %>%
.[[1]] %>%
html_table(header = TRUE)
Finally, as always, tidy up the harvested data.
# check the current data
head(dta, 10)
Country Household/Female Service Delivery Point
1 Burkina Faso Round 1 (2014) Round 1 (2014)
2 Round 2 (2015) Round 2 (2015)
3 Round 3 (2016) Round 3 (2016)
4 Round 4 (2016) Round 4 (2016)
5 Round 5 (2017) Round 5 (2017)
6
7 Cote d'Ivoire Round 1 (2017) Round 1 (2017)
8 Round 2 (2018) Round 2 (2018)
9
10 DRC Round 1 (2013) Kinshasa No SDP survey
GPS Other
1 Burkina Faso Nutrition Round 1 HHQFQ (2017)
2 Nutrition Round 1 SDP (2017)
3 Nutrition Round 2 HHQFCQ (2018)
4 Nutrition Round 2 SDP (2018)
5
6
7 Cote d'Ivoire
8
9
10 Kinshasa, Kongo Central
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(Hmisc))
dta<-dta %>%
# drop unnecessary variables/columns
select(-GPS, -Other) %>%
# rename variable names
rename (SDP = "Service Delivery Point") %>%
rename (HHF = "Household/Female") %>%
# drop empty rows that were used as space
filter(HHF!="" | SDP!="")
for (i in 1:nrow(dta)){
# fill in empty country names
if (dta[i,1]==""){
dta[i,1]=dta[i-1,1]
}}
# check the cleaned data
head(dta, 10)
Country HHF SDP
1 Burkina Faso Round 1 (2014) Round 1 (2014)
2 Burkina Faso Round 2 (2015) Round 2 (2015)
3 Burkina Faso Round 3 (2016) Round 3 (2016)
4 Burkina Faso Round 4 (2016) Round 4 (2016)
5 Burkina Faso Round 5 (2017) Round 5 (2017)
6 Cote d'Ivoire Round 1 (2017) Round 1 (2017)
7 Cote d'Ivoire Round 2 (2018) Round 2 (2018)
8 DRC Round 1 (2013) Kinshasa No SDP survey
9 DRC Round 2 (2014) Kinshasa Round 2 (2014) Kinshasa
10 DRC Round 3 (2015) Kinshasa Round 3 (2015) Kinshasa
In most webpages, however, there are multiple tables as in this example from UN Statistics Division: https://unstats.un.org/unsd/methodology/m49/.
tables<-read_html("https://unstats.un.org/unsd/methodology/m49/") %>%
html_nodes("table") %>%
html_table(header = TRUE)
numtables<-length(tables)
There are 48 tables on the webpage. Yikes! Now, explore them.
str(tables)
# Results hidden.
We need the list of countries by geographic region, which looks like this in Figure 4. There are rows with no country code (e.g., the first row with “World”). Thus, the number of observations should be greater than the total number of countries/areas recognized in the UN system (i.e., 249).
Figure 4.
# check number of observations in each table (i.e., element in "tables"")
for (i in 1:length(tables)){
print(nrow(tables[[i]]) )
}
[1] 249
[1] 249
[1] 249
[1] 249
[1] 249
[1] 249
[1] 280
[1] 280
[1] 280
[1] 280
[1] 280
[1] 280
[1] 46
[1] 24
[1] 19
[1] 9
[1] 6
[1] 6
[1] 47
[1] 47
[1] 47
[1] 47
[1] 47
[1] 47
[1] 32
[1] 32
[1] 32
[1] 32
[1] 32
[1] 32
[1] 53
[1] 53
[1] 53
[1] 53
[1] 53
[1] 53
[1] 66
[1] 66
[1] 66
[1] 66
[1] 66
[1] 66
[1] 182
[1] 182
[1] 182
[1] 182
[1] 182
[1] 182
So, now, it has been narrowed down to 7th ~ 12th tables. Explore them further.
for (i in 7:12){
dta<-tables%>% .[[i]]
str(dta)
head(dta, 10)
}
'data.frame': 280 obs. of 4 variables:
$ Country or Area: chr "World" "Africa" "Northern Africa" "Algeria" ...
$ M49 code : int 1 2 15 12 818 434 504 729 788 732 ...
$ ISO-alpha3 code: chr "" "" "" "DZA" ...
$ Other groupings: chr "" "" "" "" ...
'data.frame': 280 obs. of 4 variables:
$ Country or Area: chr "<U+4E16><U+754C>" "<U+975E><U+6D32>" "<U+5317><U+975E>" "<U+963F><U+5C14><U+53CA><U+5229><U+4E9A>" ...
$ M49 code : int 1 2 15 12 818 434 504 729 788 732 ...
$ ISO-alpha3 code: chr "" "" "" "DZA" ...
$ Other groupings: chr "" "" "" "" ...
'data.frame': 280 obs. of 4 variables:
$ Country or Area: chr "<U+0412><U+0435><U+0441><U+044C> <U+043C><U+0438><U+0440>" "<U+0410><U+0437><U+0438><U+044F>" "<U+0412><U+043E><U+0441><U+0442><U+043E><U+0447><U+043D><U+0430><U+044F> <U+0410><U+0437><U+0438><U+044F>" "<U+041A><U+0438><U+0442><U+0430><U+0439>" ...
$ M49 code : int 1 142 30 156 344 446 408 496 410 392 ...
$ ISO-alpha3 code: chr "" "" "" "CHN" ...
$ Other groupings: chr "" "" "" "" ...
'data.frame': 280 obs. of 4 variables:
$ Country or Area: chr "Monde" "Afrique" "Afrique septentrionale" "Algérie" ...
$ M49 code : int 1 2 15 12 818 434 504 732 729 788 ...
$ ISO-alpha3 code: chr "" "" "" "DZA" ...
$ Other groupings: chr "" "" "" "" ...
'data.frame': 280 obs. of 4 variables:
$ Country or Area: chr "Mundo" "África" "África septentrional" "Argelia" ...
$ M49 code : int 1 2 15 12 818 434 504 732 729 788 ...
$ ISO-alpha3 code: chr "" "" "" "DZA" ...
$ Other groupings: chr "" "" "" "" ...
'data.frame': 280 obs. of 4 variables:
$ Country or Area: chr "<U+0627><U+0644><U+0639><U+0627><U+0644><U+0645>" "<U+0622><U+0633><U+064A><U+0627>" "<U+0622><U+0633><U+064A><U+0627> <U+0627><U+0644><U+0648><U+0633><U+0637><U+0649>" "<U+0623><U+0648><U+0632><U+0628><U+0643><U+0633><U+062A><U+0627><U+0646>" ...
$ M49 : int 1 142 143 860 795 762 417 398 34 4 ...
$ ISO-alpha3 : chr "" "" "" "UZB" ...
$ Other groupings: chr "" "" "" "LLDC" ...
The first data frame (which is 7th element in “Tables”) has all information. Table 7 has been found! Compare it to the Figure 4.
dta<-tables%>% .[[7]]
head(dta, 10)
Country or Area M49 code ISO-alpha3 code Other groupings
1 World 1
2 Africa 2
3 Northern Africa 15
4 Algeria 12 DZA
5 Egypt 818 EGY
6 Libya 434 LBY
7 Morocco 504 MAR
8 Sudan 729 SDN LDC
9 Tunisia 788 TUN
10 Western Sahara 732 ESH
# or from the beginning
dta<-read_html("https://unstats.un.org/unsd/methodology/m49/") %>%
html_nodes("table") %>%
.[[7]] %>%
html_table(header = TRUE)
head(dta, 10)
Country or Area M49 code ISO-alpha3 code Other groupings
1 World 1
2 Africa 2
3 Northern Africa 15
4 Algeria 12 DZA
5 Egypt 818 EGY
6 Libya 434 LBY
7 Morocco 504 MAR
8 Sudan 729 SDN LDC
9 Tunisia 788 TUN
10 Western Sahara 732 ESH
Acknowledgement: The following resources have been very helpful for learning web scraping.
* http://bradleyboehmke.github.io/2015/12/scraping-html-text.html
* https://www.datacamp.com/community/tutorials/r-web-scraping-rvest