1 Background

In many cases, data is not always available in a ready to load and analyse format. Actually, data is often not available. In this case, we may have to collect data ourselves. One of the ways to do this is through getting data from websites through a process refered to as web-scraping. In this section, we examine the scraping of data from the web using R.

We shall utilise the following packages;

tidyverse # For data cleaning and plotting purposes.
data.table # For data cleaning.
rvest # For data scrapping.
RSelenium # For advanced data scrapping.
gt # For awesome tables.
janitor # For some very important extra functions for cleaning data.
netstat #check for free ports on the computer.

We start by loading the packages.

if (!require(pacman)) {
    install.packages("pacman")
}

pacman::p_load(
    data.table, rvest, RSelenium, tidyverse,
    gt, janitor, rJava, netstat, BiocManager
)

2 CSS Selectors and XPATH

Anything that is digital is maintained by codes. Knowing this, we could then use the structure of the code used to make websites to automatically get the data of interest from these sites. Codes regarding XPATH and CSS selector paths are useful tools for creating web pages. We utilise these two paths to access data. Every in a web page will have an XPATH and a CSS selector path. Given that most data analysts are not experts in HTML and CSS languages, we have a handy tool: the selector gadget in Google Chrome. This is an add-on package that you attach to your Chrome browser. Users of firefox can use an equivalent tool called ScrapeMate Beta developed by John Smith and available via this link, https://addons.mozilla.org/en-US/firefox/addon/scrapemate/.

What do these extensions do? The extensions allow you to capture the nodes associated with the data of interest in a webpage. Note that particular data types will have specific tags attached to them. For example, all tables will be enclosed in a tag like the one below.

## Sample HTML code: Note the NODES sorrounded by <>.
## Example of section identifiers (nodes) include h1 for headers,
## <p> for paragraphs.
## <table> for tables. 
## <body> for entire body of content.
## Anchor tag <a that have attributes like "href"

<html>
    <body>
        <h1> HEADING 1 </h1>

        <p> This paragraph consists of;

            <table> Code for Table 1 here </table> 

            <a Reference href="www.karu.ac.ke">link</a> to the Karatina University website.

        </p>

    </body>
</html>
<table> code for the table </table>

When rendered, the html above will be appear as follows in a website.

This paragraph consists of;

A table with rows and columns with text (code not included)

Reference link to the Karatina University website.

Thus, the extensions capture these tags (body, h1, p, etc) and location of the tag in the webpage allowing us to easily access the contents. Once installed, these browser extensions are quite simple to use and we do not cover them in this section.

3 The `rvest` Package

The rvest package is the most popular in R for web scraping. It is relatively simple to use and can handle most of the ordinary web scrapers’ needs. We illustrate the use of the rvest package using a few simple, but illusive examples.

3.1 The `read_html` function

Scrapping data in rvest starts by the identification of the desired website. Next, we read in the contents (which tend to be in HTML/CSS formats) using the read_html function.

my_website <- read_html("www.desired_website.com")

The resulting output will contain all the contents of the websites, including the nodes like those representing headers, paragraphs, tables, figures, and text as well as the formatting that give websites the aesthetic appeal.

3.2 The `html_nodes` function

Having read the contents of our target website using the read_html function, we then tell R what parts (nodes) of the website that we are interested in. If you are interested in text, then you will use the appropriate node for the text sections. For tables, you will more likely use the table node. The code starts from the output of the read_html function, as follows.

## The my_node can be text, table, or other pointers.
my_website <- read_html("www.desired_website.com") %>%
    html_nodes("my_node")

To see the different types of nodes, please refer to section () on html_text for a brief overview of the selector gadget.

3.3 The `html_attrs` function

The html_attr() function in R rvest gets a single attribute from an HTML element or node set. It takes two arguments: the first is the HTML element or node set, and the second is the name of the attribute to retrieve. If the attribute does not exist, html_attr() will return NA.

read_html("https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sustainability+index&btnG=") %>%
    html_nodes(".gs_rt a") %>%
    html_attr("href")

 [1] "https://www.sciencedirect.com/science/article/pii/S0195925511000758"
 [2] "https://www.sciencedirect.com/science/article/pii/S0921800907002029"
 [3] "https://www.sciencedirect.com/science/article/pii/S019592550600151X"
 [4] "https://link.springer.com/article/10.1007/s10551-006-9253-8"        
 [5] "https://ascelibrary.org/doi/abs/10.1061/(ASCE)WR.1943-5452.0000134" 
 [6] "https://link.springer.com/article/10.1007/s10098-012-0454-9"        
 [7] "https://www.mdpi.com/91882"                                         
 [8] "https://www.mdpi.com/254628"                                        
 [9] "https://www.sciencedirect.com/science/article/pii/S1470160X14000983"
[10] "https://www.sciencedirect.com/science/article/pii/S092180090400151X"

3.4 Case Study 1: Scraping Text Data

3.4.1 The `html_text` Function

The html_text function in rvest allows us to capture text from a web page. In this context, let us search for articles that contain the term “Sustainability index” on Google Scholar. Specifically, we want to capture the titles of the articles that appear on the first page of the search. All you need to do is open your browser, go to Google Scholar and serach for sustainability index. Remember to also open your selector gadget from the extensions section of your browser. The selector gadget will typically open a side pane on your browser.

knitr::include_graphics("scrap.png")

Selector Text: See Selector Gadget on Left Panel of the Browser

On the left panel of the Browser, we see the selector gadget tells us that the code associated with the paths is “.gs_rt a”. We also know that this data is “text”. Note also, the web address associated with this page that is https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sustainability+index&btnG=. With this information, we can invoke rvest as follows:

read_html("https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sustainability+index&btnG=") %>%
    html_nodes(".gs_rt a") %>%
    html_text() %>%
    tibble() %>%
    set_names("output") %>%
    gt()

output
Review of sustainability indices and indicators: Towards a new City Sustainability Index (CSI)
Measuring the immeasurable—A survey of sustainability indices
Sustainability index for Taipei
Sustainable development and corporate performance: A study based on the Dow Jones sustainability index
Sustainability index for water resources planning and management
Sustainability performance evaluation in industry by composite sustainability index
Proposal of a sustainability index for the automotive industry
Organizational sustainability practices: A study of the firms listed by the corporate sustainability index
Transport sustainability index: Melbourne case study
In search of a natural systems sustainability index

Note that your output may be different as the serach results are not static. Note also that while we just got the output in the first page, you could scale this to cover multiple pages.

3.5 Case Study 2: Scraping Tabular Data

3.5.1 The `html_table` function

In this section, we use the html_table function to capture data already populated in tables. We utilise data from the webometrics rankings of universities, with a focus on Africa. The url of interest is https://www.webometrics.info/en/Africa.

read_html("https://www.webometrics.info/en/Africa") %>%
    html_nodes("#siteContent") %>%
    html_table() %>%
    .[[1]] %>%
    tibble() %>%
    clean_names() %>%
    select(-det, -country) %>%
    set_names(names(.) %>% str_to_sentence()) %>%
    head(20) %>%
    gt()

Ranking	World_rank	University	Impact_rank	Openness_rank	Excellence_rank
1	246	University of Cape Town	284	235	293
2	398	University of the Witwatersrand	657	379	408
3	438	Stellenbosch University	696	350	471
4	450	University of Pretoria	633	470	511
5	548	Cairo University	1546	628	334
6	584	Alexandria University	879	754	599
7	598	University of Kwazulu Natal	1322	548	514
8	653	University of Johannesburg	1774	679	450
9	795	University of South Africa	1234	913	884
10	927	University of the Western Cape	1228	935	1152
11	934	Mansoura University	3902	639	538
12	992	Ain Shams University	3811	702	616
13	1066	Makerere University	1805	1200	1135
14	1076	University of Nairobi	1083	745	1690
15	1078	Zagazig University	5054	823	554
16	1106	University of the Free State	2539	1073	973
17	1109	University of Ghana	2146	777	1194
18	1129	University of Ibadan	2228	737	1223
19	1132	American University in Cairo	977	1131	1793
20	1138	Rhodes University	1525	1171	1427

There are many more functions for scrapping data in rvest. But for a majority of users, these two functions will be enough to handle most of their data needs.

3.6 Case Study 3 (Extended): Scraping Data from Multiple Web Pages

In this section, I provide an end-to-end case study in web scrapping based on a workshop by John Little ¹, a data librarian at Duke University. The case study covers extraction of text data from over 50 pages of a website through systematic iteration using the purrr package in R. The exercise also illustrates how to clean such data using dplyr and regular expressions (REGEX).

The site we scrap is Encartico that contains the names of famed individuals between 1400 and 1800 AD. The data in this spans over 26 pages with over 1200 names, making it hard to crap page by page leave alone an individual at a time. You can see the first page of this site by following this link https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse.

We start by reading in the website into R. Using the selector gadget, we can see that the 50 names in the first page have a node ‘#setwidth li a’, We can then scrap these 50 names using the html_text function.

read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
    html_nodes("#setwidth li a") %>%
    html_text() %>%
    tibble() %>%
    set_names("Children") %>%
    gt()

Children
Hillebrand Boudewynsz. van der Aa (1661 - 1717)
Boudewijn Pietersz van der Aa (? - ?)
Pieter Boudewijnsz. van der Aa (1659 - 1733)
Boudewyn van der Aa (1672 - ca. 1714)
Machtelt van der Aa (? - ?)
Claas van der Aa I (? - ?)
Claas van der Aa II (? - ?)
Willem van der Aa (? - ?)
Johanna van der Aa (? - ?)
Hans von Aachen (1552 - 1615)
Jacobus van Aaken (? - ?)
Justus van Aaken (? - ?)
Johannes van Aalburg (1717 - 1777)
Johannes Aalmis (1714 - 1799)
Johan Bartholomeus Aalmis (1723 - 1786)
Maria van Aalst (1639 - 1664)
Anna Aalst (? - ?)
Anna Aaltse (1715 - 1738)
Allart Aaltsz (1665 - 1748)
Geertruy Aaltsz (? - 1732)
Maria Aaltsz (? - 1746)
Catharina Aaltsz (? - 1727)
Nikolaas van Aaltwijk (1692 - 1727)
Maria Aams (1711 - 1774)
Jacobus Aams (1680 - ?)
Jan Govertsz. van der Aar (1544 - 1612)
Anna van der Aar (1576 - 1656)
Janneke Jans van Aarden (1609 - 1651)
Abraham van Aardenberg (1672 - 1717)
Willem Aardenhout I (? - ?)
Margrietje Aarlincx (1637 - 1690)
Dirck van Aart (1680 - 1737)
Jonas Abarbanel (? - 1667)
Josephus Abarbanel (? - ?)
Esther Abarbanel (? - ?)
Rachel Abarbanel (? - ?)
Lea Abarbanel (1691 - ?)
Isaac Abarbanel (1637 - 1723)
Damiana Abarca (? - 1630)
Bartholomeus Abba (1641 - 1684)
Cornelis Dirksz. Abba (1604 - 1675)
Clara Abba (1631 - 1671)
Aerlant Abbas (1606 - 1696)
Matheus Jansz Abbas (1569 - ?)
Hendrik Abbé (1639 - 1677)
Claude Abbé (? - 1653)
Simon Jan Pontenz. Abbe (1467 - 1549)
Simon IJsbrandz. Abbe (? - ?)
Ysbrandt Simonsz. Abbe (? - 1559)
Maximiliaen l' Abbé (? - 1675)

So how can we crawl over the rest of the pages. Doing so one by one would be time consuming. Here, we resort to the html_attrs function mentioned earlier. Specifically, we want to use the function of each of the navigation pages. Let us first see what these navigation pages are;

read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
    html_nodes("form+ .subnav a") %>%
    html_text()

[1] "[51-100]"    "[101-150]"   "[151-200]"   "[201-250]"   "[1251-1269]"

Next, we can look at the web addresses of the sites associated with the navigation buttons.

read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
    html_nodes("form+ .subnav a") %>%
    html_attr("href") %>%
    tibble() %>%
    set_names("websites")

# A tibble: 5 × 1
  websites                                                 
  <chr>                                                    
1 index.php?subtask=browse&field=surname&strtchar=A&page=2 
2 index.php?subtask=browse&field=surname&strtchar=A&page=3 
3 index.php?subtask=browse&field=surname&strtchar=A&page=4 
4 index.php?subtask=browse&field=surname&strtchar=A&page=5 
5 index.php?subtask=browse&field=surname&strtchar=A&page=26

We can see a pattern in the naming of the addresses. In this case, we have the address for pages 2-5, and then the last page 26. It appears that the site has the following naming pattern in the addresses. Remember the root of the address is;

https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse

This address is then followed by a address as follows for the second page:

<index.php?subtask=browse&field=surname&strtchar=A&page=2>

Hence, we can generalize the latter as:

<index.php?subtask=browse&field=surname&strtchar=A&page={pageNumber}>

Hence, we can generalize the full address as follows.

https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse/index.php?subtask=browse&field=surname&strtchar=A&page=26

In our case, the maximum page is 26, but this may change in the future. Hence, we can separate the root of the address to determine the maximum page at any given time then construct the addresses with this in mind.

## The output is maximum number of pages in this case 26.
no_pages <- read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
    html_nodes("form+ .subnav a") %>%
    html_attr("href") %>%
    tibble() %>%
    set_names("websites") %>%
    mutate(page = str_extract(websites, "\\d{1,2}$")) %>%
    pull(page) %>%
    as.numeric() %>%
    max()

Having determined the structure of the web addresses and the maximum pages, we can now create a function to crawl the websites, without worring if the number of pages expand in the future.

scrapper <- function(address) {
    Sys.sleep(2)

    read_html(address) %>%
        html_nodes("#setwidth li a") %>%
        html_text() %>%
        tibble() %>%
        set_names("Children")
}

Now lets create the addresses to scrap.

full_list <- tibble(root = "https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&field=surname&strtchar=A&page=", page = 1:no_pages) %>%
    mutate(web = glue::glue("{root}{page}"))

head(full_list)

# A tibble: 6 × 3
  root                                                                page web  
  <chr>                                                              <int> <glu>
1 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     1 http…
2 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     2 http…
3 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     3 http…
4 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     4 http…
5 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     5 http…
6 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     6 http…

Finally, we loop over the 26 pages to get the list of the names of people.

## Run this code after uncomenting to get the data.
## I save the data as csv to save on data.

# map_dfr(full_list %>% pull(web), scrapper) %>%
#   write_csv("dutch.csv")

Now let us clean this data.

"dutch.csv" %>%
    read_csv() %>%
    mutate(details = str_extract(Children, "\\(.*")) %>%
    mutate(Children = str_remove(Children, "\\(.*")) %>%
    separate(details, into = c("birth", "death"), sep = " - ") %>%
    mutate(
        birth = str_remove(birth, "\\("),
        death = str_remove(death, "\\)")
    ) %>%
    mutate(
        birth = parse_number(birth),
        death = parse_number(death)
    ) %>%
    gt(caption = "Clean Data") %>%
    opt_interactive()

4 Conclusion

In this analysis, I have highlighted how to scrape data from multiple web pages using the rvest package and R. This approach could be useful for researchers interested in using text as data for their research.

Footnotes

The link to the workshop is https://www.youtube.com/watch?v=8ISc8V9GDAg&t=3769s.↩︎

1 Background

2 CSS Selectors and XPATH

3 The rvest Package

3.1 The read_html function

3.2 The html_nodes function

3.3 The html_attrs function