Scraping Multiple Web Pages for Text Using R and rvest

Data Analysis Series

Author
Affiliation

John Karuitha

Karatina University

Published

June 1, 2023

Modified

June 1, 2023

1 Background

In many cases, data is not always available in a ready to load and analyse format. Actually, data is often not available. In this case, we may have to collect data ourselves. One of the ways to do this is through getting data from websites through a process refered to as web-scraping. In this section, we examine the scraping of data from the web using R.

We shall utilise the following packages;

tidyverse # For data cleaning and plotting purposes.
data.table # For data cleaning.
rvest # For data scrapping.
RSelenium # For advanced data scrapping.
gt # For awesome tables.
janitor # For some very important extra functions for cleaning data.
netstat #check for free ports on the computer.

We start by loading the packages.

if (!require(pacman)) {
    install.packages("pacman")
}

pacman::p_load(
    data.table, rvest, RSelenium, tidyverse,
    gt, janitor, rJava, netstat, BiocManager
)

2 CSS Selectors and XPATH

Anything that is digital is maintained by codes. Knowing this, we could then use the structure of the code used to make websites to automatically get the data of interest from these sites. Codes regarding XPATH and CSS selector paths are useful tools for creating web pages. We utilise these two paths to access data. Every in a web page will have an XPATH and a CSS selector path. Given that most data analysts are not experts in HTML and CSS languages, we have a handy tool: the selector gadget in Google Chrome. This is an add-on package that you attach to your Chrome browser. Users of firefox can use an equivalent tool called ScrapeMate Beta developed by John Smith and available via this link, https://addons.mozilla.org/en-US/firefox/addon/scrapemate/.

What do these extensions do? The extensions allow you to capture the nodes associated with the data of interest in a webpage. Note that particular data types will have specific tags attached to them. For example, all tables will be enclosed in a tag like the one below.

## Sample HTML code: Note the NODES sorrounded by <>.
## Example of section identifiers (nodes) include h1 for headers,
## <p> for paragraphs.
## <table> for tables. 
## <body> for entire body of content.
## Anchor tag <a that have attributes like "href"

<html>
    <body>
        <h1> HEADING 1 </h1>

        <p> This paragraph consists of;

            <table> Code for Table 1 here </table> 

            <a Reference href="www.karu.ac.ke">link</a> to the Karatina University website.

        </p>

    </body>
</html>
<table> code for the table </table>

When rendered, the html above will be appear as follows in a website.

This paragraph consists of;

A table with rows and columns with text (code not included)

Reference link to the Karatina University website.

Thus, the extensions capture these tags (body, h1, p, etc) and location of the tag in the webpage allowing us to easily access the contents. Once installed, these browser extensions are quite simple to use and we do not cover them in this section.

3 The rvest Package

The rvest package is the most popular in R for web scraping. It is relatively simple to use and can handle most of the ordinary web scrapers’ needs. We illustrate the use of the rvest package using a few simple, but illusive examples.

3.1 The read_html function

Scrapping data in rvest starts by the identification of the desired website. Next, we read in the contents (which tend to be in HTML/CSS formats) using the read_html function.

my_website <- read_html("www.desired_website.com")

The resulting output will contain all the contents of the websites, including the nodes like those representing headers, paragraphs, tables, figures, and text as well as the formatting that give websites the aesthetic appeal.

3.2 The html_nodes function

Having read the contents of our target website using the read_html function, we then tell R what parts (nodes) of the website that we are interested in. If you are interested in text, then you will use the appropriate node for the text sections. For tables, you will more likely use the table node. The code starts from the output of the read_html function, as follows.

## The my_node can be text, table, or other pointers.
my_website <- read_html("www.desired_website.com") %>%
    html_nodes("my_node")

To see the different types of nodes, please refer to section () on html_text for a brief overview of the selector gadget.

3.3 The html_attrs function

The html_attr() function in R rvest gets a single attribute from an HTML element or node set. It takes two arguments: the first is the HTML element or node set, and the second is the name of the attribute to retrieve. If the attribute does not exist, html_attr() will return NA.

read_html("https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sustainability+index&btnG=") %>%
    html_nodes(".gs_rt a") %>%
    html_attr("href")
 [1] "https://www.sciencedirect.com/science/article/pii/S0195925511000758"
 [2] "https://www.sciencedirect.com/science/article/pii/S0921800907002029"
 [3] "https://www.sciencedirect.com/science/article/pii/S019592550600151X"
 [4] "https://link.springer.com/article/10.1007/s10551-006-9253-8"        
 [5] "https://ascelibrary.org/doi/abs/10.1061/(ASCE)WR.1943-5452.0000134" 
 [6] "https://link.springer.com/article/10.1007/s10098-012-0454-9"        
 [7] "https://www.mdpi.com/91882"                                         
 [8] "https://www.mdpi.com/254628"                                        
 [9] "https://www.sciencedirect.com/science/article/pii/S1470160X14000983"
[10] "https://www.sciencedirect.com/science/article/pii/S092180090400151X"

3.4 Case Study 1: Scraping Text Data

3.4.1 The html_text Function

The html_text function in rvest allows us to capture text from a web page. In this context, let us search for articles that contain the term “Sustainability index” on Google Scholar. Specifically, we want to capture the titles of the articles that appear on the first page of the search. All you need to do is open your browser, go to Google Scholar and serach for sustainability index. Remember to also open your selector gadget from the extensions section of your browser. The selector gadget will typically open a side pane on your browser.

knitr::include_graphics("scrap.png")

Selector Text: See Selector Gadget on Left Panel of the Browser

On the left panel of the Browser, we see the selector gadget tells us that the code associated with the paths is “.gs_rt a”. We also know that this data is “text”. Note also, the web address associated with this page that is https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sustainability+index&btnG=. With this information, we can invoke rvest as follows:

read_html("https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sustainability+index&btnG=") %>%
    html_nodes(".gs_rt a") %>%
    html_text() %>%
    tibble() %>%
    set_names("output") %>%
    gt()
output
Review of sustainability indices and indicators: Towards a new City Sustainability Index (CSI)
Measuring the immeasurable—A survey of sustainability indices
Sustainability index for Taipei
Sustainable development and corporate performance: A study based on the Dow Jones sustainability index
Sustainability index for water resources planning and management
Sustainability performance evaluation in industry by composite sustainability index
Proposal of a sustainability index for the automotive industry
Organizational sustainability practices: A study of the firms listed by the corporate sustainability index
Transport sustainability index: Melbourne case study
In search of a natural systems sustainability index

Note that your output may be different as the serach results are not static. Note also that while we just got the output in the first page, you could scale this to cover multiple pages.

3.5 Case Study 2: Scraping Tabular Data

3.5.1 The html_table function

In this section, we use the html_table function to capture data already populated in tables. We utilise data from the webometrics rankings of universities, with a focus on Africa. The url of interest is https://www.webometrics.info/en/Africa.

read_html("https://www.webometrics.info/en/Africa") %>%
    html_nodes("#siteContent") %>%
    html_table() %>%
    .[[1]] %>%
    tibble() %>%
    clean_names() %>%
    select(-det, -country) %>%
    set_names(names(.) %>% str_to_sentence()) %>%
    head(20) %>%
    gt()
Ranking World_rank University Impact_rank Openness_rank Excellence_rank
1 246 University of Cape Town 284 235 293
2 398 University of the Witwatersrand 657 379 408
3 438 Stellenbosch University 696 350 471
4 450 University of Pretoria 633 470 511
5 548 Cairo University 1546 628 334
6 584 Alexandria University 879 754 599
7 598 University of Kwazulu Natal 1322 548 514
8 653 University of Johannesburg 1774 679 450
9 795 University of South Africa 1234 913 884
10 927 University of the Western Cape 1228 935 1152
11 934 Mansoura University 3902 639 538
12 992 Ain Shams University 3811 702 616
13 1066 Makerere University 1805 1200 1135
14 1076 University of Nairobi 1083 745 1690
15 1078 Zagazig University 5054 823 554
16 1106 University of the Free State 2539 1073 973
17 1109 University of Ghana 2146 777 1194
18 1129 University of Ibadan 2228 737 1223
19 1132 American University in Cairo 977 1131 1793
20 1138 Rhodes University 1525 1171 1427

There are many more functions for scrapping data in rvest. But for a majority of users, these two functions will be enough to handle most of their data needs.

3.6 Case Study 3 (Extended): Scraping Data from Multiple Web Pages

In this section, I provide an end-to-end case study in web scrapping based on a workshop by John Little 1, a data librarian at Duke University. The case study covers extraction of text data from over 50 pages of a website through systematic iteration using the purrr package in R. The exercise also illustrates how to clean such data using dplyr and regular expressions (REGEX).

The site we scrap is Encartico that contains the names of famed individuals between 1400 and 1800 AD. The data in this spans over 26 pages with over 1200 names, making it hard to crap page by page leave alone an individual at a time. You can see the first page of this site by following this link https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse.

We start by reading in the website into R. Using the selector gadget, we can see that the 50 names in the first page have a node ‘#setwidth li a’, We can then scrap these 50 names using the html_text function.

read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
    html_nodes("#setwidth li a") %>%
    html_text() %>%
    tibble() %>%
    set_names("Children") %>%
    gt()
Children
Hillebrand Boudewynsz. van der Aa (1661 - 1717)
Boudewijn Pietersz van der Aa (? - ?)
Pieter Boudewijnsz. van der Aa (1659 - 1733)
Boudewyn van der Aa (1672 - ca. 1714)
Machtelt van der Aa (? - ?)
Claas van der Aa I (? - ?)
Claas van der Aa II (? - ?)
Willem van der Aa (? - ?)
Johanna van der Aa (? - ?)
Hans von Aachen (1552 - 1615)
Jacobus van Aaken (? - ?)
Justus van Aaken (? - ?)
Johannes van Aalburg (1717 - 1777)
Johannes Aalmis (1714 - 1799)
Johan Bartholomeus Aalmis (1723 - 1786)
Maria van Aalst (1639 - 1664)
Anna Aalst (? - ?)
Anna Aaltse (1715 - 1738)
Allart Aaltsz (1665 - 1748)
Geertruy Aaltsz (? - 1732)
Maria Aaltsz (? - 1746)
Catharina Aaltsz (? - 1727)
Nikolaas van Aaltwijk (1692 - 1727)
Maria Aams (1711 - 1774)
Jacobus Aams (1680 - ?)
Jan Govertsz. van der Aar (1544 - 1612)
Anna van der Aar (1576 - 1656)
Janneke Jans van Aarden (1609 - 1651)
Abraham van Aardenberg (1672 - 1717)
Willem Aardenhout I (? - ?)
Margrietje Aarlincx (1637 - 1690)
Dirck van Aart (1680 - 1737)
Jonas Abarbanel (? - 1667)
Josephus Abarbanel (? - ?)
Esther Abarbanel (? - ?)
Rachel Abarbanel (? - ?)
Lea Abarbanel (1691 - ?)
Isaac Abarbanel (1637 - 1723)
Damiana Abarca (? - 1630)
Bartholomeus Abba (1641 - 1684)
Cornelis Dirksz. Abba (1604 - 1675)
Clara Abba (1631 - 1671)
Aerlant Abbas (1606 - 1696)
Matheus Jansz Abbas (1569 - ?)
Hendrik Abbé (1639 - 1677)
Claude Abbé (? - 1653)
Simon Jan Pontenz. Abbe (1467 - 1549)
Simon IJsbrandz. Abbe (? - ?)
Ysbrandt Simonsz. Abbe (? - 1559)
Maximiliaen l' Abbé (? - 1675)

So how can we crawl over the rest of the pages. Doing so one by one would be time consuming. Here, we resort to the html_attrs function mentioned earlier. Specifically, we want to use the function of each of the navigation pages. Let us first see what these navigation pages are;

read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
    html_nodes("form+ .subnav a") %>%
    html_text()
[1] "[51-100]"    "[101-150]"   "[151-200]"   "[201-250]"   "[1251-1269]"

Next, we can look at the web addresses of the sites associated with the navigation buttons.

read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
    html_nodes("form+ .subnav a") %>%
    html_attr("href") %>%
    tibble() %>%
    set_names("websites")
# A tibble: 5 × 1
  websites                                                 
  <chr>                                                    
1 index.php?subtask=browse&field=surname&strtchar=A&page=2 
2 index.php?subtask=browse&field=surname&strtchar=A&page=3 
3 index.php?subtask=browse&field=surname&strtchar=A&page=4 
4 index.php?subtask=browse&field=surname&strtchar=A&page=5 
5 index.php?subtask=browse&field=surname&strtchar=A&page=26

We can see a pattern in the naming of the addresses. In this case, we have the address for pages 2-5, and then the last page 26. It appears that the site has the following naming pattern in the addresses. Remember the root of the address is;

https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse

This address is then followed by a address as follows for the second page:

<index.php?subtask=browse&field=surname&strtchar=A&page=2>

Hence, we can generalize the latter as:

<index.php?subtask=browse&field=surname&strtchar=A&page={pageNumber}>

Hence, we can generalize the full address as follows.

https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse/index.php?subtask=browse&field=surname&strtchar=A&page=26

In our case, the maximum page is 26, but this may change in the future. Hence, we can separate the root of the address to determine the maximum page at any given time then construct the addresses with this in mind.

## The output is maximum number of pages in this case 26.
no_pages <- read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
    html_nodes("form+ .subnav a") %>%
    html_attr("href") %>%
    tibble() %>%
    set_names("websites") %>%
    mutate(page = str_extract(websites, "\\d{1,2}$")) %>%
    pull(page) %>%
    as.numeric() %>%
    max()

Having determined the structure of the web addresses and the maximum pages, we can now create a function to crawl the websites, without worring if the number of pages expand in the future.

scrapper <- function(address) {
    Sys.sleep(2)

    read_html(address) %>%
        html_nodes("#setwidth li a") %>%
        html_text() %>%
        tibble() %>%
        set_names("Children")
}

Now lets create the addresses to scrap.

full_list <- tibble(root = "https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&field=surname&strtchar=A&page=", page = 1:no_pages) %>%
    mutate(web = glue::glue("{root}{page}"))

head(full_list)
# A tibble: 6 × 3
  root                                                                page web  
  <chr>                                                              <int> <glu>
1 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     1 http…
2 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     2 http…
3 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     3 http…
4 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     4 http…
5 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     5 http…
6 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s…     6 http…

Finally, we loop over the 26 pages to get the list of the names of people.

## Run this code after uncomenting to get the data.
## I save the data as csv to save on data.

# map_dfr(full_list %>% pull(web), scrapper) %>%
#   write_csv("dutch.csv")

Now let us clean this data.

"dutch.csv" %>%
    read_csv() %>%
    mutate(details = str_extract(Children, "\\(.*")) %>%
    mutate(Children = str_remove(Children, "\\(.*")) %>%
    separate(details, into = c("birth", "death"), sep = " - ") %>%
    mutate(
        birth = str_remove(birth, "\\("),
        death = str_remove(death, "\\)")
    ) %>%
    mutate(
        birth = parse_number(birth),
        death = parse_number(death)
    ) %>%
    gt(caption = "Clean Data") %>%
    opt_interactive()

4 Conclusion

In this analysis, I have highlighted how to scrape data from multiple web pages using the rvest package and R. This approach could be useful for researchers interested in using text as data for their research.

Footnotes

  1. The link to the workshop is https://www.youtube.com/watch?v=8ISc8V9GDAg&t=3769s.↩︎