if (!require(pacman)) {
install.packages("pacman")
}
::p_load(
pacman
data.table, rvest, RSelenium, tidyverse,
gt, janitor, rJava, netstat, BiocManager )
1 Background
In many cases, data is not always available in a ready to load and analyse format. Actually, data is often not available. In this case, we may have to collect data ourselves. One of the ways to do this is through getting data from websites through a process refered to as web-scraping. In this section, we examine the scraping of data from the web using R.
We shall utilise the following packages;
tidyverse # For data cleaning and plotting purposes.
data.table # For data cleaning.
rvest # For data scrapping.
RSelenium # For advanced data scrapping.
gt # For awesome tables.
janitor # For some very important extra functions for cleaning data.
netstat #check for free ports on the computer.
We start by loading the packages.
2 CSS Selectors and XPATH
Anything that is digital is maintained by codes. Knowing this, we could then use the structure of the code used to make websites to automatically get the data of interest from these sites. Codes regarding XPATH and CSS selector paths are useful tools for creating web pages. We utilise these two paths to access data. Every in a web page will have an XPATH and a CSS selector path. Given that most data analysts are not experts in HTML and CSS languages, we have a handy tool: the selector gadget
in Google Chrome. This is an add-on package that you attach to your Chrome browser. Users of firefox can use an equivalent tool called ScrapeMate Beta
developed by John Smith and available via this link, https://addons.mozilla.org/en-US/firefox/addon/scrapemate/.
What do these extensions do? The extensions allow you to capture the nodes associated with the data of interest in a webpage. Note that particular data types will have specific tags attached to them. For example, all tables will be enclosed in a tag like the one below.
## Sample HTML code: Note the NODES sorrounded by <>.
## Example of section identifiers (nodes) include h1 for headers,
## <p> for paragraphs.
## <table> for tables.
## <body> for entire body of content.
## Anchor tag <a that have attributes like "href"
<html>
<body>
<h1> HEADING 1 </h1>
<p> This paragraph consists of;
<table> Code for Table 1 here </table>
<a Reference href="www.karu.ac.ke">link</a> to the Karatina University website.
</p>
</body>
</html>
<table> code for the table </table>
When rendered, the html above will be appear as follows in a website.
This paragraph consists of;
A table with rows and columns with text (code not included)
Reference link to the Karatina University website.
Thus, the extensions capture these tags (body, h1, p, etc) and location of the tag in the webpage allowing us to easily access the contents. Once installed, these browser extensions are quite simple to use and we do not cover them in this section.
3 The rvest
Package
The rvest
package is the most popular in R for web scraping. It is relatively simple to use and can handle most of the ordinary web scrapers’ needs. We illustrate the use of the rvest
package using a few simple, but illusive examples.
3.1 The read_html
function
Scrapping data in rvest
starts by the identification of the desired website. Next, we read in the contents (which tend to be in HTML/CSS formats) using the read_html
function.
my_website <- read_html("www.desired_website.com")
The resulting output will contain all the contents of the websites, including the nodes like those representing headers, paragraphs, tables, figures, and text as well as the formatting that give websites the aesthetic appeal.
3.2 The html_nodes
function
Having read the contents of our target website using the read_html
function, we then tell R what parts (nodes) of the website that we are interested in. If you are interested in text
, then you will use the appropriate node for the text sections. For tables, you will more likely use the table
node. The code starts from the output of the read_html
function, as follows.
## The my_node can be text, table, or other pointers.
my_website <- read_html("www.desired_website.com") %>%
html_nodes("my_node")
To see the different types of nodes, please refer to section () on html_text for a brief overview of the selector gadget.
3.3 The html_attrs
function
The html_attr() function in R rvest gets a single attribute from an HTML element or node set. It takes two arguments: the first is the HTML element or node set, and the second is the name of the attribute to retrieve. If the attribute does not exist, html_attr() will return NA.
read_html("https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sustainability+index&btnG=") %>%
html_nodes(".gs_rt a") %>%
html_attr("href")
[1] "https://www.sciencedirect.com/science/article/pii/S0195925511000758"
[2] "https://www.sciencedirect.com/science/article/pii/S0921800907002029"
[3] "https://www.sciencedirect.com/science/article/pii/S019592550600151X"
[4] "https://link.springer.com/article/10.1007/s10551-006-9253-8"
[5] "https://ascelibrary.org/doi/abs/10.1061/(ASCE)WR.1943-5452.0000134"
[6] "https://link.springer.com/article/10.1007/s10098-012-0454-9"
[7] "https://www.mdpi.com/91882"
[8] "https://www.mdpi.com/254628"
[9] "https://www.sciencedirect.com/science/article/pii/S1470160X14000983"
[10] "https://www.sciencedirect.com/science/article/pii/S092180090400151X"
3.4 Case Study 1: Scraping Text Data
3.4.1 The html_text
Function
The html_text
function in rvest
allows us to capture text from a web page. In this context, let us search for articles that contain the term “Sustainability index” on Google Scholar. Specifically, we want to capture the titles of the articles that appear on the first page of the search. All you need to do is open your browser, go to Google Scholar and serach for sustainability index. Remember to also open your selector gadget from the extensions section of your browser. The selector gadget will typically open a side pane on your browser.
::include_graphics("scrap.png") knitr
On the left panel of the Browser, we see the selector gadget tells us that the code associated with the paths is “.gs_rt a”. We also know that this data is “text”. Note also, the web address associated with this page that is https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sustainability+index&btnG=. With this information, we can invoke rvest as follows:
read_html("https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sustainability+index&btnG=") %>%
html_nodes(".gs_rt a") %>%
html_text() %>%
tibble() %>%
set_names("output") %>%
gt()
output |
---|
Review of sustainability indices and indicators: Towards a new City Sustainability Index (CSI) |
Measuring the immeasurable—A survey of sustainability indices |
Sustainability index for Taipei |
Sustainable development and corporate performance: A study based on the Dow Jones sustainability index |
Sustainability index for water resources planning and management |
Sustainability performance evaluation in industry by composite sustainability index |
Proposal of a sustainability index for the automotive industry |
Organizational sustainability practices: A study of the firms listed by the corporate sustainability index |
Transport sustainability index: Melbourne case study |
In search of a natural systems sustainability index |
Note that your output may be different as the serach results are not static. Note also that while we just got the output in the first page, you could scale this to cover multiple pages.
3.5 Case Study 2: Scraping Tabular Data
3.5.1 The html_table
function
In this section, we use the html_table
function to capture data already populated in tables. We utilise data from the webometrics rankings of universities, with a focus on Africa. The url of interest is https://www.webometrics.info/en/Africa.
read_html("https://www.webometrics.info/en/Africa") %>%
html_nodes("#siteContent") %>%
html_table() %>%
1]] %>%
.[[tibble() %>%
clean_names() %>%
select(-det, -country) %>%
set_names(names(.) %>% str_to_sentence()) %>%
head(20) %>%
gt()
Ranking | World_rank | University | Impact_rank | Openness_rank | Excellence_rank |
---|---|---|---|---|---|
1 | 246 | University of Cape Town | 284 | 235 | 293 |
2 | 398 | University of the Witwatersrand | 657 | 379 | 408 |
3 | 438 | Stellenbosch University | 696 | 350 | 471 |
4 | 450 | University of Pretoria | 633 | 470 | 511 |
5 | 548 | Cairo University | 1546 | 628 | 334 |
6 | 584 | Alexandria University | 879 | 754 | 599 |
7 | 598 | University of Kwazulu Natal | 1322 | 548 | 514 |
8 | 653 | University of Johannesburg | 1774 | 679 | 450 |
9 | 795 | University of South Africa | 1234 | 913 | 884 |
10 | 927 | University of the Western Cape | 1228 | 935 | 1152 |
11 | 934 | Mansoura University | 3902 | 639 | 538 |
12 | 992 | Ain Shams University | 3811 | 702 | 616 |
13 | 1066 | Makerere University | 1805 | 1200 | 1135 |
14 | 1076 | University of Nairobi | 1083 | 745 | 1690 |
15 | 1078 | Zagazig University | 5054 | 823 | 554 |
16 | 1106 | University of the Free State | 2539 | 1073 | 973 |
17 | 1109 | University of Ghana | 2146 | 777 | 1194 |
18 | 1129 | University of Ibadan | 2228 | 737 | 1223 |
19 | 1132 | American University in Cairo | 977 | 1131 | 1793 |
20 | 1138 | Rhodes University | 1525 | 1171 | 1427 |
There are many more functions for scrapping data in rvest
. But for a majority of users, these two functions will be enough to handle most of their data needs.
3.6 Case Study 3 (Extended): Scraping Data from Multiple Web Pages
In this section, I provide an end-to-end case study in web scrapping based on a workshop by John Little 1, a data librarian at Duke University. The case study covers extraction of text data from over 50 pages of a website through systematic iteration using the purrr
package in R. The exercise also illustrates how to clean such data using dplyr
and regular expressions
(REGEX).
The site we scrap is Encartico that contains the names of famed individuals between 1400 and 1800 AD. The data in this spans over 26 pages with over 1200 names, making it hard to crap page by page leave alone an individual at a time. You can see the first page of this site by following this link https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse.
We start by reading in the website into R. Using the selector gadget, we can see that the 50 names in the first page have a node ‘#setwidth li a’, We can then scrap these 50 names using the html_text function.
read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
html_nodes("#setwidth li a") %>%
html_text() %>%
tibble() %>%
set_names("Children") %>%
gt()
Children |
---|
Hillebrand Boudewynsz. van der Aa (1661 - 1717) |
Boudewijn Pietersz van der Aa (? - ?) |
Pieter Boudewijnsz. van der Aa (1659 - 1733) |
Boudewyn van der Aa (1672 - ca. 1714) |
Machtelt van der Aa (? - ?) |
Claas van der Aa I (? - ?) |
Claas van der Aa II (? - ?) |
Willem van der Aa (? - ?) |
Johanna van der Aa (? - ?) |
Hans von Aachen (1552 - 1615) |
Jacobus van Aaken (? - ?) |
Justus van Aaken (? - ?) |
Johannes van Aalburg (1717 - 1777) |
Johannes Aalmis (1714 - 1799) |
Johan Bartholomeus Aalmis (1723 - 1786) |
Maria van Aalst (1639 - 1664) |
Anna Aalst (? - ?) |
Anna Aaltse (1715 - 1738) |
Allart Aaltsz (1665 - 1748) |
Geertruy Aaltsz (? - 1732) |
Maria Aaltsz (? - 1746) |
Catharina Aaltsz (? - 1727) |
Nikolaas van Aaltwijk (1692 - 1727) |
Maria Aams (1711 - 1774) |
Jacobus Aams (1680 - ?) |
Jan Govertsz. van der Aar (1544 - 1612) |
Anna van der Aar (1576 - 1656) |
Janneke Jans van Aarden (1609 - 1651) |
Abraham van Aardenberg (1672 - 1717) |
Willem Aardenhout I (? - ?) |
Margrietje Aarlincx (1637 - 1690) |
Dirck van Aart (1680 - 1737) |
Jonas Abarbanel (? - 1667) |
Josephus Abarbanel (? - ?) |
Esther Abarbanel (? - ?) |
Rachel Abarbanel (? - ?) |
Lea Abarbanel (1691 - ?) |
Isaac Abarbanel (1637 - 1723) |
Damiana Abarca (? - 1630) |
Bartholomeus Abba (1641 - 1684) |
Cornelis Dirksz. Abba (1604 - 1675) |
Clara Abba (1631 - 1671) |
Aerlant Abbas (1606 - 1696) |
Matheus Jansz Abbas (1569 - ?) |
Hendrik Abbé (1639 - 1677) |
Claude Abbé (? - 1653) |
Simon Jan Pontenz. Abbe (1467 - 1549) |
Simon IJsbrandz. Abbe (? - ?) |
Ysbrandt Simonsz. Abbe (? - 1559) |
Maximiliaen l' Abbé (? - 1675) |
So how can we crawl over the rest of the pages. Doing so one by one would be time consuming. Here, we resort to the html_attrs
function mentioned earlier. Specifically, we want to use the function of each of the navigation pages. Let us first see what these navigation pages are;
read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
html_nodes("form+ .subnav a") %>%
html_text()
[1] "[51-100]" "[101-150]" "[151-200]" "[201-250]" "[1251-1269]"
Next, we can look at the web addresses of the sites associated with the navigation buttons.
read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
html_nodes("form+ .subnav a") %>%
html_attr("href") %>%
tibble() %>%
set_names("websites")
# A tibble: 5 × 1
websites
<chr>
1 index.php?subtask=browse&field=surname&strtchar=A&page=2
2 index.php?subtask=browse&field=surname&strtchar=A&page=3
3 index.php?subtask=browse&field=surname&strtchar=A&page=4
4 index.php?subtask=browse&field=surname&strtchar=A&page=5
5 index.php?subtask=browse&field=surname&strtchar=A&page=26
We can see a pattern in the naming of the addresses. In this case, we have the address for pages 2-5, and then the last page 26. It appears that the site has the following naming pattern in the addresses. Remember the root of the address is;
https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse
This address is then followed by a address as follows for the second page:
<index.php?subtask=browse&field=surname&strtchar=A&page=2>
Hence, we can generalize the latter as:
<index.php?subtask=browse&field=surname&strtchar=A&page={pageNumber}>
Hence, we can generalize the full address as follows.
https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse/index.php?subtask=browse&field=surname&strtchar=A&page=26
In our case, the maximum page is 26, but this may change in the future. Hence, we can separate the root of the address to determine the maximum page at any given time then construct the addresses with this in mind.
## The output is maximum number of pages in this case 26.
<- read_html("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse") %>%
no_pages html_nodes("form+ .subnav a") %>%
html_attr("href") %>%
tibble() %>%
set_names("websites") %>%
mutate(page = str_extract(websites, "\\d{1,2}$")) %>%
pull(page) %>%
as.numeric() %>%
max()
Having determined the structure of the web addresses and the maximum pages, we can now create a function to crawl the websites, without worring if the number of pages expand in the future.
<- function(address) {
scrapper Sys.sleep(2)
read_html(address) %>%
html_nodes("#setwidth li a") %>%
html_text() %>%
tibble() %>%
set_names("Children")
}
Now lets create the addresses to scrap.
<- tibble(root = "https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&field=surname&strtchar=A&page=", page = 1:no_pages) %>%
full_list mutate(web = glue::glue("{root}{page}"))
head(full_list)
# A tibble: 6 × 3
root page web
<chr> <int> <glu>
1 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s… 1 http…
2 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s… 2 http…
3 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s… 3 http…
4 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s… 4 http…
5 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s… 5 http…
6 https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?s… 6 http…
Finally, we loop over the 26 pages to get the list of the names of people.
## Run this code after uncomenting to get the data.
## I save the data as csv to save on data.
# map_dfr(full_list %>% pull(web), scrapper) %>%
# write_csv("dutch.csv")
Now let us clean this data.
"dutch.csv" %>%
read_csv() %>%
mutate(details = str_extract(Children, "\\(.*")) %>%
mutate(Children = str_remove(Children, "\\(.*")) %>%
separate(details, into = c("birth", "death"), sep = " - ") %>%
mutate(
birth = str_remove(birth, "\\("),
death = str_remove(death, "\\)")
%>%
) mutate(
birth = parse_number(birth),
death = parse_number(death)
%>%
) gt(caption = "Clean Data") %>%
opt_interactive()
4 Conclusion
In this analysis, I have highlighted how to scrape data from multiple web pages using the rvest
package and R. This approach could be useful for researchers interested in using text as data for their research.
Footnotes
The link to the workshop is https://www.youtube.com/watch?v=8ISc8V9GDAg&t=3769s.↩︎