Regular Expressions, Web Scraping, APIs

class: inverse, center, middle, title-slide

.title[
# Regular Expressions, Web Scraping, APIs
]
.subtitle[
## JSC 370: Data Science II
]
.date[
### February 10, 2025
]

---

body{
  font-family: Helvetica;
  font-size: 12pt;
}

p,h1,h2,h3,h4 {
  font-family: system-ui
}

.html-widget {
    margin: auto;
}

code.r{ /* Code block */
    font-size: 18px;
}
pre { /* Code block - determines code spacing between lines */
    font-size: 20px;
}

</style>

## Today's Goals

- Regular Expressions basics 
  - Using R for Web Scraping 
  - Interacting with APIs using R

---

## Regular Expressions: What are they?

> A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. -- [Wikipedia](https://en.wikipedia.org/wiki/Regular_expression)

<div style="text-align: center;">
<img src="https://imgs.xkcd.com/comics/regular_expressions.png" width="350px">
</div>
---

## Regular Expressions

We can use Regular Expressions for:

- Validating data fields, email address, numbers, etc.
- Searching text in various formats, e.g., addresses, there are many ways to write an address.
- Replace text, e.g., different spellings, `Storm`, `Stôrm`, `Stórm` to `Storm`.
- Remove text, e.g., tags from an HTML text, `<name>George</name>` to `George`.

---

## Regular Expressions: Metacharacters

What makes *regex* special is **metacharacters**. While we can always use *regex* to match literals like `dog`, `human`, `1999`, we unlock the full power of *regex* when using metacharacters:

- **`.` (dot)** → Matches **any character** except a new line (`
`).
  - Example: `c.t` matches `cat`, `cut`, `cot`, but not `ct`.
- **`^` (caret)** → Matches the **beginning** of a string.
  - Example: `^Hello` matches `"Hello world"`, but not `"Say Hello"`.
- **`$` (dollar sign)** → Matches the **end** of a string.
  - Example: `world$` matches `"Hello world"`, but not `"worldwide"`.

These metacharacters create dynamic and flexible search patterns beyond literal matching

---

## Regular Expressions: Metacharacters

- `[regex]` Match a single character in `regex`, e.g.
    - `[0123456789]` Any single digit from 0 to 9
    - `[0-9]` Any number in the range 0-9 (equivalent to above)
    - `[a-z]` Lower-case letters
    - `[A-Z]` Upper-case letters
    - `[a-zA-Z]` Lower or upper case letters.
    - `[a-zA-Z0-9]` Any alpha-numeric
---

## Regular Expressions: Metacharacters

Square brackets `[ ]` define a **character class**, matching a **single** character from the specified set:

- `[0-9]` → Matches any **single** digit from `0` to `9`.
  - Example: `4` matches in `42`.
- `[a-z]` → Matches any **lowercase letter** (`a` to `z`).
  - Example: `c` matches in `cat`.
- `[A-Z]` → Matches any **uppercase letter** (`A` to `Z`).
  - Example: `B` matches in `BIG`.
  
---

## Regular Expressions: Metacharacters

- `[a-zA-Z]` → Matches **any letter**, lowercase or uppercase.
  - Example: `X` matches in `eXample`.
- `[a-zA-Z0-9]` → Matches any **alphanumeric character** (letters and digits).
  - Example: `G5` matches in `Group5`.

> **Note:** Square brackets match only **one** character from the defined set. If multiple characters are needed, additional quantifiers (`+`, `*`, `{}`) must be used.

---

## Regular Expressions: Negated Metacharacters

The `[^ ]` notation in regular expressions defines a **negated class**, meaning it **matches anything except** those inside the brackets.

- **`[^0-9]`** → Matches **any character except** digits (`0-9`).
  - Example: `@` matches, but `8` does not.

``` r
grepl("[^0-9]", "Hello")  # TRUE (matches 'H', 'e', etc.)
grepl("[^0-9]", "12345")  # FALSE (only contains digits)
grepl("[^0-9]", "A1B2C")  # TRUE (matches 'A', 'B', 'C')
```

```
## [1] TRUE
## [1] FALSE
## [1] TRUE
```
---

## Regular Expressions: Negated Metacharacters

The `[^ ]` notation in regular expressions defines a **negated class**, meaning it **matches anything except** those inside the brackets.

- **`[^a-z]`** → Matches **any character except** lowercase letters (`a-z`).
  - Example: `X` matches, but `m` does not.

``` r
grepl("[^a-z]", "hello")  # FALSE (all lowercase letters)
grepl("[^a-z]", "Hello")  # TRUE (matches 'H')
grepl("[^a-z]", "1234")   # TRUE (matches '1', '2', etc.)
```

```
## [1] FALSE
## [1] TRUE
## [1] TRUE
```

---

## Regular Expressions: Negated Metacharacters

The `[^ ]` notation in regular expressions defines a **negated class**, meaning it **matches anything except** those inside the brackets.

- **`[^./ ]`** → Matches **any character except** a period (`.`), slash (`/`), or space.
  - Example: `A` matches, but `.` does not.

``` r
grepl("[^./ ]", "Hello.")  # TRUE (matches 'H', 'e', 'l', 'l', 'o')
grepl("[^./ ]", " /.")     # FALSE (contains only disallowed characters)
grepl("[^./ ]", "A/B")     # TRUE (matches 'A' and 'B')
```

```
## [1] TRUE
## [1] FALSE
## [1] TRUE
```

> **Note:**  these match only a **single character** at a time, not multiple characters in sequence.

---

## Regular Expressions: Character Classes

Ranges, e.g., `0-9` or `a-z`, are locale- and implementation-dependent, meaning that the range of lower case letters may vary depending on the OS's language. To solve for this problem, you could use [Character classes](https://en.wikipedia.org/wiki/Regular_expression#Character_classes). Some examples:

- `[:lower:]` lower case letters in the current locale, could be `[a-z]`
- `[:upper:]` upper case letters in the current locale, could be `[A-Z]`
- `[:alpha:]` upper and lower case letters in the current locale, could be `[a-zA-Z]`
- `[:digit:]` Digits: 0 1 2 3 4 5 6 7 8 9
- `[:alnum:]` Alpha numeric characters `[:alpha:]` and `[:digit:]`.
- `[:punct:]` Punctuation characters: ! " \# $ % & ' ( ) * + , - . / : ; < = > ? @ [ \\ ] ^ _ &#96; \{ | \} ~.

---

## Regular Expressions: Character Classes

In some languages and character encodings, regular expressions behave differently based on the locale setting of the system. This is especially important when working with Unicode characters beyond the basic ASCII set.

For example, in the locale `en_US`, the word `Ḧóla` IS NOT fully matched by `[a-zA-Z]+` because:

- `[a-zA-Z]` only includes ASCII characters (a-z and A-Z).
- The letter `Ḧ` (with a diacritic) is not part of ASCII, so it is not matched.
- `Ḧóla` is fully matched by `[[:alpha:]]+` because `[[:alpha:]]` is locale-aware and includes all alphabetic characters, including Unicode letters like `Ḧ`.

---

## Other important Metacharacters

`\s` white space, matches any whitespace character including:

- Spaces ( )
- Newlines (`\n`)
- Carriage returns (`\r`)
- Tabs (`\t`)
- Form feeds (`\f`)
- Vertical tabs (`\v`)

---

## Other important Metacharacters

`\s` is equivalent to `[\r\n\t\f\v ]`, meaning it captures all these whitespace variations.

``` r
grepl("\\s", "Hello World") # TRUE (space between words)
grepl("\\s", "HelloWorld")  # FALSE (no space)
```

---

## Other important Metacharacters

`|` or (logical or) matches either of the given patterns.

``` r
grepl("cat|dog", "I love cats")  # TRUE (matches "cat")
grepl("cat|dog", "I love dogs")  # TRUE (matches "dog")
grepl("cat|dog", "I love birds") # FALSE (neither word is present)
```
---

## Regular Expressions: Repetition

These usually come together with specifying how many times (repetition):

- `regex?` Zero or one match.
- `regex*` Zero or more matches
- `regex+` One or more matches
- `regex{n,}` At least `n` matches
- `regex{,m}` at most `m` matches
- `regex{n,m}` Between `n` and `m` matches.

Where `regex` is a regular expression

---

## Regular Expressions: More operations

There are other operators that can be very useful,

- `(regex)` Group capture.
- `(?:regex)` Group operation without capture.
- `(?=regex)` Look ahead (match)
- `(?!regex)` Look ahead (don't match)
- `(?<=regex)` Look behind (match)
- `(?<!regex)` Look behind (don't match)

Group captures can be reused with `\1`, `\2`, ..., `\n`.

More (great) information here https://regex101.com/

---

## Regular Expressions: Examples

Here we are extracting the first occurrence of the following regular expressions
(using `stringr::str_extract()`):

<table class="table lightable-paper lightable-hover" style='font-size: 15px; margin-left: auto; margin-right: auto; font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> Row </th>
   <th style="text-align:left;"> regex </th>
   <th style="text-align:left;"> Hanna Perez [name] </th>
   <th style="text-align:left;"> The 年 year was 1999 </th>
   <th style="text-align:left;"> HaHa, @abc said that </th>
   <th style="text-align:left;"> GoGo Blues #2024! </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:left;"> .{5} </td>
   <td style="text-align:left;"> Hanna </td>
   <td style="text-align:left;"> The 年 </td>
   <td style="text-align:left;"> HaHa, </td>
   <td style="text-align:left;"> GoGo </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:left;"> n{2} </td>
   <td style="text-align:left;"> nn </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 3 </td>
   <td style="text-align:left;"> [0-9]+ </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> 1999 </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> 2024 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:left;"> \s[a-zA-Z]+\s </td>
   <td style="text-align:left;"> Perez </td>
   <td style="text-align:left;"> year </td>
   <td style="text-align:left;"> said </td>
   <td style="text-align:left;"> Blues </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 5 </td>
   <td style="text-align:left;"> \s[[:alpha:]]+\s </td>
   <td style="text-align:left;"> Perez </td>
   <td style="text-align:left;"> 年 </td>
   <td style="text-align:left;"> said </td>
   <td style="text-align:left;"> Blues </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 6 </td>
   <td style="text-align:left;"> [a-zA-Z]+ [a-zA-Z]+ </td>
   <td style="text-align:left;"> Hanna Perez </td>
   <td style="text-align:left;"> year was </td>
   <td style="text-align:left;"> abc said </td>
   <td style="text-align:left;"> GoGo Blues </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 7 </td>
   <td style="text-align:left;"> ([a-zA-Z]+\s?){2} </td>
   <td style="text-align:left;"> Hanna Perez </td>
   <td style="text-align:left;"> The </td>
   <td style="text-align:left;"> HaHa </td>
   <td style="text-align:left;"> GoGo Blues </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 </td>
   <td style="text-align:left;"> ([a-zA-Z]+)\1 </td>
   <td style="text-align:left;"> nn </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> HaHa </td>
   <td style="text-align:left;"> GoGo </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 9 </td>
   <td style="text-align:left;"> (@|#)[a-z0-9]+ </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> @abc </td>
   <td style="text-align:left;"> #2024 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 10 </td>
   <td style="text-align:left;"> (?
   </td>
<td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> abc </td>
   <td style="text-align:left;"> 2024 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 11 </td>
   <td style="text-align:left;"> [a-z]+ </td>
   <td style="text-align:left;"> [name] </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
  </tr>
</tbody>
</table>

---

## Regular Expressions: Examples Explained

1) .{5} Match **any character** (except line end) five times.

2) n{2} Match the letter **n** twice.

3) [0-9]+ Match **any number** at least once.

4) \s[a-zA-Z]+\s Match a **space**, **any lower or upper case letter** at least once, and a **space**.

5) \s[[:alpha:]]+\s Same as before but this time include all characters.

---

## Regular Expressions: Examples Explained (cont.)

6) [a-zA-Z]+ [a-zA-Z]+ Match two sets of letters separated by one space.

7) ([a-zA-Z]+\s?){2} Match **any lower or upper case letter** at least once, maybe followed by a white space, twice.

8) ([a-zA-Z]+)\1 Match **any lower or upper case letter** at least once, and then match the same pattern again.

9) (@|#)[a-z0-9]+ Match either the `@` or `#` symbol, followed by one or more **lower case letter** or **number**.

10) (?<=#|@)[a-z0-9]+ Match one or more **lower case letter** or **number** that follows either the `@` or `#` symbol.

11) \\ [[a-z]+\\ ] Match the symbol `[`, at least one **lower case letter**, and the symbol `]`.

---

## Regular Expressions: Functions in R

Lookup text: `base::grepl()`, `stringr::str_detect()`.

Similar to `which()`, which elements are `TRUE` `base::grep()`, `stringr::str_which()`

Replace the first instance: `base::sub()`, `stringr::str_replace()`

Replace all instances: `base::gsub()`, `stringr::str_replace_all()`

Extract text: `base::regmatches()`, `stringr::str_extract()` and `stringr::str_extract_all()`.

---

## Regular Expressions: Functions in R

For example, let's create a regex that matches usernames
or hashtags with the following pattern:

`(@|#)([[:alnum:]]+)`

<table class="table lightable-paper lightable-hover" style='font-size: 17px; margin-left: auto; margin-right: auto; font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> Code </th>
   <th style="text-align:left;"> @Hanna Perez [name] #html </th>
   <th style="text-align:left;"> The @年 year was 1999 </th>
   <th style="text-align:left;"> HaHa, @abc said that @z </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> str_detect(text, pattern) or grepl(pattern, text) </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:left;"> TRUE </td>
  </tr>
  <tr>
   <td style="text-align:left;"> str_extract(text, pattern) </td>
   <td style="text-align:left;"> @Hanna </td>
   <td style="text-align:left;"> @年 </td>
   <td style="text-align:left;"> @abc </td>
  </tr>
  <tr>
   <td style="text-align:left;"> str_extract_all(text, pattern) </td>
   <td style="text-align:left;"> [@Hanna, #html] </td>
   <td style="text-align:left;"> [@年] </td>
   <td style="text-align:left;"> [@abc, @z] </td>
  </tr>
  <tr>
   <td style="text-align:left;"> str_replace(text, pattern, "\1justinbieber") </td>
   <td style="text-align:left;"> @justinbieber Perez [name] #html </td>
   <td style="text-align:left;"> The @justinbieber year was 1999 </td>
   <td style="text-align:left;"> HaHa, @justinbieber said that @z </td>
  </tr>
  <tr>
   <td style="text-align:left;"> str_replace_all(text, pattern, "\1justinbieber") </td>
   <td style="text-align:left;"> @justinbieber Perez [name] #justinbieber </td>
   <td style="text-align:left;"> The @justinbieber year was 1999 </td>
   <td style="text-align:left;"> HaHa, @justinbieber said that @justinbieber </td>
  </tr>
</tbody>
</table>

**Note**: While it is not showing in the table, the group replacement was scraped, i.e., `\\1` instead of `\1` in the code.

---

## Regular Expressions: Functions in R

Here's the code
<img src="data:image/png;base64,#code.png" width="50%" style="display: block; margin: auto;" />

---

## Regular Expressions on Data

We will use a dataset consisting of medical transcriptions from https://www.mtsamples.com/. See the readme [here](https://github.com/JSC370/JSC370-2025/tree/main/data/medical_transcriptions). The dataset consists of 4999 rows and 6 columns: “X”, “description”, “medical_specialty”, “sample_name”, “transcription” and “keywords”. Read in as `data.table`.

``` r
fn <- "mtsamples.csv"
if (!file.exists(fn))
  download.file(
    url = "https://github.com/JSC370/JSC370-2025/blob/main/data/medical_transcriptions/mtsamples.csv",
    destfile = fn
  )
mtsamples <- fread(fn, sep = ",", header = TRUE)
names(mtsamples)
```

```
## [1] "V1"                "description"       "medical_specialty"
## [4] "sample_name"       "transcription"     "keywords"
```

---

## Regex to Lookup Text: Tumor

Let's search through the "description" using `grepl` looking for the word tumor:

``` r
#How many entries contain the word tumor?
mtsamples[grepl("tumor", description, ignore.case = TRUE), .N] 
# Generating a column tagging tumor
mtsamples[, tumor_related := grepl("tumor", description, ignore.case = TRUE)]
# Taking a look at a few examples
mtsamples[tumor_related == TRUE, .(description)][1:3,]
```

```
## [1] 67
##                                                                                                      description
##                                                                                                           <char>
## 1:                                 Transurethral resection of a medium bladder tumor (TURBT), left lateral wall.
## 2:                                                  Transurethral resection of the bladder tumor (TURBT), large.
## 3:  Cystoscopy, transurethral resection of medium bladder tumor (4.0 cm in diameter), and direct bladder biopsy.
```

Notice the `ignore.case = TRUE`. This is equivalent to transforming the text to lower case using `tolower()` before passing the text to the regular expression function.

---

## Regex Lookup text: Pronoun of the patient

Now, let's try to guess the pronoun of the patient. To do so, we could tag by
using the words *he, his, him, they, them, theirs, ze, hir, hirs, she, hers, her*

``` r
mtsamples[, pronoun := str_extract(
  string  = tolower(transcription),
  pattern = "he|his|him|they|them|theirs|ze|hir|hirs|she|hers|her"
)]
mtsamples[1:10,pronoun]
mtsamples[, table(pronoun, useNA = "always")]
```

```
##  [1] "his" "his" "his" "ze"  "he"  "he"  "he"  "he"  "he"  "ze" 
## pronoun
##   he  him  hir  his  she them   ze <NA> 
## 2558    6   14  934   46   13   43   68
```

What is the problem with this approach?

---

## Regex Lookup text: Pronoun of the patient

For this we use the following regular expression:

`(?<=\W|^)(he|his|him|they|them|theirs|ze|hir|hirs|she|hers|her)(?=\W|$)`

Bit by bit this is:

- `(?<=regex)` lookback search.
    - `\W` any non alpha numeric character, this is equivalent to `[^[:alnum:]]`, `|` or
    - `^` the beginning of the text
    
---

## Regex Lookup text: Pronoun of the patient

- `he|his|him...` any of these words,
- `(?=regex)` followed by,
    - `\W` any non alpha numeric character, this is equivalent to `[^[:alnum:]]`, `|` or
    - `$` the end of the text.
---

## Regex Lookup text: Pronoun of the patient

``` r
mtsamples[, pronoun := str_extract(
  string  = tolower(transcription), 
  pattern = "(?<=\\W|^)(he|his|him|they|them|theirs|ze|hir|hirs|she|hers|her)(?=\\W|$)"
  )]
mtsamples[1:10, pronoun]
```

```
##  [1] "she" "he"  "he"  NA    NA    "she" "she" NA    NA    NA
```

---

## Regex Lookup text: Pronoun of the patient

``` r
mtsamples[, table(pronoun, useNA = "always")]
```

```
## pronoun
##   he  her  him  his  she them they <NA> 
##  767  394   29  361  870   18   67 1176
```

---

## Regex Extract Text: Type of Cancer

- Imagine now that you need to see the types of cancer mentioned in the data.
- For simplicity, let's assume that, if specified, it is in the form of `TYPE cancer`, i.e. single word.
- We are interested in the word before cancer, how can we capture this?

---

## Regex Extract Text: Type of Cancer

We can just try to **extract** the phrase `"[some word] cancer"`, in particular, we could use the
following regular expression

`[[:alnum:]-_]{4,}\s*cancer`

Where

- `[[:alnum:]-_]{4,}` captures any alphanumeric character, including `-` and `_`. 
   Furthermore, for this match to work there must be at least 4 characters,
- `\s*` captures 0 or more white-spaces, and
- `cancer` captures the word cancer:
---

## Regex Extract Text: Type of Cancer

``` r
mtsamples[, cancer_type := str_extract(tolower(keywords), "[[:alnum:]-_]{4,}\\s*cancer")]
mtsamples[, table(cancer_type)]
```

```
## cancer_type
##        anal cancer     bladder cancer      breast cancer       colon cancer 
##                  1                  6                 16                 12 
## endometrial cancer  esophageal cancer        lung cancer     ovarian cancer 
##                  5                  1                  8                  1 
##   papillary cancer    prostate cancer     uterine cancer 
##                  2                 14                  4
```

---

## Fundamentals of Web Scrapping

> Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites -- [Wikipedia](https://en.wikipedia.org/wiki/Web_scraping)

**How?**

- The [`rvest`](https://cran.r-project.org/package=rvest) R package provides various tools for reading and processing web data.
- Under-the-hood, `rvest` is a wrapper of the [`xml2`](https://cran.r-project.org/package=xml2)
and [`httr`](https://cran.r-project.org/package=httr) R packages.
(in the case of [dynamic websites](https://en.wikipedia.org/wiki/Dynamic_web_page), take a look at [selenium](https://en.wikipedia.org/wiki/Selenium_(software)))

---

## Web scraping raw HTML: Example

We would like to capture the table of COVID-19 death rates per country directly from Wikipedia.

``` r
library(rvest)
library(xml2)

# Reading the HTML table with the function xml2::read_html
covid <- read_html(
  x = "https://en.wikipedia.org/wiki/COVID-19_pandemic_death_rates_by_country"
  )

# Let's the the output
covid
```

```
## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...
```

---

## Web scraping raw HTML: Example

- We want to get the HTML table that shows up in the doc. To do this, we can use the
  function `xml2::xml_find_all()` and `rvest::html_table()`
- The first will locate the place in the document that matches a given **XPath**
  expression.
- [XPath](https://en.wikipedia.org/wiki/XPath), XML Path Language, is a query language to select nodes in a XML
  document.
- A nice tutorial can be found [here](https://www.w3schools.com/xml/xpath_intro.asp)
- Modern Web browsers make it easy to use XPath!

Example! (inspect elements in [Google Chrome](https://developers.google.com/web/tools/chrome-devtools/open),
[Firefox](https://developer.mozilla.org/en-US/docs/Tools/Page_Inspector/How_to/Open_the_Inspector), [Edge/ Explorer](https://docs.microsoft.com/en-us/microsoft-edge/devtools-guide-chromium/ie-mode), and [Safari](https://developer.apple.com/library/archive/documentation/NetworkingInternetWeb/Conceptual/Web_Inspector_Tutorial/EditingCode/EditingCode.html#//apple_ref/doc/uid/TP40017576-CH4-DontLinkElementID_25))
  
---

## Web scraping with `xml2` and the `rvest` package

Now that we know what is the path, let's use that and extract

``` r
table <- xml2::xml_find_all(covid, xpath = "/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/div[4]")
table <- rvest::html_table(table) # This returns a list of tables
head(table[[1]])
```

```
## # A tibble: 6 × 4
##   Country                `Deaths / million` Deaths    Cases      
##   <chr>                  <chr>              <chr>     <chr>      
## 1 World[a]               886                7,084,010 777,334,464
## 2 Peru                   6,601              220,994   4,528,708  
## 3 Bulgaria               5,678              38,764    1,338,277  
## 4 North Macedonia        5,428              9,990     352,060    
## 5 Bosnia and Herzegovina 5,118              16,404    404,142    
## 6 Hungary                5,072              49,122    2,237,583
```

---

## Web APIs

**What?**

A Web API is an application programming interface for either a web server or a web browser. -- [Wikipedia](https://en.wikipedia.org/wiki/Web_API)

Some examples include: [twitter API](https://developer.twitter.com/en), [facebook API](https://developers.facebook.com/), [Gene Ontology API](http://api.geneontology.org/api)

**How?**

You can request data, the **GET method**, post data, the **POST method**, and do many other things using the [HTTP protocol](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol).

**How in R?**

We will be using the `httr()` package, which is a wrapper of the `curl()` package, which in turn provides access to the `curl` library that is used to communicate with APIs.

---

## Web APIs with curl

<div align="center">
<img src="https://cdn.tutsplus.com/net/authors/jeremymcpeak/http1-url-structure.png" width="700px">
<br>
Structure of a URL (source: <a href="https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177" target="_blank">"HTTP: The Protocol Every Web Developer Must Know - Part 1"</a>)
</div>

---

## Web APIs with curl

Under-the-hood, the `httr` (and thus `curl`) sends request somewhat like this

```bash
curl -X GET https://google.com -w "%{content_type}\n%{http_code}\n"
```

A get request (`-X GET`) to `https://google.com`, which also includes (`-w`) the following:
`content_type` and `http_code`:

```html
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://www.google.com/">here</A>.
</BODY></HTML>
text/html; charset=UTF-8
301
```

We use the `httr` R package to make life easier.

---