class: inverse, center, middle, title-slide .title[ # Regular Expressions, Web Scraping, APIs ] .subtitle[ ## JSC 370: Data Science II ] .date[ ### February 10, 2025 ] --- <style type="text/css"> code.*, .remark-code, pre { font-size:15px; } body{ font-family: Helvetica; font-size: 12pt; } p,h1,h2,h3,h4 { font-family: system-ui } .html-widget { margin: auto; } code.r{ /* Code block */ font-size: 18px; } pre { /* Code block - determines code spacing between lines */ font-size: 20px; } </style> ## Today's Goals - Regular Expressions basics - Using R for Web Scraping - Interacting with APIs using R --- ## Regular Expressions: What are they? > A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. -- [Wikipedia](https://en.wikipedia.org/wiki/Regular_expression) <div style="text-align: center;"> <img src="https://imgs.xkcd.com/comics/regular_expressions.png" width="350px"> </div> --- ## Regular Expressions We can use Regular Expressions for: - Validating data fields, email address, numbers, etc. - Searching text in various formats, e.g., addresses, there are many ways to write an address. - Replace text, e.g., different spellings, `Storm`, `Stôrm`, `Stórm` to `Storm`. - Remove text, e.g., tags from an HTML text, `<name>George</name>` to `George`. --- ## Regular Expressions: Metacharacters What makes *regex* special is **metacharacters**. While we can always use *regex* to match literals like `dog`, `human`, `1999`, we unlock the full power of *regex* when using metacharacters: - **`.` (dot)** → Matches **any character** except a new line (` `). - Example: `c.t` matches `cat`, `cut`, `cot`, but not `ct`. - **`^` (caret)** → Matches the **beginning** of a string. - Example: `^Hello` matches `"Hello world"`, but not `"Say Hello"`. - **`$` (dollar sign)** → Matches the **end** of a string. - Example: `world$` matches `"Hello world"`, but not `"worldwide"`. These metacharacters create dynamic and flexible search patterns beyond literal matching --- ## Regular Expressions: Metacharacters - `[regex]` Match a single character in `regex`, e.g. - `[0123456789]` Any single digit from 0 to 9 - `[0-9]` Any number in the range 0-9 (equivalent to above) - `[a-z]` Lower-case letters - `[A-Z]` Upper-case letters - `[a-zA-Z]` Lower or upper case letters. - `[a-zA-Z0-9]` Any alpha-numeric --- ## Regular Expressions: Metacharacters Square brackets `[ ]` define a **character class**, matching a **single** character from the specified set: - `[0-9]` → Matches any **single** digit from `0` to `9`. - Example: `4` matches in `42`. - `[a-z]` → Matches any **lowercase letter** (`a` to `z`). - Example: `c` matches in `cat`. - `[A-Z]` → Matches any **uppercase letter** (`A` to `Z`). - Example: `B` matches in `BIG`. --- ## Regular Expressions: Metacharacters - `[a-zA-Z]` → Matches **any letter**, lowercase or uppercase. - Example: `X` matches in `eXample`. - `[a-zA-Z0-9]` → Matches any **alphanumeric character** (letters and digits). - Example: `G5` matches in `Group5`. > **Note:** Square brackets match only **one** character from the defined set. If multiple characters are needed, additional quantifiers (`+`, `*`, `{}`) must be used. --- ## Regular Expressions: Negated Metacharacters The `[^ ]` notation in regular expressions defines a **negated class**, meaning it **matches anything except** those inside the brackets. - **`[^0-9]`** → Matches **any character except** digits (`0-9`). - Example: `@` matches, but `8` does not. ``` r grepl("[^0-9]", "Hello") # TRUE (matches 'H', 'e', etc.) grepl("[^0-9]", "12345") # FALSE (only contains digits) grepl("[^0-9]", "A1B2C") # TRUE (matches 'A', 'B', 'C') ``` ``` ## [1] TRUE ## [1] FALSE ## [1] TRUE ``` --- ## Regular Expressions: Negated Metacharacters The `[^ ]` notation in regular expressions defines a **negated class**, meaning it **matches anything except** those inside the brackets. - **`[^a-z]`** → Matches **any character except** lowercase letters (`a-z`). - Example: `X` matches, but `m` does not. ``` r grepl("[^a-z]", "hello") # FALSE (all lowercase letters) grepl("[^a-z]", "Hello") # TRUE (matches 'H') grepl("[^a-z]", "1234") # TRUE (matches '1', '2', etc.) ``` ``` ## [1] FALSE ## [1] TRUE ## [1] TRUE ``` --- ## Regular Expressions: Negated Metacharacters The `[^ ]` notation in regular expressions defines a **negated class**, meaning it **matches anything except** those inside the brackets. - **`[^./ ]`** → Matches **any character except** a period (`.`), slash (`/`), or space. - Example: `A` matches, but `.` does not. ``` r grepl("[^./ ]", "Hello.") # TRUE (matches 'H', 'e', 'l', 'l', 'o') grepl("[^./ ]", " /.") # FALSE (contains only disallowed characters) grepl("[^./ ]", "A/B") # TRUE (matches 'A' and 'B') ``` ``` ## [1] TRUE ## [1] FALSE ## [1] TRUE ``` > **Note:** these match only a **single character** at a time, not multiple characters in sequence. --- ## Regular Expressions: Character Classes Ranges, e.g., `0-9` or `a-z`, are locale- and implementation-dependent, meaning that the range of lower case letters may vary depending on the OS's language. To solve for this problem, you could use [Character classes](https://en.wikipedia.org/wiki/Regular_expression#Character_classes). Some examples: - `[:lower:]` lower case letters in the current locale, could be `[a-z]` - `[:upper:]` upper case letters in the current locale, could be `[A-Z]` - `[:alpha:]` upper and lower case letters in the current locale, could be `[a-zA-Z]` - `[:digit:]` Digits: 0 1 2 3 4 5 6 7 8 9 - `[:alnum:]` Alpha numeric characters `[:alpha:]` and `[:digit:]`. - `[:punct:]` Punctuation characters: ! " \# $ % & ' ( ) * + , - . / : ; < = > ? @ [ \\ ] ^ _ ` \{ | \} ~. --- ## Regular Expressions: Character Classes In some languages and character encodings, regular expressions behave differently based on the locale setting of the system. This is especially important when working with Unicode characters beyond the basic ASCII set. For example, in the locale `en_US`, the word `Ḧóla` IS NOT fully matched by `[a-zA-Z]+` because: - `[a-zA-Z]` only includes ASCII characters (a-z and A-Z). - The letter `Ḧ` (with a diacritic) is not part of ASCII, so it is not matched. - `Ḧóla` is fully matched by `[[:alpha:]]+` because `[[:alpha:]]` is locale-aware and includes all alphabetic characters, including Unicode letters like `Ḧ`. --- ## Other important Metacharacters `\s` white space, matches any whitespace character including: - Spaces ( ) - Newlines (`\n`) - Carriage returns (`\r`) - Tabs (`\t`) - Form feeds (`\f`) - Vertical tabs (`\v`) --- ## Other important Metacharacters `\s` is equivalent to `[\r\n\t\f\v ]`, meaning it captures all these whitespace variations. ``` r grepl("\\s", "Hello World") # TRUE (space between words) grepl("\\s", "HelloWorld") # FALSE (no space) ``` --- ## Other important Metacharacters `|` or (logical or) matches either of the given patterns. ``` r grepl("cat|dog", "I love cats") # TRUE (matches "cat") grepl("cat|dog", "I love dogs") # TRUE (matches "dog") grepl("cat|dog", "I love birds") # FALSE (neither word is present) ``` --- ## Regular Expressions: Repetition These usually come together with specifying how many times (repetition): - `regex?` Zero or one match. - `regex*` Zero or more matches - `regex+` One or more matches - `regex{n,}` At least `n` matches - `regex{,m}` at most `m` matches - `regex{n,m}` Between `n` and `m` matches. Where `regex` is a regular expression --- ## Regular Expressions: More operations There are other operators that can be very useful, - `(regex)` Group capture. - `(?:regex)` Group operation without capture. - `(?=regex)` Look ahead (match) - `(?!regex)` Look ahead (don't match) - `(?<=regex)` Look behind (match) - `(?<!regex)` Look behind (don't match) Group captures can be reused with `\1`, `\2`, ..., `\n`. More (great) information here https://regex101.com/ --- ## Regular Expressions: Examples Here we are extracting the first occurrence of the following regular expressions (using `stringr::str_extract()`): <table class="table lightable-paper lightable-hover" style='font-size: 15px; margin-left: auto; margin-right: auto; font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Row </th> <th style="text-align:left;"> regex </th> <th style="text-align:left;"> Hanna Perez [name] </th> <th style="text-align:left;"> The 年 year was 1999 </th> <th style="text-align:left;"> HaHa, @abc said that </th> <th style="text-align:left;"> GoGo Blues #2024! </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> .{5} </td> <td style="text-align:left;"> Hanna </td> <td style="text-align:left;"> The 年 </td> <td style="text-align:left;"> HaHa, </td> <td style="text-align:left;"> GoGo </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> n{2} </td> <td style="text-align:left;"> nn </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> [0-9]+ </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 2024 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> \s[a-zA-Z]+\s </td> <td style="text-align:left;"> Perez </td> <td style="text-align:left;"> year </td> <td style="text-align:left;"> said </td> <td style="text-align:left;"> Blues </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> \s[[:alpha:]]+\s </td> <td style="text-align:left;"> Perez </td> <td style="text-align:left;"> 年 </td> <td style="text-align:left;"> said </td> <td style="text-align:left;"> Blues </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> [a-zA-Z]+ [a-zA-Z]+ </td> <td style="text-align:left;"> Hanna Perez </td> <td style="text-align:left;"> year was </td> <td style="text-align:left;"> abc said </td> <td style="text-align:left;"> GoGo Blues </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> ([a-zA-Z]+\s?){2} </td> <td style="text-align:left;"> Hanna Perez </td> <td style="text-align:left;"> The </td> <td style="text-align:left;"> HaHa </td> <td style="text-align:left;"> GoGo Blues </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> ([a-zA-Z]+)\1 </td> <td style="text-align:left;"> nn </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> HaHa </td> <td style="text-align:left;"> GoGo </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> (@|#)[a-z0-9]+ </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> @abc </td> <td style="text-align:left;"> #2024 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> (? </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> abc </td> <td style="text-align:left;"> 2024 </td> </tr> <tr> <td style="text-align:left;"> 11 </td> <td style="text-align:left;"> [a-z]+ </td> <td style="text-align:left;"> [name] </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> </tbody> </table> --- ## Regular Expressions: Examples Explained 1) .{5} Match **any character** (except line end) five times. 2) n{2} Match the letter **n** twice. 3) [0-9]+ Match **any number** at least once. 4) \s[a-zA-Z]+\s Match a **space**, **any lower or upper case letter** at least once, and a **space**. 5) \s[[:alpha:]]+\s Same as before but this time include all characters. --- ## Regular Expressions: Examples Explained (cont.) 6) [a-zA-Z]+ [a-zA-Z]+ Match two sets of letters separated by one space. 7) ([a-zA-Z]+\s?){2} Match **any lower or upper case letter** at least once, maybe followed by a white space, twice. 8) ([a-zA-Z]+)\1 Match **any lower or upper case letter** at least once, and then match the same pattern again. 9) (@|#)[a-z0-9]+ Match either the `@` or `#` symbol, followed by one or more **lower case letter** or **number**. 10) (?<=#|@)[a-z0-9]+ Match one or more **lower case letter** or **number** that follows either the `@` or `#` symbol. 11) \\ [[a-z]+\\ ] Match the symbol `[`, at least one **lower case letter**, and the symbol `]`. --- ## Regular Expressions: Functions in R Lookup text: `base::grepl()`, `stringr::str_detect()`. Similar to `which()`, which elements are `TRUE` `base::grep()`, `stringr::str_which()` Replace the first instance: `base::sub()`, `stringr::str_replace()` Replace all instances: `base::gsub()`, `stringr::str_replace_all()` Extract text: `base::regmatches()`, `stringr::str_extract()` and `stringr::str_extract_all()`. --- ## Regular Expressions: Functions in R For example, let's create a regex that matches usernames or hashtags with the following pattern: `(@|#)([[:alnum:]]+)` <table class="table lightable-paper lightable-hover" style='font-size: 17px; margin-left: auto; margin-right: auto; font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Code </th> <th style="text-align:left;"> @Hanna Perez [name] #html </th> <th style="text-align:left;"> The @年 year was 1999 </th> <th style="text-align:left;"> HaHa, @abc said that @z </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> str_detect(text, pattern) or grepl(pattern, text) </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:left;"> str_extract(text, pattern) </td> <td style="text-align:left;"> @Hanna </td> <td style="text-align:left;"> @年 </td> <td style="text-align:left;"> @abc </td> </tr> <tr> <td style="text-align:left;"> str_extract_all(text, pattern) </td> <td style="text-align:left;"> [@Hanna, #html] </td> <td style="text-align:left;"> [@年] </td> <td style="text-align:left;"> [@abc, @z] </td> </tr> <tr> <td style="text-align:left;"> str_replace(text, pattern, "\1justinbieber") </td> <td style="text-align:left;"> @justinbieber Perez [name] #html </td> <td style="text-align:left;"> The @justinbieber year was 1999 </td> <td style="text-align:left;"> HaHa, @justinbieber said that @z </td> </tr> <tr> <td style="text-align:left;"> str_replace_all(text, pattern, "\1justinbieber") </td> <td style="text-align:left;"> @justinbieber Perez [name] #justinbieber </td> <td style="text-align:left;"> The @justinbieber year was 1999 </td> <td style="text-align:left;"> HaHa, @justinbieber said that @justinbieber </td> </tr> </tbody> </table> **Note**: While it is not showing in the table, the group replacement was scraped, i.e., `\\1` instead of `\1` in the code. --- ## Regular Expressions: Functions in R Here's the code <img src="data:image/png;base64,#code.png" width="50%" style="display: block; margin: auto;" /> --- ## Regular Expressions on Data We will use a dataset consisting of medical transcriptions from https://www.mtsamples.com/. See the readme [here](https://github.com/JSC370/JSC370-2025/tree/main/data/medical_transcriptions). The dataset consists of 4999 rows and 6 columns: “X”, “description”, “medical_specialty”, “sample_name”, “transcription” and “keywords”. Read in as `data.table`. ``` r fn <- "mtsamples.csv" if (!file.exists(fn)) download.file( url = "https://github.com/JSC370/JSC370-2025/blob/main/data/medical_transcriptions/mtsamples.csv", destfile = fn ) mtsamples <- fread(fn, sep = ",", header = TRUE) names(mtsamples) ``` ``` ## [1] "V1" "description" "medical_specialty" ## [4] "sample_name" "transcription" "keywords" ``` --- ## Regex to Lookup Text: Tumor Let's search through the "description" using `grepl` looking for the word tumor: ``` r #How many entries contain the word tumor? mtsamples[grepl("tumor", description, ignore.case = TRUE), .N] # Generating a column tagging tumor mtsamples[, tumor_related := grepl("tumor", description, ignore.case = TRUE)] # Taking a look at a few examples mtsamples[tumor_related == TRUE, .(description)][1:3,] ``` ``` ## [1] 67 ## description ## <char> ## 1: Transurethral resection of a medium bladder tumor (TURBT), left lateral wall. ## 2: Transurethral resection of the bladder tumor (TURBT), large. ## 3: Cystoscopy, transurethral resection of medium bladder tumor (4.0 cm in diameter), and direct bladder biopsy. ``` Notice the `ignore.case = TRUE`. This is equivalent to transforming the text to lower case using `tolower()` before passing the text to the regular expression function. --- ## Regex Lookup text: Pronoun of the patient Now, let's try to guess the pronoun of the patient. To do so, we could tag by using the words *he, his, him, they, them, theirs, ze, hir, hirs, she, hers, her* ``` r mtsamples[, pronoun := str_extract( string = tolower(transcription), pattern = "he|his|him|they|them|theirs|ze|hir|hirs|she|hers|her" )] mtsamples[1:10,pronoun] mtsamples[, table(pronoun, useNA = "always")] ``` ``` ## [1] "his" "his" "his" "ze" "he" "he" "he" "he" "he" "ze" ## pronoun ## he him hir his she them ze <NA> ## 2558 6 14 934 46 13 43 68 ``` What is the problem with this approach? --- ## Regex Lookup text: Pronoun of the patient For this we use the following regular expression: `(?<=\W|^)(he|his|him|they|them|theirs|ze|hir|hirs|she|hers|her)(?=\W|$)` Bit by bit this is: - `(?<=regex)` lookback search. - `\W` any non alpha numeric character, this is equivalent to `[^[:alnum:]]`, `|` or - `^` the beginning of the text --- ## Regex Lookup text: Pronoun of the patient - `he|his|him...` any of these words, - `(?=regex)` followed by, - `\W` any non alpha numeric character, this is equivalent to `[^[:alnum:]]`, `|` or - `$` the end of the text. --- ## Regex Lookup text: Pronoun of the patient ``` r mtsamples[, pronoun := str_extract( string = tolower(transcription), pattern = "(?<=\\W|^)(he|his|him|they|them|theirs|ze|hir|hirs|she|hers|her)(?=\\W|$)" )] mtsamples[1:10, pronoun] ``` ``` ## [1] "she" "he" "he" NA NA "she" "she" NA NA NA ``` --- ## Regex Lookup text: Pronoun of the patient ``` r mtsamples[, table(pronoun, useNA = "always")] ``` ``` ## pronoun ## he her him his she them they <NA> ## 767 394 29 361 870 18 67 1176 ``` --- ## Regex Extract Text: Type of Cancer - Imagine now that you need to see the types of cancer mentioned in the data. - For simplicity, let's assume that, if specified, it is in the form of `TYPE cancer`, i.e. single word. - We are interested in the word before cancer, how can we capture this? --- ## Regex Extract Text: Type of Cancer We can just try to **extract** the phrase `"[some word] cancer"`, in particular, we could use the following regular expression `[[:alnum:]-_]{4,}\s*cancer` Where - `[[:alnum:]-_]{4,}` captures any alphanumeric character, including `-` and `_`. Furthermore, for this match to work there must be at least 4 characters, - `\s*` captures 0 or more white-spaces, and - `cancer` captures the word cancer: --- ## Regex Extract Text: Type of Cancer ``` r mtsamples[, cancer_type := str_extract(tolower(keywords), "[[:alnum:]-_]{4,}\\s*cancer")] mtsamples[, table(cancer_type)] ``` ``` ## cancer_type ## anal cancer bladder cancer breast cancer colon cancer ## 1 6 16 12 ## endometrial cancer esophageal cancer lung cancer ovarian cancer ## 5 1 8 1 ## papillary cancer prostate cancer uterine cancer ## 2 14 4 ``` --- ## Fundamentals of Web Scrapping > Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites -- [Wikipedia](https://en.wikipedia.org/wiki/Web_scraping) **How?** - The [`rvest`](https://cran.r-project.org/package=rvest) R package provides various tools for reading and processing web data. - Under-the-hood, `rvest` is a wrapper of the [`xml2`](https://cran.r-project.org/package=xml2) and [`httr`](https://cran.r-project.org/package=httr) R packages. (in the case of [dynamic websites](https://en.wikipedia.org/wiki/Dynamic_web_page), take a look at [selenium](https://en.wikipedia.org/wiki/Selenium_(software))) --- ## Web scraping raw HTML: Example We would like to capture the table of COVID-19 death rates per country directly from Wikipedia. ``` r library(rvest) library(xml2) # Reading the HTML table with the function xml2::read_html covid <- read_html( x = "https://en.wikipedia.org/wiki/COVID-19_pandemic_death_rates_by_country" ) # Let's the the output covid ``` ``` ## {html_document} ## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ... ``` --- ## Web scraping raw HTML: Example - We want to get the HTML table that shows up in the doc. To do this, we can use the function `xml2::xml_find_all()` and `rvest::html_table()` - The first will locate the place in the document that matches a given **XPath** expression. - [XPath](https://en.wikipedia.org/wiki/XPath), XML Path Language, is a query language to select nodes in a XML document. - A nice tutorial can be found [here](https://www.w3schools.com/xml/xpath_intro.asp) - Modern Web browsers make it easy to use XPath! Example! (inspect elements in [Google Chrome](https://developers.google.com/web/tools/chrome-devtools/open), [Firefox](https://developer.mozilla.org/en-US/docs/Tools/Page_Inspector/How_to/Open_the_Inspector), [Edge/ Explorer](https://docs.microsoft.com/en-us/microsoft-edge/devtools-guide-chromium/ie-mode), and [Safari](https://developer.apple.com/library/archive/documentation/NetworkingInternetWeb/Conceptual/Web_Inspector_Tutorial/EditingCode/EditingCode.html#//apple_ref/doc/uid/TP40017576-CH4-DontLinkElementID_25)) --- ## Web scraping with `xml2` and the `rvest` package Now that we know what is the path, let's use that and extract ``` r table <- xml2::xml_find_all(covid, xpath = "/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/div[4]") table <- rvest::html_table(table) # This returns a list of tables head(table[[1]]) ``` ``` ## # A tibble: 6 × 4 ## Country `Deaths / million` Deaths Cases ## <chr> <chr> <chr> <chr> ## 1 World[a] 886 7,084,010 777,334,464 ## 2 Peru 6,601 220,994 4,528,708 ## 3 Bulgaria 5,678 38,764 1,338,277 ## 4 North Macedonia 5,428 9,990 352,060 ## 5 Bosnia and Herzegovina 5,118 16,404 404,142 ## 6 Hungary 5,072 49,122 2,237,583 ``` --- ## Web APIs **What?** A Web API is an application programming interface for either a web server or a web browser. -- [Wikipedia](https://en.wikipedia.org/wiki/Web_API) Some examples include: [twitter API](https://developer.twitter.com/en), [facebook API](https://developers.facebook.com/), [Gene Ontology API](http://api.geneontology.org/api) **How?** You can request data, the **GET method**, post data, the **POST method**, and do many other things using the [HTTP protocol](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol). **How in R?** We will be using the `httr()` package, which is a wrapper of the `curl()` package, which in turn provides access to the `curl` library that is used to communicate with APIs. --- ## Web APIs with curl <div align="center"> <img src="https://cdn.tutsplus.com/net/authors/jeremymcpeak/http1-url-structure.png" width="700px"> <br> Structure of a URL (source: <a href="https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177" target="_blank">"HTTP: The Protocol Every Web Developer Must Know - Part 1"</a>) </div> --- ## Web APIs with curl Under-the-hood, the `httr` (and thus `curl`) sends request somewhat like this ```bash curl -X GET https://google.com -w "%{content_type}\n%{http_code}\n" ``` A get request (`-X GET`) to `https://google.com`, which also includes (`-w`) the following: `content_type` and `http_code`: ```html <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="https://www.google.com/">here</A>. </BODY></HTML> text/html; charset=UTF-8 301 ``` We use the `httr` R package to make life easier. --- ## Web API Example 1: Gene Ontology - We will make use of the [Gene Ontology API](http://api.geneontology.org). - We want to know what genes (human or not) are **involved in** the function **antiviral innate immune response** (go term [GO:0140374](http://amigo.geneontology.org/amigo/term/GO:0140374)), looking only at those annotations that have evidence code [ECO:0000006](https://evidenceontology.org/browse/#ECO_0000006) (experimental evidence): --- ## Web API Example 1: Gene Ontology ``` r library(httr) go_query <- GET( url = "http://api.geneontology.org/", path = "api/bioentity/function/GO:0140374/genes", query = list( evidence = "ECO:0000006", relationship_type = "involved_in" ), # May need to pass this option to curl to allow to wait for at least 60 s before returning error. config = config( connecttimeout = 60 ) ) ``` We could have also passed the full URL directly... --- ## Web API Example 1: Gene Ontology Let's take a look at the curl call: ```bash curl -X GET "http://api.geneontology.org/api/bioentity/function/GO:0140374/genes?evidence=ECO%3A0000006&relationship_type=involved_in" -H "accept: application/json" ``` What `httr::GET()` does: ```r > go_query$request ## <request> ## GET http://api.geneontology.org/api/bioentity/function/GO:0140374/genes?evidence=ECO%3A0000006&relationship_type=involved_in ## Output: write_memory ## Options: ## * useragent: libcurl/7.58.0 r-curl/4.3 httr/1.4.1 ## * connecttimeout: 60 ## * httpget: TRUE ## Headers: ## * Accept: application/json, text/xml, application/xml, */* ``` --- ## Web API Example 1: Gene Ontology Let's take a look at the response: ``` ## Response [https://api.geneontology.org/api/bioentity/function/GO:0140374/genes?evidence=ECO%3A0000006&relationship_type=involved_in] ## Date: 2025-02-10 18:18 ## Status: 200 ## Content-Type: application/json ## Size: 110 kB ``` Remember the codes: - 1xx: Information message - 2xx: Success - 3xx: Redirection - 4xx: Client error - 5xx: Server error --- ## Web API Example 1: Gene Ontology We can extract the results using the `httr::content()` function ``` r dat <- content(go_query) dat <- lapply(dat$associations, function(a) { data.frame( Gene = a$subject$id, taxon_id = a$subject$taxon$id, taxon_label = a$subject$taxon$label ) }) dat <- do.call(rbind, dat) str(dat) ``` ``` ## 'data.frame': 100 obs. of 3 variables: ## $ Gene : chr "UniProtKB:F6T6A5" "UniProtKB:A0A2I3SSQ6" "UniProtKB:A0A2I3RMV7" "UniProtKB:E1BJZ0" ... ## $ taxon_id : chr "NCBITaxon:9796" "NCBITaxon:9598" "NCBITaxon:9598" "NCBITaxon:9913" ... ## $ taxon_label: chr "Equus caballus" "Pan troglodytes" "Pan troglodytes" "Bos taurus" ... ``` --- ## Web API Example 1: Gene Ontology The structure of the result will depend on the API. In this case, the output was a JSON file, so the content function returns a list in R. In other scenarios it could return an XML object (we will see more in the lab) <table class="table lightable-paper lightable-hover" style='font-size: 16px; margin-left: auto; margin-right: auto; font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <caption style="font-size: initial !important;">Genes experimentally annotated with the function **antiviral innate immune response** (GO:0140374)</caption> <thead> <tr> <th style="text-align:left;"> Gene </th> <th style="text-align:left;"> taxon_id </th> <th style="text-align:left;"> taxon_label </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> UniProtKB:F6T6A5 </td> <td style="text-align:left;"> NCBITaxon:9796 </td> <td style="text-align:left;"> Equus caballus </td> </tr> <tr> <td style="text-align:left;"> UniProtKB:A0A2I3SSQ6 </td> <td style="text-align:left;"> NCBITaxon:9598 </td> <td style="text-align:left;"> Pan troglodytes </td> </tr> <tr> <td style="text-align:left;"> UniProtKB:A0A2I3RMV7 </td> <td style="text-align:left;"> NCBITaxon:9598 </td> <td style="text-align:left;"> Pan troglodytes </td> </tr> <tr> <td style="text-align:left;"> UniProtKB:E1BJZ0 </td> <td style="text-align:left;"> NCBITaxon:9913 </td> <td style="text-align:left;"> Bos taurus </td> </tr> <tr> <td style="text-align:left;"> UniProtKB:F6W4E1 </td> <td style="text-align:left;"> NCBITaxon:9796 </td> <td style="text-align:left;"> Equus caballus </td> </tr> <tr> <td style="text-align:left;"> UniProtKB:A0A9J7MPN0 </td> <td style="text-align:left;"> NCBITaxon:7739 </td> <td style="text-align:left;"> Branchiostoma floridae </td> </tr> </tbody> </table> --- ## Web API Example 2: Using Tokens - Sometimes, APIs are not completely open, you need to register. - The API may require to login (user+password), or pass a token. - In this example, I'm using a token which I obtained [here](https://www.ncdc.noaa.gov/cdo-web/token) - You can find information about the [National Centers for Environmental Information](https://www.ncdc.noaa.gov/) API [here](https://www.ncdc.noaa.gov/cdo-web/webservices/v2) --- ## Web API Example 2: Using Tokens - The way to pass the token will depend on the API service. - Some require authentication, others need you to pass it as an argument of the query, i.e., directly in the URL. - In this case, we pass it on the header. ``` r stations_api <- GET( url = "https://www.ncdc.noaa.gov", path = "cdo-web/api/v2/stations", config = add_headers( token = "AqvpIjnLVnPdXQrzCNJhjdFOTuIbOfyb" ), query = list(limit = 1000) ) ``` --- ## Web API Example 2: Using Tokens This is equivalent to using the following query ```bash curl --header "token: [YOUR TOKEN HERE]" \ https://www.ncdc.noaa.gov/cdo-web/api/v2/stations?limit=1000 ``` **Note**: This won't run, you need to get your own token --- ## Web API Example 2: Using Tokens Again, we can recover the data using the `content()` function: ``` r ans <- content(stations_api) ans$results[[64]] ``` ``` ## $elevation ## [1] 136.6 ## ## $mindate ## [1] "1938-01-01" ## ## $maxdate ## [1] "2013-12-01" ## ## $latitude ## [1] 33.8463 ## ## $name ## [1] "CARBON HILL 4 SE, AL US" ## ## $datacoverage ## [1] 0.8596 ## ## $id ## [1] "COOP:011377" ## ## $elevationUnit ## [1] "METERS" ## ## $longitude ## [1] -87.4871 ``` --- ### Web API Example 3: HHS health recommendation Here we use the Department of Health and Human Services API for "[...] demographic-specific health recommendations" (details at [health.gov](https://health.gov/our-work/health-literacy/consumer-health-content/free-web-content/apis-developers/documentation)) ``` r health_advice <- GET( url = "https://health.gov/", path = "myhealthfinder/api/v3/myhealthfinder.json", query = list( lang = "en", age = "32", sex = "male", tobaccoUse = 0 ), config = c( add_headers(accept = "application/json"), config(connecttimeout = 60) ) ) ``` --- ### Web API Example 3: HHS health recommendation Let's see the response ``` r health_advice ``` ``` ## Response [https://odphp.health.gov/myhealthfinder/api/v3/myhealthfinder.json?lang=en&age=32&sex=male&tobaccoUse=0] ## Date: 2025-02-10 18:18 ## Status: 200 ## Content-Type: application/json ## Size: 322 kB ## { ## "Result": { ## "Error": "False", ## "Total": 18, ## "Query": { ## "ApiVersion": "3", ## "ApiType": "myhealthfinder", ## "TopicId": null, ## "ToolId": null, ## "CategoryId": null, ## ... ``` --- ### Web API Example 3: HHS health recommendation ``` r # Extracting the content health_advice_ans <- content(health_advice) # Getting the titles txt <- with(health_advice_ans$Result$Resources, c( sapply(all$Resource, "[[", "Title"), sapply(some$Resource, "[[", "Title"), sapply(`You may also be interested in these health topics:`$Resource, "[[", "Title") )) cat(txt, sep = "; ") ``` --- ### Web API Example 3: HHS health recommendation Quit Smoking; Hepatitis C Screening: Questions for the Doctor; Protect Yourself from Seasonal Flu; Talk with Your Doctor About Depression; Get Your Blood Pressure Checked; Get Tested for HIV; Get Vaccines to Protect Your Health (Adults Ages 19 to 49 Years); Drink Alcohol Only in Moderation; Talk with Your Doctor About Drug Misuse and Substance Use Disorder; Aim for a Healthy Weight; Testing for Syphilis: Questions for the Doctor; Eat Healthy; Protect Yourself from Hepatitis B; Testing for Latent Tuberculosis: Questions for the Doctor; Manage Stress; Alcohol Use: Conversation Starters; Get Active; Quitting Smoking: Conversation Starters --- ## Summary - We learned about regular expressions with the package **stringr**. - We can use regular expressions to detect (`str_detect()`), replace (`str_replace()`), and extract (`str_extract()`) expressions. - We looked at web scraping using the **rvest** package (a wrapper of **xml2**). - We extracted elements from the HTML/XML using `xml_find_all()` with XPath expressions. --- ## Summary - We also used the `html_table()` function from rvest to extract tables from HTML documents. - We took a quick review on Web APIs and the Hyper-text-transfer-protocol (HTTP). - We used the **httr** R package (wrapper of **curl**) to make `GET` requests to various APIs - We even showed an example using a token passed via the `header`. - Once we got the responses, we used the `content()` function to extract the message of the response. --- ## Detour on CURL options Sometimes you will need to change the default set of options in CURL. You can checkout the list of options in `curl::curl_options()`. A common hack is to extend the time-limit before dropping the conection, e.g.: Using the **Health IT** API from the US government, we can obtain the **Electronic Prescribing Adoption and Use by County** (see docs [here](https://dashboard.healthit.gov/datadashboard/documentation/electronic-prescribing-adoption-use-data-documentation-county.php)) The problem is that it usually takes longer to get the data, so we pass the config option `connecttimeout` (which corresponds to the flag `--connect-timeout`) in the curl call (see next slide) --- ## Detour on CURL options ``` r ans <- httr::GET( url = "https://dashboard.healthit.gov/api/open-api.php", query = list( source = "AHA_2008-2015.csv", region = "California", period = 2015 ), config = config( connecttimeout = 60 ) ) ``` --- ## Detour on CURL options ```r > ans$request # <request> # GET https://dashboard.healthit.gov/api/open-api.php?source=AHA_2008-2015.csv®ion=California&period=2015 # Output: write_memory # Options: # * useragent: libcurl/7.58.0 r-curl/4.3 httr/1.4.1 # * connecttimeout: 60 # * httpget: TRUE # Headers: # * Accept: application/json, text/xml, application/xml, */* ``` --- ## Regular Expressions: Email validation This is the official regex for email validation implemented by [RCF 5322](http://www.ietf.org/rfc/rfc5322.txt) ``` (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08 \x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(? :[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[ 0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0 -9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\ x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]) ``` See the corresponding post in [StackOverflow](https://stackoverflow.com/a/201378/2097171)