1 Downloading Files and Using API Clients

1.1 Introduction: Working With Web Data in R

1.1.1 Downloading files and reading them into R

In this first exercise we’re going to look at reading already-formatted datasets - CSV or TSV files, with which you’ll no doubt be familiar! - into R from the internet. This is a lot easier than it might sound because R’s file-reading functions accept not just file paths, but also URLs.

# Here are the URLs! As you can see they're just normal strings
csv_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1561/datasets/chickwts.csv"
tsv_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_3026/datasets/tsv_data.tsv"

# Read a file in from the CSV URL and assign it to csv_data
csv_data <- read.csv(csv_url)

# Read a file in from the TSV URL and assign it to tsv_data
tsv_data <- read.delim(tsv_url)

# Examine the objects with head()
head(csv_data)
head(tsv_data)

1.1.2 Saving raw files to disk

Sometimes just reading the file in from the web is enough, but often you’ll want to store it locally so that you can refer back to it. This also lets you avoid having to spend the start of every analysis session twiddling your thumbs while particularly large files download.

Helpfully, R has download.file(), a function that lets you do just that: download a file to a location of your choice on your computer. It takes two arguments; url, indicating the URL to read from, and destfile, the destination to write the downloaded file to. In this case, we’ve pre-defined the URL - once again, it’s csv_url.

# Download the file with download.file()
download.file(url = csv_url, destfile = "feed_data.csv")
trying URL 'http://s3.amazonaws.com/assets.datacamp.com/production/course_1561/datasets/chickwts.csv'
Content type '' length 1053 bytes
downloaded 1053 bytes
# Read it in with read.csv()
csv_data <- read.csv("feed_data.csv")

1.1.3 Saving formatted files to disk

Whether you’re downloading the raw files with download.file() or using read.csv() and its sibling functions, at some point you’re probably going to find the need to modify your input data, and then save the modified data to disk so you don’t lose the changes.

You could use write.table(), but then you have to worry about accidentally writing out data in a format R can’t read back in. An easy way to avoid this risk is to use saveRDS() and readRDS(), which save R objects in an R-specific file format, with the data structure intact. That means you can use it for any type of R object (even ones that don’t turn into tables easily), and not worry you’ll lose data reading it back in. saveRDS() takes two arguments, object, pointing to the R object to save and file pointing to where to save it to. readRDS() expects file, referring to the path to the RDS file to read in.

# Add a new column: square_weight
csv_data$square_weight <- csv_data$weight^2

# Save it to disk with saveRDS()
saveRDS(object = csv_data, file = "modified_feed_data.RDS")

# Read it back in with readRDS()
modified_feed_data <- readRDS(file = "modified_feed_data.RDS")

# Examine modified_feed_data
str(modified_feed_data)
'data.frame':   71 obs. of  3 variables:
 $ weight       : int  179 160 136 227 217 168 108 124 143 140 ...
 $ feed         : chr  "horsebean" "horsebean" "horsebean" "horsebean" ...
 $ square_weight: num  32041 25600 18496 51529 47089 ...

1.2 Understanding Application Programming Interfaces

1.2.1 Using API clients

So we know that APIs are server components to make it easy for your code to interact with a service and get data from it. We also know that R features many “clients” - packages that wrap around connections to APIs so you don’t have to worry about the details.

Let’s look at a really simple API client - the pageviews package, which acts as a client to Wikipedia’s API of pageview data. As with other R API clients, it’s formatted as a package, and lives on CRAN - the central repository of R packages. The goal here is just to show how simple clients are to use: they look just like other R code, because they are just like other R code.

# Load pageviews
library(pageviews)
package 㤼㸱pageviews㤼㸲 was built under R version 4.0.3
# Get the pageviews for "Hadley Wickham"
hadley_pageviews <- article_pageviews(project = "en.wikipedia", "Hadley Wickham")

# Examine the resulting object
str(hadley_pageviews)
'data.frame':   1 obs. of  8 variables:
 $ project    : chr "wikipedia"
 $ language   : chr "en"
 $ article    : chr "Hadley_Wickham"
 $ access     : chr "all-access"
 $ agent      : chr "all-agents"
 $ granularity: chr "daily"
 $ date       : POSIXct, format: "2015-10-01"
 $ views      : num 53

– ## Access tokens and APIs

1.2.2 Using access tokens

As we discussed in the last video, it’s common for APIs to require access tokens - unique keys that verify you’re authorised to use a service. They’re usually pretty easy to use with an API client.

To show how they work, and how easy it can be, we’re going to use the R client for the Wordnik dictionary and word use service - ‘birdnik’ - and an API token we prepared earlier. Birdnik is fairly simple and lets you get all sorts of interesting information about word usage in published works. For example, to get the frequency of the use of the word “chocolate”, you would write:

word_frequency(api_key, "chocolate")
# Load birdnik
library(birdnik)
library(httr)

# Get the word frequency for "vector", using api_key to access it
vector_frequency <- word_frequency(api_key, "vector")

2 Using httr to interact with APIs directly

2.1 GET and POST requests in theory

2.1.1 GET requests in practice

To start with you’re going to make a GET request. As discussed in the video, this is a request that asks the server to give you a particular piece of data or content (usually specified in the URL). These make up the majority of the requests you’ll make in a data science context, since most of the time you’ll be getting data from servers, not giving it to them.

To do this you’ll use the httr package, written by Hadley Wickham (of course), which makes HTTP requests extremely easy. You’re going to make a very simple GET request, and then inspect the output to see what it looks like.

#Load the httr package
library(httr)

# Make a GET request to http://httpbin.org/get
get_result <- GET("http://httpbin.org/get")

# Print it to inspect it
get_result
Response [http://httpbin.org/get]
  Date: 2020-11-19 17:39
  Status: 200
  Content-Type: application/json
  Size: 364 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip", 
    "Host": "httpbin.org", 
    "User-Agent": "libcurl/7.64.1 r-curl/4.3 httr/1.4.2", 
    "X-Amzn-Trace-Id": "Root=1-5fb6adee-31fe36137bf1bb7a03afc3ea"
  }, 
  "origin": "92.18.64.211", 
...

As you can see from inspecting the output, there are a lot of parts of a HTTP response.

2.1.2 POST requests in practice

Next we’ll look at POST requests, also made through httr, with the function (you’ve guessed it) POST(). Rather than asking the server to give you something, as in GET requests, a POST request asks the server to accept something from you. They’re commonly used for things like file upload, or authentication. As a result of their use for uploading things, POST() accepts not just a url but also a body argument containing whatever you want to give to the server.

You’ll make a very simple POST request, just uploading a piece of text, and then inspect the output to see what it looks like.

# Load the httr package
library(httr)

# Make a POST request to http://httpbin.org/post with the body "this is a test"
post_result <- POST("http://httpbin.org/post", body = "this is a test")

# Print it to inspect it
post_result
Response [http://httpbin.org/post]
  Date: 2020-11-19 17:40
  Status: 200
  Content-Type: application/json
  Size: 471 B
{
  "args": {}, 
  "data": "this is a test", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip", 
    "Content-Length": "14", 
    "Host": "httpbin.org", 
...

The output for POST requests looks pretty similar to that for GET requests, although (in this case) the body of your message is included - this is a test.

2.1.3 Extracting the response

Making requests is all well and good, but it’s also not why you’re here. What we really want to do is get the data the server sent back, which can be done with httr’s content() function. You pass it an object returned from a GET (or POST, or DELETE, or…) call, and it spits out whatever the server actually sent in an R-compatible structure.

We’re going to demonstrate that now, using a slightly more complicated URL than before - in fact, using a URL from the Wikimedia pageviews system you dealt with through the pageviews package, which is stored as url. Without looking too much at the structure for the time being (we’ll get to that later) this request asks for the number of pageviews to the English-language Wikipedia’s “Hadley Wickham” article on 1 and 2 January 2017.

url <- "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Hadley_Wickham/daily/20170101/20170102"

# Make a GET request to url and save the results
pageview_response <- GET(url)

# Call content() to retrieve the data the server sent back
pageview_data <- content(pageview_response)

# Examine the results with str()
str(pageview_data)
List of 1
 $ items:List of 2
  ..$ :List of 7
  .. ..$ project    : chr "en.wikipedia"
  .. ..$ article    : chr "Hadley_Wickham"
  .. ..$ granularity: chr "daily"
  .. ..$ timestamp  : chr "2017010100"
  .. ..$ access     : chr "all-access"
  .. ..$ agent      : chr "all-agents"
  .. ..$ views      : int 45
  ..$ :List of 7
  .. ..$ project    : chr "en.wikipedia"
  .. ..$ article    : chr "Hadley_Wickham"
  .. ..$ granularity: chr "daily"
  .. ..$ timestamp  : chr "2017010200"
  .. ..$ access     : chr "all-access"
  .. ..$ agent      : chr "all-agents"
  .. ..$ views      : int 86

As you can see, the result of extracting the content is a list, which is pretty common (but not uniform) to API responses.

2.2 Graceful httr

2.2.1 Handling http failures

As mentioned, HTTP calls can go wrong. Handling that can be done with httr’s http_error() function, which identifies whether a server response contains an error. If the response does contain an error, calling http_error() over the response will produce TRUE; otherwise, FALSE. You can use this for really fine-grained control over results. For example, you could check whether the request contained an error, and (if so) issue a warning and re-try the request.

For now we’ll try something a bit simpler - issuing a warning that something went wrong if http_error() returns TRUE, and printing the content if it doesn’t.

fake_url <- "http://google.com/fakepagethatdoesnotexist"

# Make the GET request
request_result <- GET(fake_url)

# Check request_result
if(http_error(request_result)){
    warning("The request failed")
} else {
    content(request_result)
}
The request failed

Error handling is really important for writing robust code, and it looks like you’ve got a good handle on it.

2.2.2 Constructing queries (Part I)

As briefly discussed in the previous video, the actual API query (which tells the API what you want to do) tends to be in one of the two forms. The first is directory-based, where values are separated by / marks within the URL. The second is parameter-based, where all the values exist at the end of the URL and take the form of key=value.

Constructing directory-based URLs can be done via paste(), which takes an unlimited number of strings, along with a separator, as sep. So to construct http://swapi.co/api/vehicles/12 we’d call: paste(“http://swapi.co”, “api”, “vehicles”, “12”, sep = “/”) We can play with SWAPI, mentioned above, which is an API chock full of star wars data. This time, rather than a vehicle, we’ll look for a person.

# Construct a directory-based API URL to `http://swapi.co/api`,
# looking for person `1` in `people`
directory_url <- paste("http://swapi.co/api", "people", 1, sep = "/")

# Make a GET call with it
result <- GET(directory_url)

2.2.3 Constructing queries (Part II)

As mentioned (albeit briefly) in the last exercise, there are also parameter based URLs, where all the query values exist at the end of the URL and take the form of key=value - they look something like http://fakeurl.com/foo.php?country=spain&food=goulash

Constructing parameter-based URLs can also be done with paste(), but the easiest way to do it is with GET() and POST() themselves, which accept a query argument consisting of a list of keys and values. So, to continue with the food-based examples, we could construct fakeurl.com/api.php?fruit=peaches&day=thursday with: GET(“fakeurl.com/api.php”, query = list(fruit = “peaches”, day = “thursday”))

# Create list with nationality and country elements
query_params <- list(nationality = "americans", 
    country = "antigua")
    
# Make parameter-based call to httpbin, with query_params
parameter_response <- GET("https://httpbin.org/get", query = query_params)

# Print parameter_response
parameter_response
Response [https://httpbin.org/get?nationality=americans&country=antigua]
  Date: 2020-11-19 17:43
  Status: 200
  Content-Type: application/json
  Size: 464 B
{
  "args": {
    "country": "antigua", 
    "nationality": "americans"
  }, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip", 
    "Host": "httpbin.org", 
    "User-Agent": "libcurl/7.64.1 r-curl/4.3 httr/1.4.2", 
...

Did you notice the parameters you passed in were listed in the args of the response?

2.3 Respectful API usage

2.3.1 Using user agents

informative user-agents are a good way of being respectful of the developers running the API you’re interacting with. They make it easy for them to contact you in the event something goes wrong. I always try to include:

My email address; A URL for the project the code is a part of, if it’s got a URL. Building user agents is done by passing a call to user_agent() into the GET() or POST() request; something like:

GET("http://url.goes.here/", user_agent("somefakeemail@domain.com http://project.website"))

In the event you don’t have a website, a short one-sentence description of what the project is about serves pretty well.

# Do not change the url
url <- "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Aaron_Halfaker/daily/2015100100/2015103100"

# Add the email address and the test sentence inside user_agent()
server_response <- GET(url, user_agent("my@email.address this is a test"))

From your end, the request looks exactly the same with or without a user agent, but for the server the difference can be vital.

2.3.2 Rate-limiting

The next stage of respectful API usage is rate-limiting: making sure you only make a certain number of requests to the server in a given time period. Your limit will vary from server to server, but the implementation is always pretty much the same and involves a call to Sys.sleep(). This function takes one argument, a number, which represents the number of seconds to “sleep” (pause) the R session for. So if you call Sys.sleep(15), it’ll pause for 15 seconds before allowing further code to run.

As you can imagine, this is really useful for rate-limiting. If you are only allowed 4 requests a minute? No problem! Just pause for 15 seconds between each request and you’re guaranteed to never exceed it. Let’s demonstrate now by putting together a little loop that sends multiple requests on a 5-second time delay. We’ll use httpbin.org ’s APIs, which allow you to test different HTTP libraries.

# Construct a vector of 2 URLs
urls <- c("http://httpbin.org/status/404", "http://httpbin.org/status/301")

for(url in urls){
    # Send a GET request to url
    result <- GET(url)
    # Delay for 5 seconds between requests
    Sys.sleep(5)
}

2.3.3 Tying it all together

Using everything that you learned in the chapter, let’s make a simple replica of one of the ‘pageviews’ functions - building queries, sending GET requests (with an appropriate user agent) and handling the output in a fault-tolerant way. You’ll build this function up piece by piece in this exercise.

To output an error, you will use the function stop(), which takes a string as an argument, stops the execution of the program, and outputs the string as an error. You can try it right now by running stop(“This is an error”). First, get the function to construct the url.

get_pageviews <- function(article_title){
  url <- paste(
    "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents", 
    # Include article title
    article_title, 
    "daily/2015100100/2015103100", 
    sep = "/"
  ) 
  url
}

Now, make the request

get_pageviews <- function(article_title){
  url <- paste(
    "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents", 
    article_title, 
    "daily/2015100100/2015103100", 
    sep = "/"
  ) 
  # Get the webpage  
  response <- GET(url, config = user_agent("my@email.com this is a test")) 
  response
}

Now, add an error check.

get_pageviews <- function(article_title){
  url <- paste(
    "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents", 
    article_title, 
    "daily/2015100100/2015103100", 
    sep = "/"
  )   
  response <- GET(url, user_agent("my@email.com this is a test")) 
  # Is there an HTTP error?
  if(http_error(response)){ 
    # Throw an R error
     stop("the request failed") 
  }
  response
}

Finally, instead of returning response, return the content() of the response.

get_pageviews <- function(article_title){
  url <- paste(
    "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents", 
    article_title, 
    "daily/2015100100/2015103100", 
    sep = "/"
  )   
  response <- GET(url, user_agent("my@email.com this is a test")) 
  # Is there an HTTP error?
  if(http_error(response)){ 
    # Throw an R error
    stop("the request failed") 
  }
  # Return the response's content
  content(response)
}

3 Handling JSON and XML

3.1 JSON

3.1.1 Parsing JSON

While JSON is a useful format for sharing data, your first step will often be to parse it into an R object, so you can manipulate it with R.

The content() function in httr retrieves the content from a request. It takes an as argument that specifies the type of output to return. You’ve already seen that as = “text” will return the content as a character string which is useful for checking the content is as you expect.

If you don’t specify as, the default as = “parsed” is used. In this case the type of content() will be guessed based on the header and content() will choose an appropriate parsing function. For JSON this function is fromJSON() from the jsonlite package. If you know your response is JSON, you may want to use fromJSON() directly.

To practice, you’ll retrieve some revision history from the Wikipedia API, check it is JSON, then parse it into a list two ways.

rev_history <- function (title, format = "json") {
    if (title != "Hadley Wickham") {
        stop("rev_history() only works for `title = \"Hadley Wickham\"`")
    }
    if (format == "json") {
        resp <- readRDS("had_rev_json.rds")
    }
    else if (format == "xml") {
        resp <- readRDS("had_rev_xml.rds")
    }
    else {
        stop("Invalid format supplied, try \"json\" or \"xml\"")
    }
    resp
}
# Get revision history for "Hadley Wickham"
resp_json <- pageview_response

# Check http_type() of resp_json
http_type(resp_json)
[1] "application/json"
# Examine returned text with content()
content(resp_json, as = "text")
[1] "{\"items\":[{\"project\":\"en.wikipedia\",\"article\":\"Hadley_Wickham\",\"granularity\":\"daily\",\"timestamp\":\"2017010100\",\"access\":\"all-access\",\"agent\":\"all-agents\",\"views\":45},{\"project\":\"en.wikipedia\",\"article\":\"Hadley_Wickham\",\"granularity\":\"daily\",\"timestamp\":\"2017010200\",\"access\":\"all-access\",\"agent\":\"all-agents\",\"views\":86}]}"
# Parse response with content()
content(resp_json, as = "parsed")
$items
$items[[1]]
$items[[1]]$project
[1] "en.wikipedia"

$items[[1]]$article
[1] "Hadley_Wickham"

$items[[1]]$granularity
[1] "daily"

$items[[1]]$timestamp
[1] "2017010100"

$items[[1]]$access
[1] "all-access"

$items[[1]]$agent
[1] "all-agents"

$items[[1]]$views
[1] 45


$items[[2]]
$items[[2]]$project
[1] "en.wikipedia"

$items[[2]]$article
[1] "Hadley_Wickham"

$items[[2]]$granularity
[1] "daily"

$items[[2]]$timestamp
[1] "2017010200"

$items[[2]]$access
[1] "all-access"

$items[[2]]$agent
[1] "all-agents"

$items[[2]]$views
[1] 86
# Parse returned text with fromJSON()
library(jsonlite)
fromJSON(content(resp_json, as = "text"))
$items
NA

3.2 Manipulating JSON

3.2.1 Manipulating parsed JSON

As you saw in the video, the output from parsing JSON is a list. One way to extract relevant data from that list is to use a package specifically designed for manipulating lists, rlist.

rlist provides two particularly useful functions for selecting and combining elements from a list: list.select() and list.stack(). list.select() extracts sub-elements by name from each element in a list. For example using the parsed movies data from the video (movies_list), we might ask for the title and year elements from each element:

list.select(movies_list, title, year)

The result is still a list, that is where list.stack() comes in. It will stack the elements of a list into a data frame.

list.stack(
    list.select(movies_list, title, year)
)

First, you’ll need to figure out where the revisions are. Examine the output from the str() call. Can you see where the list of 5 revisions is?


url <- "https://en.wikipedia.org/w/api.php?action=query&titles=Hadley%20Wickham&prop=revisions&rvprop=timestamp%7Cuser%7Ccomment%7Ccontent&rvlimit=5&format=json&rvdir=newer&rvstart=2015-01-14T17%3A12%3A45Z&rvsection=0"

resp_json <- GET(url)
# Load rlist
library(rlist)
package 㤼㸱rlist㤼㸲 was built under R version 4.0.3Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     
# Examine output of this code
str(content(resp_json), max.level = 4)
List of 3
 $ continue:List of 2
  ..$ rvcontinue: chr "20150528042700|664370232"
  ..$ continue  : chr "||"
 $ warnings:List of 2
  ..$ main     :List of 1
  .. ..$ *: chr "Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki"| __truncated__
  ..$ revisions:List of 1
  .. ..$ *: chr "Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated,"| __truncated__
 $ query   :List of 1
  ..$ pages:List of 1
  .. ..$ 41916270:List of 4
  .. .. ..$ pageid   : int 41916270
  .. .. ..$ ns       : int 0
  .. .. ..$ title    : chr "Hadley Wickham"
  .. .. ..$ revisions:List of 5

Use list.select() to pull out the user and timestamp elements from each revision, store in user_time.

# Store revision list
revs <- content(resp_json)$query$pages$`41916270`$revisions
# Extract the user element
user_time <- list.select(revs, user, timestamp)

# Print user_time
user_time
[[1]]
[[1]]$user
[1] "214.28.226.251"

[[1]]$timestamp
[1] "2015-01-14T17:12:45Z"


[[2]]
[[2]]$user
[1] "73.183.151.193"

[[2]]$timestamp
[1] "2015-01-15T15:49:34Z"


[[3]]
[[3]]$user
[1] "FeanorStar7"

[[3]]$timestamp
[1] "2015-01-24T16:34:31Z"


[[4]]
[[4]]$user
[1] "KasparBot"

[[4]]$timestamp
[1] "2015-04-26T19:18:17Z"


[[5]]
[[5]]$user
[1] "Spkal"

[[5]]$timestamp
[1] "2015-05-06T18:24:57Z"
# Stack to turn into a data frame
list.stack(user_time)

rlist is designed to make working with lists easy, so if find you are working with JSON data a lot, you should explore more of its functionality.

3.2.2 Reformatting JSON

Of course you don’t have to use rlist. You can achieve the same thing by using functions from base R or the tidyverse. In this exercise you’ll repeat the task of extracting the username and timestamp using the dplyr package which is part of the tidyverse.

Conceptually, you’ll take the list of revisions, stack them into a data frame, then pull out the relevant columns.

dplyr’s bind_rows() function takes a list and turns it into a data frame. Then you can use select() to extract the relevant columns. And of course if we can make use of the %>% (pipe) operator to chain them all together.

# Load dplyr
library(dplyr)

Attaching package: 㤼㸱dplyr㤼㸲

The following objects are masked from 㤼㸱package:stats㤼㸲:

    filter, lag

The following objects are masked from 㤼㸱package:base㤼㸲:

    intersect, setdiff, setequal, union
# Pull out revision list
revs <- content(resp_json)$query$pages$`41916270`$revisions

# Extract user and timestamp
revs %>%
  bind_rows %>%           
  select(user, timestamp)

3.3 XML structure

3.3.1 Examining XML documents

Just like JSON, you should first verify the response is indeed XML with http_type() and by examining the result of content(r, as = “text”). Then you can turn the response into an XML document object with read_xml().

One benefit of using the XML document object is the available functions that help you explore and manipulate the document. For example xml_structure() will print a representation of the XML document that emphasizes the hierarchical structure by displaying the elements without the data.

In this exercise you’ll grab the same revision history you’ve been working with as XML, and take a look at it with xml_structure().

url <- "https://en.wikipedia.org/w/api.php?action=query&titles=Hadley%20Wickham&prop=revisions&rvprop=timestamp%7Cuser%7Ccomment%7Ccontent&rvlimit=5&format=xml&rvdir=newer&rvstart=2015-01-14T17%3A12%3A45Z&rvsection=0"


# Load xml2
library(xml2)

# Get XML revision history
resp_xml <- GET(url)

# Check response is XML 
http_type(resp_xml)
[1] "text/xml"
# Examine returned text with content()
rev_text <- content(resp_xml, as="text")

# Turn rev_text into an XML document
rev_xml <- read_xml(rev_text)

# Examine the structure of rev_xml
xml_structure(rev_xml)
<api>
  <continue [rvcontinue, continue]>
  <warnings>
    <main [space]>
      {text}
    <revisions [space]>
      {text}
  <query>
    <pages>
      <page [_idx, pageid, ns, title]>
        <revisions>
          <rev [user, anon, timestamp, contentformat, contentmodel, comment, space]>
            {text}
          <rev [user, anon, timestamp, contentformat, contentmodel, comment, space]>
            {text}
          <rev [user, timestamp, contentformat, contentmodel, comment, space]>
            {text}
          <rev [user, timestamp, contentformat, contentmodel, comment, space]>
            {text}
          <rev [user, timestamp, contentformat, contentmodel, comment, space]>
            {text}

xml_structure() helps you understand the structure of your document, without overwhelming you with content.

3.4 XPATHs

3.4.1 Extracting XML data

XPATHs are designed to specifying nodes in an XML document. Remember /node_name specifies nodes at the current level that have the tag node_name, where as //node_name specifies nodes at any level below the current level that have the tag node_name.

xml2 provides the function xml_find_all() to extract nodes that match a given XPATH. For example, xml_find_all(rev_xml, “/api”) will find all the nodes at the top level of the rev_xml document that have the tag api. Try running that in the console. You’ll get a nodeset of one node because there is only one node that satisfies that XPATH.

The object returned from xml_find_all() is a nodeset (think of it like a list of nodes). To actually get data out of the nodes in the nodeset, you’ll have to explicitly ask for it with xml_text() (or xml_double() or xml_integer()).

Use what you know about the location of the revisions data in the returned XML document extract just the content of the revision.

# Find all nodes using XPATH "/api/query/pages/page/revisions/rev"
xml_find_all(rev_xml, "/api/query/pages/page/revisions/rev")
{xml_nodeset (5)}
[1] <rev user="214.28.226.251" anon="" timestamp="2015-01-14T17:12:45Z" contentformat="text/x-wiki" contentmodel="wi ...
[2] <rev user="73.183.151.193" anon="" timestamp="2015-01-15T15:49:34Z" contentformat="text/x-wiki" contentmodel="wi ...
[3] <rev user="FeanorStar7" timestamp="2015-01-24T16:34:31Z" contentformat="text/x-wiki" contentmodel="wikitext" com ...
[4] <rev user="KasparBot" timestamp="2015-04-26T19:18:17Z" contentformat="text/x-wiki" contentmodel="wikitext" comme ...
[5] <rev user="Spkal" timestamp="2015-05-06T18:24:57Z" contentformat="text/x-wiki" contentmodel="wikitext" comment=" ...
# Find all rev nodes anywhere in document
rev_nodes <- xml_find_all(rev_xml, "//rev")

# Use xml_text() to get text from rev_nodes
# xml_text(rev_nodes)

3.4.2 Extracting XML attributes

Not all the useful data will be in the content of a node, some might also be in the attributes of a node. To extract attributes from a nodeset, xml2 provides xml_attrs() and xml_attr().

xml_attrs() takes a nodeset and returns all of the attributes for every node in the nodeset. xml_attr() takes a nodeset and an additional argument attr to extract a single named argument from each node in the nodeset.

In this exercise you’ll grab the user and anon attributes for each revision. You’ll see xml_find_first() in the sample code. It works just like xml_find_all() but it only extracts the first node it finds.

# All rev nodes
rev_nodes <- xml_find_all(rev_xml, "//rev")

# The first rev node
first_rev_node <- xml_find_first(rev_xml, "//rev")

# Find all attributes with xml_attrs()
xml_attrs(first_rev_node)
                  user                   anon              timestamp          contentformat           contentmodel 
      "214.28.226.251"                     "" "2015-01-14T17:12:45Z"          "text/x-wiki"             "wikitext" 
               comment                  space 
                    ""             "preserve" 
# Find user attribute with xml_attr()
xml_attr(first_rev_node, "user")
[1] "214.28.226.251"
# Find user attribute for all rev nodes
xml_attr(rev_nodes, "user")
[1] "214.28.226.251" "73.183.151.193" "FeanorStar7"    "KasparBot"      "Spkal"         
# Find anon attribute for all rev nodes
xml_attr(rev_nodes, "anon")
[1] "" "" NA NA NA

Did you notice that if a node didn’t have the anon attribute xml_attr() returned an NA?

3.4.3 Wrapup: returning nice API output

How might all this work together? A useful API function will retrieve results from an API and return them in a useful form. In Chapter 2, you finished up by writing a function that retrieves data from an API that relied on content() to convert it to a useful form. To write a more robust API function you shouldn’t rely on content() but instead parse the data yourself.

To finish up this chapter you’ll do exactly that: write get_revision_history() which retrieves the XML data for the revision history of page on Wikipedia, parses it, and returns it in a nice data frame.

So that you can focus on the parts of the function that parse the return object, you’ll see your function calls rev_history() to get the response from the API. You can assume this function returns the raw response and follows the best practices you learnt in Chapter 2, like using a user agent, and checking the response status.

get_revision_history <- function(article_title){
  # Get raw revision response
  rev_resp <- resp_xml
  
  # Turn the content() of rev_resp into XML
  rev_xml <- read_xml(content(rev_resp, "text"))
  
  # Find revision nodes
  rev_nodes <- xml_find_all(rev_xml, "//rev")

  # Parse out usernames
  user <- xml_attr(rev_nodes, "user")
  
  # Parse out timestamps
  timestamp <- readr::parse_datetime(xml_attr(rev_nodes, "timestamp"))
  
  # Parse out content
  content <- xml_text(rev_nodes)
  
  # Return data frame 
  data.frame(user = user,
    timestamp = timestamp,
    content = substr(content, 1, 40))
}

# Call function for "Hadley Wickham"
get_revision_history(article_title = "Hadley Wickham")
NA

Your function parsed the XML data, but you could have just as easily parsed the JSON data.

4 Web scraping with XPATHs

4.1 Web scraping 101

4.1.1 Reading HTML

The first step with web scraping is actually reading the HTML in. This can be done with a function from xml2, which is imported by rvest - read_html(). This accepts a single URL, and returns a big blob of XML that we can use further on.

We’re going to experiment with that by grabbing Hadley Wickham’s wikipedia page, with rvest, and then printing it just to see what the structure looks like.

# Load rvest
library(rvest)

# Hadley Wickham's Wikipedia page
test_url <- "https://en.wikipedia.org/wiki/Hadley_Wickham"

# Read the URL stored as "test_url" with read_html()
test_xml <- read_html(test_url)

# Print test_xml
test_xml
{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="UTF-8">\n<title>Hadl ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-Hadley_Wickham rootpag ...

4.1.2 Extracting nodes by XPATH

Now you’ve got a HTML page read into R. Great! But how do you get individual, identifiable pieces of it?

The answer is to use html_node(), which extracts individual chunks of HTML from a HTML document. There are a couple of ways of identifying and filtering nodes, and for now we’re going to use XPATHs: unique identifiers for individual pieces of a HTML document.

These can be retrieved using a browser gadget we’ll talk about later - in the meanwhile the XPATH for the information box in the page you just downloaded is stored as test_node_xpath. We’re going to retrieve the box from the HTML doc with html_node(), using test_node_xpath as the xpath argument.

test_node_xpath <- "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"vcard\", \" \" ))]"
test_node_xpath
[1] "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"vcard\", \" \" ))]"
# Use html_node() to grab the node with the XPATH stored as `test_node_xpath`
node <- html_node(x = test_xml, xpath = test_node_xpath)

# Print the first element of the result
node[1]
$node
<pointer: 0x000001516b97e6d0>

XML nodes are the building block of an XML document - extracting them leads to everything else.

4.2 HTML structure

4.2.1 Extracting names

The first thing we’ll grab is a name, from the first element of the previously extracted table (now stored as table_element). We can do this with html_name(). As you may recall from when you printed it, the element has the tag …

, so we’d expect the name to be, well, table.

# Extract the name of table_element
element_name <- html_name(node)

# Print the name
element_name
[1] "table"

You’ve started extracting components from HTML and XML nodes. The tag might not seem important (and most of the time, it’s not) but it’s a good first step, and the actual node contents (text, say) is something we’ll move on to next.

4.2.2 Extracting values

Just knowing the type of HTML object a node is isn’t much use, though (although it can be very helpful). What we really want is to extract the actual text stored within the value.

We can do that with (shocker) html_text(), another convenient rvest function that accepts a node and passes back the text inside it. For this we’ll want a node within the extracted element - specifically, the one containing the page title. The xpath value for that node is stored as second_xpath_val.

Using this xpath value, extract the node within table_element that we want, and then use html_text to extract the text, before printing it.

second_xpath_val <- "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"fn\", \" \" ))]"
# Extract the element of table_element referred to by second_xpath_val and store it as page_name
page_name <- html_node(x = test_xml, xpath = second_xpath_val)

# Extract the text from page_name
page_title <- html_text(page_name)

# Print page_title
page_title
[1] "Hadley Wickham"

Text extraction is most of what you’re likely to do with XML - after all, the content of an XML tag is almost always the important value. If it’s consistently a set of digits, say, you can always use as.integer() or as.numeric() to turn it from a string, into a number.

4.3 Reformatting Data

4.3.1 Extracting tables

The data from Wikipedia that we’ve been playing around with can be extracted bit by bit and cleaned up manually, but since it’s a table, we have an easier way of turning it into an R object. rvest contains the function html_table() which, as the name suggests, extracts tables. It accepts a node containing a table object, and outputs a data frame.

# Turn table_element into a data frame and assign it to wiki_table
wiki_table <- html_table(node)

# Print wiki_table
wiki_table

Being able to extract tables directly is a massive speedup, since otherwise they’re a ton of different nested tags.

4.3.2 Cleaning a data frame

n the last exercise, we looked at extracting tables with html_table(). The resulting data frame was pretty clean, but had two problems - first, the column names weren’t descriptive, and second, there was an empty row.

In this exercise we’re going to look at fixing both of those problems. First, column names. Column names can be cleaned up with the colnames() function. You call it on the object you want to rename, and then assign to that call a vector of new names.

The missing row, meanwhile, can be removed with the subset() function. subset takes an object, and a condition. For example, if you have a data frame df containing a column x, you could run

subset(df, !x == "")

to remove all rows from df consisting of empty strings ("") in the column x.

# Rename the columns of wiki_table
colnames(wiki_table) <- c("key", "value")

# Remove the empty row from wiki_table
cleaned_table <- subset(wiki_table, !key == "")

# Print cleaned_table
cleaned_table

Cleaning up data, or ‘munging’, is a really common thing to have to do, particularly when someone else picked how it’s formatted. If you can scrape data and clean it, you can do anything.

5 CSS Web Scraping and Final Case Study

5.1 CSS web scraping in theory

Using CSS to scrape nodes As mentioned in the video, CSS is a way to add design information to HTML, that instructs the browser on how to display the content. You can leverage these design instructions to identify content on the page.

You’ve already used html_node(), but it’s more common with CSS selectors to use html_nodes() since you’ll often want more than one node returned. Both functions allow you to specify a css argument to use a CSS selector, instead of specifying the xpath argument.

# Select the table elements
html_nodes(test_xml, css = "table")
{xml_nodeset (3)}
[1] <table class="infobox biography vcard" style="width:22em"><tbody>\n<tr><th colspan="2" style="text-align:center; ...
[2] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" style="border-spacing:0;background:transpare ...
[3] <table class="nowraplinks hlist navbox-inner" style="border-spacing:0;background:transparent;color:inherit"><tbo ...
# Select elements with class = "infobox"
html_nodes(test_xml, css = ".infobox")
{xml_nodeset (1)}
[1] <table class="infobox biography vcard" style="width:22em"><tbody>\n<tr><th colspan="2" style="text-align:center; ...
# Select elements with id = "firstHeading"
html_nodes(test_xml, css = "#firstHeading")
{xml_nodeset (1)}
[1] <h1 id="firstHeading" class="firstHeading" lang="en">Hadley Wickham</h1>

5.2 Scraping names

You might have noticed in the previous exercise, to select elements with a certain class, you add a . in front of the class name. If you need to select an element based on its id, you add a # in front of the id name.

For example if this element was inside your HTML document:

<h1 class = "heading" id = "intro">
  Introduction
</h1>

You could select it by its class using the CSS selector “.heading”, or by its id using the CSS selector “#intro”.

Once you’ve selected an element with a CSS selector, you can get the element tag name just like you did with XPATH selectors, with html_name().

# Extract element with class infobox
infobox_element <- html_nodes(test_xml, css = ".infobox")

# Get tag name of infobox_element
element_name <- html_name(infobox_element)

# Print element_name
element_name
[1] "table"

5.2.1 Scraping text

Of course you can get the contents of a node extracted using a CSS selector too, with html_text().

# Extract element with class fn
page_name <- html_node(x = infobox_element, ".fn")

# Get contents of page_name
page_title <- html_text(page_name)

# Print page_title
page_title
[1] "Hadley Wickham"

Why do you think the class for this element is fn? I suspect it’s short for full name.

5.3 Final case study: Introduction

5.3.1 API calls

Your first step is to use the Wikipedia API to get the page contents for a specific page. We’ll continue to work with the Hadley Wickham page, but as your last exercise, you’ll make it more general.

To get the content of a page from the Wikipedia API you need to use a parameter based URL. The URL you want is

https://en.wikipedia.org/w/api.php?action=parse&page=Hadley%20Wickham&format=xml

which specifies that you want the parsed content (i.e the HTML) for the “Hadley Wickham” page, and the API response should be XML.

# Load httr
library(httr)

# The API url
base_url <- "https://en.wikipedia.org/w/api.php"

# Set query parameters
query_params <- list(action = "parse", 
  page = "Hadley Wickham", 
  format = "xml")

# Get data from API
resp <- GET(url = base_url, query = query_params)
    
# Parse response
resp_xml <- content(resp)

5.3.2 Extracting information

Now we have a response from the API, we need to extract the HTML for the page from it. It turns out the HTML is stored in the contents of the XML response. Take a look, by using xml_text() to pull out the text from the XML response:

xml_text(resp_xml)
# Load rvest
library(rvest)

# Read page contents as HTML
page_html <- read_html(xml_text(resp_xml))

# Extract infobox element
infobox_element <- html_node(page_html, ".infobox")

# Extract page name element from infobox
page_name <- html_node(infobox_element, ".fn")

# Extract page name as text
page_title <- html_text(page_name)

5.3.3 Normalising information

Now it’s time to put together the information in a nice format. You’ve already seen you can use html_table() to parse the infobox into a data frame. But one piece of important information is missing from that table: who the information is about!

# Your code from earlier exercises
wiki_table <- html_table(infobox_element)
colnames(wiki_table) <- c("key", "value")
cleaned_table <- subset(wiki_table, !key == "")

# Create a dataframe for full name
name_df <- data.frame(key = "Full name", value = page_title)

# Combine name_df with cleaned_table
wiki_table2 <- rbind(name_df, cleaned_table)

# Print wiki_table
wiki_table2

5.3.4 Reproducibility

Now you’ve figured out the process for requesting and parsing the infobox for the Hadley Wickham page, it’s time to turn it into a function that does the same thing for anyone.

You’ve already done all the hard work! In the sample script we’ve just copied all your code from the previous three exercises, with only one change: we’ve wrapped it in the function definition syntax, and chosen the name get_infobox() for this function.

library(httr)
library(rvest)
library(xml2)

get_infobox <- function(title){
  base_url <- "https://en.wikipedia.org/w/api.php"
  
  # Change "Hadley Wickham" to title
  query_params <- list(action = "parse", 
    page = title, 
    format = "xml")
  
  resp <- GET(url = base_url, query = query_params)
  resp_xml <- content(resp)
  
  page_html <- read_html(xml_text(resp_xml))
  infobox_element <- html_node(x = page_html, css =".infobox")
  page_name <- html_node(x = infobox_element, css = ".fn")
  page_title <- html_text(page_name)
  
  wiki_table <- html_table(infobox_element)
  colnames(wiki_table) <- c("key", "value")
  cleaned_table <- subset(wiki_table, !wiki_table$key == "")
  name_df <- data.frame(key = "Full name", value = page_title)
  wiki_table <- rbind(name_df, cleaned_table)
  
  wiki_table
}

# Test get_infobox with "Hadley Wickham"
get_infobox(title = "Hadley Wickham")

# Try get_infobox with "Ross Ihaka"
get_infobox(title = "Ross Ihaka")

# Try get_infobox with "Grace Hopper"
get_infobox(title = "Grace Hopper")
