Web Files

The simplest way to get data from the web is to download a file from a web site. read.csv() and read.delim() read the contents from a URL just as they would a computer file path.

chickwts <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1561/datasets/chickwts.csv")

Or download the file file first, then read it.

download.file(url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_1561/datasets/chickwts.csv", 
              destfile = "./feed_data.csv")
chickwts <- read.csv("./feed_data.csv")

Using API Clients

Application Programming Interfaces (APIs) are server components than facilitate programmatic interaction with a service. Typically, the client side features a package that wraps around connections to APIs.

The pageviews package is an API client to Wikipedia’s API of pageview data. pageviews lives on CRAN - the central repository of R packages.

library(pageviews)  # Wikipedia article pageviews API client

hadley_pageviews <- article_pageviews(project = "en.wikipedia", 
                                      article = "Hadley Wickham")
str(hadley_pageviews)

## 'data.frame':    1 obs. of  8 variables:
##  $ project    : chr "wikipedia"
##  $ language   : chr "en"
##  $ article    : chr "Hadley_Wickham"
##  $ access     : chr "all-access"
##  $ agent      : chr "all-agents"
##  $ granularity: chr "daily"
##  $ date       : POSIXct, format: "2015-10-01"
##  $ views      : num 53

API developers typically regulate the API usage with access tokens, unique keys that verify your authorization to use the service. For most public APIs, you only need to register your email address and perhaps provide an explanation. Wordnik is an API that requires access tokens. R package birdnik is a client to the Wordnik API.

#library(birdnik)

Using APIs without Clients

There are several forms of internet “requests”. The main ones are GET (request to “get” content from the server), and POST (request to “post” content onto the server). There are other less common request types, including HEAD (like head()!), and DELETE (like rm()!). R package httr has functions for requesting to get and to post content. GET() requests to get content, and you can view the content with content(). POST() requests to post content to the server.

library(httr)
(get_result <- GET(url = "http://httpbin.org/get"))

## Response [http://httpbin.org/get]
##   Date: 2019-11-10 23:09
##   Status: 200
##   Content-Type: application/json
##   Size: 315 B
## {
##   "args": {}, 
##   "headers": {
##     "Accept": "application/json, text/xml, application/xml, */*", 
##     "Accept-Encoding": "deflate, gzip", 
##     "Host": "httpbin.org", 
##     "User-Agent": "libcurl/7.64.1 r-curl/4.2 httr/1.4.1"
##   }, 
##   "origin": "173.88.139.110, 173.88.139.110", 
##   "url": "https://httpbin.org/get"
## ...

In addition to the url= paramter, POST() accepts a body= parameter containing the content you want to post to the server.

(post_result <- POST(url = "http://httpbin.org/post", body = "this is a test"))

## Response [http://httpbin.org/post]
##   Date: 2019-11-10 23:09
##   Status: 200
##   Content-Type: application/json
##   Size: 422 B
## {
##   "args": {}, 
##   "data": "this is a test", 
##   "files": {}, 
##   "form": {}, 
##   "headers": {
##     "Accept": "application/json, text/xml, application/xml, */*", 
##     "Accept-Encoding": "deflate, gzip", 
##     "Content-Length": "14", 
##     "Host": "httpbin.org", 
## ...

View the content returned from a GET() or POST() request with the content() function. Here is a request to the Wikimedia pageviews API for the number of pageviews to the English-language Wikipedia’s “Hadley Wickham” article on 1 and 2 January 2017.

pageview_response <- GET(url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Hadley_Wickham/daily/20170101/20170102")

pageview_data <- content(pageview_response)
str(pageview_data)

## List of 1
##  $ items:List of 2
##   ..$ :List of 7
##   .. ..$ project    : chr "en.wikipedia"
##   .. ..$ article    : chr "Hadley_Wickham"
##   .. ..$ granularity: chr "daily"
##   .. ..$ timestamp  : chr "2017010100"
##   .. ..$ access     : chr "all-access"
##   .. ..$ agent      : chr "all-agents"
##   .. ..$ views      : int 45
##   ..$ :List of 7
##   .. ..$ project    : chr "en.wikipedia"
##   .. ..$ article    : chr "Hadley_Wickham"
##   .. ..$ granularity: chr "daily"
##   .. ..$ timestamp  : chr "2017010200"
##   .. ..$ access     : chr "all-access"
##   .. ..$ agent      : chr "all-agents"
##   .. ..$ views      : int 86

The status of requests is returned in the Status field of the response object. Status value “200” indicates successful completion; “404” indicates error. There are a lot of status codes, but a good rule of thumb is that 2* and 3* codes are good, 4* means you’ve made a mistake, and 5* means they’ve made a mistake. There is a fuller explanation of the codes at (https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)[https://en.wikipedia.org/wiki/List_of_HTTP_status_codes]. http_error() returns TRUE if the status code contains an error.

get_result <- GET(url = "http://google.com/fakepagethatdoesnotexist")

if(http_error(get_result)){
    warning("The request failed")
} else {
    content(get_result)
}

## Warning: The request failed

The best practice for constructing URLs is to stich together parts that do not change with parts that do. If the API uses directory-based URLS such as “https://fakeurl.com/api/peaches/thursday”, then you can stich together the URL with paste(., sep = "/"). Modern APIs tend to use directory-based URLs. If the API uses parameter-based URLS such as “https://fakeurl.com/api.php?fruit=peaches&day=thursday”, stich together the URL in GET(url=, query=).

# Directory-based API URL for person `1` in `people`
directory_url <- paste("http://swapi.co/api", "people", "1", sep = "/")
result <- GET(directory_url)
content(result)

## $name
## [1] "Luke Skywalker"
## 
## $height
## [1] "172"
## 
## $mass
## [1] "77"
## 
## $hair_color
## [1] "blond"
## 
## $skin_color
## [1] "fair"
## 
## $eye_color
## [1] "blue"
## 
## $birth_year
## [1] "19BBY"
## 
## $gender
## [1] "male"
## 
## $homeworld
## [1] "https://swapi.co/api/planets/1/"
## 
## $films
## $films[[1]]
## [1] "https://swapi.co/api/films/2/"
## 
## $films[[2]]
## [1] "https://swapi.co/api/films/6/"
## 
## $films[[3]]
## [1] "https://swapi.co/api/films/3/"
## 
## $films[[4]]
## [1] "https://swapi.co/api/films/1/"
## 
## $films[[5]]
## [1] "https://swapi.co/api/films/7/"
## 
## 
## $species
## $species[[1]]
## [1] "https://swapi.co/api/species/1/"
## 
## 
## $vehicles
## $vehicles[[1]]
## [1] "https://swapi.co/api/vehicles/14/"
## 
## $vehicles[[2]]
## [1] "https://swapi.co/api/vehicles/30/"
## 
## 
## $starships
## $starships[[1]]
## [1] "https://swapi.co/api/starships/12/"
## 
## $starships[[2]]
## [1] "https://swapi.co/api/starships/22/"
## 
## 
## $created
## [1] "2014-12-09T13:50:51.644000Z"
## 
## $edited
## [1] "2014-12-20T21:17:56.891000Z"
## 
## $url
## [1] "https://swapi.co/api/people/1/"

# Parameter-based API request for nationality americans in country antigua.
parameter_response <- GET(url = "https://httpbin.org/get", 
                          query = list(nationality = "americans", 
                                       country = "antigua"))
content(parameter_response)

## $args
## $args$country
## [1] "antigua"
## 
## $args$nationality
## [1] "americans"
## 
## 
## $headers
## $headers$Accept
## [1] "application/json, text/xml, application/xml, */*"
## 
## $headers$`Accept-Encoding`
## [1] "deflate, gzip"
## 
## $headers$Host
## [1] "httpbin.org"
## 
## $headers$`User-Agent`
## [1] "libcurl/7.64.1 r-curl/4.2 httr/1.4.1"
## 
## 
## $origin
## [1] "173.88.139.110, 173.88.139.110"
## 
## $url
## [1] "https://httpbin.org/get?nationality=americans&country=antigua"

Good Practices:

Pass a user agent to the API so that if there is any issue, the API can contact you.

directory_url <- "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Aaron_Halfaker/daily/2015100100/2015103100"

get_response <- GET(directory_url, user_agent("my@email.address this is a test"))

If the API has a request rate limit, enforce it in your code with Sys.sleep().

urls <- c("http://httpbin.org/status/404", "http://httpbin.org/status/301")

for(url in urls){
    result <- GET(url)
    Sys.sleep(5)  # 5 second delay between requests
}

The following function puts all these ideas together.

get_pageviews <- function(article_title){
  url <- paste(
    "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents", 
    article_title, 
    "daily/2015100100/2015103100", 
    sep = "/"
  )   
  response <- GET(url, user_agent("my@email.com this is a test")) 
  # Is there an HTTP error?
  if(http_error(response)){ 
    stop("the request failed")  # Throw an error
  }
  content(response)  # Return the response content
}

JSON and XML

APIs typically return data in either JSON or XML format.

JSON

JSON files are comprised of two types of data structures: objects, and arrays. Objects enclose name-value pairs in braces {}: {"title" : "New Hope", "year" : "1977"}. Arrays are values in brackets []: [1977, 1988]. Values within objects and arrays are one of “string”, number, true/false, null, or an object.

content() returns the “parsed” response by default. It parses it by calling the appropriate package function based on the header. You can explicitly see what type of content is in the response with http_type(). For JSON, content will call fromJSON() from the jsonlite package.

resp <- GET(url = "https://httpbin.org/get", 
            query = list(nationality = "americans", 
                         country = "antigua"))
http_type(resp)

## [1] "application/json"

content(resp)

## $args
## $args$country
## [1] "antigua"
## 
## $args$nationality
## [1] "americans"
## 
## 
## $headers
## $headers$Accept
## [1] "application/json, text/xml, application/xml, */*"
## 
## $headers$`Accept-Encoding`
## [1] "deflate, gzip"
## 
## $headers$Host
## [1] "httpbin.org"
## 
## $headers$`User-Agent`
## [1] "libcurl/7.64.1 r-curl/4.2 httr/1.4.1"
## 
## 
## $origin
## [1] "173.88.139.110, 173.88.139.110"
## 
## $url
## [1] "https://httpbin.org/get?nationality=americans&country=antigua"

If you know your response is JSON, you may want to call fromJSON() directly.

library(jsonlite)
fromJSON(txt = content(resp, as = "text"))

## No encoding supplied: defaulting to UTF-8.

## $args
## $args$country
## [1] "antigua"
## 
## $args$nationality
## [1] "americans"
## 
## 
## $headers
## $headers$Accept
## [1] "application/json, text/xml, application/xml, */*"
## 
## $headers$`Accept-Encoding`
## [1] "deflate, gzip"
## 
## $headers$Host
## [1] "httpbin.org"
## 
## $headers$`User-Agent`
## [1] "libcurl/7.64.1 r-curl/4.2 httr/1.4.1"
## 
## 
## $origin
## [1] "173.88.139.110, 173.88.139.110"
## 
## $url
## [1] "https://httpbin.org/get?nationality=americans&country=antigua"

The rlist package is a good way to pull out lists from within a JSON object.

library(rlist)

resp <- GET(url = "https://en.wikipedia.org/w/api.php",
            query = list(action = "query",
                         titles = "Hadley Wickham",
                         prop = "revisions",
                         rvprop = "timestamp|user|comment|content",
                         rvlimit = "5",
                         format = "json",
                         rvdir = "newer",
                         rvstart = "2015-01-14T17:12:45Z",
                         rvsection = "0"))
            
str(content(resp), max.level = 4)

## List of 3
##  $ continue:List of 2
##   ..$ rvcontinue: chr "20150528042700|664370232"
##   ..$ continue  : chr "||"
##  $ warnings:List of 2
##   ..$ main     :List of 1
##   .. ..$ *: chr "Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki"| __truncated__
##   ..$ revisions:List of 1
##   .. ..$ *: chr "Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated,"| __truncated__
##  $ query   :List of 1
##   ..$ pages:List of 1
##   .. ..$ 41916270:List of 4
##   .. .. ..$ pageid   : int 41916270
##   .. .. ..$ ns       : int 0
##   .. .. ..$ title    : chr "Hadley Wickham"
##   .. .. ..$ revisions:List of 5

revs <- content(resp)$query$pages$`41916270`$revisions

# revs is a list of lists.  Extract user, timestamp elements from each sublist.
(user_time <- list.select(revs, user, timestamp))

## [[1]]
## [[1]]$user
## [1] "214.28.226.251"
## 
## [[1]]$timestamp
## [1] "2015-01-14T17:12:45Z"
## 
## 
## [[2]]
## [[2]]$user
## [1] "73.183.151.193"
## 
## [[2]]$timestamp
## [1] "2015-01-15T15:49:34Z"
## 
## 
## [[3]]
## [[3]]$user
## [1] "FeanorStar7"
## 
## [[3]]$timestamp
## [1] "2015-01-24T16:34:31Z"
## 
## 
## [[4]]
## [[4]]$user
## [1] "KasparBot"
## 
## [[4]]$timestamp
## [1] "2015-04-26T19:18:17Z"
## 
## 
## [[5]]
## [[5]]$user
## [1] "Spkal"
## 
## [[5]]$timestamp
## [1] "2015-05-06T18:24:57Z"

# Stack to turn into a data frame
list.stack(user_time)

##             user            timestamp
## 1 214.28.226.251 2015-01-14T17:12:45Z
## 2 73.183.151.193 2015-01-15T15:49:34Z
## 3    FeanorStar7 2015-01-24T16:34:31Z
## 4      KasparBot 2015-04-26T19:18:17Z
## 5          Spkal 2015-05-06T18:24:57Z

Or better yet, use dplyr functions bind_rows instead of list.stack and select instead of list.select.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

revs %>% 
  bind_rows() %>%
  select(user, timestamp)

## # A tibble: 5 x 2
##   user           timestamp           
##   <chr>          <chr>               
## 1 214.28.226.251 2015-01-14T17:12:45Z
## 2 73.183.151.193 2015-01-15T15:49:34Z
## 3 FeanorStar7    2015-01-24T16:34:31Z
## 4 KasparBot      2015-04-26T19:18:17Z
## 5 Spkal          2015-05-06T18:24:57Z

XML

XML files are comprised of markup and content. Market encloses content with tags. Attributes can also be included in tags. For XML, content will call read_xml() from the xml2 package.

library(xml2)
resp <- GET(url = "https://en.wikipedia.org/w/api.php",
            query = list(action = "query",
                         titles = "Hadley Wickham",
                         prop = "revisions",
                         rvprop = "timestamp|user|comment|content",
                         rvlimit = "5",
                         format = "xml",
                         rvdir = "newer",
                         rvstart = "2015-01-14T17:12:45Z",
                         rvsection = "0"))

http_type(resp)  # is content xml?

## [1] "text/xml"

content(resp)

## {xml_document}
## <api>
## [1] <continue rvcontinue="20150528042700|664370232" continue="||"/>
## [2] <warnings>\n  <main xml:space="preserve">Subscribe to the mediawiki- ...
## [3] <query>\n  <pages>\n    <page _idx="41916270" pageid="41916270" ns=" ...

# alternatively, extract the text and parse the xml explicitly.
resp_xml <- content(resp, as = "text") %>%
  read_xml()

# structure of the xml
xml_structure(resp_xml)

## <api>
##   <continue [rvcontinue, continue]>
##   <warnings>
##     <main [space]>
##       {text}
##     <revisions [space]>
##       {text}
##   <query>
##     <pages>
##       <page [_idx, pageid, ns, title]>
##         <revisions>
##           <rev [user, anon, timestamp, contentformat, contentmodel, comment, space]>
##             {text}
##           <rev [user, anon, timestamp, contentformat, contentmodel, comment, space]>
##             {text}
##           <rev [user, timestamp, contentformat, contentmodel, comment, space]>
##             {text}
##           <rev [user, timestamp, contentformat, contentmodel, comment, space]>
##             {text}
##           <rev [user, timestamp, contentformat, contentmodel, comment, space]>
##             {text}

revs <- content(resp)$query$pages$`41916270`$revisions

# revs is a list of lists.  Extract user, timestamp elements from each sublist.
(user_time <- list.select(revs, user, timestamp))

## list()

# Stack to turn into a data frame
list.stack(user_time)

## data frame with 0 columns and 0 rows

xml2 function xml_find_all() extracts nodes that match the provided XPATH. An XPATH starting with “/” searches the current level, and an XPATH starting with “/” searches the current level and all sublevels. An XPATH with “@” searches for a node attribute. xml_find_all() returns a nodeset. To get data from the nodes in the nodeset, explicitly ask with xml_text(), xml_double, xml_integer, or as_list

xml_find_all(resp_xml, xpath = "/api/query/pages/page/revisions/rev")

## {xml_nodeset (5)}
## [1] <rev user="214.28.226.251" anon="" timestamp="2015-01-14T17:12:45Z"  ...
## [2] <rev user="73.183.151.193" anon="" timestamp="2015-01-15T15:49:34Z"  ...
## [3] <rev user="FeanorStar7" timestamp="2015-01-24T16:34:31Z" contentform ...
## [4] <rev user="KasparBot" timestamp="2015-04-26T19:18:17Z" contentformat ...
## [5] <rev user="Spkal" timestamp="2015-05-06T18:24:57Z" contentformat="te ...

# All nodes in path
rev_nodes <- xml_find_all(resp_xml, 
                          xpath = "/api/query/pages/page/revisions/rev")

# All rev nodes in document
rev_nodes <- xml_find_all(resp_xml, xpath = "//rev")

xml_text(rev_nodes)

## [1] "'''Hadley Mary Helen Wickham III''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\"about\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\n\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\"about\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>"
## [2] "'''Hadley Wickham''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\"about\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\n\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\"about\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>"               
## [3] "'''Hadley Wickham''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\"about\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\n\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\"about\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>"               
## [4] "'''Hadley Wickham''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\"about\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\n\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\"about\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>"               
## [5] "'''Hadley Wickham''' is a  [[statistician]] from [[New Zealand]] who is currently Chief Scientist at [[RStudio]]<ref>{{cite web|url=http://washstat.org/wss1310.shtml |title=Washington Statistical Society October 2013 Newsletter |publisher=Washstat.org |date= |accessdate=2014-02-12}}</ref><ref>{{cite web|url=http://news.idg.no/cw/art.cfm?id=F66B12BB-D13E-94B0-DAA22F5AB01BEFE7 |title=60+ R resources to improve your data skills ( - Software ) |publisher=News.idg.no |date= |accessdate=2014-02-12}}</ref> and an [[Professors_in_the_United_States#Adjunct_professor|adjunct]] [[Assistant Professor]] of statistics at [[Rice University]].<ref name=\"about\">{{cite web|url=http://www.rstudio.com/about/ |title=About - RStudio |accessdate=2014-08-13}}</ref> He is best known for his development of open-source statistical analysis software packages for [[R (programming language)]] that implement logics of [[data visualisation]] and data transformation. Wickham completed his undergraduate studies at the [[University of Auckland]] and his PhD at [[Iowa State University]] under the supervision of Di Cook and Heike Hoffman.<ref>{{cite web|URL=http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html |title= The R-Files: Hadley Wickham}}</ref> In 2006 he was awarded the [[John_Chambers_(statistician)|John Chambers]] Award for Statistical Computing for his work developing tools for data reshaping and visualisation.<ref>{{cite web|url=http://stat-computing.org/awards/jmc/winners.html |title=John Chambers Award Past winners|publisher=ASA Sections on Statistical Computing, Statistical Graphics,|date= |accessdate=2014-08-12}}</ref>\n\nHe is a prominent and active member of the [[R (programming language)|R]] user community and has developed several notable and widely used packages including [[ggplot2]], plyr, dplyr, and reshape2.<ref name=\"about\" /><ref>{{cite web|url=http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/ |title=Top 100 R Packages for 2013 (Jan-May)! |publisher=R-statistics blog |date= |accessdate=2014-08-12}}</ref>"

Get the tag attributes with with xml_attr.

rev_nodes <- xml_find_all(resp_xml, xpath = "//rev")  # all nodes

first_rev_node <- xml_find_first(resp_xml, "//rev")  # first rev node

xml_attrs(first_rev_node)  # list attrs

##                   user                   anon              timestamp 
##       "214.28.226.251"                     "" "2015-01-14T17:12:45Z" 
##          contentformat           contentmodel                comment 
##          "text/x-wiki"             "wikitext"                     "" 
##                  space 
##             "preserve"

xml_attr(rev_nodes, "user")  # get attr value

## [1] "214.28.226.251" "73.183.151.193" "FeanorStar7"    "KasparBot"     
## [5] "Spkal"

Here is all of this pulled together into a function.

get_revision_history <- function(article_title){
  resp <- GET(url = "https://en.wikipedia.org/w/api.php",
            query = list(action = "query",
                         titles = article_title,
                         prop = "revisions",
                         rvprop = "timestamp|user|comment|content",
                         rvlimit = "5",
                         format = "xml",
                         rvdir = "newer",
                         rvstart = "2015-01-14T17:12:45Z",
                         rvsection = "0"))

  resp_xml <- read_xml(content(resp, "text"))  # convert response text to xml
  
  rev_nodes <- xml_find_all(resp_xml, "//rev")    # Find revision nodes

  # Parse user names, timestamp, and content
  user <- xml_attr(rev_nodes, "user")  # Parse usernames
  timestamp <- xml_attr(rev_nodes, "timestamp") %>% readr::parse_datetime()
  content <- xml_text(rev_nodes)
  
  # Return data frame 
  data.frame(user = user,
    timestamp = timestamp,
    content = substr(content, 1, 40))
}

# Call function for "Hadley Wickham"
get_revision_history("Hadley Wickham")

##             user           timestamp
## 1 214.28.226.251 2015-01-14 17:12:45
## 2 73.183.151.193 2015-01-15 15:49:34
## 3    FeanorStar7 2015-01-24 16:34:31
## 4      KasparBot 2015-04-26 19:18:17
## 5          Spkal 2015-05-06 18:24:57
##                                    content
## 1 '''Hadley Mary Helen Wickham III''' is a
## 2 '''Hadley Wickham''' is a  [[statisticia
## 3 '''Hadley Wickham''' is a  [[statisticia
## 4 '''Hadley Wickham''' is a  [[statisticia
## 5 '''Hadley Wickham''' is a  [[statisticia

Web Scraping

Using XPATH

If a web site does not have an API, you can still use web scraping. Web scraping searches for tags in html. Use a selecter application or plug-in to identify tags in a web page. rvest is a web scraping package. Read an html page with read_html(). Extract content with html_node() by specifying an XPATH.

library(rvest)

(test_xml <- read_html("https://en.wikipedia.org/wiki/Hadley_Wickham"))

## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...

(test_node <- html_node(test_xml, xpath = "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"vcard\", \" \" ))]"))

## {html_node}
## <table class="infobox biography vcard" style="width:22em">
## [1] <tbody>\n<tr><th colspan="2" style="text-align:center;font-size:125% ...

Information stored in an HTML file includes the tag name (html_name()), tag attribute (html_attr()), or tag contents (html_text()).

html_name(test_node)

## [1] "table"

(test_node_element <- html_node(x = test_node, xpat = "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"fn\", \" \" ))]"))

## {html_node}
## <div class="fn" style="display:inline">

html_text(test_node_element)

## [1] "Hadley Wickham"

If the node object is a table (as it is in this example), you can convert the object directly into a data.frame.

html_name(test_node)

## [1] "table"

test_df <- html_table(test_node)
test_df[1,] <- colnames(test_df)
colnames(test_df) <- c("key", "value")
test_df <- subset(test_df, !(key == ""))
test_df

##                  key
## 1     Hadley Wickham
## 2               Born
## 3          Residence
## 4         Alma mater
## 5          Known for
## 6             Awards
## 7  Scientific career
## 8             Fields
## 9             Thesis
## 10 Doctoral advisors
## 11 Doctoral students
##                                                                                value
## 1                                                                     Hadley Wickham
## 2                         (1979-10-14) 14 October 1979 (age 40)Hamilton, New Zealand
## 3                                                                      United States
## 4                                      Iowa State University, University of Auckland
## 5                                                    R programming language packages
## 6  John Chambers Award (2006)\nFellow of the American Statistical Association (2015)
## 7                                                                  Scientific career
## 8                                 Statistics\nData science\nR (programming language)
## 9                               Practical tools for exploring data and models (2008)
## 10                                                            Di Cook\nHeike Hofmann
## 11                                                                 Garrett Grolemund

Using CSS

CSS is a way to add design information to HTML, that instructs the browser on how to display the content. You can leverage these design instructions to identify content on the page. CSS scraping looks for CSS class (using “.” prefix) or id (using “#” prefix) names in the nodes. It is much like XPATH, but will often return multiple elements.

# Select the table elements
html_nodes(test_xml, css = "table")

## {xml_nodeset (2)}
## [1] <table class="infobox biography vcard" style="width:22em"><tbody>\n< ...
## [2] <table class="nowraplinks hlist navbox-inner" style="border-spacing: ...

# Select elements with class = "infobox" (notice the .)
test_ib <- html_nodes(test_xml, css = ".infobox") 
html_text(test_ib)

## [1] "Hadley WickhamBorn (1979-10-14) 14 October 1979 (age 40)Hamilton, New ZealandResidenceUnited StatesAlma materIowa State University, University of AucklandKnown forR programming language packagesAwards\nJohn Chambers Award (2006)\nFellow of the American Statistical Association (2015)Scientific careerFields\nStatistics\nData science\nR (programming language)ThesisPractical tools for exploring data and models (2008)Doctoral advisors\nDi Cook\nHeike HofmannDoctoral studentsGarrett Grolemund\n"

# Select elements with id = "firstHeading" (notice the #)
test_fh <- html_nodes(test_xml, css = "#firstHeading")
html_text(test_fh)

## [1] "Hadley Wickham"

Here is everything wrapped in a function.

library(httr)
library(rvest)
library(xml2)

get_infobox <- function(title){
  base_url <- "https://en.wikipedia.org/w/api.php"
  
  # Change "Hadley Wickham" to title
  query_params <- list(action = "parse", 
    page = title, 
    format = "xml")
  
  resp <- GET(url = base_url, query = query_params)
  resp_xml <- content(resp)
  
  page_html <- read_html(xml_text(resp_xml))
  infobox_element <- html_node(x = page_html, css =".infobox")
  page_name <- html_node(x = infobox_element, css = ".fn")
  page_title <- html_text(page_name)
  
  wiki_table <- html_table(infobox_element)
  colnames(wiki_table) <- c("key", "value")
  cleaned_table <- subset(wiki_table, !wiki_table$key == "")
  name_df <- data.frame(key = "Full name", value = page_title)
  wiki_table <- rbind(name_df, cleaned_table)
  
  wiki_table
}

# Test get_infobox with "Hadley Wickham"
get_infobox(title = "Hadley Wickham")

##                  key
## 1          Full name
## 2               Born
## 3          Residence
## 4         Alma mater
## 5          Known for
## 6             Awards
## 7  Scientific career
## 8             Fields
## 9             Thesis
## 10 Doctoral advisors
## 11 Doctoral students
##                                                                                value
## 1                                                                     Hadley Wickham
## 2                         (1979-10-14) 14 October 1979 (age 40)Hamilton, New Zealand
## 3                                                                      United States
## 4                                      Iowa State University, University of Auckland
## 5                                                    R programming language packages
## 6  John Chambers Award (2006)\nFellow of the American Statistical Association (2015)
## 7                                                                  Scientific career
## 8                                 Statistics\nData science\nR (programming language)
## 9                               Practical tools for exploring data and models (2008)
## 10                                                            Di Cook\nHeike Hofmann
## 11                                                                 Garrett Grolemund

# Try get_infobox with "Ross Ihaka"
get_infobox(title = "Ross Ihaka")

##                                                key
## 1                                        Full name
## 2 Ihaka at the 2010 New Zealand Open Source Awards
## 3                                       Alma mater
## 4                                        Known for
## 5                                           Awards
## 6                                Scientific career
## 7                                           Fields
## 8                                     Institutions
## 9                                           Thesis
##                                                      value
## 1                                               Ross Ihaka
## 2         Ihaka at the 2010 New Zealand Open Source Awards
## 3 University of AucklandUniversity of California, Berkeley
## 4                                   R programming language
## 5                                   Pickering Medal (2008)
## 6                                        Scientific career
## 7                                    Statistical Computing
## 8                                   University of Auckland
## 9                                          Ruaumoko (1985)

# Try get_infobox with "Grace Hopper"
get_infobox(title = "Grace Hopper")

##                                   key
## 1                           Full name
## 2  Rear Admiral Grace M. Hopper, 1984
## 3                                Born
## 4                                Died
## 5                          Alma mater
## 6                     Military career
## 7                     Place of burial
## 8                          Allegiance
## 9                      Service/branch
## 10                   Years of service
## 11                               Rank
## 12                             Awards
##                                                                                                                                                                                                                                                                                    value
## 1                                                                                                                                                                                                                                                                    Grace Murray Hopper
## 2                                                                                                                                                                                                                                                     Rear Admiral Grace M. Hopper, 1984
## 3                                                                                                                                                                                                                                        (1906-12-09)December 9, 1906New York City, U.S.
## 4                                                                                                                                                                                                                         January 1, 1992(1992-01-01) (aged 85)Arlington, Virginia, U.S.
## 5                                                                                                                                                                                                                                                     Vassar College and Yale University
## 6                                                                                                                                                                                                                                                                        Military career
## 7                                                                                                                                                                                                                                                            Arlington National Cemetery
## 8                                                                                                                                                                                                                                                               United States of America
## 9                                                                                                                                                                                                                                                                     United States Navy
## 10                                                                                                                                                                                                                                                       1943–1966, 1967–1971, 1972–1986
## 11                                                                                                                                                                                                                                                             Rear admiral (lower half)
## 12 Defense Distinguished Service Medal Legion of Merit Meritorious Service Medal American Campaign Medal World War II Victory Medal National Defense Service Medal Armed Forces Reserve Medal with two Hourglass Devices Naval Reserve Medal  Presidential Medal of Freedom (posthumous)

Case Study

The Cleveland Museum of Art Open Access API provides datasets on more than 61,000 artworks in its Collection.

Setup

library(ggplot2)
library(dplyr)
library(jsonlite)
library(httr)

resp <- GET(url = "https://openaccess-api.clevelandart.org/api/artworks/",
                   query = list(department = "American Painting and Sculpture",
                                type = "Painting",
                                q = "Sargent"))
cont <- content(resp)$data  # 1 list per result

df <- NA
df <- data.frame(id = as.integer(NA),
                 title = as.character(NA),
                 creation_date = as.character(NA),
                 url = as.character(NA),
                 fun_fact = as.character(NA),
                 stringsAsFactors = FALSE)
                 
for (i in seq_along(cont)) {
  df[i, 1] <- cont[[i]]$id
  df[i, 2] <- cont[[i]]$title
  df[i, 3] <- cont[[i]]$creation_date
  df[i, 4] <- cont[[i]]$url
  df[i, 5] <- ifelse(is.null(cont[[i]]$fun_fact), NA, cont[[i]]$fun_fact)
}
head(df)

##       id                         title creation_date
## 1 160289  Portrait of Lisa Colt Curtis          1898
## 2 109250                   The Cossack       undated
## 3 109971                Head of a Girl   before 1929
## 4 170082 Self-Portrait with Five Muses       c. 1880
## 5 121261             The Violin Player       c. 1894
##                                      url
## 1  https://clevelandart.org/art/1998.168
## 2  https://clevelandart.org/art/1927.397
## 3  https://clevelandart.org/art/1928.579
## 4   https://clevelandart.org/art/2012.30
## 5 https://clevelandart.org/art/1942.1133
##                                                                                                                                                                                    fun_fact
## 1 Elegant and poised in her silk gown, Lisa Colt Curtis does not directly engage the viewer. Sargent's portrait seems to preserve the moment just before she steps forward to greet guests.
## 2                                                                                                                                                                                      <NA>
## 3                                                                                                                                                                                      <NA>
## 4                                                                     This painting includes a self-portrait of the artist with the muses of painting, sculpture, music, and blacksmithing.
## 5                                                                                                                                                                                      <NA>

Reference

Working with Web Data in R. DataCamp. https://campus.datacamp.com/courses/working-with-web-data-in-r.

Best practices for API packages. https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html.

Web Data

Getting Web data with APIs and Web Scraping

Michael Foley

2019-11-10