Introduction to APIs and the Protocol of the Internet

Advanced Programming Interface (API) is a set of procedures that makes a website’s data be read by a computer

HTTP

The main protocol for the internet is Hyper-Text Transfer Protocol (HTTP)

  • Most APIs use HTTP
  • The Reponse-Request Cycle is how HTTP works
    • The client sends a request to the server to perform a task
    • The server will send a response to the client indicating whether the server could perform the task

HTTP Requests

  • Uniform Resource Locator (URL): A unique address for a webpage, image, video, message, entity
  • Methods or Verbs: Tells the server the task that should be performed
    • Get: Requests the server to retrieve something
    • Post: Requests the server to create something
    • Put: Requests the server to edit something
    • Delete: Requests the server to delete something
  • Headers: Metadata about the request such as the date, connection type, content type
  • Body: The data the client requests from the server
  • Status Code: A three digit number that tells the client whether the request was successful
    • 200: Successful request
    • 404: File not found
    • 403: Permission denied
    • 500: Generic failure code

API Authentication

  • Basic Authentication: An API that requires only a username and password.
    • The client will take the two credentials and pass that request in an HTTP header.
    • Server will search through its stored credentials and see if there’s a match
    • If there is no match, the server will send a status code of 401
    • A potential problem with this authentication is that the username/password isn’t distinct from the owner
  • Key Authentication: Requires a unique key for the API to be accessed
  • Key is a long string of characters that is distinct from the API’s owner
  • The key can be used to offer restricted access to the user
  • The key can be put into the Authentization header or onto the URL

  • Open Authorization (OAuth): Automates key exchange by allowing access via a third party
    • All a user has to do is enter credentials and the client and server communicate to get a valid key
    • User has the client send a request to a server and logs-in to the server
    • Server responds by sending the user back to the client along with a code
    • Client exchanges the code and a key for an access token and fetches data from the server

Internet data formats - JSON and XML

JSON

Javascript Object Notation (JSON) is a human readable internet data format.

  • Most popular internet data format for APIs
  • JSON is programming language agnostic
  • There are two pieces to JSON: keys and values
  • Keys are the attributes of the object being described
  • The attributes have corresponding values

Example: IPEDS data in JSON format

  • Object: ‘universities’
  • Keys: ‘unitid’ and ‘instnm’
{"universities":[ 
  {"unitid":  "127060", "instnm": "University of Denver"} 
  {"unitid":  "126678", "instnm": "Colorado College"},
  {"unitid":  "126614", "instnm": "University of Colorado Boulder"}
  ]}

XML

The Extensible Markup Language (XML) is a common internet data format that has similarities to both HTML and JSON.

  • An older format that is falling out of style
  • Programming language agnostic
  • It is human and machine readable in a text format independent of any progrmming language.
  • It is also hierachical in nature, where element values contain others.
  • The main building block to structure the data is called a node.
  • It starts with a a root node and inside the root naode are more child nodes.
  • Each node contains a name, which tells us the attribute of the order

Differences between XML and other Internet Data Formats

  • Unlike JSON, XML uses tags and doesn’t use arrays. XML is not as easy for a machine to parse as JSON because is much longer.
  • It is similar to HTML because it is a Markup language.
  • In HTML, tags are pre-defined whereas XML are not
  • XML was developed to describe data and HTML was developed to display data.

Example: IPEDS data in XML format

  • Root node: ‘universities’
  • Children Nodes: ‘university’, ‘unitid’, ‘instname’
<universities>
  <university>
    <unitid>127060</unitid><instname>University of Denver</instname>   
  </university>
  <univerisity>   
    <unitid>126678</unitid><instname>Colorado College</instname>   
   </university>
  <university>
    <unitid>126614</unitid><instname>University of Colorado Boulder</instname>  
    </university>
</universities>  

R functions and packages to access Internet Data

curl and httr

  • curl is a package to provide the functionality of cURL for R.
    • cURL (“Client URL”) is a library and command-line tool to transfer data using various protocols
      • FTP (file transfer protocol), HTTP, IMAP (internet message access protocol) among others
      • Sends and gets files using the URL syntax -Supports downloading and streaming data
  • httr provides a wrapper for the curl package and provides more functionality to access modern web APIs
    • Functions for the HTTP verbs
    • Functions to modify requests
    • Support for OAuth

Useful curl functions

  • curl(url): Curl connection inferface
    • url: A character string of the URL
  • curl_download(url, destfile, ...): Download a file to disk
  • url: A character string of the URL
  • destfile: A character string to indicate where the file is to be downloaded

curl examples

# Create a connection to DU IRA webpage

suppressWarnings(library(curl))

# DU IRA URL
du_ira <- "https://www.du.edu/ir/"

# Create and Open a Connection
con <- curl(du_ira)
open(con)

# Read the first 10 lines of the connection
out <- readLines(con, n = 10)

# Print the output
cat(out, sep = "\n")
<!DOCTYPE HTML><html xmlns="http://www.w3.org/1999/xhtml" lang="en" dir="ltr" id="du-edu" class="no-js"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Institutional Research &amp; Analysis | University of Denver</title>

<meta name="Description" content="The Office of Institutional Research &amp; Analysis is the central source for information about the University of Denver. We serve the University's vision, values, mission, and goals by analyzing and reporting institutional data to inform University and unit-level planning and development." />
<meta name="Keywords" content="IR, IRA, institutional research, analysis, strategic planning, decision support" />

<meta name="author" content="University of Denver" />
suppressWarnings(library(curl))
library(readxl)

# Createa  temporary file
tmp <- tempfile()

# CCIHE 2018 Data file URL
ccihe_2018 <- 'http://carnegieclassifications.iu.edu/downloads/CCIHE2018-PublicData.xlsx'

# Download the file to the temporary file
curl_download(ccihe_2018, tmp)

# Read the Spreadsheet
ciihe_2018_data <- suppressWarnings(read_excel(tmp, sheet =  'Data'))

# Show data
print(ciihe_2018_data)
# A tibble: 4,324 x 97
   UNITID NAME  CITY  STABBR CC2000 BASIC2005 BASIC2010 BASIC2015 BASIC2018
    <dbl> <chr> <chr> <chr>   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 177834 A T ~ Kirk~ MO         52        25        25        25        25
 2 180203 Aani~ Harl~ MT         60        33        33        33        33
 3 222178 Abil~ Abil~ TX         21        19        18        18        18
 4 138558 Abra~ Tift~ GA         40         2        12        23        23
 5 488031 Abra~ Los ~ CA         -3        -3        -3        -2        31
 6 172866 Acad~ Bloo~ MN         40        14        23        23        28
 7 451079 Acad~ Gain~ FL         -3        -3        26        26        26
 8 457271 Acad~ Los ~ CA         -3        -3        -3        24        24
 9 412173 Acad~ West~ FL         -3        -3        -3        10        10
10 108232 Acad~ San ~ CA         56        30        30        30        18
# ... with 4,314 more rows, and 88 more variables: IPUG2018 <dbl>,
#   IPGRAD2018 <dbl>, ENRPROFILE2018 <dbl>, UGPROFILE2018 <dbl>,
#   SIZESET2018 <dbl>, CCE2015 <dbl>, OBEREG <dbl>, SECTOR <dbl>,
#   ICLEVEL <dbl>, CONTROL <dbl>, LOCALE <dbl>, LANDGRNT <dbl>,
#   MEDICAL <dbl>, HBCU <dbl>, TRIBAL <dbl>, HSI <dbl>, MSI <dbl>,
#   WOMENS <dbl>, COPLAC <dbl>, CUSU <dbl>, CUMU <dbl>, ASSOCDEG <dbl>,
#   BACCDEG <dbl>, MASTDEG <dbl>, DOCRSDEG <dbl>, DOCPPDEG <dbl>,
#   DOCOTHDEG <dbl>, TOTDEG <dbl>, `S&ER&D` <dbl>, `NONS&ER&D` <dbl>,
#   PDNFRSTAFF <dbl>, FACNUM <dbl>, HUM_RSD <dbl>, SOCSC_RSD <dbl>,
#   STEM_RSD <dbl>, OTHER_RSD <dbl>, `DRSA&S` <dbl>, DRSPROF <dbl>,
#   `OGRDA&S` <dbl>, OGRDPROF <dbl>, `A&SBADEG` <dbl>, PROFBADEG <dbl>,
#   ASC1C2TRNS <dbl>, ASC1C2CRTC <dbl>, FALLENR16 <dbl>, ANENR1617 <dbl>,
#   FALLENR17 <dbl>, FALLFTE17 <dbl>, UGTENR17 <dbl>, GRTENR17 <dbl>,
#   UGDSFTF17 <dbl>, UGDSPTF17 <dbl>, UGNDFT17 <dbl>, UGNDPT17 <dbl>,
#   GRFTF17 <dbl>, GRPTF17 <dbl>, UGN1STTMFT17 <dbl>, UGN1STTMPT17 <dbl>,
#   UGNTRFT17 <dbl>, UGNTRPT17 <dbl>, FAITHFLAG <dbl>, OTHSFFLAG <dbl>,
#   NUMCIP2 <dbl>, LRGSTCIP2 <dbl>, PCTLRGST <dbl>, UGCIP4PR <dbl>,
#   GRCIP4PR <dbl>, COEXPR <dbl>, PCTCOEX <dbl>, DOCRESFLAG <dbl>,
#   MAXGPEDUC <dbl>, MAXGPBUS <dbl>, MAXGPOTH <dbl>, NGCIP2PXDR <dbl>,
#   NGCIP2DR <dbl>, ROOMS <dbl>, ACTCAT <dbl>, NSAT <dbl>, NACT <dbl>,
#   NSATACT <dbl>, SATV25 <dbl>, SATM25 <dbl>, SATCMB25 <dbl>,
#   SATACTEQ25 <dbl>, ACTCMP25 <dbl>, ACTFINAL <dbl>, ...96 <lgl>,
#   ...97 <dbl>

Useful httr functions

  • GET(), HEAD(), PUT(), POST(), DELETE(): HTTP verbs
  • headers(resp): Extract the response headers
  • content(resp): Extract the response content
  • oauth_endpoints() Popular OAuth endpoints
  • oauth_app(): Create an OAuth app
  • config(): Set CURL options such as authentication
  • oauth1.0_token(), outh2.0_token(): Generates an oauth1.0 or oauth2.0 token

httr examples

suppressWarnings(library(httr))

Attaching package: 'httr'
The following object is masked from 'package:curl':

    handle_reset
url_str <- 
  paste0("https://ed-data-portal.urban.org/api/v1/college-university/",
        "ipeds/admissions-enrollment/2016/?year=2016&unitid=127060")

resp <- GET(url_str); print(resp)
Response [https://ed-data-portal.urban.org/api/v1/college-university/ipeds/admissions-enrollment/2016/?year=2016&unitid=127060]
  Date: 2019-08-19 17:06
  Status: 200
  Content-Type: application/json
  Size: 564 B
headers(resp); content(resp)$results
$server
[1] "nginx/1.15.12"

$date
[1] "Mon, 19 Aug 2019 17:06:46 GMT"

$`content-type`
[1] "application/json"

$`content-length`
[1] "564"

$connection
[1] "keep-alive"

$vary
[1] "Origin, Cookie"

$`x-frame-options`
[1] "SAMEORIGIN"

$`cache-control`
[1] "max-age=36288000"

$allow
[1] "GET, HEAD, OPTIONS"

$expires
[1] "Fri, 09 Oct 2020 16:40:46 GMT"

attr(,"class")
[1] "insensitive" "list"       
[[1]]
[[1]]$year
[1] 2016

[[1]]$fips
[1] 8

[[1]]$unitid
[1] 127060

[[1]]$sex
[1] 1

[[1]]$number_applied
[1] 8939

[[1]]$number_admitted
[1] 4656

[[1]]$number_enrolled_ft
[1] 613

[[1]]$number_enrolled_pt
[1] 12

[[1]]$number_enrolled_total
[1] 625


[[2]]
[[2]]$year
[1] 2016

[[2]]$fips
[1] 8

[[2]]$unitid
[1] 127060

[[2]]$sex
[1] 2

[[2]]$number_applied
[1] 11383

[[2]]$number_admitted
[1] 6211

[[2]]$number_enrolled_ft
[1] 764

[[2]]$number_enrolled_pt
[1] 10

[[2]]$number_enrolled_total
[1] 774


[[3]]
[[3]]$year
[1] 2016

[[3]]$fips
[1] 8

[[3]]$unitid
[1] 127060

[[3]]$sex
[1] 99

[[3]]$number_applied
[1] 20322

[[3]]$number_admitted
[1] 10867

[[3]]$number_enrolled_ft
[1] 1377

[[3]]$number_enrolled_pt
[1] 22

[[3]]$number_enrolled_total
[1] 1399

jsonlite

jsonlite is a package to parse JSON

Useful jsonlite functions

  • fromJSON() and toJSON(): Converts R objects to/from JSON
  • rbind_pages(pages): Combine a list of dataframes into a single dataframe.
    • Needed when a JSON API limits the amount of data per request
  • prettify(json): Makes a JSON string readable
suppressWarnings(library(tibble))
suppressWarnings(library(httr))
suppressWarnings(library(jsonlite))

url_str <- 
  paste0("https://ed-data-portal.urban.org/api/v1/college-university/",
        "ipeds/admissions-enrollment/2016/?year=2016&unitid=127060")

resp <- GET(url_str)

du_admissions <- as_tibble(fromJSON(content(resp, "text"))$results)
No encoding supplied: defaulting to UTF-8.
print(du_admissions)
# A tibble: 3 x 9
   year  fips unitid   sex number_applied number_admitted number_enrolled~
  <int> <int>  <int> <int>          <int>           <int>            <int>
1  2016     8 127060     1           8939            4656              613
2  2016     8 127060     2          11383            6211              764
3  2016     8 127060    99          20322           10867             1377
# ... with 2 more variables: number_enrolled_pt <int>,
#   number_enrolled_total <int>