This package is under active development and the API subject to change! Feedback is very much welcome. To grab the latest version:
library(devtools)
install_github("jeroenooms/curl")
The curl package implements flexible, low-level bindings to libcurl for R. The package supports retrieving data in-memory, downloading to disk, or streaming using the R “connection” interface. Some knowledge of curl is recommended to use this package. If you are looking for a more user-friendly HTTP client, you are better of using httr which extends curl with HTTP specific tools and logic.
The curl package implements three ways to retrieve data from a URL. The curl_perform
function is a synchronous interface which returns a list with content of the server response.
req <- curl_perform("https://httpbin.org/get")
str(req)
List of 6
$ url : chr "https://httpbin.org/get"
$ status_code: int 200
$ headers : raw [1:220] 48 54 54 50 ...
$ content : raw [1:231] 7b 0a 20 20 ...
$ modified : POSIXct[1:1], format: NA
$ times : Named num [1:6] 0 0.00523 0.08579 0.38715 0.46444 ...
..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
cat(rawToChar(req$content))
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "r/curl/jeroen"
},
"origin": "167.220.26.164",
"url": "https://httpbin.org/get"
}
The curl_perform
interface is the easiest interface and most powerful for buidling API clients. However because it is fully in-memory, it is not suitable for downloading really large files. If you are expecting 100G of data, you probably need one of the other interfaces.
The second method is curl_download
, which has been designed as a drop-in replacement for download.file
in r-base. It writes the response straight to disk, which is useful for downloading (large) files.
tmp <- tempfile()
curl_download("https://httpbin.org/get", tmp)
cat(readLines(tmp), sep = "\n")
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "r/curl/jeroen"
},
"origin": "167.220.26.164",
"url": "https://httpbin.org/get"
}
The most flexible interface is the curl
function, which has been designed as a drop-in replacement for base url
. It will create a so-called connection object, which allows for incremental (asynchronous) reading of the response.
con <- curl("https://httpbin.org/get")
open(con)
# Get 3 lines
out <- readLines(con, n = 3)
cat(out, sep = "\n")
{
"args": {},
"headers": {
# Get 3 more lines
out <- readLines(con, n = 3)
cat(out, sep = "\n")
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
# Get remaining lines
out <- readLines(con)
close(con)
cat(out, sep = "\n")
"User-Agent": "r/curl/jeroen"
},
"origin": "167.220.26.164",
"url": "https://httpbin.org/get"
}
The example shows how to use readLines
on an opened connection to read n
lines at a time. Similarly readBin
is used to read n
bytes at a time for stream parsing binary data.
It is important to note that curl_perform
will not automatically raise an error if the request was completed but returned a non-200 status code. When using curl_perform
you need to implement the application logic yourself.
req <- curl_perform("https://httpbin.org/status/418")
print(req$status_code)
[1] 418
The curl
and curl_download
functions on the other hand will automatically raise an error if the HTTP response was non successful, as would the base functions url
and download.file
do.
curl_download("https://httpbin.org/status/418", tempfile())
Error: HTTP error 418.
con <- curl("https://httpbin.org/status/418")
open(con)
Error in open.connection(con): HTTP error 418.
By default libcurl uses HTTP GET to issue a request to an HTTP url. To send a customized request, we first need to create and configure a curl handle object that is passed to the specific download interface.
Creating a new handle is done using new_handle
. After creating a handle object, we can set the libcurl options and http request headers.
h <- new_handle()
handle_setopt(h, COPYPOSTFIELDS = "moo=moomooo");
handle_setheaders(h,
"Content-Type" = "text/moo",
"Cache-Control" = "no-cache",
"User-Agent" = "A cow"
)
Use the curl_options()
function to get a list of the options supported by your version of libcurl. The libcurl documentation explains what each option does. Option names are not case sensitive, so this would do the same:
handle_setopt(h, copypostfields = "moo=moomooo");
After the handle has been configured, it can be used with any of the three download interfaces to issue the request. For example curl_perform
will load store the output of the request in memory:
req <- curl_perform("http://httpbin.org/post", handle = h)
cat(rawToChar(req$content))
{
"args": {},
"data": "moo=moomooo",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "no-cache",
"Content-Length": "11",
"Content-Type": "text/moo",
"Host": "httpbin.org",
"User-Agent": "A cow"
},
"json": null,
"origin": "167.220.26.164",
"url": "http://httpbin.org/post"
}
Alternatively we can use curl()
to read the data of via a connetion interface:
con <- curl("http://httpbin.org/post", handle = h)
cat(readLines(con), sep = "\n")
{
"args": {},
"data": "moo=moomooo",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "no-cache",
"Content-Length": "11",
"Content-Type": "text/moo",
"Host": "httpbin.org",
"User-Agent": "A cow"
},
"json": null,
"origin": "167.220.26.164",
"url": "http://httpbin.org/post"
}
Or we can use curl_download
to write the response to disk:
tmp <- tempfile()
curl_download("http://httpbin.org/post", destfile = tmp, handle = h)
cat(readLines(tmp), sep = "\n")
{
"args": {},
"data": "moo=moomooo",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "no-cache",
"Content-Length": "11",
"Content-Type": "text/moo",
"Host": "httpbin.org",
"User-Agent": "A cow"
},
"json": null,
"origin": "167.220.26.164",
"url": "http://httpbin.org/post"
}
As we have already seen, curl allows for reusing a single handle for multiple requests. However it is not always a good idea to do so. The performance overhead of creating and configuring a new handle object is usually negligible. The safest way to issue mutiple requests, either to a single server or multiple servers is by using a separate handle for each request.
req1 <- curl_perform("https://httpbin.org/get", handle = new_handle())
req2 <- curl_perform("http://www.r-project.org", handle = new_handle())
As far as I can tell, there are two reasons why you might want to reuse a handle. The first one is that it will automatically keep track of cookies set by the server. So if your host makes use of cookies that might be useful.
The other reason is to take advantage of HTTP Keep-Alive. Curl automatically maintains a pool of open http connections within each handle. When using a single handle to issue many requests to the same server, curl automatically uses existing connections when possible. This eliminites a little bit of connection overhead, although on a decent network this might not be very significant.
h <- new_handle()
system.time(curl_perform("http://cran.rstudio.com/web/packages/lattice/DESCRIPTION", handle = h))
user system elapsed
0.000 0.001 0.019
system.time(curl_perform("http://cran.rstudio.com/web/packages/MASS/DESCRIPTION", handle = h))
user system elapsed
0.001 0.000 0.011
The argument against reusing handles is because it is very easy to introduce bugs by forgetting to unset or reset a curl option after performing a request. Once you have set an option in the handle, it will stay active until you set it to a new value, or reset the handle.
handle_reset(h)
The handle_reset
function will reset all curl options and request headers to the default values. It will not erease cookies and it will still keep alive the connections. Therefore it is usually good practice to call handle_reset
after performing a request if you want to reuse the handle for a subsequent request. Still it is generally safer to create a new fresh handle when possible, rather than recycling old ones.
The handle_setform
function is used to perform a multipart/form-data
HTTP POST request (a.k.a. posting a form). Values can be either strings, raw vectors (for binary data) or files.
# Posting multipart
h <- new_handle()
handle_setform(h,
foo = "blabla",
bar = charToRaw("boeboe"),
description = form_file(system.file("DESCRIPTION")),
logo = form_file(file.path(Sys.getenv("R_DOC_DIR"), "html/logo.jpg"), "image/jpeg")
)
req <- curl_perform("http://httpbin.org/post", handle = h)
The form_file
function is used to upload files with the form post. It has two arguments: a file path, and optionally a content-type value. If no content-type is set, curl will guess the content type of the file based on the file extention.
All of the handle_xxx
functions return the handle object so that function calls can be chained using the popular pipe operators:
library(magrittr)
new_handle() %>%
handle_setopt(copypostfields = "moo=moomooo") %>%
handle_setheaders("Content-Type" = "text/moo", "Cache-Control" = "no-cache", "User-Agent" = "A cow") %>%
curl_perform("http://httpbin.org/post", handle = .) %$% content %>% rawToChar %>% cat
{
"args": {},
"data": "moo=moomooo",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "no-cache",
"Content-Length": "11",
"Content-Type": "text/moo",
"Host": "httpbin.org",
"User-Agent": "A cow"
},
"json": null,
"origin": "167.220.26.164",
"url": "http://httpbin.org/post"
}