This document consists of the structure and processes for the most common methods of importing data into R. I put it together as a reference for myself and figured I should share it with the R community as well.

A great debt is owed to DataCamp, whose “Importing Data into R” courses helped clarify and codify the methods of getting data into R more than any other resource I’ve encountered.

1) Importing Flat Files

utils functions

In the utils package that comes standard with R, read.table is the basic function from which read.csv and read.delim are built as special cases.

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv is used for importing csv files.

read.csv(file, header = TRUE, sep = ",", quote = "\"",
           dec = ".", fill = TRUE, comment.char = "", ...)

read.delim is used for importing tab delimited text files.

read.delim(file, header = TRUE, sep = "\t", quote = "\"",
           dec = ".", fill = TRUE, comment.char = "", ...)

readr

The readr package, developed by Hadley Wickham, provides similar functions for importing flat files. The key difference is that, unlike in the utils package, the default argument of stringsAsFactors is set to FALSE.

library(readr)

In the readr package, read_delim is the basic function from which read_csv and read_delim are built as special cases. For a document that is separated any other way, substitute the delim argument with the separating character in “quotes.”

read_delim(file, delim, quote = "\"", escape_backslash = FALSE,
       escape_double = TRUE, col_names = TRUE, col_types = NULL,
           locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
           comment = "", trim_ws = FALSE, skip = 0, n_max = Inf,
           guess_max = min(1000, n_max), progress = interactive())

For csv files, use read_csv, adding whichever arguments from read_delim are necessary to retrieve the appropriate data.

read_csv(file)

For tab delimited text files, use read_tsv, adding whichever arguments from read_delim are necessary to retrieve the appropriate data.

read_tsv(file)

fread

fread is a function in the data.table package, developed by Matt Dowle.

library(data.table)

Like readr, fread sets the stringsAsFactors argument to FALSE by default. However, it improves on readr by automatically detecting the file type that is being imported. The only necessary argument, then, is the file name (or input). Also, for large data sets, fread has an infinitely faster processing speed.

fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA", file, 
           stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), autostart=1L,
       skip=0L, select=NULL, drop=NULL, colClasses=NULL,
       integer64=getOption("datatable.integer64"), # default: "integer64"
       dec=if (sep!=".") "." else ",", col.names,
       check.names=FALSE, encoding="unknown", quote="\"",
       strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL,
       showProgress=getOption("datatable.showProgress"), # default: TRUE
       data.table=getOption("datatable.fread.datatable") # default: TRUE)

2) Importing Microsoft Excel Files

readxl

The readxl package, developed by Hadley Wickham, makes it possible to read Microsoft Excel files into R, whether the extension is “.xls” (Excel’s older version) or “.xlsx” (Excel’s newer version).

library(readxl)

The excel_sheets function prints a vector of names referencing the tabs of each worksheet in the workbook. All that is required is the file name.

excel_sheets(path)

The read_excel function creates a data frame based on the file that is referenced in the first argument.

read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "",
       skip = 0)

XLConnect

The XLConnect package, developed by Martin Struder, provides a large number of functions for importing and altering data from Microsoft Excel documents.

library(XLConnect)

The loadWorkbook function brings in the entire Excel workbook as an object in R. The variable to which the workbook is assigned is then fed into other XLConnect functions as the object, typically the first argument.

loadWorkbook(filename, create = FALSE, password = NULL)

Similar to the excel_sheets function from readxl, getSheets prints a vector with the names of the sheets in the Excel workbook (object).

getSheets(object)

The readWorksheet fuction reads a specific sheet from the Excel workbook.

readWorksheet(object,sheet,startRow,startCol,endRow,endCol,autofitRow,autofitCol,
       region,header,rownames,colTypes,forceConversion,dateTimeFormat,check.names,
       useCachedValues,keep,drop, simplify, readStrategy)  

The createSheet function adds an empty tab to the Excel workbook.

createSheet(object, name)

The writeWorksheet function adds data to the tab created with the createSheet function.

writeWorksheet(object,data,sheet,startRow,startCol,header,rownames)

The saveWorkbook function exports the updated Excel workbook to the working directory.

saveWorkbook(object,file)

The renameSheet function can be used to edit the names of the tabs in the Excel workbook.

renameSheet(object,sheet,newName)

The removeSheet function can be used to delete a tab from the Excel workbook.

removeSheet(object,sheet)

XLConnect contains dozens of other functions for working between R and Excel. The package is definitely worthy of further study.

3) Importing Statistical Software Files

haven

The haven package, developed by Hadley Wickham, contains functions for importing data into R from various statistical software systems.

library(haven)

The basic import function of each software system follows this framework: read_software(file). So, for SAS:

read_sas(file)

For SPSS:

read_spss(file)

And, for STATA:

read_stata(file)

foreign

An alternative to haven is foreign. The key difference is the function for reading the file in foreign usually references the file extension rather than the software. For example, read_stata in haven would equate to read.dta in foreign.

library(foreign)

A drawback is that data from some software systems, like SAS, are a bit more complicated to import with foreign.

A benefit to using foreign is that it the capability of reading in data from a broader range of software systems. Some software sytems for which you may need to use foreign include:

Minitab:

read.mtp(file)

And Octave (compatibility with Matlab binary files):

read.octave(file)

4) Importing Data from Databases

Connecting to databases in R can be tricky. Each database management system operates independently, using different packages and functions for retrieving data. Here, we’ll look at two systems in particular: MySQL and Microsoft SQL Server.

MySQL

To bring in data from a MySQL database, we first need to install the DBI package and load the RMySQL (by Jeroen Oooms) library.

install.packages("DBI")
library(RMySQL)

Next, we’ll want to connect to the database and assign the connection to an object on which we’ll base our queries. Typically, the arguments that must be specified for a particular database are the dbname, the host, the port, the user, and the password.

connection <- dbConnect(RMySQL::MySQL(), dbname = NULL, username = NULL,
       password = NULL, host = NULL, unix.socket = NULL, port = 0,
       client.flag = 0, groups = "rs-dbi", default.file = NULL, ...)

Now that we’ve connected to the database, we can use a variety of functions to call information from it.

The dbListTables function prints a vector of the names of all the tables in the database.

dbListTables(connection)

The dbReadTable prints the entire table called in the second argument of the function.

dbReadTable(connection, name)

The dbGetQuery function retrieves only the data that is specified by the SQL query written in the statement argument.

dbGetQuery(connection, statement)

The dbDisconnect function disconnects from the database. This should always be done after querying is finished.

dbDisconnect(connection)

Microsoft SQL Server

Data from a Microsoft SQL Server (or Access) database can be retrieved using the RODBC package, developed by Brian Ripley.

library(RODBC) 

First, to connect with the database, we use the odbcConnect function–inputting the dsn and, if necessary, the user id and password. We then assign this to an object (connection) that will be used for the other functions. Instructions on creating a DSN (Data Source Name) in SQL Server can be found here.

connection <- odbcConnect(dsn, uid = "", pwd = "", ...)

To print a vector with the names of the tables in the database, the sqlTables function can be used.

sqlTables(connection)

To retrive an entire table from the database, the sqlFetch function can be used.

sqlFetch(connection, "table name", max = m)

To write a query to select certain data from the database, the sqlQuery function can be used..

sqlQuery(channel, query, errors = TRUE, ..., rows_at_time)

5) Importing Data from the Web

Depending on the data that is being imported, there are a variety of methods for bringing data into R from the Internet.

Flat Files

The read.csv function of the native utils package can be used to read csv files into R from the web just like it reads csv files from a working directory.

read.csv("http://www.website.com/webpage.csv")

Other Tabular Files

Besides csv files, other files cannot be read into R directly. They must be downloaded to a machine and then loaded from the working directory. However, there is a method for making this process reproducible in the R code…

url <- "http://www.website.com/webpage.xlsx"

dest <- file.path("~","webpage.xlsx")

download.file(url, dest)

Once this process has been completed, the Excel file can now be read into read using XLConnect, readxl, or some other method. For example, we can now use the read_excel function from readxl.

read_excel(dest)

JSON Files

For readability, much of the data on the web is organized in the JSON file format. To import data from the web in this format, the jsonlite package (by Jeroen Oooms) can be used.

library(jsonlite)

For importing data into R, jsonlite has a single function: fromJSON. The txt argument is the file to be imported. It can be a JSON file from the working directory or a URL from the web.

fromJSON(txt, simplifyVector = TRUE, simplifyDataFrame = simplifyVector,
       simplifyMatrix = simplifyVector, flatten = FALSE, ...)