Importing data from external files

Introduction

Some of the data set you will be using in R, if not most, will be available in external formats or the internet. As such, you, as an R programmer will need to have a way of getting them into R before you can begin working on them. At times, this task can become time intensive and to some extent quite frustrating, more so if data is in the wrong format. To ease this task, RStudio includes new menu features to import data from text (.csv), Excel (.xls, .xlsx), SPSS (.sav, .por), Stata (.dta) and SAS (.sas) files.

While loading data from few external software is possible through the point and click menu, this page will focus on loading data using the syntax method. One major drawback for the syntax method is that almost every single type of file that you want to get into R will require its own function, and even then you might get lost in the numerous arguments of some of these functions. In short, it can be fairly easy to mix up things from time to time, whether you are a beginner or a more advanced R user. For this reason, we will work with a data set that is in perfect format so that we do not require to issue multiple arguments, that may be intimidating, particularly to new R users.

Download data sets

The data sets used in the demonstrations on this page are available at: http://drmathematics.com/learning/datasets/supermarkets.zip. Please download this zipped folder and extract them to a preferred folder on your local disk. Once you have done that, change the line: D:/stemresearch/R/datasets as appears in all the syntaces below to the path to your file. Please note that if you do not get this step right (i.e. if you fail to refer to the correct folder), R won’t find the data sets, meaning that all the syntax will crush.

Install required packages

The following packages will be required and should therefore be loaded first. If they are not already installed, begin by installing them using the install.packages() function e.g. install.packages(“knitr”) to install the knitr library. Other packages will be loaded only when they are required.

library(knitr) # for dynamic reporting e.g. generate HTML
library(kableExtra) # display table formatting

Text files (.txt)

Text files can generally be separated by any character. Below are four examples of text files (they all are .txt files, what differs is the delimiter that separates the column values).

space - Column values are separated by a single space.
tab - Column values are separated by a tab (tab key).
special characters - Column values are separated by a special character e.g. semicolon (;), hiphen (-), hash (#), etc.
dat - a form of text file whose values are separated by as tab.

Below is a code to import the text file.

# import
supermarketscomma = read.table("D:/stemresearch/R/datasets/supermarkets_comma.txt",
                               sep = ',',
                               header = TRUE)
# display
kbl(supermarketscomma[1:6, 1:10], 
    caption = "Table 1: Import comma delimited .txt file") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 1: Import comma delimited .txt file
transaction	date	id	gender	maritalstatus	homeowner	children	annualincome	city	state
1	12/18/2011	7223	Female	Single	Yes	2	$30K - $50K	Los Angeles	CA
2	12/20/2011	7841	Male	Married	Yes	5	$70K - $90K	Los Angeles	CA
3	12/21/2011	8374	Female	Married	No	2	$50K - $70K	Bremerton	WA
4	12/21/2011	9619	Male	Married	Yes	3	$30K - $50K	Portland	OR
5	12/22/2011	1900	Female	Single	Yes	3	$130K - $150K	Beverly Hills	CA
6	12/22/2011	6696	Female	Married	Yes	3	$10K - $30K	Beverly Hills	CA

The code below illustrates how to import the other text file formats. Note the use of the argument sep to specify the delimiter (i.e. the character that separates the column values).

# separated by space
supermarketspace = read.table("D:/stemresearch/R/datasets/supermarkets_space.txt",
                              sep = ' ',
                              header = TRUE)
# separated by tab
supermarketstab = read.table("D:/stemresearch/R/datasets/supermarkets_tab.txt",
                             sep = '\t', 
                             header = TRUE)
# separated by semicolon
supermarketsemicolon = read.table("D:/stemresearch/R/datasets/supermarkets_semicolon.txt",
                                  sep = ';',
                                  header = TRUE)
# import .dat files
supermarketsdat = read.table("D:/stemresearch/R/datasets/supermarketsdat.dat",
                             sep = '\t',
                             header = TRUE)

Comma separated values (.csv)

Comma separated values (or .csv) are text files in which the values in the columns are separated by a comma. They are one of the most common and useful ways for sharing data across platforms. In this section, we are going to learn how to import .csv files using the read.csv function from base R. You could also use the read_csv() function from the readr library.

# import
supermarketscsv = read.csv("D:/stemresearch/R/datasets/supermarkets.csv")
# display
kbl(supermarketscsv[1:6, 1:10], 
    caption = "Table 2: Import comma separated values (.csv)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 2: Import comma separated values (.csv)
transaction	date	id	gender	maritalstatus	homeowner	children	annualincome	city	state
1	18-12-11	7223	Female	Single	Yes	2	$30K - $50K	Los Angeles	CA
2	20-12-11	7841	Male	Married	Yes	5	$70K - $90K	Los Angeles	CA
3	21-12-11	8374	Female	Married	No	2	$50K - $70K	Bremerton	WA
4	21-12-11	9619	Male	Married	Yes	3	$30K - $50K	Portland	OR
5	22-12-11	1900	Female	Single	Yes	3	$130K - $150K	Beverly Hills	CA
6	22-12-11	6696	Female	Married	Yes	3	$10K - $30K	Beverly Hills	CA

Comma separated values: specify columns to import

If you are interested in just some of the columns of the data set, you can use the fread() function from the data.table library. Note that this function imports the data set and stores it as a tible - if the data has to be used as a data frame, you can use as.data.frame() function to convert it from a tible to a data frame.

library(data.table)
# import
data.raw = fread("D:/stemresearch/R/datasets/supermarkets.csv",
                 select = c("date", "country", "productfamily", "unitsold", "revenue"))
supermarketscsvsubset = as.data.frame(data.raw) # convert to data frame
# display
kbl(supermarketscsvsubset[1:6, ], 
    caption = "Table 3: Comma separated values: specify columns to import") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 3: Comma separated values: specify columns to import
date	country	productfamily	unitsold	revenue
0018-12-11	USA	Food	5	$27.38
0020-12-11	USA	Food	5	$14.90
0021-12-11	USA	Food	3	$5.52
0021-12-11	USA	Food	4	$4.44
0022-12-11	USA	Drink	4	$14.00
0022-12-11	USA	Food	3	$4.37

You can also exclude columns of a data set while importing by specifying the argument drop in the fread() function. For example, specifying drop = c(“gender”, “homeowner”, “annualincome”) causes the fread() function to import all variables of the data set except the specified ones.

Excel files (.xlsx or .xls)

One of the best ways to import an Excel file into R is to save it to a comma delimited file (.csv) and import it using the method mentioned above. In this section we are going to use the read_excel() function from the readr library to read Excel files. The first row of the data set should contain column names. If the Excel file (workbook) contains multiple sheets, you will be required to specify the name of the worksheet that contains the data you intend to import - this is done using the sheet argument. If a worksheet is not specified, then R by default will import data in the first worksheet.

library(readxl)
# import
supermarketsxlsx = read_excel("D:/stemresearch/R/datasets/supermarkets.xlsx",
                              sheet = "supermarkets")
# display
kbl(supermarketsxlsx[1:6, 1:10], 
    caption = "Table 4: Import Excel files **(.xlsx** or **.xls)**") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 4: Import Excel files **(.xlsx** or **.xls)**
transaction	date	id	gender	maritalstatus	homeowner	children	annualincome	city	state
1	2011-12-18	7223	Female	Single	Yes	2	$30K - $50K	Los Angeles	CA
2	2011-12-20	7841	Male	Married	Yes	5	$70K - $90K	Los Angeles	CA
3	2011-12-21	8374	Female	Married	No	2	$50K - $70K	Bremerton	WA
4	2011-12-21	9619	Male	Married	Yes	3	$30K - $50K	Portland	OR
5	2011-12-22	1900	Female	Single	Yes	3	$130K - $150K	Beverly Hills	CA
6	2011-12-22	6696	Female	Married	Yes	3	$10K - $30K	Beverly Hills	CA

Html tables

Html are a standard way of rendering tabular data/information on the internet. See the structure of a html table in the Notepad+ screenshot below. We use the htmltab() function from the htmltab library to import html data into R (the XML library also contains the function readHTMLTable that could also be used).

Below is the code.

library(htmltab)
# import, this may take a while, so be patient
supermarketshtml = htmltab("D:/stemresearch/R/datasets/supermarkets.html", which = 1)
# display
kbl(supermarketshtml[1:6, 1:10], 
    caption = "Table 1: Import html tables") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 1: Import html tables
	transaction	date	id	gender	maritalstatus	homeowner	children	annualincome	city	state
2	1	18-12-11	7223	Female	Single	Yes	2	$30K - $50K	Los Angeles	CA
3	2	20-12-11	7841	Male	Married	Yes	5	$70K - $90K	Los Angeles	CA
4	3	21-12-11	8374	Female	Married	No	2	$50K - $70K	Bremerton	WA
5	4	21-12-11	9619	Male	Married	Yes	3	$30K - $50K	Portland	OR
6	5	22-12-11	1900	Female	Single	Yes	3	$130K - $150K	Beverly Hills	CA
7	6	22-12-11	6696	Female	Married	Yes	3	$10K - $30K	Beverly Hills	CA

.xml files

XML is a file format which stands for eXtensive Markup Language. This file format contains markup tags that define the meaning of the data contained in it. XML data can be read into R using the xmlParse () and then converted to a data frame using the function xmlToDataFrame() both from the XML package. The screen shot below shows an XML file in Notepad++.

Below is the code.

library(XML)
# import
data.raw = xmlParse("D:/stemresearch/R/datasets/supermarkets.xml")
supermarketsxml <- xmlToDataFrame(nodes = getNodeSet(data.raw, "//record"))
# display
kbl(supermarketsxml[1:6, 1:10], 
    caption = "Table 5: Import xml files") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 5: Import xml files
id	annualincome	children	city	country	date	gender	homeowner	maritalstatus	productcategory
7223	$30K - $50K	2	Los Angeles	USA	12/18/2011	Female	Yes	Single	Snack Foods
7841	$70K - $90K	5	Los Angeles	USA	12/20/2011	Male	Yes	Married	Vegetables
8374	$50K - $70K	2	Bremerton	USA	12/21/2011	Female	No	Married	Snack Foods
9619	$30K - $50K	3	Portland	USA	12/21/2011	Male	Yes	Married	Candy
1900	$130K - $150K	3	Beverly Hills	USA	12/22/2011	Female	Yes	Single	Carbonated Beverages
6696	$10K - $30K	3	Beverly Hills	USA	12/22/2011	Female	Yes	Married	Side Dishes

.json files

JSON, which stands for JavaScript Object Notation, are files that contain data in text form. In this tutorial, we are going to use the fromJSON() function from the jsonlite library. The same can be achieved using the rjson package, though this will require that you also use the as.data.frame() function to convert the result (which is usually a list) into a data frame. The screen shot below shows a JSON file as viewed in Notepad++.

Below is the code.

library(jsonlite)
# import
supermarketsjson = fromJSON("D:/stemresearch/R/datasets/supermarkets.json")
# display
kbl(supermarketsjson[1:6, 1:10], 
    caption = "Table 6: Import json files") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 6: Import json files
transaction	date	id	gender	maritalstatus	homeowner	children	annualincome	city	state
1	12/18/2011	7223	Female	Single	Yes	2	$30K - $50K	Los Angeles	CA
2	12/20/2011	7841	Male	Married	Yes	5	$70K - $90K	Los Angeles	CA
3	12/21/2011	8374	Female	Married	No	2	$50K - $70K	Bremerton	WA
4	12/21/2011	9619	Male	Married	Yes	3	$30K - $50K	Portland	OR
5	12/22/2011	1900	Female	Single	Yes	3	$130K - $150K	Beverly Hills	CA
6	12/22/2011	6696	Female	Married	Yes	3	$10K - $30K	Beverly Hills	CA

Other external files

The haven library contains functions to import data sets from Stata, SPSS and SAS among others. The code below shows how to import data from the aforementioned software.

library(haven)
# Stata
supermarketsdta = read_dta("D:/stemresearch/R/datasets/supermarkets.dta")
# SPSS
supermarketsav = read_sav("D:/stemresearch/R/datasets/supermarkets.sav")
# SAS
supermarketsas = read_sas("D:/stemresearch/R/datasets/supermarkets.sas7bdat")

Remark

In all the above illustrations, we have considered the case where the file to be imported is saved on your computer. If data is stored on the internet, usually accessible by a url, then you will be required to change the path. For example, the syntax supermarketsdta = read_dta(“D:/stemresearch/R/datasets/supermarkets.dta”) imports the supermarkets.dta data set from the located in the folder: D:/stemresearch/R/datasets. To import the same data set from the internet, replace the file path: D:/stemresearch/R/datasets/supermarkets.dta with the url: http://drmathematics.com/learning/datasets/supermarkets.dta. The syntax should look like the one below.

supermarketsdta = read_dta(“http://drmathematics.com/learning/datasets/supermarkets.dta”)

STEM Research
http://drmathematics.com