Introduction


Some of the data set you will be using in R, if not most, will be available in external formats or the internet. As such, you, as an R programmer will need to have a way of getting them into R before you can begin working on them. At times, this task can become time intensive and to some extent quite frustrating, more so if data is in the wrong format. To ease this task, RStudio includes new menu features to import data from text (.csv), Excel (.xls, .xlsx), SPSS (.sav, .por), Stata (.dta) and SAS (.sas) files.

While loading data from few external software is possible through the point and click menu, this page will focus on loading data using the syntax method. One major drawback for the syntax method is that almost every single type of file that you want to get into R will require its own function, and even then you might get lost in the numerous arguments of some of these functions. In short, it can be fairly easy to mix up things from time to time, whether you are a beginner or a more advanced R user. For this reason, we will work with a data set that is in perfect format so that we do not require to issue multiple arguments, that may be intimidating, particularly to new R users.

Download data sets

The data sets used in the demonstrations on this page are available at: http://drmathematics.com/learning/datasets/supermarkets.zip. Please download this zipped folder and extract them to a preferred folder on your local disk. Once you have done that, change the line: D:/stemresearch/R/datasets as appears in all the syntaces below to the path to your file. Please note that if you do not get this step right (i.e. if you fail to refer to the correct folder), R won’t find the data sets, meaning that all the syntax will crush.

Install required packages


The following packages will be required and should therefore be loaded first. If they are not already installed, begin by installing them using the install.packages() function e.g. install.packages(“knitr”) to install the knitr library. Other packages will be loaded only when they are required.

library(knitr) # for dynamic reporting e.g. generate HTML
library(kableExtra) # display table formatting

Text files (.txt)


Text files can generally be separated by any character. Below are four examples of text files (they all are .txt files, what differs is the delimiter that separates the column values).

Below is a code to import the text file.

# import
supermarketscomma = read.table("D:/stemresearch/R/datasets/supermarkets_comma.txt",
                               sep = ',',
                               header = TRUE)
# display
kbl(supermarketscomma[1:6, 1:10], 
    caption = "Table 1: Import comma delimited .txt file") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 1: Import comma delimited .txt file
transaction date id gender maritalstatus homeowner children annualincome city state
1 12/18/2011 7223 Female Single Yes 2 $30K - $50K Los Angeles CA
2 12/20/2011 7841 Male Married Yes 5 $70K - $90K Los Angeles CA
3 12/21/2011 8374 Female Married No 2 $50K - $70K Bremerton WA
4 12/21/2011 9619 Male Married Yes 3 $30K - $50K Portland OR
5 12/22/2011 1900 Female Single Yes 3 $130K - $150K Beverly Hills CA
6 12/22/2011 6696 Female Married Yes 3 $10K - $30K Beverly Hills CA

The code below illustrates how to import the other text file formats. Note the use of the argument sep to specify the delimiter (i.e. the character that separates the column values).


# separated by space
supermarketspace = read.table("D:/stemresearch/R/datasets/supermarkets_space.txt",
                              sep = ' ',
                              header = TRUE)
# separated by tab
supermarketstab = read.table("D:/stemresearch/R/datasets/supermarkets_tab.txt",
                             sep = '\t', 
                             header = TRUE)
# separated by semicolon
supermarketsemicolon = read.table("D:/stemresearch/R/datasets/supermarkets_semicolon.txt",
                                  sep = ';',
                                  header = TRUE)
# import .dat files
supermarketsdat = read.table("D:/stemresearch/R/datasets/supermarketsdat.dat",
                             sep = '\t',
                             header = TRUE)

Comma separated values (.csv)


Comma separated values (or .csv) are text files in which the values in the columns are separated by a comma. They are one of the most common and useful ways for sharing data across platforms. In this section, we are going to learn how to import .csv files using the read.csv function from base R. You could also use the read_csv() function from the readr library.

# import
supermarketscsv = read.csv("D:/stemresearch/R/datasets/supermarkets.csv")
# display
kbl(supermarketscsv[1:6, 1:10], 
    caption = "Table 2: Import comma separated values (.csv)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 2: Import comma separated values (.csv)
transaction date id gender maritalstatus homeowner children annualincome city state
1 18-12-11 7223 Female Single Yes 2 $30K - $50K Los Angeles CA
2 20-12-11 7841 Male Married Yes 5 $70K - $90K Los Angeles CA
3 21-12-11 8374 Female Married No 2 $50K - $70K Bremerton WA
4 21-12-11 9619 Male Married Yes 3 $30K - $50K Portland OR
5 22-12-11 1900 Female Single Yes 3 $130K - $150K Beverly Hills CA
6 22-12-11 6696 Female Married Yes 3 $10K - $30K Beverly Hills CA

Comma separated values: specify columns to import


If you are interested in just some of the columns of the data set, you can use the fread() function from the data.table library. Note that this function imports the data set and stores it as a tible - if the data has to be used as a data frame, you can use as.data.frame() function to convert it from a tible to a data frame.

library(data.table)
# import
data.raw = fread("D:/stemresearch/R/datasets/supermarkets.csv",
                 select = c("date", "country", "productfamily", "unitsold", "revenue"))
supermarketscsvsubset = as.data.frame(data.raw) # convert to data frame
# display
kbl(supermarketscsvsubset[1:6, ], 
    caption = "Table 3: Comma separated values: specify columns to import") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 3: Comma separated values: specify columns to import
date country productfamily unitsold revenue
0018-12-11 USA Food 5 $27.38
0020-12-11 USA Food 5 $14.90
0021-12-11 USA Food 3 $5.52
0021-12-11 USA Food 4 $4.44
0022-12-11 USA Drink 4 $14.00
0022-12-11 USA Food 3 $4.37

You can also exclude columns of a data set while importing by specifying the argument drop in the fread() function. For example, specifying drop = c(“gender”, “homeowner”, “annualincome”) causes the fread() function to import all variables of the data set except the specified ones.

Excel files (.xlsx or .xls)


One of the best ways to import an Excel file into R is to save it to a comma delimited file (.csv) and import it using the method mentioned above. In this section we are going to use the read_excel() function from the readr library to read Excel files. The first row of the data set should contain column names. If the Excel file (workbook) contains multiple sheets, you will be required to specify the name of the worksheet that contains the data you intend to import - this is done using the sheet argument. If a worksheet is not specified, then R by default will import data in the first worksheet.

library(readxl)
# import
supermarketsxlsx = read_excel("D:/stemresearch/R/datasets/supermarkets.xlsx",
                              sheet = "supermarkets")
# display
kbl(supermarketsxlsx[1:6, 1:10], 
    caption = "Table 4: Import Excel files **(.xlsx** or **.xls)**") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 4: Import Excel files (.xlsx or .xls)
transaction date id gender maritalstatus homeowner children annualincome city state
1 2011-12-18 7223 Female Single Yes 2 $30K - $50K Los Angeles CA
2 2011-12-20 7841 Male Married Yes 5 $70K - $90K Los Angeles CA
3 2011-12-21 8374 Female Married No 2 $50K - $70K Bremerton WA
4 2011-12-21 9619 Male Married Yes 3 $30K - $50K Portland OR
5 2011-12-22 1900 Female Single Yes 3 $130K - $150K Beverly Hills CA
6 2011-12-22 6696 Female Married Yes 3 $10K - $30K Beverly Hills CA

Html tables

Html are a standard way of rendering tabular data/information on the internet. See the structure of a html table in the Notepad+ screenshot below. We use the htmltab() function from the htmltab library to import html data into R (the XML library also contains the function readHTMLTable that could also be used).

Below is the code.

library(htmltab)
# import, this may take a while, so be patient
supermarketshtml = htmltab("D:/stemresearch/R/datasets/supermarkets.html", which = 1)
# display
kbl(supermarketshtml[1:6, 1:10], 
    caption = "Table 1: Import html tables") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 1: Import html tables
transaction date id gender maritalstatus homeowner children annualincome city state
2 1 18-12-11 7223 Female Single Yes 2 $30K - $50K Los Angeles CA
3 2 20-12-11 7841 Male Married Yes 5 $70K - $90K Los Angeles CA
4 3 21-12-11 8374 Female Married No 2 $50K - $70K Bremerton WA
5 4 21-12-11 9619 Male Married Yes 3 $30K - $50K Portland OR
6 5 22-12-11 1900 Female Single Yes 3 $130K - $150K Beverly Hills CA
7 6 22-12-11 6696 Female Married Yes 3 $10K - $30K Beverly Hills CA

.xml files


XML is a file format which stands for eXtensive Markup Language. This file format contains markup tags that define the meaning of the data contained in it. XML data can be read into R using the xmlParse () and then converted to a data frame using the function xmlToDataFrame() both from the XML package. The screen shot below shows an XML file in Notepad++.

Below is the code.

library(XML)
# import
data.raw = xmlParse("D:/stemresearch/R/datasets/supermarkets.xml")
supermarketsxml <- xmlToDataFrame(nodes = getNodeSet(data.raw, "//record"))
# display
kbl(supermarketsxml[1:6, 1:10], 
    caption = "Table 5: Import xml files") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 5: Import xml files
id annualincome children city country date gender homeowner maritalstatus productcategory
7223 $30K - $50K 2 Los Angeles USA 12/18/2011 Female Yes Single Snack Foods
7841 $70K - $90K 5 Los Angeles USA 12/20/2011 Male Yes Married Vegetables
8374 $50K - $70K 2 Bremerton USA 12/21/2011 Female No Married Snack Foods
9619 $30K - $50K 3 Portland USA 12/21/2011 Male Yes Married Candy
1900 $130K - $150K 3 Beverly Hills USA 12/22/2011 Female Yes Single Carbonated Beverages
6696 $10K - $30K 3 Beverly Hills USA 12/22/2011 Female Yes Married Side Dishes

.json files


JSON, which stands for JavaScript Object Notation, are files that contain data in text form. In this tutorial, we are going to use the fromJSON() function from the jsonlite library. The same can be achieved using the rjson package, though this will require that you also use the as.data.frame() function to convert the result (which is usually a list) into a data frame. The screen shot below shows a JSON file as viewed in Notepad++.

Below is the code.

library(jsonlite)
# import
supermarketsjson = fromJSON("D:/stemresearch/R/datasets/supermarkets.json")
# display
kbl(supermarketsjson[1:6, 1:10], 
    caption = "Table 6: Import json files") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 6: Import json files
transaction date id gender maritalstatus homeowner children annualincome city state
1 12/18/2011 7223 Female Single Yes 2 $30K - $50K Los Angeles CA
2 12/20/2011 7841 Male Married Yes 5 $70K - $90K Los Angeles CA
3 12/21/2011 8374 Female Married No 2 $50K - $70K Bremerton WA
4 12/21/2011 9619 Male Married Yes 3 $30K - $50K Portland OR
5 12/22/2011 1900 Female Single Yes 3 $130K - $150K Beverly Hills CA
6 12/22/2011 6696 Female Married Yes 3 $10K - $30K Beverly Hills CA

Other external files


The haven library contains functions to import data sets from Stata, SPSS and SAS among others. The code below shows how to import data from the aforementioned software.

library(haven)
# Stata
supermarketsdta = read_dta("D:/stemresearch/R/datasets/supermarkets.dta")
# SPSS
supermarketsav = read_sav("D:/stemresearch/R/datasets/supermarkets.sav")
# SAS
supermarketsas = read_sas("D:/stemresearch/R/datasets/supermarkets.sas7bdat")

Remark


In all the above illustrations, we have considered the case where the file to be imported is saved on your computer. If data is stored on the internet, usually accessible by a url, then you will be required to change the path. For example, the syntax supermarketsdta = read_dta(“D:/stemresearch/R/datasets/supermarkets.dta”) imports the supermarkets.dta data set from the located in the folder: D:/stemresearch/R/datasets. To import the same data set from the internet, replace the file path: D:/stemresearch/R/datasets/supermarkets.dta with the url: http://drmathematics.com/learning/datasets/supermarkets.dta. The syntax should look like the one below.

supermarketsdta = read_dta(“http://drmathematics.com/learning/datasets/supermarkets.dta”)


STEM Research
http://drmathematics.com