Some of the data set you will be using in R, if not most, will be available in external formats or the internet. As such, you, as an R programmer will need to have a way of getting them into R before you can begin working on them. At times, this task can become time intensive and to some extent quite frustrating, more so if data is in the wrong format. To ease this task, RStudio includes new menu features to import data from text (.csv), Excel (.xls, .xlsx), SPSS (.sav, .por), Stata (.dta) and SAS (.sas) files.
While loading data from few external software is possible through the point and click menu, this page will focus on loading data using the syntax method. One major drawback for the syntax method is that almost every single type of file that you want to get into R will require its own function, and even then you might get lost in the numerous arguments of some of these functions. In short, it can be fairly easy to mix up things from time to time, whether you are a beginner or a more advanced R user. For this reason, we will work with a data set that is in perfect format so that we do not require to issue multiple arguments, that may be intimidating, particularly to new R users.
The data sets used in the demonstrations on this page are available at: http://drmathematics.com/learning/datasets/supermarkets.zip. Please download this zipped folder and extract them to a preferred folder on your local disk. Once you have done that, change the line: D:/stemresearch/R/datasets as appears in all the syntaces below to the path to your file. Please note that if you do not get this step right (i.e. if you fail to refer to the correct folder), R won’t find the data sets, meaning that all the syntax will crush.
The following packages will be required and should therefore be loaded first. If they are not already installed, begin by installing them using the install.packages() function e.g. install.packages(“knitr”) to install the knitr library. Other packages will be loaded only when they are required.
library(knitr) # for dynamic reporting e.g. generate HTML
library(kableExtra) # display table formatting
Text files can generally be separated by any character. Below are four examples of text files (they all are .txt files, what differs is the delimiter that separates the column values).
Below is a code to import the text file.
# import
supermarketscomma = read.table("D:/stemresearch/R/datasets/supermarkets_comma.txt",
sep = ',',
header = TRUE)
# display
kbl(supermarketscomma[1:6, 1:10],
caption = "Table 1: Import comma delimited .txt file") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| transaction | date | id | gender | maritalstatus | homeowner | children | annualincome | city | state |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 12/18/2011 | 7223 | Female | Single | Yes | 2 | $30K - $50K | Los Angeles | CA |
| 2 | 12/20/2011 | 7841 | Male | Married | Yes | 5 | $70K - $90K | Los Angeles | CA |
| 3 | 12/21/2011 | 8374 | Female | Married | No | 2 | $50K - $70K | Bremerton | WA |
| 4 | 12/21/2011 | 9619 | Male | Married | Yes | 3 | $30K - $50K | Portland | OR |
| 5 | 12/22/2011 | 1900 | Female | Single | Yes | 3 | $130K - $150K | Beverly Hills | CA |
| 6 | 12/22/2011 | 6696 | Female | Married | Yes | 3 | $10K - $30K | Beverly Hills | CA |
The code below illustrates how to import the other text file formats. Note the use of the argument sep to specify the delimiter (i.e. the character that separates the column values).
# separated by space
supermarketspace = read.table("D:/stemresearch/R/datasets/supermarkets_space.txt",
sep = ' ',
header = TRUE)
# separated by tab
supermarketstab = read.table("D:/stemresearch/R/datasets/supermarkets_tab.txt",
sep = '\t',
header = TRUE)
# separated by semicolon
supermarketsemicolon = read.table("D:/stemresearch/R/datasets/supermarkets_semicolon.txt",
sep = ';',
header = TRUE)
# import .dat files
supermarketsdat = read.table("D:/stemresearch/R/datasets/supermarketsdat.dat",
sep = '\t',
header = TRUE)
Comma separated values (or .csv) are text files in which the values in the columns are separated by a comma. They are one of the most common and useful ways for sharing data across platforms. In this section, we are going to learn how to import .csv files using the read.csv function from base R. You could also use the read_csv() function from the readr library.
# import
supermarketscsv = read.csv("D:/stemresearch/R/datasets/supermarkets.csv")
# display
kbl(supermarketscsv[1:6, 1:10],
caption = "Table 2: Import comma separated values (.csv)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| transaction | date | id | gender | maritalstatus | homeowner | children | annualincome | city | state |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 18-12-11 | 7223 | Female | Single | Yes | 2 | $30K - $50K | Los Angeles | CA |
| 2 | 20-12-11 | 7841 | Male | Married | Yes | 5 | $70K - $90K | Los Angeles | CA |
| 3 | 21-12-11 | 8374 | Female | Married | No | 2 | $50K - $70K | Bremerton | WA |
| 4 | 21-12-11 | 9619 | Male | Married | Yes | 3 | $30K - $50K | Portland | OR |
| 5 | 22-12-11 | 1900 | Female | Single | Yes | 3 | $130K - $150K | Beverly Hills | CA |
| 6 | 22-12-11 | 6696 | Female | Married | Yes | 3 | $10K - $30K | Beverly Hills | CA |
If you are interested in just some of the columns of the data set, you can use the fread() function from the data.table library. Note that this function imports the data set and stores it as a tible - if the data has to be used as a data frame, you can use as.data.frame() function to convert it from a tible to a data frame.
library(data.table)
# import
data.raw = fread("D:/stemresearch/R/datasets/supermarkets.csv",
select = c("date", "country", "productfamily", "unitsold", "revenue"))
supermarketscsvsubset = as.data.frame(data.raw) # convert to data frame
# display
kbl(supermarketscsvsubset[1:6, ],
caption = "Table 3: Comma separated values: specify columns to import") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| date | country | productfamily | unitsold | revenue |
|---|---|---|---|---|
| 0018-12-11 | USA | Food | 5 | $27.38 |
| 0020-12-11 | USA | Food | 5 | $14.90 |
| 0021-12-11 | USA | Food | 3 | $5.52 |
| 0021-12-11 | USA | Food | 4 | $4.44 |
| 0022-12-11 | USA | Drink | 4 | $14.00 |
| 0022-12-11 | USA | Food | 3 | $4.37 |
You can also exclude columns of a data set while importing by specifying the argument drop in the fread() function. For example, specifying drop = c(“gender”, “homeowner”, “annualincome”) causes the fread() function to import all variables of the data set except the specified ones.
One of the best ways to import an Excel file into R is to save it to a comma delimited file (.csv) and import it using the method mentioned above. In this section we are going to use the read_excel() function from the readr library to read Excel files. The first row of the data set should contain column names. If the Excel file (workbook) contains multiple sheets, you will be required to specify the name of the worksheet that contains the data you intend to import - this is done using the sheet argument. If a worksheet is not specified, then R by default will import data in the first worksheet.
library(readxl)
# import
supermarketsxlsx = read_excel("D:/stemresearch/R/datasets/supermarkets.xlsx",
sheet = "supermarkets")
# display
kbl(supermarketsxlsx[1:6, 1:10],
caption = "Table 4: Import Excel files **(.xlsx** or **.xls)**") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| transaction | date | id | gender | maritalstatus | homeowner | children | annualincome | city | state |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2011-12-18 | 7223 | Female | Single | Yes | 2 | $30K - $50K | Los Angeles | CA |
| 2 | 2011-12-20 | 7841 | Male | Married | Yes | 5 | $70K - $90K | Los Angeles | CA |
| 3 | 2011-12-21 | 8374 | Female | Married | No | 2 | $50K - $70K | Bremerton | WA |
| 4 | 2011-12-21 | 9619 | Male | Married | Yes | 3 | $30K - $50K | Portland | OR |
| 5 | 2011-12-22 | 1900 | Female | Single | Yes | 3 | $130K - $150K | Beverly Hills | CA |
| 6 | 2011-12-22 | 6696 | Female | Married | Yes | 3 | $10K - $30K | Beverly Hills | CA |
Html are a standard way of rendering tabular data/information on the internet. See the structure of a html table in the Notepad+ screenshot below. We use the htmltab() function from the htmltab library to import html data into R (the XML library also contains the function readHTMLTable that could also be used).
Below is the code.
library(htmltab)
# import, this may take a while, so be patient
supermarketshtml = htmltab("D:/stemresearch/R/datasets/supermarkets.html", which = 1)
# display
kbl(supermarketshtml[1:6, 1:10],
caption = "Table 1: Import html tables") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| transaction | date | id | gender | maritalstatus | homeowner | children | annualincome | city | state | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 18-12-11 | 7223 | Female | Single | Yes | 2 | $30K - $50K | Los Angeles | CA |
| 3 | 2 | 20-12-11 | 7841 | Male | Married | Yes | 5 | $70K - $90K | Los Angeles | CA |
| 4 | 3 | 21-12-11 | 8374 | Female | Married | No | 2 | $50K - $70K | Bremerton | WA |
| 5 | 4 | 21-12-11 | 9619 | Male | Married | Yes | 3 | $30K - $50K | Portland | OR |
| 6 | 5 | 22-12-11 | 1900 | Female | Single | Yes | 3 | $130K - $150K | Beverly Hills | CA |
| 7 | 6 | 22-12-11 | 6696 | Female | Married | Yes | 3 | $10K - $30K | Beverly Hills | CA |
XML is a file format which stands for eXtensive Markup Language. This file format contains markup tags that define the meaning of the data contained in it. XML data can be read into R using the xmlParse () and then converted to a data frame using the function xmlToDataFrame() both from the XML package. The screen shot below shows an XML file in Notepad++.
Below is the code.
library(XML)
# import
data.raw = xmlParse("D:/stemresearch/R/datasets/supermarkets.xml")
supermarketsxml <- xmlToDataFrame(nodes = getNodeSet(data.raw, "//record"))
# display
kbl(supermarketsxml[1:6, 1:10],
caption = "Table 5: Import xml files") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| id | annualincome | children | city | country | date | gender | homeowner | maritalstatus | productcategory |
|---|---|---|---|---|---|---|---|---|---|
| 7223 | $30K - $50K | 2 | Los Angeles | USA | 12/18/2011 | Female | Yes | Single | Snack Foods |
| 7841 | $70K - $90K | 5 | Los Angeles | USA | 12/20/2011 | Male | Yes | Married | Vegetables |
| 8374 | $50K - $70K | 2 | Bremerton | USA | 12/21/2011 | Female | No | Married | Snack Foods |
| 9619 | $30K - $50K | 3 | Portland | USA | 12/21/2011 | Male | Yes | Married | Candy |
| 1900 | $130K - $150K | 3 | Beverly Hills | USA | 12/22/2011 | Female | Yes | Single | Carbonated Beverages |
| 6696 | $10K - $30K | 3 | Beverly Hills | USA | 12/22/2011 | Female | Yes | Married | Side Dishes |
JSON, which stands for JavaScript Object Notation, are files that contain data in text form. In this tutorial, we are going to use the fromJSON() function from the jsonlite library. The same can be achieved using the rjson package, though this will require that you also use the as.data.frame() function to convert the result (which is usually a list) into a data frame. The screen shot below shows a JSON file as viewed in Notepad++.
Below is the code.
library(jsonlite)
# import
supermarketsjson = fromJSON("D:/stemresearch/R/datasets/supermarkets.json")
# display
kbl(supermarketsjson[1:6, 1:10],
caption = "Table 6: Import json files") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| transaction | date | id | gender | maritalstatus | homeowner | children | annualincome | city | state |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 12/18/2011 | 7223 | Female | Single | Yes | 2 | $30K - $50K | Los Angeles | CA |
| 2 | 12/20/2011 | 7841 | Male | Married | Yes | 5 | $70K - $90K | Los Angeles | CA |
| 3 | 12/21/2011 | 8374 | Female | Married | No | 2 | $50K - $70K | Bremerton | WA |
| 4 | 12/21/2011 | 9619 | Male | Married | Yes | 3 | $30K - $50K | Portland | OR |
| 5 | 12/22/2011 | 1900 | Female | Single | Yes | 3 | $130K - $150K | Beverly Hills | CA |
| 6 | 12/22/2011 | 6696 | Female | Married | Yes | 3 | $10K - $30K | Beverly Hills | CA |
The haven library contains functions to import data sets from Stata, SPSS and SAS among others. The code below shows how to import data from the aforementioned software.
library(haven)
# Stata
supermarketsdta = read_dta("D:/stemresearch/R/datasets/supermarkets.dta")
# SPSS
supermarketsav = read_sav("D:/stemresearch/R/datasets/supermarkets.sav")
# SAS
supermarketsas = read_sas("D:/stemresearch/R/datasets/supermarkets.sas7bdat")
In all the above illustrations, we have considered the case where the file to be imported is saved on your computer. If data is stored on the internet, usually accessible by a url, then you will be required to change the path. For example, the syntax supermarketsdta = read_dta(“D:/stemresearch/R/datasets/supermarkets.dta”) imports the supermarkets.dta data set from the located in the folder: D:/stemresearch/R/datasets. To import the same data set from the internet, replace the file path: D:/stemresearch/R/datasets/supermarkets.dta with the url: http://drmathematics.com/learning/datasets/supermarkets.dta. The syntax should look like the one below.
supermarketsdta = read_dta(“http://drmathematics.com/learning/datasets/supermarkets.dta”)
STEM Research
http://drmathematics.com