Importing and Exporting Data

Data I/O

Input/Output (I/O) is a technical term for the reading (importing) and writing (exporting) data.
Importing data can come from a variety of sources (database, flat-file, internet) and a variety of formats for each source.

Establishing Connections

Connections allow R allows I/O from outside of R
The connection objects can come in the form of a file, url, pipe or socket.
- A pipe connection is used to run a system command and “pipe” the output to R (useful for task scheduling)
- A socket connection allows R to communicate with another process (external software)

Functions for Importing Data from Connections

file(): The path to the file to be opened
open(), close(): Open and close a connection
scan(): Read data into a vector of the same data type from the console or file
readLines(): Reads text lines from a connection

Importing text data into R from a file using base R

Text data can be found in various formats either on the internet or file
It is stored in an encoding format such as ASCII and Unicode.
Most text data is ASCII, however R can specify the encoding.
Using the functions scan(), readLines(), R can read any character data from any connection.

# Scan character data from text file, specify that you're scan character data
scan(file = "Z:/R Training/character_data.txt", what = "character")

[1] "This"      "is"        "a"         "test"      "file"      "containg" 
[7] "character" "data."

# Establish a file connection
char_data_file <- file("Z:/R Training/character_data.txt")

# Collpase Character vector into one character string
str_c(scan(char_data_file, what = "character"), collapse = " ")

[1] "This is a test file containg character data."

# Using readLines, extracts the text from the input source 
# Returns each line  as a character string where n is the number of lines
readLines(char_data_file, n = 1)

[1] "This is a test file containg character data. "

Importing Rectangular Data

Spreadsheets-like data such as csv, tsv, and fwf
Excel formatted data
Other software-specific formats
Relational Databases

Reading in Data with base R

The function read.table() is the most common base R function for reading in rectangular data.

Arguments for read.table():

file: name of the file or connection
header: A boolean value indicating whether the file has a header line
sep: How the columns are seperated denoted by a string
colClasses: A character vector of the class for each column
nrows: The number of rows to be read. The default value is the entire file.
skip: The number of lines to skip from the beginning
stringsAsFactors: A boolean value indicating whether character vectors should be coded as factors. Default is TRUE.
- It is often advised that you set this to be FALSE. This is because you can have character vectors in the data that are not categorical variables. Another reason is that even if they are encoded as factors, they might not contain all the levels and you will have to add levels anyways.

read.table wrapper functions

read.csv(): Comma-seperated file. sep has a default value of “,”.
read.csv2(): Semicolon-sepearted file. sep has a default value of “;”.
read.delim(): Useful for reading in tab-seperated data. sep has default value of "

Memory Requirements for Datasets

R stores all objects in physical memory
Understand what the memory requirements might be when importing a new dataframe
Number of bytes = number of rows x number of columns x 8 bytes
- megabytes = number of bytes/2^(20)
- gigabytes = 1000 megabytes
file.size() and object.size() will give the size of files and objects
Specifying the colClasses argument will make read.table() much faster.
Use external packages and other formats for really large data.

# Enrollment data prepared for Factbook 

ir_dir <- "R:/Institutional Research"

enrollment_data <- read.csv(str_c(ir_dir, "Tableau","University Factbook", "Source Data", 
                               "Enrollment.csv", sep = "/" ), stringsAsFactors = FALSE)

class(enrollment_data)

[1] "data.frame"

dim(enrollment_data)

[1] 107840     56

# WWR 2040 file 

# IR directory
ir_dir <- "R:/Institutional Research"

wwr2040_ftfy_2017 <- read.delim(str_c(ir_dir, "Data Management", "Staging", "WWR2040", "FTFY",
                              "WWR2040_2017_09_15.txt", sep = "/" ), stringsAsFactors = FALSE)


class(wwr2040_ftfy_2017)

[1] "data.frame"

head(wwr2040_ftfy_2017)

     pidm        id   term majr_code   majr_desc levl_code coll_code year
1 2861794 873286069 201770      MUPR Performance        UG        AH 2017
2 2859708 873283983 201770      MUPR Performance        UG        AH 2017
3 2865126 873289395 201770      MUPR Performance        UG        AH 2017
4 2849305 873273584 201770      MUPR Performance        UG        AH 2017
5 2858542 873282817 201770      MUPR Performance        UG        AH 2017
6 2867043 873291309 201770      MUPR Performance        UG        AH 2017
  appl_cnt comp_cnt acce_cnt conf_cnt
1        1        1        1        0
2        1        1        1        0
3        1        1        1        0
4        1        1        1        0
5        1        1        1        0
6        1        1        1        0

Importing and Exporting Data

Brian Pattiz

2019-06-14

Data I/O

Establishing Connections

Functions for Importing Data from Connections

Importing text data into R from a file using base R

Importing Rectangular Data

Reading in Data with base R

read.table wrapper functions

Memory Requirements for Datasets