Data I/O

Establishing Connections

  • Connections allow R allows I/O from outside of R
  • The connection objects can come in the form of a file, url, pipe or socket.
    • A pipe connection is used to run a system command and “pipe” the output to R (useful for task scheduling)
    • A socket connection allows R to communicate with another process (external software)

Functions for Importing Data from Connections

  • file(): The path to the file to be opened
  • open(), close(): Open and close a connection
  • scan(): Read data into a vector of the same data type from the console or file
  • readLines(): Reads text lines from a connection

Importing text data into R from a file using base R

  • Text data can be found in various formats either on the internet or file
  • It is stored in an encoding format such as ASCII and Unicode.
  • Most text data is ASCII, however R can specify the encoding.
  • Using the functions scan(), readLines(), R can read any character data from any connection.
[1] "This"      "is"        "a"         "test"      "file"      "containg" 
[7] "character" "data."    
[1] "This is a test file containg character data."
[1] "This is a test file containg character data. "

Importing Rectangular Data

Reading in Data with base R

The function read.table() is the most common base R function for reading in rectangular data.

Arguments for read.table():

  • file: name of the file or connection
  • header: A boolean value indicating whether the file has a header line
  • sep: How the columns are seperated denoted by a string
  • colClasses: A character vector of the class for each column
  • nrows: The number of rows to be read. The default value is the entire file.
  • skip: The number of lines to skip from the beginning
  • stringsAsFactors: A boolean value indicating whether character vectors should be coded as factors. Default is TRUE.
    • It is often advised that you set this to be FALSE. This is because you can have character vectors in the data that are not categorical variables. Another reason is that even if they are encoded as factors, they might not contain all the levels and you will have to add levels anyways.

read.table wrapper functions

  • read.csv(): Comma-seperated file. sep has a default value of “,”.
  • read.csv2(): Semicolon-sepearted file. sep has a default value of “;”.
  • read.delim(): Useful for reading in tab-seperated data. sep has default value of "

Memory Requirements for Datasets

  • R stores all objects in physical memory
  • Understand what the memory requirements might be when importing a new dataframe
  • Number of bytes = number of rows x number of columns x 8 bytes
    • megabytes = number of bytes/2^(20)
    • gigabytes = 1000 megabytes
  • file.size() and object.size() will give the size of files and objects
  • Specifying the colClasses argument will make read.table() much faster.
  • Use external packages and other formats for really large data.
[1] "data.frame"
[1] 107840     56
[1] "data.frame"
     pidm        id   term majr_code   majr_desc levl_code coll_code year
1 2861794 873286069 201770      MUPR Performance        UG        AH 2017
2 2859708 873283983 201770      MUPR Performance        UG        AH 2017
3 2865126 873289395 201770      MUPR Performance        UG        AH 2017
4 2849305 873273584 201770      MUPR Performance        UG        AH 2017
5 2858542 873282817 201770      MUPR Performance        UG        AH 2017
6 2867043 873291309 201770      MUPR Performance        UG        AH 2017
  appl_cnt comp_cnt acce_cnt conf_cnt
1        1        1        1        0
2        1        1        1        0
3        1        1        1        0
4        1        1        1        0
5        1        1        1        0
6        1        1        1        0