My personal documentation, for future reference and created after completion of the JHU Data Science Specialization, online via Coursera LMS. Notes paraphrased from Roger D. Peng’s book Mastering Software Development in R.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

x <- 3
y <- 4
x + y
## [1] 7

Note that adding the echo = FALSE parameter to the code chunk would prevent printing of the R code that generated the plot.

Reading Tabular Data with read_csv of readr Package

R’s built in read.csv function similarly reads CSV files, but the read_csv function in readr builds on that by removing some of the quirks of read.csv as well as dramatically optimizing the speed to read data into R. Also adds a progress meter and a compact method for specifying column types.

The only required argument to read_csv is a character string specifying the path to the file to read. A typical call to read_csv:

library(readr)
teams <- read_csv("data/team_standings.csv")
## Parsed with column specification:
## cols(
##   Standing = col_integer(),
##   Team = col_character()
## )
teams
## # A tibble: 32 x 2
##    Standing Team       
##       <int> <chr>      
##  1        1 Spain      
##  2        2 Netherlands
##  3        3 Germany    
##  4        4 Uruguay    
##  5        5 Argentina  
##  6        6 Brazil     
##  7        7 Ghana      
##  8        8 Paraguay   
##  9        9 Japan      
## 10       10 Chile      
## # ... with 22 more rows

Column Types

By default, read_csv will read in the first few rows of the table in order to figure out the type of each column (integer, character, etc.). This is convenient and fast, but not robust. If the imputation fails, you’ll need to supply the correct types yourself.

You can also specify the type of each column with the col_types argument. Here, the “cc” indicates that the first column is character and the second column is character (there are only two columns):

teams <- read_csv("data/team_standings.csv", col_types = "cc")

Reading in Compressed Files and Specific Col Types (Dates)

The read_csv function will also read compressed files automatically. There is no need to decompress (unzip, etc.) the file first or use the gzfile connection function.

The following call reads a gzip-compressed CSV file containing download logs from the RStudio CRAN mirror.

logs <- read_csv("data/2016-07-19.csv.gz", n_max = 10)
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   time = col_time(format = ""),
##   size = col_integer(),
##   r_version = col_character(),
##   r_arch = col_character(),
##   r_os = col_character(),
##   package = col_character(),
##   version = col_character(),
##   country = col_character(),
##   ip_id = col_integer()
## )

The message “Parsed with column specification…” printed after the call indicates that read_csv may have had some difficulty identifying the type of each column. This can be solved by using the col_types argument.

You can specify the column type in a more detailed fashion by using the various col_* functions. For example, in the data above, the first column is actually a date, so it might make more sense to read it in as a Date var. If we wanted to only read in that first column, we could do:

logdates <- read_csv("data/2016-07-20.csv.gz", col_types = cols_only(date = col_date()), n_max = 10)
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed

## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed

## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed

## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed

## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed

## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed

## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed

## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed

## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
logdates
## # A tibble: 10 x 1
##    date      
##    <date>    
##  1 2016-07-20
##  2 2016-07-20
##  3 2016-07-20
##  4 2016-07-20
##  5 2016-07-20
##  6 2016-07-20
##  7 2016-07-20
##  8 2016-07-20
##  9 2016-07-20
## 10 2016-07-20

Now the date column is stored as a Date object which can be used for relevant date-related computations such as those in the lubridate package.

The readr progress Option

The read_csv function has a progress option that defaults to TRUE. However, if you are using read_csv in a function, or embedding in a loop, it’s probably best to set progress = FALSE.

readr

function Use
read_csv comma-separated file
read_csv2 semicolon-separated file
read_tsv tab-separated file
read_delim general unknown delimited files
read_fwf fixed width files
read_log log files