My personal documentation, for future reference and created after completion of the JHU Data Science Specialization, online via Coursera LMS. Notes paraphrased from Roger D. Peng’s book Mastering Software Development in R.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
x <- 3
y <- 4
x + y
## [1] 7
Note that adding the echo = FALSE parameter to the code chunk would prevent printing of the R code that generated the plot.
read_csv of readr PackageR’s built in read.csv function similarly reads CSV files, but the read_csv function in readr builds on that by removing some of the quirks of read.csv as well as dramatically optimizing the speed to read data into R. Also adds a progress meter and a compact method for specifying column types.
The only required argument to read_csv is a character string specifying the path to the file to read. A typical call to read_csv:
library(readr)
teams <- read_csv("data/team_standings.csv")
## Parsed with column specification:
## cols(
## Standing = col_integer(),
## Team = col_character()
## )
teams
## # A tibble: 32 x 2
## Standing Team
## <int> <chr>
## 1 1 Spain
## 2 2 Netherlands
## 3 3 Germany
## 4 4 Uruguay
## 5 5 Argentina
## 6 6 Brazil
## 7 7 Ghana
## 8 8 Paraguay
## 9 9 Japan
## 10 10 Chile
## # ... with 22 more rows
By default, read_csv will read in the first few rows of the table in order to figure out the type of each column (integer, character, etc.). This is convenient and fast, but not robust. If the imputation fails, you’ll need to supply the correct types yourself.
You can also specify the type of each column with the col_types argument. Here, the “cc” indicates that the first column is character and the second column is character (there are only two columns):
teams <- read_csv("data/team_standings.csv", col_types = "cc")
The read_csv function will also read compressed files automatically. There is no need to decompress (unzip, etc.) the file first or use the gzfile connection function.
The following call reads a gzip-compressed CSV file containing download logs from the RStudio CRAN mirror.
logs <- read_csv("data/2016-07-19.csv.gz", n_max = 10)
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## time = col_time(format = ""),
## size = col_integer(),
## r_version = col_character(),
## r_arch = col_character(),
## r_os = col_character(),
## package = col_character(),
## version = col_character(),
## country = col_character(),
## ip_id = col_integer()
## )
The message “Parsed with column specification…” printed after the call indicates that read_csv may have had some difficulty identifying the type of each column. This can be solved by using the col_types argument.
You can specify the column type in a more detailed fashion by using the various col_* functions. For example, in the data above, the first column is actually a date, so it might make more sense to read it in as a Date var. If we wanted to only read in that first column, we could do:
logdates <- read_csv("data/2016-07-20.csv.gz", col_types = cols_only(date = col_date()), n_max = 10)
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
## Warning in read_tokens_(data, tokenizer, col_specs, col_names, locale_, :
## length of NULL cannot be changed
logdates
## # A tibble: 10 x 1
## date
## <date>
## 1 2016-07-20
## 2 2016-07-20
## 3 2016-07-20
## 4 2016-07-20
## 5 2016-07-20
## 6 2016-07-20
## 7 2016-07-20
## 8 2016-07-20
## 9 2016-07-20
## 10 2016-07-20
Now the date column is stored as a Date object which can be used for relevant date-related computations such as those in the lubridate package.
readr progress OptionThe read_csv function has a progress option that defaults to TRUE. However, if you are using read_csv in a function, or embedding in a loop, it’s probably best to set progress = FALSE.
readr| function | Use |
|---|---|
read_csv |
comma-separated file |
read_csv2 |
semicolon-separated file |
read_tsv |
tab-separated file |
read_delim |
general unknown delimited files |
read_fwf |
fixed width files |
read_log |
log files |