In this vignette, we will focus on the readr package and its ability
to parse files into tables from data files such as csv’s and
tsv’s.
We will cover the material in two sections; Functions & Column
Specifications
Main Functions
The main functions from the readr package are focused on the
different categories of flat file formats to contain data.
A file and command will be made for each so that the user can view
the file structure outside of R and the command associated with it.
Take a look at each of the linked files to get an idea of the
structure and run the associated function.
read_csv(): comma-separated values (CSV)
- this file has each column separated by commas
CSV
Example
## # A tibble: 1 × 3
## column1 column2 column3
## <chr> <chr> <chr>
## 1 value1 value2 value3
read_tsv(): tab-separated values (TSV)
- this file has each column separated by tabs
TSV
Example
read_tsv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/tsv_format.txt")
## # A tibble: 1 × 3
## column1 column2 column3
## <chr> <chr> <chr>
## 1 value1 value2 value3
read_csv2(): semicolon-separated values with a
comma as the decimal mark for numbers
- this file has each column separated by semicolons and any numbers
that have decimal values have a comma to replace them
CSV2
Example
read_csv2("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/csv2_format.txt")
## # A tibble: 1 × 3
## column1 column2 column3
## <chr> <chr> <dbl>
## 1 value1 value2 142.
read_delim(): generic delimited files
- this command is a generic version of the read_csv command that allow
the user to select the delimiter, in this example we use ‘|’
Generic
Delim Example
read_delim("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/delim_format.txt", delim = "|")
## # A tibble: 1 × 3
## column1 column2 column3
## <chr> <chr> <chr>
## 1 value1 value2 value3
read_fwf(): fixed-width files
- this file has the start and end of columns at specifically stated
dimensions, ie. the first column ends on character 9, and the second
column ends on the 20th character
- note that this function cannot handle urls
For this reason, we will use the readr example.
There are several options to be used with read_fwf.
fwf_sample <- readr_example("fwf-sample.txt")
writeLines(read_lines(fwf_sample))
## John Smith WA 418-Y11-4111
## Mary Hartford CA 319-Z19-4341
## Evan Nolan IL 219-532-c301
- fwf_empty() - Guesses based on the positions of empty columns
read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
## # A tibble: 3 × 4
## first last state ssn
## <chr> <chr> <chr> <chr>
## 1 John Smith WA 418-Y11-4111
## 2 Mary Hartford CA 319-Z19-4341
## 3 Evan Nolan IL 219-532-c301
- fwf_widths() - Supply the widths of each column
Notice we combined first and last names by denoting the column width
for the first column to be 20.
read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
## # A tibble: 3 × 3
## name state ssn
## <chr> <chr> <chr>
## 1 John Smith WA 418-Y11-4111
## 2 Mary Hartford CA 319-Z19-4341
## 3 Evan Nolan IL 219-532-c301
- fwf_positions() - Supply paired vectors of start and end
positions
Notice the vectors passed to fwf_samples are read as
fwf_positions(c(start_position_vec1,
start_position_vec2), c(end_position_vec1,
end_position_vec2)). Do not read them as c(start_position_vec1,
end_position_vec2).
read_fwf(fwf_sample, fwf_positions(c(1, 30), c(20, 42), c("name", "ssn")))
## # A tibble: 3 × 2
## name ssn
## <chr> <chr>
## 1 John Smith 418-Y11-4111
## 2 Mary Hartford 319-Z19-4341
## 3 Evan Nolan 219-532-c301
- fwf_cols() - Supply column names and their respective lengths
Note that subsequent named columns will determine the start of its
column by noting the position the previous column ended. I.E. The name
column is the first column and lasts 20 characters. The state column
starts at character 21 and continues for 10 more characters. The ssn
column starts at character 31 and ends 12 characters later. The user
only needs to tell fwf_cols how long each column is.
read_fwf(fwf_sample, fwf_cols(name = 20, state = 10, ssn = 12))
## # A tibble: 3 × 3
## name state ssn
## <chr> <chr> <chr>
## 1 John Smith WA 418-Y11-4111
## 2 Mary Hartford CA 319-Z19-4341
## 3 Evan Nolan IL 219-532-c301
read_table(): whitespace separated files
- this file designates columns at either ends of whitespace and
removes the whitespace afterwards Whitespace
Example
read_table("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/whitespace_format.txt")
## # A tibble: 1 × 3
## column1 column2 column3
## <chr> <chr> <chr>
## 1 value1 value2 value3
read_log(): web log files
- uses a combination of quotation marks and brackets in file structure
Log
Example
read_log("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/example.log")
## # A tibble: 2 × 7
## X1 X2 X3 X4 X5 X6 X7
## <chr> <lgl> <chr> <chr> <chr> <dbl> <dbl>
## 1 172.21.13.45 NA "Microsoft\\JohnDoe" 08/Apr/2001:17:39:0… GET … 200 3401
## 2 127.0.0.1 NA "frank" 10/Oct/2000:13:55:3… GET … 200 2326
Column Specifications
Readr will naturally make guesses about the
best column type to use
by parsing out 1000 values in a particular column. By default, it
usually works acceptably well. Sometimes it needs a more delicate touch
and the readr package will allow the user to specify these column types
during import with one of the read_ functions.
In each of these read functions, readr will
allow the user to specify column types by passing their read function a
list. There is also an option that allows skipping a column. This is
quite useful in the early
stages of data tidying to avoid any column type changes or column
drops in the middle of computations/transformations that are type
specific that normally pop up in tidying. This allows the user to keep a
very clean workflow.
An exhaustive list of types is below with abbreviations in
brackets:
col_logical() [l], containing only T, F, TRUE or FALSE
col_integer() [i], integers
col_double() [d], doubles
col_character() [c], everything else
col_factor(levels, ordered) [f], a fixed set of values
col_date(format = ““) [D]: with the locale’s date_format
col_time(format =““) [t]: with the locale’s time_format
col_datetime(format =““) [T]: ISO8601 date times
col_number() [n], numbers containing the grouping_mark
col_skip() [ _ , -], don’t import this column
col_guess() [?], parse using the “best” type based on the
input
Let’s use a few of these to gain some
experience. In this first
example, we will be as verbose as possible in our code.
In the below example, we follow the standard
syntax of read_csv.
read_csv(file,
col_types = list(
column_name =
column_type(),
….
)
)
## # A tibble: 1 × 3
## column1 column2 column3
## <chr> <chr> <chr>
## 1 value1 value2 value3
In this example, let’s try to cut down on typing. We can avoid
denoting each column and the full type by using the col type’s
abbreviation for each column from the list earlier.
read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/csv_format.txt", col_types = list(
'c',
'c',
'c'
))
## # A tibble: 1 × 3
## column1 column2 column3
## <chr> <chr> <chr>
## 1 value1 value2 value3
Let’s say we want to skip a column. There are two methods to achieve
this. One by using the abbreviation and the other by using the
cols_only() option. In this example, lets skip the 2nd column via
abbreviation.
read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/csv_format.txt", col_types = list(
'c',
'_', # You can also use '-' instead of '_'
'c'
))
## # A tibble: 1 × 2
## column1 column3
## <chr> <chr>
## 1 value1 value3
Now with the cols_only option instead of abbreviation. This option is
useful when you have many columns and are already aware of the few
you’re interested in.
read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/csv_format.txt", col_types = cols_only(
column1 = 'c',
column2 = 'c'
))
## # A tibble: 1 × 2
## column1 column2
## <chr> <chr>
## 1 value1 value2
Specifications for Factor Data
If you’re unfamiliar with factors in the R
world, these can be simplified into categories of a categorical
variable. For example, if you have a list of cars, one of the categories
may be the manufacturer. Each of these manufacturers are considered a
factor. When a column has its type labelled as “factor,” it will only
ever accept a value in its rows as one of those “factors”
(categories).
For this example, we will use a slightly modified version of the
mtcars data set. The only change made was setting the first column to be
the make of the car by printing the mtcars data to csv, opening the
file, and typing the column name in.
Although there a other methods to changing a column name in R, this
is an example of how understanding flat file composition can come in
handy. We will be using the make as different factors of the categorical
variable of manufacturer. This would be a good exercise to follow along
with on your end.
Data Prep for Factors (optional)
To create an example using multiple column types. Let’s use the
mtcars dataset as if we were importing it for the first time. To make
this example fitting, I would like to include a factor variable and a
date variable.
Let’s first ‘add’ the factor variable by naming the first column of
the mtcars data set. I first print the mtcars dataset with
write.csv.
library(stringr)
write.csv(mtcars,"C:\\repos\\DATA607\\TIDYVERSE Create\\mtcarsReadr.csv")
I open this csv file and add a make column by following the file
structure and save it as a new file.

From here, I will read the file back in with read_csv and edit the
first column so that it only contains the manufacturers.
mtcarsEdit <- read_csv("C:\\repos\\DATA607\\TIDYVERSE Create\\mtcarsMake.csv")
#Removing excess text
mtcarsEdit$make <- str_remove(mtcarsEdit$make, " .*")
I then create a vector of dates and bind them to the mtcarsEdit data
frame. I re-write this using another function from readr, write_csv.
#Creating a date column for use in the next example
relDate <- c(sample(seq(as.Date('1970/01/01'), as.Date('1980/01/01'), by="day"), 32))
#Combining the original mtcars dataframe and the new date column
mtcarsEdit <- cbind(mtcarsEdit,relDate)
mtcarsEdit
## make mpg cyl disp hp drat wt qsec vs am gear carb relDate
## 1 Mazda 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1974-08-21
## 2 Mazda 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1975-02-20
## 3 Datsun 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1972-07-06
## 4 Hornet 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1979-08-19
## 5 Hornet 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1975-01-27
## 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1973-02-06
## 7 Duster 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 1971-07-14
## 8 Merc 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1973-06-05
## 9 Merc 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 1978-03-19
## 10 Merc 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 1975-11-20
## 11 Merc 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1975-12-05
## 12 Merc 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 1978-01-30
## 13 Merc 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1978-06-15
## 14 Merc 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 1972-05-02
## 15 Cadillac 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 1970-03-26
## 16 Lincoln 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1973-09-17
## 17 Chrysler 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 1973-05-01
## 18 Fiat 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 1970-10-14
## 19 Honda 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1978-08-15
## 20 Toyota 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1975-05-12
## 21 Toyota 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1977-01-11
## 22 Dodge 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 1976-09-18
## 23 AMC 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 1977-02-28
## 24 Camaro 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 1970-12-14
## 25 Pontiac 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 1973-11-29
## 26 Fiat 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 1972-07-31
## 27 Porsche 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 1970-10-09
## 28 Lotus 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 1979-12-13
## 29 Ford 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 1975-08-17
## 30 Ferrari 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1971-03-31
## 31 Maserati 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 1977-02-06
## 32 Volvo 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1973-02-09
#I re-printed and pushed to github to make available for those that do not wish to copy the previous steps
write_csv(mtcarsEdit,"C:\\repos\\DATA607\\TIDYVERSE Create\\mtcarsEdit.csv")
From here, we will use the github file we
just wrote out above, its path is set to a variable below as ‘url2,’ to
use as an example for importing columns as factors with readr.
Importing & Complete Example w/ mtcarsEdit
Let’s say we only want to import the make,
mpg, and release date from the modified mtcars dataset just above. We
know that the the make should be a factor, as its a categorical
variable, the mpg should be a double, and the release date as a date.
Recall that integers are whole numbers, doubles can contain decimal
points.
Since we know that we’re only interested in certain columns, we will
use the col_only() sub option in combination with our col_types to get a
clean import.
url2 <- "https://raw.githubusercontent.com/d-ev-craig/DATA607/main/TIDYVERSE%20Create/mtcarsEdit.csv"
read_csv(url2, col_types = cols_only(
make = 'f',
mpg = 'd',
relDate = 'D'))
## # A tibble: 32 × 3
## make mpg relDate
## <fct> <dbl> <date>
## 1 Mazda 21 1977-12-15
## 2 Mazda 21 1977-06-17
## 3 Datsun 22.8 1977-07-18
## 4 Hornet 21.4 1970-03-01
## 5 Hornet 18.7 1972-06-10
## 6 Valiant 18.1 1979-10-28
## 7 Duster 14.3 1975-05-17
## 8 Merc 24.4 1971-07-10
## 9 Merc 22.8 1973-02-26
## 10 Merc 19.2 1976-04-22
## # ℹ 22 more rows