Handy readr functions for working with CSV and other data formats

I was looking for a simple dataset with count data for many items to demonstrate some basic readr functions. Luckily, readr comes bundled with a good demo dataset.

chickens <- read_csv(readr_example("chickens.csv"))
chickens

## # A tibble: 5 × 4
##   chicken                 sex     eggs_laid motto                               
##   <chr>                   <chr>       <dbl> <chr>                               
## 1 Foghorn Leghorn         rooster         0 That's a joke, ah say, that's a jok…
## 2 Chicken Little          hen             3 The sky is falling!                 
## 3 Ginger                  hen            12 Listen. We'll either die free chick…
## 4 Camilla the Chicken     hen             7 Bawk, buck, ba-gawk.                
## 5 Ernie The Giant Chicken rooster         0 Put Captain Solo in the cargo hold.

1A: Setting column data types

Q: how to I set column types? A: Use readr column specifications

Column types have been printed by readr. The column types were guessed by readr, and although it has done a very good job, it is not perfect. For example, the guessed column type for eggs_laid is double.

spec(chickens)

## cols(
##   chicken = col_character(),
##   sex = col_character(),
##   eggs_laid = col_double(),
##   motto = col_character()
## )

Since chickens do not lay fractional eggs we may want to tell readr to set the type of eggs_laid as integer. Furthermore, we may also want sex to be read in as factor instead of character. Notice, we set specifications for only those columns that were not guessed correctly.

chickens <- read_csv(readr_example("chickens.csv"),
                     col_types = cols(
                         sex = col_factor(levels = c('rooster', 'hen')),
                         eggs_laid = col_integer()
                     )
)
chickens

## # A tibble: 5 × 4
##   chicken                 sex     eggs_laid motto                               
##   <chr>                   <fct>       <int> <chr>                               
## 1 Foghorn Leghorn         rooster         0 That's a joke, ah say, that's a jok…
## 2 Chicken Little          hen             3 The sky is falling!                 
## 3 Ginger                  hen            12 Listen. We'll either die free chick…
## 4 Camilla the Chicken     hen             7 Bawk, buck, ba-gawk.                
## 5 Ernie The Giant Chicken rooster         0 Put Captain Solo in the cargo hold.

The column types have now been set correctly. A compact way of providing column types is by using a string of positional types. For example, cfi to read first column as character, second as float, and third as integer. To skip columns underscore character.

chickens <- read_csv(readr_example("chickens.csv"),
                     col_types = cols(
                         sex = col_factor(levels = c('rooster', 'hen')),
                         eggs_laid = col_integer(),
                         .default = col_character()
                     )
)
chickens

## # A tibble: 5 × 4
##   chicken                 sex     eggs_laid motto                               
##   <chr>                   <fct>       <int> <chr>                               
## 1 Foghorn Leghorn         rooster         0 That's a joke, ah say, that's a jok…
## 2 Chicken Little          hen             3 The sky is falling!                 
## 3 Ginger                  hen            12 Listen. We'll either die free chick…
## 4 Camilla the Chicken     hen             7 Bawk, buck, ba-gawk.                
## 5 Ernie The Giant Chicken rooster         0 Put Captain Solo in the cargo hold.

Finally, a default type can be used for instead of guessing for columns that are not specified.

chickens <- read_csv(readr_example("chickens.csv"),
                     col_types = "cfi_"
                     )

2A: Parsing atomic vectors

Q: how can I parse a character vector into specific data type A: Use readr::parse_ functions

parse_double(c('1.1', '2', '3', '4'))

## [1] 1.1 2.0 3.0 4.0

parse_logical(c('t', 'f'))

## [1]  TRUE FALSE

Unlike parse_integer() and parse_double(), parse_number() is able to handle num-numeric prefixes and suffixes.

parse_number(c('$123.45', '1,000,000'))

## [1]     123.45 1000000.00

Finally, there are flexible Date/Time parsers.

parse_datetime('2022-10-29 18:06')

## [1] "2022-10-29 18:06:00 UTC"

parse_date('2022-10-29')

## [1] "2022-10-29"

parse_time("3:08 pm")

## 15:08:00

The Date/Time parsers takes an optional format argument that specifies the string format.

parse_date("10/29/2022", format = "%m/%d/%Y")

## [1] "2022-10-29"

1B. BEN INBAR. Use chicken demo example data to feature the infamous ‘forcats’ package, useful for the factor data type in the tibble.

fct_relevel() allows us to change the ‘level’ or order of any particular factor or vector of vectors on the fly.

ggplot(data=chickens, aes(x=sex)) + geom_bar(fill='lightyellow')

ggplot(data=chickens, aes(x=fct_relevel(sex, "hen"))) + geom_bar(fill='lightyellow')

fct_infreq() lets us sort by frequency, descending.

ggplot(data=chickens, aes(x=fct_infreq(sex))) + geom_bar(fill='lightyellow') + geom_text(aes(label = ..count..), stat = "count")

2B. Stringr. Take the ‘motto’ character column from the chickens tibble and parse words within a specified range using word() function. Then use our handy dplyr mutate function to re-append.

chickens <- read_csv(readr_example("chickens.csv"))
word23 <- word(chickens$motto, start=2, end=3, sep=fixed(" "))
chickens |> mutate(word_2to3 = word23)

## # A tibble: 5 × 5
##   chicken                 sex     eggs_laid motto                        word_…¹
##   <chr>                   <chr>       <dbl> <chr>                        <chr>  
## 1 Foghorn Leghorn         rooster         0 That's a joke, ah say, that… a joke,
## 2 Chicken Little          hen             3 The sky is falling!          sky is 
## 3 Ginger                  hen            12 Listen. We'll either die fr… We'll …
## 4 Camilla the Chicken     hen             7 Bawk, buck, ba-gawk.         buck, …
## 5 Ernie The Giant Chicken rooster         0 Put Captain Solo in the car… Captai…
## # … with abbreviated variable name ¹word_2to3

tidyverse: using readr to control column data types

Jhakim, extended by Ben Inbar

10/25/2022

Load library

Handy readr functions for working with CSV and other data formats

1A: Setting column data types

2A: Parsing atomic vectors

1B. BEN INBAR. Use chicken demo example data to feature the infamous ‘forcats’ package, useful for the factor data type in the tibble.

2B. Stringr. Take the ‘motto’ character column from the chickens tibble and parse words within a specified range using word() function. Then use our handy dplyr mutate function to re-append.