Controlling Column Types

Author

Jamal Rogers

Published

May 16, 2023

We load the tidyverse package to continue using the readr package.

library(tidyverse)

A CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself. Finally, we’ll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.

Guessing types

readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,0002 rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:

Does it contain only F, T, FALSE, or TRUE (ignoring case)? If so, it’s a logical.
Does it contain only numbers (e.g., 1, -4.5, 5e6, Inf)? If so, it’s a number.
Does it match the ISO8601 standard? If so, it’s a date or date-time.
Otherwise, it must be a string.

You can see that behavior in action in this simple example:

read_csv("
  logical,numeric,date,string
  TRUE,1,2021-01-15,abc
  false,4.5,2021-02-15,def
  T,Inf,2021-02-16,ghi
")

# A tibble: 3 × 4
  logical numeric date       string
  <lgl>     <dbl> <date>     <chr> 
1 TRUE        1   2021-01-15 abc   
2 FALSE       4.5 2021-02-15 def   
3 TRUE      Inf   2021-02-16 ghi

This heuristic works well if you have a clean dataset, but in real life, you’ll encounter a selection of weird and beautiful failures.

Missing values, column types, and problems

The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the NA that readr expects.

Take this simple 1 column CSV file as an example:

simple_csv <- "
  x
  10
  .
  20
  30"

If we read it without any additional arguments, x becomes a character column:

read_csv(simple_csv)

# A tibble: 4 × 1
  x    
  <chr>
1 10   
2 .    
3 20   
4 30

In this very small case, you can easily see the missing value .. But what happens if you have thousands of rows with only a few missing values represented by .s speckled among them? One approach is to tell readr that x is a numeric column, and then see where it fails. You can do that with the col_types argument, which takes a named list where the names match the column names in the CSV file:

df <- read_csv(
  simple_csv, 
  col_types = list(x = col_double())
)

Now read_csv() reports that there was a problem, and tells us we can find out more with problems():

problems(df)

# A tibble: 1 × 5
    row   col expected actual file                                              
  <int> <int> <chr>    <chr>  <chr>                                             
1     3     1 a double .      C:/Users/Jamal Kay Rogers/AppData/Local/Temp/Rtmp…

This tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = “.”, the automatic guessing succeeds, giving us the numeric column that we want:

read_csv(simple_csv, na = ".")

# A tibble: 4 × 1
      x
  <dbl>
1    10
2    NA
3    20
4    30

Column types

readr provides a total of nine column types for you to use:

col_logical() and col_double() read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.
col_integer() reads integers. We seldom distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
col_character() reads strings. This can be useful to specify explicitly when you have a column that is a numeric identifier, i.e., long series of digits that identifies an object but doesn’t make sense apply mathematical operations to. Examples include phone numbers, social security numbers, credit card numbers, etc.
col_factor(), col_date(), and col_datetime() create factors, dates, and date-times respectively.
col_number() is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies.
col_skip() skips a column so it’s not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.

It’s also possible to override the default column by switching from list() to cols() and specifying .default:

another_csv <- "
x,y,z
1,2,3"

read_csv(
  another_csv, 
  col_types = cols(.default = col_character())
)

# A tibble: 1 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     2     3

Another useful helper is cols_only() which will read in only the columns you specify:

read_csv(
  another_csv,
  col_types = cols_only(x = col_character())
)

# A tibble: 1 × 1
  x    
  <chr>
1 1