Data import with readr

On of the best packages to import data is readr. First, we should install it.

install.packages("readr")
library(readr)

readr has multiple functions to read different type of files. Here, I am concerned with reading CSV files. To this, we should use read_csv(). The first argument in this command is the path to the target file.

Some helpful functions

  • skip and comment

Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #.

  • Col names

    • The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn.

    • Alternatively you can pass col_names a character vector which will be used as the column names. For instance: read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))

  • NA

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file: read_csv("a,b,c\n1,2,.", na = ".")

Parsing a vector

parse_*() function take a character vector and return a more specialised vector like a logical, integer, or date.

The aim of computing (e.g., machine learning and NLP) is to unify the input, because algorithms could work best on unified and flat data.Thus, data should be in same format to be analyzed in R. It is also true for pre-proccessing data. As such, machine methods produce uniform outputs

The parse_*() functions are uniform: the first argument is a character vector to parse, and the na argument specifies which strings should be treated as missing: parse_integer(c("1", "231", ".", "456"), na = ".")

Parse numbers

Three problems make parsing numbers tricky:

  1. People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.

solution: To address the first problem, readr has the notion of a “locale”, an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of . by creating a new locale and setting the decimal_mark argument: parse_double("1,23", locale = locale(decimal_mark = ","))

point: parse_number() works in this case as well.

  1. Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.

Solution: parse_number() addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.

  1. Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.

Solution: The final problem is addressed by the combination of parse_number() and the locale as parse_number() will ignore the “grouping mark”: parse_number("123'456'789", locale = locale(grouping_mark = "'"))

Parse strings

Sometimes, the encoding of a string is different from the system’s encodfing. In such cases, you need to specify the encoding in parse_character(): parse_character(x1, locale = locale(encoding = "Latin1"))

Dates, date-times, and times

  • parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.

  • parse_date() expects a four digit year, a - or /, the month, a - or /, then the day.

  • parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier.

The above functions work on flat data. Sometimes, the date and time are embedded in other characters like strings, symbols, etc. In these cases, we should modify these functions to work. Consider " 23 شب 03". The time is 23:03, but parse_time() would fail to parse it. Instead, we should use this one: parse_time("23 شب 03", " %H شب %M"). The below abbreviations help us to do that.

Year

%Y (4 digits). 
%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999. 

Month

%m (2 digits). 
%b (abbreviated name, like “Jan”). 
%B (full name, “January”). 

Day

%d (2 digits). 
%e (optional leading space). 

Time

%H 0-23 hour. 
%I 0-12, must be used with %p. 
%p AM/PM indicator. 
%M minutes. 
%S integer seconds. 
%OS real seconds. 
%Z Time zone (as name, e.g. America/Chicago). Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this time zones. 
%z (as offset from UTC, e.g. +0800). 

Non-digits

%. skips one non-digit character. 
%* skips any number of non-digits. 

Parsing a file

readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using guess_parser(), which returns readr’s best guess, and parse_guess() which uses that guess to parse the column. Some problems could happen:

  1. The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general. For example, you might have a column of doubles that only contains integers in the first 1000 rows.

  2. The column might contain a lot of missing values. If the first 1000 rows contain only NAs, readr will guess that it’s a logical vector, whereas you probably want to parse it as something more specific.

Solution: A good strategy is to work column by column until there are no problems remaining. In this case, we should embed col types manually in reading code.

challenge <- read_csv(
  readr_example("challenge.csv"),
  col_types = cols(
  x = col_double(),
  y = col_date()
  )
)
challenge2 <- read_csv(readr_example("challenge.csv"), 
  col_types = cols(.default = col_character())
)

Every parse_xyz() function has a corresponding col_xyz() function. You use parse_xyz() when the data is in a character vector in R already; you use col_xyz() when you want to tell readr how to load the data.

Writing to a file

write_csv() and write_tsv() are two functions to import data. The most important arguments are x (the data frame to save), and path (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.

write_csv(challenge, "challenge.csv")

Note that the type information is lost when you save to csv.

_Solution__: write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS.

Other types of data

There are many different type of files. I do not mention how R can read them here, but two most important ones:

haven: reads SPSS, Stata, and SAS files.

readxl: reads excel files (both .xls and .xlsx). writexl also could be used to write excel files.

The end.