Data import with readr
On of the best packages to import data is readr. First, we should install it.
install.packages("readr")
library(readr)readr has multiple functions to read different type of files. Here, I am concerned with reading CSV files. To this, we should use read_csv(). The first argument in this command is the path to the target file.
Some helpful functions
- skip and comment
Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #.
Col names
The data might not have column names. You can use
col_names = FALSEto tellread_csv()not to treat the first row as headings, and instead label them sequentially from X1 to Xn.Alternatively you can pass col_names a character vector which will be used as the column names. For instance:
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
NA
Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file: read_csv("a,b,c\n1,2,.", na = ".")
Parsing a vector
parse_*() function take a character vector and return a more specialised vector like a logical, integer, or date.
The aim of computing (e.g., machine learning and NLP) is to unify the input, because algorithms could work best on unified and flat data.Thus, data should be in same format to be analyzed in R. It is also true for pre-proccessing data. As such, machine methods produce uniform outputs
The parse_*() functions are uniform: the first argument is a character vector to parse, and the na argument specifies which strings should be treated as missing: parse_integer(c("1", "231", ".", "456"), na = ".")
- If parsing fails, you’ll get a warning. You could use
problems()to get the complete set of errors. This returns a tibble, which you can then manipulate with dplyr.
Parse numbers
Three problems make parsing numbers tricky:
- People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.
solution: To address the first problem, readr has the notion of a “locale”, an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of . by creating a new locale and setting the decimal_mark argument: parse_double("1,23", locale = locale(decimal_mark = ","))
point: parse_number() works in this case as well.
- Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.
Solution: parse_number() addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
- Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.
Solution: The final problem is addressed by the combination of parse_number() and the locale as parse_number() will ignore the “grouping mark”: parse_number("123'456'789", locale = locale(grouping_mark = "'"))
Parse strings
Sometimes, the encoding of a string is different from the system’s encodfing. In such cases, you need to specify the encoding in parse_character(): parse_character(x1, locale = locale(encoding = "Latin1"))
Dates, date-times, and times
parse_datetime()expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.parse_date()expects a four digit year, a - or /, the month, a - or /, then the day.parse_time()expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier.
The above functions work on flat data. Sometimes, the date and time are embedded in other characters like strings, symbols, etc. In these cases, we should modify these functions to work. Consider " 23 شب 03". The time is 23:03, but parse_time() would fail to parse it. Instead, we should use this one: parse_time("23 شب 03", " %H شب %M"). The below abbreviations help us to do that.
Year
%Y (4 digits).
%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
Month
%m (2 digits).
%b (abbreviated name, like “Jan”).
%B (full name, “January”).
Day
%d (2 digits).
%e (optional leading space).
Time
%H 0-23 hour.
%I 0-12, must be used with %p.
%p AM/PM indicator.
%M minutes.
%S integer seconds.
%OS real seconds.
%Z Time zone (as name, e.g. America/Chicago). Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this time zones.
%z (as offset from UTC, e.g. +0800).
Non-digits
%. skips one non-digit character.
%* skips any number of non-digits.
Parsing a file
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using guess_parser(), which returns readr’s best guess, and parse_guess() which uses that guess to parse the column. Some problems could happen:
The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general. For example, you might have a column of doubles that only contains integers in the first 1000 rows.
The column might contain a lot of missing values. If the first 1000 rows contain only NAs, readr will guess that it’s a logical vector, whereas you probably want to parse it as something more specific.
Solution: A good strategy is to work column by column until there are no problems remaining. In this case, we should embed col types manually in reading code.
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
)
)- Sometimes it’s easier to diagnose problems if you just read in all the columns as character vectors.
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
)- we can look at more rows than the default by
guess_max =
Every parse_xyz() function has a corresponding col_xyz() function. You use parse_xyz() when the data is in a character vector in R already; you use col_xyz() when you want to tell readr how to load the data.
Writing to a file
write_csv() and write_tsv() are two functions to import data. The most important arguments are x (the data frame to save), and path (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.
write_csv(challenge, "challenge.csv")
Note that the type information is lost when you save to csv.
_Solution__: write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS.
- The feather package implements a fast binary file format that can be shared across programming languages.
Other types of data
There are many different type of files. I do not mention how R can read them here, but two most important ones:
haven: reads SPSS, Stata, and SAS files.
readxl: reads excel files (both .xls and .xlsx). writexl also could be used to write excel files.
The end.