library(tidyverse)
The first step of working on any data set in RStudio is of course, importing the data set. Due to various ways of collecting, transferring and storing data, there are many different data formats. Regarding tabular data, below are a few commonly used data format:
Here, “csv” and Excel files are most commonly used. “json”, “txt” and “rds” can store a wide range of data (not limited to tabular data), but sometimes tabular data might be stored in these formats.
The base R package provides many built-in data import functions, such
as read.csv
, readxl::read_excel
to read csv
and excel files. These functions work on data frames. Here we are going
to introduce a few functions in readr
package of
tidyverse
, which directly read data into tibbles.
read_csv
functionThe read_csv
function reads “csv” files in which values
are separated by commas. For example
read_csv("x,y,z
1,2,3
4,5,6")
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
Here we read a simple example of manually input data in “csv” format:
If we don’t need a header we can do the following:
read_csv("1,2,3
4,5,6", col_names = FALSE)
## # A tibble: 2 × 3
## X1 X2 X3
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
In this situation, columns are labeled to be x1
,
x2
, …
We can also assign column names manually:
read_csv("1,2,3
4,5,6", col_names = c("x","y","z"))
## # A tibble: 2 × 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
Sometimes, a file may use \n
in the text to represent a
new line:
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## # A tibble: 2 × 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
Another option that commonly needs tweaking is NA
: this
specifies the value (or values) that are used to represent missing
values in your file:
read_csv("a,b,c\n1,2,3\n4,5,.", na = ".")
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): a, b, c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 NA
To read a file on your disk, you need to input the file path inside
the string for read_csv
to read it. Note that the path can
be relative to your working directory, or an absolute path.
For example, If we read a file directly from the working folder
(which can be obtained by getwd()
and changed by the
setwd()
function), we may simply input the file name into
read_csv()
.
getwd()
## [1] "/Users/miao"
setwd("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets")
bank_data <- read_csv("BankChurners.csv")
print(bank_data)
## # A tibble: 10,127 × 20
## Attrition_Flag Customer_Age Gender Dependent_count Education_Level
## <chr> <dbl> <chr> <dbl> <chr>
## 1 Existing Customer 45 M 3 High School
## 2 Existing Customer 49 F 5 Graduate
## 3 Existing Customer 51 M 3 Graduate
## 4 Existing Customer 40 F 4 High School
## 5 Existing Customer 40 M 3 Uneducated
## 6 Existing Customer 44 M 2 Graduate
## 7 Existing Customer 51 M 4 Unknown
## 8 Existing Customer 32 M 0 High School
## 9 Existing Customer 37 M 3 Uneducated
## 10 Existing Customer 48 M 2 Graduate
## # ℹ 10,117 more rows
## # ℹ 15 more variables: Marital_Status <chr>, Income_Category <chr>,
## # Card_Category <chr>, Months_on_book <dbl>, Total_Relationship_Count <dbl>,
## # Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## # Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## # Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## # Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>
Equivalently, we can read a csv file by its path. We may use the “Copy Folder Path to Clipboard” in “More” of the Files Panel in RStudio.
bank_data <- read_csv("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets/BankChurners.csv")
There is also a function read_csv2
that reads data using
;
as the separator and ,
as the decimal
points, which is commonly used in some European countries.
This is all you need to know to read around 75% of “csv” files that
you’ll encounter in practice. You can also easily adapt what you’ve
learned to read tab separated files with read_tsv()
and
fixed width files with read_fwf()
.
read a csv file into a tibble from your disk**.
read_tsv
and read_delim
functionTo show why we need other functions for data import, we use
read_tsv
and read_fwf
as examples.
The read_tsv
function reads “tab separated values”, for
example:
read_tsv("a\tb\tc
1\t2\t3
4\t5\t6")
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (3): a, b, c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
Here \t
represents a tab character, which can be usually
created in a text file when the “tab” key is pressed on a keyboard.
Generally, there is a function read_delim
to handle any
specified separator:
read_delim("a|b|c
1|2|3
4|5|6", delim = "|")
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "|"
## dbl (3): a, b, c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
read_fwf
functionThe read_fwf
function read fixed-width files. As below
is an example:
fwf_sample <- readr_example("fwf-sample.txt")
writeLines(read_lines(fwf_sample))
## John Smith WA 418-Y11-4111
## Mary Hartford CA 319-Z19-4341
## Evan Nolan IL 219-532-c301
read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
## # A tibble: 3 × 4
## first last state ssn
## <chr> <chr> <chr> <chr>
## 1 John Smith WA 418-Y11-4111
## 2 Mary Hartford CA 319-Z19-4341
## 3 Evan Nolan IL 219-532-c301
So in these files, columns occupy some “fixed-width” in each line, such as “position 1-10” for column 1, “position 5-20” for column 2 etc. (column positions may overlap).
Specifying the width or positions for each column can be a painful
job. We may simply call the
fwf_empty(<data_name>, <col_name>)
to make a
guess and assign each column with given names as shown in the example
above. For more details, you may refer to the help documentation of
read_fwf
.
Try to use read_csv
to read the following text. What
do you get?
Try to use read_delim
to read the same text. What do
you get?
"John Smith WA 418-Y11-4111
Mary Hartford CA 319-Z19-4341
Evan Nolan IL 219-532-c301"
In base R, we also have the read.csv()
function to read
CSV files. There are a few good reasons to favour readr
functions over the base equivalents:
They are typically much faster (~10x) than their base equivalents.
They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.
They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.
There are also many other data types. For example, JSON is a standard data exchange format on the web, which looks like:
json <-
'[
{"Name" : "Mario", "Age" : 32, "Occupation" : "Plumber"},
{"Name" : "Peach", "Age" : 21, "Occupation" : "Princess"},
{},
{"Name" : "Bowser", "Occupation" : "Koopa"}
]'
This stores attribute data about objects in JavaScript, which
naturally forms a tabular data set. To read this, we need to install
jsonlite
package and call the fromJSON
function to read it.
library(jsonlite)
mydf <- fromJSON(json)
print(mydf)
## Name Age Occupation
## 1 Mario 32 Plumber
## 2 Peach 21 Princess
## 3 <NA> NA <NA>
## 4 Bowser NA Koopa
Here is a brief summary of the packages that are commonly used to read different data types:
haven
reads SPSS, Stata, and SAS files.
readxl
reads excel files (both .xls and
.xlsx).
DBI
, along with a database specific backend
(e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries
against a database and return a data frame.
jsonlite
reads JSON files; JSON is a standard data
exchange format on the web.
xml2
reads XML files.
readr
also comes with two useful functions for writing
data back to disk: write_csv()
and
write_tsv()
. Both functions increase the chances of the
output file being read back in correctly by:
Always encoding strings in UTF-8.
Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
Below is an example:
write_csv(bank_data, 'my_bank_data.csv')
Refresh your working directory in the Files panel and you will see the new file on the disk.
However, there is an issue with using “csv” to store data - all your column type information will be lost. Sometimes this can be a problem for storing interim results. There are two alternatives to resolve the problem.
write_rds()
and read_rds()
are uniform
wrappers around the base functions readRDS()
and
saveRDS()
. These store data in R’s custom binary format
called RDS:write_rds(bank_data, 'my_bank_data.rds')
read_rds('my_bank_data.rds')
## # A tibble: 10,127 × 20
## Attrition_Flag Customer_Age Gender Dependent_count Education_Level
## <chr> <dbl> <chr> <dbl> <chr>
## 1 Existing Customer 45 M 3 High School
## 2 Existing Customer 49 F 5 Graduate
## 3 Existing Customer 51 M 3 Graduate
## 4 Existing Customer 40 F 4 High School
## 5 Existing Customer 40 M 3 Uneducated
## 6 Existing Customer 44 M 2 Graduate
## 7 Existing Customer 51 M 4 Unknown
## 8 Existing Customer 32 M 0 High School
## 9 Existing Customer 37 M 3 Uneducated
## 10 Existing Customer 48 M 2 Graduate
## # ℹ 10,117 more rows
## # ℹ 15 more variables: Marital_Status <chr>, Income_Category <chr>,
## # Card_Category <chr>, Months_on_book <dbl>, Total_Relationship_Count <dbl>,
## # Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## # Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## # Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## # Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>
feather
package implements a fast binary file
format that can be shared across programming languages:library(feather)
write_feather(bank_data, 'my_bank_data.feather')
read_feather('my_bank_data.feather')
## # A tibble: 10,127 × 20
## Attrition_Flag Customer_Age Gender Dependent_count Education_Level
## <chr> <dbl> <chr> <dbl> <chr>
## 1 Existing Customer 45 M 3 High School
## 2 Existing Customer 49 F 5 Graduate
## 3 Existing Customer 51 M 3 Graduate
## 4 Existing Customer 40 F 4 High School
## 5 Existing Customer 40 M 3 Uneducated
## 6 Existing Customer 44 M 2 Graduate
## 7 Existing Customer 51 M 4 Unknown
## 8 Existing Customer 32 M 0 High School
## 9 Existing Customer 37 M 3 Uneducated
## 10 Existing Customer 48 M 2 Graduate
## # ℹ 10,117 more rows
## # ℹ 15 more variables: Marital_Status <chr>, Income_Category <chr>,
## # Card_Category <chr>, Months_on_book <dbl>, Total_Relationship_Count <dbl>,
## # Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## # Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## # Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## # Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>
Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you’ll learn about later); feather currently does not.
Divide the bank customer data into two parts - data for female customers only and for male customers only. Save them into two different files “female_bank_data.rds” and “male_bank_data.rds”.
In most cases, when we use functions in readr
to import
data, it already makes reasonable guesses to the type of each column.
For example:
read_csv("x, y, z
a, T, 2.0
b, F, 4.0")
## # A tibble: 2 × 3
## x y z
## <chr> <lgl> <dbl>
## 1 a TRUE 2
## 2 b FALSE 4
To understand how this works, we are going to learn a few parsing functions. These functions can be useful by themselves, so we will take a detour to learn them.
Parsing function are functions such as parse_integer()
,
parse_number()
, and other functions in the form of
parse_xyz()
where xyz
is some data type.
These functions take a character vector and return a more
specialised vector like a logical, integer, or date:
str(parse_logical(c("TRUE", "FALSE", "NA")))
## logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
## int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
## Date[1:2], format: "2010-01-01" "1979-10-14"
Here str
function returns the structure of an R object
to show its data type.
There are eight particularly useful parsers:
parse_logical()
and parse_integer()
parse logicals and integers respectively.
parse_double()
is a strict numeric parser, and
parse_number()
is a flexible numeric parser.
parse_character()
handles character
encodings.
parse_factor()
create factors, the data structure
that R uses to represent categorical variables with fixed and known
values.
parse_datetime()
, parse_date()
, and
parse_time()
allow you to parse various date & time
specifications.
For any parse_xyz()
function, the first argument is a
character vector to be parsed. An na
argument can be added
to specify which strings should be treated as missing.
parse_integer(c("1", "231", ".", "456"), na = ".")
## [1] 1 231 NA 456
If parsing fails, you receive a warning message.
parse_integer(c("1", "231", "abc", "456"), na = ".")
## Warning: 1 parsing failure.
## row col expected actual
## 3 -- no trailing characters abc
## [1] 1 231 NA 456
## attr(,"problems")
## # A tibble: 1 × 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA no trailing characters abc
It seems like it should be straightforward to parse a number, but three problems make it tricky:
People write numbers differently in different parts of the world.
For example, some countries use .
in between the integer
and fractional parts of a real number, while others use
,
.
Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.
Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.
To address the first problem, we may use the argument
locale
to override the default locale specification
(US-centric that we are using). A most important case is to override the
value of .
by creating a new locale and setting the
decimal_mark argument:
parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23
The second problem is addressed by parse_number()
: it
ignores non-numeric characters before and after the number. This is
particularly useful for currencies and percentages, but also works to
extract numbers embedded in text.
parse_number("$100")
## [1] 100
parse_number("20%")
## [1] 20
parse_number("It cost $123.45")
## [1] 123.45
The final problem is addressed by the combination of
parse_number()
and the locale as
parse_number()
will ignore the “grouping mark”:
# Used in America
parse_number("$123,456,789")
## [1] 123456789
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
## [1] 123456789
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
## [1] 123456789
It seems like parse_character()
should be really
unnecessary since the inputs are already characters. Unfortunately life
isn’t so simple, as there are multiple ways to represent the same string
- different encoding systems.
To understand what’s going on, we need to dive into the details of
how computers represent strings. In R, we can get at the underlying
representation of a string using charToRaw()
:
charToRaw("Hadley")
## [1] 48 61 64 6c 65 79
Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII that is used to represent English characters.
Things get more complicated for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters - Latin1 for western European languages, Shift-JIS for Japanese, BIG5 for Chinese etc.
Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji).
readr
uses UTF-8 everywhere: it assumes your data is
UTF-8 encoded when you read it, and always uses it when writing. When
the string data are from older systems, then we need to specify the
locale
in parse_character
to make it work
properly.
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
print(x1)
## [1] "El Ni\xf1o was particularly bad this year"
print(x2)
## [1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd"
parse_character(x1, locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
## [1] "こんにちは"
There are also parse_factor()
and
parse_datetime()
, parse_date()
,
parse_time()
to parse factors and dates/time from a vector.
We will learn more details when we study factors and dates/times
data.
At last we will briefly introduce how readr
parse a file
that stores some tabular data. readr
automatically guesses
the type of each column using a heuristic approach. it reads the
first 1000 rows and uses some (moderately conservative)
heuristics to figure out the type of each column.
The heuristic tries each of the following types, stopping when it finds a match:
logical: contains only “F”, “T”, “FALSE”, or “TRUE”.
integer: contains only numeric characters (and -).
double: contains only valid doubles (including numbers like 4.5e-5).
number: contains valid doubles with the grouping mark inside.
time: matches the default time_format.
date: matches the default date_format.
date-time: any ISO8601 date.
Submit your answer in a single pdf or html knitted from a R markdown file. Submit your R markdown file as well.