Load Libraries


library(tidyverse)

Introduction


The first step of working on any data set in RStudio is of course, importing the data set. Due to various ways of collecting, transferring and storing data, there are many different data formats. Regarding tabular data, below are a few commonly used data format:

  • csv (comma separated value, a text file for storing tabular data)
  • xls or xlsx (MS Excel format)
  • json (JavaScript Object Notation)
  • txt (plain text)
  • rds (a binary format to store any R object)

Here, “csv” and Excel files are most commonly used. “json”, “txt” and “rds” can store a wide range of data (not limited to tabular data), but sometimes tabular data might be stored in these formats.

The base R package provides many built-in data import functions, such as read.csv, readxl::read_excel to read csv and excel files. These functions work on data frames. Here we are going to introduce a few functions in readr package of tidyverse, which directly read data into tibbles.

The read_csv function


The read_csv function reads “csv” files in which values are separated by commas. For example

read_csv("x,y,z
         1,2,3
         4,5,6")
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Here we read a simple example of manually input data in “csv” format:

  • The whole data is in a string (since “csv” is essentially a text file)
  • In each row, the values are separated by comma
  • The first row is read as the header

If we don’t need a header we can do the following:

read_csv("1,2,3
         4,5,6", col_names = FALSE)
## # A tibble: 2 × 3
##      X1    X2    X3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

In this situation, columns are labeled to be x1, x2, …

We can also assign column names manually:

read_csv("1,2,3
         4,5,6", col_names = c("x","y","z"))
## # A tibble: 2 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Sometimes, a file may use \n in the text to represent a new line:

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## # A tibble: 2 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Another option that commonly needs tweaking is NA: this specifies the value (or values) that are used to represent missing values in your file:

read_csv("a,b,c\n1,2,3\n4,5,.", na = ".")
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): a, b, c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5    NA

Read a file from your disk


To read a file on your disk, you need to input the file path inside the string for read_csv to read it. Note that the path can be relative to your working directory, or an absolute path.

For example, If we read a file directly from the working folder (which can be obtained by getwd() and changed by the setwd() function), we may simply input the file name into read_csv().

getwd()
## [1] "/Users/miao"
setwd("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets")
bank_data <- read_csv("BankChurners.csv")
print(bank_data)
## # A tibble: 10,127 × 20
##    Attrition_Flag    Customer_Age Gender Dependent_count Education_Level
##    <chr>                    <dbl> <chr>            <dbl> <chr>          
##  1 Existing Customer           45 M                    3 High School    
##  2 Existing Customer           49 F                    5 Graduate       
##  3 Existing Customer           51 M                    3 Graduate       
##  4 Existing Customer           40 F                    4 High School    
##  5 Existing Customer           40 M                    3 Uneducated     
##  6 Existing Customer           44 M                    2 Graduate       
##  7 Existing Customer           51 M                    4 Unknown        
##  8 Existing Customer           32 M                    0 High School    
##  9 Existing Customer           37 M                    3 Uneducated     
## 10 Existing Customer           48 M                    2 Graduate       
## # ℹ 10,117 more rows
## # ℹ 15 more variables: Marital_Status <chr>, Income_Category <chr>,
## #   Card_Category <chr>, Months_on_book <dbl>, Total_Relationship_Count <dbl>,
## #   Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## #   Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## #   Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## #   Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>

Equivalently, we can read a csv file by its path. We may use the “Copy Folder Path to Clipboard” in “More” of the Files Panel in RStudio.

bank_data <- read_csv("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets/BankChurners.csv")

There is also a function read_csv2 that reads data using ; as the separator and , as the decimal points, which is commonly used in some European countries.

This is all you need to know to read around 75% of “csv” files that you’ll encounter in practice. You can also easily adapt what you’ve learned to read tab separated files with read_tsv() and fixed width files with read_fwf().

Lab Exercise


read a csv file into a tibble from your disk**.

The read_tsv and read_delim function


To show why we need other functions for data import, we use read_tsv and read_fwf as examples.

The read_tsv function reads “tab separated values”, for example:

read_tsv("a\tb\tc
1\t2\t3
4\t5\t6")
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (3): a, b, c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Here \t represents a tab character, which can be usually created in a text file when the “tab” key is pressed on a keyboard.

Generally, there is a function read_delim to handle any specified separator:

read_delim("a|b|c
1|2|3
4|5|6", delim = "|")
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "|"
## dbl (3): a, b, c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

read_fwf function


The read_fwf function read fixed-width files. As below is an example:

fwf_sample <- readr_example("fwf-sample.txt")
writeLines(read_lines(fwf_sample))
## John Smith          WA        418-Y11-4111
## Mary Hartford       CA        319-Z19-4341
## Evan Nolan          IL        219-532-c301
read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
## # A tibble: 3 × 4
##   first last     state ssn         
##   <chr> <chr>    <chr> <chr>       
## 1 John  Smith    WA    418-Y11-4111
## 2 Mary  Hartford CA    319-Z19-4341
## 3 Evan  Nolan    IL    219-532-c301

So in these files, columns occupy some “fixed-width” in each line, such as “position 1-10” for column 1, “position 5-20” for column 2 etc. (column positions may overlap).

Specifying the width or positions for each column can be a painful job. We may simply call the fwf_empty(<data_name>, <col_name>) to make a guess and assign each column with given names as shown in the example above. For more details, you may refer to the help documentation of read_fwf.


Lab Exercise


  1. Try to use read_csv to read the following text. What do you get?

  2. Try to use read_delim to read the same text. What do you get?

"John Smith          WA        418-Y11-4111
Mary Hartford       CA        319-Z19-4341
Evan Nolan          IL        219-532-c301"

Compared to base R


In base R, we also have the read.csv() function to read CSV files. There are a few good reasons to favour readr functions over the base equivalents:

  • They are typically much faster (~10x) than their base equivalents.

  • They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.

  • They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.

Other data types


There are also many other data types. For example, JSON is a standard data exchange format on the web, which looks like:

json <-
'[
  {"Name" : "Mario", "Age" : 32, "Occupation" : "Plumber"}, 
  {"Name" : "Peach", "Age" : 21, "Occupation" : "Princess"},
  {},
  {"Name" : "Bowser", "Occupation" : "Koopa"}
]'

This stores attribute data about objects in JavaScript, which naturally forms a tabular data set. To read this, we need to install jsonlite package and call the fromJSON function to read it.

library(jsonlite)
mydf <- fromJSON(json)
print(mydf)
##     Name Age Occupation
## 1  Mario  32    Plumber
## 2  Peach  21   Princess
## 3   <NA>  NA       <NA>
## 4 Bowser  NA      Koopa

Here is a brief summary of the packages that are commonly used to read different data types:

  • haven reads SPSS, Stata, and SAS files.

  • readxl reads excel files (both .xls and .xlsx).

  • DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.

  • jsonlite reads JSON files; JSON is a standard data exchange format on the web.

  • xml2 reads XML files.

Writing to a file


readr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv(). Both functions increase the chances of the output file being read back in correctly by:

  • Always encoding strings in UTF-8.

  • Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.

Below is an example:

write_csv(bank_data, 'my_bank_data.csv')

Refresh your working directory in the Files panel and you will see the new file on the disk.

However, there is an issue with using “csv” to store data - all your column type information will be lost. Sometimes this can be a problem for storing interim results. There are two alternatives to resolve the problem.

  1. write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS:
write_rds(bank_data, 'my_bank_data.rds')
read_rds('my_bank_data.rds')
## # A tibble: 10,127 × 20
##    Attrition_Flag    Customer_Age Gender Dependent_count Education_Level
##    <chr>                    <dbl> <chr>            <dbl> <chr>          
##  1 Existing Customer           45 M                    3 High School    
##  2 Existing Customer           49 F                    5 Graduate       
##  3 Existing Customer           51 M                    3 Graduate       
##  4 Existing Customer           40 F                    4 High School    
##  5 Existing Customer           40 M                    3 Uneducated     
##  6 Existing Customer           44 M                    2 Graduate       
##  7 Existing Customer           51 M                    4 Unknown        
##  8 Existing Customer           32 M                    0 High School    
##  9 Existing Customer           37 M                    3 Uneducated     
## 10 Existing Customer           48 M                    2 Graduate       
## # ℹ 10,117 more rows
## # ℹ 15 more variables: Marital_Status <chr>, Income_Category <chr>,
## #   Card_Category <chr>, Months_on_book <dbl>, Total_Relationship_Count <dbl>,
## #   Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## #   Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## #   Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## #   Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>
  1. The feather package implements a fast binary file format that can be shared across programming languages:
library(feather)
write_feather(bank_data, 'my_bank_data.feather')
read_feather('my_bank_data.feather')
## # A tibble: 10,127 × 20
##    Attrition_Flag    Customer_Age Gender Dependent_count Education_Level
##    <chr>                    <dbl> <chr>            <dbl> <chr>          
##  1 Existing Customer           45 M                    3 High School    
##  2 Existing Customer           49 F                    5 Graduate       
##  3 Existing Customer           51 M                    3 Graduate       
##  4 Existing Customer           40 F                    4 High School    
##  5 Existing Customer           40 M                    3 Uneducated     
##  6 Existing Customer           44 M                    2 Graduate       
##  7 Existing Customer           51 M                    4 Unknown        
##  8 Existing Customer           32 M                    0 High School    
##  9 Existing Customer           37 M                    3 Uneducated     
## 10 Existing Customer           48 M                    2 Graduate       
## # ℹ 10,117 more rows
## # ℹ 15 more variables: Marital_Status <chr>, Income_Category <chr>,
## #   Card_Category <chr>, Months_on_book <dbl>, Total_Relationship_Count <dbl>,
## #   Months_Inactive_12_mon <dbl>, Contacts_Count_12_mon <dbl>,
## #   Credit_Limit <dbl>, Total_Revolving_Bal <dbl>, Avg_Open_To_Buy <dbl>,
## #   Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>,
## #   Total_Ct_Chng_Q4_Q1 <dbl>, Avg_Utilization_Ratio <dbl>

Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you’ll learn about later); feather currently does not.

Lab Exercise


Divide the bank customer data into two parts - data for female customers only and for male customers only. Save them into two different files “female_bank_data.rds” and “male_bank_data.rds”.

Parsing functions


In most cases, when we use functions in readr to import data, it already makes reasonable guesses to the type of each column. For example:

read_csv("x, y, z
a, T, 2.0
b, F, 4.0")
## # A tibble: 2 × 3
##   x     y         z
##   <chr> <lgl> <dbl>
## 1 a     TRUE      2
## 2 b     FALSE     4

To understand how this works, we are going to learn a few parsing functions. These functions can be useful by themselves, so we will take a detour to learn them.

Parsing function are functions such as parse_integer(), parse_number(), and other functions in the form of parse_xyz() where xyz is some data type. These functions take a character vector and return a more specialised vector like a logical, integer, or date:

str(parse_logical(c("TRUE", "FALSE", "NA")))
##  logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
##  int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
##  Date[1:2], format: "2010-01-01" "1979-10-14"

Here str function returns the structure of an R object to show its data type.

There are eight particularly useful parsers:

  • parse_logical() and parse_integer() parse logicals and integers respectively.

  • parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser.

  • parse_character() handles character encodings.

  • parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values.

  • parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications.

Examples


For any parse_xyz() function, the first argument is a character vector to be parsed. An na argument can be added to specify which strings should be treated as missing.

parse_integer(c("1", "231", ".", "456"), na = ".")
## [1]   1 231  NA 456

If parsing fails, you receive a warning message.

parse_integer(c("1", "231", "abc", "456"), na = ".")
## Warning: 1 parsing failure.
## row col               expected actual
##   3  -- no trailing characters    abc
## [1]   1 231  NA 456
## attr(,"problems")
## # A tibble: 1 × 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA no trailing characters abc

Numbers


It seems like it should be straightforward to parse a number, but three problems make it tricky:

  • People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.

  • Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.

  • Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.

To address the first problem, we may use the argument locale to override the default locale specification (US-centric that we are using). A most important case is to override the value of . by creating a new locale and setting the decimal_mark argument:

parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23

The second problem is addressed by parse_number(): it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.

parse_number("$100")
## [1] 100
parse_number("20%")
## [1] 20
parse_number("It cost $123.45")
## [1] 123.45

The final problem is addressed by the combination of parse_number() and the locale as parse_number() will ignore the “grouping mark”:

# Used in America
parse_number("$123,456,789")
## [1] 123456789
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
## [1] 123456789
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
## [1] 123456789

Strings


It seems like parse_character() should be really unnecessary since the inputs are already characters. Unfortunately life isn’t so simple, as there are multiple ways to represent the same string - different encoding systems.

To understand what’s going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using charToRaw():

charToRaw("Hadley")
## [1] 48 61 64 6c 65 79

Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII that is used to represent English characters.

Things get more complicated for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters - Latin1 for western European languages, Shift-JIS for Japanese, BIG5 for Chinese etc.

Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji).

readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing. When the string data are from older systems, then we need to specify the locale in parse_character to make it work properly.

x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

print(x1)
## [1] "El Ni\xf1o was particularly bad this year"
print(x2)
## [1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd"
parse_character(x1, locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
## [1] "こんにちは"

Factors and Dates/Time


There are also parse_factor() and parse_datetime(), parse_date(), parse_time() to parse factors and dates/time from a vector. We will learn more details when we study factors and dates/times data.

Parsing a file


At last we will briefly introduce how readr parse a file that stores some tabular data. readr automatically guesses the type of each column using a heuristic approach. it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.

The heuristic tries each of the following types, stopping when it finds a match:

  • logical: contains only “F”, “T”, “FALSE”, or “TRUE”.

  • integer: contains only numeric characters (and -).

  • double: contains only valid doubles (including numbers like 4.5e-5).

  • number: contains valid doubles with the grouping mark inside.

  • time: matches the default time_format.

  • date: matches the default date_format.

  • date-time: any ISO8601 date.

Lab Homework


  1. Finish all lab exercises.

Submit your answer in a single pdf or html knitted from a R markdown file. Submit your R markdown file as well.