R Tutorial 3: Importing Data

MKT 410: Marketing Analytics

Author

Levin Zhu

Learning Objectives

In this tutorial, we will go over the first step of any data analysis: importing data. Specifically, we will cover:

Reading data from a file
Controlling column types
Writing data to file

Prerequisites

The focus of this tutorial is on the readr package, which is included in tidyverse.

library(tidyverse)
#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Reading Data from a File

We’ll focus on comma-separated values (CSV) files, which is the most common rectangular data file type. A typical CSV file will have a header row which gives the column names, followed by rows of data. The columns are separated (delimited) by commas.

Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6

The data as shown as a table is as follows:

Student ID	Full Name	favourite.food	mealPlan	AGE
1	Sunil Huffmann	Strawberry yoghurt	Lunch only	4
2	Barclay Lynn	French fries	Lunch only	5
3	Jayendra Lyne	N/A	Breakfast and lunch	7
4	Leon Rossini	Anchovies	Lunch only	NA
5	Chidiegwu Dunkel	Pizza	Breakfast and lunch	five
6	Güvenç Attila	Ice cream	Lunch only	6

The data are available to download from https://pos.it/r4ds-students-csv. I have also included the data under Resources > data in Laulima. We can read CSV files using the function read_csv().

# read from file
students <- read_csv("data/students.csv")
#> Rows: 6 Columns: 5
#> ── Column specification ─────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): Full Name, favourite.food, mealPlan, AGE
#> dbl (1): Student ID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning

Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. We will cover transformations in depth in a later tutorial, but we’ll use a couple transformations here for now.

This is what students looks like when you import it:

students
#> # A tibble: 6 × 5
#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
#>          <dbl> <chr>            <chr>              <chr>               <chr>
#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2            2 Barclay Lynn     French fries       Lunch only          5    
#> 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6            6 Güvenç Attila    Ice cream          Lunch only          6

We’ll notice the entry “N/A” under the favourite.food column. By default, read_csv() only recognizes empty strings "" as NAs, so we need to specify the character string "N/A" as a missing value as well.

students <- read_csv("data/students.csv", na = c("N/A", ""))
students
#> # A tibble: 6 × 5
#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
#>          <dbl> <chr>            <chr>              <chr>               <chr>
#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2            2 Barclay Lynn     French fries       Lunch only          5    
#> 3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6            6 Güvenç Attila    Ice cream          Lunch only          6

There are a few other issues that need to be ironed out:

The column names aren’t very “nice” looking
The mealPlan column appears to be a categorical variable (i.e. has a known set of possible values), but is currently a “character” variable (which is unrestricted in the possible values)
The AGE column has a “five” in it, which we would like to convert to an integer

Let’s address each in turn.

Cleaning column names

We first see that the columns Student ID and Full Name have spaces in their names, leading to backticks ` surrounding the column names. This is because R does not allow spaces in variable names (or object names). We can fix this using the rename() function:

students %>%
  rename(
    student_id = `Student ID`,
    full_name = `Full Name`
  )
#> # A tibble: 6 × 5
#>   student_id full_name        favourite.food     mealPlan            AGE  
#>        <dbl> <chr>            <chr>              <chr>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6

Alternatively, we can use janitor::clean_names() which uses heuristics to turn all column names into snake case at once:

students %>%
  janitor::clean_names()
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan           age  
#>        <dbl> <chr>            <chr>              <chr>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6

Another way to deal with column names is, if you already know what columns there are and want to specify your own names, you can use the col_names argument within read_csv() itself:

colnames <- c(
  "student_id", "full_name", "favourite_food", "meal_plan", "age"
)
read_csv(
  "data/students.csv",
  na = c("N/A", ""),
  col_names = colnames
)

Column types

We noticed that meal_plan has only two values, which suggests it is a categorical variable. To convert the column from a “character” column to a categorical (or “factor”) column, we’ll use the mutate() function, which creates new columns (or replaces existing ones) using existing columns.

students %>%
  janitor::clean_names() %>%
  mutate(meal_plan = factor(meal_plan))
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan           age  
#>        <dbl> <chr>            <chr>              <fct>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6

Here, we replaced the meal_plan column with another meal_plan column that converted the original meal_plan column to a factor (<fctr>) column.

Fixing specific entries

We might also want to fix the “five” in the age column. Here, we can use an if_else() statement that searches for the value "five" and outputs the integer 5, while keeping all the other values in the column the same. Weʻll also use as.integer() to converts the column to an integer variable.

students <- students %>%
  janitor::clean_names() %>%
  mutate(
    meal_plan = factor(meal_plan),
    age = as.integer(if_else(age == "five", "5", age))
  )

students
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan             age
#>        <dbl> <chr>            <chr>              <fct>               <int>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
#> 2          2 Barclay Lynn     French fries       Lunch only              5
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
#> 6          6 Güvenç Attila    Ice cream          Lunch only              6

Controlling Column Types

A CSV file doesnʻt contain any information about what the column types are in your data. readr will try to guess the type of each column. However, this doesnʻt always work, and we will go over some steps to make sure the dataset is read correctly.

Guessing Column Types

readr uses a heuristic to figure out the column types. It samples 1,000 rows (evenly spaced) from the dataset, ignoring missing values, and works through the following questions:

Does it contain only F, T, FALSE, or TRUE (ignoring case)? If so, it’s a logical
Does it contain only numbers (e.g. 1, -4.5, 5e6, Inf)? If so, it’s a number
Does it match the ISO8601 standard? If so, it’s a date or date-time
Otherwise, it must be a string

As a simple example:

read_csv("
  logical,numeric,date,string
  TRUE,1,2021-01-15,abc
  false,4.5,2021-02-15,def
  T,Inf,2021-02-16,ghi
")
#> Rows: 3 Columns: 4
#> ── Column specification ─────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (1): string
#> dbl  (1): numeric
#> lgl  (1): logical
#> date (1): date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 3 × 4
#>   logical numeric date       string
#>   <lgl>     <dbl> <date>     <chr> 
#> 1 TRUE        1   2021-01-15 abc   
#> 2 FALSE       4.5 2021-02-15 def   
#> 3 TRUE      Inf   2021-02-16 ghi

Missing Values, Column Types, and Problems

Of course, real world datasets are rarely as easy to deal with as the above example.

One of the most common ways that column type detection fails is if there is a missing value that is not recorded correctly by readr.

simple_csv <- "
  x
  10
  .
  20
  30
"

Without providing any additional columns, x is classified as a character column:

read_csv(simple_csv)
#> # A tibble: 4 × 1
#>   x    
#>   <chr>
#> 1 10   
#> 2 .    
#> 3 20   
#> 4 30

If you have a lot of rows and are expecting the column to be a numeric column, and you end up with a character column when you read the data, it may not be clear to you where the error occurs; how would you know what is going on if you have thousands of rows with only a few missing values repsented by .s sprinkled among them?

One way do this is to specify the column type when reading the dataset, using the col_types argument in read_csv():

df <- read_csv(
  simple_csv, 
  col_types = list(x = col_double())
)
#> Warning: One or more parsing issues, call `problems()` on your data frame for
#> details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

We now see that there was a problem. Use problems() to find out more about the problem:

problems(df)
#> # A tibble: 1 × 5
#>     row   col expected actual file                                           
#>   <int> <int> <chr>    <chr>  <chr>                                          
#> 1     3     1 a double .      /private/var/folders/dy/mpnv63fj3wxgctzgbhch4g…

This tells us that there was a problem in row 3, column 1, where readr expected a double but got a .. This suggests the dataset uses . for missing values, so we can set na = "." to give us the column type we want.

read_csv(simple_csv, na = ".")
#> # A tibble: 4 × 1
#>       x
#>   <dbl>
#> 1    10
#> 2    NA
#> 3    20
#> 4    30

Column Types

There are nine column types that readr allows you to specify in the col_types argument:

col_logical(): true/false columns
col_double(): real numbers
col_integer(): integers
col_character(): strings
col_factor(): factors (categories)
col_date(): dates
col_datetime(): date-times
col_number(): a numeric parser that ignores non-numeric components (e.g. useful for currencies)
col_skip(): skips a column so itʻs not included in the result

You can also override the default column by using cols() instead of list() and specifying .default as follows:

another_csv <- "
x,y,z
1,2,3
"

read_csv(
  another_csv, 
  col_types = cols(.default = col_character())
)
#> # A tibble: 1 × 3
#>   x     y     z    
#>   <chr> <chr> <chr>
#> 1 1     2     3

Another useful helper function is cols_only() which will read in only the columns you specify:

read_csv(
  another_csv,
  col_types = cols_only(
    x = col_character(),
    y = col_integer()
  )
)
#> # A tibble: 1 × 2
#>   x         y
#>   <chr> <int>
#> 1 1         2

Writing to a File

readr provides a useful function for writing data back to disc: write_csv(). The main arguments are:

x: the data frame to save
file: the location to save it

You can also specify how missing values are written with the na argument and if you want to append to an existing file.

Letʻs save students into a new folder output. Note that we will need to create a new directory first; we can do this directly in the file manager, or we can use dir.create() (and we can use dir.exists() to check if that folder already exists or not).

if (!dir.exists("output")) {
  dir.create("output")
}

write_csv(students, "output/students.csv")

Now, letʻs read the dataset back and compare with what we had before:

# original data
students
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan             age
#>        <dbl> <chr>            <chr>              <fct>               <int>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
#> 2          2 Barclay Lynn     French fries       Lunch only              5
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
#> 6          6 Güvenç Attila    Ice cream          Lunch only              6

# new output
read_csv("output/students.csv")
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan             age
#>        <dbl> <chr>            <chr>              <chr>               <dbl>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
#> 2          2 Barclay Lynn     French fries       Lunch only              5
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
#> 6          6 Güvenç Attila    Ice cream          Lunch only              6

We can see that weʻre reading back a plain text file, and the column meal_plan column is now a character column again.

There are two alternatives to loading CSV files that youʻve already done some work on, and in particular, have specified the column types:

write_rds() and read_rds(): These store data in Rʻs custom binary format called RDS, which means that when you reload the object, you are loading the exact same R object that you stored.

write_rds(students, "output/students.rds")
read_rds("output/students.rds")
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan             age
#>        <dbl> <chr>            <chr>              <fct>               <int>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
#> 2          2 Barclay Lynn     French fries       Lunch only              5
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
#> 6          6 Güvenç Attila    Ice cream          Lunch only              6

Read and write parquet files using the arrow package. Parquet files are a fast binary file format that can be shared across programming languages. Parquet tends to be much faster than RDS and is usable outside of R, but does require installing an additional package.

# install.packages("arrow")
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:lubridate':
#> 
#>     duration
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
write_parquet(students, "output/students.parquet")
read_parquet("output/students.parquet")
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan             age
#>        <dbl> <chr>            <chr>              <fct>               <int>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
#> 2          2 Barclay Lynn     French fries       Lunch only              5
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
#> 6          6 Güvenç Attila    Ice cream          Lunch only              6

Summary

In this tutorial, we learned how to read and write data files in R. We also talked about data cleaning and controlling column types.

In the next tutorial, we’ll talk about the next step of data analysis: data tidying.