Data Transformation

Introduction

It is rare that you get the data in exactly the right form you need. Often you’ll need to manipulate your data and get ready for your analysis. In this chapter we will learn how to do this using dplyr package and a new dataset on flights departing New York City in 2013.

Prerequisites

library(dplyr); library(ggplot2); library(nycflights13)

nycflights13

To explore the basic data manipulation verbs of dplyr, we’ll use nycflights13::flights. This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.

flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <time>

Dplyr Basics

The basic dplyr functions are:

Select rows by filter() and slice().
Reorder the rows by arrange().
Pick variables by select() and rename().
Create new variables with mutate() and transmute().
Collapse many values down to a single one by summarise().

All of them can be combined with group_by() to work on subsets of variables instead of their entirety.

Let’s dive in and see how these verbs work.

Subset rows:

By Condition

filter() subsets observations based on logical conditions. The first argument is the data frame. The second and subsequent arguments are logical expressions that filter the data frame.

Example: select all flights on January 1st:

filter(flights, month == 1 & day == 1)

## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <time>

Dplyr returns a new filtered data frame. To save the result, use the assignment operator, <-:

jan1 <- filter(flights, month==1 & day==1)

In the previous example we used the logical and operator, &, to filter the rows. This means: select the rwos where both conditions are true.

Other logical operators are:

Or, |: select rows where any of the conditions is true
Not, !: select rows where the condition is not true
Group membership %in%: select rows that match a specific group of values of a column

Example:

Find all flights that departed in November or December

Solution:

(dec.or.nov <- filter(flights, month == 11 | month == 12))

## # A tibble: 55,403 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013    11     1        5           2359         6      352
## 2   2013    11     1       35           2250       105      123
## 3   2013    11     1      455            500        -5      641
## 4   2013    11     1      539            545        -6      856
## 5   2013    11     1      542            545        -3      831
## 6   2013    11     1      549            600       -11      912
## 7   2013    11     1      550            600       -10      705
## 8   2013    11     1      554            600        -6      659
## 9   2013    11     1      554            600        -6      826
## 10  2013    11     1      554            600        -6      749
## # ... with 55,393 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <time>

To see that it selected only the flights in November or December:

unique(dec.or.nov$month)

## [1] 11 12

Note: when using & or |, you have to repeat the name of the variable like we did: month == 11 | month == 12, not: month == 11 | 12.

Note: to check for equality, we use the double equal operator: ==, not the single one: =.

Example: Find all flights that departed from any airport other than ‘JFK’

no_jfk <- filter(flights, origin != 'JFK')

# check if no_jfk contains 'JFK' in the `origin`
'JFK' %in% no_jfk$origin

## [1] FALSE

Example: Find all flights that belong to ‘UA’, ‘B6’, ‘DL’ airlines, i.e: carrier

filter(flights, carrier %in% c('UA', 'B6', 'DL'))

## # A tibble: 161,410 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      544            545        -1     1004
## 4   2013     1     1      554            600        -6      812
## 5   2013     1     1      554            558        -4      740
## 6   2013     1     1      555            600        -5      913
## 7   2013     1     1      557            600        -3      838
## 8   2013     1     1      558            600        -2      849
## 9   2013     1     1      558            600        -2      853
## 10  2013     1     1      558            600        -2      924
## # ... with 161,400 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <time>

# note: you could also use the logical OR operator `|` here.

By Position

slice() subsets rows based on their position (index).

Example: Get the 2nd five rows

slice(flights, 6:10)

## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      554            558        -4      740
## 2  2013     1     1      555            600        -5      913
## 3  2013     1     1      557            600        -3      709
## 4  2013     1     1      557            600        -3      838
## 5  2013     1     1      558            600        -2      753
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <time>

Example: Get the first row

slice(flights, 1)

## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <time>

Example: Get the last row

slice(flights, n())

## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     9    30       NA            840        NA       NA
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <time>

Note: the function n() returns the number of rows in the dataset.

top_n() selects and orde rs top (or bottom) n entries. If you use it with the entire data frame, it’ll select the top (or bottom) entries from the entirety of the dataframe. If used with group_by, it’ll select the top (or bottom) entries from each group. (more on group_by later in the chapter).

Example: select the top 5 flights ordered by dep_delay. i.e: the 5 flights with the highest dep_delay

top_n(flights, 5, dep_delay)

## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     9      641            900      1301     1242
## 2  2013     1    10     1121           1635      1126     1239
## 3  2013     6    15     1432           1935      1137     1607
## 4  2013     7    22      845           1600      1005     1044
## 5  2013     9    20     1139           1845      1014     1457
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <time>

Example: select the bottom 5 flights ordered by arr_delay. i.e: the 5 flights with the lowest arr_delay

top_n(flights, -5, arr_delay)

## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     5     2     1947           1949        -2     2209
## 2  2013     5     4     1816           1820        -4     2017
## 3  2013     5     6     1826           1830        -4     2045
## 4  2013     5     7     1715           1729       -14     1944
## 5  2013     5    20      719            735       -16      951
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <time>

Summary of the Logical and Comparison Functions

	Logic in R	?Comparison	?Logic
<	Less than	!=	Not equal to
>	Greater than	%in%	Group membership
==	Equal to	is.na	Is NA
<=	Less than or equal to	!is.na	Is not NA
>=	Greater than or equal to	&,	,!,xor,any,all

Example: Find all flights where arr_delay is not NA

filter(flights, !is.na(arr_delay))

## # A tibble: 327,346 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 327,336 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <time>

Other Useful Functions:

distinct(flights): Remove duplicate rows.
sample_n(flights, 10, replace = TRUE): Randomly select n rows.
sample_frac(flights, 0.5, replace = TRUE): Randomly select fraction of rows.

Introducing the Pype Operator `%>%`

So far we’ve been passing the dataframe as a first argument to dplyr functions. This is to tell the function that we’re now working inside the dataframe, so that when we use the variable name it knows that it’s a variable of the dataframe.

We can accomplish the same objective using %>%. For example, our very first example: filter(flights, month == 1 & day == 1) can be re-written as: flights %>% filter(month == 1 & day == 1). The %>% operator tells filter that we’re now inside flights.

We will be using this syntax from now on, because it’s much cleaner and easier to read, especially when you combine multiple operations, as we’ll see later in the chapter.

Missing Values

Later (I think it’s better to include it in the basics chapter at the very beginning)

Arrange rows

arrange() reorder rows according to the values of a column, or a gorup of columns. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

Example: Arrange the flights by year, break ties by month, and break ties again by day

dplyr::arrange(flights, year, month, day)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <time>

Note: you don’t see the tie breaking here, because only a small piece of the result if printed. To see the whole result:

break_ties <- arrange(flights, year, month, day)
View(break_ties)    # note the letter 'v' in 'View' is capitalized

Use desc() to re-order by a column in descending order

Example: Arrange flights ascendingly by dep_time, and descendingly by dep_delay

arrange(flights, dep_time, desc(dep_delay))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     4    10        1           1930       271      106
## 2   2013     5    22        1           1935       266      154
## 3   2013     6    24        1           1950       251      105
## 4   2013     7     1        1           2029       212      236
## 5   2013     1    31        1           2100       181      124
## 6   2013     2    11        1           2100       181      111
## 7   2013     3    18        1           2128       153      247
## 8   2013     6    25        1           2130       151      249
## 9   2013     2    24        1           2245        76      121
## 10  2013     1    13        1           2249        72      108
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <time>

flights %>%
  summarise(sum((arr_delay + dep_delay) < 0, na.rm=T))

## # A tibble: 1 x 1
##   sum((arr_delay + dep_delay) < 0, na.r...
##                                      <int>
## 1                                   188401

Select Columns:

Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

Example: Select the year, month, day from the flights data set

flights %>% select(year, month, day)

## # A tibble: 336,776 x 3
##    year month   day
##   <int> <int> <int>
## 1  2013     1     1
## 2  2013     1     1
## 3  2013     1     1
## 4  2013     1     1
## # ... with 336,772 more rows

Example: Select all columns between year and day (inclusive)

flights %>% select(year:day)

## # A tibble: 336,776 x 3
##    year month   day
##   <int> <int> <int>
## 1  2013     1     1
## 2  2013     1     1
## 3  2013     1     1
## 4  2013     1     1
## # ... with 336,772 more rows

Example: Select all columns except those from year to day (inclusive)

flights %>% select(-(year:day))

## # A tibble: 336,776 x 16
##   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##      <int>          <int>     <dbl>    <int>          <int>     <dbl>
## 1      517            515         2      830            819        11
## 2      533            529         4      850            830        20
## 3      542            540         2      923            850        33
## 4      544            545        -1     1004           1022       -18
## # ... with 336,772 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>

There are a number of helper functions you can use within select():

starts_with("abc"): matches names that begin with “abc”.
ends_with("xyz"): matches names that end with “xyz”.
contains("ijk"): matches names that contain “ijk”.
matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
num_range("x", 1:3): matches x1, x2 and x3.

See ?select for more details.

To select columns and renaming them: use rename() instead of select()

flights %>% rename(which_year = year)

## # A tibble: 336,776 x 19
##   which_year month   day dep_time sched_dep_time dep_delay arr_time
##        <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1       2013     1     1      517            515         2      830
## 2       2013     1     1      533            529         4      850
## 3       2013     1     1      542            540         2      923
## 4       2013     1     1      544            545        -1     1004
## # ... with 336,772 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <time>

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

flights %>% select(time_hour, air_time, everything())

## # A tibble: 336,776 x 19
##             time_hour air_time  year month   day dep_time sched_dep_time
##                <time>    <dbl> <int> <int> <int>    <int>          <int>
## 1 2013-01-01 05:00:00      227  2013     1     1      517            515
## 2 2013-01-01 05:00:00      227  2013     1     1      533            529
## 3 2013-01-01 05:00:00      160  2013     1     1      542            540
## 4 2013-01-01 05:00:00      183  2013     1     1      544            545
## # ... with 336,772 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

Compute using `summarise`

Now we have the ability to filter rows and select columns of interest, we usually want to perform some computations on them by applying summary functions. In general, we have 4 cases:

Case 1: apply one function to one variable
Case 2: apply many functions to one variable
Case 3: apply one function to many variables
Case 4: apply many functions to many variables

Case 1: apply one function to one variable

Example: How many trips have had arr_delay less than zero?

flights %>% 
  filter(arr_delay < 0) %>% 
  summarize(n())

## # A tibble: 1 x 1
##      n()
##    <int>
## 1 188933

Notice how the %>% operator comes in handy here. It tells the natural flow of thinking about the problem:

I first want to filter my observations (rows) such that it only contains the flights matching the condition
Then I use summarise() to do computation on the resulted dataframe from the previous step
The computation I need here is just count the number rows; which is achieved by the n() helper function.

Case 2: apply many functions to one variable

Let’s look at a slightly more complicated problem:

Example: calculate the mean, standard deviation, and the number of flights with dep_delay less than 20

# narrate the problem:
# 1- select the rows with less than 20 dep_delay
# 2- do the required computation using summarise
flights %>% 
  filter(dep_delay < 20) %>% 
  summarise(dep_delay_mean = mean(dep_delay),
            dep_delay_std = sd(dep_delay),
            count = n())

## # A tibble: 1 x 3
##   dep_delay_mean dep_delay_std  count
##            <dbl>         <dbl>  <int>
## 1      -1.549192      6.415418 265184

Note that dep_delay_mean, dep_delay_std and count are optional names I gave to the columns of the returned table. I could have just typed:

flights %>% 
  filter(dep_delay < 20) %>% 
  summarise(mean(dep_delay),
            sd(dep_delay),
            n())

## # A tibble: 1 x 3
##   mean(dep_delay) sd(dep_delay)    n()
##             <dbl>         <dbl>  <int>
## 1       -1.549192      6.415418 265184

Case 3: apply one function to many variables

In the previous examples, we performed compuations on one variable. To do the same computation on more than one variable simaltaneously, use summarise_each():

Example: compute the variance of dep_delay and arr_delay, for all flights from JFK airport

flights %>%
  filter(origin == 'JFK') %>%
  summarise_each(funs(var), dep_delay, arr_delay)

## # A tibble: 1 x 2
##   dep_delay arr_delay
##       <dbl>     <dbl>
## 1        NA        NA

Case 4: apply many functions to many variables

flights %>%
  filter(origin == 'JFK') %>%
  summarise_each(funs(mean, sd), dep_delay, arr_delay)

## # A tibble: 1 x 4
##   dep_delay_mean arr_delay_mean dep_delay_sd arr_delay_sd
##            <dbl>          <dbl>        <dbl>        <dbl>
## 1             NA             NA          NaN          NaN

Note: dplyr automatically named the output variables for us. If we want to give them custom names:

flights %>%
  filter(origin == 'JFK') %>%
  summarise_each(funs(mean, sd), dep_delay, arr_delay) %>% 
  setNames(c('avg_dep_dealy', 'std_dep_delay', 'avg_arr_delay', 'std_arr_delay'))

## # A tibble: 1 x 4
##   avg_dep_dealy std_dep_delay avg_arr_delay std_arr_delay
##           <dbl>         <dbl>         <dbl>         <dbl>
## 1            NA            NA           NaN           NaN

For more illustration: Aggregation with dplyr: summarise and summarise_each

Add new variables

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate() and transmute()

`mutate()`

mutate() always adds new columns at the end of your dataset, so we’ll start by creating a narrower dataset so we can see the new variables.

flights_sml <- flights %>%
  select(year:day, ends_with('delay'), distance, air_time)

flights_sml %>% 
  mutate(gain = arr_delay - dep_delay,
         speed = distance / air_time * 60)

## # A tibble: 336,776 × 9
##    year month   day dep_delay arr_delay distance air_time  gain    speed
##   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl>    <dbl>
## 1  2013     1     1         2        11     1400      227     9 370.0441
## 2  2013     1     1         4        20     1416      227    16 374.2731
## 3  2013     1     1         2        33     1089      160    31 408.3750
## 4  2013     1     1        -1       -18     1576      183   -17 516.7213
## # ... with 336,772 more rows

Note that you can refer to columns that you’ve just created:

flights_sml %>% 
  mutate(gain = arr_delay - dep_delay,
         hours = air_time / 60,
         gain_per_hour = gain / hours)

## # A tibble: 336,776 × 10
##    year month   day dep_delay arr_delay distance air_time  gain    hours
##   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl>    <dbl>
## 1  2013     1     1         2        11     1400      227     9 3.783333
## 2  2013     1     1         4        20     1416      227    16 3.783333
## 3  2013     1     1         2        33     1089      160    31 2.666667
## 4  2013     1     1        -1       -18     1576      183   -17 3.050000
## # ... with 336,772 more rows, and 1 more variables: gain_per_hour <dbl>

If you only want to keep the new variables, use:

`transmute()`

flights %>% 
  transmute(gain = arr_delay - dep_delay,
            hours = air_time / 60,
            gain_per_hour = gain / hours)

## # A tibble: 336,776 × 3
##    gain    hours gain_per_hour
##   <dbl>    <dbl>         <dbl>
## 1     9 3.783333      2.378855
## 2    16 3.783333      4.229075
## 3    31 2.666667     11.625000
## 4   -17 3.050000     -5.573770
## # ... with 336,772 more rows