0. Introduction

In this module, we will be using the dplyr package to perform data transformation in R. We will use a new data set on flights departing New York City in 2013. You must install the package nycflights13 to make it available.

install.packages("nycflights13")

library(tidyverse)
library(nycflights13)

`nycflights13` data set

To explore the basic data manipulation verbs of dplyr, we’ll use nycflights13::flights. This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.

glimpse(flights)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Please check the meaning of each variable from the help document.

Variable types

To summarize the data types in R so far:

int stands for integers.
dbl stands for doubles, or real numbers.
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time).
lgl stands for logical, vectors that contain only TRUE or FALSE.
fct stands for factors, which R uses to represent categorical variables with fixed possible values.
date stands for dates.

dplyr basics

Pick observations by their values (filter()).
Reorder the rows (arrange()).
Pick variables by their names (select()).
Create new variables with functions of existing variables (mutate()).
Collapse many values down to a single summary (summarise()).
These can all be used in conjunction with group_by()

1. Filter rows with `filter()`

filter() allows you to subset observations based on their values. For example, to pick up the flights on January 1st, we may do the following:

filter(flights, month == 1, day == 1)

## # A tibble: 842 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Use the assignment operator, `<-`:

To use that filtered data, we need to assign it to a new name with the assignment <-. (Later we will learn to use a pipe to simplify this process.)

jan1 <- filter(flights, month == 1, day == 1)

jan1

## # A tibble: 842 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Template for using `fliter()`

fliter(<DATA_NAME>, ...<LOGICAL CONDITIONS>...)

Everything after data_name should be an expression returning some logical value. When multiple conditions are given, then rows that satisfy all conditions (every condition is TRUE) will be retained.

Comparisons

fliter() function supports all logical comparisons to fliter our data using the following operators:

>, >=, <, <=, != (not equal to), and == (equal to).

Error message when use `=`

filter(flights, month = 1)

## Error in `filter()`:
## ! We detected a named input.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `month == 1`?

Caution with `==` when dealing with float:

When handling float numbers, it is usually not a good idea to use == since there is machine error for floats and even mathmatically true equations may turn out to be FALSE.

sqrt(2) ^ 2 == 2

## [1] FALSE

1/49 * 49 == 1

## [1] FALSE

Instead, one should use near() instead to check equations about float numbers.

near(sqrt(2) ^ 2,  2)

## [1] TRUE

near(1 / 49 * 49, 1)

## [1] TRUE

Lab Exercise

Find all flights that travel a distance of less than 200 miles. How many of such flights are there in the data set?

Logical operators

Use Boolean operators with filter() : & is “and”, | is “or”, and ! is “not”.

Complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.

Finds all flights that departed in November or December:

filter(flights, month == 11 | month == 12)

## # A tibble: 55,403 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013    11     1        5       2359       6     352     345       7 B6     
##  2  2013    11     1       35       2250     105     123    2356      87 B6     
##  3  2013    11     1      455        500      -5     641     651     -10 US     
##  4  2013    11     1      539        545      -6     856     827      29 UA     
##  5  2013    11     1      542        545      -3     831     855     -24 AA     
##  6  2013    11     1      549        600     -11     912     923     -11 UA     
##  7  2013    11     1      550        600     -10     705     659       6 US     
##  8  2013    11     1      554        600      -6     659     701      -2 US     
##  9  2013    11     1      554        600      -6     826     827      -1 DL     
## 10  2013    11     1      554        600      -6     749     751      -2 DL     
## # … with 55,393 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Note that the following doesn’t work as you expected:

filter(flights, month == (11 | 12))

This is because (11 | 12) is equivalent to evaluate TRUE|TRUE (any non zero value is considered TRUE in logical expressions) which returns TRUE. Then TRUE is interpreted as 1 to match the numeric variable month. So this finds all flights in January instead of November and December!

Use `x %in% y`.

%in% is equivalent to the in keyword in Python. It can conveniently replace | when there are many options. The following code finds flights that depart in May, July or November.

nov_dec <- filter(flights, month %in% c(5, 7, 11))

R also has `&&` and `||`, but

Don’t use them here in filter()!

Use `between()` helper function for (X > a) & (X < b)

For filtering a value between two values, for example, depature time between 0:00am and 6:00am, we can use the between function:

    filter(flights, between(dep_time, 0000, 0600))

## # A tibble: 9,344 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 9,334 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Lab Exercise

a) Find all flights that depart from `JFK` and land on `ORD` or `CVG` (Chicago or Dallas).

b) Find all flights that flied a distance between 200 and 500 miles.

Missing values `NA`

Missing values are represented as NAs (“not available”) in R. NA by itself is a constant in R, and can be of any type depending on the type of other available data.

NA are “contagious” in mathematical operations:

NA > 5

## [1] NA

10 == NA

## [1] NA

NA + 10

## [1] NA

NA / 2

## [1] NA

Why is the following expression `FALSE`?

NA == NA

## [1] NA

To understand with example

# Let x be Mary's age. We don't know how old she is.
x <- NA

# Let y be John's age. We don't know how old he is.
y <- NA

# Are John and Mary the same age?
x == y

## [1] NA

# We don't know!

Use `is.na()` to filter out `NA` in your data set

If you want to determine if a value is missing, use is.na():

is.na(x)

## [1] TRUE

Note: `filter()` excludes both `FALSE` and `NA` values.

filter(flights, is.na(dep_time))

## # A tibble: 8,255 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1       NA       1630      NA      NA    1815      NA EV     
##  2  2013     1     1       NA       1935      NA      NA    2240      NA AA     
##  3  2013     1     1       NA       1500      NA      NA    1825      NA AA     
##  4  2013     1     1       NA        600      NA      NA     901      NA B6     
##  5  2013     1     2       NA       1540      NA      NA    1747      NA EV     
##  6  2013     1     2       NA       1620      NA      NA    1746      NA EV     
##  7  2013     1     2       NA       1355      NA      NA    1459      NA EV     
##  8  2013     1     2       NA       1420      NA      NA    1644      NA EV     
##  9  2013     1     2       NA       1321      NA      NA    1536      NA EV     
## 10  2013     1     2       NA       1545      NA      NA    1910      NA AA     
## # … with 8,245 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

filter(flights, dep_time > 1900)

## # A tibble: 54,702 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1     1904       1905      -1    2139    2227     -48 DL     
##  2  2013     1     1     1904       1905      -1    2157    2208     -11 UA     
##  3  2013     1     1     1905       1900       5    2311    2301      10 DL     
##  4  2013     1     1     1906       1915      -9    2211    2244     -33 DL     
##  5  2013     1     1     1909       1912      -3    2239    2237       2 B6     
##  6  2013     1     1     1909       1910      -1    2212    2224     -12 DL     
##  7  2013     1     1     1910       1909       1    2126    2046      40 EV     
##  8  2013     1     1     1910       1910       0    2126    2107      19 9E     
##  9  2013     1     1     1910       1855      15    2118    2103      15 US     
## 10  2013     1     1     1911       1910       1    2050    2055      -5 MQ     
## # … with 54,692 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

The first result shows that there are missing values in actual departure time, and the second result shows that when we do filtering, NA values are dropped.

Lab Exercise

How many flights have a missing plane tail number?

Data analysis example

Let’s exercise what we just learned combining data transformation and data visualization. One routine practice is to remove samples with NA values before making the graph.

For example, if we want to create a histogram of dep_time, we can directly make the graph and ggplot will smartly throw away the NA values. A warning message would pop out to hint us with this fact.

ggplot(flights) + geom_histogram(aes(dep_time))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 8255 rows containing non-finite values (`stat_bin()`).

But sometimes we hope to know information about those flights with NA values. For example, NA in dep_time means that the flight has been canceled (therefore no departure time). If we hope to know which airline company canceled the most flights in our data set, we can do the following:

canceled_flight <- filter(flights, is.na(dep_time))

ggplot(canceled_flight) + 
  geom_bar(aes(carrier)) +
  labs(title = "Canceled Flights by Carrier", 
       x = "Carrier (in abbreviation)", 
       y = "Counts") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

So we see that airline `EV` canceled the most flights in our data set. After checking `airlines` commands we see that it is ExpressJet Airlines.

Lab Exercise

For EV, during which month were most flights canceled?

2. Select columns with `select()`

The second function to study here is the select() function, which select variables from the original data set and form a new one.

There are a few ways to use select(), as shown below.

# Select columns by name
select(flights, year, month, day)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

# Select all columns between year and day (inclusive)
select(flights, year:day)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))

## # A tibble: 336,776 × 16
##    dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
##      <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr> 
##  1     517     515       2     830     819      11 UA        1545 N14228  EWR   
##  2     533     529       4     850     830      20 UA        1714 N24211  LGA   
##  3     542     540       2     923     850      33 AA        1141 N619AA  JFK   
##  4     544     545      -1    1004    1022     -18 B6         725 N804JB  JFK   
##  5     554     600      -6     812     837     -25 DL         461 N668DN  LGA   
##  6     554     558      -4     740     728      12 UA        1696 N39463  EWR   
##  7     555     600      -5     913     854      19 B6         507 N516JB  EWR   
##  8     557     600      -3     709     723     -14 EV        5708 N829AS  LGA   
##  9     557     600      -3     838     846      -8 B6          79 N593JB  JFK   
## 10     558     600      -2     753     745       8 AA         301 N3ALAA  LGA   
## # … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
## #   variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
## #   ⁵sched_arr_time, ⁶arr_delay

# Select by col number
select(flights, 1:3)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

Similarly to filter(), if you want to save the data after selecting, we must either name it or use the pipe (discussed later).

new_data <- select(flights, year, month, day)

Helper functions `starts_with(), ends_with(), contains()`

There are a number of helper functions you can use within select():

* `starts_with("abc")`: matches names that begin with "abc".
* `ends_with("xyz")`: matches names that end with "xyz".
* `contains("ijk")`: matches names that contain "ijk".
* `matches("(.)\\1")`: selects variables that match a regular expression.
*  `num_range("x", 1:3)` matches `x1`, `x2` and `x3`.
   
* See `?select` for more details.

# Select columns of year, month, day and those starting with "dep" or "arr"
dep_data = select(flights, year, month, day, starts_with(c("dep", "arr")))
glimpse(dep_data)

## Rows: 336,776
## Columns: 7
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, …
## $ arr_time  <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31…

This may not look useful for a data set with only dozens of columns. But when you have hundreds of columns or more, this can be very helpful.

`rename()` to rename a column

Sometimes we hope to rename a column for convenience. We can do the following:

rename(flights, sch_dep = sched_dep_time) # new_name = old_name

## # A tibble: 336,776 × 19
##     year month   day dep_time sch_dep dep_delay arr_time sched…¹ arr_d…² carrier
##    <int> <int> <int>    <int>   <int>     <dbl>    <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517     515         2      830     819      11 UA     
##  2  2013     1     1      533     529         4      850     830      20 UA     
##  3  2013     1     1      542     540         2      923     850      33 AA     
##  4  2013     1     1      544     545        -1     1004    1022     -18 B6     
##  5  2013     1     1      554     600        -6      812     837     -25 DL     
##  6  2013     1     1      554     558        -4      740     728      12 UA     
##  7  2013     1     1      555     600        -5      913     854      19 B6     
##  8  2013     1     1      557     600        -3      709     723     -14 EV     
##  9  2013     1     1      557     600        -3      838     846      -8 B6     
## 10  2013     1     1      558     600        -2      753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_arr_time, ²arr_delay

Use `everything()` to reorder the columns

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

For example, if we want to move time_hour and air_time to the first two columns and keep all others. We can do

select(flights, time_hour, air_time, everything())

## # A tibble: 336,776 × 19
##    time_hour           air_t…¹  year month   day dep_t…² sched…³ dep_d…⁴ arr_t…⁵
##    <dttm>                <dbl> <int> <int> <int>   <int>   <int>   <dbl>   <int>
##  1 2013-01-01 05:00:00     227  2013     1     1     517     515       2     830
##  2 2013-01-01 05:00:00     227  2013     1     1     533     529       4     850
##  3 2013-01-01 05:00:00     160  2013     1     1     542     540       2     923
##  4 2013-01-01 05:00:00     183  2013     1     1     544     545      -1    1004
##  5 2013-01-01 06:00:00     116  2013     1     1     554     600      -6     812
##  6 2013-01-01 05:00:00     150  2013     1     1     554     558      -4     740
##  7 2013-01-01 06:00:00     158  2013     1     1     555     600      -5     913
##  8 2013-01-01 06:00:00      53  2013     1     1     557     600      -3     709
##  9 2013-01-01 06:00:00     140  2013     1     1     557     600      -3     838
## 10 2013-01-01 06:00:00     138  2013     1     1     558     600      -2     753
## # … with 336,766 more rows, 10 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
## #   variable names ¹air_time, ²dep_time, ³sched_dep_time, ⁴dep_delay, ⁵arr_time

3. Arrange rows with `arrange()`

Next, we will study the arrange() function which changes row order by sorting the values in one or more given columns.

For example, the original data set does not completely follow the order of time since month 10, 11, 12 goes right after month 1. To resolve this, we can rearrange the rows by month and day.

flights_md <- arrange(flights, month, day)

Similar to all previous functions, the first argument is the name of data set, followed by column names.

Arrange from the largest to the smallest values - `desc()`

By default, the arrange() function sorting out values from the smallest to the largest. To reverse it, we can simply put an desc (representing “descending”) function on the column name.

for example, we hope to see flights with the worst delay in departure, we can do the following:

arrange(flights, desc(dep_delay))

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     9      641        900    1301    1242    1530    1272 HA     
##  2  2013     6    15     1432       1935    1137    1607    2120    1127 MQ     
##  3  2013     1    10     1121       1635    1126    1239    1810    1109 MQ     
##  4  2013     9    20     1139       1845    1014    1457    2210    1007 AA     
##  5  2013     7    22      845       1600    1005    1044    1815     989 MQ     
##  6  2013     4    10     1100       1900     960    1342    2211     931 DL     
##  7  2013     3    17     2321        810     911     135    1020     915 DL     
##  8  2013     6    27      959       1900     899    1236    2226     850 DL     
##  9  2013     7    22     2257        759     898     121    1026     895 DL     
## 10  2013    12     5      756       1700     896    1058    2020     878 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

We see that the worst one delayed by 1301 minutes, which is over 21 hours!

arrange() function can be very useful when we hope to quickly check the minimum and maximum values of any column.

Lab Exercise

What was the longest travel distance for any flight in our data set? What was the origin and the destination?

Note: missing values are always sorted at the end:

(df <- tibble(x = c(5, 2, NA)))

## # A tibble: 3 × 1
##       x
##   <dbl>
## 1     5
## 2     2
## 3    NA

arrange(df, x)

## # A tibble: 3 × 1
##       x
##   <dbl>
## 1     2
## 2     5
## 3    NA

arrange(df, desc(x))

## # A tibble: 3 × 1
##       x
##   <dbl>
## 1     5
## 2     2
## 3    NA

show sorted columns with `select()` and `arrange()`

When there are too many columns, we may hope to make the arranged column the first one. For example, if we arrange the data set by air_time, then we can do the following:

select(arrange(flights, air_time), flight, month, day, year, air_time)

## # A tibble: 336,776 × 5
##    flight month   day  year air_time
##     <int> <int> <int> <int>    <dbl>
##  1   4368     1    16  2013       20
##  2   4631     4    13  2013       20
##  3   4276    12     6  2013       21
##  4   4619     2     3  2013       21
##  5   4368     2     5  2013       21
##  6   4619     2    12  2013       21
##  7   2132     3     2  2013       21
##  8   3650     3     8  2013       21
##  9   4118     3    18  2013       21
## 10   4276     3    19  2013       21
## # … with 336,766 more rows

Data analysis Example

Let’s do another analysis example. Let’s try to analyze what is the prominent factor that causes very long delays. First, let’s separate our data set into two parts - those with long delay (more than 2 hours) and all other flights.

long_delay <- filter(flights, dep_delay >= 120 | arr_delay >= 120)
others <- filter(flights, dep_delay < 120 & arr_delay < 120)

So we have 11606 observations in long_delay. That’s about 3.4% of the whole data set.

nrow(long_delay)/nrow(flights)

## [1] 0.03446208

Let’s see whether carrier is a big factor here:

ggplot(long_delay) + 
  geom_bar(aes(carrier, y = after_stat(count/sum(count)))) + 
  labs(title = "Long Delay Flights", x = "Carrier", y = "Relative Frequency") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))  

ggplot(others) + 
  geom_bar(aes(carrier, y = after_stat(count/sum(count)))) +
  labs(title = "Long Delay Flights", x = "Carrier", y = "Relative Frequency") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

We can see that the distributions of carriers are quite different between the two graphs. There are some carriers with significantly higher proportion of flights with long delays - AA, EV, UA, US etc.

Lecture 6 - Data Transformation1

Miao Yu

2023-02-07

0. Introduction

nycflights13 data set

Variable types

dplyr basics

1. Filter rows with filter()

Use the assignment operator, <-:

Template for using fliter()

Comparisons

Error message when use =

Caution with == when dealing with float:

Lab Exercise

Find all flights that travel a distance of less than 200 miles. How many of such flights are there in the data set?

Logical operators

Finds all flights that departed in November or December:

Note that the following doesn’t work as you expected:

Use x %in% y.

R also has && and ||, but

Use between() helper function for (X > a) & (X < b)

Lab Exercise

a) Find all flights that depart from JFK and land on ORD or CVG (Chicago or Dallas).

b) Find all flights that flied a distance between 200 and 500 miles.

Missing values NA

Why is the following expression FALSE?

To understand with example

Use is.na() to filter out NA in your data set

Note: filter() excludes both FALSE and NA values.

Lab Exercise

How many flights have a missing plane tail number?

Data analysis example

So we see that airline EV canceled the most flights in our data set. After checking airlines commands we see that it is ExpressJet Airlines.

Lab Exercise

2. Select columns with select()

Helper functions starts_with(), ends_with(), contains()

rename() to rename a column

Use everything() to reorder the columns

3. Arrange rows with arrange()

Arrange from the largest to the smallest values - desc()

Lab Exercise

What was the longest travel distance for any flight in our data set? What was the origin and the destination?

Note: missing values are always sorted at the end:

show sorted columns with select() and arrange()

Data analysis Example

`nycflights13` data set

1. Filter rows with `filter()`

Use the assignment operator, `<-`:

Template for using `fliter()`

Error message when use `=`

Caution with `==` when dealing with float:

Use `x %in% y`.

R also has `&&` and `||`, but

Use `between()` helper function for (X > a) & (X < b)

a) Find all flights that depart from `JFK` and land on `ORD` or `CVG` (Chicago or Dallas).

Missing values `NA`

Why is the following expression `FALSE`?

Use `is.na()` to filter out `NA` in your data set

Note: `filter()` excludes both `FALSE` and `NA` values.

So we see that airline `EV` canceled the most flights in our data set. After checking `airlines` commands we see that it is ExpressJet Airlines.

2. Select columns with `select()`

Helper functions `starts_with(), ends_with(), contains()`

`rename()` to rename a column

Use `everything()` to reorder the columns

3. Arrange rows with `arrange()`

Arrange from the largest to the smallest values - `desc()`

show sorted columns with `select()` and `arrange()`