Note: This is personal practice notes for “R for data science (2e)”

2 Data visualization

2.1 Introduction

2.1.1 Prerequisites

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

In addition to tidyverse, we will also use the palmerpenguins package, which includes the penguins dataset containing body measurements for penguins on three islands in the Palmer Archipelago.

library(palmerpenguins)

2.2 First steps

Let’s use our first graph to answer a question:

  • Do penguins with longer flippers weigh more or less than penguins with shorter flippers?

  • What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear?

  • Does the relationship vary by the species of the penguin?

  • How about by the island where the penguin lives.

2.2.1 The penguins data frame

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
##    <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
##  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
##  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
##  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
##  4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
##  5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
##  6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
##  7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
##  8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
##  9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
## 10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
## # … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
## #   ²​body_mass_g
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# View(penguins)

2.2.2 Ultimate goal

Our ultimate goal in this chapter is to recreate the following visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.

/need help: how to insert a pic?/

2.2.3 Creating a ggplot

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

2.2.4 Adding aesthetics and layers

Scatterplots are useful for displaying the relationship between two variables, but it’s always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship.

ggplot(
  data = penguins, 
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
  ) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

we want points to be colored based on species but don’t want the smooth curves to be separated out for them, we should specify color = species for geom_point() only.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point (mapping = aes(color = species)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Add shapes

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point (mapping = aes(color = species, shape = species)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

2.2.5 Exercises

  1. Q:How many rows are in penguins? How many columns?

A:(Rows: 344, Columns: 8)

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
  1. Q:What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out. A:(bill_length_mm: a number denoting bill length (millimeters))
?penguins
  1. Make a scatterplot of bill_depth_mm vs. bill_length_mm. Describe the relationship between these two variables.
ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

  1. What happens if you make a scatterplot of species vs. bill_depth_mm? Why is the plot not useful?
ggplot(
  data = penguins,
  mapping = aes(x = bill_depth_mm, y = species)
) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

  1. Why does the following give an error and how would you fix it?

  2. What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

  3. Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

  4. Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = bill_depth_mm)) +
  geom_smooth() +
  labs(title = "Data come from the palmerpenguins package."
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

  1. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

  1. Will these two graphs look different? Why/why not? (They are same)
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Removed 2 rows containing missing values (`geom_point()`).

2.3 ggplot2 calls

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins|>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

2.4 Visualizing distributions

How you visualize the distribution of a variable depends on the type of variable: categorical or numerical.

2.4.1 A categorical variabl

2.4.2 A numerical variable

ggplot(penguins,aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

ggplot(penguins, aes(x = body_mass_g)) +
  geom_density()
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).

2.4.3 Exercises

  1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?
ggplot(penguins, aes(y = species)) +
  geom_bar()

  1. How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?
ggplot(penguins, aes(x = species)) +
  geom_bar(color = "red")

ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red")                                   

  1. What does the bins argument in geom_histogram() do?

  2. Make a histogram of the carat variable in the diamonds dataset. Experiment with different binwidths. What binwidth reveals the most interesting patterns?

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.40)

# 0.3 is interesting binwidth
ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.30)

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.20)

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.1)

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.05)

2.5 Visualizing relationships

To visualize a relationship we need to have at least two variables mapped to aesthetics of a plot.

2.5.1 A numerical and a categorical variable

To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians.

ggplot(penguins, aes(y = body_mass_g, x = species)) +
  geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Alternatively, we can make frequency polygons because It’s much easier to understand overlapping lines than bars of geom_histogram().

ggplot(penguins, aes(x = body_mass_g, color = species)) +
  geom_freqpoly(binwidth = 200, linewidth = 0.75)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
  geom_histogram(binwidth = 200)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

We can also use overlaid density plots, with species mapped to both color and fill aesthetics and using the alpha aesthetic to add transparency to the filled density curves.

ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
  geom_density(alpha = 0.5)
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).

2.5.2 Two categorical variables

We can use segmented bar plots to visualize the distribution between two categorical variables.

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar() +
  labs(title = 'the frequencies of each species of penguins on each island')

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = 'fill') +
  labs(title = "the proportions of each species within each island")

2.5.3 Two numerical variables

A scatterplot is probably the most commonly used plot for visualizing the relationship between two variables.

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

2.5.4 Three or more variables

One way to add additional variables to a plot is by mapping them to an aesthetic.

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island))
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way is to split your plot into facets, subplots that each display one subset of the data.

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~ island)
## Warning: Removed 2 rows containing missing values (`geom_point()`).

ggsave( filename = "penguinsisland.png")
## Saving 7 x 5 in image
## Warning: Removed 2 rows containing missing values (`geom_point()`).

2.5.5 Exercises

  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

  2. Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?

ggplot(mpg, aes(x= displ, y = hwy)) + 
  geom_point()

# third numerical variable, color
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = cyl))

# size
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes( size = cyl))

# color and size
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes( color = cyl, size = cyl))

#shape
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes( shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (`geom_point()`).

  1. In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(linewidth = cyl))
## Warning in geom_point(aes(linewidth = cyl)): Ignoring unknown aesthetics:
## linewidth

  1. What happens if you map the same variable to multiple aesthetics?

  2. Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables?

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species))
## Warning: Removed 2 rows containing missing values (`geom_point()`).

  1. Why does the following yield two separate legends? How would you fix it to combine the two legends?
ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    color = species, shape = species
  )
) +
  geom_point() +
  labs(color = "Species")
## Warning: Removed 2 rows containing missing values (`geom_point()`).

# remove labs()
ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    color = species, shape = species
  )
) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

2.6 Saving your plots

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species))
## Warning: Removed 2 rows containing missing values (`geom_point()`).

ggsave("pengins1.png")
## Saving 7 x 5 in image
## Warning: Removed 2 rows containing missing values (`geom_point()`).

2.6.1 Exercises

  1. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?
ggplot(mpg, aes(x = class)) +
  geom_bar()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

ggsave("mpg-plot.png")
## Saving 7 x 5 in image
  1. What do you need to change in the code above to save the plot as a PDF instead of a PNG?
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

ggsave("mpg-plot.pdf")
## Saving 7 x 5 in image

2.7 Common problems

2.8 Summary

3 Workflow: basics

3.1 Coding basics

<- keyboard shortcut: Alt + - (the minus sign)

3.2 Comments

Use comments to explain the why of your code, not the how or the what.

3.3 What’s in a name?

We recommend snake_case, where you separate lowercase words with _.

this_is_a_real_long_name <- 2.5

3.4 Calling functions

3.5 Exercises

  1. Why does this code not work?
# my_variable <- 10 
# my_varıable 
#\> Error in eval(expr, envir, enclos): object 'my_varıable' not found
  1. Tweak each of the following R commands so that they run correctly:
#libary(tidyverse)

#ggplot(dota = mpg) + 
  geom_point(maping = aes(x = displ, y = hwy))
## Warning in geom_point(maping = aes(x = displ, y = hwy)): Ignoring unknown
## parameters: `maping`
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
  1. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

  2. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

my_bar_plot <- ggplot(mpg, aes(x = class)) +
  geom_bar()
my_scatter_plot <- ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()
ggsave(filename = "mpg-plot.png", plot = my_bar_plot)
## Saving 7 x 5 in image

3.6 Summary

4 Data transformation

4.1 Introduction

This chapter will introduce you to data transformation using the dplyr package and a new dataset on flights that departed New York City in 2013.

4.1.1 Prerequisites

library(tidyverse)
library(nycflights13)

when we need to be precise about which function a package comes from, we’ll use the same syntax as R: packagename::functionname(). For example, If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names:stats::filter()

4.1.2 nycflights13

This dataset contains all 336,776 flights that departed from New York City in 2013.

flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

It’s a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas.tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.

you can use print(flights, width = Inf) to show all columns, or use call glimpse()

glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
  • int is short for integer,
  • dbl is short for double (aka real numbers),
  • chr for character (aka strings),
  • dttm for date-time

4.1.3 dplyr basics

dplyr verbs work well with the pipe,

flights %>% 
  filter(dest == "IAH") %>% 
  group_by(year, month, day) %>% 
  summarize(arr_delay = mean(arr_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day arr_delay
##    <int> <int> <int>     <dbl>
##  1  2013     1     1     17.8 
##  2  2013     1     2      7   
##  3  2013     1     3     18.3 
##  4  2013     1     4     -3.2 
##  5  2013     1     5     20.2 
##  6  2013     1     6      9.28
##  7  2013     1     7     -7.74
##  8  2013     1     8      7.79
##  9  2013     1     9     18.1 
## 10  2013     1    10      6.68
## # … with 355 more rows

dplyr’s verbs are organised into four groups based on what they operate on: rows, columns, groups, or tables.

4.2 Rows

filter(), which changes which rows are present without changing their order, and arrange(), which changes the order of the rows without changing which are present.

4.2.1 filter()

flights %>% 
  filter(arr_delay > 120)
## # A tibble: 10,034 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      811        630     101    1047     830     137 MQ     
##  2  2013     1     1      848       1835     853    1001    1950     851 MQ     
##  3  2013     1     1      957        733     144    1056     853     123 UA     
##  4  2013     1     1     1114        900     134    1447    1222     145 UA     
##  5  2013     1     1     1505       1310     115    1638    1431     127 EV     
##  6  2013     1     1     1525       1340     105    1831    1626     125 B6     
##  7  2013     1     1     1549       1445      64    1912    1656     136 EV     
##  8  2013     1     1     1558       1359     119    1718    1515     123 EV     
##  9  2013     1     1     1732       1630      62    2028    1825     123 EV     
## 10  2013     1     1     1803       1620     103    2008    1750     138 MQ     
## # … with 10,024 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

As well as > (greater than), you can use >= (greater than or equal to), < (less than), <= (less than or equal to), == (equal to), and != (not equal to). You can also use & (and) or | (or) to combine multiple conditions:

# Flights that departed on January 1
flights %>% 
  filter(month == 1 & day == 1)
## # A tibble: 842 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# Flights that departed in January or February
flights %>% 
  filter(month == 1 | month == 2)
## # A tibble: 51,955 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 51,945 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

shortcut when you’re combining | and ==: %in%. It keeps rows where the variable equals one of the values on the right:

# A shorter way to select flights that departed in January or February
flights %>% 
  filter(month %in% c(1,2))
## # A tibble: 51,955 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 51,945 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

When you run filter() dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing flights dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <-

jan1 <- flights %>% 
  filter(month == 1 & day == 1)

4.2.2 Common mistakes

the easiest mistake to make is to use = instead of == when testing for equality. Another mistakes is you write “or” statements like you would in English:

4.2.3 arrange()

flights %>% 
  arrange(year, month, day, dep_time)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
flights %>% 
  arrange(desc(dep_delay))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     9      641        900    1301    1242    1530    1272 HA     
##  2  2013     6    15     1432       1935    1137    1607    2120    1127 MQ     
##  3  2013     1    10     1121       1635    1126    1239    1810    1109 MQ     
##  4  2013     9    20     1139       1845    1014    1457    2210    1007 AA     
##  5  2013     7    22      845       1600    1005    1044    1815     989 MQ     
##  6  2013     4    10     1100       1900     960    1342    2211     931 DL     
##  7  2013     3    17     2321        810     911     135    1020     915 DL     
##  8  2013     6    27      959       1900     899    1236    2226     850 DL     
##  9  2013     7    22     2257        759     898     121    1026     895 DL     
## 10  2013    12     5      756       1700     896    1058    2020     878 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# look for the flights that were most delayed on arrival that left on roughly on time
flights %>% 
  filter(dep_delay <= 10 & dep_delay >= 10) %>% 
  arrange(desc(arr_delay))
## # A tibble: 2,859 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     9    12     1309       1259      10    1727    1507     140 EV     
##  2  2013     7     7     1555       1545      10    1919    1710     129 MQ     
##  3  2013     4    10     1445       1435      10    1843    1644     119 DL     
##  4  2013     7     1     1130       1120      10    1424    1233     111 B6     
##  5  2013     9    18     1710       1700      10    2030    1839     111 UA     
##  6  2013     7     7     1815       1805      10    2139    1950     109 AA     
##  7  2013     4    18      710        700      10    1057     916     101 UA     
##  8  2013     7    10     1740       1730      10    2251    2110     101 DL     
##  9  2013     7    22     1859       1849      10    2324    2145      99 UA     
## 10  2013     5    19     1610       1600      10    2053    1915      98 DL     
## # … with 2,849 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
flights %>% 
  arrange(desc(arr_delay)) %>% 
  filter(dep_delay <= 10 & dep_delay >= 10)
## # A tibble: 2,859 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     9    12     1309       1259      10    1727    1507     140 EV     
##  2  2013     7     7     1555       1545      10    1919    1710     129 MQ     
##  3  2013     4    10     1445       1435      10    1843    1644     119 DL     
##  4  2013     7     1     1130       1120      10    1424    1233     111 B6     
##  5  2013     9    18     1710       1700      10    2030    1839     111 UA     
##  6  2013     7     7     1815       1805      10    2139    1950     109 AA     
##  7  2013     4    18      710        700      10    1057     916     101 UA     
##  8  2013     7    10     1740       1730      10    2251    2110     101 DL     
##  9  2013     7    22     1859       1849      10    2324    2145      99 UA     
## 10  2013     5    19     1610       1600      10    2053    1915      98 DL     
## # … with 2,849 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

4.2.4 distinct()

finds all the unique rows in a dataset

# This would remove any duplicate rows if there were any
flights %>% 
  distinct()
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# This finds all unique origin and destination pairs.
flights %>% 
  distinct(origin, dest)
## # A tibble: 224 × 2
##    origin dest 
##    <chr>  <chr>
##  1 EWR    IAH  
##  2 LGA    IAH  
##  3 JFK    MIA  
##  4 JFK    BQN  
##  5 LGA    ATL  
##  6 EWR    ORD  
##  7 EWR    FLL  
##  8 LGA    IAD  
##  9 JFK    MCO  
## 10 LGA    ORD  
## # … with 214 more rows

Note that if you want to find the number of duplicates, or rows that weren’t duplicated, you’re better off swapping distinct() for count() and then filtering as needed.

4.2.5 Exercises

  1. Find all flights that
# Had an arrival delay of two or more hours
flights %>% 
  filter(arr_delay >= 120)
## # A tibble: 10,200 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      811        630     101    1047     830     137 MQ     
##  2  2013     1     1      848       1835     853    1001    1950     851 MQ     
##  3  2013     1     1      957        733     144    1056     853     123 UA     
##  4  2013     1     1     1114        900     134    1447    1222     145 UA     
##  5  2013     1     1     1505       1310     115    1638    1431     127 EV     
##  6  2013     1     1     1525       1340     105    1831    1626     125 B6     
##  7  2013     1     1     1549       1445      64    1912    1656     136 EV     
##  8  2013     1     1     1558       1359     119    1718    1515     123 EV     
##  9  2013     1     1     1732       1630      62    2028    1825     123 EV     
## 10  2013     1     1     1803       1620     103    2008    1750     138 MQ     
## # … with 10,190 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# Flew to Houston (IAH or HOU)
flights %>% 
  filter(dest %in% c( 'IAH', 'HOU'))
## # A tibble: 9,313 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      623        627      -4     933     932       1 UA     
##  4  2013     1     1      728        732      -4    1041    1038       3 UA     
##  5  2013     1     1      739        739       0    1104    1038      26 UA     
##  6  2013     1     1      908        908       0    1228    1219       9 UA     
##  7  2013     1     1     1028       1026       2    1350    1339      11 UA     
##  8  2013     1     1     1044       1045      -1    1352    1351       1 UA     
##  9  2013     1     1     1114        900     134    1447    1222     145 UA     
## 10  2013     1     1     1205       1200       5    1503    1505      -2 UA     
## # … with 9,303 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# Were operated by United, American, or Delta
flights %>% 
  filter(carrier %in% c("DL",'AA','UA'))
## # A tibble: 139,504 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      554        600      -6     812     837     -25 DL     
##  5  2013     1     1      554        558      -4     740     728      12 UA     
##  6  2013     1     1      558        600      -2     753     745       8 AA     
##  7  2013     1     1      558        600      -2     924     917       7 UA     
##  8  2013     1     1      558        600      -2     923     937     -14 UA     
##  9  2013     1     1      559        600      -1     941     910      31 AA     
## 10  2013     1     1      559        600      -1     854     902      -8 UA     
## # … with 139,494 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# Departed in summer (July, August, and September)
flights %>% 
  filter(month %in% c(7,8,9))
## # A tibble: 86,326 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     7     1        1       2029     212     236    2359     157 B6     
##  2  2013     7     1        2       2359       3     344     344       0 B6     
##  3  2013     7     1       29       2245     104     151       1     110 B6     
##  4  2013     7     1       43       2130     193     322      14     188 B6     
##  5  2013     7     1       44       2150     174     300     100     120 AA     
##  6  2013     7     1       46       2051     235     304    2358     186 B6     
##  7  2013     7     1       48       2001     287     308    2305     243 VX     
##  8  2013     7     1       58       2155     183     335      43     172 B6     
##  9  2013     7     1      100       2146     194     327      30     177 B6     
## 10  2013     7     1      100       2245     135     337     135     122 B6     
## # … with 86,316 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# Arrived more than two hours late, but didn’t leave late
flights %>% 
  filter(arr_delay > 120 & dep_delay <= 0)
## # A tibble: 29 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1    27     1419       1420      -1    1754    1550     124 MQ     
##  2  2013    10     7     1350       1350       0    1736    1526     130 EV     
##  3  2013    10     7     1357       1359      -2    1858    1654     124 AA     
##  4  2013    10    16      657        700      -3    1258    1056     122 B6     
##  5  2013    11     1      658        700      -2    1329    1015     194 VX     
##  6  2013     3    18     1844       1847      -3      39    2219     140 UA     
##  7  2013     4    17     1635       1640      -5    2049    1845     124 MQ     
##  8  2013     4    18      558        600      -2    1149     850     179 AA     
##  9  2013     4    18      655        700      -5    1213     950     143 AA     
## 10  2013     5    22     1827       1830      -3    2217    2010     127 MQ     
## # … with 19 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# Were delayed by at least an hour, but made up over 30 minutes in flight
flights %>% 
  filter(dep_delay >= 60 & dep_delay - arr_delay > 30) 
## # A tibble: 1,844 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1     2205       1720     285      46    2040     246 AA     
##  2  2013     1     1     2326       2130     116     131      18      73 B6     
##  3  2013     1     3     1503       1221     162    1803    1555     128 UA     
##  4  2013     1     3     1839       1700      99    2056    1950      66 AA     
##  5  2013     1     3     1850       1745      65    2148    2120      28 AA     
##  6  2013     1     3     1941       1759     102    2246    2139      67 UA     
##  7  2013     1     3     1950       1845      65    2228    2227       1 B6     
##  8  2013     1     3     2015       1915      60    2135    2111      24 9E     
##  9  2013     1     3     2257       2000     177      45    2224     141 9E     
## 10  2013     1     4     1917       1700     137    2135    1950     105 AA     
## # … with 1,834 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  1. Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     9      641        900    1301    1242    1530    1272 HA     
##  2  2013     6    15     1432       1935    1137    1607    2120    1127 MQ     
##  3  2013     1    10     1121       1635    1126    1239    1810    1109 MQ     
##  4  2013     9    20     1139       1845    1014    1457    2210    1007 AA     
##  5  2013     7    22      845       1600    1005    1044    1815     989 MQ     
##  6  2013     4    10     1100       1900     960    1342    2211     931 DL     
##  7  2013     3    17     2321        810     911     135    1020     915 DL     
##  8  2013     6    27      959       1900     899    1236    2226     850 DL     
##  9  2013     7    22     2257        759     898     121    1026     895 DL     
## 10  2013    12     5      756       1700     896    1058    2020     878 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
arrange(flights,sched_dep_time)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     7    27       NA        106      NA      NA     245      NA US     
##  2  2013     1     2      458        500      -2     703     650      13 US     
##  3  2013     1     3      458        500      -2     650     650       0 US     
##  4  2013     1     4      456        500      -4     631     650     -19 US     
##  5  2013     1     5      458        500      -2     640     650     -10 US     
##  6  2013     1     6      458        500      -2     718     650      28 US     
##  7  2013     1     7      454        500      -6     637     648     -11 US     
##  8  2013     1     8      454        500      -6     625     648     -23 US     
##  9  2013     1     9      457        500      -3     647     648      -1 US     
## 10  2013     1    10      450        500     -10     634     648     -14 US     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  1. Sort flights to find the fastest flights (Hint: try sorting by a calculation).
flights %>% 
  arrange(desc( distance / air_time))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     5    25     1709       1700       9    1923    1937     -14 DL     
##  2  2013     7     2     1558       1513      45    1745    1719      26 EV     
##  3  2013     5    13     2040       2025      15    2225    2226      -1 EV     
##  4  2013     3    23     1914       1910       4    2045    2043       2 EV     
##  5  2013     1    12     1559       1600      -1    1849    1917     -28 DL     
##  6  2013    11    17      650        655      -5    1059    1150     -51 DL     
##  7  2013     2    21     2355       2358      -3     412     438     -26 B6     
##  8  2013    11    17      759        800      -1    1212    1255     -43 AA     
##  9  2013    11    16     2003       1925      38      17      36     -19 DL     
## 10  2013    11    16     2349       2359     -10     402     440     -38 B6     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  1. Was there a flight on every day of 2013?
flights %>% 
  filter(year == 2013) %>% 
  distinct(month, day)
## # A tibble: 365 × 2
##    month   day
##    <int> <int>
##  1     1     1
##  2     1     2
##  3     1     3
##  4     1     4
##  5     1     5
##  6     1     6
##  7     1     7
##  8     1     8
##  9     1     9
## 10     1    10
## # … with 355 more rows
flight2013 <- flights %>% 
  filter(year == 2013) %>% 
  count(month, day)
flight2013 %>% arrange(n)
## # A tibble: 365 × 3
##    month   day     n
##    <int> <int> <int>
##  1    11    28   634
##  2    11    29   661
##  3     1    19   674
##  4    10    12   676
##  5     1    26   680
##  6     8    31   680
##  7     2     2   682
##  8     9    28   682
##  9     2     9   684
## 10    10    19   684
## # … with 355 more rows
  1. Which flights traveled the farthest distance? Which traveled the least distance?
flights %>% 
  arrange(desc(distance))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      857        900      -3    1516    1530     -14 HA     
##  2  2013     1     2      909        900       9    1525    1530      -5 HA     
##  3  2013     1     3      914        900      14    1504    1530     -26 HA     
##  4  2013     1     4      900        900       0    1516    1530     -14 HA     
##  5  2013     1     5      858        900      -2    1519    1530     -11 HA     
##  6  2013     1     6     1019        900      79    1558    1530      28 HA     
##  7  2013     1     7     1042        900     102    1620    1530      50 HA     
##  8  2013     1     8      901        900       1    1504    1530     -26 HA     
##  9  2013     1     9      641        900    1301    1242    1530    1272 HA     
## 10  2013     1    10      859        900      -1    1449    1530     -41 HA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
flights %>% 
  arrange(distance)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     7    27       NA        106      NA      NA     245      NA US     
##  2  2013     1     3     2127       2129      -2    2222    2224      -2 EV     
##  3  2013     1     4     1240       1200      40    1333    1306      27 EV     
##  4  2013     1     4     1829       1615     134    1937    1721     136 EV     
##  5  2013     1     4     2128       2129      -1    2218    2224      -6 EV     
##  6  2013     1     5     1155       1200      -5    1241    1306     -25 EV     
##  7  2013     1     6     2125       2129      -4    2224    2224       0 EV     
##  8  2013     1     7     2124       2129      -5    2212    2224     -12 EV     
##  9  2013     1     8     2127       2130      -3    2304    2225      39 EV     
## 10  2013     1     9     2126       2129      -3    2217    2224      -7 EV     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  1. Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

4.3 Columns

4.3.1 mutate()

to add new columns that are calculated from the existing columns. By default, mutate() adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left hand side.

flights %>% 
  mutate(
   gain = dep_delay - arr_delay,  
  speen = distance / air_time * 60,
  .before = 1
  )
## # A tibble: 336,776 × 21
##     gain speen  year month   day dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
##    <dbl> <dbl> <int> <int> <int>   <int>   <int>   <dbl>   <int>   <int>   <dbl>
##  1    -9  370.  2013     1     1     517     515       2     830     819      11
##  2   -16  374.  2013     1     1     533     529       4     850     830      20
##  3   -31  408.  2013     1     1     542     540       2     923     850      33
##  4    17  517.  2013     1     1     544     545      -1    1004    1022     -18
##  5    19  394.  2013     1     1     554     600      -6     812     837     -25
##  6   -16  288.  2013     1     1     554     558      -4     740     728      12
##  7   -24  404.  2013     1     1     555     600      -5     913     854      19
##  8    11  259.  2013     1     1     557     600      -3     709     723     -14
##  9     5  405.  2013     1     1     557     600      -3     838     846      -8
## 10   -10  319.  2013     1     1     558     600      -2     753     745       8
## # … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​dep_time, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time, ⁵​sched_arr_time,
## #   ⁶​arr_delay
flights %>% 
  mutate(
   gain = dep_delay - arr_delay,  
  speen = distance / air_time * 60,
  .after = day
  )
## # A tibble: 336,776 × 21
##     year month   day  gain speen dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
##    <int> <int> <int> <dbl> <dbl>   <int>   <int>   <dbl>   <int>   <int>   <dbl>
##  1  2013     1     1    -9  370.     517     515       2     830     819      11
##  2  2013     1     1   -16  374.     533     529       4     850     830      20
##  3  2013     1     1   -31  408.     542     540       2     923     850      33
##  4  2013     1     1    17  517.     544     545      -1    1004    1022     -18
##  5  2013     1     1    19  394.     554     600      -6     812     837     -25
##  6  2013     1     1   -16  288.     554     558      -4     740     728      12
##  7  2013     1     1   -24  404.     555     600      -5     913     854      19
##  8  2013     1     1    11  259.     557     600      -3     709     723     -14
##  9  2013     1     1     5  405.     557     600      -3     838     846      -8
## 10  2013     1     1   -10  319.     558     600      -2     753     745       8
## # … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​dep_time, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time, ⁵​sched_arr_time,
## #   ⁶​arr_delay

Alternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is "used" which allows you to see the inputs and outputs from your calculations:

flights %>% 
  mutate(
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours,
    .keep = "used"
  )
## # A tibble: 336,776 × 6
##    dep_delay arr_delay air_time  gain hours gain_per_hour
##        <dbl>     <dbl>    <dbl> <dbl> <dbl>         <dbl>
##  1         2        11      227    -9 3.78          -2.38
##  2         4        20      227   -16 3.78          -4.23
##  3         2        33      160   -31 2.67         -11.6 
##  4        -1       -18      183    17 3.05           5.57
##  5        -6       -25      116    19 1.93           9.83
##  6        -4        12      150   -16 2.5           -6.4 
##  7        -5        19      158   -24 2.63          -9.11
##  8        -3       -14       53    11 0.883         12.5 
##  9        -3        -8      140     5 2.33           2.14
## 10        -2         8      138   -10 2.3           -4.35
## # … with 336,766 more rows

4.3.2 select()

#select columns by names
flights %>% 
  select(year, month, day)
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows
# Select all columns between year and day (inclusive)
flights %>% 
  select(year:day)
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows
# Select all columns except those from year to day (inclusive)
flights %>% 
  select(!year:day)
## # A tibble: 336,776 × 16
##    dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
##      <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr> 
##  1     517     515       2     830     819      11 UA        1545 N14228  EWR   
##  2     533     529       4     850     830      20 UA        1714 N24211  LGA   
##  3     542     540       2     923     850      33 AA        1141 N619AA  JFK   
##  4     544     545      -1    1004    1022     -18 B6         725 N804JB  JFK   
##  5     554     600      -6     812     837     -25 DL         461 N668DN  LGA   
##  6     554     558      -4     740     728      12 UA        1696 N39463  EWR   
##  7     555     600      -5     913     854      19 B6         507 N516JB  EWR   
##  8     557     600      -3     709     723     -14 EV        5708 N829AS  LGA   
##  9     557     600      -3     838     846      -8 B6          79 N593JB  JFK   
## 10     558     600      -2     753     745       8 AA         301 N3ALAA  LGA   
## # … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
## #   variable names ¹​dep_time, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time,
## #   ⁵​sched_arr_time, ⁶​arr_delay
# Select all columns that are characters
flights %>% 
  select(where(is.character))
## # A tibble: 336,776 × 4
##    carrier tailnum origin dest 
##    <chr>   <chr>   <chr>  <chr>
##  1 UA      N14228  EWR    IAH  
##  2 UA      N24211  LGA    IAH  
##  3 AA      N619AA  JFK    MIA  
##  4 B6      N804JB  JFK    BQN  
##  5 DL      N668DN  LGA    ATL  
##  6 UA      N39463  EWR    ORD  
##  7 B6      N516JB  EWR    FLL  
##  8 EV      N829AS  LGA    IAD  
##  9 B6      N593JB  JFK    MCO  
## 10 AA      N3ALAA  LGA    ORD  
## # … with 336,766 more rows
# Select all columns that are interger
flights %>% 
  select(where(is.integer))
## # A tibble: 336,776 × 8
##     year month   day dep_time sched_dep_time arr_time sched_arr_time flight
##    <int> <int> <int>    <int>          <int>    <int>          <int>  <int>
##  1  2013     1     1      517            515      830            819   1545
##  2  2013     1     1      533            529      850            830   1714
##  3  2013     1     1      542            540      923            850   1141
##  4  2013     1     1      544            545     1004           1022    725
##  5  2013     1     1      554            600      812            837    461
##  6  2013     1     1      554            558      740            728   1696
##  7  2013     1     1      555            600      913            854    507
##  8  2013     1     1      557            600      709            723   5708
##  9  2013     1     1      557            600      838            846     79
## 10  2013     1     1      558            600      753            745    301
## # … with 336,766 more rows

you can rename variables, new name is appears on the left side.

flights %>% 
  select(tail_num = tailnum)
## # A tibble: 336,776 × 1
##    tail_num
##    <chr>   
##  1 N14228  
##  2 N24211  
##  3 N619AA  
##  4 N804JB  
##  5 N668DN  
##  6 N39463  
##  7 N516JB  
##  8 N829AS  
##  9 N593JB  
## 10 N3ALAA  
## # … with 336,766 more rows

4.3.3 rename()

to keep all the existing variables and just want to rename a few

flights %>% 
  rename(tail_num = tailnum)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tail_num <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

4.3.4 relocate()

By default relocate() moves variables to the front:

flights %>% 
  relocate(carrier, flight)
## # A tibble: 336,776 × 19
##    carrier flight  year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴
##    <chr>    <int> <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>
##  1 UA        1545  2013     1     1      517         515       2     830     819
##  2 UA        1714  2013     1     1      533         529       4     850     830
##  3 AA        1141  2013     1     1      542         540       2     923     850
##  4 B6         725  2013     1     1      544         545      -1    1004    1022
##  5 DL         461  2013     1     1      554         600      -6     812     837
##  6 UA        1696  2013     1     1      554         558      -4     740     728
##  7 B6         507  2013     1     1      555         600      -5     913     854
##  8 EV        5708  2013     1     1      557         600      -3     709     723
##  9 B6          79  2013     1     1      557         600      -3     838     846
## 10 AA         301  2013     1     1      558         600      -2     753     745
## # … with 336,766 more rows, 9 more variables: arr_delay <dbl>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time
# .before
flights %>% 
  relocate(year:day, .before = carrier)
## # A tibble: 336,776 × 19
##    dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵  year month   day carrier
##       <int>      <int>   <dbl>   <int>   <int>   <dbl> <int> <int> <int> <chr>  
##  1      517        515       2     830     819      11  2013     1     1 UA     
##  2      533        529       4     850     830      20  2013     1     1 UA     
##  3      542        540       2     923     850      33  2013     1     1 AA     
##  4      544        545      -1    1004    1022     -18  2013     1     1 B6     
##  5      554        600      -6     812     837     -25  2013     1     1 DL     
##  6      554        558      -4     740     728      12  2013     1     1 UA     
##  7      555        600      -5     913     854      19  2013     1     1 B6     
##  8      557        600      -3     709     723     -14  2013     1     1 EV     
##  9      557        600      -3     838     846      -8  2013     1     1 B6     
## 10      558        600      -2     753     745       8  2013     1     1 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# .after
flights %>% 
  relocate(year:day, .after = flight)
## # A tibble: 336,776 × 19
##    dep_time sched_d…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight  year month
##       <int>     <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <int> <int>
##  1      517       515       2     830     819      11 UA        1545  2013     1
##  2      533       529       4     850     830      20 UA        1714  2013     1
##  3      542       540       2     923     850      33 AA        1141  2013     1
##  4      544       545      -1    1004    1022     -18 B6         725  2013     1
##  5      554       600      -6     812     837     -25 DL         461  2013     1
##  6      554       558      -4     740     728      12 UA        1696  2013     1
##  7      555       600      -5     913     854      19 B6         507  2013     1
##  8      557       600      -3     709     723     -14 EV        5708  2013     1
##  9      557       600      -3     838     846      -8 B6          79  2013     1
## 10      558       600      -2     753     745       8 AA         301  2013     1
## # … with 336,766 more rows, 9 more variables: day <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

4.3.5 Exercises

  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
  • dep_delay = dep_time - sched_dep_time
  1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
flights %>% 
  select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
flights %>% 
  select(c(dep_time, dep_delay, arr_time, arr_delay))
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
flights %>% 
  select(starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
  1. What happens if you include the name of a variable multiple times in a select() call?
# it will ignore  duplicate variables
flights %>% 
  select(dep_time, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
  1. What does the any_of() function do? Why might it be helpful in conjunction with this vector?
#It returns all the variables you ask for, for example ones stored in a vector.
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% 
  select(any_of(variables))
## # A tibble: 336,776 × 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows
flights %>% 
  select(one_of(variables))
## # A tibble: 336,776 × 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows
  1. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
# The default helper functions are insensitive to case. 
select(flights, contains("TIME"))
## # A tibble: 336,776 × 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      517            515      830            819      227 2013-01-01 05:00:00
##  2      533            529      850            830      227 2013-01-01 05:00:00
##  3      542            540      923            850      160 2013-01-01 05:00:00
##  4      544            545     1004           1022      183 2013-01-01 05:00:00
##  5      554            600      812            837      116 2013-01-01 06:00:00
##  6      554            558      740            728      150 2013-01-01 05:00:00
##  7      555            600      913            854      158 2013-01-01 06:00:00
##  8      557            600      709            723       53 2013-01-01 06:00:00
##  9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # … with 336,766 more rows
#This can be changes by setting `ignore.case=FALSE`.
select(flights, contains("TIME",  ignore.case = FALSE))
## # A tibble: 336,776 × 0

4.4 Groups

4.4.1 group_by

group_by() doesn’t do anything by itself; instead it changes the behavior of the subsequent verbs.

group_by(flights, month)
## # A tibble: 336,776 × 19
## # Groups:   month [12]
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

4.4.2 summarize()

flights %>% 
  group_by(month) %>% 
  summarise(
    delay = mean(dep_delay)
  )
## # A tibble: 12 × 2
##    month delay
##    <int> <dbl>
##  1     1    NA
##  2     2    NA
##  3     3    NA
##  4     4    NA
##  5     5    NA
##  6     6    NA
##  7     7    NA
##  8     8    NA
##  9     9    NA
## 10    10    NA
## 11    11    NA
## 12    12    NA
flights %>% 
  group_by(month) %>% 
  summarise(
    delay = mean(dep_delay, na.rm = TRUE)
  )
## # A tibble: 12 × 2
##    month delay
##    <int> <dbl>
##  1     1 10.0 
##  2     2 10.8 
##  3     3 13.2 
##  4     4 13.9 
##  5     5 13.0 
##  6     6 20.8 
##  7     7 21.7 
##  8     8 12.6 
##  9     9  6.72
## 10    10  6.24
## 11    11  5.44
## 12    12 16.6

one very useful summary is n(), which returns the number of rows in each group:

flights %>% 
  group_by(month) %>% 
  summarise(
    delay = mean(dep_delay, na.rm = TRUE),
    n = n()
    )
## # A tibble: 12 × 3
##    month delay     n
##    <int> <dbl> <int>
##  1     1 10.0  27004
##  2     2 10.8  24951
##  3     3 13.2  28834
##  4     4 13.9  28330
##  5     5 13.0  28796
##  6     6 20.8  28243
##  7     7 21.7  29425
##  8     8 12.6  29327
##  9     9  6.72 27574
## 10    10  6.24 28889
## 11    11  5.44 27268
## 12    12 16.6  28135

4.4.3 The slice_ functions

There are five handy functions that allow you pick off specific rows within each group:

df |> slice_head(n = 1) takes the first row from each group.

df |> slice_tail(n = 1) takes the last row in each group.

df |> slice_min(x, n = 1) takes the row with the smallest value of x.

df |> slice_max(x, n = 1) takes the row with the largest value of x.

df |> slice_sample(n = 1) takes one random row.

You can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group.

flights %>% 
  group_by(dest) %>% 
  slice_max(arr_delay, n = 1) %>% 
  relocate(dest, arr_delay)
## # A tibble: 108 × 19
## # Groups:   dest [105]
##    dest  arr_delay  year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴
##    <chr>     <dbl> <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>
##  1 ABQ         153  2013     7    22     2145       2007      98     132    2259
##  2 ACK         221  2013     7    23     1139        800     219    1250     909
##  3 ALB         328  2013     1    25      123       2000     323     229    2101
##  4 ANC          39  2013     8    17     1740       1625      75    2042    2003
##  5 ATL         895  2013     7    22     2257        759     898     121    1026
##  6 AUS         349  2013     7    10     2056       1505     351    2347    1758
##  7 AVL         228  2013     8    13     1156        832     204    1417    1029
##  8 BDL         266  2013     2    21     1728       1316     252    1839    1413
##  9 BGR         238  2013    12     1     1504       1056     248    1628    1230
## 10 BHM         291  2013     4    10       25       1900     325     136    2045
## # … with 98 more rows, 9 more variables: carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time
flights %>% 
  group_by(dest) %>% 
  summarise(max_arr_delay = max(arr_delay, na.rm = "TRUE"))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `max_arr_delay = max(arr_delay, na.rm = "TRUE")`.
## ℹ In group 52: `dest = "LGA"`.
## Caused by warning in `max()`:
## ! no non-missing arguments to max; returning -Inf
## # A tibble: 105 × 2
##    dest  max_arr_delay
##    <chr>         <dbl>
##  1 ABQ             153
##  2 ACK             221
##  3 ALB             328
##  4 ANC              39
##  5 ATL             895
##  6 AUS             349
##  7 AVL             228
##  8 BDL             266
##  9 BGR             238
## 10 BHM             291
## # … with 95 more rows

4.4.4 Grouping by multiple variables

daily <- flights %>% 
  group_by(year, month, day)
daily
## # A tibble: 336,776 × 19
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
daily_flight <- daily %>% 
  summarise(
    n = n(),
    .group = "drop_last"
  )
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
daily_flight
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day     n .group   
##    <int> <int> <int> <int> <chr>    
##  1  2013     1     1   842 drop_last
##  2  2013     1     2   943 drop_last
##  3  2013     1     3   914 drop_last
##  4  2013     1     4   915 drop_last
##  5  2013     1     5   720 drop_last
##  6  2013     1     6   832 drop_last
##  7  2013     1     7   933 drop_last
##  8  2013     1     8   899 drop_last
##  9  2013     1     9   902 drop_last
## 10  2013     1    10   932 drop_last
## # … with 355 more rows

4.4.5 Ungrouping

to remove grouping outside of summarize(). You can do this with ungroup().

daily %>% 
  ungroup() %>% 
  summarise(
    delay = mean(dep_delay, na.rm = "TRUE"),
    n = n()
  )
## # A tibble: 1 × 2
##   delay      n
##   <dbl>  <int>
## 1    NA 336776

4.4.6 Exercises

  1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))
flights %>% 
  filter(arr_delay > 0) %>% 
  group_by(carrier) %>% 
  summarise(
    mean_arr_delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  arrange(desc(mean_arr_delay))
## # A tibble: 16 × 2
##    carrier mean_arr_delay
##    <chr>            <dbl>
##  1 OO                60.6
##  2 YV                51.1
##  3 9E                49.3
##  4 EV                48.3
##  5 F9                47.6
##  6 VX                43.8
##  7 FL                41.1
##  8 WN                40.7
##  9 B6                40.0
## 10 AA                38.3
## 11 MQ                37.9
## 12 DL                37.7
## 13 UA                36.7
## 14 HA                35.0
## 15 AS                34.4
## 16 US                29.0
flights %>%
  summarise(n_distinct(carrier),
            n_distinct(origin),
            n_distinct(dest))
## # A tibble: 1 × 3
##   `n_distinct(carrier)` `n_distinct(origin)` `n_distinct(dest)`
##                   <int>                <int>              <int>
## 1                    16                    3                105
# There are 16 carriers, 3 origin airports, and 105 destination airports in this dataset. 
# For many destination airports, there are only one or two carriers that fly there.
# so it is difficult to tell how much of the delay is due to the carrier, and how much is due to the airport (busy destination airports can force planes to loiter longer before there is a free landing slot). 
# We also can't necessarily tell how much of the delay is due to the route, versus the airport itself. 
# This makes attributing the cause of in flight delays difficult.
  1. Find the most delayed flight to each destination.
flights %>%
  filter(arr_delay > 0) %>% 
  select(dest, arr_delay) %>% 
  group_by(dest) %>% 
  slice_tail(n=1)
## # A tibble: 103 × 2
## # Groups:   dest [103]
##    dest  arr_delay
##    <chr>     <dbl>
##  1 ABQ          45
##  2 ACK           3
##  3 ALB           9
##  4 ANC          39
##  5 ATL         136
##  6 AUS          12
##  7 AVL           4
##  8 BDL          77
##  9 BGR           8
## 10 BHM          77
## # … with 93 more rows
  1. How do delays vary over the course of the day. Illustrate your answer with a plot.
# 3.1 average delay VS dep_hour (if delay > 0)
delay_cover_dephour <- flights %>% 
  filter(dep_delay > 0 | arr_delay > 0) %>% 
  group_by(hour) %>% 
  summarise(
    avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
    sd_dep_delay = sd(dep_delay, na.rm = TRUE),
    avg_arr_delay = mean(arr_delay, na.rm = TRUE), 
    sd_arr_delay = sd(arr_delay, na.rm = TRUE)
    )

delay_cover_dephour
## # A tibble: 19 × 5
##     hour avg_dep_delay sd_dep_delay avg_arr_delay sd_arr_delay
##    <dbl>         <dbl>        <dbl>         <dbl>        <dbl>
##  1     5          8.14         22.7          10.2         26.4
##  2     6         12.0          35.9          15.5         37.3
##  3     7         13.3          36.3          16.0         38.2
##  4     8         17.1          42.5          21.1         42.7
##  5     9         17.2          43.1          20.0         43.0
##  6    10         20.6          46.4          22.8         47.0
##  7    11         22.2          43.8          23.3         45.3
##  8    12         23.0          43.4          23.9         44.9
##  9    13         25.4          44.7          25.8         47.8
## 10    14         29.0          50.1          30.6         53.7
## 11    15         30.6          50.3          31.1         54.2
## 12    16         34.6          55.5          34.1         58.8
## 13    17         36.7          58.1          37.3         60.8
## 14    18         37.6          57.3          36.9         58.5
## 15    19         43.1          60.3          39.2         60.7
## 16    20         42.2          54.2          38.3         54.3
## 17    21         42.5          51.2          39.5         50.5
## 18    22         38.9          46.1          39.1         44.0
## 19    23         26.8          40.6          28.3         38.3
ggplot(delay_cover_dephour)+
  geom_point(
    aes(x = hour, y = avg_dep_delay, color = "avg_dep_delay", size = sd_dep_delay ))+
  geom_point(
    aes(x = hour,  y = avg_arr_delay, color = "avg_arr_delay", size = sd_arr_delay))+
  labs(title = "Average delay VS schedual departure time ",
       caption = "Only include data with delay time > 0 ")

# 3.2 all data scatter by dep_hour
ggplot(flights, aes(x = hour, y = arr_delay)) +
  geom_count() +
  labs(title = "Arrive Delay VS schedual departure time ",
       caption = "Include all data except missing values ")
## Warning: Removed 9430 rows containing non-finite values (`stat_sum()`).

# 3.3 average delay VS sched_arr_time (all data)
delay_cover_arrhour <- flights %>%  
  mutate(sched_arr_hour = str_sub(sched_arr_time,1,nchar(sched_arr_time)-2 )) %>% 
  group_by(sched_arr_hour) %>% 
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            avg_arr_delay = mean(arr_delay,na.rm = TRUE)) 
 
delay_cover_arrhour$sched_arr_hour.num <- as.numeric(delay_cover_arrhour$sched_arr_hour)
  
ggplot(delay_cover_arrhour)+
  geom_point(aes(x = sched_arr_hour.num, y = avg_dep_delay, color = "avg_dep_delay"))+
  geom_point(aes(x = sched_arr_hour.num, y = avg_arr_delay,color = "avg_arr_delay"))+
  labs(title = "Average delay VS schedual arrival time ")
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).

# 3.4 average delay VS sched_arr_time (if delay > 0)
delay_cover_arrhour <- flights %>%  
  filter(dep_delay > 0 | arr_delay > 0) %>% 
    mutate(sched_arr_hour = str_sub(sched_arr_time,1,nchar(sched_arr_time)-2 )) %>% 
  group_by(sched_arr_hour) %>% 
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            avg_arr_delay = mean(arr_delay,na.rm = TRUE),
            sd_dep_delay = sd(dep_delay, na.rm = TRUE), 
            sd_arr_delay = sd(arr_delay,na.rm = TRUE)) 

delay_cover_arrhour$sched_arr_hour.num <- as.numeric(delay_cover_arrhour$sched_arr_hour)
delay_cover_arrhour  
## # A tibble: 24 × 6
##    sched_arr_hour avg_dep_delay avg_arr_delay sd_dep_delay sd_arr_delay sched_…¹
##    <chr>                  <dbl>         <dbl>        <dbl>        <dbl>    <dbl>
##  1 ""                      39.0          35.0         52.1         53.7       NA
##  2 "1"                     42.8          35.2         47.5         49.9        1
##  3 "10"                    15.5          19.4         41.0         41.5       10
##  4 "11"                    17.6          20.3         42.7         43.0       11
##  5 "12"                    18.4          20.5         41.4         41.9       12
##  6 "13"                    20.7          22.9         42.5         43.9       13
##  7 "14"                    23.9          25.1         44.3         46.5       14
##  8 "15"                    26.9          27.4         50.8         52.9       15
##  9 "16"                    27.6          29.4         48.8         51.7       16
## 10 "17"                    31.0          32.4         50.8         54.7       17
## # … with 14 more rows, and abbreviated variable name ¹​sched_arr_hour.num
ggplot(delay_cover_arrhour)+
  geom_point(aes(
    x = sched_arr_hour.num, y = avg_dep_delay, 
    color = "avg_dep_delay", size = sd_dep_delay))+
  geom_point(aes(
    x = sched_arr_hour.num, y = avg_arr_delay,
    color = "avg_arr_delay", size = sd_arr_delay))+
  labs(title = "Average delay VS schedual arrival time ",
       caption= "Only include data with delay time > 0")
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).

# 3.5 all data scatter by arr_hour
delay_cover_arrhour <- flights %>%  
    mutate(sched_arr_hour = str_sub(sched_arr_time,1,nchar(sched_arr_time)-2 ))

delay_cover_arrhour$sched_arr_hour.num <- as.numeric(delay_cover_arrhour$sched_arr_hour)

ggplot(delay_cover_arrhour, aes(x = sched_arr_hour.num, y = arr_delay)) +
  geom_point() +
  labs(title = "Arrive Delay VS schedual arrive hour ",
       caption = "Include all data except missing values ")
## Warning: Removed 12578 rows containing missing values (`geom_point()`).

#delay_cover_arrhour$sched_arr_hour.num <- as.numeric(delay_cover_arrhour$sched_arr_hour)
  1. What happens if you supply a negative n to slice_min() and friends?

  2. Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

  3. Suppose we have the following tiny data frame: