Note: This is personal practice notes for “R for data science (2e)”
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
In addition to tidyverse, we will also use the palmerpenguins package, which includes the penguins dataset containing body measurements for penguins on three islands in the Palmer Archipelago.
library(palmerpenguins)
Let’s use our first graph to answer a question:
Do penguins with longer flippers weigh more or less than penguins with shorter flippers?
What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear?
Does the relationship vary by the species of the penguin?
How about by the island where the penguin lives.
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
## # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# View(penguins)
Our ultimate goal in this chapter is to recreate the following visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.
/need help: how to insert a pic?/
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Scatterplots are useful for displaying the relationship between two variables, but it’s always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
we want points to be colored based on species but don’t want the smooth curves to be separated out for them, we should specify color = species for geom_point() only.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point (mapping = aes(color = species)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Add shapes
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point (mapping = aes(color = species, shape = species)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
A:(Rows: 344, Columns: 8)
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
?penguins
ggplot(
data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot(
data = penguins,
mapping = aes(x = bill_depth_mm, y = species)
) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Why does the following give an error and how would you fix it?
What does the na.rm argument do in
geom_point()? What is the default value of the argument?
Create a scatterplot where you successfully use this argument set to
TRUE.
Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().
Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = bill_depth_mm)) +
geom_smooth() +
labs(title = "Data come from the palmerpenguins package."
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot() +
geom_point(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_smooth(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Removed 2 rows containing missing values (`geom_point()`).
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
penguins|>
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
How you visualize the distribution of a variable depends on the type of variable: categorical or numerical.
ggplot(penguins,aes(x = body_mass_g)) +
geom_histogram(binwidth = 200)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
ggplot(penguins, aes(x = body_mass_g)) +
geom_density()
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).
ggplot(penguins, aes(y = species)) +
geom_bar()
ggplot(penguins, aes(x = species)) +
geom_bar(color = "red")
ggplot(penguins, aes(x = species)) +
geom_bar(fill = "red")
What does the bins argument in geom_histogram() do?
Make a histogram of the carat variable in the diamonds dataset. Experiment with different binwidths. What binwidth reveals the most interesting patterns?
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.40)
# 0.3 is interesting binwidth
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.30)
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.20)
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.1)
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.05)
To visualize a relationship we need to have at least two variables mapped to aesthetics of a plot.
To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians.
ggplot(penguins, aes(y = body_mass_g, x = species)) +
geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Alternatively, we can make frequency polygons because It’s much easier to understand overlapping lines than bars of geom_histogram().
ggplot(penguins, aes(x = body_mass_g, color = species)) +
geom_freqpoly(binwidth = 200, linewidth = 0.75)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
geom_histogram(binwidth = 200)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
We can also use overlaid density plots, with species mapped to both color and fill aesthetics and using the alpha aesthetic to add transparency to the filled density curves.
ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
geom_density(alpha = 0.5)
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).
We can use segmented bar plots to visualize the distribution between two categorical variables.
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar() +
labs(title = 'the frequencies of each species of penguins on each island')
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = 'fill') +
labs(title = "the proportions of each species within each island")
A scatterplot is probably the most commonly used plot for visualizing the relationship between two variables.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
One way to add additional variables to a plot is by mapping them to an aesthetic.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = island))
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way is to split your plot into facets, subplots that each display one subset of the data.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
facet_wrap(~ island)
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggsave( filename = "penguinsisland.png")
## Saving 7 x 5 in image
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?
ggplot(mpg, aes(x= displ, y = hwy)) +
geom_point()
# third numerical variable, color
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = cyl))
# size
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes( size = cyl))
# color and size
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes( color = cyl, size = cyl))
#shape
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes( shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (`geom_point()`).
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(linewidth = cyl))
## Warning in geom_point(aes(linewidth = cyl)): Ignoring unknown aesthetics:
## linewidth
What happens if you map the same variable to multiple aesthetics?
Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables?
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species))
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species
)
) +
geom_point() +
labs(color = "Species")
## Warning: Removed 2 rows containing missing values (`geom_point()`).
# remove labs()
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species
)
) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species))
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggsave("pengins1.png")
## Saving 7 x 5 in image
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot(mpg, aes(x = class)) +
geom_bar()
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave("mpg-plot.png")
## Saving 7 x 5 in image
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave("mpg-plot.pdf")
## Saving 7 x 5 in image
<- keyboard shortcut: Alt + - (the minus sign)
We recommend snake_case, where you separate lowercase words with _.
this_is_a_real_long_name <- 2.5
# my_variable <- 10
# my_varıable
#\> Error in eval(expr, envir, enclos): object 'my_varıable' not found
#libary(tidyverse)
#ggplot(dota = mpg) +
geom_point(maping = aes(x = displ, y = hwy))
## Warning in geom_point(maping = aes(x = displ, y = hwy)): Ignoring unknown
## parameters: `maping`
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
Press Alt + Shift + K. What happens? How can you get to the same place using the menus?
Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?
my_bar_plot <- ggplot(mpg, aes(x = class)) +
geom_bar()
my_scatter_plot <- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave(filename = "mpg-plot.png", plot = my_bar_plot)
## Saving 7 x 5 in image
This chapter will introduce you to data transformation using the dplyr package and a new dataset on flights that departed New York City in 2013.
library(tidyverse)
library(nycflights13)
when we need to be precise about which function a package comes from,
we’ll use the same syntax as R: packagename::functionname(). For
example, If you want to use the base version of these functions after
loading dplyr, you’ll need to use their full
names:stats::filter()
This dataset contains all 336,776 flights that departed from New York City in 2013.
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
It’s a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas.tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
you can use print(flights, width = Inf) to show all
columns, or use call glimpse()
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
dplyr verbs work well with the pipe,
flights %>%
filter(dest == "IAH") %>%
group_by(year, month, day) %>%
summarize(arr_delay = mean(arr_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day arr_delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 17.8
## 2 2013 1 2 7
## 3 2013 1 3 18.3
## 4 2013 1 4 -3.2
## 5 2013 1 5 20.2
## 6 2013 1 6 9.28
## 7 2013 1 7 -7.74
## 8 2013 1 8 7.79
## 9 2013 1 9 18.1
## 10 2013 1 10 6.68
## # … with 355 more rows
dplyr’s verbs are organised into four groups based on what they operate on: rows, columns, groups, or tables.
filter(), which changes which rows are present without
changing their order, and arrange(), which changes the
order of the rows without changing which are present.
flights %>%
filter(arr_delay > 120)
## # A tibble: 10,034 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 811 630 101 1047 830 137 MQ
## 2 2013 1 1 848 1835 853 1001 1950 851 MQ
## 3 2013 1 1 957 733 144 1056 853 123 UA
## 4 2013 1 1 1114 900 134 1447 1222 145 UA
## 5 2013 1 1 1505 1310 115 1638 1431 127 EV
## 6 2013 1 1 1525 1340 105 1831 1626 125 B6
## 7 2013 1 1 1549 1445 64 1912 1656 136 EV
## 8 2013 1 1 1558 1359 119 1718 1515 123 EV
## 9 2013 1 1 1732 1630 62 2028 1825 123 EV
## 10 2013 1 1 1803 1620 103 2008 1750 138 MQ
## # … with 10,024 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
As well as > (greater than), you can use >= (greater than or equal to), < (less than), <= (less than or equal to), == (equal to), and != (not equal to). You can also use & (and) or | (or) to combine multiple conditions:
# Flights that departed on January 1
flights %>%
filter(month == 1 & day == 1)
## # A tibble: 842 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# Flights that departed in January or February
flights %>%
filter(month == 1 | month == 2)
## # A tibble: 51,955 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 51,945 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
shortcut when you’re combining | and ==: %in%. It keeps rows where the variable equals one of the values on the right:
# A shorter way to select flights that departed in January or February
flights %>%
filter(month %in% c(1,2))
## # A tibble: 51,955 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 51,945 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
When you run filter() dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing flights dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <-
jan1 <- flights %>%
filter(month == 1 & day == 1)
the easiest mistake to make is to use = instead of == when testing for equality. Another mistakes is you write “or” statements like you would in English:
flights %>%
arrange(year, month, day, dep_time)
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>%
arrange(desc(dep_delay))
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA
## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
## 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
## 5 2013 7 22 845 1600 1005 1044 1815 989 MQ
## 6 2013 4 10 1100 1900 960 1342 2211 931 DL
## 7 2013 3 17 2321 810 911 135 1020 915 DL
## 8 2013 6 27 959 1900 899 1236 2226 850 DL
## 9 2013 7 22 2257 759 898 121 1026 895 DL
## 10 2013 12 5 756 1700 896 1058 2020 878 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# look for the flights that were most delayed on arrival that left on roughly on time
flights %>%
filter(dep_delay <= 10 & dep_delay >= 10) %>%
arrange(desc(arr_delay))
## # A tibble: 2,859 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 9 12 1309 1259 10 1727 1507 140 EV
## 2 2013 7 7 1555 1545 10 1919 1710 129 MQ
## 3 2013 4 10 1445 1435 10 1843 1644 119 DL
## 4 2013 7 1 1130 1120 10 1424 1233 111 B6
## 5 2013 9 18 1710 1700 10 2030 1839 111 UA
## 6 2013 7 7 1815 1805 10 2139 1950 109 AA
## 7 2013 4 18 710 700 10 1057 916 101 UA
## 8 2013 7 10 1740 1730 10 2251 2110 101 DL
## 9 2013 7 22 1859 1849 10 2324 2145 99 UA
## 10 2013 5 19 1610 1600 10 2053 1915 98 DL
## # … with 2,849 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>%
arrange(desc(arr_delay)) %>%
filter(dep_delay <= 10 & dep_delay >= 10)
## # A tibble: 2,859 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 9 12 1309 1259 10 1727 1507 140 EV
## 2 2013 7 7 1555 1545 10 1919 1710 129 MQ
## 3 2013 4 10 1445 1435 10 1843 1644 119 DL
## 4 2013 7 1 1130 1120 10 1424 1233 111 B6
## 5 2013 9 18 1710 1700 10 2030 1839 111 UA
## 6 2013 7 7 1815 1805 10 2139 1950 109 AA
## 7 2013 4 18 710 700 10 1057 916 101 UA
## 8 2013 7 10 1740 1730 10 2251 2110 101 DL
## 9 2013 7 22 1859 1849 10 2324 2145 99 UA
## 10 2013 5 19 1610 1600 10 2053 1915 98 DL
## # … with 2,849 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
finds all the unique rows in a dataset
# This would remove any duplicate rows if there were any
flights %>%
distinct()
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# This finds all unique origin and destination pairs.
flights %>%
distinct(origin, dest)
## # A tibble: 224 × 2
## origin dest
## <chr> <chr>
## 1 EWR IAH
## 2 LGA IAH
## 3 JFK MIA
## 4 JFK BQN
## 5 LGA ATL
## 6 EWR ORD
## 7 EWR FLL
## 8 LGA IAD
## 9 JFK MCO
## 10 LGA ORD
## # … with 214 more rows
Note that if you want to find the number of duplicates, or rows that weren’t duplicated, you’re better off swapping distinct() for count() and then filtering as needed.
# Had an arrival delay of two or more hours
flights %>%
filter(arr_delay >= 120)
## # A tibble: 10,200 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 811 630 101 1047 830 137 MQ
## 2 2013 1 1 848 1835 853 1001 1950 851 MQ
## 3 2013 1 1 957 733 144 1056 853 123 UA
## 4 2013 1 1 1114 900 134 1447 1222 145 UA
## 5 2013 1 1 1505 1310 115 1638 1431 127 EV
## 6 2013 1 1 1525 1340 105 1831 1626 125 B6
## 7 2013 1 1 1549 1445 64 1912 1656 136 EV
## 8 2013 1 1 1558 1359 119 1718 1515 123 EV
## 9 2013 1 1 1732 1630 62 2028 1825 123 EV
## 10 2013 1 1 1803 1620 103 2008 1750 138 MQ
## # … with 10,190 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# Flew to Houston (IAH or HOU)
flights %>%
filter(dest %in% c( 'IAH', 'HOU'))
## # A tibble: 9,313 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 623 627 -4 933 932 1 UA
## 4 2013 1 1 728 732 -4 1041 1038 3 UA
## 5 2013 1 1 739 739 0 1104 1038 26 UA
## 6 2013 1 1 908 908 0 1228 1219 9 UA
## 7 2013 1 1 1028 1026 2 1350 1339 11 UA
## 8 2013 1 1 1044 1045 -1 1352 1351 1 UA
## 9 2013 1 1 1114 900 134 1447 1222 145 UA
## 10 2013 1 1 1205 1200 5 1503 1505 -2 UA
## # … with 9,303 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# Were operated by United, American, or Delta
flights %>%
filter(carrier %in% c("DL",'AA','UA'))
## # A tibble: 139,504 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 554 600 -6 812 837 -25 DL
## 5 2013 1 1 554 558 -4 740 728 12 UA
## 6 2013 1 1 558 600 -2 753 745 8 AA
## 7 2013 1 1 558 600 -2 924 917 7 UA
## 8 2013 1 1 558 600 -2 923 937 -14 UA
## 9 2013 1 1 559 600 -1 941 910 31 AA
## 10 2013 1 1 559 600 -1 854 902 -8 UA
## # … with 139,494 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# Departed in summer (July, August, and September)
flights %>%
filter(month %in% c(7,8,9))
## # A tibble: 86,326 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1 2029 212 236 2359 157 B6
## 2 2013 7 1 2 2359 3 344 344 0 B6
## 3 2013 7 1 29 2245 104 151 1 110 B6
## 4 2013 7 1 43 2130 193 322 14 188 B6
## 5 2013 7 1 44 2150 174 300 100 120 AA
## 6 2013 7 1 46 2051 235 304 2358 186 B6
## 7 2013 7 1 48 2001 287 308 2305 243 VX
## 8 2013 7 1 58 2155 183 335 43 172 B6
## 9 2013 7 1 100 2146 194 327 30 177 B6
## 10 2013 7 1 100 2245 135 337 135 122 B6
## # … with 86,316 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# Arrived more than two hours late, but didn’t leave late
flights %>%
filter(arr_delay > 120 & dep_delay <= 0)
## # A tibble: 29 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 27 1419 1420 -1 1754 1550 124 MQ
## 2 2013 10 7 1350 1350 0 1736 1526 130 EV
## 3 2013 10 7 1357 1359 -2 1858 1654 124 AA
## 4 2013 10 16 657 700 -3 1258 1056 122 B6
## 5 2013 11 1 658 700 -2 1329 1015 194 VX
## 6 2013 3 18 1844 1847 -3 39 2219 140 UA
## 7 2013 4 17 1635 1640 -5 2049 1845 124 MQ
## 8 2013 4 18 558 600 -2 1149 850 179 AA
## 9 2013 4 18 655 700 -5 1213 950 143 AA
## 10 2013 5 22 1827 1830 -3 2217 2010 127 MQ
## # … with 19 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# Were delayed by at least an hour, but made up over 30 minutes in flight
flights %>%
filter(dep_delay >= 60 & dep_delay - arr_delay > 30)
## # A tibble: 1,844 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 2205 1720 285 46 2040 246 AA
## 2 2013 1 1 2326 2130 116 131 18 73 B6
## 3 2013 1 3 1503 1221 162 1803 1555 128 UA
## 4 2013 1 3 1839 1700 99 2056 1950 66 AA
## 5 2013 1 3 1850 1745 65 2148 2120 28 AA
## 6 2013 1 3 1941 1759 102 2246 2139 67 UA
## 7 2013 1 3 1950 1845 65 2228 2227 1 B6
## 8 2013 1 3 2015 1915 60 2135 2111 24 9E
## 9 2013 1 3 2257 2000 177 45 2224 141 9E
## 10 2013 1 4 1917 1700 137 2135 1950 105 AA
## # … with 1,834 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA
## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
## 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
## 5 2013 7 22 845 1600 1005 1044 1815 989 MQ
## 6 2013 4 10 1100 1900 960 1342 2211 931 DL
## 7 2013 3 17 2321 810 911 135 1020 915 DL
## 8 2013 6 27 959 1900 899 1236 2226 850 DL
## 9 2013 7 22 2257 759 898 121 1026 895 DL
## 10 2013 12 5 756 1700 896 1058 2020 878 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
arrange(flights,sched_dep_time)
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 27 NA 106 NA NA 245 NA US
## 2 2013 1 2 458 500 -2 703 650 13 US
## 3 2013 1 3 458 500 -2 650 650 0 US
## 4 2013 1 4 456 500 -4 631 650 -19 US
## 5 2013 1 5 458 500 -2 640 650 -10 US
## 6 2013 1 6 458 500 -2 718 650 28 US
## 7 2013 1 7 454 500 -6 637 648 -11 US
## 8 2013 1 8 454 500 -6 625 648 -23 US
## 9 2013 1 9 457 500 -3 647 648 -1 US
## 10 2013 1 10 450 500 -10 634 648 -14 US
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>%
arrange(desc( distance / air_time))
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 5 25 1709 1700 9 1923 1937 -14 DL
## 2 2013 7 2 1558 1513 45 1745 1719 26 EV
## 3 2013 5 13 2040 2025 15 2225 2226 -1 EV
## 4 2013 3 23 1914 1910 4 2045 2043 2 EV
## 5 2013 1 12 1559 1600 -1 1849 1917 -28 DL
## 6 2013 11 17 650 655 -5 1059 1150 -51 DL
## 7 2013 2 21 2355 2358 -3 412 438 -26 B6
## 8 2013 11 17 759 800 -1 1212 1255 -43 AA
## 9 2013 11 16 2003 1925 38 17 36 -19 DL
## 10 2013 11 16 2349 2359 -10 402 440 -38 B6
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>%
filter(year == 2013) %>%
distinct(month, day)
## # A tibble: 365 × 2
## month day
## <int> <int>
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 1 7
## 8 1 8
## 9 1 9
## 10 1 10
## # … with 355 more rows
flight2013 <- flights %>%
filter(year == 2013) %>%
count(month, day)
flight2013 %>% arrange(n)
## # A tibble: 365 × 3
## month day n
## <int> <int> <int>
## 1 11 28 634
## 2 11 29 661
## 3 1 19 674
## 4 10 12 676
## 5 1 26 680
## 6 8 31 680
## 7 2 2 682
## 8 9 28 682
## 9 2 9 684
## 10 10 19 684
## # … with 355 more rows
flights %>%
arrange(desc(distance))
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 857 900 -3 1516 1530 -14 HA
## 2 2013 1 2 909 900 9 1525 1530 -5 HA
## 3 2013 1 3 914 900 14 1504 1530 -26 HA
## 4 2013 1 4 900 900 0 1516 1530 -14 HA
## 5 2013 1 5 858 900 -2 1519 1530 -11 HA
## 6 2013 1 6 1019 900 79 1558 1530 28 HA
## 7 2013 1 7 1042 900 102 1620 1530 50 HA
## 8 2013 1 8 901 900 1 1504 1530 -26 HA
## 9 2013 1 9 641 900 1301 1242 1530 1272 HA
## 10 2013 1 10 859 900 -1 1449 1530 -41 HA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>%
arrange(distance)
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 27 NA 106 NA NA 245 NA US
## 2 2013 1 3 2127 2129 -2 2222 2224 -2 EV
## 3 2013 1 4 1240 1200 40 1333 1306 27 EV
## 4 2013 1 4 1829 1615 134 1937 1721 136 EV
## 5 2013 1 4 2128 2129 -1 2218 2224 -6 EV
## 6 2013 1 5 1155 1200 -5 1241 1306 -25 EV
## 7 2013 1 6 2125 2129 -4 2224 2224 0 EV
## 8 2013 1 7 2124 2129 -5 2212 2224 -12 EV
## 9 2013 1 8 2127 2130 -3 2304 2225 39 EV
## 10 2013 1 9 2126 2129 -3 2217 2224 -7 EV
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
to add new columns that are calculated from the existing columns. By default, mutate() adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left hand side.
flights %>%
mutate(
gain = dep_delay - arr_delay,
speen = distance / air_time * 60,
.before = 1
)
## # A tibble: 336,776 × 21
## gain speen year month day dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
## <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
## 1 -9 370. 2013 1 1 517 515 2 830 819 11
## 2 -16 374. 2013 1 1 533 529 4 850 830 20
## 3 -31 408. 2013 1 1 542 540 2 923 850 33
## 4 17 517. 2013 1 1 544 545 -1 1004 1022 -18
## 5 19 394. 2013 1 1 554 600 -6 812 837 -25
## 6 -16 288. 2013 1 1 554 558 -4 740 728 12
## 7 -24 404. 2013 1 1 555 600 -5 913 854 19
## 8 11 259. 2013 1 1 557 600 -3 709 723 -14
## 9 5 405. 2013 1 1 557 600 -3 838 846 -8
## 10 -10 319. 2013 1 1 558 600 -2 753 745 8
## # … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time, ⁵sched_arr_time,
## # ⁶arr_delay
flights %>%
mutate(
gain = dep_delay - arr_delay,
speen = distance / air_time * 60,
.after = day
)
## # A tibble: 336,776 × 21
## year month day gain speen dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
## <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <int> <dbl>
## 1 2013 1 1 -9 370. 517 515 2 830 819 11
## 2 2013 1 1 -16 374. 533 529 4 850 830 20
## 3 2013 1 1 -31 408. 542 540 2 923 850 33
## 4 2013 1 1 17 517. 544 545 -1 1004 1022 -18
## 5 2013 1 1 19 394. 554 600 -6 812 837 -25
## 6 2013 1 1 -16 288. 554 558 -4 740 728 12
## 7 2013 1 1 -24 404. 555 600 -5 913 854 19
## 8 2013 1 1 11 259. 557 600 -3 709 723 -14
## 9 2013 1 1 5 405. 557 600 -3 838 846 -8
## 10 2013 1 1 -10 319. 558 600 -2 753 745 8
## # … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time, ⁵sched_arr_time,
## # ⁶arr_delay
Alternatively, you can control which variables are kept with the
.keep argument. A particularly useful argument is
"used" which allows you to see the inputs and outputs from
your calculations:
flights %>%
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)
## # A tibble: 336,776 × 6
## dep_delay arr_delay air_time gain hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 11 227 -9 3.78 -2.38
## 2 4 20 227 -16 3.78 -4.23
## 3 2 33 160 -31 2.67 -11.6
## 4 -1 -18 183 17 3.05 5.57
## 5 -6 -25 116 19 1.93 9.83
## 6 -4 12 150 -16 2.5 -6.4
## 7 -5 19 158 -24 2.63 -9.11
## 8 -3 -14 53 11 0.883 12.5
## 9 -3 -8 140 5 2.33 2.14
## 10 -2 8 138 -10 2.3 -4.35
## # … with 336,766 more rows
#select columns by names
flights %>%
select(year, month, day)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
# Select all columns between year and day (inclusive)
flights %>%
select(year:day)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
# Select all columns except those from year to day (inclusive)
flights %>%
select(!year:day)
## # A tibble: 336,776 × 16
## dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
## <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## 5 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 6 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 7 555 600 -5 913 854 19 B6 507 N516JB EWR
## 8 557 600 -3 709 723 -14 EV 5708 N829AS LGA
## 9 557 600 -3 838 846 -8 B6 79 N593JB JFK
## 10 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## # … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
## # variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
## # ⁵sched_arr_time, ⁶arr_delay
# Select all columns that are characters
flights %>%
select(where(is.character))
## # A tibble: 336,776 × 4
## carrier tailnum origin dest
## <chr> <chr> <chr> <chr>
## 1 UA N14228 EWR IAH
## 2 UA N24211 LGA IAH
## 3 AA N619AA JFK MIA
## 4 B6 N804JB JFK BQN
## 5 DL N668DN LGA ATL
## 6 UA N39463 EWR ORD
## 7 B6 N516JB EWR FLL
## 8 EV N829AS LGA IAD
## 9 B6 N593JB JFK MCO
## 10 AA N3ALAA LGA ORD
## # … with 336,766 more rows
# Select all columns that are interger
flights %>%
select(where(is.integer))
## # A tibble: 336,776 × 8
## year month day dep_time sched_dep_time arr_time sched_arr_time flight
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 2013 1 1 517 515 830 819 1545
## 2 2013 1 1 533 529 850 830 1714
## 3 2013 1 1 542 540 923 850 1141
## 4 2013 1 1 544 545 1004 1022 725
## 5 2013 1 1 554 600 812 837 461
## 6 2013 1 1 554 558 740 728 1696
## 7 2013 1 1 555 600 913 854 507
## 8 2013 1 1 557 600 709 723 5708
## 9 2013 1 1 557 600 838 846 79
## 10 2013 1 1 558 600 753 745 301
## # … with 336,766 more rows
you can rename variables, new name is appears on the left side.
flights %>%
select(tail_num = tailnum)
## # A tibble: 336,776 × 1
## tail_num
## <chr>
## 1 N14228
## 2 N24211
## 3 N619AA
## 4 N804JB
## 5 N668DN
## 6 N39463
## 7 N516JB
## 8 N829AS
## 9 N593JB
## 10 N3ALAA
## # … with 336,766 more rows
to keep all the existing variables and just want to rename a few
flights %>%
rename(tail_num = tailnum)
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tail_num <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
By default relocate() moves variables to the front:
flights %>%
relocate(carrier, flight)
## # A tibble: 336,776 × 19
## carrier flight year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴
## <chr> <int> <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 UA 1545 2013 1 1 517 515 2 830 819
## 2 UA 1714 2013 1 1 533 529 4 850 830
## 3 AA 1141 2013 1 1 542 540 2 923 850
## 4 B6 725 2013 1 1 544 545 -1 1004 1022
## 5 DL 461 2013 1 1 554 600 -6 812 837
## 6 UA 1696 2013 1 1 554 558 -4 740 728
## 7 B6 507 2013 1 1 555 600 -5 913 854
## 8 EV 5708 2013 1 1 557 600 -3 709 723
## 9 B6 79 2013 1 1 557 600 -3 838 846
## 10 AA 301 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, 9 more variables: arr_delay <dbl>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time
# .before
flights %>%
relocate(year:day, .before = carrier)
## # A tibble: 336,776 × 19
## dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ year month day carrier
## <int> <int> <dbl> <int> <int> <dbl> <int> <int> <int> <chr>
## 1 517 515 2 830 819 11 2013 1 1 UA
## 2 533 529 4 850 830 20 2013 1 1 UA
## 3 542 540 2 923 850 33 2013 1 1 AA
## 4 544 545 -1 1004 1022 -18 2013 1 1 B6
## 5 554 600 -6 812 837 -25 2013 1 1 DL
## 6 554 558 -4 740 728 12 2013 1 1 UA
## 7 555 600 -5 913 854 19 2013 1 1 B6
## 8 557 600 -3 709 723 -14 2013 1 1 EV
## 9 557 600 -3 838 846 -8 2013 1 1 B6
## 10 558 600 -2 753 745 8 2013 1 1 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# .after
flights %>%
relocate(year:day, .after = flight)
## # A tibble: 336,776 × 19
## dep_time sched_d…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight year month
## <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <int> <int>
## 1 517 515 2 830 819 11 UA 1545 2013 1
## 2 533 529 4 850 830 20 UA 1714 2013 1
## 3 542 540 2 923 850 33 AA 1141 2013 1
## 4 544 545 -1 1004 1022 -18 B6 725 2013 1
## 5 554 600 -6 812 837 -25 DL 461 2013 1
## 6 554 558 -4 740 728 12 UA 1696 2013 1
## 7 555 600 -5 913 854 19 B6 507 2013 1
## 8 557 600 -3 709 723 -14 EV 5708 2013 1
## 9 557 600 -3 838 846 -8 B6 79 2013 1
## 10 558 600 -2 753 745 8 AA 301 2013 1
## # … with 336,766 more rows, 9 more variables: day <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>%
select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
flights %>%
select(c(dep_time, dep_delay, arr_time, arr_delay))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
flights %>%
select(starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
# it will ignore duplicate variables
flights %>%
select(dep_time, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
#It returns all the variables you ask for, for example ones stored in a vector.
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>%
select(any_of(variables))
## # A tibble: 336,776 × 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # … with 336,766 more rows
flights %>%
select(one_of(variables))
## # A tibble: 336,776 × 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # … with 336,766 more rows
# The default helper functions are insensitive to case.
select(flights, contains("TIME"))
## # A tibble: 336,776 × 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # … with 336,766 more rows
#This can be changes by setting `ignore.case=FALSE`.
select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 × 0
group_by() doesn’t do anything by itself; instead it changes the behavior of the subsequent verbs.
group_by(flights, month)
## # A tibble: 336,776 × 19
## # Groups: month [12]
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>%
group_by(month) %>%
summarise(
delay = mean(dep_delay)
)
## # A tibble: 12 × 2
## month delay
## <int> <dbl>
## 1 1 NA
## 2 2 NA
## 3 3 NA
## 4 4 NA
## 5 5 NA
## 6 6 NA
## 7 7 NA
## 8 8 NA
## 9 9 NA
## 10 10 NA
## 11 11 NA
## 12 12 NA
flights %>%
group_by(month) %>%
summarise(
delay = mean(dep_delay, na.rm = TRUE)
)
## # A tibble: 12 × 2
## month delay
## <int> <dbl>
## 1 1 10.0
## 2 2 10.8
## 3 3 13.2
## 4 4 13.9
## 5 5 13.0
## 6 6 20.8
## 7 7 21.7
## 8 8 12.6
## 9 9 6.72
## 10 10 6.24
## 11 11 5.44
## 12 12 16.6
one very useful summary is n(), which returns the number of rows in each group:
flights %>%
group_by(month) %>%
summarise(
delay = mean(dep_delay, na.rm = TRUE),
n = n()
)
## # A tibble: 12 × 3
## month delay n
## <int> <dbl> <int>
## 1 1 10.0 27004
## 2 2 10.8 24951
## 3 3 13.2 28834
## 4 4 13.9 28330
## 5 5 13.0 28796
## 6 6 20.8 28243
## 7 7 21.7 29425
## 8 8 12.6 29327
## 9 9 6.72 27574
## 10 10 6.24 28889
## 11 11 5.44 27268
## 12 12 16.6 28135
There are five handy functions that allow you pick off specific rows within each group:
df |> slice_head(n = 1) takes the first row from each group.
df |> slice_tail(n = 1) takes the last row in each group.
df |> slice_min(x, n = 1) takes the row with the smallest value of x.
df |> slice_max(x, n = 1) takes the row with the largest value of x.
df |> slice_sample(n = 1) takes one random row.
You can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group.
flights %>%
group_by(dest) %>%
slice_max(arr_delay, n = 1) %>%
relocate(dest, arr_delay)
## # A tibble: 108 × 19
## # Groups: dest [105]
## dest arr_delay year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴
## <chr> <dbl> <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 ABQ 153 2013 7 22 2145 2007 98 132 2259
## 2 ACK 221 2013 7 23 1139 800 219 1250 909
## 3 ALB 328 2013 1 25 123 2000 323 229 2101
## 4 ANC 39 2013 8 17 1740 1625 75 2042 2003
## 5 ATL 895 2013 7 22 2257 759 898 121 1026
## 6 AUS 349 2013 7 10 2056 1505 351 2347 1758
## 7 AVL 228 2013 8 13 1156 832 204 1417 1029
## 8 BDL 266 2013 2 21 1728 1316 252 1839 1413
## 9 BGR 238 2013 12 1 1504 1056 248 1628 1230
## 10 BHM 291 2013 4 10 25 1900 325 136 2045
## # … with 98 more rows, 9 more variables: carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time
flights %>%
group_by(dest) %>%
summarise(max_arr_delay = max(arr_delay, na.rm = "TRUE"))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `max_arr_delay = max(arr_delay, na.rm = "TRUE")`.
## ℹ In group 52: `dest = "LGA"`.
## Caused by warning in `max()`:
## ! no non-missing arguments to max; returning -Inf
## # A tibble: 105 × 2
## dest max_arr_delay
## <chr> <dbl>
## 1 ABQ 153
## 2 ACK 221
## 3 ALB 328
## 4 ANC 39
## 5 ATL 895
## 6 AUS 349
## 7 AVL 228
## 8 BDL 266
## 9 BGR 238
## 10 BHM 291
## # … with 95 more rows
daily <- flights %>%
group_by(year, month, day)
daily
## # A tibble: 336,776 × 19
## # Groups: year, month, day [365]
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
daily_flight <- daily %>%
summarise(
n = n(),
.group = "drop_last"
)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
daily_flight
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day n .group
## <int> <int> <int> <int> <chr>
## 1 2013 1 1 842 drop_last
## 2 2013 1 2 943 drop_last
## 3 2013 1 3 914 drop_last
## 4 2013 1 4 915 drop_last
## 5 2013 1 5 720 drop_last
## 6 2013 1 6 832 drop_last
## 7 2013 1 7 933 drop_last
## 8 2013 1 8 899 drop_last
## 9 2013 1 9 902 drop_last
## 10 2013 1 10 932 drop_last
## # … with 355 more rows
to remove grouping outside of summarize(). You can do this with ungroup().
daily %>%
ungroup() %>%
summarise(
delay = mean(dep_delay, na.rm = "TRUE"),
n = n()
)
## # A tibble: 1 × 2
## delay n
## <dbl> <int>
## 1 NA 336776
flights %>%
filter(arr_delay > 0) %>%
group_by(carrier) %>%
summarise(
mean_arr_delay = mean(arr_delay, na.rm = TRUE)
) %>%
arrange(desc(mean_arr_delay))
## # A tibble: 16 × 2
## carrier mean_arr_delay
## <chr> <dbl>
## 1 OO 60.6
## 2 YV 51.1
## 3 9E 49.3
## 4 EV 48.3
## 5 F9 47.6
## 6 VX 43.8
## 7 FL 41.1
## 8 WN 40.7
## 9 B6 40.0
## 10 AA 38.3
## 11 MQ 37.9
## 12 DL 37.7
## 13 UA 36.7
## 14 HA 35.0
## 15 AS 34.4
## 16 US 29.0
flights %>%
summarise(n_distinct(carrier),
n_distinct(origin),
n_distinct(dest))
## # A tibble: 1 × 3
## `n_distinct(carrier)` `n_distinct(origin)` `n_distinct(dest)`
## <int> <int> <int>
## 1 16 3 105
# There are 16 carriers, 3 origin airports, and 105 destination airports in this dataset.
# For many destination airports, there are only one or two carriers that fly there.
# so it is difficult to tell how much of the delay is due to the carrier, and how much is due to the airport (busy destination airports can force planes to loiter longer before there is a free landing slot).
# We also can't necessarily tell how much of the delay is due to the route, versus the airport itself.
# This makes attributing the cause of in flight delays difficult.
flights %>%
filter(arr_delay > 0) %>%
select(dest, arr_delay) %>%
group_by(dest) %>%
slice_tail(n=1)
## # A tibble: 103 × 2
## # Groups: dest [103]
## dest arr_delay
## <chr> <dbl>
## 1 ABQ 45
## 2 ACK 3
## 3 ALB 9
## 4 ANC 39
## 5 ATL 136
## 6 AUS 12
## 7 AVL 4
## 8 BDL 77
## 9 BGR 8
## 10 BHM 77
## # … with 93 more rows
# 3.1 average delay VS dep_hour (if delay > 0)
delay_cover_dephour <- flights %>%
filter(dep_delay > 0 | arr_delay > 0) %>%
group_by(hour) %>%
summarise(
avg_dep_delay = mean(dep_delay, na.rm = TRUE),
sd_dep_delay = sd(dep_delay, na.rm = TRUE),
avg_arr_delay = mean(arr_delay, na.rm = TRUE),
sd_arr_delay = sd(arr_delay, na.rm = TRUE)
)
delay_cover_dephour
## # A tibble: 19 × 5
## hour avg_dep_delay sd_dep_delay avg_arr_delay sd_arr_delay
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 8.14 22.7 10.2 26.4
## 2 6 12.0 35.9 15.5 37.3
## 3 7 13.3 36.3 16.0 38.2
## 4 8 17.1 42.5 21.1 42.7
## 5 9 17.2 43.1 20.0 43.0
## 6 10 20.6 46.4 22.8 47.0
## 7 11 22.2 43.8 23.3 45.3
## 8 12 23.0 43.4 23.9 44.9
## 9 13 25.4 44.7 25.8 47.8
## 10 14 29.0 50.1 30.6 53.7
## 11 15 30.6 50.3 31.1 54.2
## 12 16 34.6 55.5 34.1 58.8
## 13 17 36.7 58.1 37.3 60.8
## 14 18 37.6 57.3 36.9 58.5
## 15 19 43.1 60.3 39.2 60.7
## 16 20 42.2 54.2 38.3 54.3
## 17 21 42.5 51.2 39.5 50.5
## 18 22 38.9 46.1 39.1 44.0
## 19 23 26.8 40.6 28.3 38.3
ggplot(delay_cover_dephour)+
geom_point(
aes(x = hour, y = avg_dep_delay, color = "avg_dep_delay", size = sd_dep_delay ))+
geom_point(
aes(x = hour, y = avg_arr_delay, color = "avg_arr_delay", size = sd_arr_delay))+
labs(title = "Average delay VS schedual departure time ",
caption = "Only include data with delay time > 0 ")
# 3.2 all data scatter by dep_hour
ggplot(flights, aes(x = hour, y = arr_delay)) +
geom_count() +
labs(title = "Arrive Delay VS schedual departure time ",
caption = "Include all data except missing values ")
## Warning: Removed 9430 rows containing non-finite values (`stat_sum()`).
# 3.3 average delay VS sched_arr_time (all data)
delay_cover_arrhour <- flights %>%
mutate(sched_arr_hour = str_sub(sched_arr_time,1,nchar(sched_arr_time)-2 )) %>%
group_by(sched_arr_hour) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
avg_arr_delay = mean(arr_delay,na.rm = TRUE))
delay_cover_arrhour$sched_arr_hour.num <- as.numeric(delay_cover_arrhour$sched_arr_hour)
ggplot(delay_cover_arrhour)+
geom_point(aes(x = sched_arr_hour.num, y = avg_dep_delay, color = "avg_dep_delay"))+
geom_point(aes(x = sched_arr_hour.num, y = avg_arr_delay,color = "avg_arr_delay"))+
labs(title = "Average delay VS schedual arrival time ")
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
# 3.4 average delay VS sched_arr_time (if delay > 0)
delay_cover_arrhour <- flights %>%
filter(dep_delay > 0 | arr_delay > 0) %>%
mutate(sched_arr_hour = str_sub(sched_arr_time,1,nchar(sched_arr_time)-2 )) %>%
group_by(sched_arr_hour) %>%
summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
avg_arr_delay = mean(arr_delay,na.rm = TRUE),
sd_dep_delay = sd(dep_delay, na.rm = TRUE),
sd_arr_delay = sd(arr_delay,na.rm = TRUE))
delay_cover_arrhour$sched_arr_hour.num <- as.numeric(delay_cover_arrhour$sched_arr_hour)
delay_cover_arrhour
## # A tibble: 24 × 6
## sched_arr_hour avg_dep_delay avg_arr_delay sd_dep_delay sd_arr_delay sched_…¹
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "" 39.0 35.0 52.1 53.7 NA
## 2 "1" 42.8 35.2 47.5 49.9 1
## 3 "10" 15.5 19.4 41.0 41.5 10
## 4 "11" 17.6 20.3 42.7 43.0 11
## 5 "12" 18.4 20.5 41.4 41.9 12
## 6 "13" 20.7 22.9 42.5 43.9 13
## 7 "14" 23.9 25.1 44.3 46.5 14
## 8 "15" 26.9 27.4 50.8 52.9 15
## 9 "16" 27.6 29.4 48.8 51.7 16
## 10 "17" 31.0 32.4 50.8 54.7 17
## # … with 14 more rows, and abbreviated variable name ¹sched_arr_hour.num
ggplot(delay_cover_arrhour)+
geom_point(aes(
x = sched_arr_hour.num, y = avg_dep_delay,
color = "avg_dep_delay", size = sd_dep_delay))+
geom_point(aes(
x = sched_arr_hour.num, y = avg_arr_delay,
color = "avg_arr_delay", size = sd_arr_delay))+
labs(title = "Average delay VS schedual arrival time ",
caption= "Only include data with delay time > 0")
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 rows containing missing values (`geom_point()`).
# 3.5 all data scatter by arr_hour
delay_cover_arrhour <- flights %>%
mutate(sched_arr_hour = str_sub(sched_arr_time,1,nchar(sched_arr_time)-2 ))
delay_cover_arrhour$sched_arr_hour.num <- as.numeric(delay_cover_arrhour$sched_arr_hour)
ggplot(delay_cover_arrhour, aes(x = sched_arr_hour.num, y = arr_delay)) +
geom_point() +
labs(title = "Arrive Delay VS schedual arrive hour ",
caption = "Include all data except missing values ")
## Warning: Removed 12578 rows containing missing values (`geom_point()`).
#delay_cover_arrhour$sched_arr_hour.num <- as.numeric(delay_cover_arrhour$sched_arr_hour)
What happens if you supply a negative n to slice_min() and friends?
Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?
Suppose we have the following tiny data frame:
3.2 Comments
Use comments to explain the why of your code, not the how or the what.