Exercise 1

2.5 Exercises

1. Why does this code not work?

This code does not work because “my_variable <- 10” and “my_varıable” use different “i”’s. Thus, R is unable to understand that they are referring to the same base object.

2. Tweak each of the following R commands so that they run correctly:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

3. Press Option + Shift + K / Alt + Shift + K. What happens? How can you get to the same place using the menus?

Upon pressing Option + Shift + K, a Keyboard Shortcut Quick Reference appears on my screen. I can get here using the menu through the “Help” tab at the top of my screen, then clicking on “Keyboard Shortcuts Help”

4. Let’s revisit an exercise from the Section 1.6. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

my_bar_plot <- ggplot(mpg, aes(x = class)) + geom_bar()
my_scatter_plot <- ggplot(mpg, aes(x = cty, y = hwy)) + geom_point()
ggsave(filename = "mpg-plot.png", plot = my_bar_plot)

## Saving 7 x 5 in image

The “my_bar_plot” plot is saved as “mpg-plot.png” due to the third line of code specifying that “my_bar_plot” is the one to be saved.

3.2.5 Exercises

library(nycflights13)

1. In a single pipeline for each condition, find all flights that meet the condition:

flights |> filter(arr_delay >= 120)

## # A tibble: 10,200 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      848           1835       853     1001           1950
##  3  2013     1     1      957            733       144     1056            853
##  4  2013     1     1     1114            900       134     1447           1222
##  5  2013     1     1     1505           1310       115     1638           1431
##  6  2013     1     1     1525           1340       105     1831           1626
##  7  2013     1     1     1549           1445        64     1912           1656
##  8  2013     1     1     1558           1359       119     1718           1515
##  9  2013     1     1     1732           1630        62     2028           1825
## 10  2013     1     1     1803           1620       103     2008           1750
## # ℹ 10,190 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> filter(dest == "IAH" | dest == "HOU")

## # A tibble: 9,313 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # ℹ 9,303 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> filter(carrier == "UA" | carrier == "AA" | carrier == "DL")

## # A tibble: 139,504 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            600        -1      941            910
## 10  2013     1     1      559            600        -1      854            902
## # ℹ 139,494 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> filter(month == 7 | month == 8 | month == 9)

## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ℹ 86,316 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> filter(dep_delay < 0 & arr_delay >= 120)

## # A tibble: 26 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    27     1419           1420        -1     1754           1550
##  2  2013    10     7     1357           1359        -2     1858           1654
##  3  2013    10    16      657            700        -3     1258           1056
##  4  2013    11     1      658            700        -2     1329           1015
##  5  2013     3    18     1844           1847        -3       39           2219
##  6  2013     4    17     1635           1640        -5     2049           1845
##  7  2013     4    18      558            600        -2     1149            850
##  8  2013     4    18      655            700        -5     1213            950
##  9  2013     5    22     1827           1830        -3     2217           2010
## 10  2013     6     5     1604           1615       -11     2041           1840
## # ℹ 16 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> filter(dep_delay >= 60 & arr_delay <= dep_delay - 30)

## # A tibble: 2,074 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     1716           1545        91     2140           2039
##  2  2013     1     1     2205           1720       285       46           2040
##  3  2013     1     1     2326           2130       116      131             18
##  4  2013     1     3     1503           1221       162     1803           1555
##  5  2013     1     3     1821           1530       171     2131           1910
##  6  2013     1     3     1839           1700        99     2056           1950
##  7  2013     1     3     1850           1745        65     2148           2120
##  8  2013     1     3     1923           1815        68     2036           1958
##  9  2013     1     3     1941           1759       102     2246           2139
## 10  2013     1     3     1950           1845        65     2228           2227
## # ℹ 2,064 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

2. Sort flights to find the flights with the longest departure delays. Find the flights that left earliest in the morning.

flights |> arrange(desc(dep_delay))

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> arrange(dep_time)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    13        1           2249        72      108           2357
##  2  2013     1    31        1           2100       181      124           2225
##  3  2013    11    13        1           2359         2      442            440
##  4  2013    12    16        1           2359         2      447            437
##  5  2013    12    20        1           2359         2      430            440
##  6  2013    12    26        1           2359         2      437            440
##  7  2013    12    30        1           2359         2      441            437
##  8  2013     2    11        1           2100       181      111           2225
##  9  2013     2    24        1           2245        76      121           2354
## 10  2013     3     8        1           2355         6      431            440
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

3. Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

flights |> arrange(distance/air_time)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    28     1917           1825        52     2118           1935
##  2  2013     6    29      755            800        -5     1035            909
##  3  2013     8    28      932            940        -8     1116           1051
##  4  2013     1    30     1037            955        42     1221           1100
##  5  2013    11    27      556            600        -4      727            658
##  6  2013     5    21      558            600        -2      721            657
##  7  2013    12     9     1540           1535         5     1720           1656
##  8  2013     6    10     1356           1300        56     1646           1414
##  9  2013     7    28     1322           1325        -3     1612           1432
## 10  2013     4    11     1349           1345         4     1542           1453
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

4. Was there a flight on every day of 2013?

flights |> distinct(month,day)

## # A tibble: 365 × 2
##    month   day
##    <int> <int>
##  1     1     1
##  2     1     2
##  3     1     3
##  4     1     4
##  5     1     5
##  6     1     6
##  7     1     7
##  8     1     8
##  9     1     9
## 10     1    10
## # ℹ 355 more rows

Yes, because running my code tells me that there are 365 rows, I can conclude that there was a flight on each unique month + day pair.

5. Which flights traveled the farthest distance? Which traveled the least distance?

#Arranging flights by farthest distance
flights |> arrange(desc(distance))

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      857            900        -3     1516           1530
##  2  2013     1     2      909            900         9     1525           1530
##  3  2013     1     3      914            900        14     1504           1530
##  4  2013     1     4      900            900         0     1516           1530
##  5  2013     1     5      858            900        -2     1519           1530
##  6  2013     1     6     1019            900        79     1558           1530
##  7  2013     1     7     1042            900       102     1620           1530
##  8  2013     1     8      901            900         1     1504           1530
##  9  2013     1     9      641            900      1301     1242           1530
## 10  2013     1    10      859            900        -1     1449           1530
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

#Arranging flights by least distance
flights |> arrange(distance)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7    27       NA            106        NA       NA            245
##  2  2013     1     3     2127           2129        -2     2222           2224
##  3  2013     1     4     1240           1200        40     1333           1306
##  4  2013     1     4     1829           1615       134     1937           1721
##  5  2013     1     4     2128           2129        -1     2218           2224
##  6  2013     1     5     1155           1200        -5     1241           1306
##  7  2013     1     6     2125           2129        -4     2224           2224
##  8  2013     1     7     2124           2129        -5     2212           2224
##  9  2013     1     8     2127           2130        -3     2304           2225
## 10  2013     1     9     2126           2129        -3     2217           2224
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The flights that traveled the farthest distance were flights from JFK to HNL. The flights that traveled the least distance was a flight from EWR to LGA.

6. Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

It does matter what order you do the functions in, as filter() will trim down the data set to the relevant data points, then arrange() will further sort that trimmed data set.

3.3.5 Exercises

1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

I would expect that sched_dep_time + dep_delay = dep_time

2. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights |> select(dep_time, dep_delay, arr_time, arr_delay)

## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows

3. What happens if you specify the name of the same variable multiple times in a select() call?

flights |> select(dep_time, dep_time)

## # A tibble: 336,776 × 1
##    dep_time
##       <int>
##  1      517
##  2      533
##  3      542
##  4      544
##  5      554
##  6      554
##  7      555
##  8      557
##  9      557
## 10      558
## # ℹ 336,766 more rows

It only pulls up the variable’s column once.

4. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

The any_of() function expedites the process of selecting for variables. Rather than type out each variable of interest every time I run analyses, through turning “year”, “month”, “day”, “dep_delay” and “arr_delay” into the vector “variables”, I can use flights |> select(any_of(variables)) to analyse all rows containing said variables.

5. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

The results surprise me. R is typically case-sensitive. However, the select helpers ignore case. I can change that default with ignore.case = FALSE.

6. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

flights |> rename(air_time_min = air_time) |> relocate(air_time_min)

## # A tibble: 336,776 × 19
##    air_time_min  year month   day dep_time sched_dep_time dep_delay arr_time
##           <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1          227  2013     1     1      517            515         2      830
##  2          227  2013     1     1      533            529         4      850
##  3          160  2013     1     1      542            540         2      923
##  4          183  2013     1     1      544            545        -1     1004
##  5          116  2013     1     1      554            600        -6      812
##  6          150  2013     1     1      554            558        -4      740
##  7          158  2013     1     1      555            600        -5      913
##  8           53  2013     1     1      557            600        -3      709
##  9          140  2013     1     1      557            600        -3      838
## 10          138  2013     1     1      558            600        -2      753
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

7. Why doesn’t the following work, and what does the error mean?

The issue is that select(tailnum) excludes arr_delay, so when R tries to rearrange by arr_delay there is nothing in the selection. To fix this, I need to use select(tailnum, arr_delay) to include both variables before trying to arrange(arr_delay).

3.5.7 Exercises

1. Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

flights |> 
  group_by(carrier) |> 
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) |> 
  arrange(desc(avg_delay))

## # A tibble: 16 × 2
##    carrier avg_delay
##    <chr>       <dbl>
##  1 F9          20.2 
##  2 EV          20.0 
##  3 YV          19.0 
##  4 FL          18.7 
##  5 WN          17.7 
##  6 9E          16.7 
##  7 B6          13.0 
##  8 VX          12.9 
##  9 OO          12.6 
## 10 UA          12.1 
## 11 MQ          10.6 
## 12 DL           9.26
## 13 AA           8.59
## 14 AS           5.80
## 15 HA           4.90
## 16 US           3.78

On average, the carrier with the worst delays is F9.

2. Find the flights that are most delayed upon departure from each destination.

flights |> 
  group_by(dest) |>
  slice_max(dep_delay, n = 1) |>
  relocate(dest, dep_delay)

## # A tibble: 105 × 19
## # Groups:   dest [105]
##    dest  dep_delay  year month   day dep_time sched_dep_time arr_time
##    <chr>     <dbl> <int> <int> <int>    <int>          <int>    <int>
##  1 ABQ         142  2013    12    14     2223           2001      133
##  2 ACK         219  2013     7    23     1139            800     1250
##  3 ALB         323  2013     1    25      123           2000      229
##  4 ANC          75  2013     8    17     1740           1625     2042
##  5 ATL         898  2013     7    22     2257            759      121
##  6 AUS         351  2013     7    10     2056           1505     2347
##  7 AVL         222  2013     6    14     1158            816     1335
##  8 BDL         252  2013     2    21     1728           1316     1839
##  9 BGR         248  2013    12     1     1504           1056     1628
## 10 BHM         325  2013     4    10       25           1900      136
## # ℹ 95 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

3. How do delays vary over the course of the day? Illustrate your answer with a plot.

flights_hourly_delay <- flights|>
  filter(!is.na(dep_delay), !is.na(dep_time))|> 
  mutate(dep_hour = dep_time/100)|> 
  group_by(dep_hour)|> 
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE))
ggplot(flights_hourly_delay, aes(x = dep_hour, y = avg_dep_delay)) + 
  geom_line(color = "blue") + 
  labs(title = "Average departure delays by time of day", x = "Hour of the day", y = "Average departure delay (minutes)")

Departure delays vary throughout the course of the day, peaking earliest in the day prior to 5 am.

4. What happens if you supply a negative n to slice_min() and friends?

flights |> 
  slice_min(arr_delay, n = -1) |> 
  relocate(arr_delay)

## # A tibble: 336,776 × 19
##    arr_delay  year month   day dep_time sched_dep_time dep_delay arr_time
##        <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1       -86  2013     5     7     1715           1729       -14     1944
##  2       -79  2013     5    20      719            735       -16      951
##  3       -75  2013     5     2     1947           1949        -2     2209
##  4       -75  2013     5     6     1826           1830        -4     2045
##  5       -74  2013     5     4     1816           1820        -4     2017
##  6       -73  2013     5     2     1926           1929        -3     2157
##  7       -71  2013     5     6     1753           1755        -2     2004
##  8       -71  2013     5     7     2054           2055        -1     2317
##  9       -71  2013     5    13      657            700        -3      908
## 10       -70  2013     1     4     1026           1030        -4     1305
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Supplying a negative n to slice_min() (and similar functions) excludes the last |n| rows from the result after sorting.

5. Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

Count() is a simplified way of counting occurrences of unique values in a column, internally using group_by() and summarize(n = n()). The sort argument sorts the output in descending order of the count (n), making it easy to identify the most frequent values.

6a. Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.

The group_by(y) function should group the data frame together based on the value in column y. Since there are two values, a and b, in column y, the data frame will be grouped into two groups: Rows where y == a and rows where y == b.

6b. Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also, comment on how it’s different from the group_by() in part (a).

arrange(y) sorts the rows based on the values in y: All the rows where y == “a” will come first. The rows where y == “b” will come after.

arrange(): Sorts the rows based on a column’s values. It just changes the order of the rows. group_by(): Groups the data based on a column’s values. It doesn’t change the order of the rows, but it prepares the data for grouped operations (like summaries).

6c. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.

df |> group_by(y): Groups the data frame df by the column y. Rows with the same value of y are grouped together. summarize(mean_x = mean(x)): For each group of y, it calculates the mean of the x values and outputs it in a new column called mean_x.

6d. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.

The pipeline groups the data by unique combinations of y and z. For each unique combination of y and z, it computes the mean of the x values. Outputs a summarized data frame with one row per unique (y, z) combination, containing the group’s mean_x.

6e. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d)?

The pipeline with .groups = “drop” removes all grouping from the result, returning a regular, ungrouped data frame. The pipeline without .groups = “drop” retains the grouping by the highest-level grouping variable that wasn’t summarized (y in this case), which may be useful for further grouped operations.

6f. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?

Pipeline 1 (summarize()): Outputs a summarized data frame with one row per group and only the relevant columns (y, z, and mean_x). Pipeline 2 (mutate()): Outputs the full original data frame with a new column (mean_x) that contains the mean of x for each group, without removing any rows.

4.6 Exercises

1. Restyle the following pipelines following the guidelines above.

flights |> 
  filter(dest == "IAH") |> 
  group_by(year, month, day) |> 
  summarize(
    delay = mean(arr_delay, na.rm = TRUE), 
    n = n()
    ) |>
  filter(n > 10)

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day delay     n
##    <int> <int> <int> <dbl> <int>
##  1  2013     1     1 17.8     20
##  2  2013     1     2  7       20
##  3  2013     1     3 18.3     19
##  4  2013     1     4 -3.2     20
##  5  2013     1     5 20.2     13
##  6  2013     1     6  9.28    18
##  7  2013     1     7 -7.74    19
##  8  2013     1     8  7.79    19
##  9  2013     1     9 18.1     19
## 10  2013     1    10  6.68    19
## # ℹ 355 more rows

flights |>
  filter(carrier == "UA", 
         dest %in% c("IAH", "HOU"), sched_dep_time > 900, sched_arr_time < 2000) |>
  group_by(flight) |> 
  summarize(
    delay = mean(arr_delay, na.rm = TRUE),
    cancelled = sum(is.na(arr_delay)),
    n = n()
    ) |>
  filter(n > 10)

## # A tibble: 74 × 4
##    flight delay cancelled     n
##     <int> <dbl>     <int> <int>
##  1     53 12.5          2    18
##  2    112 14.1          0    14
##  3    205 -1.71         0    14
##  4    235 -5.36         0    14
##  5    255 -9.47         0    15
##  6    268 38.6          1    15
##  7    292  6.57         0    21
##  8    318 10.7          1    20
##  9    337 20.1          2    21
## 10    370 17.5          0    11
## # ℹ 64 more rows

5.2.1 Exercises

1. For each of the sample tables, describe what each observation and each column represents.

Table 1:

Each row in this table represents the number of disease cases and the population for a specific country in a given year.

For each column: country: The name of the country (e.g., Afghanistan, Brazil, China). year: The year in which the data was recorded (e.g., 1999, 2000). cases: The number of cases of a specific disease in that country and year. population: The total population of the country in that year.

Table 2:

Each row represents a single data point (either the number of cases or the population) for a given country and year. The data is split between “cases” and “population” in the type column.

For each column: country: The name of the country (e.g., Afghanistan, Brazil). year: The year in which the data was recorded (e.g., 1999, 2000). type: The type of data being recorded—either “cases” or “population.” count: The corresponding value for the type—either the number of cases or the population.

Table 3:

Each row represents the rate of disease cases per population for a specific country and year. The rate is expressed as a fraction of cases over population.

For each column: country: The name of the country (e.g., Afghanistan, Brazil, China). year: The year in which the data was recorded (e.g., 1999, 2000). rate: The ratio of the number of disease cases to the population, expressed as a fraction (e.g., “745/19987071” means 745 cases per 19,987,071 people).

2. Sketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations.

# For Table 2
library(dplyr)
# Extract cases and population data
cases_table <- table2 |> 
  filter(type == "cases")
population_table <- table2 |> 
  filter(type == "population")

# Join the two tables by country and year
combined_table <- cases_table |> 
  inner_join(population_table, 
             by = c("country", "year"), 
             suffix = c("_cases", "_population")
             )

# Calculate the rate and store the result
result_table2 <- combined_table |> 
  mutate(rate = (count_cases / count_population) * 10000) |> 
  select(country, year, rate)

# For Table 3
library(dplyr)
library(tidyr)
# Split the 'rate' string into 'cases' and 'population'
table3 <- table3 |> 
  separate(rate, into = c("cases", "population"), sep = "/") |> 
  mutate(cases = as.numeric(cases),
         population = as.numeric(population)
         )

# Calculate the rate and store the result
result_table3 <- table3 |> 
  mutate(rate_per_10000 = (cases / population) * 10000) |> 
  select(country, year, rate_per_10000)

7.2.4 Exercises

1. What function would you use to read a file where fields were separated with “|”?

To read a file where fields are separated by the pipe character |, you can use the read_delim() function from the readr package in R. This function allows you to specify a custom delimiter, such as |.

2. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

The read_csv() and read_tsv() functions from the readr package are designed to read comma-separated files (CSV) and tab-separated files (TSV), respectively. Apart from the file, skip, and comment arguments, they share several other common arguments that allow users to control how the data is imported and handled.

3. What are the most important arguments to read_fwf()?

file: The path to the file you want to read. col_positions: Defines how the columns are parsed (most important for fixed-width files). col_names: Controls whether to use a header or specify custom column names. col_types: Defines the types of each column. skip: Skips initial lines before reading. n_max: Limits the number of rows to read. na: Specifies what values should be treated as missing. trim_ws: Determines whether to trim leading/trailing whitespace.

4. Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like ” or ’. By default, read_csv() assumes that the quoting character will be “. To read the following text into a data frame, what argument to read_csv() do you need to specify?

read_csv("x,y\n1,'a,b'", quote = "'")

## Rows: 1 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): y
## dbl (1): x
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 1 × 2
##       x y    
##   <dbl> <chr>
## 1     1 a,b

5. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

read_csv("a,b\n1,2,3\n4,5,6")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): a
## num (1): b
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 2
##       a     b
##   <dbl> <dbl>
## 1     1    23
## 2     4    56

The first line (a,b) defines two columns: a and b. However, the data rows (1,2,3 and 4,5,6) contain three values instead of two values. read_csv() expects each row to have the same number of values as there are column headers. Here, the rows have too many columns, which will result in an error.

read_csv("a,b,c\n1,2\n1,2,3,4")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): a, b
## num (1): c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2    NA
## 2     1     2    34

The header (a,b,c) defines three columns: a, b, and c. The first data row (1,2) contains only two values (too few values). The second data row (1,2,3,4) contains four values (too many values).

read_csv() will throw an error because the number of values in the rows does not match the number of columns defined in the header.

read_csv("a,b\n\"1")

## Rows: 0 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): a, b
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 0 × 2
## # ℹ 2 variables: a <chr>, b <chr>

The header (a,b) defines two columns: a and b. The first data row (“1) starts with a double quote (”), but it is not properly closed.

read_csv() expects a properly closed quoted string. Since the string is not closed here, you will get an error.

read_csv("a,b\n1,2\na,b")

## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): a, b
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 2
##   a     b    
##   <chr> <chr>
## 1 1     2    
## 2 a     b

The header (a,b) defines two columns: a and b. The first data row (1,2) is fine. The second data row (a,b) repeats the column names as data values.

This will not result in an error. However, the second row will be interpreted as data (not headers), so you will end up with a data frame where the second row contains the values “a” and “b” as strings, which might not be the intended behavior.

read_csv("a;b\n1;3")

## Rows: 1 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): a;b
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 1 × 1
##   `a;b`
##   <chr>
## 1 1;3

The file uses semi-colons (;) as delimiters, but read_csv() expects a comma as the delimiter. Incorrect parsing: read_csv() will not split the columns correctly because it is looking for commas (,), not semi-colons (;). This will result in the entire line being read as a single column.

6. Practice referring to non-syntactic names in the following data frame by:

annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

6a. Extracting the variable called 1.

library(dplyr)
# Extract the column `1` using dplyr
annoying |> 
  select(`1`)

## # A tibble: 10 × 1
##      `1`
##    <int>
##  1     1
##  2     2
##  3     3
##  4     4
##  5     5
##  6     6
##  7     7
##  8     8
##  9     9
## 10    10

6b. Plotting a scatterplot of 1 vs. 2.

library(ggplot2)
# Scatterplot of `1` vs. `2`
ggplot(annoying, aes(x = `1`, y = `2`)) +
  geom_point() +
  labs(x = "1", y = "2")

6c. Creating a new column called 3, which is 2 divided by 1.

# Create a new column `3`, which is `2` divided by `1`
annoying <- annoying |> 
  mutate(`3` = `2` / `1`)

6d. Renaming the columns to one, two, and three.

# Rename columns `1`, `2`, and `3` to `one`, `two`, and `three`
annoying <- annoying |> 
  rename(one = `1`, two = `2`, three = `3`)