1. Why does this code not work?
This code does not work because “my_variable <- 10” and “my_varıable” use different “i”’s. Thus, R is unable to understand that they are referring to the same base object.
2. Tweak each of the following R commands so that they run correctly:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
3. Press Option + Shift + K / Alt + Shift + K. What happens? How can you get to the same place using the menus?
Upon pressing Option + Shift + K, a Keyboard Shortcut Quick Reference appears on my screen. I can get here using the menu through the “Help” tab at the top of my screen, then clicking on “Keyboard Shortcuts Help”
4. Let’s revisit an exercise from the Section 1.6. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?
my_bar_plot <- ggplot(mpg, aes(x = class)) + geom_bar()
my_scatter_plot <- ggplot(mpg, aes(x = cty, y = hwy)) + geom_point()
ggsave(filename = "mpg-plot.png", plot = my_bar_plot)
## Saving 7 x 5 in image
The “my_bar_plot” plot is saved as “mpg-plot.png” due to the third line of code specifying that “my_bar_plot” is the one to be saved.
library(nycflights13)
1. In a single pipeline for each condition, find all flights that meet the condition:
flights |> filter(arr_delay >= 120)
## # A tibble: 10,200 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 811 630 101 1047 830
## 2 2013 1 1 848 1835 853 1001 1950
## 3 2013 1 1 957 733 144 1056 853
## 4 2013 1 1 1114 900 134 1447 1222
## 5 2013 1 1 1505 1310 115 1638 1431
## 6 2013 1 1 1525 1340 105 1831 1626
## 7 2013 1 1 1549 1445 64 1912 1656
## 8 2013 1 1 1558 1359 119 1718 1515
## 9 2013 1 1 1732 1630 62 2028 1825
## 10 2013 1 1 1803 1620 103 2008 1750
## # ℹ 10,190 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> filter(dest == "IAH" | dest == "HOU")
## # A tibble: 9,313 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 623 627 -4 933 932
## 4 2013 1 1 728 732 -4 1041 1038
## 5 2013 1 1 739 739 0 1104 1038
## 6 2013 1 1 908 908 0 1228 1219
## 7 2013 1 1 1028 1026 2 1350 1339
## 8 2013 1 1 1044 1045 -1 1352 1351
## 9 2013 1 1 1114 900 134 1447 1222
## 10 2013 1 1 1205 1200 5 1503 1505
## # ℹ 9,303 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> filter(carrier == "UA" | carrier == "AA" | carrier == "DL")
## # A tibble: 139,504 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 600 -6 812 837
## 5 2013 1 1 554 558 -4 740 728
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 600 -1 941 910
## 10 2013 1 1 559 600 -1 854 902
## # ℹ 139,494 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> filter(month == 7 | month == 8 | month == 9)
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # ℹ 86,316 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> filter(dep_delay < 0 & arr_delay >= 120)
## # A tibble: 26 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 27 1419 1420 -1 1754 1550
## 2 2013 10 7 1357 1359 -2 1858 1654
## 3 2013 10 16 657 700 -3 1258 1056
## 4 2013 11 1 658 700 -2 1329 1015
## 5 2013 3 18 1844 1847 -3 39 2219
## 6 2013 4 17 1635 1640 -5 2049 1845
## 7 2013 4 18 558 600 -2 1149 850
## 8 2013 4 18 655 700 -5 1213 950
## 9 2013 5 22 1827 1830 -3 2217 2010
## 10 2013 6 5 1604 1615 -11 2041 1840
## # ℹ 16 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> filter(dep_delay >= 60 & arr_delay <= dep_delay - 30)
## # A tibble: 2,074 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 1716 1545 91 2140 2039
## 2 2013 1 1 2205 1720 285 46 2040
## 3 2013 1 1 2326 2130 116 131 18
## 4 2013 1 3 1503 1221 162 1803 1555
## 5 2013 1 3 1821 1530 171 2131 1910
## 6 2013 1 3 1839 1700 99 2056 1950
## 7 2013 1 3 1850 1745 65 2148 2120
## 8 2013 1 3 1923 1815 68 2036 1958
## 9 2013 1 3 1941 1759 102 2246 2139
## 10 2013 1 3 1950 1845 65 2228 2227
## # ℹ 2,064 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
2. Sort flights to find the flights with the longest departure delays. Find the flights that left earliest in the morning.
flights |> arrange(desc(dep_delay))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> arrange(dep_time)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 13 1 2249 72 108 2357
## 2 2013 1 31 1 2100 181 124 2225
## 3 2013 11 13 1 2359 2 442 440
## 4 2013 12 16 1 2359 2 447 437
## 5 2013 12 20 1 2359 2 430 440
## 6 2013 12 26 1 2359 2 437 440
## 7 2013 12 30 1 2359 2 441 437
## 8 2013 2 11 1 2100 181 111 2225
## 9 2013 2 24 1 2245 76 121 2354
## 10 2013 3 8 1 2355 6 431 440
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
3. Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)
flights |> arrange(distance/air_time)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 28 1917 1825 52 2118 1935
## 2 2013 6 29 755 800 -5 1035 909
## 3 2013 8 28 932 940 -8 1116 1051
## 4 2013 1 30 1037 955 42 1221 1100
## 5 2013 11 27 556 600 -4 727 658
## 6 2013 5 21 558 600 -2 721 657
## 7 2013 12 9 1540 1535 5 1720 1656
## 8 2013 6 10 1356 1300 56 1646 1414
## 9 2013 7 28 1322 1325 -3 1612 1432
## 10 2013 4 11 1349 1345 4 1542 1453
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
4. Was there a flight on every day of 2013?
flights |> distinct(month,day)
## # A tibble: 365 × 2
## month day
## <int> <int>
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 1 7
## 8 1 8
## 9 1 9
## 10 1 10
## # ℹ 355 more rows
Yes, because running my code tells me that there are 365 rows, I can conclude that there was a flight on each unique month + day pair.
5. Which flights traveled the farthest distance? Which traveled the least distance?
#Arranging flights by farthest distance
flights |> arrange(desc(distance))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 857 900 -3 1516 1530
## 2 2013 1 2 909 900 9 1525 1530
## 3 2013 1 3 914 900 14 1504 1530
## 4 2013 1 4 900 900 0 1516 1530
## 5 2013 1 5 858 900 -2 1519 1530
## 6 2013 1 6 1019 900 79 1558 1530
## 7 2013 1 7 1042 900 102 1620 1530
## 8 2013 1 8 901 900 1 1504 1530
## 9 2013 1 9 641 900 1301 1242 1530
## 10 2013 1 10 859 900 -1 1449 1530
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#Arranging flights by least distance
flights |> arrange(distance)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 27 NA 106 NA NA 245
## 2 2013 1 3 2127 2129 -2 2222 2224
## 3 2013 1 4 1240 1200 40 1333 1306
## 4 2013 1 4 1829 1615 134 1937 1721
## 5 2013 1 4 2128 2129 -1 2218 2224
## 6 2013 1 5 1155 1200 -5 1241 1306
## 7 2013 1 6 2125 2129 -4 2224 2224
## 8 2013 1 7 2124 2129 -5 2212 2224
## 9 2013 1 8 2127 2130 -3 2304 2225
## 10 2013 1 9 2126 2129 -3 2217 2224
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
The flights that traveled the farthest distance were flights from JFK to HNL. The flights that traveled the least distance was a flight from EWR to LGA.
6. Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
It does matter what order you do the functions in, as filter() will trim down the data set to the relevant data points, then arrange() will further sort that trimmed data set.
1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
I would expect that sched_dep_time + dep_delay = dep_time
2. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
flights |> select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
3. What happens if you specify the name of the same variable multiple times in a select() call?
flights |> select(dep_time, dep_time)
## # A tibble: 336,776 × 1
## dep_time
## <int>
## 1 517
## 2 533
## 3 542
## 4 544
## 5 554
## 6 554
## 7 555
## 8 557
## 9 557
## 10 558
## # ℹ 336,766 more rows
It only pulls up the variable’s column once.
4. What does the any_of() function do? Why might it be helpful in conjunction with this vector?
The any_of() function expedites the process of selecting for variables. Rather than type out each variable of interest every time I run analyses, through turning “year”, “month”, “day”, “dep_delay” and “arr_delay” into the vector “variables”, I can use flights |> select(any_of(variables)) to analyse all rows containing said variables.
5. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
The results surprise me. R is typically case-sensitive. However, the select helpers ignore case. I can change that default with ignore.case = FALSE.
6. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.
flights |> rename(air_time_min = air_time) |> relocate(air_time_min)
## # A tibble: 336,776 × 19
## air_time_min year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 227 2013 1 1 517 515 2 830
## 2 227 2013 1 1 533 529 4 850
## 3 160 2013 1 1 542 540 2 923
## 4 183 2013 1 1 544 545 -1 1004
## 5 116 2013 1 1 554 600 -6 812
## 6 150 2013 1 1 554 558 -4 740
## 7 158 2013 1 1 555 600 -5 913
## 8 53 2013 1 1 557 600 -3 709
## 9 140 2013 1 1 557 600 -3 838
## 10 138 2013 1 1 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
7. Why doesn’t the following work, and what does the error mean?
The issue is that select(tailnum) excludes arr_delay, so when R tries to rearrange by arr_delay there is nothing in the selection. To fix this, I need to use select(tailnum, arr_delay) to include both variables before trying to arrange(arr_delay).
1. Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))
flights |>
group_by(carrier) |>
summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) |>
arrange(desc(avg_delay))
## # A tibble: 16 × 2
## carrier avg_delay
## <chr> <dbl>
## 1 F9 20.2
## 2 EV 20.0
## 3 YV 19.0
## 4 FL 18.7
## 5 WN 17.7
## 6 9E 16.7
## 7 B6 13.0
## 8 VX 12.9
## 9 OO 12.6
## 10 UA 12.1
## 11 MQ 10.6
## 12 DL 9.26
## 13 AA 8.59
## 14 AS 5.80
## 15 HA 4.90
## 16 US 3.78
On average, the carrier with the worst delays is F9.
2. Find the flights that are most delayed upon departure from each destination.
flights |>
group_by(dest) |>
slice_max(dep_delay, n = 1) |>
relocate(dest, dep_delay)
## # A tibble: 105 × 19
## # Groups: dest [105]
## dest dep_delay year month day dep_time sched_dep_time arr_time
## <chr> <dbl> <int> <int> <int> <int> <int> <int>
## 1 ABQ 142 2013 12 14 2223 2001 133
## 2 ACK 219 2013 7 23 1139 800 1250
## 3 ALB 323 2013 1 25 123 2000 229
## 4 ANC 75 2013 8 17 1740 1625 2042
## 5 ATL 898 2013 7 22 2257 759 121
## 6 AUS 351 2013 7 10 2056 1505 2347
## 7 AVL 222 2013 6 14 1158 816 1335
## 8 BDL 252 2013 2 21 1728 1316 1839
## 9 BGR 248 2013 12 1 1504 1056 1628
## 10 BHM 325 2013 4 10 25 1900 136
## # ℹ 95 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
3. How do delays vary over the course of the day? Illustrate your answer with a plot.
flights_hourly_delay <- flights|>
filter(!is.na(dep_delay), !is.na(dep_time))|>
mutate(dep_hour = dep_time/100)|>
group_by(dep_hour)|>
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE))
ggplot(flights_hourly_delay, aes(x = dep_hour, y = avg_dep_delay)) +
geom_line(color = "blue") +
labs(title = "Average departure delays by time of day", x = "Hour of the day", y = "Average departure delay (minutes)")
Departure delays vary throughout the course of the day, peaking earliest in the day prior to 5 am.
4. What happens if you supply a negative n to slice_min() and friends?
flights |>
slice_min(arr_delay, n = -1) |>
relocate(arr_delay)
## # A tibble: 336,776 × 19
## arr_delay year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 -86 2013 5 7 1715 1729 -14 1944
## 2 -79 2013 5 20 719 735 -16 951
## 3 -75 2013 5 2 1947 1949 -2 2209
## 4 -75 2013 5 6 1826 1830 -4 2045
## 5 -74 2013 5 4 1816 1820 -4 2017
## 6 -73 2013 5 2 1926 1929 -3 2157
## 7 -71 2013 5 6 1753 1755 -2 2004
## 8 -71 2013 5 7 2054 2055 -1 2317
## 9 -71 2013 5 13 657 700 -3 908
## 10 -70 2013 1 4 1026 1030 -4 1305
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Supplying a negative n to slice_min() (and similar functions) excludes the last |n| rows from the result after sorting.
5. Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?
Count() is a simplified way of counting occurrences of unique values in a column, internally using group_by() and summarize(n = n()). The sort argument sorts the output in descending order of the count (n), making it easy to identify the most frequent values.
6a. Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.
The group_by(y) function should group the data frame together based on the value in column y. Since there are two values, a and b, in column y, the data frame will be grouped into two groups: Rows where y == a and rows where y == b.
6b. Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also, comment on how it’s different from the group_by() in part (a).
arrange(y) sorts the rows based on the values in y: All the rows where y == “a” will come first. The rows where y == “b” will come after.
arrange(): Sorts the rows based on a column’s values. It just changes the order of the rows. group_by(): Groups the data based on a column’s values. It doesn’t change the order of the rows, but it prepares the data for grouped operations (like summaries).
6c. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
df |> group_by(y): Groups the data frame df by the column y. Rows with the same value of y are grouped together. summarize(mean_x = mean(x)): For each group of y, it calculates the mean of the x values and outputs it in a new column called mean_x.
6d. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.
The pipeline groups the data by unique combinations of y and z. For each unique combination of y and z, it computes the mean of the x values. Outputs a summarized data frame with one row per unique (y, z) combination, containing the group’s mean_x.
6e. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d)?
The pipeline with .groups = “drop” removes all grouping from the result, returning a regular, ungrouped data frame. The pipeline without .groups = “drop” retains the grouping by the highest-level grouping variable that wasn’t summarized (y in this case), which may be useful for further grouped operations.
6f. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?
Pipeline 1 (summarize()): Outputs a summarized data frame with one row per group and only the relevant columns (y, z, and mean_x). Pipeline 2 (mutate()): Outputs the full original data frame with a new column (mean_x) that contains the mean of x for each group, without removing any rows.
1. Restyle the following pipelines following the guidelines above.
flights |>
filter(dest == "IAH") |>
group_by(year, month, day) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
) |>
filter(n > 10)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day delay n
## <int> <int> <int> <dbl> <int>
## 1 2013 1 1 17.8 20
## 2 2013 1 2 7 20
## 3 2013 1 3 18.3 19
## 4 2013 1 4 -3.2 20
## 5 2013 1 5 20.2 13
## 6 2013 1 6 9.28 18
## 7 2013 1 7 -7.74 19
## 8 2013 1 8 7.79 19
## 9 2013 1 9 18.1 19
## 10 2013 1 10 6.68 19
## # ℹ 355 more rows
flights |>
filter(carrier == "UA",
dest %in% c("IAH", "HOU"), sched_dep_time > 900, sched_arr_time < 2000) |>
group_by(flight) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
cancelled = sum(is.na(arr_delay)),
n = n()
) |>
filter(n > 10)
## # A tibble: 74 × 4
## flight delay cancelled n
## <int> <dbl> <int> <int>
## 1 53 12.5 2 18
## 2 112 14.1 0 14
## 3 205 -1.71 0 14
## 4 235 -5.36 0 14
## 5 255 -9.47 0 15
## 6 268 38.6 1 15
## 7 292 6.57 0 21
## 8 318 10.7 1 20
## 9 337 20.1 2 21
## 10 370 17.5 0 11
## # ℹ 64 more rows
1. For each of the sample tables, describe what each observation and each column represents.
Table 1:
Each row in this table represents the number of disease cases and the population for a specific country in a given year.
For each column: country: The name of the country (e.g., Afghanistan, Brazil, China). year: The year in which the data was recorded (e.g., 1999, 2000). cases: The number of cases of a specific disease in that country and year. population: The total population of the country in that year.
Table 2:
Each row represents a single data point (either the number of cases or the population) for a given country and year. The data is split between “cases” and “population” in the type column.
For each column: country: The name of the country (e.g., Afghanistan, Brazil). year: The year in which the data was recorded (e.g., 1999, 2000). type: The type of data being recorded—either “cases” or “population.” count: The corresponding value for the type—either the number of cases or the population.
Table 3:
Each row represents the rate of disease cases per population for a specific country and year. The rate is expressed as a fraction of cases over population.
For each column: country: The name of the country (e.g., Afghanistan, Brazil, China). year: The year in which the data was recorded (e.g., 1999, 2000). rate: The ratio of the number of disease cases to the population, expressed as a fraction (e.g., “745/19987071” means 745 cases per 19,987,071 people).
2. Sketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations.
# For Table 2
library(dplyr)
# Extract cases and population data
cases_table <- table2 |>
filter(type == "cases")
population_table <- table2 |>
filter(type == "population")
# Join the two tables by country and year
combined_table <- cases_table |>
inner_join(population_table,
by = c("country", "year"),
suffix = c("_cases", "_population")
)
# Calculate the rate and store the result
result_table2 <- combined_table |>
mutate(rate = (count_cases / count_population) * 10000) |>
select(country, year, rate)
# For Table 3
library(dplyr)
library(tidyr)
# Split the 'rate' string into 'cases' and 'population'
table3 <- table3 |>
separate(rate, into = c("cases", "population"), sep = "/") |>
mutate(cases = as.numeric(cases),
population = as.numeric(population)
)
# Calculate the rate and store the result
result_table3 <- table3 |>
mutate(rate_per_10000 = (cases / population) * 10000) |>
select(country, year, rate_per_10000)
1. What function would you use to read a file where fields were separated with “|”?
To read a file where fields are separated by the pipe character |, you can use the read_delim() function from the readr package in R. This function allows you to specify a custom delimiter, such as |.
2. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?
The read_csv() and read_tsv() functions from the readr package are designed to read comma-separated files (CSV) and tab-separated files (TSV), respectively. Apart from the file, skip, and comment arguments, they share several other common arguments that allow users to control how the data is imported and handled.
3. What are the most important arguments to read_fwf()?
file: The path to the file you want to read. col_positions: Defines how the columns are parsed (most important for fixed-width files). col_names: Controls whether to use a header or specify custom column names. col_types: Defines the types of each column. skip: Skips initial lines before reading. n_max: Limits the number of rows to read. na: Specifies what values should be treated as missing. trim_ws: Determines whether to trim leading/trailing whitespace.
4. Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like ” or ’. By default, read_csv() assumes that the quoting character will be “. To read the following text into a data frame, what argument to read_csv() do you need to specify?
read_csv("x,y\n1,'a,b'", quote = "'")
## Rows: 1 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): y
## dbl (1): x
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 2
## x y
## <dbl> <chr>
## 1 1 a,b
5. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): a
## num (1): b
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 2
## a b
## <dbl> <dbl>
## 1 1 23
## 2 4 56
The first line (a,b) defines two columns: a and b. However, the data rows (1,2,3 and 4,5,6) contain three values instead of two values. read_csv() expects each row to have the same number of values as there are column headers. Here, the rows have too many columns, which will result in an error.
read_csv("a,b,c\n1,2\n1,2,3,4")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 2 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): a, b
## num (1): c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 NA
## 2 1 2 34
The header (a,b,c) defines three columns: a, b, and c. The first data row (1,2) contains only two values (too few values). The second data row (1,2,3,4) contains four values (too many values).
read_csv() will throw an error because the number of values in the rows does not match the number of columns defined in the header.
read_csv("a,b\n\"1")
## Rows: 0 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): a, b
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 0 × 2
## # ℹ 2 variables: a <chr>, b <chr>
The header (a,b) defines two columns: a and b. The first data row (“1) starts with a double quote (”), but it is not properly closed.
read_csv() expects a properly closed quoted string. Since the string is not closed here, you will get an error.
read_csv("a,b\n1,2\na,b")
## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): a, b
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
The header (a,b) defines two columns: a and b. The first data row (1,2) is fine. The second data row (a,b) repeats the column names as data values.
This will not result in an error. However, the second row will be interpreted as data (not headers), so you will end up with a data frame where the second row contains the values “a” and “b” as strings, which might not be the intended behavior.
read_csv("a;b\n1;3")
## Rows: 1 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): a;b
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 1
## `a;b`
## <chr>
## 1 1;3
The file uses semi-colons (;) as delimiters, but read_csv() expects a comma as the delimiter. Incorrect parsing: read_csv() will not split the columns correctly because it is looking for commas (,), not semi-colons (;). This will result in the entire line being read as a single column.
6. Practice referring to non-syntactic names in the following data frame by:
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
6a. Extracting the variable called 1.
library(dplyr)
# Extract the column `1` using dplyr
annoying |>
select(`1`)
## # A tibble: 10 × 1
## `1`
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
6b. Plotting a scatterplot of 1 vs. 2.
library(ggplot2)
# Scatterplot of `1` vs. `2`
ggplot(annoying, aes(x = `1`, y = `2`)) +
geom_point() +
labs(x = "1", y = "2")
6c. Creating a new column called 3, which is 2 divided by 1.
# Create a new column `3`, which is `2` divided by `1`
annoying <- annoying |>
mutate(`3` = `2` / `1`)
6d. Renaming the columns to one, two, and three.
# Rename columns `1`, `2`, and `3` to `one`, `two`, and `three`
annoying <- annoying |>
rename(one = `1`, two = `2`, three = `3`)