library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
planes
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## # ℹ 3,312 more rows
airports
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # ℹ 1,448 more rows
weather
## # A tibble: 26,115 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
All flights that had an arrival delay of two or more hours
flights_arr2 <- flights |>
filter(arr_delay >= 120)
Flew to Houston (IAH or HOU)
flights_houston <- flights |>
filter(dest %in% c("IAH", "HOU"))
Were operated by United, American, or Delta
flights_UAD <- flights |>
filter(carrier %in% c("UA", "AA", "DL"))
Departed in summer (July, August, and September)
flights_summer <- flights |>
filter(month %in% c(7, 8, 9))
Arrived more than two hours late, but didn’t leave late
flights |>
filter(arr_delay > 120 & dep_delay == 0)
## # A tibble: 3 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 7 1350 1350 0 1736 1526
## 2 2013 5 23 1810 1810 0 2208 2000
## 3 2013 7 1 905 905 0 1443 1223
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Were delayed by at least an hour, but made up over 30 minutes in flight
flights |>
filter(dep_delay >= 60 & arr_delay <= 30)
## # A tibble: 239 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 3 1850 1745 65 2148 2120
## 2 2013 1 3 1950 1845 65 2228 2227
## 3 2013 1 3 2015 1915 60 2135 2111
## 4 2013 1 6 1019 900 79 1558 1530
## 5 2013 1 7 1543 1430 73 1758 1735
## 6 2013 1 11 1020 920 60 1311 1245
## 7 2013 1 12 1706 1600 66 1949 1927
## 8 2013 1 12 1953 1845 68 2154 2137
## 9 2013 1 19 1456 1355 61 1636 1615
## 10 2013 1 21 1531 1430 61 1843 1815
## # ℹ 229 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
arrange(desc(dep_delay), dep_time)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
arrange(sched_dep_time)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 27 NA 106 NA NA 245
## 2 2013 1 2 458 500 -2 703 650
## 3 2013 1 3 458 500 -2 650 650
## 4 2013 1 4 456 500 -4 631 650
## 5 2013 1 5 458 500 -2 640 650
## 6 2013 1 6 458 500 -2 718 650
## 7 2013 1 7 454 500 -6 637 648
## 8 2013 1 8 454 500 -6 625 648
## 9 2013 1 9 457 500 -3 647 648
## 10 2013 1 10 450 500 -10 634 648
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)
flights |>
arrange(distance/air_time)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 28 1917 1825 52 2118 1935
## 2 2013 6 29 755 800 -5 1035 909
## 3 2013 8 28 932 940 -8 1116 1051
## 4 2013 1 30 1037 955 42 1221 1100
## 5 2013 11 27 556 600 -4 727 658
## 6 2013 5 21 558 600 -2 721 657
## 7 2013 12 9 1540 1535 5 1720 1656
## 8 2013 6 10 1356 1300 56 1646 1414
## 9 2013 7 28 1322 1325 -3 1612 1432
## 10 2013 4 11 1349 1345 4 1542 1453
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Was there a flight on every day of 2013?
flights |>
distinct(year, month, day)
## # A tibble: 365 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 2
## 3 2013 1 3
## 4 2013 1 4
## 5 2013 1 5
## 6 2013 1 6
## 7 2013 1 7
## 8 2013 1 8
## 9 2013 1 9
## 10 2013 1 10
## # ℹ 355 more rows
Answer: yes
Which flights traveled the farthest distance? Which traveled the least distance?
flights |>
arrange(desc(distance))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 857 900 -3 1516 1530
## 2 2013 1 2 909 900 9 1525 1530
## 3 2013 1 3 914 900 14 1504 1530
## 4 2013 1 4 900 900 0 1516 1530
## 5 2013 1 5 858 900 -2 1519 1530
## 6 2013 1 6 1019 900 79 1558 1530
## 7 2013 1 7 1042 900 102 1620 1530
## 8 2013 1 8 901 900 1 1504 1530
## 9 2013 1 9 641 900 1301 1242 1530
## 10 2013 1 10 859 900 -1 1449 1530
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
JFK-HNL
flights |>
arrange(distance)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 27 NA 106 NA NA 245
## 2 2013 1 3 2127 2129 -2 2222 2224
## 3 2013 1 4 1240 1200 40 1333 1306
## 4 2013 1 4 1829 1615 134 1937 1721
## 5 2013 1 4 2128 2129 -1 2218 2224
## 6 2013 1 5 1155 1200 -5 1241 1306
## 7 2013 1 6 2125 2129 -4 2224 2224
## 8 2013 1 7 2124 2129 -5 2212 2224
## 9 2013 1 8 2127 2130 -3 2304 2225
## 10 2013 1 9 2126 2129 -3 2217 2224
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
EWR-LGA and EWR-PHL
Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
Yes, if you filter first then there is less work since fewer items are being arranged. If you arrange before filtering, the function will have to do more work.
flights |>
select(dep_time, sched_dep_time, dep_delay)
## # A tibble: 336,776 × 3
## dep_time sched_dep_time dep_delay
## <int> <int> <dbl>
## 1 517 515 2
## 2 533 529 4
## 3 542 540 2
## 4 544 545 -1
## 5 554 600 -6
## 6 554 558 -4
## 7 555 600 -5
## 8 557 600 -3
## 9 557 600 -3
## 10 558 600 -2
## # ℹ 336,766 more rows
I expect that dep_time - sched_dep_time = dep_delay. As dep_time increases and/or sched_dep time decreases, dep_delay increases.
flights |>
select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
flights |>
select(starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
flights |>
select(contains("arr"), contains("dep")) #Doesn't work because sched_arr_time and carrier contain "arr"
## # A tibble: 336,776 × 7
## arr_time sched_arr_time arr_delay carrier dep_time sched_dep_time dep_delay
## <int> <int> <dbl> <chr> <int> <int> <dbl>
## 1 830 819 11 UA 517 515 2
## 2 850 830 20 UA 533 529 4
## 3 923 850 33 AA 542 540 2
## 4 1004 1022 -18 B6 544 545 -1
## 5 812 837 -25 DL 554 600 -6
## 6 740 728 12 UA 554 558 -4
## 7 913 854 19 B6 555 600 -5
## 8 709 723 -14 EV 557 600 -3
## 9 838 846 -8 B6 557 600 -3
## 10 753 745 8 AA 558 600 -2
## # ℹ 336,766 more rows
flights |>
select(carrier, carrier, arr_time)
## # A tibble: 336,776 × 2
## carrier arr_time
## <chr> <int>
## 1 UA 830
## 2 UA 850
## 3 AA 923
## 4 B6 1004
## 5 DL 812
## 6 UA 740
## 7 B6 913
## 8 EV 709
## 9 B6 838
## 10 AA 753
## # ℹ 336,766 more rows
Only 1 column appears for the repeated variable
variables <- c(“year”, “month”, “day”, “dep_delay”, “arr_delay”)
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights |>
select(any_of(variables))
## # A tibble: 336,776 × 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ℹ 336,766 more rows
Any_of can specify and display only the columns that have been assigned to a character vector. It can be used within a select function to quickly select preassigned columns by using the vector name, rather than typing all the names out. It can be used for negative selections (make sure a variable is removed)
flights |> select(contains(“TIME”))
flights |> select(contains("TIME"))
## # A tibble: 336,776 × 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # ℹ 336,766 more rows
The select function/helpers do not appear to be case-sensitive.
flights |> select(contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 × 0
“ignore.case” is TRUE by default; change it to FALSE. Nothing appears since there are no uppercase column names.
flights |>
rename(air_time_min = air_time) |>
relocate(air_time_min)
## # A tibble: 336,776 × 19
## air_time_min year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 227 2013 1 1 517 515 2 830
## 2 227 2013 1 1 533 529 4 850
## 3 160 2013 1 1 542 540 2 923
## 4 183 2013 1 1 544 545 -1 1004
## 5 116 2013 1 1 554 600 -6 812
## 6 150 2013 1 1 554 558 -4 740
## 7 158 2013 1 1 555 600 -5 913
## 8 53 2013 1 1 557 600 -3 709
## 9 140 2013 1 1 557 600 -3 838
## 10 138 2013 1 1 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> select(tailnum) |> arrange(arr_delay) #> Error in
arrange(): #> ℹ In argument:
..1 = arr_delay. #> Caused by error: #> ! object
‘arr_delay’ not found
flights |>
select(tailnum)
## # A tibble: 336,776 × 1
## tailnum
## <chr>
## 1 N14228
## 2 N24211
## 3 N619AA
## 4 N804JB
## 5 N668DN
## 6 N39463
## 7 N516JB
## 8 N829AS
## 9 N593JB
## 10 N3ALAA
## # ℹ 336,766 more rows
Arr_delay was not selected. Only tailnum is in this tibble, therefore it is not possible to arrange by arr_delay.
flights |>
group_by(carrier, dest) |>
summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) |>
arrange(desc(avg_delay))
## `summarise()` has grouped output by 'carrier'. You can override using the
## `.groups` argument.
## # A tibble: 314 × 3
## # Groups: carrier [16]
## carrier dest avg_delay
## <chr> <chr> <dbl>
## 1 UA STL 77.5
## 2 OO ORD 67
## 3 OO DTW 61
## 4 UA RDU 60
## 5 EV PBI 48.7
## 6 EV TYS 41.8
## 7 EV CAE 36.7
## 8 EV TUL 34.9
## 9 9E BGR 34
## 10 WN MSY 33.4
## # ℹ 304 more rows
There does not appear to be a pattern in airports in the top 10 delays. However, 4 of the top 10 highest average delayed flights were operated by ExpressJet, while other airlines only appeared up to 2 more times.
flights |>
filter(dep_delay > 0) |>
group_by(dest) |>
slice_max(dep_delay, n = 1) |>
relocate(dest)
## # A tibble: 103 × 19
## # Groups: dest [103]
## dest year month day dep_time sched_dep_time dep_delay arr_time
## <chr> <int> <int> <int> <int> <int> <dbl> <int>
## 1 ABQ 2013 12 14 2223 2001 142 133
## 2 ACK 2013 7 23 1139 800 219 1250
## 3 ALB 2013 1 25 123 2000 323 229
## 4 ANC 2013 8 17 1740 1625 75 2042
## 5 ATL 2013 7 22 2257 759 898 121
## 6 AUS 2013 7 10 2056 1505 351 2347
## 7 AVL 2013 6 14 1158 816 222 1335
## 8 BDL 2013 2 21 1728 1316 252 1839
## 9 BGR 2013 12 1 1504 1056 248 1628
## 10 BHM 2013 4 10 25 1900 325 136
## # ℹ 93 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
ggplot(aes(x = dep_time, y = dep_delay, na.rm = TRUE)) +
geom_point()
## Warning: Removed 8255 rows containing missing values (`geom_point()`).
Delays increase from 5-10 am and continue to increase a little over the course of the day.There are more delays at night.
flights |>
slice_min(dep_delay, n = -1)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
All rows are shown instead of just showing the lowest/highest value.
count() allows you to count the number of occurrences per unique observation (e.g., # of flights per origin, # of flights per destination, # of flights for a given origin and destination, etc.). sort = TRUE places the most common values (or highest counts) at the top. When using count, the results are sorted by “n.”
df <- tibble( x = 1:5, y = c(“a”, “b”, “a”, “a”, “b”), z = c(“K”, “K”, “L”, “L”, “K”) )
It will create a 3x5 data frame in the environment with variables x, y, and z. x will be the numbers 1-5, y will be the characters a, b, a, a, b, and z will be the characters K, K, L, L, K; in corresponding order with x.
df <- tibble(
x = 1:5,
y = c("a", "b", "a", "a", "b"),
z = c("K", "K", "L", "L", "K")
)
group_by() groups the selected variable(s) based on the value into a 5x3 tibble.
df |> group_by(y)
df |>
group_by(y)
## # A tibble: 5 × 3
## # Groups: y [2]
## x y z
## <int> <chr> <chr>
## 1 1 a K
## 2 2 b K
## 3 3 a L
## 4 4 a L
## 5 5 b K
df |> arrange(y)
It will arrange observations based on the value of y (a and b) in a 5x3 tibble.
df |>
arrange(y)
## # A tibble: 5 × 3
## x y z
## <int> <chr> <chr>
## 1 1 a K
## 2 3 a L
## 3 4 a L
## 4 2 b K
## 5 5 b K
Arrange() sorts n by ascending order by default and can also sort by descending order if specified. This is different from group_by() because with arrange(), the output is not grouped by a variable’s value.
df |> group_by(y) |> summarize(mean_x = mean(x))
The output will produce a 2x2 tibble with the mean x for each group of y.
df |>
group_by(y) |>
summarize(mean_x = mean(x))
## # A tibble: 2 × 2
## y mean_x
## <chr> <dbl>
## 1 a 2.67
## 2 b 3.5
The pipeline connects the action (verb) to be used with the dataframe that it will be used on (information in the dataframe “df” will be arranged by y). The pipeline first groups the data in df by the variable y, then produces the mean of x within each group of y. The mean x of a is lower than the mean x of b.
df |> group_by(y, z) |> summarize(mean_x = mean(x))
The output will be a 3x3 tibble containing the mean x for each combination of y and z.
df |>
group_by(y, z) |>
summarize(mean_x = mean(x))
## `summarise()` has grouped output by 'y'. You can override using the `.groups`
## argument.
## # A tibble: 3 × 3
## # Groups: y [2]
## y z mean_x
## <chr> <chr> <dbl>
## 1 a K 1
## 2 a L 3.5
## 3 b K 3.5
The pipeline groups the information in df by the values of y and z, then produces the mean x for each group of y and z values.
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).
The output will be the same as (d).
df |>
group_by(y, z) |>
summarize(mean_x = mean(x), .groups = "drop")
## # A tibble: 3 × 3
## y z mean_x
## <chr> <chr> <dbl>
## 1 a K 1
## 2 a L 3.5
## 3 b K 3.5
The pipeline groups the information in df by the values of y and z, then produces the mean x for each group of y and z values, and ungroups the data. This output is the same as (d) because the values in the dataframe were separated (ungrouped) without explicitly coding for this by grouping by all but 1 variable.
df |> group_by(y, z) |> summarize(mean_x = mean(x))
The output will be a 3x3 tibble with the mean x for each combination of y and z (same as (d)).
df |>
group_by(y, z) |>
summarize(mean_x = mean(x))
## `summarise()` has grouped output by 'y'. You can override using the `.groups`
## argument.
## # A tibble: 3 × 3
## # Groups: y [2]
## y z mean_x
## <chr> <chr> <dbl>
## 1 a K 1
## 2 a L 3.5
## 3 b K 3.5
The pipeline groups df by y and z and then produces the mean x.
df |> group_by(y, z) |> mutate(mean_x = mean(x))
The output will be a 3x4 tibble with an extra column titled mean_x in addition to the variable x.
df |>
group_by(y, z) |>
mutate(mean_x = mean(x))
## # A tibble: 5 × 4
## # Groups: y, z [3]
## x y z mean_x
## <int> <chr> <chr> <dbl>
## 1 1 a K 1
## 2 2 b K 3.5
## 3 3 a L 3.5
## 4 4 a L 3.5
## 5 5 b K 3.5
The pipeline groups df by y and z, then creates a new column to produce a 5x4 tibble with the mean x for each combination of x, y, and z. This output is different because using the mutate function to create a new column retains the original x column, meaning that each observation remains separate. The mean_x is equal to x in this tibble because of the separate observations.