library(tidyverse)Warning: package 'readr' was built under R version 4.2.3
library(palmerpenguins)
library(ggthemes)Warning: package 'ggthemes' was built under R version 4.2.3
library(tidyverse)Warning: package 'readr' was built under R version 4.2.3
library(palmerpenguins)
library(ggthemes)Warning: package 'ggthemes' was built under R version 4.2.3
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(title = "Body mass and flipper length",
subtitle = "Dimensions of Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)", y = "Body mass (g)",
color = "Species", shape = "Species") +
scale_color_colorblind()Question 3: Make a scatterplot of bill_depth_mm vs bill_length_mm . Describe the relationship between these two variables.
ggplot(data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(mapping = aes(color = species, shape = species))Warning: Removed 2 rows containing missing values (`geom_point()`).
It doesn’t seem like there is any linear relationship between the variables, except that it tends to cluster along different species.
Question 4: Make a scatterplot of species vs. bill_depth_mm ? Choose a better geom
ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = species)) +
geom_point()Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = species)) +
geom_boxplot()Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Question 8: Replicate this visualization:
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(mapping = aes(color = bill_depth_mm)) +
geom_smooth()`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
bill_depth_mm should be mapped to geom_point and it should be mapped locally.
Question 2: Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = cty))
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(size = cty))
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = cty, size = cty))The shape argument can’t be used, since numerical variables are continuous.
Question 5: Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species)) + facet_wrap(~species)Warning: Removed 2 rows containing missing values (`geom_point()`).
Question 6: Why does the following yield two separate legends? How would you fix it to combine the two legends?
ggplot(data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species)) +
geom_point() + labs(color = "Species")The color mapping is mapped both locally and globally. Remove labs(color = "Species') to obtain one legend.
ggplot(data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species)) +
geom_point()Warning: Removed 2 rows containing missing values (`geom_point()`).
Question 7: Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "fill")
ggplot(penguins, aes(x = species, fill = island)) +
geom_bar(position = "fill")The first plot is about the population of the island with the species makeup, the second plot is about the proportion each species make up for each island.
library(tidyverse)
library(nycflights13)
flights# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Question 1: In a single pipeline for each condition, find all flights that meet the condition:
Had an arrival delay of two or more hours
Flew to Houston (IAH or HOU)
Were operated by United, American, or Delta
Departed in summer (July, August, and September)
Arrived more than two hours late, but didn’t leave late
Were delayed by at least an hour, but made up over 30 minutes in flight
# Had an arrival delay of two or more hours
flights |>
filter(arr_delay >= 120)# A tibble: 10,200 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 811 630 101 1047 830
2 2013 1 1 848 1835 853 1001 1950
3 2013 1 1 957 733 144 1056 853
4 2013 1 1 1114 900 134 1447 1222
5 2013 1 1 1505 1310 115 1638 1431
6 2013 1 1 1525 1340 105 1831 1626
7 2013 1 1 1549 1445 64 1912 1656
8 2013 1 1 1558 1359 119 1718 1515
9 2013 1 1 1732 1630 62 2028 1825
10 2013 1 1 1803 1620 103 2008 1750
# ℹ 10,190 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# Flew to Houston (IAH or HOU)
flights |>
filter(dest %in% c("IAH", "HOU"))# A tibble: 9,313 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 623 627 -4 933 932
4 2013 1 1 728 732 -4 1041 1038
5 2013 1 1 739 739 0 1104 1038
6 2013 1 1 908 908 0 1228 1219
7 2013 1 1 1028 1026 2 1350 1339
8 2013 1 1 1044 1045 -1 1352 1351
9 2013 1 1 1114 900 134 1447 1222
10 2013 1 1 1205 1200 5 1503 1505
# ℹ 9,303 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# Were operated by United, American, or Delta
flights |>
filter(carrier %in% c("UA", "AA", "DL"))# A tibble: 139,504 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 554 600 -6 812 837
5 2013 1 1 554 558 -4 740 728
6 2013 1 1 558 600 -2 753 745
7 2013 1 1 558 600 -2 924 917
8 2013 1 1 558 600 -2 923 937
9 2013 1 1 559 600 -1 941 910
10 2013 1 1 559 600 -1 854 902
# ℹ 139,494 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# Departed in summer (July, August, and September)
flights |>
filter(month %in% c(7, 8, 9))# A tibble: 86,326 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 7 1 1 2029 212 236 2359
2 2013 7 1 2 2359 3 344 344
3 2013 7 1 29 2245 104 151 1
4 2013 7 1 43 2130 193 322 14
5 2013 7 1 44 2150 174 300 100
6 2013 7 1 46 2051 235 304 2358
7 2013 7 1 48 2001 287 308 2305
8 2013 7 1 58 2155 183 335 43
9 2013 7 1 100 2146 194 327 30
10 2013 7 1 100 2245 135 337 135
# ℹ 86,316 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# Arrived more than two hours late, but didn’t leave late
flights |>
filter(arr_delay > 120 & dep_delay <= 0)# A tibble: 29 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 27 1419 1420 -1 1754 1550
2 2013 10 7 1350 1350 0 1736 1526
3 2013 10 7 1357 1359 -2 1858 1654
4 2013 10 16 657 700 -3 1258 1056
5 2013 11 1 658 700 -2 1329 1015
6 2013 3 18 1844 1847 -3 39 2219
7 2013 4 17 1635 1640 -5 2049 1845
8 2013 4 18 558 600 -2 1149 850
9 2013 4 18 655 700 -5 1213 950
10 2013 5 22 1827 1830 -3 2217 2010
# ℹ 19 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# Were delayed by at least an hour, but made up over 30 minutes in flight
flights |>
filter(dep_delay >= 60 & air_time > 30)# A tibble: 26,657 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 811 630 101 1047 830
2 2013 1 1 826 715 71 1136 1045
3 2013 1 1 848 1835 853 1001 1950
4 2013 1 1 957 733 144 1056 853
5 2013 1 1 1114 900 134 1447 1222
6 2013 1 1 1120 944 96 1331 1213
7 2013 1 1 1301 1150 71 1518 1345
8 2013 1 1 1337 1220 77 1649 1531
9 2013 1 1 1400 1250 70 1645 1502
10 2013 1 1 1505 1310 115 1638 1431
# ℹ 26,647 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Question 2: Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.
flights |> filter(is.na(dep_delay) == F) |> arrange(desc(dep_delay))# A tibble: 328,521 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 6 15 1432 1935 1137 1607 2120
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 9 20 1139 1845 1014 1457 2210
5 2013 7 22 845 1600 1005 1044 1815
6 2013 4 10 1100 1900 960 1342 2211
7 2013 3 17 2321 810 911 135 1020
8 2013 6 27 959 1900 899 1236 2226
9 2013 7 22 2257 759 898 121 1026
10 2013 12 5 756 1700 896 1058 2020
# ℹ 328,511 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> arrange(sched_dep_time, dep_time, dep_delay)# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 7 27 NA 106 NA NA 245
2 2013 5 8 445 500 -15 620 640
3 2013 5 5 446 500 -14 636 640
4 2013 9 4 446 500 -14 618 648
5 2013 10 1 447 500 -13 614 648
6 2013 9 19 447 500 -13 620 648
7 2013 1 29 448 500 -12 635 648
8 2013 12 27 448 500 -12 648 651
9 2013 5 7 448 500 -12 624 640
10 2013 10 2 449 500 -11 620 648
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Question 3: Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)
flights |> arrange(distance/air_time)# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 28 1917 1825 52 2118 1935
2 2013 6 29 755 800 -5 1035 909
3 2013 8 28 932 940 -8 1116 1051
4 2013 1 30 1037 955 42 1221 1100
5 2013 11 27 556 600 -4 727 658
6 2013 5 21 558 600 -2 721 657
7 2013 12 9 1540 1535 5 1720 1656
8 2013 6 10 1356 1300 56 1646 1414
9 2013 7 28 1322 1325 -3 1612 1432
10 2013 4 11 1349 1345 4 1542 1453
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Question 4: Was there a flight on every day of 2013?
flights |> distinct(year, month, day)# A tibble: 365 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 2
3 2013 1 3
4 2013 1 4
5 2013 1 5
6 2013 1 6
7 2013 1 7
8 2013 1 8
9 2013 1 9
10 2013 1 10
# ℹ 355 more rows
Yes.
Question 5: Which flights traveled the farthest distance? Which traveled the least distance?
flights |> arrange(desc(distance))# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 857 900 -3 1516 1530
2 2013 1 2 909 900 9 1525 1530
3 2013 1 3 914 900 14 1504 1530
4 2013 1 4 900 900 0 1516 1530
5 2013 1 5 858 900 -2 1519 1530
6 2013 1 6 1019 900 79 1558 1530
7 2013 1 7 1042 900 102 1620 1530
8 2013 1 8 901 900 1 1504 1530
9 2013 1 9 641 900 1301 1242 1530
10 2013 1 10 859 900 -1 1449 1530
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> arrange(distance)# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 7 27 NA 106 NA NA 245
2 2013 1 3 2127 2129 -2 2222 2224
3 2013 1 4 1240 1200 40 1333 1306
4 2013 1 4 1829 1615 134 1937 1721
5 2013 1 4 2128 2129 -1 2218 2224
6 2013 1 5 1155 1200 -5 1241 1306
7 2013 1 6 2125 2129 -4 2224 2224
8 2013 1 7 2124 2129 -5 2212 2224
9 2013 1 8 2127 2130 -3 2304 2225
10 2013 1 9 2126 2129 -3 2217 2224
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Question 6: Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
flights |> filter(distance > 100) |> arrange(year, month, day, dep_time)# A tibble: 335,143 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 335,133 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
arrange(year, month, day, dep_time) |>
filter(distance > 100) # A tibble: 335,143 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 335,133 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Question 1: Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
flights |> select(contains("dep"))# A tibble: 336,776 × 3
dep_time sched_dep_time dep_delay
<int> <int> <dbl>
1 517 515 2
2 533 529 4
3 542 540 2
4 544 545 -1
5 554 600 -6
6 554 558 -4
7 555 600 -5
8 557 600 -3
9 557 600 -3
10 558 600 -2
# ℹ 336,766 more rows
flights |>
mutate(obt_delay = dep_time - sched_dep_time,
dep_delay = dep_delay,
.keep = "used")# A tibble: 336,776 × 4
dep_time sched_dep_time dep_delay obt_delay
<int> <int> <dbl> <int>
1 517 515 2 2
2 533 529 4 4
3 542 540 2 2
4 544 545 -1 -1
5 554 600 -6 -46
6 554 558 -4 -4
7 555 600 -5 -45
8 557 600 -3 -43
9 557 600 -3 -43
10 558 600 -2 -42
# ℹ 336,766 more rows
Question 2: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
flights |> select(dep_time, dep_delay, arr_time, arr_delay)
flights |>
select((contains("dep") | contains("arr")) &
!(contains("sched") | carrier))
flights |>
select((contains("time") | contains("delay")) &
!(contains("sched") | contains("air") | contains("hour")))Question 4: What does the any_of() function do? Why might it be helpful in conjunction with this vector?
variables <- c("year", "month", "day", "dep_delay", "arr_delay")variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights |> select(any_of(variables))# A tibble: 336,776 × 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# ℹ 336,766 more rows
The any_of() function is a selection helper for the select() , which allows for less stricter selection compared to all_of() . Suppose we introduce a non-existent variable takeoff in variables, then any_of() would return an exact same output:
variables <- c("year", "month", "day", "dep_delay", "arr_delay", "takeoff")
flights |> select(any_of(variables))# A tibble: 336,776 × 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# ℹ 336,766 more rows
whereas all_of() would return an error.
Question 5: Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
flights |> select(contains("TIME"))The contains() function contains the ignore.case argument, which is set to by default. to change this, set it to FALSE :
flights |>
select(contains("TIME", ignore.case = FALSE))# A tibble: 336,776 × 0
The output implies that there is no uppercase variable called TIME in flights.
Question 6: Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.
flights |>
rename(air_time_min = air_time) |>
relocate(air_time_min, .before = "year")# A tibble: 336,776 × 19
air_time_min year month day dep_time sched_dep_time dep_delay arr_time
<dbl> <int> <int> <int> <int> <int> <dbl> <int>
1 227 2013 1 1 517 515 2 830
2 227 2013 1 1 533 529 4 850
3 160 2013 1 1 542 540 2 923
4 183 2013 1 1 544 545 -1 1004
5 116 2013 1 1 554 600 -6 812
6 150 2013 1 1 554 558 -4 740
7 158 2013 1 1 555 600 -5 913
8 53 2013 1 1 557 600 -3 709
9 140 2013 1 1 557 600 -3 838
10 138 2013 1 1 558 600 -2 753
# ℹ 336,766 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Question 7: Why doesn’t the following work, and what does the error mean?
flights |>
select(tailnum) |>
arrange(arr_delay)Error in `arrange()`:
ℹ In argument: `..1 = arr_delay`.
Caused by error:
! object 'arr_delay' not found
The select() function is already specified to select only tailnum . Since another pipe is placed after select(tailnum), arr-delay isn’t able to be arranged because it was not included as the selected objects.
Question 1: Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))
# Question 1
flights |>
group_by(carrier) |>
summarize(delay = mean(dep_delay, na.rm = T)) |>
arrange(desc(delay))# A tibble: 16 × 2
carrier delay
<chr> <dbl>
1 F9 20.2
2 EV 20.0
3 YV 19.0
4 FL 18.7
5 WN 17.7
6 9E 16.7
7 B6 13.0
8 VX 12.9
9 OO 12.6
10 UA 12.1
11 MQ 10.6
12 DL 9.26
13 AA 8.59
14 AS 5.80
15 HA 4.90
16 US 3.78
F9 has the worst delays.
Question 2: Find the flights that are most delayed upon departure from each destination.
flights |>
group_by(dest) |>
arrange(dest, desc(dep_delay)) |>
slice_head(n = 5) |>
relocate(dest, dep_delay)# A tibble: 517 × 19
# Groups: dest [105]
dest dep_delay year month day dep_time sched_dep_time arr_time
<chr> <dbl> <int> <int> <int> <int> <int> <int>
1 ABQ 142 2013 12 14 2223 2001 133
2 ABQ 139 2013 12 17 2220 2001 120
3 ABQ 125 2013 7 30 2212 2007 57
4 ABQ 125 2013 9 2 2212 2007 48
5 ABQ 119 2013 7 23 2206 2007 116
6 ACK 219 2013 7 23 1139 800 1250
7 ACK 138 2013 7 2 1018 800 1119
8 ACK 117 2013 7 4 957 800 1106
9 ACK 101 2013 5 30 1321 1140 1419
10 ACK 100 2013 6 24 940 800 1111
# ℹ 507 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Question 3: How do delays vary over the course of the day. Illustrate your answer with a plot.
flights |>
group_by(hour) |>
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) |>
ggplot(aes(x = hour, y = avg_dep_delay)) +
geom_point() + geom_smooth()`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 1 rows containing missing values (`geom_point()`).
The average hourly departure delays generally increase and peaked 7:00 P.M., after which the delays begin to drop, although not as low as the beginning of the day
Question 4: What happens if you supply a negative n to slice_min() and friends?
flights |>
slice_min(dep_delay, n = 5) |>
relocate(dep_delay)# A tibble: 5 × 19
dep_delay year month day dep_time sched_dep_time arr_time sched_arr_time
<dbl> <int> <int> <int> <int> <int> <int> <int>
1 -43 2013 12 7 2040 2123 40 2352
2 -33 2013 2 3 2022 2055 2240 2338
3 -32 2013 11 10 1408 1440 1549 1559
4 -30 2013 1 11 1900 1930 2233 2243
5 -27 2013 1 29 1703 1730 1947 1957
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
slice_min(dep_delay, n = -5) |>
relocate(dep_delay)# A tibble: 336,776 × 19
dep_delay year month day dep_time sched_dep_time arr_time sched_arr_time
<dbl> <int> <int> <int> <int> <int> <int> <int>
1 -43 2013 12 7 2040 2123 40 2352
2 -33 2013 2 3 2022 2055 2240 2338
3 -32 2013 11 10 1408 1440 1549 1559
4 -30 2013 1 11 1900 1930 2233 2243
5 -27 2013 1 29 1703 1730 1947 1957
6 -26 2013 8 9 729 755 1002 955
7 -25 2013 10 23 1907 1932 2143 2143
8 -25 2013 3 30 2030 2055 2213 2250
9 -24 2013 3 2 1431 1455 1601 1631
10 -24 2013 5 5 934 958 1225 1309
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
slice_max(dep_delay, n = 5) |>
relocate(dep_delay)# A tibble: 5 × 19
dep_delay year month day dep_time sched_dep_time arr_time sched_arr_time
<dbl> <int> <int> <int> <int> <int> <int> <int>
1 1301 2013 1 9 641 900 1242 1530
2 1137 2013 6 15 1432 1935 1607 2120
3 1126 2013 1 10 1121 1635 1239 1810
4 1014 2013 9 20 1139 1845 1457 2210
5 1005 2013 7 22 845 1600 1044 1815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
slice_max(dep_delay, n = -5) |>
relocate(dep_delay)# A tibble: 336,776 × 19
dep_delay year month day dep_time sched_dep_time arr_time sched_arr_time
<dbl> <int> <int> <int> <int> <int> <int> <int>
1 1301 2013 1 9 641 900 1242 1530
2 1137 2013 6 15 1432 1935 1607 2120
3 1126 2013 1 10 1121 1635 1239 1810
4 1014 2013 9 20 1139 1845 1457 2210
5 1005 2013 7 22 845 1600 1044 1815
6 960 2013 4 10 1100 1900 1342 2211
7 911 2013 3 17 2321 810 135 1020
8 899 2013 6 27 959 1900 1236 2226
9 898 2013 7 22 2257 759 121 1026
10 896 2013 12 5 756 1700 1058 2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Supplying a negative n in slice_min() and slice_max() will arrange the dep_delay in increasing and decreasing manner respectively, without slicing flights.
Question 5: Explain what count() does in terms of the dplyr verbs you just learned. What does the sortargument to count() do?
flights |> count(year, month, day, sort = T)# A tibble: 365 × 4
year month day n
<int> <int> <int> <int>
1 2013 11 27 1014
2 2013 7 11 1006
3 2013 7 8 1004
4 2013 7 10 1004
5 2013 12 2 1004
6 2013 7 18 1003
7 2013 7 25 1003
8 2013 7 12 1002
9 2013 7 9 1001
10 2013 7 17 1001
# ℹ 355 more rows
The count() function finds the number of occurrences for every row, with sort set to FALSE as default. Specifying sort to TRUE sorts the row according to the number of occurrences in decreasing order.
Question 6: Suppose we have the following tiny data frame:
df <- tibble(
x = 1:5,
y = c("a", "b", "a", "a", "b"),
z = c("K", "K", "L", "L", "K")
)Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.
The tibble is grouped by y.
df |>
group_by(y)# A tibble: 5 × 3
# Groups: y [2]
x y z
<int> <chr> <chr>
1 1 a K
2 2 b K
3 3 a L
4 4 a L
5 5 b K
Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?
The tibble is arranged is ascending order according to y.
df |>
arrange(y)# A tibble: 5 × 3
x y z
<int> <chr> <chr>
1 1 a K
2 3 a L
3 4 a L
4 2 b K
5 5 b K
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
The tibble is grouped by y in ascending order, then calculates the mean of x corresponding to each group of y .
df |>
group_by(y) |>
summarize(mean_x = mean(x))# A tibble: 2 × 2
y mean_x
<chr> <dbl>
1 a 2.67
2 b 3.5
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.
The tibble is grouped in order by y and z and calculates the mean of x corresponding to each combination y and z. The data frame is grouped by y.
df |>
group_by(y, z) |>
summarize(mean_x = mean(x))`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups: y [2]
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).
The tibble is grouped in order by y and z and calculates the mean of x corresponding to each combination of y and z. The data frame is ungrouped.
df |>
group_by(y, z) |>
summarize(mean_x = mean(x), .groups = "drop")# A tibble: 3 × 3
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?
Both pipeline groups the tibble by y and z and calculates the mean of x corresponding to each combination of y and z. With summarize() the data frame has one row per group combination while with mutate() the data frame has the same number of rows as the original data frame.
df |>
group_by(y, z) |>
summarize(mean_x = mean(x))`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups: y [2]
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
df |>
group_by(y, z) |>
mutate(mean_x = mean(x))# A tibble: 5 × 4
# Groups: y, z [3]
x y z mean_x
<int> <chr> <chr> <dbl>
1 1 a K 1
2 2 b K 3.5
3 3 a L 3.5
4 4 a L 3.5
5 5 b K 3.5
table1# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
These are all representations of the same underlying data, but they are not equally easy to use. One of them, table1, will be much easier to work with inside the tidyverse because it’s tidy.
There are three interrelated rules that make a dataset tidy:
Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with table1.
# Compute rate per 10,000
table1 |>
mutate(rate = cases / population * 10000)# A tibble: 6 × 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
# Compute total cases per year
table1 |>
group_by(year) |>
summarize(total_cases = sum(cases))# A tibble: 2 × 2
year total_cases
<dbl> <dbl>
1 1999 250740
2 2000 296920
# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country, shape = country)) +
scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000For each of the sample tables, describe what each observation and each column represents.
table1: Each observation represents the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. Each column represents country, year, TB cases, and population respectively.table2: Observations are similar to table1 but divided into two - TB cases and population. The variables cases and population are grouped under type and its values under the variable counttable3: Each observation represent the rate of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. THe rate column is derived from the proportion of cases and population from each combination of country and year.Sketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:
table2 |>
pivot_wider(names_from = "type", values_from = "count") |>
mutate(rate = (cases / population) * 10000)# A tibble: 6 × 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
table3 |>
separate_wider_delim(rate, "/", names = c("cases", "population")) |>
mutate_at(c("cases", "population"), as.double) |>
mutate(rate = (cases / population) * 10000)# A tibble: 6 × 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
Question 1: What function would you use to read a file where fields were separated with “|”?
It should be read using the read_delim() function with delim = "|".
Question 4: Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like " or '. By default, read_csv() assumes that the quoting character will be ". To read the following text into a data frame, what argument to read_csv() do you need to specify?
"x,y\n1,'a,b'"We will specify the quote argument:
read_csv("x, y\n1, 'a,b'", quote = "\'")Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 1 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): y
dbl (1): x
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 2
x y
<dbl> <chr>
1 1 a,b
Question 5: Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
read_csv("a,b,c\n1,2\n1,2,3,4")
read_csv("a,b\n\"1")
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")There are two column headers but three values in each row, so the last two values get merged:
read_csv("a,b\n1,2,3\n4,5,6")Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 2 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): a
num (1): b
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 2
a b
<dbl> <dbl>
1 1 23
2 4 56
There are 3 column headers, but only 2 values in the first row, so the last column gets an NA there, and 4 values in the secod row so the last two values are merged:
read_csv("a,b,c\n1,2\n1,2,3,4")Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): a, b
num (1): c
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 3
a b c
<dbl> <dbl> <dbl>
1 1 2 NA
2 1 2 34
No rows are read in:
read_csv("a,b\n\"1")Rows: 0 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): a, b
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 0 × 2
# ℹ 2 variables: a <chr>, b <chr>
Each column has a numerical and a character value, so the column type is coerced to character:
read_csv("a,b\n1,2\na,b")Rows: 2 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): a, b
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 2
a b
<chr> <chr>
1 1 2
2 a b
The delimiter is ; but it’s not specified, therefore this is read in as a single-column data frame with a single observation:
read_csv("a;b\n1;3")Rows: 1 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): a;b
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 1
`a;b`
<chr>
1 1;3
Question 6: Practice referring to non-syntactic names in the following data frame by:
1.1 vs. 2.3, which is 2 is divided by 1.one, two, and three.annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)Extracting the variable called 1:
annoying |>
select(`1`)# A tibble: 10 × 1
`1`
<int>
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Plotting a scatterplot of 1 vs. 2:
ggplot(annoying, aes(x = `2`, y = `1`)) +
geom_point()Creating a new column called 3, which is 2 divided by 1:
annoying |>
mutate(`3` = `2` / `1`)# A tibble: 10 × 3
`1` `2` `3`
<int> <dbl> <dbl>
1 1 1.28 1.28
2 2 3.83 1.92
3 3 6.70 2.23
4 4 8.06 2.01
5 5 9.46 1.89
6 6 12.7 2.12
7 7 12.4 1.78
8 8 15.5 1.94
9 9 16.3 1.81
10 10 21.7 2.17
Renaming the columns to one, two, and three:
annoying |>
mutate(`3` = `2` / `1`) |>
rename(
"one" = `1`,
"two" = `2`,
"three" = `3`
)# A tibble: 10 × 3
one two three
<int> <dbl> <dbl>
1 1 1.28 1.28
2 2 3.83 1.92
3 3 6.70 2.23
4 4 8.06 2.01
5 5 9.46 1.89
6 6 12.7 2.12
7 7 12.4 1.78
8 8 15.5 1.94
9 9 16.3 1.81
10 10 21.7 2.17
readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,0002 rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:
Does it contain only F, T, FALSE, or TRUE (ignoring case)? If so, it’s a logical.
Does it contain only numbers (e.g., 1, -4.5, 5e6, Inf)? If so, it’s a number.
Does it match the ISO8601 standard? If so, it’s a date or date-time. (We’ll return to date-times in more detail in Section 17.2).
Otherwise, it must be a string.
You can see that behavior in action in this simple example:
read_csv("
logical, numeric, date, string
TRUE,1,2021-01-15,abc
false,4.5,2021-02-15,def
T,Inf,2021-02-16,ghi
")Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): string
dbl (1): numeric
lgl (1): logical
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 4
logical numeric date string
<lgl> <dbl> <date> <chr>
1 TRUE 1 2021-01-15 abc
2 FALSE 4.5 2021-02-15 def
3 TRUE Inf 2021-02-16 ghi
The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the NA that readr expects.
simple_csv <- "
x
10
.
20
30"Reading it without any additional arguments, x becomes a character column:
read_csv(simple_csv)Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): x
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 1
x
<chr>
1 10
2 .
3 20
4 30
In this very small case, you can easily see the missing value .. But what happens if you have thousands of rows with only a few missing values represented by .s sprinkled among them? One approach is to tell readr that x is a numeric column, and then see where it fails. You can do that with the col_types argument, which takes a named list where the names match the column names in the CSV file:
df <- read_csv(
simple_csv,
col_types = list(x = col_double())
)Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Now read_csv() reports that there was a problem, and tells us we can find out more with problems():
problems(df)# A tibble: 1 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 3 1 a double . /private/var/folders/yq/q9fv54y91vb70rhxtfs06ysh0…
This tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = ".", the automatic guessing succeeds, giving us the numeric column that we want:
read_csv(simple_csv, na = ".")Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 1
x
<dbl>
1 10
2 NA
3 20
4 30
readr provides a total of nine column types for you to use:
col_logical() and col_double() read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.
col_integer() reads integers. We seldom distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
col_character() reads strings. This can be useful to specify explicitly when you have a column that is a numeric identifier, i.e., long series of digits that identifies an object but doesn’t make sense to apply mathematical operations to. Examples include phone numbers, social security numbers, credit card numbers, etc.
col_factor(), col_date(), and col_datetime() create factors, dates, and date-times respectively; you’ll learn more about those when we get to those data types in Chapter 16 and Chapter 17.
col_number() is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in Chapter 13.
col_skip() skips a column so it’s not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.
It’s also possible to override the default column by switching from list() to cols() and specifying .default:
another_csv <- "
x,y,z
1,2,3"
read_csv(
another_csv,
col_types = cols(.default = col_character())
)# A tibble: 1 × 3
x y z
<chr> <chr> <chr>
1 1 2 3
Another useful helper is cols_only() which will read in only the columns you specify:
read_csv(
another_csv,
col_types = cols_only(x = col_character())
)# A tibble: 1 × 1
x
<chr>
1 1
Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March. With read_csv() you can read these data in at once and stack them on top of each other in a single data frame.
sales_files <- c(
"Datasets/r4ds-01-sales",
"Datasets/r4ds-02-sales",
"Datasets/r4ds-03-sales"
)
read_csv(sales_files, id = "file")Rows: 19 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): month
dbl (4): year, brand, item, n
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 19 × 6
file month year brand item n
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Datasets/r4ds-01-sales January 2019 1 1234 3
2 Datasets/r4ds-01-sales January 2019 1 8721 9
3 Datasets/r4ds-01-sales January 2019 1 1822 2
4 Datasets/r4ds-01-sales January 2019 2 3333 1
5 Datasets/r4ds-01-sales January 2019 2 2156 9
6 Datasets/r4ds-01-sales January 2019 2 3987 6
7 Datasets/r4ds-01-sales January 2019 2 3827 6
8 Datasets/r4ds-02-sales February 2019 1 1234 8
9 Datasets/r4ds-02-sales February 2019 1 8721 2
10 Datasets/r4ds-02-sales February 2019 1 1822 3
11 Datasets/r4ds-02-sales February 2019 2 3333 1
12 Datasets/r4ds-02-sales February 2019 2 2156 3
13 Datasets/r4ds-02-sales February 2019 2 3987 6
14 Datasets/r4ds-03-sales March 2019 1 1234 3
15 Datasets/r4ds-03-sales March 2019 1 3627 1
16 Datasets/r4ds-03-sales March 2019 1 8820 3
17 Datasets/r4ds-03-sales March 2019 2 7253 1
18 Datasets/r4ds-03-sales March 2019 2 8766 3
19 Datasets/r4ds-03-sales March 2019 2 8288 6
The id argument adds a new column called file to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.
If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base list.files() function to find the files for you by matching a pattern in the file names.
sales_files <- list.files("Datasets", pattern = "sales", full.names = TRUE)
sales_files[1] "Datasets/r4ds-01-sales" "Datasets/r4ds-02-sales" "Datasets/r4ds-03-sales"