Whole Game: Selected Examples and Solutions

Author

Azam Raduan

1. Data Visualization

1.1 Introduction

Data and packages

library(tidyverse)
Warning: package 'readr' was built under R version 4.2.3
library(palmerpenguins)
library(ggthemes)
Warning: package 'ggthemes' was built under R version 4.2.3

Example Plot: Body mass and flipper length

ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(mapping = aes(color = species, shape = species)) + 
  geom_smooth(method = "lm") +
  labs(title = "Body mass and flipper length",
       subtitle = "Dimensions of Adelie, Chinstrap, and Gentoo Penguins",
       x = "Flipper length (mm)", y = "Body mass (g)",
       color = "Species", shape = "Species") +
  scale_color_colorblind()

1.2.5 Selected Solutions

Question 3: Make a scatterplot of bill_depth_mm vs bill_length_mm . Describe the relationship between these two variables.

ggplot(data = penguins, 
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(mapping = aes(color = species, shape = species))
Warning: Removed 2 rows containing missing values (`geom_point()`).

It doesn’t seem like there is any linear relationship between the variables, except that it tends to cluster along different species.

Question 4: Make a scatterplot of species vs. bill_depth_mm ? Choose a better geom

ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = species)) +
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = species)) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Question 8: Replicate this visualization:

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(mapping = aes(color = bill_depth_mm)) +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

bill_depth_mm should be mapped to geom_point and it should be mapped locally.

1.5.5 Selected Solutions

Question 2: Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = cty))
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(size = cty))
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = cty, size = cty))

The shape argument can’t be used, since numerical variables are continuous.

Question 5: Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species)) + facet_wrap(~species)
Warning: Removed 2 rows containing missing values (`geom_point()`).

Question 6: Why does the following yield two separate legends? How would you fix it to combine the two legends?

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm, 
                     color = species, shape = species)) +
  geom_point() + labs(color = "Species")

The color mapping is mapped both locally and globally. Remove labs(color = "Species') to obtain one legend.

ggplot(data = penguins, 
       mapping = aes(x = bill_length_mm, y = bill_depth_mm, 
                     color = species, shape = species)) + 
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

Question 7: Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")
ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar(position = "fill")

The first plot is about the population of the island with the species makeup, the second plot is about the proportion each species make up for each island.

3. Data Transformation

library(tidyverse)
library(nycflights13)
flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

3.2.5 Selected Solutions

Question 1: In a single pipeline for each condition, find all flights that meet the condition:

  • Had an arrival delay of two or more hours

  • Flew to Houston (IAH or HOU)

  • Were operated by United, American, or Delta

  • Departed in summer (July, August, and September)

  • Arrived more than two hours late, but didn’t leave late

  • Were delayed by at least an hour, but made up over 30 minutes in flight

# Had an arrival delay of two or more hours
flights |> 
  filter(arr_delay >= 120)
# A tibble: 10,200 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      811            630       101     1047            830
 2  2013     1     1      848           1835       853     1001           1950
 3  2013     1     1      957            733       144     1056            853
 4  2013     1     1     1114            900       134     1447           1222
 5  2013     1     1     1505           1310       115     1638           1431
 6  2013     1     1     1525           1340       105     1831           1626
 7  2013     1     1     1549           1445        64     1912           1656
 8  2013     1     1     1558           1359       119     1718           1515
 9  2013     1     1     1732           1630        62     2028           1825
10  2013     1     1     1803           1620       103     2008           1750
# ℹ 10,190 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# Flew to Houston (IAH or HOU)
flights |> 
  filter(dest %in% c("IAH", "HOU"))
# A tibble: 9,313 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      623            627        -4      933            932
 4  2013     1     1      728            732        -4     1041           1038
 5  2013     1     1      739            739         0     1104           1038
 6  2013     1     1      908            908         0     1228           1219
 7  2013     1     1     1028           1026         2     1350           1339
 8  2013     1     1     1044           1045        -1     1352           1351
 9  2013     1     1     1114            900       134     1447           1222
10  2013     1     1     1205           1200         5     1503           1505
# ℹ 9,303 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# Were operated by United, American, or Delta
flights |> 
  filter(carrier %in% c("UA", "AA", "DL"))
# A tibble: 139,504 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      554            600        -6      812            837
 5  2013     1     1      554            558        -4      740            728
 6  2013     1     1      558            600        -2      753            745
 7  2013     1     1      558            600        -2      924            917
 8  2013     1     1      558            600        -2      923            937
 9  2013     1     1      559            600        -1      941            910
10  2013     1     1      559            600        -1      854            902
# ℹ 139,494 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# Departed in summer (July, August, and September)
flights |> 
  filter(month %in% c(7, 8, 9))
# A tibble: 86,326 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     7     1        1           2029       212      236           2359
 2  2013     7     1        2           2359         3      344            344
 3  2013     7     1       29           2245       104      151              1
 4  2013     7     1       43           2130       193      322             14
 5  2013     7     1       44           2150       174      300            100
 6  2013     7     1       46           2051       235      304           2358
 7  2013     7     1       48           2001       287      308           2305
 8  2013     7     1       58           2155       183      335             43
 9  2013     7     1      100           2146       194      327             30
10  2013     7     1      100           2245       135      337            135
# ℹ 86,316 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# Arrived more than two hours late, but didn’t leave late
flights |> 
  filter(arr_delay > 120 & dep_delay <= 0)
# A tibble: 29 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    27     1419           1420        -1     1754           1550
 2  2013    10     7     1350           1350         0     1736           1526
 3  2013    10     7     1357           1359        -2     1858           1654
 4  2013    10    16      657            700        -3     1258           1056
 5  2013    11     1      658            700        -2     1329           1015
 6  2013     3    18     1844           1847        -3       39           2219
 7  2013     4    17     1635           1640        -5     2049           1845
 8  2013     4    18      558            600        -2     1149            850
 9  2013     4    18      655            700        -5     1213            950
10  2013     5    22     1827           1830        -3     2217           2010
# ℹ 19 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# Were delayed by at least an hour, but made up over 30 minutes in flight
flights |> 
  filter(dep_delay >= 60 & air_time > 30)
# A tibble: 26,657 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      811            630       101     1047            830
 2  2013     1     1      826            715        71     1136           1045
 3  2013     1     1      848           1835       853     1001           1950
 4  2013     1     1      957            733       144     1056            853
 5  2013     1     1     1114            900       134     1447           1222
 6  2013     1     1     1120            944        96     1331           1213
 7  2013     1     1     1301           1150        71     1518           1345
 8  2013     1     1     1337           1220        77     1649           1531
 9  2013     1     1     1400           1250        70     1645           1502
10  2013     1     1     1505           1310       115     1638           1431
# ℹ 26,647 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 2: Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

flights |> filter(is.na(dep_delay) == F) |> arrange(desc(dep_delay))
# A tibble: 328,521 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     6    27      959           1900       899     1236           2226
 9  2013     7    22     2257            759       898      121           1026
10  2013    12     5      756           1700       896     1058           2020
# ℹ 328,511 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> arrange(sched_dep_time, dep_time, dep_delay)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     7    27       NA            106        NA       NA            245
 2  2013     5     8      445            500       -15      620            640
 3  2013     5     5      446            500       -14      636            640
 4  2013     9     4      446            500       -14      618            648
 5  2013    10     1      447            500       -13      614            648
 6  2013     9    19      447            500       -13      620            648
 7  2013     1    29      448            500       -12      635            648
 8  2013    12    27      448            500       -12      648            651
 9  2013     5     7      448            500       -12      624            640
10  2013    10     2      449            500       -11      620            648
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 3: Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

flights |> arrange(distance/air_time)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    28     1917           1825        52     2118           1935
 2  2013     6    29      755            800        -5     1035            909
 3  2013     8    28      932            940        -8     1116           1051
 4  2013     1    30     1037            955        42     1221           1100
 5  2013    11    27      556            600        -4      727            658
 6  2013     5    21      558            600        -2      721            657
 7  2013    12     9     1540           1535         5     1720           1656
 8  2013     6    10     1356           1300        56     1646           1414
 9  2013     7    28     1322           1325        -3     1612           1432
10  2013     4    11     1349           1345         4     1542           1453
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 4: Was there a flight on every day of 2013?

flights |> distinct(year, month, day)
# A tibble: 365 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     2
 3  2013     1     3
 4  2013     1     4
 5  2013     1     5
 6  2013     1     6
 7  2013     1     7
 8  2013     1     8
 9  2013     1     9
10  2013     1    10
# ℹ 355 more rows

Yes.

Question 5: Which flights traveled the farthest distance? Which traveled the least distance?

flights |>  arrange(desc(distance))
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      857            900        -3     1516           1530
 2  2013     1     2      909            900         9     1525           1530
 3  2013     1     3      914            900        14     1504           1530
 4  2013     1     4      900            900         0     1516           1530
 5  2013     1     5      858            900        -2     1519           1530
 6  2013     1     6     1019            900        79     1558           1530
 7  2013     1     7     1042            900       102     1620           1530
 8  2013     1     8      901            900         1     1504           1530
 9  2013     1     9      641            900      1301     1242           1530
10  2013     1    10      859            900        -1     1449           1530
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>  arrange(distance)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     7    27       NA            106        NA       NA            245
 2  2013     1     3     2127           2129        -2     2222           2224
 3  2013     1     4     1240           1200        40     1333           1306
 4  2013     1     4     1829           1615       134     1937           1721
 5  2013     1     4     2128           2129        -1     2218           2224
 6  2013     1     5     1155           1200        -5     1241           1306
 7  2013     1     6     2125           2129        -4     2224           2224
 8  2013     1     7     2124           2129        -5     2212           2224
 9  2013     1     8     2127           2130        -3     2304           2225
10  2013     1     9     2126           2129        -3     2217           2224
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 6: Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

flights |> filter(distance > 100) |> arrange(year, month, day, dep_time)
# A tibble: 335,143 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 335,133 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> 
  arrange(year, month, day, dep_time) |> 
  filter(distance > 100) 
# A tibble: 335,143 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 335,133 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

3.3.5 Selected Solutions

Question 1: Compare dep_timesched_dep_time, and dep_delay. How would you expect those three numbers to be related?

flights |> select(contains("dep"))
# A tibble: 336,776 × 3
   dep_time sched_dep_time dep_delay
      <int>          <int>     <dbl>
 1      517            515         2
 2      533            529         4
 3      542            540         2
 4      544            545        -1
 5      554            600        -6
 6      554            558        -4
 7      555            600        -5
 8      557            600        -3
 9      557            600        -3
10      558            600        -2
# ℹ 336,766 more rows
flights |> 
  mutate(obt_delay = dep_time - sched_dep_time,
         dep_delay = dep_delay,
         .keep = "used")
# A tibble: 336,776 × 4
   dep_time sched_dep_time dep_delay obt_delay
      <int>          <int>     <dbl>     <int>
 1      517            515         2         2
 2      533            529         4         4
 3      542            540         2         2
 4      544            545        -1        -1
 5      554            600        -6       -46
 6      554            558        -4        -4
 7      555            600        -5       -45
 8      557            600        -3       -43
 9      557            600        -3       -43
10      558            600        -2       -42
# ℹ 336,766 more rows

Question 2: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights |> select(dep_time, dep_delay, arr_time, arr_delay)
flights |> 
  select((contains("dep") | contains("arr")) & 
           !(contains("sched") | carrier))
flights |> 
  select((contains("time") | contains("delay")) & 
           !(contains("sched") | contains("air") | contains("hour")))

Question 4: What does the any_of() function do? Why might it be helpful in conjunction with this vector?

variables <- c("year", "month", "day", "dep_delay", "arr_delay")
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights |> select(any_of(variables))
# A tibble: 336,776 × 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# ℹ 336,766 more rows

The any_of() function is a selection helper for the select() , which allows for less stricter selection compared to all_of() . Suppose we introduce a non-existent variable takeoff in variables, then  any_of()  would return an exact same output:

variables <- c("year", "month", "day", "dep_delay", "arr_delay", "takeoff")
flights |> select(any_of(variables))
# A tibble: 336,776 × 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# ℹ 336,766 more rows

whereas all_of() would return an error.

Question 5: Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

flights |> select(contains("TIME"))

The contains() function contains the ignore.case argument, which is set to by default. to change this, set it to FALSE :

flights |> 
  select(contains("TIME", ignore.case = FALSE))
# A tibble: 336,776 × 0

The output implies that there is no uppercase variable called TIME in flights.

Question 6: Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

flights |> 
  rename(air_time_min = air_time) |> 
  relocate(air_time_min, .before = "year")
# A tibble: 336,776 × 19
   air_time_min  year month   day dep_time sched_dep_time dep_delay arr_time
          <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1          227  2013     1     1      517            515         2      830
 2          227  2013     1     1      533            529         4      850
 3          160  2013     1     1      542            540         2      923
 4          183  2013     1     1      544            545        -1     1004
 5          116  2013     1     1      554            600        -6      812
 6          150  2013     1     1      554            558        -4      740
 7          158  2013     1     1      555            600        -5      913
 8           53  2013     1     1      557            600        -3      709
 9          140  2013     1     1      557            600        -3      838
10          138  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 7: Why doesn’t the following work, and what does the error mean?

  flights |> 
    select(tailnum) |> 
    arrange(arr_delay)
Error in `arrange()`:
ℹ In argument: `..1 = arr_delay`.
Caused by error:
! object 'arr_delay' not found

The select() function is already specified to select only tailnum . Since another pipe is placed after select(tailnum), arr-delay isn’t able to be arranged because it was not included as the selected objects.

3.5.7 Selected Solutions

Question 1: Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

# Question 1
flights |> 
  group_by(carrier) |> 
  summarize(delay = mean(dep_delay, na.rm = T)) |> 
  arrange(desc(delay))
# A tibble: 16 × 2
   carrier delay
   <chr>   <dbl>
 1 F9      20.2 
 2 EV      20.0 
 3 YV      19.0 
 4 FL      18.7 
 5 WN      17.7 
 6 9E      16.7 
 7 B6      13.0 
 8 VX      12.9 
 9 OO      12.6 
10 UA      12.1 
11 MQ      10.6 
12 DL       9.26
13 AA       8.59
14 AS       5.80
15 HA       4.90
16 US       3.78

F9 has the worst delays.

Question 2: Find the flights that are most delayed upon departure from each destination.

flights |> 
  group_by(dest) |> 
  arrange(dest, desc(dep_delay)) |>
  slice_head(n = 5) |> 
  relocate(dest, dep_delay)
# A tibble: 517 × 19
# Groups:   dest [105]
   dest  dep_delay  year month   day dep_time sched_dep_time arr_time
   <chr>     <dbl> <int> <int> <int>    <int>          <int>    <int>
 1 ABQ         142  2013    12    14     2223           2001      133
 2 ABQ         139  2013    12    17     2220           2001      120
 3 ABQ         125  2013     7    30     2212           2007       57
 4 ABQ         125  2013     9     2     2212           2007       48
 5 ABQ         119  2013     7    23     2206           2007      116
 6 ACK         219  2013     7    23     1139            800     1250
 7 ACK         138  2013     7     2     1018            800     1119
 8 ACK         117  2013     7     4      957            800     1106
 9 ACK         101  2013     5    30     1321           1140     1419
10 ACK         100  2013     6    24      940            800     1111
# ℹ 507 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 3: How do delays vary over the course of the day. Illustrate your answer with a plot.

flights |> 
  group_by(hour) |> 
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) |> 
  ggplot(aes(x = hour, y = avg_dep_delay)) +
  geom_point() + geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 1 rows containing missing values (`geom_point()`).

The average hourly departure delays generally increase and peaked 7:00 P.M., after which the delays begin to drop, although not as low as the beginning of the day

Question 4: What happens if you supply a negative n to slice_min() and friends?

flights |> 
  slice_min(dep_delay, n = 5) |> 
  relocate(dep_delay)
# A tibble: 5 × 19
  dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
      <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
1       -43  2013    12     7     2040           2123       40           2352
2       -33  2013     2     3     2022           2055     2240           2338
3       -32  2013    11    10     1408           1440     1549           1559
4       -30  2013     1    11     1900           1930     2233           2243
5       -27  2013     1    29     1703           1730     1947           1957
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> 
  slice_min(dep_delay, n = -5) |> 
  relocate(dep_delay)
# A tibble: 336,776 × 19
   dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
       <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
 1       -43  2013    12     7     2040           2123       40           2352
 2       -33  2013     2     3     2022           2055     2240           2338
 3       -32  2013    11    10     1408           1440     1549           1559
 4       -30  2013     1    11     1900           1930     2233           2243
 5       -27  2013     1    29     1703           1730     1947           1957
 6       -26  2013     8     9      729            755     1002            955
 7       -25  2013    10    23     1907           1932     2143           2143
 8       -25  2013     3    30     2030           2055     2213           2250
 9       -24  2013     3     2     1431           1455     1601           1631
10       -24  2013     5     5      934            958     1225           1309
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> 
  slice_max(dep_delay, n = 5) |> 
  relocate(dep_delay)
# A tibble: 5 × 19
  dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
      <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
1      1301  2013     1     9      641            900     1242           1530
2      1137  2013     6    15     1432           1935     1607           2120
3      1126  2013     1    10     1121           1635     1239           1810
4      1014  2013     9    20     1139           1845     1457           2210
5      1005  2013     7    22      845           1600     1044           1815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights |> 
  slice_max(dep_delay, n = -5) |> 
  relocate(dep_delay)
# A tibble: 336,776 × 19
   dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
       <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
 1      1301  2013     1     9      641            900     1242           1530
 2      1137  2013     6    15     1432           1935     1607           2120
 3      1126  2013     1    10     1121           1635     1239           1810
 4      1014  2013     9    20     1139           1845     1457           2210
 5      1005  2013     7    22      845           1600     1044           1815
 6       960  2013     4    10     1100           1900     1342           2211
 7       911  2013     3    17     2321            810      135           1020
 8       899  2013     6    27      959           1900     1236           2226
 9       898  2013     7    22     2257            759      121           1026
10       896  2013    12     5      756           1700     1058           2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Supplying a negative n in slice_min() and slice_max() will arrange the dep_delay in increasing and decreasing manner respectively, without slicing flights.

Question 5: Explain what count() does in terms of the dplyr verbs you just learned. What does the sortargument to count() do?

flights |> count(year, month, day, sort = T)
# A tibble: 365 × 4
    year month   day     n
   <int> <int> <int> <int>
 1  2013    11    27  1014
 2  2013     7    11  1006
 3  2013     7     8  1004
 4  2013     7    10  1004
 5  2013    12     2  1004
 6  2013     7    18  1003
 7  2013     7    25  1003
 8  2013     7    12  1002
 9  2013     7     9  1001
10  2013     7    17  1001
# ℹ 355 more rows

The count() function finds the number of occurrences for every row, with sort set to FALSE as default. Specifying sort to TRUE sorts the row according to the number of occurrences in decreasing order.

Question 6: Suppose we have the following tiny data frame:

df <- tibble(
  x = 1:5,
  y = c("a", "b", "a", "a", "b"),
  z = c("K", "K", "L", "L", "K")
)
  1. Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.

    The tibble is grouped by y.

    df |> 
      group_by(y)
    # A tibble: 5 × 3
    # Groups:   y [2]
          x y     z    
      <int> <chr> <chr>
    1     1 a     K    
    2     2 b     K    
    3     3 a     L    
    4     4 a     L    
    5     5 b     K    
  2. Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?

    The tibble is arranged is ascending order according to y.

    df |> 
      arrange(y)
    # A tibble: 5 × 3
          x y     z    
      <int> <chr> <chr>
    1     1 a     K    
    2     3 a     L    
    3     4 a     L    
    4     2 b     K    
    5     5 b     K    
  3. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.

    The tibble is grouped by y in ascending order, then calculates the mean of x corresponding to each group of y .

    df |> 
      group_by(y) |> 
      summarize(mean_x = mean(x))
    # A tibble: 2 × 2
      y     mean_x
      <chr>  <dbl>
    1 a       2.67
    2 b       3.5 
  4. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.

    The tibble is grouped in order by y and z and calculates the mean of x corresponding to each combination y and z. The data frame is grouped by y.

    df |> 
      group_by(y, z) |> 
      summarize(mean_x = mean(x))
    `summarise()` has grouped output by 'y'. You can override using the `.groups`
    argument.
    # A tibble: 3 × 3
    # Groups:   y [2]
      y     z     mean_x
      <chr> <chr>  <dbl>
    1 a     K        1  
    2 a     L        3.5
    3 b     K        3.5
  5. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).

    The tibble is grouped in order by y and z and calculates the mean of x corresponding to each combination of y and z. The data frame is ungrouped.

    df |> 
      group_by(y, z) |> 
      summarize(mean_x = mean(x), .groups = "drop")
    # A tibble: 3 × 3
      y     z     mean_x
      <chr> <chr>  <dbl>
    1 a     K        1  
    2 a     L        3.5
    3 b     K        3.5
  6. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?

    Both pipeline groups the tibble by y and z and calculates the mean of x corresponding to each combination of y and z. With summarize() the data frame has one row per group combination while with mutate() the data frame has the same number of rows as the original data frame.

    df |> 
      group_by(y, z) |> 
      summarize(mean_x = mean(x))
    `summarise()` has grouped output by 'y'. You can override using the `.groups`
    argument.
    # A tibble: 3 × 3
    # Groups:   y [2]
      y     z     mean_x
      <chr> <chr>  <dbl>
    1 a     K        1  
    2 a     L        3.5
    3 b     K        3.5
    df |> 
      group_by(y, z) |> 
      mutate(mean_x = mean(x))
    # A tibble: 5 × 4
    # Groups:   y, z [3]
          x y     z     mean_x
      <int> <chr> <chr>  <dbl>
    1     1 a     K        1  
    2     2 b     K        3.5
    3     3 a     L        3.5
    4     4 a     L        3.5
    5     5 b     K        3.5

5. Data Tidying

5.2 Tidy data

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

These are all representations of the same underlying data, but they are not equally easy to use. One of them, table1, will be much easier to work with inside the tidyverse because it’s tidy.

There are three interrelated rules that make a dataset tidy:

  1. Each variable is a column; each column is a variable.

  2. Each observation is a row; each row is an observation.

  3. Each value is a cell; each cell is a single value.

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with table1.

# Compute rate per 10,000
table1 |>
  mutate(rate = cases / population * 10000)
# A tibble: 6 × 5
  country      year  cases population  rate
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67 
# Compute total cases per year
table1 |> 
  group_by(year) |> 
  summarize(total_cases = sum(cases))
# A tibble: 2 × 2
   year total_cases
  <dbl>       <dbl>
1  1999      250740
2  2000      296920
# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000

5.2.1 Selected Solutions

  1. For each of the sample tables, describe what each observation and each column represents.

    • table1: Each observation represents the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. Each column represents country, year, TB cases, and population respectively.
    • table2: Observations are similar to table1 but divided into two - TB cases and population. The variables cases and population are grouped under type and its values under the variable count
    • table3: Each observation represent the rate of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. THe rate column is derived from the proportion of cases and population from each combination of country and year.
  2. Sketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:

    1. Extract the number of TB cases per country per year.
    2. Extract the matching population per country per year.
    3. Divide cases by population, and multiply by 10000.
    4. Store back in the appropriate place.
    table2 |> 
      pivot_wider(names_from = "type", values_from = "count") |> 
      mutate(rate = (cases / population) * 10000)
    # A tibble: 6 × 5
      country      year  cases population  rate
      <chr>       <dbl>  <dbl>      <dbl> <dbl>
    1 Afghanistan  1999    745   19987071 0.373
    2 Afghanistan  2000   2666   20595360 1.29 
    3 Brazil       1999  37737  172006362 2.19 
    4 Brazil       2000  80488  174504898 4.61 
    5 China        1999 212258 1272915272 1.67 
    6 China        2000 213766 1280428583 1.67 
    table3 |> 
      separate_wider_delim(rate, "/", names = c("cases", "population")) |> 
      mutate_at(c("cases", "population"), as.double) |> 
      mutate(rate = (cases / population) * 10000)
    # A tibble: 6 × 5
      country      year  cases population  rate
      <chr>       <dbl>  <dbl>      <dbl> <dbl>
    1 Afghanistan  1999    745   19987071 0.373
    2 Afghanistan  2000   2666   20595360 1.29 
    3 Brazil       1999  37737  172006362 2.19 
    4 Brazil       2000  80488  174504898 4.61 
    5 China        1999 212258 1272915272 1.67 
    6 China        2000 213766 1280428583 1.67 

7. Data Import

7.2 Reading data from a file

7.2.4 Selected Solutions

Question 1: What function would you use to read a file where fields were separated with “|”?

It should be read using the read_delim() function with delim = "|".

Question 4: Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like " or '. By default, read_csv() assumes that the quoting character will be ". To read the following text into a data frame, what argument to read_csv() do you need to specify?

"x,y\n1,'a,b'"

We will specify the quote argument:

read_csv("x, y\n1, 'a,b'", quote = "\'")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 1 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): y
dbl (1): x

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 2
      x y    
  <dbl> <chr>
1     1 a,b  

Question 5: Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

read_csv("a,b\n1,2,3\n4,5,6")
read_csv("a,b,c\n1,2\n1,2,3,4")
read_csv("a,b\n\"1")
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
  • There are two column headers but three values in each row, so the last two values get merged:

    read_csv("a,b\n1,2,3\n4,5,6")
    Warning: One or more parsing issues, call `problems()` on your data frame for details,
    e.g.:
      dat <- vroom(...)
      problems(dat)
    Rows: 2 Columns: 2
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    dbl (1): a
    num (1): b
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # A tibble: 2 × 2
          a     b
      <dbl> <dbl>
    1     1    23
    2     4    56
  • There are 3 column headers, but only 2 values in the first row, so the last column gets an NA there, and 4 values in the secod row so the last two values are merged:

    read_csv("a,b,c\n1,2\n1,2,3,4")
    Warning: One or more parsing issues, call `problems()` on your data frame for details,
    e.g.:
      dat <- vroom(...)
      problems(dat)
    Rows: 2 Columns: 3
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    dbl (2): a, b
    num (1): c
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # A tibble: 2 × 3
          a     b     c
      <dbl> <dbl> <dbl>
    1     1     2    NA
    2     1     2    34
  • No rows are read in:

    read_csv("a,b\n\"1")
    Rows: 0 Columns: 2
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    chr (2): a, b
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # A tibble: 0 × 2
    # ℹ 2 variables: a <chr>, b <chr>
  • Each column has a numerical and a character value, so the column type is coerced to character:

    read_csv("a,b\n1,2\na,b")
    Rows: 2 Columns: 2
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    chr (2): a, b
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # A tibble: 2 × 2
      a     b    
      <chr> <chr>
    1 1     2    
    2 a     b    
  • The delimiter is ; but it’s not specified, therefore this is read in as a single-column data frame with a single observation:

    read_csv("a;b\n1;3")
    Rows: 1 Columns: 1
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    chr (1): a;b
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # A tibble: 1 × 1
      `a;b`
      <chr>
    1 1;3  

Question 6: Practice referring to non-syntactic names in the following data frame by:

  1. Extracting the variable called 1.
  2. Plotting a scatterplot of 1 vs. 2.
  3. Creating a new column called 3, which is 2 is divided by 1.
  4. Renaming the columns to one, two, and three.
annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)
  1. Extracting the variable called 1:

    annoying |> 
      select(`1`)
    # A tibble: 10 × 1
         `1`
       <int>
     1     1
     2     2
     3     3
     4     4
     5     5
     6     6
     7     7
     8     8
     9     9
    10    10
  2. Plotting a scatterplot of 1 vs. 2:

    ggplot(annoying, aes(x = `2`, y = `1`)) +
      geom_point()

  3. Creating a new column called 3, which is 2 divided by 1:

    annoying |>
      mutate(`3` = `2` / `1`)
    # A tibble: 10 × 3
         `1`   `2`   `3`
       <int> <dbl> <dbl>
     1     1  1.28  1.28
     2     2  3.83  1.92
     3     3  6.70  2.23
     4     4  8.06  2.01
     5     5  9.46  1.89
     6     6 12.7   2.12
     7     7 12.4   1.78
     8     8 15.5   1.94
     9     9 16.3   1.81
    10    10 21.7   2.17
  4. Renaming the columns to onetwo, and three:

    annoying |>
      mutate(`3` = `2` / `1`) |>
      rename(
        "one" = `1`,
        "two" = `2`,
        "three" = `3`
      )
    # A tibble: 10 × 3
         one   two three
       <int> <dbl> <dbl>
     1     1  1.28  1.28
     2     2  3.83  1.92
     3     3  6.70  2.23
     4     4  8.06  2.01
     5     5  9.46  1.89
     6     6 12.7   2.12
     7     7 12.4   1.78
     8     8 15.5   1.94
     9     9 16.3   1.81
    10    10 21.7   2.17

7.3 Controlling column types

7.3.1 Guessing types

readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,0002 rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:

  • Does it contain only FTFALSE, or TRUE (ignoring case)? If so, it’s a logical.

  • Does it contain only numbers (e.g., 1-4.55e6Inf)? If so, it’s a number.

  • Does it match the ISO8601 standard? If so, it’s a date or date-time. (We’ll return to date-times in more detail in Section 17.2).

  • Otherwise, it must be a string.

You can see that behavior in action in this simple example:

read_csv("
  logical, numeric, date, string
  TRUE,1,2021-01-15,abc
  false,4.5,2021-02-15,def
  T,Inf,2021-02-16,ghi
")
Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): string
dbl  (1): numeric
lgl  (1): logical
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 4
  logical numeric date       string
  <lgl>     <dbl> <date>     <chr> 
1 TRUE        1   2021-01-15 abc   
2 FALSE       4.5 2021-02-15 def   
3 TRUE      Inf   2021-02-16 ghi   

7.3.2 Missing values, column types, and problems

The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the NA that readr expects.

simple_csv <- "
  x
  10
  .
  20
  30"

Reading it without any additional arguments, x becomes a character column:

read_csv(simple_csv)
Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): x

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 1
  x    
  <chr>
1 10   
2 .    
3 20   
4 30   

In this very small case, you can easily see the missing value .. But what happens if you have thousands of rows with only a few missing values represented by .s sprinkled among them? One approach is to tell readr that x is a numeric column, and then see where it fails. You can do that with the col_types argument, which takes a named list where the names match the column names in the CSV file:

df <- read_csv(
  simple_csv,
  col_types = list(x = col_double())
)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Now read_csv() reports that there was a problem, and tells us we can find out more with problems():

problems(df)
# A tibble: 1 × 5
    row   col expected actual file                                              
  <int> <int> <chr>    <chr>  <chr>                                             
1     3     1 a double .      /private/var/folders/yq/q9fv54y91vb70rhxtfs06ysh0…

This tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = ".", the automatic guessing succeeds, giving us the numeric column that we want:

read_csv(simple_csv, na = ".")
Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 1
      x
  <dbl>
1    10
2    NA
3    20
4    30

7.3.3 Column types

readr provides a total of nine column types for you to use:

  • col_logical() and col_double() read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.

  • col_integer() reads integers. We seldom distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.

  • col_character() reads strings. This can be useful to specify explicitly when you have a column that is a numeric identifier, i.e., long series of digits that identifies an object but doesn’t make sense to apply mathematical operations to. Examples include phone numbers, social security numbers, credit card numbers, etc.

  • col_factor()col_date(), and col_datetime() create factors, dates, and date-times respectively; you’ll learn more about those when we get to those data types in Chapter 16 and Chapter 17.

  • col_number() is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in Chapter 13.

  • col_skip() skips a column so it’s not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.

It’s also possible to override the default column by switching from list() to cols() and specifying .default:

another_csv <- "
x,y,z
1,2,3"

read_csv(
  another_csv,
  col_types = cols(.default = col_character())
)
# A tibble: 1 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     2     3    

Another useful helper is cols_only() which will read in only the columns you specify:

read_csv(
  another_csv,
  col_types = cols_only(x = col_character())
)
# A tibble: 1 × 1
  x    
  <chr>
1 1    

7.4 Reading data from multiple files

Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March. With read_csv() you can read these data in at once and stack them on top of each other in a single data frame.

sales_files <- c(
  "Datasets/r4ds-01-sales",
  "Datasets/r4ds-02-sales",
  "Datasets/r4ds-03-sales"
)
read_csv(sales_files, id = "file")
Rows: 19 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): month
dbl (4): year, brand, item, n

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 19 × 6
   file                   month     year brand  item     n
   <chr>                  <chr>    <dbl> <dbl> <dbl> <dbl>
 1 Datasets/r4ds-01-sales January   2019     1  1234     3
 2 Datasets/r4ds-01-sales January   2019     1  8721     9
 3 Datasets/r4ds-01-sales January   2019     1  1822     2
 4 Datasets/r4ds-01-sales January   2019     2  3333     1
 5 Datasets/r4ds-01-sales January   2019     2  2156     9
 6 Datasets/r4ds-01-sales January   2019     2  3987     6
 7 Datasets/r4ds-01-sales January   2019     2  3827     6
 8 Datasets/r4ds-02-sales February  2019     1  1234     8
 9 Datasets/r4ds-02-sales February  2019     1  8721     2
10 Datasets/r4ds-02-sales February  2019     1  1822     3
11 Datasets/r4ds-02-sales February  2019     2  3333     1
12 Datasets/r4ds-02-sales February  2019     2  2156     3
13 Datasets/r4ds-02-sales February  2019     2  3987     6
14 Datasets/r4ds-03-sales March     2019     1  1234     3
15 Datasets/r4ds-03-sales March     2019     1  3627     1
16 Datasets/r4ds-03-sales March     2019     1  8820     3
17 Datasets/r4ds-03-sales March     2019     2  7253     1
18 Datasets/r4ds-03-sales March     2019     2  8766     3
19 Datasets/r4ds-03-sales March     2019     2  8288     6

The id argument adds a new column called file to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.

If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base list.files() function to find the files for you by matching a pattern in the file names.

sales_files <- list.files("Datasets", pattern = "sales", full.names = TRUE)
sales_files
[1] "Datasets/r4ds-01-sales" "Datasets/r4ds-02-sales" "Datasets/r4ds-03-sales"