library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(babynames)
library(nycflights13)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(hms)
## 
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
## 
##     hms

Logicals

Your Turn 1

Use flights to create delayed, a variable that displays whether a flight was delayed (arr_delay > 0).

Then, remove all rows that contain an NA in delayed.

Finally, create a summary table that shows:

  1. How many flights were delayed
  2. What proportion of flights were delayed
flights %>%
  mutate(delayed = arr_delay > 0) %>%
  filter(!is.na(delayed)) %>%
  summarise(total = sum(delayed), prop = mean(delayed))
## # A tibble: 1 × 2
##    total  prop
##    <int> <dbl>
## 1 133004 0.406

Strings

Your Turn 2

In your group, fill in the blanks to:

  1. Isolate the last letter of every name

  2. Create a logical variable that displays whether the last letter is one of “a”, “e”, “i”, “o”, “u”, or “y”.

  3. Use a weighted mean to calculate the proportion of children whose name ends in a vowel (by year and sex)

  4. and then display the results as a line plot.

(Hint: Be sure to remove each _ before running the code)

babynames %>%
  mutate(last = str_sub(name, -1),
    vowel = last %in% c("a", "e", "i", "o", "u", "y")) %>%
  group_by(year, sex) %>%
  summarise(p_vowel = weighted.mean(vowel, n)) %>%
  ggplot() +
    geom_line(mapping = aes(year, p_vowel, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Factors

Your Turn 3

Repeat the demonstration, some of whose code is below, to make a sensible graph of average TV consumption by marital status.

(Hint: Be sure to remove each _ before running the code)

gss_cat %>%
  filter(!is.na(tvhours)) %>%
  group_by(marital) %>%
  summarise(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(marital, tvhours))) +
    geom_point()

Your Turn 4

Do you think liberals or conservatives watch more TV? Compute average tv hours by party ID an then plot the results.

gss_cat %>%
  filter(!is.na(tvhours)) %>%
  group_by(partyid) %>%
  summarise(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
    geom_point() +
    labs(y = "partyid")

Dates and Times

Your Turn 5

What is the best time of day to fly?

Use the hour and minute variables in flights to make a new variable that shows the time of each flight as an hms.

Then use a smooth line to plot the relationship between time of day and arr_delay.

flights %>%
  mutate(time = hms(hour = hour, minute = minute)) %>%
  ggplot(aes(time, arr_delay)) +
    geom_point(alpha = 0.2) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9430 rows containing non-finite values (stat_smooth).
## Warning: Removed 9430 rows containing missing values (geom_point).

Your Turn 6

What is the best day of the week to fly?

Look at the code skeleton for Your Turn 7. Discuss with your neighbor:

  • What does each line do?
  • What will the missing parts need to do?

Your Turn 7

Fill in the blank to:

Extract the day of the week of each flight (as a full name) from time_hour.

Plot the average arrival delay by day as a column chart (bar chart).

flights %>%
  mutate(weekday = wday(time_hour, label = TRUE, abbr = FALSE)) %>%
  group_by(weekday) %>%
  filter(!is.na(arr_delay)) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  ggplot() +
    geom_col(mapping = aes(x = weekday, y = avg_delay))


Take Aways

Dplyr gives you three general functions for manipulating data: mutate(), summarise(), and group_by(). Augment these with functions from the packages below, which focus on specific types of data.

Package Data Type
stringr strings
forcats factors
hms times
lubridate dates and times