library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(babynames)
library(nycflights13)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(hms)

## 
## Attaching package: 'hms'

## The following object is masked from 'package:lubridate':
## 
##     hms

Logicals

Your Turn 1

Use flights to create delayed, a variable that displays whether a flight was delayed (arr_delay > 0).

Then, remove all rows that contain an NA in delayed.

Finally, create a summary table that shows:

How many flights were delayed
What proportion of flights were delayed

flights %>%
  mutate(delayed = arr_delay > 0) %>%
  filter(!is.na(delayed)) %>%
  summarise(total = sum(delayed), prop = mean(delayed))

## # A tibble: 1 × 2
##    total  prop
##    <int> <dbl>
## 1 133004 0.406

Strings

Your Turn 2

In your group, fill in the blanks to:

Isolate the last letter of every name
Create a logical variable that displays whether the last letter is one of “a”, “e”, “i”, “o”, “u”, or “y”.
Use a weighted mean to calculate the proportion of children whose name ends in a vowel (by year and sex)
and then display the results as a line plot.

(Hint: Be sure to remove each _ before running the code)

babynames %>%
  mutate(last = str_sub(name, -1),
    vowel = last %in% c("a", "e", "i", "o", "u", "y")) %>%
  group_by(year, sex) %>%
  summarise(p_vowel = weighted.mean(vowel, n)) %>%
  ggplot() +
    geom_line(mapping = aes(year, p_vowel, color = sex))

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Factors

Your Turn 3

Repeat the demonstration, some of whose code is below, to make a sensible graph of average TV consumption by marital status.

(Hint: Be sure to remove each _ before running the code)

gss_cat %>%
  filter(!is.na(tvhours)) %>%
  group_by(marital) %>%
  summarise(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(marital, tvhours))) +
    geom_point()

Your Turn 4

Do you think liberals or conservatives watch more TV? Compute average tv hours by party ID an then plot the results.

gss_cat %>%
  filter(!is.na(tvhours)) %>%
  group_by(partyid) %>%
  summarise(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
    geom_point() +
    labs(y = "partyid")

Dates and Times

Your Turn 5

What is the best time of day to fly?

Use the hour and minute variables in flights to make a new variable that shows the time of each flight as an hms.

Then use a smooth line to plot the relationship between time of day and arr_delay.

flights %>%
  mutate(time = hms(hour = hour, minute = minute)) %>%
  ggplot(aes(time, arr_delay)) +
    geom_point(alpha = 0.2) + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 9430 rows containing non-finite values (stat_smooth).

## Warning: Removed 9430 rows containing missing values (geom_point).

Your Turn 6

What is the best day of the week to fly?

Look at the code skeleton for Your Turn 7. Discuss with your neighbor:

What does each line do?
What will the missing parts need to do?

Your Turn 7

Fill in the blank to:

Extract the day of the week of each flight (as a full name) from time_hour.

Plot the average arrival delay by day as a column chart (bar chart).

flights %>%
  mutate(weekday = wday(time_hour, label = TRUE, abbr = FALSE)) %>%
  group_by(weekday) %>%
  filter(!is.na(arr_delay)) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  ggplot() +
    geom_col(mapping = aes(x = weekday, y = avg_delay))

Take Aways

Dplyr gives you three general functions for manipulating data: mutate(), summarise(), and group_by(). Augment these with functions from the packages below, which focus on specific types of data.

Package	Data Type
stringr	strings
forcats	factors
hms	times
lubridate	dates and times

Data Types

Elanur Ural

4/4/2022