library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(babynames)
library(nycflights13)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(hms)
##
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
##
## hms
Use flights
to create delayed
, a variable that displays whether a flight was delayed (arr_delay > 0
).
Then, remove all rows that contain an NA in delayed
.
Finally, create a summary table that shows:
flights %>%
mutate(delayed = arr_delay > 0) %>%
filter(!is.na(delayed)) %>%
summarise(total = sum(delayed), prop = mean(delayed))
## # A tibble: 1 × 2
## total prop
## <int> <dbl>
## 1 133004 0.406
In your group, fill in the blanks to:
Isolate the last letter of every name
Create a logical variable that displays whether the last letter is one of “a”, “e”, “i”, “o”, “u”, or “y”.
Use a weighted mean to calculate the proportion of children whose name ends in a vowel (by year
and sex
)
and then display the results as a line plot.
(Hint: Be sure to remove each _
before running the code)
babynames %>%
mutate(last = str_sub(name, -1),
vowel = last %in% c("a", "e", "i", "o", "u", "y")) %>%
group_by(year, sex) %>%
summarise(p_vowel = weighted.mean(vowel, n)) %>%
ggplot() +
geom_line(mapping = aes(year, p_vowel, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
Repeat the demonstration, some of whose code is below, to make a sensible graph of average TV consumption by marital status.
(Hint: Be sure to remove each _
before running the code)
gss_cat %>%
filter(!is.na(tvhours)) %>%
group_by(marital) %>%
summarise(tvhours = mean(tvhours)) %>%
ggplot(aes(tvhours, fct_reorder(marital, tvhours))) +
geom_point()
Do you think liberals or conservatives watch more TV? Compute average tv hours by party ID an then plot the results.
gss_cat %>%
filter(!is.na(tvhours)) %>%
group_by(partyid) %>%
summarise(tvhours = mean(tvhours)) %>%
ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
geom_point() +
labs(y = "partyid")
What is the best time of day to fly?
Use the hour
and minute
variables in flights
to make a new variable that shows the time of each flight as an hms.
Then use a smooth line to plot the relationship between time of day and arr_delay
.
flights %>%
mutate(time = hms(hour = hour, minute = minute)) %>%
ggplot(aes(time, arr_delay)) +
geom_point(alpha = 0.2) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9430 rows containing non-finite values (stat_smooth).
## Warning: Removed 9430 rows containing missing values (geom_point).
What is the best day of the week to fly?
Look at the code skeleton for Your Turn 7. Discuss with your neighbor:
Fill in the blank to:
Extract the day of the week of each flight (as a full name) from time_hour
.
Plot the average arrival delay by day as a column chart (bar chart).
flights %>%
mutate(weekday = wday(time_hour, label = TRUE, abbr = FALSE)) %>%
group_by(weekday) %>%
filter(!is.na(arr_delay)) %>%
summarise(avg_delay = mean(arr_delay)) %>%
ggplot() +
geom_col(mapping = aes(x = weekday, y = avg_delay))
Dplyr gives you three general functions for manipulating data: mutate()
, summarise()
, and group_by()
. Augment these with functions from the packages below, which focus on specific types of data.
Package | Data Type |
---|---|
stringr | strings |
forcats | factors |
hms | times |
lubridate | dates and times |