Chapter 5 Homework Solutions
Introduction
These exercises are taken from the tidy data and iteration chapter from Modern Data Science with R: http://mdsr-book.github.io.
Spreading Babies
Here is a table for babies with the name of Harrison, Roderick or Terry, born in either 1912 or 2012.
babiesHRT <-
babynames %>%
filter(name %in% c("Harrison", "Roderick", "Terry")) %>%
filter(year %in% c(1912, 2012)) %>%
rename(nbabies = n) %>%
select(name, sex, year, nbabies)
knitr::kable(babiesHRT)| name | sex | year | nbabies |
|---|---|---|---|
| Terry | F | 1912 | 17 |
| Harrison | M | 1912 | 170 |
| Terry | M | 1912 | 49 |
| Roderick | M | 1912 | 46 |
| Terry | F | 2012 | 17 |
| Harrison | F | 2012 | 15 |
| Harrison | M | 2012 | 2122 |
| Terry | M | 2012 | 480 |
| Roderick | M | 2012 | 204 |
Use tidyr::spread() to get the following table from babiesHRT:
| name | year | F | M |
|---|---|---|---|
| Harrison | 1912 | 0 | 170 |
| Harrison | 2012 | 15 | 2122 |
| Roderick | 1912 | 0 | 46 |
| Roderick | 2012 | 0 | 203 |
| Terry | 1912 | 17 | 49 |
| Terry | 2012 | 17 | 479 |
Again starting from babiesHRT, use tidyr::spread() to get the following table:
| name | sex | 1912 | 2012 |
|---|---|---|---|
| Harrison | F | 0 | 15 |
| Harrison | M | 170 | 2122 |
| Roderick | M | 46 | 203 |
| Terry | F | 17 | 17 |
| Terry | M | 49 | 479 |
SOLUTION
Here’s how to get the first table:
| name | year | F | M |
|---|---|---|---|
| Harrison | 1912 | 0 | 170 |
| Harrison | 2012 | 15 | 2122 |
| Roderick | 1912 | 0 | 46 |
| Roderick | 2012 | 0 | 204 |
| Terry | 1912 | 17 | 49 |
| Terry | 2012 | 17 | 480 |
Here’s how to get the second one:
| name | sex | 1912 | 2012 |
|---|---|---|---|
| Harrison | F | 0 | 15 |
| Harrison | M | 170 | 2122 |
| Roderick | M | 46 | 204 |
| Terry | F | 17 | 17 |
| Terry | M | 49 | 480 |
Home runs
Problem
Consider the number of home runs hit (HR) and home runs allowed (HRA) for the Chicago Cubs (CHN) baseball team. Reshape the Teams data from the Lahman package into long format and plot a time series conditioned on whether the HRs that involved the Cubs were hit by them or allowed by them.
SOLUTION
Teams %>%
gather(key = type, value = home_runs, HR, HRA) %>%
filter(teamID == "CHN") %>%
select(yearID, type, home_runs) %>%
mutate(type = recode(type, "HR" = "hit", "HRA" = "allowed")) %>%
ggplot(aes(x = yearID, y = home_runs)) +
geom_line(aes(color = type)) +
labs(y = "home runs") +
theme(legend.position = "top")Seasons
Problem
Write a function called count_seasons that, when given a teamID, will count the number of seasons the team played in the Teams data frame from the Lahman package.
SOLUTION
Let’s try it out by finding how many seasons the Boston Red Sox have played:
## [1] 118
We’ll always have Brooklyn
Problem
The team IDs corresponding to Brooklyn baseball teams from the Teams data frame from the Lahman package are listed below. Use purrr::map_dbl() (or sapply() as discussed in the textbook) to find the number of seasons in which each of those teams played.
SOLUTION
bk_teams <- c("BR1", "BR2", "BR3", "BR4", "BRO", "BRP", "BRF")
brooklyn_seasons <-
bk_teams %>%
map_dbl(count_seasons)
names(brooklyn_seasons) <- bk_teams
brooklyn_seasons## BR1 BR2 BR3 BR4 BRO BRP BRF
## 1 4 6 1 68 1 2
Marriage
Problem
In the Marriage data set included in mosaicData, the appdate, ceremonydate, and dob variables are encoded as factors, even though they are dates. Use the lubridate package to convert those three columns into a date format.
SOLUTION
Marriage %>%
select(appdate, ceremonydate, dob) %>%
mutate(appdate2 = mdy(appdate),
ceremonydate2 = mdy(ceremonydate),
dob2 = mdy(dob)) %>%
glimpse()## Observations: 98
## Variables: 6
## $ appdate <fct> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, 12/26/9…
## $ ceremonydate <fct> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, 12/26/9…
## $ dob <fct> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/21/70, 1…
## $ appdate2 <date> 1996-10-29, 1996-11-12, 1996-11-19, 1996-12-02, 1996-1…
## $ ceremonydate2 <date> 1996-11-09, 1996-11-12, 1996-11-27, 1996-12-07, 1996-1…
## $ dob2 <date> 2064-04-11, 2064-08-06, 2062-02-20, 2056-05-20, 2066-1…
Notice that for dates before the UNIX epoch (1970), we get the wrong year!
Let’s be more careful:
Marriage %>%
select(appdate, ceremonydate, dob) %>%
mutate(app_date = mdy(appdate),
ceremony_date = mdy(ceremonydate),
date_of_birth = mdy(dob),
date_of_birth = ifelse(year(date_of_birth) > year(now()),
date_of_birth - years(100),
date_of_birth),
date_of_birth = as_date(date_of_birth)) %>%
glimpse()## Observations: 98
## Variables: 6
## $ appdate <fct> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, 12/26/9…
## $ ceremonydate <fct> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, 12/26/9…
## $ dob <fct> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/21/70, 1…
## $ app_date <date> 1996-10-29, 1996-11-12, 1996-11-19, 1996-12-02, 1996-1…
## $ ceremony_date <date> 1996-11-09, 1996-11-12, 1996-11-27, 1996-12-07, 1996-1…
## $ date_of_birth <date> 1964-04-11, 1964-08-06, 1962-02-20, 1956-05-20, 1966-1…
There we go!
Coercion
Problem
Consider the values returned by the as.numeric() and readr::parse_number() functions when applied to the following vectors. Describe the results and their implication.
SOLUTION
## [1] 1900.45 NA NA NA
## [1] 1900.45 1900.45 1900.45 2000.00
readr::parse_number() is much more likely to parse correctly!
## [1] 3 1 2 4
## Error in parse_vector(x, col_number(), na = na, locale = locale, trim_ws = trim_ws): is.character(x) is not TRUE
Because in R a factor is stored using the numbers corresponding to its levels, as.numeric() sees only these numbers. `readr::parse_number() throws an error right away, complaining that it is not being shown a character vector. That’s good, because then you can go back an correct the code:
## [1] 1900.45 1900.45 1900.45 2000.00
Baseball Records
Problem
Use the dplyr::do() function and the Lahman data to replicate one of these baseball records plots (http://tinyurl.com/nytimes-records) from the The New York Times.
SOLUTION
Answers will vary. I’ll tackle two of the graphs.
Batting Average
The following makes a chart for batting average, showing the top 30 batters each season since 1900. Only batters with more than 200 at-bats in a season are shown.
top_batters <- function(x) {
x %>%
filter(AB >= 200) %>%
mutate(ba = H / AB) %>%
arrange(desc(ba)) %>%
select(playerID, yearID, ba) %>%
head(30)
}
Batting %>%
filter(yearID >= 1900) %>%
group_by(yearID) %>%
dplyr::do(top_batters(.)) %>%
ggplot(aes(x = yearID, y = ba)) +
geom_point(alpha = 0.3) +
labs(x = NULL, y = "batting average")Home Runs
Let’s now go for the one showing top home-run hitters, working in the top fifteen home-run hitters in each season. We’ll also use package gghighlight to highlight the record-setting performances.
First we write the function that picks up the top 15 home-run hitters in each season:
For each player in a season, we are going to highlight his corresponding point if and only if his home-run performance that year matched or exceeded the record for home-runs that was standing at the time. To do this we need to create a table that shows, for each season, what the home-run record was by the end of the season.
HR_Record <-
Batting %>%
filter(yearID >= 1900) %>%
group_by(yearID) %>%
summarise(HR_season_max = max(HR, na.rm = TRUE)) %>%
mutate(HR_record = cummax(HR_season_max))Note the use of cummax()—short for cumulative maximum. This is a quick way to find the largest number in a vector so far. For example:
## [1] 5 5 7 8 8 8 8 9 9
Now we create our table for top home-run performers, and join it to HR_Record:
GlyphReadyTable <-
Batting %>%
filter(yearID >= 1900) %>%
group_by(yearID) %>%
dplyr::do(top_hr(.)) %>%
inner_join(HR_Record, by = "yearID") %>%
mutate(breaks_or_ties_record = HR == HR_record) %>%
select(playerID, yearID, HR, breaks_or_ties_record)Let’s look at it to verify that it does the job:
Looks good! Studying the vignette for package gghighlight, we see that we can use breaks_or_ties_record to determine whether or not to highlight a point.
GlyphReadyTable %>%
ggplot(aes(x = yearID, y = HR)) +
geom_point() +
gghighlight::gghighlight(breaks_or_ties_record) +
labs(x = NULL, y = NULL,
title = "Home Runs: 49 years")Top 15 home-run hitters from each season. Record-setters are highlighted.
Let’s modify the default ggplot2 theme so the graph will more closely resemble oour target graph:
yearsToShow <- seq(1910, 2010, by = 20)
GlyphReadyTable %>%
ggplot(aes(x = yearID, y = HR)) +
geom_point() +
gghighlight::gghighlight(breaks_or_ties_record) +
labs(y = NULL, title = "Home Runs: 49 years") +
scale_y_continuous(position = "right") +
theme(plot.background = element_rect(fill = "white"),
panel.background = element_rect(fill = "white"),
panel.grid = element_blank(),
panel.grid.major.y = element_line(linetype = "dotted"),
plot.title = element_text(size=14, hjust = 0.5),
axis.line = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_line(),
axis.text.x = element_text(color = "gray", size = 10),
axis.text.y = element_text(color = "gray", size = 10)) +
scale_x_continuous(name = NULL,
breaks = yearsToShow,
labels = paste0("| ", yearsToShow))Top 15 home-run hitters from each season. Record-setters are highlighted. Attempting to approximate elements of the New York Times theme.
Wikipedia
Problem
Using the approach described in Section 5.5.4, find another table in Wikipedia that can be scraped and visualized. Be sure to interpret your graphical display.
SOLUTION
Answers will vary.