Chapter 5 Homework Solutions

Introduction

These exercises are taken from the tidy data and iteration chapter from Modern Data Science with R: http://mdsr-book.github.io.

Spreading Babies

Here is a table for babies with the name of Harrison, Roderick or Terry, born in either 1912 or 2012.

name sex year nbabies
Terry F 1912 17
Harrison M 1912 170
Terry M 1912 49
Roderick M 1912 46
Terry F 2012 17
Harrison F 2012 15
Harrison M 2012 2122
Terry M 2012 480
Roderick M 2012 204

Use tidyr::spread() to get the following table from babiesHRT:

name year F M
Harrison 1912 0 170
Harrison 2012 15 2122
Roderick 1912 0 46
Roderick 2012 0 203
Terry 1912 17 49
Terry 2012 17 479

Again starting from babiesHRT, use tidyr::spread() to get the following table:

name sex 1912 2012
Harrison F 0 15
Harrison M 170 2122
Roderick M 46 203
Terry F 17 17
Terry M 49 479

SOLUTION

Here’s how to get the first table:

name year F M
Harrison 1912 0 170
Harrison 2012 15 2122
Roderick 1912 0 46
Roderick 2012 0 204
Terry 1912 17 49
Terry 2012 17 480

Here’s how to get the second one:

name sex 1912 2012
Harrison F 0 15
Harrison M 170 2122
Roderick M 46 204
Terry F 17 17
Terry M 49 480

Home runs

Problem

Consider the number of home runs hit (HR) and home runs allowed (HRA) for the Chicago Cubs (CHN) baseball team. Reshape the Teams data from the Lahman package into long format and plot a time series conditioned on whether the HRs that involved the Cubs were hit by them or allowed by them.

Seasons

Problem

Write a function called count_seasons that, when given a teamID, will count the number of seasons the team played in the Teams data frame from the Lahman package.

SOLUTION

Let’s try it out by finding how many seasons the Boston Red Sox have played:

## [1] 118

We’ll always have Brooklyn

Problem

The team IDs corresponding to Brooklyn baseball teams from the Teams data frame from the Lahman package are listed below. Use purrr::map_dbl() (or sapply() as discussed in the textbook) to find the number of seasons in which each of those teams played.

Marriage

Problem

In the Marriage data set included in mosaicData, the appdate, ceremonydate, and dob variables are encoded as factors, even though they are dates. Use the lubridate package to convert those three columns into a date format.

SOLUTION

## Observations: 98
## Variables: 6
## $ appdate       <fct> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, 12/26/9…
## $ ceremonydate  <fct> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, 12/26/9…
## $ dob           <fct> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/21/70, 1…
## $ appdate2      <date> 1996-10-29, 1996-11-12, 1996-11-19, 1996-12-02, 1996-1…
## $ ceremonydate2 <date> 1996-11-09, 1996-11-12, 1996-11-27, 1996-12-07, 1996-1…
## $ dob2          <date> 2064-04-11, 2064-08-06, 2062-02-20, 2056-05-20, 2066-1…

Notice that for dates before the UNIX epoch (1970), we get the wrong year!

Let’s be more careful:

## Observations: 98
## Variables: 6
## $ appdate       <fct> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, 12/26/9…
## $ ceremonydate  <fct> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, 12/26/9…
## $ dob           <fct> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/21/70, 1…
## $ app_date      <date> 1996-10-29, 1996-11-12, 1996-11-19, 1996-12-02, 1996-1…
## $ ceremony_date <date> 1996-11-09, 1996-11-12, 1996-11-27, 1996-12-07, 1996-1…
## $ date_of_birth <date> 1964-04-11, 1964-08-06, 1962-02-20, 1956-05-20, 1966-1…

There we go!

Coercion

Problem

Consider the values returned by the as.numeric() and readr::parse_number() functions when applied to the following vectors. Describe the results and their implication.

SOLUTION

## [1] 1900.45      NA      NA      NA
## [1] 1900.45 1900.45 1900.45 2000.00

readr::parse_number() is much more likely to parse correctly!

## [1] 3 1 2 4
## Error in parse_vector(x, col_number(), na = na, locale = locale, trim_ws = trim_ws): is.character(x) is not TRUE

Because in R a factor is stored using the numbers corresponding to its levels, as.numeric() sees only these numbers. `readr::parse_number() throws an error right away, complaining that it is not being shown a character vector. That’s good, because then you can go back an correct the code:

## [1] 1900.45 1900.45 1900.45 2000.00

Baseball Records

Problem

Use the dplyr::do() function and the Lahman data to replicate one of these baseball records plots (http://tinyurl.com/nytimes-records) from the The New York Times.

SOLUTION

Answers will vary. I’ll tackle two of the graphs.

Home Runs

Let’s now go for the one showing top home-run hitters, working in the top fifteen home-run hitters in each season. We’ll also use package gghighlight to highlight the record-setting performances.

First we write the function that picks up the top 15 home-run hitters in each season:

For each player in a season, we are going to highlight his corresponding point if and only if his home-run performance that year matched or exceeded the record for home-runs that was standing at the time. To do this we need to create a table that shows, for each season, what the home-run record was by the end of the season.

Note the use of cummax()—short for cumulative maximum. This is a quick way to find the largest number in a vector so far. For example:

## [1] 5 5 7 8 8 8 8 9 9

Now we create our table for top home-run performers, and join it to HR_Record:

Let’s look at it to verify that it does the job:

Looks good! Studying the vignette for package gghighlight, we see that we can use breaks_or_ties_record to determine whether or not to highlight a point.

Top 15 home-run hitters from each season.  Record-setters are highlighted.

Top 15 home-run hitters from each season. Record-setters are highlighted.

Let’s modify the default ggplot2 theme so the graph will more closely resemble oour target graph:

Top 15 home-run hitters from each season.  Record-setters are highlighted.  Attempting to approximate elements of the New York Times theme.

Top 15 home-run hitters from each season. Record-setters are highlighted. Attempting to approximate elements of the New York Times theme.

Wikipedia

Problem

Using the approach described in Section 5.5.4, find another table in Wikipedia that can be scraped and visualized. Be sure to interpret your graphical display.

SOLUTION

Answers will vary.

24 April, 2020