Chapter 4 Homework

HSW: Score is 8.7/10, see comments below and study solutions.

Introduction

These exercises are based on exercises in our textbook.

Verbs

Each of these tasks can be performed using a single data verb. For each task, say which verb it is:

  1. Find the average of one of the variables.
  2. Add a new column that is the ratio between two variables.
  3. Sort the cases in descending order of a variable.
  4. Create a new data table that includes only those cases that meet a criterion.
  5. From a data table with three categorical variables A, B, and C, and a quantitative variable X, produce a data frame that has the same cases but only the variables A and X.

SOLUTION

  1. Summarize()
  2. Mutate()
  3. Arrange()
  4. Filter()
  5. Select()

Cancelled flights

Use the nycflights13 package and the flights data frame to answer the following questions: What month had the highest proportion of cancelled flights? What month had the lowest? Interpret any seasonal patterns.

Note: You may consider a flight to be cancelled if its departure time is NA.

Hint

At some point you’ll need to figure out how many elements of a vector are NA. Some R programming ideas will be helpful, here.

Let’s make a small example.

smallVector <- c(4, 2, NA, 17, 4, NA)

Clearly, two of the elements of smallVector are NA. Here’s how to figure that out with programming, rather than by physically looking at the elements.

The function is.na() takes a vector and produces a logical vector whose elements are TRUE when the original value was NA:

is.na(smallVector)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Let’s now use the sum() function to “add up” the elements of this logical vector:

sum(is.na(smallVector))
## [1] 2

What happened was that R converted the TRUE values to 1 and the FALSE values to 0, then added up. The sum is the number of 1’s, thus the number of TRUEs, thus the number of NAs in the original vector.

Accordingly, the following gets you part of the way to the solution:

cancellations <- 
  flights %>%
  group_by(month) %>%
  summarize(cancelled = sum(is.na(dep_time)),
            total = n()
            )

SOLUTION

cancellations <- 
  flights %>%
  group_by(month) %>%
  summarize(cancelled = sum(is.na(dep_time)),
            total = n(),
              proportion_cancelled = cancelled / total)
highest_cancelled <- cancellations %>% 
  filter(proportion_cancelled == max(proportion_cancelled))
lowest_cancelled <- cancellations %>% 
  filter(proportion_cancelled == min(proportion_cancelled))
reactable(cancellations)

Frequent NYC visitors

Use the nycflights13 package and the flights data frame to answer the following questions: What plane (specified by the tailnum variable) traveled the most times from New York City airports in 2013? Plot the number of trips per week over the year.

Hint

The second part of the problem asks you to plot the number of trips per week over the year. For this you’ll have to group the data by weeks of the year.

One of the variables (time_hour) gives the date and time of day of the flight. Package lubridate has a function called week() that takes a date and returns the number of the week it belong in.

Example:

library(lubridate)
dates <- as_date(c("2018-01-03", "2018-02-05"))
week(dates)
## [1] 1 6

The dates were in the first and sixth weeks of the year 2018.

As you wrangle, use mutate() to create a new variable that gives the week-number for each flight, then group by that variable.

SOLUTION

topPlane <- flights %>%
  filter(year(time_hour) == 2013) %>%
  group_by(tailnum) %>%
  summarize(num_flights = n()) %>%
  filter(num_flights == max(num_flights))

tripsPerWeek <- flights %>%
  filter(year(time_hour) == 2013) %>%
  mutate(week = week(time_hour)) %>%
  group_by(week) %>%
  summarize(num_trips = n()) %>%
  arrange(week)

ggplot(tripsPerWeek, aes(x = week, y = num_trips)) +
  geom_line() +
  labs(title = "Number of Trips per Week From the NYC Airport in 2013",
       x = "Week of the Year",
       y = "Number of Trips") 

HSW: To find the top plane you will need to filter out the rows where the tail number is NA. - 0.3 pts. I think we were supposed to find find the flights each week for the top plane, but your interpretation is fine, too.

Weather

Use the nycflights13 package and the weather table to answer the following questions, for the JFK airport for the month of July, 2013.

Part 1

Make a line-graph of the high and low temperatures for each day of the month. There should be a line connecting the high temperatures and another line (of a different color) connecting the low temperatures. Hint: Each line-graph will be its own layer, so the code for your plot will have a call to geom_line() to make the line-graph for the high temperatures and another call to geom_line() to make the line-graph for the low temperatures.

Part 1 Solution

JFKweather <- weather %>%
  filter(origin == "JFK", year == 2013, month == 7)

dailyTemp <- JFKweather %>%
  group_by(day) %>%
  summarize(high_temp = max(temp, na.rm = TRUE),
            low_temp = min(temp, na.rm = TRUE))

dailyTemp <- JFKweather %>%
  group_by(day) %>%
  summarize(high_temp = max(temp, na.rm = TRUE),
            low_temp = min(temp, na.rm = TRUE))
ggplot(dailyTemp, aes(x = day)) +
  geom_line(aes(y = high_temp, color = "High Temp")) +
  geom_line(aes(y = low_temp, color = "Low Temp")) +
  labs(title = "High and Low Temperatures at JFK Airport in July 2013",
       x = "Date",
       y = "Temperature (F)") 

Part 2

Make parallel box-plots of the wind_speed variable for each day of the month. Hint: Consider that the boxplots need to group by day, which is a numeric variable.

Part 2 Solution

ggplot(JFKweather, aes(x = factor(day), y = wind_speed)) +
  geom_boxplot() +
  labs(title = "Wind Speed at JFK Airport in July 2013",
       x = "Date",
       y = "Wind Speed (mph)") 

Part 3

Make a scatterplot to study the relationship between dewp and humid. Describe any pattern that you see. Hint: To study the relationship between two numerical variables, a scatterplot is a good idea.

Part 3 Solution

ggplot(JFKweather, aes(x = dewp, y = humid)) +
  geom_point() +
  labs(title = "Relationship between Dew Point and Humidity at JFK Airport in July 2013",
       x = "Dew Point (F)",
       y = "Humidity (%)")

Batting

Problem

Define two new variables in the Teams data frame from the Lahman package: batting average (\(BA\)) and slugging percentage (\(SLG\)).

Notes for Problem

Batting average is the ratio of hits (H) to at-bats (AB).

Slugging percentage is total bases divided by at-bats. To compute total bases, you a single as 1 base, a double as 2, a triple as 3, and a home run as 4. You have a bit of a problem, in that there is no variable for singles. However, you do have the variable H which is any one four types of hit, so you could compute the total bases in two steps:

Teams_2 <-
  Teams %>% 
  mutate(singles = H - (X2B + X3B + HR)) %>% 
  mutate(total_bases = singles + 2 * X2B + 3 * X3B + 4 * HR)

Solution to Problem

Teams_2 <- Teams %>%
  mutate(
    singles = H - (X2B + X3B + HR),
    total_bases = singles + 2 * X2B + 3 * X3B + 4 * HR,
    BA = H / AB,
    SLG = total_bases / AB)
reactable(Teams_2)

AL vs NL

Problem

Plot a time-series line-graph of SLG since 1954 conditioned by lgID. each point on the plot will stand for a league in a season. The x-coordinate is the season and the y-coordinate is the mean of the slugging percentages for all of the teams in the league. By “conditioned” we mean that there will be two line-plots of different colors, so you should aesthetically map color to lgID variable. A glyph-ready data table would look like this:

yearID lgID mean_slugging
1954 AL 0.3732352
1954 NL 0.4067245
1955 AL 0.3810866
1955 NL 0.4068172

Is slugging percentage typically higher in the American League (AL) or the National League (NL)? Can you think of why this might be the case?

Solution

Teams_2 <- Teams %>%
  filter(yearID >= 1954) %>%
  mutate(singles = H - (X2B + X3B + HR),
    total_bases = singles + 2 * X2B + 3 * X3B + 4 * HR,
    BA = H / AB,
    SLG = total_bases / AB) %>%
  group_by(yearID, lgID) %>%
  summarize(mean_slugging = mean(SLG, na.rm = TRUE))
ggplot(Teams_2, aes(x = yearID, y = mean_slugging, color = lgID)) +
  geom_line() +
  labs(title = "Mean Slugging Percentage Each Season",
       x = "Season",
       y = "Mean Slugging Percentage") 

Slugging percentage is typically higher in the American league. After some light research, it seems that because the American league offers a designated hitter for the pitchers, which leads to higher overall statistics.

Slugging percentage

Display the top 15 teams ranked in terms of slugging percentage in MLB history. Repeat this using teams since 1969. Note: You should consider a “team” to be a team in a particular season, e.g., the Mets in 1975, not the Mets in their entire history.

Solution

Teams3 <- Teams %>%
  mutate(singles = H - (X2B + X3B + HR),
    total_bases = singles + 2 * X2B + 3 * X3B + 4 * HR,
    SLG = total_bases / AB)


top15 <- Teams3 %>%
  arrange(desc(SLG)) %>%
  filter(SLG == max(SLG))

top15_1969 <- Teams3 %>%
  filter(yearID >= 1969) %>%
  arrange(desc(SLG)) %>%
  filter(SLG == max(SLG))

reactable(top15)
reactable(top15_1969)

HSW. Here you don’t want to filter for the top value, you want to arrange by SLG (descending) and then take the top 15 using head(). - 0.5 pts.

Angels

The Angels have at times been called the California Angels (CAL), the Anaheim Angels (ANA), and the Los Angeles Angels of Anaheim (LAA). Find the 10 most successful seasons in Angels history. Have they ever won the World Series?

Solution

Angels <- Teams %>%
  filter(teamID %in% c("CAL", "ANA", "LAA")) %>%
  mutate(win_percentage = W / (W + L))

top10Angels <- Angels %>%
  arrange(desc(win_percentage)) %>%
  filter(row_number() <= 10)

AngelsWorldSeries <- Angels %>%
  filter(WSWin == "Y")

reactable(Angels)
reactable(AngelsWorldSeries)

HSW: You need to display your top10Angels table, not the one above. - -.3 pts.

No, they have not won a World Series.

HSW: Your code shows clearly that they did (in 2002). - 0.2 pts.

Presidential baseball

Problem

Create a factor called election that divides the yearID into four-year blocks that correspond to U.S. presidential terms. During which term have the most home runs been hit?

Hints for Problem

You need to use the `cut() function to group the years into presidential terms. The earliest presidential term in the Lahman data begins in 1869 and runs through 1872. The next term is 1873 to 1876, and so on until the current presidential term. Suppose, for the sake of illustration, that the current term is 2013-2016. (It will actually be a more recent term, for you, in this assignment.) Then the terms you want are:

  • 1869-1872
  • 1873-1876
  • 2013-2016

Imagine some years, where each year falls into one of the terms:

year <- c(1871, 1872, 1873, 1894, 1945, 1984, 2005, 2009, 2012, 2015)

Set the sequence of term-starts using the seq() function:

term_start <- seq(from = 1869, to = 2021, by = 4)

Now use cut():

cut(year, breaks = c(term_start), right = FALSE)
##  [1] [1869,1873) [1869,1873) [1873,1877) [1893,1897) [1945,1949) [1981,1985)
##  [7] [2005,2009) [2009,2013) [2009,2013) [2013,2017)
## 38 Levels: [1869,1873) [1873,1877) [1877,1881) [1881,1885) ... [2017,2021)

Note: Can you see why we set the end parameter to 2021 in our call to the seq() function?

Solution to Problem

term_start <- seq(from = 1869, to = 2021, by = 4)
year <- c(1871, 1872, 1873, 1894, 1945, 1984, 2005, 2009, 2012, 2015)
election <- cut(year, breaks = c(term_start), right = FALSE)

Teams$election <- cut(Teams$yearID, breaks = c(term_start), right = FALSE)

home_runs_by_term <- Teams %>%
  group_by(election) %>%
  summarize(total_home_runs = sum(HR, na.rm = TRUE)) %>%
  arrange(desc(total_home_runs))

reactable(home_runs_by_term)

The most home runs have been hit during the 2001-2005 presidential term.

I didn’t use ChatGPT, but used a Microsoft browser AI to reach the conclusion of using “Teams$election” in my code. I was unsure how to access the new election factor within the Teams dataset. I put in my developing code chunk and asked the AI how to access th factor within the dataset. I came to this through this conversation

HSW: Unfortunately the link does not take me to the conversation. I think it is not designed to be publicly “shareable”. Can you use chatGPT in the future?