Chapter 4 Homework
HSW: Score is 8.7/10, see comments below and study solutions.
Introduction
These exercises are based on exercises in our textbook.
Verbs
Each of these tasks can be performed using a single data verb. For each task, say which verb it is:
- Find the average of one of the variables.
- Add a new column that is the ratio between two variables.
- Sort the cases in descending order of a variable.
- Create a new data table that includes only those cases that meet a criterion.
- From a data table with three categorical variables A, B, and C, and a quantitative variable X, produce a data frame that has the same cases but only the variables A and X.
SOLUTION
- Summarize()
- Mutate()
- Arrange()
- Filter()
- Select()
Cancelled flights
Use the nycflights13 package and the
flights data frame to answer the following questions: What
month had the highest proportion of cancelled flights? What month had
the lowest? Interpret any seasonal patterns.
Note: You may consider a flight to be cancelled if
its departure time is NA.
Hint
At some point you’ll need to figure out how many elements of a vector
are NA. Some R programming ideas will be helpful, here.
Let’s make a small example.
Clearly, two of the elements of smallVector are
NA. Here’s how to figure that out with programming, rather
than by physically looking at the elements.
The function is.na() takes a vector and produces a
logical vector whose elements are TRUE when the original
value was NA:
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
Let’s now use the sum() function to “add up” the
elements of this logical vector:
## [1] 2
What happened was that R converted the TRUE values to 1
and the FALSE values to 0, then added up. The sum is the
number of 1’s, thus the number of TRUEs, thus the number of
NAs in the original vector.
Accordingly, the following gets you part of the way to the solution:
SOLUTION
cancellations <-
flights %>%
group_by(month) %>%
summarize(cancelled = sum(is.na(dep_time)),
total = n(),
proportion_cancelled = cancelled / total)
highest_cancelled <- cancellations %>%
filter(proportion_cancelled == max(proportion_cancelled))
lowest_cancelled <- cancellations %>%
filter(proportion_cancelled == min(proportion_cancelled))
reactable(cancellations)Frequent NYC visitors
Use the nycflights13 package and the
flights data frame to answer the following questions: What
plane (specified by the tailnum variable) traveled the most
times from New York City airports in 2013? Plot the number of trips per
week over the year.
Hint
The second part of the problem asks you to plot the number of trips per week over the year. For this you’ll have to group the data by weeks of the year.
One of the variables (time_hour) gives the date and time
of day of the flight. Package lubridate has a function
called week() that takes a date and returns the number of
the week it belong in.
Example:
## [1] 1 6
The dates were in the first and sixth weeks of the year 2018.
As you wrangle, use mutate() to create a new variable
that gives the week-number for each flight, then group by that
variable.
SOLUTION
topPlane <- flights %>%
filter(year(time_hour) == 2013) %>%
group_by(tailnum) %>%
summarize(num_flights = n()) %>%
filter(num_flights == max(num_flights))
tripsPerWeek <- flights %>%
filter(year(time_hour) == 2013) %>%
mutate(week = week(time_hour)) %>%
group_by(week) %>%
summarize(num_trips = n()) %>%
arrange(week)
ggplot(tripsPerWeek, aes(x = week, y = num_trips)) +
geom_line() +
labs(title = "Number of Trips per Week From the NYC Airport in 2013",
x = "Week of the Year",
y = "Number of Trips") HSW: To find the top plane you will need to filter
out the rows where the tail number is NA. - 0.3 pts. I
think we were supposed to find find the flights each week for the top
plane, but your interpretation is fine, too.
Weather
Use the nycflights13 package and the
weather table to answer the following questions, for the
JFK airport for the month of July, 2013.
Part 1
Make a line-graph of the high and low temperatures for each day of
the month. There should be a line connecting the high temperatures and
another line (of a different color) connecting the low temperatures.
Hint: Each line-graph will be its own layer, so the
code for your plot will have a call to geom_line() to make
the line-graph for the high temperatures and another call to
geom_line() to make the line-graph for the low
temperatures.
Part 1 Solution
JFKweather <- weather %>%
filter(origin == "JFK", year == 2013, month == 7)
dailyTemp <- JFKweather %>%
group_by(day) %>%
summarize(high_temp = max(temp, na.rm = TRUE),
low_temp = min(temp, na.rm = TRUE))
dailyTemp <- JFKweather %>%
group_by(day) %>%
summarize(high_temp = max(temp, na.rm = TRUE),
low_temp = min(temp, na.rm = TRUE))
ggplot(dailyTemp, aes(x = day)) +
geom_line(aes(y = high_temp, color = "High Temp")) +
geom_line(aes(y = low_temp, color = "Low Temp")) +
labs(title = "High and Low Temperatures at JFK Airport in July 2013",
x = "Date",
y = "Temperature (F)") Part 2
Make parallel box-plots of the wind_speed variable for
each day of the month. Hint: Consider that the boxplots
need to group by day, which is a numeric variable.
Part 2 Solution
ggplot(JFKweather, aes(x = factor(day), y = wind_speed)) +
geom_boxplot() +
labs(title = "Wind Speed at JFK Airport in July 2013",
x = "Date",
y = "Wind Speed (mph)") Part 3
Make a scatterplot to study the relationship between
dewp and humid. Describe any pattern that you
see. Hint: To study the relationship between two
numerical variables, a scatterplot is a good idea.
Batting
Problem
Define two new variables in the Teams data frame from
the Lahman package: batting average (\(BA\)) and slugging percentage (\(SLG\)).
Notes for Problem
Batting average is the ratio of hits (H) to at-bats
(AB).
Slugging percentage is total bases divided by at-bats. To compute
total bases, you a single as 1 base, a double as 2, a triple as 3, and a
home run as 4. You have a bit of a problem, in that there is no variable
for singles. However, you do have the variable H which is
any one four types of hit, so you could compute the total bases in two
steps:
AL vs NL
Problem
Plot a time-series line-graph of SLG since 1954
conditioned by lgID. each point on the plot will stand for
a league in a season. The x-coordinate is the season and the
y-coordinate is the mean of the slugging percentages for all of the
teams in the league. By “conditioned” we mean that there will be two
line-plots of different colors, so you should aesthetically map
color to lgID variable. A glyph-ready data
table would look like this:
| yearID | lgID | mean_slugging |
|---|---|---|
| 1954 | AL | 0.3732352 |
| 1954 | NL | 0.4067245 |
| 1955 | AL | 0.3810866 |
| 1955 | NL | 0.4068172 |
Is slugging percentage typically higher in the American League (AL) or the National League (NL)? Can you think of why this might be the case?
Solution
Teams_2 <- Teams %>%
filter(yearID >= 1954) %>%
mutate(singles = H - (X2B + X3B + HR),
total_bases = singles + 2 * X2B + 3 * X3B + 4 * HR,
BA = H / AB,
SLG = total_bases / AB) %>%
group_by(yearID, lgID) %>%
summarize(mean_slugging = mean(SLG, na.rm = TRUE))
ggplot(Teams_2, aes(x = yearID, y = mean_slugging, color = lgID)) +
geom_line() +
labs(title = "Mean Slugging Percentage Each Season",
x = "Season",
y = "Mean Slugging Percentage")
Slugging percentage is typically higher in the American league. After
some light research, it seems that because the American league offers a
designated hitter for the pitchers, which leads to higher overall
statistics.
Slugging percentage
Display the top 15 teams ranked in terms of slugging percentage in MLB history. Repeat this using teams since 1969. Note: You should consider a “team” to be a team in a particular season, e.g., the Mets in 1975, not the Mets in their entire history.
Solution
Teams3 <- Teams %>%
mutate(singles = H - (X2B + X3B + HR),
total_bases = singles + 2 * X2B + 3 * X3B + 4 * HR,
SLG = total_bases / AB)
top15 <- Teams3 %>%
arrange(desc(SLG)) %>%
filter(SLG == max(SLG))
top15_1969 <- Teams3 %>%
filter(yearID >= 1969) %>%
arrange(desc(SLG)) %>%
filter(SLG == max(SLG))
reactable(top15)HSW. Here you don’t want to filter for the top value, you want to arrange by SLG (descending) and then take the top 15 using head(). - 0.5 pts.
Angels
The Angels have at times been called the California Angels
(CAL), the Anaheim Angels (ANA), and the Los
Angeles Angels of Anaheim (LAA). Find the 10 most
successful seasons in Angels history. Have they ever won the World
Series?
Solution
Angels <- Teams %>%
filter(teamID %in% c("CAL", "ANA", "LAA")) %>%
mutate(win_percentage = W / (W + L))
top10Angels <- Angels %>%
arrange(desc(win_percentage)) %>%
filter(row_number() <= 10)
AngelsWorldSeries <- Angels %>%
filter(WSWin == "Y")
reactable(Angels)HSW: You need to display your
top10Angels table, not the one above. - -.3 pts.
No, they have not won a World Series.
HSW: Your code shows clearly that they did (in 2002). - 0.2 pts.
Presidential baseball
Problem
Create a factor called election that divides the
yearID into four-year blocks that correspond to U.S.
presidential terms. During which term have the most home runs been
hit?
Hints for Problem
You need to use the `cut() function to group the years into presidential terms. The earliest presidential term in the Lahman data begins in 1869 and runs through 1872. The next term is 1873 to 1876, and so on until the current presidential term. Suppose, for the sake of illustration, that the current term is 2013-2016. (It will actually be a more recent term, for you, in this assignment.) Then the terms you want are:
- 1869-1872
- 1873-1876
- …
- 2013-2016
Imagine some years, where each year falls into one of the terms:
Set the sequence of term-starts using the seq()
function:
Now use cut():
## [1] [1869,1873) [1869,1873) [1873,1877) [1893,1897) [1945,1949) [1981,1985)
## [7] [2005,2009) [2009,2013) [2009,2013) [2013,2017)
## 38 Levels: [1869,1873) [1873,1877) [1877,1881) [1881,1885) ... [2017,2021)
Note: Can you see why we set the end
parameter to 2021 in our call to the seq() function?
Solution to Problem
term_start <- seq(from = 1869, to = 2021, by = 4)
year <- c(1871, 1872, 1873, 1894, 1945, 1984, 2005, 2009, 2012, 2015)
election <- cut(year, breaks = c(term_start), right = FALSE)
Teams$election <- cut(Teams$yearID, breaks = c(term_start), right = FALSE)
home_runs_by_term <- Teams %>%
group_by(election) %>%
summarize(total_home_runs = sum(HR, na.rm = TRUE)) %>%
arrange(desc(total_home_runs))
reactable(home_runs_by_term)The most home runs have been hit during the 2001-2005 presidential term.
I didn’t use ChatGPT, but used a Microsoft browser AI to reach the conclusion of using “Teams$election” in my code. I was unsure how to access the new election factor within the Teams dataset. I put in my developing code chunk and asked the AI how to access th factor within the dataset. I came to this through this conversation
HSW: Unfortunately the link does not take me to the conversation. I think it is not designed to be publicly “shareable”. Can you use chatGPT in the future?