Introduction

You will work with yearly baseball statistics for each team from the years 2000 to 2015. You do not need any deep insight into the sport of baseball to successfully complete this exam. However, if you think you do not understand a specific variable and how it relates to baseball then ask questions.

A baseball team beats its opponent when it outscores its opponent. This happens when a team scores more runs than it allows in a game. The win percentage (percentage of games played that resulted in a win) of a team is related to the number of runs a team scores versus the number of runs a team allows over the course of a full season (usually 162 games). Runs scored and runs allowed are a function of hits and walks, and hits allowed and walks allowed, respectively. You will explore three models that investigate the relationship between win percentage, and runs scored and runs allowed. This is the first step if you ultimately want to quantify players’ value to a team.

Data

Below is a preview of the data set. Consult the data dictionary for further details on the variables. Definitions of variables are available if you load package Lahman and type ?Teams in your Console. The file mlb.csv is a subset of data frame Teams. You should not work with the Teams data frame.

To get started, load packages tidyverse and broom. Next, read in mlb.csv with function read_csv() and save it as an object named mlb.

Data exploration

Task 1

Add the following variables to mlb: Wpct, RD, HD, BBD, SOD, logWL, logRRA.

Variable Name Definition
Wpct win percentage \(W / G\), proportion of wins among all games played
RD run differential \(R - RA\), runs scored minus runs allowed
HD hit differential \(H - HA\), hits minus hits allowed
BBD walk differential \(BB - BBA\), walks minus walks allowed
SOD strike out differential \(SO - SOA\), strikeouts minus strikeouts allowed
logWL log win-loss ratio \(\log(W/L)\), logarithm of wins divided by losses
logRRA log runs-runs allowed ratio \(\log(R/RA)\), logarithm of runs divided by runs allowed

Task 2

Use data of mlb from 2000 - 2011 to create scatter plots of win percentage versus the four differential variables you created in Task 1: RD, HD, BBD, SOD. Comment on the four relationships you observe. Which variable appears to have the strongest correlation with a team’s win percentage? Which variable appears to have the weakest correlation with a team’s win percentage?


Model I

Task 3

Use data of mlb from 2000 - 2011 to fit a linear model with Wpct as the response and RD as the single predictor variable. Save the object as m.rd. Write out the model. If you do not know LaTeX, look at examples from previous in-class assignments.

Task 4

Give an interpretation of \(b_0\) - the estimated intercept coefficient, and \(b_1\) - the estimated slope coefficient from m.rd.

Task 5

Is run differential statistically significant based on m.rd? Justify your reason.

Model II

Task 6

Use data of mlb from 2000 - 2011 to fit a linear model with Wpct as the response and HD and BBD as predictor variables. Save the object as m.hbbd. Write out the model. If you do not know LaTeX, look at examples from previous in-class assignments.

Task 7

Give an interpretation of \(b_0\) - the estimated intercept coefficient, \(b_1\) - the estimated coefficient for HD from m.hbbd, and \(b_2\) - the estimated coefficient for BBD from m.hbbd.

Task 8

Use data of mlb from 2000 - 2011 to fit a linear model with Wpct as the response and HD, BBD, and SOD as predictor variables. Save the object as m.hbbsod. Compare \(R^2\) and adjusted \(R^2\) between models m.hbbd and m.hbbsod. Comment on the differences.

Model III

The Pythagorean win expectation (derived by Bill James) is given by the following formula. \[\mbox{Win percentage} = \frac{R^2}{R^2 + RA^2},\] where \(R\) is defined as a team’s runs scored and \(RA\) is defined as a team’s runs allowed. For example, in 2008 the Philadelphia Phillies scored 799 runs and allowed 680 runs. Therefore, their expected win percentage is \[\frac{799^2}{799^2 + 680^2} = 0.58.\]

We can generalize the above formula by writing \[\mbox{Win percentage} = \frac{R^k}{R^k + RA^k},\] where \(k\) is some real number. After some algebra, it can be shown that the above expression is equivalent to \[\log\bigg(\frac{W}{L}\bigg) = k\times \log\bigg(\frac{R}{RA}\bigg),\] where \(W\) is the number of wins, \(L\) is the number of losses, \(R\) is defined as a team’s runs scored and \(RA\) is defined as a team’s runs allowed.

Task 9

Use data of mlb from 2000 - 2011 to plot the relationship between \(\log(W/L)\) and \(\log(R/RA)\). Include the fitted regression line in your plot.

Task 10

Use data of mlb from 2000 - 2011 to fit a linear model with \(\log(W/L)\) as the response and \(\log(R/RA)\) as the single predictor variable. The model should be fit with no intercept term. Save the object as m.pyth. Write out the model. If you do not know LaTeX, look at examples from previous in-class assignments.

Task 11

Is the value of the coefficient for \(\log(R/RA)\) statistically different than 2? Justify your reason. Recall that 2 is the exponent in the Pythagorean win expectation formula derived by Bill James.

Task 12

Check if the four linear model assumptions discussed in the course are satisfied for m.pyth. One plot to check each assumption is sufficient. Also, give a comment on each plot and what assumption it verifies or shows does not hold.

Model comparison

Task 13

Use data of mlb from 2012 - 2015 to evaluate models m.rd, m.hbbd, and m.pyth. Use each model to compute the expected win percentage for each team for the years 2012 - 2015. Since you know the actual win percentages for the teams during those years you can evaluate the prediction accuracy. This prediction accuracy will be evaluated by the mean squared prediction error, or MSPR, such that \[MSPR = \displaystyle \frac{\displaystyle \sum_{i=1}^{n^*} (y_i - \hat{y}_i) ^ 2}{n^*},\] where \(y_i\) is the observed win percentage from the 2012 - 2015 data, \(\hat{y}_i\) is the predicted win percentage from a given model based on inputs from the 2012 - 2015 data, and \(n^*\) is the number of observations in the data from 2012 - 2015.

Extra credit

Task 14

Use at least one join operation on at least two data frames from package Lahman to create an original (not one done in class) well-labelled visualization involving at least three variables. You can make a GIF or use any ggplot2 extension. Be creative.

Essential details

Deadline and submission

The deadline to submit Exam 2 is 11:59pm on Tuesday, April 09. Submit your work by uploading only your Rmd file through Google Classroom. Late work will not be accepted except under certain extraordinary circumstances.

Help

  • Post your questions in the #exam2 channel on Slack. These should only be general questions, where you feel the directions are not clear. Do not post any code.

  • Visit Scott or I in office hours or make an appointment. However, we will not guide you to a solution or verify your code is correct.
    • Shawn’s office hours: Wednesdays 9:00 - 10:30am & Fridays 1:30 - 3:00pm, C409 Wells Hall
    • Scott’s office hours: Thursdays 11:00 - 12:00pm, C511 Wells Hall

Academic integrity

  • This is an individual assignment. This document, its questions, and your answers should only be viewed by you, the instructor, and the teaching assistant. If you fail to abide by these rules, you will earn a 0 and an Academic Dishonesty Report will be filed.

  • You may use any course material or other resources you find helpful online.

  • You must always cite any code you copy or use as inspiration. Copied code without citation is plagiarism and will result in a 0 for the assignment.

Grading

You must use R Markdown. Formatting is at your discretion but is graded. Use the in-class assignments and resources available online for inspiration. Another useful resource for R Markdown formatting is available at https://holtzy.github.io/Pimp-my-rmd/

Topic Points
Tasks 1 - 12 60
Task 13 8
Code style 8
- 80 characters per line
- Format of code
- Comments used appropriately
- Spaces around operations and commas
Efficiency 8
- Using tidyverse code when possible
- Using broom code when possible
- Avoiding loops
R Markdown style / formatting 7
Knit 6
Named code chunks 3
Total 100

A bonus of 5 points can be earned for Task 14

References

  1. Lahman, S. (2017) Lahman’s Baseball Database, 1871-2016.