STAT 3341 Homework #1
Instructions (Level 1)
Your goal is to reproduce every aspect of this document (including this!). The only additional thing that should be included in your document is your name and student ID. The point break down is as follows:
- 15 points: Submitting both a .qmd and .html file that include your name and student ID as the “Author” in the YAML header
- 15 points: YAML header (including embed-resources, spacelab theme, code folding, and table of contents)
- 10 points: R Chunks (including chunk options)
- 10 points: Displaying Pythagorean win percentage equation (including the ^ above wins)
- 30 points: All text formatting issues (including headers, hyperlinks, bulleted lists, tabs, and font in bold/italics)
- 10 points: Datatable formatting
- 10 points: In-line code used to display the max Pythagorean win percentage in the “Viewing the Raw Data” section
Pythagorean Winning Percentage (Level 2)
The Pythagorean winning percentage is an advanced metric first proposed by Bill James. The idea is that game-to-game variability can sometimes mask overall strength… that is, the number of games you actually won may not match with how many games you should have won.
The original formula for Pythagorean winning percentage is: \[\hat{wins} = \frac{R^2}{R^2+RA^2}\] where R = runs scored and RA = runs allowed. More information about Pythagorean winning percentage can be found at:
https://www.mlb.com/glossary/advanced-stats/pythagorean-winning-percentage
Application to Lahman Data (Level 2)
Let’s explore Pythagorean winning percentage using the Lahman package. Specifically, you need to load the library and access the “Teams” data set.
Remember, you can use the datatable() function from the DT package to create a presentation worthy table. We will work with all seasons since 1995, which is when the Wild Card was introduced. First, let’s subset the data and add both actual winning percentage and Pythagorean winning percentage.
Below is a nicely formatted table that uses the teams.short data set defined above. Note that the chunk of code that creates this table has intentionally not been displayed (e.g. echo: false).
Using in-line code with the max() and round() functions (and rounding to three digits), we find that the max Pythagorean winning percentage was 0.729. This can be confirmed by sorting the table above.
Next, let’s explore the correlation between actual winning percentage and Pythagorean winning percentage.
Last, let’s see whether actual winning percentage or Pythagorean winning percentage are better predictors of winning the world series.
SHOW ME THE CODE!
Call:
glm(formula = WSChamp ~ WinPercent, family = "binomial", data = teams.short)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4067 -0.2601 -0.1368 -0.0716 2.8899
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.205 2.255 -7.186 6.67e-13 ***
WinPercent 23.363 3.833 6.096 1.09e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 236.35 on 803 degrees of freedom
Residual deviance: 183.20 on 802 degrees of freedom
AIC: 187.2
Number of Fisher Scoring iterations: 7
Call:
glm(formula = WSChamp ~ PythagPercent, family = "binomial", data = teams.short)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.02349 -0.26406 -0.14586 -0.07879 2.88091
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -14.966 2.076 -7.210 5.61e-13 ***
PythagPercent 21.143 3.522 6.003 1.94e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 236.35 on 803 degrees of freedom
Residual deviance: 187.87 on 802 degrees of freedom
AIC: 191.87
Number of Fisher Scoring iterations: 7
The AIC value for the model using actual winning percentage is slightly lower, indicating that it is a better predictor of winning the world series (but perhaps not meaningfully better).