STAT 3341 Homework #1

Author

Hamish Suter 47940936

Instructions (Level 1)

Your goal is to reproduce every aspect of this document (including this!). The only additional thing that should be included in your document is your name and student ID. The point break down is as follows:

15 points: Submitting both a .qmd and .html file that include your name and student ID as the “Author” in the YAML header
15 points: YAML header (including embed-resources, spacelab theme, code folding, and table of contents)
10 points: R Chunks (including chunk options)
10 points: Displaying Pythagorean win percentage equation (including the ^ above wins)
30 points: All text formatting issues (including headers, hyperlinks, bulleted lists, tabs, and font in bold/italics)
10 points: Datatable formatting
10 points: In-line code used to display the max Pythagorean win percentage in the “Viewing the Raw Data” section

Pythagorean Winning Percentage (Level 2)

The Pythagorean winning percentage is an advanced metric first proposed by Bill James. The idea is that game-to-game variability can sometimes mask overall strength… that is, the number of games you actually won may not match with how many games you should have won.

The original formula for Pythagorean winning percentage is: \[\hat{wins} = \frac{R^2}{R^2+RA^2}\] where R = runs scored and RA = runs allowed. More information about Pythagorean winning percentage can be found at:

https://www.mlb.com/glossary/advanced-stats/pythagorean-winning-percentage

Application to Lahman Data (Level 2)

Let’s explore Pythagorean winning percentage using the Lahman package. Specifically, you need to load the library and access the “Teams” data set.

SHOW ME THE CODE!

##You wouldn't want to display the entire data set because it is 2,955 rows
##If we wanted to do this, we would just type the name of the data set
##Instead, we set eval: false to prevent the long output!

library(Lahman)
dim(Teams)
Teams

Remember, you can use the datatable() function from the DT package to create a presentation worthy table. We will work with all seasons since 1995, which is when the Wild Card was introduced. First, let’s subset the data and add both actual winning percentage and Pythagorean winning percentage.

SHOW ME THE CODE!

library(Lahman)

teams.short <- subset(Teams[,c(1:2, 4, 9:10, 15, 27, 14)], yearID >= 1995)

teams.short$WinPercent <- with(teams.short, W/(W+L))
teams.short$PythagPercent <- with(teams.short, R^2/(R^2 + RA^2))

Below is a nicely formatted table that uses the teams.short data set defined above. Note that the chunk of code that creates this table has intentionally not been displayed (e.g. echo: false).

Using in-line code with the max() and round() functions (and rounding to three digits), we find that the max Pythagorean winning percentage was 0.729. This can be confirmed by sorting the table above.

Next, let’s explore the correlation between actual winning percentage and Pythagorean winning percentage.

SHOW ME THE CODE!

with(teams.short, plot(PythagPercent, WinPercent, xlab="Pythagorean Winning Percentage", ylab="Actual Winning Percentage"))

Last, let’s see whether actual winning percentage or Pythagorean winning percentage are better predictors of winning the world series.

SHOW ME THE CODE!

teams.short$WSChamp <- ifelse(teams.short$WSWin=='Y', 1, 0)

summary(glm(WSChamp ~ WinPercent, data=teams.short, family='binomial'))


Call:
glm(formula = WSChamp ~ WinPercent, family = "binomial", data = teams.short)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4067  -0.2601  -0.1368  -0.0716   2.8899  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -16.205      2.255  -7.186 6.67e-13 ***
WinPercent    23.363      3.833   6.096 1.09e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 236.35  on 803  degrees of freedom
Residual deviance: 183.20  on 802  degrees of freedom
AIC: 187.2

Number of Fisher Scoring iterations: 7

SHOW ME THE CODE!

summary(glm(WSChamp ~ PythagPercent, data=teams.short, family='binomial'))


Call:
glm(formula = WSChamp ~ PythagPercent, family = "binomial", data = teams.short)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.02349  -0.26406  -0.14586  -0.07879   2.88091  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -14.966      2.076  -7.210 5.61e-13 ***
PythagPercent   21.143      3.522   6.003 1.94e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 236.35  on 803  degrees of freedom
Residual deviance: 187.87  on 802  degrees of freedom
AIC: 191.87

Number of Fisher Scoring iterations: 7

The AIC value for the model using actual winning percentage is slightly lower, indicating that it is a better predictor of winning the world series (but perhaps not meaningfully better).