Final Project - Predicting Major League Baseball Wins

AB - An at bat refers to a batter’s turn to face a pitcher and attempt to hit the ball.

IPouts - The number of outs a pitcher has recorded in a game or over a certain period.

Shutout (SHO) - A shutout occurs when a pitcher or group of pitchers prevents the opposing team from scoring any runs over the course of an entire game.

Save (SV) - A save is a statistic credited to a relief pitcher who finishes a game for the winning team under certain prescribed circumstances.

1. Introduction

2. Reading the lahman’s Data File.

3. Exploratory Data Analysis.

-Selecting Independent Variables

4. Linear Regression Model - Predicting Wins

5. Model Diagnosis.

6. Model Evaluation using a 2023 season data.

7. Usefulness of the study.

1. Introduction

For my final project, I propose to delve into the realm of sports analytics, specifically focusing on Major League Baseball (MLB). The objective of this project is to develop a predictive model that can forecast the outcome of MLB games based on box score and game-related statistics. By leveraging data-driven insights, we aim to gain a deeper understanding of the factors that contribute to a team’s success in winning games.

library(ggrepel)

## Loading required package: ggplot2

library(boot)
library(broom)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## The following object is masked from 'package:boot':
## 
##     logit

library(psych)

## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:car':
## 
##     logit
## 
## The following object is masked from 'package:boot':
## 
##     logit
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(gganimate)

2. Reading the lahman’s Data file.

lahman_data = read.csv("/Users/anuragreddy/Desktop/Statistics with R/teams.csv")
lahman_data <- lahman_data |>
  filter(yearID != 2020)

Note: 2020 seasonis removed due to unusual format because of covid pandemic.

3. Exploratory Data Analysis.

Understand the data-set.

numeric_lahman <- lahman_data[, sapply(lahman_data, is.numeric)]
head(describe(numeric_lahman))

##        vars   n    mean    sd median trimmed   mad  min  max range  skew
## yearID    1 660 2010.59  6.50 2010.5 2010.50  8.15 2000 2022    22  0.07
## Rank      2 660    3.01  1.45    3.0    2.99  1.48    1    6     5  0.09
## G         3 660  161.96  0.30  162.0  162.00  0.00  161  163     2 -1.01
## Ghome     4 660   80.96  0.50   81.0   81.00  0.00   77   84     7 -1.51
## W         5 660   80.97 12.04   81.0   81.18 13.34   43  116    73 -0.15
## L         6 660   80.97 12.02   81.0   80.78 13.34   46  119    73  0.15
##        kurtosis   se
## yearID    -1.13 0.25
## Rank      -1.18 0.06
## G          7.35 0.01
## Ghome     23.98 0.02
## W         -0.46 0.47
## L         -0.45 0.47

Distribution of runs league wise (independent variable):

Box_Plot <- lahman_data |>
  ggplot(aes(x=lgID,y=R,fill=lgID))+
  geom_boxplot()+
  labs(x='League',y='Runs scored by teams',title="Distribution of Runs Scored by teams in AL and NL")+
  theme_economist()

ggplotly(Box_Plot)

Interpretation: By interpreting the box plot we can assume that the average runs scored in American league is higher than that in National League.

Box_Plot <- lahman_data |>
  ggplot(aes(x=lgID,y=RA,fill=lgID))+
  geom_boxplot()+
  labs(x='League',y='Runs allowed by teams',title="Distribution of Runs Allowed by teams in AL and NL")+
  theme_economist()

ggplotly(Box_Plot)

Interpretation: By interpreting the box plot we can assume that the average runs allowed in American league is higher than that in National League

Selecting Independent Variables

lahman_data_numeric <- lahman_data[ ,sapply(lahman_data, is.numeric)]
lahman_data_numeric <- lahman_data_numeric |>
  rename(Attd=attendance)|>
  select(-Rank,-G,-Ghome,-L,-BPF,-PPF,-yearID)
Corr = cor(lahman_data_numeric,method = "pearson")

correlation_df <- as.data.frame(Corr)


correlation_df['W'] |>
  arrange(desc(W)) |>
  head()

##                W
## W      1.0000000
## SV     0.6597673
## R      0.5691824
## SHO    0.5113828
## Attd   0.4700870
## IPouts 0.4643397

library(ggcorrplot)
corr <- ggcorrplot(Corr,hc.order = TRUE, 
           type = "lower") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Correlation Plot") +
  theme_classic()

ggplotly(corr)

Interpretation: The above dataframe indicates which independent variables are closely correlated with wins. With some knowledge of baseball, I have selected the following variables as independent variables.

Variables Selected: R+RA+AB+ERA+HR+IPouts+Attd+HA+SHO+SV

library(ggplot2)

# Define custom density function
my_density <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_density(alpha = 0.5, fill = "cornflowerblue", ...)
}

# Define custom scatterplot function
my_scatter <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_point(alpha = 0.5, color = "cornflowerblue") + 
    geom_smooth(method = lm, se = FALSE, show.legend = FALSE, ...)
}

# Example data
data <- lahman_data_numeric |>
  select(W, R, ERA, Attd, SV)

# Create scatterplot matrix
pairs <- ggpairs(data, 
        lower = list(continuous = my_scatter), 
        diag = list(continuous = my_density),
        progress = FALSE)
ggplotly(pairs)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

## Warning: Can only have one: highlight

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

## Warning: Can only have one: highlight

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Can only have one: highlight

## Warning: Can only have one: highlight

Interpretation: Show casing the ggpairs plots between the Wins and few other independent variables. The independent variables approximately following the normal distribution.

4. Linear Regression Model

Dependent Variable <- Wins

Independent Variables <- R+RA+H+ERA+HR+Attd+HA+SHO+SV

MLR_model1 <- lm(W~R+RA+ERA+HR+Attd+HA+SHO+SV+AB,data=lahman_data_numeric) |>
  step(direction = 'backward', trace =0)

summary(MLR_model1)

## 
## Call:
## lm(formula = W ~ R + RA + ERA + Attd + SHO + SV + AB, data = lahman_data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8470 -2.0185  0.0385  1.9721  8.8952 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.307e+01  9.517e+00   7.677 5.96e-14 ***
## R            9.064e-02  1.873e-03  48.407  < 2e-16 ***
## RA          -5.609e-02  9.292e-03  -6.037 2.63e-09 ***
## ERA         -3.063e+00  1.506e+00  -2.034   0.0424 *  
## Attd         3.772e-07  1.966e-07   1.919   0.0555 .  
## SHO          2.032e-01  4.244e-02   4.788 2.09e-06 ***
## SV           4.098e-01  2.067e-02  19.829  < 2e-16 ***
## AB          -4.371e-03  1.848e-03  -2.365   0.0183 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.123 on 652 degrees of freedom
## Multiple R-squared:  0.9334, Adjusted R-squared:  0.9327 
## F-statistic:  1306 on 7 and 652 DF,  p-value: < 2.2e-16

Interpretation:

The intercept (51) is statistically significant (p < 0.05), suggesting that when all predictor variables are zero, the expected value of the response variable is significantly different from zero.
The coefficients for predictors R (Runs), RA (Runs Allowed), SHO (Shutouts), SV (Saves), ERA (Earned Runs Average), AB (At Bats) are all statistically significant (p < 0.05). For instance, a one-unit increase in R is associated with a 0.088 unit increase in the response variable, controlling for other predictors.
Overall, this model explains approximately 93.34% of the variability in the number of wins (Adjusted R-squared = 0.9327), indicating a strong fit. However, it’s important to note that while some variables like runs scored, runs allowed, shutouts, and saves appear to have significant effects on the number of wins, other variables like attendance do not show statistically significant effects in this model.

Variance Inflation Factor

(VIF is a measure used in regression analysis to quantify how much the variance of an estimated regression coefficient increases if your predictors are correlated. High VIF values indicate high multicollinearity, which can lead to unreliable coefficient estimates.)

vif(MLR_model1)

##         R        RA       ERA      Attd       SHO        SV        AB 
##  1.618830 46.759013 44.989702  1.332814  1.905306  1.543260  1.643200

paste("Correlation between RA and ERA", cor(lahman_data$ERA,lahman_data$RA))

## [1] "Correlation between RA and ERA 0.988224732424427"

Interpretation: A rule of thumb is to keep each VIF below 10. In the above output, ERA and RA have values of 45 and 47, respectively. Technically speaking, RA and ERA measure the runs scored by the opponents, and they are highly correlated. It is better to remove either one of the variables from the independent variables list.

After removing ERA, Attd and adding IPouts to the model.

MLR_model2 <- lm(W~R+RA+SHO+SV+IPouts+AB,data=lahman_data_numeric) |>
  step(direction = 'backward', trace =0)

summary(MLR_model2)

## 
## Call:
## lm(formula = W ~ R + RA + SHO + SV + IPouts + AB, data = lahman_data_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9247 -2.1280  0.0897  2.0253  9.2901 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.691335  14.001191   0.978    0.329    
## R            0.093685   0.001857  50.452  < 2e-16 ***
## RA          -0.070289   0.002335 -30.102  < 2e-16 ***
## SHO          0.202505   0.041833   4.841 1.62e-06 ***
## SV           0.396055   0.020626  19.202  < 2e-16 ***
## IPouts       0.021186   0.004182   5.066 5.30e-07 ***
## AB          -0.010824   0.002332  -4.641 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.079 on 653 degrees of freedom
## Multiple R-squared:  0.9352, Adjusted R-squared:  0.9346 
## F-statistic:  1571 on 6 and 653 DF,  p-value: < 2.2e-16

vif(MLR_model2)

##        R       RA      SHO       SV   IPouts       AB 
## 1.638348 3.038943 1.905124 1.581730 2.313793 2.692953

tidy(MLR_model2,conf.int = TRUE)

## # A tibble: 7 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)  13.7     14.0         0.978 3.29e-  1 -13.8     41.2    
## 2 R             0.0937   0.00186    50.5   1.79e-227   0.0900   0.0973 
## 3 RA           -0.0703   0.00234   -30.1   1.60e-125  -0.0749  -0.0657 
## 4 SHO           0.203    0.0418      4.84  1.62e-  6   0.120    0.285  
## 5 SV            0.396    0.0206     19.2   1.73e- 65   0.356    0.437  
## 6 IPouts        0.0212   0.00418     5.07  5.30e-  7   0.0130   0.0294 
## 7 AB           -0.0108   0.00233    -4.64  4.19e-  6  -0.0154  -0.00624

Interpretation: In summary, this regression model suggests that the independent variables R, RA, SHO, SV, IPouts and AB are all significantly associated with the number of wins in baseball games. The model has a high R-squared value (0.9352), indicating that it explains a large proportion of the variance in W.

Compare Models - Using BIC to compare model 1 and 2.

paste("BIC of Model 1",BIC(MLR_model1))

## [1] "BIC of Model 1 3426.71585363651"

paste("BIC of Model 2",BIC(MLR_model2))

## [1] "BIC of Model 2 3402.29040494585"

Interpretation: Comparing the BIC values of the two models, we observe that Model 2 has a lower BIC value (3402.3) compared to Model 1 (3426.7). According to the BIC criterion, lower values indicate a better trade-off between model fit and complexity. Therefore, based on the BIC values alone, Model 2 appears to be a preferable model as it achieves a better balance between goodness of fit and parsimony.

5. Model Diagnosis

Diagnostic Plots

gg_diagnose(MLR_model2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Residuals vs. Fitted Values

ggres <- gg_resfitted(MLR_model2) +
  geom_smooth(se=FALSE)+
  theme_economist()

ggplotly(ggres)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Interpretation: The points in the plot exhibit a random scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is not violated. In other words, there is no discernible pattern in the residuals as the fitted values change. The residuals should be independent of each other. In other words, there is no pattern or correlation between consecutive residuals.

Normality of residuals:

ggqq <- gg_qqplot(MLR_model2)+
  theme_economist()
ggplotly(ggqq)

Interpretation: Normal Q-Q plot shows points closely following the diagonal line, indicating that the data (residuals) is approximately normally distributed

Residuals vs. X Values:

gg_resX(MLR_model2)

Interpretation: The plot helps to assess the assumption of linearity between the predictor variable (X) and the response variable. The points in the above plot exhibiting a random scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is reasonable.

Residual Histogram:

ghist <- gg_reshist(MLR_model2)+
  geom_histogram(aes(colorbar='blue'),show.legend = FALSE)+
  theme_economist()

## Warning in geom_histogram(aes(colorbar = "blue"), show.legend = FALSE):
## Ignoring unknown aesthetics: colourbar

ggplotly(ghist)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation: The above histogram of residuals interprets that the data is almost normally distributed.

Influential Data Points:

Reslev <- gg_resleverage(MLR_model2)+
  theme_economist()

ggplotly(Reslev)

## `geom_smooth()` using formula = 'y ~ x'

Interpretation: This plot helps us in identifying the influential points in the dataset. Look for the points which have high leverage and high residual.

Cook’s Distance - Used to identify the influential points.

Cookd <- gg_cooksd(MLR_model2, threshold = 'matlab') +
  theme_economist()
ggplotly(Cookd)

Interpretation: A Cook’s Distance plot is a diagnostic tool used in regression analysis to identify influential data points that may disproportionately affect the estimation of regression coefficients.

Higher values of Cook’s Distance indicate greater influence of the corresponding data point on the regression model.

Observations with Cook’s Distance values exceeding the threshold are highlighted in the plot. These points are considered potentially influential and may significantly impact the estimated regression coefficients if removed from the analysis.

High Cook’s Distance values indicate that removing the corresponding observation from the dataset could lead to substantial changes in the estimated regression coefficients.

6. Model Evaluation

Lets try to predict the Tampa Bay Rays 2023 season wins using the multiple linear model we built:

Before that lets try to visualize the scoring patterns of the Tampa Bay Rays and New York Yankees (One of the most successful franchises in the major league baseball).

anim <- lahman_data |>
  filter(franchID=='TBD' | franchID == 'NYY')|>
  ggplot(aes(x=yearID,y=R,color=franchID))+
  geom_line()+
  geom_point()+
  labs(title = "Runs Scoring pattern from 2000-2022 season",x="Season",y="Runs Scored")+
  transition_reveal(yearID)+
  theme_economist()
  
anim

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

#anim_save("anim.gif", anim)

Team	Season	At Bats	Runs Scored	Runs Allowed	Shutouts	Saves	IPouts
Tampa Bay Rays	2023	5511	860	665	14	45	4326

Predicted Wins=13.691335+0.093685×R−0.070289×RA+0.202505×SHO+0.396055×SV+0.021186×IPouts−0.010824×AB

Predicted Wins = 96.7.

Actual Wins = 99.

7. Usefulness of the study

The present study delves into the realm of baseball, with a focus on predicting Major League Baseball wins through the utilization of a Multiple Linear Regression model. The primary aim of this endeavor is to provide teams and managers with insights into the key variables that significantly influence game outcomes.

It has been determined that among various game-related statistics, such as Runs Scored, Runs Allowed, Shutouts, Saves, and IPouts, the latter hold significant sway in determining the outcome of matches. While Runs Scored reflects the offensive prowess of a team, Runs Allowed, Shutouts, Saves, and IPouts shed light on defensive and pitching capabilities.

The constructed model demonstrates an impressive explanatory power, accounting for 93.5% of the variance in wins, utilizing the aforementioned game-related statistics.

Moving forward, there exists ample opportunity for further exploration within the domain of baseball analytics, also known as Sabermetrics. While the current study primarily relies on box-score related statistics, the burgeoning advancements in sports technology present a wealth of possibilities for enhancing our understanding of the game and refining the accuracy and efficacy of predictions.

Final Project - Predicting Major League Baseball Wins

Anurag

2024-04-18

Contents

1. Introduction

2. Reading the lahman’s Data File.

3. Exploratory Data Analysis.

-Selecting Independent Variables

4. Linear Regression Model - Predicting Wins

5. Model Diagnosis.

6. Model Evaluation using a 2023 season data.

7. Usefulness of the study.

1. Introduction

2. Reading the lahman’s Data file.

3. Exploratory Data Analysis.

Selecting Independent Variables

4. Linear Regression Model

Interpretation:

Variance Inflation Factor

After removing ERA, Attd and adding IPouts to the model.

Compare Models - Using BIC to compare model 1 and 2.

5. Model Diagnosis

Residuals vs. Fitted Values

Normality of residuals:

Residuals vs. X Values:

Residual Histogram:

Influential Data Points:

6. Model Evaluation

7. Usefulness of the study