Game related statistics explanation:
R - Runs Scored by a team in a season
RA - Runs Allowed by a team in a season
ERA - Runs Allowed responsible by a pitcher of the team in a season
HR - Home Runs scored by the team in a season
AB - An at bat refers to a batter’s turn to face a pitcher and attempt to hit the ball.
IPouts - The number of outs a pitcher has recorded in a game or over a certain period.
Shutout (SHO) - A shutout occurs when a pitcher or group of pitchers prevents the opposing team from scoring any runs over the course of an entire game.
Save (SV) - A save is a statistic credited to a relief pitcher who finishes a game for the winning team under certain prescribed circumstances.
For my final project, I propose to delve into the realm of sports analytics, specifically focusing on Major League Baseball (MLB). The objective of this project is to develop a predictive model that can forecast the outcome of MLB games based on box score and game-related statistics. By leveraging data-driven insights, we aim to gain a deeper understanding of the factors that contribute to a team’s success in winning games.
library(ggrepel)
## Loading required package: ggplot2
library(boot)
library(broom)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
##
## The following object is masked from 'package:boot':
##
## logit
library(psych)
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:car':
##
## logit
##
## The following object is masked from 'package:boot':
##
## logit
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(gganimate)
lahman_data = read.csv("/Users/anuragreddy/Desktop/Statistics with R/teams.csv")
lahman_data <- lahman_data |>
filter(yearID != 2020)
Note: 2020 seasonis removed due to unusual format because of covid pandemic.
numeric_lahman <- lahman_data[, sapply(lahman_data, is.numeric)]
head(describe(numeric_lahman))
## vars n mean sd median trimmed mad min max range skew
## yearID 1 660 2010.59 6.50 2010.5 2010.50 8.15 2000 2022 22 0.07
## Rank 2 660 3.01 1.45 3.0 2.99 1.48 1 6 5 0.09
## G 3 660 161.96 0.30 162.0 162.00 0.00 161 163 2 -1.01
## Ghome 4 660 80.96 0.50 81.0 81.00 0.00 77 84 7 -1.51
## W 5 660 80.97 12.04 81.0 81.18 13.34 43 116 73 -0.15
## L 6 660 80.97 12.02 81.0 80.78 13.34 46 119 73 0.15
## kurtosis se
## yearID -1.13 0.25
## Rank -1.18 0.06
## G 7.35 0.01
## Ghome 23.98 0.02
## W -0.46 0.47
## L -0.45 0.47
Box_Plot <- lahman_data |>
ggplot(aes(x=lgID,y=R,fill=lgID))+
geom_boxplot()+
labs(x='League',y='Runs scored by teams',title="Distribution of Runs Scored by teams in AL and NL")+
theme_economist()
ggplotly(Box_Plot)
Interpretation: By interpreting the box plot we can assume that the average runs scored in American league is higher than that in National League.
Box_Plot <- lahman_data |>
ggplot(aes(x=lgID,y=RA,fill=lgID))+
geom_boxplot()+
labs(x='League',y='Runs allowed by teams',title="Distribution of Runs Allowed by teams in AL and NL")+
theme_economist()
ggplotly(Box_Plot)
Interpretation: By interpreting the box plot we can assume that the average runs allowed in American league is higher than that in National League
lahman_data_numeric <- lahman_data[ ,sapply(lahman_data, is.numeric)]
lahman_data_numeric <- lahman_data_numeric |>
rename(Attd=attendance)|>
select(-Rank,-G,-Ghome,-L,-BPF,-PPF,-yearID)
Corr = cor(lahman_data_numeric,method = "pearson")
correlation_df <- as.data.frame(Corr)
correlation_df['W'] |>
arrange(desc(W)) |>
head()
## W
## W 1.0000000
## SV 0.6597673
## R 0.5691824
## SHO 0.5113828
## Attd 0.4700870
## IPouts 0.4643397
library(ggcorrplot)
corr <- ggcorrplot(Corr,hc.order = TRUE,
type = "lower") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Correlation Plot") +
theme_classic()
ggplotly(corr)
Interpretation: The above dataframe indicates which independent variables are closely correlated with wins. With some knowledge of baseball, I have selected the following variables as independent variables.
Variables Selected: R+RA+AB+ERA+HR+IPouts+Attd+HA+SHO+SV
library(ggplot2)
# Define custom density function
my_density <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_density(alpha = 0.5, fill = "cornflowerblue", ...)
}
# Define custom scatterplot function
my_scatter <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_point(alpha = 0.5, color = "cornflowerblue") +
geom_smooth(method = lm, se = FALSE, show.legend = FALSE, ...)
}
# Example data
data <- lahman_data_numeric |>
select(W, R, ERA, Attd, SV)
# Create scatterplot matrix
pairs <- ggpairs(data,
lower = list(continuous = my_scatter),
diag = list(continuous = my_density),
progress = FALSE)
ggplotly(pairs)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Can only have one: highlight
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Can only have one: highlight
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Can only have one: highlight
## Warning: Can only have one: highlight
Interpretation: Show casing the ggpairs plots between the Wins and few other independent variables. The independent variables approximately following the normal distribution.
Dependent Variable <- Wins
Independent Variables <- R+RA+H+ERA+HR+Attd+HA+SHO+SV
MLR_model1 <- lm(W~R+RA+ERA+HR+Attd+HA+SHO+SV+AB,data=lahman_data_numeric) |>
step(direction = 'backward', trace =0)
summary(MLR_model1)
##
## Call:
## lm(formula = W ~ R + RA + ERA + Attd + SHO + SV + AB, data = lahman_data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8470 -2.0185 0.0385 1.9721 8.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.307e+01 9.517e+00 7.677 5.96e-14 ***
## R 9.064e-02 1.873e-03 48.407 < 2e-16 ***
## RA -5.609e-02 9.292e-03 -6.037 2.63e-09 ***
## ERA -3.063e+00 1.506e+00 -2.034 0.0424 *
## Attd 3.772e-07 1.966e-07 1.919 0.0555 .
## SHO 2.032e-01 4.244e-02 4.788 2.09e-06 ***
## SV 4.098e-01 2.067e-02 19.829 < 2e-16 ***
## AB -4.371e-03 1.848e-03 -2.365 0.0183 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.123 on 652 degrees of freedom
## Multiple R-squared: 0.9334, Adjusted R-squared: 0.9327
## F-statistic: 1306 on 7 and 652 DF, p-value: < 2.2e-16
The intercept (51) is statistically significant (p < 0.05), suggesting that when all predictor variables are zero, the expected value of the response variable is significantly different from zero.
The coefficients for predictors R (Runs), RA (Runs Allowed), SHO (Shutouts), SV (Saves), ERA (Earned Runs Average), AB (At Bats) are all statistically significant (p < 0.05). For instance, a one-unit increase in R is associated with a 0.088 unit increase in the response variable, controlling for other predictors.
Overall, this model explains approximately 93.34% of the variability in the number of wins (Adjusted R-squared = 0.9327), indicating a strong fit. However, it’s important to note that while some variables like runs scored, runs allowed, shutouts, and saves appear to have significant effects on the number of wins, other variables like attendance do not show statistically significant effects in this model.
(VIF is a measure used in regression analysis to quantify how much the variance of an estimated regression coefficient increases if your predictors are correlated. High VIF values indicate high multicollinearity, which can lead to unreliable coefficient estimates.)
vif(MLR_model1)
## R RA ERA Attd SHO SV AB
## 1.618830 46.759013 44.989702 1.332814 1.905306 1.543260 1.643200
paste("Correlation between RA and ERA", cor(lahman_data$ERA,lahman_data$RA))
## [1] "Correlation between RA and ERA 0.988224732424427"
Interpretation: A rule of thumb is to keep each VIF below 10. In the above output, ERA and RA have values of 45 and 47, respectively. Technically speaking, RA and ERA measure the runs scored by the opponents, and they are highly correlated. It is better to remove either one of the variables from the independent variables list.
MLR_model2 <- lm(W~R+RA+SHO+SV+IPouts+AB,data=lahman_data_numeric) |>
step(direction = 'backward', trace =0)
summary(MLR_model2)
##
## Call:
## lm(formula = W ~ R + RA + SHO + SV + IPouts + AB, data = lahman_data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9247 -2.1280 0.0897 2.0253 9.2901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.691335 14.001191 0.978 0.329
## R 0.093685 0.001857 50.452 < 2e-16 ***
## RA -0.070289 0.002335 -30.102 < 2e-16 ***
## SHO 0.202505 0.041833 4.841 1.62e-06 ***
## SV 0.396055 0.020626 19.202 < 2e-16 ***
## IPouts 0.021186 0.004182 5.066 5.30e-07 ***
## AB -0.010824 0.002332 -4.641 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.079 on 653 degrees of freedom
## Multiple R-squared: 0.9352, Adjusted R-squared: 0.9346
## F-statistic: 1571 on 6 and 653 DF, p-value: < 2.2e-16
vif(MLR_model2)
## R RA SHO SV IPouts AB
## 1.638348 3.038943 1.905124 1.581730 2.313793 2.692953
tidy(MLR_model2,conf.int = TRUE)
## # A tibble: 7 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13.7 14.0 0.978 3.29e- 1 -13.8 41.2
## 2 R 0.0937 0.00186 50.5 1.79e-227 0.0900 0.0973
## 3 RA -0.0703 0.00234 -30.1 1.60e-125 -0.0749 -0.0657
## 4 SHO 0.203 0.0418 4.84 1.62e- 6 0.120 0.285
## 5 SV 0.396 0.0206 19.2 1.73e- 65 0.356 0.437
## 6 IPouts 0.0212 0.00418 5.07 5.30e- 7 0.0130 0.0294
## 7 AB -0.0108 0.00233 -4.64 4.19e- 6 -0.0154 -0.00624
Interpretation: In summary, this regression model suggests that the independent variables R, RA, SHO, SV, IPouts and AB are all significantly associated with the number of wins in baseball games. The model has a high R-squared value (0.9352), indicating that it explains a large proportion of the variance in W.
paste("BIC of Model 1",BIC(MLR_model1))
## [1] "BIC of Model 1 3426.71585363651"
paste("BIC of Model 2",BIC(MLR_model2))
## [1] "BIC of Model 2 3402.29040494585"
Interpretation: Comparing the BIC values of the two models, we observe that Model 2 has a lower BIC value (3402.3) compared to Model 1 (3426.7). According to the BIC criterion, lower values indicate a better trade-off between model fit and complexity. Therefore, based on the BIC values alone, Model 2 appears to be a preferable model as it achieves a better balance between goodness of fit and parsimony.
Diagnostic Plots
gg_diagnose(MLR_model2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
ggres <- gg_resfitted(MLR_model2) +
geom_smooth(se=FALSE)+
theme_economist()
ggplotly(ggres)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Interpretation: The points in the plot exhibit a random scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is not violated. In other words, there is no discernible pattern in the residuals as the fitted values change. The residuals should be independent of each other. In other words, there is no pattern or correlation between consecutive residuals.
ggqq <- gg_qqplot(MLR_model2)+
theme_economist()
ggplotly(ggqq)
Interpretation: Normal Q-Q plot shows points closely following the diagonal line, indicating that the data (residuals) is approximately normally distributed
gg_resX(MLR_model2)
Interpretation: The plot helps to assess the assumption of linearity between the predictor variable (X) and the response variable. The points in the above plot exhibiting a random scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is reasonable.
ghist <- gg_reshist(MLR_model2)+
geom_histogram(aes(colorbar='blue'),show.legend = FALSE)+
theme_economist()
## Warning in geom_histogram(aes(colorbar = "blue"), show.legend = FALSE):
## Ignoring unknown aesthetics: colourbar
ggplotly(ghist)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interpretation: The above histogram of residuals interprets that the data is almost normally distributed.
Reslev <- gg_resleverage(MLR_model2)+
theme_economist()
ggplotly(Reslev)
## `geom_smooth()` using formula = 'y ~ x'
Interpretation: This plot helps us in identifying the influential points in the dataset. Look for the points which have high leverage and high residual.
Cook’s Distance - Used to identify the influential points.
Cookd <- gg_cooksd(MLR_model2, threshold = 'matlab') +
theme_economist()
ggplotly(Cookd)
Interpretation: A Cook’s Distance plot is a diagnostic tool used in regression analysis to identify influential data points that may disproportionately affect the estimation of regression coefficients.
Higher values of Cook’s Distance indicate greater influence of the corresponding data point on the regression model.
Observations with Cook’s Distance values exceeding the threshold are highlighted in the plot. These points are considered potentially influential and may significantly impact the estimated regression coefficients if removed from the analysis.
High Cook’s Distance values indicate that removing the corresponding observation from the dataset could lead to substantial changes in the estimated regression coefficients.
Lets try to predict the Tampa Bay Rays 2023 season wins using the multiple linear model we built:
Before that lets try to visualize the scoring patterns of the Tampa Bay Rays and New York Yankees (One of the most successful franchises in the major league baseball).
anim <- lahman_data |>
filter(franchID=='TBD' | franchID == 'NYY')|>
ggplot(aes(x=yearID,y=R,color=franchID))+
geom_line()+
geom_point()+
labs(title = "Runs Scoring pattern from 2000-2022 season",x="Season",y="Runs Scored")+
transition_reveal(yearID)+
theme_economist()
anim
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
#anim_save("anim.gif", anim)
Team | Season | At Bats | Runs Scored | Runs Allowed | Shutouts | Saves | IPouts |
---|---|---|---|---|---|---|---|
Tampa Bay Rays | 2023 | 5511 | 860 | 665 | 14 | 45 | 4326 |
Predicted Wins=13.691335+0.093685×R−0.070289×RA+0.202505×SHO+0.396055×SV+0.021186×IPouts−0.010824×AB
Predicted Wins = 96.7.
Actual Wins = 99.
The present study delves into the realm of baseball, with a focus on predicting Major League Baseball wins through the utilization of a Multiple Linear Regression model. The primary aim of this endeavor is to provide teams and managers with insights into the key variables that significantly influence game outcomes.
It has been determined that among various game-related statistics, such as Runs Scored, Runs Allowed, Shutouts, Saves, and IPouts, the latter hold significant sway in determining the outcome of matches. While Runs Scored reflects the offensive prowess of a team, Runs Allowed, Shutouts, Saves, and IPouts shed light on defensive and pitching capabilities.
The constructed model demonstrates an impressive explanatory power, accounting for 93.5% of the variance in wins, utilizing the aforementioned game-related statistics.
Moving forward, there exists ample opportunity for further exploration within the domain of baseball analytics, also known as Sabermetrics. While the current study primarily relies on box-score related statistics, the burgeoning advancements in sports technology present a wealth of possibilities for enhancing our understanding of the game and refining the accuracy and efficacy of predictions.