MLB Prediction Analysis

Question: Can we predict an MLB team’s attendance based on their team statistics and player performance?

Modeling Objective: Create a predictive model for home attendance using team statistics from the MLB dataset.

The dataset that I chose (mlb_teams.csv) includes multiple team statistics across MLB seasons, including wins, runs, home runs, ERA, walks, stolen bases, and more. https://www.openintro.org/data/index.php?data=mlb_teams

Response Variable (Y): home_attendance – total home game attendance for the team in a season.

Predictor Variables (X) : team statistic variables such as wins, runs scored, homeruns, strikeouts, shutouts, and more.

Imported the necessary libraries and removed all NA values and associated rows.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.8     ✔ rsample      1.3.0
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.9     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.1
## ✔ parsnip      1.3.2     ✔ yardstick    1.3.2
## ✔ recipes      1.3.1     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

library(glmnet)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-10

tidymodels::tidymodels_prefer() 

mlbdata <- read.csv("mlb_teams.csv")

mlb <- na.omit(mlbdata)

Exploratory Data Analysis:

Shown from the summary, there is a wide range of fans attending games because this data dates back to the late 1800s where attendance was heavily influenced by stadium capacity, the limited amount of technology, the mainstream rise of baseball, and historic events such as wars and economic depressions. This caused the early years of baseball to smaller amounts of fans in attendance during games. Over time, attendance trends shifted dramatically due to multiple reasons: better stadiums were built that could hold more people, expansion of teams in the league, and improvements to media coverage.

Another trend that is apparent is the positive relationship between team performance and attendance. Teams that recorded a higher amount of wins, homeruns, and runs scored saw greater fan turnout. This suggests that fans are more likely to attend games when their teams are statistically doing good. However, the scatter plots show a lot of variance, indicating that these statistics don’t fully explain home attendance. This is also shown through the R-square values of each of the linear models. Each of the linear regression models show a low amount of R-squared values, ranging in the low 20%s, indicating that the model can only explain a small amount of the variance in attendance. This suggestions that other variables are a big role in attendance.

I also created a histogram of home attendance variable to see the distribution of the values. The histogram showed a slightly right-skewed distribution, with most teams hovering around the 2,000,000 value. This shows us that while attendance is usually consistent, some teams draw more fans while some draw less fans.

summary(mlb$home_attendance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  306763 1458100 2016298 2060766 2617666 4483350

options(scipen = 999)

# Linear regression of Wins vs home attendance
ggplot(mlb, aes(x = wins, y = home_attendance)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "Wins vs Attendance")

## `geom_smooth()` using formula = 'y ~ x'

model1 <- lm(home_attendance ~ wins, data = mlb)
summary(model1)

## 
## Call:
## lm(formula = home_attendance ~ wins, data = mlb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1851302  -515440   -41926   480112  2802149 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  -324174     122914  -2.637              0.00845 ** 
## wins           29931       1524  19.635 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 700400 on 1382 degrees of freedom
## Multiple R-squared:  0.2181, Adjusted R-squared:  0.2176 
## F-statistic: 385.5 on 1 and 1382 DF,  p-value: < 0.00000000000000022

# Linear regression Runs scored vs home attendance
ggplot(mlb, aes(x = runs_scored, y = home_attendance)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "Runs Scored vs Attendance")

## `geom_smooth()` using formula = 'y ~ x'

model2 <- lm(home_attendance ~ runs_scored, data = mlb)
summary(model2)

## 
## Call:
## lm(formula = home_attendance ~ runs_scored, data = mlb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1771225  -518874   -57566   462091  2248917 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -647121.7   138009.7  -4.689           0.00000302 ***
## runs_scored    3801.5      191.9  19.805 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 699100 on 1382 degrees of freedom
## Multiple R-squared:  0.2211, Adjusted R-squared:  0.2205 
## F-statistic: 392.3 on 1 and 1382 DF,  p-value: < 0.00000000000000022

# Linear regression of Homeruns vs home attendance
ggplot(mlb, aes(x = homeruns, y = home_attendance)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "Earned Run Average vs Attendance")

## `geom_smooth()` using formula = 'y ~ x'

model3 <- lm(home_attendance ~ homeruns, data = mlb)
summary(model3)

## 
## Call:
## lm(formula = home_attendance ~ homeruns, data = mlb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1741228  -527856   -69827   481462  2493544 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 764885.7    68628.2   11.14 <0.0000000000000002 ***
## homeruns      8626.2      439.3   19.64 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 700400 on 1382 degrees of freedom
## Multiple R-squared:  0.2181, Adjusted R-squared:  0.2176 
## F-statistic: 385.6 on 1 and 1382 DF,  p-value: < 0.00000000000000022

# Histogram of home attendance
ggplot(mlb, aes(x = home_attendance)) +
  geom_histogram(binwidth = 500000, color = "white") +
  labs(title = "Distribution of Home Attendance",
       x = "Home Attendance",
       y = "Frequency") +
  theme_minimal()

Feature Engineering:

I created two new variables to better measure team performance:

win percentage
runs per game

These features help compare teams more fairly across different seasons and makes the model more accurate.

mlb$win_pct = mlb$wins / (mlb$wins+mlb$losses)
mlb$runs_per_game = mlb$runs_scored/mlb$games_played

Model Selection:

I chose linear regression as my primary model since the outcome, home attendance, is continuous, making linear regression the most natural starting point. Linear regression allows us to understand the direction and the strength of the relationships between team statistics and home attendance easily.

Model Training and Evaluation:

To find how effective the linear regression model was at predicting, I split the dataset into two categories: training and test sets, with the training set having 80% of the data while the test set having 20% of the data. I then put the variables wins, homeruns, runs scored, and hits as predictor variables and normalized them. After finding the predicted values using augment(), I found the RMSE, MAE, and R-squared valued it showed that the model wasn’t highly accurate, with the RMSE and MAE value fairly high with the R-squared value fairly low. I also graphed the model to see the the predicted values versus the actual values, and it showed that there was still a high variance.

set.seed(1)

mlb_split <- mlb %>% initial_split(prop = 0.80)
mlb_train <- training(mlb_split)
mlb_test <- testing(mlb_split)

mlb_recipe <- recipe(home_attendance ~ wins + homeruns + runs_scored + hits, data = mlb_train) %>%
  step_normalize(all_predictors())

model <- linear_reg()

mlb_wflow <- workflow() %>%
  add_model(model) %>%
  add_recipe(mlb_recipe)

mlb_fit <- fit(mlb_wflow, data = mlb_train)

results <- augment(mlb_fit, new_data = mlb_test)

results %>% # high variance shown within the graph
  ggplot(aes(x = home_attendance, y = .pred)) +
  geom_abline(lty = 2, color="red") +
  geom_point(alpha = 0.7, color="blue") +
  labs(title = "Predicted versus actual values of home attendance", 
       x = "Home Attendance", 
       y = "Predicted Home Attendance") +
  coord_obs_pred()

mlb_metrics <- metric_set(rmse, rsq, mae)
mlb_metrics(results, truth = home_attendance, estimate = .pred)

## # A tibble: 3 × 3
##   .metric .estimator  .estimate
##   <chr>   <chr>           <dbl>
## 1 rmse    standard   663592.   
## 2 rsq     standard        0.358
## 3 mae     standard   525000.

Model Tuning:

I used Lasso Regression and Cross-Validation to identify the most important predictors of home attendance. Lasso regression applies an L1 penalty, shrinking the less important coefficients towards zero, sometimes setting them to exactly zero. The results showed that the strongest predictors included wins, runs_per_game, hits, homeruns, and strikeouts_by_pitcher. This allowed me to tune the parameters for the final model.

set.seed(1)

X <- model.matrix(home_attendance ~ wins + homeruns + runs_scored + hits + strikeouts_by_pitchers + doubles + triples + walks + stolen_bases + complete_games + shutouts + saves + double_plays + earned_run_average + win_pct + runs_per_game, data = mlb)[, -1] 
y <- mlb$home_attendance

cv_model <- cv.glmnet(X, y, alpha = 1)

best_lambda <- cv_model$lambda.min
best_lambda

## [1] 55.90864

X_scaled <- scale(X)
lasso_model <- glmnet(X_scaled, y, alpha = 1, lambda = best_lambda, standardize = TRUE)

coef(lasso_model)

## 17 x 1 sparse Matrix of class "dgCMatrix"
##                                 s0
## (Intercept)            2060766.412
## wins                    369544.795
## homeruns                101642.569
## runs_scored            -420864.914
## hits                    238744.415
## strikeouts_by_pitchers   71214.338
## doubles                  -8748.874
## triples                 -25113.242
## walks                    30537.232
## stolen_bases             18211.341
## complete_games         -335165.361
## shutouts                 54605.059
## saves                   -82358.360
## double_plays            -45687.451
## earned_run_average       64326.884
## win_pct                  35652.664
## runs_per_game           224813.180

Final Model:

The final model assumes that the relationship between the team statistics and home attendance is linear and that the points are independent of each other.

options(scipen = 999)

set.seed(1)
mlb_split <- mlb %>% initial_split(prop = 0.80)
mlb_train <- training(mlb_split)
mlb_test <- testing(mlb_split)

mlb_recipe <- recipe(home_attendance ~ wins + runs_per_game + hits + homeruns + strikeouts_by_pitchers, data = mlb_train) %>%     step_normalize(all_predictors())

model <- linear_reg() 

mlb_wflow <- workflow() %>%
  add_model(model) %>%
  add_recipe(mlb_recipe)

mlb_fit <- fit(mlb_wflow, data = mlb_train)

results <- augment(mlb_fit, new_data = mlb_test)


results %>% # high variance shown within the graph
  ggplot(aes(x = home_attendance, y = .pred)) +
  geom_abline(lty = 2, color="red") +
  geom_point(alpha = 0.7, color="blue") +
  labs(title = "Final Model: Predicted versus actual values of \nhome attendance", 
       x = "Home Attendance", 
       y = "Predicted Home Attendance") +
  coord_obs_pred()

mlb_metrics <- metric_set(rmse, rsq, mae)
mlb_metrics(results, truth = home_attendance, estimate = .pred)

## # A tibble: 3 × 3
##   .metric .estimator  .estimate
##   <chr>   <chr>           <dbl>
## 1 rmse    standard   619372.   
## 2 rsq     standard        0.439
## 3 mae     standard   496004.

Interpretations:

Overall, the model’s performance was still not strong enough to be considered highly predictive. This indicates that while team statistics can help explain some of the variation in home attendance, other external factors such as weather, star players, promotions, market size, and rivalry games likely play a big role in home attendance.