Question: Can we predict an MLB team’s attendance based on their team statistics and player performance?
Modeling Objective: Create a predictive model for home attendance using team statistics from the MLB dataset.
The dataset that I chose (mlb_teams.csv) includes multiple team statistics across MLB seasons, including wins, runs, home runs, ERA, walks, stolen bases, and more. https://www.openintro.org/data/index.php?data=mlb_teams
Response Variable (Y): home_attendance – total home game attendance for the team in a season.
Predictor Variables (X) : team statistic variables such as wins, runs scored, homeruns, strikeouts, shutouts, and more.
Imported the necessary libraries and removed all NA values and associated rows.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.4 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.8 ✔ rsample 1.3.0
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.9 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.1
## ✔ parsnip 1.3.2 ✔ yardstick 1.3.2
## ✔ recipes 1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-10
Exploratory Data Analysis:
Shown from the summary, there is a wide range of fans attending games because this data dates back to the late 1800s where attendance was heavily influenced by stadium capacity, the limited amount of technology, the mainstream rise of baseball, and historic events such as wars and economic depressions. This caused the early years of baseball to smaller amounts of fans in attendance during games. Over time, attendance trends shifted dramatically due to multiple reasons: better stadiums were built that could hold more people, expansion of teams in the league, and improvements to media coverage.
Another trend that is apparent is the positive relationship between team performance and attendance. Teams that recorded a higher amount of wins, homeruns, and runs scored saw greater fan turnout. This suggests that fans are more likely to attend games when their teams are statistically doing good. However, the scatter plots show a lot of variance, indicating that these statistics don’t fully explain home attendance. This is also shown through the R-square values of each of the linear models. Each of the linear regression models show a low amount of R-squared values, ranging in the low 20%s, indicating that the model can only explain a small amount of the variance in attendance. This suggestions that other variables are a big role in attendance.
I also created a histogram of home attendance variable to see the distribution of the values. The histogram showed a slightly right-skewed distribution, with most teams hovering around the 2,000,000 value. This shows us that while attendance is usually consistent, some teams draw more fans while some draw less fans.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 306763 1458100 2016298 2060766 2617666 4483350
options(scipen = 999)
# Linear regression of Wins vs home attendance
ggplot(mlb, aes(x = wins, y = home_attendance)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Wins vs Attendance")## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = home_attendance ~ wins, data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1851302 -515440 -41926 480112 2802149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -324174 122914 -2.637 0.00845 **
## wins 29931 1524 19.635 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 700400 on 1382 degrees of freedom
## Multiple R-squared: 0.2181, Adjusted R-squared: 0.2176
## F-statistic: 385.5 on 1 and 1382 DF, p-value: < 0.00000000000000022
# Linear regression Runs scored vs home attendance
ggplot(mlb, aes(x = runs_scored, y = home_attendance)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Runs Scored vs Attendance")## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = home_attendance ~ runs_scored, data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1771225 -518874 -57566 462091 2248917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -647121.7 138009.7 -4.689 0.00000302 ***
## runs_scored 3801.5 191.9 19.805 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 699100 on 1382 degrees of freedom
## Multiple R-squared: 0.2211, Adjusted R-squared: 0.2205
## F-statistic: 392.3 on 1 and 1382 DF, p-value: < 0.00000000000000022
# Linear regression of Homeruns vs home attendance
ggplot(mlb, aes(x = homeruns, y = home_attendance)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Earned Run Average vs Attendance")## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = home_attendance ~ homeruns, data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1741228 -527856 -69827 481462 2493544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 764885.7 68628.2 11.14 <0.0000000000000002 ***
## homeruns 8626.2 439.3 19.64 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 700400 on 1382 degrees of freedom
## Multiple R-squared: 0.2181, Adjusted R-squared: 0.2176
## F-statistic: 385.6 on 1 and 1382 DF, p-value: < 0.00000000000000022
# Histogram of home attendance
ggplot(mlb, aes(x = home_attendance)) +
geom_histogram(binwidth = 500000, color = "white") +
labs(title = "Distribution of Home Attendance",
x = "Home Attendance",
y = "Frequency") +
theme_minimal()Feature Engineering:
I created two new variables to better measure team performance:
win percentage
runs per game
These features help compare teams more fairly across different seasons and makes the model more accurate.
Model Selection:
I chose linear regression as my primary model since the outcome, home attendance, is continuous, making linear regression the most natural starting point. Linear regression allows us to understand the direction and the strength of the relationships between team statistics and home attendance easily.
Model Training and Evaluation:
To find how effective the linear regression model was at predicting, I split the dataset into two categories: training and test sets, with the training set having 80% of the data while the test set having 20% of the data. I then put the variables wins, homeruns, runs scored, and hits as predictor variables and normalized them. After finding the predicted values using augment(), I found the RMSE, MAE, and R-squared valued it showed that the model wasn’t highly accurate, with the RMSE and MAE value fairly high with the R-squared value fairly low. I also graphed the model to see the the predicted values versus the actual values, and it showed that there was still a high variance.
set.seed(1)
mlb_split <- mlb %>% initial_split(prop = 0.80)
mlb_train <- training(mlb_split)
mlb_test <- testing(mlb_split)
mlb_recipe <- recipe(home_attendance ~ wins + homeruns + runs_scored + hits, data = mlb_train) %>%
step_normalize(all_predictors())
model <- linear_reg()
mlb_wflow <- workflow() %>%
add_model(model) %>%
add_recipe(mlb_recipe)
mlb_fit <- fit(mlb_wflow, data = mlb_train)
results <- augment(mlb_fit, new_data = mlb_test)
results %>% # high variance shown within the graph
ggplot(aes(x = home_attendance, y = .pred)) +
geom_abline(lty = 2, color="red") +
geom_point(alpha = 0.7, color="blue") +
labs(title = "Predicted versus actual values of home attendance",
x = "Home Attendance",
y = "Predicted Home Attendance") +
coord_obs_pred()mlb_metrics <- metric_set(rmse, rsq, mae)
mlb_metrics(results, truth = home_attendance, estimate = .pred)## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 663592.
## 2 rsq standard 0.358
## 3 mae standard 525000.
Model Tuning:
I used Lasso Regression and Cross-Validation to identify the most important predictors of home attendance. Lasso regression applies an L1 penalty, shrinking the less important coefficients towards zero, sometimes setting them to exactly zero. The results showed that the strongest predictors included wins, runs_per_game, hits, homeruns, and strikeouts_by_pitcher. This allowed me to tune the parameters for the final model.
set.seed(1)
X <- model.matrix(home_attendance ~ wins + homeruns + runs_scored + hits + strikeouts_by_pitchers + doubles + triples + walks + stolen_bases + complete_games + shutouts + saves + double_plays + earned_run_average + win_pct + runs_per_game, data = mlb)[, -1]
y <- mlb$home_attendance
cv_model <- cv.glmnet(X, y, alpha = 1)
best_lambda <- cv_model$lambda.min
best_lambda## [1] 55.90864
X_scaled <- scale(X)
lasso_model <- glmnet(X_scaled, y, alpha = 1, lambda = best_lambda, standardize = TRUE)
coef(lasso_model)## 17 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 2060766.412
## wins 369544.795
## homeruns 101642.569
## runs_scored -420864.914
## hits 238744.415
## strikeouts_by_pitchers 71214.338
## doubles -8748.874
## triples -25113.242
## walks 30537.232
## stolen_bases 18211.341
## complete_games -335165.361
## shutouts 54605.059
## saves -82358.360
## double_plays -45687.451
## earned_run_average 64326.884
## win_pct 35652.664
## runs_per_game 224813.180
Final Model:
The final model assumes that the relationship between the team statistics and home attendance is linear and that the points are independent of each other.
options(scipen = 999)
set.seed(1)
mlb_split <- mlb %>% initial_split(prop = 0.80)
mlb_train <- training(mlb_split)
mlb_test <- testing(mlb_split)
mlb_recipe <- recipe(home_attendance ~ wins + runs_per_game + hits + homeruns + strikeouts_by_pitchers, data = mlb_train) %>% step_normalize(all_predictors())
model <- linear_reg()
mlb_wflow <- workflow() %>%
add_model(model) %>%
add_recipe(mlb_recipe)
mlb_fit <- fit(mlb_wflow, data = mlb_train)
results <- augment(mlb_fit, new_data = mlb_test)
results %>% # high variance shown within the graph
ggplot(aes(x = home_attendance, y = .pred)) +
geom_abline(lty = 2, color="red") +
geom_point(alpha = 0.7, color="blue") +
labs(title = "Final Model: Predicted versus actual values of \nhome attendance",
x = "Home Attendance",
y = "Predicted Home Attendance") +
coord_obs_pred()mlb_metrics <- metric_set(rmse, rsq, mae)
mlb_metrics(results, truth = home_attendance, estimate = .pred)## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 619372.
## 2 rsq standard 0.439
## 3 mae standard 496004.
Interpretations:
Overall, the model’s performance was still not strong enough to be considered highly predictive. This indicates that while team statistics can help explain some of the variation in home attendance, other external factors such as weather, star players, promotions, market size, and rivalry games likely play a big role in home attendance.