TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
Median : 82.00 Median :1454 Median :238.0 Median : 47.00
Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
Median :102.00 Median :512.0 Median : 750.0 Median :101.0
Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
NA's :102 NA's :131
TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
Median : 49.0 Median :58.00 Median : 1518 Median :107.0
Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
NA's :772 NA's :2085
TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
NA's :102 NA's :286
1.3 Missing Data Visualization
vis_dat(train)
The training dataset contains 2,276 observations and 16 variables, all stored as integer type. A visual inspection of the data using vis_dat reveals that while most variables are largely complete, several suffer from meaningful missingness that must be addressed before modeling. The most problematic variable is TEAM_BATTING_HBP (batters hit by pitch), which is missing for the vast majority of observations and will likely need to be dropped or treated with a missing indicator flag rather than imputed. Moderate missingness is observed in TEAM_BASERUN_CS (caught stealing) and TEAM_BASERUN_SB (stolen bases), while TEAM_BATTING_SO (strikeouts by batters), TEAM_PITCHING_SO (strikeouts by pitchers), and TEAM_FIELDING_DP (double plays) show smaller but still notable gaps. All other variables appear fully populated. These missing values are unlikely to be random — for example, stolen base statistics may not have been recorded in earlier eras of baseball — which has implications for how we choose to impute them. The missingness patterns will be addressed systematically in the Data Preparation section.
Six variables in the training dataset contain missing values. The most severely affected is TEAM_BATTING_HBP (batters hit by pitch), missing 91.6% of observations, making it analytically unusable and a candidate for exclusion from modeling. TEAM_BASERUN_CS (caught stealing) follows with 33.9% missing, while TEAM_FIELDING_DP (double plays), TEAM_BASERUN_SB (stolen bases), TEAM_BATTING_SO, and TEAM_PITCHING_SO have more modest missingness ranging from 4.5% to 12.6%. The missing baserunning statistics likely reflect incomplete record-keeping in earlier eras of professional baseball rather than true random missingness. All variables except TEAM_BATTING_HBP will be addressed through median imputation in the Data Preparation section, with binary flag variables created to retain any signal in the missingness.
# Bar chart of missing countsmissing_counts <- train %>%summarise(across(everything(), ~sum(is.na(.)))) %>%pivot_longer(everything(), names_to ="variable", values_to ="missing") %>%filter(missing >0) %>%arrange(desc(missing))ggplot(missing_counts, aes(x =reorder(variable, -missing), y = missing)) +geom_col(fill ="steelblue") +labs(title ="Missing Values by Variable", x ="Variable", y ="# Missing") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
1.4 Histograms of all variables
train %>%pivot_longer(everything()) %>%ggplot(aes(x = value)) +geom_histogram(bins =30, fill ="steelblue", color ="white") +facet_wrap(~name, scales ="free") +labs(title ="Distribution of All Variables") +theme_minimal()
Warning: Removed 3478 rows containing non-finite outside the scale range
(`stat_bin()`).
Examining the distribution of all variables reveals several important patterns. TARGET_WINS is approximately normally distributed, centered around 80 wins, which is consistent with a 162-game season where teams tend to cluster near 0.500. Most batting variables such as TEAM_BATTING_H, TEAM_BATTING_2B, and TEAM_BATTING_BB also follow roughly normal distributions, suggesting well-behaved predictors. However, several variables display notable right skew and extreme outliers. TEAM_PITCHING_H and TEAM_PITCHING_BB have values extending far to the right, with some observations reaching 30,000 and 3,000 respectively — values that are clearly unrealistic for a 162-game season and are likely data entry errors or artifacts from early baseball records. Similarly, TEAM_FIELDING_E and TEAM_PITCHING_SO show heavy right tails. TEAM_BATTING_HBP confirms its near-total missingness, with only a narrow band of observations visible. These extreme outliers in the pitching and fielding variables will need to be capped or winsorized during data preparation to prevent them from unduly influencing the regression models.
The correlation analysis reveals that TEAM_BATTING_H (base hits) has the strongest positive relationship with TARGET_WINS at 0.389, followed by TEAM_BATTING_2B (0.289) and TEAM_BATTING_BB (0.233), confirming that offensive production is the primary driver of wins. TEAM_FIELDING_E is the strongest negative predictor at -0.176, consistent with the expectation that errors hurt a team’s performance. Notably, TEAM_PITCHING_HR shows a counterintuitive positive correlation of 0.189, which likely reflects multicollinearity with other variables rather than a true relationship. Overall, the moderate correlation magnitudes suggest no single variable dominates, reinforcing the need for a multivariate modeling approach. The correlation matrix also highlights strong relationships among pitching variables, raising multicollinearity concerns to be addressed during model building.
1.6 Full correlation matrix
cor_matrix <-cor(train, use ="pairwise.complete.obs")corrplot(cor_matrix, method ="color", type ="upper",tl.cex =0.7, title ="Correlation Matrix", mar =c(0,0,1,0))
TARGET_WINS TEAM_BATTING_1B TEAM_BATTING_2B TEAM_BATTING_3B
Min. : 0.00 Min. : 709.0 Min. : 69.0 Min. : 0.00
1st Qu.: 71.00 1st Qu.: 990.8 1st Qu.:208.0 1st Qu.: 34.00
Median : 82.00 Median :1050.0 Median :238.0 Median : 47.00
Mean : 80.79 Mean :1073.2 Mean :241.2 Mean : 55.25
3rd Qu.: 92.00 3rd Qu.:1129.0 3rd Qu.:273.0 3rd Qu.: 72.00
Max. :146.00 Max. :2112.0 Max. :458.0 Max. :223.00
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 556.8 1st Qu.: 67.0
Median :102.00 Median :512.0 Median : 750.0 Median :101.0
Mean : 99.61 Mean :501.6 Mean : 736.3 Mean :123.4
3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:151.0
Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_PITCHING_H TEAM_PITCHING_HR
Min. : 65.0 Min. : 52.0 Min. :1137 Min. : 0.0
1st Qu.: 127.0 1st Qu.:134.0 1st Qu.:1419 1st Qu.: 50.0
Median : 159.0 Median :149.0 Median :1518 Median :107.0
Mean : 244.0 Mean :146.7 Mean :1716 Mean :105.7
3rd Qu.: 249.2 3rd Qu.:161.2 3rd Qu.:1682 3rd Qu.:150.0
Max. :1228.0 Max. :228.0 Max. :7054 Max. :343.0
TEAM_PITCHING_BB TEAM_PITCHING_SO BATTING_RATIO WHIP_PROXY
Min. : 0.0 Min. : 0.0 Min. :0.4962 Min. : 9.469
1st Qu.:476.0 1st Qu.: 626.0 1st Qu.:0.6057 1st Qu.:11.969
Median :536.5 Median : 813.5 Median :0.6525 Median :12.802
Mean :547.0 Mean : 798.7 Mean :0.6720 Mean :13.968
3rd Qu.:611.0 3rd Qu.: 957.0 3rd Qu.:0.7283 3rd Qu.:13.995
Max. :921.0 Max. :1461.8 Max. :1.0000 Max. :49.228
Several preparation steps were applied before modeling. Binary flag variables were created for all columns with meaningful missingness — TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_SO, and TEAM_FIELDING_DP — to preserve any signal in the missingness itself. TEAM_BATTING_HBP was dropped entirely given its 91.6% missingness. All remaining missing values were imputed using column medians, and winsorization at the 99th percentile was applied to TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, and TEAM_FIELDING_E to address the extreme outliers identified during exploration. Three derived variables were also created: TEAM_BATTING_1B (singles, calculated by subtracting extra-base hits from total hits), BATTING_RATIO (hits divided by hits plus strikeouts as a batting efficiency proxy), and WHIP_PROXY (walks plus hits allowed per game, approximating the standard pitching metric). After preparation, the dataset shows a mean of approximately 81 wins and all variables fall within plausible baseball ranges, confirming the data is ready for modeling.
TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
Min. : 819 Min. : 44.0 Min. : 14.00 Min. : 0.00
1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00 1st Qu.: 44.50
Median :1455 Median :239.0 Median : 52.00 Median :101.00
Mean :1469 Mean :241.3 Mean : 55.91 Mean : 95.63
3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00 3rd Qu.:135.50
Max. :2170 Max. :376.0 Max. :155.00 Max. :242.00
TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
Min. : 15.0 Min. : 0.0 Min. : 0.0 Min. : 0.00
1st Qu.:436.5 1st Qu.: 565.0 1st Qu.: 60.5 1st Qu.: 44.00
Median :509.0 Median : 686.0 Median : 92.0 Median : 49.50
Mean :499.0 Mean : 707.7 Mean :122.1 Mean : 51.37
3rd Qu.:565.5 3rd Qu.: 904.5 3rd Qu.:149.0 3rd Qu.: 56.00
Max. :792.0 Max. :1268.0 Max. :580.0 Max. :154.00
TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
Min. :42.00 Min. :1155 Min. : 0.0 Min. : 136.0
1st Qu.:62.00 1st Qu.:1426 1st Qu.: 52.0 1st Qu.: 471.0
Median :62.00 Median :1515 Median :104.0 Median : 526.0
Mean :62.03 Mean :1744 Mean :102.1 Mean : 545.5
3rd Qu.:62.00 3rd Qu.:1681 3rd Qu.:142.5 3rd Qu.: 606.5
Max. :96.00 Max. :8817 Max. :336.0 Max. :1131.0
TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP FLAG_BATTING_SO
Min. : 0.0 Min. : 73.0 Min. : 69.0 Min. :0.0000
1st Qu.: 622.5 1st Qu.: 131.0 1st Qu.:134.5 1st Qu.:0.0000
Median : 745.0 Median : 163.0 Median :148.0 Median :0.0000
Mean : 761.6 Mean : 247.5 Mean :146.3 Mean :0.0695
3rd Qu.: 927.5 3rd Qu.: 252.0 3rd Qu.:160.5 3rd Qu.:0.0000
Max. :1279.3 Max. :1239.5 Max. :204.0 Max. :1.0000
FLAG_BASERUN_SB FLAG_BASERUN_CS FLAG_PITCHING_SO FLAG_FIELDING_DP
Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.05019 Mean :0.3359 Mean :0.0695 Mean :0.1197
3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
TEAM_BATTING_1B BATTING_RATIO WHIP_PROXY
Min. : 657.0 Min. :0.4252 Min. : 9.34
1st Qu.: 990.5 1st Qu.:0.6144 1st Qu.:11.99
Median :1059.0 Median :0.6751 Median :12.64
Mean :1076.5 Mean :0.6803 Mean :14.13
3rd Qu.:1134.0 3rd Qu.:0.7334 3rd Qu.:14.01
Max. :1846.0 Max. :1.0000 Max. :61.41
The same preparation steps applied to the training data were replicated on the evaluation dataset to ensure consistency. All flag variables, median imputations, winsorization, and derived variables were successfully applied. The evaluation dataset shows comparable distributions to the training data, with a median of 1,059 singles, a mean BATTING_RATIO of 0.68, and a mean WHIP_PROXY of 14.13, all consistent with the training set ranges. It is worth noting that TEAM_PITCHING_H and TEAM_FIELDING_E show slightly higher maximum values in the evaluation set compared to training, which is expected since the winsorization caps were derived from the training data. Overall the evaluation data is clean, fully imputed, and ready to receive predictions from the final model.
Model 1 includes all 14 original predictors and serves as a baseline “kitchen sink” model. The model achieves an Adjusted R² of 0.306 and an F-statistic of 72.62, which is highly significant (p < 2.2e-16), confirming that the predictors jointly explain a meaningful portion of the variation in wins. Most coefficients align with theoretical expectations — TEAM_BATTING_H, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, and TEAM_BASERUN_SB all carry positive and significant coefficients, while TEAM_BATTING_SO, TEAM_FIELDING_E, and TEAM_PITCHING_BB are negative and significant, consistent with their expected negative impact on wins.
However, several counterintuitive results emerge. TEAM_BATTING_2B carries a negative coefficient (-0.018), which is unexpected given that doubles should contribute positively to scoring. This likely reflects multicollinearity with TEAM_BATTING_H, since doubles are a component of total hits. Similarly, TEAM_PITCHING_H shows a positive coefficient (0.002), suggesting that allowing more hits leads to more wins — clearly implausible and again likely a product of multicollinearity. TEAM_BASERUN_CS and TEAM_PITCHING_HR are both insignificant (p > 0.05), suggesting they add little explanatory power in the presence of other variables. These issues motivate the construction of more refined models.
Model 2 takes a theory-driven approach, replacing TEAM_BATTING_H with the derived TEAM_BATTING_1B to isolate the contribution of singles and removing variables that were either insignificant or counterintuitive in Model 1. The model achieves an Adjusted R² of 0.299 and an F-statistic of 98.24 (p < 2.2e-16), remaining highly significant overall with fewer predictors. Most coefficients now align well with baseball theory — TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, and TEAM_BATTING_BB all carry positive and significant coefficients, confirming that offensive production across all hit types drives wins. TEAM_BASERUN_SB is positive and significant, consistent with stolen bases creating scoring opportunities. TEAM_FIELDING_E and TEAM_FIELDING_DP are both negative and significant, reflecting the expected penalty of errors and the nuanced role of double plays.
However, two variables remain problematic. TEAM_PITCHING_HR carries a positive coefficient (0.027), which contradicts the expectation that allowing home runs hurts a team, though it is not statistically significant (p = 0.313). Similarly, TEAM_PITCHING_BB is insignificant (p = 0.449), suggesting walks allowed may be redundant given the other predictors included. While Model 2 is slightly less explanatory than Model 1, it is more parsimonious and theoretically cleaner, making it a strong candidate for the final model.
Model 3 was built using stepwise AIC selection starting from a full set of predictors including the derived variables BATTING_RATIO and WHIP_PROXY. The algorithm settled on 13 predictors and achieves the best performance of the three models with an Adjusted R² of 0.315 and a residual standard error of 13.04. The F-statistic of 81.46 (p < 2.2e-16) confirms strong overall significance. Notably, all 13 retained variables are statistically significant, making this the cleanest model of the three in terms of variable selection.
Most coefficients behave as expected — all batting variables carry positive coefficients, TEAM_FIELDING_E, TEAM_FIELDING_DP, and TEAM_PITCHING_BB are negative as theorized, and TEAM_PITCHING_SO is positive, reflecting the benefit of strikeout pitching. One counterintuitive result is BATTING_RATIO, which carries a large negative coefficient (-151.7). While this seems to suggest that batting efficiency hurts wins, it likely reflects multicollinearity with the individual batting components already in the model — once singles, doubles, and walks are controlled for, the ratio becomes redundant and its coefficient is distorted. Similarly, TEAM_PITCHING_H retains a positive coefficient (0.004), which remains puzzling but may again reflect multicollinearity among the pitching variables. Despite these nuances, Model 3 is the strongest performer on Adjusted R² and residual standard error, and will be carried forward as the selected model.
The stargazer table allows for a clean side-by-side comparison of all three models across 2,276 observations. Model 3 outperforms the others on every metric, achieving the highest Adjusted R² of 0.315 and the lowest residual standard error of 13.04, compared to 0.306 and 13.12 for Model 1 and 0.299 and 13.19 for Model 2. While the differences in Adjusted R² are modest across the three models, Model 3 stands out for having all retained variables statistically significant, whereas Models 1 and 2 both contain insignificant predictors such as TEAM_BASERUN_CS and TEAM_PITCHING_HR. The comparison also highlights how separating TEAM_BATTING_H into its components improves coefficient interpretability, in Model 1, TEAM_BATTING_2B carries a counter intuitive negative sign, while in Models 2 and 3 it correctly turns positive once singles are isolated. Overall, the table reinforces that Model 3 is the strongest candidate, combining the best predictive performance with the most statistically clean set of predictors.
Model Adj_R2 RMSE F_stat AIC
value...1 Model 1: Full 0.3059 13.0800 72.62 18194.58
value...2 Model 2: Theory 0.2994 13.1526 98.24 18211.78
value...3 Model 3: Stepwise 0.3150 12.9975 81.46 18163.78
The performance metrics across all three models are summarized in the table above. Model 3 consistently outperforms the others on every criterion — it achieves the highest Adjusted R² of 0.315, the lowest RMSE of 13.00, and the lowest AIC of 18,163.78, indicating the best balance of fit and parsimony. Model 1 ranks second with an Adjusted R² of 0.306 and RMSE of 13.08, while Model 2 trails slightly with an Adjusted R² of 0.299 and RMSE of 13.15. The AIC differences are meaningful — Model 3 is notably lower than both Model 1 (18,194.58) and Model 2 (18,211.78), penalizing the latter two for their less efficient use of predictors. While none of the models explain more than roughly 32% of the variation in wins, this is not unexpected given the inherent unpredictability of baseball outcomes and the limited scope of the available variables. Based on these metrics, Model 3 is selected as the final model for generating predictions on the evaluation dataset.
4.2 VIF, multicollinearity on best model
library(car)
Warning: package 'car' was built under R version 4.5.2
Loading required package: carData
Warning: package 'carData' was built under R version 4.5.2
Registered S3 method overwritten by 'car':
method from
na.action.merMod lme4
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
par(mfrow =c(2, 2))plot(model3, main ="Model 3 Diagnostics")
par(mfrow =c(1, 1))
The Residuals vs Fitted plot shows residuals scattered randomly around zero across the range of fitted values, with the red line remaining approximately flat. This suggests the linearity assumption is reasonably satisfied, though a slight fanning of residuals at lower fitted values hints at some heteroscedasticity. Observations 1342, 2012, and 1828 are flagged as potential outliers with large residuals.
The Normal Q-Q plot shows residuals tracking closely along the diagonal reference line through the middle range, indicating approximate normality. However, both tails deviate from the line, with observations 1828 at the lower end and 2012 and 1342 at the upper end pulling away, suggesting the residuals have slightly heavier tails than a perfect normal distribution.
The Scale-Location plot shows a mildly downward sloping red line, indicating a slight decrease in residual variance at higher fitted values. While not severely problematic, this suggests mild heteroscedasticity that should be noted as a limitation of the model.
The Residuals vs Leverage plot shows that the vast majority of observations cluster at low leverage values near zero, which is reassuring. Observation 1342 stands out with both high leverage and a large residual, and a point at the far right around leverage 0.18 approaches but does not cross the Cook’s distance boundary of 0.5, suggesting no single observation is unduly distorting the model estimates.
ggplot(eval, aes(x = PREDICTED_WINS)) +geom_histogram(bins =30, fill ="steelblue", color ="white") +labs(title ="Distribution of Predicted Wins",x ="Predicted Wins", y ="Count") +theme_minimal()
Model 3 was applied to the evaluation dataset to generate predicted win totals for each team. The distribution of predicted wins is approximately normal and centered around 80-85 wins, which is consistent with the training data distribution and reflects a realistic range for a 162-game baseball season. The bulk of predictions fall between 65 and 95 wins, with a small number of outlier predictions below 30 wins likely corresponding to historically unusual team seasons. All predictions were capped between 0 and 162 to ensure they fall within the physically possible range of a baseball season. A sample of the first ten predictions shows values ranging from 54 to 80 wins, with most clustering in the high 60s to mid 70s. The predicted values have been exported to moneyball_predictions.csv for submission.