Moneyball Prediction

1 Data Exploration

1.1 First Look

head(train, 5)

  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
1          39           1445             194              39              13
2          70           1339             219              22             190
3          86           1377             232              35             137
4          70           1387             209              38              96
5          82           1297             186              27             102
  TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
1             143             842              NA              NA
2             685            1075              37              28
3             602             917              46              27
4             451             922              43              30
5             472             920              49              39
  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
1               NA            9364               84              927
2               NA            1347              191              689
3               NA            1377              137              602
4               NA            1396               97              454
5               NA            1297              102              472
  TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
1             5456            1011               NA
2             1082             193              155
3              917             175              153
4              928             164              156
5              920             138              168

1.2 Summary Statistics

summary(train)

  TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
 Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
 1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
 Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
 Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
 3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
 Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
                                                                 
 TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
 Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
 1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
 Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
 Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
 3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
 Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
                                  NA's   :102      NA's   :131    
 TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
 Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
 1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
 Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
 Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
 3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
 Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
 NA's   :772     NA's   :2085                                     
 TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
 Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
 1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
 Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
 Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
 3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
 Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
                  NA's   :102                        NA's   :286

1.3 Missing Data Visualization

vis_dat(train)

The training dataset contains 2,276 observations and 16 variables, all stored as integer type. A visual inspection of the data using vis_dat reveals that while most variables are largely complete, several suffer from meaningful missingness that must be addressed before modeling. The most problematic variable is TEAM_BATTING_HBP (batters hit by pitch), which is missing for the vast majority of observations and will likely need to be dropped or treated with a missing indicator flag rather than imputed. Moderate missingness is observed in TEAM_BASERUN_CS (caught stealing) and TEAM_BASERUN_SB (stolen bases), while TEAM_BATTING_SO (strikeouts by batters), TEAM_PITCHING_SO (strikeouts by pitchers), and TEAM_FIELDING_DP (double plays) show smaller but still notable gaps. All other variables appear fully populated. These missing values are unlikely to be random — for example, stolen base statistics may not have been recorded in earlier eras of baseball — which has implications for how we choose to impute them. The missingness patterns will be addressed systematically in the Data Preparation section.

vis_dat(train)

theme(axis.text.x = element_text(angle = 90, hjust = 1))

<theme> List of 1
 $ axis.text.x: <ggplot2::element_text>
  ..@ family       : NULL
  ..@ face         : NULL
  ..@ italic       : chr NA
  ..@ fontweight   : num NA
  ..@ fontwidth    : num NA
  ..@ colour       : NULL
  ..@ size         : NULL
  ..@ hjust        : num 1
  ..@ vjust        : NULL
  ..@ angle        : num 90
  ..@ lineheight   : NULL
  ..@ margin       : NULL
  ..@ debug        : NULL
  ..@ inherit.blank: logi FALSE
 @ complete: logi FALSE
 @ validate: logi TRUE

train %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
  mutate(pct_missing = round(n_missing / nrow(train) * 100, 1)) %>%
  filter(n_missing > 0) %>%
  arrange(desc(n_missing))

# A tibble: 6 × 3
  variable         n_missing pct_missing
  <chr>                <int>       <dbl>
1 TEAM_BATTING_HBP      2085        91.6
2 TEAM_BASERUN_CS        772        33.9
3 TEAM_FIELDING_DP       286        12.6
4 TEAM_BASERUN_SB        131         5.8
5 TEAM_BATTING_SO        102         4.5
6 TEAM_PITCHING_SO       102         4.5

Six variables in the training dataset contain missing values. The most severely affected is TEAM_BATTING_HBP (batters hit by pitch), missing 91.6% of observations, making it analytically unusable and a candidate for exclusion from modeling. TEAM_BASERUN_CS (caught stealing) follows with 33.9% missing, while TEAM_FIELDING_DP (double plays), TEAM_BASERUN_SB (stolen bases), TEAM_BATTING_SO, and TEAM_PITCHING_SO have more modest missingness ranging from 4.5% to 12.6%. The missing baserunning statistics likely reflect incomplete record-keeping in earlier eras of professional baseball rather than true random missingness. All variables except TEAM_BATTING_HBP will be addressed through median imputation in the Data Preparation section, with binary flag variables created to retain any signal in the missingness.

# Bar chart of missing counts
missing_counts <- train %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing") %>%
  filter(missing > 0) %>%
  arrange(desc(missing))

ggplot(missing_counts, aes(x = reorder(variable, -missing), y = missing)) +
  geom_col(fill = "steelblue") +
  labs(title = "Missing Values by Variable", x = "Variable", y = "# Missing") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

1.4 Histograms of all variables

train %>%
  pivot_longer(everything()) %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~name, scales = "free") +
  labs(title = "Distribution of All Variables") +
  theme_minimal()

Warning: Removed 3478 rows containing non-finite outside the scale range
(`stat_bin()`).

Examining the distribution of all variables reveals several important patterns. TARGET_WINS is approximately normally distributed, centered around 80 wins, which is consistent with a 162-game season where teams tend to cluster near 0.500. Most batting variables such as TEAM_BATTING_H, TEAM_BATTING_2B, and TEAM_BATTING_BB also follow roughly normal distributions, suggesting well-behaved predictors. However, several variables display notable right skew and extreme outliers. TEAM_PITCHING_H and TEAM_PITCHING_BB have values extending far to the right, with some observations reaching 30,000 and 3,000 respectively — values that are clearly unrealistic for a 162-game season and are likely data entry errors or artifacts from early baseball records. Similarly, TEAM_FIELDING_E and TEAM_PITCHING_SO show heavy right tails. TEAM_BATTING_HBP confirms its near-total missingness, with only a narrow band of observations visible. These extreme outliers in the pitching and fielding variables will need to be capped or winsorized during data preparation to prevent them from unduly influencing the regression models.

1.5 Correlation Targets

cor_with_target <- train %>%
  summarise(across(-TARGET_WINS,
                   ~round(cor(., TARGET_WINS, use = "complete.obs"), 3))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "correlation") %>%
  arrange(desc(abs(correlation)))
print(cor_with_target)

# A tibble: 15 × 2
   variable         correlation
   <chr>                  <dbl>
 1 TEAM_BATTING_H         0.389
 2 TEAM_BATTING_2B        0.289
 3 TEAM_BATTING_BB        0.233
 4 TEAM_PITCHING_HR       0.189
 5 TEAM_BATTING_HR        0.176
 6 TEAM_FIELDING_E       -0.176
 7 TEAM_BATTING_3B        0.143
 8 TEAM_BASERUN_SB        0.135
 9 TEAM_PITCHING_BB       0.124
10 TEAM_PITCHING_H       -0.11 
11 TEAM_PITCHING_SO      -0.078
12 TEAM_BATTING_HBP       0.074
13 TEAM_FIELDING_DP      -0.035
14 TEAM_BATTING_SO       -0.032
15 TEAM_BASERUN_CS        0.022

The correlation analysis reveals that TEAM_BATTING_H (base hits) has the strongest positive relationship with TARGET_WINS at 0.389, followed by TEAM_BATTING_2B (0.289) and TEAM_BATTING_BB (0.233), confirming that offensive production is the primary driver of wins. TEAM_FIELDING_E is the strongest negative predictor at -0.176, consistent with the expectation that errors hurt a team’s performance. Notably, TEAM_PITCHING_HR shows a counterintuitive positive correlation of 0.189, which likely reflects multicollinearity with other variables rather than a true relationship. Overall, the moderate correlation magnitudes suggest no single variable dominates, reinforcing the need for a multivariate modeling approach. The correlation matrix also highlights strong relationships among pitching variables, raising multicollinearity concerns to be addressed during model building.

1.6 Full correlation matrix

cor_matrix <- cor(train, use = "pairwise.complete.obs")
corrplot(cor_matrix, method = "color", type = "upper",
         tl.cex = 0.7, title = "Correlation Matrix", mar = c(0,0,1,0))

2 Data Preparation

2.1 Flag missing values

train <- train %>%
  mutate(
    FLAG_BATTING_SO  = ifelse(is.na(TEAM_BATTING_SO),  1, 0),
    FLAG_BASERUN_SB  = ifelse(is.na(TEAM_BASERUN_SB),  1, 0),
    FLAG_BASERUN_CS  = ifelse(is.na(TEAM_BASERUN_CS),  1, 0),
    FLAG_PITCHING_SO = ifelse(is.na(TEAM_PITCHING_SO), 1, 0),
    FLAG_FIELDING_DP = ifelse(is.na(TEAM_FIELDING_DP), 1, 0)
  )

2.2 Impute missing values with median

impute_median <- function(x) {
  x[is.na(x)] <- median(x, na.rm = TRUE)
  x
}

train <- train %>%
  mutate(across(where(is.numeric), impute_median))

sum(is.na(train))

[1] 0

2.3 Cap Extreme outliers

winsorize_99 <- function(x) {
  cap <- quantile(x, 0.99, na.rm = TRUE)
  pmin(x, cap)
}

train <- train %>%
  mutate(
    TEAM_PITCHING_H  = winsorize_99(TEAM_PITCHING_H),
    TEAM_PITCHING_BB = winsorize_99(TEAM_PITCHING_BB),
    TEAM_PITCHING_SO = winsorize_99(TEAM_PITCHING_SO),
    TEAM_FIELDING_E  = winsorize_99(TEAM_FIELDING_E)
  )

2.4 Creating Derived variables

train <- train %>%
  mutate(
    # Singles = total hits minus extra-base hits
    TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -
                      TEAM_BATTING_3B - TEAM_BATTING_HR,
    # Batting ratio: hits per plate appearance proxy
    BATTING_RATIO   = TEAM_BATTING_H / (TEAM_BATTING_H + TEAM_BATTING_SO),
    # WHIP proxy: walks + hits allowed per game
    WHIP_PROXY      = (TEAM_PITCHING_H + TEAM_PITCHING_BB) / 162
  )

2.5 Summary on final variables

final_vars <- train %>%
  dplyr::select(TARGET_WINS, TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B,
         TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_SO,
         TEAM_BASERUN_SB, TEAM_FIELDING_E, TEAM_FIELDING_DP,
         TEAM_PITCHING_H, TEAM_PITCHING_HR, TEAM_PITCHING_BB,
         TEAM_PITCHING_SO, BATTING_RATIO, WHIP_PROXY)

summary(final_vars)

  TARGET_WINS     TEAM_BATTING_1B  TEAM_BATTING_2B TEAM_BATTING_3B 
 Min.   :  0.00   Min.   : 709.0   Min.   : 69.0   Min.   :  0.00  
 1st Qu.: 71.00   1st Qu.: 990.8   1st Qu.:208.0   1st Qu.: 34.00  
 Median : 82.00   Median :1050.0   Median :238.0   Median : 47.00  
 Mean   : 80.79   Mean   :1073.2   Mean   :241.2   Mean   : 55.25  
 3rd Qu.: 92.00   3rd Qu.:1129.0   3rd Qu.:273.0   3rd Qu.: 72.00  
 Max.   :146.00   Max.   :2112.0   Max.   :458.0   Max.   :223.00  
 TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
 Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
 1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 556.8   1st Qu.: 67.0  
 Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
 Mean   : 99.61   Mean   :501.6   Mean   : 736.3   Mean   :123.4  
 3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:151.0  
 Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
 TEAM_FIELDING_E  TEAM_FIELDING_DP TEAM_PITCHING_H TEAM_PITCHING_HR
 Min.   :  65.0   Min.   : 52.0    Min.   :1137    Min.   :  0.0   
 1st Qu.: 127.0   1st Qu.:134.0    1st Qu.:1419    1st Qu.: 50.0   
 Median : 159.0   Median :149.0    Median :1518    Median :107.0   
 Mean   : 244.0   Mean   :146.7    Mean   :1716    Mean   :105.7   
 3rd Qu.: 249.2   3rd Qu.:161.2    3rd Qu.:1682    3rd Qu.:150.0   
 Max.   :1228.0   Max.   :228.0    Max.   :7054    Max.   :343.0   
 TEAM_PITCHING_BB TEAM_PITCHING_SO BATTING_RATIO      WHIP_PROXY    
 Min.   :  0.0    Min.   :   0.0   Min.   :0.4962   Min.   : 9.469  
 1st Qu.:476.0    1st Qu.: 626.0   1st Qu.:0.6057   1st Qu.:11.969  
 Median :536.5    Median : 813.5   Median :0.6525   Median :12.802  
 Mean   :547.0    Mean   : 798.7   Mean   :0.6720   Mean   :13.968  
 3rd Qu.:611.0    3rd Qu.: 957.0   3rd Qu.:0.7283   3rd Qu.:13.995  
 Max.   :921.0    Max.   :1461.8   Max.   :1.0000   Max.   :49.228

Several preparation steps were applied before modeling. Binary flag variables were created for all columns with meaningful missingness — TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_SO, and TEAM_FIELDING_DP — to preserve any signal in the missingness itself. TEAM_BATTING_HBP was dropped entirely given its 91.6% missingness. All remaining missing values were imputed using column medians, and winsorization at the 99th percentile was applied to TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, and TEAM_FIELDING_E to address the extreme outliers identified during exploration. Three derived variables were also created: TEAM_BATTING_1B (singles, calculated by subtracting extra-base hits from total hits), BATTING_RATIO (hits divided by hits plus strikeouts as a batting efficiency proxy), and WHIP_PROXY (walks plus hits allowed per game, approximating the standard pitching metric). After preparation, the dataset shows a mean of approximately 81 wins and all variables fall within plausible baseball ranges, confirming the data is ready for modeling.

2.6 Transformation to EVAL Data

eval <- eval %>%
  mutate(
    FLAG_BATTING_SO  = ifelse(is.na(TEAM_BATTING_SO),  1, 0),
    FLAG_BASERUN_SB  = ifelse(is.na(TEAM_BASERUN_SB),  1, 0),
    FLAG_BASERUN_CS  = ifelse(is.na(TEAM_BASERUN_CS),  1, 0),
    FLAG_PITCHING_SO = ifelse(is.na(TEAM_PITCHING_SO), 1, 0),
    FLAG_FIELDING_DP = ifelse(is.na(TEAM_FIELDING_DP), 1, 0)
  ) %>%
  mutate(across(where(is.numeric), impute_median)) %>%
  mutate(
    TEAM_PITCHING_H  = winsorize_99(TEAM_PITCHING_H),
    TEAM_PITCHING_BB = winsorize_99(TEAM_PITCHING_BB),
    TEAM_PITCHING_SO = winsorize_99(TEAM_PITCHING_SO),
    TEAM_FIELDING_E  = winsorize_99(TEAM_FIELDING_E),
    TEAM_BATTING_1B  = TEAM_BATTING_H - TEAM_BATTING_2B -
                       TEAM_BATTING_3B - TEAM_BATTING_HR,
    BATTING_RATIO    = TEAM_BATTING_H / (TEAM_BATTING_H + TEAM_BATTING_SO),
    WHIP_PROXY       = (TEAM_PITCHING_H + TEAM_PITCHING_BB) / 162
  )
summary(eval)

 TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B  TEAM_BATTING_HR 
 Min.   : 819   Min.   : 44.0   Min.   : 14.00   Min.   :  0.00  
 1st Qu.:1387   1st Qu.:210.0   1st Qu.: 35.00   1st Qu.: 44.50  
 Median :1455   Median :239.0   Median : 52.00   Median :101.00  
 Mean   :1469   Mean   :241.3   Mean   : 55.91   Mean   : 95.63  
 3rd Qu.:1548   3rd Qu.:278.5   3rd Qu.: 72.00   3rd Qu.:135.50  
 Max.   :2170   Max.   :376.0   Max.   :155.00   Max.   :242.00  
 TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB TEAM_BASERUN_CS 
 Min.   : 15.0   Min.   :   0.0   Min.   :  0.0   Min.   :  0.00  
 1st Qu.:436.5   1st Qu.: 565.0   1st Qu.: 60.5   1st Qu.: 44.00  
 Median :509.0   Median : 686.0   Median : 92.0   Median : 49.50  
 Mean   :499.0   Mean   : 707.7   Mean   :122.1   Mean   : 51.37  
 3rd Qu.:565.5   3rd Qu.: 904.5   3rd Qu.:149.0   3rd Qu.: 56.00  
 Max.   :792.0   Max.   :1268.0   Max.   :580.0   Max.   :154.00  
 TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
 Min.   :42.00    Min.   :1155    Min.   :  0.0    Min.   : 136.0  
 1st Qu.:62.00    1st Qu.:1426    1st Qu.: 52.0    1st Qu.: 471.0  
 Median :62.00    Median :1515    Median :104.0    Median : 526.0  
 Mean   :62.03    Mean   :1744    Mean   :102.1    Mean   : 545.5  
 3rd Qu.:62.00    3rd Qu.:1681    3rd Qu.:142.5    3rd Qu.: 606.5  
 Max.   :96.00    Max.   :8817    Max.   :336.0    Max.   :1131.0  
 TEAM_PITCHING_SO TEAM_FIELDING_E  TEAM_FIELDING_DP FLAG_BATTING_SO 
 Min.   :   0.0   Min.   :  73.0   Min.   : 69.0    Min.   :0.0000  
 1st Qu.: 622.5   1st Qu.: 131.0   1st Qu.:134.5    1st Qu.:0.0000  
 Median : 745.0   Median : 163.0   Median :148.0    Median :0.0000  
 Mean   : 761.6   Mean   : 247.5   Mean   :146.3    Mean   :0.0695  
 3rd Qu.: 927.5   3rd Qu.: 252.0   3rd Qu.:160.5    3rd Qu.:0.0000  
 Max.   :1279.3   Max.   :1239.5   Max.   :204.0    Max.   :1.0000  
 FLAG_BASERUN_SB   FLAG_BASERUN_CS  FLAG_PITCHING_SO FLAG_FIELDING_DP
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.00000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.05019   Mean   :0.3359   Mean   :0.0695   Mean   :0.1197  
 3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
 TEAM_BATTING_1B  BATTING_RATIO      WHIP_PROXY   
 Min.   : 657.0   Min.   :0.4252   Min.   : 9.34  
 1st Qu.: 990.5   1st Qu.:0.6144   1st Qu.:11.99  
 Median :1059.0   Median :0.6751   Median :12.64  
 Mean   :1076.5   Mean   :0.6803   Mean   :14.13  
 3rd Qu.:1134.0   3rd Qu.:0.7334   3rd Qu.:14.01  
 Max.   :1846.0   Max.   :1.0000   Max.   :61.41

The same preparation steps applied to the training data were replicated on the evaluation dataset to ensure consistency. All flag variables, median imputations, winsorization, and derived variables were successfully applied. The evaluation dataset shows comparable distributions to the training data, with a median of 1,059 singles, a mean BATTING_RATIO of 0.68, and a mean WHIP_PROXY of 14.13, all consistent with the training set ranges. It is worth noting that TEAM_PITCHING_H and TEAM_FIELDING_E show slightly higher maximum values in the evaluation set compared to training, which is expected since the winsorization caps were derived from the training data. Overall the evaluation data is clean, fully imputed, and ready to receive predictions from the final model.

3 Building Models

Model 1: All predictors

model1 <- lm(TARGET_WINS ~
               TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
               TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
               TEAM_BASERUN_SB + TEAM_BASERUN_CS +
               TEAM_FIELDING_E + TEAM_FIELDING_DP +
               TEAM_PITCHING_H + TEAM_PITCHING_HR +
               TEAM_PITCHING_BB + TEAM_PITCHING_SO,
             data = train)

summary(model1)


Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
    TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
    TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
    TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO, 
    data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.014  -8.402   0.128   8.277  66.648 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      19.5583938  5.3942719   3.626 0.000294 ***
TEAM_BATTING_H    0.0447524  0.0039175  11.424  < 2e-16 ***
TEAM_BATTING_2B  -0.0184258  0.0091669  -2.010 0.044547 *  
TEAM_BATTING_3B   0.0822276  0.0174108   4.723 2.47e-06 ***
TEAM_BATTING_HR   0.0781608  0.0315128   2.480 0.013200 *  
TEAM_BATTING_BB   0.0478276  0.0098452   4.858 1.27e-06 ***
TEAM_BATTING_SO  -0.0247820  0.0054415  -4.554 5.54e-06 ***
TEAM_BASERUN_SB   0.0309100  0.0044932   6.879 7.76e-12 ***
TEAM_BASERUN_CS  -0.0118224  0.0158703  -0.745 0.456386    
TEAM_FIELDING_E  -0.0253983  0.0031387  -8.092 9.49e-16 ***
TEAM_FIELDING_DP -0.1144784  0.0130431  -8.777  < 2e-16 ***
TEAM_PITCHING_H   0.0023269  0.0009206   2.528 0.011552 *  
TEAM_PITCHING_HR -0.0182345  0.0280984  -0.649 0.516436    
TEAM_PITCHING_BB -0.0302420  0.0082164  -3.681 0.000238 ***
TEAM_PITCHING_SO  0.0201817  0.0045028   4.482 7.76e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.12 on 2261 degrees of freedom
Multiple R-squared:  0.3102,    Adjusted R-squared:  0.3059 
F-statistic: 72.62 on 14 and 2261 DF,  p-value: < 2.2e-16

Model 1 includes all 14 original predictors and serves as a baseline “kitchen sink” model. The model achieves an Adjusted R² of 0.306 and an F-statistic of 72.62, which is highly significant (p < 2.2e-16), confirming that the predictors jointly explain a meaningful portion of the variation in wins. Most coefficients align with theoretical expectations — TEAM_BATTING_H, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, and TEAM_BASERUN_SB all carry positive and significant coefficients, while TEAM_BATTING_SO, TEAM_FIELDING_E, and TEAM_PITCHING_BB are negative and significant, consistent with their expected negative impact on wins.

However, several counterintuitive results emerge. TEAM_BATTING_2B carries a negative coefficient (-0.018), which is unexpected given that doubles should contribute positively to scoring. This likely reflects multicollinearity with TEAM_BATTING_H, since doubles are a component of total hits. Similarly, TEAM_PITCHING_H shows a positive coefficient (0.002), suggesting that allowing more hits leads to more wins — clearly implausible and again likely a product of multicollinearity. TEAM_BASERUN_CS and TEAM_PITCHING_HR are both insignificant (p > 0.05), suggesting they add little explanatory power in the presence of other variables. These issues motivate the construction of more refined models.

Model 2: Hand-picked meaninful variables

model2 <- lm(TARGET_WINS ~
               TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B +
               TEAM_BATTING_HR + TEAM_BATTING_BB +
               TEAM_BASERUN_SB +
               TEAM_FIELDING_E + TEAM_FIELDING_DP +
               TEAM_PITCHING_HR + TEAM_PITCHING_BB,
             data = train)

summary(model2)


Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
    TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
    TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_PITCHING_HR + TEAM_PITCHING_BB, 
    data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.638  -8.461   0.084   8.463  68.179 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      15.785490   3.565024   4.428 9.97e-06 ***
TEAM_BATTING_1B   0.049083   0.003110  15.782  < 2e-16 ***
TEAM_BATTING_2B   0.029898   0.007297   4.097 4.33e-05 ***
TEAM_BATTING_3B   0.127385   0.015412   8.265 2.34e-16 ***
TEAM_BATTING_HR   0.073085   0.029559   2.473  0.01349 *  
TEAM_BATTING_BB   0.019074   0.007295   2.615  0.00899 ** 
TEAM_BASERUN_SB   0.023093   0.004030   5.730 1.14e-08 ***
TEAM_FIELDING_E  -0.018781   0.002626  -7.152 1.15e-12 ***
TEAM_FIELDING_DP -0.118387   0.012887  -9.186  < 2e-16 ***
TEAM_PITCHING_HR  0.026701   0.026451   1.009  0.31286    
TEAM_PITCHING_BB -0.004545   0.006003  -0.757  0.44904    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.18 on 2265 degrees of freedom
Multiple R-squared:  0.3025,    Adjusted R-squared:  0.2994 
F-statistic: 98.24 on 10 and 2265 DF,  p-value: < 2.2e-16

Model 2 takes a theory-driven approach, replacing TEAM_BATTING_H with the derived TEAM_BATTING_1B to isolate the contribution of singles and removing variables that were either insignificant or counterintuitive in Model 1. The model achieves an Adjusted R² of 0.299 and an F-statistic of 98.24 (p < 2.2e-16), remaining highly significant overall with fewer predictors. Most coefficients now align well with baseball theory — TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, and TEAM_BATTING_BB all carry positive and significant coefficients, confirming that offensive production across all hit types drives wins. TEAM_BASERUN_SB is positive and significant, consistent with stolen bases creating scoring opportunities. TEAM_FIELDING_E and TEAM_FIELDING_DP are both negative and significant, reflecting the expected penalty of errors and the nuanced role of double plays.

However, two variables remain problematic. TEAM_PITCHING_HR carries a positive coefficient (0.027), which contradicts the expectation that allowing home runs hurts a team, though it is not statistically significant (p = 0.313). Similarly, TEAM_PITCHING_BB is insignificant (p = 0.449), suggesting walks allowed may be redundant given the other predictors included. While Model 2 is slightly less explanatory than Model 1, it is more parsimonious and theoretically cleaner, making it a strong candidate for the final model.

Model 3: Stepwise AIC selection

full_model <- lm(TARGET_WINS ~
                   TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B +
                   TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
                   TEAM_BASERUN_SB + TEAM_BASERUN_CS +
                   TEAM_FIELDING_E + TEAM_FIELDING_DP +
                   TEAM_PITCHING_H + TEAM_PITCHING_HR +
                   TEAM_PITCHING_BB + TEAM_PITCHING_SO +
                   BATTING_RATIO + WHIP_PROXY,
                 data = train)

model3 <- stepAIC(full_model, direction = "both", trace = FALSE)
summary(model3)


Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
    TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
    TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_PITCHING_H + 
    TEAM_PITCHING_BB + TEAM_PITCHING_SO + BATTING_RATIO, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.312  -8.383   0.169   8.103  59.042 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1.299e+02  2.093e+01   6.208 6.37e-10 ***
TEAM_BATTING_1B   5.909e-02  4.708e-03  12.551  < 2e-16 ***
TEAM_BATTING_2B   5.090e-02  8.619e-03   5.906 4.03e-09 ***
TEAM_BATTING_3B   1.378e-01  1.611e-02   8.557  < 2e-16 ***
TEAM_BATTING_HR   1.305e-01  1.010e-02  12.926  < 2e-16 ***
TEAM_BATTING_BB   4.972e-02  9.622e-03   5.168 2.58e-07 ***
TEAM_BATTING_SO  -6.623e-02  9.345e-03  -7.087 1.83e-12 ***
TEAM_BASERUN_SB   3.128e-02  4.359e-03   7.177 9.62e-13 ***
TEAM_FIELDING_E  -2.166e-02  3.015e-03  -7.183 9.20e-13 ***
TEAM_FIELDING_DP -1.154e-01  1.294e-02  -8.915  < 2e-16 ***
TEAM_PITCHING_H   3.722e-03  9.463e-04   3.933 8.64e-05 ***
TEAM_PITCHING_BB -3.329e-02  7.980e-03  -4.172 3.13e-05 ***
TEAM_PITCHING_SO  1.313e-02  4.348e-03   3.019  0.00257 ** 
BATTING_RATIO    -1.517e+02  2.778e+01  -5.461 5.25e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.04 on 2262 degrees of freedom
Multiple R-squared:  0.3189,    Adjusted R-squared:  0.315 
F-statistic: 81.46 on 13 and 2262 DF,  p-value: < 2.2e-16

Model 3 was built using stepwise AIC selection starting from a full set of predictors including the derived variables BATTING_RATIO and WHIP_PROXY. The algorithm settled on 13 predictors and achieves the best performance of the three models with an Adjusted R² of 0.315 and a residual standard error of 13.04. The F-statistic of 81.46 (p < 2.2e-16) confirms strong overall significance. Notably, all 13 retained variables are statistically significant, making this the cleanest model of the three in terms of variable selection.

Most coefficients behave as expected — all batting variables carry positive coefficients, TEAM_FIELDING_E, TEAM_FIELDING_DP, and TEAM_PITCHING_BB are negative as theorized, and TEAM_PITCHING_SO is positive, reflecting the benefit of strikeout pitching. One counterintuitive result is BATTING_RATIO, which carries a large negative coefficient (-151.7). While this seems to suggest that batting efficiency hurts wins, it likely reflects multicollinearity with the individual batting components already in the model — once singles, doubles, and walks are controlled for, the ratio becomes redundant and its coefficient is distorted. Similarly, TEAM_PITCHING_H retains a positive coefficient (0.004), which remains puzzling but may again reflect multicollinearity among the pitching variables. Despite these nuances, Model 3 is the strongest performer on Adjusted R² and residual standard error, and will be carried forward as the selected model.

3.1 Comparison

stargazer(model1, model2, model3,
          type          = "text",
          title         = "Regression Model Comparison",
          dep.var.labels = "TARGET_WINS",
          column.labels  = c("Model 1: Full", "Model 2: Theory", "Model 3: Stepwise"),
          omit.stat     = "f",
          digits        = 3)


Regression Model Comparison
============================================================================
                                      Dependent variable:                   
                    --------------------------------------------------------
                                             TARGET                         
                      Model 1: Full     Model 2: Theory   Model 3: Stepwise 
                           (1)                (2)                (3)        
----------------------------------------------------------------------------
TEAM_BATTING_H           0.045***                                           
                         (0.004)                                            
                                                                            
TEAM_BATTING_1B                             0.049***           0.059***     
                                            (0.003)            (0.005)      
                                                                            
TEAM_BATTING_2B          -0.018**           0.030***           0.051***     
                         (0.009)            (0.007)            (0.009)      
                                                                            
TEAM_BATTING_3B          0.082***           0.127***           0.138***     
                         (0.017)            (0.015)            (0.016)      
                                                                            
TEAM_BATTING_HR          0.078**            0.073**            0.131***     
                         (0.032)            (0.030)            (0.010)      
                                                                            
TEAM_BATTING_BB          0.048***           0.019***           0.050***     
                         (0.010)            (0.007)            (0.010)      
                                                                            
TEAM_BATTING_SO         -0.025***                             -0.066***     
                         (0.005)                               (0.009)      
                                                                            
TEAM_BASERUN_SB          0.031***           0.023***           0.031***     
                         (0.004)            (0.004)            (0.004)      
                                                                            
TEAM_BASERUN_CS           -0.012                                            
                         (0.016)                                            
                                                                            
TEAM_FIELDING_E         -0.025***          -0.019***          -0.022***     
                         (0.003)            (0.003)            (0.003)      
                                                                            
TEAM_FIELDING_DP        -0.114***          -0.118***          -0.115***     
                         (0.013)            (0.013)            (0.013)      
                                                                            
TEAM_PITCHING_H          0.002**                               0.004***     
                         (0.001)                               (0.001)      
                                                                            
TEAM_PITCHING_HR          -0.018             0.027                          
                         (0.028)            (0.026)                         
                                                                            
TEAM_PITCHING_BB        -0.030***            -0.005           -0.033***     
                         (0.008)            (0.006)            (0.008)      
                                                                            
TEAM_PITCHING_SO         0.020***                              0.013***     
                         (0.005)                               (0.004)      
                                                                            
BATTING_RATIO                                                -151.703***    
                                                               (27.780)     
                                                                            
Constant                19.558***          15.785***          129.941***    
                         (5.394)            (3.565)            (20.931)     
                                                                            
----------------------------------------------------------------------------
Observations              2,276              2,276              2,276       
R2                        0.310              0.303              0.319       
Adjusted R2               0.306              0.299              0.315       
Residual Std. Error 13.123 (df = 2261) 13.185 (df = 2265) 13.038 (df = 2262)
============================================================================
Note:                                            *p<0.1; **p<0.05; ***p<0.01

The stargazer table allows for a clean side-by-side comparison of all three models across 2,276 observations. Model 3 outperforms the others on every metric, achieving the highest Adjusted R² of 0.315 and the lowest residual standard error of 13.04, compared to 0.306 and 13.12 for Model 1 and 0.299 and 13.19 for Model 2. While the differences in Adjusted R² are modest across the three models, Model 3 stands out for having all retained variables statistically significant, whereas Models 1 and 2 both contain insignificant predictors such as TEAM_BASERUN_CS and TEAM_PITCHING_HR. The comparison also highlights how separating TEAM_BATTING_H into its components improves coefficient interpretability, in Model 1, TEAM_BATTING_2B carries a counter intuitive negative sign, while in Models 2 and 3 it correctly turns positive once singles are isolated. Overall, the table reinforces that Model 3 is the strongest candidate, combining the best predictive performance with the most statistically clean set of predictors.

4 Select Model & Predictions

4.1 Performance metrics for all three models

get_metrics <- function(model, label) {
  s       <- summary(model)
  preds   <- fitted(model)
  actuals <- model$model$TARGET_WINS
  rmse    <- sqrt(mean((actuals - preds)^2))
  data.frame(
    Model   = label,
    Adj_R2  = round(s$adj.r.squared, 4),
    RMSE    = round(rmse, 4),
    F_stat  = round(s$fstatistic[1], 2),
    AIC     = round(AIC(model), 2)
  )
}

metrics <- bind_rows(
  get_metrics(model1, "Model 1: Full"),
  get_metrics(model2, "Model 2: Theory"),
  get_metrics(model3, "Model 3: Stepwise")
)

print(metrics)

                      Model Adj_R2    RMSE F_stat      AIC
value...1     Model 1: Full 0.3059 13.0800  72.62 18194.58
value...2   Model 2: Theory 0.2994 13.1526  98.24 18211.78
value...3 Model 3: Stepwise 0.3150 12.9975  81.46 18163.78

The performance metrics across all three models are summarized in the table above. Model 3 consistently outperforms the others on every criterion — it achieves the highest Adjusted R² of 0.315, the lowest RMSE of 13.00, and the lowest AIC of 18,163.78, indicating the best balance of fit and parsimony. Model 1 ranks second with an Adjusted R² of 0.306 and RMSE of 13.08, while Model 2 trails slightly with an Adjusted R² of 0.299 and RMSE of 13.15. The AIC differences are meaningful — Model 3 is notably lower than both Model 1 (18,194.58) and Model 2 (18,211.78), penalizing the latter two for their less efficient use of predictors. While none of the models explain more than roughly 32% of the variation in wins, this is not unexpected given the inherent unpredictability of baseball outcomes and the limited scope of the available variables. Based on these metrics, Model 3 is selected as the final model for generating predictions on the evaluation dataset.

4.2 VIF, multicollinearity on best model

library(car)

Warning: package 'car' was built under R version 4.5.2

Loading required package: carData

Warning: package 'carData' was built under R version 4.5.2

Registered S3 method overwritten by 'car':
  method           from
  na.action.merMod lme4


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

vif(model3)

 TEAM_BATTING_1B  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
        4.930926         2.177680         2.709741         5.003191 
 TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_FIELDING_E 
       18.647718        68.963399         1.854841         5.595227 
TEAM_FIELDING_DP  TEAM_PITCHING_H TEAM_PITCHING_BB TEAM_PITCHING_SO 
        1.350257         7.426530        11.949044        14.754747 
   BATTING_RATIO 
       79.738979

4.3 Residual diagnostic for best model

par(mfrow = c(2, 2))
plot(model3, 
     main = "Model 3 Diagnostics")

par(mfrow = c(1, 1))

The Residuals vs Fitted plot shows residuals scattered randomly around zero across the range of fitted values, with the red line remaining approximately flat. This suggests the linearity assumption is reasonably satisfied, though a slight fanning of residuals at lower fitted values hints at some heteroscedasticity. Observations 1342, 2012, and 1828 are flagged as potential outliers with large residuals.

The Normal Q-Q plot shows residuals tracking closely along the diagonal reference line through the middle range, indicating approximate normality. However, both tails deviate from the line, with observations 1828 at the lower end and 2012 and 1342 at the upper end pulling away, suggesting the residuals have slightly heavier tails than a perfect normal distribution.

The Scale-Location plot shows a mildly downward sloping red line, indicating a slight decrease in residual variance at higher fitted values. While not severely problematic, this suggests mild heteroscedasticity that should be noted as a limitation of the model.

The Residuals vs Leverage plot shows that the vast majority of observations cluster at low leverage values near zero, which is reassuring. Observation 1342 stands out with both high leverage and a large residual, and a point at the far right around leverage 0.18 approaches but does not cross the Cook’s distance boundary of 0.5, suggesting no single observation is unduly distorting the model estimates.

4.4 Predictions on evaluation data

eval$PREDICTED_WINS <- predict(model3, newdata = eval)

eval$PREDICTED_WINS <- pmax(0, pmin(162, round(eval$PREDICTED_WINS)))

head(eval %>% dplyr::select(PREDICTED_WINS), 10)

   PREDICTED_WINS
1              64
2              67
3              75
4              86
5              74
6              68
7              79
8              76
9              73
10             74

ggplot(eval, aes(x = PREDICTED_WINS)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Predicted Wins",
       x = "Predicted Wins", y = "Count") +
  theme_minimal()

Model 3 was applied to the evaluation dataset to generate predicted win totals for each team. The distribution of predicted wins is approximately normal and centered around 80-85 wins, which is consistent with the training data distribution and reflects a realistic range for a 162-game baseball season. The bulk of predictions fall between 65 and 95 wins, with a small number of outlier predictions below 30 wins likely corresponding to historically unusual team seasons. All predictions were capped between 0 and 162 to ensure they fall within the physically possible range of a baseball season. A sample of the first ten predictions shows values ranging from 54 to 80 wins, with most clustering in the high 60s to mid 70s. The predicted values have been exported to moneyball_predictions.csv for submission.