Moneyball Regression: Predicting Baseball Wins

Author

Darwhin Gomez

Moneyball Predictions

Overview In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).

Data

Code
train <- read_csv("/data621/moneyball-training-data.csv") %>% clean_names()
Rows: 2276 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): INDEX, TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
test <- read_csv("/data621/moneyball-evaluation-data.csv") %>% clean_names()
Rows: 259 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): INDEX, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATT...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

EDA

The Data has been imported lets check dimensions and missingness.

Code
cat("Dimensions\n")
Dimensions
Code
cat("===========\n")
===========
Code
cat(paste("Training:", paste(dim(train), collapse = " x "), "\n"))
Training: 2276 x 17 
Code
cat(paste("Test:", paste(dim(test), collapse = " x "), "\n"))
Test: 259 x 16 
Code
train_test_ratio <- round(dim(test)[1] / dim(train)[1], 4)
cat(paste("Ratio of test to train data:", train_test_ratio, "\n"))
Ratio of test to train data: 0.1138 
Code
my_skim <- skim_with(
 numeric = sfl(median,p0 =NULL, p25 = NULL,p50 = NULL, p75 = NULL, p100= NULL, ),append = TRUE)

# Now run skim
my_skim(train)|>
  arrange((complete_rate))
Data summary
Name train
Number of rows 2276
Number of columns 17
_______________________
Column type frequency:
numeric 17
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd hist median
team_batting_hbp 2085 0.08 59.36 12.97 ▂▇▇▅▁ NA
team_baserun_cs 772 0.66 52.80 22.96 ▃▇▁▁▁ NA
team_fielding_dp 286 0.87 146.39 26.23 ▁▂▇▆▁ NA
team_baserun_sb 131 0.94 124.76 87.79 ▇▃▁▁▁ NA
team_batting_so 102 0.96 735.61 248.53 ▁▆▇▇▁ NA
team_pitching_so 102 0.96 817.73 553.09 ▇▁▁▁▁ NA
index 0 1.00 1268.46 736.35 ▇▇▇▇▇ 1270.5
target_wins 0 1.00 80.79 15.75 ▁▁▇▅▁ 82.0
team_batting_h 0 1.00 1469.27 144.59 ▁▇▂▁▁ 1454.0
team_batting_2b 0 1.00 241.25 46.80 ▁▆▇▂▁ 238.0
team_batting_3b 0 1.00 55.25 27.94 ▇▇▂▁▁ 47.0
team_batting_hr 0 1.00 99.61 60.55 ▇▆▇▅▁ 102.0
team_batting_bb 0 1.00 501.56 122.67 ▁▁▇▇▁ 512.0
team_pitching_h 0 1.00 1779.21 1406.84 ▇▁▁▁▁ 1518.0
team_pitching_hr 0 1.00 105.70 61.30 ▇▇▆▁▁ 107.0
team_pitching_bb 0 1.00 553.01 166.36 ▇▁▁▁▁ 536.5
team_fielding_e 0 1.00 246.48 227.77 ▇▁▁▁▁ 159.0

There are some missing values that should be addressed. I will proceed by removing the team_batting_HBP as there is very little data available, I will then use median imputation for the remaining missing values in the data set. As a reminder we must also do this to our test set for accurate modeling.

Code
train %>%
  dplyr::select(where(is.numeric)) %>%
  summarise(across(
    everything(),
    list(
      zeros = ~sum(. == 0, na.rm = TRUE)
      
    )
  ))
# A tibble: 1 × 17
  index_zeros target_wins_zeros team_batting_h_zeros team_batting_2b_zeros
        <int>             <int>                <int>                 <int>
1           0                 1                    0                     0
# ℹ 13 more variables: team_batting_3b_zeros <int>,
#   team_batting_hr_zeros <int>, team_batting_bb_zeros <int>,
#   team_batting_so_zeros <int>, team_baserun_sb_zeros <int>,
#   team_baserun_cs_zeros <int>, team_batting_hbp_zeros <int>,
#   team_pitching_h_zeros <int>, team_pitching_hr_zeros <int>,
#   team_pitching_bb_zeros <int>, team_pitching_so_zeros <int>,
#   team_fielding_e_zeros <int>, team_fielding_dp_zeros <int>

We also notice some columns with 0’s since there is little chance that a team can go a whole season with put allowing a home run or giving up a hit we will assume these are data entry/capture errors and include them in our imputation implementation.

Missing Data

Code
train <- train %>% dplyr::select(-team_batting_hbp,-index)
test  <- test  %>% dplyr::select(-team_batting_hbp)
print("train cols")
[1] "train cols"
Code
colnames(train)
 [1] "target_wins"      "team_batting_h"   "team_batting_2b"  "team_batting_3b" 
 [5] "team_batting_hr"  "team_batting_bb"  "team_batting_so"  "team_baserun_sb" 
 [9] "team_baserun_cs"  "team_pitching_h"  "team_pitching_hr" "team_pitching_bb"
[13] "team_pitching_so" "team_fielding_e"  "team_fielding_dp"
Code
print("test cols")
[1] "test cols"
Code
colnames(test)
 [1] "index"            "team_batting_h"   "team_batting_2b"  "team_batting_3b" 
 [5] "team_batting_hr"  "team_batting_bb"  "team_batting_so"  "team_baserun_sb" 
 [9] "team_baserun_cs"  "team_pitching_h"  "team_pitching_hr" "team_pitching_bb"
[13] "team_pitching_so" "team_fielding_e"  "team_fielding_dp"

We see our columns match except for target wins, our target variable, lets also impute any 0 values predictors and remove the row with now traget_wins, lets impute.

Code
impute_nonpositive_with_median <- function(df) {
  df %>%
    mutate(across(
      where(is.numeric),
      ~ {
        med <- median(.[. > 0], na.rm = TRUE)
        # fallback median if no positives
        if (is.na(med)) med <- median(., na.rm = TRUE)
        ifelse(. <= 0 | is.na(.), med, .)
      }
    ))
}

train <- impute_nonpositive_with_median(train)
test  <- impute_nonpositive_with_median(test)
Code
vars <- train |> dplyr::select(where(is.numeric)) %>% dplyr::select(-target_wins)

for (col in names(vars)) {
  print(
    ggplot(train, aes_string(x = col, y = "target_wins")) +
      geom_point(alpha = 0.5) +
      geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
       geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1) +
      labs(title = paste("Linearity check:", col))
  )
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'


Code
library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:psych':

    logit
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
Code
model <- lm(target_wins ~ ., data = train)
par(mfrow = c(2, 3))
crPlots(model)

Pre Process

To satisfy the linearity requirement of linear regression we will implement a box cox transformation to the training and testing set before modeling.

Code
# Apply Box-Cox transformation, centering, and scaling
train_df <- as.data.frame(train)
test_df  <- as.data.frame(test)

train_x <- train_df[, setdiff(names(train_df), "target_wins")]
test_x  <- test_df  # test should have same predictors only

# --- Preprocess predictors only ---
preproc <- preProcess(train_x, method = c("BoxCox", "center", "scale"))

# --- Apply transformations ---
train_trans <- predict(preproc, train_x)
test_trans  <- predict(preproc, test_x)

Lets take a look at the component and residual plots to determine if the linearity assumption is met.

Code
train_trans$target_wins <- train$target_wins
model <- lm(target_wins ~ ., data = train_trans)
par(mfrow = c(2, 3))
crPlots(model)

Models

Now that we have transformed our data we can choose and train some models and then make some predictions.
Lets begin with the kitchen sink:
Target_Wins ~ .(All the predictors)

Code
summary(model)

Call:
lm(formula = target_wins ~ ., data = train_trans)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.805  -7.856  -0.048   8.138  65.161 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       80.8269     0.2754 293.495  < 2e-16 ***
team_batting_h     6.5463     0.6162  10.624  < 2e-16 ***
team_batting_2b   -0.6753     0.4394  -1.537   0.1245    
team_batting_3b    3.6264     0.4694   7.726 1.66e-14 ***
team_batting_hr   -0.9450     1.8877  -0.501   0.6167    
team_batting_bb    5.6780     0.6104   9.303  < 2e-16 ***
team_batting_so   -5.7608     1.1113  -5.184 2.37e-07 ***
team_baserun_sb    2.7764     0.3882   7.151 1.16e-12 ***
team_baserun_cs   -0.4880     0.3211  -1.520   0.1287    
team_pitching_h   -1.2413     0.7004  -1.772   0.0765 .  
team_pitching_hr   2.6807     1.7475   1.534   0.1252    
team_pitching_bb  -3.2703     0.5819  -5.620 2.15e-08 ***
team_pitching_so   4.7434     0.9220   5.145 2.91e-07 ***
team_fielding_e   -5.8311     0.5969  -9.770  < 2e-16 ***
team_fielding_dp  -2.6888     0.3240  -8.299  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.14 on 2261 degrees of freedom
Multiple R-squared:  0.3005,    Adjusted R-squared:  0.2962 
F-statistic: 69.39 on 14 and 2261 DF,  p-value: < 2.2e-16

The first model shows some very important predictors in team_batting_h (hits), team_batting_3b (tripples), team_batting_bb (walks), team_batting_so (strike outs)[negative impact on wins], team_batting_hr (home runs) , team_picthing_hr (home runs given up)[negative impact], team_pitching_so (strike outs), team_fielding_e (defensive errors) [negative impact on wins], team_pitching_bb (walks given up)[negative impact on wins], team_fielding_dp (forced double plays ie two outs on one play)
The model has P- value of 2.2e16 indicating that there is a definite relationship between the predictors and the target variable.
The \(R^2 = 0.30\) indicating that the model is still missing some information to accurately predict wins.


Code
plot(model, c(1,2,4))

Code
model_sig <- lm(target_wins ~
                  team_batting_h +
                  team_batting_3b +
                  team_batting_bb +
                  team_batting_so +
                  team_baserun_sb +
                  team_pitching_bb +
                  team_pitching_so +
                  team_fielding_e +
                  team_fielding_dp,
                data = train_trans)

summary(model_sig)

Call:
lm(formula = target_wins ~ team_batting_h + team_batting_3b + 
    team_batting_bb + team_batting_so + team_baserun_sb + team_pitching_bb + 
    team_pitching_so + team_fielding_e + team_fielding_dp, data = train_trans)

Residuals:
    Min      1Q  Median      3Q     Max 
-58.763  -8.023   0.131   8.348  60.107 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       80.8269     0.2763 292.585  < 2e-16 ***
team_batting_h     6.0443     0.3244  18.631  < 2e-16 ***
team_batting_3b    3.3363     0.4569   7.302 3.92e-13 ***
team_batting_bb    5.6872     0.5908   9.626  < 2e-16 ***
team_batting_so   -4.3359     0.8906  -4.869 1.20e-06 ***
team_baserun_sb    2.2064     0.3341   6.604 4.97e-11 ***
team_pitching_bb  -2.8741     0.5517  -5.210 2.06e-07 ***
team_pitching_so   4.0888     0.7265   5.628 2.05e-08 ***
team_fielding_e   -6.2220     0.5307 -11.724  < 2e-16 ***
team_fielding_dp  -2.5819     0.3181  -8.116 7.80e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.18 on 2266 degrees of freedom
Multiple R-squared:  0.2946,    Adjusted R-squared:  0.2918 
F-statistic: 105.2 on 9 and 2266 DF,  p-value: < 2.2e-16
Code
plot(model_sig,c(1,2,4))

In our second model we remove our non influential predictors, and see a small reduction to \(R^2 = .2946\) While now we have only significant predictors we don’t see an improvement to \(R^2\) . Lets see if we can remove some outliers to help our model.

Code
# Compute diagnostic measures
influence <- influence.measures(model_sig)

# View the summary of flagged observations
#summary(influence)

# Extract Cook’s Distance
cooksD <- cooks.distance(model_sig)

# Inspect top outliers
sort(cooksD, decreasing = TRUE)[1:20]
      1342       1211       1828        299       1825       1210       1584 
0.24249383 0.07861006 0.07728364 0.05010914 0.04624258 0.04443302 0.03027927 
       862          1       2137        297        982       1810         53 
0.02938295 0.02817135 0.02506170 0.02232897 0.01987927 0.01894834 0.01801108 
       418       2012       2233        417        415       1345 
0.01688765 0.01662414 0.01640930 0.01599025 0.01567112 0.01378480 
Code
influential_ids <- c(1342, 1211, 1828, 299, 1210, 862, 1825, 1584,82,1,2137,297,982,1810,53,418,2012,2233,417,409,1385,2232,1820,427,1341,1345,2015,295,2020,859,1191,296,1822,1083,1340)
train_clean <- train_trans[-influential_ids, ]
model_clean <- lm(target_wins ~
                    team_batting_h +
                    team_batting_3b +
                    team_batting_bb +
                    team_batting_so +
                    team_baserun_sb +
                    team_pitching_bb +
                    team_pitching_so +
                    team_fielding_e +
                    team_fielding_dp,
                  data = train_clean)

summary(model_clean)

Call:
lm(formula = target_wins ~ team_batting_h + team_batting_3b + 
    team_batting_bb + team_batting_so + team_baserun_sb + team_pitching_bb + 
    team_pitching_so + team_fielding_e + team_fielding_dp, data = train_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.171  -7.903  -0.041   8.191  44.380 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       80.8732     0.2596 311.488  < 2e-16 ***
team_batting_h     5.9953     0.3184  18.827  < 2e-16 ***
team_batting_3b    2.7662     0.4408   6.275 4.19e-10 ***
team_batting_bb    8.4308     0.6596  12.783  < 2e-16 ***
team_batting_so   -9.1663     0.9362  -9.791  < 2e-16 ***
team_baserun_sb    2.4652     0.3143   7.843 6.75e-15 ***
team_pitching_bb  -6.2403     0.6624  -9.421  < 2e-16 ***
team_pitching_so   7.8474     0.7880   9.958  < 2e-16 ***
team_fielding_e   -6.9664     0.5036 -13.833  < 2e-16 ***
team_fielding_dp  -2.6437     0.2985  -8.857  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.27 on 2231 degrees of freedom
Multiple R-squared:  0.3161,    Adjusted R-squared:  0.3133 
F-statistic: 114.6 on 9 and 2231 DF,  p-value: < 2.2e-16
Code
plot(model_clean,c(1,2,4))

Model Refinement and Outlier Removal

Influential points were identified using Cook’s Distance and removed to reduce leverage bias. the criteria was a cooks distance above .015
A total of 35 about \(35/2276 = 1.53%\) % of observations (e.g., 1342, 1211, 1828, 299, etc.) were excluded from the dataset.

After refitting the model:

  • \(R^2\) improved from 0.2946 to 0.3161
  • Diagnostic plots confirmed:
    • Improved residual homoscedasticity
    • Fewer high-leverage outliers
    • Similar coefficient directions and magnitudes

These results suggest that the original model was modestly affected by a small set of influential data points.
After their removal, the simple linear regression model explains approximately 31.5% of the variance in team wins, satisfying all OLS assumptions.

Predicting

Code
predictions <- predict(model_clean, newdata = test_trans)
head(predictions,20)
       1        2        3        4        5        6        7        8 
59.58434 62.62154 72.99858 85.41745 91.52105 74.69780 77.16716 74.75044 
       9       10       11       12       13       14       15       16 
71.60332 72.85401 69.98114 77.94288 76.59361 75.35512 82.38866 71.92271 
      17       18       19       20 
67.69300 77.97088 71.32262 83.98262 

Our model is predicting lets apply the ceiling function to our predictions

Code
predictions_ceiling <- ceiling(predictions)
head(predictions_ceiling,20)
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
60 63 73 86 92 75 78 75 72 73 70 78 77 76 83 72 68 78 72 84 
Code
# Create histogram and save histogram info
m <- mean(predictions_ceiling)
s <- sd(predictions_ceiling)

# Re-plot with labels
h <- hist(predictions_ceiling,
          breaks = 20,
          col = "gray",
          border = "white",
          freq = FALSE,
          main = "Distribution of Predicted Wins",
          xlab = "Predicted Wins")

# Add the normal curve
xfit <- seq(min(predictions_ceiling), max(predictions_ceiling), length = 100)
yfit <- dnorm(xfit, mean = m, sd = s)
lines(xfit, yfit, col = "red", lwd = 2)

# Add mean and SD text annotation
text(x = m, y = max(yfit)/2,
     labels = paste0("Mean = ", round(m, 1), 
                     "\nSD = ", round(s, 1)),
     col = "blue", cex = 0.9, pos = 4)

We generated predictions using the cleaned linear model after removing 35 influential observations identified by Cook’s Distance values greater than 0.015.
A ceiling function was applied to ensure all predicted team win totals were whole numbers.
We then visualized the distribution of predicted wins using a histogram with an overlaid normal distribution curve.

The resulting predictions followed an approximately normal (Gaussian) distribution, centered around 80.8 wins—a realistic outcome given that Major League Baseball teams play 162 games per season.
An average near 81 wins represents a balanced league, confirming that our model’s predictions align well with real-world expectations.

As a baseball fan, I recognize that additional variables could improve the model’s ability to explain team wins.

Metrics such as runs scored, hits with runners in scoring position, and runs allowed directly capture a team’s offensive efficiency and pitching effectiveness.

Including these types of performance indicators could enhance the model’s explanatory power and yield a higher \(R^2\), providing a more complete representation of what drives success over a 162-game season.

CSV Predictions

Code
results <- data.frame(test_trans, Predicted_Wins = predictions_ceiling)

# Write to CSV file
write.csv(results, "predicted_team_wins.csv", row.names = FALSE)