Moneyball Regression: Predicting Baseball Wins

Author

Darwhin Gomez

Moneyball Predictions

Overview In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).

Data

Code

train <- read_csv("/data621/moneyball-training-data.csv") %>% clean_names()

Rows: 2276 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): INDEX, TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

test <- read_csv("/data621/moneyball-evaluation-data.csv") %>% clean_names()

Rows: 259 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): INDEX, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATT...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

EDA

The Data has been imported lets check dimensions and missingness.

Code

cat("Dimensions\n")

Dimensions

Code

cat("===========\n")

===========

Code

cat(paste("Training:", paste(dim(train), collapse = " x "), "\n"))

Training: 2276 x 17

Code

cat(paste("Test:", paste(dim(test), collapse = " x "), "\n"))

Test: 259 x 16

Code

train_test_ratio <- round(dim(test)[1] / dim(train)[1], 4)
cat(paste("Ratio of test to train data:", train_test_ratio, "\n"))

Ratio of test to train data: 0.1138

Code

my_skim <- skim_with(
 numeric = sfl(median,p0 =NULL, p25 = NULL,p50 = NULL, p75 = NULL, p100= NULL, ),append = TRUE)

# Now run skim
my_skim(train)|>
  arrange((complete_rate))

Data summary
Name	train
Number of rows	2276
Number of columns	17
_______________________
Column type frequency:
numeric	17
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	hist	median
team_batting_hbp	2085	0.08	59.36	12.97	▂▇▇▅▁	NA
team_baserun_cs	772	0.66	52.80	22.96	▃▇▁▁▁	NA
team_fielding_dp	286	0.87	146.39	26.23	▁▂▇▆▁	NA
team_baserun_sb	131	0.94	124.76	87.79	▇▃▁▁▁	NA
team_batting_so	102	0.96	735.61	248.53	▁▆▇▇▁	NA
team_pitching_so	102	0.96	817.73	553.09	▇▁▁▁▁	NA
index	0	1.00	1268.46	736.35	▇▇▇▇▇	1270.5
target_wins	0	1.00	80.79	15.75	▁▁▇▅▁	82.0
team_batting_h	0	1.00	1469.27	144.59	▁▇▂▁▁	1454.0
team_batting_2b	0	1.00	241.25	46.80	▁▆▇▂▁	238.0
team_batting_3b	0	1.00	55.25	27.94	▇▇▂▁▁	47.0
team_batting_hr	0	1.00	99.61	60.55	▇▆▇▅▁	102.0
team_batting_bb	0	1.00	501.56	122.67	▁▁▇▇▁	512.0
team_pitching_h	0	1.00	1779.21	1406.84	▇▁▁▁▁	1518.0
team_pitching_hr	0	1.00	105.70	61.30	▇▇▆▁▁	107.0
team_pitching_bb	0	1.00	553.01	166.36	▇▁▁▁▁	536.5
team_fielding_e	0	1.00	246.48	227.77	▇▁▁▁▁	159.0

There are some missing values that should be addressed. I will proceed by removing the team_batting_HBP as there is very little data available, I will then use median imputation for the remaining missing values in the data set. As a reminder we must also do this to our test set for accurate modeling.

Code

train %>%
  dplyr::select(where(is.numeric)) %>%
  summarise(across(
    everything(),
    list(
      zeros = ~sum(. == 0, na.rm = TRUE)
      
    )
  ))

# A tibble: 1 × 17
  index_zeros target_wins_zeros team_batting_h_zeros team_batting_2b_zeros
        <int>             <int>                <int>                 <int>
1           0                 1                    0                     0
# ℹ 13 more variables: team_batting_3b_zeros <int>,
#   team_batting_hr_zeros <int>, team_batting_bb_zeros <int>,
#   team_batting_so_zeros <int>, team_baserun_sb_zeros <int>,
#   team_baserun_cs_zeros <int>, team_batting_hbp_zeros <int>,
#   team_pitching_h_zeros <int>, team_pitching_hr_zeros <int>,
#   team_pitching_bb_zeros <int>, team_pitching_so_zeros <int>,
#   team_fielding_e_zeros <int>, team_fielding_dp_zeros <int>

We also notice some columns with 0’s since there is little chance that a team can go a whole season with put allowing a home run or giving up a hit we will assume these are data entry/capture errors and include them in our imputation implementation.

Missing Data

Code

train <- train %>% dplyr::select(-team_batting_hbp,-index)
test  <- test  %>% dplyr::select(-team_batting_hbp)
print("train cols")

[1] "train cols"

Code

colnames(train)

 [1] "target_wins"      "team_batting_h"   "team_batting_2b"  "team_batting_3b" 
 [5] "team_batting_hr"  "team_batting_bb"  "team_batting_so"  "team_baserun_sb" 
 [9] "team_baserun_cs"  "team_pitching_h"  "team_pitching_hr" "team_pitching_bb"
[13] "team_pitching_so" "team_fielding_e"  "team_fielding_dp"

Code

print("test cols")

[1] "test cols"

Code

colnames(test)

 [1] "index"            "team_batting_h"   "team_batting_2b"  "team_batting_3b" 
 [5] "team_batting_hr"  "team_batting_bb"  "team_batting_so"  "team_baserun_sb" 
 [9] "team_baserun_cs"  "team_pitching_h"  "team_pitching_hr" "team_pitching_bb"
[13] "team_pitching_so" "team_fielding_e"  "team_fielding_dp"

We see our columns match except for target wins, our target variable, lets also impute any 0 values predictors and remove the row with now traget_wins, lets impute.

Code

impute_nonpositive_with_median <- function(df) {
  df %>%
    mutate(across(
      where(is.numeric),
      ~ {
        med <- median(.[. > 0], na.rm = TRUE)
        # fallback median if no positives
        if (is.na(med)) med <- median(., na.rm = TRUE)
        ifelse(. <= 0 | is.na(.), med, .)
      }
    ))
}

train <- impute_nonpositive_with_median(train)
test  <- impute_nonpositive_with_median(test)

Code

vars <- train |> dplyr::select(where(is.numeric)) %>% dplyr::select(-target_wins)

for (col in names(vars)) {
  print(
    ggplot(train, aes_string(x = col, y = "target_wins")) +
      geom_point(alpha = 0.5) +
      geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
       geom_smooth(method = "loess", se = FALSE, color = "blue", linewidth = 1) +
      labs(title = paste("Linearity check:", col))
  )
}

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Code

library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:psych':

    logit

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Code

model <- lm(target_wins ~ ., data = train)
par(mfrow = c(2, 3))
crPlots(model)

Pre Process

To satisfy the linearity requirement of linear regression we will implement a box cox transformation to the training and testing set before modeling.

Code

# Apply Box-Cox transformation, centering, and scaling
train_df <- as.data.frame(train)
test_df  <- as.data.frame(test)

train_x <- train_df[, setdiff(names(train_df), "target_wins")]
test_x  <- test_df  # test should have same predictors only

# --- Preprocess predictors only ---
preproc <- preProcess(train_x, method = c("BoxCox", "center", "scale"))

# --- Apply transformations ---
train_trans <- predict(preproc, train_x)
test_trans  <- predict(preproc, test_x)

Lets take a look at the component and residual plots to determine if the linearity assumption is met.

Code

train_trans$target_wins <- train$target_wins
model <- lm(target_wins ~ ., data = train_trans)
par(mfrow = c(2, 3))
crPlots(model)

Models

Now that we have transformed our data we can choose and train some models and then make some predictions.
Lets begin with the kitchen sink:
Target_Wins ~ .(All the predictors)

Code

summary(model)


Call:
lm(formula = target_wins ~ ., data = train_trans)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.805  -7.856  -0.048   8.138  65.161 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       80.8269     0.2754 293.495  < 2e-16 ***
team_batting_h     6.5463     0.6162  10.624  < 2e-16 ***
team_batting_2b   -0.6753     0.4394  -1.537   0.1245    
team_batting_3b    3.6264     0.4694   7.726 1.66e-14 ***
team_batting_hr   -0.9450     1.8877  -0.501   0.6167    
team_batting_bb    5.6780     0.6104   9.303  < 2e-16 ***
team_batting_so   -5.7608     1.1113  -5.184 2.37e-07 ***
team_baserun_sb    2.7764     0.3882   7.151 1.16e-12 ***
team_baserun_cs   -0.4880     0.3211  -1.520   0.1287    
team_pitching_h   -1.2413     0.7004  -1.772   0.0765 .  
team_pitching_hr   2.6807     1.7475   1.534   0.1252    
team_pitching_bb  -3.2703     0.5819  -5.620 2.15e-08 ***
team_pitching_so   4.7434     0.9220   5.145 2.91e-07 ***
team_fielding_e   -5.8311     0.5969  -9.770  < 2e-16 ***
team_fielding_dp  -2.6888     0.3240  -8.299  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.14 on 2261 degrees of freedom
Multiple R-squared:  0.3005,    Adjusted R-squared:  0.2962 
F-statistic: 69.39 on 14 and 2261 DF,  p-value: < 2.2e-16

The first model shows some very important predictors in team_batting_h (hits), team_batting_3b (tripples), team_batting_bb (walks), team_batting_so (strike outs)[negative impact on wins], team_batting_hr (home runs) , team_picthing_hr (home runs given up)[negative impact], team_pitching_so (strike outs), team_fielding_e (defensive errors) [negative impact on wins], team_pitching_bb (walks given up)[negative impact on wins], team_fielding_dp (forced double plays ie two outs on one play)
The model has P- value of 2.2e16 indicating that there is a definite relationship between the predictors and the target variable.
The \(R^2 = 0.30\) indicating that the model is still missing some information to accurately predict wins.

Code

plot(model, c(1,2,4))

Code

model_sig <- lm(target_wins ~
                  team_batting_h +
                  team_batting_3b +
                  team_batting_bb +
                  team_batting_so +
                  team_baserun_sb +
                  team_pitching_bb +
                  team_pitching_so +
                  team_fielding_e +
                  team_fielding_dp,
                data = train_trans)

summary(model_sig)


Call:
lm(formula = target_wins ~ team_batting_h + team_batting_3b + 
    team_batting_bb + team_batting_so + team_baserun_sb + team_pitching_bb + 
    team_pitching_so + team_fielding_e + team_fielding_dp, data = train_trans)

Residuals:
    Min      1Q  Median      3Q     Max 
-58.763  -8.023   0.131   8.348  60.107 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       80.8269     0.2763 292.585  < 2e-16 ***
team_batting_h     6.0443     0.3244  18.631  < 2e-16 ***
team_batting_3b    3.3363     0.4569   7.302 3.92e-13 ***
team_batting_bb    5.6872     0.5908   9.626  < 2e-16 ***
team_batting_so   -4.3359     0.8906  -4.869 1.20e-06 ***
team_baserun_sb    2.2064     0.3341   6.604 4.97e-11 ***
team_pitching_bb  -2.8741     0.5517  -5.210 2.06e-07 ***
team_pitching_so   4.0888     0.7265   5.628 2.05e-08 ***
team_fielding_e   -6.2220     0.5307 -11.724  < 2e-16 ***
team_fielding_dp  -2.5819     0.3181  -8.116 7.80e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.18 on 2266 degrees of freedom
Multiple R-squared:  0.2946,    Adjusted R-squared:  0.2918 
F-statistic: 105.2 on 9 and 2266 DF,  p-value: < 2.2e-16

Code

plot(model_sig,c(1,2,4))

In our second model we remove our non influential predictors, and see a small reduction to \(R^2 = .2946\) While now we have only significant predictors we don’t see an improvement to \(R^2\) . Lets see if we can remove some outliers to help our model.

Code

# Compute diagnostic measures
influence <- influence.measures(model_sig)

# View the summary of flagged observations
#summary(influence)

# Extract Cook’s Distance
cooksD <- cooks.distance(model_sig)

# Inspect top outliers
sort(cooksD, decreasing = TRUE)[1:20]

      1342       1211       1828        299       1825       1210       1584 
0.24249383 0.07861006 0.07728364 0.05010914 0.04624258 0.04443302 0.03027927 
       862          1       2137        297        982       1810         53 
0.02938295 0.02817135 0.02506170 0.02232897 0.01987927 0.01894834 0.01801108 
       418       2012       2233        417        415       1345 
0.01688765 0.01662414 0.01640930 0.01599025 0.01567112 0.01378480

Code

influential_ids <- c(1342, 1211, 1828, 299, 1210, 862, 1825, 1584,82,1,2137,297,982,1810,53,418,2012,2233,417,409,1385,2232,1820,427,1341,1345,2015,295,2020,859,1191,296,1822,1083,1340)
train_clean <- train_trans[-influential_ids, ]
model_clean <- lm(target_wins ~
                    team_batting_h +
                    team_batting_3b +
                    team_batting_bb +
                    team_batting_so +
                    team_baserun_sb +
                    team_pitching_bb +
                    team_pitching_so +
                    team_fielding_e +
                    team_fielding_dp,
                  data = train_clean)

summary(model_clean)


Call:
lm(formula = target_wins ~ team_batting_h + team_batting_3b + 
    team_batting_bb + team_batting_so + team_baserun_sb + team_pitching_bb + 
    team_pitching_so + team_fielding_e + team_fielding_dp, data = train_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.171  -7.903  -0.041   8.191  44.380 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       80.8732     0.2596 311.488  < 2e-16 ***
team_batting_h     5.9953     0.3184  18.827  < 2e-16 ***
team_batting_3b    2.7662     0.4408   6.275 4.19e-10 ***
team_batting_bb    8.4308     0.6596  12.783  < 2e-16 ***
team_batting_so   -9.1663     0.9362  -9.791  < 2e-16 ***
team_baserun_sb    2.4652     0.3143   7.843 6.75e-15 ***
team_pitching_bb  -6.2403     0.6624  -9.421  < 2e-16 ***
team_pitching_so   7.8474     0.7880   9.958  < 2e-16 ***
team_fielding_e   -6.9664     0.5036 -13.833  < 2e-16 ***
team_fielding_dp  -2.6437     0.2985  -8.857  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.27 on 2231 degrees of freedom
Multiple R-squared:  0.3161,    Adjusted R-squared:  0.3133 
F-statistic: 114.6 on 9 and 2231 DF,  p-value: < 2.2e-16

Code

plot(model_clean,c(1,2,4))

Predicting

Code

predictions <- predict(model_clean, newdata = test_trans)
head(predictions,20)

       1        2        3        4        5        6        7        8 
59.58434 62.62154 72.99858 85.41745 91.52105 74.69780 77.16716 74.75044 
       9       10       11       12       13       14       15       16 
71.60332 72.85401 69.98114 77.94288 76.59361 75.35512 82.38866 71.92271 
      17       18       19       20 
67.69300 77.97088 71.32262 83.98262

Our model is predicting lets apply the ceiling function to our predictions

Code

predictions_ceiling <- ceiling(predictions)
head(predictions_ceiling,20)

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
60 63 73 86 92 75 78 75 72 73 70 78 77 76 83 72 68 78 72 84

Code

# Create histogram and save histogram info
m <- mean(predictions_ceiling)
s <- sd(predictions_ceiling)

# Re-plot with labels
h <- hist(predictions_ceiling,
          breaks = 20,
          col = "gray",
          border = "white",
          freq = FALSE,
          main = "Distribution of Predicted Wins",
          xlab = "Predicted Wins")

# Add the normal curve
xfit <- seq(min(predictions_ceiling), max(predictions_ceiling), length = 100)
yfit <- dnorm(xfit, mean = m, sd = s)
lines(xfit, yfit, col = "red", lwd = 2)

# Add mean and SD text annotation
text(x = m, y = max(yfit)/2,
     labels = paste0("Mean = ", round(m, 1), 
                     "\nSD = ", round(s, 1)),
     col = "blue", cex = 0.9, pos = 4)

We generated predictions using the cleaned linear model after removing 35 influential observations identified by Cook’s Distance values greater than 0.015.
A ceiling function was applied to ensure all predicted team win totals were whole numbers.
We then visualized the distribution of predicted wins using a histogram with an overlaid normal distribution curve.

The resulting predictions followed an approximately normal (Gaussian) distribution, centered around 80.8 wins—a realistic outcome given that Major League Baseball teams play 162 games per season.
An average near 81 wins represents a balanced league, confirming that our model’s predictions align well with real-world expectations.

As a baseball fan, I recognize that additional variables could improve the model’s ability to explain team wins.

Metrics such as runs scored, hits with runners in scoring position, and runs allowed directly capture a team’s offensive efficiency and pitching effectiveness.

Including these types of performance indicators could enhance the model’s explanatory power and yield a higher \(R^2\), providing a more complete representation of what drives success over a 162-game season.

CSV Predictions

Code

results <- data.frame(test_trans, Predicted_Wins = predictions_ceiling)

# Write to CSV file
write.csv(results, "predicted_team_wins.csv", row.names = FALSE)

Moneyball Predictions

Data

EDA

Missing Data

Pre Process

Models

Model Refinement and Outlier Removal

Predicting

CSV Predictions