DATA621 - Homework 1

Author

Anthony Josue Roman, Rupendra Shrestha, Bikash Bhowmik, Haoming Chen, Jerald Melukkaran

Instruction

OPTION 1

Instructions for HW #1 are included in the attached PDF file. This assignment is due by 03/01/2026, 11.59 PM EST. Please submit your report as a PDF file.

OPTION 2

Define a concept that you have learned from this week’s readings and videos. Compare (i,e., Identify similarities) and Contrast (i.e., Identify differences) this concept with another that you have learned throughout the course. Use R to provide a real-world example of how you would approach solving a problem using the concept that you have learned. Please submit your report as a PDF file.

Overview

The goal of this assignment is to use the Moneyball dataset to understand the effect of baseball team performance metrics on the total number of games won in a season. To do this, we’ll use historical team data standardized for a 162-game season and try to model the relationship between the response variable TARGET_WINS.

The data analysis is performed following a data mining framework. The data is first examined through exploratory data analysis. In this phase, the distribution and scale of the data are examined. Special emphasis is given to the presence of missing values and the correlation between the features. Then, data preparation techniques are used. These techniques are median imputation, creation of missing values indicators, and feature engineering. Multiple models are developed using multiple linear regression. At least three models are developed using different features. Model selection is performed through out-of-sample metrics such as Mean Squared Error (MSE) and R². F-statistics, multicollinearity statistics, and residual analysis are also used in model selection. Finally, the model is refitted on the entire data set and used to produce predictions on the evaluation set.

Objective

The objective of this assignment is to build, test, and interpret multiple linear regression models that forecast or predict team wins based on their previous performance records in baseball games. This involves carrying out exploratory data analysis, using suitable data transformations, building multiple candidate models, selecting the best model based on out-of-sample performance, as well as making predictions for the evaluation set. By doing this, it is possible to demonstrate a data mining process that is informed by statistical thinking and predictive assessment.

Data Exploration

To understand the data structure and the relationship between the data points in the Moneyball dataset used for the model’s training, exploratory data analysis was performed. The data set had a total of 2,276 data points and included multiple team performance metrics standardized for a 162-game season.

The summary statistics show significant variation in the data points of almost all the team performance metrics. There are also some right-skewed distributions in the data points of the home runs, walks, and total hits metrics. These statistics show the benefits of data transformations for some of the data points in the subsequent stages of the modeling process.

Missing values are also present in the data points of the pitching metrics. Given the historical context of the data set, the missing data points are likely the result of incomplete data in the earlier seasons of the league’s history. Therefore, median imputation was a suitable option for the data set in the data preparation phase.

The correlation matrix shows a positive correlation between the team’s wins and the data points representing the team’s offensive production. There is also a negative correlation between the defensive errors and the team’s wins. There are also positive correlations between the data points of the team’s pitching metrics and the team’s wins.

The strong correlations between the data points of the team’s offensive production also show the possibility of multicollinearity in the data set. Therefore, the Variance Inflation Factor diagnostic tool should also be used in the model evaluation phase.

The exploratory data analysis also confirmed the presence of statistically significant variation in the data points in the data set and the presence of logical correlations in the data set.

Loading and Inspect the Data

The training and evaluation datasets are loaded from CSV files using efficient data import functions in R. These datasets contain historical baseball performance statistics used for model development and prediction.

Code
library(tidyverse)
library(corrplot)
library(GGally)
library(car)
library(broom)
library(performance)
library(lmtest)
library(sandwich)
library(tidymodels)  
library(glmnet)      
library(gt)
Code
train_raw <- read_csv("moneyball-training-data.csv")
eval_raw  <- read_csv("moneyball-evaluation-data.csv")

dim(train_raw)
[1] 2276   17
Code
glimpse(train_raw)
Rows: 2,276
Columns: 17
$ INDEX            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 1…
$ TARGET_WINS      <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
$ TEAM_BATTING_H   <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
$ TEAM_BATTING_2B  <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
$ TEAM_BATTING_3B  <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
$ TEAM_BATTING_HR  <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
$ TEAM_BATTING_BB  <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
$ TEAM_BATTING_SO  <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
$ TEAM_BASERUN_SB  <dbl> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
$ TEAM_BASERUN_CS  <dbl> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
$ TEAM_BATTING_HBP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TEAM_PITCHING_H  <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
$ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
$ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
$ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
$ TEAM_FIELDING_E  <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
$ TEAM_FIELDING_DP <dbl> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…
Code
summary(train_raw)
     INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
 Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
 1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
 Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
 Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
 3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
 Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
                                                                 
 TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
 1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
 Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
 Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
 3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
 Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
                                                   NA's   :102     
 TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
 Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
 1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
 Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
 Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
 3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
 Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
 NA's   :131     NA's   :772     NA's   :2085                    
 TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
 Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
 1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
 Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
 Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
 3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
 Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
                                   NA's   :102                       
 TEAM_FIELDING_DP
 Min.   : 52.0   
 1st Qu.:131.0   
 Median :149.0   
 Mean   :146.4   
 3rd Qu.:164.0   
 Max.   :228.0   
 NA's   :286     

Missing Values

Code
colSums(is.na(train_raw))
           INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
               0                0                0                0 
 TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
               0                0                0              102 
 TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
             131              772             2085                0 
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
               0                0              102                0 
TEAM_FIELDING_DP 
             286 
Code
colSums(is.na(eval_raw))
           INDEX   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
               0                0                0                0 
 TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
               0                0               18               13 
 TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR 
              87              240                0                0 
TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
               0               18                0               31 

Distribution of Target Variables

The histogram shows that TARGET_WINS is roughly centered around the mid-70s to low-80s range, with most teams falling between about 65 and 95 wins. The distribution appears fairly symmetric, with only a few extreme high- or low-win seasons.

Code
ggplot(train_raw, aes(x = TARGET_WINS)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of TARGET_WINS",
       x = "Wins",
       y = "Count")

The target variable appears suitably distributed for regression analysis, with no extreme imbalance or severe skewness. This supports the use of multiple linear regression to model team wins without requiring transformation of the response variable.

Density Plots of Key Variables

The density plots display the distribution of each numeric variable in the dataset on its own scale. Several offensive variables, such as home runs and walks, show right-skewed distributions, indicating that a small number of teams recorded unusually high values. Other variables appear more symmetric and approximately bell-shaped.

The differences in spread across panels also highlight that predictors are measured on very different scales.

Code
train_long <- train_raw %>%
  select(-INDEX) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "value")

ggplot(train_long, aes(x = value)) +
  geom_density(fill = "orange", alpha = 0.5) +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

Some variables exhibit skewness and scale differences, which suggests that transformations or careful model specification may improve stability and interpretation in regression modeling.

Correlation Matrix

The correlation matrix shows the pairwise linear relationships among all numeric variables in the dataset. Offensive metrics such as home runs and walks display strong positive correlations with TARGET_WINS, while defensive errors show a negative correlation with wins. Pitching-related variables also exhibit meaningful relationships with the target.

Additionally, several offensive predictors are highly correlated with one another, suggesting potential multicollinearity.

Code
cor_mat <- cor(train_raw %>% select(-INDEX),
               use = "pairwise.complete.obs")

corrplot(cor_mat,
         method = "color",
         type = "upper",
         tl.cex = 0.6)

Code
cor_with_target <- cor(train_raw %>% select(-INDEX),
                       use = "pairwise.complete.obs")["TARGET_WINS",]

sort(cor_with_target, decreasing = TRUE)
     TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_BB 
      1.00000000       0.38876752       0.28910365       0.23255986 
TEAM_PITCHING_HR  TEAM_BATTING_HR  TEAM_BATTING_3B  TEAM_BASERUN_SB 
      0.18901373       0.17615320       0.14260841       0.13513892 
TEAM_PITCHING_BB TEAM_BATTING_HBP  TEAM_BASERUN_CS  TEAM_BATTING_SO 
      0.12417454       0.07350424       0.02240407      -0.03175071 
TEAM_FIELDING_DP TEAM_PITCHING_SO  TEAM_PITCHING_H  TEAM_FIELDING_E 
     -0.03485058      -0.07843609      -0.10993705      -0.17648476 

The correlation analysis confirms that key performance metrics are logically related to team wins. However, strong correlations among predictors indicate that multicollinearity should be evaluated during model development.

The ranking of these correlations with TARGET_WINS indicates which variables are most strongly linearly related to wins. The offensive statistics such as home runs and walks are strongly related to wins. The defensive statistics, such as errors, are inversely related. The high correlation between offensive statistics indicates that redundancy and possibly multicollinearity are issues.

The scatterplot matrix displays pairwise relationships among TARGET_WINS and selected key performance variables. The plots show clear positive linear trends between wins and offensive measures such as home runs and walks.

Code
GGally::ggpairs(
  train_raw %>%
    select(TARGET_WINS,
           TEAM_BATTING_HR,
           TEAM_BATTING_BB,
           TEAM_FIELDING_E,
           TEAM_PITCHING_HR,
           TEAM_PITCHING_SO) %>%
    drop_na()
)

The scatterplots support the assumption of linear relationships between key predictors and team wins, while also highlighting correlations among predictors that should be monitored during regression modeling.

Code
train_raw %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), sd))
# A tibble: 1 × 17
  INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
  <dbl>       <dbl>          <dbl>           <dbl>           <dbl>
1  736.        15.8           145.            46.8            27.9
# ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
#   TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
#   TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
#   TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
#   TEAM_FIELDING_DP <dbl>

The range of standard deviations varies significantly for different predictors, depending upon the scales used for measurement. It is observed that linear regression provides unbiased results without the need for normalizing the predictors, but scaling may be useful for numerical stability as well as for comparison of coefficients.

Outlier and Influence Investigation

Code
boxplot.stats(train_raw$TARGET_WINS)$out
 [1]  39  38 134 146  39  23 128 129 126 124  21  22  17  33  32  34   0  24  38
[20]  36  27  14  26 135  34  36  33  12  38  29  35  31
Code
m_temp <- lm(TARGET_WINS ~ ., data = train_raw %>% select(-INDEX))
hatvalues(m_temp) %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01789 0.04966 0.07018 0.08377 0.08717 0.95247 

The presence of extreme values and potentially high leverage points suggests that certain seasons may have a large influence on the regression findings. This emphasizes the importance of examining leverage and residual values when choosing a model, in order to remain safe and robust. As linear regression is particularly susceptible to the influence of points, we will look at residuals and leverage as part of the regression evaluation process.

While early scatterplots suggest that there may be linear relationships between important offensive measures and team wins, certain predictor variables may also have skewed distributions, making nonlinear scaling methods such as log scaling worthy of consideration during regression development.

Data Preparation

Prior to regression modeling, we performed a number of preprocessing steps to ensure that the regression models remain stable, interpretable, and conducive to inference. These preprocessing steps address missing data, scaling issues, and collinearity among performance metrics.

Handling Missing Values

During the exploratory phase, it was noted that the missing data are mainly concentrated in a few pitching variables. As the missing data are likely due to incomplete historical records, which are unlikely to be completely random, median imputation was chosen. Median is more robust to extreme values, preserving the “center” of the data.

Also, missing value indicator flags are included to detect any underlying structural information in the missing data.

Code
eval_index <- eval_raw$INDEX

train <- train_raw
eval  <- eval_raw

na_cols <- names(train)[colSums(is.na(train)) > 0]
na_cols <- setdiff(na_cols, c("INDEX", "TARGET_WINS"))

for (nm in na_cols) {
  train[[paste0(nm, "_NA")]] <- ifelse(is.na(train[[nm]]), 1, 0)
  eval[[paste0(nm, "_NA")]]  <- ifelse(is.na(eval[[nm]]), 1, 0)
}

train <- train %>%
  mutate(across(where(is.numeric),
                ~ifelse(is.na(.x),
                        median(.x, na.rm = TRUE),
                        .x)))

eval <- eval %>%
  mutate(across(where(is.numeric),
                ~ifelse(is.na(.x),
                        median(.x, na.rm = TRUE),
                        .x)))

train <- train %>% select(-INDEX)
eval  <- eval  %>% select(-INDEX)

colSums(is.na(train))
        TARGET_WINS      TEAM_BATTING_H     TEAM_BATTING_2B     TEAM_BATTING_3B 
                  0                   0                   0                   0 
    TEAM_BATTING_HR     TEAM_BATTING_BB     TEAM_BATTING_SO     TEAM_BASERUN_SB 
                  0                   0                   0                   0 
    TEAM_BASERUN_CS    TEAM_BATTING_HBP     TEAM_PITCHING_H    TEAM_PITCHING_HR 
                  0                   0                   0                   0 
   TEAM_PITCHING_BB    TEAM_PITCHING_SO     TEAM_FIELDING_E    TEAM_FIELDING_DP 
                  0                   0                   0                   0 
 TEAM_BATTING_SO_NA  TEAM_BASERUN_SB_NA  TEAM_BASERUN_CS_NA TEAM_BATTING_HBP_NA 
                  0                   0                   0                   0 
TEAM_PITCHING_SO_NA TEAM_FIELDING_DP_NA 
                  0                   0 
Code
colSums(is.na(eval))
     TEAM_BATTING_H     TEAM_BATTING_2B     TEAM_BATTING_3B     TEAM_BATTING_HR 
                  0                   0                   0                   0 
    TEAM_BATTING_BB     TEAM_BATTING_SO     TEAM_BASERUN_SB     TEAM_BASERUN_CS 
                  0                   0                   0                   0 
   TEAM_BATTING_HBP     TEAM_PITCHING_H    TEAM_PITCHING_HR    TEAM_PITCHING_BB 
                  0                   0                   0                   0 
   TEAM_PITCHING_SO     TEAM_FIELDING_E    TEAM_FIELDING_DP  TEAM_BATTING_SO_NA 
                  0                   0                   0                   0 
 TEAM_BASERUN_SB_NA  TEAM_BASERUN_CS_NA TEAM_BATTING_HBP_NA TEAM_PITCHING_SO_NA 
                  0                   0                   0                   0 
TEAM_FIELDING_DP_NA 
                  0 

After preprocessing, all predictors contain complete observations, ensuring compatibility with linear regression modeling.

Feature Engineering

Several offensive variables capture related aspects of hitting performance and may introduce multicollinearity when modeled simultaneously. To address this, derived features were constructed to summarize overall offensive production more efficiently.

Total bases provide a consolidated measure of power hitting:

Code
train <- train %>%
  mutate(
    TOTAL_BASES = TEAM_BATTING_H +
                  TEAM_BATTING_2B +
                  2 * TEAM_BATTING_3B +
                  3 * TEAM_BATTING_HR,
    POWER_RATIO = ifelse(TEAM_BATTING_H > 0,
                         TEAM_BATTING_HR / TEAM_BATTING_H,
                         0)
  )

eval <- eval %>%
  mutate(
    TOTAL_BASES = TEAM_BATTING_H +
                  TEAM_BATTING_2B +
                  2 * TEAM_BATTING_3B +
                  3 * TEAM_BATTING_HR,
    POWER_RATIO = ifelse(TEAM_BATTING_H > 0,
                         TEAM_BATTING_HR / TEAM_BATTING_H,
                         0)
  )

glimpse(train)
Rows: 2,276
Columns: 24
$ TARGET_WINS         <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72…
$ TEAM_BATTING_H      <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 13…
$ TEAM_BATTING_2B     <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, …
$ TEAM_BATTING_3B     <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41…
$ TEAM_BATTING_HR     <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, …
$ TEAM_BATTING_BB     <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, …
$ TEAM_BATTING_SO     <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 82…
$ TEAM_BASERUN_SB     <dbl> 101, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119,…
$ TEAM_BASERUN_CS     <dbl> 49, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 10…
$ TEAM_BATTING_HBP    <dbl> 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58…
$ TEAM_PITCHING_H     <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 13…
$ TEAM_PITCHING_HR    <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, …
$ TEAM_PITCHING_BB    <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, …
$ TEAM_PITCHING_SO    <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 8…
$ TEAM_FIELDING_E     <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131,…
$ TEAM_FIELDING_DP    <dbl> 149, 155, 153, 156, 168, 149, 186, 136, 169, 159, …
$ TEAM_BATTING_SO_NA  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_BASERUN_SB_NA  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_BASERUN_CS_NA  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_BATTING_HBP_NA <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ TEAM_PITCHING_SO_NA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_FIELDING_DP_NA <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TOTAL_BASES         <dbl> 1756, 2172, 2090, 1960, 1843, 1827, 1897, 1863, 20…
$ POWER_RATIO         <dbl> 0.00899654, 0.14189694, 0.09949165, 0.06921413, 0.…

Thus, all the predictors are now fully observed, making them suitable for linear regression modeling.

Scale Considerations

While ordinary least squares regression does not necessitate standardization of the predictors to achieve unbiased estimators of the regression coefficients, scale issues may impact stability and interpretation. This may be investigated by considering alternative model specifications, particularly when comparing coefficient magnitude across predictors.

Code
cor(train %>%
      select(TARGET_WINS, TOTAL_BASES, POWER_RATIO),
    use = "pairwise.complete.obs")
            TARGET_WINS TOTAL_BASES POWER_RATIO
TARGET_WINS   1.0000000   0.4227754    0.136171
TOTAL_BASES   0.4227754   1.0000000    0.556130
POWER_RATIO   0.1361710   0.5561300    1.000000

Having dealt with missing values, constructed indicator variables, and engineered features, the data is now ready for regression modeling. The data preprocessing steps have effectively dealt with missing values, reduced data redundancy due to correlated predictors, and promoted interpretability. The next steps will involve developing and comparing multiple linear regression models to identify the best regression specification for team win predictions.

Model Development

To compare different specifications, the prepared data set will be divided into a set of training data and a set of test data. We’ll fit the models on the training data and compare how well the models do on the test data.

Train/Test Split

Code
set.seed(42)

idx <- sample(seq_len(nrow(train)), 
              size = floor(0.9 * nrow(train)))

train_tr <- train[idx, ]
test_tr  <- train[-idx, ]

dim(train_tr)
[1] 2048   24
Code
dim(test_tr)
[1] 228  24

To assess the models’ out-of-sample predictive ability without sacrificing the number of data points for model estimation, we’ll use a 90/10 split.

Model 1: Baseline Specification

This baseline model establishes a straightforward linear relationship between key traditional baseball statistics and total team wins. It provides a reference point for evaluating whether more advanced feature engineering or transformations meaningfully improve predictive performance in later models.

Model 1 uses the basic set of offensive and pitching statistics most commonly related to team success.

Code
m1 <- lm(
  TARGET_WINS ~ TEAM_BATTING_HR +
    TEAM_BATTING_BB +
    TEAM_FIELDING_E +
    TEAM_PITCHING_H +
    TEAM_PITCHING_HR,
  data = train_tr
)

summary(m1)

Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB + 
    TEAM_FIELDING_E + TEAM_PITCHING_H + TEAM_PITCHING_HR, data = train_tr)

Residuals:
    Min      1Q  Median      3Q     Max 
-52.839 -10.006   0.489  10.025  77.005 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      68.7746445  2.2708167  30.286  < 2e-16 ***
TEAM_BATTING_HR  -0.1297548  0.0259857  -4.993 6.44e-07 ***
TEAM_BATTING_BB   0.0231387  0.0037337   6.197 6.93e-10 ***
TEAM_FIELDING_E  -0.0041112  0.0026511  -1.551    0.121    
TEAM_PITCHING_H  -0.0004030  0.0003298  -1.222    0.222    
TEAM_PITCHING_HR  0.1409561  0.0242072   5.823 6.70e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.23 on 2042 degrees of freedom
Multiple R-squared:  0.07324,   Adjusted R-squared:  0.07097 
F-statistic: 32.27 on 5 and 2042 DF,  p-value: < 2.2e-16

The results of Model 1 show that offensive variables such as home runs and walks are positively associated with team wins, while fielding errors tend to have a negative relationship with wins. This aligns with basic baseball intuition: stronger offensive production increases wins, and defensive mistakes reduce them.

Pitching variables also contribute to explaining variation in wins, although their statistical significance and magnitude may vary. Overall, the model is statistically significant and explains a meaningful portion of the variation in team performance, making it a reasonable baseline for comparison with more advanced specifications.

Model 2: Engineered Feature Model

Model 2 replaces several individual hitting statistics with engineered variables that summarize offensive production more efficiently. TOTAL_BASES captures overall power output, while POWER_RATIO reflects the share of home runs relative to total hits. These are combined with walks, fielding errors, and selected pitching metrics to create a more compact and theoretically structured specification.

Code
m2 <- lm(
  TARGET_WINS ~ TOTAL_BASES +
    TEAM_BATTING_BB +
    POWER_RATIO +
    TEAM_FIELDING_E +
    TEAM_PITCHING_SO +
    TEAM_PITCHING_HR,
  data = train_tr
)

summary(m2)

Call:
lm(formula = TARGET_WINS ~ TOTAL_BASES + TEAM_BATTING_BB + POWER_RATIO + 
    TEAM_FIELDING_E + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = train_tr)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.935  -8.953   0.175   9.007  49.706 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1.455e+01  3.823e+00   3.807 0.000145 ***
TOTAL_BASES       3.402e-02  1.812e-03  18.776  < 2e-16 ***
TEAM_BATTING_BB   1.379e-02  3.376e-03   4.084 4.60e-05 ***
POWER_RATIO      -1.299e+02  3.036e+01  -4.280 1.96e-05 ***
TEAM_FIELDING_E  -1.504e-02  2.052e-03  -7.331 3.28e-13 ***
TEAM_PITCHING_SO  5.672e-04  5.733e-04   0.989 0.322671    
TEAM_PITCHING_HR -7.816e-03  2.156e-02  -0.363 0.716953    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.68 on 2041 degrees of freedom
Multiple R-squared:  0.253, Adjusted R-squared:  0.2508 
F-statistic: 115.2 on 6 and 2041 DF,  p-value: < 2.2e-16

The results indicate that total offensive production, as measured by total bases and walks, is positively associated with team wins, while fielding errors remain negatively related. The engineered features improve interpretability by reducing redundancy among correlated batting statistics. Overall, this model provides a more structured and potentially more stable alternative to the baseline specification

Model 3: Transformed Predictor Model

Model 3 applies log transformations to selected skewed predictors, including home runs, walks, and pitching home runs allowed. The goal is to reduce the influence of extreme values and capture potential nonlinear relationships between these variables and team wins. Fielding errors and pitching strikeouts remain in their original scale.

Code
m3 <- lm(
  TARGET_WINS ~ log(TEAM_BATTING_HR + 1) +
    log(TEAM_BATTING_BB + 1) +
    TEAM_FIELDING_E +
    log(TEAM_PITCHING_HR + 1) +
    TEAM_PITCHING_SO,
  data = train_tr
)

summary(m3)

Call:
lm(formula = TARGET_WINS ~ log(TEAM_BATTING_HR + 1) + log(TEAM_BATTING_BB + 
    1) + TEAM_FIELDING_E + log(TEAM_PITCHING_HR + 1) + TEAM_PITCHING_SO, 
    data = train_tr)

Residuals:
    Min      1Q  Median      3Q     Max 
-67.378  -9.905   0.709  10.000  77.826 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                2.854e+01  8.570e+00   3.331 0.000882 ***
log(TEAM_BATTING_HR + 1)  -1.986e+01  2.519e+00  -7.882 5.19e-15 ***
log(TEAM_BATTING_BB + 1)   8.152e+00  1.315e+00   6.200 6.82e-10 ***
TEAM_FIELDING_E           -1.306e-02  3.108e-03  -4.202 2.76e-05 ***
log(TEAM_PITCHING_HR + 1)  2.105e+01  2.350e+00   8.956  < 2e-16 ***
TEAM_PITCHING_SO          -2.719e-03  5.903e-04  -4.606 4.37e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.01 on 2042 degrees of freedom
Multiple R-squared:  0.1005,    Adjusted R-squared:  0.09833 
F-statistic: 45.65 on 5 and 2042 DF,  p-value: < 2.2e-16

The transformed variables maintain the expected directional relationships with wins, while potentially improving model stability by reducing skewness. The log specification suggests diminishing marginal effects, meaning that increases in already large values may have smaller incremental impacts on wins. This model provides an alternative functional form to test whether nonlinear scaling improves predictive performance compared to the earlier linear specifications.

These three models represent different theories and statistical models for predicting team wins. The following section compares the models using out-of-sample error metrics and statistics to determine which model is the most appropriate.

Model Selection

The models will be evaluated in terms of how they perform on the data that they have not seen, and that is done through the holdout set. What is important for the choice of the model is the out-of-sample measures, such as the MSE, R², F-stat for the overall significance, and the residuals, which will indicate the goodness of fit.

Code
model_metrics <- function(mod, data){
  
  pred <- predict(mod, newdata = data)
  
  mse  <- mean((data$TARGET_WINS - pred)^2)
  
  r2   <- 1 - sum((data$TARGET_WINS - pred)^2) /
    sum((data$TARGET_WINS - mean(data$TARGET_WINS))^2)
  
  fstat <- summary(mod)$fstatistic
  
  tibble(
    MSE = mse,
    R2 = r2,
    F_value = unname(fstat["value"]),
    df1 = unname(fstat["numdf"]),
    df2 = unname(fstat["dendf"])
  )
}
Code
results_table <- bind_rows(
  Model_1 = model_metrics(m1, test_tr),
  Model_2 = model_metrics(m2, test_tr),
  Model_3 = model_metrics(m3, test_tr),
  .id = "Model"
)

results_table
# A tibble: 3 × 6
  Model     MSE     R2 F_value   df1   df2
  <chr>   <dbl>  <dbl>   <dbl> <dbl> <dbl>
1 Model_1  219. 0.0519    32.3     5  2042
2 Model_2  198. 0.141    115.      6  2041
3 Model_3  227. 0.0167    45.6     5  2042

When comparing the models, it indicatess that the model with the lowest out-of-sample MSE has the highest power of prediction. The differences in R² will indicate the goodness of fit, while the F-stat will indicate the overall significance of the models.

When considering the holdout MSE and R², it shows that the most meaningful choice for predicting the wins of the teams is Model 2, so it is the final choice for the model.

Code
car::vif(m2)
     TOTAL_BASES  TEAM_BATTING_BB      POWER_RATIO  TEAM_FIELDING_E 
        2.444247         1.907044        17.406845         2.455342 
TEAM_PITCHING_SO TEAM_PITCHING_HR 
        1.141935        19.238526 

The Variance Inflation Factor (VIF) was also checked to understand the level of multicollinearity between the predictors, and most predictors have low to moderate values, but the values for POWER_RATIO and TEAM_PITCHING_HR are high, indicating the redundancy in the features that were created.

Code
par(mfrow = c(2,2))
plot(m2)

Code
par(mfrow = c(1,1))

Residual diagnostics check if linearity, constant variance, and normality assumptions are reasonable. Residual plots show that the chosen model is reasonable, and there are no major issues of extreme nonlinearity. However, the Breusch-Pagan test results show that there is heteroskedasticity. Thus, robust standard errors are employed.

Code
lmtest::bptest(m2)

    studentized Breusch-Pagan test

data:  m2
BP = 225.73, df = 6, p-value < 2.2e-16
Code
coeftest(m2, vcov = vcovHC(m2, type = "HC1"))

t test of coefficients:

                    Estimate  Std. Error t value  Pr(>|t|)    
(Intercept)       1.4553e+01  4.4470e+00  3.2725 0.0010839 ** 
TOTAL_BASES       3.4021e-02  2.1421e-03 15.8818 < 2.2e-16 ***
TEAM_BATTING_BB   1.3788e-02  3.8716e-03  3.5612 0.0003776 ***
POWER_RATIO      -1.2993e+02  3.3488e+01 -3.8798 0.0001079 ***
TEAM_FIELDING_E  -1.5042e-02  2.7249e-03 -5.5201 3.819e-08 ***
TEAM_PITCHING_SO  5.6717e-04  1.0485e-03  0.5409 0.5886148    
TEAM_PITCHING_HR -7.8160e-03  2.4389e-02 -0.3205 0.7486431    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Breusch-Pagan test is statistically significant, which shows that heteroskedasticity is present. Though the coefficient estimates of OLS will still be consistent, it is advised that robust standard errors should be employed.

From the chosen model, it is evident that more offensive output, measured by total bases and walks, is associated with more wins, while defensive errors are associated with fewer wins. The pitching variables perform as expected, although with lower precision.

The negative coefficient of POWER_RATIO shows that teams that use home runs relative to total hits do not win more games, given that total bases are already factored into the model. Thus, there may be diminishing returns to using home runs when already having more total bases.

Final Model Fit and Evaluation Predictions

Fit final model on the full training data

Code
final_model <- lm(
  TARGET_WINS ~ TOTAL_BASES +
    TEAM_BATTING_BB +
    POWER_RATIO +
    TEAM_FIELDING_E +
    TEAM_PITCHING_SO +
    TEAM_PITCHING_HR,
  data = train
)

summary(final_model)

Call:
lm(formula = TARGET_WINS ~ TOTAL_BASES + TEAM_BATTING_BB + POWER_RATIO + 
    TEAM_FIELDING_E + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-52.271  -9.087   0.151   9.171  49.596 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1.489e+01  3.664e+00   4.063 5.02e-05 ***
TOTAL_BASES       3.362e-02  1.750e-03  19.217  < 2e-16 ***
TEAM_BATTING_BB   1.395e-02  3.209e-03   4.346 1.45e-05 ***
POWER_RATIO      -1.226e+02  2.954e+01  -4.151 3.43e-05 ***
TEAM_FIELDING_E  -1.438e-02  1.970e-03  -7.298 4.01e-13 ***
TEAM_PITCHING_SO  6.432e-04  5.706e-04   1.127    0.260    
TEAM_PITCHING_HR -9.275e-03  2.101e-02  -0.441    0.659    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.72 on 2269 degrees of freedom
Multiple R-squared:  0.2438,    Adjusted R-squared:  0.2418 
F-statistic: 121.9 on 6 and 2269 DF,  p-value: < 2.2e-16

Predict evaluation data and write CSV

Code
eval_pred <- predict(final_model, newdata = eval)

submission <- tibble(
  INDEX = eval_index,
  TARGET_WINS = eval_pred
)

write_csv(submission, "moneyball_predictions.csv")
head(submission)
# A tibble: 6 × 2
  INDEX TARGET_WINS
  <dbl>       <dbl>
1     9        67.6
2    10        68.0
3    14        75.6
4    47        86.4
5    60        68.0
6    63        69.7

Conclusion

As this analysis demonstrates, offensive production in total bases and walks really matters in predicting the number of games won, while errors hurt those chances. Of all the model selection approaches we tested, the engineered feature model was the standout because of its lowest out-of-sample Mean Squared Error and highest predictive R².

In spite of the presence of heteroscedasticity as suggested by the Breusch-Pagan test, we employed robust standard errors in order to ensure that our statistical conclusions remain reliable. Overall, the final model achieves a good balance of ease of interpretation, statistical rigor, and prediction accuracy.