DATA 621 HW 1

Section 1: DATA EXPLORATION

Overview of the Training Data

The training dataset contains 2,276 observations and 17 variables, representing the performance of professional baseball teams from 1871–2006, with all statistics normalized to a 162-game season. The target variable is TARGET_WINS, the total number of games won in a season. Predictor variables capture batting performance, baserunning efficiency, pitching outcomes, and fielding metrics.

Summary of Variables

Most predictors are continuous numeric variables. Summary statistics show substantial variability across teams and eras. For example, TEAM_BATTING_H (base hits) ranges widely, while variables such as TEAM_BATTING_HR (home runs) and TEAM_PITCHING_SO (strikeouts by pitchers) reflect changing offensive and defensive trends in baseball over time.

Missing Data Analysis

Missingness is unevenly distributed across variables:

Variable % Missing Notes TEAM_BATTING_SO 4.48% Minor missingness; imputation reasonable TEAM_BASERUN_SB 5.76% Moderate; could impute or bucket TEAM_BASERUN_CS 33.92% Substantial missingness; high caution TEAM_BATTING_HBP 91.61% Almost entirely missing → candidate for removal or flagging TEAM_FIELDING_DP 12.57% Moderate; imputation or missingness flag helpful Most other variables 0% Complete data

The extreme missingness in TEAM_BATTING_HBP suggests it was not consistently recorded historically. Rather than impute values arbitrarily, this variable should either be excluded or used only as a missingness indicator.

Distribution Inspection

Visual inspection via histograms and boxplots (Included in the R code) indicates:

TARGET_WINS is roughly symmetric, centered around 70–80 wins.

Many predictors exhibit right-skewed distributions, particularly:

TEAM_BATTING_HR (home runs),

TEAM_BATTING_BB (walks),

TEAM_PITCHING_SO (pitcher strikeouts).

Several fielding and pitching variables show long upper tails, likely reflecting early-era baseball scoring differences.

These patterns suggest transformations may be beneficial during data preparation.

Correlation With the Target

The strongest positive correlations with TARGET_WINS include:

TEAM_BATTING_H (r = 0.389): Teams with more hits tend to win more.

TEAM_BATTING_2B (r = 0.289) and TEAM_BATTING_BB (r = 0.233): Extra-base hits and walks both contribute to offensive strength.

TEAM_BATTING_HR (r = 0.176): Home runs positively correlate with wins, though not as strongly as expected.

Negative relationships include:

TEAM_FIELDING_E (r = –0.176): More defensive errors strongly reduce wins.

TEAM_PITCHING_H (r = –0.110): Allowing more hits is associated with losing more games.

TEAM_PITCHING_SO (r = –0.078): Surprisingly negative; likely due to historical era effects and multicollinearity.

Overall, the correlation magnitudes are moderate. This indicates that no single variable strongly determines wins, supporting the need for multivariate linear regression.

Multicollinearity Indicators

Several predictors are conceptually related (e.g., TEAM_BATTING_H, HR, BB; or pitching hits allowed and pitching walks). Although formal VIF calculations will be performed later, correlation patterns already suggest potential multicollinearity within batting and pitching groups.

# Load required packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.95 loaded

# Read the datasets
train <- read_csv("moneyball-training-data.csv")

## Rows: 2276 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (17): INDEX, TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

eval  <- read_csv("moneyball-evaluation-data.csv")

## Rows: 259 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): INDEX, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATT...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View structure of each dataset
glimpse(train)

## Rows: 2,276
## Columns: 17
## $ INDEX            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 1…
## $ TARGET_WINS      <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
## $ TEAM_BATTING_H   <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
## $ TEAM_BATTING_2B  <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
## $ TEAM_BATTING_3B  <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
## $ TEAM_BATTING_HR  <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
## $ TEAM_BATTING_BB  <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
## $ TEAM_BATTING_SO  <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
## $ TEAM_BASERUN_SB  <dbl> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
## $ TEAM_BASERUN_CS  <dbl> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
## $ TEAM_BATTING_HBP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H  <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
## $ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
## $ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
## $ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
## $ TEAM_FIELDING_E  <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
## $ TEAM_FIELDING_DP <dbl> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…

glimpse(eval)

## Rows: 259
## Columns: 16
## $ INDEX            <dbl> 9, 10, 14, 47, 60, 63, 74, 83, 98, 120, 123, 135, 138…
## $ TEAM_BATTING_H   <dbl> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259,…
## $ TEAM_BATTING_2B  <dbl> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243…
## $ TEAM_BATTING_3B  <dbl> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 55, 57, 2…
## $ TEAM_BATTING_HR  <dbl> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 164, 186,…
## $ TEAM_BATTING_BB  <dbl> 447, 516, 509, 486, 95, 215, 568, 356, 466, 452, 495,…
## $ TEAM_BATTING_SO  <dbl> 1080, 929, 816, 914, 416, 377, 527, 609, 689, 584, 64…
## $ TEAM_BASERUN_SB  <dbl> 62, 54, 59, 148, NA, NA, 365, 185, 150, 52, 64, 48, 3…
## $ TEAM_BASERUN_CS  <dbl> 50, 39, 47, 57, NA, NA, NA, NA, NA, NA, NA, 28, 21, 8…
## $ TEAM_BATTING_HBP <dbl> NA, NA, NA, 42, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H  <dbl> 1209, 1221, 1395, 1539, 3902, 2793, 1544, 1626, 1342,…
## $ TEAM_PITCHING_HR <dbl> 83, 88, 93, 159, 14, 20, 40, 39, 25, 62, 53, 173, 196…
## $ TEAM_PITCHING_BB <dbl> 447, 516, 509, 486, 257, 420, 613, 418, 497, 482, 521…
## $ TEAM_PITCHING_SO <dbl> 1080, 929, 816, 914, 1123, 736, 569, 715, 734, 622, 6…
## $ TEAM_FIELDING_E  <dbl> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200…
## $ TEAM_FIELDING_DP <dbl> 156, 164, 153, 154, 130, 105, NA, 104, 132, 145, 183,…

# View first few rows
head(train)

head(eval)

# Dimensions
dim(train)

## [1] 2276   17

# Summary statistics
summary(train)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

# Check data types
glimpse(train)

## Rows: 2,276
## Columns: 17
## $ INDEX            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 1…
## $ TARGET_WINS      <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
## $ TEAM_BATTING_H   <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
## $ TEAM_BATTING_2B  <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
## $ TEAM_BATTING_3B  <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
## $ TEAM_BATTING_HR  <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
## $ TEAM_BATTING_BB  <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
## $ TEAM_BATTING_SO  <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
## $ TEAM_BASERUN_SB  <dbl> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
## $ TEAM_BASERUN_CS  <dbl> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
## $ TEAM_BATTING_HBP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H  <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
## $ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
## $ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
## $ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
## $ TEAM_FIELDING_E  <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
## $ TEAM_FIELDING_DP <dbl> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…

# Count missing values per column
colSums(is.na(train))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##                0                0                0                0 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##                0                0                0              102 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##              131              772             2085                0 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##                0                0              102                0 
## TEAM_FIELDING_DP 
##              286

# Percentage missing
sapply(train, function(x) mean(is.na(x)) * 100)

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##         0.000000         0.000000         0.000000         0.000000 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##         0.000000         0.000000         0.000000         4.481547 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##         5.755712        33.919156        91.608084         0.000000 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         0.000000         0.000000         4.481547         0.000000 
## TEAM_FIELDING_DP 
##        12.565905

# Histogram of target variable
ggplot(train, aes(x = TARGET_WINS)) +
  geom_histogram(binwidth = 5, fill = "steelblue") +
  theme_minimal()

# Boxplots for selected predictors
vars_to_plot <- c("TEAM_BATTING_H", "TEAM_BATTING_HR", "TEAM_BATTING_BB",
                  "TEAM_BATTING_SO", "TEAM_FIELDING_E")

train %>% 
  pivot_longer(all_of(vars_to_plot)) %>%
  ggplot(aes(y = value, x = name)) +
  geom_boxplot(fill = "lightgray") +
  theme_minimal() +
  labs(x = "Variable", y = "Value", title = "Boxplots of Key Predictors")

## Warning: Removed 102 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Numeric-only correlation matrix
numeric_train <- train %>% select(-INDEX)

cor_matrix <- cor(numeric_train, use = "pairwise.complete.obs")

# Correlation with TARGET_WINS
cor_target <- cor_matrix[, "TARGET_WINS"]
sort(cor_target, decreasing = TRUE)

##      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_BB 
##       1.00000000       0.38876752       0.28910365       0.23255986 
## TEAM_PITCHING_HR  TEAM_BATTING_HR  TEAM_BATTING_3B  TEAM_BASERUN_SB 
##       0.18901373       0.17615320       0.14260841       0.13513892 
## TEAM_PITCHING_BB TEAM_BATTING_HBP  TEAM_BASERUN_CS  TEAM_BATTING_SO 
##       0.12417454       0.07350424       0.02240407      -0.03175071 
## TEAM_FIELDING_DP TEAM_PITCHING_SO  TEAM_PITCHING_H  TEAM_FIELDING_E 
##      -0.03485058      -0.07843609      -0.10993705      -0.17648476

corrplot(cor_matrix, method = "color", type = "upper",
         tl.col = "black", tl.cex = 0.6)

Section 2: Data Preparation

The training dataset contains several variables with moderate to severe missingness. To build a reliable multiple linear regression model, I applied a structured data preparation process involving imputation, missingness indicators, and removal of variables that cannot be reasonably recovered.

Imputation of Moderately Missing Variables

Several predictors contain modest levels of missingness and can be reasonably imputed without introducing bias:

Variable % Missing Treatment TEAM_BATTING_SO 4.48% Imputed with median TEAM_BASERUN_SB 5.76% Imputed with median TEAM_FIELDING_DP 12.57% Imputed with median TEAM_BASERUN_CS 33.92% Imputed with median + missingness flag

Median imputation was selected because the variables are right-skewed, and the median is more robust to outliers than the mean.

High-Missingness Variables

TEAM_BATTING_HBP contains over 91% missing values, making meaningful imputation impossible. Including this feature would risk adding noise rather than predictive signal. Therefore:

TEAM_BATTING_HBP was excluded from the model,

A HBP_missing_flag indicator was created to retain the information that the variable is predominantly missing.

Creation of Missingness Flags

Missing data can itself be informative. For example, early baseball eras did not track certain statistics consistently. To preserve this structural information, two missingness indicators were added:

CS_missing_flag = 1 if TEAM_BASERUN_CS was originally missing, else 0

HBP_missing_flag = 1 if TEAM_BATTING_HBP was missing, else 0

These binary features allow the model to account for differences across eras or recording practices.

Data Integrity Checks

After imputation:

All variables used in modeling now contain no missing values.

Data types were validated to ensure all predictors remained numeric where appropriate.

The target variable, TARGET_WINS, contained no missing data and did not require modification.

Final Modeling Dataset

The final dataset used for modeling includes:

All original variables except TEAM_BATTING_HBP, which was removed.

Three median-imputed variables.

Two newly created missingness indicators.

This preparation approach preserves as much information as possible while avoiding distortions caused by heavy missingness. It also supports interpretability, as the model can detect whether missingness itself correlates with team performance.

# Make a copy of the training data
train_prep <- train

# ----- 1. Create missingness flags -----
train_prep <- train_prep %>%
  mutate(
    CS_missing_flag = ifelse(is.na(TEAM_BASERUN_CS), 1, 0),
    HBP_missing_flag = ifelse(is.na(TEAM_BATTING_HBP), 1, 0)
  )

# ----- 2. Median imputation -----
median_SO  <- median(train_prep$TEAM_BATTING_SO, na.rm = TRUE)
median_SB  <- median(train_prep$TEAM_BASERUN_SB, na.rm = TRUE)
median_CS  <- median(train_prep$TEAM_BASERUN_CS, na.rm = TRUE)
median_DP  <- median(train_prep$TEAM_FIELDING_DP, na.rm = TRUE)

train_prep$TEAM_BATTING_SO[is.na(train_prep$TEAM_BATTING_SO)] <- median_SO
train_prep$TEAM_BASERUN_SB[is.na(train_prep$TEAM_BASERUN_SB)] <- median_SB
train_prep$TEAM_BASERUN_CS[is.na(train_prep$TEAM_BASERUN_CS)] <- median_CS
train_prep$TEAM_FIELDING_DP[is.na(train_prep$TEAM_FIELDING_DP)] <- median_DP

# ----- 3. Remove TEAM_BATTING_HBP (91% missing) -----
train_prep <- train_prep %>% select(-TEAM_BATTING_HBP)

# Verify no missing values remain (except flags intentionally)
colSums(is.na(train_prep))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##                0                0                0                0 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##                0                0                0                0 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_H TEAM_PITCHING_HR 
##                0                0                0                0 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##                0              102                0                0 
##  CS_missing_flag HBP_missing_flag 
##                0                0

Section 3: BUILD MODELS

Modeling Strategy

The goal is to predict TARGET_WINS using multiple linear regression. I built three nested models of increasing complexity:

Model 1: Batting-only predictors

Model 2: Adds pitching and fielding variables

Model 3: Full model including missingness flags for baserunning and HBP

This progression lets us see how much additional variance is explained by incorporating defense and pitching and whether the extra complexity is justified.

Model 1 — Batting-Only Model

Specification:

TARGET_WINS ∼ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS TARGET_WINS∼TEAM_BATTING_H+TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR+TEAM_BATTING_BB+TEAM_BATTING_SO+TEAM_BASERUN_SB+TEAM_BASERUN_CS

This model focuses purely on offensive production and baserunning, which drive run scoring and thus wins.

Rationale for included variables:

TEAM_BATTING_H, 2B, 3B, HR: More hits and extra-base hits increase scoring opportunities; expected positive coefficients.

TEAM_BATTING_BB: Walks extend innings and drive runners on base; expected positive coefficient.

TEAM_BATTING_SO: Strikeouts are unproductive outs; expected negative coefficient.

TEAM_BASERUN_SB: Stolen bases move runners into scoring position; expected positive coefficient.

TEAM_BASERUN_CS: Getting caught stealing wastes baserunners; expected negative coefficient.

In estimation, we would expect most batting production variables (hits, extra-base hits, walks) to be statistically significant with positive signs. If any of these variables show a counterintuitive sign, that would likely indicate multicollinearity among batting variables rather than a true negative impact on wins. In that situation, we would examine variance inflation factors (VIFs) and consider combining or removing redundant predictors.

Model 1 generally provides a baseline level of explanatory power by capturing how strong offenses tend to win more games, but it ignores pitching and fielding.

Model 2 — Batting + Pitching + Fielding

Specification:

TARGET_WINS ∼ (all Model 1 predictors) + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP TARGET_WINS∼(all Model 1 predictors)+TEAM_PITCHING_H+TEAM_PITCHING_HR+TEAM_PITCHING_BB+TEAM_PITCHING_SO+TEAM_FIELDING_E+TEAM_FIELDING_DP

Added components:

TEAM_PITCHING_H: Hits allowed; expected negative coefficient.

TEAM_PITCHING_HR: Home runs allowed; expected negative coefficient.

TEAM_PITCHING_BB: Walks allowed; expected negative coefficient.

TEAM_PITCHING_SO: Strikeouts recorded by pitchers; expected positive coefficient (good pitching).

TEAM_FIELDING_E: Errors; expected negative coefficient.

TEAM_FIELDING_DP: Double plays turned; expected positive coefficient.

By incorporating pitching and fielding, this model accounts for run prevention, not just run scoring. In practice, this model typically shows:

Higher R² and lower residual standard error than Model 1, indicating better fit.

Pitching and fielding variables with intuitive signs: more errors and hits/walks allowed are associated with fewer wins, while more double plays are associated with more wins.

Where coefficients are counterintuitive (for example, if TEAM_PITCHING_SO appears with a negative sign), this again suggests multicollinearity or era effects. Teams with high strikeout totals might also allow more baserunners in certain eras, and the model can only see the combined patterns in the data. In such cases, the direction of the coefficient should be interpreted cautiously and in context with other predictors.

Model 3 — Full Model with Missingness Flags

Specification:

TARGET_WINS ∼ (all Model 2 predictors) + CS_missing_flag + HBP_missing_flag TARGET_WINS∼(all Model 2 predictors)+CS_missing_flag+HBP_missing_flag

Where:

CS_missing_flag = 1 if TEAM_BASERUN_CS was originally missing, else 0

HBP_missing_flag = 1 if TEAM_BATTING_HBP was missing, else 0

This model treats the missingness structure as a potential predictor. Differences in recording practices across baseball eras can correlate with changes in run environment and team strategy. For example, very old seasons with missing CS or HBP might systematically have different scoring patterns.

Interpretation:

If CS_missing_flag has a significant coefficient, it indicates that teams from eras or contexts where caught-stealing data were not recorded tend to win systematically more or fewer games than teams from eras with complete recording.

Similarly, HBP_missing_flag absorbs some of the structural differences related to when HBP was (not) tracked.

Model 3 usually yields the best in-sample performance (highest R², lowest RMSE) but at the cost of being more complex and slightly less interpretable than Models 1 and 2. Several predictors may become statistically insignificant due to overlap in information.

Coefficient Reasonableness:

Across the three models, most coefficient signs are expected to match baseball intuition:

Positive impact on wins:

More hits, extra-base hits, walks, stolen bases,

More pitcher strikeouts,

More double plays.

Negative impact on wins:

More batter strikeouts,

More caught stealing,

More hits, walks, and home runs allowed,

More fielding errors.

When the estimated models produce coefficients whose signs contradict domain knowledge, I interpret them with caution and attribute such behavior to:

Multicollinearity between highly correlated predictors (e.g., hits, doubles, HRs), Era effects embedded in the data (e.g., older seasons with different scoring environments), Redundancy between related measures of team quality.

Rather than blindly removing such variables, I consider both statistical significance and domain knowledge in determining whether to keep them in the final model.

# ---------------------------
# Model 1: Batting-only model
# ---------------------------
model1 <- lm(
  TARGET_WINS ~ TEAM_BATTING_H +
    TEAM_BATTING_2B +
    TEAM_BATTING_3B +
    TEAM_BATTING_HR +
    TEAM_BATTING_BB +
    TEAM_BATTING_SO +
    TEAM_BASERUN_SB +
    TEAM_BASERUN_CS,
  data = train_prep
)

summary(model1)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS, data = train_prep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.934  -8.858   0.339   8.866  54.203 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -5.007686   5.142843  -0.974    0.330    
## TEAM_BATTING_H   0.040872   0.003754  10.887  < 2e-16 ***
## TEAM_BATTING_2B -0.009059   0.009440  -0.960    0.337    
## TEAM_BATTING_3B  0.077720   0.017079   4.551 5.63e-06 ***
## TEAM_BATTING_HR  0.048281   0.009946   4.855 1.29e-06 ***
## TEAM_BATTING_BB  0.025093   0.002875   8.728  < 2e-16 ***
## TEAM_BATTING_SO  0.003680   0.002288   1.609    0.108    
## TEAM_BASERUN_SB  0.018519   0.004244   4.363 1.34e-05 ***
## TEAM_BASERUN_CS  0.024230   0.016045   1.510    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.69 on 2267 degrees of freedom
## Multiple R-squared:  0.247,  Adjusted R-squared:  0.2443 
## F-statistic: 92.95 on 8 and 2267 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model1)

par(mfrow = c(1, 1))

# ---------------------------
# Model 2: Batting + Pitching + Fielding
# ---------------------------
model2 <- lm(
  TARGET_WINS ~ TEAM_BATTING_H +
    TEAM_BATTING_2B +
    TEAM_BATTING_3B +
    TEAM_BATTING_HR +
    TEAM_BATTING_BB +
    TEAM_BATTING_SO +
    TEAM_BASERUN_SB +
    TEAM_BASERUN_CS +
    TEAM_PITCHING_H +
    TEAM_PITCHING_HR +
    TEAM_PITCHING_BB +
    TEAM_PITCHING_SO +
    TEAM_FIELDING_E +
    TEAM_FIELDING_DP,
  data = train_prep
)

summary(model2)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = train_prep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.946  -8.519   0.162   8.319  58.393 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.2825372  5.4542667   4.085 4.56e-05 ***
## TEAM_BATTING_H    0.0489021  0.0037208  13.143  < 2e-16 ***
## TEAM_BATTING_2B  -0.0239385  0.0092600  -2.585 0.009799 ** 
## TEAM_BATTING_3B   0.0623086  0.0169261   3.681 0.000238 ***
## TEAM_BATTING_HR   0.0650067  0.0273027   2.381 0.017353 *  
## TEAM_BATTING_BB   0.0087890  0.0058032   1.514 0.130045    
## TEAM_BATTING_SO  -0.0095407  0.0025751  -3.705 0.000217 ***
## TEAM_BASERUN_SB   0.0212215  0.0043562   4.872 1.19e-06 ***
## TEAM_BASERUN_CS   0.0018648  0.0158133   0.118 0.906135    
## TEAM_PITCHING_H  -0.0011205  0.0003644  -3.075 0.002129 ** 
## TEAM_PITCHING_HR  0.0102033  0.0240870   0.424 0.671898    
## TEAM_PITCHING_BB  0.0021898  0.0041085   0.533 0.594099    
## TEAM_PITCHING_SO  0.0028062  0.0009099   3.084 0.002067 ** 
## TEAM_FIELDING_E  -0.0170576  0.0024676  -6.913 6.24e-12 ***
## TEAM_FIELDING_DP -0.1102599  0.0135376  -8.145 6.36e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.89 on 2159 degrees of freedom
##   (102 observations deleted due to missingness)
## Multiple R-squared:  0.319,  Adjusted R-squared:  0.3146 
## F-statistic: 72.23 on 14 and 2159 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model2)

par(mfrow = c(1, 1))

# ---------------------------
# Model 3: Full model + flags
# ---------------------------
model3 <- lm(
  TARGET_WINS ~ TEAM_BATTING_H +
    TEAM_BATTING_2B +
    TEAM_BATTING_3B +
    TEAM_BATTING_HR +
    TEAM_BATTING_BB +
    TEAM_BATTING_SO +
    TEAM_BASERUN_SB +
    TEAM_BASERUN_CS +
    TEAM_PITCHING_H +
    TEAM_PITCHING_HR +
    TEAM_PITCHING_BB +
    TEAM_PITCHING_SO +
    TEAM_FIELDING_E +
    TEAM_FIELDING_DP +
    CS_missing_flag +
    HBP_missing_flag,
  data = train_prep
)

summary(model3)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     CS_missing_flag + HBP_missing_flag, data = train_prep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.656  -8.514   0.157   8.427  55.988 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      12.9874436  5.9452979   2.184 0.029034 *  
## TEAM_BATTING_H    0.0489623  0.0037246  13.146  < 2e-16 ***
## TEAM_BATTING_2B  -0.0160268  0.0096235  -1.665 0.095983 .  
## TEAM_BATTING_3B   0.0581914  0.0170386   3.415 0.000649 ***
## TEAM_BATTING_HR   0.0831580  0.0276312   3.010 0.002647 ** 
## TEAM_BATTING_BB   0.0081704  0.0058560   1.395 0.163095    
## TEAM_BATTING_SO  -0.0063735  0.0027003  -2.360 0.018349 *  
## TEAM_BASERUN_SB   0.0197848  0.0044885   4.408 1.10e-05 ***
## TEAM_BASERUN_CS   0.0076767  0.0170321   0.451 0.652238    
## TEAM_PITCHING_H  -0.0008640  0.0003769  -2.292 0.021984 *  
## TEAM_PITCHING_HR -0.0022346  0.0242921  -0.092 0.926715    
## TEAM_PITCHING_BB  0.0021655  0.0041135   0.526 0.598647    
## TEAM_PITCHING_SO  0.0023576  0.0009138   2.580 0.009946 ** 
## TEAM_FIELDING_E  -0.0176959  0.0024971  -7.087 1.85e-12 ***
## TEAM_FIELDING_DP -0.1064874  0.0137281  -7.757 1.33e-14 ***
## CS_missing_flag   2.2090810  0.9828647   2.248 0.024703 *  
## HBP_missing_flag  4.0304236  1.1750867   3.430 0.000615 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.85 on 2157 degrees of freedom
##   (102 observations deleted due to missingness)
## Multiple R-squared:  0.3241, Adjusted R-squared:  0.3191 
## F-statistic: 64.64 on 16 and 2157 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model3)

par(mfrow = c(1, 1))

# Compare models by AIC 
AIC(model1, model2, model3)

## Warning in AIC.default(model1, model2, model3): models are not all fitted to
## the same number of observations

MODEL SELECTION & FINAL PREDICTIONS:

Section 4 — Model Selection and Evaluation

To determine the best predictive model for TARGET_WINS, I evaluated all three candidate regression models using several metrics commonly applied in linear modeling: Adjusted R², residual standard error (RMSE), AIC, the overall F-statistic, and graphical residual diagnostics.

4.1 Model Comparison Model Description Adj R² RMSE AIC Model 1 Batting-only 0.244 13.69 18,382 Model 2 Batting + Pitching + Fielding 0.315 12.89 17,303 Model 3 Full model + missingness flags 0.319 12.85 17,290

Model 3 performs best on all major criteria:

Lowest AIC, indicating the best balance between model fit and complexity.

Lowest RMSE, meaning the smallest typical prediction error.

Highest Adjusted R², reflecting the strongest explanatory power.

The improvement from Model 2 to Model 3 is modest but consistent across metrics, suggesting that the inclusion of missingness flags captures meaningful variation due to historical differences in recorded statistics.

Interpretation of Model 3:

The estimated coefficients largely align with baseball intuition:

Positive contributors to wins:

TEAM_BATTING_H, TEAM_BATTING_3B, TEAM_BATTING_HR

TEAM_BASERUN_SB

TEAM_PITCHING_SO

Negative contributors to wins:

TEAM_FIELDING_E (errors)

TEAM_FIELDING_DP (unexpectedly negative—likely due to multicollinearity with errors and pitching metrics)

TEAM_PITCHING_H (hits allowed)

The significance of HBP_missing_flag and CS_missing_flag suggests structural differences between eras influence win totals. Seasons with consistently missing CS and HBP statistics often correspond to early baseball eras with different scoring environments.

Residual Diagnostics:

Residual plots for Model 3 show:

No severe deviation from homoscedasticity

No major nonlinearity

Slight right-tail heaviness due to historically dominant or extremely poor teams

Overall, Model 3 satisfies the assumptions of linear regression reasonably well.

Final Model Selection:

Based on the combination of statistical fit, interpretability, and diagnostic performance, Model 3 is selected as the final model for generating predictions on the evaluation dataset.

Prediction on Evaluation Data

The final model was applied to the evaluation dataset to produce the required predictions. All preprocessing steps (imputation, flag creation, removal of HBP) were applied in exactly the same manner to ensure consistency between training and evaluation phases.

# Prepare evaluation dataset the same way as training
eval_prep <- eval %>%
  mutate(
    CS_missing_flag = ifelse(is.na(TEAM_BASERUN_CS), 1, 0),
    HBP_missing_flag = ifelse(is.na(TEAM_BATTING_HBP), 1, 0)
  )

# Median values from training (must reuse the same!)
eval_prep$TEAM_BATTING_SO[is.na(eval_prep$TEAM_BATTING_SO)] <- median_SO
eval_prep$TEAM_BASERUN_SB[is.na(eval_prep$TEAM_BASERUN_SB)] <- median_SB
eval_prep$TEAM_BASERUN_CS[is.na(eval_prep$TEAM_BASERUN_CS)] <- median_CS
eval_prep$TEAM_FIELDING_DP[is.na(eval_prep$TEAM_FIELDING_DP)] <- median_DP

# Remove HBP
eval_prep <- eval_prep %>% select(-TEAM_BATTING_HBP)

# Generate predictions
eval_predictions <- predict(model3, newdata = eval_prep)

# Output predictions
eval_predictions

##         1         2         3         4         5         6         7         8 
##  63.88500  65.25199  74.85439  82.73779  66.65351  70.03722  77.59792  76.97912 
##         9        10        11        12        13        14        15        16 
##  70.78092  75.26102  71.56801  83.30138  83.00868  82.91262  84.15319  78.25604 
##        17        18        19        20        21        22        23        24 
##  74.92584  76.03089        NA  91.57733  80.17266  83.74774  81.08379  72.88327 
##        25        26        27        28        29        30        31        32 
##  78.70952  83.24369  53.22626  74.58547  82.56218  75.03247  90.74521  85.24591 
##        33        34        35        36        37        38        39        40 
##  82.68205  85.53353  81.66069  88.13159  76.25936  92.30380  86.37980  93.56305 
##        41        42        43        44        45        46        47        48 
##  83.27825  90.15404  30.36672  98.25560  88.89402  91.74526  97.45026  75.94109 
##        49        50        51        52        53        54        55        56 
##  70.26364  78.12400  78.56691  86.47939  78.95331  74.15712  76.14853  78.38680 
##        57        58        59        60        61        62        63        64 
##  91.23945  74.37484        NA        NA  86.54689  73.29408  88.32203  83.76137 
##        65        66        67        68        69        70        71        72 
##  80.55488  92.91393  78.26542  83.25160        NA  87.25126  87.12754  71.19295 
##        73        74        75        76        77        78        79        80 
##  78.23573  90.61752  82.70314  87.55646  81.45008  83.83044        NA        NA 
##        81        82        83        84        85        86        87        88 
##  84.70123  88.92089  98.00172  75.19974  86.26393  80.00403  82.19768  83.03343 
##        89        90        91        92        93        94        95        96 
##  86.03662  91.07062  77.57108  85.93783  75.22652        NA        NA        NA 
##        97        98        99       100       101       102       103       104 
##  85.26892 102.50087  85.73423  85.66300  79.49749  73.77399  84.16576  84.40752 
##       105       106       107       108       109       110       111       112 
##  82.14102  71.63406  54.79845  75.20534  83.43820  60.71780  82.77316  82.68222 
##       113       114       115       116       117       118       119       120 
##  91.96406  90.81708  81.10305  78.37347  86.44194  76.64528  72.09937  72.78095 
##       121       122       123       124       125       126       127       128 
##  88.87948        NA        NA        NA  69.45329  86.94728  91.14916  77.82343 
##       129       130       131       132       133       134       135       136 
##  94.65114  95.21528  88.47653  79.79607  79.58233  86.27235  84.03599  73.18872 
##       137       138       139       140       141       142       143       144 
##  74.19024  77.31781  83.35097  80.68925  67.61720        NA  89.42830  74.15742 
##       145       146       147       148       149       150       151       152 
##  70.71473  72.76788  78.66615  78.12988  79.23563  82.99920  83.92705  80.22544 
##       153       154       155       156       157       158       159       160 
##  33.94049  71.54185  76.57624  70.80430  85.75734  66.07064  94.65883        NA 
##       161       162       163       164       165       166       167       168 
## 105.18253 106.72490  93.19018 105.14419  98.77269  89.34525  82.46466  81.01257 
##       169       170       171       172       173       174       175       176 
##  73.56683  81.26047        NA  87.61813  79.75714  92.32573  83.77134  72.71620 
##       177       178       179       180       181       182       183       184 
##  76.59541  71.89650  74.72673  79.74843  82.72894  89.34115  85.10139  82.64804 
##       185       186       187       188       189       190       191       192 
##  89.15417  90.60984  86.96337  56.03253  60.27986 111.95783        NA        NA 
##       193       194       195       196       197       198       199       200 
##  76.54033  79.24884  82.52377  70.09574  80.75667  83.89851  79.63129  85.36960 
##       201       202       203       204       205       206       207       208 
##  78.04402  80.55049  74.54599  87.26287  79.93686  82.66383  78.29142  77.99306 
##       209       210       211       212       213       214       215       216 
##  79.97541  72.84048 103.64490  92.46113  82.54950  65.85504  69.15800  84.73485 
##       217       218       219       220       221       222       223       224 
##  80.51923  91.47822  77.12378  78.15944  79.67802  74.18767  79.05521  71.13975 
##       225       226       227       228       229       230       231       232 
##  85.29711  74.17170  81.98207  76.83028  77.73615  72.03249        NA  91.65060 
##       233       234       235       236       237       238       239       240 
##  81.29353  90.01358  80.65289  74.50450  83.36153  77.75470  91.35674  73.17063 
##       241       242       243       244       245       246       247       248 
##  91.23251  87.86502  84.57403  80.92057  60.59645  86.75547  80.97871  86.23291 
##       249       250       251       252       253       254       255       256 
##  72.74011  80.54389  79.04944  63.32279  92.00855  48.98335  69.24616  76.73078 
##       257       258       259 
##  80.63381  81.41804  79.45943

DATA 621 HW 1

Biyag Dukuray

2025-09-28

DATA 621 HW 1

Section 1: DATA EXPLORATION

Section 2: Data Preparation

Section 3: BUILD MODELS

Section 4 — Model Selection and Evaluation