DATA621- Homework 1

Background
Data Preparation
Model Building
- Model 1: Full model
Model 1.1: Full model with Log transformation of the skewed predictors
Model 1.2: Full model with Log transformation of the skewed predictors and Weighted Least Squares
Model 1.3: Full model with square root transformation of the skewed predictors
Select Models

Background

In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

VARIABLE NAME DEFINITION THEORETICAL EFFECT

INDEX Identification Variable (do not use) None

TARGET_WINS Number of wins

TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR) Positive Impact on Wins

TEAM_BATTING_2B Doubles by batters (2B) Positive Impact on Wins

TEAM_BATTING_3B Triples by batters (3B) Positive Impact on Wins

TEAM_BATTING_HR Homeruns by batters (4B) Positive Impact on Wins

TEAM_BATTING_BB Walks by batters Positive Impact on Wins

TEAM_BATTING_HBP Batters hit by pitch (get a free base) Positive Impact on Wins

TEAM_BATTING_SO Strikeouts by batters Negative Impact on Wins

TEAM_BASERUN_SB Stolen bases Positive Impact on Wins

TEAM_BASERUN_CS Caught stealing Negative Impact on Wins

TEAM_FIELDING_E Errors Negative Impact on Wins

TEAM_FIELDING_DP Double Plays Positive Impact on Wins

TEAM_PITCHING_BB Walks allowed Negative Impact on Wins

TEAM_PITCHING_H Hits allowed Negative Impact on Wins

TEAM_PITCHING_HR Homeruns allowed Negative Impact on Wins

TEAM_PITCHING_SO Strikeouts by pitchers Positive Impact on Wins

Data Preparation

1. Importing Libraries

#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(e1071)  # For skewness function

## Warning: package 'e1071' was built under R version 4.3.3

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.95 loaded

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(ROSE)

## Warning: package 'ROSE' was built under R version 4.3.3

## Loaded ROSE 0.0-4

library(smotefamily)

## Warning: package 'smotefamily' was built under R version 4.3.3

library(dplyr)
library(caret)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.3.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.3

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(GGally)

## Warning: package 'GGally' was built under R version 4.3.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

2. Data Ingestion

data_raw <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA621/refs/heads/main/moneyball-training-data.csv")

data_eval <- read.csv("https://raw.githubusercontent.com/nakesha-fray/DATA621/main/moneyball-evaluation-data.csv")

3. Data Exploration and Cleaning

The MoneyBall dataset contains 2,276 observations and 17 variables, each representing the performance of a baseball team in a given year between 1871 and 2006. The goal is to predict the number of team wins (TARGET_WINS) using available data.

3.1 Summary Statistics and Distribution

The histograms below display the distributions of the numerical features in the dataset. Several variables, such as TEAM_BATTING_2B, TEAM_BATTING_H, and TEAM_FIELDING_DP, appear approximately normally distributed, although TEAM_BATTING_H shows a slight right skew.

Other features exhibit more pronounced skewness and outliers. For example, TEAM_BATTING_BB is slightly left-skewed, while variables like TEAM_BASERUN_CS, TEAM_BASERUN_SB, TEAM_BATTING_3B, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, and TEAM_PITCHING_SO are heavily right-skewed, indicating that a few teams had exceptionally high values in these categories.

Additionally, variables such as TEAM_BATTING_HBP, TEAM_BATTING_HR, TEAM_BATTING_SO, and TEAM_PITCHING_HR display bimodal or multimodal distributions, suggesting that team performance in these metrics may differ substantially depending on the year and team.

The target variable TARGET_WINS is continuous and roughly normally distributed, this confirms that we are dealing with a regression problem. As a first approach, applying linear regression is appropriate and may provide reasonable performance. However, not all predictors are perfectly symmetric or linear, and some exhibit skewness or potential outliers, which may affect modeling accuracy.

3.2 Missing Data

Several variables contain missing values, the most significant being TEAM_BATTING_HBP (2085), followed by TEAM_BASERUN_CS (772), TEAM_FIELDING_DP (286), and TEAM_BASERUN_SB (131). The statistics for these weren’t tracked in early baseball (1871-1890s), so missing likely means “0”.This makes historical sense - if stolen bases weren’t recorded, it’s because they didn’t track them, not because the stat is unknown.

# Structure of the data 
str(data_raw)

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

# Glimpse of the data
head(data_raw)

##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1     1          39           1445             194              39
## 2     2          70           1339             219              22
## 3     3          86           1377             232              35
## 4     4          70           1387             209              38
## 5     5          82           1297             186              27
## 6     6          75           1279             200              36
##   TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1              13             143             842              NA
## 2             190             685            1075              37
## 3             137             602             917              46
## 4              96             451             922              43
## 5             102             472             920              49
## 6              92             443             973             107
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1              NA               NA            9364               84
## 2              28               NA            1347              191
## 3              27               NA            1377              137
## 4              30               NA            1396               97
## 5              39               NA            1297              102
## 6              59               NA            1279               92
##   TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1              927             5456            1011               NA
## 2              689             1082             193              155
## 3              602              917             175              153
## 4              454              928             164              156
## 5              472              920             138              168
## 6              443              973             123              149

# Summary
summary(data_raw)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

# Feature Distribution
data_raw |>
  dplyr::select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
  filter(!is.na(Value)) |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Feature, scales = "free") +
  ggtitle("Histograms of Numerical Features")

# Distribution of TARGET_WINS
data_raw |>
  dplyr::select(TARGET_WINS) |>
  ggplot() + 
  aes(x = TARGET_WINS) + 
  geom_histogram(bins= 40,fill = "blue", color = "black") + 
  labs(title = "Distribution of TARGET_WINS", y = "Count") +  
  theme_minimal()

# Count missing values per Variable
data_raw %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  filter(missing_count != 0) %>%
  arrange(desc(missing_count))

## # A tibble: 6 × 2
##   variable         missing_count
##   <chr>                    <int>
## 1 TEAM_BATTING_HBP          2085
## 2 TEAM_BASERUN_CS            772
## 3 TEAM_FIELDING_DP           286
## 4 TEAM_BASERUN_SB            131
## 5 TEAM_BATTING_SO            102
## 6 TEAM_PITCHING_SO           102

3.3 Identifying Outliers

The boxplots reveal that TARGET_WINS exhibits no extreme outliers. In contrast, many predictor variables show clear evidence of outliers. Variables such as TEAM_BASERUN_SB, TEAM_BASERUN_BB, TEAM_FIELDING_E, and several pitching and batting statistics have values that extend far beyond the interquartile range, indicating that some teams had exceptionally high or low values in these categories. These outliers could strongly influence regression models if left unaddressed. To mitigate this, we apply various transformations—such as logarithmic and square root—in the modeling phase, with the goal of reducing skewness and the impact of extreme values.

# Boxplot of TARGET_WINS

data_raw |>
  dplyr::select(TARGET_WINS) |>
  ggplot(aes(x = "", y = TARGET_WINS)) + 
  geom_boxplot(fill="pink") + 
  labs(title = "Distribution of TARGET_WINS", y = "Wins") +  
  theme_minimal()

# Boxplots of all predictors 

ggplot(stack(data_raw[,-1]), aes(x = ind, y = values)) + 
  geom_boxplot(fill = "darkgrey") +
  coord_cartesian(ylim = c(0, 1000)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        panel.background = element_rect(fill = 'grey96')) +
  labs(title = "Boxplots of Predictor Variables", x="Predictors")

## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

3.3 Data Cleansing

To prepare the dataset for modeling, we addressed missing values through a combination of variable removal and imputation.

First, we dropped the INDEX variable, as it serves only as a row identifier and holds no predictive value.

Next, we removed the TEAM_BATTING_HBP variable, which contained over 90% missing values. This specific statistic was not recorded in most early baseball seasons, making the variable unsuitable for imputation or inclusion in modeling.

For other variables with moderate missingness — specifically TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_FIELDING_DP, TEAM_BATTING_SO, and TEAM_PITCHING_SO — we applied zero imputation. This decision is informed by historical context: these statistics were often not tracked in earlier years, so the missing values are likely systematic (i.e., due to lack of data collection) rather than random.

Finally, we imputed any remaining missing values across the dataset using the median of each variable to reduce the influence of outliers and skewed distributions.

# Drop the INDEX variable as it has no theoretical effect and it was mentioned to not use it.
# Drop the TEAM_BATTING_HBP variable as its mostly made up of missing values
data_transform <- data_raw |>
    # dplyr::select(-INDEX, TEAM_BATTING_HBP)
    dplyr::select(-c(INDEX, TEAM_BATTING_HBP))


# These stats: missing likely means "0" (not tracked in early baseball), hence mutating to 0
# TEAM_BATTING_SO missing or 0 assumes not recorded
data_transform  <- data_transform  %>%
    mutate(
      # TEAM_BATTING_HBP = ifelse(is.na(TEAM_BATTING_HBP), 0, TEAM_BATTING_HBP),
      TEAM_BASERUN_SB = ifelse(is.na(TEAM_BASERUN_SB), 0, TEAM_BASERUN_SB),
      TEAM_BASERUN_CS = ifelse(is.na(TEAM_BASERUN_CS), 0, TEAM_BASERUN_CS),
      TEAM_FIELDING_DP = ifelse(is.na(TEAM_FIELDING_DP), 0, TEAM_FIELDING_DP),
      TEAM_BATTING_SO = ifelse(is.na(TEAM_BATTING_SO), 0, TEAM_BATTING_SO),
      TEAM_PITCHING_SO = ifelse(is.na(TEAM_PITCHING_SO), 0, TEAM_PITCHING_SO)
    )

# Handle missing values - using median imputation for numerical variables
# Handle missing values with median imputation
data_transform  <- data_transform  |>
 mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))

str(data_transform)

## 'data.frame':    2276 obs. of  15 variables:
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : num  842 1075 917 922 920 ...
##  $ TEAM_BASERUN_SB : num  0 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : num  0 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: num  5456 1082 917 928 920 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: num  0 155 153 156 168 149 186 136 169 159 ...

3.4 Checking for linear relationship

To evaluate the suitability of linear regression, we plotted the linear relationships between each predictor and the target variable (TARGET_WINS) using scatterplots and Pearson correlation coefficients.

TEAM_BATTING_H (r = 0.389) shows the strongest positive correlation with TARGET_WINS, suggesting that teams with more base hits generally win more games — as expected. This is followed by TEAM_BATTING_2B, which also demonstrates a moderately positive relationship, aligning with baseball intuition.

Interestingly, TEAM_BATTING_SO (strikeouts by batters) shows an almost negligible correlation (r = -0.030) with TARGET_WINS. This is somewhat surprising, as one might expect higher strikeout counts to hurt performance. The lack of relationship suggests that this variable is not a strong independent predictor.

Another notable result is TEAM_PITCHING_SO (strikeouts by pitchers), which shows a weak negative correlation with wins (r = -0.075). This result could indicate multicollinearity — for example, if teams that accumulate many strikeouts also give up more hits or walks — or it may suggest data quality issues in historical records.

# Get all predictor variables (excluding TARGET_WINS)
predictors <- names(data_transform)[names(data_transform) != "TARGET_WINS"]

# Create scatterplots for all predictors vs TARGET_WINS
# Set up plotting parameters for multiple plots
par(mfrow = c(4, 4), mar = c(4, 4, 3, 2))


# Plot each predictor vs TARGET_WINS
for(i in 1:length(predictors)) {
  predictor <- predictors[i]
  correlation <- cor(data_transform[[predictor]], data_transform$TARGET_WINS)
  
  plot(data_transform[[predictor]], data_transform$TARGET_WINS,
       xlab = predictor,
       ylab = "TARGET_WINS",
       main = paste(predictor, "\n(r =", round(correlation, 3), ")"),
       pch = 16,
       col = ifelse(correlation > 0, "darkblue", "darkred"),
       cex = 0.6)
  
  # Add regression line
  abline(lm(data_transform$TARGET_WINS ~ data_transform[[predictor]]), 
         col = "red", lwd = 2)
  
  # Add correlation text
  text(x = quantile(data_transform[[predictor]], 0.05), 
       y = quantile(data_transform$TARGET_WINS, 0.95),
       labels = paste("r =", round(correlation, 3)),
       cex = 0.8, col = "red")
}

# Reset plotting parameters
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2))

3.5 Observing Correlation

To further assess linear relationships and potential multicollinearity among variables, we examined the correlation matrix of all numeric features using a corrplot and targeted pairwise comparisons.

Several batting and pitching variables are highly correlated with one another: TEAM_BATTING_HR is highly correlated with TEAM_PITCHING_HR, TEAM_BATTING_H is highly correlated with TEAM_PITCHING_H, TEAM_BATTING_BB is highly correlated with TEAM_PITCHING_BB, TEAM_BATTING_SO is highly correlated with TEAM_PITCHING_SO, TEAM_FIELDING_E is highly correlated (negative) with TEAM_FIELDING_DP

These relationships suggest possible multicollinearity, which could impact model stability and interpretability.

We also ranked features by their correlation with TARGET_WINS, identifying the top 5 positive and negative predictors:

Top positively correlated features include:TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_FIELDING_BB, TEAM_PITCHING_HR, and TEAM_BATTING_HR.

Top negatively correlated features include: TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_BATTING_SO.

# Correlation matrix
cor_matrix <- cor(data_transform, use = "complete.obs")
# Plot
corrplot(cor_matrix, 
         method = "circle", 
         type = "upper", 
         tl.cex = 0.5, 
         tl.srt = 45)  # Rotate text diagonally

par(mar = c(1, 1, 4, 1))  # top margin = 4 lines

# Extract correlations between batting and pitching variables
cat("=== MULTICOLLINEARITY ANALYSIS ===\n")

## === MULTICOLLINEARITY ANALYSIS ===

# Batting vs Pitching
multicollin_pairs <- data.frame(
  Batting_Var = c("TEAM_BATTING_HR", "TEAM_BATTING_H", "TEAM_BATTING_BB", "TEAM_BATTING_SO"),
  Pitching_Var = c("TEAM_PITCHING_HR", "TEAM_PITCHING_H", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO"),
  Correlation = c(
    cor(data_transform$TEAM_BATTING_HR, data_transform$TEAM_PITCHING_HR),
    cor(data_transform$TEAM_BATTING_H, data_transform$TEAM_PITCHING_H),
    cor(data_transform$TEAM_BATTING_BB, data_transform$TEAM_PITCHING_BB),
    cor(data_transform$TEAM_BATTING_SO, data_transform$TEAM_PITCHING_SO)
  )
)
print(multicollin_pairs)

##       Batting_Var     Pitching_Var Correlation
## 1 TEAM_BATTING_HR TEAM_PITCHING_HR   0.9693714
## 2  TEAM_BATTING_H  TEAM_PITCHING_H   0.3026937
## 3 TEAM_BATTING_BB TEAM_PITCHING_BB   0.4893613
## 4 TEAM_BATTING_SO TEAM_PITCHING_SO   0.4952422

# Fielding
print(cor(data_transform$TEAM_FIELDING_DP, data_transform$TEAM_FIELDING_E))

## [1] -0.7569313

# Get correlations with TARGET_WINS
target_wins_correlation <- cor_matrix[, "TARGET_WINS"]

# Remove TARGET_WINS itself (correlation = 1)
target_wins_correlation  <- target_wins_correlation [names(target_wins_correlation ) != "TARGET_WINS"]

# Top 5 Positive correlations
top5_pos <- sort(target_wins_correlation , decreasing = TRUE) %>%
  head(5)

# Top 5 Negative correlations
top5_neg <- sort(target_wins_correlation , decreasing = FALSE) %>%
  head(5)

# Combine into one dataframe
combined_df <- data.frame(
  Feature = c(names(top5_pos), names(top5_neg)),
  Correlation = c(top5_pos, top5_neg)
)

# Add a column to show Positive or Negative
combined_df$Direction <- ifelse(combined_df$Correlation > 0, "Positive", "Negative")


ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Positively and Negatively Correlated with target_wins",
       x = "Feature",
       y = "Correlation with target_wins",
       fill = "Direction") +
  theme_minimal()

# The `findCorrelation` function is specifically designed for this
high_corr_list <- findCorrelation(cor_matrix, cutoff = 0.4, verbose = FALSE)
names(data_transform)[high_corr_list]

## [1] "TEAM_BATTING_HR"  "TEAM_PITCHING_HR" "TEAM_BATTING_SO"  "TEAM_FIELDING_E" 
## [5] "TEAM_BATTING_3B"  "TEAM_FIELDING_DP" "TEAM_BATTING_BB"  "TEAM_BATTING_2B" 
## [9] "TEAM_PITCHING_SO"

3.6 Feature Engineering

To address multicollinearity and explore different modeling strategies, several derived datasets were created using feature engineering and variable selection:

Dataset 1: Conservative Reduction

We removed TEAM_PITCHING_HR, which was highly correlated with TEAM_BATTING_HR.

Dataset 2: Moderate Reduction

This version takes a more aggressive stance on multicollinearity. We removed TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO, and TEAM_BATTING_SO. These variables showed correlations > 0.4 with other predictors, and their removal helps reduce noise and potential variance.

Dataset 3: Transformed Variables (Net Advantage)

Rather than eliminating variables, this version transforms them into net advantage metrics:

NET_HR_ADVANTAGE = TEAM_BATTING_HR - TEAM_PITCHING_HR NET_HITS_ADVANTAGE = TEAM_BATTING_H - TEAM_PITCHING_H NET_WALKS_ADVANTAGE = TEAM_BATTING_BB - TEAM_PITCHING_BB NET_SO_ADVANTAGE = TEAM_PITCHING_SO - TEAM_BATTING_SO

Dataset 4: Batting-Focused

This subset includes only batting, base running, and fielding statistics — excluding all pitching variables.

Dataset 5: Pitching-Focused

In contrast, this version includes only pitching, base running, and fielding stats — omitting batting features entirely.

# Dataset 1: Conservative - Remove one variable from each highly correlated pair (TEAM_BATTING_HR, TEAM_PITCHING_HR)
data_reduced <- data_transform %>%
  dplyr::select(-TEAM_PITCHING_HR)

# Dataset 2: Moderate - Remove FE, HR and BB and SO pitching (absolute value correlations > 0.4)
data_moderate <- data_transform %>%
  dplyr::select(-TEAM_PITCHING_HR, -TEAM_PITCHING_BB, -TEAM_PITCHING_SO, -TEAM_BATTING_SO, -TEAM_FIELDING_E)

# Dataset 3: Transformed - Create net advantage variables
data_transformed <- data_transform %>%
  mutate(
    NET_HR_ADVANTAGE = TEAM_BATTING_HR - TEAM_PITCHING_HR,
    NET_HITS_ADVANTAGE = TEAM_BATTING_H - TEAM_PITCHING_H,
    NET_WALKS_ADVANTAGE = TEAM_BATTING_BB - TEAM_PITCHING_BB,
    NET_SO_ADVANTAGE = TEAM_PITCHING_SO - TEAM_BATTING_SO
  ) %>%
  # dplyr::select(TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BASERUN_SB, 
  #        TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_FIELDING_E, TEAM_FIELDING_DP,
  #        NET_HR_ADVANTAGE, NET_HITS_ADVANTAGE, NET_WALKS_ADVANTAGE, NET_SO_ADVANTAGE)
  dplyr::select(TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BASERUN_SB, 
         TEAM_BASERUN_CS, TEAM_FIELDING_E, TEAM_FIELDING_DP,
         NET_HR_ADVANTAGE, NET_HITS_ADVANTAGE, NET_WALKS_ADVANTAGE, NET_SO_ADVANTAGE)

# Dataset 4 &5 :  Keep only batting OR pitching variables (separate models)
data_batting_only <- data_transform |>
  dplyr::select(TARGET_WINS, starts_with("TEAM_BATTING"), starts_with("TEAM_BASERUN"), 
         starts_with("TEAM_FIELDING"))

data_pitching_focused <- data_transform |>
  dplyr::select(TARGET_WINS, starts_with("TEAM_PITCHING"), starts_with("TEAM_FIELDING"),
         TEAM_BASERUN_SB, TEAM_BASERUN_CS)  # Keep base running as it's not correlated

3.7 Split Data for validation

To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.

This was done for all engineered versions of the dataset:

train_full / test_full — Full feature set (after cleaning) train_conservative / test_conservative — Reduced set with one variable removed per collinear pair train_moderate / test_moderate — More aggressively reduced set train_transformed / test_transformed — Feature-engineered net-advantage variables train_batting / test_batting — Batting-only feature set train_pitching / test_pitching — Pitching-focused feature set

Each of these splits will be used in the next phase to build and evaluate multiple linear regression models, comparing their performance on both the training and testing sets.

# Split data for validation
set.seed(123)
#n <- nrow(data_transform)
#train_idx <- sample(n, 0.8 * n)

train_idx <- createDataPartition(y = data_transform$TARGET_WINS, p = 0.8, list = FALSE)

# Create train/test splits for each dataset
train_full <- data_transform[train_idx, ]
test_full <- data_transform[-train_idx, ]

train_conservative <- data_reduced[train_idx, ]
test_conservative <- data_reduced[-train_idx, ]

train_moderate <- data_moderate[train_idx, ]
test_moderate <- data_moderate[-train_idx, ]

train_transformed <- data_transformed[train_idx, ]
test_transformed <- data_transformed[-train_idx, ]

train_batting <- data_batting_only[train_idx, ]
test_batting <- data_batting_only[-train_idx, ]

train_pitching <- data_pitching_focused[train_idx, ]
test_pitching<- data_pitching_focused[-train_idx, ]

Model Building

Model 1: Full model

Let’s first start out with a model that includes everything:

# Model 1: Full model (all data including multicollinearity predictors)

model_full <- lm(TARGET_WINS ~ ., data = train_full)

summary(model_full)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.690  -8.522   0.254   8.868  55.697 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      30.5795271  5.1389680   5.951 3.20e-09 ***
## TEAM_BATTING_H    0.0430474  0.0039130  11.001  < 2e-16 ***
## TEAM_BATTING_2B  -0.0119177  0.0104750  -1.138  0.25538    
## TEAM_BATTING_3B   0.0506945  0.0189857   2.670  0.00765 ** 
## TEAM_BATTING_HR   0.0681571  0.0310559   2.195  0.02832 *  
## TEAM_BATTING_BB   0.0029672  0.0062916   0.472  0.63725    
## TEAM_BATTING_SO  -0.0135485  0.0022790  -5.945 3.31e-09 ***
## TEAM_BASERUN_SB   0.0058131  0.0046004   1.264  0.20653    
## TEAM_BASERUN_CS  -0.0006132  0.0120640  -0.051  0.95947    
## TEAM_PITCHING_H  -0.0011992  0.0004539  -2.642  0.00831 ** 
## TEAM_PITCHING_HR  0.0019021  0.0276842   0.069  0.94523    
## TEAM_PITCHING_BB  0.0032131  0.0044546   0.721  0.47081    
## TEAM_PITCHING_SO  0.0025115  0.0010104   2.486  0.01302 *  
## TEAM_FIELDING_E  -0.0270029  0.0034348  -7.862 6.47e-15 ***
## TEAM_FIELDING_DP -0.0591386  0.0113452  -5.213 2.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.12 on 1807 degrees of freedom
## Multiple R-squared:  0.2953, Adjusted R-squared:  0.2898 
## F-statistic: 54.08 on 14 and 1807 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full))
# n = # of observations
n <- nrow(train_full)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_full)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_full
## BP = 259.6, df = 14, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_full)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.266195         2.518246         2.987128        37.132896 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         6.171798         4.418444         1.873546         1.499667 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##         3.948045        30.165917         6.101638         3.903397 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         6.216632         4.030364

######


preds <- predict(model_full, newdata = test_full)
RMSE(preds, test_full$TARGET_WINS)

## [1] 13.81744

R2(preds, test_full$TARGET_WINS)

## [1] 0.2950739

Model 1 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance increases then decreases
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constant
Standardized residuals vs leverage: Case 2136 is highly influential

Other notes:

Variance Inflation Factor: TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_FIELDING_E have VIF > 5 indicating serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Next step: Try a log transformation on the skewed predictors

Model 1.1: Full model with Log transformation of the skewed predictors

# Model 1.1: Full model with Log transformation of the skewed predictors

train_full_log <- train_full %>%
  mutate(
    log_BASERUN_CS = log(TEAM_BASERUN_CS + 1),
    log_TEAM_BASERUN_SB = log(TEAM_BASERUN_SB + 1),
    log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
    log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1),
    log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
    log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
    log_TEAM_PITCHING_SO = log(TEAM_PITCHING_SO + 1),
  )

test_full_log <- test_full %>%
  mutate(
    log_BASERUN_CS = log(TEAM_BASERUN_CS + 1),
    log_TEAM_BASERUN_SB = log(TEAM_BASERUN_SB + 1),
    log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
    log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1),
    log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
    log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
    log_TEAM_PITCHING_SO = log(TEAM_PITCHING_SO + 1),
  )

model_full_log <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
  TEAM_BATTING_BB + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS +  log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
  log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + log_TEAM_PITCHING_SO, data = train_full_log)

summary(model_full_log)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     log_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + log_TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + 
##     log_TEAM_PITCHING_SO, data = train_full_log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.543  -8.102   0.097   8.322  75.073 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           98.870093  13.622878   7.258 5.82e-13 ***
## TEAM_BATTING_H         0.051890   0.003904  13.292  < 2e-16 ***
## TEAM_BATTING_2B       -0.025986   0.010062  -2.583 0.009885 ** 
## log_TEAM_BATTING_3B    5.652896   1.048372   5.392 7.88e-08 ***
## TEAM_BATTING_HR        0.010862   0.009482   1.146 0.252151    
## TEAM_BATTING_BB        0.016636   0.005009   3.321 0.000915 ***
## log_TEAM_BASERUN_SB   -2.403681   0.330267  -7.278 5.03e-13 ***
## TEAM_BASERUN_CS       -0.003933   0.011857  -0.332 0.740174    
## log_TEAM_FIELDING_E  -15.874758   1.484497 -10.694  < 2e-16 ***
## TEAM_FIELDING_DP      -0.097354   0.010245  -9.502  < 2e-16 ***
## log_TEAM_PITCHING_H   -3.505929   2.039026  -1.719 0.085710 .  
## log_TEAM_PITCHING_BB   3.896707   1.713334   2.274 0.023062 *  
## log_TEAM_PITCHING_SO  -1.790880   0.241812  -7.406 1.98e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.99 on 1809 degrees of freedom
## Multiple R-squared:  0.3092, Adjusted R-squared:  0.3046 
## F-statistic: 67.47 on 12 and 1809 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_log, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full_log)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_log))
# n = # of observations
n <- nrow(train_full_log)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_full_log)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_full_log
## BP = 344.74, df = 12, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_full_log)

##       TEAM_BATTING_H      TEAM_BATTING_2B  log_TEAM_BATTING_3B 
##             3.319843             2.372920             2.907026 
##      TEAM_BATTING_HR      TEAM_BATTING_BB  log_TEAM_BASERUN_SB 
##             3.535063             3.995746             1.795107 
##      TEAM_BASERUN_CS  log_TEAM_FIELDING_E     TEAM_FIELDING_DP 
##             1.479500             8.739550             3.356650 
##  log_TEAM_PITCHING_H log_TEAM_PITCHING_BB log_TEAM_PITCHING_SO 
##             4.091596             2.601571             1.405713

######


preds <- predict(model_full_log, newdata = test_full_log)
RMSE(preds, test_full_log$TARGET_WINS)

## [1] 13.47148

R2(preds, test_full_log$TARGET_WINS)

## [1] 0.3317772

Model 1.1 validity check (linear regression assumptions):

Histogram: Looks approximately normal, very slightly right skewed
Residuals vs Fitted values: Somewhat constant variance (improved) - a few outliers for lower fitted values such as case 1342
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constant
Standardized residuals vs leverage: Case 1211 and 1342 are highly influential

Other notes:

Variance Inflation Factor: log_TEAM_FIELDING_E has VIF > 5 indicating serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Next step: let’s try a log transformation on the skewed predictors, then a WLS to address nonconstant variance

Model 1.2: Full model with Log transformation of the skewed predictors and Weighted Least Squares

# Model 1.2: Full model with Log transformation of the skewed predictors + WLS

# define weights to use
wt <- 1 / lm(abs(model_full_log$residuals) ~ model_full_log$fitted.values)$fitted.values^2

# perform weighted least squares regression
# Fit model
model_full_log_wls <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
  TEAM_BATTING_BB + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS +  log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
  log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + log_TEAM_PITCHING_SO, data = train_full_log, weights=wt)

#view summary of model
summary(model_full_log_wls)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     log_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + log_TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + 
##     log_TEAM_PITCHING_SO, data = train_full_log, weights = wt)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5198 -0.7978  0.0165  0.8211  5.2462 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           90.900031  13.897883   6.541 7.95e-11 ***
## TEAM_BATTING_H         0.048887   0.004012  12.184  < 2e-16 ***
## TEAM_BATTING_2B       -0.030259   0.009865  -3.067  0.00219 ** 
## log_TEAM_BATTING_3B    6.846195   1.059650   6.461 1.34e-10 ***
## TEAM_BATTING_HR        0.025509   0.009267   2.753  0.00597 ** 
## TEAM_BATTING_BB        0.015221   0.006320   2.408  0.01612 *  
## log_TEAM_BASERUN_SB   -2.458922   0.340044  -7.231 7.04e-13 ***
## TEAM_BASERUN_CS        0.004310   0.011934   0.361  0.71801    
## log_TEAM_FIELDING_E  -15.235592   1.466696 -10.388  < 2e-16 ***
## TEAM_FIELDING_DP      -0.091278   0.009992  -9.136  < 2e-16 ***
## log_TEAM_PITCHING_H   -2.581261   2.614669  -0.987  0.32367    
## log_TEAM_PITCHING_BB   3.668685   2.594108   1.414  0.15746    
## log_TEAM_PITCHING_SO  -2.043838   0.243612  -8.390  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.278 on 1809 degrees of freedom
## Multiple R-squared:  0.3033, Adjusted R-squared:  0.2987 
## F-statistic: 65.62 on 12 and 1809 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_log_wls, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full_log_wls)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_log_wls))
# n = # of observations
n <- nrow(train_full_log)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log_wls)

# Perform Breusch-Pagan Test
bptest(model_full_log_wls)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_full_log_wls
## BP = 6.0351, df = 12, p-value = 0.9143

# Variance Inflation Factor
vif(model_full_log_wls)

##       TEAM_BATTING_H      TEAM_BATTING_2B  log_TEAM_BATTING_3B 
##             4.397868             2.411497             3.068544 
##      TEAM_BATTING_HR      TEAM_BATTING_BB  log_TEAM_BASERUN_SB 
##             3.535661             6.323528             1.774496 
##      TEAM_BASERUN_CS  log_TEAM_FIELDING_E     TEAM_FIELDING_DP 
##             1.492232             8.555100             3.436528 
##  log_TEAM_PITCHING_H log_TEAM_PITCHING_BB log_TEAM_PITCHING_SO 
##             6.506746             4.628415             1.450357

######


preds <- predict(model_full_log_wls, newdata = test_full_log)
RMSE(preds, test_full_log$TARGET_WINS)

## [1] 13.55652

R2(preds, test_full_log$TARGET_WINS)

## [1] 0.3232014

Model 1.2 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Somewhat constant variance (improved) - a few outliers for lower fitted values such as case 1342
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes is more flat, yet there are a few outliers
Standardized residuals vs leverage: Zero highly influential cases

Other notes:

Variance Inflation Factor: log_TEAM_FIELDING_E has VIF > 5 indicating serious multicollinearity
Breusch-Pagan Test: P value > 0.05 strongly indicating constant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Next step: let’s try a square root transformation of the skewed predictors

Model 1.3: Full model with square root transformation of the skewed predictors

# Model 1.3: Full model with square root transformation of the skewed predictors

# Transform skewed predictors via sqrt
train_full_sqrt <- train_full %>%
  mutate(
    sqrt_BASERUN_CS        = sqrt(TEAM_BASERUN_CS + 1),
    sqrt_TEAM_BASERUN_SB    = sqrt(TEAM_BASERUN_SB + 1),
    sqrt_TEAM_BATTING_3B     = sqrt(TEAM_BATTING_3B + 1),
    sqrt_TEAM_FIELDING_E     = sqrt(TEAM_FIELDING_E + 1),
    sqrt_TEAM_PITCHING_BB    = sqrt(TEAM_PITCHING_BB + 1),
    sqrt_TEAM_PITCHING_H     = sqrt(TEAM_PITCHING_H + 1),
    sqrt_TEAM_PITCHING_SO    = sqrt(TEAM_PITCHING_SO + 1)
    # … you can include more if you have other skewed predictors …
  )

test_full_sqrt <- test_full %>%
  mutate(
    sqrt_BASERUN_CS        = sqrt(TEAM_BASERUN_CS + 1),
    sqrt_TEAM_BASERUN_SB    = sqrt(TEAM_BASERUN_SB + 1),
    sqrt_TEAM_BATTING_3B     = sqrt(TEAM_BATTING_3B + 1),
    sqrt_TEAM_FIELDING_E     = sqrt(TEAM_FIELDING_E + 1),
    sqrt_TEAM_PITCHING_BB    = sqrt(TEAM_PITCHING_BB + 1),
    sqrt_TEAM_PITCHING_H     = sqrt(TEAM_PITCHING_H + 1),
    sqrt_TEAM_PITCHING_SO    = sqrt(TEAM_PITCHING_SO + 1)
  )

# Fit model 
model_full_sqrt <- lm(TARGET_WINS ~ 
                        TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
                        TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
                        TEAM_BASERUN_CS + sqrt_TEAM_FIELDING_E + TEAM_FIELDING_DP +
                        sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB + sqrt_TEAM_PITCHING_SO,
                      data = train_full_sqrt)

summary(model_full_sqrt)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + sqrt_TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB + 
##     sqrt_TEAM_PITCHING_SO, data = train_full_sqrt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.565  -8.741   0.118   8.600  59.755 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           37.227357   5.279731   7.051 2.52e-12 ***
## TEAM_BATTING_H         0.044714   0.003869  11.558  < 2e-16 ***
## TEAM_BATTING_2B       -0.016441   0.010338  -1.590 0.111912    
## sqrt_TEAM_BATTING_3B   1.400029   0.302612   4.626 3.98e-06 ***
## TEAM_BATTING_HR        0.037551   0.009679   3.880 0.000108 ***
## TEAM_BATTING_BB       -0.005708   0.006589  -0.866 0.386412    
## sqrt_TEAM_BASERUN_SB  -0.218281   0.099544  -2.193 0.028448 *  
## TEAM_BASERUN_CS       -0.011769   0.012138  -0.970 0.332374    
## sqrt_TEAM_FIELDING_E  -1.502950   0.162193  -9.266  < 2e-16 ***
## TEAM_FIELDING_DP      -0.084372   0.011440  -7.375 2.48e-13 ***
## sqrt_TEAM_PITCHING_H  -0.106645   0.075170  -1.419 0.156154    
## sqrt_TEAM_PITCHING_BB  0.723495   0.226509   3.194 0.001427 ** 
## sqrt_TEAM_PITCHING_SO -0.231261   0.056310  -4.107 4.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.18 on 1809 degrees of freedom
## Multiple R-squared:  0.2883, Adjusted R-squared:  0.2836 
## F-statistic: 61.07 on 12 and 1809 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full_sqrt)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_sqrt))
# n = # of observations
n <- nrow(train_full_sqrt)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_sqrt)

# Perform Breusch-Pagan Test
bptest(model_full_sqrt)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_full_sqrt
## BP = 312.21, df = 12, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_full_sqrt)

##        TEAM_BATTING_H       TEAM_BATTING_2B  sqrt_TEAM_BATTING_3B 
##              3.164723              2.431249              3.041022 
##       TEAM_BATTING_HR       TEAM_BATTING_BB  sqrt_TEAM_BASERUN_SB 
##              3.575573              6.710071              1.739953 
##       TEAM_BASERUN_CS  sqrt_TEAM_FIELDING_E      TEAM_FIELDING_DP 
##              1.504790              8.178851              4.062051 
##  sqrt_TEAM_PITCHING_H sqrt_TEAM_PITCHING_BB sqrt_TEAM_PITCHING_SO 
##              5.038932              5.006073              2.136819

######

# Predict on test set
preds_sqrt <- predict(model_full_sqrt, newdata = test_full_sqrt)

rmse_sqrt <- RMSE(preds_sqrt, test_full_sqrt$TARGET_WINS)
r2_sqrt   <- R2(preds_sqrt, test_full_sqrt$TARGET_WINS)

rmse_sqrt

## [1] 13.6536

r2_sqrt

## [1] 0.3144132

Model 1.3 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Somewhat constant variance (improved) - a few outliers for lower fitted values such as case 1342
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constan
Standardized residuals vs leverage: Case 1342 Cook’s distance > 0.5

Other notes:

Variance Inflation Factor: TEAM_BATTING_BB, sqrt_TEAM_FIELDING_E, sqrt_TEAM_PITCHING_BB have VIF > 5 indicating serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Best model 1 version:

Next step: let’s address multicollinearity by using the train_moderate dataset (removed highly correlated predictors)

Model 2: Moderate model

# Model 2: Moderate dataset model (removed highly correlated predictors)

# Fit model 
model_moderate <- lm(TARGET_WINS ~ ., data = train_moderate)

#view summary of model
summary(model_moderate)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_moderate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.909  -8.995   0.495   8.905  76.870 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.0807986  3.9865298   1.274  0.20265    
## TEAM_BATTING_H    0.0427033  0.0035658  11.976  < 2e-16 ***
## TEAM_BATTING_2B  -0.0058880  0.0101635  -0.579  0.56244    
## TEAM_BATTING_3B   0.0544292  0.0191381   2.844  0.00450 ** 
## TEAM_BATTING_HR   0.0471021  0.0085870   5.485 4.71e-08 ***
## TEAM_BATTING_BB   0.0173697  0.0036085   4.814 1.61e-06 ***
## TEAM_BASERUN_SB   0.0126187  0.0045981   2.744  0.00612 ** 
## TEAM_BASERUN_CS  -0.0133792  0.0118269  -1.131  0.25810    
## TEAM_PITCHING_H  -0.0017073  0.0002894  -5.900 4.33e-09 ***
## TEAM_FIELDING_DP -0.0004111  0.0093700  -0.044  0.96501    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.54 on 1812 degrees of freedom
## Multiple R-squared:  0.2479, Adjusted R-squared:  0.2442 
## F-statistic: 66.38 on 9 and 1812 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_moderate, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_moderate)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_moderate))
# n = # of observations
n <- nrow(train_moderate)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

anova(model_moderate)

## Analysis of Variance Table
## 
## Response: TARGET_WINS
##                    Df Sum Sq Mean Sq  F value    Pr(>F)    
## TEAM_BATTING_H      1  63883   63883 348.5795 < 2.2e-16 ***
## TEAM_BATTING_2B     1   3416    3416  18.6403 1.664e-05 ***
## TEAM_BATTING_3B     1      1       1   0.0045  0.946460    
## TEAM_BATTING_HR     1  18718   18718 102.1338 < 2.2e-16 ***
## TEAM_BATTING_BB     1  14674   14674  80.0688 < 2.2e-16 ***
## TEAM_BASERUN_SB     1   1828    1828   9.9768  0.001611 ** 
## TEAM_BASERUN_CS     1    168     168   0.9144  0.339072    
## TEAM_PITCHING_H     1   6793    6793  37.0689 1.389e-09 ***
## TEAM_FIELDING_DP    1      0       0   0.0019  0.965006    
## Residuals        1812 332078     183                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Perform Breusch-Pagan Test
bptest(model_moderate)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_moderate
## BP = 269.1, df = 9, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_moderate)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.548488         2.227575         2.852050         2.667503 
##  TEAM_BATTING_BB  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_H 
##         1.907658         1.758721         1.354279         1.508270 
## TEAM_FIELDING_DP 
##         2.583185

######

# Predict on test set
preds_moderate <- predict(model_moderate, newdata = test_moderate)

rmse_moderate <- RMSE(preds_moderate, test_moderate$TARGET_WINS)
r2_moderate   <- R2(preds_moderate, test_moderate$TARGET_WINS)

rmse_moderate

## [1] 14.01101

r2_moderate

## [1] 0.2769754

Model 2 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance increases then decreases with outlier case 1342
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constant
Standardized residuals vs leverage: No highly influential cases

Other notes:

Variance Inflation Factor: Zero predictors showing serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Next step: let’s address the validity check issues by transforming the skewed predictors

Model 2.1: Moderate dataset model with square root transformation of skewed predictors

# Model 2: Moderate dataset model + transformation of skewed predictors

# Transform skewed predictors via sqrt
train_moderate_sqrt <- train_moderate %>%
  mutate(
    sqrt_BASERUN_CS        = sqrt(TEAM_BASERUN_CS + 1),
    sqrt_TEAM_BASERUN_SB    = sqrt(TEAM_BASERUN_SB + 1),
    sqrt_TEAM_BATTING_3B     = sqrt(TEAM_BATTING_3B + 1),
    sqrt_TEAM_PITCHING_H     = sqrt(TEAM_PITCHING_H + 1),
    # … you can include more if you have other skewed predictors …
  )

test_moderate_sqrt <- test_moderate %>%
  mutate(
    sqrt_BASERUN_CS        = sqrt(TEAM_BASERUN_CS + 1),
    sqrt_TEAM_BASERUN_SB    = sqrt(TEAM_BASERUN_SB + 1),
    sqrt_TEAM_BATTING_3B     = sqrt(TEAM_BATTING_3B + 1),
    sqrt_TEAM_PITCHING_H     = sqrt(TEAM_PITCHING_H + 1),
  )

# Fit model 
model_moderate_sqrt <- lm(TARGET_WINS ~ 
                        TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
                        TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
                        TEAM_BASERUN_CS + TEAM_FIELDING_DP + sqrt_TEAM_PITCHING_H,
                      data = train_moderate_sqrt)

summary(model_moderate_sqrt)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP + 
##     sqrt_TEAM_PITCHING_H, data = train_moderate_sqrt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.848  -8.959   0.565   9.014  78.010 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.607803   4.249041   1.790  0.07354 .  
## TEAM_BATTING_H        0.045072   0.003747  12.029  < 2e-16 ***
## TEAM_BATTING_2B      -0.007240   0.010193  -0.710  0.47766    
## sqrt_TEAM_BATTING_3B  0.962061   0.302455   3.181  0.00149 ** 
## TEAM_BATTING_HR       0.047035   0.008912   5.278 1.46e-07 ***
## TEAM_BATTING_BB       0.018099   0.003701   4.890 1.10e-06 ***
## sqrt_TEAM_BASERUN_SB  0.107730   0.096868   1.112  0.26623    
## TEAM_BASERUN_CS      -0.011914   0.011972  -0.995  0.31982    
## TEAM_FIELDING_DP     -0.011846   0.008987  -1.318  0.18762    
## sqrt_TEAM_PITCHING_H -0.273777   0.048625  -5.630 2.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.56 on 1812 degrees of freedom
## Multiple R-squared:  0.2454, Adjusted R-squared:  0.2416 
## F-statistic: 65.47 on 9 and 1812 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_moderate_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full_sqrt)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_moderate_sqrt))
# n = # of observations
n <- nrow(train_moderate_sqrt)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_moderate_sqrt)

# Perform Breusch-Pagan Test
bptest(model_moderate_sqrt)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_moderate_sqrt
## BP = 272.64, df = 9, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_moderate)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.548488         2.227575         2.852050         2.667503 
##  TEAM_BATTING_BB  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_H 
##         1.907658         1.758721         1.354279         1.508270 
## TEAM_FIELDING_DP 
##         2.583185

######

# Predict on test set
preds_moderate_sqrt <- predict(model_moderate_sqrt, newdata = test_moderate_sqrt)

rmse_sqrt <- RMSE(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS)
r2_sqrt   <- R2(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS)

rmse_sqrt

## [1] 14.05698

r2_sqrt

## [1] 0.2722703

Model 2.1 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constant
Standardized residuals vs leverage: Case 1342 Cook’s distance > 0.5

Other notes:

Variance Inflation Factor: Zero predictors showing serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Next step: let’s address nonconstant variance by transforming the skewed predictors and using Weighted Least Squares

Model 2.2: Moderate dataset model with transformation of skewed predictors and WLS

# Model 2.2: Moderate dataset model + transformation of skewed predictors + WLS

# define weights to use
wt <- 1 / lm(abs(model_moderate_sqrt$residuals) ~ model_moderate_sqrt$fitted.values)$fitted.values^2

# perform weighted least squares regression
wls_model_moderate_sqrt <- lm(TARGET_WINS ~ 
                        TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
                        TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
                        TEAM_BASERUN_CS + TEAM_FIELDING_DP +
                        sqrt_TEAM_PITCHING_H,
                      data = train_moderate_sqrt, weights=wt)

#view summary of model
summary(wls_model_moderate_sqrt)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP + 
##     sqrt_TEAM_PITCHING_H, data = train_moderate_sqrt, weights = wt)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7679 -0.8433  0.0520  0.8563  5.0569 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.431146   4.039826   2.087 0.037027 *  
## TEAM_BATTING_H        0.045416   0.003807  11.928  < 2e-16 ***
## TEAM_BATTING_2B      -0.016131   0.010108  -1.596 0.110706    
## sqrt_TEAM_BATTING_3B  0.984533   0.297333   3.311 0.000947 ***
## TEAM_BATTING_HR       0.051740   0.008678   5.962 2.99e-09 ***
## TEAM_BATTING_BB       0.015460   0.003615   4.277 1.99e-05 ***
## sqrt_TEAM_BASERUN_SB  0.094450   0.097192   0.972 0.331282    
## TEAM_BASERUN_CS      -0.008323   0.011978  -0.695 0.487205    
## TEAM_FIELDING_DP     -0.006103   0.008862  -0.689 0.491124    
## sqrt_TEAM_PITCHING_H -0.254253   0.057049  -4.457 8.83e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.276 on 1812 degrees of freedom
## Multiple R-squared:  0.2392, Adjusted R-squared:  0.2354 
## F-statistic: 63.28 on 9 and 1812 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = wls_model_moderate_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(wls_model_moderate_sqrt)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(wls_model_moderate_sqrt))
# n = # of observations
n <- nrow(train_moderate_sqrt)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(wls_model_moderate_sqrt)

# Perform Breusch-Pagan Test
bptest(wls_model_moderate_sqrt)

## 
##  studentized Breusch-Pagan test
## 
## data:  wls_model_moderate_sqrt
## BP = 5.3621, df = 9, p-value = 0.8017

# Variance Inflation Factor
vif(wls_model_moderate_sqrt)

##       TEAM_BATTING_H      TEAM_BATTING_2B sqrt_TEAM_BATTING_3B 
##             3.654325             2.307732             3.129800 
##      TEAM_BATTING_HR      TEAM_BATTING_BB sqrt_TEAM_BASERUN_SB 
##             2.849087             1.860144             1.637307 
##      TEAM_BASERUN_CS     TEAM_FIELDING_DP sqrt_TEAM_PITCHING_H 
##             1.386230             2.543128             2.415067

######

# Predict on test set
preds_mod_sqrt <- predict(wls_model_moderate_sqrt, newdata = test_moderate_sqrt)

rmse_mod_wls_sqrt <- RMSE(preds_mod_sqrt, test_moderate_sqrt$TARGET_WINS)
r2_mod_wls_sqrt   <- R2(preds_mod_sqrt, test_moderate_sqrt$TARGET_WINS)

rmse_mod_wls_sqrt

## [1] 14.09027

r2_mod_wls_sqrt

## [1] 0.2697522

Model 2.2 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward slightly (improved) indicating variance is slightly nonconstant
Standardized residuals vs leverage: Looks good

Other notes:

Variance Inflation Factor: Zero predictors showing serious multicollinearity
Breusch-Pagan Test: P value > 0.05 strongly indicating constant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Model 3: Transformed data (net advantage variables)

# Model 3: Transformed data (net advantage variables)

model_transformed <- lm(TARGET_WINS ~ ., data = train_transformed)

summary(model_transformed)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.181  -9.836   0.120   9.254  50.919 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         65.8307062  3.1039480  21.209  < 2e-16 ***
## TEAM_BATTING_2B      0.0881661  0.0076316  11.553  < 2e-16 ***
## TEAM_BATTING_3B      0.1179503  0.0159613   7.390 2.23e-13 ***
## TEAM_BASERUN_SB      0.0068615  0.0047736   1.437   0.1508    
## TEAM_BASERUN_CS     -0.0183701  0.0123926  -1.482   0.1384    
## TEAM_FIELDING_E     -0.0283239  0.0033600  -8.430  < 2e-16 ***
## TEAM_FIELDING_DP    -0.0503107  0.0120314  -4.182 3.03e-05 ***
## NET_HR_ADVANTAGE    -0.0568972  0.0289452  -1.966   0.0495 *  
## NET_HITS_ADVANTAGE   0.0003005  0.0004777   0.629   0.5294    
## NET_WALKS_ADVANTAGE -0.0046571  0.0046944  -0.992   0.3213    
## NET_SO_ADVANTAGE    -0.0010532  0.0010548  -0.999   0.3182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.1 on 1811 degrees of freedom
## Multiple R-squared:  0.185,  Adjusted R-squared:  0.1805 
## F-statistic: 41.12 on 10 and 1811 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_transformed, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_transformed)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_transformed))
# n = # of observations
n <- nrow(train_transformed)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_transformed)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_transformed
## BP = 234.15, df = 10, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_transformed)

##     TEAM_BATTING_2B     TEAM_BATTING_3B     TEAM_BASERUN_SB     TEAM_BASERUN_CS 
##            1.158376            1.829650            1.748234            1.371398 
##     TEAM_FIELDING_E    TEAM_FIELDING_DP    NET_HR_ADVANTAGE  NET_HITS_ADVANTAGE 
##            5.155289            3.928120            1.704036            3.590426 
## NET_WALKS_ADVANTAGE    NET_SO_ADVANTAGE 
##            4.945062            2.896198

######


preds <- predict(model_transformed, newdata = test_transformed)
RMSE(preds, test_transformed$TARGET_WINS)

## [1] 14.6851

R2(preds, test_transformed$TARGET_WINS)

## [1] 0.2047218

Model 3 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is nonconstant
Standardized residuals vs leverage: Case 2136 Cook’s distance > 0.5

Other notes:

Variance Inflation Factor: TEAM_FIELDING_E showing serious multicollinearity
Breusch-Pagan Test: P value < 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Model 4: (batting variables only)

# Model 4 (batting variables only)

model_batting <- lm(TARGET_WINS ~ ., data = train_batting)

summary(model_batting)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_batting)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.421  -8.402   0.373   8.748  58.784 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      33.431575   4.965568   6.733 2.23e-11 ***
## TEAM_BATTING_H    0.039843   0.003821  10.428  < 2e-16 ***
## TEAM_BATTING_2B  -0.007248   0.010340  -0.701  0.48343    
## TEAM_BATTING_3B   0.058620   0.018496   3.169  0.00155 ** 
## TEAM_BATTING_HR   0.063377   0.010237   6.191 7.39e-10 ***
## TEAM_BATTING_BB   0.006910   0.003678   1.879  0.06042 .  
## TEAM_BATTING_SO  -0.009689   0.001948  -4.974 7.19e-07 ***
## TEAM_BASERUN_SB   0.004604   0.004567   1.008  0.31359    
## TEAM_BASERUN_CS  -0.006115   0.011971  -0.511  0.60951    
## TEAM_FIELDING_E  -0.030461   0.002789 -10.922  < 2e-16 ***
## TEAM_FIELDING_DP -0.066330   0.011165  -5.941 3.40e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.17 on 1811 degrees of freedom
## Multiple R-squared:  0.2884, Adjusted R-squared:  0.2845 
## F-statistic:  73.4 on 10 and 1811 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_batting, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_transformed)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_batting))
# n = # of observations
n <- nrow(train_batting)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_batting)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_batting
## BP = 258.36, df = 10, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_batting)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.090758         2.435330         2.813750         4.004748 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         2.093224         3.204192         1.832782         1.465501 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         4.067925         3.874389

######


preds <- predict(model_batting, newdata = test_batting)
RMSE(preds, test_batting$TARGET_WINS)

## [1] 13.7715

R2(preds, test_batting$TARGET_WINS)

## [1] 0.30035

Model 4.1 (batting variables only) Elastic Net

# Model 4.1 (batting variables only) Elastic Net

# Elastic Net Regression
set.seed(786)
# 5-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 5)

#pre process
pre_process <- c("center","scale")
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))


enet_model <- train(TARGET_WINS ~ ., data = train_batting,
                 method = "enet",
                 tuneGrid = enetGrid,
                 trControl = ctrl,
                 preProcess = pre_process)

# Test predictors and response
testX <- test_batting %>% dplyr::select(-TARGET_WINS)
testY <- test_batting$TARGET_WINS

enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)

enet_model

## Elasticnet 
## 
## 1822 samples
##   10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1457, 1459, 1457, 1458, 1457 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE     
##   0.00    0.05      15.11660  0.1462023  11.85688
##   0.00    0.10      14.77552  0.1693321  11.58652
##   0.00    0.15      14.48511  0.1964809  11.37373
##   0.00    0.20      14.23495  0.2152453  11.20424
##   0.00    0.25      14.02034  0.2262527  11.05617
##   0.00    0.30      13.84813  0.2323982  10.93107
##   0.00    0.35      13.71765  0.2391241  10.83505
##   0.00    0.40      13.63134  0.2431921  10.76693
##   0.00    0.45      13.57522  0.2468239  10.71146
##   0.00    0.50      13.52140  0.2517718  10.66110
##   0.00    0.55      13.46687  0.2572057  10.61409
##   0.00    0.60      13.41047  0.2629063  10.56656
##   0.00    0.65      13.36064  0.2676903  10.52504
##   0.00    0.70      13.31746  0.2716252  10.48747
##   0.00    0.75      13.28100  0.2747834  10.45551
##   0.00    0.80      13.25307  0.2770476  10.42973
##   0.00    0.85      13.23344  0.2785247  10.40983
##   0.00    0.90      13.22041  0.2794839  10.39616
##   0.00    0.95      13.21376  0.2800335  10.38778
##   0.00    1.00      13.21325  0.2801650  10.38311
##   0.01    0.05      15.12632  0.1462023  11.86470
##   0.01    0.10      14.79147  0.1672191  11.59853
##   0.01    0.15      14.50468  0.1948826  11.38754
##   0.01    0.20      14.25854  0.2136086  11.21962
##   0.01    0.25      14.04484  0.2251521  11.07331
##   0.01    0.30      13.87088  0.2313519  10.94667
##   0.01    0.35      13.73914  0.2378597  10.85025
##   0.01    0.40      13.64661  0.2424108  10.77878
##   0.01    0.45      13.58787  0.2458208  10.72150
##   0.01    0.50      13.53528  0.2504010  10.67004
##   0.01    0.55      13.48144  0.2556951  10.62293
##   0.01    0.60      13.42507  0.2614221  10.57553
##   0.01    0.65      13.37461  0.2663131  10.53323
##   0.01    0.70      13.33046  0.2703870  10.49532
##   0.01    0.75      13.29268  0.2737109  10.46208
##   0.01    0.80      13.26137  0.2763500  10.43404
##   0.01    0.85      13.23959  0.2780486  10.41194
##   0.01    0.90      13.22402  0.2792326  10.39587
##   0.01    0.95      13.21607  0.2798368  10.38747
##   0.01    1.00      13.21348  0.2801348  10.38133
##   0.10    0.05      15.17135  0.1462023  11.90093
##   0.10    0.10      14.86332  0.1575644  11.65326
##   0.10    0.15      14.59599  0.1871904  11.45272
##   0.10    0.20      14.36992  0.2047548  11.29213
##   0.10    0.25      14.16348  0.2189006  11.15613
##   0.10    0.30      13.98745  0.2270224  11.03189
##   0.10    0.35      13.84477  0.2321631  10.92584
##   0.10    0.40      13.73567  0.2374867  10.84510
##   0.10    0.45      13.66066  0.2416122  10.78264
##   0.10    0.50      13.60013  0.2457268  10.72409
##   0.10    0.55      13.54813  0.2490694  10.66892
##   0.10    0.60      13.50169  0.2530986  10.62254
##   0.10    0.65      13.45498  0.2575167  10.58096
##   0.10    0.70      13.41150  0.2616107  10.54383
##   0.10    0.75      13.37302  0.2651171  10.50925
##   0.10    0.80      13.33949  0.2681004  10.48044
##   0.10    0.85      13.31086  0.2706215  10.45644
##   0.10    0.90      13.28736  0.2726941  10.43577
##   0.10    0.95      13.26903  0.2743636  10.41721
##   0.10    1.00      13.25643  0.2756133  10.40289
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 1 and lambda = 0.

enet_resample

##     RMSE Rsquared      MAE 
## 13.77150  0.30035 10.50955

Model 5: (pitching variables only)

# Model 5: (pitching variables only)

model_pitching <- lm(TARGET_WINS ~ ., data = train_pitching)

summary(model_pitching)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_pitching)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.816  -9.648   0.389   9.423  65.280 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      84.6752515  2.8006028  30.235  < 2e-16 ***
## TEAM_PITCHING_H   0.0006886  0.0004335   1.588   0.1124    
## TEAM_PITCHING_HR  0.0487187  0.0070549   6.906 6.89e-12 ***
## TEAM_PITCHING_BB  0.0150764  0.0026860   5.613 2.30e-08 ***
## TEAM_PITCHING_SO -0.0054353  0.0007213  -7.536 7.63e-14 ***
## TEAM_FIELDING_E  -0.0245325  0.0036059  -6.804 1.38e-11 ***
## TEAM_FIELDING_DP -0.0690500  0.0124265  -5.557 3.16e-08 ***
## TEAM_BASERUN_SB   0.0101588  0.0049969   2.033   0.0422 *  
## TEAM_BASERUN_CS  -0.0210745  0.0127787  -1.649   0.0993 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.53 on 1813 degrees of freedom
## Multiple R-squared:  0.1329, Adjusted R-squared:  0.1291 
## F-statistic: 34.74 on 8 and 1813 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_pitching, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_transformed)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_pitching))
# n = # of observations
n <- nrow(model_pitching)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_pitching)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_pitching
## BP = 486.34, df = 8, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_pitching)

##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##         2.937196         1.597455         1.808918         1.622142 
##  TEAM_FIELDING_E TEAM_FIELDING_DP  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         5.586704         3.942808         1.802445         1.372055

######


preds <- predict(model_pitching, newdata = test_pitching)
RMSE(preds, test_pitching$TARGET_WINS)

## [1] 15.71512

R2(preds, test_pitching$TARGET_WINS)

## [1] 0.09207141

Model 6: Conservative dataset

# Model 6: Conservative dataset

model_conservative <- lm(TARGET_WINS ~ ., data = train_conservative)

summary(model_conservative)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_conservative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.696  -8.523   0.255   8.873  55.714 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      30.5529851  5.1230167   5.964 2.96e-09 ***
## TEAM_BATTING_H    0.0430641  0.0039045  11.029  < 2e-16 ***
## TEAM_BATTING_2B  -0.0119349  0.0104691  -1.140  0.25443    
## TEAM_BATTING_3B   0.0509068  0.0187274   2.718  0.00662 ** 
## TEAM_BATTING_HR   0.0701690  0.0103410   6.786 1.56e-11 ***
## TEAM_BATTING_BB   0.0028022  0.0058134   0.482  0.62985    
## TEAM_BATTING_SO  -0.0135210  0.0022428  -6.029 2.00e-09 ***
## TEAM_BASERUN_SB   0.0057933  0.0045900   1.262  0.20706    
## TEAM_BASERUN_CS  -0.0005916  0.0120567  -0.049  0.96087    
## TEAM_PITCHING_H  -0.0011970  0.0004527  -2.645  0.00825 ** 
## TEAM_PITCHING_BB  0.0033526  0.0039644   0.846  0.39784    
## TEAM_PITCHING_SO  0.0024874  0.0009472   2.626  0.00871 ** 
## TEAM_FIELDING_E  -0.0270055  0.0034337  -7.865 6.30e-15 ***
## TEAM_FIELDING_DP -0.0591481  0.0113413  -5.215 2.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.12 on 1808 degrees of freedom
## Multiple R-squared:  0.2953, Adjusted R-squared:  0.2902 
## F-statistic: 58.28 on 13 and 1808 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_conservative, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_conservative)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_conservative))
# n = # of observations
n <- nrow(train_transformed)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_conservative)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_conservative
## BP = 258.82, df = 13, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_conservative)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.253674         2.516804         2.908010         4.119362 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         5.272319         4.281437         1.866157         1.498655 
##  TEAM_PITCHING_H TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         3.929357         4.835196         3.432345         6.215882 
## TEAM_FIELDING_DP 
##         4.029774

######


preds <- predict(model_conservative, newdata = test_conservative)
RMSE(preds, test_conservative$TARGET_WINS)

## [1] 13.81767

R2(preds, test_conservative$TARGET_WINS)

## [1] 0.2950506

Select Models

After evaluating various different linear regression models using key performance metrics—including RMSE, R², MSE, and the F-statistic—Model 1.1 (Full + Log) emerged as the most robust and reliable choice. It demonstrated the best predictive accuracy (lowest RMSE and MSE), the highest proportion of variance explained (R² = 0.332), and a strong F-statistic, indicating overall model significance. The use of log-transformed predictors improved model fit and likely addressed issues related to skewness or heteroscedasticity.

While Model 1.2 (WLS) showed similar metrics and better residual plots, its complexity made interpretation more difficult. Model 4, a simpler alternative, had slightly weaker performance. Overall, Model 1.1 offers the best balance between accuracy and interpretability and is selected as the final model for prediction.

model_results <- data.frame(
  Model = c("Model 1: Full",
            "Model 1.1: Full + Log",
            "Model 1.2: Full + Log + WLS",
            "Model 1.3: Full + Sqrt",
            "Model 2: Moderate",
            "Model 2.1: Moderate + Sqrt",
            "Model 2.2: Moderate + Sqrt + WLS",
            "Model 3: Transformed Advantage",
            "Model 4: Batting",
            "Model 4.1: Batting (Elastic Net)",
            "Model 5: Pitching",
            "Model 6: Conservative"),
  
  RMSE = c(
    RMSE(predict(model_full, newdata = test_full), test_full$TARGET_WINS),
    RMSE(predict(model_full_log, newdata = test_full_log), test_full_log$TARGET_WINS),
    RMSE(predict(model_full_log_wls, newdata = test_full_log), test_full_log$TARGET_WINS),
    rmse_sqrt,
    rmse_moderate,
    RMSE(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS),
    rmse_mod_wls_sqrt,
    RMSE(predict(model_transformed, newdata = test_transformed), test_transformed$TARGET_WINS),
    RMSE(predict(model_batting, newdata = test_batting), test_batting$TARGET_WINS),
    enet_resample["RMSE"],
    RMSE(predict(model_pitching, newdata = test_pitching), test_pitching$TARGET_WINS),
    RMSE(predict(model_conservative, newdata = test_conservative), test_conservative$TARGET_WINS)
  ),
  
  R2 = c(
    R2(predict(model_full, newdata = test_full), test_full$TARGET_WINS),
    R2(predict(model_full_log, newdata = test_full_log), test_full_log$TARGET_WINS),
    R2(predict(model_full_log_wls, newdata = test_full_log), test_full_log$TARGET_WINS),
    r2_sqrt,
    r2_moderate,
    R2(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS),
    r2_mod_wls_sqrt,
    R2(predict(model_transformed, newdata = test_transformed), test_transformed$TARGET_WINS),
    R2(predict(model_batting, newdata = test_batting), test_batting$TARGET_WINS),
    enet_resample["Rsquared"],
    R2(predict(model_pitching, newdata = test_pitching), test_pitching$TARGET_WINS),
    R2(predict(model_conservative, newdata = test_conservative), test_conservative$TARGET_WINS)
  )
)

# Add MSE column
model_results$MSE <- model_results$RMSE^2

# Add F-statistic column (NA for models where it doesn’t apply: Elastic Net)
model_results$F_statistic <- c(
  summary(model_full)$fstatistic[1],
  summary(model_full_log)$fstatistic[1],
  summary(model_full_log_wls)$fstatistic[1],
  summary(model_full_sqrt)$fstatistic[1],
  summary(model_moderate)$fstatistic[1],
  summary(model_moderate_sqrt)$fstatistic[1],
  summary(wls_model_moderate_sqrt)$fstatistic[1],
  summary(model_transformed)$fstatistic[1],
  summary(model_batting)$fstatistic[1],
  NA,  # Elastic Net model doesn’t have F-statistic in the same form
  summary(model_pitching)$fstatistic[1],
  summary(model_conservative)$fstatistic[1]
)

# View the updated table
print(model_results)

##                               Model     RMSE         R2      MSE F_statistic
## 1                     Model 1: Full 13.81744 0.29507385 190.9215    54.08500
## 2             Model 1.1: Full + Log 13.47148 0.33177723 181.4809    67.47286
## 3       Model 1.2: Full + Log + WLS 13.55652 0.32320140 183.7792    65.61963
## 4            Model 1.3: Full + Sqrt 14.05698 0.27227029 197.5987    61.07278
## 5                 Model 2: Moderate 14.01101 0.27697537 196.3085    66.37656
## 6        Model 2.1: Moderate + Sqrt 14.05698 0.27227029 197.5987    65.47077
## 7  Model 2.2: Moderate + Sqrt + WLS 14.09027 0.26975219 198.5357    63.28447
## 8    Model 3: Transformed Advantage 14.68510 0.20472177 215.6522    41.11853
## 9                  Model 4: Batting 13.77150 0.30035002 189.6543    73.39906
## 10 Model 4.1: Batting (Elastic Net) 13.77150 0.30035002 189.6543          NA
## 11                Model 5: Pitching 15.71512 0.09207141 246.9649    34.74147
## 12            Model 6: Conservative 13.81767 0.29505061 190.9280    58.27710

##Final Model and Predictions

After selecting the Full + Log model (Model 1.1) as the best-performing model, we applied it to the evaluation dataset. Log transformations were applied to the relevant predictor variables to match the training data. The model was then used to predict the number of team wins.

data_eval$log_TEAM_BATTING_3B <- log(data_eval$TEAM_BATTING_3B + 1)
data_eval$log_BASERUN_CS <- log(data_eval$TEAM_BASERUN_CS + 1)
data_eval$log_TEAM_BASERUN_SB <- log(data_eval$TEAM_BASERUN_SB + 1)
data_eval$log_TEAM_FIELDING_E <- log(data_eval$TEAM_FIELDING_E + 1)
data_eval$log_TEAM_PITCHING_BB <- log(data_eval$TEAM_PITCHING_BB + 1)
data_eval$log_TEAM_PITCHING_H <- log(data_eval$TEAM_PITCHING_H + 1)
data_eval$log_TEAM_PITCHING_SO <- log(data_eval$TEAM_PITCHING_SO + 1)

# Predict
data_eval$predicted_wins <- predict(model_full_log, newdata = data_eval)

hist(data_eval$predicted_wins,
     breaks = 20,
     main = "Distribution of Predicted Wins",
     xlab = "Predicted Wins",
     col = "lightpink",
     border = "white")

# Preview predictions
head(data_eval[, "predicted_wins", drop = FALSE], 4)

##   predicted_wins
## 1       67.94843
## 2       70.51659
## 3       76.91473
## 4       82.01534

DATA621- Homework 1

Dhanya Nair, Nakesha Fray, and Gillian McGovern

2025-09-13

Background

Data Preparation

1. Importing Libraries

2. Data Ingestion

3. Data Exploration and Cleaning

3.1 Summary Statistics and Distribution

3.2 Missing Data

3.3 Identifying Outliers

3.3 Data Cleansing

3.4 Checking for linear relationship

3.5 Observing Correlation

3.6 Feature Engineering

3.7 Split Data for validation

Model Building

Model 1: Full model

Model 1.1: Full model with Log transformation of the skewed predictors

Model 1.2: Full model with Log transformation of the skewed predictors and Weighted Least Squares

Model 1.3: Full model with square root transformation of the skewed predictors

Model 2: Moderate model

Model 2.1: Moderate dataset model with square root transformation of skewed predictors

Model 2.2: Moderate dataset model with transformation of skewed predictors and WLS

Model 3: Transformed data (net advantage variables)

Model 4: (batting variables only)

Model 4.1 (batting variables only) Elastic Net

Model 5: (pitching variables only)

Model 6: Conservative dataset

Select Models