DATA621- Homework 1

Background
Data Preparation
Model Building
- Model 1: Full model
Model 1.1: Full model with Log transformation of the skewed predictors
Model 1.2: Full model with Log transformation of the skewed predictors and Weighted Least Squares
Model 1.3: Full model with square root transformation of the skewed predictors
Select Models
Final Model and Predictions

Background

In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

VARIABLE NAME DEFINITION THEORETICAL EFFECT

INDEX Identification Variable (do not use) None

TARGET_WINS Number of wins

TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR) Positive Impact on Wins

TEAM_BATTING_2B Doubles by batters (2B) Positive Impact on Wins

TEAM_BATTING_3B Triples by batters (3B) Positive Impact on Wins

TEAM_BATTING_HR Homeruns by batters (4B) Positive Impact on Wins

TEAM_BATTING_BB Walks by batters Positive Impact on Wins

TEAM_BATTING_HBP Batters hit by pitch (get a free base) Positive Impact on Wins

TEAM_BATTING_SO Strikeouts by batters Negative Impact on Wins

TEAM_BASERUN_SB Stolen bases Positive Impact on Wins

TEAM_BASERUN_CS Caught stealing Negative Impact on Wins

TEAM_FIELDING_E Errors Negative Impact on Wins

TEAM_FIELDING_DP Double Plays Positive Impact on Wins

TEAM_PITCHING_BB Walks allowed Negative Impact on Wins

TEAM_PITCHING_H Hits allowed Negative Impact on Wins

TEAM_PITCHING_HR Homeruns allowed Negative Impact on Wins

TEAM_PITCHING_SO Strikeouts by pitchers Positive Impact on Wins

Data Preparation

1. Importing Libraries

#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(e1071)  # For skewness function
library(corrplot)

## corrplot 0.95 loaded

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(ROSE)

## Loaded ROSE 0.0-4

library(smotefamily)

library(dplyr)
library(caret)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

2. Data Ingestion

data_raw <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA621/refs/heads/main/moneyball-training-data.csv")

data_eval <- read.csv("https://raw.githubusercontent.com/nakesha-fray/DATA621/main/moneyball-evaluation-data.csv")

3. Data Exploration and Cleaning

The MoneyBall dataset contains 2,276 observations and 17 variables, each representing the performance of a baseball team in a given year between 1871 and 2006. The goal is to predict the number of team wins (TARGET_WINS) using available data.

3.1 Summary Statistics and Distribution

The histograms below display the distributions of the numerical features in the dataset. Several variables, such as TEAM_BATTING_2B, TEAM_BATTING_H, and TEAM_FIELDING_DP, appear approximately normally distributed, although TEAM_BATTING_H shows a slight right skew.

Other features exhibit more pronounced skewness and outliers. For example, TEAM_BATTING_BB is slightly left-skewed, while variables like TEAM_BASERUN_CS, TEAM_BASERUN_SB, TEAM_BATTING_3B, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, and TEAM_PITCHING_SO are heavily right-skewed, indicating that a few teams had exceptionally high values in these categories.

Additionally, variables such as TEAM_BATTING_HBP, TEAM_BATTING_HR, TEAM_BATTING_SO, and TEAM_PITCHING_HR display bimodal or multimodal distributions, suggesting that team performance in these metrics may differ substantially depending on the year and team.

The target variable TARGET_WINS is continuous and roughly normally distributed, this confirms that we are dealing with a regression problem. As a first approach, applying linear regression is appropriate and may provide reasonable performance. However, not all predictors are perfectly symmetric or linear, and some exhibit skewness or potential outliers, which may affect modeling accuracy.

3.2 Missing Data

Several variables contain missing values, the most significant being TEAM_BATTING_HBP (2085), followed by TEAM_BASERUN_CS (772), TEAM_FIELDING_DP (286), and TEAM_BASERUN_SB (131). The statistics for these weren’t tracked in early baseball (1871-1890s), so missing likely means “0”.This makes historical sense - if stolen bases weren’t recorded, it’s because they didn’t track them, not because the stat is unknown.

# Structure of the data 
str(data_raw)

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

# Glimpse of the data
head(data_raw)

##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1     1          39           1445             194              39
## 2     2          70           1339             219              22
## 3     3          86           1377             232              35
## 4     4          70           1387             209              38
## 5     5          82           1297             186              27
## 6     6          75           1279             200              36
##   TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1              13             143             842              NA
## 2             190             685            1075              37
## 3             137             602             917              46
## 4              96             451             922              43
## 5             102             472             920              49
## 6              92             443             973             107
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1              NA               NA            9364               84
## 2              28               NA            1347              191
## 3              27               NA            1377              137
## 4              30               NA            1396               97
## 5              39               NA            1297              102
## 6              59               NA            1279               92
##   TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1              927             5456            1011               NA
## 2              689             1082             193              155
## 3              602              917             175              153
## 4              454              928             164              156
## 5              472              920             138              168
## 6              443              973             123              149

# Summary
summary(data_raw)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

# Feature Distribution
data_raw |>
  dplyr::select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
  filter(!is.na(Value)) |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Feature, scales = "free") +
  ggtitle("Histograms of Numerical Features")

# Distribution of TARGET_WINS
data_raw |>
  dplyr::select(TARGET_WINS) |>
  ggplot() + 
  aes(x = TARGET_WINS) + 
  geom_histogram(bins= 40,fill = "blue", color = "black") + 
  labs(title = "Distribution of TARGET_WINS", y = "Count") +  
  theme_minimal()

# Count missing values per Variable
data_raw %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  filter(missing_count != 0) %>%
  arrange(desc(missing_count))

## # A tibble: 6 × 2
##   variable         missing_count
##   <chr>                    <int>
## 1 TEAM_BATTING_HBP          2085
## 2 TEAM_BASERUN_CS            772
## 3 TEAM_FIELDING_DP           286
## 4 TEAM_BASERUN_SB            131
## 5 TEAM_BATTING_SO            102
## 6 TEAM_PITCHING_SO           102

3.3 Identifying Outliers

The boxplots reveal that TARGET_WINS exhibits no extreme outliers. In contrast, many predictor variables show clear evidence of outliers. Variables such as TEAM_BASERUN_SB, TEAM_BASERUN_BB, TEAM_FIELDING_E, and several pitching and batting statistics have values that extend far beyond the interquartile range, indicating that some teams had exceptionally high or low values in these categories. These outliers could strongly influence regression models if left unaddressed. To mitigate this, we apply various transformations—such as logarithmic and square root—in the modeling phase, with the goal of reducing skewness and the impact of extreme values.

# Boxplot of TARGET_WINS

data_raw |>
  dplyr::select(TARGET_WINS) |>
  ggplot(aes(x = "", y = TARGET_WINS)) + 
  geom_boxplot(fill="pink") + 
  labs(title = "Distribution of TARGET_WINS", y = "Wins") +  
  theme_minimal()

# Boxplots of all predictors 

ggplot(stack(data_raw[,-1]), aes(x = ind, y = values)) + 
  geom_boxplot(fill = "darkgrey") +
  coord_cartesian(ylim = c(0, 1000)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        panel.background = element_rect(fill = 'grey96')) +
  labs(title = "Boxplots of Predictor Variables", x="Predictors")

## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

3.3 Data Cleansing

To prepare the dataset for modeling, we addressed missing values through a combination of variable removal and imputation.

First, we dropped the INDEX variable, as it serves only as a row identifier and holds no predictive value.

Next, we removed the TEAM_BATTING_HBP variable, which contained over 90% missing values. This specific statistic was not recorded in most early baseball seasons, making the variable unsuitable for imputation or inclusion in modeling.

For other variables with moderate missingness — specifically TEAM_BASERUN_SB, TEAM_BASERUN_CS, and TEAM_FIELDING_DP — we applied zero imputation. This decision is informed by historical context: these statistics were often not tracked in earlier years, so the missing values are likely systematic (i.e., due to lack of data collection) rather than random.

Finally, we imputed any remaining missing values (TEAM_BATTING_SO and TEAM_PITCHING_SO) across the dataset using the median of each variable to reduce the influence of outliers and skewed distributions.

# Drop the INDEX variable as it has no theoretical effect and it was mentioned to not use it.
# Drop the TEAM_BATTING_HBP variable as its mostly made up of missing values
data_transform <- data_raw |>
    # dplyr::select(-INDEX, TEAM_BATTING_HBP)
    dplyr::select(-c(INDEX, TEAM_BATTING_HBP))


# These stats: missing likely means "0" (not tracked in early baseball), hence mutating to 0
# TEAM_BATTING_SO missing or 0 assumes not recorded
data_transform  <- data_transform  %>%
    mutate(
      # TEAM_BATTING_HBP = ifelse(is.na(TEAM_BATTING_HBP), 0, TEAM_BATTING_HBP),
      TEAM_BASERUN_SB = ifelse(is.na(TEAM_BASERUN_SB), 0, TEAM_BASERUN_SB),
      TEAM_BASERUN_CS = ifelse(is.na(TEAM_BASERUN_CS), 0, TEAM_BASERUN_CS),
      TEAM_FIELDING_DP = ifelse(is.na(TEAM_FIELDING_DP), 0, TEAM_FIELDING_DP)
    )

# Handle missing values - using median imputation for numerical variables
# Handle missing values with median imputation (this applies to TEAM_BATTING_SO and TEAM_PITCHING_SO)
data_transform  <- data_transform  |>
 mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))

str(data_transform)

## 'data.frame':    2276 obs. of  15 variables:
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : num  842 1075 917 922 920 ...
##  $ TEAM_BASERUN_SB : num  0 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : num  0 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: num  5456 1082 917 928 920 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: num  0 155 153 156 168 149 186 136 169 159 ...

3.4 Checking for linear relationship

To evaluate the suitability of linear regression, we plotted the linear relationships between each predictor and the target variable (TARGET_WINS) using scatterplots and Pearson correlation coefficients.

TEAM_BATTING_H (r = 0.389) shows the strongest positive correlation with TARGET_WINS, suggesting that teams with more base hits generally win more games — as expected. This is followed by TEAM_BATTING_2B, which also demonstrates a moderately positive relationship, aligning with baseball intuition.

Interestingly, TEAM_BATTING_SO (strikeouts by batters) shows an almost negligible correlation (r = -0.030) with TARGET_WINS. This is somewhat surprising, as one might expect higher strikeout counts to hurt performance. The lack of relationship suggests that this variable is not a strong independent predictor.

Another notable result is TEAM_PITCHING_SO (strikeouts by pitchers), which shows a weak negative correlation with wins (r = -0.075). This result could indicate multicollinearity — for example, if teams that accumulate many strikeouts also give up more hits or walks — or it may suggest data quality issues in historical records.

# Get all predictor variables (excluding TARGET_WINS)
predictors <- names(data_transform)[names(data_transform) != "TARGET_WINS"]

# Create scatterplots for all predictors vs TARGET_WINS
# Set up plotting parameters for multiple plots
par(mfrow = c(4, 4), mar = c(4, 4, 3, 2))


# Plot each predictor vs TARGET_WINS
for(i in 1:length(predictors)) {
  predictor <- predictors[i]
  correlation <- cor(data_transform[[predictor]], data_transform$TARGET_WINS)
  
  plot(data_transform[[predictor]], data_transform$TARGET_WINS,
       xlab = predictor,
       ylab = "TARGET_WINS",
       main = paste(predictor, "\n(r =", round(correlation, 3), ")"),
       pch = 16,
       col = ifelse(correlation > 0, "darkblue", "darkred"),
       cex = 0.6)
  
  # Add regression line
  abline(lm(data_transform$TARGET_WINS ~ data_transform[[predictor]]), 
         col = "red", lwd = 2)
  
  # Add correlation text
  text(x = quantile(data_transform[[predictor]], 0.05), 
       y = quantile(data_transform$TARGET_WINS, 0.95),
       labels = paste("r =", round(correlation, 3)),
       cex = 0.8, col = "red")
}

# Reset plotting parameters
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2))

3.5 Observing Correlation

To further assess linear relationships and potential multicollinearity among variables, we examined the correlation matrix of all numeric features using a corrplot and targeted pairwise comparisons.

Several batting and pitching variables are highly correlated with one another: TEAM_BATTING_HR is highly correlated with TEAM_PITCHING_HR, TEAM_BATTING_H is highly correlated with TEAM_PITCHING_H, TEAM_BATTING_BB is highly correlated with TEAM_PITCHING_BB, TEAM_BATTING_SO is highly correlated with TEAM_PITCHING_SO, TEAM_FIELDING_E is highly correlated (negative) with TEAM_FIELDING_DP

These relationships suggest possible multicollinearity, which could impact model stability and interpretability.

We also ranked features by their correlation with TARGET_WINS, identifying the top 5 positive and negative predictors:

Top positively correlated features include:TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_FIELDING_BB, TEAM_PITCHING_HR, and TEAM_BATTING_HR.

Top negatively correlated features include: TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_BATTING_SO.

# Correlation matrix
cor_matrix <- cor(data_transform, use = "complete.obs")
# Plot
corrplot(cor_matrix, 
         method = "circle", 
         type = "upper", 
         tl.cex = 0.5, 
         tl.srt = 45)  # Rotate text diagonally

par(mar = c(1, 1, 4, 1))  # top margin = 4 lines

# Extract correlations between batting and pitching variables
cat("=== MULTICOLLINEARITY ANALYSIS ===\n")

## === MULTICOLLINEARITY ANALYSIS ===

# Batting vs Pitching
multicollin_pairs <- data.frame(
  Batting_Var = c("TEAM_BATTING_HR", "TEAM_BATTING_H", "TEAM_BATTING_BB", "TEAM_BATTING_SO"),
  Pitching_Var = c("TEAM_PITCHING_HR", "TEAM_PITCHING_H", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO"),
  Correlation = c(
    cor(data_transform$TEAM_BATTING_HR, data_transform$TEAM_PITCHING_HR),
    cor(data_transform$TEAM_BATTING_H, data_transform$TEAM_PITCHING_H),
    cor(data_transform$TEAM_BATTING_BB, data_transform$TEAM_PITCHING_BB),
    cor(data_transform$TEAM_BATTING_SO, data_transform$TEAM_PITCHING_SO)
  )
)
print(multicollin_pairs)

##       Batting_Var     Pitching_Var Correlation
## 1 TEAM_BATTING_HR TEAM_PITCHING_HR   0.9693714
## 2  TEAM_BATTING_H  TEAM_PITCHING_H   0.3026937
## 3 TEAM_BATTING_BB TEAM_PITCHING_BB   0.4893613
## 4 TEAM_BATTING_SO TEAM_PITCHING_SO   0.4161816

# Fielding
print(cor(data_transform$TEAM_FIELDING_DP, data_transform$TEAM_FIELDING_E))

## [1] -0.7569313

# Get correlations with TARGET_WINS
target_wins_correlation <- cor_matrix[, "TARGET_WINS"]

# Remove TARGET_WINS itself (correlation = 1)
target_wins_correlation  <- target_wins_correlation [names(target_wins_correlation ) != "TARGET_WINS"]

# Top 5 Positive correlations
top5_pos <- sort(target_wins_correlation , decreasing = TRUE) %>%
  head(5)

# Top 5 Negative correlations
top5_neg <- sort(target_wins_correlation , decreasing = FALSE) %>%
  head(5)

# Combine into one dataframe
combined_df <- data.frame(
  Feature = c(names(top5_pos), names(top5_neg)),
  Correlation = c(top5_pos, top5_neg)
)

# Add a column to show Positive or Negative
combined_df$Direction <- ifelse(combined_df$Correlation > 0, "Positive", "Negative")


ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Positively and Negatively Correlated with target_wins",
       x = "Feature",
       y = "Correlation with target_wins",
       fill = "Direction") +
  theme_minimal()

# The `findCorrelation` function is specifically designed for this
high_corr_list <- findCorrelation(cor_matrix, cutoff = 0.4, verbose = FALSE)
names(data_transform)[high_corr_list]

## [1] "TEAM_BATTING_HR"  "TEAM_FIELDING_E"  "TEAM_PITCHING_HR" "TEAM_BATTING_SO" 
## [5] "TEAM_BATTING_3B"  "TEAM_FIELDING_DP" "TEAM_BATTING_BB"  "TEAM_BATTING_2B" 
## [9] "TEAM_PITCHING_SO"

3.6 Feature Engineering

To address multicollinearity and explore different modeling strategies, several derived datasets were created using feature engineering and variable selection:

Dataset 1: Conservative Reduction

We removed TEAM_PITCHING_HR, which was highly correlated with TEAM_BATTING_HR.

Dataset 2: Moderate Reduction

This version takes a more aggressive stance on multicollinearity. We removed TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO, and TEAM_BATTING_SO. These variables showed correlations > 0.4 with other predictors, and their removal helps reduce noise and potential variance.

Dataset 3: Transformed Variables (Net Advantage)

Rather than eliminating variables, this version transforms them into net advantage metrics:

NET_HR_ADVANTAGE = TEAM_BATTING_HR - TEAM_PITCHING_HR NET_HITS_ADVANTAGE = TEAM_BATTING_H - TEAM_PITCHING_H NET_WALKS_ADVANTAGE = TEAM_BATTING_BB - TEAM_PITCHING_BB NET_SO_ADVANTAGE = TEAM_PITCHING_SO - TEAM_BATTING_SO

Dataset 4: Batting-Focused

This subset includes only batting, base running, and fielding statistics — excluding all pitching variables. A new variable (TEAM_BATTING_XBH) was also created to address coefficients with contradicting theoretical expectations.

Dataset 5: Pitching-Focused

In contrast, this version includes only pitching, base running, and fielding stats — omitting batting features entirely.

# Dataset 1: Conservative - Remove one variable from each highly correlated pair (TEAM_BATTING_HR, TEAM_PITCHING_HR)
data_reduced <- data_transform %>%
  dplyr::select(-TEAM_PITCHING_HR)

# Dataset 2: Moderate - Remove FE, HR and BB and SO pitching (absolute value correlations > 0.4)
data_moderate <- data_transform %>%
  dplyr::select(-TEAM_PITCHING_HR, -TEAM_PITCHING_BB, -TEAM_PITCHING_SO, -TEAM_BATTING_SO, -TEAM_FIELDING_E)

# Dataset 3: Transformed - Create net advantage variables
data_transformed <- data_transform %>%
  mutate(
    NET_HR_ADVANTAGE = TEAM_BATTING_HR - TEAM_PITCHING_HR,
    NET_HITS_ADVANTAGE = TEAM_BATTING_H - TEAM_PITCHING_H,
    NET_WALKS_ADVANTAGE = TEAM_BATTING_BB - TEAM_PITCHING_BB,
    NET_SO_ADVANTAGE = TEAM_PITCHING_SO - TEAM_BATTING_SO
  ) %>%
  # dplyr::select(TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BASERUN_SB, 
  #        TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_FIELDING_E, TEAM_FIELDING_DP,
  #        NET_HR_ADVANTAGE, NET_HITS_ADVANTAGE, NET_WALKS_ADVANTAGE, NET_SO_ADVANTAGE)
  dplyr::select(TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BASERUN_SB, 
         TEAM_BASERUN_CS, TEAM_FIELDING_E, TEAM_FIELDING_DP,
         NET_HR_ADVANTAGE, NET_HITS_ADVANTAGE, NET_WALKS_ADVANTAGE, NET_SO_ADVANTAGE)

# Dataset 4 &5 :  Keep only batting OR pitching variables (separate models)
data_batting_only <- data_transform |>
  dplyr::select(TARGET_WINS, starts_with("TEAM_BATTING"), starts_with("TEAM_BASERUN"), 
         starts_with("TEAM_FIELDING"))

data_pitching_focused <- data_transform |>
  dplyr::select(TARGET_WINS, starts_with("TEAM_PITCHING"), starts_with("TEAM_FIELDING"),
         TEAM_BASERUN_SB, TEAM_BASERUN_CS)  # Keep base running as it's not correlated

3.7 Split Data for validation

To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.

This was done for all engineered versions of the dataset:

train_full / test_full — Full feature set (after cleaning) train_conservative / test_conservative — Reduced set with one variable removed per collinear pair train_moderate / test_moderate — More aggressively reduced set train_transformed / test_transformed — Feature-engineered net-advantage variables train_batting / test_batting — Batting-only feature set train_pitching / test_pitching — Pitching-focused feature set

Each of these splits will be used in the next phase to build and evaluate multiple linear regression models, comparing their performance on both the training and testing sets.

# Split data for validation
set.seed(123)
#n <- nrow(data_transform)
#train_idx <- sample(n, 0.8 * n)

train_idx <- createDataPartition(y = data_transform$TARGET_WINS, p = 0.8, list = FALSE)

# Create train/test splits for each dataset
train_full <- data_transform[train_idx, ]
test_full <- data_transform[-train_idx, ]

train_conservative <- data_reduced[train_idx, ]
test_conservative <- data_reduced[-train_idx, ]

train_moderate <- data_moderate[train_idx, ]
test_moderate <- data_moderate[-train_idx, ]

train_transformed <- data_transformed[train_idx, ]
test_transformed <- data_transformed[-train_idx, ]

train_batting <- data_batting_only[train_idx, ]
test_batting <- data_batting_only[-train_idx, ]

train_pitching <- data_pitching_focused[train_idx, ]
test_pitching<- data_pitching_focused[-train_idx, ]

Model Building

Model 1: Full model

Let’s first start out with a model that includes everything:

# Model 1: Full model (all data including multicollinearity predictors)

model_full <- lm(TARGET_WINS ~ ., data = train_full)

summary(model_full)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.719  -8.854   0.269   8.579  53.626 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.0418328  6.3176701   2.856  0.00434 ** 
## TEAM_BATTING_H    0.0500311  0.0042745  11.705  < 2e-16 ***
## TEAM_BATTING_2B  -0.0252875  0.0106160  -2.382  0.01732 *  
## TEAM_BATTING_3B   0.0585974  0.0191983   3.052  0.00230 ** 
## TEAM_BATTING_HR   0.0395462  0.0312765   1.264  0.20625    
## TEAM_BATTING_BB   0.0057714  0.0064023   0.901  0.36747    
## TEAM_BATTING_SO  -0.0036516  0.0028289  -1.291  0.19694    
## TEAM_BASERUN_SB   0.0061007  0.0047539   1.283  0.19955    
## TEAM_BASERUN_CS  -0.0192662  0.0117365  -1.642  0.10085    
## TEAM_PITCHING_H  -0.0008394  0.0004566  -1.838  0.06616 .  
## TEAM_PITCHING_HR  0.0006267  0.0279223   0.022  0.98210    
## TEAM_PITCHING_BB  0.0013045  0.0044831   0.291  0.77111    
## TEAM_PITCHING_SO  0.0023757  0.0010191   2.331  0.01985 *  
## TEAM_FIELDING_E  -0.0276333  0.0034789  -7.943 3.43e-15 ***
## TEAM_FIELDING_DP -0.0511858  0.0114022  -4.489 7.60e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.23 on 1807 degrees of freedom
## Multiple R-squared:  0.2836, Adjusted R-squared:  0.278 
## F-statistic: 51.09 on 14 and 1807 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full))
# n = # of observations
n <- nrow(train_full)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_full)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_full
## BP = 244.87, df = 14, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_full)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.833728         2.544201         3.004412        37.045992 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         6.286496         4.824217         1.967925         1.396113 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##         3.929952        30.184911         6.078982         3.604667 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         6.272760         4.004321

######


preds <- predict(model_full, newdata = test_full)
RMSE(preds, test_full$TARGET_WINS)

## [1] 13.71198

R2(preds, test_full$TARGET_WINS)

## [1] 0.3063485

Model 1 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance increases then decreases
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constant
Standardized residuals vs leverage: Case 2136 is highly influential

Other notes:

Variance Inflation Factor: TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_FIELDING_E have VIF > 5 indicating serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Next step: Try a log transformation on the skewed predictors

Model 1.1: Full model with Log transformation of the skewed predictors

# Model 1.1: Full model with Log transformation of the skewed predictors

train_full_log <- train_full %>%
  mutate(
    log_BASERUN_CS = log(TEAM_BASERUN_CS + 1),
    log_TEAM_BASERUN_SB = log(TEAM_BASERUN_SB + 1),
    log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
    log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1),
    log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
    log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
    log_TEAM_PITCHING_SO = log(TEAM_PITCHING_SO + 1),
  )

test_full_log <- test_full %>%
  mutate(
    log_BASERUN_CS = log(TEAM_BASERUN_CS + 1),
    log_TEAM_BASERUN_SB = log(TEAM_BASERUN_SB + 1),
    log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
    log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1),
    log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
    log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
    log_TEAM_PITCHING_SO = log(TEAM_PITCHING_SO + 1),
  )

model_full_log <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
  TEAM_BATTING_BB + TEAM_BATTING_SO + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS +  log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
  log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + TEAM_PITCHING_HR + log_TEAM_PITCHING_SO, data = train_full_log)

summary(model_full_log)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     log_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     log_TEAM_FIELDING_E + TEAM_FIELDING_DP + log_TEAM_PITCHING_H + 
##     log_TEAM_PITCHING_BB + TEAM_PITCHING_HR + log_TEAM_PITCHING_SO, 
##     data = train_full_log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.836  -8.238  -0.009   8.257  75.027 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.212e+01  1.814e+01   5.078 4.21e-07 ***
## TEAM_BATTING_H        5.453e-02  4.445e-03  12.269  < 2e-16 ***
## TEAM_BATTING_2B      -3.549e-02  1.059e-02  -3.350 0.000825 ***
## log_TEAM_BATTING_3B   5.677e+00  1.100e+00   5.162 2.71e-07 ***
## TEAM_BATTING_HR      -1.026e-04  3.208e-02  -0.003 0.997449    
## TEAM_BATTING_BB       2.159e-02  5.561e-03   3.882 0.000107 ***
## TEAM_BATTING_SO       1.707e-03  3.038e-03   0.562 0.574259    
## log_TEAM_BASERUN_SB  -1.952e+00  3.322e-01  -5.875 5.02e-09 ***
## TEAM_BASERUN_CS      -2.434e-02  1.184e-02  -2.055 0.039990 *  
## log_TEAM_FIELDING_E  -1.568e+01  1.521e+00 -10.314  < 2e-16 ***
## TEAM_FIELDING_DP     -9.046e-02  1.071e-02  -8.449  < 2e-16 ***
## log_TEAM_PITCHING_H  -2.054e+00  2.384e+00  -0.861 0.389219    
## log_TEAM_PITCHING_BB -1.144e-01  2.010e+00  -0.057 0.954626    
## TEAM_PITCHING_HR     -7.289e-03  2.838e-02  -0.257 0.797334    
## log_TEAM_PITCHING_SO  4.828e-01  7.491e-01   0.645 0.519300    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.18 on 1807 degrees of freedom
## Multiple R-squared:  0.2889, Adjusted R-squared:  0.2834 
## F-statistic: 52.43 on 14 and 1807 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_log, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full_log)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_log))
# n = # of observations
n <- nrow(train_full_log)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_full_log)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_full_log
## BP = 325.49, df = 14, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_full_log)

##       TEAM_BATTING_H      TEAM_BATTING_2B  log_TEAM_BATTING_3B 
##             4.175962             2.552007             3.103816 
##      TEAM_BATTING_HR      TEAM_BATTING_BB      TEAM_BATTING_SO 
##            39.270562             4.778690             5.603908 
##  log_TEAM_BASERUN_SB      TEAM_BASERUN_CS  log_TEAM_FIELDING_E 
##             1.762556             1.431583             8.896992 
##     TEAM_FIELDING_DP  log_TEAM_PITCHING_H log_TEAM_PITCHING_BB 
##             3.557102             5.429279             3.475735 
##     TEAM_PITCHING_HR log_TEAM_PITCHING_SO 
##            31.415344             2.455430

######


preds <- predict(model_full_log, newdata = test_full_log)
RMSE(preds, test_full_log$TARGET_WINS)

## [1] 13.36052

R2(preds, test_full_log$TARGET_WINS)

## [1] 0.3458344

Model 1.1 validity check (linear regression assumptions):

Histogram: Looks approximately normal, very slightly right skewed
Residuals vs Fitted values: Somewhat constant variance (improved) - a few outliers for lower fitted values such as case 1342
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constant
Standardized residuals vs leverage: Case 1342 are highly influential

Other notes:

Variance Inflation Factor: log_TEAM_FIELDING_E, TEAM_PITCHING_HR, TEAM_BATTING_HR, and log_TEAM_PITCHING_H have VIF > 5 indicating serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Check for leverage points:

# Leverage points
selected_row_single <- train_full[c("2136", "1342"), ]
print(selected_row_single)

##      TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 2136          41            992             263              20               0
## 1342         108           1188             338               0               0
##      TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 2136             142             952               0               0
## 1342             270             945               0               0
##      TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 2136           20088                0             2876            19278
## 1342           16038                0             3645            12758
##      TEAM_FIELDING_E TEAM_FIELDING_DP
## 2136             952                0
## 1342             716                0

As you can see, it looks like these points are leverage points since they contain a lot of zeros. For example, case 1342’s TEAM_BATTING_HR is zero. Since we can’t find out more information about these points, we won’t remove. In a real-world scenario, we would investigate more and check if thesee values were inputted incorrectly.

Next step: let’s try a log transformation on the skewed predictors, then a WLS to address nonconstant variance

Model 1.2: Full model with Log transformation of the skewed predictors and Weighted Least Squares

# Model 1.2: Full model with Log transformation of the skewed predictors + WLS

# define weights to use
wt <- 1 / lm(abs(model_full_log$residuals) ~ model_full_log$fitted.values)$fitted.values^2

# perform weighted least squares regression
# Fit model
model_full_log_wls <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
  TEAM_BATTING_BB + TEAM_BATTING_SO + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS +  log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
  log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + TEAM_PITCHING_HR + log_TEAM_PITCHING_SO, data = train_full_log, weights=wt)
# model_full_log_wls <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
#   TEAM_BATTING_BB + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS +  log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
#   log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + log_TEAM_PITCHING_SO, data = train_full_log, weights=wt)

#view summary of model
summary(model_full_log_wls)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     log_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_BATTING_SO + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     log_TEAM_FIELDING_E + TEAM_FIELDING_DP + log_TEAM_PITCHING_H + 
##     log_TEAM_PITCHING_BB + TEAM_PITCHING_HR + log_TEAM_PITCHING_SO, 
##     data = train_full_log, weights = wt)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5067 -0.7887  0.0009  0.8068  5.1522 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.122e+01  1.964e+01   4.646 3.64e-06 ***
## TEAM_BATTING_H        4.958e-02  4.577e-03  10.833  < 2e-16 ***
## TEAM_BATTING_2B      -3.676e-02  1.039e-02  -3.536 0.000417 ***
## log_TEAM_BATTING_3B   6.681e+00  1.123e+00   5.951 3.20e-09 ***
## TEAM_BATTING_HR       2.528e-03  3.222e-02   0.078 0.937470    
## TEAM_BATTING_BB       2.315e-02  7.136e-03   3.245 0.001197 ** 
## TEAM_BATTING_SO       5.269e-04  3.076e-03   0.171 0.864004    
## log_TEAM_BASERUN_SB  -1.884e+00  3.399e-01  -5.542 3.43e-08 ***
## TEAM_BASERUN_CS      -1.733e-02  1.195e-02  -1.451 0.147084    
## log_TEAM_FIELDING_E  -1.461e+01  1.504e+00  -9.715  < 2e-16 ***
## TEAM_FIELDING_DP     -8.219e-02  1.042e-02  -7.886 5.35e-15 ***
## log_TEAM_PITCHING_H  -7.814e-01  2.998e+00  -0.261 0.794377    
## log_TEAM_PITCHING_BB -2.125e+00  3.032e+00  -0.701 0.483483    
## TEAM_PITCHING_HR      6.595e-03  2.852e-02   0.231 0.817128    
## log_TEAM_PITCHING_SO  3.132e-01  8.703e-01   0.360 0.718973    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.282 on 1807 degrees of freedom
## Multiple R-squared:  0.2785, Adjusted R-squared:  0.2729 
## F-statistic: 49.82 on 14 and 1807 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_log_wls, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full_log_wls)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_log_wls))
# n = # of observations
n <- nrow(train_full_log)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log_wls)

# Perform Breusch-Pagan Test
bptest(model_full_log_wls)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_full_log_wls
## BP = 5.7805, df = 14, p-value = 0.9717

# Variance Inflation Factor
vif(model_full_log_wls)

##       TEAM_BATTING_H      TEAM_BATTING_2B  log_TEAM_BATTING_3B 
##             5.473533             2.560235             3.334773 
##      TEAM_BATTING_HR      TEAM_BATTING_BB      TEAM_BATTING_SO 
##            40.833173             7.705148             6.150963 
##  log_TEAM_BASERUN_SB      TEAM_BASERUN_CS  log_TEAM_FIELDING_E 
##             1.713186             1.439969             8.694267 
##     TEAM_FIELDING_DP  log_TEAM_PITCHING_H log_TEAM_PITCHING_BB 
##             3.641813             8.171872             6.038269 
##     TEAM_PITCHING_HR log_TEAM_PITCHING_SO 
##            32.736416             2.647742

######


preds <- predict(model_full_log_wls, newdata = test_full_log)
RMSE(preds, test_full_log$TARGET_WINS)

## [1] 13.37799

R2(preds, test_full_log$TARGET_WINS)

## [1] 0.346121

Model 1.2 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Somewhat constant variance (improved) - a few outliers for lower fitted values such as case 1342
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes is more flat, yet there are a few outliers
Standardized residuals vs leverage: Zero highly influential cases

Other notes:

Variance Inflation Factor: TEAM_BATTING_BB, log_TEAM_FIELDING_E, log_TEAM_PITCHING_H, log_TEAM_PITCHING_BB have VIF > 5 indicating serious multicollinearity
Breusch-Pagan Test: P value > 0.05 strongly indicating constant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Next step: let’s try a square root transformation of the skewed predictors

Model 1.3: Full model with square root transformation of the skewed predictors

# Model 1.3: Full model with square root transformation of the skewed predictors

# Transform skewed predictors via sqrt
train_full_sqrt <- train_full %>%
  mutate(
    sqrt_BASERUN_CS        = sqrt(TEAM_BASERUN_CS + 1),
    sqrt_TEAM_BASERUN_SB    = sqrt(TEAM_BASERUN_SB + 1),
    sqrt_TEAM_BATTING_3B     = sqrt(TEAM_BATTING_3B + 1),
    sqrt_TEAM_FIELDING_E     = sqrt(TEAM_FIELDING_E + 1),
    sqrt_TEAM_PITCHING_BB    = sqrt(TEAM_PITCHING_BB + 1),
    sqrt_TEAM_PITCHING_H     = sqrt(TEAM_PITCHING_H + 1),
    sqrt_TEAM_PITCHING_SO    = sqrt(TEAM_PITCHING_SO + 1)
  )

test_full_sqrt <- test_full %>%
  mutate(
    sqrt_BASERUN_CS        = sqrt(TEAM_BASERUN_CS + 1),
    sqrt_TEAM_BASERUN_SB    = sqrt(TEAM_BASERUN_SB + 1),
    sqrt_TEAM_BATTING_3B     = sqrt(TEAM_BATTING_3B + 1),
    sqrt_TEAM_FIELDING_E     = sqrt(TEAM_FIELDING_E + 1),
    sqrt_TEAM_PITCHING_BB    = sqrt(TEAM_PITCHING_BB + 1),
    sqrt_TEAM_PITCHING_H     = sqrt(TEAM_PITCHING_H + 1),
    sqrt_TEAM_PITCHING_SO    = sqrt(TEAM_PITCHING_SO + 1)
  )

# Fit model 
model_full_sqrt <- lm(TARGET_WINS ~ 
                        TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
                        TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
                        TEAM_BASERUN_CS + sqrt_TEAM_FIELDING_E + TEAM_FIELDING_DP +
                        sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB + sqrt_TEAM_PITCHING_SO +
                        TEAM_PITCHING_HR + TEAM_BATTING_SO,
                      data = train_full_sqrt)

summary(model_full_sqrt)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + sqrt_TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB + 
##     sqrt_TEAM_PITCHING_SO + TEAM_PITCHING_HR + TEAM_BATTING_SO, 
##     data = train_full_sqrt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.309  -8.626  -0.003   8.309  52.160 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           27.924758   7.094322   3.936 8.59e-05 ***
## TEAM_BATTING_H         0.051973   0.004329  12.006  < 2e-16 ***
## TEAM_BATTING_2B       -0.030711   0.010587  -2.901 0.003768 ** 
## sqrt_TEAM_BATTING_3B   1.358313   0.311597   4.359 1.38e-05 ***
## TEAM_BATTING_HR        0.029548   0.031534   0.937 0.348875    
## TEAM_BATTING_BB        0.020872   0.008164   2.557 0.010653 *  
## sqrt_TEAM_BASERUN_SB  -0.072625   0.101922  -0.713 0.476216    
## TEAM_BASERUN_CS       -0.023099   0.011847  -1.950 0.051354 .  
## sqrt_TEAM_FIELDING_E  -1.499124   0.162702  -9.214  < 2e-16 ***
## TEAM_FIELDING_DP      -0.078172   0.011445  -6.831 1.15e-11 ***
## sqrt_TEAM_PITCHING_H  -0.034373   0.076828  -0.447 0.654642    
## sqrt_TEAM_PITCHING_BB -0.457310   0.301683  -1.516 0.129729    
## sqrt_TEAM_PITCHING_SO  0.498949   0.135541   3.681 0.000239 ***
## TEAM_PITCHING_HR      -0.003848   0.027868  -0.138 0.890182    
## TEAM_BATTING_SO       -0.009869   0.003954  -2.496 0.012646 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.2 on 1807 degrees of freedom
## Multiple R-squared:  0.2874, Adjusted R-squared:  0.2818 
## F-statistic: 52.05 on 14 and 1807 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full_sqrt)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_sqrt))
# n = # of observations
n <- nrow(train_full_sqrt)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_sqrt)

# Perform Breusch-Pagan Test
bptest(model_full_sqrt)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_full_sqrt
## BP = 290.83, df = 14, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_full_sqrt)

##        TEAM_BATTING_H       TEAM_BATTING_2B  sqrt_TEAM_BATTING_3B 
##              3.953139              2.543924              3.216376 
##       TEAM_BATTING_HR       TEAM_BATTING_BB  sqrt_TEAM_BASERUN_SB 
##             37.858671             10.276593              1.819623 
##       TEAM_BASERUN_CS  sqrt_TEAM_FIELDING_E      TEAM_FIELDING_DP 
##              1.430004              8.210097              4.055568 
##  sqrt_TEAM_PITCHING_H sqrt_TEAM_PITCHING_BB sqrt_TEAM_PITCHING_SO 
##              5.250760              8.858551              6.530655 
##      TEAM_PITCHING_HR       TEAM_BATTING_SO 
##             30.226649              9.473807

######

# Predict on test set
preds_sqrt <- predict(model_full_sqrt, newdata = test_full_sqrt)

rmse_sqrt <- RMSE(preds_sqrt, test_full_sqrt$TARGET_WINS)
r2_sqrt   <- R2(preds_sqrt, test_full_sqrt$TARGET_WINS)

rmse_sqrt

## [1] 13.61336

r2_sqrt

## [1] 0.3167827

Model 1.3 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Somewhat constant variance (improved) - a few outliers for lower fitted values such as case 1342
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constan
Standardized residuals vs leverage: Case 1342 Cook’s distance is ~0.5

Other notes:

Variance Inflation Factor: TEAM_BATTING_BB, sqrt_TEAM_FIELDING_E, sqrt_TEAM_PITCHING_H, sqrt_TEAM_PITCHING_BB have VIF > 5 indicating serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Best model 1 version:

Next step: let’s address multicollinearity by using the train_moderate dataset (removed highly correlated predictors)

Model 2: Moderate model

# Model 2: Moderate dataset model (removed highly correlated predictors)

# Fit model 
model_moderate <- lm(TARGET_WINS ~ ., data = train_moderate)

#view summary of model
summary(model_moderate)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_moderate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.909  -8.995   0.495   8.905  76.870 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.0807986  3.9865298   1.274  0.20265    
## TEAM_BATTING_H    0.0427033  0.0035658  11.976  < 2e-16 ***
## TEAM_BATTING_2B  -0.0058880  0.0101635  -0.579  0.56244    
## TEAM_BATTING_3B   0.0544292  0.0191381   2.844  0.00450 ** 
## TEAM_BATTING_HR   0.0471021  0.0085870   5.485 4.71e-08 ***
## TEAM_BATTING_BB   0.0173697  0.0036085   4.814 1.61e-06 ***
## TEAM_BASERUN_SB   0.0126187  0.0045981   2.744  0.00612 ** 
## TEAM_BASERUN_CS  -0.0133792  0.0118269  -1.131  0.25810    
## TEAM_PITCHING_H  -0.0017073  0.0002894  -5.900 4.33e-09 ***
## TEAM_FIELDING_DP -0.0004111  0.0093700  -0.044  0.96501    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.54 on 1812 degrees of freedom
## Multiple R-squared:  0.2479, Adjusted R-squared:  0.2442 
## F-statistic: 66.38 on 9 and 1812 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_moderate, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_moderate)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_moderate))
# n = # of observations
n <- nrow(train_moderate)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

anova(model_moderate)

## Analysis of Variance Table
## 
## Response: TARGET_WINS
##                    Df Sum Sq Mean Sq  F value    Pr(>F)    
## TEAM_BATTING_H      1  63883   63883 348.5795 < 2.2e-16 ***
## TEAM_BATTING_2B     1   3416    3416  18.6403 1.664e-05 ***
## TEAM_BATTING_3B     1      1       1   0.0045  0.946460    
## TEAM_BATTING_HR     1  18718   18718 102.1338 < 2.2e-16 ***
## TEAM_BATTING_BB     1  14674   14674  80.0688 < 2.2e-16 ***
## TEAM_BASERUN_SB     1   1828    1828   9.9768  0.001611 ** 
## TEAM_BASERUN_CS     1    168     168   0.9144  0.339072    
## TEAM_PITCHING_H     1   6793    6793  37.0689 1.389e-09 ***
## TEAM_FIELDING_DP    1      0       0   0.0019  0.965006    
## Residuals        1812 332078     183                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Perform Breusch-Pagan Test
bptest(model_moderate)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_moderate
## BP = 269.1, df = 9, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_moderate)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.548488         2.227575         2.852050         2.667503 
##  TEAM_BATTING_BB  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_H 
##         1.907658         1.758721         1.354279         1.508270 
## TEAM_FIELDING_DP 
##         2.583185

######

# Predict on test set
preds_moderate <- predict(model_moderate, newdata = test_moderate)

rmse_moderate <- RMSE(preds_moderate, test_moderate$TARGET_WINS)
r2_moderate   <- R2(preds_moderate, test_moderate$TARGET_WINS)

rmse_moderate

## [1] 14.01101

r2_moderate

## [1] 0.2769754

Model 2 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance increases then decreases with outlier case 1342
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constant
Standardized residuals vs leverage: No highly influential cases

Other notes:

Variance Inflation Factor: Zero predictors showing serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Next step: let’s address the validity check issues by transforming the skewed predictors

Model 2.1: Moderate dataset model with square root transformation of skewed predictors

# Model 2: Moderate dataset model + transformation of skewed predictors

# Transform skewed predictors via sqrt
train_moderate_sqrt <- train_moderate %>%
  mutate(
    sqrt_BASERUN_CS        = sqrt(TEAM_BASERUN_CS + 1),
    sqrt_TEAM_BASERUN_SB    = sqrt(TEAM_BASERUN_SB + 1),
    sqrt_TEAM_BATTING_3B     = sqrt(TEAM_BATTING_3B + 1),
    sqrt_TEAM_PITCHING_H     = sqrt(TEAM_PITCHING_H + 1),
    # … you can include more if you have other skewed predictors …
  )

test_moderate_sqrt <- test_moderate %>%
  mutate(
    sqrt_BASERUN_CS        = sqrt(TEAM_BASERUN_CS + 1),
    sqrt_TEAM_BASERUN_SB    = sqrt(TEAM_BASERUN_SB + 1),
    sqrt_TEAM_BATTING_3B     = sqrt(TEAM_BATTING_3B + 1),
    sqrt_TEAM_PITCHING_H     = sqrt(TEAM_PITCHING_H + 1),
  )

# Fit model 
model_moderate_sqrt <- lm(TARGET_WINS ~ 
                        TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
                        TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
                        TEAM_BASERUN_CS + TEAM_FIELDING_DP + sqrt_TEAM_PITCHING_H,
                      data = train_moderate_sqrt)

summary(model_moderate_sqrt)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP + 
##     sqrt_TEAM_PITCHING_H, data = train_moderate_sqrt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.848  -8.959   0.565   9.014  78.010 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.607803   4.249041   1.790  0.07354 .  
## TEAM_BATTING_H        0.045072   0.003747  12.029  < 2e-16 ***
## TEAM_BATTING_2B      -0.007240   0.010193  -0.710  0.47766    
## sqrt_TEAM_BATTING_3B  0.962061   0.302455   3.181  0.00149 ** 
## TEAM_BATTING_HR       0.047035   0.008912   5.278 1.46e-07 ***
## TEAM_BATTING_BB       0.018099   0.003701   4.890 1.10e-06 ***
## sqrt_TEAM_BASERUN_SB  0.107730   0.096868   1.112  0.26623    
## TEAM_BASERUN_CS      -0.011914   0.011972  -0.995  0.31982    
## TEAM_FIELDING_DP     -0.011846   0.008987  -1.318  0.18762    
## sqrt_TEAM_PITCHING_H -0.273777   0.048625  -5.630 2.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.56 on 1812 degrees of freedom
## Multiple R-squared:  0.2454, Adjusted R-squared:  0.2416 
## F-statistic: 65.47 on 9 and 1812 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_moderate_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_full_sqrt)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_moderate_sqrt))
# n = # of observations
n <- nrow(train_moderate_sqrt)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_moderate_sqrt)

# Perform Breusch-Pagan Test
bptest(model_moderate_sqrt)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_moderate_sqrt
## BP = 272.64, df = 9, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_moderate)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.548488         2.227575         2.852050         2.667503 
##  TEAM_BATTING_BB  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_H 
##         1.907658         1.758721         1.354279         1.508270 
## TEAM_FIELDING_DP 
##         2.583185

######

# Predict on test set
preds_moderate_sqrt <- predict(model_moderate_sqrt, newdata = test_moderate_sqrt)

rmse_sqrt <- RMSE(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS)
r2_sqrt   <- R2(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS)

rmse_sqrt

## [1] 14.05698

r2_sqrt

## [1] 0.2722703

Model 2.1 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal with outlier (case 1342)
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is not constant
Standardized residuals vs leverage: Case 1342 Cook’s distance > 0.5

Other notes:

Variance Inflation Factor: Zero predictors showing serious multicollinearity
Breusch-Pagan Test: P value less than 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Next step: let’s address nonconstant variance by transforming the skewed predictors and using Weighted Least Squares

Model 2.2: Moderate dataset model with transformation of skewed predictors and WLS

# Model 2.2: Moderate dataset model + transformation of skewed predictors + WLS

# define weights to use
wt <- 1 / lm(abs(model_moderate_sqrt$residuals) ~ model_moderate_sqrt$fitted.values)$fitted.values^2

# perform weighted least squares regression
wls_model_moderate_sqrt <- lm(TARGET_WINS ~ 
                        TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
                        TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
                        TEAM_BASERUN_CS + TEAM_FIELDING_DP +
                        sqrt_TEAM_PITCHING_H,
                      data = train_moderate_sqrt, weights=wt)

#view summary of model
summary(wls_model_moderate_sqrt)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP + 
##     sqrt_TEAM_PITCHING_H, data = train_moderate_sqrt, weights = wt)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7679 -0.8433  0.0520  0.8563  5.0569 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.431146   4.039826   2.087 0.037027 *  
## TEAM_BATTING_H        0.045416   0.003807  11.928  < 2e-16 ***
## TEAM_BATTING_2B      -0.016131   0.010108  -1.596 0.110706    
## sqrt_TEAM_BATTING_3B  0.984533   0.297333   3.311 0.000947 ***
## TEAM_BATTING_HR       0.051740   0.008678   5.962 2.99e-09 ***
## TEAM_BATTING_BB       0.015460   0.003615   4.277 1.99e-05 ***
## sqrt_TEAM_BASERUN_SB  0.094450   0.097192   0.972 0.331282    
## TEAM_BASERUN_CS      -0.008323   0.011978  -0.695 0.487205    
## TEAM_FIELDING_DP     -0.006103   0.008862  -0.689 0.491124    
## sqrt_TEAM_PITCHING_H -0.254253   0.057049  -4.457 8.83e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.276 on 1812 degrees of freedom
## Multiple R-squared:  0.2392, Adjusted R-squared:  0.2354 
## F-statistic: 63.28 on 9 and 1812 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = wls_model_moderate_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(wls_model_moderate_sqrt)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(wls_model_moderate_sqrt))
# n = # of observations
n <- nrow(train_moderate_sqrt)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(wls_model_moderate_sqrt)

# Perform Breusch-Pagan Test
bptest(wls_model_moderate_sqrt)

## 
##  studentized Breusch-Pagan test
## 
## data:  wls_model_moderate_sqrt
## BP = 5.3621, df = 9, p-value = 0.8017

# Variance Inflation Factor
vif(wls_model_moderate_sqrt)

##       TEAM_BATTING_H      TEAM_BATTING_2B sqrt_TEAM_BATTING_3B 
##             3.654325             2.307732             3.129800 
##      TEAM_BATTING_HR      TEAM_BATTING_BB sqrt_TEAM_BASERUN_SB 
##             2.849087             1.860144             1.637307 
##      TEAM_BASERUN_CS     TEAM_FIELDING_DP sqrt_TEAM_PITCHING_H 
##             1.386230             2.543128             2.415067

######

# Predict on test set
preds_mod_sqrt <- predict(wls_model_moderate_sqrt, newdata = test_moderate_sqrt)

rmse_mod_wls_sqrt <- RMSE(preds_mod_sqrt, test_moderate_sqrt$TARGET_WINS)
r2_mod_wls_sqrt   <- R2(preds_mod_sqrt, test_moderate_sqrt$TARGET_WINS)

rmse_mod_wls_sqrt

## [1] 14.09027

r2_mod_wls_sqrt

## [1] 0.2697522

Model 2.2 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward slightly (improved) indicating variance is slightly nonconstant
Standardized residuals vs leverage: Looks good

Other notes:

Variance Inflation Factor: Zero predictors showing serious multicollinearity
Breusch-Pagan Test: P value > 0.05 strongly indicating constant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Model 3: Transformed data (net advantage variables)

# Model 3: Transformed data (net advantage variables)

model_transformed <- lm(TARGET_WINS ~ ., data = train_transformed)

summary(model_transformed)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.989  -9.839   0.122   9.261  50.902 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         65.7924810  3.1064212  21.180  < 2e-16 ***
## TEAM_BATTING_2B      0.0881168  0.0076330  11.544  < 2e-16 ***
## TEAM_BATTING_3B      0.1180329  0.0159613   7.395 2.15e-13 ***
## TEAM_BASERUN_SB      0.0069852  0.0047682   1.465   0.1431    
## TEAM_BASERUN_CS     -0.0184721  0.0123932  -1.491   0.1363    
## TEAM_FIELDING_E     -0.0282612  0.0033613  -8.408  < 2e-16 ***
## TEAM_FIELDING_DP    -0.0501127  0.0120283  -4.166 3.24e-05 ***
## NET_HR_ADVANTAGE    -0.0580541  0.0289564  -2.005   0.0451 *  
## NET_HITS_ADVANTAGE   0.0003041  0.0004778   0.636   0.5245    
## NET_WALKS_ADVANTAGE -0.0042913  0.0046882  -0.915   0.3601    
## NET_SO_ADVANTAGE    -0.0009371  0.0010537  -0.889   0.3739    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.1 on 1811 degrees of freedom
## Multiple R-squared:  0.1849, Adjusted R-squared:  0.1804 
## F-statistic: 41.09 on 10 and 1811 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_transformed, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_transformed)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_transformed))
# n = # of observations
n <- nrow(train_transformed)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_transformed)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_transformed
## BP = 233.9, df = 10, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_transformed)

##     TEAM_BATTING_2B     TEAM_BATTING_3B     TEAM_BASERUN_SB     TEAM_BASERUN_CS 
##            1.158682            1.829429            1.744094            1.371380 
##     TEAM_FIELDING_E    TEAM_FIELDING_DP    NET_HR_ADVANTAGE  NET_HITS_ADVANTAGE 
##            5.158672            3.925586            1.705163            3.591665 
## NET_WALKS_ADVANTAGE    NET_SO_ADVANTAGE 
##            4.931322            2.887223

######


preds <- predict(model_transformed, newdata = test_transformed)
RMSE(preds, test_transformed$TARGET_WINS)

## [1] 14.68434

R2(preds, test_transformed$TARGET_WINS)

## [1] 0.2047937

Model 3 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward indicating variance is nonconstant
Standardized residuals vs leverage: Case 2136 Cook’s distance > 0.5

Other notes:

Variance Inflation Factor: TEAM_FIELDING_E showing serious multicollinearity
Breusch-Pagan Test: P value < 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_FIELDING_DP

Model 4: (batting variables only)

# Model 4 (batting variables only)

# model_batting_xbh <- lm(TARGET_WINS ~ ., data = train_batting_fixed)

model_batting <- lm(TARGET_WINS ~ ., data = train_batting)

summary(model_batting)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_batting)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.383  -8.884   0.276   8.369  57.893 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.011e+01  6.184e+00   3.252 0.001169 ** 
## TEAM_BATTING_H    4.748e-02  4.195e-03  11.319  < 2e-16 ***
## TEAM_BATTING_2B  -2.064e-02  1.046e-02  -1.973 0.048628 *  
## TEAM_BATTING_3B   6.399e-02  1.866e-02   3.429 0.000619 ***
## TEAM_BATTING_HR   3.498e-02  1.049e-02   3.336 0.000867 ***
## TEAM_BATTING_BB   7.673e-03  3.775e-03   2.033 0.042220 *  
## TEAM_BATTING_SO  -9.631e-05  2.541e-03  -0.038 0.969765    
## TEAM_BASERUN_SB   4.525e-03  4.710e-03   0.961 0.336835    
## TEAM_BASERUN_CS  -2.239e-02  1.167e-02  -1.919 0.055129 .  
## TEAM_FIELDING_E  -2.990e-02  2.856e-03 -10.469  < 2e-16 ***
## TEAM_FIELDING_DP -5.708e-02  1.119e-02  -5.100 3.75e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.26 on 1811 degrees of freedom
## Multiple R-squared:  0.2787, Adjusted R-squared:  0.2747 
## F-statistic: 69.97 on 10 and 1811 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_batting, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_batting)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_batting))
# n = # of observations
n <- nrow(train_batting)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_batting)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_batting
## BP = 236.27, df = 10, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_batting)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.674924         2.460004         2.826139         4.145321 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         2.175251         3.873051         1.922653         1.373299 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         4.209614         3.840744

######


preds <- predict(model_batting, newdata = test_batting)
RMSE(preds, test_batting$TARGET_WINS)

## [1] 13.67704

R2(preds, test_batting$TARGET_WINS)

## [1] 0.3110345

Model 4 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward slightly indicating variance is nonconstant
Standardized residuals vs leverage: Case 2136 Cook’s distance > 0.5

Other notes:

Variance Inflation Factor: No variables showing serious multicollinearity
Breusch-Pagan Test: P value < 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Since TEAM_BATTING_2B and TEAM_FIELDING_E have unexpected coefficient signs, let’s try creating a new batting variable to encompass all of the batting variables (2B,3B,HR). This should help with any multicollinearity for batting.

Model 4.1 (batting variables only) Extra Base Hits variable

# Model 4.1 (batting variables only) - New Extra Base Hits variable

# Create new TEAM_BATTING_XBH and remove individual predictors to address any multicollinearity between batting variables
train_batting_fixed <- train_batting %>% mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>% dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR)

test_batting_fixed <- test_batting %>% mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>% dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR)

model_batting_xbh <- lm(TARGET_WINS ~ ., data = train_batting_fixed)

summary(model_batting_xbh)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_batting_fixed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.261  -9.034   0.119   8.590  49.134 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      21.3813982  6.0513216   3.533 0.000421 ***
## TEAM_BATTING_H    0.0455792  0.0041558  10.968  < 2e-16 ***
## TEAM_BATTING_BB   0.0093918  0.0037260   2.521 0.011800 *  
## TEAM_BATTING_SO  -0.0004717  0.0022751  -0.207 0.835788    
## TEAM_BASERUN_SB   0.0065223  0.0044980   1.450 0.147216    
## TEAM_BASERUN_CS  -0.0265459  0.0116684  -2.275 0.023020 *  
## TEAM_FIELDING_E  -0.0297338  0.0028692 -10.363  < 2e-16 ***
## TEAM_FIELDING_DP -0.0628551  0.0110728  -5.677  1.6e-08 ***
## TEAM_BATTING_XBH  0.0090417  0.0075030   1.205 0.228333    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.33 on 1813 degrees of freedom
## Multiple R-squared:  0.2705, Adjusted R-squared:  0.2673 
## F-statistic: 84.04 on 8 and 1813 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_batting_xbh, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_batting_xbh)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_batting_xbh))
# n = # of observations
n <- nrow(train_batting_fixed)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_batting_xbh)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_batting_xbh
## BP = 210.04, df = 8, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_batting_xbh)

##   TEAM_BATTING_H  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##         3.570685         2.098047         3.074606         1.735990 
##  TEAM_BASERUN_CS  TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_XBH 
##         1.359770         4.204279         3.721079         3.817897

######


preds <- predict(model_batting_xbh, newdata = test_batting_fixed)
RMSE(preds, test_batting_fixed$TARGET_WINS)

## [1] 13.79523

R2(preds, test_batting_fixed$TARGET_WINS)

## [1] 0.2991301

Model 4.1 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward slightly indicating variance is nonconstant
Standardized residuals vs leverage: Looks good

Other notes:

Variance Inflation Factor: No variables showing serious multicollinearity
Breusch-Pagan Test: P value < 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_FIELDING_DP

This model helped with the batting variables multicollinearity. To fix the coefficient sign of TEAM_FIELDING_DP, let’s remove TEAM_FIELDING_E which has a negative correlation with TEAM_FIELDING_DP.

Model 4.2 (batting variables only) Extra Base Hits variable with TEAM_FIELDING_E Removed

# Model 4.2 (batting variables only) - New Extra Base Hits variable with FIELDING_E Removed

train_batting_fixed_v2 <- train_batting %>% mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>% dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_FIELDING_E)

test_batting_fixed_v2 <- test_batting %>% mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>% dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_FIELDING_E)

model_batting_xbh_v2 <- lm(TARGET_WINS ~ ., data = train_batting_fixed_v2)

summary(model_batting_xbh_v2)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_batting_fixed_v2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.999  -9.161   0.533   9.102  48.202 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.737311   5.747359  -0.476 0.633937    
## TEAM_BATTING_H    0.040684   0.004248   9.577  < 2e-16 ***
## TEAM_BATTING_BB   0.025280   0.003494   7.235 6.84e-13 ***
## TEAM_BATTING_SO   0.004120   0.002296   1.795 0.072899 .  
## TEAM_BASERUN_SB   0.016050   0.004530   3.543 0.000406 ***
## TEAM_BASERUN_CS  -0.023355   0.012002  -1.946 0.051813 .  
## TEAM_FIELDING_DP  0.007774   0.008979   0.866 0.386697    
## TEAM_BATTING_XBH  0.014933   0.007698   1.940 0.052549 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.71 on 1814 degrees of freedom
## Multiple R-squared:  0.2273, Adjusted R-squared:  0.2243 
## F-statistic: 76.23 on 7 and 1814 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_batting_xbh_v2, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_batting_xbh_v2)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_batting_xbh_v2))
# n = # of observations
n <- nrow(train_batting_fixed_v2)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_batting_xbh_v2)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_batting_xbh_v2
## BP = 241.83, df = 7, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_batting_xbh_v2)

##   TEAM_BATTING_H  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##         3.524550         1.742845         2.957982         1.663462 
##  TEAM_BASERUN_CS TEAM_FIELDING_DP TEAM_BATTING_XBH 
##         1.358823         2.311359         3.795982

######


preds <- predict(model_batting_xbh_v2, newdata = test_batting_fixed_v2)
RMSE(preds, test_batting_fixed_v2$TARGET_WINS)

## [1] 14.25159

R2(preds, test_batting_fixed_v2$TARGET_WINS)

## [1] 0.2535529

Model 4.2 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward slightly indicating variance is nonconstant
Standardized residuals vs leverage: Looks good (some outliers)

Other notes:

Variance Inflation Factor: No variables showing serious multicollinearity
Breusch-Pagan Test: P value < 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_SO

Let’s also try regularization with elastic net:

Model 4.3 (batting variables only) Elastic Net

# Model 4.3 (batting variables only) Elastic Net

# Elastic Net Regression
set.seed(786)
# 5-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 5)

#pre process
pre_process <- c("center","scale")
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))


enet_model <- train(TARGET_WINS ~ ., data = train_batting,
                 method = "enet",
                 tuneGrid = enetGrid,
                 trControl = ctrl,
                 preProcess = pre_process)

# Test predictors and response
testX <- test_batting %>% dplyr::select(-TARGET_WINS)
testY <- test_batting$TARGET_WINS

enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)

enet_model

## Elasticnet 
## 
## 1822 samples
##   10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1457, 1459, 1457, 1458, 1457 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE     
##   0.00    0.05      15.15154  0.1462023  11.88505
##   0.00    0.10      14.83157  0.1622760  11.62952
##   0.00    0.15      14.55462  0.1910254  11.42345
##   0.00    0.20      14.31963  0.2092135  11.25904
##   0.00    0.25      14.10932  0.2220923  11.11903
##   0.00    0.30      13.93332  0.2292596  10.99344
##   0.00    0.35      13.79439  0.2348170  10.89144
##   0.00    0.40      13.68548  0.2407948  10.81057
##   0.00    0.45      13.60776  0.2447093  10.74921
##   0.00    0.50      13.55294  0.2486167  10.70010
##   0.00    0.55      13.49733  0.2535632  10.65247
##   0.00    0.60      13.44407  0.2584496  10.60816
##   0.00    0.65      13.39720  0.2625928  10.56742
##   0.00    0.70      13.35845  0.2658271  10.53521
##   0.00    0.75      13.33147  0.2678645  10.51261
##   0.00    0.80      13.31384  0.2691977  10.49504
##   0.00    0.85      13.30241  0.2700431  10.47931
##   0.00    0.90      13.29495  0.2706265  10.46509
##   0.00    0.95      13.29181  0.2709066  10.45523
##   0.00    1.00      13.29759  0.2703835  10.45305
##   0.01    0.05      15.16134  0.1462023  11.89293
##   0.01    0.10      14.84672  0.1604437  11.64111
##   0.01    0.15      14.57450  0.1893114  11.43750
##   0.01    0.20      14.34379  0.2073170  11.27472
##   0.01    0.25      14.13515  0.2207728  11.13679
##   0.01    0.30      13.95884  0.2283742  11.01163
##   0.01    0.35      13.81734  0.2337047  10.90783
##   0.01    0.40      13.70662  0.2396172  10.82584
##   0.01    0.45      13.62462  0.2437302  10.76313
##   0.01    0.50      13.56619  0.2476038  10.71070
##   0.01    0.55      13.51130  0.2523255  10.66095
##   0.01    0.60      13.45915  0.2570366  10.61731
##   0.01    0.65      13.41145  0.2612984  10.57638
##   0.01    0.70      13.37052  0.2647993  10.54128
##   0.01    0.75      13.33980  0.2672329  10.51642
##   0.01    0.80      13.31882  0.2687891  10.49734
##   0.01    0.85      13.30627  0.2697359  10.48194
##   0.01    0.90      13.29845  0.2703257  10.46798
##   0.01    0.95      13.29460  0.2706450  10.45770
##   0.01    1.00      13.29651  0.2704849  10.45304
##   0.10    0.05      15.20837  0.1462023  11.93041
##   0.10    0.10      14.91970  0.1518720  11.69678
##   0.10    0.15      14.67409  0.1798391  11.50918
##   0.10    0.20      14.46029  0.1979614  11.35431
##   0.10    0.25      14.26715  0.2123321  11.22366
##   0.10    0.30      14.09381  0.2222994  11.10676
##   0.10    0.35      13.94529  0.2283872  10.99991
##   0.10    0.40      13.82615  0.2329749  10.91131
##   0.10    0.45      13.73086  0.2375405  10.83997
##   0.10    0.50      13.65589  0.2418900  10.78079
##   0.10    0.55      13.58911  0.2467168  10.72252
##   0.10    0.60      13.53592  0.2502655  10.66980
##   0.10    0.65      13.49355  0.2533289  10.62393
##   0.10    0.70      13.45470  0.2564967  10.58945
##   0.10    0.75      13.41701  0.2597632  10.55825
##   0.10    0.80      13.38473  0.2625117  10.53007
##   0.10    0.85      13.35842  0.2647103  10.50905
##   0.10    0.90      13.34157  0.2660898  10.49269
##   0.10    0.95      13.33420  0.2666511  10.48261
##   0.10    1.00      13.33227  0.2668126  10.47572
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.95 and lambda = 0.

enet_resample

##       RMSE   Rsquared        MAE 
## 13.6869927  0.3110159 10.5173741

Model 5: (pitching variables only)

# Model 5: (pitching variables only)

model_pitching <- lm(TARGET_WINS ~ ., data = train_pitching)

summary(model_pitching)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_pitching)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.045  -9.521   0.433   9.421  65.200 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      85.3668947  2.8219995  30.250  < 2e-16 ***
## TEAM_PITCHING_H   0.0007015  0.0004369   1.606  0.10849    
## TEAM_PITCHING_HR  0.0447728  0.0070586   6.343 2.84e-10 ***
## TEAM_PITCHING_BB  0.0136610  0.0026556   5.144 2.98e-07 ***
## TEAM_PITCHING_SO -0.0049613  0.0007268  -6.826 1.18e-11 ***
## TEAM_FIELDING_E  -0.0250202  0.0036376  -6.878 8.31e-12 ***
## TEAM_FIELDING_DP -0.0662994  0.0124298  -5.334 1.08e-07 ***
## TEAM_BASERUN_SB   0.0128438  0.0049496   2.595  0.00954 ** 
## TEAM_BASERUN_CS  -0.0284583  0.0127401  -2.234  0.02562 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.57 on 1813 degrees of freedom
## Multiple R-squared:  0.1282, Adjusted R-squared:  0.1243 
## F-statistic: 33.32 on 8 and 1813 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_pitching, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_transformed)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_pitching))
# n = # of observations
n <- nrow(model_pitching)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_pitching)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_pitching
## BP = 463.05, df = 8, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_pitching)

##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##         2.966536         1.590418         1.758632         1.511765 
##  TEAM_FIELDING_E TEAM_FIELDING_DP  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         5.654472         3.923401         1.758882         1.356343

######


preds <- predict(model_pitching, newdata = test_pitching)
RMSE(preds, test_pitching$TARGET_WINS)

## [1] 15.67067

R2(preds, test_pitching$TARGET_WINS)

## [1] 0.09612363

Model 5 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward slightly indicating variance is nonconstant
Standardized residuals vs leverage: Case 2136 Cook’s distance > 0.5

Other notes:

Variance Inflation Factor: TEAM_FIELDING_E showing serious multicollinearity
Breusch-Pagan Test: P value < 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_FIELDING_DP

Model 6: Conservative dataset

# Model 6: Conservative dataset

model_conservative <- lm(TARGET_WINS ~ ., data = train_conservative)

summary(model_conservative)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_conservative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.721  -8.853   0.267   8.578  53.631 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.0309826  6.2974051   2.863 0.004242 ** 
## TEAM_BATTING_H    0.0500376  0.0042636  11.736  < 2e-16 ***
## TEAM_BATTING_2B  -0.0252943  0.0106088  -2.384 0.017215 *  
## TEAM_BATTING_3B   0.0586682  0.0189317   3.099 0.001972 ** 
## TEAM_BATTING_HR   0.0402066  0.0105933   3.795 0.000152 ***
## TEAM_BATTING_BB   0.0057175  0.0059344   0.963 0.335446    
## TEAM_BATTING_SO  -0.0036414  0.0027915  -1.304 0.192242    
## TEAM_BASERUN_SB   0.0060936  0.0047418   1.285 0.198936    
## TEAM_BASERUN_CS  -0.0192592  0.0117291  -1.642 0.100764    
## TEAM_PITCHING_H  -0.0008386  0.0004553  -1.842 0.065638 .  
## TEAM_PITCHING_BB  0.0013503  0.0039896   0.338 0.735056    
## TEAM_PITCHING_SO  0.0023678  0.0009551   2.479 0.013266 *  
## TEAM_FIELDING_E  -0.0276340  0.0034778  -7.946 3.36e-15 ***
## TEAM_FIELDING_DP -0.0511885  0.0113984  -4.491 7.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.23 on 1808 degrees of freedom
## Multiple R-squared:  0.2836, Adjusted R-squared:  0.2784 
## F-statistic: 55.05 on 13 and 1808 DF,  p-value: < 2.2e-16

######

# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_conservative, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")

plot(model_conservative)

# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_conservative))
# n = # of observations
n <- nrow(train_transformed)

# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n

# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)

# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
       col = "red", lty = 2, bty = "n")

# anova(model_full_log)

# Perform Breusch-Pagan Test
bptest(model_conservative)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_conservative
## BP = 244.26, df = 13, p-value < 2.2e-16

# Variance Inflation Factor
vif(model_conservative)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.816249         2.542129         2.923185         4.252122 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         5.404069         4.699856         1.959051         1.395128 
##  TEAM_PITCHING_H TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         3.910081         4.816881         3.168400         6.272359 
## TEAM_FIELDING_DP 
##         4.003873

######


preds <- predict(model_conservative, newdata = test_conservative)
RMSE(preds, test_conservative$TARGET_WINS)

## [1] 13.71211

R2(preds, test_conservative$TARGET_WINS)

## [1] 0.3063382

Model 6 validity check (linear regression assumptions):

Histogram: Looks approximately normal
Residuals vs Fitted values: Variance looks somewhat constant
QQ Plot: Looks approximately normal
Sq Rt Standardized Residuals vs Fitted values: Red line slopes downward slightly indicating variance is nonconstant
Standardized residuals vs leverage: Case 2136 is a highly influential data point

Other notes:

Variance Inflation Factor: TEAM_BATTING_BB and TEAM_FIELDING_E showing serious multicollinearity
Breusch-Pagan Test: P value < 0.05 indicating nonconstant variance
Some of the coefficient signs don’t make sense such as TEAM_BATTING_2B

Select Models

After evaluating several linear regression models using key performance metrics—including RMSE, R2, MSE, and the F-statistic—we selected Model 4.2 as the most robust and reliable overall. While Model 4.2 did not achieve the best predictive performance in terms of lowest RMSE or highest R² (those belonged to Model 1.1), it outperformed others in terms of model stability and interpretability.

Specifically, Model 4.2 exhibited lower multicollinearity across predictors compared to Model 1.1, as indicated by more acceptable VIF values. This was crucial because Model 1.1, despite its stronger predictive metrics, displayed signs of multicollinearity-induced distortion, including coefficient estimates with unexpected signs—i.e., directions of effect that contradicted theoretical expectations.

In contrast, Model 4.2 produced coefficient signs that were more consistent with domain knowledge, suggesting improved interpretability and more trustworthy parameter estimates. While no model was perfect, Model 4.2 represents the best balance between predictive accuracy and statistical validity with an R2 of 0.224, RMSE of 14.25, F-statistic of 76.23, and MSE of 0.254-and was therefore selected as the final model for prediction and interpretation.

model_results <- data.frame(
  Model = c("Model 1: Full",
            "Model 1.1: Full + Log",
            "Model 1.2: Full + Log + WLS",
            "Model 1.3: Full + Sqrt",
            "Model 2: Moderate",
            "Model 2.1: Moderate + Sqrt",
            "Model 2.2: Moderate + Sqrt + WLS",
            "Model 3: Transformed Advantage",
            "Model 4: Batting",
            "Model 4.1: Batting New Variable",
            "Model 4.2: Batting New Variable + Fielding Removal",
            "Model 4.3: Batting (Elastic Net)",
            "Model 5: Pitching",
            "Model 6: Conservative"),
  
  RMSE = c(
    RMSE(predict(model_full, newdata = test_full), test_full$TARGET_WINS),
    RMSE(predict(model_full_log, newdata = test_full_log), test_full_log$TARGET_WINS),
    RMSE(predict(model_full_log_wls, newdata = test_full_log), test_full_log$TARGET_WINS),
    rmse_sqrt,
    rmse_moderate,
    RMSE(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS),
    rmse_mod_wls_sqrt,
    RMSE(predict(model_transformed, newdata = test_transformed), test_transformed$TARGET_WINS),
    RMSE(predict(model_batting, newdata = test_batting), test_batting$TARGET_WINS),
    RMSE(predict(model_batting_xbh, newdata = test_batting_fixed), test_batting_fixed$TARGET_WINS),
    RMSE(predict(model_batting_xbh_v2, newdata = test_batting_fixed_v2), test_batting_fixed_v2$TARGET_WINS),
    enet_resample["RMSE"],
    RMSE(predict(model_pitching, newdata = test_pitching), test_pitching$TARGET_WINS),
    RMSE(predict(model_conservative, newdata = test_conservative), test_conservative$TARGET_WINS)
  ),
  
  R2 = c(
    R2(predict(model_full, newdata = test_full), test_full$TARGET_WINS),
    R2(predict(model_full_log, newdata = test_full_log), test_full_log$TARGET_WINS),
    R2(predict(model_full_log_wls, newdata = test_full_log), test_full_log$TARGET_WINS),
    r2_sqrt,
    r2_moderate,
    R2(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS),
    r2_mod_wls_sqrt,
    R2(predict(model_transformed, newdata = test_transformed), test_transformed$TARGET_WINS),
    R2(predict(model_batting, newdata = test_batting), test_batting$TARGET_WINS),
    R2(predict(model_batting_xbh, newdata = test_batting_fixed), test_batting_fixed$TARGET_WINS),
    R2(predict(model_batting_xbh_v2, newdata = test_batting_fixed_v2), test_batting_fixed_v2$TARGET_WINS),
    enet_resample["Rsquared"],
    R2(predict(model_pitching, newdata = test_pitching), test_pitching$TARGET_WINS),
    R2(predict(model_conservative, newdata = test_conservative), test_conservative$TARGET_WINS)
  )
)

# Add MSE column
model_results$MSE <- model_results$RMSE^2

# Add F-statistic column (NA for models where it doesn’t apply: Elastic Net)
model_results$F_statistic <- c(
  summary(model_full)$fstatistic[1],
  summary(model_full_log)$fstatistic[1],
  summary(model_full_log_wls)$fstatistic[1],
  summary(model_full_sqrt)$fstatistic[1],
  summary(model_moderate)$fstatistic[1],
  summary(model_moderate_sqrt)$fstatistic[1],
  summary(wls_model_moderate_sqrt)$fstatistic[1],
  summary(model_transformed)$fstatistic[1],
  summary(model_batting)$fstatistic[1],
  summary(model_batting_xbh)$fstatistic[1],
  summary(model_batting_xbh_v2)$fstatistic[1],
  NA,  # Elastic Net model doesn’t have F-statistic in the same form
  summary(model_pitching)$fstatistic[1],
  summary(model_conservative)$fstatistic[1]
)

# View the updated table
print(model_results)

##                                                 Model     RMSE         R2
## 1                                       Model 1: Full 13.71198 0.30634846
## 2                               Model 1.1: Full + Log 13.36052 0.34583439
## 3                         Model 1.2: Full + Log + WLS 13.37799 0.34612099
## 4                              Model 1.3: Full + Sqrt 14.05698 0.27227029
## 5                                   Model 2: Moderate 14.01101 0.27697537
## 6                          Model 2.1: Moderate + Sqrt 14.05698 0.27227029
## 7                    Model 2.2: Moderate + Sqrt + WLS 14.09027 0.26975219
## 8                      Model 3: Transformed Advantage 14.68434 0.20479368
## 9                                    Model 4: Batting 13.67704 0.31103447
## 10                    Model 4.1: Batting New Variable 13.79523 0.29913008
## 11 Model 4.2: Batting New Variable + Fielding Removal 14.25159 0.25355288
## 12                   Model 4.3: Batting (Elastic Net) 13.68699 0.31101590
## 13                                  Model 5: Pitching 15.67067 0.09612363
## 14                              Model 6: Conservative 13.71211 0.30633816
##         MSE F_statistic
## 1  188.0183    51.08818
## 2  178.5034    52.42818
## 3  178.9706    49.82345
## 4  197.5987    52.04595
## 5  196.3085    66.37656
## 6  197.5987    65.47077
## 7  198.5357    63.28447
## 8  215.6299    41.09327
## 9  187.0616    69.96969
## 10 190.3083    84.04218
## 11 203.1077    76.23436
## 12 187.3338          NA
## 13 245.5698    33.31725
## 14 188.0219    55.04844

Final Model and Predictions

After selecting the Batting Variables Only Model (Model 4.2) as the best-performing model, we applied it to the evaluation dataset. The model was then used to predict the number of team wins.

data_eval %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  filter(missing_count != 0) %>%
  arrange(desc(missing_count))

## # A tibble: 6 × 2
##   variable         missing_count
##   <chr>                    <int>
## 1 TEAM_BATTING_HBP           240
## 2 TEAM_BASERUN_CS             87
## 3 TEAM_FIELDING_DP            31
## 4 TEAM_BATTING_SO             18
## 5 TEAM_PITCHING_SO            18
## 6 TEAM_BASERUN_SB             13

# Impute and remove variables
data_transform_eval <- data_eval |>
    dplyr::select(-c(INDEX, TEAM_BATTING_HBP))

data_transform_eval  <- data_transform_eval %>%
    mutate(
      TEAM_BASERUN_SB = ifelse(is.na(TEAM_BASERUN_SB), 0, TEAM_BASERUN_SB),
      TEAM_BASERUN_CS = ifelse(is.na(TEAM_BASERUN_CS), 0, TEAM_BASERUN_CS),
      TEAM_FIELDING_DP = ifelse(is.na(TEAM_FIELDING_DP), 0, TEAM_FIELDING_DP)
    )

# Handle missing values - using median imputation for numerical variables
# Handle missing values with median imputation (this applies to TEAM_BATTING_SO and TEAM_PITCHING_SO)
data_transform_eval  <- data_transform_eval |>
 mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))

# batting only
data_eval_batting_only <- data_transform_eval |>
  dplyr::select(starts_with("TEAM_BATTING"), starts_with("TEAM_BASERUN"), 
         starts_with("TEAM_FIELDING"))

# Create new batting variable
data_eval2 <- data_eval_batting_only %>% 
  mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>% 
  dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_FIELDING_E)

# Predict
data_eval2$predicted_wins <- predict(model_batting_xbh_v2, newdata = data_eval2)

hist(data_eval2$predicted_wins,
     breaks = 20,
     main = "Distribution of Predicted Wins",
     xlab = "Predicted Wins",
     col = "lightpink",
     border = "white")

# Preview predictions
head(data_eval2[, "predicted_wins", drop = FALSE], 4)

##   predicted_wins
## 1       67.51041
## 2       69.04266
## 3       75.83950
## 4       85.59009

Save results:

# Save results
library(openxlsx)
wb <- createWorkbook()
addWorksheet(wb, "Predictions")
writeData(wb, sheet = "Predictions", x = data_eval2)
saveWorkbook(wb, "BaseballPrediction.xlsx", overwrite = TRUE)

DATA621- Homework 1

Dhanya Nair, Nakesha Fray, and Gillian McGovern

2025-09-13

Background

Data Preparation

1. Importing Libraries

2. Data Ingestion

3. Data Exploration and Cleaning

3.1 Summary Statistics and Distribution

3.2 Missing Data

3.3 Identifying Outliers

3.3 Data Cleansing

3.4 Checking for linear relationship

3.5 Observing Correlation

3.6 Feature Engineering

3.7 Split Data for validation

Model Building

Model 1: Full model

Model 1.1: Full model with Log transformation of the skewed predictors

Model 1.2: Full model with Log transformation of the skewed predictors and Weighted Least Squares

Model 1.3: Full model with square root transformation of the skewed predictors

Model 2: Moderate model

Model 2.1: Moderate dataset model with square root transformation of skewed predictors

Model 2.2: Moderate dataset model with transformation of skewed predictors and WLS

Model 3: Transformed data (net advantage variables)

Model 4: (batting variables only)

Model 4.1 (batting variables only) Extra Base Hits variable

Model 4.2 (batting variables only) Extra Base Hits variable with TEAM_FIELDING_E Removed

Model 4.3 (batting variables only) Elastic Net

Model 5: (pitching variables only)

Model 6: Conservative dataset

Select Models

Final Model and Predictions