In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
VARIABLE NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET_WINS Number of wins
TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR) Positive Impact on Wins
TEAM_BATTING_2B Doubles by batters (2B) Positive Impact on Wins
TEAM_BATTING_3B Triples by batters (3B) Positive Impact on Wins
TEAM_BATTING_HR Homeruns by batters (4B) Positive Impact on Wins
TEAM_BATTING_BB Walks by batters Positive Impact on Wins
TEAM_BATTING_HBP Batters hit by pitch (get a free base) Positive Impact on Wins
TEAM_BATTING_SO Strikeouts by batters Negative Impact on Wins
TEAM_BASERUN_SB Stolen bases Positive Impact on Wins
TEAM_BASERUN_CS Caught stealing Negative Impact on Wins
TEAM_FIELDING_E Errors Negative Impact on Wins
TEAM_FIELDING_DP Double Plays Positive Impact on Wins
TEAM_PITCHING_BB Walks allowed Negative Impact on Wins
TEAM_PITCHING_H Hits allowed Negative Impact on Wins
TEAM_PITCHING_HR Homeruns allowed Negative Impact on Wins
TEAM_PITCHING_SO Strikeouts by pitchers Positive Impact on Wins
#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(e1071) # For skewness function
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ROSE)
## Loaded ROSE 0.0-4
library(smotefamily)
library(dplyr)
library(caret)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data_raw <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA621/refs/heads/main/moneyball-training-data.csv")
data_eval <- read.csv("https://raw.githubusercontent.com/nakesha-fray/DATA621/main/moneyball-evaluation-data.csv")
The MoneyBall dataset contains 2,276 observations and 17 variables, each representing the performance of a baseball team in a given year between 1871 and 2006. The goal is to predict the number of team wins (TARGET_WINS) using available data.
The histograms below display the distributions of the numerical features in the dataset. Several variables, such as TEAM_BATTING_2B, TEAM_BATTING_H, and TEAM_FIELDING_DP, appear approximately normally distributed, although TEAM_BATTING_H shows a slight right skew.
Other features exhibit more pronounced skewness and outliers. For example, TEAM_BATTING_BB is slightly left-skewed, while variables like TEAM_BASERUN_CS, TEAM_BASERUN_SB, TEAM_BATTING_3B, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, and TEAM_PITCHING_SO are heavily right-skewed, indicating that a few teams had exceptionally high values in these categories.
Additionally, variables such as TEAM_BATTING_HBP, TEAM_BATTING_HR, TEAM_BATTING_SO, and TEAM_PITCHING_HR display bimodal or multimodal distributions, suggesting that team performance in these metrics may differ substantially depending on the year and team.
The target variable TARGET_WINS is continuous and roughly normally distributed, this confirms that we are dealing with a regression problem. As a first approach, applying linear regression is appropriate and may provide reasonable performance. However, not all predictors are perfectly symmetric or linear, and some exhibit skewness or potential outliers, which may affect modeling accuracy.
Several variables contain missing values, the most significant being TEAM_BATTING_HBP (2085), followed by TEAM_BASERUN_CS (772), TEAM_FIELDING_DP (286), and TEAM_BASERUN_SB (131). The statistics for these weren’t tracked in early baseball (1871-1890s), so missing likely means “0”.This makes historical sense - if stolen bases weren’t recorded, it’s because they didn’t track them, not because the stat is unknown.
# Structure of the data
str(data_raw)
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
# Glimpse of the data
head(data_raw)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 1 39 1445 194 39
## 2 2 70 1339 219 22
## 3 3 86 1377 232 35
## 4 4 70 1387 209 38
## 5 5 82 1297 186 27
## 6 6 75 1279 200 36
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 NA
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## 6 92 443 973 107
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 9364 84
## 2 28 NA 1347 191
## 3 27 NA 1377 137
## 4 30 NA 1396 97
## 5 39 NA 1297 102
## 6 59 NA 1279 92
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 927 5456 1011 NA
## 2 689 1082 193 155
## 3 602 917 175 153
## 4 454 928 164 156
## 5 472 920 138 168
## 6 443 973 123 149
# Summary
summary(data_raw)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
# Feature Distribution
data_raw |>
dplyr::select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
filter(!is.na(Value)) |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Feature, scales = "free") +
ggtitle("Histograms of Numerical Features")
# Distribution of TARGET_WINS
data_raw |>
dplyr::select(TARGET_WINS) |>
ggplot() +
aes(x = TARGET_WINS) +
geom_histogram(bins= 40,fill = "blue", color = "black") +
labs(title = "Distribution of TARGET_WINS", y = "Count") +
theme_minimal()
# Count missing values per Variable
data_raw %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
filter(missing_count != 0) %>%
arrange(desc(missing_count))
## # A tibble: 6 × 2
## variable missing_count
## <chr> <int>
## 1 TEAM_BATTING_HBP 2085
## 2 TEAM_BASERUN_CS 772
## 3 TEAM_FIELDING_DP 286
## 4 TEAM_BASERUN_SB 131
## 5 TEAM_BATTING_SO 102
## 6 TEAM_PITCHING_SO 102
The boxplots reveal that TARGET_WINS exhibits no extreme outliers. In contrast, many predictor variables show clear evidence of outliers. Variables such as TEAM_BASERUN_SB, TEAM_BASERUN_BB, TEAM_FIELDING_E, and several pitching and batting statistics have values that extend far beyond the interquartile range, indicating that some teams had exceptionally high or low values in these categories. These outliers could strongly influence regression models if left unaddressed. To mitigate this, we apply various transformations—such as logarithmic and square root—in the modeling phase, with the goal of reducing skewness and the impact of extreme values.
# Boxplot of TARGET_WINS
data_raw |>
dplyr::select(TARGET_WINS) |>
ggplot(aes(x = "", y = TARGET_WINS)) +
geom_boxplot(fill="pink") +
labs(title = "Distribution of TARGET_WINS", y = "Wins") +
theme_minimal()
# Boxplots of all predictors
ggplot(stack(data_raw[,-1]), aes(x = ind, y = values)) +
geom_boxplot(fill = "darkgrey") +
coord_cartesian(ylim = c(0, 1000)) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1),
panel.background = element_rect(fill = 'grey96')) +
labs(title = "Boxplots of Predictor Variables", x="Predictors")
## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
To prepare the dataset for modeling, we addressed missing values through a combination of variable removal and imputation.
First, we dropped the INDEX variable, as it serves only as a row identifier and holds no predictive value.
Next, we removed the TEAM_BATTING_HBP variable, which contained over 90% missing values. This specific statistic was not recorded in most early baseball seasons, making the variable unsuitable for imputation or inclusion in modeling.
For other variables with moderate missingness — specifically TEAM_BASERUN_SB, TEAM_BASERUN_CS, and TEAM_FIELDING_DP — we applied zero imputation. This decision is informed by historical context: these statistics were often not tracked in earlier years, so the missing values are likely systematic (i.e., due to lack of data collection) rather than random.
Finally, we imputed any remaining missing values (TEAM_BATTING_SO and TEAM_PITCHING_SO) across the dataset using the median of each variable to reduce the influence of outliers and skewed distributions.
# Drop the INDEX variable as it has no theoretical effect and it was mentioned to not use it.
# Drop the TEAM_BATTING_HBP variable as its mostly made up of missing values
data_transform <- data_raw |>
# dplyr::select(-INDEX, TEAM_BATTING_HBP)
dplyr::select(-c(INDEX, TEAM_BATTING_HBP))
# These stats: missing likely means "0" (not tracked in early baseball), hence mutating to 0
# TEAM_BATTING_SO missing or 0 assumes not recorded
data_transform <- data_transform %>%
mutate(
# TEAM_BATTING_HBP = ifelse(is.na(TEAM_BATTING_HBP), 0, TEAM_BATTING_HBP),
TEAM_BASERUN_SB = ifelse(is.na(TEAM_BASERUN_SB), 0, TEAM_BASERUN_SB),
TEAM_BASERUN_CS = ifelse(is.na(TEAM_BASERUN_CS), 0, TEAM_BASERUN_CS),
TEAM_FIELDING_DP = ifelse(is.na(TEAM_FIELDING_DP), 0, TEAM_FIELDING_DP)
)
# Handle missing values - using median imputation for numerical variables
# Handle missing values with median imputation (this applies to TEAM_BATTING_SO and TEAM_PITCHING_SO)
data_transform <- data_transform |>
mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))
str(data_transform)
## 'data.frame': 2276 obs. of 15 variables:
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : num 842 1075 917 922 920 ...
## $ TEAM_BASERUN_SB : num 0 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : num 0 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: num 5456 1082 917 928 920 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: num 0 155 153 156 168 149 186 136 169 159 ...
To evaluate the suitability of linear regression, we plotted the linear relationships between each predictor and the target variable (TARGET_WINS) using scatterplots and Pearson correlation coefficients.
TEAM_BATTING_H (r = 0.389) shows the strongest positive correlation with TARGET_WINS, suggesting that teams with more base hits generally win more games — as expected. This is followed by TEAM_BATTING_2B, which also demonstrates a moderately positive relationship, aligning with baseball intuition.
Interestingly, TEAM_BATTING_SO (strikeouts by batters) shows an almost negligible correlation (r = -0.030) with TARGET_WINS. This is somewhat surprising, as one might expect higher strikeout counts to hurt performance. The lack of relationship suggests that this variable is not a strong independent predictor.
Another notable result is TEAM_PITCHING_SO (strikeouts by pitchers), which shows a weak negative correlation with wins (r = -0.075). This result could indicate multicollinearity — for example, if teams that accumulate many strikeouts also give up more hits or walks — or it may suggest data quality issues in historical records.
# Get all predictor variables (excluding TARGET_WINS)
predictors <- names(data_transform)[names(data_transform) != "TARGET_WINS"]
# Create scatterplots for all predictors vs TARGET_WINS
# Set up plotting parameters for multiple plots
par(mfrow = c(4, 4), mar = c(4, 4, 3, 2))
# Plot each predictor vs TARGET_WINS
for(i in 1:length(predictors)) {
predictor <- predictors[i]
correlation <- cor(data_transform[[predictor]], data_transform$TARGET_WINS)
plot(data_transform[[predictor]], data_transform$TARGET_WINS,
xlab = predictor,
ylab = "TARGET_WINS",
main = paste(predictor, "\n(r =", round(correlation, 3), ")"),
pch = 16,
col = ifelse(correlation > 0, "darkblue", "darkred"),
cex = 0.6)
# Add regression line
abline(lm(data_transform$TARGET_WINS ~ data_transform[[predictor]]),
col = "red", lwd = 2)
# Add correlation text
text(x = quantile(data_transform[[predictor]], 0.05),
y = quantile(data_transform$TARGET_WINS, 0.95),
labels = paste("r =", round(correlation, 3)),
cex = 0.8, col = "red")
}
# Reset plotting parameters
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2))
To further assess linear relationships and potential multicollinearity among variables, we examined the correlation matrix of all numeric features using a corrplot and targeted pairwise comparisons.
Several batting and pitching variables are highly correlated with one another: TEAM_BATTING_HR is highly correlated with TEAM_PITCHING_HR, TEAM_BATTING_H is highly correlated with TEAM_PITCHING_H, TEAM_BATTING_BB is highly correlated with TEAM_PITCHING_BB, TEAM_BATTING_SO is highly correlated with TEAM_PITCHING_SO, TEAM_FIELDING_E is highly correlated (negative) with TEAM_FIELDING_DP
These relationships suggest possible multicollinearity, which could impact model stability and interpretability.
We also ranked features by their correlation with TARGET_WINS, identifying the top 5 positive and negative predictors:
Top positively correlated features include:TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_FIELDING_BB, TEAM_PITCHING_HR, and TEAM_BATTING_HR.
Top negatively correlated features include: TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_BATTING_SO.
# Correlation matrix
cor_matrix <- cor(data_transform, use = "complete.obs")
# Plot
corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.cex = 0.5,
tl.srt = 45) # Rotate text diagonally
par(mar = c(1, 1, 4, 1)) # top margin = 4 lines
# Extract correlations between batting and pitching variables
cat("=== MULTICOLLINEARITY ANALYSIS ===\n")
## === MULTICOLLINEARITY ANALYSIS ===
# Batting vs Pitching
multicollin_pairs <- data.frame(
Batting_Var = c("TEAM_BATTING_HR", "TEAM_BATTING_H", "TEAM_BATTING_BB", "TEAM_BATTING_SO"),
Pitching_Var = c("TEAM_PITCHING_HR", "TEAM_PITCHING_H", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO"),
Correlation = c(
cor(data_transform$TEAM_BATTING_HR, data_transform$TEAM_PITCHING_HR),
cor(data_transform$TEAM_BATTING_H, data_transform$TEAM_PITCHING_H),
cor(data_transform$TEAM_BATTING_BB, data_transform$TEAM_PITCHING_BB),
cor(data_transform$TEAM_BATTING_SO, data_transform$TEAM_PITCHING_SO)
)
)
print(multicollin_pairs)
## Batting_Var Pitching_Var Correlation
## 1 TEAM_BATTING_HR TEAM_PITCHING_HR 0.9693714
## 2 TEAM_BATTING_H TEAM_PITCHING_H 0.3026937
## 3 TEAM_BATTING_BB TEAM_PITCHING_BB 0.4893613
## 4 TEAM_BATTING_SO TEAM_PITCHING_SO 0.4161816
# Fielding
print(cor(data_transform$TEAM_FIELDING_DP, data_transform$TEAM_FIELDING_E))
## [1] -0.7569313
# Get correlations with TARGET_WINS
target_wins_correlation <- cor_matrix[, "TARGET_WINS"]
# Remove TARGET_WINS itself (correlation = 1)
target_wins_correlation <- target_wins_correlation [names(target_wins_correlation ) != "TARGET_WINS"]
# Top 5 Positive correlations
top5_pos <- sort(target_wins_correlation , decreasing = TRUE) %>%
head(5)
# Top 5 Negative correlations
top5_neg <- sort(target_wins_correlation , decreasing = FALSE) %>%
head(5)
# Combine into one dataframe
combined_df <- data.frame(
Feature = c(names(top5_pos), names(top5_neg)),
Correlation = c(top5_pos, top5_neg)
)
# Add a column to show Positive or Negative
combined_df$Direction <- ifelse(combined_df$Correlation > 0, "Positive", "Negative")
ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
labs(title = "Top Features Positively and Negatively Correlated with target_wins",
x = "Feature",
y = "Correlation with target_wins",
fill = "Direction") +
theme_minimal()
# The `findCorrelation` function is specifically designed for this
high_corr_list <- findCorrelation(cor_matrix, cutoff = 0.4, verbose = FALSE)
names(data_transform)[high_corr_list]
## [1] "TEAM_BATTING_HR" "TEAM_FIELDING_E" "TEAM_PITCHING_HR" "TEAM_BATTING_SO"
## [5] "TEAM_BATTING_3B" "TEAM_FIELDING_DP" "TEAM_BATTING_BB" "TEAM_BATTING_2B"
## [9] "TEAM_PITCHING_SO"
To address multicollinearity and explore different modeling strategies, several derived datasets were created using feature engineering and variable selection:
Dataset 1: Conservative Reduction
We removed TEAM_PITCHING_HR, which was highly correlated with TEAM_BATTING_HR.
Dataset 2: Moderate Reduction
This version takes a more aggressive stance on multicollinearity. We removed TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO, and TEAM_BATTING_SO. These variables showed correlations > 0.4 with other predictors, and their removal helps reduce noise and potential variance.
Dataset 3: Transformed Variables (Net Advantage)
Rather than eliminating variables, this version transforms them into net advantage metrics:
NET_HR_ADVANTAGE = TEAM_BATTING_HR - TEAM_PITCHING_HR NET_HITS_ADVANTAGE = TEAM_BATTING_H - TEAM_PITCHING_H NET_WALKS_ADVANTAGE = TEAM_BATTING_BB - TEAM_PITCHING_BB NET_SO_ADVANTAGE = TEAM_PITCHING_SO - TEAM_BATTING_SO
Dataset 4: Batting-Focused
This subset includes only batting, base running, and fielding statistics — excluding all pitching variables. A new variable (TEAM_BATTING_XBH) was also created to address coefficients with contradicting theoretical expectations.
Dataset 5: Pitching-Focused
In contrast, this version includes only pitching, base running, and fielding stats — omitting batting features entirely.
# Dataset 1: Conservative - Remove one variable from each highly correlated pair (TEAM_BATTING_HR, TEAM_PITCHING_HR)
data_reduced <- data_transform %>%
dplyr::select(-TEAM_PITCHING_HR)
# Dataset 2: Moderate - Remove FE, HR and BB and SO pitching (absolute value correlations > 0.4)
data_moderate <- data_transform %>%
dplyr::select(-TEAM_PITCHING_HR, -TEAM_PITCHING_BB, -TEAM_PITCHING_SO, -TEAM_BATTING_SO, -TEAM_FIELDING_E)
# Dataset 3: Transformed - Create net advantage variables
data_transformed <- data_transform %>%
mutate(
NET_HR_ADVANTAGE = TEAM_BATTING_HR - TEAM_PITCHING_HR,
NET_HITS_ADVANTAGE = TEAM_BATTING_H - TEAM_PITCHING_H,
NET_WALKS_ADVANTAGE = TEAM_BATTING_BB - TEAM_PITCHING_BB,
NET_SO_ADVANTAGE = TEAM_PITCHING_SO - TEAM_BATTING_SO
) %>%
# dplyr::select(TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BASERUN_SB,
# TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_FIELDING_E, TEAM_FIELDING_DP,
# NET_HR_ADVANTAGE, NET_HITS_ADVANTAGE, NET_WALKS_ADVANTAGE, NET_SO_ADVANTAGE)
dplyr::select(TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BASERUN_SB,
TEAM_BASERUN_CS, TEAM_FIELDING_E, TEAM_FIELDING_DP,
NET_HR_ADVANTAGE, NET_HITS_ADVANTAGE, NET_WALKS_ADVANTAGE, NET_SO_ADVANTAGE)
# Dataset 4 &5 : Keep only batting OR pitching variables (separate models)
data_batting_only <- data_transform |>
dplyr::select(TARGET_WINS, starts_with("TEAM_BATTING"), starts_with("TEAM_BASERUN"),
starts_with("TEAM_FIELDING"))
data_pitching_focused <- data_transform |>
dplyr::select(TARGET_WINS, starts_with("TEAM_PITCHING"), starts_with("TEAM_FIELDING"),
TEAM_BASERUN_SB, TEAM_BASERUN_CS) # Keep base running as it's not correlated
To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.
This was done for all engineered versions of the dataset:
train_full / test_full — Full feature set (after cleaning) train_conservative / test_conservative — Reduced set with one variable removed per collinear pair train_moderate / test_moderate — More aggressively reduced set train_transformed / test_transformed — Feature-engineered net-advantage variables train_batting / test_batting — Batting-only feature set train_pitching / test_pitching — Pitching-focused feature set
Each of these splits will be used in the next phase to build and evaluate multiple linear regression models, comparing their performance on both the training and testing sets.
# Split data for validation
set.seed(123)
#n <- nrow(data_transform)
#train_idx <- sample(n, 0.8 * n)
train_idx <- createDataPartition(y = data_transform$TARGET_WINS, p = 0.8, list = FALSE)
# Create train/test splits for each dataset
train_full <- data_transform[train_idx, ]
test_full <- data_transform[-train_idx, ]
train_conservative <- data_reduced[train_idx, ]
test_conservative <- data_reduced[-train_idx, ]
train_moderate <- data_moderate[train_idx, ]
test_moderate <- data_moderate[-train_idx, ]
train_transformed <- data_transformed[train_idx, ]
test_transformed <- data_transformed[-train_idx, ]
train_batting <- data_batting_only[train_idx, ]
test_batting <- data_batting_only[-train_idx, ]
train_pitching <- data_pitching_focused[train_idx, ]
test_pitching<- data_pitching_focused[-train_idx, ]
Let’s first start out with a model that includes everything:
# Model 1: Full model (all data including multicollinearity predictors)
model_full <- lm(TARGET_WINS ~ ., data = train_full)
summary(model_full)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.719 -8.854 0.269 8.579 53.626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.0418328 6.3176701 2.856 0.00434 **
## TEAM_BATTING_H 0.0500311 0.0042745 11.705 < 2e-16 ***
## TEAM_BATTING_2B -0.0252875 0.0106160 -2.382 0.01732 *
## TEAM_BATTING_3B 0.0585974 0.0191983 3.052 0.00230 **
## TEAM_BATTING_HR 0.0395462 0.0312765 1.264 0.20625
## TEAM_BATTING_BB 0.0057714 0.0064023 0.901 0.36747
## TEAM_BATTING_SO -0.0036516 0.0028289 -1.291 0.19694
## TEAM_BASERUN_SB 0.0061007 0.0047539 1.283 0.19955
## TEAM_BASERUN_CS -0.0192662 0.0117365 -1.642 0.10085
## TEAM_PITCHING_H -0.0008394 0.0004566 -1.838 0.06616 .
## TEAM_PITCHING_HR 0.0006267 0.0279223 0.022 0.98210
## TEAM_PITCHING_BB 0.0013045 0.0044831 0.291 0.77111
## TEAM_PITCHING_SO 0.0023757 0.0010191 2.331 0.01985 *
## TEAM_FIELDING_E -0.0276333 0.0034789 -7.943 3.43e-15 ***
## TEAM_FIELDING_DP -0.0511858 0.0114022 -4.489 7.60e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.23 on 1807 degrees of freedom
## Multiple R-squared: 0.2836, Adjusted R-squared: 0.278
## F-statistic: 51.09 on 14 and 1807 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full))
# n = # of observations
n <- nrow(train_full)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_full)
##
## studentized Breusch-Pagan test
##
## data: model_full
## BP = 244.87, df = 14, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_full)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 3.833728 2.544201 3.004412 37.045992
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 6.286496 4.824217 1.967925 1.396113
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 3.929952 30.184911 6.078982 3.604667
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 6.272760 4.004321
######
preds <- predict(model_full, newdata = test_full)
RMSE(preds, test_full$TARGET_WINS)
## [1] 13.71198
R2(preds, test_full$TARGET_WINS)
## [1] 0.3063485
Model 1 validity check (linear regression assumptions):
Other notes:
Next step: Try a log transformation on the skewed predictors
# Model 1.1: Full model with Log transformation of the skewed predictors
train_full_log <- train_full %>%
mutate(
log_BASERUN_CS = log(TEAM_BASERUN_CS + 1),
log_TEAM_BASERUN_SB = log(TEAM_BASERUN_SB + 1),
log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1),
log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
log_TEAM_PITCHING_SO = log(TEAM_PITCHING_SO + 1),
)
test_full_log <- test_full %>%
mutate(
log_BASERUN_CS = log(TEAM_BASERUN_CS + 1),
log_TEAM_BASERUN_SB = log(TEAM_BASERUN_SB + 1),
log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1),
log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
log_TEAM_PITCHING_SO = log(TEAM_PITCHING_SO + 1),
)
model_full_log <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
TEAM_BATTING_BB + TEAM_BATTING_SO + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + TEAM_PITCHING_HR + log_TEAM_PITCHING_SO, data = train_full_log)
summary(model_full_log)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## log_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## TEAM_BATTING_SO + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS +
## log_TEAM_FIELDING_E + TEAM_FIELDING_DP + log_TEAM_PITCHING_H +
## log_TEAM_PITCHING_BB + TEAM_PITCHING_HR + log_TEAM_PITCHING_SO,
## data = train_full_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.836 -8.238 -0.009 8.257 75.027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.212e+01 1.814e+01 5.078 4.21e-07 ***
## TEAM_BATTING_H 5.453e-02 4.445e-03 12.269 < 2e-16 ***
## TEAM_BATTING_2B -3.549e-02 1.059e-02 -3.350 0.000825 ***
## log_TEAM_BATTING_3B 5.677e+00 1.100e+00 5.162 2.71e-07 ***
## TEAM_BATTING_HR -1.026e-04 3.208e-02 -0.003 0.997449
## TEAM_BATTING_BB 2.159e-02 5.561e-03 3.882 0.000107 ***
## TEAM_BATTING_SO 1.707e-03 3.038e-03 0.562 0.574259
## log_TEAM_BASERUN_SB -1.952e+00 3.322e-01 -5.875 5.02e-09 ***
## TEAM_BASERUN_CS -2.434e-02 1.184e-02 -2.055 0.039990 *
## log_TEAM_FIELDING_E -1.568e+01 1.521e+00 -10.314 < 2e-16 ***
## TEAM_FIELDING_DP -9.046e-02 1.071e-02 -8.449 < 2e-16 ***
## log_TEAM_PITCHING_H -2.054e+00 2.384e+00 -0.861 0.389219
## log_TEAM_PITCHING_BB -1.144e-01 2.010e+00 -0.057 0.954626
## TEAM_PITCHING_HR -7.289e-03 2.838e-02 -0.257 0.797334
## log_TEAM_PITCHING_SO 4.828e-01 7.491e-01 0.645 0.519300
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.18 on 1807 degrees of freedom
## Multiple R-squared: 0.2889, Adjusted R-squared: 0.2834
## F-statistic: 52.43 on 14 and 1807 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_log, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full_log)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_log))
# n = # of observations
n <- nrow(train_full_log)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_full_log)
##
## studentized Breusch-Pagan test
##
## data: model_full_log
## BP = 325.49, df = 14, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_full_log)
## TEAM_BATTING_H TEAM_BATTING_2B log_TEAM_BATTING_3B
## 4.175962 2.552007 3.103816
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 39.270562 4.778690 5.603908
## log_TEAM_BASERUN_SB TEAM_BASERUN_CS log_TEAM_FIELDING_E
## 1.762556 1.431583 8.896992
## TEAM_FIELDING_DP log_TEAM_PITCHING_H log_TEAM_PITCHING_BB
## 3.557102 5.429279 3.475735
## TEAM_PITCHING_HR log_TEAM_PITCHING_SO
## 31.415344 2.455430
######
preds <- predict(model_full_log, newdata = test_full_log)
RMSE(preds, test_full_log$TARGET_WINS)
## [1] 13.36052
R2(preds, test_full_log$TARGET_WINS)
## [1] 0.3458344
Model 1.1 validity check (linear regression assumptions):
Other notes:
Check for leverage points:
# Leverage points
selected_row_single <- train_full[c("2136", "1342"), ]
print(selected_row_single)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 2136 41 992 263 20 0
## 1342 108 1188 338 0 0
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 2136 142 952 0 0
## 1342 270 945 0 0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 2136 20088 0 2876 19278
## 1342 16038 0 3645 12758
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 2136 952 0
## 1342 716 0
As you can see, it looks like these points are leverage points since they contain a lot of zeros. For example, case 1342’s TEAM_BATTING_HR is zero. Since we can’t find out more information about these points, we won’t remove. In a real-world scenario, we would investigate more and check if thesee values were inputted incorrectly.
Next step: let’s try a log transformation on the skewed predictors, then a WLS to address nonconstant variance
# Model 1.2: Full model with Log transformation of the skewed predictors + WLS
# define weights to use
wt <- 1 / lm(abs(model_full_log$residuals) ~ model_full_log$fitted.values)$fitted.values^2
# perform weighted least squares regression
# Fit model
model_full_log_wls <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
TEAM_BATTING_BB + TEAM_BATTING_SO + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + TEAM_PITCHING_HR + log_TEAM_PITCHING_SO, data = train_full_log, weights=wt)
# model_full_log_wls <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
# TEAM_BATTING_BB + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
# log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + log_TEAM_PITCHING_SO, data = train_full_log, weights=wt)
#view summary of model
summary(model_full_log_wls)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## log_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## TEAM_BATTING_SO + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS +
## log_TEAM_FIELDING_E + TEAM_FIELDING_DP + log_TEAM_PITCHING_H +
## log_TEAM_PITCHING_BB + TEAM_PITCHING_HR + log_TEAM_PITCHING_SO,
## data = train_full_log, weights = wt)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -5.5067 -0.7887 0.0009 0.8068 5.1522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.122e+01 1.964e+01 4.646 3.64e-06 ***
## TEAM_BATTING_H 4.958e-02 4.577e-03 10.833 < 2e-16 ***
## TEAM_BATTING_2B -3.676e-02 1.039e-02 -3.536 0.000417 ***
## log_TEAM_BATTING_3B 6.681e+00 1.123e+00 5.951 3.20e-09 ***
## TEAM_BATTING_HR 2.528e-03 3.222e-02 0.078 0.937470
## TEAM_BATTING_BB 2.315e-02 7.136e-03 3.245 0.001197 **
## TEAM_BATTING_SO 5.269e-04 3.076e-03 0.171 0.864004
## log_TEAM_BASERUN_SB -1.884e+00 3.399e-01 -5.542 3.43e-08 ***
## TEAM_BASERUN_CS -1.733e-02 1.195e-02 -1.451 0.147084
## log_TEAM_FIELDING_E -1.461e+01 1.504e+00 -9.715 < 2e-16 ***
## TEAM_FIELDING_DP -8.219e-02 1.042e-02 -7.886 5.35e-15 ***
## log_TEAM_PITCHING_H -7.814e-01 2.998e+00 -0.261 0.794377
## log_TEAM_PITCHING_BB -2.125e+00 3.032e+00 -0.701 0.483483
## TEAM_PITCHING_HR 6.595e-03 2.852e-02 0.231 0.817128
## log_TEAM_PITCHING_SO 3.132e-01 8.703e-01 0.360 0.718973
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.282 on 1807 degrees of freedom
## Multiple R-squared: 0.2785, Adjusted R-squared: 0.2729
## F-statistic: 49.82 on 14 and 1807 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_log_wls, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full_log_wls)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_log_wls))
# n = # of observations
n <- nrow(train_full_log)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log_wls)
# Perform Breusch-Pagan Test
bptest(model_full_log_wls)
##
## studentized Breusch-Pagan test
##
## data: model_full_log_wls
## BP = 5.7805, df = 14, p-value = 0.9717
# Variance Inflation Factor
vif(model_full_log_wls)
## TEAM_BATTING_H TEAM_BATTING_2B log_TEAM_BATTING_3B
## 5.473533 2.560235 3.334773
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 40.833173 7.705148 6.150963
## log_TEAM_BASERUN_SB TEAM_BASERUN_CS log_TEAM_FIELDING_E
## 1.713186 1.439969 8.694267
## TEAM_FIELDING_DP log_TEAM_PITCHING_H log_TEAM_PITCHING_BB
## 3.641813 8.171872 6.038269
## TEAM_PITCHING_HR log_TEAM_PITCHING_SO
## 32.736416 2.647742
######
preds <- predict(model_full_log_wls, newdata = test_full_log)
RMSE(preds, test_full_log$TARGET_WINS)
## [1] 13.37799
R2(preds, test_full_log$TARGET_WINS)
## [1] 0.346121
Model 1.2 validity check (linear regression assumptions):
Other notes:
Next step: let’s try a square root transformation of the skewed predictors
# Model 1.3: Full model with square root transformation of the skewed predictors
# Transform skewed predictors via sqrt
train_full_sqrt <- train_full %>%
mutate(
sqrt_BASERUN_CS = sqrt(TEAM_BASERUN_CS + 1),
sqrt_TEAM_BASERUN_SB = sqrt(TEAM_BASERUN_SB + 1),
sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B + 1),
sqrt_TEAM_FIELDING_E = sqrt(TEAM_FIELDING_E + 1),
sqrt_TEAM_PITCHING_BB = sqrt(TEAM_PITCHING_BB + 1),
sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H + 1),
sqrt_TEAM_PITCHING_SO = sqrt(TEAM_PITCHING_SO + 1)
)
test_full_sqrt <- test_full %>%
mutate(
sqrt_BASERUN_CS = sqrt(TEAM_BASERUN_CS + 1),
sqrt_TEAM_BASERUN_SB = sqrt(TEAM_BASERUN_SB + 1),
sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B + 1),
sqrt_TEAM_FIELDING_E = sqrt(TEAM_FIELDING_E + 1),
sqrt_TEAM_PITCHING_BB = sqrt(TEAM_PITCHING_BB + 1),
sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H + 1),
sqrt_TEAM_PITCHING_SO = sqrt(TEAM_PITCHING_SO + 1)
)
# Fit model
model_full_sqrt <- lm(TARGET_WINS ~
TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
TEAM_BASERUN_CS + sqrt_TEAM_FIELDING_E + TEAM_FIELDING_DP +
sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB + sqrt_TEAM_PITCHING_SO +
TEAM_PITCHING_HR + TEAM_BATTING_SO,
data = train_full_sqrt)
summary(model_full_sqrt)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + sqrt_TEAM_FIELDING_E +
## TEAM_FIELDING_DP + sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB +
## sqrt_TEAM_PITCHING_SO + TEAM_PITCHING_HR + TEAM_BATTING_SO,
## data = train_full_sqrt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.309 -8.626 -0.003 8.309 52.160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.924758 7.094322 3.936 8.59e-05 ***
## TEAM_BATTING_H 0.051973 0.004329 12.006 < 2e-16 ***
## TEAM_BATTING_2B -0.030711 0.010587 -2.901 0.003768 **
## sqrt_TEAM_BATTING_3B 1.358313 0.311597 4.359 1.38e-05 ***
## TEAM_BATTING_HR 0.029548 0.031534 0.937 0.348875
## TEAM_BATTING_BB 0.020872 0.008164 2.557 0.010653 *
## sqrt_TEAM_BASERUN_SB -0.072625 0.101922 -0.713 0.476216
## TEAM_BASERUN_CS -0.023099 0.011847 -1.950 0.051354 .
## sqrt_TEAM_FIELDING_E -1.499124 0.162702 -9.214 < 2e-16 ***
## TEAM_FIELDING_DP -0.078172 0.011445 -6.831 1.15e-11 ***
## sqrt_TEAM_PITCHING_H -0.034373 0.076828 -0.447 0.654642
## sqrt_TEAM_PITCHING_BB -0.457310 0.301683 -1.516 0.129729
## sqrt_TEAM_PITCHING_SO 0.498949 0.135541 3.681 0.000239 ***
## TEAM_PITCHING_HR -0.003848 0.027868 -0.138 0.890182
## TEAM_BATTING_SO -0.009869 0.003954 -2.496 0.012646 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.2 on 1807 degrees of freedom
## Multiple R-squared: 0.2874, Adjusted R-squared: 0.2818
## F-statistic: 52.05 on 14 and 1807 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full_sqrt)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_sqrt))
# n = # of observations
n <- nrow(train_full_sqrt)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_sqrt)
# Perform Breusch-Pagan Test
bptest(model_full_sqrt)
##
## studentized Breusch-Pagan test
##
## data: model_full_sqrt
## BP = 290.83, df = 14, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_full_sqrt)
## TEAM_BATTING_H TEAM_BATTING_2B sqrt_TEAM_BATTING_3B
## 3.953139 2.543924 3.216376
## TEAM_BATTING_HR TEAM_BATTING_BB sqrt_TEAM_BASERUN_SB
## 37.858671 10.276593 1.819623
## TEAM_BASERUN_CS sqrt_TEAM_FIELDING_E TEAM_FIELDING_DP
## 1.430004 8.210097 4.055568
## sqrt_TEAM_PITCHING_H sqrt_TEAM_PITCHING_BB sqrt_TEAM_PITCHING_SO
## 5.250760 8.858551 6.530655
## TEAM_PITCHING_HR TEAM_BATTING_SO
## 30.226649 9.473807
######
# Predict on test set
preds_sqrt <- predict(model_full_sqrt, newdata = test_full_sqrt)
rmse_sqrt <- RMSE(preds_sqrt, test_full_sqrt$TARGET_WINS)
r2_sqrt <- R2(preds_sqrt, test_full_sqrt$TARGET_WINS)
rmse_sqrt
## [1] 13.61336
r2_sqrt
## [1] 0.3167827
Model 1.3 validity check (linear regression assumptions):
Other notes:
Best model 1 version:
Next step: let’s address multicollinearity by using the
train_moderate dataset (removed highly correlated
predictors)
# Model 2: Moderate dataset model (removed highly correlated predictors)
# Fit model
model_moderate <- lm(TARGET_WINS ~ ., data = train_moderate)
#view summary of model
summary(model_moderate)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_moderate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.909 -8.995 0.495 8.905 76.870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0807986 3.9865298 1.274 0.20265
## TEAM_BATTING_H 0.0427033 0.0035658 11.976 < 2e-16 ***
## TEAM_BATTING_2B -0.0058880 0.0101635 -0.579 0.56244
## TEAM_BATTING_3B 0.0544292 0.0191381 2.844 0.00450 **
## TEAM_BATTING_HR 0.0471021 0.0085870 5.485 4.71e-08 ***
## TEAM_BATTING_BB 0.0173697 0.0036085 4.814 1.61e-06 ***
## TEAM_BASERUN_SB 0.0126187 0.0045981 2.744 0.00612 **
## TEAM_BASERUN_CS -0.0133792 0.0118269 -1.131 0.25810
## TEAM_PITCHING_H -0.0017073 0.0002894 -5.900 4.33e-09 ***
## TEAM_FIELDING_DP -0.0004111 0.0093700 -0.044 0.96501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.54 on 1812 degrees of freedom
## Multiple R-squared: 0.2479, Adjusted R-squared: 0.2442
## F-statistic: 66.38 on 9 and 1812 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_moderate, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_moderate)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_moderate))
# n = # of observations
n <- nrow(train_moderate)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
anova(model_moderate)
## Analysis of Variance Table
##
## Response: TARGET_WINS
## Df Sum Sq Mean Sq F value Pr(>F)
## TEAM_BATTING_H 1 63883 63883 348.5795 < 2.2e-16 ***
## TEAM_BATTING_2B 1 3416 3416 18.6403 1.664e-05 ***
## TEAM_BATTING_3B 1 1 1 0.0045 0.946460
## TEAM_BATTING_HR 1 18718 18718 102.1338 < 2.2e-16 ***
## TEAM_BATTING_BB 1 14674 14674 80.0688 < 2.2e-16 ***
## TEAM_BASERUN_SB 1 1828 1828 9.9768 0.001611 **
## TEAM_BASERUN_CS 1 168 168 0.9144 0.339072
## TEAM_PITCHING_H 1 6793 6793 37.0689 1.389e-09 ***
## TEAM_FIELDING_DP 1 0 0 0.0019 0.965006
## Residuals 1812 332078 183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Perform Breusch-Pagan Test
bptest(model_moderate)
##
## studentized Breusch-Pagan test
##
## data: model_moderate
## BP = 269.1, df = 9, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_moderate)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 2.548488 2.227575 2.852050 2.667503
## TEAM_BATTING_BB TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
## 1.907658 1.758721 1.354279 1.508270
## TEAM_FIELDING_DP
## 2.583185
######
# Predict on test set
preds_moderate <- predict(model_moderate, newdata = test_moderate)
rmse_moderate <- RMSE(preds_moderate, test_moderate$TARGET_WINS)
r2_moderate <- R2(preds_moderate, test_moderate$TARGET_WINS)
rmse_moderate
## [1] 14.01101
r2_moderate
## [1] 0.2769754
Model 2 validity check (linear regression assumptions):
Other notes:
Next step: let’s address the validity check issues by transforming the skewed predictors
# Model 2: Moderate dataset model + transformation of skewed predictors
# Transform skewed predictors via sqrt
train_moderate_sqrt <- train_moderate %>%
mutate(
sqrt_BASERUN_CS = sqrt(TEAM_BASERUN_CS + 1),
sqrt_TEAM_BASERUN_SB = sqrt(TEAM_BASERUN_SB + 1),
sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B + 1),
sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H + 1),
# … you can include more if you have other skewed predictors …
)
test_moderate_sqrt <- test_moderate %>%
mutate(
sqrt_BASERUN_CS = sqrt(TEAM_BASERUN_CS + 1),
sqrt_TEAM_BASERUN_SB = sqrt(TEAM_BASERUN_SB + 1),
sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B + 1),
sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H + 1),
)
# Fit model
model_moderate_sqrt <- lm(TARGET_WINS ~
TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
TEAM_BASERUN_CS + TEAM_FIELDING_DP + sqrt_TEAM_PITCHING_H,
data = train_moderate_sqrt)
summary(model_moderate_sqrt)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP +
## sqrt_TEAM_PITCHING_H, data = train_moderate_sqrt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.848 -8.959 0.565 9.014 78.010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.607803 4.249041 1.790 0.07354 .
## TEAM_BATTING_H 0.045072 0.003747 12.029 < 2e-16 ***
## TEAM_BATTING_2B -0.007240 0.010193 -0.710 0.47766
## sqrt_TEAM_BATTING_3B 0.962061 0.302455 3.181 0.00149 **
## TEAM_BATTING_HR 0.047035 0.008912 5.278 1.46e-07 ***
## TEAM_BATTING_BB 0.018099 0.003701 4.890 1.10e-06 ***
## sqrt_TEAM_BASERUN_SB 0.107730 0.096868 1.112 0.26623
## TEAM_BASERUN_CS -0.011914 0.011972 -0.995 0.31982
## TEAM_FIELDING_DP -0.011846 0.008987 -1.318 0.18762
## sqrt_TEAM_PITCHING_H -0.273777 0.048625 -5.630 2.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.56 on 1812 degrees of freedom
## Multiple R-squared: 0.2454, Adjusted R-squared: 0.2416
## F-statistic: 65.47 on 9 and 1812 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_moderate_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full_sqrt)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_moderate_sqrt))
# n = # of observations
n <- nrow(train_moderate_sqrt)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_moderate_sqrt)
# Perform Breusch-Pagan Test
bptest(model_moderate_sqrt)
##
## studentized Breusch-Pagan test
##
## data: model_moderate_sqrt
## BP = 272.64, df = 9, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_moderate)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 2.548488 2.227575 2.852050 2.667503
## TEAM_BATTING_BB TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
## 1.907658 1.758721 1.354279 1.508270
## TEAM_FIELDING_DP
## 2.583185
######
# Predict on test set
preds_moderate_sqrt <- predict(model_moderate_sqrt, newdata = test_moderate_sqrt)
rmse_sqrt <- RMSE(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS)
r2_sqrt <- R2(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS)
rmse_sqrt
## [1] 14.05698
r2_sqrt
## [1] 0.2722703
Model 2.1 validity check (linear regression assumptions):
Other notes:
Next step: let’s address nonconstant variance by transforming the skewed predictors and using Weighted Least Squares
# Model 2.2: Moderate dataset model + transformation of skewed predictors + WLS
# define weights to use
wt <- 1 / lm(abs(model_moderate_sqrt$residuals) ~ model_moderate_sqrt$fitted.values)$fitted.values^2
# perform weighted least squares regression
wls_model_moderate_sqrt <- lm(TARGET_WINS ~
TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
TEAM_BASERUN_CS + TEAM_FIELDING_DP +
sqrt_TEAM_PITCHING_H,
data = train_moderate_sqrt, weights=wt)
#view summary of model
summary(wls_model_moderate_sqrt)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP +
## sqrt_TEAM_PITCHING_H, data = train_moderate_sqrt, weights = wt)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -4.7679 -0.8433 0.0520 0.8563 5.0569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.431146 4.039826 2.087 0.037027 *
## TEAM_BATTING_H 0.045416 0.003807 11.928 < 2e-16 ***
## TEAM_BATTING_2B -0.016131 0.010108 -1.596 0.110706
## sqrt_TEAM_BATTING_3B 0.984533 0.297333 3.311 0.000947 ***
## TEAM_BATTING_HR 0.051740 0.008678 5.962 2.99e-09 ***
## TEAM_BATTING_BB 0.015460 0.003615 4.277 1.99e-05 ***
## sqrt_TEAM_BASERUN_SB 0.094450 0.097192 0.972 0.331282
## TEAM_BASERUN_CS -0.008323 0.011978 -0.695 0.487205
## TEAM_FIELDING_DP -0.006103 0.008862 -0.689 0.491124
## sqrt_TEAM_PITCHING_H -0.254253 0.057049 -4.457 8.83e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.276 on 1812 degrees of freedom
## Multiple R-squared: 0.2392, Adjusted R-squared: 0.2354
## F-statistic: 63.28 on 9 and 1812 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = wls_model_moderate_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(wls_model_moderate_sqrt)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(wls_model_moderate_sqrt))
# n = # of observations
n <- nrow(train_moderate_sqrt)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(wls_model_moderate_sqrt)
# Perform Breusch-Pagan Test
bptest(wls_model_moderate_sqrt)
##
## studentized Breusch-Pagan test
##
## data: wls_model_moderate_sqrt
## BP = 5.3621, df = 9, p-value = 0.8017
# Variance Inflation Factor
vif(wls_model_moderate_sqrt)
## TEAM_BATTING_H TEAM_BATTING_2B sqrt_TEAM_BATTING_3B
## 3.654325 2.307732 3.129800
## TEAM_BATTING_HR TEAM_BATTING_BB sqrt_TEAM_BASERUN_SB
## 2.849087 1.860144 1.637307
## TEAM_BASERUN_CS TEAM_FIELDING_DP sqrt_TEAM_PITCHING_H
## 1.386230 2.543128 2.415067
######
# Predict on test set
preds_mod_sqrt <- predict(wls_model_moderate_sqrt, newdata = test_moderate_sqrt)
rmse_mod_wls_sqrt <- RMSE(preds_mod_sqrt, test_moderate_sqrt$TARGET_WINS)
r2_mod_wls_sqrt <- R2(preds_mod_sqrt, test_moderate_sqrt$TARGET_WINS)
rmse_mod_wls_sqrt
## [1] 14.09027
r2_mod_wls_sqrt
## [1] 0.2697522
Model 2.2 validity check (linear regression assumptions):
Other notes:
# Model 3: Transformed data (net advantage variables)
model_transformed <- lm(TARGET_WINS ~ ., data = train_transformed)
summary(model_transformed)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.989 -9.839 0.122 9.261 50.902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.7924810 3.1064212 21.180 < 2e-16 ***
## TEAM_BATTING_2B 0.0881168 0.0076330 11.544 < 2e-16 ***
## TEAM_BATTING_3B 0.1180329 0.0159613 7.395 2.15e-13 ***
## TEAM_BASERUN_SB 0.0069852 0.0047682 1.465 0.1431
## TEAM_BASERUN_CS -0.0184721 0.0123932 -1.491 0.1363
## TEAM_FIELDING_E -0.0282612 0.0033613 -8.408 < 2e-16 ***
## TEAM_FIELDING_DP -0.0501127 0.0120283 -4.166 3.24e-05 ***
## NET_HR_ADVANTAGE -0.0580541 0.0289564 -2.005 0.0451 *
## NET_HITS_ADVANTAGE 0.0003041 0.0004778 0.636 0.5245
## NET_WALKS_ADVANTAGE -0.0042913 0.0046882 -0.915 0.3601
## NET_SO_ADVANTAGE -0.0009371 0.0010537 -0.889 0.3739
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.1 on 1811 degrees of freedom
## Multiple R-squared: 0.1849, Adjusted R-squared: 0.1804
## F-statistic: 41.09 on 10 and 1811 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_transformed, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_transformed)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_transformed))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_transformed)
##
## studentized Breusch-Pagan test
##
## data: model_transformed
## BP = 233.9, df = 10, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_transformed)
## TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BASERUN_SB TEAM_BASERUN_CS
## 1.158682 1.829429 1.744094 1.371380
## TEAM_FIELDING_E TEAM_FIELDING_DP NET_HR_ADVANTAGE NET_HITS_ADVANTAGE
## 5.158672 3.925586 1.705163 3.591665
## NET_WALKS_ADVANTAGE NET_SO_ADVANTAGE
## 4.931322 2.887223
######
preds <- predict(model_transformed, newdata = test_transformed)
RMSE(preds, test_transformed$TARGET_WINS)
## [1] 14.68434
R2(preds, test_transformed$TARGET_WINS)
## [1] 0.2047937
Model 3 validity check (linear regression assumptions):
Other notes:
# Model 4 (batting variables only)
# model_batting_xbh <- lm(TARGET_WINS ~ ., data = train_batting_fixed)
model_batting <- lm(TARGET_WINS ~ ., data = train_batting)
summary(model_batting)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_batting)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.383 -8.884 0.276 8.369 57.893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.011e+01 6.184e+00 3.252 0.001169 **
## TEAM_BATTING_H 4.748e-02 4.195e-03 11.319 < 2e-16 ***
## TEAM_BATTING_2B -2.064e-02 1.046e-02 -1.973 0.048628 *
## TEAM_BATTING_3B 6.399e-02 1.866e-02 3.429 0.000619 ***
## TEAM_BATTING_HR 3.498e-02 1.049e-02 3.336 0.000867 ***
## TEAM_BATTING_BB 7.673e-03 3.775e-03 2.033 0.042220 *
## TEAM_BATTING_SO -9.631e-05 2.541e-03 -0.038 0.969765
## TEAM_BASERUN_SB 4.525e-03 4.710e-03 0.961 0.336835
## TEAM_BASERUN_CS -2.239e-02 1.167e-02 -1.919 0.055129 .
## TEAM_FIELDING_E -2.990e-02 2.856e-03 -10.469 < 2e-16 ***
## TEAM_FIELDING_DP -5.708e-02 1.119e-02 -5.100 3.75e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.26 on 1811 degrees of freedom
## Multiple R-squared: 0.2787, Adjusted R-squared: 0.2747
## F-statistic: 69.97 on 10 and 1811 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_batting, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_batting)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_batting))
# n = # of observations
n <- nrow(train_batting)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_batting)
##
## studentized Breusch-Pagan test
##
## data: model_batting
## BP = 236.27, df = 10, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_batting)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 3.674924 2.460004 2.826139 4.145321
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 2.175251 3.873051 1.922653 1.373299
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 4.209614 3.840744
######
preds <- predict(model_batting, newdata = test_batting)
RMSE(preds, test_batting$TARGET_WINS)
## [1] 13.67704
R2(preds, test_batting$TARGET_WINS)
## [1] 0.3110345
Model 4 validity check (linear regression assumptions):
Other notes:
Since TEAM_BATTING_2B and TEAM_FIELDING_E have unexpected coefficient signs, let’s try creating a new batting variable to encompass all of the batting variables (2B,3B,HR). This should help with any multicollinearity for batting.
# Model 4.1 (batting variables only) - New Extra Base Hits variable
# Create new TEAM_BATTING_XBH and remove individual predictors to address any multicollinearity between batting variables
train_batting_fixed <- train_batting %>% mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>% dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR)
test_batting_fixed <- test_batting %>% mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>% dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR)
model_batting_xbh <- lm(TARGET_WINS ~ ., data = train_batting_fixed)
summary(model_batting_xbh)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_batting_fixed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.261 -9.034 0.119 8.590 49.134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.3813982 6.0513216 3.533 0.000421 ***
## TEAM_BATTING_H 0.0455792 0.0041558 10.968 < 2e-16 ***
## TEAM_BATTING_BB 0.0093918 0.0037260 2.521 0.011800 *
## TEAM_BATTING_SO -0.0004717 0.0022751 -0.207 0.835788
## TEAM_BASERUN_SB 0.0065223 0.0044980 1.450 0.147216
## TEAM_BASERUN_CS -0.0265459 0.0116684 -2.275 0.023020 *
## TEAM_FIELDING_E -0.0297338 0.0028692 -10.363 < 2e-16 ***
## TEAM_FIELDING_DP -0.0628551 0.0110728 -5.677 1.6e-08 ***
## TEAM_BATTING_XBH 0.0090417 0.0075030 1.205 0.228333
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.33 on 1813 degrees of freedom
## Multiple R-squared: 0.2705, Adjusted R-squared: 0.2673
## F-statistic: 84.04 on 8 and 1813 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_batting_xbh, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_batting_xbh)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_batting_xbh))
# n = # of observations
n <- nrow(train_batting_fixed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_batting_xbh)
##
## studentized Breusch-Pagan test
##
## data: model_batting_xbh
## BP = 210.04, df = 8, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_batting_xbh)
## TEAM_BATTING_H TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 3.570685 2.098047 3.074606 1.735990
## TEAM_BASERUN_CS TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_XBH
## 1.359770 4.204279 3.721079 3.817897
######
preds <- predict(model_batting_xbh, newdata = test_batting_fixed)
RMSE(preds, test_batting_fixed$TARGET_WINS)
## [1] 13.79523
R2(preds, test_batting_fixed$TARGET_WINS)
## [1] 0.2991301
Model 4.1 validity check (linear regression assumptions):
Other notes:
This model helped with the batting variables multicollinearity. To fix the coefficient sign of TEAM_FIELDING_DP, let’s remove TEAM_FIELDING_E which has a negative correlation with TEAM_FIELDING_DP.
# Model 4.2 (batting variables only) - New Extra Base Hits variable with FIELDING_E Removed
train_batting_fixed_v2 <- train_batting %>% mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>% dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_FIELDING_E)
test_batting_fixed_v2 <- test_batting %>% mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>% dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_FIELDING_E)
model_batting_xbh_v2 <- lm(TARGET_WINS ~ ., data = train_batting_fixed_v2)
summary(model_batting_xbh_v2)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_batting_fixed_v2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.999 -9.161 0.533 9.102 48.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.737311 5.747359 -0.476 0.633937
## TEAM_BATTING_H 0.040684 0.004248 9.577 < 2e-16 ***
## TEAM_BATTING_BB 0.025280 0.003494 7.235 6.84e-13 ***
## TEAM_BATTING_SO 0.004120 0.002296 1.795 0.072899 .
## TEAM_BASERUN_SB 0.016050 0.004530 3.543 0.000406 ***
## TEAM_BASERUN_CS -0.023355 0.012002 -1.946 0.051813 .
## TEAM_FIELDING_DP 0.007774 0.008979 0.866 0.386697
## TEAM_BATTING_XBH 0.014933 0.007698 1.940 0.052549 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.71 on 1814 degrees of freedom
## Multiple R-squared: 0.2273, Adjusted R-squared: 0.2243
## F-statistic: 76.23 on 7 and 1814 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_batting_xbh_v2, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_batting_xbh_v2)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_batting_xbh_v2))
# n = # of observations
n <- nrow(train_batting_fixed_v2)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_batting_xbh_v2)
##
## studentized Breusch-Pagan test
##
## data: model_batting_xbh_v2
## BP = 241.83, df = 7, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_batting_xbh_v2)
## TEAM_BATTING_H TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 3.524550 1.742845 2.957982 1.663462
## TEAM_BASERUN_CS TEAM_FIELDING_DP TEAM_BATTING_XBH
## 1.358823 2.311359 3.795982
######
preds <- predict(model_batting_xbh_v2, newdata = test_batting_fixed_v2)
RMSE(preds, test_batting_fixed_v2$TARGET_WINS)
## [1] 14.25159
R2(preds, test_batting_fixed_v2$TARGET_WINS)
## [1] 0.2535529
Model 4.2 validity check (linear regression assumptions):
Other notes:
Let’s also try regularization with elastic net:
# Model 4.3 (batting variables only) Elastic Net
# Elastic Net Regression
set.seed(786)
# 5-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 5)
#pre process
pre_process <- c("center","scale")
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enet_model <- train(TARGET_WINS ~ ., data = train_batting,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProcess = pre_process)
# Test predictors and response
testX <- test_batting %>% dplyr::select(-TARGET_WINS)
testY <- test_batting$TARGET_WINS
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_model
## Elasticnet
##
## 1822 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1457, 1459, 1457, 1458, 1457
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 15.15154 0.1462023 11.88505
## 0.00 0.10 14.83157 0.1622760 11.62952
## 0.00 0.15 14.55462 0.1910254 11.42345
## 0.00 0.20 14.31963 0.2092135 11.25904
## 0.00 0.25 14.10932 0.2220923 11.11903
## 0.00 0.30 13.93332 0.2292596 10.99344
## 0.00 0.35 13.79439 0.2348170 10.89144
## 0.00 0.40 13.68548 0.2407948 10.81057
## 0.00 0.45 13.60776 0.2447093 10.74921
## 0.00 0.50 13.55294 0.2486167 10.70010
## 0.00 0.55 13.49733 0.2535632 10.65247
## 0.00 0.60 13.44407 0.2584496 10.60816
## 0.00 0.65 13.39720 0.2625928 10.56742
## 0.00 0.70 13.35845 0.2658271 10.53521
## 0.00 0.75 13.33147 0.2678645 10.51261
## 0.00 0.80 13.31384 0.2691977 10.49504
## 0.00 0.85 13.30241 0.2700431 10.47931
## 0.00 0.90 13.29495 0.2706265 10.46509
## 0.00 0.95 13.29181 0.2709066 10.45523
## 0.00 1.00 13.29759 0.2703835 10.45305
## 0.01 0.05 15.16134 0.1462023 11.89293
## 0.01 0.10 14.84672 0.1604437 11.64111
## 0.01 0.15 14.57450 0.1893114 11.43750
## 0.01 0.20 14.34379 0.2073170 11.27472
## 0.01 0.25 14.13515 0.2207728 11.13679
## 0.01 0.30 13.95884 0.2283742 11.01163
## 0.01 0.35 13.81734 0.2337047 10.90783
## 0.01 0.40 13.70662 0.2396172 10.82584
## 0.01 0.45 13.62462 0.2437302 10.76313
## 0.01 0.50 13.56619 0.2476038 10.71070
## 0.01 0.55 13.51130 0.2523255 10.66095
## 0.01 0.60 13.45915 0.2570366 10.61731
## 0.01 0.65 13.41145 0.2612984 10.57638
## 0.01 0.70 13.37052 0.2647993 10.54128
## 0.01 0.75 13.33980 0.2672329 10.51642
## 0.01 0.80 13.31882 0.2687891 10.49734
## 0.01 0.85 13.30627 0.2697359 10.48194
## 0.01 0.90 13.29845 0.2703257 10.46798
## 0.01 0.95 13.29460 0.2706450 10.45770
## 0.01 1.00 13.29651 0.2704849 10.45304
## 0.10 0.05 15.20837 0.1462023 11.93041
## 0.10 0.10 14.91970 0.1518720 11.69678
## 0.10 0.15 14.67409 0.1798391 11.50918
## 0.10 0.20 14.46029 0.1979614 11.35431
## 0.10 0.25 14.26715 0.2123321 11.22366
## 0.10 0.30 14.09381 0.2222994 11.10676
## 0.10 0.35 13.94529 0.2283872 10.99991
## 0.10 0.40 13.82615 0.2329749 10.91131
## 0.10 0.45 13.73086 0.2375405 10.83997
## 0.10 0.50 13.65589 0.2418900 10.78079
## 0.10 0.55 13.58911 0.2467168 10.72252
## 0.10 0.60 13.53592 0.2502655 10.66980
## 0.10 0.65 13.49355 0.2533289 10.62393
## 0.10 0.70 13.45470 0.2564967 10.58945
## 0.10 0.75 13.41701 0.2597632 10.55825
## 0.10 0.80 13.38473 0.2625117 10.53007
## 0.10 0.85 13.35842 0.2647103 10.50905
## 0.10 0.90 13.34157 0.2660898 10.49269
## 0.10 0.95 13.33420 0.2666511 10.48261
## 0.10 1.00 13.33227 0.2668126 10.47572
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.95 and lambda = 0.
enet_resample
## RMSE Rsquared MAE
## 13.6869927 0.3110159 10.5173741
# Model 5: (pitching variables only)
model_pitching <- lm(TARGET_WINS ~ ., data = train_pitching)
summary(model_pitching)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_pitching)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.045 -9.521 0.433 9.421 65.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.3668947 2.8219995 30.250 < 2e-16 ***
## TEAM_PITCHING_H 0.0007015 0.0004369 1.606 0.10849
## TEAM_PITCHING_HR 0.0447728 0.0070586 6.343 2.84e-10 ***
## TEAM_PITCHING_BB 0.0136610 0.0026556 5.144 2.98e-07 ***
## TEAM_PITCHING_SO -0.0049613 0.0007268 -6.826 1.18e-11 ***
## TEAM_FIELDING_E -0.0250202 0.0036376 -6.878 8.31e-12 ***
## TEAM_FIELDING_DP -0.0662994 0.0124298 -5.334 1.08e-07 ***
## TEAM_BASERUN_SB 0.0128438 0.0049496 2.595 0.00954 **
## TEAM_BASERUN_CS -0.0284583 0.0127401 -2.234 0.02562 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.57 on 1813 degrees of freedom
## Multiple R-squared: 0.1282, Adjusted R-squared: 0.1243
## F-statistic: 33.32 on 8 and 1813 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_pitching, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_transformed)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_pitching))
# n = # of observations
n <- nrow(model_pitching)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_pitching)
##
## studentized Breusch-Pagan test
##
## data: model_pitching
## BP = 463.05, df = 8, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_pitching)
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 2.966536 1.590418 1.758632 1.511765
## TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BASERUN_SB TEAM_BASERUN_CS
## 5.654472 3.923401 1.758882 1.356343
######
preds <- predict(model_pitching, newdata = test_pitching)
RMSE(preds, test_pitching$TARGET_WINS)
## [1] 15.67067
R2(preds, test_pitching$TARGET_WINS)
## [1] 0.09612363
Model 5 validity check (linear regression assumptions):
Other notes:
# Model 6: Conservative dataset
model_conservative <- lm(TARGET_WINS ~ ., data = train_conservative)
summary(model_conservative)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_conservative)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.721 -8.853 0.267 8.578 53.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.0309826 6.2974051 2.863 0.004242 **
## TEAM_BATTING_H 0.0500376 0.0042636 11.736 < 2e-16 ***
## TEAM_BATTING_2B -0.0252943 0.0106088 -2.384 0.017215 *
## TEAM_BATTING_3B 0.0586682 0.0189317 3.099 0.001972 **
## TEAM_BATTING_HR 0.0402066 0.0105933 3.795 0.000152 ***
## TEAM_BATTING_BB 0.0057175 0.0059344 0.963 0.335446
## TEAM_BATTING_SO -0.0036414 0.0027915 -1.304 0.192242
## TEAM_BASERUN_SB 0.0060936 0.0047418 1.285 0.198936
## TEAM_BASERUN_CS -0.0192592 0.0117291 -1.642 0.100764
## TEAM_PITCHING_H -0.0008386 0.0004553 -1.842 0.065638 .
## TEAM_PITCHING_BB 0.0013503 0.0039896 0.338 0.735056
## TEAM_PITCHING_SO 0.0023678 0.0009551 2.479 0.013266 *
## TEAM_FIELDING_E -0.0276340 0.0034778 -7.946 3.36e-15 ***
## TEAM_FIELDING_DP -0.0511885 0.0113984 -4.491 7.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.23 on 1808 degrees of freedom
## Multiple R-squared: 0.2836, Adjusted R-squared: 0.2784
## F-statistic: 55.05 on 13 and 1808 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_conservative, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_conservative)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_conservative))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_conservative)
##
## studentized Breusch-Pagan test
##
## data: model_conservative
## BP = 244.26, df = 13, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_conservative)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 3.816249 2.542129 2.923185 4.252122
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 5.404069 4.699856 1.959051 1.395128
## TEAM_PITCHING_H TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 3.910081 4.816881 3.168400 6.272359
## TEAM_FIELDING_DP
## 4.003873
######
preds <- predict(model_conservative, newdata = test_conservative)
RMSE(preds, test_conservative$TARGET_WINS)
## [1] 13.71211
R2(preds, test_conservative$TARGET_WINS)
## [1] 0.3063382
Model 6 validity check (linear regression assumptions):
Other notes:
After evaluating several linear regression models using key performance metrics—including RMSE, R2, MSE, and the F-statistic—we selected Model 4.2 as the most robust and reliable overall. While Model 4.2 did not achieve the best predictive performance in terms of lowest RMSE or highest R² (those belonged to Model 1.1), it outperformed others in terms of model stability and interpretability.
Specifically, Model 4.2 exhibited lower multicollinearity across predictors compared to Model 1.1, as indicated by more acceptable VIF values. This was crucial because Model 1.1, despite its stronger predictive metrics, displayed signs of multicollinearity-induced distortion, including coefficient estimates with unexpected signs—i.e., directions of effect that contradicted theoretical expectations.
In contrast, Model 4.2 produced coefficient signs that were more consistent with domain knowledge, suggesting improved interpretability and more trustworthy parameter estimates. While no model was perfect, Model 4.2 represents the best balance between predictive accuracy and statistical validity with an R2 of 0.224, RMSE of 14.25, F-statistic of 76.23, and MSE of 0.254-and was therefore selected as the final model for prediction and interpretation.
model_results <- data.frame(
Model = c("Model 1: Full",
"Model 1.1: Full + Log",
"Model 1.2: Full + Log + WLS",
"Model 1.3: Full + Sqrt",
"Model 2: Moderate",
"Model 2.1: Moderate + Sqrt",
"Model 2.2: Moderate + Sqrt + WLS",
"Model 3: Transformed Advantage",
"Model 4: Batting",
"Model 4.1: Batting New Variable",
"Model 4.2: Batting New Variable + Fielding Removal",
"Model 4.3: Batting (Elastic Net)",
"Model 5: Pitching",
"Model 6: Conservative"),
RMSE = c(
RMSE(predict(model_full, newdata = test_full), test_full$TARGET_WINS),
RMSE(predict(model_full_log, newdata = test_full_log), test_full_log$TARGET_WINS),
RMSE(predict(model_full_log_wls, newdata = test_full_log), test_full_log$TARGET_WINS),
rmse_sqrt,
rmse_moderate,
RMSE(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS),
rmse_mod_wls_sqrt,
RMSE(predict(model_transformed, newdata = test_transformed), test_transformed$TARGET_WINS),
RMSE(predict(model_batting, newdata = test_batting), test_batting$TARGET_WINS),
RMSE(predict(model_batting_xbh, newdata = test_batting_fixed), test_batting_fixed$TARGET_WINS),
RMSE(predict(model_batting_xbh_v2, newdata = test_batting_fixed_v2), test_batting_fixed_v2$TARGET_WINS),
enet_resample["RMSE"],
RMSE(predict(model_pitching, newdata = test_pitching), test_pitching$TARGET_WINS),
RMSE(predict(model_conservative, newdata = test_conservative), test_conservative$TARGET_WINS)
),
R2 = c(
R2(predict(model_full, newdata = test_full), test_full$TARGET_WINS),
R2(predict(model_full_log, newdata = test_full_log), test_full_log$TARGET_WINS),
R2(predict(model_full_log_wls, newdata = test_full_log), test_full_log$TARGET_WINS),
r2_sqrt,
r2_moderate,
R2(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS),
r2_mod_wls_sqrt,
R2(predict(model_transformed, newdata = test_transformed), test_transformed$TARGET_WINS),
R2(predict(model_batting, newdata = test_batting), test_batting$TARGET_WINS),
R2(predict(model_batting_xbh, newdata = test_batting_fixed), test_batting_fixed$TARGET_WINS),
R2(predict(model_batting_xbh_v2, newdata = test_batting_fixed_v2), test_batting_fixed_v2$TARGET_WINS),
enet_resample["Rsquared"],
R2(predict(model_pitching, newdata = test_pitching), test_pitching$TARGET_WINS),
R2(predict(model_conservative, newdata = test_conservative), test_conservative$TARGET_WINS)
)
)
# Add MSE column
model_results$MSE <- model_results$RMSE^2
# Add F-statistic column (NA for models where it doesn’t apply: Elastic Net)
model_results$F_statistic <- c(
summary(model_full)$fstatistic[1],
summary(model_full_log)$fstatistic[1],
summary(model_full_log_wls)$fstatistic[1],
summary(model_full_sqrt)$fstatistic[1],
summary(model_moderate)$fstatistic[1],
summary(model_moderate_sqrt)$fstatistic[1],
summary(wls_model_moderate_sqrt)$fstatistic[1],
summary(model_transformed)$fstatistic[1],
summary(model_batting)$fstatistic[1],
summary(model_batting_xbh)$fstatistic[1],
summary(model_batting_xbh_v2)$fstatistic[1],
NA, # Elastic Net model doesn’t have F-statistic in the same form
summary(model_pitching)$fstatistic[1],
summary(model_conservative)$fstatistic[1]
)
# View the updated table
print(model_results)
## Model RMSE R2
## 1 Model 1: Full 13.71198 0.30634846
## 2 Model 1.1: Full + Log 13.36052 0.34583439
## 3 Model 1.2: Full + Log + WLS 13.37799 0.34612099
## 4 Model 1.3: Full + Sqrt 14.05698 0.27227029
## 5 Model 2: Moderate 14.01101 0.27697537
## 6 Model 2.1: Moderate + Sqrt 14.05698 0.27227029
## 7 Model 2.2: Moderate + Sqrt + WLS 14.09027 0.26975219
## 8 Model 3: Transformed Advantage 14.68434 0.20479368
## 9 Model 4: Batting 13.67704 0.31103447
## 10 Model 4.1: Batting New Variable 13.79523 0.29913008
## 11 Model 4.2: Batting New Variable + Fielding Removal 14.25159 0.25355288
## 12 Model 4.3: Batting (Elastic Net) 13.68699 0.31101590
## 13 Model 5: Pitching 15.67067 0.09612363
## 14 Model 6: Conservative 13.71211 0.30633816
## MSE F_statistic
## 1 188.0183 51.08818
## 2 178.5034 52.42818
## 3 178.9706 49.82345
## 4 197.5987 52.04595
## 5 196.3085 66.37656
## 6 197.5987 65.47077
## 7 198.5357 63.28447
## 8 215.6299 41.09327
## 9 187.0616 69.96969
## 10 190.3083 84.04218
## 11 203.1077 76.23436
## 12 187.3338 NA
## 13 245.5698 33.31725
## 14 188.0219 55.04844
After selecting the Batting Variables Only Model (Model 4.2) as the best-performing model, we applied it to the evaluation dataset. The model was then used to predict the number of team wins.
data_eval %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
filter(missing_count != 0) %>%
arrange(desc(missing_count))
## # A tibble: 6 × 2
## variable missing_count
## <chr> <int>
## 1 TEAM_BATTING_HBP 240
## 2 TEAM_BASERUN_CS 87
## 3 TEAM_FIELDING_DP 31
## 4 TEAM_BATTING_SO 18
## 5 TEAM_PITCHING_SO 18
## 6 TEAM_BASERUN_SB 13
# Impute and remove variables
data_transform_eval <- data_eval |>
dplyr::select(-c(INDEX, TEAM_BATTING_HBP))
data_transform_eval <- data_transform_eval %>%
mutate(
TEAM_BASERUN_SB = ifelse(is.na(TEAM_BASERUN_SB), 0, TEAM_BASERUN_SB),
TEAM_BASERUN_CS = ifelse(is.na(TEAM_BASERUN_CS), 0, TEAM_BASERUN_CS),
TEAM_FIELDING_DP = ifelse(is.na(TEAM_FIELDING_DP), 0, TEAM_FIELDING_DP)
)
# Handle missing values - using median imputation for numerical variables
# Handle missing values with median imputation (this applies to TEAM_BATTING_SO and TEAM_PITCHING_SO)
data_transform_eval <- data_transform_eval |>
mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))
# batting only
data_eval_batting_only <- data_transform_eval |>
dplyr::select(starts_with("TEAM_BATTING"), starts_with("TEAM_BASERUN"),
starts_with("TEAM_FIELDING"))
# Create new batting variable
data_eval2 <- data_eval_batting_only %>%
mutate(TEAM_BATTING_XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR) %>%
dplyr::select(-TEAM_BATTING_2B, -TEAM_BATTING_3B, -TEAM_BATTING_HR, -TEAM_FIELDING_E)
# Predict
data_eval2$predicted_wins <- predict(model_batting_xbh_v2, newdata = data_eval2)
hist(data_eval2$predicted_wins,
breaks = 20,
main = "Distribution of Predicted Wins",
xlab = "Predicted Wins",
col = "lightpink",
border = "white")
# Preview predictions
head(data_eval2[, "predicted_wins", drop = FALSE], 4)
## predicted_wins
## 1 67.51041
## 2 69.04266
## 3 75.83950
## 4 85.59009
Save results:
# Save results
library(openxlsx)
wb <- createWorkbook()
addWorksheet(wb, "Predictions")
writeData(wb, sheet = "Predictions", x = data_eval2)
saveWorkbook(wb, "BaseballPrediction.xlsx", overwrite = TRUE)