In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
VARIABLE NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET_WINS Number of wins
TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR) Positive Impact on Wins
TEAM_BATTING_2B Doubles by batters (2B) Positive Impact on Wins
TEAM_BATTING_3B Triples by batters (3B) Positive Impact on Wins
TEAM_BATTING_HR Homeruns by batters (4B) Positive Impact on Wins
TEAM_BATTING_BB Walks by batters Positive Impact on Wins
TEAM_BATTING_HBP Batters hit by pitch (get a free base) Positive Impact on Wins
TEAM_BATTING_SO Strikeouts by batters Negative Impact on Wins
TEAM_BASERUN_SB Stolen bases Positive Impact on Wins
TEAM_BASERUN_CS Caught stealing Negative Impact on Wins
TEAM_FIELDING_E Errors Negative Impact on Wins
TEAM_FIELDING_DP Double Plays Positive Impact on Wins
TEAM_PITCHING_BB Walks allowed Negative Impact on Wins
TEAM_PITCHING_H Hits allowed Negative Impact on Wins
TEAM_PITCHING_HR Homeruns allowed Negative Impact on Wins
TEAM_PITCHING_SO Strikeouts by pitchers Positive Impact on Wins
#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(e1071) # For skewness function
## Warning: package 'e1071' was built under R version 4.3.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ROSE)
## Warning: package 'ROSE' was built under R version 4.3.3
## Loaded ROSE 0.0-4
library(smotefamily)
## Warning: package 'smotefamily' was built under R version 4.3.3
library(dplyr)
library(caret)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data_raw <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA621/refs/heads/main/moneyball-training-data.csv")
data_eval <- read.csv("https://raw.githubusercontent.com/nakesha-fray/DATA621/main/moneyball-evaluation-data.csv")
The MoneyBall dataset contains 2,276 observations and 17 variables, each representing the performance of a baseball team in a given year between 1871 and 2006. The goal is to predict the number of team wins (TARGET_WINS) using available data.
The histograms below display the distributions of the numerical features in the dataset. Several variables, such as TEAM_BATTING_2B, TEAM_BATTING_H, and TEAM_FIELDING_DP, appear approximately normally distributed, although TEAM_BATTING_H shows a slight right skew.
Other features exhibit more pronounced skewness and outliers. For example, TEAM_BATTING_BB is slightly left-skewed, while variables like TEAM_BASERUN_CS, TEAM_BASERUN_SB, TEAM_BATTING_3B, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, and TEAM_PITCHING_SO are heavily right-skewed, indicating that a few teams had exceptionally high values in these categories.
Additionally, variables such as TEAM_BATTING_HBP, TEAM_BATTING_HR, TEAM_BATTING_SO, and TEAM_PITCHING_HR display bimodal or multimodal distributions, suggesting that team performance in these metrics may differ substantially depending on the year and team.
The target variable TARGET_WINS is continuous and roughly normally distributed, this confirms that we are dealing with a regression problem. As a first approach, applying linear regression is appropriate and may provide reasonable performance. However, not all predictors are perfectly symmetric or linear, and some exhibit skewness or potential outliers, which may affect modeling accuracy.
Several variables contain missing values, the most significant being TEAM_BATTING_HBP (2085), followed by TEAM_BASERUN_CS (772), TEAM_FIELDING_DP (286), and TEAM_BASERUN_SB (131). The statistics for these weren’t tracked in early baseball (1871-1890s), so missing likely means “0”.This makes historical sense - if stolen bases weren’t recorded, it’s because they didn’t track them, not because the stat is unknown.
# Structure of the data
str(data_raw)
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
# Glimpse of the data
head(data_raw)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 1 39 1445 194 39
## 2 2 70 1339 219 22
## 3 3 86 1377 232 35
## 4 4 70 1387 209 38
## 5 5 82 1297 186 27
## 6 6 75 1279 200 36
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 NA
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## 6 92 443 973 107
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 9364 84
## 2 28 NA 1347 191
## 3 27 NA 1377 137
## 4 30 NA 1396 97
## 5 39 NA 1297 102
## 6 59 NA 1279 92
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 927 5456 1011 NA
## 2 689 1082 193 155
## 3 602 917 175 153
## 4 454 928 164 156
## 5 472 920 138 168
## 6 443 973 123 149
# Summary
summary(data_raw)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
# Feature Distribution
data_raw |>
dplyr::select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
filter(!is.na(Value)) |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Feature, scales = "free") +
ggtitle("Histograms of Numerical Features")
# Distribution of TARGET_WINS
data_raw |>
dplyr::select(TARGET_WINS) |>
ggplot() +
aes(x = TARGET_WINS) +
geom_histogram(bins= 40,fill = "blue", color = "black") +
labs(title = "Distribution of TARGET_WINS", y = "Count") +
theme_minimal()
# Count missing values per Variable
data_raw %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
filter(missing_count != 0) %>%
arrange(desc(missing_count))
## # A tibble: 6 × 2
## variable missing_count
## <chr> <int>
## 1 TEAM_BATTING_HBP 2085
## 2 TEAM_BASERUN_CS 772
## 3 TEAM_FIELDING_DP 286
## 4 TEAM_BASERUN_SB 131
## 5 TEAM_BATTING_SO 102
## 6 TEAM_PITCHING_SO 102
The boxplots reveal that TARGET_WINS exhibits no extreme outliers. In contrast, many predictor variables show clear evidence of outliers. Variables such as TEAM_BASERUN_SB, TEAM_BASERUN_BB, TEAM_FIELDING_E, and several pitching and batting statistics have values that extend far beyond the interquartile range, indicating that some teams had exceptionally high or low values in these categories. These outliers could strongly influence regression models if left unaddressed. To mitigate this, we apply various transformations—such as logarithmic and square root—in the modeling phase, with the goal of reducing skewness and the impact of extreme values.
# Boxplot of TARGET_WINS
data_raw |>
dplyr::select(TARGET_WINS) |>
ggplot(aes(x = "", y = TARGET_WINS)) +
geom_boxplot(fill="pink") +
labs(title = "Distribution of TARGET_WINS", y = "Wins") +
theme_minimal()
# Boxplots of all predictors
ggplot(stack(data_raw[,-1]), aes(x = ind, y = values)) +
geom_boxplot(fill = "darkgrey") +
coord_cartesian(ylim = c(0, 1000)) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1),
panel.background = element_rect(fill = 'grey96')) +
labs(title = "Boxplots of Predictor Variables", x="Predictors")
## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
To prepare the dataset for modeling, we addressed missing values through a combination of variable removal and imputation.
First, we dropped the INDEX variable, as it serves only as a row identifier and holds no predictive value.
Next, we removed the TEAM_BATTING_HBP variable, which contained over 90% missing values. This specific statistic was not recorded in most early baseball seasons, making the variable unsuitable for imputation or inclusion in modeling.
For other variables with moderate missingness — specifically TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_FIELDING_DP, TEAM_BATTING_SO, and TEAM_PITCHING_SO — we applied zero imputation. This decision is informed by historical context: these statistics were often not tracked in earlier years, so the missing values are likely systematic (i.e., due to lack of data collection) rather than random.
Finally, we imputed any remaining missing values across the dataset using the median of each variable to reduce the influence of outliers and skewed distributions.
# Drop the INDEX variable as it has no theoretical effect and it was mentioned to not use it.
# Drop the TEAM_BATTING_HBP variable as its mostly made up of missing values
data_transform <- data_raw |>
# dplyr::select(-INDEX, TEAM_BATTING_HBP)
dplyr::select(-c(INDEX, TEAM_BATTING_HBP))
# These stats: missing likely means "0" (not tracked in early baseball), hence mutating to 0
# TEAM_BATTING_SO missing or 0 assumes not recorded
data_transform <- data_transform %>%
mutate(
# TEAM_BATTING_HBP = ifelse(is.na(TEAM_BATTING_HBP), 0, TEAM_BATTING_HBP),
TEAM_BASERUN_SB = ifelse(is.na(TEAM_BASERUN_SB), 0, TEAM_BASERUN_SB),
TEAM_BASERUN_CS = ifelse(is.na(TEAM_BASERUN_CS), 0, TEAM_BASERUN_CS),
TEAM_FIELDING_DP = ifelse(is.na(TEAM_FIELDING_DP), 0, TEAM_FIELDING_DP),
TEAM_BATTING_SO = ifelse(is.na(TEAM_BATTING_SO), 0, TEAM_BATTING_SO),
TEAM_PITCHING_SO = ifelse(is.na(TEAM_PITCHING_SO), 0, TEAM_PITCHING_SO)
)
# Handle missing values - using median imputation for numerical variables
# Handle missing values with median imputation
data_transform <- data_transform |>
mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))
str(data_transform)
## 'data.frame': 2276 obs. of 15 variables:
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : num 842 1075 917 922 920 ...
## $ TEAM_BASERUN_SB : num 0 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : num 0 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: num 5456 1082 917 928 920 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: num 0 155 153 156 168 149 186 136 169 159 ...
To evaluate the suitability of linear regression, we plotted the linear relationships between each predictor and the target variable (TARGET_WINS) using scatterplots and Pearson correlation coefficients.
TEAM_BATTING_H (r = 0.389) shows the strongest positive correlation with TARGET_WINS, suggesting that teams with more base hits generally win more games — as expected. This is followed by TEAM_BATTING_2B, which also demonstrates a moderately positive relationship, aligning with baseball intuition.
Interestingly, TEAM_BATTING_SO (strikeouts by batters) shows an almost negligible correlation (r = -0.030) with TARGET_WINS. This is somewhat surprising, as one might expect higher strikeout counts to hurt performance. The lack of relationship suggests that this variable is not a strong independent predictor.
Another notable result is TEAM_PITCHING_SO (strikeouts by pitchers), which shows a weak negative correlation with wins (r = -0.075). This result could indicate multicollinearity — for example, if teams that accumulate many strikeouts also give up more hits or walks — or it may suggest data quality issues in historical records.
# Get all predictor variables (excluding TARGET_WINS)
predictors <- names(data_transform)[names(data_transform) != "TARGET_WINS"]
# Create scatterplots for all predictors vs TARGET_WINS
# Set up plotting parameters for multiple plots
par(mfrow = c(4, 4), mar = c(4, 4, 3, 2))
# Plot each predictor vs TARGET_WINS
for(i in 1:length(predictors)) {
predictor <- predictors[i]
correlation <- cor(data_transform[[predictor]], data_transform$TARGET_WINS)
plot(data_transform[[predictor]], data_transform$TARGET_WINS,
xlab = predictor,
ylab = "TARGET_WINS",
main = paste(predictor, "\n(r =", round(correlation, 3), ")"),
pch = 16,
col = ifelse(correlation > 0, "darkblue", "darkred"),
cex = 0.6)
# Add regression line
abline(lm(data_transform$TARGET_WINS ~ data_transform[[predictor]]),
col = "red", lwd = 2)
# Add correlation text
text(x = quantile(data_transform[[predictor]], 0.05),
y = quantile(data_transform$TARGET_WINS, 0.95),
labels = paste("r =", round(correlation, 3)),
cex = 0.8, col = "red")
}
# Reset plotting parameters
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2))
To further assess linear relationships and potential multicollinearity among variables, we examined the correlation matrix of all numeric features using a corrplot and targeted pairwise comparisons.
Several batting and pitching variables are highly correlated with one another: TEAM_BATTING_HR is highly correlated with TEAM_PITCHING_HR, TEAM_BATTING_H is highly correlated with TEAM_PITCHING_H, TEAM_BATTING_BB is highly correlated with TEAM_PITCHING_BB, TEAM_BATTING_SO is highly correlated with TEAM_PITCHING_SO, TEAM_FIELDING_E is highly correlated (negative) with TEAM_FIELDING_DP
These relationships suggest possible multicollinearity, which could impact model stability and interpretability.
We also ranked features by their correlation with TARGET_WINS, identifying the top 5 positive and negative predictors:
Top positively correlated features include:TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_FIELDING_BB, TEAM_PITCHING_HR, and TEAM_BATTING_HR.
Top negatively correlated features include: TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_BATTING_SO.
# Correlation matrix
cor_matrix <- cor(data_transform, use = "complete.obs")
# Plot
corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.cex = 0.5,
tl.srt = 45) # Rotate text diagonally
par(mar = c(1, 1, 4, 1)) # top margin = 4 lines
# Extract correlations between batting and pitching variables
cat("=== MULTICOLLINEARITY ANALYSIS ===\n")
## === MULTICOLLINEARITY ANALYSIS ===
# Batting vs Pitching
multicollin_pairs <- data.frame(
Batting_Var = c("TEAM_BATTING_HR", "TEAM_BATTING_H", "TEAM_BATTING_BB", "TEAM_BATTING_SO"),
Pitching_Var = c("TEAM_PITCHING_HR", "TEAM_PITCHING_H", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO"),
Correlation = c(
cor(data_transform$TEAM_BATTING_HR, data_transform$TEAM_PITCHING_HR),
cor(data_transform$TEAM_BATTING_H, data_transform$TEAM_PITCHING_H),
cor(data_transform$TEAM_BATTING_BB, data_transform$TEAM_PITCHING_BB),
cor(data_transform$TEAM_BATTING_SO, data_transform$TEAM_PITCHING_SO)
)
)
print(multicollin_pairs)
## Batting_Var Pitching_Var Correlation
## 1 TEAM_BATTING_HR TEAM_PITCHING_HR 0.9693714
## 2 TEAM_BATTING_H TEAM_PITCHING_H 0.3026937
## 3 TEAM_BATTING_BB TEAM_PITCHING_BB 0.4893613
## 4 TEAM_BATTING_SO TEAM_PITCHING_SO 0.4952422
# Fielding
print(cor(data_transform$TEAM_FIELDING_DP, data_transform$TEAM_FIELDING_E))
## [1] -0.7569313
# Get correlations with TARGET_WINS
target_wins_correlation <- cor_matrix[, "TARGET_WINS"]
# Remove TARGET_WINS itself (correlation = 1)
target_wins_correlation <- target_wins_correlation [names(target_wins_correlation ) != "TARGET_WINS"]
# Top 5 Positive correlations
top5_pos <- sort(target_wins_correlation , decreasing = TRUE) %>%
head(5)
# Top 5 Negative correlations
top5_neg <- sort(target_wins_correlation , decreasing = FALSE) %>%
head(5)
# Combine into one dataframe
combined_df <- data.frame(
Feature = c(names(top5_pos), names(top5_neg)),
Correlation = c(top5_pos, top5_neg)
)
# Add a column to show Positive or Negative
combined_df$Direction <- ifelse(combined_df$Correlation > 0, "Positive", "Negative")
ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
labs(title = "Top Features Positively and Negatively Correlated with target_wins",
x = "Feature",
y = "Correlation with target_wins",
fill = "Direction") +
theme_minimal()
# The `findCorrelation` function is specifically designed for this
high_corr_list <- findCorrelation(cor_matrix, cutoff = 0.4, verbose = FALSE)
names(data_transform)[high_corr_list]
## [1] "TEAM_BATTING_HR" "TEAM_PITCHING_HR" "TEAM_BATTING_SO" "TEAM_FIELDING_E"
## [5] "TEAM_BATTING_3B" "TEAM_FIELDING_DP" "TEAM_BATTING_BB" "TEAM_BATTING_2B"
## [9] "TEAM_PITCHING_SO"
To address multicollinearity and explore different modeling strategies, several derived datasets were created using feature engineering and variable selection:
Dataset 1: Conservative Reduction
We removed TEAM_PITCHING_HR, which was highly correlated with TEAM_BATTING_HR.
Dataset 2: Moderate Reduction
This version takes a more aggressive stance on multicollinearity. We removed TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO, and TEAM_BATTING_SO. These variables showed correlations > 0.4 with other predictors, and their removal helps reduce noise and potential variance.
Dataset 3: Transformed Variables (Net Advantage)
Rather than eliminating variables, this version transforms them into net advantage metrics:
NET_HR_ADVANTAGE = TEAM_BATTING_HR - TEAM_PITCHING_HR NET_HITS_ADVANTAGE = TEAM_BATTING_H - TEAM_PITCHING_H NET_WALKS_ADVANTAGE = TEAM_BATTING_BB - TEAM_PITCHING_BB NET_SO_ADVANTAGE = TEAM_PITCHING_SO - TEAM_BATTING_SO
Dataset 4: Batting-Focused
This subset includes only batting, base running, and fielding statistics — excluding all pitching variables.
Dataset 5: Pitching-Focused
In contrast, this version includes only pitching, base running, and fielding stats — omitting batting features entirely.
# Dataset 1: Conservative - Remove one variable from each highly correlated pair (TEAM_BATTING_HR, TEAM_PITCHING_HR)
data_reduced <- data_transform %>%
dplyr::select(-TEAM_PITCHING_HR)
# Dataset 2: Moderate - Remove FE, HR and BB and SO pitching (absolute value correlations > 0.4)
data_moderate <- data_transform %>%
dplyr::select(-TEAM_PITCHING_HR, -TEAM_PITCHING_BB, -TEAM_PITCHING_SO, -TEAM_BATTING_SO, -TEAM_FIELDING_E)
# Dataset 3: Transformed - Create net advantage variables
data_transformed <- data_transform %>%
mutate(
NET_HR_ADVANTAGE = TEAM_BATTING_HR - TEAM_PITCHING_HR,
NET_HITS_ADVANTAGE = TEAM_BATTING_H - TEAM_PITCHING_H,
NET_WALKS_ADVANTAGE = TEAM_BATTING_BB - TEAM_PITCHING_BB,
NET_SO_ADVANTAGE = TEAM_PITCHING_SO - TEAM_BATTING_SO
) %>%
# dplyr::select(TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BASERUN_SB,
# TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_FIELDING_E, TEAM_FIELDING_DP,
# NET_HR_ADVANTAGE, NET_HITS_ADVANTAGE, NET_WALKS_ADVANTAGE, NET_SO_ADVANTAGE)
dplyr::select(TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BASERUN_SB,
TEAM_BASERUN_CS, TEAM_FIELDING_E, TEAM_FIELDING_DP,
NET_HR_ADVANTAGE, NET_HITS_ADVANTAGE, NET_WALKS_ADVANTAGE, NET_SO_ADVANTAGE)
# Dataset 4 &5 : Keep only batting OR pitching variables (separate models)
data_batting_only <- data_transform |>
dplyr::select(TARGET_WINS, starts_with("TEAM_BATTING"), starts_with("TEAM_BASERUN"),
starts_with("TEAM_FIELDING"))
data_pitching_focused <- data_transform |>
dplyr::select(TARGET_WINS, starts_with("TEAM_PITCHING"), starts_with("TEAM_FIELDING"),
TEAM_BASERUN_SB, TEAM_BASERUN_CS) # Keep base running as it's not correlated
To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.
This was done for all engineered versions of the dataset:
train_full / test_full — Full feature set (after cleaning) train_conservative / test_conservative — Reduced set with one variable removed per collinear pair train_moderate / test_moderate — More aggressively reduced set train_transformed / test_transformed — Feature-engineered net-advantage variables train_batting / test_batting — Batting-only feature set train_pitching / test_pitching — Pitching-focused feature set
Each of these splits will be used in the next phase to build and evaluate multiple linear regression models, comparing their performance on both the training and testing sets.
# Split data for validation
set.seed(123)
#n <- nrow(data_transform)
#train_idx <- sample(n, 0.8 * n)
train_idx <- createDataPartition(y = data_transform$TARGET_WINS, p = 0.8, list = FALSE)
# Create train/test splits for each dataset
train_full <- data_transform[train_idx, ]
test_full <- data_transform[-train_idx, ]
train_conservative <- data_reduced[train_idx, ]
test_conservative <- data_reduced[-train_idx, ]
train_moderate <- data_moderate[train_idx, ]
test_moderate <- data_moderate[-train_idx, ]
train_transformed <- data_transformed[train_idx, ]
test_transformed <- data_transformed[-train_idx, ]
train_batting <- data_batting_only[train_idx, ]
test_batting <- data_batting_only[-train_idx, ]
train_pitching <- data_pitching_focused[train_idx, ]
test_pitching<- data_pitching_focused[-train_idx, ]
Let’s first start out with a model that includes everything:
# Model 1: Full model (all data including multicollinearity predictors)
model_full <- lm(TARGET_WINS ~ ., data = train_full)
summary(model_full)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.690 -8.522 0.254 8.868 55.697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.5795271 5.1389680 5.951 3.20e-09 ***
## TEAM_BATTING_H 0.0430474 0.0039130 11.001 < 2e-16 ***
## TEAM_BATTING_2B -0.0119177 0.0104750 -1.138 0.25538
## TEAM_BATTING_3B 0.0506945 0.0189857 2.670 0.00765 **
## TEAM_BATTING_HR 0.0681571 0.0310559 2.195 0.02832 *
## TEAM_BATTING_BB 0.0029672 0.0062916 0.472 0.63725
## TEAM_BATTING_SO -0.0135485 0.0022790 -5.945 3.31e-09 ***
## TEAM_BASERUN_SB 0.0058131 0.0046004 1.264 0.20653
## TEAM_BASERUN_CS -0.0006132 0.0120640 -0.051 0.95947
## TEAM_PITCHING_H -0.0011992 0.0004539 -2.642 0.00831 **
## TEAM_PITCHING_HR 0.0019021 0.0276842 0.069 0.94523
## TEAM_PITCHING_BB 0.0032131 0.0044546 0.721 0.47081
## TEAM_PITCHING_SO 0.0025115 0.0010104 2.486 0.01302 *
## TEAM_FIELDING_E -0.0270029 0.0034348 -7.862 6.47e-15 ***
## TEAM_FIELDING_DP -0.0591386 0.0113452 -5.213 2.07e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.12 on 1807 degrees of freedom
## Multiple R-squared: 0.2953, Adjusted R-squared: 0.2898
## F-statistic: 54.08 on 14 and 1807 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full))
# n = # of observations
n <- nrow(train_full)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_full)
##
## studentized Breusch-Pagan test
##
## data: model_full
## BP = 259.6, df = 14, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_full)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 3.266195 2.518246 2.987128 37.132896
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 6.171798 4.418444 1.873546 1.499667
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 3.948045 30.165917 6.101638 3.903397
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 6.216632 4.030364
######
preds <- predict(model_full, newdata = test_full)
RMSE(preds, test_full$TARGET_WINS)
## [1] 13.81744
R2(preds, test_full$TARGET_WINS)
## [1] 0.2950739
Model 1 validity check (linear regression assumptions):
Other notes:
Next step: Try a log transformation on the skewed predictors
# Model 1.1: Full model with Log transformation of the skewed predictors
train_full_log <- train_full %>%
mutate(
log_BASERUN_CS = log(TEAM_BASERUN_CS + 1),
log_TEAM_BASERUN_SB = log(TEAM_BASERUN_SB + 1),
log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1),
log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
log_TEAM_PITCHING_SO = log(TEAM_PITCHING_SO + 1),
)
test_full_log <- test_full %>%
mutate(
log_BASERUN_CS = log(TEAM_BASERUN_CS + 1),
log_TEAM_BASERUN_SB = log(TEAM_BASERUN_SB + 1),
log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1),
log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
log_TEAM_PITCHING_SO = log(TEAM_PITCHING_SO + 1),
)
model_full_log <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
TEAM_BATTING_BB + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + log_TEAM_PITCHING_SO, data = train_full_log)
summary(model_full_log)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## log_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + log_TEAM_FIELDING_E +
## TEAM_FIELDING_DP + log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB +
## log_TEAM_PITCHING_SO, data = train_full_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.543 -8.102 0.097 8.322 75.073
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.870093 13.622878 7.258 5.82e-13 ***
## TEAM_BATTING_H 0.051890 0.003904 13.292 < 2e-16 ***
## TEAM_BATTING_2B -0.025986 0.010062 -2.583 0.009885 **
## log_TEAM_BATTING_3B 5.652896 1.048372 5.392 7.88e-08 ***
## TEAM_BATTING_HR 0.010862 0.009482 1.146 0.252151
## TEAM_BATTING_BB 0.016636 0.005009 3.321 0.000915 ***
## log_TEAM_BASERUN_SB -2.403681 0.330267 -7.278 5.03e-13 ***
## TEAM_BASERUN_CS -0.003933 0.011857 -0.332 0.740174
## log_TEAM_FIELDING_E -15.874758 1.484497 -10.694 < 2e-16 ***
## TEAM_FIELDING_DP -0.097354 0.010245 -9.502 < 2e-16 ***
## log_TEAM_PITCHING_H -3.505929 2.039026 -1.719 0.085710 .
## log_TEAM_PITCHING_BB 3.896707 1.713334 2.274 0.023062 *
## log_TEAM_PITCHING_SO -1.790880 0.241812 -7.406 1.98e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.99 on 1809 degrees of freedom
## Multiple R-squared: 0.3092, Adjusted R-squared: 0.3046
## F-statistic: 67.47 on 12 and 1809 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_log, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full_log)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_log))
# n = # of observations
n <- nrow(train_full_log)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_full_log)
##
## studentized Breusch-Pagan test
##
## data: model_full_log
## BP = 344.74, df = 12, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_full_log)
## TEAM_BATTING_H TEAM_BATTING_2B log_TEAM_BATTING_3B
## 3.319843 2.372920 2.907026
## TEAM_BATTING_HR TEAM_BATTING_BB log_TEAM_BASERUN_SB
## 3.535063 3.995746 1.795107
## TEAM_BASERUN_CS log_TEAM_FIELDING_E TEAM_FIELDING_DP
## 1.479500 8.739550 3.356650
## log_TEAM_PITCHING_H log_TEAM_PITCHING_BB log_TEAM_PITCHING_SO
## 4.091596 2.601571 1.405713
######
preds <- predict(model_full_log, newdata = test_full_log)
RMSE(preds, test_full_log$TARGET_WINS)
## [1] 13.47148
R2(preds, test_full_log$TARGET_WINS)
## [1] 0.3317772
Model 1.1 validity check (linear regression assumptions):
Other notes:
Next step: let’s try a log transformation on the skewed predictors, then a WLS to address nonconstant variance
# Model 1.2: Full model with Log transformation of the skewed predictors + WLS
# define weights to use
wt <- 1 / lm(abs(model_full_log$residuals) ~ model_full_log$fitted.values)$fitted.values^2
# perform weighted least squares regression
# Fit model
model_full_log_wls <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + log_TEAM_BATTING_3B + TEAM_BATTING_HR +
TEAM_BATTING_BB + log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + log_TEAM_FIELDING_E + TEAM_FIELDING_DP +
log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + log_TEAM_PITCHING_SO, data = train_full_log, weights=wt)
#view summary of model
summary(model_full_log_wls)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## log_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## log_TEAM_BASERUN_SB + TEAM_BASERUN_CS + log_TEAM_FIELDING_E +
## TEAM_FIELDING_DP + log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB +
## log_TEAM_PITCHING_SO, data = train_full_log, weights = wt)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -5.5198 -0.7978 0.0165 0.8211 5.2462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.900031 13.897883 6.541 7.95e-11 ***
## TEAM_BATTING_H 0.048887 0.004012 12.184 < 2e-16 ***
## TEAM_BATTING_2B -0.030259 0.009865 -3.067 0.00219 **
## log_TEAM_BATTING_3B 6.846195 1.059650 6.461 1.34e-10 ***
## TEAM_BATTING_HR 0.025509 0.009267 2.753 0.00597 **
## TEAM_BATTING_BB 0.015221 0.006320 2.408 0.01612 *
## log_TEAM_BASERUN_SB -2.458922 0.340044 -7.231 7.04e-13 ***
## TEAM_BASERUN_CS 0.004310 0.011934 0.361 0.71801
## log_TEAM_FIELDING_E -15.235592 1.466696 -10.388 < 2e-16 ***
## TEAM_FIELDING_DP -0.091278 0.009992 -9.136 < 2e-16 ***
## log_TEAM_PITCHING_H -2.581261 2.614669 -0.987 0.32367
## log_TEAM_PITCHING_BB 3.668685 2.594108 1.414 0.15746
## log_TEAM_PITCHING_SO -2.043838 0.243612 -8.390 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.278 on 1809 degrees of freedom
## Multiple R-squared: 0.3033, Adjusted R-squared: 0.2987
## F-statistic: 65.62 on 12 and 1809 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_log_wls, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full_log_wls)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_log_wls))
# n = # of observations
n <- nrow(train_full_log)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log_wls)
# Perform Breusch-Pagan Test
bptest(model_full_log_wls)
##
## studentized Breusch-Pagan test
##
## data: model_full_log_wls
## BP = 6.0351, df = 12, p-value = 0.9143
# Variance Inflation Factor
vif(model_full_log_wls)
## TEAM_BATTING_H TEAM_BATTING_2B log_TEAM_BATTING_3B
## 4.397868 2.411497 3.068544
## TEAM_BATTING_HR TEAM_BATTING_BB log_TEAM_BASERUN_SB
## 3.535661 6.323528 1.774496
## TEAM_BASERUN_CS log_TEAM_FIELDING_E TEAM_FIELDING_DP
## 1.492232 8.555100 3.436528
## log_TEAM_PITCHING_H log_TEAM_PITCHING_BB log_TEAM_PITCHING_SO
## 6.506746 4.628415 1.450357
######
preds <- predict(model_full_log_wls, newdata = test_full_log)
RMSE(preds, test_full_log$TARGET_WINS)
## [1] 13.55652
R2(preds, test_full_log$TARGET_WINS)
## [1] 0.3232014
Model 1.2 validity check (linear regression assumptions):
Other notes:
Next step: let’s try a square root transformation of the skewed predictors
# Model 1.3: Full model with square root transformation of the skewed predictors
# Transform skewed predictors via sqrt
train_full_sqrt <- train_full %>%
mutate(
sqrt_BASERUN_CS = sqrt(TEAM_BASERUN_CS + 1),
sqrt_TEAM_BASERUN_SB = sqrt(TEAM_BASERUN_SB + 1),
sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B + 1),
sqrt_TEAM_FIELDING_E = sqrt(TEAM_FIELDING_E + 1),
sqrt_TEAM_PITCHING_BB = sqrt(TEAM_PITCHING_BB + 1),
sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H + 1),
sqrt_TEAM_PITCHING_SO = sqrt(TEAM_PITCHING_SO + 1)
# … you can include more if you have other skewed predictors …
)
test_full_sqrt <- test_full %>%
mutate(
sqrt_BASERUN_CS = sqrt(TEAM_BASERUN_CS + 1),
sqrt_TEAM_BASERUN_SB = sqrt(TEAM_BASERUN_SB + 1),
sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B + 1),
sqrt_TEAM_FIELDING_E = sqrt(TEAM_FIELDING_E + 1),
sqrt_TEAM_PITCHING_BB = sqrt(TEAM_PITCHING_BB + 1),
sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H + 1),
sqrt_TEAM_PITCHING_SO = sqrt(TEAM_PITCHING_SO + 1)
)
# Fit model
model_full_sqrt <- lm(TARGET_WINS ~
TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
TEAM_BASERUN_CS + sqrt_TEAM_FIELDING_E + TEAM_FIELDING_DP +
sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB + sqrt_TEAM_PITCHING_SO,
data = train_full_sqrt)
summary(model_full_sqrt)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + sqrt_TEAM_FIELDING_E +
## TEAM_FIELDING_DP + sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB +
## sqrt_TEAM_PITCHING_SO, data = train_full_sqrt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.565 -8.741 0.118 8.600 59.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.227357 5.279731 7.051 2.52e-12 ***
## TEAM_BATTING_H 0.044714 0.003869 11.558 < 2e-16 ***
## TEAM_BATTING_2B -0.016441 0.010338 -1.590 0.111912
## sqrt_TEAM_BATTING_3B 1.400029 0.302612 4.626 3.98e-06 ***
## TEAM_BATTING_HR 0.037551 0.009679 3.880 0.000108 ***
## TEAM_BATTING_BB -0.005708 0.006589 -0.866 0.386412
## sqrt_TEAM_BASERUN_SB -0.218281 0.099544 -2.193 0.028448 *
## TEAM_BASERUN_CS -0.011769 0.012138 -0.970 0.332374
## sqrt_TEAM_FIELDING_E -1.502950 0.162193 -9.266 < 2e-16 ***
## TEAM_FIELDING_DP -0.084372 0.011440 -7.375 2.48e-13 ***
## sqrt_TEAM_PITCHING_H -0.106645 0.075170 -1.419 0.156154
## sqrt_TEAM_PITCHING_BB 0.723495 0.226509 3.194 0.001427 **
## sqrt_TEAM_PITCHING_SO -0.231261 0.056310 -4.107 4.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.18 on 1809 degrees of freedom
## Multiple R-squared: 0.2883, Adjusted R-squared: 0.2836
## F-statistic: 61.07 on 12 and 1809 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_full_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full_sqrt)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_full_sqrt))
# n = # of observations
n <- nrow(train_full_sqrt)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_sqrt)
# Perform Breusch-Pagan Test
bptest(model_full_sqrt)
##
## studentized Breusch-Pagan test
##
## data: model_full_sqrt
## BP = 312.21, df = 12, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_full_sqrt)
## TEAM_BATTING_H TEAM_BATTING_2B sqrt_TEAM_BATTING_3B
## 3.164723 2.431249 3.041022
## TEAM_BATTING_HR TEAM_BATTING_BB sqrt_TEAM_BASERUN_SB
## 3.575573 6.710071 1.739953
## TEAM_BASERUN_CS sqrt_TEAM_FIELDING_E TEAM_FIELDING_DP
## 1.504790 8.178851 4.062051
## sqrt_TEAM_PITCHING_H sqrt_TEAM_PITCHING_BB sqrt_TEAM_PITCHING_SO
## 5.038932 5.006073 2.136819
######
# Predict on test set
preds_sqrt <- predict(model_full_sqrt, newdata = test_full_sqrt)
rmse_sqrt <- RMSE(preds_sqrt, test_full_sqrt$TARGET_WINS)
r2_sqrt <- R2(preds_sqrt, test_full_sqrt$TARGET_WINS)
rmse_sqrt
## [1] 13.6536
r2_sqrt
## [1] 0.3144132
Model 1.3 validity check (linear regression assumptions):
Other notes:
Best model 1 version:
Next step: let’s address multicollinearity by using the
train_moderate
dataset (removed highly correlated
predictors)
# Model 2: Moderate dataset model (removed highly correlated predictors)
# Fit model
model_moderate <- lm(TARGET_WINS ~ ., data = train_moderate)
#view summary of model
summary(model_moderate)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_moderate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.909 -8.995 0.495 8.905 76.870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0807986 3.9865298 1.274 0.20265
## TEAM_BATTING_H 0.0427033 0.0035658 11.976 < 2e-16 ***
## TEAM_BATTING_2B -0.0058880 0.0101635 -0.579 0.56244
## TEAM_BATTING_3B 0.0544292 0.0191381 2.844 0.00450 **
## TEAM_BATTING_HR 0.0471021 0.0085870 5.485 4.71e-08 ***
## TEAM_BATTING_BB 0.0173697 0.0036085 4.814 1.61e-06 ***
## TEAM_BASERUN_SB 0.0126187 0.0045981 2.744 0.00612 **
## TEAM_BASERUN_CS -0.0133792 0.0118269 -1.131 0.25810
## TEAM_PITCHING_H -0.0017073 0.0002894 -5.900 4.33e-09 ***
## TEAM_FIELDING_DP -0.0004111 0.0093700 -0.044 0.96501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.54 on 1812 degrees of freedom
## Multiple R-squared: 0.2479, Adjusted R-squared: 0.2442
## F-statistic: 66.38 on 9 and 1812 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_moderate, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_moderate)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_moderate))
# n = # of observations
n <- nrow(train_moderate)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
anova(model_moderate)
## Analysis of Variance Table
##
## Response: TARGET_WINS
## Df Sum Sq Mean Sq F value Pr(>F)
## TEAM_BATTING_H 1 63883 63883 348.5795 < 2.2e-16 ***
## TEAM_BATTING_2B 1 3416 3416 18.6403 1.664e-05 ***
## TEAM_BATTING_3B 1 1 1 0.0045 0.946460
## TEAM_BATTING_HR 1 18718 18718 102.1338 < 2.2e-16 ***
## TEAM_BATTING_BB 1 14674 14674 80.0688 < 2.2e-16 ***
## TEAM_BASERUN_SB 1 1828 1828 9.9768 0.001611 **
## TEAM_BASERUN_CS 1 168 168 0.9144 0.339072
## TEAM_PITCHING_H 1 6793 6793 37.0689 1.389e-09 ***
## TEAM_FIELDING_DP 1 0 0 0.0019 0.965006
## Residuals 1812 332078 183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Perform Breusch-Pagan Test
bptest(model_moderate)
##
## studentized Breusch-Pagan test
##
## data: model_moderate
## BP = 269.1, df = 9, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_moderate)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 2.548488 2.227575 2.852050 2.667503
## TEAM_BATTING_BB TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
## 1.907658 1.758721 1.354279 1.508270
## TEAM_FIELDING_DP
## 2.583185
######
# Predict on test set
preds_moderate <- predict(model_moderate, newdata = test_moderate)
rmse_moderate <- RMSE(preds_moderate, test_moderate$TARGET_WINS)
r2_moderate <- R2(preds_moderate, test_moderate$TARGET_WINS)
rmse_moderate
## [1] 14.01101
r2_moderate
## [1] 0.2769754
Model 2 validity check (linear regression assumptions):
Other notes:
Next step: let’s address the validity check issues by transforming the skewed predictors
# Model 2: Moderate dataset model + transformation of skewed predictors
# Transform skewed predictors via sqrt
train_moderate_sqrt <- train_moderate %>%
mutate(
sqrt_BASERUN_CS = sqrt(TEAM_BASERUN_CS + 1),
sqrt_TEAM_BASERUN_SB = sqrt(TEAM_BASERUN_SB + 1),
sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B + 1),
sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H + 1),
# … you can include more if you have other skewed predictors …
)
test_moderate_sqrt <- test_moderate %>%
mutate(
sqrt_BASERUN_CS = sqrt(TEAM_BASERUN_CS + 1),
sqrt_TEAM_BASERUN_SB = sqrt(TEAM_BASERUN_SB + 1),
sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B + 1),
sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H + 1),
)
# Fit model
model_moderate_sqrt <- lm(TARGET_WINS ~
TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
TEAM_BASERUN_CS + TEAM_FIELDING_DP + sqrt_TEAM_PITCHING_H,
data = train_moderate_sqrt)
summary(model_moderate_sqrt)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP +
## sqrt_TEAM_PITCHING_H, data = train_moderate_sqrt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.848 -8.959 0.565 9.014 78.010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.607803 4.249041 1.790 0.07354 .
## TEAM_BATTING_H 0.045072 0.003747 12.029 < 2e-16 ***
## TEAM_BATTING_2B -0.007240 0.010193 -0.710 0.47766
## sqrt_TEAM_BATTING_3B 0.962061 0.302455 3.181 0.00149 **
## TEAM_BATTING_HR 0.047035 0.008912 5.278 1.46e-07 ***
## TEAM_BATTING_BB 0.018099 0.003701 4.890 1.10e-06 ***
## sqrt_TEAM_BASERUN_SB 0.107730 0.096868 1.112 0.26623
## TEAM_BASERUN_CS -0.011914 0.011972 -0.995 0.31982
## TEAM_FIELDING_DP -0.011846 0.008987 -1.318 0.18762
## sqrt_TEAM_PITCHING_H -0.273777 0.048625 -5.630 2.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.56 on 1812 degrees of freedom
## Multiple R-squared: 0.2454, Adjusted R-squared: 0.2416
## F-statistic: 65.47 on 9 and 1812 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_moderate_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_full_sqrt)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_moderate_sqrt))
# n = # of observations
n <- nrow(train_moderate_sqrt)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_moderate_sqrt)
# Perform Breusch-Pagan Test
bptest(model_moderate_sqrt)
##
## studentized Breusch-Pagan test
##
## data: model_moderate_sqrt
## BP = 272.64, df = 9, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_moderate)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 2.548488 2.227575 2.852050 2.667503
## TEAM_BATTING_BB TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
## 1.907658 1.758721 1.354279 1.508270
## TEAM_FIELDING_DP
## 2.583185
######
# Predict on test set
preds_moderate_sqrt <- predict(model_moderate_sqrt, newdata = test_moderate_sqrt)
rmse_sqrt <- RMSE(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS)
r2_sqrt <- R2(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS)
rmse_sqrt
## [1] 14.05698
r2_sqrt
## [1] 0.2722703
Model 2.1 validity check (linear regression assumptions):
Other notes:
Next step: let’s address nonconstant variance by transforming the skewed predictors and using Weighted Least Squares
# Model 2.2: Moderate dataset model + transformation of skewed predictors + WLS
# define weights to use
wt <- 1 / lm(abs(model_moderate_sqrt$residuals) ~ model_moderate_sqrt$fitted.values)$fitted.values^2
# perform weighted least squares regression
wls_model_moderate_sqrt <- lm(TARGET_WINS ~
TEAM_BATTING_H + TEAM_BATTING_2B + sqrt_TEAM_BATTING_3B +
TEAM_BATTING_HR + TEAM_BATTING_BB + sqrt_TEAM_BASERUN_SB +
TEAM_BASERUN_CS + TEAM_FIELDING_DP +
sqrt_TEAM_PITCHING_H,
data = train_moderate_sqrt, weights=wt)
#view summary of model
summary(wls_model_moderate_sqrt)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## sqrt_TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB +
## sqrt_TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP +
## sqrt_TEAM_PITCHING_H, data = train_moderate_sqrt, weights = wt)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -4.7679 -0.8433 0.0520 0.8563 5.0569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.431146 4.039826 2.087 0.037027 *
## TEAM_BATTING_H 0.045416 0.003807 11.928 < 2e-16 ***
## TEAM_BATTING_2B -0.016131 0.010108 -1.596 0.110706
## sqrt_TEAM_BATTING_3B 0.984533 0.297333 3.311 0.000947 ***
## TEAM_BATTING_HR 0.051740 0.008678 5.962 2.99e-09 ***
## TEAM_BATTING_BB 0.015460 0.003615 4.277 1.99e-05 ***
## sqrt_TEAM_BASERUN_SB 0.094450 0.097192 0.972 0.331282
## TEAM_BASERUN_CS -0.008323 0.011978 -0.695 0.487205
## TEAM_FIELDING_DP -0.006103 0.008862 -0.689 0.491124
## sqrt_TEAM_PITCHING_H -0.254253 0.057049 -4.457 8.83e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.276 on 1812 degrees of freedom
## Multiple R-squared: 0.2392, Adjusted R-squared: 0.2354
## F-statistic: 63.28 on 9 and 1812 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = wls_model_moderate_sqrt, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(wls_model_moderate_sqrt)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(wls_model_moderate_sqrt))
# n = # of observations
n <- nrow(train_moderate_sqrt)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(wls_model_moderate_sqrt)
# Perform Breusch-Pagan Test
bptest(wls_model_moderate_sqrt)
##
## studentized Breusch-Pagan test
##
## data: wls_model_moderate_sqrt
## BP = 5.3621, df = 9, p-value = 0.8017
# Variance Inflation Factor
vif(wls_model_moderate_sqrt)
## TEAM_BATTING_H TEAM_BATTING_2B sqrt_TEAM_BATTING_3B
## 3.654325 2.307732 3.129800
## TEAM_BATTING_HR TEAM_BATTING_BB sqrt_TEAM_BASERUN_SB
## 2.849087 1.860144 1.637307
## TEAM_BASERUN_CS TEAM_FIELDING_DP sqrt_TEAM_PITCHING_H
## 1.386230 2.543128 2.415067
######
# Predict on test set
preds_mod_sqrt <- predict(wls_model_moderate_sqrt, newdata = test_moderate_sqrt)
rmse_mod_wls_sqrt <- RMSE(preds_mod_sqrt, test_moderate_sqrt$TARGET_WINS)
r2_mod_wls_sqrt <- R2(preds_mod_sqrt, test_moderate_sqrt$TARGET_WINS)
rmse_mod_wls_sqrt
## [1] 14.09027
r2_mod_wls_sqrt
## [1] 0.2697522
Model 2.2 validity check (linear regression assumptions):
Other notes:
# Model 3: Transformed data (net advantage variables)
model_transformed <- lm(TARGET_WINS ~ ., data = train_transformed)
summary(model_transformed)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.181 -9.836 0.120 9.254 50.919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.8307062 3.1039480 21.209 < 2e-16 ***
## TEAM_BATTING_2B 0.0881661 0.0076316 11.553 < 2e-16 ***
## TEAM_BATTING_3B 0.1179503 0.0159613 7.390 2.23e-13 ***
## TEAM_BASERUN_SB 0.0068615 0.0047736 1.437 0.1508
## TEAM_BASERUN_CS -0.0183701 0.0123926 -1.482 0.1384
## TEAM_FIELDING_E -0.0283239 0.0033600 -8.430 < 2e-16 ***
## TEAM_FIELDING_DP -0.0503107 0.0120314 -4.182 3.03e-05 ***
## NET_HR_ADVANTAGE -0.0568972 0.0289452 -1.966 0.0495 *
## NET_HITS_ADVANTAGE 0.0003005 0.0004777 0.629 0.5294
## NET_WALKS_ADVANTAGE -0.0046571 0.0046944 -0.992 0.3213
## NET_SO_ADVANTAGE -0.0010532 0.0010548 -0.999 0.3182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.1 on 1811 degrees of freedom
## Multiple R-squared: 0.185, Adjusted R-squared: 0.1805
## F-statistic: 41.12 on 10 and 1811 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_transformed, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_transformed)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_transformed))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_transformed)
##
## studentized Breusch-Pagan test
##
## data: model_transformed
## BP = 234.15, df = 10, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_transformed)
## TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BASERUN_SB TEAM_BASERUN_CS
## 1.158376 1.829650 1.748234 1.371398
## TEAM_FIELDING_E TEAM_FIELDING_DP NET_HR_ADVANTAGE NET_HITS_ADVANTAGE
## 5.155289 3.928120 1.704036 3.590426
## NET_WALKS_ADVANTAGE NET_SO_ADVANTAGE
## 4.945062 2.896198
######
preds <- predict(model_transformed, newdata = test_transformed)
RMSE(preds, test_transformed$TARGET_WINS)
## [1] 14.6851
R2(preds, test_transformed$TARGET_WINS)
## [1] 0.2047218
Model 3 validity check (linear regression assumptions):
Other notes:
# Model 4 (batting variables only)
model_batting <- lm(TARGET_WINS ~ ., data = train_batting)
summary(model_batting)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_batting)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.421 -8.402 0.373 8.748 58.784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.431575 4.965568 6.733 2.23e-11 ***
## TEAM_BATTING_H 0.039843 0.003821 10.428 < 2e-16 ***
## TEAM_BATTING_2B -0.007248 0.010340 -0.701 0.48343
## TEAM_BATTING_3B 0.058620 0.018496 3.169 0.00155 **
## TEAM_BATTING_HR 0.063377 0.010237 6.191 7.39e-10 ***
## TEAM_BATTING_BB 0.006910 0.003678 1.879 0.06042 .
## TEAM_BATTING_SO -0.009689 0.001948 -4.974 7.19e-07 ***
## TEAM_BASERUN_SB 0.004604 0.004567 1.008 0.31359
## TEAM_BASERUN_CS -0.006115 0.011971 -0.511 0.60951
## TEAM_FIELDING_E -0.030461 0.002789 -10.922 < 2e-16 ***
## TEAM_FIELDING_DP -0.066330 0.011165 -5.941 3.40e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.17 on 1811 degrees of freedom
## Multiple R-squared: 0.2884, Adjusted R-squared: 0.2845
## F-statistic: 73.4 on 10 and 1811 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_batting, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_transformed)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_batting))
# n = # of observations
n <- nrow(train_batting)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_batting)
##
## studentized Breusch-Pagan test
##
## data: model_batting
## BP = 258.36, df = 10, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_batting)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 3.090758 2.435330 2.813750 4.004748
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 2.093224 3.204192 1.832782 1.465501
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 4.067925 3.874389
######
preds <- predict(model_batting, newdata = test_batting)
RMSE(preds, test_batting$TARGET_WINS)
## [1] 13.7715
R2(preds, test_batting$TARGET_WINS)
## [1] 0.30035
# Model 4.1 (batting variables only) Elastic Net
# Elastic Net Regression
set.seed(786)
# 5-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 5)
#pre process
pre_process <- c("center","scale")
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enet_model <- train(TARGET_WINS ~ ., data = train_batting,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProcess = pre_process)
# Test predictors and response
testX <- test_batting %>% dplyr::select(-TARGET_WINS)
testY <- test_batting$TARGET_WINS
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_model
## Elasticnet
##
## 1822 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1457, 1459, 1457, 1458, 1457
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 15.11660 0.1462023 11.85688
## 0.00 0.10 14.77552 0.1693321 11.58652
## 0.00 0.15 14.48511 0.1964809 11.37373
## 0.00 0.20 14.23495 0.2152453 11.20424
## 0.00 0.25 14.02034 0.2262527 11.05617
## 0.00 0.30 13.84813 0.2323982 10.93107
## 0.00 0.35 13.71765 0.2391241 10.83505
## 0.00 0.40 13.63134 0.2431921 10.76693
## 0.00 0.45 13.57522 0.2468239 10.71146
## 0.00 0.50 13.52140 0.2517718 10.66110
## 0.00 0.55 13.46687 0.2572057 10.61409
## 0.00 0.60 13.41047 0.2629063 10.56656
## 0.00 0.65 13.36064 0.2676903 10.52504
## 0.00 0.70 13.31746 0.2716252 10.48747
## 0.00 0.75 13.28100 0.2747834 10.45551
## 0.00 0.80 13.25307 0.2770476 10.42973
## 0.00 0.85 13.23344 0.2785247 10.40983
## 0.00 0.90 13.22041 0.2794839 10.39616
## 0.00 0.95 13.21376 0.2800335 10.38778
## 0.00 1.00 13.21325 0.2801650 10.38311
## 0.01 0.05 15.12632 0.1462023 11.86470
## 0.01 0.10 14.79147 0.1672191 11.59853
## 0.01 0.15 14.50468 0.1948826 11.38754
## 0.01 0.20 14.25854 0.2136086 11.21962
## 0.01 0.25 14.04484 0.2251521 11.07331
## 0.01 0.30 13.87088 0.2313519 10.94667
## 0.01 0.35 13.73914 0.2378597 10.85025
## 0.01 0.40 13.64661 0.2424108 10.77878
## 0.01 0.45 13.58787 0.2458208 10.72150
## 0.01 0.50 13.53528 0.2504010 10.67004
## 0.01 0.55 13.48144 0.2556951 10.62293
## 0.01 0.60 13.42507 0.2614221 10.57553
## 0.01 0.65 13.37461 0.2663131 10.53323
## 0.01 0.70 13.33046 0.2703870 10.49532
## 0.01 0.75 13.29268 0.2737109 10.46208
## 0.01 0.80 13.26137 0.2763500 10.43404
## 0.01 0.85 13.23959 0.2780486 10.41194
## 0.01 0.90 13.22402 0.2792326 10.39587
## 0.01 0.95 13.21607 0.2798368 10.38747
## 0.01 1.00 13.21348 0.2801348 10.38133
## 0.10 0.05 15.17135 0.1462023 11.90093
## 0.10 0.10 14.86332 0.1575644 11.65326
## 0.10 0.15 14.59599 0.1871904 11.45272
## 0.10 0.20 14.36992 0.2047548 11.29213
## 0.10 0.25 14.16348 0.2189006 11.15613
## 0.10 0.30 13.98745 0.2270224 11.03189
## 0.10 0.35 13.84477 0.2321631 10.92584
## 0.10 0.40 13.73567 0.2374867 10.84510
## 0.10 0.45 13.66066 0.2416122 10.78264
## 0.10 0.50 13.60013 0.2457268 10.72409
## 0.10 0.55 13.54813 0.2490694 10.66892
## 0.10 0.60 13.50169 0.2530986 10.62254
## 0.10 0.65 13.45498 0.2575167 10.58096
## 0.10 0.70 13.41150 0.2616107 10.54383
## 0.10 0.75 13.37302 0.2651171 10.50925
## 0.10 0.80 13.33949 0.2681004 10.48044
## 0.10 0.85 13.31086 0.2706215 10.45644
## 0.10 0.90 13.28736 0.2726941 10.43577
## 0.10 0.95 13.26903 0.2743636 10.41721
## 0.10 1.00 13.25643 0.2756133 10.40289
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 1 and lambda = 0.
enet_resample
## RMSE Rsquared MAE
## 13.77150 0.30035 10.50955
# Model 5: (pitching variables only)
model_pitching <- lm(TARGET_WINS ~ ., data = train_pitching)
summary(model_pitching)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_pitching)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.816 -9.648 0.389 9.423 65.280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.6752515 2.8006028 30.235 < 2e-16 ***
## TEAM_PITCHING_H 0.0006886 0.0004335 1.588 0.1124
## TEAM_PITCHING_HR 0.0487187 0.0070549 6.906 6.89e-12 ***
## TEAM_PITCHING_BB 0.0150764 0.0026860 5.613 2.30e-08 ***
## TEAM_PITCHING_SO -0.0054353 0.0007213 -7.536 7.63e-14 ***
## TEAM_FIELDING_E -0.0245325 0.0036059 -6.804 1.38e-11 ***
## TEAM_FIELDING_DP -0.0690500 0.0124265 -5.557 3.16e-08 ***
## TEAM_BASERUN_SB 0.0101588 0.0049969 2.033 0.0422 *
## TEAM_BASERUN_CS -0.0210745 0.0127787 -1.649 0.0993 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.53 on 1813 degrees of freedom
## Multiple R-squared: 0.1329, Adjusted R-squared: 0.1291
## F-statistic: 34.74 on 8 and 1813 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_pitching, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_transformed)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_pitching))
# n = # of observations
n <- nrow(model_pitching)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_pitching)
##
## studentized Breusch-Pagan test
##
## data: model_pitching
## BP = 486.34, df = 8, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_pitching)
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 2.937196 1.597455 1.808918 1.622142
## TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BASERUN_SB TEAM_BASERUN_CS
## 5.586704 3.942808 1.802445 1.372055
######
preds <- predict(model_pitching, newdata = test_pitching)
RMSE(preds, test_pitching$TARGET_WINS)
## [1] 15.71512
R2(preds, test_pitching$TARGET_WINS)
## [1] 0.09207141
# Model 6: Conservative dataset
model_conservative <- lm(TARGET_WINS ~ ., data = train_conservative)
summary(model_conservative)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_conservative)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.696 -8.523 0.255 8.873 55.714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.5529851 5.1230167 5.964 2.96e-09 ***
## TEAM_BATTING_H 0.0430641 0.0039045 11.029 < 2e-16 ***
## TEAM_BATTING_2B -0.0119349 0.0104691 -1.140 0.25443
## TEAM_BATTING_3B 0.0509068 0.0187274 2.718 0.00662 **
## TEAM_BATTING_HR 0.0701690 0.0103410 6.786 1.56e-11 ***
## TEAM_BATTING_BB 0.0028022 0.0058134 0.482 0.62985
## TEAM_BATTING_SO -0.0135210 0.0022428 -6.029 2.00e-09 ***
## TEAM_BASERUN_SB 0.0057933 0.0045900 1.262 0.20706
## TEAM_BASERUN_CS -0.0005916 0.0120567 -0.049 0.96087
## TEAM_PITCHING_H -0.0011970 0.0004527 -2.645 0.00825 **
## TEAM_PITCHING_BB 0.0033526 0.0039644 0.846 0.39784
## TEAM_PITCHING_SO 0.0024874 0.0009472 2.626 0.00871 **
## TEAM_FIELDING_E -0.0270055 0.0034337 -7.865 6.30e-15 ***
## TEAM_FIELDING_DP -0.0591481 0.0113413 -5.215 2.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.12 on 1808 degrees of freedom
## Multiple R-squared: 0.2953, Adjusted R-squared: 0.2902
## F-statistic: 58.28 on 13 and 1808 DF, p-value: < 2.2e-16
######
# Valid Linear Regression Model Assumption Check
# Nearly normal residuals
ggplot(data = model_conservative, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
plot(model_conservative)
# Calculate the leverage cutoff
# p = # of coeff
p <- length(coef(model_conservative))
# n = # of observations
n <- nrow(train_transformed)
# Cut off point high leverage
high_leverage <- (2 * (p+1)) / n
# Add the cutoff line to the plot
abline(v = high_leverage, col = "red", lty = 2)
# Add a legend
legend("topright", legend = paste("Leverage Cutoff (2*(p+1)/n):", round(high_leverage, 3)),
col = "red", lty = 2, bty = "n")
# anova(model_full_log)
# Perform Breusch-Pagan Test
bptest(model_conservative)
##
## studentized Breusch-Pagan test
##
## data: model_conservative
## BP = 258.82, df = 13, p-value < 2.2e-16
# Variance Inflation Factor
vif(model_conservative)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 3.253674 2.516804 2.908010 4.119362
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 5.272319 4.281437 1.866157 1.498655
## TEAM_PITCHING_H TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 3.929357 4.835196 3.432345 6.215882
## TEAM_FIELDING_DP
## 4.029774
######
preds <- predict(model_conservative, newdata = test_conservative)
RMSE(preds, test_conservative$TARGET_WINS)
## [1] 13.81767
R2(preds, test_conservative$TARGET_WINS)
## [1] 0.2950506
After evaluating various different linear regression models using key performance metrics—including RMSE, R², MSE, and the F-statistic—Model 1.1 (Full + Log) emerged as the most robust and reliable choice. It demonstrated the best predictive accuracy (lowest RMSE and MSE), the highest proportion of variance explained (R² = 0.332), and a strong F-statistic, indicating overall model significance. The use of log-transformed predictors improved model fit and likely addressed issues related to skewness or heteroscedasticity.
While Model 1.2 (WLS) showed similar metrics and better residual plots, its complexity made interpretation more difficult. Model 4, a simpler alternative, had slightly weaker performance. Overall, Model 1.1 offers the best balance between accuracy and interpretability and is selected as the final model for prediction.
model_results <- data.frame(
Model = c("Model 1: Full",
"Model 1.1: Full + Log",
"Model 1.2: Full + Log + WLS",
"Model 1.3: Full + Sqrt",
"Model 2: Moderate",
"Model 2.1: Moderate + Sqrt",
"Model 2.2: Moderate + Sqrt + WLS",
"Model 3: Transformed Advantage",
"Model 4: Batting",
"Model 4.1: Batting (Elastic Net)",
"Model 5: Pitching",
"Model 6: Conservative"),
RMSE = c(
RMSE(predict(model_full, newdata = test_full), test_full$TARGET_WINS),
RMSE(predict(model_full_log, newdata = test_full_log), test_full_log$TARGET_WINS),
RMSE(predict(model_full_log_wls, newdata = test_full_log), test_full_log$TARGET_WINS),
rmse_sqrt,
rmse_moderate,
RMSE(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS),
rmse_mod_wls_sqrt,
RMSE(predict(model_transformed, newdata = test_transformed), test_transformed$TARGET_WINS),
RMSE(predict(model_batting, newdata = test_batting), test_batting$TARGET_WINS),
enet_resample["RMSE"],
RMSE(predict(model_pitching, newdata = test_pitching), test_pitching$TARGET_WINS),
RMSE(predict(model_conservative, newdata = test_conservative), test_conservative$TARGET_WINS)
),
R2 = c(
R2(predict(model_full, newdata = test_full), test_full$TARGET_WINS),
R2(predict(model_full_log, newdata = test_full_log), test_full_log$TARGET_WINS),
R2(predict(model_full_log_wls, newdata = test_full_log), test_full_log$TARGET_WINS),
r2_sqrt,
r2_moderate,
R2(preds_moderate_sqrt, test_moderate_sqrt$TARGET_WINS),
r2_mod_wls_sqrt,
R2(predict(model_transformed, newdata = test_transformed), test_transformed$TARGET_WINS),
R2(predict(model_batting, newdata = test_batting), test_batting$TARGET_WINS),
enet_resample["Rsquared"],
R2(predict(model_pitching, newdata = test_pitching), test_pitching$TARGET_WINS),
R2(predict(model_conservative, newdata = test_conservative), test_conservative$TARGET_WINS)
)
)
# Add MSE column
model_results$MSE <- model_results$RMSE^2
# Add F-statistic column (NA for models where it doesn’t apply: Elastic Net)
model_results$F_statistic <- c(
summary(model_full)$fstatistic[1],
summary(model_full_log)$fstatistic[1],
summary(model_full_log_wls)$fstatistic[1],
summary(model_full_sqrt)$fstatistic[1],
summary(model_moderate)$fstatistic[1],
summary(model_moderate_sqrt)$fstatistic[1],
summary(wls_model_moderate_sqrt)$fstatistic[1],
summary(model_transformed)$fstatistic[1],
summary(model_batting)$fstatistic[1],
NA, # Elastic Net model doesn’t have F-statistic in the same form
summary(model_pitching)$fstatistic[1],
summary(model_conservative)$fstatistic[1]
)
# View the updated table
print(model_results)
## Model RMSE R2 MSE F_statistic
## 1 Model 1: Full 13.81744 0.29507385 190.9215 54.08500
## 2 Model 1.1: Full + Log 13.47148 0.33177723 181.4809 67.47286
## 3 Model 1.2: Full + Log + WLS 13.55652 0.32320140 183.7792 65.61963
## 4 Model 1.3: Full + Sqrt 14.05698 0.27227029 197.5987 61.07278
## 5 Model 2: Moderate 14.01101 0.27697537 196.3085 66.37656
## 6 Model 2.1: Moderate + Sqrt 14.05698 0.27227029 197.5987 65.47077
## 7 Model 2.2: Moderate + Sqrt + WLS 14.09027 0.26975219 198.5357 63.28447
## 8 Model 3: Transformed Advantage 14.68510 0.20472177 215.6522 41.11853
## 9 Model 4: Batting 13.77150 0.30035002 189.6543 73.39906
## 10 Model 4.1: Batting (Elastic Net) 13.77150 0.30035002 189.6543 NA
## 11 Model 5: Pitching 15.71512 0.09207141 246.9649 34.74147
## 12 Model 6: Conservative 13.81767 0.29505061 190.9280 58.27710
##Final Model and Predictions
After selecting the Full + Log model (Model 1.1) as the best-performing model, we applied it to the evaluation dataset. Log transformations were applied to the relevant predictor variables to match the training data. The model was then used to predict the number of team wins.
data_eval$log_TEAM_BATTING_3B <- log(data_eval$TEAM_BATTING_3B + 1)
data_eval$log_BASERUN_CS <- log(data_eval$TEAM_BASERUN_CS + 1)
data_eval$log_TEAM_BASERUN_SB <- log(data_eval$TEAM_BASERUN_SB + 1)
data_eval$log_TEAM_FIELDING_E <- log(data_eval$TEAM_FIELDING_E + 1)
data_eval$log_TEAM_PITCHING_BB <- log(data_eval$TEAM_PITCHING_BB + 1)
data_eval$log_TEAM_PITCHING_H <- log(data_eval$TEAM_PITCHING_H + 1)
data_eval$log_TEAM_PITCHING_SO <- log(data_eval$TEAM_PITCHING_SO + 1)
# Predict
data_eval$predicted_wins <- predict(model_full_log, newdata = data_eval)
hist(data_eval$predicted_wins,
breaks = 20,
main = "Distribution of Predicted Wins",
xlab = "Predicted Wins",
col = "lightpink",
border = "white")
# Preview predictions
head(data_eval[, "predicted_wins", drop = FALSE], 4)
## predicted_wins
## 1 67.94843
## 2 70.51659
## 3 76.91473
## 4 82.01534