Assignment Introduction

We have read in the 2 datasets provided as part of this project from Github.

Our goal for this project is to analyze the variables in the baseball statistics columns provided on the training file, explore missing data, understand and prepare the data and build regression models to fit the data (predict our Y or dependent variable “Target Wins” for a given team, based on multiple predictor variables provided in the dataset). 

Once have fit the data, we can then run and test on the evaluation dataset provided, we will analyze model validity and performance and report model that can be selected as the best fit for the data. Shown below is an image all the variables and their definitions.

A good reference site that we have consulted is the MLB league page https://www.mlb.com/stats/. These have great definitions and usage of the column variables for reference information. 

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(kableExtra)
library(tidyverse)
library(tidyr)
library(forecast)
library(cowplot)
#install.packages("car")
library(naniar)
library(Seurat)
library(skimr)
library(PerformanceAnalytics)
library(corrplot)
library(mice)
library(ggplot2)
library(reshape2)
library(dplyr)
library(tidyr)
library(MASS)# For stepAIC (Stepwise selection)
library(caret)  # For splitting data and cross-validation
library(leaps)
library(car)


moneyball_tdata <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-training-data.csv",stringsAsFactors = F,header=TRUE)

moneyball_testdta <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-evaluation-data.csv",stringsAsFactors = F,header=TRUE)
#knitr::include_graphics("Definition.png")

Data Exploration

Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

  1. Mean / Standard Deviation / Median
  2. Bar Chart or Box Plot of the data
  3. Is the data correlated to the target variable (or to other variables?)
  4. Are any of the variables missing and need to be imputed “fixed”?

Explore DataSet & Review Missing Data

This data set contains 2276 records and 17 variables in each record. 

Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Here we have used a few visualizations to summarize the variables that are missing data or have values 0’s. 

Following variables have the highest “NA” missing values which we can either ignore or consider for imputation. In this project our team have decided to proceed with imputation.Three other variables TEAM_BASERUN_SB, TEAM_BATTING_SO,TEAM_PITCHING_SO are all around 6% and 5% missing data respectively.

TEAM_BATTING_HBP (92%)
TEAM_BASERUN_CS (34%)
TEAM_FIELDING_DP (13%) 

We also have records with 0’s.A potential option is transforming them by adding a constant. 

A quick run down of statistics that we need to understand in the world of baseball. Some trivia, for a player, a .275 average is pretty good. However this means that the batter was successful just over 25% of the time. Nearly 73% of the time, they did not get a hit.

BA = batting average,  G = games played,  AB = at bats,  R = runs,
H = hits,
2B = doubles,
3B = triples,
HR = home runs,
RBI = runs batted in,  SB = stolen bases.

#Review the data
skimr::skim(moneyball_tdata)
Data summary
Name moneyball_tdata
Number of rows 2276
Number of columns 17
_______________________
Column type frequency:
numeric 17
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 1268.46 736.35 1 630.75 1270.5 1915.50 2535 ▇▇▇▇▇
TARGET_WINS 0 1.00 80.79 15.75 0 71.00 82.0 92.00 146 ▁▁▇▅▁
TEAM_BATTING_H 0 1.00 1469.27 144.59 891 1383.00 1454.0 1537.25 2554 ▁▇▂▁▁
TEAM_BATTING_2B 0 1.00 241.25 46.80 69 208.00 238.0 273.00 458 ▁▆▇▂▁
TEAM_BATTING_3B 0 1.00 55.25 27.94 0 34.00 47.0 72.00 223 ▇▇▂▁▁
TEAM_BATTING_HR 0 1.00 99.61 60.55 0 42.00 102.0 147.00 264 ▇▆▇▅▁
TEAM_BATTING_BB 0 1.00 501.56 122.67 0 451.00 512.0 580.00 878 ▁▁▇▇▁
TEAM_BATTING_SO 102 0.96 735.61 248.53 0 548.00 750.0 930.00 1399 ▁▆▇▇▁
TEAM_BASERUN_SB 131 0.94 124.76 87.79 0 66.00 101.0 156.00 697 ▇▃▁▁▁
TEAM_BASERUN_CS 772 0.66 52.80 22.96 0 38.00 49.0 62.00 201 ▃▇▁▁▁
TEAM_BATTING_HBP 2085 0.08 59.36 12.97 29 50.50 58.0 67.00 95 ▂▇▇▅▁
TEAM_PITCHING_H 0 1.00 1779.21 1406.84 1137 1419.00 1518.0 1682.50 30132 ▇▁▁▁▁
TEAM_PITCHING_HR 0 1.00 105.70 61.30 0 50.00 107.0 150.00 343 ▇▇▆▁▁
TEAM_PITCHING_BB 0 1.00 553.01 166.36 0 476.00 536.5 611.00 3645 ▇▁▁▁▁
TEAM_PITCHING_SO 102 0.96 817.73 553.09 0 615.00 813.5 968.00 19278 ▇▁▁▁▁
TEAM_FIELDING_E 0 1.00 246.48 227.77 65 127.00 159.0 249.25 1898 ▇▁▁▁▁
TEAM_FIELDING_DP 286 0.87 146.39 26.23 52 131.00 149.0 164.00 228 ▁▂▇▆▁
#Identify columns with missing values
gg_miss_upset(moneyball_tdata)

gg_miss_var(moneyball_tdata, show_pct = TRUE)

# Drop the Index column
df <- moneyball_tdata %>%  dplyr::select(-INDEX)


# #Analysis of Missing values on the original dataset without Mice imputation
# missing <- colSums(df %>% sapply(is.na))
# missing_pct <- round(missing / nrow(df) * 100, 2)
# stack(sort(missing_pct, decreasing = TRUE))

df %>% 
  gather(variable, value) %>%
  filter(is.na(value)) %>%
  group_by(variable) %>%
  tally() %>%
  mutate(percent = n / nrow(df) * 100) %>%
  mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
  arrange(desc(n)) %>%
  rename(`Variable Missing Data` = variable,
         `Number of Records` = n,
         `Share of Total` = percent) %>%
  kable() %>%
  kable_styling()
Variable Missing Data Number of Records Share of Total
TEAM_BATTING_HBP 2085 92%
TEAM_BASERUN_CS 772 34%
TEAM_FIELDING_DP 286 13%
TEAM_BASERUN_SB 131 5.8%
TEAM_BATTING_SO 102 4.5%
TEAM_PITCHING_SO 102 4.5%
df %>% 
  gather(variable, value) %>%
  filter(value == 0) %>%
  group_by(variable) %>%
  tally() %>%
  mutate(percent = n / nrow(df) * 100) %>%
  mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
  arrange(desc(n)) %>%
  rename(`Variable With Zeros` = variable,
         `Number of Records` = n,
         `Share of Total` = percent) %>%
  kable() %>%
  kable_styling()
Variable With Zeros Number of Records Share of Total
TEAM_BATTING_SO 20 0.9%
TEAM_PITCHING_SO 20 0.9%
TEAM_BATTING_HR 15 0.7%
TEAM_PITCHING_HR 15 0.7%
TEAM_BASERUN_SB 2 0.1%
TEAM_BATTING_3B 2 0.1%
TARGET_WINS 1 0%
TEAM_BASERUN_CS 1 0%
TEAM_BATTING_BB 1 0%
TEAM_PITCHING_BB 1 0%

Review distributions

The histogram distribution show kurtosis, specifically right skew in variables such as BASERUN_CS, BASERUN_SB, FIELDING_E, PITCHING_BB, PITCHING_H and PITCHING_SO. We may need to transform these to see if we can make them more of a normal curve prior to proceeding with the model build using the data. 

BATTING_HR, PITCHING_HR and BATTING_SO have a bimodal data which may suggest a shift in the data represented within the dataset. Looking at quick qqplots of the bimodal variables confirms that these are not normal, however, HBP which has a large number of missing values does seem following along the line on the values represented. We can choose to keep this variable and impute the missing values rather than dropping it from the dataset.

TARGET WINS is normally distributed. 

Boxplots reveal that there are many variables with outliers found in the distribution and as shown by the qqplot as well. These will have some impact on our model fit. We could potentially find leverage and influential points and remove them and rerun the model to determine during model build process.

#Review the data
#install.packages("ggcorrplot")
library(ggcorrplot)

# Look at distributions of the columns
mt <- df %>% 
  gather(key = 'variable', value = 'value')

names(df)
##  [1] "TARGET_WINS"      "TEAM_BATTING_H"   "TEAM_BATTING_2B"  "TEAM_BATTING_3B" 
##  [5] "TEAM_BATTING_HR"  "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
##  [9] "TEAM_BASERUN_CS"  "TEAM_BATTING_HBP" "TEAM_PITCHING_H"  "TEAM_PITCHING_HR"
## [13] "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E"  "TEAM_FIELDING_DP"
variable_names <- list(
  "TARGET_WINS"      =  "TARGET_WINS"  
  ,"TEAM_BATTING_H"   = "BATTING_H"
  ,"TEAM_BATTING_2B"  = "BATTING_2B"
  ,"TEAM_BATTING_3B"  = "BATTING_3B"
  ,"TEAM_BATTING_HR"  = "BATTING_HR"
  ,"TEAM_BATTING_BB"  = "BATTING_BB"
  ,"TEAM_BATTING_SO"  = "BATTING_SO"
  ,"TEAM_BASERUN_SB"  = "BASERUN_SB"
  ,"TEAM_BASERUN_CS"  = "BASERUN_CS"
  ,"TEAM_BATTING_HBP" = "BATTING_HBP"
  ,"TEAM_PITCHING_H"  = "PITCHING_H"
  ,"TEAM_PITCHING_HR" = "PITCHING_HR"
  ,"TEAM_PITCHING_BB" = "PITCHING_BB"
  ,"TEAM_PITCHING_SO" = "PITCHING_SO"
  ,"TEAM_FIELDING_E"  = "FIELDING_E"
  ,"TEAM_FIELDING_DP" = "FIELDING_DP"
  
)

variable_labeller <- function(variable,value){
  return(variable_names[value])
}

# Histogram plots of each variable

mt %>% ggplot(., aes(value)) + 
geom_density(fill = "lightgray", color="lightgray") + 
facet_wrap(.~variable, scales='free',ncol=4,labeller=variable_labeller)

# findoutlier <- function(x) {
# return(x < quantile(x, .25,na.rm=TRUE) - 1.5*IQR(x,na.rm=TRUE) | x > quantile(x, .75,na.rm=TRUE) + 1.5*IQR(x,na.rm=TRUE))
#  }
# 
# mtout <-mt %>%
# group_by(variable) %>%
# mutate(outlier = ifelse(findoutlier(value), value, NA))

# Boxplot
ggplot(mt, aes(factor(variable), value)) +
  geom_boxplot() +
#geom_text(aes(label=outlier), na.rm=TRUE, hjust=-.5)+
 facet_wrap(. ~variable, scales='free', ncol=5,labeller = variable_labeller)

# ggplot(stack(df), aes(x = ind, y = values)) + 
#   geom_boxplot() +
#    theme(legend.position="none") +
#   theme(axis.text.x=element_text(angle=45, hjust=1))

#stacked boxplot
longdf <- df %>%
  melt()

longdf %>%
  filter(complete.cases(.)) %>%
  ggplot(aes(x= variable, y=value)) +
  geom_boxplot(fill="lightgray") +
  scale_y_log10() +
  coord_flip() +
  theme_minimal() +
  labs(y="Value", x="variable",
       title="Boxplot")

#For Bimodal we want to understand qqplot.
qqplot_BATTING_3B <- ggplot(df, aes(sample = TEAM_BATTING_3B)) +
    stat_qq() + 
    stat_qq_line() +
    labs(title="BATTING_3B")


qqplot_BATTING_HR <- ggplot(df, aes(sample = TEAM_BATTING_HR)) +
    stat_qq() + 
    stat_qq_line() +
    labs(title="BATTING_HR")

qqplot_BATTING_HBP <- ggplot(df, aes(sample = TEAM_BATTING_HBP)) +
    stat_qq() + 
    stat_qq_line() +
    labs(title="BATTING_HBP")

qqplot_BATTING_3B

qqplot_BATTING_HR

qqplot_BATTING_HBP

CORRELATIONS REVIEW

Interesting Insights from the Correlation Plot

Positive Correlation Between TEAM_BATTING_H and TARGET_WINS (0.47)

As expected, base hits (TEAM_BATTING_H) have a strong positive correlation with the number of wins. This makes sense because getting more base hits usually increases a team’s chances of scoring and winning games.

Positive Correlation Between TEAM_BATTING_HR and TARGET_WINS (0.42)

Homeruns (TEAM_BATTING_HR) are also strongly correlated with wins, indicating that teams with more homeruns tend to win more games. This confirms the importance of power-hitting in baseball.

Moderate Positive Correlation Between TEAM_BATTING_2B and TARGET_WINS (0.31)

Doubles (TEAM_BATTING_2B) show a moderate positive correlation with wins, emphasizing that extra-base hits contribute significantly to a team’s success.

Negative Correlation Between TEAM_BATTING_3B and TARGET_WINS (-0.12)

Surprisingly, triples (TEAM_BATTING_3B) show a slight negative correlation with wins. This could suggest that teams with more triples don’t necessarily translate them into scoring, or that teams who hit fewer triples may rely on other strategies to win.

Positive Correlation Between TEAM_PITCHING_H and TARGET_WINS (0.47)

This is an unexpected result, as you would normally expect more hits allowed to lead to fewer wins. This could indicate that teams that allow more hits are able to compensate with other strong areas, like offensive production or good defense.

Negative Correlation Between TEAM_BATTING_SO and TARGET_WINS (-0.23)

Strikeouts by batters (TEAM_BATTING_SO) are negatively correlated with wins, meaning that teams that strike out more often tend to win less. This is expected, as striking out reduces a team’s chances of getting on base and scoring.

Negative Correlation Between TEAM_FIELDING_E and TARGET_WINS (-0.39)

Fielding errors (TEAM_FIELDING_E) have a strong negative correlation with wins. Teams that commit more errors tend to lose more games, which is logical since errors typically give the opposing team more scoring opportunities.

Positive Correlation Between TEAM_BATTING_BB and TARGET_WINS (0.47)

Walks by batters (TEAM_BATTING_BB) have a significant positive impact on wins. This suggests that good plate discipline and the ability to draw walks help teams increase their chances of winning by getting more players on base.

TEAM_PITCHING_BB and TARGET_WINS (0.47)

Similar to hits allowed, walks allowed by pitchers (TEAM_PITCHING_BB) show a moderate positive correlation with wins. This is counterintuitive, as more walks generally put the team at a disadvantage. Further analysis might reveal that stronger teams manage to win despite allowing more walks.

#Review the data
#install.packages("ggcorrplot")
library(ggcorrplot)

# Look at distributions of the columns
mt <- df %>% 
  gather(key = 'variable', value = 'value')

names(df)
##  [1] "TARGET_WINS"      "TEAM_BATTING_H"   "TEAM_BATTING_2B"  "TEAM_BATTING_3B" 
##  [5] "TEAM_BATTING_HR"  "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
##  [9] "TEAM_BASERUN_CS"  "TEAM_BATTING_HBP" "TEAM_PITCHING_H"  "TEAM_PITCHING_HR"
## [13] "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E"  "TEAM_FIELDING_DP"
variable_names <- list(
  "TARGET_WINS"      =  "TARGET_WINS"  
  ,"TEAM_BATTING_H"   = "BATTING_H"
  ,"TEAM_BATTING_2B"  = "BATTING_2B"
  ,"TEAM_BATTING_3B"  = "BATTING_3B"
  ,"TEAM_BATTING_HR"  = "BATTING_HR"
  ,"TEAM_BATTING_BB"  = "BATTING_BB"
  ,"TEAM_BATTING_SO"  = "BATTING_SO"
  ,"TEAM_BASERUN_SB"  = "BASERUN_SB"
  ,"TEAM_BASERUN_CS"  = "BASERUN_CS"
  ,"TEAM_BATTING_HBP" = "BATTING_HBP"
  ,"TEAM_PITCHING_H"  = "PITCHING_H"
  ,"TEAM_PITCHING_HR" = "PITCHING_HR"
  ,"TEAM_PITCHING_BB" = "PITCHING_BB"
  ,"TEAM_PITCHING_SO" = "PITCHING_SO"
  ,"TEAM_FIELDING_E"  = "FIELDING_E"
  ,"TEAM_FIELDING_DP" = "FIELDING_DP"
  
)

variable_labeller <- function(variable,value){
  return(variable_names[value])
}

# Select numeric columns only
numeric_data <- df[sapply(df, is.numeric)]
M<- cor(numeric_data,use="complete.obs")
 # M %>% kable() %>%
 #  kable_styling()

ggcorrplot(M, type = "upper", outline.color = "white",
           ggtheme = theme_classic,
           #colors = c("#6D9EC1", "white", "#E46726"),
           lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)

# Calculate the correlation matrix
correlation_matrix <- cor(numeric_data, use="complete.obs")
print(correlation_matrix)
##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS       1.00000000     0.46994665      0.31298400     -0.12434586
## TEAM_BATTING_H    0.46994665     1.00000000      0.56177286      0.21391883
## TEAM_BATTING_2B   0.31298400     0.56177286      1.00000000      0.04203441
## TEAM_BATTING_3B  -0.12434586     0.21391883      0.04203441      1.00000000
## TEAM_BATTING_HR   0.42241683     0.39627593      0.25099045     -0.21879927
## TEAM_BATTING_BB   0.46868793     0.19735234      0.19749256     -0.20584392
## TEAM_BATTING_SO  -0.22889273    -0.34174328     -0.06415123     -0.19291841
## TEAM_BASERUN_SB   0.01483639     0.07167495     -0.18768279      0.16946086
## TEAM_BASERUN_CS  -0.17875598    -0.09377545     -0.20413884      0.23213978
## TEAM_BATTING_HBP  0.07350424    -0.02911218      0.04608475     -0.17424715
## TEAM_PITCHING_H   0.47123431     0.99919269      0.56045355      0.21250322
## TEAM_PITCHING_HR  0.42246683     0.39495630      0.24999875     -0.21973263
## TEAM_PITCHING_BB  0.46839882     0.19529071      0.19592157     -0.20675383
## TEAM_PITCHING_SO -0.22936481    -0.34445001     -0.06616615     -0.19386654
## TEAM_FIELDING_E  -0.38668800    -0.25381638     -0.19427027     -0.06513145
## TEAM_FIELDING_DP -0.19586601     0.01776946     -0.02488808      0.13314758
##                  TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS           0.42241683      0.46868793     -0.22889273
## TEAM_BATTING_H        0.39627593      0.19735234     -0.34174328
## TEAM_BATTING_2B       0.25099045      0.19749256     -0.06415123
## TEAM_BATTING_3B      -0.21879927     -0.20584392     -0.19291841
## TEAM_BATTING_HR       1.00000000      0.45638161      0.21045444
## TEAM_BATTING_BB       0.45638161      1.00000000      0.21833871
## TEAM_BATTING_SO       0.21045444      0.21833871      1.00000000
## TEAM_BASERUN_SB      -0.19021893     -0.08806123     -0.07475974
## TEAM_BASERUN_CS      -0.27579838     -0.20878051     -0.05613035
## TEAM_BATTING_HBP      0.10618116      0.04746007      0.22094219
## TEAM_PITCHING_H       0.39549390      0.19848687     -0.34145321
## TEAM_PITCHING_HR      0.99993259      0.45659283      0.21111617
## TEAM_PITCHING_BB      0.45542468      0.99988140      0.21895783
## TEAM_PITCHING_SO      0.20829574      0.21793253      0.99976835
## TEAM_FIELDING_E       0.01567397     -0.07847126      0.30814540
## TEAM_FIELDING_DP     -0.06182222     -0.07929078     -0.12319072
##                  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS           0.01483639    -0.178755979       0.07350424
## TEAM_BATTING_H        0.07167495    -0.093775445      -0.02911218
## TEAM_BATTING_2B      -0.18768279    -0.204138837       0.04608475
## TEAM_BATTING_3B       0.16946086     0.232139777      -0.17424715
## TEAM_BATTING_HR      -0.19021893    -0.275798375       0.10618116
## TEAM_BATTING_BB      -0.08806123    -0.208780510       0.04746007
## TEAM_BATTING_SO      -0.07475974    -0.056130355       0.22094219
## TEAM_BASERUN_SB       1.00000000     0.624737808      -0.06400498
## TEAM_BASERUN_CS       0.62473781     1.000000000      -0.07051390
## TEAM_BATTING_HBP     -0.06400498    -0.070513896       1.00000000
## TEAM_PITCHING_H       0.07395373    -0.092977893      -0.02769699
## TEAM_PITCHING_HR     -0.18948057    -0.275471495       0.10675878
## TEAM_PITCHING_BB     -0.08741902    -0.208470154       0.04785137
## TEAM_PITCHING_SO     -0.07351325    -0.055308336       0.22157375
## TEAM_FIELDING_E       0.04292341     0.207701189       0.04178971
## TEAM_FIELDING_DP     -0.13023054    -0.006764233      -0.07120824
##                  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS           0.47123431       0.42246683       0.46839882
## TEAM_BATTING_H        0.99919269       0.39495630       0.19529071
## TEAM_BATTING_2B       0.56045355       0.24999875       0.19592157
## TEAM_BATTING_3B       0.21250322      -0.21973263      -0.20675383
## TEAM_BATTING_HR       0.39549390       0.99993259       0.45542468
## TEAM_BATTING_BB       0.19848687       0.45659283       0.99988140
## TEAM_BATTING_SO      -0.34145321       0.21111617       0.21895783
## TEAM_BASERUN_SB       0.07395373      -0.18948057      -0.08741902
## TEAM_BASERUN_CS      -0.09297789      -0.27547150      -0.20847015
## TEAM_BATTING_HBP     -0.02769699       0.10675878       0.04785137
## TEAM_PITCHING_H       1.00000000       0.39463199       0.19703302
## TEAM_PITCHING_HR      0.39463199       1.00000000       0.45580983
## TEAM_PITCHING_BB      0.19703302       0.45580983       1.00000000
## TEAM_PITCHING_SO     -0.34330646       0.20920115       0.21887700
## TEAM_FIELDING_E      -0.25073028       0.01689330      -0.07692315
## TEAM_FIELDING_DP      0.01416807      -0.06292475      -0.08040645
##                  TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS           -0.22936481     -0.38668800     -0.195866006
## TEAM_BATTING_H        -0.34445001     -0.25381638      0.017769456
## TEAM_BATTING_2B       -0.06616615     -0.19427027     -0.024888081
## TEAM_BATTING_3B       -0.19386654     -0.06513145      0.133147578
## TEAM_BATTING_HR        0.20829574      0.01567397     -0.061822219
## TEAM_BATTING_BB        0.21793253     -0.07847126     -0.079290775
## TEAM_BATTING_SO        0.99976835      0.30814540     -0.123190715
## TEAM_BASERUN_SB       -0.07351325      0.04292341     -0.130230537
## TEAM_BASERUN_CS       -0.05530834      0.20770119     -0.006764233
## TEAM_BATTING_HBP       0.22157375      0.04178971     -0.071208241
## TEAM_PITCHING_H       -0.34330646     -0.25073028      0.014168073
## TEAM_PITCHING_HR       0.20920115      0.01689330     -0.062924751
## TEAM_PITCHING_BB       0.21887700     -0.07692315     -0.080406452
## TEAM_PITCHING_SO       1.00000000      0.31008407     -0.124923213
## TEAM_FIELDING_E        0.31008407      1.00000000      0.040205814
## TEAM_FIELDING_DP      -0.12492321      0.04020581      1.000000000
corrplot(correlation_matrix, method="circle")

OUTLIERS

There are variables with outliers. For example, there are many pairs of variable types that have 0’s recorded above (for example batting_HR and pitching_HR), since these are 0’s, they could be considered outliers or they can be legitimate. In addition, a few variables have outliers further distance away from the mean such as BATTING_3B, PITCHING_SO, PITCHING_BB.An observation is considered an outlier, if it is extreme, relative to other response values. In contrast, some records have extremely high or low values for the predictor variable, relative to the other values. These are referred to as high leverage observations.These can influence our model estimates and introduce bias. We could potentially calculate these points, remove them and refit the line and check our Rsquared value later on during the model build process.

Cook’s D measures how much the model coefficient estimates would change if an observation were to be removed from the data set.

There is one Cook’s D value for each observation used to fit the model. The higher the Cook’s D value, the greater the influence. Generally accepted rules of thumb are that Cook’s D values above 1.0 indicate influential values, and any values that stick out from the rest might also be influential.

#Review the data
#install.packages("ggcorrplot")
library(ggcorrplot)


#Scatterplot of variables with target wins
df %>%
  gather(variable, value, -TARGET_WINS) %>%
  ggplot(., aes(value, TARGET_WINS)) + 
  geom_point(fill = "lightblue", color="lightblue") + 
  geom_smooth(method = "lm", se = FALSE, color = "black") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = "Wins")

#Look at few variable distributions in depth

df %>% 
  ggplot(aes(TARGET_WINS)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TARGET_WINS, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TARGET_WINS, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution TARGET_WINS",
       caption = "* Red line is the mean value and green is the median")

df %>% 
  ggplot(aes(TEAM_BATTING_3B)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution TEAM_BATTING_3B",
       caption = "* Red line is the mean value and green is the median")

df %>% 
  ggplot(aes(TEAM_PITCHING_H)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_PITCHING_H, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_PITCHING_H, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution TEAM_PITCHING_H",
       caption = "* Red line is the mean value and green is the median")

df %>% 
  ggplot(aes(TEAM_PITCHING_SO)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_PITCHING_SO, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_PITCHING_SO, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution TEAM_PITCHING_SO",
       caption = "* Red line is the mean value and green is the median")

df %>% 
  ggplot(aes(TEAM_PITCHING_BB)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_PITCHING_BB, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_PITCHING_BB, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution PITCHING_BB",
       caption = "* Red line is the mean value and green is the median")

Data Preparation

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
a. Fix missing values (maybe with a Mean or Median value)
b. Create flags to suggest if a variable was missing
c. Transform data by putting it into buckets
d. Mathematical transforms such as log or square root (or use Box-Cox)
e. Combine variables (such as ratios or adding or multiplying) to create new variables)

Method1: Impute missing values by using the most correlated values

The approach here is impute missing values in a specific column using the most correlated variables. Additionally we can view boxplots of the data after imputation. 

  1. Find the most correlated columns with the target column in the pre-computed correlation matrix
  2. Select top correlated columns that have no missing values
  3. Use top n correlated predictors
  4. Build regression model
  5. Predict and impute missing values on the training dataset
  6. We compare summary statistics to view imputed data and then check if any records have NA’s and confirm there are no NA’s present.
impute_missing_values <- function(df, corr_matrix, top_n = 2) {
  #Identify columns with missing values
  missing_columns <- names(df)[colSums(is.na(df)) > 0]
  #Iterate over each column with missing values
  for (column_name in missing_columns) {
    missing_count <- sum(is.na(df[[column_name]]))
    cat(paste("Column name: ", column_name, ", missing values:", missing_count, "\n"))
    
    #Check for missing
    if (missing_count > 0) {
      cat(paste("Working On Column:", column_name, "\n"))
      
      #Get corr columns
      correlations <- sort(corr_matrix[, column_name], decreasing = TRUE)
      
      #Remove column from predictors
      predictor_columns <- names(correlations)[names(correlations) != column_name]
      
      #Exclude predictors that have missing values
      predictor_columns <- predictor_columns[!predictor_columns %in% missing_columns]
      
      #Use top n predictors
      predictor_columns <- predictor_columns[1:min(top_n, length(predictor_columns))]
      
      #Print predictors
      cat(paste("Predictors for column", column_name, ":", predictor_columns, "\n"))
      
      if (length(predictor_columns) > 0) {
        #Impute missing values in predictors with their mean
        for (pred_col in predictor_columns) {
          if (any(is.na(df[[pred_col]]))) {
            df[[pred_col]][is.na(df[[pred_col]])] <- mean(df[[pred_col]], na.rm = TRUE)
          }
        }
        
        #Build model
        formula <- as.formula(paste(column_name, "~", paste(predictor_columns, collapse = "+")))
        
        #Fit Model
        fit <- lm(formula, data = df, na.action = na.exclude)
        
        #Predict
        na_indices <- is.na(df[[column_name]])
        predicted_values <- predict(fit, newdata = df[na_indices, predictor_columns, drop = FALSE])
        df[[column_name]][na_indices] <- predicted_values
        
        cat(paste("Imputed", sum(na_indices), "missing values in column:", column_name, "\n"))
      } else {
        #Fallback: Mean imputation for the column
        df[[column_name]][is.na(df[[column_name]])] <- mean(df[[column_name]], na.rm = TRUE)
        cat(paste("No valid predictors found. Fallback to mean imputation for column:", column_name, "\n"))
      }
    } else {
      cat(paste("Skipped:", column_name, "\n"))
    }
  }
  return(df)
}
df_imputed <- impute_missing_values(df, correlation_matrix)
## Column name:  TEAM_BATTING_SO , missing values: 102 
## Working On Column: TEAM_BATTING_SO 
## Predictors for column TEAM_BATTING_SO : TEAM_FIELDING_E 
##  Predictors for column TEAM_BATTING_SO : TEAM_PITCHING_BB 
## Imputed 102 missing values in column: TEAM_BATTING_SO 
## Column name:  TEAM_BASERUN_SB , missing values: 131 
## Working On Column: TEAM_BASERUN_SB 
## Predictors for column TEAM_BASERUN_SB : TEAM_BATTING_3B 
##  Predictors for column TEAM_BASERUN_SB : TEAM_PITCHING_H 
## Imputed 131 missing values in column: TEAM_BASERUN_SB 
## Column name:  TEAM_BASERUN_CS , missing values: 772 
## Working On Column: TEAM_BASERUN_CS 
## Predictors for column TEAM_BASERUN_CS : TEAM_BATTING_3B 
##  Predictors for column TEAM_BASERUN_CS : TEAM_FIELDING_E 
## Imputed 772 missing values in column: TEAM_BASERUN_CS 
## Column name:  TEAM_BATTING_HBP , missing values: 2085 
## Working On Column: TEAM_BATTING_HBP 
## Predictors for column TEAM_BATTING_HBP : TEAM_PITCHING_HR 
##  Predictors for column TEAM_BATTING_HBP : TEAM_BATTING_HR 
## Imputed 2085 missing values in column: TEAM_BATTING_HBP 
## Column name:  TEAM_PITCHING_SO , missing values: 102 
## Working On Column: TEAM_PITCHING_SO 
## Predictors for column TEAM_PITCHING_SO : TEAM_FIELDING_E 
##  Predictors for column TEAM_PITCHING_SO : TEAM_PITCHING_BB 
## Imputed 102 missing values in column: TEAM_PITCHING_SO 
## Column name:  TEAM_FIELDING_DP , missing values: 286 
## Working On Column: TEAM_FIELDING_DP 
## Predictors for column TEAM_FIELDING_DP : TEAM_BATTING_3B 
##  Predictors for column TEAM_FIELDING_DP : TEAM_FIELDING_E 
## Imputed 286 missing values in column: TEAM_FIELDING_DP
summary(df_imputed)
##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :-15.2  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 556.8   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 735.0   Median :105.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 733.8   Mean   :127.1  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:162.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##  TEAM_BASERUN_CS  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.00   Min.   : 29.00   Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 44.00   1st Qu.: 56.44   1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 56.00   Median : 59.12   Median : 1518   Median :107.0   
##  Mean   : 57.13   Mean   : 66.27   Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 68.00   3rd Qu.: 66.76   3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.00   Max.   :485.98   Max.   :30132   Max.   :343.0   
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   :-14.41  
##  1st Qu.: 476.0   1st Qu.:  610.8   1st Qu.: 127.0   1st Qu.:120.51  
##  Median : 536.5   Median :  801.5   Median : 159.0   Median :145.00  
##  Mean   : 553.0   Mean   :  810.5   Mean   : 246.5   Mean   :139.87  
##  3rd Qu.: 611.0   3rd Qu.:  957.2   3rd Qu.: 249.2   3rd Qu.:161.25  
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.00
summary(df)
##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
##  Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  NA's   :772     NA's   :2085                                     
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
##                   NA's   :102                        NA's   :286
#Create Boxplot function
create_grouped_boxplot_with_stats <- function(df) {
  
  #Define groupings (I know could be made more modular)
  team_batting <- colnames(df)[grep("^TEAM_BATTING_", colnames(df))]
  team_pitching <- c("TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO")
  everything_else <- setdiff(names(df), c(team_batting, team_pitching, "INDEX", "TARGET_WINS"))  # Exclude index and target
  
  #Helper function to plot multiple variables
  generate_combined_boxplot <- function(df, columns, group_name) {
    #Subset the data
    df_subset <- df[, columns]
    
    #Melt the data 
    df_melted <- melt(df_subset, variable.name = "Variable", value.name = "Value")
    
    #Create boxplot
    p <- ggplot(df_melted, aes(x = Variable, y = Value)) +
            geom_boxplot(fill = "skyblue", color = "black", outlier.colour = "red") +
            labs(title = paste("Boxplot for", group_name), x = "Variables", y = "Value") +
            theme_minimal() +
            theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels
    
    print(p)
    
    #Print stats for each column
    for (column in columns) {
      stats <- summary(df[[column]])
      cat(paste("Summary statistics for", column, ":\n"))
      print(stats)
    }
  }
  
  #Generate boxplots for each group
  cat("Team Batting Group:\n")
  generate_combined_boxplot(df, team_batting, "TEAM_BATTING")
  
  cat("\nTEAM_PITCHING Group:\n")
  generate_combined_boxplot(df, team_pitching, "TEAM_PITCHING")
  
  cat("\nEverything Else Group:\n")
  generate_combined_boxplot(df, everything_else, "Everything Else")
}

create_grouped_boxplot_with_stats(df_imputed)
## Team Batting Group:

## Summary statistics for TEAM_BATTING_H :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     891    1383    1454    1469    1537    2554 
## Summary statistics for TEAM_BATTING_2B :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    69.0   208.0   238.0   241.2   273.0   458.0 
## Summary statistics for TEAM_BATTING_3B :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   34.00   47.00   55.25   72.00  223.00 
## Summary statistics for TEAM_BATTING_HR :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   42.00  102.00   99.61  147.00  264.00 
## Summary statistics for TEAM_BATTING_BB :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   451.0   512.0   501.6   580.0   878.0 
## Summary statistics for TEAM_BATTING_SO :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   556.8   735.0   733.8   925.0  1399.0 
## Summary statistics for TEAM_BATTING_HBP :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   56.44   59.12   66.27   66.76  485.98 
## 
## TEAM_PITCHING Group:

## Summary statistics for TEAM_PITCHING_H :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1137    1419    1518    1779    1682   30132 
## Summary statistics for TEAM_PITCHING_HR :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    50.0   107.0   105.7   150.0   343.0 
## Summary statistics for TEAM_PITCHING_BB :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   476.0   536.5   553.0   611.0  3645.0 
## Summary statistics for TEAM_PITCHING_SO :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   610.8   801.5   810.5   957.2 19278.0 
## 
## Everything Else Group:

## Summary statistics for TEAM_BASERUN_SB :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -15.2    67.0   105.0   127.1   162.0   697.0 
## Summary statistics for TEAM_BASERUN_CS :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   44.00   56.00   57.13   68.00  201.00 
## Summary statistics for TEAM_FIELDING_E :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    65.0   127.0   159.0   246.5   249.2  1898.0 
## Summary statistics for TEAM_FIELDING_DP :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -14.41  120.51  145.00  139.87  161.25  228.00
compare_summary_stats <- function(df, df_imputed) {
  
  numeric_columns <- df %>% select_if(is.numeric) %>% names()
  
  #Extract summary statistics
  extract_summary_stats <- function(df, columns) {
    stats <- data.frame(Variable = character(),
                        Min = numeric(), Q1 = numeric(), Median = numeric(),
                        Mean = numeric(), Q3 = numeric(), Max = numeric(),
                        stringsAsFactors = FALSE)
    
    for (column in columns) {
      summary_vals <- as.numeric(summary(df[[column]]))
      if (length(summary_vals) == 6) {  #Only keep good data
        stats <- rbind(stats, data.frame(
          Variable = column,
          Min = summary_vals[1], Q1 = summary_vals[2], Median = summary_vals[3],
          Mean = summary_vals[4], Q3 = summary_vals[5], Max = summary_vals[6]
        ))
      }
    }
    
    return(stats)
  }
  
  #Extract summary statistics
  df_stats <- extract_summary_stats(df, numeric_columns)
  df_imputed_stats <- extract_summary_stats(df_imputed, numeric_columns)
  
  #Calculate delta
  stats_comparison <- df_stats %>%
    left_join(df_imputed_stats, by = "Variable", suffix = c("_original", "_imputed")) %>%
    mutate(
      Min_diff = Min_imputed - Min_original,
      Q1_diff = Q1_imputed - Q1_original,
      Median_diff = Median_imputed - Median_original,
      Mean_diff = Mean_imputed - Mean_original,
      Q3_diff = Q3_imputed - Q3_original,
      Max_diff = Max_imputed - Max_original
    )
  return(stats_comparison)
}

stats_comparison <- compare_summary_stats(df, df_imputed)

#Display table
print(stats_comparison)
##            Variable Min_original Q1_original Median_original Mean_original
## 1       TARGET_WINS            0          71            82.0      80.79086
## 2    TEAM_BATTING_H          891        1383          1454.0    1469.26977
## 3   TEAM_BATTING_2B           69         208           238.0     241.24692
## 4   TEAM_BATTING_3B            0          34            47.0      55.25000
## 5   TEAM_BATTING_HR            0          42           102.0      99.61204
## 6   TEAM_BATTING_BB            0         451           512.0     501.55888
## 7   TEAM_PITCHING_H         1137        1419          1518.0    1779.21046
## 8  TEAM_PITCHING_HR            0          50           107.0     105.69859
## 9  TEAM_PITCHING_BB            0         476           536.5     553.00791
## 10  TEAM_FIELDING_E           65         127           159.0     246.48067
##    Q3_original Max_original Min_imputed Q1_imputed Median_imputed Mean_imputed
## 1        92.00          146           0         71           82.0     80.79086
## 2      1537.25         2554         891       1383         1454.0   1469.26977
## 3       273.00          458          69        208          238.0    241.24692
## 4        72.00          223           0         34           47.0     55.25000
## 5       147.00          264           0         42          102.0     99.61204
## 6       580.00          878           0        451          512.0    501.55888
## 7      1682.50        30132        1137       1419         1518.0   1779.21046
## 8       150.00          343           0         50          107.0    105.69859
## 9       611.00         3645           0        476          536.5    553.00791
## 10      249.25         1898          65        127          159.0    246.48067
##    Q3_imputed Max_imputed Min_diff Q1_diff Median_diff Mean_diff Q3_diff
## 1       92.00         146        0       0           0         0       0
## 2     1537.25        2554        0       0           0         0       0
## 3      273.00         458        0       0           0         0       0
## 4       72.00         223        0       0           0         0       0
## 5      147.00         264        0       0           0         0       0
## 6      580.00         878        0       0           0         0       0
## 7     1682.50       30132        0       0           0         0       0
## 8      150.00         343        0       0           0         0       0
## 9      611.00        3645        0       0           0         0       0
## 10     249.25        1898        0       0           0         0       0
##    Max_diff
## 1         0
## 2         0
## 3         0
## 4         0
## 5         0
## 6         0
## 7         0
## 8         0
## 9         0
## 10        0

Method 2: Imputation using Mice

MICE Imputation, short for ‘Multiple Imputation by Chained Equation’ is an advanced missing data imputation technique that uses plausible data points based on distributions. We use this imputation method below on both train data set and test dataset (Evaluation for use later on during prediction). Once we impute HBP column for instance, we look at the change in the histogram. For HBP, the curve has moved away from normal curve to something different with 2 different peaks which is not what we prefer, however, we will proceed keeping this imputation for this variable. For BASERUN_CS seems to have become more normal after imputation with mice.

After imputation of the dataset, we can see positive and negative correlations below with Target Wins. BASERUN_CS have mild positive correlation and Target wins and HBP have a slight negative correlation. 

#Reviewing HBP that has 90% of the data missing to review the spread of the data

df %>% 
  ggplot(aes(TEAM_BATTING_HBP)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "blue", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Hit By a Pitch",
       caption = "* Blue line is the mean value and green is the median")

# Set the seed for reproducibility
set.seed(12345)

# Perform Multiple Imputation
imputed_data_df <- mice(df, m=5, method='pmm', print=FALSE)
completedData <- complete(imputed_data_df,1)
summary(completedData) %>%
    kable() %>%
    kable_styling()
TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 541.0 1st Qu.: 67 1st Qu.: 42.0 1st Qu.:51.00 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 602.0 1st Qu.: 127.0 1st Qu.:124.0
Median : 82.00 Median :1454 Median :238.0 Median : 47.00 Median :102.00 Median :512.0 Median : 733.0 Median :106 Median : 57.0 Median :56.00 Median : 1518 Median :107.0 Median : 536.5 Median : 797.0 Median : 159.0 Median :145.5
Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 727.2 Mean :137 Mean : 76.2 Mean :58.47 Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 807.2 Mean : 246.5 Mean :141.6
3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:170 3rd Qu.: 92.0 3rd Qu.:69.00 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957.0 3rd Qu.: 249.2 3rd Qu.:162.0
Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697 Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
# Density plots 
ggplot(df, aes(x=TEAM_BATTING_HBP, fill="Original")) +
  geom_density(alpha=0.5) +
  geom_density(data=completedData, aes(x=TEAM_BATTING_HBP, fill="imputed"), alpha=0.5) +
  labs(title="Density Plot of Ozone: Original vs. Imputed")

#check 0's again
completedData %>% 
  gather(variable, value) %>%
  filter(value == 0) %>%
  group_by(variable) %>%
  tally() %>%
  mutate(percent = n / nrow(completedData) * 100) %>%
  mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
  arrange(desc(n)) %>%
  rename(`Variable With Zeros` = variable,
         `Number of Records` = n,
         `Share of Total` = percent) %>%
  kable() %>%
  kable_styling()
Variable With Zeros Number of Records Share of Total
TEAM_BATTING_SO 20 0.9%
TEAM_PITCHING_SO 20 0.9%
TEAM_BATTING_HR 15 0.7%
TEAM_PITCHING_HR 15 0.7%
TEAM_BASERUN_SB 2 0.1%
TEAM_BATTING_3B 2 0.1%
TARGET_WINS 1 0%
TEAM_BASERUN_CS 1 0%
TEAM_BATTING_BB 1 0%
TEAM_PITCHING_BB 1 0%
# #
# hist(completedData$TEAM_BATTING_SO)
# completedData$log_BATTING_SO <- (completedData$TEAM_BATTING_SO)^(1/3)
# hist(completedData$log_BATTING_SO)

#Analysis of Missing values on the imputed dataset
missing1 <- colSums(completedData %>% sapply(is.na))
missing_pct1 <- round(missing1 / nrow(df) * 100, 2)
stack(sort(missing_pct1, decreasing = TRUE))
mt1 <- completedData  %>% 
  gather(key = 'variable', value = 'value')
  
  mt1 %>% ggplot(., aes(value)) + 
  geom_density(fill = "green", color="green") + 
  facet_wrap(.~variable, scales='free',ncol=4,labeller=variable_labeller)

completedData %>% 
  ggplot(aes(TEAM_BATTING_HBP)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "blue", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Hit By a Pitch with mice imputation",
       caption = "* Blue line is the mean value and green is the median")

completedData %>% 
  ggplot(aes(TEAM_BASERUN_CS)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BASERUN_CS, na.rm = T)), col = "blue", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BASERUN_CS, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "BASERUN_CS with mice imputation",
       caption = "* Blue line is the mean value and green is the median")

#Analyze Correlations
M<- completedData %>%
 cor(., use = "complete.obs")
print(M)
##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS       1.00000000    0.388767521      0.28910365     0.142608411
## TEAM_BATTING_H    0.38876752    1.000000000      0.56284968     0.427696575
## TEAM_BATTING_2B   0.28910365    0.562849678      1.00000000    -0.107305824
## TEAM_BATTING_3B   0.14260841    0.427696575     -0.10730582     1.000000000
## TEAM_BATTING_HR   0.17615320   -0.006544685      0.43539729    -0.635566946
## TEAM_BATTING_BB   0.23255986   -0.072464013      0.25572610    -0.287235841
## TEAM_BATTING_SO  -0.03700542   -0.423472975      0.18945983    -0.671507774
## TEAM_BASERUN_SB   0.10806892    0.132765300     -0.19125764     0.513732615
## TEAM_BASERUN_CS   0.04189727    0.116968307     -0.29737125     0.637883604
## TEAM_BATTING_HBP -0.02195123    0.199416536      0.04069928     0.263360368
## TEAM_PITCHING_H  -0.10993705    0.302693709      0.02369219     0.194879411
## TEAM_PITCHING_HR  0.18901373    0.072853119      0.45455082    -0.567836679
## TEAM_PITCHING_BB  0.12417454    0.094193027      0.17805420    -0.002224148
## TEAM_PITCHING_SO -0.07633076   -0.231737691      0.08388897    -0.266685175
## TEAM_FIELDING_E  -0.17648476    0.264902478     -0.23515099     0.509778447
## TEAM_FIELDING_DP -0.04872795    0.014656289      0.30360837    -0.435353559
##                  TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS          0.176153200     0.232559864     -0.03700542
## TEAM_BATTING_H      -0.006544685    -0.072464013     -0.42347298
## TEAM_BATTING_2B      0.435397293     0.255726103      0.18945983
## TEAM_BATTING_3B     -0.635566946    -0.287235841     -0.67150777
## TEAM_BATTING_HR      1.000000000     0.513734810      0.72991823
## TEAM_BATTING_BB      0.513734810     1.000000000      0.38691765
## TEAM_BATTING_SO      0.729918232     0.386917646      1.00000000
## TEAM_BASERUN_SB     -0.501890501    -0.339539238     -0.30585964
## TEAM_BASERUN_CS     -0.635081576    -0.352347044     -0.46343485
## TEAM_BATTING_HBP    -0.231718837    -0.338486366     -0.26456155
## TEAM_PITCHING_H     -0.250145481    -0.449777625     -0.36207321
## TEAM_PITCHING_HR     0.969371396     0.459552072      0.67285963
## TEAM_PITCHING_BB     0.136927564     0.489361263      0.05421024
## TEAM_PITCHING_SO     0.201349979    -0.009344302      0.42418730
## TEAM_FIELDING_E     -0.587339098    -0.655970815     -0.58194385
## TEAM_FIELDING_DP     0.510707512     0.391002244      0.30778415
##                  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS           0.10806892     0.041897274      -0.02195123
## TEAM_BATTING_H        0.13276530     0.116968307       0.19941654
## TEAM_BATTING_2B      -0.19125764    -0.297371248       0.04069928
## TEAM_BATTING_3B       0.51373261     0.637883604       0.26336037
## TEAM_BATTING_HR      -0.50189050    -0.635081576      -0.23171884
## TEAM_BATTING_BB      -0.33953924    -0.352347044      -0.33848637
## TEAM_BATTING_SO      -0.30585964    -0.463434852      -0.26456155
## TEAM_BASERUN_SB       1.00000000     0.798659242       0.15955042
## TEAM_BASERUN_CS       0.79865924     1.000000000       0.21492884
## TEAM_BATTING_HBP      0.15955042     0.214928843       1.00000000
## TEAM_PITCHING_H       0.17644317     0.173942970       0.15956202
## TEAM_PITCHING_HR     -0.45351728    -0.584641263      -0.18940532
## TEAM_PITCHING_BB      0.02569117     0.009879709      -0.11421634
## TEAM_PITCHING_SO      0.02566609    -0.090100513      -0.03539331
## TEAM_FIELDING_E       0.60025561     0.582171528       0.26247599
## TEAM_FIELDING_DP     -0.61149875    -0.627917734      -0.21305085
##                  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS          -0.10993705       0.18901373      0.124174536
## TEAM_BATTING_H        0.30269371       0.07285312      0.094193027
## TEAM_BATTING_2B       0.02369219       0.45455082      0.178054204
## TEAM_BATTING_3B       0.19487941      -0.56783668     -0.002224148
## TEAM_BATTING_HR      -0.25014548       0.96937140      0.136927564
## TEAM_BATTING_BB      -0.44977762       0.45955207      0.489361263
## TEAM_BATTING_SO      -0.36207321       0.67285963      0.054210244
## TEAM_BASERUN_SB       0.17644317      -0.45351728      0.025691173
## TEAM_BASERUN_CS       0.17394297      -0.58464126      0.009879709
## TEAM_BATTING_HBP      0.15956202      -0.18940532     -0.114216344
## TEAM_PITCHING_H       1.00000000      -0.14161276      0.320676162
## TEAM_PITCHING_HR     -0.14161276       1.00000000      0.221937505
## TEAM_PITCHING_BB      0.32067616       0.22193750      1.000000000
## TEAM_PITCHING_SO      0.26954767       0.22183950      0.489688867
## TEAM_FIELDING_E       0.66775901      -0.49314447     -0.022837561
## TEAM_FIELDING_DP     -0.05508368       0.48492027      0.156541505
##                  TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS          -0.076330764     -0.17648476      -0.04872795
## TEAM_BATTING_H       -0.231737691      0.26490248       0.01465629
## TEAM_BATTING_2B       0.083888966     -0.23515099       0.30360837
## TEAM_BATTING_3B      -0.266685175      0.50977845      -0.43535356
## TEAM_BATTING_HR       0.201349979     -0.58733910       0.51070751
## TEAM_BATTING_BB      -0.009344302     -0.65597081       0.39100224
## TEAM_BATTING_SO       0.424187297     -0.58194385       0.30778415
## TEAM_BASERUN_SB       0.025666091      0.60025561      -0.61149875
## TEAM_BASERUN_CS      -0.090100513      0.58217153      -0.62791773
## TEAM_BATTING_HBP     -0.035393311      0.26247599      -0.21305085
## TEAM_PITCHING_H       0.269547667      0.66775901      -0.05508368
## TEAM_PITCHING_HR      0.221839501     -0.49314447       0.48492027
## TEAM_PITCHING_BB      0.489688867     -0.02283756       0.15654151
## TEAM_PITCHING_SO      1.000000000     -0.02841702       0.10633958
## TEAM_FIELDING_E      -0.028417020      1.00000000      -0.49430493
## TEAM_FIELDING_DP      0.106339583     -0.49430493       1.00000000
# corrplot(M, type = 'lower', order = 'hclust', tl.col = 'black',
#          cl.ratio = 0.2, tl.srt = 45, col = COL2('PuOr', 10))
# 
# testRes = cor.mtest(M, conf.level = 0.95)

#correlations including TARGET WINS and multi-collinearity

ggcorrplot(M, type = "upper", outline.color = "white",
           ggtheme = theme_classic,
           #colors = c("#6D9EC1", "white", "#E46726"),
           lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)

#IMPUTE TEST DATASET FOR CONSISTENCY


# moneyball_testdta %>% 
#   ggplot(aes(TEAM_BATTING_HBP)) + 
#   geom_histogram(bins = 50) +
#   geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "blue", lty = 2) +
#   geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
#   labs(x = element_blank(),
#        y = "Count",
#        title = "Hit By a Pitch TEST EVALUATION DATASET - REVIEW TO IMPUTE",
#        caption = "* Blue line is the mean value and green is the median")

# Set the seed for reproducibility
set.seed(12345)

# Perform Multiple Imputation
imputed_TEST_df <- mice(moneyball_testdta, m=5, method='pmm', print=FALSE)
completedData1 <- complete(imputed_TEST_df,1)
summary(completedData1) %>%
    kable() %>%
    kable_styling()
INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
Min. : 9 Min. : 819 Min. : 44.0 Min. : 14.00 Min. : 0.00 Min. : 15.0 Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. :42.00 Min. : 1155 Min. : 0.0 Min. : 136.0 Min. : 0.0 Min. : 73.0 Min. : 69.0
1st Qu.: 708 1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00 1st Qu.: 44.50 1st Qu.:436.5 1st Qu.: 532.5 1st Qu.: 60.5 1st Qu.: 42.00 1st Qu.:55.00 1st Qu.: 1426 1st Qu.: 52.0 1st Qu.: 471.0 1st Qu.: 608.5 1st Qu.: 131.0 1st Qu.:127.0
Median :1249 Median :1455 Median :239.0 Median : 52.00 Median :101.00 Median :509.0 Median : 677.0 Median : 96.0 Median : 57.00 Median :66.00 Median : 1515 Median :104.0 Median : 526.0 Median : 731.0 Median : 163.0 Median :146.0
Mean :1264 Mean :1469 Mean :241.3 Mean : 55.91 Mean : 95.63 Mean :499.0 Mean : 700.4 Mean :134.4 Mean : 69.54 Mean :66.61 Mean : 1813 Mean :102.1 Mean : 552.4 Mean : 790.1 Mean : 249.7 Mean :142.2
3rd Qu.:1832 3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00 3rd Qu.:135.50 3rd Qu.:565.5 3rd Qu.: 904.5 3rd Qu.:159.5 3rd Qu.:115.00 3rd Qu.:74.00 3rd Qu.: 1681 3rd Qu.:142.5 3rd Qu.: 606.5 3rd Qu.: 929.5 3rd Qu.: 252.0 3rd Qu.:162.0
Max. :2525 Max. :2170 Max. :376.0 Max. :155.00 Max. :242.00 Max. :792.0 Max. :1268.0 Max. :580.0 Max. :154.00 Max. :96.00 Max. :22768 Max. :336.0 Max. :2008.0 Max. :9963.0 Max. :1568.0 Max. :204.0
# # Density plots 
# ggplot(moneyball_testdta , aes(x=TEAM_BATTING_HBP, fill="Original")) +
#   geom_density(alpha=0.5) +
#   geom_density(data=completedData1, aes(x=TEAM_BATTING_HBP, fill="Imputed"), alpha=0.5) +
#   labs(title="Density Plot of TEAM_BATTING_HBP: Original vs. Imputed")

# #check 0's again
# completedData1 %>% 
#   gather(variable, value) %>%
#   filter(value == 0) %>%
#   group_by(variable) %>%
#   tally() %>%
#   mutate(percent = n / nrow(completedData1) * 100) %>%
#   mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
#   arrange(desc(n)) %>%
#   rename(`Variable With Zeros IMPUTED TEST EVAL DATASET` = variable,
#          `Number of Records` = n,
#          `Share of Total` = percent) %>%
#   kable() %>%
#   kable_styling()


#Analysis of Missing values on the imputed dataset
missing2 <- colSums(completedData1 %>% sapply(is.na))
missing_pct2 <- round(missing2 / nrow(df) * 100, 2)
stack(sort(missing_pct2, decreasing = TRUE))

Method 3: Data Transformation (Log and BoxCox)

As seen on the histogram distributions, the spread of the data shows right and left skew and we can perform some math transformations (might help in normalizing the variability of the data). 

Log and Box Cox transformations are used here for that purpose as seen on the comparative histograms below. Logarithm scale, allows small values that are close together are spread further out. Larger values that are spread out are brought closer together.

We can create a new variable such as TEAM_BATTING_1B to see if it has any influence over the target wins through our model build process.

We have imputed the HBP and the CS variables with the medians in this method. After imputing with medians we have applied log transformation.The HBP variable was missing a large number cases. From the qqplot on the original HBP and then the log transformed HBP_T show that the new variable is not normal on the qqplot. There is a large deviation on left and the right tails. Hence we will consider dropping this transformed variable from the model later on and utilize the other variables to see if the fit of the data to the model is applicable.

3B_t and E_T both on earlier qqplot showed some non-linear on the tails and these seemed have smoothed over better with the log and boxcox transformation so we can keep these in use for the model fit. 

# New variable: TEAM_BATTING_1B
tdata <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-training-data.csv", header = TRUE) %>% dplyr::select(-INDEX)
baseline <- tdata
updbase <- baseline %>% mutate(TEAM_BATTING_1B = baseline$TEAM_BATTING_H - dplyr::select(., TEAM_BATTING_2B:TEAM_BATTING_HR) %>% rowSums(na.rm = FALSE))
head(updbase)
#Check new variable distribution
updbase %>% 
  ggplot(aes(TEAM_BATTING_1B)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_1B, na.rm = T)), col = "blue", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_1B, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "New variable: TEAM_BATTING_1B",
       caption = "* Blue line is the mean value and green is the median")

#TRANSFORM ALL OTHER VARIABLES
transformDS <- updbase

#keep HBP and let's not drop it.
med_TF <- round(median(transformDS$TEAM_BATTING_HBP, na.rm = TRUE),0)
transformDS[is.na(transformDS[,"TEAM_BATTING_HBP"]), "TEAM_BATTING_HBP"] <- med_TF

#keep CS and let's not drop it.
med_CS <- round(median(transformDS$TEAM_BASERUN_CS, na.rm = TRUE),0)
transformDS[is.na(transformDS[,"TEAM_BASERUN_CS"]), "TEAM_BASERUN_CS"] <- med_CS

#Analysis of Missing values on the imputed dataset
missing6 <- colSums(transformDS %>% sapply(is.na))
missing_pct6 <- round(missing6 / nrow(df) * 100, 2)
stack(sort(missing_pct6, decreasing = TRUE))
#Log transform TEAM_BASERUN_CS
transformDS$TEAM_BASERUN_CS_t <-log(transformDS$TEAM_BASERUN_CS)
baserun_cs <- ggplot(transformDS, aes(x=TEAM_BASERUN_CS)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_BASERUN_CS")
baserun_cs_tf <- ggplot(transformDS, aes(x=TEAM_BASERUN_CS_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "Log Transformed")

#Log transform TEAM_BATTING_HBP
transformDS$TEAM_BATTING_HBP_t <-log(transformDS$TEAM_BATTING_HBP)
baserun_HBP <- ggplot(transformDS, aes(x=TEAM_BASERUN_HBP)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_BATTING_HBP")
baserun_HBP_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_HBP_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "Log Transformed")

#Log transform TEAM_BASERUN_SB
transformDS$TEAM_BASERUN_SB_t <-log(transformDS$TEAM_BASERUN_SB)
baserun_sb <- ggplot(transformDS, aes(x=TEAM_BASERUN_SB)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_BASERUN_SB")
baserun_sb_tf <- ggplot(transformDS, aes(x=TEAM_BASERUN_SB_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "Log Transformed")

#Log transform TEAM_BATTING_2B
transformDS$TEAM_BATTING_2B_t <-log(transformDS$TEAM_BATTING_2B)
batting_2b <- ggplot(transformDS, aes(x=TEAM_BATTING_2B)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_BATTING_2B")
batting_2b_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_2B_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "Log Transformed")


#Log transform TEAM_BATTING_3B
transformDS$TEAM_BATTING_3B_t <-log(transformDS$TEAM_BATTING_3B)
batting_3b <- ggplot(transformDS, aes(x=TEAM_BATTING_3B)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_BATTING_3B")
batting_3b_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_3B_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "Log Transformed")

#BoxCoxtransform TEAM_BATTING_BB
transformDS$TEAM_BATTING_BB_t <- BoxCox(transformDS$TEAM_BATTING_BB, BoxCox.lambda(transformDS$TEAM_BATTING_BB))
batting_bb <- ggplot(transformDS, aes(x=TEAM_BATTING_BB)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_BATTING_BB")
batting_bb_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_BB_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "BoxCox Transformed")

#BoxCoxtransform TEAM_BATTING_H
transformDS$TEAM_BATTING_H_t <- BoxCox(transformDS$TEAM_BATTING_H, BoxCox.lambda(transformDS$TEAM_BATTING_H))
batting_h <- ggplot(transformDS, aes(x=TEAM_BATTING_H)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_BATTING_H")
batting_h_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_H_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "BoxCox Transformed")

#BoxCoxtransform TEAM_BATTING_1B
transformDS$TEAM_BATTING_1B_t <- BoxCox(transformDS$TEAM_BATTING_1B, BoxCox.lambda(transformDS$TEAM_BATTING_1B))
batting_1b <- ggplot(transformDS, aes(x=TEAM_BATTING_1B)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_BATTING_1B")
batting_1b_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_1B_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "BoxCox Transformed")

#BoxCoxtransform TEAM_FIELDING_E
transformDS$TEAM_FIELDING_E_t <- BoxCox(transformDS$TEAM_FIELDING_E, BoxCox.lambda(transformDS$TEAM_FIELDING_E))
fielding_e <- ggplot(transformDS, aes(x=TEAM_FIELDING_E)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_FIELDING_E")
fielding_e_tf <- ggplot(transformDS, aes(x=TEAM_FIELDING_E_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "BoxCox Transformed")

#Log transform TEAM_PITCHING_BB
transformDS$TEAM_PITCHING_BB_t <-log(transformDS$TEAM_PITCHING_BB)
pitching_bb <- ggplot(transformDS, aes(x=TEAM_PITCHING_BB)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_PITCHING_BB")
pitching_bb_tf <- ggplot(transformDS, aes(x=TEAM_PITCHING_BB_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "Log Transformed")

#BoxCoxtransform TEAM_PITCHING_H
transformDS$TEAM_PITCHING_H_t <- BoxCox(transformDS$TEAM_PITCHING_H, BoxCox.lambda(transformDS$TEAM_PITCHING_H))
pitching_h <- ggplot(transformDS, aes(x=TEAM_PITCHING_H)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_PITCHING_H")
pitching_h_tf <- ggplot(transformDS, aes(x=TEAM_PITCHING_H_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "BoxCox Transformed")

#Log transform TEAM_PITCHING_SO
transformDS$TEAM_PITCHING_SO_t <-log(transformDS$TEAM_PITCHING_SO)
pitching_so <- ggplot(transformDS, aes(x=TEAM_PITCHING_SO)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "TEAM_PITCHING_SO")
pitching_so_tf <- ggplot(transformDS, aes(x=TEAM_PITCHING_SO_t)) + 
    geom_histogram(aes(y=..density..), colour="black", fill="red") +
    geom_density(alpha=.8, fill="lightblue") + 
  theme_classic() + labs(title = "Log Transformed")

p1<- plot_grid(baserun_cs, baserun_cs_tf, baserun_sb, baserun_sb_tf,batting_3b, batting_3b_tf, batting_bb, batting_bb_tf,
         batting_h, batting_h_tf, ncol=2)

p2<- plot_grid(batting_1b, batting_1b_tf, fielding_e, fielding_e_tf, pitching_bb, pitching_bb_tf, 
pitching_h, pitching_h_tf, pitching_so, pitching_so_tf,ncol=2)

p1

p2

#For Bimodal we want to understand qqplot.
qqplot_BATTING_HBP_t  <- ggplot(transformDS, aes(sample = TEAM_BATTING_HBP_t)) +
    stat_qq() + 
    stat_qq_line() +
    labs(title="TEAM_BATTING_HBP_t")

qqplot_BATTING_HBP_t

#For Bimodal we want to understand qqplot.
qqplot_BATTING_3B_t  <- ggplot(transformDS, aes(sample = TEAM_BATTING_3B_t)) +
    stat_qq() + 
    stat_qq_line() +
    labs(title="BATTING_3B_t")
qqplot_BATTING_3B_t 

#for this variable had several outliers and higly skewed distribution
qqplot_TEAM_FIELDING_E_t  <- ggplot(transformDS, aes(sample = TEAM_FIELDING_E_t)) +
    stat_qq() + 
    stat_qq_line() +
    labs(title="TEAM_FIELDING_E_t")

qqplot_TEAM_FIELDING_E_t

#Test data transformations to be consistent
ttest <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-evaluation-data.csv", header = TRUE)%>%dplyr::select(-INDEX)  

basetest <- ttest %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - dplyr::select(., TEAM_BATTING_2B:TEAM_BATTING_HR) %>% rowSums(na.rm = FALSE))

test_tf <- basetest

#keep HBP and let's not drop it.
med_TF1 <- round(median(basetest$TEAM_BATTING_HBP, na.rm = TRUE),0)
basetest[is.na(basetest[,"TEAM_BATTING_HBP"]), "TEAM_BATTING_HBP"] <- med_TF1

#keep CS and let's not drop it.
med_CS1 <- round(median(basetest$TEAM_BASERUN_CS, na.rm = TRUE),0)
basetest[is.na(basetest[,"TEAM_BASERUN_CS"]), "TEAM_BASERUN_CS"] <- med_CS1

#Analysis of Missing values on the imputed dataset
missing7 <- colSums(basetest %>% sapply(is.na))
missing_pct7 <- round(missing7 / nrow(df) * 100, 2)
stack(sort(missing_pct7, decreasing = TRUE))
#Log transform TEAM_BASERUN_CS
test_tf$TEAM_BASERUN_CS_t <-log(test_tf$TEAM_BASERUN_CS)

#Log transform TEAM_BASERUN_SB
test_tf$TEAM_BASERUN_SB_t <-log(test_tf$TEAM_BASERUN_SB)

#Log transform TEAM_BATTING_3B
test_tf$TEAM_BATTING_3B_t <-log(test_tf$TEAM_BATTING_3B)

#BoxCoxtransform TEAM_BATTING_BB
test_tf$TEAM_BATTING_BB_t <- BoxCox(test_tf$TEAM_BATTING_BB, BoxCox.lambda(test_tf$TEAM_BATTING_BB))

#BoxCoxtransform TEAM_BATTING_H
test_tf$TEAM_BATTING_H_t <- BoxCox(test_tf$TEAM_BATTING_H, BoxCox.lambda(test_tf$TEAM_BATTING_H))

#BoxCoxtransform TEAM_BATTING_1B
test_tf$TEAM_BATTING_1B_t <- BoxCox(test_tf$TEAM_BATTING_1B, BoxCox.lambda(test_tf$TEAM_BATTING_1B))

#BoxCoxtransform TEAM_FIELDING_E
test_tf$TEAM_FIELDING_E_t <- BoxCox(test_tf$TEAM_FIELDING_E, BoxCox.lambda(test_tf$TEAM_FIELDING_E))

#Log transform TEAM_PITCHING_BB
test_tf$TEAM_PITCHING_BB_t <-log(test_tf$TEAM_PITCHING_BB)

#BoxCoxtransform TEAM_PITCHING_H
test_tf$TEAM_PITCHING_H_t <- BoxCox(test_tf$TEAM_PITCHING_H, BoxCox.lambda(test_tf$TEAM_PITCHING_H))

#Log transform TEAM_PITCHING_SO
test_tf$TEAM_PITCHING_SO_t <-log(test_tf$TEAM_PITCHING_SO)

Method 4: Data Transformation (Create new statistic variable similar to MLB site)

Some common statistics variables used in baseball can be derived from existing variables in our dataset.

We have created few variables below to see how we can fit a model to explore these variables and their influence on predicting Target Wins. AVG (Batting Average) - Rate of hits per at bat. Formula - H/AB
OBP (On-Base Percentage) - The rate at which a batter reached base in his plate appearances. Formula -H+BB+HBP/AB+BB+HBP+SF. We don’t have a column for SF(sacrifice flies) so we will calculate without it.
Slugging Percentage (SLG) - Slugging percentage represents the total number of bases a player records per at-bat. Unlike on-base percentage, slugging percentage deals only with hits and does not include walks and hit-by-pitches in its equation.Formula - (1B + 2Bx2 + 3Bx3 + HRx4)/AB

After imputing HBP and CS with the median values, we have then proceeded to calculate the other new variables. We do this for the test evaluation dataset to remain consistent.

# New variable
NEW_features <- function(df1) {
  df1 %>%
    mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR) %>% 
    mutate(TEAM_BATTING_AB = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP + TEAM_BATTING_SO) %>%
    mutate(TEAM_BATTING_AVG = TEAM_BATTING_H/TEAM_BATTING_AB) %>%
    mutate(TEAM_BATTING_OBP = (TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP)/(TEAM_BATTING_AB + TEAM_BATTING_BB + TEAM_BATTING_HBP)) %>%
    mutate(TEAM_BATTING_SLG = (TEAM_BATTING_1B + 2*TEAM_BATTING_2B + 3*TEAM_BATTING_3B + 4*TEAM_BATTING_HR)/TEAM_BATTING_AB)
}

#Impute with median value
NEW_tdata <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-training-data.csv", header = TRUE) %>% dplyr::select(-INDEX)
train_impHBP <- NEW_tdata
med_hbp <- round(median(train_impHBP$TEAM_BATTING_HBP, na.rm = TRUE),0)
train_impHBP[is.na(train_impHBP[,"TEAM_BATTING_HBP"]), "TEAM_BATTING_HBP"] <- med_hbp

#keep CS and let's not drop it.
med_CS1 <- round(median(train_impHBP$TEAM_BASERUN_CS, na.rm = TRUE),0)
train_impHBP[is.na(train_impHBP[,"TEAM_BASERUN_CS"]), "TEAM_BASERUN_CS"] <- med_CS1

#keep CS and let's not drop it.
med_SO <- round(median(train_impHBP$TEAM_BATTING_SO, na.rm = TRUE),0)
train_impHBP[is.na(train_impHBP[,"TEAM_BATTING_SO"]), "TEAM_BATTING_SO"] <- med_SO

dtrain_ALLNEWfeat <- NEW_features(train_impHBP)

dtrain_ALLNEWfeat
#Analysis of Missing values on the imputed dataset
missing5 <- colSums(dtrain_ALLNEWfeat %>% sapply(is.na))
missing_pct5 <- round(missing5 / nrow(df) * 100, 2)
stack(sort(missing_pct5, decreasing = TRUE))
dtrain_ALLNEWfeat %>%
  gather(variable, value, -c(TARGET_WINS:TEAM_BATTING_1B)) %>%
  ggplot(., aes(value, TARGET_WINS)) + 
  geom_point(fill = "lightblue", color="lightblue") + 
  geom_smooth(method = "lm", se = FALSE, color = "red") + 
  facet_wrap(~variable, scales ="free", nrow = 4) +
  labs(x = element_blank(), y = "Wins")

dtrain_ALLNEWfeat %>% 
  ggplot(aes(TEAM_BATTING_AB)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_AB, na.rm = T)), col = "blue", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_AB, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "TEAM_BATTING_AB TEST EVALUATION DATASET - REVIEW TO IMPUTE",
       caption = "* Blue line is the mean value and green is the median")

BUILD MODEL

Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

APPROACH 1: Full Model

The approach 1 is to construct a full model. Here, we have used the mice imputed dataset and did not drop the HBP and CS variables.A few records that had 0 values , we let them remain as is.

In the Coefficients section, TEAM_BATTING_H, TEAM_BATTING_SO,TEAM_BASERUN_SB are considered significant by the model because their p-value is lower than 0.05 , however, TEAM_BASERUN_CS is not considered significant. 

As seem above, SB — The total number of bases stolen by a team’s hitters. Stolen bases give runners opportunities to get in better scoring positions, therefore it has a positive significance here.

# full model using mice imputed data
mfull <- lm(TARGET_WINS ~., data = completedData, na.action = na.omit)

summary(mfull)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = completedData, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.958  -8.314   0.075   8.084  63.658 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      42.2908245  5.3871360   7.850 6.35e-15 ***
## TEAM_BATTING_H    0.0426291  0.0035300  12.076  < 2e-16 ***
## TEAM_BATTING_2B  -0.0192768  0.0088528  -2.177  0.02955 *  
## TEAM_BATTING_3B   0.0422177  0.0164161   2.572  0.01018 *  
## TEAM_BATTING_HR   0.0516522  0.0262148   1.970  0.04892 *  
## TEAM_BATTING_BB   0.0163852  0.0056562   2.897  0.00381 ** 
## TEAM_BATTING_SO  -0.0180456  0.0024611  -7.332 3.13e-13 ***
## TEAM_BASERUN_SB   0.0543822  0.0048339  11.250  < 2e-16 ***
## TEAM_BASERUN_CS  -0.0090415  0.0103203  -0.876  0.38107    
## TEAM_BATTING_HBP -0.0617980  0.0258507  -2.391  0.01690 *  
## TEAM_PITCHING_H   0.0016089  0.0003825   4.207 2.69e-05 ***
## TEAM_PITCHING_HR  0.0334149  0.0233084   1.434  0.15183    
## TEAM_PITCHING_BB -0.0081385  0.0040059  -2.032  0.04231 *  
## TEAM_PITCHING_SO  0.0025787  0.0008812   2.926  0.00346 ** 
## TEAM_FIELDING_E  -0.0438559  0.0026641 -16.462  < 2e-16 ***
## TEAM_FIELDING_DP -0.1296883  0.0128955 -10.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.49 on 2260 degrees of freedom
## Multiple R-squared:  0.375,  Adjusted R-squared:  0.3709 
## F-statistic: 90.41 on 15 and 2260 DF,  p-value: < 2.2e-16
plot(mfull)

# Get the model residuals
model_residuals = mfull$residuals

# Plot the result
hist(model_residuals)

# Plot the residuals
qqnorm(model_residuals)
# Plot the Q-Q line
qqline(model_residuals)

APPROACH 1A: Model with retaining only highly significant features from the full model

WE have manually used only significant features of the first model, and here we see that the R-squared has slightly decreased to 0.3357 from 0.3709 which implies that the full model is better performing.

From the ANOVA result, we observe that the p-value (2.2e-16 ) is very small (less than 0.05), so we reject the null hypothesis, meaning that the second model is not an improvement of the first one. 

#full model using mice imputed data
mfull2 <- lm(TARGET_WINS ~TEAM_BATTING_H + TEAM_BATTING_SO +TEAM_BASERUN_SB +TEAM_PITCHING_H+TEAM_FIELDING_E+ TEAM_FIELDING_DP, data = completedData, na.action = na.omit)

summary(mfull2)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = completedData, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.233  -8.787  -0.069   8.524  53.753 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.967534   3.898330   5.892 4.39e-09 ***
## TEAM_BATTING_H    0.052224   0.002132  24.494  < 2e-16 ***
## TEAM_BATTING_SO  -0.002267   0.001452  -1.561    0.119    
## TEAM_BASERUN_SB   0.046403   0.003760  12.342  < 2e-16 ***
## TEAM_PITCHING_H   0.001405   0.000296   4.746 2.21e-06 ***
## TEAM_FIELDING_E  -0.047441   0.002471 -19.200  < 2e-16 ***
## TEAM_FIELDING_DP -0.101815   0.012671  -8.036 1.48e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.84 on 2269 degrees of freedom
## Multiple R-squared:  0.3374, Adjusted R-squared:  0.3357 
## F-statistic: 192.6 on 6 and 2269 DF,  p-value: < 2.2e-16
plot(mfull2)

anova(mfull, mfull2)

APPROACH 2: Model with automated variable selection process using STEPAIC

We have built 3 MODELS using automated variable selection process and StepAIC below. Additionally calculate the RMSE value and extract coefficients from the top 2 models to evaluate during model slection process.

set.seed(1024)  #Set Seed
split <- createDataPartition(df_imputed$TARGET_WINS, p = 0.8, list = FALSE)
train_data <- df_imputed[split, ]
test_data <- df_imputed[-split, ]

#Build Model with all params
full_model <- lm(TARGET_WINS ~ ., data = train_data)

#Automatic Tuning

#Tune Forward 
forward_model <- stepAIC(lm(TARGET_WINS ~ 1, data = train_data), 
                         direction = "forward", 
                         scope = formula(full_model), 
                         trace = FALSE)

#Tune Backward
backward_model <- stepAIC(full_model, direction = "backward", trace = FALSE)

#Tune Stepwise
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)

#Calculate RMSE function
evaluate_rmse <- function(model, test_data) {
  predictions <- predict(model, newdata = test_data)
  rmse <- sqrt(mean((test_data$TARGET_WINS - predictions)^2))
  return(rmse)
}

#Calculate RMSE
rmse_forward <- evaluate_rmse(forward_model, test_data)
rmse_backward <- evaluate_rmse(backward_model, test_data)
rmse_stepwise <- evaluate_rmse(stepwise_model, test_data)

#Visualize RMSE for each model
rmse_table <- data.frame(
  Model = c("Forward Selection", "Backward Elimination", "Stepwise Selection"),
  RMSE = c(rmse_forward, rmse_backward, rmse_stepwise)
)

#Select the top two Models
top_two_models <- rmse_table %>% arrange(RMSE) %>% head(2)
print("Top 2 models by RMSE:")
## [1] "Top 2 models by RMSE:"
print(top_two_models)
##                  Model     RMSE
## 1    Forward Selection 13.58656
## 2 Backward Elimination 13.59236
#Extract coefficients for the top two models
get_model_coefficients <- function(model) {
  coef_df <- as.data.frame(coef(model))
  coef_df$Variable <- rownames(coef_df)
  colnames(coef_df) <- c("Coefficient", "Variable")
  return(coef_df)
}

#Find the top two models by RMSE
model_names <- top_two_models$Model
if ("Forward Selection" %in% model_names) {
  forward_coefficients <- get_model_coefficients(forward_model)
  print("Forward Selection Coefficients:")
  print(forward_coefficients)
}
## [1] "Forward Selection Coefficients:"
##                   Coefficient         Variable
## (Intercept)      36.190002364      (Intercept)
## TEAM_BATTING_H    0.043818530   TEAM_BATTING_H
## TEAM_FIELDING_E  -0.037227545  TEAM_FIELDING_E
## TEAM_BASERUN_SB   0.036793895  TEAM_BASERUN_SB
## TEAM_FIELDING_DP -0.121274716 TEAM_FIELDING_DP
## TEAM_PITCHING_HR  0.014478155 TEAM_PITCHING_HR
## TEAM_BATTING_SO  -0.015518416  TEAM_BATTING_SO
## TEAM_PITCHING_SO  0.004879867 TEAM_PITCHING_SO
## TEAM_BATTING_3B   0.059566279  TEAM_BATTING_3B
## TEAM_BATTING_HR   0.060720213  TEAM_BATTING_HR
## TEAM_BATTING_2B  -0.019572820  TEAM_BATTING_2B
## TEAM_BATTING_BB   0.005894305  TEAM_BATTING_BB
if ("Backward Elimination" %in% model_names) {
  backward_coefficients <- get_model_coefficients(backward_model)
  print("Backward Elimination Coefficients:")
  print(backward_coefficients)
}
## [1] "Backward Elimination Coefficients:"
##                   Coefficient         Variable
## (Intercept)      35.766287007      (Intercept)
## TEAM_BATTING_H    0.044118333   TEAM_BATTING_H
## TEAM_BATTING_2B  -0.019668984  TEAM_BATTING_2B
## TEAM_BATTING_3B   0.059934066  TEAM_BATTING_3B
## TEAM_BATTING_HR   0.076041736  TEAM_BATTING_HR
## TEAM_BATTING_BB   0.005845841  TEAM_BATTING_BB
## TEAM_BATTING_SO  -0.015742469  TEAM_BATTING_SO
## TEAM_BASERUN_SB   0.036918317  TEAM_BASERUN_SB
## TEAM_PITCHING_SO  0.005049941 TEAM_PITCHING_SO
## TEAM_FIELDING_E  -0.037040233  TEAM_FIELDING_E
## TEAM_FIELDING_DP -0.121410201 TEAM_FIELDING_DP
if ("Stepwise Selection" %in% model_names) {
  stepwise_coefficients <- get_model_coefficients(stepwise_model)
  print("Stepwise Selection Coefficients:")
  print(stepwise_coefficients)
}

APPROACH 3: Polynomial Regression

Similar to linear regression, model evaluation metrics such as mean squared error (MSE), R-squared, or adjusted R-squared can be used to assess the performance of the polynomial regression model.The degree selection is important as it can lead to underfitting when it is smaller or overfitting if its higher. The ADJ R^2 of 0.0.4627 is the best fit so far based on our squares. Also the RMSE calculated is 30.20802.

set.seed(1024)
Equation <- "TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_SO^2) + I(TEAM_BASERUN_SB^2) + I(TEAM_BASERUN_CS^2) + I(TEAM_PITCHING_H^2) + I(TEAM_PITCHING_HR^2) + I(TEAM_PITCHING_BB^2) + I(TEAM_PITCHING_SO^2) + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_DP^2)  + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + I(TEAM_BATTING_HR^3) + I(TEAM_BATTING_BB^3) + I(TEAM_BATTING_SO^3) + I(TEAM_BASERUN_SB^3) + I(TEAM_BASERUN_CS^3) + I(TEAM_PITCHING_H^3) + I(TEAM_PITCHING_HR^3) + I(TEAM_PITCHING_BB^3) + I(TEAM_PITCHING_SO^3) + I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_DP^3)  + I(TEAM_BATTING_2B^4) + I(TEAM_BATTING_3B^4) + I(TEAM_BATTING_HR^4) + I(TEAM_BATTING_BB^4) + I(TEAM_BATTING_SO^4) + I(TEAM_BASERUN_SB^4) + I(TEAM_BASERUN_CS^4) + I(TEAM_PITCHING_H^4) + I(TEAM_PITCHING_HR^4) + I(TEAM_PITCHING_BB^4) + I(TEAM_PITCHING_SO^4) + I(TEAM_FIELDING_E^4) + I(TEAM_FIELDING_DP^4) "

mpoly <- lm(Equation, completedData)
step_back <- MASS::stepAIC(mpoly, direction="backward", trace = F)
poly_call <- summary(step_back)$call
step_back <- lm(poly_call[2], completedData)
summary(step_back)
## 
## Call:
## lm(formula = poly_call[2], data = completedData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.734  -7.150  -0.064   7.271  59.801 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.403e+02  4.562e+01  -3.076  0.00212 ** 
## TEAM_BATTING_2B        1.283e+00  1.765e-01   7.266 5.10e-13 ***
## TEAM_BATTING_BB        6.948e-01  9.351e-02   7.430 1.54e-13 ***
## TEAM_BATTING_SO       -2.716e-01  5.003e-02  -5.430 6.25e-08 ***
## TEAM_PITCHING_H        3.170e-02  3.665e-03   8.648  < 2e-16 ***
## TEAM_PITCHING_SO       2.874e-02  1.057e-02   2.720  0.00658 ** 
## TEAM_FIELDING_E       -2.173e-01  2.206e-02  -9.851  < 2e-16 ***
## TEAM_FIELDING_DP       2.573e+00  1.346e+00   1.912  0.05606 .  
## I(TEAM_BATTING_2B^2)  -5.165e-03  7.039e-04  -7.338 3.03e-13 ***
## I(TEAM_BATTING_3B^2)   2.570e-03  3.043e-04   8.445  < 2e-16 ***
## I(TEAM_BATTING_HR^2)  -1.846e-03  9.764e-04  -1.890  0.05888 .  
## I(TEAM_BATTING_BB^2)  -2.100e-03  3.069e-04  -6.843 9.95e-12 ***
## I(TEAM_BATTING_SO^2)   5.225e-04  9.370e-05   5.576 2.76e-08 ***
## I(TEAM_BASERUN_SB^2)   5.771e-04  7.215e-05   7.999 2.00e-15 ***
## I(TEAM_BASERUN_CS^2)  -1.326e-03  6.306e-04  -2.103  0.03555 *  
## I(TEAM_PITCHING_H^2)  -2.818e-06  5.122e-07  -5.501 4.20e-08 ***
## I(TEAM_PITCHING_HR^2)  2.162e-03  7.769e-04   2.783  0.00543 ** 
## I(TEAM_PITCHING_BB^2) -1.514e-04  1.958e-05  -7.729 1.62e-14 ***
## I(TEAM_PITCHING_SO^2) -1.401e-05  3.096e-06  -4.527 6.30e-06 ***
## I(TEAM_FIELDING_E^2)   3.106e-04  5.322e-05   5.837 6.11e-09 ***
## I(TEAM_FIELDING_DP^2) -3.170e-02  1.508e-02  -2.102  0.03568 *  
## I(TEAM_BATTING_2B^3)   6.714e-06  9.179e-07   7.315 3.58e-13 ***
## I(TEAM_BATTING_3B^3)  -1.241e-05  1.856e-06  -6.689 2.82e-11 ***
## I(TEAM_BATTING_HR^3)   1.409e-05  7.028e-06   2.005  0.04504 *  
## I(TEAM_BATTING_BB^3)   2.988e-06  4.642e-07   6.437 1.49e-10 ***
## I(TEAM_BATTING_SO^3)  -4.604e-07  8.869e-08  -5.191 2.29e-07 ***
## I(TEAM_BASERUN_SB^3)  -1.561e-06  2.576e-07  -6.058 1.61e-09 ***
## I(TEAM_BASERUN_CS^3)   1.519e-05  7.003e-06   2.169  0.03022 *  
## I(TEAM_PITCHING_H^3)   9.899e-11  3.031e-11   3.266  0.00111 ** 
## I(TEAM_PITCHING_HR^3) -1.225e-05  4.954e-06  -2.473  0.01346 *  
## I(TEAM_PITCHING_BB^3)  1.050e-07  1.556e-08   6.751 1.87e-11 ***
## I(TEAM_PITCHING_SO^3)  2.064e-09  3.926e-10   5.258 1.60e-07 ***
## I(TEAM_FIELDING_E^3)  -2.382e-07  4.827e-08  -4.935 8.63e-07 ***
## I(TEAM_FIELDING_DP^3)  1.530e-04  7.285e-05   2.100  0.03584 *  
## I(TEAM_BATTING_HR^4)  -2.890e-08  1.451e-08  -1.992  0.04645 *  
## I(TEAM_BATTING_BB^4)  -1.462e-09  2.511e-10  -5.823 6.61e-09 ***
## I(TEAM_BATTING_SO^4)   1.412e-10  3.082e-11   4.581 4.87e-06 ***
## I(TEAM_BASERUN_SB^4)   1.218e-09  2.443e-10   4.986 6.64e-07 ***
## I(TEAM_BASERUN_CS^4)  -4.456e-08  2.081e-08  -2.141  0.03235 *  
## I(TEAM_PITCHING_H^4)  -1.112e-15  5.686e-16  -1.956  0.05061 .  
## I(TEAM_PITCHING_HR^4)  1.969e-08  8.742e-09   2.253  0.02438 *  
## I(TEAM_PITCHING_BB^4) -1.996e-11  3.377e-12  -5.910 3.95e-09 ***
## I(TEAM_PITCHING_SO^4) -7.272e-14  1.365e-14  -5.328 1.09e-07 ***
## I(TEAM_FIELDING_E^4)   6.091e-11  1.450e-11   4.202 2.75e-05 ***
## I(TEAM_FIELDING_DP^4) -2.600e-07  1.285e-07  -2.023  0.04315 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.55 on 2231 degrees of freedom
## Multiple R-squared:  0.4731, Adjusted R-squared:  0.4627 
## F-statistic: 45.53 on 44 and 2231 DF,  p-value: < 2.2e-16
coef(step_back)
##           (Intercept)       TEAM_BATTING_2B       TEAM_BATTING_BB 
##         -1.403321e+02          1.282765e+00          6.947867e-01 
##       TEAM_BATTING_SO       TEAM_PITCHING_H      TEAM_PITCHING_SO 
##         -2.716374e-01          3.169657e-02          2.874148e-02 
##       TEAM_FIELDING_E      TEAM_FIELDING_DP  I(TEAM_BATTING_2B^2) 
##         -2.173048e-01          2.572819e+00         -5.164862e-03 
##  I(TEAM_BATTING_3B^2)  I(TEAM_BATTING_HR^2)  I(TEAM_BATTING_BB^2) 
##          2.569934e-03         -1.845551e-03         -2.100048e-03 
##  I(TEAM_BATTING_SO^2)  I(TEAM_BASERUN_SB^2)  I(TEAM_BASERUN_CS^2) 
##          5.224734e-04          5.770797e-04         -1.326280e-03 
##  I(TEAM_PITCHING_H^2) I(TEAM_PITCHING_HR^2) I(TEAM_PITCHING_BB^2) 
##         -2.817604e-06          2.162038e-03         -1.513564e-04 
## I(TEAM_PITCHING_SO^2)  I(TEAM_FIELDING_E^2) I(TEAM_FIELDING_DP^2) 
##         -1.401396e-05          3.106118e-04         -3.170085e-02 
##  I(TEAM_BATTING_2B^3)  I(TEAM_BATTING_3B^3)  I(TEAM_BATTING_HR^3) 
##          6.714131e-06         -1.241406e-05          1.409350e-05 
##  I(TEAM_BATTING_BB^3)  I(TEAM_BATTING_SO^3)  I(TEAM_BASERUN_SB^3) 
##          2.988198e-06         -4.603664e-07         -1.560651e-06 
##  I(TEAM_BASERUN_CS^3)  I(TEAM_PITCHING_H^3) I(TEAM_PITCHING_HR^3) 
##          1.518618e-05          9.899396e-11         -1.225342e-05 
## I(TEAM_PITCHING_BB^3) I(TEAM_PITCHING_SO^3)  I(TEAM_FIELDING_E^3) 
##          1.050250e-07          2.064289e-09         -2.381952e-07 
## I(TEAM_FIELDING_DP^3)  I(TEAM_BATTING_HR^4)  I(TEAM_BATTING_BB^4) 
##          1.529962e-04         -2.890410e-08         -1.462366e-09 
##  I(TEAM_BATTING_SO^4)  I(TEAM_BASERUN_SB^4)  I(TEAM_BASERUN_CS^4) 
##          1.412109e-10          1.218106e-09         -4.455961e-08 
##  I(TEAM_PITCHING_H^4) I(TEAM_PITCHING_HR^4) I(TEAM_PITCHING_BB^4) 
##         -1.112181e-15          1.969337e-08         -1.995516e-11 
## I(TEAM_PITCHING_SO^4)  I(TEAM_FIELDING_E^4) I(TEAM_FIELDING_DP^4) 
##         -7.272284e-14          6.090874e-11         -2.599881e-07
summary(step_back)
## 
## Call:
## lm(formula = poly_call[2], data = completedData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.734  -7.150  -0.064   7.271  59.801 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.403e+02  4.562e+01  -3.076  0.00212 ** 
## TEAM_BATTING_2B        1.283e+00  1.765e-01   7.266 5.10e-13 ***
## TEAM_BATTING_BB        6.948e-01  9.351e-02   7.430 1.54e-13 ***
## TEAM_BATTING_SO       -2.716e-01  5.003e-02  -5.430 6.25e-08 ***
## TEAM_PITCHING_H        3.170e-02  3.665e-03   8.648  < 2e-16 ***
## TEAM_PITCHING_SO       2.874e-02  1.057e-02   2.720  0.00658 ** 
## TEAM_FIELDING_E       -2.173e-01  2.206e-02  -9.851  < 2e-16 ***
## TEAM_FIELDING_DP       2.573e+00  1.346e+00   1.912  0.05606 .  
## I(TEAM_BATTING_2B^2)  -5.165e-03  7.039e-04  -7.338 3.03e-13 ***
## I(TEAM_BATTING_3B^2)   2.570e-03  3.043e-04   8.445  < 2e-16 ***
## I(TEAM_BATTING_HR^2)  -1.846e-03  9.764e-04  -1.890  0.05888 .  
## I(TEAM_BATTING_BB^2)  -2.100e-03  3.069e-04  -6.843 9.95e-12 ***
## I(TEAM_BATTING_SO^2)   5.225e-04  9.370e-05   5.576 2.76e-08 ***
## I(TEAM_BASERUN_SB^2)   5.771e-04  7.215e-05   7.999 2.00e-15 ***
## I(TEAM_BASERUN_CS^2)  -1.326e-03  6.306e-04  -2.103  0.03555 *  
## I(TEAM_PITCHING_H^2)  -2.818e-06  5.122e-07  -5.501 4.20e-08 ***
## I(TEAM_PITCHING_HR^2)  2.162e-03  7.769e-04   2.783  0.00543 ** 
## I(TEAM_PITCHING_BB^2) -1.514e-04  1.958e-05  -7.729 1.62e-14 ***
## I(TEAM_PITCHING_SO^2) -1.401e-05  3.096e-06  -4.527 6.30e-06 ***
## I(TEAM_FIELDING_E^2)   3.106e-04  5.322e-05   5.837 6.11e-09 ***
## I(TEAM_FIELDING_DP^2) -3.170e-02  1.508e-02  -2.102  0.03568 *  
## I(TEAM_BATTING_2B^3)   6.714e-06  9.179e-07   7.315 3.58e-13 ***
## I(TEAM_BATTING_3B^3)  -1.241e-05  1.856e-06  -6.689 2.82e-11 ***
## I(TEAM_BATTING_HR^3)   1.409e-05  7.028e-06   2.005  0.04504 *  
## I(TEAM_BATTING_BB^3)   2.988e-06  4.642e-07   6.437 1.49e-10 ***
## I(TEAM_BATTING_SO^3)  -4.604e-07  8.869e-08  -5.191 2.29e-07 ***
## I(TEAM_BASERUN_SB^3)  -1.561e-06  2.576e-07  -6.058 1.61e-09 ***
## I(TEAM_BASERUN_CS^3)   1.519e-05  7.003e-06   2.169  0.03022 *  
## I(TEAM_PITCHING_H^3)   9.899e-11  3.031e-11   3.266  0.00111 ** 
## I(TEAM_PITCHING_HR^3) -1.225e-05  4.954e-06  -2.473  0.01346 *  
## I(TEAM_PITCHING_BB^3)  1.050e-07  1.556e-08   6.751 1.87e-11 ***
## I(TEAM_PITCHING_SO^3)  2.064e-09  3.926e-10   5.258 1.60e-07 ***
## I(TEAM_FIELDING_E^3)  -2.382e-07  4.827e-08  -4.935 8.63e-07 ***
## I(TEAM_FIELDING_DP^3)  1.530e-04  7.285e-05   2.100  0.03584 *  
## I(TEAM_BATTING_HR^4)  -2.890e-08  1.451e-08  -1.992  0.04645 *  
## I(TEAM_BATTING_BB^4)  -1.462e-09  2.511e-10  -5.823 6.61e-09 ***
## I(TEAM_BATTING_SO^4)   1.412e-10  3.082e-11   4.581 4.87e-06 ***
## I(TEAM_BASERUN_SB^4)   1.218e-09  2.443e-10   4.986 6.64e-07 ***
## I(TEAM_BASERUN_CS^4)  -4.456e-08  2.081e-08  -2.141  0.03235 *  
## I(TEAM_PITCHING_H^4)  -1.112e-15  5.686e-16  -1.956  0.05061 .  
## I(TEAM_PITCHING_HR^4)  1.969e-08  8.742e-09   2.253  0.02438 *  
## I(TEAM_PITCHING_BB^4) -1.996e-11  3.377e-12  -5.910 3.95e-09 ***
## I(TEAM_PITCHING_SO^4) -7.272e-14  1.365e-14  -5.328 1.09e-07 ***
## I(TEAM_FIELDING_E^4)   6.091e-11  1.450e-11   4.202 2.75e-05 ***
## I(TEAM_FIELDING_DP^4) -2.600e-07  1.285e-07  -2.023  0.04315 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.55 on 2231 degrees of freedom
## Multiple R-squared:  0.4731, Adjusted R-squared:  0.4627 
## F-statistic: 45.53 on 44 and 2231 DF,  p-value: < 2.2e-16
#calculate RMSE value for polynomial model
rmse_Polymod3 <- evaluate_rmse(step_back, test_data)
rmse_Polymod3
## [1] 12.91918

APPROACH 4: Model with transformed data

Here we have built a model manually by selecting most of the transformed variables using log and boxcox on the explanatory variables , and did not include the HBP_t values since the qqplot showed a large deviation from the left and right tails. The R^2 value is moderate in this case and not as great as the polynomial fit above.Polynomial degree of 3 can accomadate more curves to fit the data, therefore it is the best fit so far compared to the model with the transformed data.

#data with all transformed variables, we drop HBP_t since on the qqplot it had a significant deviation on the left and right tails.

transformDS
mT_data <- dplyr::select(transformDS, -TEAM_BASERUN_SB, -TEAM_BATTING_3B, -TEAM_BATTING_BB, -TEAM_BATTING_H, -TEAM_FIELDING_E, -TEAM_PITCHING_BB, -TEAM_PITCHING_H, -TEAM_PITCHING_SO, -TEAM_BATTING_1B,-TEAM_BATTING_HBP, -TEAM_BASERUN_CS,-TEAM_BATTING_HBP_t)

mT4_data <- lm(TARGET_WINS ~., data = mT_data)
summary(mT4_data)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = mT_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.334  -7.005  -0.010   7.332  32.259 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -8.442e+04  2.712e+04  -3.113  0.00188 ** 
## TEAM_BATTING_2B    -2.665e-02  5.229e-02  -0.510  0.61029    
## TEAM_BATTING_HR     9.400e-02  6.379e-02   1.473  0.14080    
## TEAM_BATTING_SO    -4.349e-02  8.871e-03  -4.903 1.03e-06 ***
## TEAM_PITCHING_HR   -2.433e-02  5.312e-02  -0.458  0.64697    
## TEAM_FIELDING_DP   -1.078e-01  1.252e-02  -8.610  < 2e-16 ***
## TEAM_BASERUN_CS_t  -1.401e+00  9.080e-01  -1.543  0.12307    
## TEAM_BASERUN_SB_t   5.612e+00  6.470e-01   8.675  < 2e-16 ***
## TEAM_BATTING_2B_t  -1.158e+01  1.392e+01  -0.832  0.40572    
## TEAM_BATTING_3B_t   6.681e+00  1.432e+00   4.666 3.29e-06 ***
## TEAM_BATTING_BB_t   7.161e-05  2.274e-05   3.149  0.00167 ** 
## TEAM_BATTING_H_t    1.376e+05  6.392e+04   2.153  0.03143 *  
## TEAM_BATTING_1B_t  -2.480e+04  2.937e+04  -0.844  0.39868    
## TEAM_FIELDING_E_t  -1.206e+03  8.853e+01 -13.625  < 2e-16 ***
## TEAM_PITCHING_BB_t -1.434e+00  6.607e+00  -0.217  0.82824    
## TEAM_PITCHING_H_t  -2.691e+04  1.848e+04  -1.457  0.14539    
## TEAM_PITCHING_SO_t  1.486e+01  6.560e+00   2.266  0.02358 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.44 on 1818 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3762, Adjusted R-squared:  0.3707 
## F-statistic: 68.53 on 16 and 1818 DF,  p-value: < 2.2e-16
plot(mT4_data)

APPROACH 4A: Model with transformed data and backwards elimination

Here we have simply manually removed variables that are not as highly significant from the model with transformed variables above with log and boxcox.

#data with all transformed variables,KEEP HBP and CS of imputed variable

# mT5_data <- lm(TARGET_WINS ~ TEAM_BATTING_SO + TEAM_BATTING_HR+TEAM_BATTING_1B_t+TEAM_BASERUN_CS_t+TEAM_PITCHING_HR +TEAM_PITCHING_H_t+TEAM_PITCHING_SO_t+ TEAM_BATTING_HBP_t+
#      TEAM_FIELDING_DP + TEAM_BASERUN_SB_t + 
#      TEAM_BATTING_3B_t + TEAM_BATTING_BB_t + TEAM_BATTING_2B_t+
#     TEAM_FIELDING_E_t , data = transformDS)


mT5_data <- lm(TARGET_WINS ~ TEAM_BATTING_SO +
     TEAM_FIELDING_DP + TEAM_BASERUN_SB_t + 
     TEAM_BATTING_3B_t+ TEAM_BATTING_BB_t +TEAM_FIELDING_E_t   , data = transformDS)

summary(mT5_data)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_SO + TEAM_FIELDING_DP + 
##     TEAM_BASERUN_SB_t + TEAM_BATTING_3B_t + TEAM_BATTING_BB_t + 
##     TEAM_FIELDING_E_t, data = transformDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.442  -7.637   0.311   7.562  35.741 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.900e+03  1.033e+02  18.391  < 2e-16 ***
## TEAM_BATTING_SO   -1.788e-02  1.937e-03  -9.229  < 2e-16 ***
## TEAM_FIELDING_DP  -8.408e-02  1.307e-02  -6.435 1.58e-10 ***
## TEAM_BASERUN_SB_t  3.670e+00  5.370e-01   6.835 1.11e-11 ***
## TEAM_BATTING_3B_t  7.955e+00  9.169e-01   8.677  < 2e-16 ***
## TEAM_BATTING_BB_t  8.754e-05  5.985e-06  14.626  < 2e-16 ***
## TEAM_FIELDING_E_t -1.572e+03  8.749e+01 -17.967  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.24 on 1828 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.2723, Adjusted R-squared:  0.2699 
## F-statistic:   114 on 6 and 1828 DF,  p-value: < 2.2e-16
plot(mT5_data)

APPROACH 5: Model with new variables & Cook’s distance

Our approach 5 is to build the model with standard baseball derived variables like AB, AVG, OBP and SLG. These all potentially should have a positive impact on Target Wins. We measured the cooks distance and influential points and then on the 2nd fit of the models removed those outliers and reran the model again to see if the fit , R-squared was higher. Indeed the original adj- R squared was 0.1922 and after removing about the outlier records, the adjusted R-square was 0.2422. 

#data with all transformed variables,KEEP HBP and CS of imputed variable


m7 <- lm(TARGET_WINS ~., data = dtrain_ALLNEWfeat, na.action = na.omit)
summary(m7)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = dtrain_ALLNEWfeat, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.768  -6.987   0.074   6.763  30.227 
## 
## Coefficients: (2 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.519e+02  6.779e+01   2.240  0.02519 *  
## TEAM_BATTING_H   -2.017e-02  2.070e-02  -0.975  0.32994    
## TEAM_BATTING_2B  -6.180e-02  2.666e-02  -2.318  0.02055 *  
## TEAM_BATTING_3B   1.571e-01  5.618e-02   2.796  0.00523 ** 
## TEAM_BATTING_HR   9.136e-02  9.888e-02   0.924  0.35562    
## TEAM_BATTING_BB   1.362e-01  5.750e-02   2.369  0.01794 *  
## TEAM_BATTING_SO   2.102e-03  2.527e-02   0.083  0.93373    
## TEAM_BASERUN_SB   7.743e-02  6.268e-03  12.353  < 2e-16 ***
## TEAM_BASERUN_CS  -3.754e-02  1.397e-02  -2.688  0.00726 ** 
## TEAM_BATTING_HBP  1.157e-01  6.800e-02   1.701  0.08912 .  
## TEAM_PITCHING_H   6.374e-02  1.594e-02   3.998 6.63e-05 ***
## TEAM_PITCHING_HR -3.422e-02  8.003e-02  -0.428  0.66905    
## TEAM_PITCHING_BB -9.230e-02  4.156e-02  -2.221  0.02650 *  
## TEAM_PITCHING_SO -4.750e-02  2.144e-02  -2.215  0.02688 *  
## TEAM_FIELDING_E  -1.193e-01  7.305e-03 -16.334  < 2e-16 ***
## TEAM_FIELDING_DP -1.125e-01  1.234e-02  -9.117  < 2e-16 ***
## TEAM_BATTING_1B          NA         NA      NA       NA    
## TEAM_BATTING_AB          NA         NA      NA       NA    
## TEAM_BATTING_AVG  6.805e+01  3.163e+02   0.215  0.82968    
## TEAM_BATTING_OBP -2.714e+02  3.605e+02  -0.753  0.45173    
## TEAM_BATTING_SLG  3.932e+01  7.237e+01   0.543  0.58696    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.14 on 1816 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.4114, Adjusted R-squared:  0.4056 
## F-statistic: 70.53 on 18 and 1816 DF,  p-value: < 2.2e-16
m6 <- lm(TARGET_WINS ~ TEAM_BATTING_AB + TEAM_BATTING_AVG + TEAM_BATTING_OBP + TEAM_BATTING_SLG, data = dtrain_ALLNEWfeat )
summary(m6)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_AB + TEAM_BATTING_AVG + 
##     TEAM_BATTING_OBP + TEAM_BATTING_SLG, data = dtrain_ALLNEWfeat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.849  -9.005   0.550   9.504  50.328 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -8.747e+01  8.519e+00 -10.268  < 2e-16 ***
## TEAM_BATTING_AB   2.793e-02  1.963e-03  14.230  < 2e-16 ***
## TEAM_BATTING_AVG -1.343e+02  2.906e+01  -4.621 4.04e-06 ***
## TEAM_BATTING_OBP  2.399e+02  3.166e+01   7.578 5.07e-14 ***
## TEAM_BATTING_SLG  2.018e+01  8.367e+00   2.412   0.0159 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.16 on 2271 degrees of freedom
## Multiple R-squared:  0.1937, Adjusted R-squared:  0.1922 
## F-statistic: 136.4 on 4 and 2271 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(m6)

m6cd<- cooks.distance(m6)
cooks.distance(m6)[which.max(cooks.distance(m6))]
##       1211 
## 0.08032934
#plot cooks distance
plot(m6,which=4)

plot(cooks.distance(m6),type="b",pch=18,col="red")

N = 2271
k = 4
cutoff = 4/ (N-k-1)
abline(h=cutoff,lty=2)

subsetds <- m6cd < 0.001765225

#Find influential & outlier points
sum(m6cd > 0.001765225)
## [1] 127
#removing the outliers about 131 influential points does seemed to have helped.
m6upd <- lm(TARGET_WINS ~ TEAM_BATTING_AB + TEAM_BATTING_AVG + TEAM_BATTING_OBP + TEAM_BATTING_SLG, data = dtrain_ALLNEWfeat, subset = m6cd < 0.001765225)
summary(m6upd)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_AB + TEAM_BATTING_AVG + 
##     TEAM_BATTING_OBP + TEAM_BATTING_SLG, data = dtrain_ALLNEWfeat, 
##     subset = m6cd < 0.001765225)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.175  -8.437   0.454   8.530  41.815 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -9.806e+01  7.889e+00 -12.430  < 2e-16 ***
## TEAM_BATTING_AB   2.400e-02  1.811e-03  13.252  < 2e-16 ***
## TEAM_BATTING_AVG -2.011e+02  2.698e+01  -7.455 1.30e-13 ***
## TEAM_BATTING_OBP  3.172e+02  2.889e+01  10.982  < 2e-16 ***
## TEAM_BATTING_SLG  3.304e+01  7.547e+00   4.378 1.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.92 on 2144 degrees of freedom
## Multiple R-squared:  0.2436, Adjusted R-squared:  0.2422 
## F-statistic: 172.6 on 4 and 2144 DF,  p-value: < 2.2e-16
plot(cooks.distance(m6upd),type="b",pch=18,col="red")

N = 2271
k = 4
cutoff = 4/ (N-k-1)
abline(h=cutoff,lty=2)

SELECT MODELS

Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.

For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. 

Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. Make predictions using the evaluation data set

BEST FIT: SUMMARY ANALYSIS: Model using APPROACH 1 (From BUILD MODELS section). Predictions for the TEST evaluation data set has been written out to Excel files for both BEST FIT models (We have selected 2 models as best fit)

Delving into the coefficients provides insights into how different factors impact team success. Positive coefficients, such as those for TEAM_BATTING_H (0.0438), TEAM_BATTING_HR (0.0607), and TEAM_BATTING_3B (0.0596), indicate that increasing base hits, home runs, and triples, respectively, are associated with a boost in predicted wins. This suggests that improving a team’s hitting performance, particularly with power and extra-base hits, strongly contributes to the overall success of the team. Additionally, TEAM_BASERUN_SB (0.0368) suggests that increasing stolen bases positively affects wins, further highlighting the offensive advantage of aggressive baserunning. On the other hand, the negative coefficients pinpoint areas that detract form team performance. A great example is TEAM_FIELDING_E (-0.0372) indicate that each additional fielding error decreases the predicted number of wins, underscoring the need to minimize defensive mistakes. Interestingly, TEAM_FIELDING_DP (-0.1213) is a strong negative coefficient, suggesting that while double plays are generally viewed positively in game, from a win perspective, it may indicate other weaknesses within the team. 

merge_coefficients_with_definitions <- function(coeff_table) {
  var_definitions <- data.frame(
    Variable = c("TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B",
                 "TEAM_BATTING_HR", "TEAM_BATTING_BB", "TEAM_BATTING_HBP", "TEAM_BATTING_SO",
                 "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_FIELDING_E", "TEAM_FIELDING_DP",
                 "TEAM_PITCHING_BB", "TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_SO"),
    Definition = c("Number of wins", 
                   "Base Hits by batters (1B, 2B, 3B, HR)",
                   "Doubles by batters (2B)",
                   "Triples by batters (3B)",
                   "Homeruns by batters (4B)",
                   "Walks by batters",
                   "Batters hit by pitch (get a free base)",
                   "Strikeouts by batters",
                   "Stolen bases",
                   "Caught stealing",
                   "Errors",
                   "Double Plays",
                   "Walks allowed",
                   "Hits allowed",
                   "Homeruns allowed",
                   "Strikeouts by pitchers"),
    stringsAsFactors = FALSE
  )
  
  #Merge the coefficients table with the variable definitions
  merged_table <- left_join(coeff_table, var_definitions, by = "Variable")
  
  return(merged_table)
}
merge_coefficients_with_definitions(forward_coefficients)
merge_coefficients_with_definitions(backward_coefficients)

The majority of the points in the center of the \(Q-Q\) plot fall fairly close to the line of \(y=x\) indicating that they are normally distributed.However, the divergence of the tails indicates some issues for extreme values.

r_squared <- summary(forward_model)$r.squared
cat("R-squared: ", r_squared, "\n")
## R-squared:  0.3265676
qqnorm(residuals(forward_model))
qqline(residuals(forward_model), col = "red")

The residuals vs. fitted values plot suggests that the model is alright. The good news is that the residuals are generally centered around zero, indicating that the model is unbiased in its predictions across the middle range of fitted values. However, there is evidence of heteroscedasticity, as the residuals show a “fanning out” pattern, with increasing variance at higher fitted values, which may affect prediction reliability in these ranges. In addition, there are a few outliers, particularly targeting teams that have low wins, which could be distorting the model’s accuracy. 

fitted_vals <- fitted(forward_model)
residuals_vals <- residuals(forward_model)

plot(fitted_vals, residuals_vals,
     main = "Residuals vs Fitted",
     xlab = "Fitted values",
     ylab = "Residuals")
abline(h = 0, col = "red")

moneyball_evaluation_data <- read.csv("moneyball-evaluation-data.csv")
moneyball_evaluation_data_imputed <- impute_missing_values(moneyball_evaluation_data,correlation_matrix)
## Column name:  TEAM_BATTING_SO , missing values: 18 
## Working On Column: TEAM_BATTING_SO 
## Predictors for column TEAM_BATTING_SO : TEAM_FIELDING_E 
##  Predictors for column TEAM_BATTING_SO : TEAM_PITCHING_BB 
## Imputed 18 missing values in column: TEAM_BATTING_SO 
## Column name:  TEAM_BASERUN_SB , missing values: 13 
## Working On Column: TEAM_BASERUN_SB 
## Predictors for column TEAM_BASERUN_SB : TEAM_BATTING_3B 
##  Predictors for column TEAM_BASERUN_SB : TEAM_PITCHING_H 
## Imputed 13 missing values in column: TEAM_BASERUN_SB 
## Column name:  TEAM_BASERUN_CS , missing values: 87 
## Working On Column: TEAM_BASERUN_CS 
## Predictors for column TEAM_BASERUN_CS : TEAM_BATTING_3B 
##  Predictors for column TEAM_BASERUN_CS : TEAM_FIELDING_E 
## Imputed 87 missing values in column: TEAM_BASERUN_CS 
## Column name:  TEAM_BATTING_HBP , missing values: 240 
## Working On Column: TEAM_BATTING_HBP 
## Predictors for column TEAM_BATTING_HBP : TEAM_PITCHING_HR 
##  Predictors for column TEAM_BATTING_HBP : TEAM_BATTING_HR 
## Imputed 240 missing values in column: TEAM_BATTING_HBP 
## Column name:  TEAM_PITCHING_SO , missing values: 18 
## Working On Column: TEAM_PITCHING_SO 
## Predictors for column TEAM_PITCHING_SO : TEAM_FIELDING_E 
##  Predictors for column TEAM_PITCHING_SO : TEAM_PITCHING_BB 
## Imputed 18 missing values in column: TEAM_PITCHING_SO 
## Column name:  TEAM_FIELDING_DP , missing values: 31 
## Working On Column: TEAM_FIELDING_DP 
## Predictors for column TEAM_FIELDING_DP : TEAM_BATTING_3B 
##  Predictors for column TEAM_FIELDING_DP : TEAM_FIELDING_E 
## Imputed 31 missing values in column: TEAM_FIELDING_DP
moneyball_evaluation_data_imputed[moneyball_evaluation_data_imputed == "" | is.na(moneyball_evaluation_data_imputed)] <- 0
sum(is.na(moneyball_evaluation_data_imputed))
## [1] 0
count_na_as_string <- sum(moneyball_evaluation_data_imputed == "NA", na.rm = TRUE)
moneyball_evaluation_data_imputed[moneyball_evaluation_data_imputed == "NA"] <- 0

#Check for NAs
sum(is.na(moneyball_evaluation_data_imputed))
## [1] 0
#Predict wins 
predicted_wins <- predict(forward_model, newdata = moneyball_evaluation_data_imputed)
moneyball_evaluation_data$Predicted_Wins <- round(predicted_wins)
#export data
print(predicted_wins)
##         1         2         3         4         5         6         7         8 
##  63.34201  65.31216  74.58335  86.55665  66.30999  67.67456  80.52421  76.97679 
##         9        10        11        12        13        14        15        16 
##  71.03161  74.29158  69.13493  83.16481  82.57485  83.13729  85.96313  78.00385 
##        17        18        19        20        21        22        23        24 
##  74.45590  78.00566  71.61659  90.71735  81.25573  83.20570  80.25591  72.39130 
##        25        26        27        28        29        30        31        32 
##  82.47815  87.44777  55.33278  75.48136  82.95327  75.31005  89.47883  85.16628 
##        33        34        35        36        37        38        39        40 
##  81.22681  82.59840  80.04774  86.27811  76.06674  90.12123  87.05489  91.97500 
##        41        42        43        44        45        46        47        48 
##  82.48081  90.46132  33.04682 102.32632  91.84046  94.27205  99.51838  77.24707 
##        49        50        51        52        53        54        55        56 
##  69.49431  80.08895  77.39947  85.10195  76.92961  74.71985  75.18438  79.49462 
##        57        58        59        60        61        62        63        64 
##  91.44009  76.56690  64.33847  78.89128  87.96009  73.25524  87.73089  85.73417 
##        65        66        67        68        69        70        71        72 
##  84.49861  99.33537  75.02017  80.76783  77.98779  87.54773  86.15504  70.71865 
##        73        74        75        76        77        78        79        80 
##  77.48192  89.99050  81.33350  84.63528  81.76609  83.61217  72.50406  76.03212 
##        81        82        83        84        85        86        87        88 
##  84.39389  88.17360  96.64595  74.22512  84.74053  81.06267  83.42634  83.72506 
##        89        90        91        92        93        94        95        96 
##  88.29683  89.25760  80.32536  97.28005  74.19830  84.11231  84.49740  84.05951 
##        97        98        99       100       101       102       103       104 
##  86.89059 101.95956  85.75903  85.99408  79.55943  74.46950  83.99920  84.16972 
##       105       106       107       108       109       110       111       112 
##  78.53056  69.41087  54.32570  77.04523  86.40053  59.45698  83.99773  85.36602 
##       113       114       115       116       117       118       119       120 
##  93.51967  91.76664  81.54284  78.10601  84.98347  80.69157  74.65586  77.07999 
##       121       122       123       124       125       126       127       128 
##  92.73023  69.51742  69.15351  67.99571  69.22172  88.76361  91.25376  77.93940 
##       129       130       131       132       133       134       135       136 
##  93.05825  91.66904  85.35456  79.12941  79.69121  85.49378  87.34826  67.23954 
##       137       138       139       140       141       142       143       144 
##  74.24250  77.99283  86.96169  80.47501  66.29843  72.97633  89.83782  72.79161 
##       145       146       147       148       149       150       151       152 
##  71.72553  72.24594  76.88918  78.34019  78.54647  83.23298  82.35583  80.35067 
##       153       154       155       156       157       158       159       160 
##  53.65196  70.71838  76.89142  70.80214  89.16285  68.31321  92.26013  74.16343 
##       161       162       163       164       165       166       167       168 
## 102.97119 105.73589  93.85001 102.73814  96.86479  89.41896  81.60908  81.72290 
##       169       170       171       172       173       174       175       176 
##  73.62270  80.32511  88.86271  88.58551  80.71567  93.58114  83.25400  73.22329 
##       177       178       179       180       181       182       183       184 
##  77.07210  70.75924  74.50462  78.65383  86.91645  88.05098  84.99958  84.64730 
##       185       186       187       188       189       190       191       192 
##  96.91088  96.79967  86.85417  56.49043  60.65153 113.16314  71.43358  80.82280 
##       193       194       195       196       197       198       199       200 
##  77.60838  77.79108  79.84679  68.26375  78.74758  84.59009  79.73953  83.95373 
##       201       202       203       204       205       206       207       208 
##  75.51396  79.69558  73.96268  91.02842  81.00795  82.96984  77.82390  77.53497 
##       209       210       211       212       213       214       215       216 
##  81.58098  73.80030 106.43962  93.66124  81.85864  65.42548  68.91173  83.67666 
##       217       218       219       220       221       222       223       224 
##  79.25817  93.07118  77.93890  78.54334  78.26961  74.29641  80.49750  73.47004 
##       225       226       227       228       229       230       231       232 
##  88.22145  75.26303  81.64626  78.45500  81.30498  73.34752  79.99408  92.94496 
##       233       234       235       236       237       238       239       240 
##  78.30733  88.92503  79.84306  75.42988  82.14292  77.96719  92.66573  73.19262 
##       241       242       243       244       245       246       247       248 
##  89.04751  86.29256  83.32269  81.36724  61.25078  87.09117  81.45165  85.17810 
##       249       250       251       252       253       254       255       256 
##  73.59924  83.57244  80.66345  53.89637  93.71869  74.93086  70.05148  75.85348 
##       257       258       259 
##  82.66711  83.05989  77.50290
write.csv(moneyball_evaluation_data, "moneyball-evaluation-data-predicted.csv", row.names = FALSE)
print("Predicted wins saved to 'moneyball-evaluation-data-predicted.csv'")
## [1] "Predicted wins saved to 'moneyball-evaluation-data-predicted.csv'"
# #Predict wins using they polynomial regression model using the mice imputed test dataset.
# sum(is.na(completedData1))
# predicted_winspoly <- predict(mpoly, newdata = completedData1)
# completedData1$Predicted_Wins <- round(predicted_winspoly)
# 
# #Display the predicted wins
# print(predicted_winspoly)
# write.csv(completedData1, "moneyball-evaluation-data-predicted_POLY.csv", row.names = FALSE)

2ND BEST FIT: SUMMARY ANALYSIS: Model using APPROACH 3 (POLYNOMIAL From BUILD MODELS section)

The 2nd best fit is our Polynomial Regression Model.One reason why this model may not be higher could be because of the presence of outliers. As an extention of this project in the future, we could remove some of this high leverage outliers and try fitting this model again to see if the adjusted R square is higher than what was observed. For this project, however, this is our 2nd best selection and fit.

The prediction wins have been written out to a data set 

Cook’s distance measures how and observation influences the overall model or predicted values
Studentizided residuals are the residuals divided by their estimated standard deviation
Bonferroni test to identify outliers
Hat-points identify influential points (have a high impact on the predictor variables)

r_squaredpoly <- summary(step_back)$r.squared
cat("R-squared: ", r_squaredpoly, "\n")
## R-squared:  0.4731361
qqnorm(residuals(step_back))
qqline(residuals(step_back), col = "red")

qqPlot(step_back, id.n=3)

## [1] 2012 2233
fitted_valspoly <- fitted(step_back)
residuals_valsply <- residuals(step_back)

plot(fitted_valspoly, residuals_valsply,
     main = "Residuals vs Fitted step_back regression",
     xlab = "Fitted values",
     ylab = "Residuals")
abline(h = 0, col = "red")

influenceIndexPlot(step_back, id.n=3)

influencePlot(step_back, id.n=3)
# Get the model residuals
model_residuals = step_back$residuals

# Plot the result
hist(model_residuals)

Below, we have used the step_back (polynomial regression model) to fit the test data after imputation. I noticed on review that the model predicted negative target wins for 1 record. The record has been printed out below. The TEAM_BASERUN_SB for thiS record is 0 on the record. On the model coefficients this is one of the significant predictor variables, therefore, having an outlier such as 0 has had an impact on this record. No other records on the test dataset has -ve Target wins predicted. Maybe the presence of 0 on the significant predictor played a role in the -ve target wins. One way we can avoid running into negative values is to log transform the target variable. Then we would need to convert to actual scale by taking the exponential. Negative Intercept: If the intercept in a linear regression model is negative, it means that the predicted value of Y when X is zero is negative. In this case, the regression line crosses the y-axis below the zero value. Models can predict out of bounds values in this scenario.

#sTEP 4: Predict wins using they polynomial regression model using the mice imputed test dataset.
sum(is.na(completedData1))
## [1] 0
predicted_winspoly <- predict(step_back, newdata = completedData1)
completedData1$Predicted_Wins <- round(predicted_winspoly)

# This record had -ve TARGET WINS
completedData1[153,]
# Step 3: Display the predicted wins
print(predicted_winspoly)
##         1         2         3         4         5         6         7         8 
##  56.17294  60.28666  69.64209  84.94790  68.27314  85.69634  78.20616  70.28016 
##         9        10        11        12        13        14        15        16 
##  72.62274  69.87750  65.52472  84.40004  86.42408  84.87206  88.84636  76.49452 
##        17        18        19        20        21        22        23        24 
##  70.34470  76.56092  70.57387  80.98965  82.29357  81.51251  83.08143  72.01933 
##        25        26        27        28        29        30        31        32 
##  81.79996  87.92424  74.33068  66.83266  80.53896  67.66413  92.17382  84.85526 
##        33        34        35        36        37        38        39        40 
##  88.78826  88.58816  79.80351  85.16124  77.13142  87.71430  81.61208  89.02724 
##        41        42        43        44        45        46        47        48 
##  84.21168  97.96500  19.54563 106.70521  86.67550  96.90132  96.92875  79.05094 
##        49        50        51        52        53        54        55        56 
##  67.51130  78.10583  72.03109  78.54303  70.68609  77.49007  68.06108  80.91025 
##        57        58        59        60        61        62        63        64 
##  91.45457  84.51076  54.18980  83.38793  88.56795  87.41098  87.88489  80.09362 
##        65        66        67        68        69        70        71        72 
##  85.65714 104.43752  69.42358  73.57874  77.47431  95.06832  84.52662  74.63740 
##        73        74        75        76        77        78        79        80 
##  83.82775  90.32849  72.63642  74.04629  89.17671  82.02841  76.91166  79.06990 
##        81        82        83        84        85        86        87        88 
##  82.96109  84.87351  97.78765  72.49258  88.21016  81.02128  87.09083  83.61214 
##        89        90        91        92        93        94        95        96 
##  97.11698  92.20992  75.67395  95.63936  63.57094  92.02143  90.33738  91.33154 
##        97        98        99       100       101       102       103       104 
## 101.57956  96.13043  84.33106  87.58446  76.56099  68.60087  81.90618  89.07745 
##       105       106       107       108       109       110       111       112 
##  69.54867  84.82527  50.61242  73.87781  89.60298  45.83271  88.15486  88.83011 
##       113       114       115       116       117       118       119       120 
##  94.87455  91.96957  80.81252  77.71058  86.48245  81.07379  69.83175  79.69570 
##       121       122       123       124       125       126       127       128 
## 101.80170  67.89132  68.53678  59.01639  69.53754  82.32925  82.29472  75.69144 
##       129       130       131       132       133       134       135       136 
##  87.11546  86.32240  77.56226  78.54961  75.94046  78.66570  89.84772  82.59521 
##       137       138       139       140       141       142       143       144 
##  76.42258  75.60449  96.87116  80.06190  65.00611  73.87770  88.96301  67.28884 
##       145       146       147       148       149       150       151       152 
##  77.15675  75.76504  73.24084  80.86218  80.05597  83.90668  80.86188  79.88856 
##       153       154       155       156       157       158       159       160 
## -19.84017  64.77946  78.32424  64.87205  91.89268  81.89351  89.36259  72.12265 
##       161       162       163       164       165       166       167       168 
## 103.02976 111.67013  97.21169 108.75241 102.40531  98.45281  88.61654  84.95617 
##       169       170       171       172       173       174       175       176 
##  72.49847  80.70537  95.86099  94.82055  77.89417  88.50043  75.41264  79.19125 
##       177       178       179       180       181       182       183       184 
##  84.74238  71.10428  77.34219  80.99975  98.93045  87.46444  88.35490  86.95535 
##       185       186       187       188       189       190       191       192 
##  85.51381  82.43470  83.71987  46.48291  52.39584 108.56011  64.54175  83.93363 
##       193       194       195       196       197       198       199       200 
##  70.01263  69.62693  68.91100  58.25476  70.12960  98.29763  84.71910  85.60574 
##       201       202       203       204       205       206       207       208 
##  68.83443  79.20509  74.50917  93.34670  81.99057  90.00616  77.24307  76.26457 
##       209       210       211       212       213       214       215       216 
##  73.56917  69.79046 104.60704  95.17004  84.35413  62.58243  71.91686  84.28700 
##       217       218       219       220       221       222       223       224 
##  81.90389  96.32345  76.92910  81.56029  70.41984  70.49817  77.32647  72.56484 
##       225       226       227       228       229       230       231       232 
##  70.78992  78.00976  78.37884  74.24700  84.74737  86.06397  82.04895  90.50258 
##       233       234       235       236       237       238       239       240 
##  87.55257  86.15641  77.86184  75.71337  74.50398  80.12416  85.63313  71.52249 
##       241       242       243       244       245       246       247       248 
##  82.21822  87.97379  80.25439  81.82278  57.15675  81.05379  79.00098  81.39071 
##       249       250       251       252       253       254       255       256 
##  74.23961  84.23937  77.17536  56.02002  93.72933 209.90022  65.68211  79.94488 
##       257       258       259 
##  86.67192  85.85297  73.20054
write.csv(completedData1, "moneyball-evaluation-data-predicted_POLY.csv", row.names = FALSE)