We have read in the 2 datasets provided as part of this project from
Github.
Our goal for this project is to analyze the variables in the baseball statistics columns provided on the training file, explore missing data, understand and prepare the data and build regression models to fit the data (predict our Y or dependent variable “Target Wins” for a given team, based on multiple predictor variables provided in the dataset).
Once have fit the data, we can then run and test on the evaluation
dataset provided, we will analyze model validity and performance and
report model that can be selected as the best fit for the data. Shown
below is an image all the variables and their definitions.
A good reference site that we have consulted is the MLB league page https://www.mlb.com/stats/. These have great definitions and usage of the column variables for reference information.
library(kableExtra)
library(tidyverse)
library(tidyr)
library(forecast)
library(cowplot)
#install.packages("car")
library(naniar)
library(Seurat)
library(skimr)
library(PerformanceAnalytics)
library(corrplot)
library(mice)
library(ggplot2)
library(reshape2)
library(dplyr)
library(tidyr)
library(MASS)# For stepAIC (Stepwise selection)
library(caret) # For splitting data and cross-validation
library(leaps)
library(car)
moneyball_tdata <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-training-data.csv",stringsAsFactors = F,header=TRUE)
moneyball_testdta <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-evaluation-data.csv",stringsAsFactors = F,header=TRUE)
#knitr::include_graphics("Definition.png")Describe the size and the variables in the moneyball training data
set. Consider that too much detail will cause a manager to lose interest
while too little detail will make the manager consider that you aren’t
doing your job. Some suggestions are given below. Please do NOT treat
this as a check list of things to do to complete the assignment. You
should have your own thoughts on what to tell the boss. These are just
ideas.
This data set contains 2276 records and 17 variables in each record.
Each record represents a professional baseball team from the years
1871 to 2006 inclusive. Each record has the performance of the team for
the given year, with all of the statistics adjusted to match the
performance of a 162 game season.
Here we have used a few visualizations to summarize the variables that are missing data or have values 0’s.
Following variables have the highest “NA” missing values which we can
either ignore or consider for imputation. In this project our team have
decided to proceed with imputation.Three other variables
TEAM_BASERUN_SB, TEAM_BATTING_SO,TEAM_PITCHING_SO are all around 6% and
5% missing data respectively.
TEAM_BATTING_HBP (92%)
TEAM_BASERUN_CS (34%)
TEAM_FIELDING_DP (13%)
We also have records with 0’s.A potential option is transforming them by adding a constant.
A quick run down of statistics that we need to understand in the
world of baseball. Some trivia, for a player, a .275 average is pretty
good. However this means that the batter was successful just over 25% of
the time. Nearly 73% of the time, they did not get a hit.
BA = batting average, G = games played, AB = at bats, R =
runs,
H = hits,
2B = doubles,
3B = triples,
HR = home runs,
RBI = runs batted in, SB = stolen bases.
| Name | moneyball_tdata |
| Number of rows | 2276 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| numeric | 17 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| INDEX | 0 | 1.00 | 1268.46 | 736.35 | 1 | 630.75 | 1270.5 | 1915.50 | 2535 | ▇▇▇▇▇ |
| TARGET_WINS | 0 | 1.00 | 80.79 | 15.75 | 0 | 71.00 | 82.0 | 92.00 | 146 | ▁▁▇▅▁ |
| TEAM_BATTING_H | 0 | 1.00 | 1469.27 | 144.59 | 891 | 1383.00 | 1454.0 | 1537.25 | 2554 | ▁▇▂▁▁ |
| TEAM_BATTING_2B | 0 | 1.00 | 241.25 | 46.80 | 69 | 208.00 | 238.0 | 273.00 | 458 | ▁▆▇▂▁ |
| TEAM_BATTING_3B | 0 | 1.00 | 55.25 | 27.94 | 0 | 34.00 | 47.0 | 72.00 | 223 | ▇▇▂▁▁ |
| TEAM_BATTING_HR | 0 | 1.00 | 99.61 | 60.55 | 0 | 42.00 | 102.0 | 147.00 | 264 | ▇▆▇▅▁ |
| TEAM_BATTING_BB | 0 | 1.00 | 501.56 | 122.67 | 0 | 451.00 | 512.0 | 580.00 | 878 | ▁▁▇▇▁ |
| TEAM_BATTING_SO | 102 | 0.96 | 735.61 | 248.53 | 0 | 548.00 | 750.0 | 930.00 | 1399 | ▁▆▇▇▁ |
| TEAM_BASERUN_SB | 131 | 0.94 | 124.76 | 87.79 | 0 | 66.00 | 101.0 | 156.00 | 697 | ▇▃▁▁▁ |
| TEAM_BASERUN_CS | 772 | 0.66 | 52.80 | 22.96 | 0 | 38.00 | 49.0 | 62.00 | 201 | ▃▇▁▁▁ |
| TEAM_BATTING_HBP | 2085 | 0.08 | 59.36 | 12.97 | 29 | 50.50 | 58.0 | 67.00 | 95 | ▂▇▇▅▁ |
| TEAM_PITCHING_H | 0 | 1.00 | 1779.21 | 1406.84 | 1137 | 1419.00 | 1518.0 | 1682.50 | 30132 | ▇▁▁▁▁ |
| TEAM_PITCHING_HR | 0 | 1.00 | 105.70 | 61.30 | 0 | 50.00 | 107.0 | 150.00 | 343 | ▇▇▆▁▁ |
| TEAM_PITCHING_BB | 0 | 1.00 | 553.01 | 166.36 | 0 | 476.00 | 536.5 | 611.00 | 3645 | ▇▁▁▁▁ |
| TEAM_PITCHING_SO | 102 | 0.96 | 817.73 | 553.09 | 0 | 615.00 | 813.5 | 968.00 | 19278 | ▇▁▁▁▁ |
| TEAM_FIELDING_E | 0 | 1.00 | 246.48 | 227.77 | 65 | 127.00 | 159.0 | 249.25 | 1898 | ▇▁▁▁▁ |
| TEAM_FIELDING_DP | 286 | 0.87 | 146.39 | 26.23 | 52 | 131.00 | 149.0 | 164.00 | 228 | ▁▂▇▆▁ |
# Drop the Index column
df <- moneyball_tdata %>% dplyr::select(-INDEX)
# #Analysis of Missing values on the original dataset without Mice imputation
# missing <- colSums(df %>% sapply(is.na))
# missing_pct <- round(missing / nrow(df) * 100, 2)
# stack(sort(missing_pct, decreasing = TRUE))
df %>%
gather(variable, value) %>%
filter(is.na(value)) %>%
group_by(variable) %>%
tally() %>%
mutate(percent = n / nrow(df) * 100) %>%
mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
arrange(desc(n)) %>%
rename(`Variable Missing Data` = variable,
`Number of Records` = n,
`Share of Total` = percent) %>%
kable() %>%
kable_styling()| Variable Missing Data | Number of Records | Share of Total |
|---|---|---|
| TEAM_BATTING_HBP | 2085 | 92% |
| TEAM_BASERUN_CS | 772 | 34% |
| TEAM_FIELDING_DP | 286 | 13% |
| TEAM_BASERUN_SB | 131 | 5.8% |
| TEAM_BATTING_SO | 102 | 4.5% |
| TEAM_PITCHING_SO | 102 | 4.5% |
df %>%
gather(variable, value) %>%
filter(value == 0) %>%
group_by(variable) %>%
tally() %>%
mutate(percent = n / nrow(df) * 100) %>%
mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
arrange(desc(n)) %>%
rename(`Variable With Zeros` = variable,
`Number of Records` = n,
`Share of Total` = percent) %>%
kable() %>%
kable_styling()| Variable With Zeros | Number of Records | Share of Total |
|---|---|---|
| TEAM_BATTING_SO | 20 | 0.9% |
| TEAM_PITCHING_SO | 20 | 0.9% |
| TEAM_BATTING_HR | 15 | 0.7% |
| TEAM_PITCHING_HR | 15 | 0.7% |
| TEAM_BASERUN_SB | 2 | 0.1% |
| TEAM_BATTING_3B | 2 | 0.1% |
| TARGET_WINS | 1 | 0% |
| TEAM_BASERUN_CS | 1 | 0% |
| TEAM_BATTING_BB | 1 | 0% |
| TEAM_PITCHING_BB | 1 | 0% |
The histogram distribution show kurtosis, specifically right skew in variables such as BASERUN_CS, BASERUN_SB, FIELDING_E, PITCHING_BB, PITCHING_H and PITCHING_SO. We may need to transform these to see if we can make them more of a normal curve prior to proceeding with the model build using the data.
BATTING_HR, PITCHING_HR and BATTING_SO have a bimodal data which may
suggest a shift in the data represented within the dataset. Looking at
quick qqplots of the bimodal variables confirms that these are not
normal, however, HBP which has a large number of missing values does
seem following along the line on the values represented. We can choose
to keep this variable and impute the missing values rather than dropping
it from the dataset.
TARGET WINS is normally distributed.
Boxplots reveal that there are many variables with outliers found in
the distribution and as shown by the qqplot as well. These will have
some impact on our model fit. We could potentially find leverage and
influential points and remove them and rerun the model to determine
during model build process.
#Review the data
#install.packages("ggcorrplot")
library(ggcorrplot)
# Look at distributions of the columns
mt <- df %>%
gather(key = 'variable', value = 'value')
names(df)## [1] "TARGET_WINS" "TEAM_BATTING_H" "TEAM_BATTING_2B" "TEAM_BATTING_3B"
## [5] "TEAM_BATTING_HR" "TEAM_BATTING_BB" "TEAM_BATTING_SO" "TEAM_BASERUN_SB"
## [9] "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H" "TEAM_PITCHING_HR"
## [13] "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E" "TEAM_FIELDING_DP"
variable_names <- list(
"TARGET_WINS" = "TARGET_WINS"
,"TEAM_BATTING_H" = "BATTING_H"
,"TEAM_BATTING_2B" = "BATTING_2B"
,"TEAM_BATTING_3B" = "BATTING_3B"
,"TEAM_BATTING_HR" = "BATTING_HR"
,"TEAM_BATTING_BB" = "BATTING_BB"
,"TEAM_BATTING_SO" = "BATTING_SO"
,"TEAM_BASERUN_SB" = "BASERUN_SB"
,"TEAM_BASERUN_CS" = "BASERUN_CS"
,"TEAM_BATTING_HBP" = "BATTING_HBP"
,"TEAM_PITCHING_H" = "PITCHING_H"
,"TEAM_PITCHING_HR" = "PITCHING_HR"
,"TEAM_PITCHING_BB" = "PITCHING_BB"
,"TEAM_PITCHING_SO" = "PITCHING_SO"
,"TEAM_FIELDING_E" = "FIELDING_E"
,"TEAM_FIELDING_DP" = "FIELDING_DP"
)
variable_labeller <- function(variable,value){
return(variable_names[value])
}
# Histogram plots of each variable
mt %>% ggplot(., aes(value)) +
geom_density(fill = "lightgray", color="lightgray") +
facet_wrap(.~variable, scales='free',ncol=4,labeller=variable_labeller)# findoutlier <- function(x) {
# return(x < quantile(x, .25,na.rm=TRUE) - 1.5*IQR(x,na.rm=TRUE) | x > quantile(x, .75,na.rm=TRUE) + 1.5*IQR(x,na.rm=TRUE))
# }
#
# mtout <-mt %>%
# group_by(variable) %>%
# mutate(outlier = ifelse(findoutlier(value), value, NA))
# Boxplot
ggplot(mt, aes(factor(variable), value)) +
geom_boxplot() +
#geom_text(aes(label=outlier), na.rm=TRUE, hjust=-.5)+
facet_wrap(. ~variable, scales='free', ncol=5,labeller = variable_labeller)# ggplot(stack(df), aes(x = ind, y = values)) +
# geom_boxplot() +
# theme(legend.position="none") +
# theme(axis.text.x=element_text(angle=45, hjust=1))
#stacked boxplot
longdf <- df %>%
melt()
longdf %>%
filter(complete.cases(.)) %>%
ggplot(aes(x= variable, y=value)) +
geom_boxplot(fill="lightgray") +
scale_y_log10() +
coord_flip() +
theme_minimal() +
labs(y="Value", x="variable",
title="Boxplot")#For Bimodal we want to understand qqplot.
qqplot_BATTING_3B <- ggplot(df, aes(sample = TEAM_BATTING_3B)) +
stat_qq() +
stat_qq_line() +
labs(title="BATTING_3B")
qqplot_BATTING_HR <- ggplot(df, aes(sample = TEAM_BATTING_HR)) +
stat_qq() +
stat_qq_line() +
labs(title="BATTING_HR")
qqplot_BATTING_HBP <- ggplot(df, aes(sample = TEAM_BATTING_HBP)) +
stat_qq() +
stat_qq_line() +
labs(title="BATTING_HBP")
qqplot_BATTING_3BCORRELATIONS REVIEW
Interesting Insights from the Correlation Plot
As expected, base hits (TEAM_BATTING_H) have a strong
positive correlation with the number of wins. This makes sense because
getting more base hits usually increases a team’s chances of scoring and
winning games.
Homeruns (TEAM_BATTING_HR) are also strongly correlated
with wins, indicating that teams with more homeruns tend to win more
games. This confirms the importance of power-hitting in baseball.
Doubles (TEAM_BATTING_2B) show a moderate positive
correlation with wins, emphasizing that extra-base hits contribute
significantly to a team’s success.
Surprisingly, triples (TEAM_BATTING_3B) show a slight
negative correlation with wins. This could suggest that teams with more
triples don’t necessarily translate them into scoring, or that teams who
hit fewer triples may rely on other strategies to win.
This is an unexpected result, as you would normally expect more hits allowed to lead to fewer wins. This could indicate that teams that allow more hits are able to compensate with other strong areas, like offensive production or good defense.
Negative Correlation Between TEAM_BATTING_SO and TARGET_WINS (-0.23)
Strikeouts by batters (TEAM_BATTING_SO) are negatively
correlated with wins, meaning that teams that strike out more often tend
to win less. This is expected, as striking out reduces a team’s chances
of getting on base and scoring.
Fielding errors (TEAM_FIELDING_E) have a strong negative
correlation with wins. Teams that commit more errors tend to lose more
games, which is logical since errors typically give the opposing team
more scoring opportunities.
Walks by batters (TEAM_BATTING_BB) have a significant
positive impact on wins. This suggests that good plate discipline and
the ability to draw walks help teams increase their chances of winning
by getting more players on base.
Similar to hits allowed, walks allowed by pitchers
(TEAM_PITCHING_BB) show a moderate positive correlation
with wins. This is counterintuitive, as more walks generally put the
team at a disadvantage. Further analysis might reveal that stronger
teams manage to win despite allowing more walks.
#Review the data
#install.packages("ggcorrplot")
library(ggcorrplot)
# Look at distributions of the columns
mt <- df %>%
gather(key = 'variable', value = 'value')
names(df)## [1] "TARGET_WINS" "TEAM_BATTING_H" "TEAM_BATTING_2B" "TEAM_BATTING_3B"
## [5] "TEAM_BATTING_HR" "TEAM_BATTING_BB" "TEAM_BATTING_SO" "TEAM_BASERUN_SB"
## [9] "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H" "TEAM_PITCHING_HR"
## [13] "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E" "TEAM_FIELDING_DP"
variable_names <- list(
"TARGET_WINS" = "TARGET_WINS"
,"TEAM_BATTING_H" = "BATTING_H"
,"TEAM_BATTING_2B" = "BATTING_2B"
,"TEAM_BATTING_3B" = "BATTING_3B"
,"TEAM_BATTING_HR" = "BATTING_HR"
,"TEAM_BATTING_BB" = "BATTING_BB"
,"TEAM_BATTING_SO" = "BATTING_SO"
,"TEAM_BASERUN_SB" = "BASERUN_SB"
,"TEAM_BASERUN_CS" = "BASERUN_CS"
,"TEAM_BATTING_HBP" = "BATTING_HBP"
,"TEAM_PITCHING_H" = "PITCHING_H"
,"TEAM_PITCHING_HR" = "PITCHING_HR"
,"TEAM_PITCHING_BB" = "PITCHING_BB"
,"TEAM_PITCHING_SO" = "PITCHING_SO"
,"TEAM_FIELDING_E" = "FIELDING_E"
,"TEAM_FIELDING_DP" = "FIELDING_DP"
)
variable_labeller <- function(variable,value){
return(variable_names[value])
}
# Select numeric columns only
numeric_data <- df[sapply(df, is.numeric)]
M<- cor(numeric_data,use="complete.obs")
# M %>% kable() %>%
# kable_styling()
ggcorrplot(M, type = "upper", outline.color = "white",
ggtheme = theme_classic,
#colors = c("#6D9EC1", "white", "#E46726"),
lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)# Calculate the correlation matrix
correlation_matrix <- cor(numeric_data, use="complete.obs")
print(correlation_matrix)## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS 1.00000000 0.46994665 0.31298400 -0.12434586
## TEAM_BATTING_H 0.46994665 1.00000000 0.56177286 0.21391883
## TEAM_BATTING_2B 0.31298400 0.56177286 1.00000000 0.04203441
## TEAM_BATTING_3B -0.12434586 0.21391883 0.04203441 1.00000000
## TEAM_BATTING_HR 0.42241683 0.39627593 0.25099045 -0.21879927
## TEAM_BATTING_BB 0.46868793 0.19735234 0.19749256 -0.20584392
## TEAM_BATTING_SO -0.22889273 -0.34174328 -0.06415123 -0.19291841
## TEAM_BASERUN_SB 0.01483639 0.07167495 -0.18768279 0.16946086
## TEAM_BASERUN_CS -0.17875598 -0.09377545 -0.20413884 0.23213978
## TEAM_BATTING_HBP 0.07350424 -0.02911218 0.04608475 -0.17424715
## TEAM_PITCHING_H 0.47123431 0.99919269 0.56045355 0.21250322
## TEAM_PITCHING_HR 0.42246683 0.39495630 0.24999875 -0.21973263
## TEAM_PITCHING_BB 0.46839882 0.19529071 0.19592157 -0.20675383
## TEAM_PITCHING_SO -0.22936481 -0.34445001 -0.06616615 -0.19386654
## TEAM_FIELDING_E -0.38668800 -0.25381638 -0.19427027 -0.06513145
## TEAM_FIELDING_DP -0.19586601 0.01776946 -0.02488808 0.13314758
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS 0.42241683 0.46868793 -0.22889273
## TEAM_BATTING_H 0.39627593 0.19735234 -0.34174328
## TEAM_BATTING_2B 0.25099045 0.19749256 -0.06415123
## TEAM_BATTING_3B -0.21879927 -0.20584392 -0.19291841
## TEAM_BATTING_HR 1.00000000 0.45638161 0.21045444
## TEAM_BATTING_BB 0.45638161 1.00000000 0.21833871
## TEAM_BATTING_SO 0.21045444 0.21833871 1.00000000
## TEAM_BASERUN_SB -0.19021893 -0.08806123 -0.07475974
## TEAM_BASERUN_CS -0.27579838 -0.20878051 -0.05613035
## TEAM_BATTING_HBP 0.10618116 0.04746007 0.22094219
## TEAM_PITCHING_H 0.39549390 0.19848687 -0.34145321
## TEAM_PITCHING_HR 0.99993259 0.45659283 0.21111617
## TEAM_PITCHING_BB 0.45542468 0.99988140 0.21895783
## TEAM_PITCHING_SO 0.20829574 0.21793253 0.99976835
## TEAM_FIELDING_E 0.01567397 -0.07847126 0.30814540
## TEAM_FIELDING_DP -0.06182222 -0.07929078 -0.12319072
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS 0.01483639 -0.178755979 0.07350424
## TEAM_BATTING_H 0.07167495 -0.093775445 -0.02911218
## TEAM_BATTING_2B -0.18768279 -0.204138837 0.04608475
## TEAM_BATTING_3B 0.16946086 0.232139777 -0.17424715
## TEAM_BATTING_HR -0.19021893 -0.275798375 0.10618116
## TEAM_BATTING_BB -0.08806123 -0.208780510 0.04746007
## TEAM_BATTING_SO -0.07475974 -0.056130355 0.22094219
## TEAM_BASERUN_SB 1.00000000 0.624737808 -0.06400498
## TEAM_BASERUN_CS 0.62473781 1.000000000 -0.07051390
## TEAM_BATTING_HBP -0.06400498 -0.070513896 1.00000000
## TEAM_PITCHING_H 0.07395373 -0.092977893 -0.02769699
## TEAM_PITCHING_HR -0.18948057 -0.275471495 0.10675878
## TEAM_PITCHING_BB -0.08741902 -0.208470154 0.04785137
## TEAM_PITCHING_SO -0.07351325 -0.055308336 0.22157375
## TEAM_FIELDING_E 0.04292341 0.207701189 0.04178971
## TEAM_FIELDING_DP -0.13023054 -0.006764233 -0.07120824
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS 0.47123431 0.42246683 0.46839882
## TEAM_BATTING_H 0.99919269 0.39495630 0.19529071
## TEAM_BATTING_2B 0.56045355 0.24999875 0.19592157
## TEAM_BATTING_3B 0.21250322 -0.21973263 -0.20675383
## TEAM_BATTING_HR 0.39549390 0.99993259 0.45542468
## TEAM_BATTING_BB 0.19848687 0.45659283 0.99988140
## TEAM_BATTING_SO -0.34145321 0.21111617 0.21895783
## TEAM_BASERUN_SB 0.07395373 -0.18948057 -0.08741902
## TEAM_BASERUN_CS -0.09297789 -0.27547150 -0.20847015
## TEAM_BATTING_HBP -0.02769699 0.10675878 0.04785137
## TEAM_PITCHING_H 1.00000000 0.39463199 0.19703302
## TEAM_PITCHING_HR 0.39463199 1.00000000 0.45580983
## TEAM_PITCHING_BB 0.19703302 0.45580983 1.00000000
## TEAM_PITCHING_SO -0.34330646 0.20920115 0.21887700
## TEAM_FIELDING_E -0.25073028 0.01689330 -0.07692315
## TEAM_FIELDING_DP 0.01416807 -0.06292475 -0.08040645
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS -0.22936481 -0.38668800 -0.195866006
## TEAM_BATTING_H -0.34445001 -0.25381638 0.017769456
## TEAM_BATTING_2B -0.06616615 -0.19427027 -0.024888081
## TEAM_BATTING_3B -0.19386654 -0.06513145 0.133147578
## TEAM_BATTING_HR 0.20829574 0.01567397 -0.061822219
## TEAM_BATTING_BB 0.21793253 -0.07847126 -0.079290775
## TEAM_BATTING_SO 0.99976835 0.30814540 -0.123190715
## TEAM_BASERUN_SB -0.07351325 0.04292341 -0.130230537
## TEAM_BASERUN_CS -0.05530834 0.20770119 -0.006764233
## TEAM_BATTING_HBP 0.22157375 0.04178971 -0.071208241
## TEAM_PITCHING_H -0.34330646 -0.25073028 0.014168073
## TEAM_PITCHING_HR 0.20920115 0.01689330 -0.062924751
## TEAM_PITCHING_BB 0.21887700 -0.07692315 -0.080406452
## TEAM_PITCHING_SO 1.00000000 0.31008407 -0.124923213
## TEAM_FIELDING_E 0.31008407 1.00000000 0.040205814
## TEAM_FIELDING_DP -0.12492321 0.04020581 1.000000000
OUTLIERS
There are variables with outliers. For example, there are many pairs
of variable types that have 0’s recorded above (for example batting_HR
and pitching_HR), since these are 0’s, they could be considered outliers
or they can be legitimate. In addition, a few variables have outliers
further distance away from the mean such as BATTING_3B, PITCHING_SO,
PITCHING_BB.An observation is considered an outlier, if it is extreme,
relative to other response values. In contrast, some records have
extremely high or low values for the predictor variable, relative to the
other values. These are referred to as high leverage observations.These
can influence our model estimates and introduce bias. We could
potentially calculate these points, remove them and refit the line and
check our Rsquared value later on during the model build process.
Cook’s D measures how much the model coefficient estimates would
change if an observation were to be removed from the data set.
There is one Cook’s D value for each observation used to fit the
model. The higher the Cook’s D value, the greater the influence.
Generally accepted rules of thumb are that Cook’s D values above 1.0
indicate influential values, and any values that stick out from the rest
might also be influential.
#Review the data
#install.packages("ggcorrplot")
library(ggcorrplot)
#Scatterplot of variables with target wins
df %>%
gather(variable, value, -TARGET_WINS) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "lightblue", color="lightblue") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = "Wins")#Look at few variable distributions in depth
df %>%
ggplot(aes(TARGET_WINS)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TARGET_WINS, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TARGET_WINS, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution TARGET_WINS",
caption = "* Red line is the mean value and green is the median")df %>%
ggplot(aes(TEAM_BATTING_3B)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution TEAM_BATTING_3B",
caption = "* Red line is the mean value and green is the median")df %>%
ggplot(aes(TEAM_PITCHING_H)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_PITCHING_H, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_PITCHING_H, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution TEAM_PITCHING_H",
caption = "* Red line is the mean value and green is the median")df %>%
ggplot(aes(TEAM_PITCHING_SO)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_PITCHING_SO, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_PITCHING_SO, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution TEAM_PITCHING_SO",
caption = "* Red line is the mean value and green is the median")df %>%
ggplot(aes(TEAM_PITCHING_BB)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_PITCHING_BB, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_PITCHING_BB, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution PITCHING_BB",
caption = "* Red line is the mean value and green is the median")Describe how you have transformed the data by changing the original
variables or creating new variables. If you did transform the data or
create new variables, discuss why you did this. Here are some possible
transformations.
a. Fix missing values (maybe with a Mean or Median value)
b. Create flags to suggest if a variable was missing
c. Transform data by putting it into buckets
d. Mathematical transforms such as log or square root (or use
Box-Cox)
e. Combine variables (such as ratios or adding or multiplying) to create
new variables)
MICE Imputation, short for ‘Multiple Imputation by Chained Equation’
is an advanced missing data imputation technique that uses plausible
data points based on distributions. We use this imputation method below
on both train data set and test dataset (Evaluation for use later on
during prediction). Once we impute HBP column for instance, we look at
the change in the histogram. For HBP, the curve has moved away from
normal curve to something different with 2 different peaks which is not
what we prefer, however, we will proceed keeping this imputation for
this variable. For BASERUN_CS seems to have become more normal after
imputation with mice.
After imputation of the dataset, we can see positive and negative correlations below with Target Wins. BASERUN_CS have mild positive correlation and Target wins and HBP have a slight negative correlation.
#Reviewing HBP that has 90% of the data missing to review the spread of the data
df %>%
ggplot(aes(TEAM_BATTING_HBP)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "blue", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Hit By a Pitch",
caption = "* Blue line is the mean value and green is the median")# Set the seed for reproducibility
set.seed(12345)
# Perform Multiple Imputation
imputed_data_df <- mice(df, m=5, method='pmm', print=FALSE)
completedData <- complete(imputed_data_df,1)
summary(completedData) %>%
kable() %>%
kable_styling()| TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | |
| 1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 541.0 | 1st Qu.: 67 | 1st Qu.: 42.0 | 1st Qu.:51.00 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 602.0 | 1st Qu.: 127.0 | 1st Qu.:124.0 | |
| Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 733.0 | Median :106 | Median : 57.0 | Median :56.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 797.0 | Median : 159.0 | Median :145.5 | |
| Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 727.2 | Mean :137 | Mean : 76.2 | Mean :58.47 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 807.2 | Mean : 246.5 | Mean :141.6 | |
| 3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 925.0 | 3rd Qu.:170 | 3rd Qu.: 92.0 | 3rd Qu.:69.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 957.0 | 3rd Qu.: 249.2 | 3rd Qu.:162.0 | |
| Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 |
# Density plots
ggplot(df, aes(x=TEAM_BATTING_HBP, fill="Original")) +
geom_density(alpha=0.5) +
geom_density(data=completedData, aes(x=TEAM_BATTING_HBP, fill="imputed"), alpha=0.5) +
labs(title="Density Plot of Ozone: Original vs. Imputed")#check 0's again
completedData %>%
gather(variable, value) %>%
filter(value == 0) %>%
group_by(variable) %>%
tally() %>%
mutate(percent = n / nrow(completedData) * 100) %>%
mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
arrange(desc(n)) %>%
rename(`Variable With Zeros` = variable,
`Number of Records` = n,
`Share of Total` = percent) %>%
kable() %>%
kable_styling()| Variable With Zeros | Number of Records | Share of Total |
|---|---|---|
| TEAM_BATTING_SO | 20 | 0.9% |
| TEAM_PITCHING_SO | 20 | 0.9% |
| TEAM_BATTING_HR | 15 | 0.7% |
| TEAM_PITCHING_HR | 15 | 0.7% |
| TEAM_BASERUN_SB | 2 | 0.1% |
| TEAM_BATTING_3B | 2 | 0.1% |
| TARGET_WINS | 1 | 0% |
| TEAM_BASERUN_CS | 1 | 0% |
| TEAM_BATTING_BB | 1 | 0% |
| TEAM_PITCHING_BB | 1 | 0% |
# #
# hist(completedData$TEAM_BATTING_SO)
# completedData$log_BATTING_SO <- (completedData$TEAM_BATTING_SO)^(1/3)
# hist(completedData$log_BATTING_SO)
#Analysis of Missing values on the imputed dataset
missing1 <- colSums(completedData %>% sapply(is.na))
missing_pct1 <- round(missing1 / nrow(df) * 100, 2)
stack(sort(missing_pct1, decreasing = TRUE))mt1 <- completedData %>%
gather(key = 'variable', value = 'value')
mt1 %>% ggplot(., aes(value)) +
geom_density(fill = "green", color="green") +
facet_wrap(.~variable, scales='free',ncol=4,labeller=variable_labeller)completedData %>%
ggplot(aes(TEAM_BATTING_HBP)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "blue", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Hit By a Pitch with mice imputation",
caption = "* Blue line is the mean value and green is the median")completedData %>%
ggplot(aes(TEAM_BASERUN_CS)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BASERUN_CS, na.rm = T)), col = "blue", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BASERUN_CS, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "BASERUN_CS with mice imputation",
caption = "* Blue line is the mean value and green is the median")## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS 1.00000000 0.388767521 0.28910365 0.142608411
## TEAM_BATTING_H 0.38876752 1.000000000 0.56284968 0.427696575
## TEAM_BATTING_2B 0.28910365 0.562849678 1.00000000 -0.107305824
## TEAM_BATTING_3B 0.14260841 0.427696575 -0.10730582 1.000000000
## TEAM_BATTING_HR 0.17615320 -0.006544685 0.43539729 -0.635566946
## TEAM_BATTING_BB 0.23255986 -0.072464013 0.25572610 -0.287235841
## TEAM_BATTING_SO -0.03700542 -0.423472975 0.18945983 -0.671507774
## TEAM_BASERUN_SB 0.10806892 0.132765300 -0.19125764 0.513732615
## TEAM_BASERUN_CS 0.04189727 0.116968307 -0.29737125 0.637883604
## TEAM_BATTING_HBP -0.02195123 0.199416536 0.04069928 0.263360368
## TEAM_PITCHING_H -0.10993705 0.302693709 0.02369219 0.194879411
## TEAM_PITCHING_HR 0.18901373 0.072853119 0.45455082 -0.567836679
## TEAM_PITCHING_BB 0.12417454 0.094193027 0.17805420 -0.002224148
## TEAM_PITCHING_SO -0.07633076 -0.231737691 0.08388897 -0.266685175
## TEAM_FIELDING_E -0.17648476 0.264902478 -0.23515099 0.509778447
## TEAM_FIELDING_DP -0.04872795 0.014656289 0.30360837 -0.435353559
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS 0.176153200 0.232559864 -0.03700542
## TEAM_BATTING_H -0.006544685 -0.072464013 -0.42347298
## TEAM_BATTING_2B 0.435397293 0.255726103 0.18945983
## TEAM_BATTING_3B -0.635566946 -0.287235841 -0.67150777
## TEAM_BATTING_HR 1.000000000 0.513734810 0.72991823
## TEAM_BATTING_BB 0.513734810 1.000000000 0.38691765
## TEAM_BATTING_SO 0.729918232 0.386917646 1.00000000
## TEAM_BASERUN_SB -0.501890501 -0.339539238 -0.30585964
## TEAM_BASERUN_CS -0.635081576 -0.352347044 -0.46343485
## TEAM_BATTING_HBP -0.231718837 -0.338486366 -0.26456155
## TEAM_PITCHING_H -0.250145481 -0.449777625 -0.36207321
## TEAM_PITCHING_HR 0.969371396 0.459552072 0.67285963
## TEAM_PITCHING_BB 0.136927564 0.489361263 0.05421024
## TEAM_PITCHING_SO 0.201349979 -0.009344302 0.42418730
## TEAM_FIELDING_E -0.587339098 -0.655970815 -0.58194385
## TEAM_FIELDING_DP 0.510707512 0.391002244 0.30778415
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS 0.10806892 0.041897274 -0.02195123
## TEAM_BATTING_H 0.13276530 0.116968307 0.19941654
## TEAM_BATTING_2B -0.19125764 -0.297371248 0.04069928
## TEAM_BATTING_3B 0.51373261 0.637883604 0.26336037
## TEAM_BATTING_HR -0.50189050 -0.635081576 -0.23171884
## TEAM_BATTING_BB -0.33953924 -0.352347044 -0.33848637
## TEAM_BATTING_SO -0.30585964 -0.463434852 -0.26456155
## TEAM_BASERUN_SB 1.00000000 0.798659242 0.15955042
## TEAM_BASERUN_CS 0.79865924 1.000000000 0.21492884
## TEAM_BATTING_HBP 0.15955042 0.214928843 1.00000000
## TEAM_PITCHING_H 0.17644317 0.173942970 0.15956202
## TEAM_PITCHING_HR -0.45351728 -0.584641263 -0.18940532
## TEAM_PITCHING_BB 0.02569117 0.009879709 -0.11421634
## TEAM_PITCHING_SO 0.02566609 -0.090100513 -0.03539331
## TEAM_FIELDING_E 0.60025561 0.582171528 0.26247599
## TEAM_FIELDING_DP -0.61149875 -0.627917734 -0.21305085
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS -0.10993705 0.18901373 0.124174536
## TEAM_BATTING_H 0.30269371 0.07285312 0.094193027
## TEAM_BATTING_2B 0.02369219 0.45455082 0.178054204
## TEAM_BATTING_3B 0.19487941 -0.56783668 -0.002224148
## TEAM_BATTING_HR -0.25014548 0.96937140 0.136927564
## TEAM_BATTING_BB -0.44977762 0.45955207 0.489361263
## TEAM_BATTING_SO -0.36207321 0.67285963 0.054210244
## TEAM_BASERUN_SB 0.17644317 -0.45351728 0.025691173
## TEAM_BASERUN_CS 0.17394297 -0.58464126 0.009879709
## TEAM_BATTING_HBP 0.15956202 -0.18940532 -0.114216344
## TEAM_PITCHING_H 1.00000000 -0.14161276 0.320676162
## TEAM_PITCHING_HR -0.14161276 1.00000000 0.221937505
## TEAM_PITCHING_BB 0.32067616 0.22193750 1.000000000
## TEAM_PITCHING_SO 0.26954767 0.22183950 0.489688867
## TEAM_FIELDING_E 0.66775901 -0.49314447 -0.022837561
## TEAM_FIELDING_DP -0.05508368 0.48492027 0.156541505
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS -0.076330764 -0.17648476 -0.04872795
## TEAM_BATTING_H -0.231737691 0.26490248 0.01465629
## TEAM_BATTING_2B 0.083888966 -0.23515099 0.30360837
## TEAM_BATTING_3B -0.266685175 0.50977845 -0.43535356
## TEAM_BATTING_HR 0.201349979 -0.58733910 0.51070751
## TEAM_BATTING_BB -0.009344302 -0.65597081 0.39100224
## TEAM_BATTING_SO 0.424187297 -0.58194385 0.30778415
## TEAM_BASERUN_SB 0.025666091 0.60025561 -0.61149875
## TEAM_BASERUN_CS -0.090100513 0.58217153 -0.62791773
## TEAM_BATTING_HBP -0.035393311 0.26247599 -0.21305085
## TEAM_PITCHING_H 0.269547667 0.66775901 -0.05508368
## TEAM_PITCHING_HR 0.221839501 -0.49314447 0.48492027
## TEAM_PITCHING_BB 0.489688867 -0.02283756 0.15654151
## TEAM_PITCHING_SO 1.000000000 -0.02841702 0.10633958
## TEAM_FIELDING_E -0.028417020 1.00000000 -0.49430493
## TEAM_FIELDING_DP 0.106339583 -0.49430493 1.00000000
# corrplot(M, type = 'lower', order = 'hclust', tl.col = 'black',
# cl.ratio = 0.2, tl.srt = 45, col = COL2('PuOr', 10))
#
# testRes = cor.mtest(M, conf.level = 0.95)
#correlations including TARGET WINS and multi-collinearity
ggcorrplot(M, type = "upper", outline.color = "white",
ggtheme = theme_classic,
#colors = c("#6D9EC1", "white", "#E46726"),
lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)#IMPUTE TEST DATASET FOR CONSISTENCY
# moneyball_testdta %>%
# ggplot(aes(TEAM_BATTING_HBP)) +
# geom_histogram(bins = 50) +
# geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "blue", lty = 2) +
# geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
# labs(x = element_blank(),
# y = "Count",
# title = "Hit By a Pitch TEST EVALUATION DATASET - REVIEW TO IMPUTE",
# caption = "* Blue line is the mean value and green is the median")
# Set the seed for reproducibility
set.seed(12345)
# Perform Multiple Imputation
imputed_TEST_df <- mice(moneyball_testdta, m=5, method='pmm', print=FALSE)
completedData1 <- complete(imputed_TEST_df,1)
summary(completedData1) %>%
kable() %>%
kable_styling()| INDEX | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 9 | Min. : 819 | Min. : 44.0 | Min. : 14.00 | Min. : 0.00 | Min. : 15.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.00 | Min. :42.00 | Min. : 1155 | Min. : 0.0 | Min. : 136.0 | Min. : 0.0 | Min. : 73.0 | Min. : 69.0 | |
| 1st Qu.: 708 | 1st Qu.:1387 | 1st Qu.:210.0 | 1st Qu.: 35.00 | 1st Qu.: 44.50 | 1st Qu.:436.5 | 1st Qu.: 532.5 | 1st Qu.: 60.5 | 1st Qu.: 42.00 | 1st Qu.:55.00 | 1st Qu.: 1426 | 1st Qu.: 52.0 | 1st Qu.: 471.0 | 1st Qu.: 608.5 | 1st Qu.: 131.0 | 1st Qu.:127.0 | |
| Median :1249 | Median :1455 | Median :239.0 | Median : 52.00 | Median :101.00 | Median :509.0 | Median : 677.0 | Median : 96.0 | Median : 57.00 | Median :66.00 | Median : 1515 | Median :104.0 | Median : 526.0 | Median : 731.0 | Median : 163.0 | Median :146.0 | |
| Mean :1264 | Mean :1469 | Mean :241.3 | Mean : 55.91 | Mean : 95.63 | Mean :499.0 | Mean : 700.4 | Mean :134.4 | Mean : 69.54 | Mean :66.61 | Mean : 1813 | Mean :102.1 | Mean : 552.4 | Mean : 790.1 | Mean : 249.7 | Mean :142.2 | |
| 3rd Qu.:1832 | 3rd Qu.:1548 | 3rd Qu.:278.5 | 3rd Qu.: 72.00 | 3rd Qu.:135.50 | 3rd Qu.:565.5 | 3rd Qu.: 904.5 | 3rd Qu.:159.5 | 3rd Qu.:115.00 | 3rd Qu.:74.00 | 3rd Qu.: 1681 | 3rd Qu.:142.5 | 3rd Qu.: 606.5 | 3rd Qu.: 929.5 | 3rd Qu.: 252.0 | 3rd Qu.:162.0 | |
| Max. :2525 | Max. :2170 | Max. :376.0 | Max. :155.00 | Max. :242.00 | Max. :792.0 | Max. :1268.0 | Max. :580.0 | Max. :154.00 | Max. :96.00 | Max. :22768 | Max. :336.0 | Max. :2008.0 | Max. :9963.0 | Max. :1568.0 | Max. :204.0 |
# # Density plots
# ggplot(moneyball_testdta , aes(x=TEAM_BATTING_HBP, fill="Original")) +
# geom_density(alpha=0.5) +
# geom_density(data=completedData1, aes(x=TEAM_BATTING_HBP, fill="Imputed"), alpha=0.5) +
# labs(title="Density Plot of TEAM_BATTING_HBP: Original vs. Imputed")
# #check 0's again
# completedData1 %>%
# gather(variable, value) %>%
# filter(value == 0) %>%
# group_by(variable) %>%
# tally() %>%
# mutate(percent = n / nrow(completedData1) * 100) %>%
# mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
# arrange(desc(n)) %>%
# rename(`Variable With Zeros IMPUTED TEST EVAL DATASET` = variable,
# `Number of Records` = n,
# `Share of Total` = percent) %>%
# kable() %>%
# kable_styling()
#Analysis of Missing values on the imputed dataset
missing2 <- colSums(completedData1 %>% sapply(is.na))
missing_pct2 <- round(missing2 / nrow(df) * 100, 2)
stack(sort(missing_pct2, decreasing = TRUE))As seen on the histogram distributions, the spread of the data shows right and left skew and we can perform some math transformations (might help in normalizing the variability of the data).
Log and Box Cox transformations are used here for that purpose as
seen on the comparative histograms below. Logarithm scale, allows small
values that are close together are spread further out. Larger values
that are spread out are brought closer together.
We can create a new variable such as TEAM_BATTING_1B to see if it has
any influence over the target wins through our model build
process.
We have imputed the HBP and the CS variables with the medians in this
method. After imputing with medians we have applied log
transformation.The HBP variable was missing a large number cases. From
the qqplot on the original HBP and then the log transformed HBP_T show
that the new variable is not normal on the qqplot. There is a large
deviation on left and the right tails. Hence we will consider dropping
this transformed variable from the model later on and utilize the other
variables to see if the fit of the data to the model is
applicable.
3B_t and E_T both on earlier qqplot showed some non-linear on the tails and these seemed have smoothed over better with the log and boxcox transformation so we can keep these in use for the model fit.
# New variable: TEAM_BATTING_1B
tdata <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-training-data.csv", header = TRUE) %>% dplyr::select(-INDEX)
baseline <- tdata
updbase <- baseline %>% mutate(TEAM_BATTING_1B = baseline$TEAM_BATTING_H - dplyr::select(., TEAM_BATTING_2B:TEAM_BATTING_HR) %>% rowSums(na.rm = FALSE))
head(updbase)#Check new variable distribution
updbase %>%
ggplot(aes(TEAM_BATTING_1B)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_1B, na.rm = T)), col = "blue", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_1B, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "New variable: TEAM_BATTING_1B",
caption = "* Blue line is the mean value and green is the median")#TRANSFORM ALL OTHER VARIABLES
transformDS <- updbase
#keep HBP and let's not drop it.
med_TF <- round(median(transformDS$TEAM_BATTING_HBP, na.rm = TRUE),0)
transformDS[is.na(transformDS[,"TEAM_BATTING_HBP"]), "TEAM_BATTING_HBP"] <- med_TF
#keep CS and let's not drop it.
med_CS <- round(median(transformDS$TEAM_BASERUN_CS, na.rm = TRUE),0)
transformDS[is.na(transformDS[,"TEAM_BASERUN_CS"]), "TEAM_BASERUN_CS"] <- med_CS
#Analysis of Missing values on the imputed dataset
missing6 <- colSums(transformDS %>% sapply(is.na))
missing_pct6 <- round(missing6 / nrow(df) * 100, 2)
stack(sort(missing_pct6, decreasing = TRUE))#Log transform TEAM_BASERUN_CS
transformDS$TEAM_BASERUN_CS_t <-log(transformDS$TEAM_BASERUN_CS)
baserun_cs <- ggplot(transformDS, aes(x=TEAM_BASERUN_CS)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_BASERUN_CS")
baserun_cs_tf <- ggplot(transformDS, aes(x=TEAM_BASERUN_CS_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "Log Transformed")
#Log transform TEAM_BATTING_HBP
transformDS$TEAM_BATTING_HBP_t <-log(transformDS$TEAM_BATTING_HBP)
baserun_HBP <- ggplot(transformDS, aes(x=TEAM_BASERUN_HBP)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_BATTING_HBP")
baserun_HBP_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_HBP_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "Log Transformed")
#Log transform TEAM_BASERUN_SB
transformDS$TEAM_BASERUN_SB_t <-log(transformDS$TEAM_BASERUN_SB)
baserun_sb <- ggplot(transformDS, aes(x=TEAM_BASERUN_SB)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_BASERUN_SB")
baserun_sb_tf <- ggplot(transformDS, aes(x=TEAM_BASERUN_SB_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "Log Transformed")
#Log transform TEAM_BATTING_2B
transformDS$TEAM_BATTING_2B_t <-log(transformDS$TEAM_BATTING_2B)
batting_2b <- ggplot(transformDS, aes(x=TEAM_BATTING_2B)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_BATTING_2B")
batting_2b_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_2B_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "Log Transformed")
#Log transform TEAM_BATTING_3B
transformDS$TEAM_BATTING_3B_t <-log(transformDS$TEAM_BATTING_3B)
batting_3b <- ggplot(transformDS, aes(x=TEAM_BATTING_3B)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_BATTING_3B")
batting_3b_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_3B_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "Log Transformed")
#BoxCoxtransform TEAM_BATTING_BB
transformDS$TEAM_BATTING_BB_t <- BoxCox(transformDS$TEAM_BATTING_BB, BoxCox.lambda(transformDS$TEAM_BATTING_BB))
batting_bb <- ggplot(transformDS, aes(x=TEAM_BATTING_BB)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_BATTING_BB")
batting_bb_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_BB_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "BoxCox Transformed")
#BoxCoxtransform TEAM_BATTING_H
transformDS$TEAM_BATTING_H_t <- BoxCox(transformDS$TEAM_BATTING_H, BoxCox.lambda(transformDS$TEAM_BATTING_H))
batting_h <- ggplot(transformDS, aes(x=TEAM_BATTING_H)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_BATTING_H")
batting_h_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_H_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "BoxCox Transformed")
#BoxCoxtransform TEAM_BATTING_1B
transformDS$TEAM_BATTING_1B_t <- BoxCox(transformDS$TEAM_BATTING_1B, BoxCox.lambda(transformDS$TEAM_BATTING_1B))
batting_1b <- ggplot(transformDS, aes(x=TEAM_BATTING_1B)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_BATTING_1B")
batting_1b_tf <- ggplot(transformDS, aes(x=TEAM_BATTING_1B_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "BoxCox Transformed")
#BoxCoxtransform TEAM_FIELDING_E
transformDS$TEAM_FIELDING_E_t <- BoxCox(transformDS$TEAM_FIELDING_E, BoxCox.lambda(transformDS$TEAM_FIELDING_E))
fielding_e <- ggplot(transformDS, aes(x=TEAM_FIELDING_E)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_FIELDING_E")
fielding_e_tf <- ggplot(transformDS, aes(x=TEAM_FIELDING_E_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "BoxCox Transformed")
#Log transform TEAM_PITCHING_BB
transformDS$TEAM_PITCHING_BB_t <-log(transformDS$TEAM_PITCHING_BB)
pitching_bb <- ggplot(transformDS, aes(x=TEAM_PITCHING_BB)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_PITCHING_BB")
pitching_bb_tf <- ggplot(transformDS, aes(x=TEAM_PITCHING_BB_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "Log Transformed")
#BoxCoxtransform TEAM_PITCHING_H
transformDS$TEAM_PITCHING_H_t <- BoxCox(transformDS$TEAM_PITCHING_H, BoxCox.lambda(transformDS$TEAM_PITCHING_H))
pitching_h <- ggplot(transformDS, aes(x=TEAM_PITCHING_H)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_PITCHING_H")
pitching_h_tf <- ggplot(transformDS, aes(x=TEAM_PITCHING_H_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "BoxCox Transformed")
#Log transform TEAM_PITCHING_SO
transformDS$TEAM_PITCHING_SO_t <-log(transformDS$TEAM_PITCHING_SO)
pitching_so <- ggplot(transformDS, aes(x=TEAM_PITCHING_SO)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "TEAM_PITCHING_SO")
pitching_so_tf <- ggplot(transformDS, aes(x=TEAM_PITCHING_SO_t)) +
geom_histogram(aes(y=..density..), colour="black", fill="red") +
geom_density(alpha=.8, fill="lightblue") +
theme_classic() + labs(title = "Log Transformed")
p1<- plot_grid(baserun_cs, baserun_cs_tf, baserun_sb, baserun_sb_tf,batting_3b, batting_3b_tf, batting_bb, batting_bb_tf,
batting_h, batting_h_tf, ncol=2)
p2<- plot_grid(batting_1b, batting_1b_tf, fielding_e, fielding_e_tf, pitching_bb, pitching_bb_tf,
pitching_h, pitching_h_tf, pitching_so, pitching_so_tf,ncol=2)
p1#For Bimodal we want to understand qqplot.
qqplot_BATTING_HBP_t <- ggplot(transformDS, aes(sample = TEAM_BATTING_HBP_t)) +
stat_qq() +
stat_qq_line() +
labs(title="TEAM_BATTING_HBP_t")
qqplot_BATTING_HBP_t#For Bimodal we want to understand qqplot.
qqplot_BATTING_3B_t <- ggplot(transformDS, aes(sample = TEAM_BATTING_3B_t)) +
stat_qq() +
stat_qq_line() +
labs(title="BATTING_3B_t")
qqplot_BATTING_3B_t #for this variable had several outliers and higly skewed distribution
qqplot_TEAM_FIELDING_E_t <- ggplot(transformDS, aes(sample = TEAM_FIELDING_E_t)) +
stat_qq() +
stat_qq_line() +
labs(title="TEAM_FIELDING_E_t")
qqplot_TEAM_FIELDING_E_t#Test data transformations to be consistent
ttest <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-evaluation-data.csv", header = TRUE)%>%dplyr::select(-INDEX)
basetest <- ttest %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - dplyr::select(., TEAM_BATTING_2B:TEAM_BATTING_HR) %>% rowSums(na.rm = FALSE))
test_tf <- basetest
#keep HBP and let's not drop it.
med_TF1 <- round(median(basetest$TEAM_BATTING_HBP, na.rm = TRUE),0)
basetest[is.na(basetest[,"TEAM_BATTING_HBP"]), "TEAM_BATTING_HBP"] <- med_TF1
#keep CS and let's not drop it.
med_CS1 <- round(median(basetest$TEAM_BASERUN_CS, na.rm = TRUE),0)
basetest[is.na(basetest[,"TEAM_BASERUN_CS"]), "TEAM_BASERUN_CS"] <- med_CS1
#Analysis of Missing values on the imputed dataset
missing7 <- colSums(basetest %>% sapply(is.na))
missing_pct7 <- round(missing7 / nrow(df) * 100, 2)
stack(sort(missing_pct7, decreasing = TRUE))#Log transform TEAM_BASERUN_CS
test_tf$TEAM_BASERUN_CS_t <-log(test_tf$TEAM_BASERUN_CS)
#Log transform TEAM_BASERUN_SB
test_tf$TEAM_BASERUN_SB_t <-log(test_tf$TEAM_BASERUN_SB)
#Log transform TEAM_BATTING_3B
test_tf$TEAM_BATTING_3B_t <-log(test_tf$TEAM_BATTING_3B)
#BoxCoxtransform TEAM_BATTING_BB
test_tf$TEAM_BATTING_BB_t <- BoxCox(test_tf$TEAM_BATTING_BB, BoxCox.lambda(test_tf$TEAM_BATTING_BB))
#BoxCoxtransform TEAM_BATTING_H
test_tf$TEAM_BATTING_H_t <- BoxCox(test_tf$TEAM_BATTING_H, BoxCox.lambda(test_tf$TEAM_BATTING_H))
#BoxCoxtransform TEAM_BATTING_1B
test_tf$TEAM_BATTING_1B_t <- BoxCox(test_tf$TEAM_BATTING_1B, BoxCox.lambda(test_tf$TEAM_BATTING_1B))
#BoxCoxtransform TEAM_FIELDING_E
test_tf$TEAM_FIELDING_E_t <- BoxCox(test_tf$TEAM_FIELDING_E, BoxCox.lambda(test_tf$TEAM_FIELDING_E))
#Log transform TEAM_PITCHING_BB
test_tf$TEAM_PITCHING_BB_t <-log(test_tf$TEAM_PITCHING_BB)
#BoxCoxtransform TEAM_PITCHING_H
test_tf$TEAM_PITCHING_H_t <- BoxCox(test_tf$TEAM_PITCHING_H, BoxCox.lambda(test_tf$TEAM_PITCHING_H))
#Log transform TEAM_PITCHING_SO
test_tf$TEAM_PITCHING_SO_t <-log(test_tf$TEAM_PITCHING_SO)Some common statistics variables used in baseball can be derived from existing variables in our dataset.
We have created few variables below to see how we can fit a model to
explore these variables and their influence on predicting Target Wins.
AVG (Batting Average) - Rate of hits per at bat. Formula - H/AB
OBP (On-Base Percentage) - The rate at which a batter reached base in
his plate appearances. Formula -H+BB+HBP/AB+BB+HBP+SF. We don’t have a
column for SF(sacrifice flies) so we will calculate without it.
Slugging Percentage (SLG) - Slugging percentage represents the total
number of bases a player records per at-bat. Unlike on-base percentage,
slugging percentage deals only with hits and does not include walks and
hit-by-pitches in its equation.Formula - (1B + 2Bx2 + 3Bx3 +
HRx4)/AB
After imputing HBP and CS with the median values, we have then
proceeded to calculate the other new variables. We do this for the test
evaluation dataset to remain consistent.
# New variable
NEW_features <- function(df1) {
df1 %>%
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR) %>%
mutate(TEAM_BATTING_AB = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP + TEAM_BATTING_SO) %>%
mutate(TEAM_BATTING_AVG = TEAM_BATTING_H/TEAM_BATTING_AB) %>%
mutate(TEAM_BATTING_OBP = (TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_HBP)/(TEAM_BATTING_AB + TEAM_BATTING_BB + TEAM_BATTING_HBP)) %>%
mutate(TEAM_BATTING_SLG = (TEAM_BATTING_1B + 2*TEAM_BATTING_2B + 3*TEAM_BATTING_3B + 4*TEAM_BATTING_HR)/TEAM_BATTING_AB)
}
#Impute with median value
NEW_tdata <- read.csv("https://raw.githubusercontent.com/BanuB/CUNY-DATA-621/refs/heads/main/moneyball-training-data.csv", header = TRUE) %>% dplyr::select(-INDEX)
train_impHBP <- NEW_tdata
med_hbp <- round(median(train_impHBP$TEAM_BATTING_HBP, na.rm = TRUE),0)
train_impHBP[is.na(train_impHBP[,"TEAM_BATTING_HBP"]), "TEAM_BATTING_HBP"] <- med_hbp
#keep CS and let's not drop it.
med_CS1 <- round(median(train_impHBP$TEAM_BASERUN_CS, na.rm = TRUE),0)
train_impHBP[is.na(train_impHBP[,"TEAM_BASERUN_CS"]), "TEAM_BASERUN_CS"] <- med_CS1
#keep CS and let's not drop it.
med_SO <- round(median(train_impHBP$TEAM_BATTING_SO, na.rm = TRUE),0)
train_impHBP[is.na(train_impHBP[,"TEAM_BATTING_SO"]), "TEAM_BATTING_SO"] <- med_SO
dtrain_ALLNEWfeat <- NEW_features(train_impHBP)
dtrain_ALLNEWfeat#Analysis of Missing values on the imputed dataset
missing5 <- colSums(dtrain_ALLNEWfeat %>% sapply(is.na))
missing_pct5 <- round(missing5 / nrow(df) * 100, 2)
stack(sort(missing_pct5, decreasing = TRUE))dtrain_ALLNEWfeat %>%
gather(variable, value, -c(TARGET_WINS:TEAM_BATTING_1B)) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "lightblue", color="lightblue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
facet_wrap(~variable, scales ="free", nrow = 4) +
labs(x = element_blank(), y = "Wins")dtrain_ALLNEWfeat %>%
ggplot(aes(TEAM_BATTING_AB)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_AB, na.rm = T)), col = "blue", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_AB, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "TEAM_BATTING_AB TEST EVALUATION DATASET - REVIEW TO IMPUTE",
caption = "* Blue line is the mean value and green is the median")Using the training data set, build at least three different multiple
linear regression models, using different variables (or the same
variables with different transformations). Since we have not yet covered
automated variable selection methods, you should select the variables
manually (unless you previously learned Forward or Stepwise selection,
etc.). Since you manually selected a variable for inclusion into the
model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For
example, if a team hits a lot of Home Runs, it would be reasonably
expected that such a team would win more games. However, if the
coefficient is negative (suggesting that the team would lose more
games), then that needs to be discussed. Are you keeping the model even
though it is counter intuitive? Why? The boss needs to know.
The approach 1 is to construct a full model. Here, we have used the
mice imputed dataset and did not drop the HBP and CS variables.A few
records that had 0 values , we let them remain as is.
In the Coefficients section, TEAM_BATTING_H, TEAM_BATTING_SO,TEAM_BASERUN_SB are considered significant by the model because their p-value is lower than 0.05 , however, TEAM_BASERUN_CS is not considered significant.
As seem above, SB — The total number of bases stolen by a team’s
hitters. Stolen bases give runners opportunities to get in better
scoring positions, therefore it has a positive significance here.
# full model using mice imputed data
mfull <- lm(TARGET_WINS ~., data = completedData, na.action = na.omit)
summary(mfull)##
## Call:
## lm(formula = TARGET_WINS ~ ., data = completedData, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.958 -8.314 0.075 8.084 63.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.2908245 5.3871360 7.850 6.35e-15 ***
## TEAM_BATTING_H 0.0426291 0.0035300 12.076 < 2e-16 ***
## TEAM_BATTING_2B -0.0192768 0.0088528 -2.177 0.02955 *
## TEAM_BATTING_3B 0.0422177 0.0164161 2.572 0.01018 *
## TEAM_BATTING_HR 0.0516522 0.0262148 1.970 0.04892 *
## TEAM_BATTING_BB 0.0163852 0.0056562 2.897 0.00381 **
## TEAM_BATTING_SO -0.0180456 0.0024611 -7.332 3.13e-13 ***
## TEAM_BASERUN_SB 0.0543822 0.0048339 11.250 < 2e-16 ***
## TEAM_BASERUN_CS -0.0090415 0.0103203 -0.876 0.38107
## TEAM_BATTING_HBP -0.0617980 0.0258507 -2.391 0.01690 *
## TEAM_PITCHING_H 0.0016089 0.0003825 4.207 2.69e-05 ***
## TEAM_PITCHING_HR 0.0334149 0.0233084 1.434 0.15183
## TEAM_PITCHING_BB -0.0081385 0.0040059 -2.032 0.04231 *
## TEAM_PITCHING_SO 0.0025787 0.0008812 2.926 0.00346 **
## TEAM_FIELDING_E -0.0438559 0.0026641 -16.462 < 2e-16 ***
## TEAM_FIELDING_DP -0.1296883 0.0128955 -10.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.49 on 2260 degrees of freedom
## Multiple R-squared: 0.375, Adjusted R-squared: 0.3709
## F-statistic: 90.41 on 15 and 2260 DF, p-value: < 2.2e-16
WE have manually used only significant features of the first model,
and here we see that the R-squared has slightly decreased to 0.3357 from
0.3709 which implies that the full model is better performing.
From the ANOVA result, we observe that the p-value (2.2e-16 ) is very small (less than 0.05), so we reject the null hypothesis, meaning that the second model is not an improvement of the first one.
#full model using mice imputed data
mfull2 <- lm(TARGET_WINS ~TEAM_BATTING_H + TEAM_BATTING_SO +TEAM_BASERUN_SB +TEAM_PITCHING_H+TEAM_FIELDING_E+ TEAM_FIELDING_DP, data = completedData, na.action = na.omit)
summary(mfull2)##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = completedData, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.233 -8.787 -0.069 8.524 53.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.967534 3.898330 5.892 4.39e-09 ***
## TEAM_BATTING_H 0.052224 0.002132 24.494 < 2e-16 ***
## TEAM_BATTING_SO -0.002267 0.001452 -1.561 0.119
## TEAM_BASERUN_SB 0.046403 0.003760 12.342 < 2e-16 ***
## TEAM_PITCHING_H 0.001405 0.000296 4.746 2.21e-06 ***
## TEAM_FIELDING_E -0.047441 0.002471 -19.200 < 2e-16 ***
## TEAM_FIELDING_DP -0.101815 0.012671 -8.036 1.48e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.84 on 2269 degrees of freedom
## Multiple R-squared: 0.3374, Adjusted R-squared: 0.3357
## F-statistic: 192.6 on 6 and 2269 DF, p-value: < 2.2e-16
We have built 3 MODELS using automated variable selection process and
StepAIC below. Additionally calculate the RMSE value and extract
coefficients from the top 2 models to evaluate during model slection
process.
set.seed(1024) #Set Seed
split <- createDataPartition(df_imputed$TARGET_WINS, p = 0.8, list = FALSE)
train_data <- df_imputed[split, ]
test_data <- df_imputed[-split, ]
#Build Model with all params
full_model <- lm(TARGET_WINS ~ ., data = train_data)
#Automatic Tuning
#Tune Forward
forward_model <- stepAIC(lm(TARGET_WINS ~ 1, data = train_data),
direction = "forward",
scope = formula(full_model),
trace = FALSE)
#Tune Backward
backward_model <- stepAIC(full_model, direction = "backward", trace = FALSE)
#Tune Stepwise
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)
#Calculate RMSE function
evaluate_rmse <- function(model, test_data) {
predictions <- predict(model, newdata = test_data)
rmse <- sqrt(mean((test_data$TARGET_WINS - predictions)^2))
return(rmse)
}
#Calculate RMSE
rmse_forward <- evaluate_rmse(forward_model, test_data)
rmse_backward <- evaluate_rmse(backward_model, test_data)
rmse_stepwise <- evaluate_rmse(stepwise_model, test_data)
#Visualize RMSE for each model
rmse_table <- data.frame(
Model = c("Forward Selection", "Backward Elimination", "Stepwise Selection"),
RMSE = c(rmse_forward, rmse_backward, rmse_stepwise)
)
#Select the top two Models
top_two_models <- rmse_table %>% arrange(RMSE) %>% head(2)
print("Top 2 models by RMSE:")## [1] "Top 2 models by RMSE:"
## Model RMSE
## 1 Forward Selection 13.58656
## 2 Backward Elimination 13.59236
#Extract coefficients for the top two models
get_model_coefficients <- function(model) {
coef_df <- as.data.frame(coef(model))
coef_df$Variable <- rownames(coef_df)
colnames(coef_df) <- c("Coefficient", "Variable")
return(coef_df)
}
#Find the top two models by RMSE
model_names <- top_two_models$Model
if ("Forward Selection" %in% model_names) {
forward_coefficients <- get_model_coefficients(forward_model)
print("Forward Selection Coefficients:")
print(forward_coefficients)
}## [1] "Forward Selection Coefficients:"
## Coefficient Variable
## (Intercept) 36.190002364 (Intercept)
## TEAM_BATTING_H 0.043818530 TEAM_BATTING_H
## TEAM_FIELDING_E -0.037227545 TEAM_FIELDING_E
## TEAM_BASERUN_SB 0.036793895 TEAM_BASERUN_SB
## TEAM_FIELDING_DP -0.121274716 TEAM_FIELDING_DP
## TEAM_PITCHING_HR 0.014478155 TEAM_PITCHING_HR
## TEAM_BATTING_SO -0.015518416 TEAM_BATTING_SO
## TEAM_PITCHING_SO 0.004879867 TEAM_PITCHING_SO
## TEAM_BATTING_3B 0.059566279 TEAM_BATTING_3B
## TEAM_BATTING_HR 0.060720213 TEAM_BATTING_HR
## TEAM_BATTING_2B -0.019572820 TEAM_BATTING_2B
## TEAM_BATTING_BB 0.005894305 TEAM_BATTING_BB
if ("Backward Elimination" %in% model_names) {
backward_coefficients <- get_model_coefficients(backward_model)
print("Backward Elimination Coefficients:")
print(backward_coefficients)
}## [1] "Backward Elimination Coefficients:"
## Coefficient Variable
## (Intercept) 35.766287007 (Intercept)
## TEAM_BATTING_H 0.044118333 TEAM_BATTING_H
## TEAM_BATTING_2B -0.019668984 TEAM_BATTING_2B
## TEAM_BATTING_3B 0.059934066 TEAM_BATTING_3B
## TEAM_BATTING_HR 0.076041736 TEAM_BATTING_HR
## TEAM_BATTING_BB 0.005845841 TEAM_BATTING_BB
## TEAM_BATTING_SO -0.015742469 TEAM_BATTING_SO
## TEAM_BASERUN_SB 0.036918317 TEAM_BASERUN_SB
## TEAM_PITCHING_SO 0.005049941 TEAM_PITCHING_SO
## TEAM_FIELDING_E -0.037040233 TEAM_FIELDING_E
## TEAM_FIELDING_DP -0.121410201 TEAM_FIELDING_DP
Similar to linear regression, model evaluation metrics such as mean
squared error (MSE), R-squared, or adjusted R-squared can be used to
assess the performance of the polynomial regression model.The degree
selection is important as it can lead to underfitting when it is smaller
or overfitting if its higher. The ADJ R^2 of 0.0.4627 is the best fit so
far based on our squares. Also the RMSE calculated is 30.20802.
set.seed(1024)
Equation <- "TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_SO^2) + I(TEAM_BASERUN_SB^2) + I(TEAM_BASERUN_CS^2) + I(TEAM_PITCHING_H^2) + I(TEAM_PITCHING_HR^2) + I(TEAM_PITCHING_BB^2) + I(TEAM_PITCHING_SO^2) + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + I(TEAM_BATTING_HR^3) + I(TEAM_BATTING_BB^3) + I(TEAM_BATTING_SO^3) + I(TEAM_BASERUN_SB^3) + I(TEAM_BASERUN_CS^3) + I(TEAM_PITCHING_H^3) + I(TEAM_PITCHING_HR^3) + I(TEAM_PITCHING_BB^3) + I(TEAM_PITCHING_SO^3) + I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_DP^3) + I(TEAM_BATTING_2B^4) + I(TEAM_BATTING_3B^4) + I(TEAM_BATTING_HR^4) + I(TEAM_BATTING_BB^4) + I(TEAM_BATTING_SO^4) + I(TEAM_BASERUN_SB^4) + I(TEAM_BASERUN_CS^4) + I(TEAM_PITCHING_H^4) + I(TEAM_PITCHING_HR^4) + I(TEAM_PITCHING_BB^4) + I(TEAM_PITCHING_SO^4) + I(TEAM_FIELDING_E^4) + I(TEAM_FIELDING_DP^4) "
mpoly <- lm(Equation, completedData)
step_back <- MASS::stepAIC(mpoly, direction="backward", trace = F)
poly_call <- summary(step_back)$call
step_back <- lm(poly_call[2], completedData)
summary(step_back)##
## Call:
## lm(formula = poly_call[2], data = completedData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.734 -7.150 -0.064 7.271 59.801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.403e+02 4.562e+01 -3.076 0.00212 **
## TEAM_BATTING_2B 1.283e+00 1.765e-01 7.266 5.10e-13 ***
## TEAM_BATTING_BB 6.948e-01 9.351e-02 7.430 1.54e-13 ***
## TEAM_BATTING_SO -2.716e-01 5.003e-02 -5.430 6.25e-08 ***
## TEAM_PITCHING_H 3.170e-02 3.665e-03 8.648 < 2e-16 ***
## TEAM_PITCHING_SO 2.874e-02 1.057e-02 2.720 0.00658 **
## TEAM_FIELDING_E -2.173e-01 2.206e-02 -9.851 < 2e-16 ***
## TEAM_FIELDING_DP 2.573e+00 1.346e+00 1.912 0.05606 .
## I(TEAM_BATTING_2B^2) -5.165e-03 7.039e-04 -7.338 3.03e-13 ***
## I(TEAM_BATTING_3B^2) 2.570e-03 3.043e-04 8.445 < 2e-16 ***
## I(TEAM_BATTING_HR^2) -1.846e-03 9.764e-04 -1.890 0.05888 .
## I(TEAM_BATTING_BB^2) -2.100e-03 3.069e-04 -6.843 9.95e-12 ***
## I(TEAM_BATTING_SO^2) 5.225e-04 9.370e-05 5.576 2.76e-08 ***
## I(TEAM_BASERUN_SB^2) 5.771e-04 7.215e-05 7.999 2.00e-15 ***
## I(TEAM_BASERUN_CS^2) -1.326e-03 6.306e-04 -2.103 0.03555 *
## I(TEAM_PITCHING_H^2) -2.818e-06 5.122e-07 -5.501 4.20e-08 ***
## I(TEAM_PITCHING_HR^2) 2.162e-03 7.769e-04 2.783 0.00543 **
## I(TEAM_PITCHING_BB^2) -1.514e-04 1.958e-05 -7.729 1.62e-14 ***
## I(TEAM_PITCHING_SO^2) -1.401e-05 3.096e-06 -4.527 6.30e-06 ***
## I(TEAM_FIELDING_E^2) 3.106e-04 5.322e-05 5.837 6.11e-09 ***
## I(TEAM_FIELDING_DP^2) -3.170e-02 1.508e-02 -2.102 0.03568 *
## I(TEAM_BATTING_2B^3) 6.714e-06 9.179e-07 7.315 3.58e-13 ***
## I(TEAM_BATTING_3B^3) -1.241e-05 1.856e-06 -6.689 2.82e-11 ***
## I(TEAM_BATTING_HR^3) 1.409e-05 7.028e-06 2.005 0.04504 *
## I(TEAM_BATTING_BB^3) 2.988e-06 4.642e-07 6.437 1.49e-10 ***
## I(TEAM_BATTING_SO^3) -4.604e-07 8.869e-08 -5.191 2.29e-07 ***
## I(TEAM_BASERUN_SB^3) -1.561e-06 2.576e-07 -6.058 1.61e-09 ***
## I(TEAM_BASERUN_CS^3) 1.519e-05 7.003e-06 2.169 0.03022 *
## I(TEAM_PITCHING_H^3) 9.899e-11 3.031e-11 3.266 0.00111 **
## I(TEAM_PITCHING_HR^3) -1.225e-05 4.954e-06 -2.473 0.01346 *
## I(TEAM_PITCHING_BB^3) 1.050e-07 1.556e-08 6.751 1.87e-11 ***
## I(TEAM_PITCHING_SO^3) 2.064e-09 3.926e-10 5.258 1.60e-07 ***
## I(TEAM_FIELDING_E^3) -2.382e-07 4.827e-08 -4.935 8.63e-07 ***
## I(TEAM_FIELDING_DP^3) 1.530e-04 7.285e-05 2.100 0.03584 *
## I(TEAM_BATTING_HR^4) -2.890e-08 1.451e-08 -1.992 0.04645 *
## I(TEAM_BATTING_BB^4) -1.462e-09 2.511e-10 -5.823 6.61e-09 ***
## I(TEAM_BATTING_SO^4) 1.412e-10 3.082e-11 4.581 4.87e-06 ***
## I(TEAM_BASERUN_SB^4) 1.218e-09 2.443e-10 4.986 6.64e-07 ***
## I(TEAM_BASERUN_CS^4) -4.456e-08 2.081e-08 -2.141 0.03235 *
## I(TEAM_PITCHING_H^4) -1.112e-15 5.686e-16 -1.956 0.05061 .
## I(TEAM_PITCHING_HR^4) 1.969e-08 8.742e-09 2.253 0.02438 *
## I(TEAM_PITCHING_BB^4) -1.996e-11 3.377e-12 -5.910 3.95e-09 ***
## I(TEAM_PITCHING_SO^4) -7.272e-14 1.365e-14 -5.328 1.09e-07 ***
## I(TEAM_FIELDING_E^4) 6.091e-11 1.450e-11 4.202 2.75e-05 ***
## I(TEAM_FIELDING_DP^4) -2.600e-07 1.285e-07 -2.023 0.04315 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.55 on 2231 degrees of freedom
## Multiple R-squared: 0.4731, Adjusted R-squared: 0.4627
## F-statistic: 45.53 on 44 and 2231 DF, p-value: < 2.2e-16
## (Intercept) TEAM_BATTING_2B TEAM_BATTING_BB
## -1.403321e+02 1.282765e+00 6.947867e-01
## TEAM_BATTING_SO TEAM_PITCHING_H TEAM_PITCHING_SO
## -2.716374e-01 3.169657e-02 2.874148e-02
## TEAM_FIELDING_E TEAM_FIELDING_DP I(TEAM_BATTING_2B^2)
## -2.173048e-01 2.572819e+00 -5.164862e-03
## I(TEAM_BATTING_3B^2) I(TEAM_BATTING_HR^2) I(TEAM_BATTING_BB^2)
## 2.569934e-03 -1.845551e-03 -2.100048e-03
## I(TEAM_BATTING_SO^2) I(TEAM_BASERUN_SB^2) I(TEAM_BASERUN_CS^2)
## 5.224734e-04 5.770797e-04 -1.326280e-03
## I(TEAM_PITCHING_H^2) I(TEAM_PITCHING_HR^2) I(TEAM_PITCHING_BB^2)
## -2.817604e-06 2.162038e-03 -1.513564e-04
## I(TEAM_PITCHING_SO^2) I(TEAM_FIELDING_E^2) I(TEAM_FIELDING_DP^2)
## -1.401396e-05 3.106118e-04 -3.170085e-02
## I(TEAM_BATTING_2B^3) I(TEAM_BATTING_3B^3) I(TEAM_BATTING_HR^3)
## 6.714131e-06 -1.241406e-05 1.409350e-05
## I(TEAM_BATTING_BB^3) I(TEAM_BATTING_SO^3) I(TEAM_BASERUN_SB^3)
## 2.988198e-06 -4.603664e-07 -1.560651e-06
## I(TEAM_BASERUN_CS^3) I(TEAM_PITCHING_H^3) I(TEAM_PITCHING_HR^3)
## 1.518618e-05 9.899396e-11 -1.225342e-05
## I(TEAM_PITCHING_BB^3) I(TEAM_PITCHING_SO^3) I(TEAM_FIELDING_E^3)
## 1.050250e-07 2.064289e-09 -2.381952e-07
## I(TEAM_FIELDING_DP^3) I(TEAM_BATTING_HR^4) I(TEAM_BATTING_BB^4)
## 1.529962e-04 -2.890410e-08 -1.462366e-09
## I(TEAM_BATTING_SO^4) I(TEAM_BASERUN_SB^4) I(TEAM_BASERUN_CS^4)
## 1.412109e-10 1.218106e-09 -4.455961e-08
## I(TEAM_PITCHING_H^4) I(TEAM_PITCHING_HR^4) I(TEAM_PITCHING_BB^4)
## -1.112181e-15 1.969337e-08 -1.995516e-11
## I(TEAM_PITCHING_SO^4) I(TEAM_FIELDING_E^4) I(TEAM_FIELDING_DP^4)
## -7.272284e-14 6.090874e-11 -2.599881e-07
##
## Call:
## lm(formula = poly_call[2], data = completedData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.734 -7.150 -0.064 7.271 59.801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.403e+02 4.562e+01 -3.076 0.00212 **
## TEAM_BATTING_2B 1.283e+00 1.765e-01 7.266 5.10e-13 ***
## TEAM_BATTING_BB 6.948e-01 9.351e-02 7.430 1.54e-13 ***
## TEAM_BATTING_SO -2.716e-01 5.003e-02 -5.430 6.25e-08 ***
## TEAM_PITCHING_H 3.170e-02 3.665e-03 8.648 < 2e-16 ***
## TEAM_PITCHING_SO 2.874e-02 1.057e-02 2.720 0.00658 **
## TEAM_FIELDING_E -2.173e-01 2.206e-02 -9.851 < 2e-16 ***
## TEAM_FIELDING_DP 2.573e+00 1.346e+00 1.912 0.05606 .
## I(TEAM_BATTING_2B^2) -5.165e-03 7.039e-04 -7.338 3.03e-13 ***
## I(TEAM_BATTING_3B^2) 2.570e-03 3.043e-04 8.445 < 2e-16 ***
## I(TEAM_BATTING_HR^2) -1.846e-03 9.764e-04 -1.890 0.05888 .
## I(TEAM_BATTING_BB^2) -2.100e-03 3.069e-04 -6.843 9.95e-12 ***
## I(TEAM_BATTING_SO^2) 5.225e-04 9.370e-05 5.576 2.76e-08 ***
## I(TEAM_BASERUN_SB^2) 5.771e-04 7.215e-05 7.999 2.00e-15 ***
## I(TEAM_BASERUN_CS^2) -1.326e-03 6.306e-04 -2.103 0.03555 *
## I(TEAM_PITCHING_H^2) -2.818e-06 5.122e-07 -5.501 4.20e-08 ***
## I(TEAM_PITCHING_HR^2) 2.162e-03 7.769e-04 2.783 0.00543 **
## I(TEAM_PITCHING_BB^2) -1.514e-04 1.958e-05 -7.729 1.62e-14 ***
## I(TEAM_PITCHING_SO^2) -1.401e-05 3.096e-06 -4.527 6.30e-06 ***
## I(TEAM_FIELDING_E^2) 3.106e-04 5.322e-05 5.837 6.11e-09 ***
## I(TEAM_FIELDING_DP^2) -3.170e-02 1.508e-02 -2.102 0.03568 *
## I(TEAM_BATTING_2B^3) 6.714e-06 9.179e-07 7.315 3.58e-13 ***
## I(TEAM_BATTING_3B^3) -1.241e-05 1.856e-06 -6.689 2.82e-11 ***
## I(TEAM_BATTING_HR^3) 1.409e-05 7.028e-06 2.005 0.04504 *
## I(TEAM_BATTING_BB^3) 2.988e-06 4.642e-07 6.437 1.49e-10 ***
## I(TEAM_BATTING_SO^3) -4.604e-07 8.869e-08 -5.191 2.29e-07 ***
## I(TEAM_BASERUN_SB^3) -1.561e-06 2.576e-07 -6.058 1.61e-09 ***
## I(TEAM_BASERUN_CS^3) 1.519e-05 7.003e-06 2.169 0.03022 *
## I(TEAM_PITCHING_H^3) 9.899e-11 3.031e-11 3.266 0.00111 **
## I(TEAM_PITCHING_HR^3) -1.225e-05 4.954e-06 -2.473 0.01346 *
## I(TEAM_PITCHING_BB^3) 1.050e-07 1.556e-08 6.751 1.87e-11 ***
## I(TEAM_PITCHING_SO^3) 2.064e-09 3.926e-10 5.258 1.60e-07 ***
## I(TEAM_FIELDING_E^3) -2.382e-07 4.827e-08 -4.935 8.63e-07 ***
## I(TEAM_FIELDING_DP^3) 1.530e-04 7.285e-05 2.100 0.03584 *
## I(TEAM_BATTING_HR^4) -2.890e-08 1.451e-08 -1.992 0.04645 *
## I(TEAM_BATTING_BB^4) -1.462e-09 2.511e-10 -5.823 6.61e-09 ***
## I(TEAM_BATTING_SO^4) 1.412e-10 3.082e-11 4.581 4.87e-06 ***
## I(TEAM_BASERUN_SB^4) 1.218e-09 2.443e-10 4.986 6.64e-07 ***
## I(TEAM_BASERUN_CS^4) -4.456e-08 2.081e-08 -2.141 0.03235 *
## I(TEAM_PITCHING_H^4) -1.112e-15 5.686e-16 -1.956 0.05061 .
## I(TEAM_PITCHING_HR^4) 1.969e-08 8.742e-09 2.253 0.02438 *
## I(TEAM_PITCHING_BB^4) -1.996e-11 3.377e-12 -5.910 3.95e-09 ***
## I(TEAM_PITCHING_SO^4) -7.272e-14 1.365e-14 -5.328 1.09e-07 ***
## I(TEAM_FIELDING_E^4) 6.091e-11 1.450e-11 4.202 2.75e-05 ***
## I(TEAM_FIELDING_DP^4) -2.600e-07 1.285e-07 -2.023 0.04315 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.55 on 2231 degrees of freedom
## Multiple R-squared: 0.4731, Adjusted R-squared: 0.4627
## F-statistic: 45.53 on 44 and 2231 DF, p-value: < 2.2e-16
#calculate RMSE value for polynomial model
rmse_Polymod3 <- evaluate_rmse(step_back, test_data)
rmse_Polymod3## [1] 12.91918
Here we have built a model manually by selecting most of the
transformed variables using log and boxcox on the explanatory variables
, and did not include the HBP_t values since the qqplot showed a large
deviation from the left and right tails. The R^2 value is moderate in
this case and not as great as the polynomial fit above.Polynomial degree
of 3 can accomadate more curves to fit the data, therefore it is the
best fit so far compared to the model with the transformed data.
#data with all transformed variables, we drop HBP_t since on the qqplot it had a significant deviation on the left and right tails.
transformDSmT_data <- dplyr::select(transformDS, -TEAM_BASERUN_SB, -TEAM_BATTING_3B, -TEAM_BATTING_BB, -TEAM_BATTING_H, -TEAM_FIELDING_E, -TEAM_PITCHING_BB, -TEAM_PITCHING_H, -TEAM_PITCHING_SO, -TEAM_BATTING_1B,-TEAM_BATTING_HBP, -TEAM_BASERUN_CS,-TEAM_BATTING_HBP_t)
mT4_data <- lm(TARGET_WINS ~., data = mT_data)
summary(mT4_data)##
## Call:
## lm(formula = TARGET_WINS ~ ., data = mT_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.334 -7.005 -0.010 7.332 32.259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.442e+04 2.712e+04 -3.113 0.00188 **
## TEAM_BATTING_2B -2.665e-02 5.229e-02 -0.510 0.61029
## TEAM_BATTING_HR 9.400e-02 6.379e-02 1.473 0.14080
## TEAM_BATTING_SO -4.349e-02 8.871e-03 -4.903 1.03e-06 ***
## TEAM_PITCHING_HR -2.433e-02 5.312e-02 -0.458 0.64697
## TEAM_FIELDING_DP -1.078e-01 1.252e-02 -8.610 < 2e-16 ***
## TEAM_BASERUN_CS_t -1.401e+00 9.080e-01 -1.543 0.12307
## TEAM_BASERUN_SB_t 5.612e+00 6.470e-01 8.675 < 2e-16 ***
## TEAM_BATTING_2B_t -1.158e+01 1.392e+01 -0.832 0.40572
## TEAM_BATTING_3B_t 6.681e+00 1.432e+00 4.666 3.29e-06 ***
## TEAM_BATTING_BB_t 7.161e-05 2.274e-05 3.149 0.00167 **
## TEAM_BATTING_H_t 1.376e+05 6.392e+04 2.153 0.03143 *
## TEAM_BATTING_1B_t -2.480e+04 2.937e+04 -0.844 0.39868
## TEAM_FIELDING_E_t -1.206e+03 8.853e+01 -13.625 < 2e-16 ***
## TEAM_PITCHING_BB_t -1.434e+00 6.607e+00 -0.217 0.82824
## TEAM_PITCHING_H_t -2.691e+04 1.848e+04 -1.457 0.14539
## TEAM_PITCHING_SO_t 1.486e+01 6.560e+00 2.266 0.02358 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.44 on 1818 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3762, Adjusted R-squared: 0.3707
## F-statistic: 68.53 on 16 and 1818 DF, p-value: < 2.2e-16
Here we have simply manually removed variables that are not as highly
significant from the model with transformed variables above with log and
boxcox.
#data with all transformed variables,KEEP HBP and CS of imputed variable
# mT5_data <- lm(TARGET_WINS ~ TEAM_BATTING_SO + TEAM_BATTING_HR+TEAM_BATTING_1B_t+TEAM_BASERUN_CS_t+TEAM_PITCHING_HR +TEAM_PITCHING_H_t+TEAM_PITCHING_SO_t+ TEAM_BATTING_HBP_t+
# TEAM_FIELDING_DP + TEAM_BASERUN_SB_t +
# TEAM_BATTING_3B_t + TEAM_BATTING_BB_t + TEAM_BATTING_2B_t+
# TEAM_FIELDING_E_t , data = transformDS)
mT5_data <- lm(TARGET_WINS ~ TEAM_BATTING_SO +
TEAM_FIELDING_DP + TEAM_BASERUN_SB_t +
TEAM_BATTING_3B_t+ TEAM_BATTING_BB_t +TEAM_FIELDING_E_t , data = transformDS)
summary(mT5_data)##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_SO + TEAM_FIELDING_DP +
## TEAM_BASERUN_SB_t + TEAM_BATTING_3B_t + TEAM_BATTING_BB_t +
## TEAM_FIELDING_E_t, data = transformDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.442 -7.637 0.311 7.562 35.741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.900e+03 1.033e+02 18.391 < 2e-16 ***
## TEAM_BATTING_SO -1.788e-02 1.937e-03 -9.229 < 2e-16 ***
## TEAM_FIELDING_DP -8.408e-02 1.307e-02 -6.435 1.58e-10 ***
## TEAM_BASERUN_SB_t 3.670e+00 5.370e-01 6.835 1.11e-11 ***
## TEAM_BATTING_3B_t 7.955e+00 9.169e-01 8.677 < 2e-16 ***
## TEAM_BATTING_BB_t 8.754e-05 5.985e-06 14.626 < 2e-16 ***
## TEAM_FIELDING_E_t -1.572e+03 8.749e+01 -17.967 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.24 on 1828 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.2723, Adjusted R-squared: 0.2699
## F-statistic: 114 on 6 and 1828 DF, p-value: < 2.2e-16
Our approach 5 is to build the model with standard baseball derived variables like AB, AVG, OBP and SLG. These all potentially should have a positive impact on Target Wins. We measured the cooks distance and influential points and then on the 2nd fit of the models removed those outliers and reran the model again to see if the fit , R-squared was higher. Indeed the original adj- R squared was 0.1922 and after removing about the outlier records, the adjusted R-square was 0.2422.
#data with all transformed variables,KEEP HBP and CS of imputed variable
m7 <- lm(TARGET_WINS ~., data = dtrain_ALLNEWfeat, na.action = na.omit)
summary(m7)##
## Call:
## lm(formula = TARGET_WINS ~ ., data = dtrain_ALLNEWfeat, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.768 -6.987 0.074 6.763 30.227
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.519e+02 6.779e+01 2.240 0.02519 *
## TEAM_BATTING_H -2.017e-02 2.070e-02 -0.975 0.32994
## TEAM_BATTING_2B -6.180e-02 2.666e-02 -2.318 0.02055 *
## TEAM_BATTING_3B 1.571e-01 5.618e-02 2.796 0.00523 **
## TEAM_BATTING_HR 9.136e-02 9.888e-02 0.924 0.35562
## TEAM_BATTING_BB 1.362e-01 5.750e-02 2.369 0.01794 *
## TEAM_BATTING_SO 2.102e-03 2.527e-02 0.083 0.93373
## TEAM_BASERUN_SB 7.743e-02 6.268e-03 12.353 < 2e-16 ***
## TEAM_BASERUN_CS -3.754e-02 1.397e-02 -2.688 0.00726 **
## TEAM_BATTING_HBP 1.157e-01 6.800e-02 1.701 0.08912 .
## TEAM_PITCHING_H 6.374e-02 1.594e-02 3.998 6.63e-05 ***
## TEAM_PITCHING_HR -3.422e-02 8.003e-02 -0.428 0.66905
## TEAM_PITCHING_BB -9.230e-02 4.156e-02 -2.221 0.02650 *
## TEAM_PITCHING_SO -4.750e-02 2.144e-02 -2.215 0.02688 *
## TEAM_FIELDING_E -1.193e-01 7.305e-03 -16.334 < 2e-16 ***
## TEAM_FIELDING_DP -1.125e-01 1.234e-02 -9.117 < 2e-16 ***
## TEAM_BATTING_1B NA NA NA NA
## TEAM_BATTING_AB NA NA NA NA
## TEAM_BATTING_AVG 6.805e+01 3.163e+02 0.215 0.82968
## TEAM_BATTING_OBP -2.714e+02 3.605e+02 -0.753 0.45173
## TEAM_BATTING_SLG 3.932e+01 7.237e+01 0.543 0.58696
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.14 on 1816 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.4114, Adjusted R-squared: 0.4056
## F-statistic: 70.53 on 18 and 1816 DF, p-value: < 2.2e-16
m6 <- lm(TARGET_WINS ~ TEAM_BATTING_AB + TEAM_BATTING_AVG + TEAM_BATTING_OBP + TEAM_BATTING_SLG, data = dtrain_ALLNEWfeat )
summary(m6)##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_AB + TEAM_BATTING_AVG +
## TEAM_BATTING_OBP + TEAM_BATTING_SLG, data = dtrain_ALLNEWfeat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.849 -9.005 0.550 9.504 50.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.747e+01 8.519e+00 -10.268 < 2e-16 ***
## TEAM_BATTING_AB 2.793e-02 1.963e-03 14.230 < 2e-16 ***
## TEAM_BATTING_AVG -1.343e+02 2.906e+01 -4.621 4.04e-06 ***
## TEAM_BATTING_OBP 2.399e+02 3.166e+01 7.578 5.07e-14 ***
## TEAM_BATTING_SLG 2.018e+01 8.367e+00 2.412 0.0159 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.16 on 2271 degrees of freedom
## Multiple R-squared: 0.1937, Adjusted R-squared: 0.1922
## F-statistic: 136.4 on 4 and 2271 DF, p-value: < 2.2e-16
## 1211
## 0.08032934
#plot cooks distance
plot(m6,which=4)
plot(cooks.distance(m6),type="b",pch=18,col="red")
N = 2271
k = 4
cutoff = 4/ (N-k-1)
abline(h=cutoff,lty=2)
subsetds <- m6cd < 0.001765225
#Find influential & outlier points
sum(m6cd > 0.001765225)## [1] 127
#removing the outliers about 131 influential points does seemed to have helped.
m6upd <- lm(TARGET_WINS ~ TEAM_BATTING_AB + TEAM_BATTING_AVG + TEAM_BATTING_OBP + TEAM_BATTING_SLG, data = dtrain_ALLNEWfeat, subset = m6cd < 0.001765225)
summary(m6upd)##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_AB + TEAM_BATTING_AVG +
## TEAM_BATTING_OBP + TEAM_BATTING_SLG, data = dtrain_ALLNEWfeat,
## subset = m6cd < 0.001765225)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.175 -8.437 0.454 8.530 41.815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.806e+01 7.889e+00 -12.430 < 2e-16 ***
## TEAM_BATTING_AB 2.400e-02 1.811e-03 13.252 < 2e-16 ***
## TEAM_BATTING_AVG -2.011e+02 2.698e+01 -7.455 1.30e-13 ***
## TEAM_BATTING_OBP 3.172e+02 2.889e+01 10.982 < 2e-16 ***
## TEAM_BATTING_SLG 3.304e+01 7.547e+00 4.378 1.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.92 on 2144 degrees of freedom
## Multiple R-squared: 0.2436, Adjusted R-squared: 0.2422
## F-statistic: 172.6 on 4 and 2144 DF, p-value: < 2.2e-16
plot(cooks.distance(m6upd),type="b",pch=18,col="red")
N = 2271
k = 4
cutoff = 4/ (N-k-1)
abline(h=cutoff,lty=2)Decide on the criteria for selecting the best multiple linear
regression model. Will you select a model with slightly worse
performance if it makes more sense or is more parsimonious? Discuss why
you selected your model.
For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output.
Using the training data set, evaluate the multiple linear regression
model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d)
residual plots. Make predictions using the evaluation data set
Delving into the coefficients provides insights into how different factors impact team success. Positive coefficients, such as those for TEAM_BATTING_H (0.0438), TEAM_BATTING_HR (0.0607), and TEAM_BATTING_3B (0.0596), indicate that increasing base hits, home runs, and triples, respectively, are associated with a boost in predicted wins. This suggests that improving a team’s hitting performance, particularly with power and extra-base hits, strongly contributes to the overall success of the team. Additionally, TEAM_BASERUN_SB (0.0368) suggests that increasing stolen bases positively affects wins, further highlighting the offensive advantage of aggressive baserunning. On the other hand, the negative coefficients pinpoint areas that detract form team performance. A great example is TEAM_FIELDING_E (-0.0372) indicate that each additional fielding error decreases the predicted number of wins, underscoring the need to minimize defensive mistakes. Interestingly, TEAM_FIELDING_DP (-0.1213) is a strong negative coefficient, suggesting that while double plays are generally viewed positively in game, from a win perspective, it may indicate other weaknesses within the team.
merge_coefficients_with_definitions <- function(coeff_table) {
var_definitions <- data.frame(
Variable = c("TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B",
"TEAM_BATTING_HR", "TEAM_BATTING_BB", "TEAM_BATTING_HBP", "TEAM_BATTING_SO",
"TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_FIELDING_E", "TEAM_FIELDING_DP",
"TEAM_PITCHING_BB", "TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_SO"),
Definition = c("Number of wins",
"Base Hits by batters (1B, 2B, 3B, HR)",
"Doubles by batters (2B)",
"Triples by batters (3B)",
"Homeruns by batters (4B)",
"Walks by batters",
"Batters hit by pitch (get a free base)",
"Strikeouts by batters",
"Stolen bases",
"Caught stealing",
"Errors",
"Double Plays",
"Walks allowed",
"Hits allowed",
"Homeruns allowed",
"Strikeouts by pitchers"),
stringsAsFactors = FALSE
)
#Merge the coefficients table with the variable definitions
merged_table <- left_join(coeff_table, var_definitions, by = "Variable")
return(merged_table)
}The majority of the points in the center of the \(Q-Q\) plot fall fairly close to the line of
\(y=x\) indicating that they are
normally distributed.However, the divergence of the tails indicates some
issues for extreme values.
## R-squared: 0.3265676
The residuals vs. fitted values plot suggests that the model is alright. The good news is that the residuals are generally centered around zero, indicating that the model is unbiased in its predictions across the middle range of fitted values. However, there is evidence of heteroscedasticity, as the residuals show a “fanning out” pattern, with increasing variance at higher fitted values, which may affect prediction reliability in these ranges. In addition, there are a few outliers, particularly targeting teams that have low wins, which could be distorting the model’s accuracy.
fitted_vals <- fitted(forward_model)
residuals_vals <- residuals(forward_model)
plot(fitted_vals, residuals_vals,
main = "Residuals vs Fitted",
xlab = "Fitted values",
ylab = "Residuals")
abline(h = 0, col = "red")moneyball_evaluation_data <- read.csv("moneyball-evaluation-data.csv")
moneyball_evaluation_data_imputed <- impute_missing_values(moneyball_evaluation_data,correlation_matrix)## Column name: TEAM_BATTING_SO , missing values: 18
## Working On Column: TEAM_BATTING_SO
## Predictors for column TEAM_BATTING_SO : TEAM_FIELDING_E
## Predictors for column TEAM_BATTING_SO : TEAM_PITCHING_BB
## Imputed 18 missing values in column: TEAM_BATTING_SO
## Column name: TEAM_BASERUN_SB , missing values: 13
## Working On Column: TEAM_BASERUN_SB
## Predictors for column TEAM_BASERUN_SB : TEAM_BATTING_3B
## Predictors for column TEAM_BASERUN_SB : TEAM_PITCHING_H
## Imputed 13 missing values in column: TEAM_BASERUN_SB
## Column name: TEAM_BASERUN_CS , missing values: 87
## Working On Column: TEAM_BASERUN_CS
## Predictors for column TEAM_BASERUN_CS : TEAM_BATTING_3B
## Predictors for column TEAM_BASERUN_CS : TEAM_FIELDING_E
## Imputed 87 missing values in column: TEAM_BASERUN_CS
## Column name: TEAM_BATTING_HBP , missing values: 240
## Working On Column: TEAM_BATTING_HBP
## Predictors for column TEAM_BATTING_HBP : TEAM_PITCHING_HR
## Predictors for column TEAM_BATTING_HBP : TEAM_BATTING_HR
## Imputed 240 missing values in column: TEAM_BATTING_HBP
## Column name: TEAM_PITCHING_SO , missing values: 18
## Working On Column: TEAM_PITCHING_SO
## Predictors for column TEAM_PITCHING_SO : TEAM_FIELDING_E
## Predictors for column TEAM_PITCHING_SO : TEAM_PITCHING_BB
## Imputed 18 missing values in column: TEAM_PITCHING_SO
## Column name: TEAM_FIELDING_DP , missing values: 31
## Working On Column: TEAM_FIELDING_DP
## Predictors for column TEAM_FIELDING_DP : TEAM_BATTING_3B
## Predictors for column TEAM_FIELDING_DP : TEAM_FIELDING_E
## Imputed 31 missing values in column: TEAM_FIELDING_DP
moneyball_evaluation_data_imputed[moneyball_evaluation_data_imputed == "" | is.na(moneyball_evaluation_data_imputed)] <- 0
sum(is.na(moneyball_evaluation_data_imputed))## [1] 0
count_na_as_string <- sum(moneyball_evaluation_data_imputed == "NA", na.rm = TRUE)
moneyball_evaluation_data_imputed[moneyball_evaluation_data_imputed == "NA"] <- 0
#Check for NAs
sum(is.na(moneyball_evaluation_data_imputed))## [1] 0
#Predict wins
predicted_wins <- predict(forward_model, newdata = moneyball_evaluation_data_imputed)
moneyball_evaluation_data$Predicted_Wins <- round(predicted_wins)
#export data
print(predicted_wins)## 1 2 3 4 5 6 7 8
## 63.34201 65.31216 74.58335 86.55665 66.30999 67.67456 80.52421 76.97679
## 9 10 11 12 13 14 15 16
## 71.03161 74.29158 69.13493 83.16481 82.57485 83.13729 85.96313 78.00385
## 17 18 19 20 21 22 23 24
## 74.45590 78.00566 71.61659 90.71735 81.25573 83.20570 80.25591 72.39130
## 25 26 27 28 29 30 31 32
## 82.47815 87.44777 55.33278 75.48136 82.95327 75.31005 89.47883 85.16628
## 33 34 35 36 37 38 39 40
## 81.22681 82.59840 80.04774 86.27811 76.06674 90.12123 87.05489 91.97500
## 41 42 43 44 45 46 47 48
## 82.48081 90.46132 33.04682 102.32632 91.84046 94.27205 99.51838 77.24707
## 49 50 51 52 53 54 55 56
## 69.49431 80.08895 77.39947 85.10195 76.92961 74.71985 75.18438 79.49462
## 57 58 59 60 61 62 63 64
## 91.44009 76.56690 64.33847 78.89128 87.96009 73.25524 87.73089 85.73417
## 65 66 67 68 69 70 71 72
## 84.49861 99.33537 75.02017 80.76783 77.98779 87.54773 86.15504 70.71865
## 73 74 75 76 77 78 79 80
## 77.48192 89.99050 81.33350 84.63528 81.76609 83.61217 72.50406 76.03212
## 81 82 83 84 85 86 87 88
## 84.39389 88.17360 96.64595 74.22512 84.74053 81.06267 83.42634 83.72506
## 89 90 91 92 93 94 95 96
## 88.29683 89.25760 80.32536 97.28005 74.19830 84.11231 84.49740 84.05951
## 97 98 99 100 101 102 103 104
## 86.89059 101.95956 85.75903 85.99408 79.55943 74.46950 83.99920 84.16972
## 105 106 107 108 109 110 111 112
## 78.53056 69.41087 54.32570 77.04523 86.40053 59.45698 83.99773 85.36602
## 113 114 115 116 117 118 119 120
## 93.51967 91.76664 81.54284 78.10601 84.98347 80.69157 74.65586 77.07999
## 121 122 123 124 125 126 127 128
## 92.73023 69.51742 69.15351 67.99571 69.22172 88.76361 91.25376 77.93940
## 129 130 131 132 133 134 135 136
## 93.05825 91.66904 85.35456 79.12941 79.69121 85.49378 87.34826 67.23954
## 137 138 139 140 141 142 143 144
## 74.24250 77.99283 86.96169 80.47501 66.29843 72.97633 89.83782 72.79161
## 145 146 147 148 149 150 151 152
## 71.72553 72.24594 76.88918 78.34019 78.54647 83.23298 82.35583 80.35067
## 153 154 155 156 157 158 159 160
## 53.65196 70.71838 76.89142 70.80214 89.16285 68.31321 92.26013 74.16343
## 161 162 163 164 165 166 167 168
## 102.97119 105.73589 93.85001 102.73814 96.86479 89.41896 81.60908 81.72290
## 169 170 171 172 173 174 175 176
## 73.62270 80.32511 88.86271 88.58551 80.71567 93.58114 83.25400 73.22329
## 177 178 179 180 181 182 183 184
## 77.07210 70.75924 74.50462 78.65383 86.91645 88.05098 84.99958 84.64730
## 185 186 187 188 189 190 191 192
## 96.91088 96.79967 86.85417 56.49043 60.65153 113.16314 71.43358 80.82280
## 193 194 195 196 197 198 199 200
## 77.60838 77.79108 79.84679 68.26375 78.74758 84.59009 79.73953 83.95373
## 201 202 203 204 205 206 207 208
## 75.51396 79.69558 73.96268 91.02842 81.00795 82.96984 77.82390 77.53497
## 209 210 211 212 213 214 215 216
## 81.58098 73.80030 106.43962 93.66124 81.85864 65.42548 68.91173 83.67666
## 217 218 219 220 221 222 223 224
## 79.25817 93.07118 77.93890 78.54334 78.26961 74.29641 80.49750 73.47004
## 225 226 227 228 229 230 231 232
## 88.22145 75.26303 81.64626 78.45500 81.30498 73.34752 79.99408 92.94496
## 233 234 235 236 237 238 239 240
## 78.30733 88.92503 79.84306 75.42988 82.14292 77.96719 92.66573 73.19262
## 241 242 243 244 245 246 247 248
## 89.04751 86.29256 83.32269 81.36724 61.25078 87.09117 81.45165 85.17810
## 249 250 251 252 253 254 255 256
## 73.59924 83.57244 80.66345 53.89637 93.71869 74.93086 70.05148 75.85348
## 257 258 259
## 82.66711 83.05989 77.50290
write.csv(moneyball_evaluation_data, "moneyball-evaluation-data-predicted.csv", row.names = FALSE)
print("Predicted wins saved to 'moneyball-evaluation-data-predicted.csv'")## [1] "Predicted wins saved to 'moneyball-evaluation-data-predicted.csv'"
# #Predict wins using they polynomial regression model using the mice imputed test dataset.
# sum(is.na(completedData1))
# predicted_winspoly <- predict(mpoly, newdata = completedData1)
# completedData1$Predicted_Wins <- round(predicted_winspoly)
#
# #Display the predicted wins
# print(predicted_winspoly)
# write.csv(completedData1, "moneyball-evaluation-data-predicted_POLY.csv", row.names = FALSE)The 2nd best fit is our Polynomial Regression Model.One reason why this model may not be higher could be because of the presence of outliers. As an extention of this project in the future, we could remove some of this high leverage outliers and try fitting this model again to see if the adjusted R square is higher than what was observed. For this project, however, this is our 2nd best selection and fit.
The prediction wins have been written out to a data set
Cook’s distance measures how and observation influences the overall
model or predicted values
Studentizided residuals are the residuals divided by their estimated
standard deviation
Bonferroni test to identify outliers
Hat-points identify influential points (have a high impact on the
predictor variables)
## R-squared: 0.4731361
## [1] 2012 2233
fitted_valspoly <- fitted(step_back)
residuals_valsply <- residuals(step_back)
plot(fitted_valspoly, residuals_valsply,
main = "Residuals vs Fitted step_back regression",
xlab = "Fitted values",
ylab = "Residuals")
abline(h = 0, col = "red")# Get the model residuals
model_residuals = step_back$residuals
# Plot the result
hist(model_residuals)Below, we have used the step_back (polynomial regression model) to fit the test data after imputation. I noticed on review that the model predicted negative target wins for 1 record. The record has been printed out below. The TEAM_BASERUN_SB for thiS record is 0 on the record. On the model coefficients this is one of the significant predictor variables, therefore, having an outlier such as 0 has had an impact on this record. No other records on the test dataset has -ve Target wins predicted. Maybe the presence of 0 on the significant predictor played a role in the -ve target wins. One way we can avoid running into negative values is to log transform the target variable. Then we would need to convert to actual scale by taking the exponential. Negative Intercept: If the intercept in a linear regression model is negative, it means that the predicted value of Y when X is zero is negative. In this case, the regression line crosses the y-axis below the zero value. Models can predict out of bounds values in this scenario.
#sTEP 4: Predict wins using they polynomial regression model using the mice imputed test dataset.
sum(is.na(completedData1))## [1] 0
predicted_winspoly <- predict(step_back, newdata = completedData1)
completedData1$Predicted_Wins <- round(predicted_winspoly)
# This record had -ve TARGET WINS
completedData1[153,]## 1 2 3 4 5 6 7 8
## 56.17294 60.28666 69.64209 84.94790 68.27314 85.69634 78.20616 70.28016
## 9 10 11 12 13 14 15 16
## 72.62274 69.87750 65.52472 84.40004 86.42408 84.87206 88.84636 76.49452
## 17 18 19 20 21 22 23 24
## 70.34470 76.56092 70.57387 80.98965 82.29357 81.51251 83.08143 72.01933
## 25 26 27 28 29 30 31 32
## 81.79996 87.92424 74.33068 66.83266 80.53896 67.66413 92.17382 84.85526
## 33 34 35 36 37 38 39 40
## 88.78826 88.58816 79.80351 85.16124 77.13142 87.71430 81.61208 89.02724
## 41 42 43 44 45 46 47 48
## 84.21168 97.96500 19.54563 106.70521 86.67550 96.90132 96.92875 79.05094
## 49 50 51 52 53 54 55 56
## 67.51130 78.10583 72.03109 78.54303 70.68609 77.49007 68.06108 80.91025
## 57 58 59 60 61 62 63 64
## 91.45457 84.51076 54.18980 83.38793 88.56795 87.41098 87.88489 80.09362
## 65 66 67 68 69 70 71 72
## 85.65714 104.43752 69.42358 73.57874 77.47431 95.06832 84.52662 74.63740
## 73 74 75 76 77 78 79 80
## 83.82775 90.32849 72.63642 74.04629 89.17671 82.02841 76.91166 79.06990
## 81 82 83 84 85 86 87 88
## 82.96109 84.87351 97.78765 72.49258 88.21016 81.02128 87.09083 83.61214
## 89 90 91 92 93 94 95 96
## 97.11698 92.20992 75.67395 95.63936 63.57094 92.02143 90.33738 91.33154
## 97 98 99 100 101 102 103 104
## 101.57956 96.13043 84.33106 87.58446 76.56099 68.60087 81.90618 89.07745
## 105 106 107 108 109 110 111 112
## 69.54867 84.82527 50.61242 73.87781 89.60298 45.83271 88.15486 88.83011
## 113 114 115 116 117 118 119 120
## 94.87455 91.96957 80.81252 77.71058 86.48245 81.07379 69.83175 79.69570
## 121 122 123 124 125 126 127 128
## 101.80170 67.89132 68.53678 59.01639 69.53754 82.32925 82.29472 75.69144
## 129 130 131 132 133 134 135 136
## 87.11546 86.32240 77.56226 78.54961 75.94046 78.66570 89.84772 82.59521
## 137 138 139 140 141 142 143 144
## 76.42258 75.60449 96.87116 80.06190 65.00611 73.87770 88.96301 67.28884
## 145 146 147 148 149 150 151 152
## 77.15675 75.76504 73.24084 80.86218 80.05597 83.90668 80.86188 79.88856
## 153 154 155 156 157 158 159 160
## -19.84017 64.77946 78.32424 64.87205 91.89268 81.89351 89.36259 72.12265
## 161 162 163 164 165 166 167 168
## 103.02976 111.67013 97.21169 108.75241 102.40531 98.45281 88.61654 84.95617
## 169 170 171 172 173 174 175 176
## 72.49847 80.70537 95.86099 94.82055 77.89417 88.50043 75.41264 79.19125
## 177 178 179 180 181 182 183 184
## 84.74238 71.10428 77.34219 80.99975 98.93045 87.46444 88.35490 86.95535
## 185 186 187 188 189 190 191 192
## 85.51381 82.43470 83.71987 46.48291 52.39584 108.56011 64.54175 83.93363
## 193 194 195 196 197 198 199 200
## 70.01263 69.62693 68.91100 58.25476 70.12960 98.29763 84.71910 85.60574
## 201 202 203 204 205 206 207 208
## 68.83443 79.20509 74.50917 93.34670 81.99057 90.00616 77.24307 76.26457
## 209 210 211 212 213 214 215 216
## 73.56917 69.79046 104.60704 95.17004 84.35413 62.58243 71.91686 84.28700
## 217 218 219 220 221 222 223 224
## 81.90389 96.32345 76.92910 81.56029 70.41984 70.49817 77.32647 72.56484
## 225 226 227 228 229 230 231 232
## 70.78992 78.00976 78.37884 74.24700 84.74737 86.06397 82.04895 90.50258
## 233 234 235 236 237 238 239 240
## 87.55257 86.15641 77.86184 75.71337 74.50398 80.12416 85.63313 71.52249
## 241 242 243 244 245 246 247 248
## 82.21822 87.97379 80.25439 81.82278 57.15675 81.05379 79.00098 81.39071
## 249 250 251 252 253 254 255 256
## 74.23961 84.23937 77.17536 56.02002 93.72933 209.90022 65.68211 79.94488
## 257 258 259
## 86.67192 85.85297 73.20054