In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read.csv("C:/Users/ddebo/Downloads/moneyball-training-data.csv")
str(data)
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
summary(data)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
The first thing that stands out is the presence of NAs in six variables. This data set extends back to 1871 which is earlier than some statistics such as being hit by a pitch or being caught while attempting to steal a base were being recorded. Each began to be recorded or considered later in the decade. Even stolen bases were not recorded until 1877, the rules behind them were not solidified until the 1890s. That being said, the number of missing values in the hit by pitch column renders it useless for this endeavor.
summary_table <- data |>
select(where(is.numeric)) %>%
# For each column, compute the desired statistics
summarize(across(everything(),
list(mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE),
min = ~min(.x, na.rm = TRUE),
max = ~max(.x, na.rm = TRUE),
missing = ~sum(is.na(.x))))) %>%
# Convert from wide to long format
pivot_longer(
cols = everything(),
names_to = c("Variable", "Statistic"),
names_pattern = "^(.*)_(mean|sd|min|max|missing)$",
values_to = "Value"
) %>%
pivot_wider(names_from = Statistic, values_from = Value)
summary_table
## # A tibble: 17 × 6
## Variable mean sd min max missing
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 INDEX 1268. 736. 1 2535 0
## 2 TARGET_WINS 80.8 15.8 0 146 0
## 3 TEAM_BATTING_H 1469. 145. 891 2554 0
## 4 TEAM_BATTING_2B 241. 46.8 69 458 0
## 5 TEAM_BATTING_3B 55.2 27.9 0 223 0
## 6 TEAM_BATTING_HR 99.6 60.5 0 264 0
## 7 TEAM_BATTING_BB 502. 123. 0 878 0
## 8 TEAM_BATTING_SO 736. 249. 0 1399 102
## 9 TEAM_BASERUN_SB 125. 87.8 0 697 131
## 10 TEAM_BASERUN_CS 52.8 23.0 0 201 772
## 11 TEAM_BATTING_HBP 59.4 13.0 29 95 2085
## 12 TEAM_PITCHING_H 1779. 1407. 1137 30132 0
## 13 TEAM_PITCHING_HR 106. 61.3 0 343 0
## 14 TEAM_PITCHING_BB 553. 166. 0 3645 0
## 15 TEAM_PITCHING_SO 818. 553. 0 19278 102
## 16 TEAM_FIELDING_E 246. 228. 65 1898 0
## 17 TEAM_FIELDING_DP 146. 26.2 52 228 286
# Convert data to long format for plotting
long_df <- data |>
select(where(is.numeric)) |>
pivot_longer(everything(), names_to = "Variable", values_to = "Value")
# Plot histogram for each variable
ggplot(long_df, aes(x = Value)) +
geom_histogram(fill = "steelblue", bins = 30, color = "black") +
facet_wrap(~Variable, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Each Numeric Variable", x = "Value", y = "Count")
## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_bin()`).
Many of the variables are normally distributed, most notably our target variable, TARGET_WINS. Possible predictor variables that also appear normally distributed are doubles, walks, hit-by-pitches, and double plays. Bimodality is present for home runs, strikeouts by batters, and home runs allowed. Other variables show a notable right skew such as both baserunning stats and triples. The rest of the variables show a very extreme right skew. As a result, I took a closer look at the mean, median, and max for these variables to confirm, the max value for each is much higher than expected given the means and as a result have very large standard deviations.
long_df <- data |>
select(-INDEX) |>
pivot_longer(
cols = everything(),
names_to = "Variable",
values_to = "Value"
)
# Boxplot of all numeric variables except INDEX
ggplot(long_df, aes(x = Variable, y = Value)) +
geom_boxplot(fill = "steelblue") +
coord_cartesian(ylim = c(0, 2000)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Boxplot of All Numeric Variables", x = "", y = "Value")
## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The large range for the number of hits given up by pitchers required me to crop the visualization. Additionally, hits batted and strike outs also have outliers that go beyond the cutoff, but not to the same degree.
cor_matrix <- cor(data[,-1], use="pairwise.complete.obs")
cor_matrix
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS 1.00000000 0.388767521 0.28910365 0.142608411
## TEAM_BATTING_H 0.38876752 1.000000000 0.56284968 0.427696575
## TEAM_BATTING_2B 0.28910365 0.562849678 1.00000000 -0.107305824
## TEAM_BATTING_3B 0.14260841 0.427696575 -0.10730582 1.000000000
## TEAM_BATTING_HR 0.17615320 -0.006544685 0.43539729 -0.635566946
## TEAM_BATTING_BB 0.23255986 -0.072464013 0.25572610 -0.287235841
## TEAM_BATTING_SO -0.03175071 -0.463853571 0.16268519 -0.669781188
## TEAM_BASERUN_SB 0.13513892 0.123567797 -0.19975724 0.533506448
## TEAM_BASERUN_CS 0.02240407 0.016705668 -0.09981406 0.348764919
## TEAM_BATTING_HBP 0.07350424 -0.029112176 0.04608475 -0.174247154
## TEAM_PITCHING_H -0.10993705 0.302693709 0.02369219 0.194879411
## TEAM_PITCHING_HR 0.18901373 0.072853119 0.45455082 -0.567836679
## TEAM_PITCHING_BB 0.12417454 0.094193027 0.17805420 -0.002224148
## TEAM_PITCHING_SO -0.07843609 -0.252656790 0.06479231 -0.258818931
## TEAM_FIELDING_E -0.17648476 0.264902478 -0.23515099 0.509778447
## TEAM_FIELDING_DP -0.03485058 0.155383321 0.29087998 -0.323074847
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS 0.176153200 0.23255986 -0.03175071
## TEAM_BATTING_H -0.006544685 -0.07246401 -0.46385357
## TEAM_BATTING_2B 0.435397293 0.25572610 0.16268519
## TEAM_BATTING_3B -0.635566946 -0.28723584 -0.66978119
## TEAM_BATTING_HR 1.000000000 0.51373481 0.72706935
## TEAM_BATTING_BB 0.513734810 1.00000000 0.37975087
## TEAM_BATTING_SO 0.727069348 0.37975087 1.00000000
## TEAM_BASERUN_SB -0.453578426 -0.10511564 -0.25448923
## TEAM_BASERUN_CS -0.433793868 -0.13698837 -0.21788137
## TEAM_BATTING_HBP 0.106181160 0.04746007 0.22094219
## TEAM_PITCHING_H -0.250145481 -0.44977762 -0.37568637
## TEAM_PITCHING_HR 0.969371396 0.45955207 0.66717889
## TEAM_PITCHING_BB 0.136927564 0.48936126 0.03700514
## TEAM_PITCHING_SO 0.184707564 -0.02075682 0.41623330
## TEAM_FIELDING_E -0.587339098 -0.65597081 -0.58466444
## TEAM_FIELDING_DP 0.448985348 0.43087675 0.15488939
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS 0.13513892 0.02240407 0.07350424
## TEAM_BATTING_H 0.12356780 0.01670567 -0.02911218
## TEAM_BATTING_2B -0.19975724 -0.09981406 0.04608475
## TEAM_BATTING_3B 0.53350645 0.34876492 -0.17424715
## TEAM_BATTING_HR -0.45357843 -0.43379387 0.10618116
## TEAM_BATTING_BB -0.10511564 -0.13698837 0.04746007
## TEAM_BATTING_SO -0.25448923 -0.21788137 0.22094219
## TEAM_BASERUN_SB 1.00000000 0.65524480 -0.06400498
## TEAM_BASERUN_CS 0.65524480 1.00000000 -0.07051390
## TEAM_BATTING_HBP -0.06400498 -0.07051390 1.00000000
## TEAM_PITCHING_H 0.07328505 -0.05200781 -0.02769699
## TEAM_PITCHING_HR -0.41651072 -0.42256605 0.10675878
## TEAM_PITCHING_BB 0.14641513 -0.10696124 0.04785137
## TEAM_PITCHING_SO -0.13712861 -0.21022274 0.22157375
## TEAM_FIELDING_E 0.50963090 0.04832189 0.04178971
## TEAM_FIELDING_DP -0.49707763 -0.21424801 -0.07120824
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS -0.10993705 0.18901373 0.124174536
## TEAM_BATTING_H 0.30269371 0.07285312 0.094193027
## TEAM_BATTING_2B 0.02369219 0.45455082 0.178054204
## TEAM_BATTING_3B 0.19487941 -0.56783668 -0.002224148
## TEAM_BATTING_HR -0.25014548 0.96937140 0.136927564
## TEAM_BATTING_BB -0.44977762 0.45955207 0.489361263
## TEAM_BATTING_SO -0.37568637 0.66717889 0.037005141
## TEAM_BASERUN_SB 0.07328505 -0.41651072 0.146415134
## TEAM_BASERUN_CS -0.05200781 -0.42256605 -0.106961236
## TEAM_BATTING_HBP -0.02769699 0.10675878 0.047851371
## TEAM_PITCHING_H 1.00000000 -0.14161276 0.320676162
## TEAM_PITCHING_HR -0.14161276 1.00000000 0.221937505
## TEAM_PITCHING_BB 0.32067616 0.22193750 1.000000000
## TEAM_PITCHING_SO 0.26724807 0.20588053 0.488498653
## TEAM_FIELDING_E 0.66775901 -0.49314447 -0.022837561
## TEAM_FIELDING_DP -0.22865059 0.43917040 0.324457226
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS -0.07843609 -0.17648476 -0.03485058
## TEAM_BATTING_H -0.25265679 0.26490248 0.15538332
## TEAM_BATTING_2B 0.06479231 -0.23515099 0.29087998
## TEAM_BATTING_3B -0.25881893 0.50977845 -0.32307485
## TEAM_BATTING_HR 0.18470756 -0.58733910 0.44898535
## TEAM_BATTING_BB -0.02075682 -0.65597081 0.43087675
## TEAM_BATTING_SO 0.41623330 -0.58466444 0.15488939
## TEAM_BASERUN_SB -0.13712861 0.50963090 -0.49707763
## TEAM_BASERUN_CS -0.21022274 0.04832189 -0.21424801
## TEAM_BATTING_HBP 0.22157375 0.04178971 -0.07120824
## TEAM_PITCHING_H 0.26724807 0.66775901 -0.22865059
## TEAM_PITCHING_HR 0.20588053 -0.49314447 0.43917040
## TEAM_PITCHING_BB 0.48849865 -0.02283756 0.32445723
## TEAM_PITCHING_SO 1.00000000 -0.02329178 0.02615804
## TEAM_FIELDING_E -0.02329178 1.00000000 -0.49768495
## TEAM_FIELDING_DP 0.02615804 -0.49768495 1.00000000
library(corrplot)
## corrplot 0.95 loaded
corrplot(cor_matrix, method = "circle", type = "upper",
tl.col = "black", tl.srt = 45, # text label color & rotation
number.cex = 0.7, # size of coefficients
diag = FALSE)
One correlation that stands out is the one between home runs hit and home runs pitched. They have a .97 correlation which suggests that times of batters hittings more home runs are tied to the pitchers pitching them those very home runs. This incredibly strong correlation is therefore no surprise. We see other strong correlations between batting strikeouts and home runs hit (r = .727), triples hit and batters striking out (-.670), batting strikeouts and home runs pitched (.667), stolen bases and caught stolen bases (r = .655), fielding errors and walks (r = -.656).
cor_with_target <- cor_matrix["TARGET_WINS", ]
cor_with_target
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1.00000000 0.38876752 0.28910365 0.14260841
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 0.17615320 0.23255986 -0.03175071 0.13513892
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 0.02240407 0.07350424 -0.10993705 0.18901373
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0.12417454 -0.07843609 -0.17648476 -0.03485058
Focusing on the target variable, we can see that the strongest correlation is a positive one with hits (r = .389), followed by doubles hit and home runs pitched (r = .289, r = .189). The negative correlations are not as strong, with the strongest negative correlation being with fielding errors committed (r = -.176), followed by hits given up (r = -.101).
As seen earlier, there are a few variables with missing values. For batters hit by pitches, there are just too many missing values for any possibility of it contributing meaningfully to the model. Another variable with many missing values, although not as many is runners caught while stealing. About 30% of the data is missing a value for this variable and the correlation of those with a value present is only .02, suggesting essentially no impact on variance in number of games won. This is strong enough evidence to preclude this variable from further analysis as well.
After dropping those variables, we still have some variables with missing values. In these cases, whether they have been omitted due to error or the fact that that statistic was not yet being recorded, I have decided that the best course of action is to impute the median value.
numeric_vars <- data |>
select(-INDEX)
numeric_vars <- numeric_vars |>
mutate(across(everything(), ~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)))
With a little research into baseball statistics, one that can be derived from what we have are run differential (difference between runs scored and runs given up). Another variable that may prove useful in a model is a combination of on-base and slugging percentages to get a full picture of the player’s offensive impact. Previously, HBP was identified as a variable that will be cut from the final model, but where available it is useful in computing the on base + slugging percentage.
numeric_vars <- numeric_vars |>
mutate(
RUN_DIFFERENTIAL = TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB - TEAM_PITCHING_H - TEAM_PITCHING_BB - TEAM_FIELDING_E,
# Replace HBP missing values with 0 just for the computation of this derived metric
ON_BASE_PLUS_SLUGGING = (TEAM_BATTING_H + TEAM_BATTING_BB + coalesce(TEAM_BATTING_HBP, 0)) / 162
)
set.seed(24601)
n <- nrow(numeric_vars)
train_indices <- sample(seq_len(n), size = 0.7 * n)
train_data <- numeric_vars[train_indices, ]
test_data <- numeric_vars[-train_indices, ]
# log-transform selected predictors (safe version: log1p handles zeros)
train_data <- train_data %>%
mutate(
log_TEAM_BATTING_H = log1p(TEAM_BATTING_H),
log_TEAM_BATTING_HR = log1p(TEAM_BATTING_HR),
log_TEAM_BATTING_BB = log1p(TEAM_BATTING_BB),
log_TEAM_BASERUN_SB = log1p(TEAM_BASERUN_SB),
log_TEAM_PITCHING_HR = log1p(TEAM_PITCHING_HR),
log_TEAM_FIELDING_E = log1p(TEAM_FIELDING_E)
)
model2 <- lm(
TARGET_WINS ~ log_TEAM_BATTING_H + log_TEAM_BATTING_HR +
log_TEAM_BATTING_BB + log_TEAM_BASERUN_SB +
log_TEAM_PITCHING_HR + log_TEAM_FIELDING_E,
data = train_data
)
summary(model2)
##
## Call:
## lm(formula = TARGET_WINS ~ log_TEAM_BATTING_H + log_TEAM_BATTING_HR +
## log_TEAM_BATTING_BB + log_TEAM_BASERUN_SB + log_TEAM_PITCHING_HR +
## log_TEAM_FIELDING_E, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.701 -8.660 -0.121 8.606 52.691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -456.618 28.267 -16.154 < 2e-16 ***
## log_TEAM_BATTING_H 71.754 3.961 18.114 < 2e-16 ***
## log_TEAM_BATTING_HR -8.610 2.726 -3.159 0.00161 **
## log_TEAM_BATTING_BB 5.457 1.362 4.006 6.46e-05 ***
## log_TEAM_BASERUN_SB 4.492 0.611 7.351 3.13e-13 ***
## log_TEAM_PITCHING_HR 8.067 2.500 3.227 0.00128 **
## log_TEAM_FIELDING_E -7.224 1.148 -6.290 4.09e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.26 on 1586 degrees of freedom
## Multiple R-squared: 0.2508, Adjusted R-squared: 0.248
## F-statistic: 88.5 on 6 and 1586 DF, p-value: < 2.2e-16
All previous variables were kept and logarithmically transformed to build this model. The result is a much stronger model. We still see a surprising negative coefficient for the effect of home runs hit and it remains one of the weaker aspects of our model, along with home runs pitched.
check_model(model2)
The collinearity graph suggests two variables are still strongly
correlated with each other, and from the evidence above, it seems to be
the two home run variables. For the next model, I will see how removing
them impacts the rest of the model.
model3 <- lm(
TARGET_WINS ~ log_TEAM_BATTING_H +
log_TEAM_BATTING_BB + log_TEAM_BASERUN_SB +log_TEAM_FIELDING_E,
data = train_data
)
summary(model3)
##
## Call:
## lm(formula = TARGET_WINS ~ log_TEAM_BATTING_H + log_TEAM_BATTING_BB +
## log_TEAM_BASERUN_SB + log_TEAM_FIELDING_E, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.027 -8.943 -0.147 8.710 51.594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -473.9766 27.3774 -17.313 < 2e-16 ***
## log_TEAM_BATTING_H 74.0328 3.6823 20.105 < 2e-16 ***
## log_TEAM_BATTING_BB 3.8609 1.2267 3.148 0.00168 **
## log_TEAM_BASERUN_SB 4.4868 0.6032 7.438 1.67e-13 ***
## log_TEAM_FIELDING_E -5.5078 0.8099 -6.800 1.47e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.3 on 1588 degrees of freedom
## Multiple R-squared: 0.2459, Adjusted R-squared: 0.244
## F-statistic: 129.5 on 4 and 1588 DF, p-value: < 2.2e-16
check_model(model3)
This is the best performing model so far #### Model 4 - Using Generated Variables
model4 <- lm(TARGET_WINS ~ ON_BASE_PLUS_SLUGGING + RUN_DIFFERENTIAL +
log1p(TEAM_PITCHING_SO),
data = train_data)
summary(model4)
##
## Call:
## lm(formula = TARGET_WINS ~ ON_BASE_PLUS_SLUGGING + RUN_DIFFERENTIAL +
## log1p(TEAM_PITCHING_SO), data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.652 -8.908 0.301 8.967 55.890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8789192 5.3217219 0.917 0.35939
## ON_BASE_PLUS_SLUGGING 5.9259662 0.3053187 19.409 < 2e-16 ***
## RUN_DIFFERENTIAL 0.0006255 0.0002361 2.650 0.00814 **
## log1p(TEAM_PITCHING_SO) 0.3765640 0.5112345 0.737 0.46149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.67 on 1589 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.2009
## F-statistic: 134.4 on 3 and 1589 DF, p-value: < 2.2e-16
check_model(model4)
model5 <- lm(TARGET_WINS ~ log1p(TEAM_BATTING_H) + TEAM_BATTING_HR + TEAM_BATTING_BB + log1p(TEAM_PITCHING_SO) + log1p(TEAM_PITCHING_BB) + log1p(TEAM_FIELDING_DP) +
log1p(TEAM_FIELDING_E) + I(TEAM_PITCHING_HR / TEAM_BATTING_HR),
data = train_data)
summary(model5)
##
## Call:
## lm(formula = TARGET_WINS ~ log1p(TEAM_BATTING_H) + TEAM_BATTING_HR +
## TEAM_BATTING_BB + log1p(TEAM_PITCHING_SO) + log1p(TEAM_PITCHING_BB) +
## log1p(TEAM_FIELDING_DP) + log1p(TEAM_FIELDING_E) + I(TEAM_PITCHING_HR/TEAM_BATTING_HR),
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.542 -8.476 0.403 8.503 48.748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.138e+02 3.255e+01 -12.710 < 2e-16 ***
## log1p(TEAM_BATTING_H) 7.879e+01 4.220e+00 18.669 < 2e-16 ***
## TEAM_BATTING_HR 2.337e-02 9.190e-03 2.543 0.01108 *
## TEAM_BATTING_BB 1.890e-03 9.136e-03 0.207 0.83611
## log1p(TEAM_PITCHING_SO) -7.458e-01 7.633e-01 -0.977 0.32872
## log1p(TEAM_PITCHING_BB) 8.450e+00 3.934e+00 2.148 0.03188 *
## log1p(TEAM_FIELDING_DP) -2.220e+01 2.039e+00 -10.889 < 2e-16 ***
## log1p(TEAM_FIELDING_E) -3.070e+00 1.008e+00 -3.046 0.00235 **
## I(TEAM_PITCHING_HR/TEAM_BATTING_HR) -3.417e+00 1.403e+00 -2.436 0.01498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.81 on 1574 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.2674, Adjusted R-squared: 0.2637
## F-statistic: 71.81 on 8 and 1574 DF, p-value: < 2.2e-16
check_model(model5)
model6 <- lm(TARGET_WINS ~ log1p(TEAM_BATTING_H) + TEAM_BATTING_BB + log1p(TEAM_FIELDING_DP) +
log1p(TEAM_FIELDING_E),
data = train_data)
summary(model6)
##
## Call:
## lm(formula = TARGET_WINS ~ log1p(TEAM_BATTING_H) + TEAM_BATTING_BB +
## log1p(TEAM_FIELDING_DP) + log1p(TEAM_FIELDING_E), data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.141 -8.591 0.307 8.447 55.993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.852e+02 2.616e+01 -14.727 < 2e-16 ***
## log1p(TEAM_BATTING_H) 8.070e+01 3.714e+00 21.726 < 2e-16 ***
## TEAM_BATTING_BB 2.402e-02 3.382e-03 7.100 1.87e-12 ***
## log1p(TEAM_FIELDING_DP) -2.184e+01 1.977e+00 -11.048 < 2e-16 ***
## log1p(TEAM_FIELDING_E) -4.750e+00 7.030e-01 -6.757 1.97e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.05 on 1588 degrees of freedom
## Multiple R-squared: 0.2737, Adjusted R-squared: 0.2718
## F-statistic: 149.6 on 4 and 1588 DF, p-value: < 2.2e-16
check_model(model6)
par(mfrow = c(2, 2))
plot(model6)