# load data
library(tidyverse)
library(readr)
library(curl)
library(ggplot2)
library(dplyr)
library(scales)
library(zoo)
baseball_training <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data621/main/moneyball-training-data.csv"))
head(baseball_training)
summary(baseball_training)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
We can see that a number of the variables have a minimum of 0.0 and/or NA values. These will need to be addressed in order to build models.
Most people believe the more home runs a team hits, the more the team will win. This is definitely plausible, to an extent. People don’t take into account how many runs the same team’s pitching gives up, as well. Nonetheless, let’s take a look at home runs and target wins as they relate.
baseball_training %>%
ggplot(aes(x=TEAM_BATTING_HR, y=TARGET_WINS, color=TEAM_BATTING_HR)) +
geom_point() +
labs(title = "Wins vs. Home runs hit", x="Home Runs",y="Target Wins",colour="Home Runs")
There seems to be a slight positive correlation in the amount of home
runs hit and the number of wins. You can see the team with the most wins
(~149) in the data set actually has a small, relative amount of home
runs and then a few teams with 150-155 homeruns actually have less than
50 wins. Let’s take a look at how all of the explanatory variables
relate to the response variable (Target wins):
#remove Index, no need to conduct a relationship on that column
baseball_training %>%
gather(-INDEX,-TARGET_WINS, key="var", value="value") %>%
ggplot(aes(x=value,y=TARGET_WINS))+
geom_point() +
geom_smooth(method="lm",color="red") +
facet_wrap(~ var, scales="free") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3478 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 3478 rows containing missing values (`geom_point()`).
It’s pretty hard to tell how each X relates to Y based on just the
plots. There are some relationships you expect to see that are just now
as clear by looking only at the scatterplots. Let’s calculate
correlations between the predictors and target.
correlations <- cor(baseball_training[ , colnames(baseball_training) != "TARGET_WINS"], baseball_training$TARGET_WINS)
correlations
## [,1]
## INDEX -0.02105643
## TEAM_BATTING_H 0.38876752
## TEAM_BATTING_2B 0.28910365
## TEAM_BATTING_3B 0.14260841
## TEAM_BATTING_HR 0.17615320
## TEAM_BATTING_BB 0.23255986
## TEAM_BATTING_SO NA
## TEAM_BASERUN_SB NA
## TEAM_BASERUN_CS NA
## TEAM_BATTING_HBP NA
## TEAM_PITCHING_H -0.10993705
## TEAM_PITCHING_HR 0.18901373
## TEAM_PITCHING_BB 0.12417454
## TEAM_PITCHING_SO NA
## TEAM_FIELDING_E -0.17648476
## TEAM_FIELDING_DP NA
It doesn’t appear that any one predictor has a strong correlation to TARGET_WINS on its own.
Before moving further along in creating models, let’s tidy the data.
#replace NAs with mean of column
baseball_training2 <- baseball_training
baseball_training2 <- na.aggregate(baseball_training2)
summary(baseball_training2)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 556.8
## Median : 47.00 Median :102.00 Median :512.0 Median : 735.6
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.00 Min. :29.00 Min. : 1137
## 1st Qu.: 67.0 1st Qu.: 44.00 1st Qu.:59.36 1st Qu.: 1419
## Median :106.0 Median : 52.80 Median :59.36 Median : 1518
## Mean :124.8 Mean : 52.80 Mean :59.36 Mean : 1779
## 3rd Qu.:151.0 3rd Qu.: 54.25 3rd Qu.:59.36 3rd Qu.: 1682
## Max. :697.0 Max. :201.00 Max. :95.00 Max. :30132
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 626.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 817.7 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:134.0
## Median :146.4
## Mean :146.4
## 3rd Qu.:161.2
## Max. :228.0
Now you will see no NA values exist within the data. However, there are still some statistics with a Min of 0.0, which also does not seem likely. In order for the model to be more reliable, I’d want to replace the Minimums of 0.0 in HR, BB, SO, SB, CS, Pitching_HR, Pitching_BB, Pitching_SO with the mean, as well.
baseball_training2 <- baseball_training2 |> mutate(across(.cols = c("TARGET_WINS", "TEAM_BATTING_3B", "TEAM_BATTING_HR","TEAM_BATTING_BB","TEAM_BATTING_SO","TEAM_BASERUN_SB","TEAM_BASERUN_CS","TEAM_PITCHING_HR","TEAM_PITCHING_BB","TEAM_PITCHING_SO"),
.fns = ~ifelse(.x == 0, mean(.x), .x)))
summary(baseball_training2)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 12.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.83 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 8.0 Min. : 3.00 Min. : 12.0 Min. : 66.0
## 1st Qu.: 34.0 1st Qu.: 42.75 1st Qu.:451.0 1st Qu.: 562.0
## Median : 47.0 Median :102.00 Median :512.0 Median : 735.6
## Mean : 55.3 Mean :100.27 Mean :501.8 Mean : 742.1
## 3rd Qu.: 72.0 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0
## Max. :223.0 Max. :264.00 Max. :878.0 Max. :1399.0
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 14.0 Min. : 7.00 Min. :29.00 Min. : 1137
## 1st Qu.: 67.0 1st Qu.: 44.00 1st Qu.:59.36 1st Qu.: 1419
## Median :106.0 Median : 52.80 Median :59.36 Median : 1518
## Mean :124.9 Mean : 52.83 Mean :59.36 Mean : 1779
## 3rd Qu.:151.0 3rd Qu.: 54.25 3rd Qu.:59.36 3rd Qu.: 1682
## Max. :697.0 Max. :201.00 Max. :95.00 Max. :30132
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 3.0 Min. : 119.0 Min. : 181.0 Min. : 65.0
## 1st Qu.: 52.0 1st Qu.: 476.0 1st Qu.: 633.0 1st Qu.: 127.0
## Median :107.0 Median : 537.0 Median : 817.7 Median : 159.0
## Mean :106.4 Mean : 553.3 Mean : 824.9 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:134.0
## Median :146.4
## Mean :146.4
## 3rd Qu.:161.2
## Max. :228.0
Now you’ll see that no zero values exist in the variables. This is more of a working data set. We can do some more data preparation to condense a few X variables. 1) TEAM_BATTING_HBP is the same as a walk (TEAM_BATTING_BB), so I’ll combine to a new variable, TEAM_BATTING_WALKS. 2) TEAM_BASERUN_SB and TEAM_BASERUN_CS together are the total Stolen Base attempts. I’ll find the success rate of stolen bases as TEAM_BASERUN_SB/(TEAM_BASERUN_CS + TEAM_BASERUN_SB) to TEAM_SB_SUCCESS.
#this caused errors in my testing later on. Removing.
#baseball_training2 <- baseball_training2 %>%
# mutate(TEAM_BATTING_WALKS = TEAM_BATTING_BB+TEAM_BATTING_HBP) %>%
# mutate(TEAM_SB_SUCCESS = TEAM_BASERUN_SB/(TEAM_BASERUN_SB+TEAM_BASERUN_CS)) %>%
# select(-c(TEAM_BATTING_BB,TEAM_BATTING_HBP,TEAM_BASERUN_SB,TEAM_BASERUN_CS))
Now we’re left with 2 less explanatory variables, however, it represents the original data, therefore, still usable in the model. Let’s looks at correlations with non-NA and zero data.
correlations_tidy <- cor(baseball_training2[ , colnames(baseball_training2) != "TARGET_WINS"], baseball_training2$TARGET_WINS)
correlations_tidy
## [,1]
## INDEX -0.020937436
## TEAM_BATTING_H 0.381966897
## TEAM_BATTING_2B 0.285642615
## TEAM_BATTING_3B 0.140704764
## TEAM_BATTING_HR 0.151842473
## TEAM_BATTING_BB 0.225471525
## TEAM_BATTING_SO -0.050333581
## TEAM_BASERUN_SB 0.118638506
## TEAM_BASERUN_CS 0.009251595
## TEAM_BATTING_HBP 0.016432705
## TEAM_PITCHING_H -0.074670265
## TEAM_PITCHING_HR 0.163747351
## TEAM_PITCHING_BB 0.117643913
## TEAM_PITCHING_SO -0.085788792
## TEAM_FIELDING_E -0.161152195
## TEAM_FIELDING_DP -0.029009541
Still, not one predictor on its own stands out.
I will build 3 models as follows: 1) based on TEAM_BATTING variables, 2) based on TEAM PITCHING variables, and 3) based on TEAM_BATTING_H, TEAM_PITCHING_H, and TEAM_FIELDING_E as these seem to have the strongest linear relationships to TARGET_WINS based on the plots above. You may notice that TEAM_HITTING_2B also appears to have a strong correlation (per the scatterplots, not per the calculations necessarily), but a 2B is also counted in TEAM_BATTING_H, therefore I’ve decided to use the variable that encompasses all hits.
fit_batting <- lm(TARGET_WINS ~ TEAM_BATTING_H+TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR+TEAM_BATTING_SO+TEAM_BATTING_BB+TEAM_BATTING_HBP,data=baseball_training2)
coef(fit_batting)
## (Intercept) TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## -13.103877806 0.043546345 -0.012480071 0.099596565
## TEAM_BATTING_HR TEAM_BATTING_SO TEAM_BATTING_BB TEAM_BATTING_HBP
## 0.023000975 0.008193301 0.030864722 0.060290292
summary(fit_batting)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BATTING_BB +
## TEAM_BATTING_HBP, data = baseball_training2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.010 -8.797 0.535 9.114 51.629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.103878 6.713408 -1.952 0.051073 .
## TEAM_BATTING_H 0.043546 0.003556 12.247 < 2e-16 ***
## TEAM_BATTING_2B -0.012480 0.009242 -1.350 0.177013
## TEAM_BATTING_3B 0.099597 0.016587 6.005 2.23e-09 ***
## TEAM_BATTING_HR 0.023001 0.009181 2.505 0.012305 *
## TEAM_BATTING_SO 0.008193 0.002204 3.718 0.000206 ***
## TEAM_BATTING_BB 0.030865 0.002790 11.062 < 2e-16 ***
## TEAM_BATTING_HBP 0.060290 0.077058 0.782 0.434061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.77 on 2268 degrees of freedom
## Multiple R-squared: 0.2297, Adjusted R-squared: 0.2273
## F-statistic: 96.6 on 7 and 2268 DF, p-value: < 2.2e-16
This model has a Multiple R-squared of 0.2297 with a p-value less than 0.05. This means that the model does not do a good job in explaining variations in the estimators. I can also see that TEAM_BATTING_2B has a high p-value, and therefore may need to be excluded from the model for better accuracy.
fit_pitching <- lm(TARGET_WINS ~ TEAM_PITCHING_H+TEAM_PITCHING_HR+TEAM_PITCHING_BB+TEAM_PITCHING_SO,data=baseball_training2)
coef(fit_pitching)
## (Intercept) TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 72.5417444307 -0.0008080885 0.0399348172 0.0180298652
## TEAM_PITCHING_SO
## -0.0054569652
summary(fit_pitching)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO, data = baseball_training2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.034 -9.834 0.644 9.793 77.022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.5417444 1.1496770 63.098 < 2e-16 ***
## TEAM_PITCHING_H -0.0008081 0.0002488 -3.248 0.00118 **
## TEAM_PITCHING_HR 0.0399348 0.0055313 7.220 7.07e-13 ***
## TEAM_PITCHING_BB 0.0180299 0.0022669 7.954 2.83e-15 ***
## TEAM_PITCHING_SO -0.0054570 0.0006910 -7.897 4.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.12 on 2271 degrees of freedom
## Multiple R-squared: 0.06923, Adjusted R-squared: 0.06759
## F-statistic: 42.23 on 4 and 2271 DF, p-value: < 2.2e-16
This model’s Multiple R-squared value is significantly higher than the previous with a p-value below 0.05. All of the predictor variables p-values are also less than 0.05. This model, based on pitching statistics, may be a better predictor of target_wins than just batting, as it can predict 69% of wins.
fit_hits_errors <- lm(TARGET_WINS ~ TEAM_BATTING_H+TEAM_PITCHING_H+ TEAM_FIELDING_E,data=baseball_training2)
coef(fit_hits_errors)
## (Intercept) TEAM_BATTING_H TEAM_PITCHING_H TEAM_FIELDING_E
## 12.3612274067 0.0501349210 -0.0005080986 -0.0174154798
summary(fit_hits_errors)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_PITCHING_H +
## TEAM_FIELDING_E, data = baseball_training2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.795 -9.272 0.267 9.056 68.898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3612274 3.0165918 4.098 4.32e-05 ***
## TEAM_BATTING_H 0.0501349 0.0021125 23.732 < 2e-16 ***
## TEAM_PITCHING_H -0.0005081 0.0002813 -1.807 0.071 .
## TEAM_FIELDING_E -0.0174155 0.0017171 -10.143 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.83 on 2272 degrees of freedom
## Multiple R-squared: 0.221, Adjusted R-squared: 0.22
## F-statistic: 214.9 on 3 and 2272 DF, p-value: < 2.2e-16
By only using these three variables, the multiple r-squared value again decreases to 0.22, therefore may not be a good model to predict our target variable.
With our three models, we can input the actual data to test the accuracy of each. Let’s first load it.
baseball_evaluation <- read.csv(curl("https://raw.githubusercontent.com/brsingh7/Data621/main/moneyball-evaluation-data.csv"))
#fill NAs for purposes of testing model
baseball_evaluation2 <- baseball_evaluation
baseball_evaluation2 <- na.aggregate(baseball_evaluation2)
summary(baseball_evaluation2)
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 9 Min. : 819 Min. : 44.0 Min. : 14.00
## 1st Qu.: 708 1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00
## Median :1249 Median :1455 Median :239.0 Median : 52.00
## Mean :1264 Mean :1469 Mean :241.3 Mean : 55.91
## 3rd Qu.:1832 3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00
## Max. :2525 Max. :2170 Max. :376.0 Max. :155.00
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 15.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 44.50 1st Qu.:436.5 1st Qu.: 565.0 1st Qu.: 60.5
## Median :101.00 Median :509.0 Median : 709.3 Median : 96.0
## Mean : 95.63 Mean :499.0 Mean : 709.3 Mean :123.7
## 3rd Qu.:135.50 3rd Qu.:565.5 3rd Qu.: 904.5 3rd Qu.:149.0
## Max. :242.00 Max. :792.0 Max. :1268.0 Max. :580.0
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.00 Min. :42.00 Min. : 1155 Min. : 0.0
## 1st Qu.: 44.00 1st Qu.:62.37 1st Qu.: 1426 1st Qu.: 52.0
## Median : 52.32 Median :62.37 Median : 1515 Median :104.0
## Mean : 52.32 Mean :62.37 Mean : 1813 Mean :102.1
## 3rd Qu.: 56.00 3rd Qu.:62.37 3rd Qu.: 1681 3rd Qu.:142.5
## Max. :154.00 Max. :96.00 Max. :22768 Max. :336.0
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 136.0 Min. : 0.0 Min. : 73.0 Min. : 69.0
## 1st Qu.: 471.0 1st Qu.: 622.5 1st Qu.: 131.0 1st Qu.:134.5
## Median : 526.0 Median : 782.0 Median : 163.0 Median :146.1
## Mean : 552.4 Mean : 799.7 Mean : 249.7 Mean :146.1
## 3rd Qu.: 606.5 3rd Qu.: 927.5 3rd Qu.: 252.0 3rd Qu.:160.5
## Max. :2008.0 Max. :9963.0 Max. :1568.0 Max. :204.0
#standardize variables
#baseball_evaluation <- baseball_evaluation %>%
# mutate(TEAM_BATTING_WALKS = TEAM_BATTING_BB+TEAM_BATTING_HBP) %>%
# mutate(TEAM_SB_SUCCESS = TEAM_BASERUN_SB/(TEAM_BASERUN_SB+TEAM_BASERUN_CS)) %>%
# select(-c(TEAM_BATTING_BB,TEAM_BATTING_HBP,TEAM_BASERUN_SB,TEAM_BASERUN_CS))
To test the accuracy, we can run a few visuals and root mean square errors to understand the difference between each. Let’s run predictions on using the baseball_evaluation data set before comparing.
pred1 <- predict(fit_batting,newdata = data.frame(baseball_evaluation2))
pred1
## 1 2 3 4 5 6 7 8
## 69.02332 70.39209 76.54290 81.62419 64.27548 65.25927 78.37235 69.91582
## 9 10 11 12 13 14 15 16
## 71.59791 73.09766 75.41997 81.81791 81.58931 78.67432 75.51881 76.72835
## 17 18 19 20 21 22 23 24
## 73.18468 81.91163 67.40273 90.67725 84.20854 86.63365 85.31085 75.15657
## 25 26 27 28 29 30 31 32
## 79.29060 83.04885 59.06793 73.79108 85.00737 74.96752 91.97348 87.01782
## 33 34 35 36 37 38 39 40
## 89.08648 92.29560 83.28620 83.45037 77.10765 89.74024 85.81320 87.98720
## 41 42 43 44 45 46 47 48
## 81.56670 87.04156 42.50173 100.66709 87.20943 93.68922 95.92407 71.94483
## 49 50 51 52 53 54 55 56
## 70.45126 74.87788 76.38373 83.37616 80.49225 72.63923 76.92429 75.76593
## 57 58 59 60 61 62 63 64
## 87.85558 68.52421 63.13311 77.22863 82.55592 85.15860 84.57356 82.94137
## 65 66 67 68 69 70 71 72
## 80.17056 85.94818 77.03164 81.98240 76.78787 87.45434 89.08374 74.51536
## 73 74 75 76 77 78 79 80
## 82.46346 84.82696 81.05446 86.98952 83.26616 79.09429 68.24419 74.10746
## 81 82 83 84 85 86 87 88
## 85.20444 89.29199 96.79917 82.64148 87.54932 77.16695 76.03202 81.60815
## 89 90 91 92 93 94 95 96
## 82.24683 88.79385 78.70431 93.52340 72.80467 77.95506 76.47452 77.42784
## 97 98 99 100 101 102 103 104
## 87.82337 100.81797 91.23864 92.24367 84.18104 74.29019 83.83800 81.97561
## 105 106 107 108 109 110 111 112
## 83.00491 75.94619 65.86750 81.19457 84.67721 68.82092 81.95114 79.59690
## 113 114 115 116 117 118 119 120
## 88.45745 85.44496 79.08546 80.28643 89.34081 79.21481 77.64203 72.97227
## 121 122 123 124 125 126 127 128
## 86.39589 66.39059 67.30869 60.26995 70.71182 82.55702 89.54708 73.69231
## 129 130 131 132 133 134 135 136
## 87.79599 93.33843 87.96655 79.08752 74.60263 84.62829 84.10350 69.34915
## 137 138 139 140 141 142 143 144
## 76.46567 75.89800 79.10612 79.88932 65.40377 69.92514 93.25758 80.96440
## 145 146 147 148 149 150 151 152
## 76.30607 76.27901 80.91122 82.33784 83.76734 79.61306 82.27751 80.13054
## 153 154 155 156 157 158 159 160
## 61.14164 71.95630 77.43822 72.79959 84.91432 73.15190 91.17033 72.56672
## 161 162 163 164 165 166 167 168
## 104.86366 105.15197 91.06320 105.01078 98.28674 90.72715 86.25125 79.78595
## 169 170 171 172 173 174 175 176
## 72.49598 80.29813 88.63038 84.26605 81.01467 89.95105 83.11377 79.27908
## 177 178 179 180 181 182 183 184
## 80.50449 77.52805 76.91121 82.04500 76.88126 84.55123 82.97896 85.21549
## 185 186 187 188 189 190 191 192
## 98.44055 86.73570 92.12372 68.74468 65.60718 107.11029 71.79960 79.63424
## 193 194 195 196 197 198 199 200
## 72.58022 76.83369 80.15356 69.63103 76.02875 82.86953 80.95599 87.75156
## 201 202 203 204 205 206 207 208
## 81.52108 82.16757 77.78889 82.50815 76.79609 81.45618 81.68290 78.19480
## 209 210 211 212 213 214 215 216
## 81.43368 78.46398 101.96047 92.39319 83.97692 70.71751 75.10910 86.95871
## 217 218 219 220 221 222 223 224
## 84.80324 85.24344 75.13609 78.42613 80.99022 74.78310 85.27730 80.76922
## 225 226 227 228 229 230 231 232
## 93.67668 76.25543 78.53725 83.23883 81.55291 81.64655 73.19156 90.46028
## 233 234 235 236 237 238 239 240
## 84.18946 84.27135 80.03538 74.50281 82.56600 77.07609 93.57988 72.89689
## 241 242 243 244 245 246 247 248
## 88.59108 86.40800 82.98258 81.53952 65.96458 83.48599 76.81224 82.22952
## 249 250 251 252 253 254 255 256
## 72.81879 84.15689 83.28610 62.82698 93.83673 48.18839 69.81853 79.83797
## 257 258 259
## 75.82978 78.85357 77.57921
#compare predicted and actual values
plot(baseball_training$TARGET_WINS, type="l",lty=1.8,col="red")
lines(pred1,type="l",col="blue")
#plot(pred1,type="l",col="blue")
rmse <- sqrt(mean(pred1-baseball_training$TARGET_WINS)^2)
## Warning in pred1 - baseball_training$TARGET_WINS: longer object length is not a
## multiple of shorter object length
rmse
## [1] 0.1722775
pred2<-predict(fit_pitching,newdata = data.frame(baseball_evaluation2))
pred2
## 1 2 3 4 5 6 7 8
## 77.04518 79.30322 79.85272 81.42258 68.45317 74.63967 80.83874 76.42000
## 9 10 11 12 13 14 15 16
## 77.41109 79.11062 79.16637 83.09389 84.55075 84.32453 82.14794 81.24158
## 17 18 19 20 21 22 23 24
## 80.14140 82.62919 75.12521 83.75498 82.70278 84.12543 84.62454 82.55133
## 25 26 27 28 29 30 31 32
## 80.23546 81.58861 73.92224 77.80588 82.07987 77.71562 86.84706 84.83176
## 33 34 35 36 37 38 39 40
## 85.81664 87.73500 81.89847 85.52082 82.01151 84.48213 87.69453 82.89879
## 41 42 43 44 45 46 47 48
## 84.12553 83.48571 72.70279 86.90693 84.11633 84.21597 83.10782 77.78950
## 49 50 51 52 53 54 55 56
## 76.87728 78.83716 79.40117 81.53498 80.52444 79.48955 79.18987 78.61305
## 57 58 59 60 61 62 63 64
## 82.38836 75.98521 75.64863 76.19330 80.14578 83.60986 83.12728 84.38542
## 65 66 67 68 69 70 71 72
## 81.73209 81.36614 76.78993 75.78061 77.64728 80.25995 79.78227 79.19576
## 73 74 75 76 77 78 79 80
## 85.03699 86.96859 77.21840 80.97159 83.39845 81.25324 75.15099 74.90960
## 81 82 83 84 85 86 87 88
## 83.42739 81.10773 83.86050 80.99557 85.81101 82.30003 82.11341 81.29180
## 89 90 91 92 93 94 95 96
## 84.04329 84.78144 80.72769 109.41427 79.48985 74.00984 75.90040 77.39775
## 97 98 99 100 101 102 103 104
## 82.70851 83.03437 82.15563 82.74118 80.05892 81.99711 82.46801 85.53390
## 105 106 107 108 109 110 111 112
## 82.47149 76.78341 73.55357 80.21612 84.01747 75.20067 83.06988 80.73286
## 113 114 115 116 117 118 119 120
## 81.10482 80.35060 78.68424 79.65251 81.95838 81.41096 78.54989 77.92738
## 121 122 123 124 125 126 127 128
## 81.04498 76.13429 75.22439 74.93731 76.97530 78.47243 80.44013 77.49180
## 129 130 131 132 133 134 135 136
## 82.99307 82.77731 82.50669 82.50063 79.97061 81.94254 83.18179 76.69299
## 137 138 139 140 141 142 143 144
## 80.30968 79.45320 79.58514 81.30195 73.04708 75.02532 83.10015 80.27213
## 145 146 147 148 149 150 151 152
## 81.53951 82.40013 82.10462 81.78437 81.46879 81.79842 82.90198 79.37736
## 153 154 155 156 157 158 159 160
## 54.59167 76.82498 79.98688 78.18751 83.35955 76.34499 79.13518 76.20533
## 161 162 163 164 165 166 167 168
## 87.14551 89.71291 85.90671 89.68915 89.34562 86.90390 84.63987 83.59410
## 169 170 171 172 173 174 175 176
## 79.43832 82.60606 77.78716 80.17687 82.21869 82.60998 82.65314 82.70713
## 177 178 179 180 181 182 183 184
## 85.55218 81.49001 81.99295 83.24471 82.35500 84.26075 82.98463 85.03622
## 185 186 187 188 189 190 191 192
## 93.77507 78.05510 83.71529 72.61288 76.55040 84.79403 75.80936 75.95766
## 193 194 195 196 197 198 199 200
## 76.64950 78.76897 78.80736 79.81275 80.52056 86.16151 84.31874 82.90969
## 201 202 203 204 205 206 207 208
## 80.22751 79.95144 79.16455 86.91438 80.45776 83.03616 81.20316 79.77685
## 209 210 211 212 213 214 215 216
## 81.19356 79.99647 84.09687 78.27573 80.49052 80.86473 80.89989 81.70446
## 217 218 219 220 221 222 223 224
## 79.75560 80.35548 78.62211 80.83886 81.26718 76.98884 82.57107 78.82091
## 225 226 227 228 229 230 231 232
## 73.20540 80.73887 80.53003 82.10081 82.07548 76.45128 78.49053 81.56820
## 233 234 235 236 237 238 239 240
## 85.91794 83.92520 82.41435 79.31196 79.97506 80.26811 83.78061 78.29002
## 241 242 243 244 245 246 247 248
## 81.56839 83.58202 81.32763 79.19948 78.47194 81.75620 79.44828 81.91921
## 249 250 251 252 253 254 255 256
## 80.04707 81.62991 82.01191 69.91561 84.83660 50.81624 79.78578 84.73741
## 257 258 259
## 78.52096 81.35358 79.91782
#compare predicted and actual values
plot(baseball_training$TARGET_WINS, type="l",lty=1.8,col="red")
lines(pred2,type="l",col="blue")
#plot(pred1,type="l",col="blue")
rmse2 <- sqrt(mean(pred2-baseball_training$TARGET_WINS)^2)
## Warning in pred2 - baseball_training$TARGET_WINS: longer object length is not a
## multiple of shorter object length
rmse2
## [1] 0.02142157
pred3 <- predict(fit_hits_errors,newdata = data.frame(baseball_evaluation2))
pred3
## 1 2 3 4 5 6 7 8
## 69.92189 70.60449 78.87383 86.57739 72.09565 72.72353 74.73608 75.25965
## 9 10 11 12 13 14 15 16
## 70.86333 78.43871 79.65801 83.95100 80.40780 82.60084 80.38189 80.98343
## 17 18 19 20 21 22 23 24
## 74.03710 82.10969 70.40489 93.37952 84.85222 88.86251 81.29777 77.92393
## 25 26 27 28 29 30 31 32
## 85.76686 87.80418 63.57158 80.93589 85.84209 83.38473 88.42351 86.12408
## 33 34 35 36 37 38 39 40
## 85.01269 86.14169 81.89808 81.60331 76.80906 87.50736 87.88554 89.17448
## 41 42 43 44 45 46 47 48
## 84.38400 88.60607 39.06137 83.86190 80.15680 85.00856 90.08344 72.83338
## 49 50 51 52 53 54 55 56
## 71.60092 78.66625 83.48078 90.28015 78.57083 75.74656 79.35438 80.84595
## 57 58 59 60 61 62 63 64
## 82.07339 71.49616 62.12779 76.01162 85.87424 79.82496 84.94993 84.11700
## 65 66 67 68 69 70 71 72
## 82.68306 87.78081 76.29441 85.14064 73.52318 79.30423 89.40244 81.22710
## 73 74 75 76 77 78 79 80
## 81.27792 83.64773 87.91639 86.75402 79.03760 80.60475 71.34585 77.37806
## 81 82 83 84 85 86 87 88
## 90.14202 91.78527 98.51390 85.43525 82.53457 83.90018 78.21573 83.77106
## 89 90 91 92 93 94 95 96
## 82.57503 86.37456 74.97354 78.60790 74.76515 81.47932 79.24140 75.90281
## 97 98 99 100 101 102 103 104
## 79.21503 98.60902 89.45587 89.00847 86.28271 77.32394 87.13607 80.33120
## 105 106 107 108 109 110 111 112
## 80.13284 74.93526 58.35551 83.07222 82.95817 65.79888 80.67781 82.65071
## 113 114 115 116 117 118 119 120
## 90.94625 89.70126 83.90972 83.36449 89.26695 80.31917 81.89505 69.72583
## 121 122 123 124 125 126 127 128
## 84.30796 66.00216 68.94005 62.39899 69.33363 90.43742 93.42221 77.18817
## 129 130 131 132 133 134 135 136
## 89.42069 95.56621 89.65630 82.79842 77.38148 82.47851 83.04879 71.15935
## 137 138 139 140 141 142 143 144
## 76.40127 82.17776 84.11007 80.84765 68.74505 69.92721 93.06669 80.91154
## 145 146 147 148 149 150 151 152
## 76.88632 76.47451 78.64481 82.86925 86.65137 82.37914 82.06659 83.58462
## 153 154 155 156 157 158 159 160
## 52.64843 77.55826 78.15131 74.94113 83.45182 68.75187 85.16782 71.88811
## 161 162 163 164 165 166 167 168
## 96.33997 96.81719 88.34111 97.38312 90.35794 87.39468 83.77134 82.69735
## 169 170 171 172 173 174 175 176
## 75.77354 82.96511 83.71000 82.60755 81.89726 92.58346 84.48676 78.14771
## 177 178 179 180 181 182 183 184
## 77.90270 78.45673 78.02985 79.45286 75.68671 80.95384 83.68254 82.54374
## 185 186 187 188 189 190 191 192
## 87.91744 83.80529 83.26453 63.96956 65.69721 106.30433 71.92543 79.26786
## 193 194 195 196 197 198 199 200
## 80.78624 85.04144 88.41274 74.73990 79.83767 78.19445 78.22955 84.29407
## 201 202 203 204 205 206 207 208
## 78.53667 82.25292 75.25147 84.98479 80.57516 76.80954 82.74379 79.51389
## 209 210 211 212 213 214 215 216
## 77.18789 70.67993 94.62582 90.37015 78.31362 71.69743 75.18288 87.93221
## 217 218 219 220 221 222 223 224
## 85.50733 86.75756 80.20512 76.85669 81.41352 79.15248 84.80038 81.85919
## 225 226 227 228 229 230 231 232
## 96.45704 75.60749 81.57202 80.02834 80.73452 74.65767 72.29034 91.64071
## 233 234 235 236 237 238 239 240
## 80.38514 85.58598 78.34181 75.03108 83.80862 77.22843 83.46140 73.83059
## 241 242 243 244 245 246 247 248
## 89.14121 88.92305 88.04102 86.71147 66.30425 90.47187 82.44069 83.14790
## 249 250 251 252 253 254 255 256
## 77.86597 84.56626 82.80405 66.35048 85.08240 33.37564 71.99924 76.27058
## 257 258 259
## 78.73635 79.56599 80.69877
#compare predicted and actual values
plot(baseball_training$TARGET_WINS, type="l",lty=1.8,col="red")
lines(pred3,type="l",col="blue")
rmse3 <- sqrt(mean(pred3-baseball_training$TARGET_WINS)^2)
## Warning in pred3 - baseball_training$TARGET_WINS: longer object length is not a
## multiple of shorter object length
rmse3
## [1] 0.01683029
Based on the analyses above, I would probably choose the second model as the best predictor of the three, as this model’s Multiple R-squared value is significantly higher than the other two (0.69 vs ~0.22) with a p-value below 0.05. All of the predictor variables p-values are also less than 0.05. This model, based on pitching statistics, may be a better predictor of target_wins than just batting, as it can predict 69% of wins. In addition, the RMSE of the second model, while not less than the third model’s RMSE, is still relatively low at 0.02. Low RMSE values indicate that the model fits the data well and has more precise predictions. As for next steps, I would continue adjusting the second model, adding some TEAM_BATTING predictors to increase the multiple r-squared as well as the performance of the model. This would include some trial and error.