As a huge fan of Major League Baseball, I have many reasons to love the sport. In recent years, my interest in the game has grown due to its application of Analytics and my study of Data Science. Heavily influenced by the work of Bill James and others, MLB pioneered the use of statistics to objectively evaluate players and create new statistics which provide better insight into the value of players and what really contributes to winnning games. MLB’s use of Analytics has even penetrated the popular culture. The book and film, “Moneyball”, made using Analytics look heroic and cool. Even if someone is not a fan of the sport, anyone who has a strong intrest in seeing the principles of Data Science in action should take a look at how MLB has used Analytics to improve the sport.
In baseball, there’s a heuristic that pitching wins games and matters more than hitting and scoring. In a blog article titled, Hitting or Pitching,Which wins more games, www.wmbriggs.com/post/120/, by Tim Murray and William Briggs (April 28, 2008), they explore which is a better predictor of wins, hitting or pitching. For this project, I took a slightly different appoach. I wanted to see whether pitching alone can predict wins.
The Hypothesis test is as follows:
H0: Pitching alone cannot predict wins by a MLB team HA: Pitching alone can predict wins by a MLB team
The data was collected from the website, fangraphs.com. Fangraphs is operated by FanGraphs, Inc. Fangraphs compiles historical statistical data for the entire history of Major League Baseball (“MLB”). In addition, it creates and records advanced baseball metrics outside of the established statistics. FanGraphs is well established as a chronicler and compiler of baseball statistics. It has parternership deals with ESPN and SB Nation. Fangraphs has an interface for members (of which I am one) where you are able to download individual and team statistics covering a single season or multiple years. For this project, I downloaded data for all 30 teams from 1998 to 2017. I purposely chose the year 1998 as that was the last year that MLB expanded its number of teams.On the site, you can generate a custom report for pitching statistics which is how I collected the data. I combined the downloads into one csv file which is located in the DATA folder. The file is titled: DATA/MLB_PITCHING_STATS_1998_to_2017.csv
mlb_df <- read.csv('DATA/MLB_PITCHING_STATS_1998_to_2017.csv')
In the mlb_df dataset, there are 600 cases or observations (30 teams x 20 seasons). This is observational data based on recorded statistics from these teams. Each observation represents the win total and key pitching statistics for 30 MLB teams from 1998 to 2017.
dim(mlb_df)
## [1] 600 22
glimpse(mlb_df)
## Observations: 600
## Variables: 22
## $ Team <fct> Braves, Astros, Padres, Mets, Dodgers, Yankees, Pirate...
## $ W <int> 106, 102, 98, 88, 83, 114, 69, 89, 92, 88, 83, 63, 65,...
## $ ERA <dbl> 3.25, 3.50, 3.63, 3.77, 3.81, 3.82, 3.91, 4.19, 4.19, ...
## $ H <int> 1291, 1435, 1384, 1381, 1332, 1357, 1433, 1457, 1406, ...
## $ R <int> 581, 620, 635, 645, 678, 656, 718, 739, 729, 768, 782,...
## $ ER <int> 520, 572, 587, 611, 612, 619, 629, 687, 668, 698, 706,...
## $ HR <int> 117, 147, 139, 152, 135, 156, 147, 171, 168, 169, 151,...
## $ BB <int> 467, 465, 501, 532, 587, 466, 530, 562, 504, 587, 558,...
## $ SO <int> 1232, 1187, 1217, 1129, 1178, 1080, 1112, 1089, 1025, ...
## $ KsPer9 <dbl> 7.71, 7.26, 7.53, 6.97, 7.33, 6.67, 6.91, 6.64, 6.42, ...
## $ BBper9 <dbl> 2.92, 2.84, 3.10, 3.28, 3.65, 2.88, 3.29, 3.42, 3.16, ...
## $ KperBB <dbl> 2.64, 2.55, 2.43, 2.12, 2.01, 2.32, 2.10, 1.94, 2.03, ...
## $ Hper9 <dbl> 8.08, 8.78, 8.56, 8.52, 8.28, 8.38, 8.90, 8.88, 8.81, ...
## $ HRper9 <dbl> 0.73, 0.90, 0.86, 0.94, 0.84, 0.96, 0.91, 1.04, 1.05, ...
## $ AVG <dbl> 0.236, 0.251, 0.247, 0.248, 0.241, 0.244, 0.253, 0.253...
## $ WHIP <dbl> 1.22, 1.29, 1.30, 1.31, 1.33, 1.25, 1.35, 1.37, 1.33, ...
## $ BABIP <dbl> 0.285, 0.295, 0.293, 0.287, 0.284, 0.277, 0.292, 0.287...
## $ LOB_pct <dbl> 0.74, 0.76, 0.75, 0.76, 0.73, 0.74, 0.71, 0.73, 0.71, ...
## $ FIP <dbl> 3.53, 3.86, 3.84, 4.18, 4.06, 4.15, 4.08, 4.42, 4.40, ...
## $ RAR <dbl> 255.7, 178.7, 202.1, 143.8, 171.8, 212.5, 172.4, 120.3...
## $ K_pct <dbl> 0.207, 0.191, 0.198, 0.183, 0.191, 0.177, 0.179, 0.171...
## $ BB_pct <dbl> 0.08, 0.08, 0.08, 0.09, 0.10, 0.08, 0.09, 0.09, 0.08, ...
The table below provides descriptions for all 22 variables. In the table below, I briefly describe the variables. For a more in-depth description of these variables, please see FanGraphs.com. At this link, you will find a complete glossary for all statistics.
| Variables | Description |
|---|---|
| Team | Name of the MLB Team |
| W | Number of Wins |
| ERA | Team Earned Run Average |
| H | Number of Hits Allowed by Team |
| R | Number of Runs Allowed by Team |
| ER | Earned Runs Allowed by Team |
| HR | Home Runs Allowed by Team |
| BB | Walks Allowed by Team |
| SO | Strikeouts by Team |
| KsPer9 | Number of Strikeouts per 9 innings |
| BBper9 | Number of Walks per 9 innings |
| KperBB | Strikeouts per Walks |
| Hper9 | Hits allowed per 9 innings |
| HRper9 | Home Runs per 9 innings |
| AVG | Batting average allowed by Team |
| WHIP | Walks plus Hits per innings pitched |
| BABIP | The rate at which the pitcher allows a hit when the ball is put in play, calculated as (H-HR)/(AB-K-HR+SF) |
| LOB_pct | Percentage of pitcher’s own base runners that they strand over the course of a season Not equal to the LOB column in the box score |
| FIP | An estimate of a pitcher’s ERA based on strikeouts, walks/HBP, and home runs allowed, assuming league average results on balls in play |
| RAR | Runs above replacement level The number of runs a player has been worth to his team compared to a freely available player based on FIP |
| K_pct | Frequency with which the pitcher has struck out a batter, calculated as strikeouts divided by total batters faced |
| BB_pct | Frequency with which the pitcher has issued a walk, calculated as walks divided by total batters faced |
For this project, Wins is the response variable. Looking at the distribution for wins, you can see that Wins are normally distributed. The distribution is unimodal with the highest frequency in the middle.
ggplot(mlb_df, aes(x=W)) + geom_histogram(bins=20,
col="blue",
fill="orange") + labs(title="Histogram for Wins", align="center") +
labs(x="Wins", y="Count")
#### New Categorical Variable - Record
I created a new variable called “Record” which catgorizes the win totals by two levels, at or more than 50% and below 50%.
mlb_df[mlb_df$W >= 81, "Record"] <- as.character("1")
mlb_df[mlb_df$W < 81, "Record"] <- as.character("0")
mlb_df$Record <- as.factor(mlb_df$Record)
MLB is an unusual sport in that it consists of two different leagues, the American and National leagues (“AL” and “NL”), that play under different rules. In the National league, the pitcher, typically the worst hitter, has to bat. In the American league, teams are allowed a Designated Hitter.
In short, one league has the potential for more offense, AL, than the other, and this impacts pitching as NL pitchers typically have lower ERAs. I created a new category variable called “League”, and assigned this variable to each team.
mlb_df$League <- case_when(mlb_df$Team == 'Braves'~'National League',
mlb_df$Team == 'Marlins'~'National League',
mlb_df$Team == 'Mets'~'National League',
mlb_df$Team == 'Phillies'~'National League',
mlb_df$Team == 'Nationals'~'National League',
mlb_df$Team == 'Cubs'~'National League',
mlb_df$Team == 'Reds'~'National League',
mlb_df$Team == 'Brewers'~'National League',
mlb_df$Team == 'Pirates'~'National League',
mlb_df$Team == 'Cardinals'~'National League',
mlb_df$Team == 'iamondbacks'~'National League',
mlb_df$Team == 'Rockies'~'National League',
mlb_df$Team == 'Dodgers'~'National League',
mlb_df$Team == 'Padres'~'National League',
mlb_df$Team == 'Giants'~'National League',
TRUE ~ as.character('American League'))
During the past 20 years, one team, the Houston Astros changed leagues from the NL to the AL, so I have to account for this switch.
rows <- as.numeric(row.names.data.frame(mlb_df[mlb_df$Team == 'Astros',]))
rows <- rows[rows < 480]
mlb_df$League[ rows] <- "National League"
mlb_df$League <- as.factor(mlb_df$League)
In MLB, there are traditional pitching statistics such as ERA which was used to measure a pitcher’s effectiveness. ERA meaures the Earned Runs that a pitcher yields. Tradtionally, ERA was the primary statistic used to evaluate pitching. Since the purpose of this model is to determine if pitching can predict wins, the expectation is that ERA should be the best predictor from a traditional perspective.
Below, we can see that ERA is unimodal, normally distributed.
describe(mlb_df$ERA)
## mlb_df$ERA
## n missing distinct Info Mean Gmd .05 .10
## 600 0 205 1 4.28 0.6167 3.43 3.58
## .25 .50 .75 .90 .95
## 3.89 4.24 4.66 5.01 5.20
##
## lowest : 2.94 3.02 3.03 3.13 3.15, highest: 5.56 5.67 5.69 5.71 6.03
ggplot(mlb_df, aes(x=ERA)) + geom_histogram(bins=20,
col="white",
fill="red") + labs(title="Histogram for ERA") +
labs(x="Earned Run Average", y="Count")
Sabremetrics is the new school of baseball metrics. It applies statisticals in a more nuanced way to evaluate effectiveness. Fielding Independent Pitching (“FIP”) is one such measure. It estimates the effectiveness of a pitcher independent of the performance of their defense. It observes outcomes based solely on a pitcher’s ability to prevent runs.
Below we can see that FIP is also normally distributed.
describe(mlb_df$FIP)
## mlb_df$FIP
## n missing distinct Info Mean Gmd .05 .10
## 600 0 179 1 4.279 0.4915 3.580 3.729
## .25 .50 .75 .90 .95
## 3.970 4.280 4.570 4.841 5.010
##
## lowest : 3.18 3.24 3.27 3.30 3.33, highest: 5.29 5.31 5.46 5.52 5.54
ggplot(mlb_df, aes(x=FIP)) + geom_histogram(bins=15,
col="blue",
fill="orange") + labs(title="Histogram for Fielding Independent Pitching") +
labs(x="FIP", y="Count")
Which pitching variable, ERA or FIP, is a better predictor of Wins? In this section, I used Simple Linear Regression to assess which variable is a better predictor for Wins.
ggplot(mlb_df, aes(x=FIP, y=W))+
geom_point(aes(col=League))
ggplot(mlb_df, aes(x=ERA, y=W))+
geom_point(aes(col=League))
Visually, both variables display somewhat of a negative linear relationship with wins which make intuitive sense. The higher each stat, the less wins. ERA seems to have a higher correlation to Wins than FIP. In the next section, I test a simple linear relationship between the two.
cor(mlb_df$W, mlb_df$FIP)
## [1] -0.5190366
cor(mlb_df$W, mlb_df$ERA)
## [1] -0.6256832
p1 <- lm(W ~ FIP, data = mlb_df)
summary(p1)
##
## Call:
## lm(formula = W ~ FIP, data = mlb_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.845 -6.903 -0.366 7.427 32.828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140.2366 4.0117 34.96 <2e-16 ***
## FIP -13.8506 0.9327 -14.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.852 on 598 degrees of freedom
## Multiple R-squared: 0.2694, Adjusted R-squared: 0.2682
## F-statistic: 220.5 on 1 and 598 DF, p-value: < 2.2e-16
\[ \hat{y} = 140.2366 -13.8506 * FIP \]
The Intercept for this model is 140.2366 which means that a team’s expected win total is 140 games when the slope is equal to zero which would mean that that team’s FIP is 0. This does make some sense since tt would mean that the only way an opposing team would score runs would be through errors or other defensive mis-plays.
The slope is -13.8506 which is the numerical relationship between FIP and Wins. For every FIP increases of 1, the expected value of the number wins goes down by -13.8506 with the reverse also being true for a 1 unit decrease of FIP. The standard error for the Intercept is 4.0117, and for the coefficient, it’s 0.9327
The R-sqaured score of 0.26494 may seem low. R-squared is the fraction by which the variance of the errors is less than the variance of the dependent variable. High variability of the dependent variable is to be expected in MLB.
Looking at the p-value for the FIP coefficient, we see that it is < 0.05 which makes it statistically significant. We cannot reject the null hypothesis.
Below, I test the model using stats from 2018. In that year, the Mets had a team FIP of 3.97, and inserting it into the equation, the Mets should have won 85 games. The Mets actually won 77 games in 2018.
Mets_FIP_Intercept = 140.2366
Mets_FIP_Slope = -13.8506
Mets_FIP = 3.97
Mets_FIP_Wins = Mets_FIP_Intercept + (Mets_FIP_Slope *Mets_FIP)
Mets_FIP_Wins
## [1] 85.24972
Diagnotics – FIP
Conditions are met for Linear Regression
plot(p1$residuals ~ mlb_df$FIP)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
ggplot(data=p1, aes(p1$residuals)) +
geom_histogram(breaks=seq(-40, 40, by = 5), bins=5, col="white", fill="blue", alpha = .2) +
labs(title="Histogram for FIP Residuals", x="FIP Residuals")
qqnorm(p1$residuals)
qqline(p1$residuals)
p2 <- lm(W ~ ERA, data = mlb_df)
summary(p2)
##
## Call:
## lm(formula = W ~ ERA, data = mlb_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.5474 -5.8343 -0.3085 6.5031 26.9156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.8924 2.9255 47.13 <2e-16 ***
## ERA -13.3005 0.6781 -19.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.992 on 598 degrees of freedom
## Multiple R-squared: 0.3915, Adjusted R-squared: 0.3905
## F-statistic: 384.7 on 1 and 598 DF, p-value: < 2.2e-16
\[ \hat{y} = 137.8924 -13.3005 * ERA \]
The Intercept for this model is 137.8924 which means that a team’s expected win total is 138 games when the slope is equal to zero which would mean that that team’s ERA is 0. Intuitively, this does make some sense since it would mean that the only way an opposing team would score runs would be through errors or other defensive mis-plays.
The slope is -13.30 which is the numerical relationship between ERA and Wins. For every 1 unit increase in ERA increase, the expected value of the number wins goes down by -13.30 with the reverse also being true for a 1 unit decrease of FIP. The standard error for the Intercept is 2.9255, and for the coefficient, it’s 0.06781
Looking at the p-value for the ERA coefficient, we see that it is < 0.05 which makes it statistically significant. We cannot reject the null hypothesis.
Below, I test the model using stats from 2018. In that year, the Mets had a team ERA of 4.07, and inserting it into the equation, the Mets should have won 84 games. The Mets actually won 77 games in 2018.
Mets_ERA_Intercept = 137.8924
Mets_ERA_Slope = -13.3005
Mets_ERA = 4.07
Mets_ERA_Wins = Mets_ERA_Intercept + (Mets_ERA_Slope *Mets_ERA)
Mets_ERA_Wins
## [1] 83.75937
Diagnotics – ERA
Conditions are met for Linear Regression
plot(p2$residuals ~ mlb_df$ERA)
abline(h = 0, lty = 3)
ggplot(data=p2, aes(p2$residuals)) +
geom_histogram(breaks=seq(-40, 40, by = 5), bins=5, col="white", fill="blue", alpha = .2) +
labs(title="Histogram for ERA Residuals", x="ERA Residuals")
qqnorm(p2$residuals)
qqline(p2$residuals)
Reflections:
Both statistics are good predictors of a team’s win total. In the next three code blocks, I created functions for each of the Simple Linear Regression models. Next, I downloaded 2018 data for testing purposes. I applied both functions that predict wins based on the team’s FIP or ERA. I then calculated the difference between actual wins minus predicted wins.
In the end, the average difference for each statistic were virtually the same.
ERAPredictWins <- function(ERA) {
ERA_Intercept = 137.8924
ERA_Slope = -13.3005
ERA_Wins = ERA_Intercept + (ERA_Slope * ERA)
return(ERA_Wins)
}
FIPPredictWins <- function(FIP) {
FIP_Intercept = 140.2366
FIP_Slope = -13.8506
FIP_Wins = FIP_Intercept + (FIP_Slope * FIP)
return(FIP_Wins)
}
mlb_df_2018 <- read.csv("DATA/MLB_ERA_FIP_2018.csv")
mlb_df_2018$FIP_Pred_W <- FIPPredictWins(mlb_df_2018$FIP)
mlb_df_2018$ERA_Pred_W <- ERAPredictWins(mlb_df_2018$ERA)
mlb_df_2018$ERA_Win_Diff <- (mlb_df_2018$W - mlb_df_2018$ERA_Pred_W )
mlb_df_2018$ERA_FIP_Diff <- (mlb_df_2018$W - mlb_df_2018$FIP_Pred_W )
mean(mlb_df_2018$ERA_FIP_Diff)
## [1] -1.690959
mean(mlb_df_2018$ERA_Win_Diff)
## [1] -1.62209
Returning to the original purpose of this project and to test the hypothesis of the predictive value of pitching to wins, I built an multiple regression model for all of the pitching statistics.
pitch_full <- lm(W ~ ERA + H + R + ER + HR + BB + SO + KsPer9 + BBper9 + KperBB + Hper9 + HRper9 +
AVG + WHIP + BABIP + LOB_pct + FIP + RAR + K_pct + BB_pct, data = mlb_df)
summary(pitch_full)
##
## Call:
## lm(formula = W ~ ERA + H + R + ER + HR + BB + SO + KsPer9 + BBper9 +
## KperBB + Hper9 + HRper9 + AVG + WHIP + BABIP + LOB_pct +
## FIP + RAR + K_pct + BB_pct, data = mlb_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.8122 -4.5678 0.1799 4.3503 18.3915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.696e+01 7.110e+01 -0.379 0.70471
## ERA 6.276e+01 7.174e+01 0.875 0.38197
## H 2.057e-01 2.848e-01 0.722 0.47061
## R -6.229e-02 4.929e-02 -1.264 0.20686
## ER -3.186e-01 4.472e-01 -0.713 0.47641
## HR 1.618e-01 5.948e-01 0.272 0.78572
## BB 5.506e-01 4.398e-01 1.252 0.21102
## SO -1.827e-01 1.762e-01 -1.037 0.30039
## KsPer9 3.994e+01 2.697e+01 1.481 0.13908
## BBper9 -9.433e+01 7.001e+01 -1.347 0.17837
## KperBB 3.674e+00 4.777e+00 0.769 0.44210
## Hper9 -3.224e+01 4.554e+01 -0.708 0.47927
## HRper9 -5.053e+01 9.347e+01 -0.541 0.58895
## AVG 2.897e+02 7.056e+02 0.411 0.68155
## WHIP -1.570e+01 9.799e+01 -0.160 0.87281
## BABIP -4.006e+02 4.900e+02 -0.818 0.41393
## LOB_pct 1.322e+02 7.703e+01 1.716 0.08670 .
## FIP 1.362e+01 4.676e+00 2.914 0.00371 **
## RAR 1.326e-01 7.408e-03 17.896 < 2e-16 ***
## K_pct -2.974e+02 4.137e+02 -0.719 0.47244
## BB_pct 9.459e+01 9.840e+01 0.961 0.33683
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.897 on 579 degrees of freedom
## Multiple R-squared: 0.6533, Adjusted R-squared: 0.6413
## F-statistic: 54.55 on 20 and 579 DF, p-value: < 2.2e-16
After several iterations where I removed variables to find the best model, I identified the following coefficients who p-values are statistically significant.
| Variables | Description |
|---|---|
| RAR | Runs above replacement level The number of runs a player has been worth to his team compared to a freely available player based on FIP |
| LOB_pct | Percentage of pitcher’s own base runners that they strand over the course of a season Not equal to the LOB column in the box score |
| Hper9 | Hits allowed per 9 innings |
| H | Number of Hits Allowed by Team |
| BBper9 | Number of Walks per 9 innings |
| KperBB | Strikeouts per Walks |
| FIP | An estimate of a pitcher’s ERA based on strikeouts, walks/HBP, and home runs allowed, assuming league average results on balls in play |
| HRper9 | Home Runs per 9 innings |
| ERA | Team Earned Run Average |
pitch_full_best <- lm(W ~ RAR+LOB_pct+H+Hper9+BBper9+FIP+HRper9+ERA , data = mlb_df)
summary(pitch_full_best)
##
## Call:
## lm(formula = W ~ RAR + LOB_pct + H + Hper9 + BBper9 + FIP + HRper9 +
## ERA, data = mlb_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.9150 -4.5522 0.1612 4.4809 18.1029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.342203 28.897722 -2.227 0.0264 *
## RAR 0.132246 0.007255 18.228 < 2e-16 ***
## LOB_pct 195.507407 36.061038 5.422 8.62e-08 ***
## H 0.104260 0.025240 4.131 4.14e-05 ***
## Hper9 -23.019203 4.277447 -5.382 1.07e-07 ***
## BBper9 -8.854223 1.493380 -5.929 5.19e-09 ***
## FIP 14.223588 2.092105 6.799 2.59e-11 ***
## HRper9 -21.098982 4.983850 -4.233 2.67e-05 ***
## ERA 7.880635 3.573680 2.205 0.0278 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.871 on 591 degrees of freedom
## Multiple R-squared: 0.6488, Adjusted R-squared: 0.644
## F-statistic: 136.5 on 8 and 591 DF, p-value: < 2.2e-16
The final model is below:
\[ \hat{y} = -64.342203 + 0.132246*RAR + 195.507407*LOB_pct + 0.104260*H - 23.019203*Hper9 - 8.854223*BBper9 + 14.223588 * FIP - -21.098982*HRper9 + 7.880635*ERA \]
DIAGNOSTICS and ASSUMPTIONS
qqnorm(pitch_full_best$residuals, main="Normal Q-Q Plot for Pitching Best Model ")
qqline(pitch_full_best$residuals)
plot(abs(pitch_full_best$residuals) ~ pitch_full_best$fitted.values, main="Variability of Residuals for Pitching Best Model")
ggplot(data=pitch_full_best, aes(x=pitch_full_best$residuals)) +
geom_histogram(breaks=seq(-35, 35, by = 5), bins=5, col="white", fill="blue", alpha = .2) +
labs(title="Histogram for Best Model Residuals", x="Best Model Residuals")
Test on 2018 data
In the code block below, I loaded the pitching stats for the best model along with 2018 win totals. I created a function that predicts wins based on these top stats and compared them to the actual win totals.
stats <- c("RAR", "LOB_pct", "H", "Hper9", "BBper9", "FIP", "HRper9", "ERA" )
Best_Model_Predict_Wins <- function (stats) {
Intercept = -64.342203
Slope = c(0.132246, 195.507407, 0.104260, -23.019203, -8.854223, 14.223588, -21.098982, 7.880635)
coeff = sum((Slope * stats))
Wins = Intercept + coeff
return(Wins)
}
mlb_df_2018_best <- read.csv("Data/MLB_ERA_MODEL_BEST.csv")
mlb_df_2018_best$LOB_pct <- as.numeric(mlb_df_2018_best$LOB_pct)
mlb_df_2018_best$Prediction <- apply(mlb_df_2018_best[3:10],1,Best_Model_Predict_Wins)
mlb_df_2018_best$Win_Diff <- mlb_df_2018_best$W - mlb_df_2018_best$Prediction
In the scatter plot below, the Predicted win total is plotted against the actual win total. The size of each dot is the absolute difference between actual wins and predicted wins. Color represents each team.
ggplot(mlb_df_2018_best, aes(x=Prediction, y=W)) +
geom_point(aes(color=Team, size= abs(Win_Diff)))
head(mlb_df_2018_best,10)
## Team W RAR LOB_pct H Hper9 BBper9 FIP HRper9 ERA
## 1 Astros 103 287.3 0.779 1164 7.20 2.69 3.23 0.94 3.11
## 2 Dodgers 92 189.7 0.762 1279 7.80 2.57 3.60 1.09 3.40
## 3 Cubs 95 121.7 0.762 1319 8.04 3.79 4.14 0.96 3.65
## 4 Diamondbacks 82 153.0 0.757 1313 8.08 3.21 3.91 1.07 3.73
## 5 Brewers 96 155.5 0.744 1259 7.76 3.41 4.01 1.07 3.73
## 6 Rays 90 156.3 0.733 1236 7.68 3.11 3.82 1.02 3.75
## 7 Braves 90 140.7 0.741 1236 7.64 3.92 3.99 0.95 3.75
## 8 Red Sox 108 192.6 0.758 1305 8.05 3.16 3.82 1.09 3.75
## 9 Indians 91 210.4 0.760 1349 8.33 2.51 3.79 1.24 3.77
## 10 Yankees 100 252.3 0.739 1311 8.10 3.05 3.63 1.09 3.78
## Prediction Win_Diff
## 1 108.37278 -5.372782
## 2 95.76610 -3.766096
## 3 87.01077 7.989229
## 4 88.79979 -6.799793
## 5 87.97643 8.023569
## 6 86.54155 3.458451
## 7 83.68636 6.313643
## 8 92.98696 15.013041
## 9 96.19532 -5.195319
## 10 95.14991 4.850094
By looking at the p-values for the coefficients, the Diagnostics for the model, and the adjusted R-squared score, we can reject the null hypothesis that pitching cannot predict win totals. The data shows indeed that pitching is a critical factor in determining how many wins a team will have.
In the next section, I look at two categorical variables, Record and League, to evaluate whether pitching can predict a winning season and how pitching can predict based on the leagues.
In this section, I explore the predictive value of winning team. For the purposes of this project, a winning team is defined by whether the team wins at least 50% of its games. The category variable, “Record”,
ggplot(mlb_df, aes(x=Record)) + geom_bar(color="Green", fill="yellow") + labs(title="Histogram for Record") +
labs(x="Record", y="Count")
In the code block below record is the binary categorical factor with a value of “1” for a season at or above 50% in wins and a value of “0” for less than 50% wins.
Record <- glm(Record~ERA + H + R + ER + HR + BB + SO + KsPer9 + BBper9 + KperBB + Hper9 + HRper9 +
AVG + WHIP + BABIP + LOB_pct + FIP + RAR + K_pct + BB_pct, data = mlb_df, family='binomial')
summary(Record) # display results
##
## Call:
## glm(formula = Record ~ ERA + H + R + ER + HR + BB + SO + KsPer9 +
## BBper9 + KperBB + Hper9 + HRper9 + AVG + WHIP + BABIP + LOB_pct +
## FIP + RAR + K_pct + BB_pct, family = "binomial", data = mlb_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.87008 -0.38161 0.03218 0.41238 2.57727
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.49685 38.62642 -1.463 0.1436
## ERA 22.66176 36.74005 0.617 0.5374
## H 0.04891 0.13988 0.350 0.7266
## R -0.02248 0.02502 -0.898 0.3690
## ER -0.10626 0.22876 -0.465 0.6423
## HR 0.17283 0.28930 0.597 0.5502
## BB 0.16023 0.21065 0.761 0.4469
## SO -0.05402 0.09313 -0.580 0.5619
## KsPer9 15.50739 14.77751 1.049 0.2940
## BBper9 -24.31243 34.00041 -0.715 0.4746
## KperBB 3.55487 2.89923 1.226 0.2201
## Hper9 -7.16777 22.84385 -0.314 0.7537
## HRper9 -39.22307 45.11088 -0.869 0.3846
## AVG 448.37802 363.40464 1.234 0.2173
## WHIP -39.67814 47.62090 -0.833 0.4047
## BABIP -331.12627 251.34917 -1.317 0.1877
## LOB_pct 57.23993 39.59791 1.446 0.1483
## FIP 2.79338 2.25450 1.239 0.2153
## RAR 0.05123 0.00510 10.046 <2e-16 ***
## K_pct -195.49126 232.44620 -0.841 0.4003
## BB_pct 80.85668 47.66879 1.696 0.0898 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 831.45 on 599 degrees of freedom
## Residual deviance: 370.42 on 579 degrees of freedom
## AIC: 412.42
##
## Number of Fisher Scoring iterations: 6
Using backward elimination with a p-value cutoff of < 0.05, I was identified only one statistically significant coefficient variable, Runs Above Replacement (“RAR”) which is a measure of how many runs a pitcher prevents above the league average for pitchers. The equation below captures this relationship.
\[ \ln(p/1-p)=y = 6.0655 - 0.044595 * RAR \]
We can interpret this model to mean that for every increase in RAR (runs prevented above league average), the probability that the team will have a winning season increases by 4.4%.
Best_Model <- glm(Record~RAR,data = mlb_df, family='binomial')
summary(Best_Model)
##
## Call:
## glm(formula = Record ~ RAR, family = "binomial", data = mlb_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8685 -0.6581 0.1051 0.6754 2.7506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.065461 0.506281 -11.98 <2e-16 ***
## RAR 0.044595 0.003612 12.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 831.45 on 599 degrees of freedom
## Residual deviance: 513.97 on 598 degrees of freedom
## AIC: 517.97
##
## Number of Fisher Scoring iterations: 5
The goodness of fit is:
1 - (513.97/831.45)
## [1] 0.381839
Using the formula above and the 2018 data, I calculated the probabilities of each team having a winning record for 2018. In the code blocks and visualizations below, I show that the model accurately predicted the odds of having a winning season for 24 out of 30 teams.
RAR = mlb_df_2018_best$RAR
y = -6.065461 + (0.044595 * RAR)
probabilities = exp(y) / (1+exp(y))
mlb_df_2018_best$prob_winning_record <- probabilities
This code block and visualization show that when the model predicted more than a 50% chance of winning, 12 teams had a winning season.
model_right <- mlb_df_2018_best %>%
select(Team, W, prob_winning_record, RAR) %>%
filter(W >= 81 & prob_winning_record >= .5) %>%
arrange(desc(RAR, W))
## Warning: package 'bindrcpp' was built under R version 3.5.1
ggplot(model_right, aes(x=prob_winning_record, y=W)) +
geom_point(aes(color=Team, size= W))
This code block and visualization show that when the model predicted less than a 50% chance of winning, these 12 teams had a losing season.
model_right2 <- mlb_df_2018_best %>%
select(Team, W, prob_winning_record, RAR) %>%
filter(W < 81 & prob_winning_record < .5) %>%
arrange(desc(RAR, W))
ggplot(model_right2, aes(x=prob_winning_record, y=W)) +
geom_point(aes(color=Team, size= W))
In the next block, the visualization shows where the model predicted winning seasons but two teams had losing seasons.
model_wrong <- mlb_df_2018_best %>%
select(Team, W, prob_winning_record, RAR) %>%
filter(W < 81 & prob_winning_record >= .5) %>%
arrange(desc(RAR, W))
ggplot(model_wrong, aes(x=prob_winning_record, y=W)) +
geom_point(aes(color=Team, size= W))
Finally, in the next block, the visualization shows where the model predicted losing seasons but these four teams had winning seasons.
model_wrong2 <- mlb_df_2018_best %>%
select(Team, W, prob_winning_record, RAR) %>%
filter(W >= 81 & prob_winning_record < .5) %>%
arrange(desc(RAR, W))
ggplot(model_wrong2, aes(x=prob_winning_record, y=W)) +
geom_point(aes(color=Team, size= W))
The model created in the logistic regression model has shown to be an accurate predictor of whether a team would have a winning season or not. It correctly predicted the results of 24 out of the 30 teams. Again, I can reject the null hypothesis that pitching cannot predict wins.
As stated earlier, the NL and AL operate under different rules. The AL allows a designated hitter (DH) to bat for a pitcher which leads to more offense in the AL. In the code blocks below, I separated out the AL from the NL and applied the best model to see if there was a difference. Separating out the American League gives us 312 observations over 24 variables.
mlb_AL <- mlb_df[mlb_df$League == 'American League',]
glimpse(mlb_AL)
## Observations: 312
## Variables: 24
## $ Team <fct> Yankees, Red Sox, Blue Jays, Devil Rays, Expos, Indian...
## $ W <int> 114, 92, 88, 63, 65, 89, 85, 65, 79, 70, 74, 65, 76, 8...
## $ ERA <dbl> 3.82, 4.19, 4.29, 4.35, 4.39, 4.45, 4.49, 4.64, 4.74, ...
## $ H <int> 1357, 1406, 1443, 1425, 1448, 1552, 1481, 1463, 1505, ...
## $ R <int> 656, 729, 768, 751, 783, 779, 783, 812, 785, 818, 866,...
## $ ER <int> 619, 668, 698, 698, 696, 722, 720, 738, 754, 765, 769,...
## $ HR <int> 156, 168, 169, 171, 156, 171, 164, 188, 169, 180, 179,...
## $ BB <int> 466, 504, 587, 643, 533, 563, 630, 489, 535, 457, 529,...
## $ SO <int> 1080, 1025, 1154, 1008, 1017, 1037, 1091, 908, 1065, 9...
## $ KsPer9 <dbl> 6.67, 6.42, 7.09, 6.29, 6.41, 6.39, 6.80, 5.71, 6.70, ...
## $ BBper9 <dbl> 2.88, 3.16, 3.61, 4.01, 3.36, 3.47, 3.93, 3.07, 3.36, ...
## $ KperBB <dbl> 2.32, 2.03, 1.97, 1.57, 1.91, 1.84, 1.73, 1.86, 1.99, ...
## $ Hper9 <dbl> 8.38, 8.81, 8.86, 8.89, 9.13, 9.57, 9.23, 9.19, 9.46, ...
## $ HRper9 <dbl> 0.96, 1.05, 1.04, 1.07, 0.98, 1.05, 1.02, 1.18, 1.06, ...
## $ AVG <dbl> 0.244, 0.252, 0.252, 0.257, 0.258, 0.269, 0.262, 0.260...
## $ WHIP <dbl> 1.25, 1.33, 1.39, 1.43, 1.39, 1.45, 1.46, 1.36, 1.43, ...
## $ BABIP <dbl> 0.277, 0.282, 0.290, 0.287, 0.291, 0.303, 0.300, 0.282...
## $ LOB_pct <dbl> 0.74, 0.71, 0.71, 0.73, 0.69, 0.72, 0.71, 0.68, 0.70, ...
## $ FIP <dbl> 4.15, 4.40, 4.36, 4.79, 4.38, 4.54, 4.51, 4.69, 4.40, ...
## $ RAR <dbl> 212.5, 156.7, 181.2, 109.8, 119.9, 160.6, 161.4, 87.4,...
## $ K_pct <dbl> 0.177, 0.167, 0.182, 0.161, 0.164, 0.162, 0.173, 0.148...
## $ BB_pct <dbl> 0.08, 0.08, 0.09, 0.10, 0.09, 0.09, 0.10, 0.08, 0.09, ...
## $ Record <fct> 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, ...
## $ League <fct> American League, American League, American League, Ame...
AL_full <- lm(W ~ RAR+LOB_pct+H+Hper9+BBper9+FIP+HRper9+ERA, data = mlb_AL)
summary(AL_full)
##
## Call:
## lm(formula = W ~ RAR + LOB_pct + H + Hper9 + BBper9 + FIP + HRper9 +
## ERA, data = mlb_AL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2905 -5.3591 0.3234 5.0264 18.0251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -109.00120 43.04478 -2.532 0.011837 *
## RAR 0.14242 0.01075 13.247 < 2e-16 ***
## LOB_pct 264.74632 53.92057 4.910 1.49e-06 ***
## H 0.12384 0.03842 3.224 0.001404 **
## Hper9 -29.80913 6.63934 -4.490 1.01e-05 ***
## BBper9 -10.39012 2.36379 -4.396 1.53e-05 ***
## FIP 13.93472 3.07375 4.533 8.36e-06 ***
## HRper9 -33.87673 7.67554 -4.414 1.42e-05 ***
## ERA 18.73142 5.48324 3.416 0.000722 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.332 on 303 degrees of freedom
## Multiple R-squared: 0.6343, Adjusted R-squared: 0.6246
## F-statistic: 65.68 on 8 and 303 DF, p-value: < 2.2e-16
The AL model is the same as the one created earlier under Multiple regression:
\[ \hat{y} = -64.342203 + 0.132246*RAR + 195.507407*LOB_pct + 0.104260*H - 23.019203*Hper9 - 8.854223*BBper9 + 14.223588 * FIP -21.098982*HRper9 + 7.880635*ERA \]
mlb_df_2018_best$League <- case_when(mlb_df_2018_best$Team == 'Braves'~'National League',
mlb_df_2018_best$Team == 'Marlins'~'National League',
mlb_df_2018_best$Team == 'Mets'~'National League',
mlb_df_2018_best$Team == 'Phillies'~'National League',
mlb_df_2018_best$Team == 'Nationals'~'National League',
mlb_df_2018_best$Team == 'Cubs'~'National League',
mlb_df_2018_best$Team == 'Brewers'~'National League',
mlb_df_2018_best$Team == 'Pirates'~'National League',
mlb_df_2018_best$Team == 'Cardinals'~'National League',
mlb_df_2018_best$Team == 'Diamondbacks'~'National League',
mlb_df_2018_best$Team == 'Rockies'~'National League',
mlb_df_2018_best$Team == 'Dodgers'~'National League',
mlb_df_2018_best$Team == 'Padres'~'National League',
mlb_df_2018_best$Team == 'Giants'~'National League',
TRUE ~ as.character('American League'))
AL_2018 <- mlb_df_2018_best %>%
filter(League == "American League")
AL_2018$Prediction <- apply(AL_2018[3:10],1,Best_Model_Predict_Wins)
AL_2018$Win_Diff <- AL_2018$W - AL_2018$Prediction
ggplot(AL_2018 , aes(x=Prediction, y=W)) +
geom_point(aes(color=Team, size= abs(Win_Diff)))
However for the National league, it’s a little different. There are 288 cases/observations over 24 variables for the past 30 years of the NL. The NL model shows Home Runs per 9 innings and ERA as variables with high p-values. Using backward elimination, I removed these two variables and re-ran the model.
The final best NL model is:
\[ \hat{y} = -64.342203 + 0.132246*RAR + 195.507407*LOB_pct + 0.104260*H - 23.019203*Hper9 - 8.854223*BBper9 + 14.223588 * FIP \]
mlb_NL <- mlb_df[mlb_df$League == 'National League',]
glimpse(mlb_NL)
## Observations: 288
## Variables: 24
## $ Team <fct> Braves, Astros, Padres, Mets, Dodgers, Pirates, Giants...
## $ W <int> 106, 102, 98, 88, 83, 69, 89, 83, 77, 90, 74, 75, 77, ...
## $ ERA <dbl> 3.25, 3.50, 3.63, 3.77, 3.81, 3.91, 4.19, 4.32, 4.44, ...
## $ H <int> 1291, 1435, 1384, 1381, 1332, 1433, 1457, 1513, 1400, ...
## $ R <int> 581, 620, 635, 645, 678, 718, 739, 782, 760, 792, 812,...
## $ ER <int> 520, 572, 587, 611, 612, 629, 687, 706, 711, 738, 746,...
## $ HR <int> 117, 147, 139, 152, 135, 147, 171, 151, 170, 180, 188,...
## $ BB <int> 467, 465, 501, 532, 587, 530, 562, 558, 573, 575, 550,...
## $ SO <int> 1232, 1187, 1217, 1129, 1178, 1112, 1089, 972, 1098, 1...
## $ KsPer9 <dbl> 7.71, 7.26, 7.53, 6.97, 7.33, 6.91, 6.64, 5.95, 6.86, ...
## $ BBper9 <dbl> 2.92, 2.84, 3.10, 3.28, 3.65, 3.29, 3.42, 3.42, 3.58, ...
## $ KperBB <dbl> 2.64, 2.55, 2.43, 2.12, 2.01, 2.10, 1.94, 1.74, 1.92, ...
## $ Hper9 <dbl> 8.08, 8.78, 8.56, 8.52, 8.28, 8.90, 8.88, 9.27, 8.74, ...
## $ HRper9 <dbl> 0.73, 0.90, 0.86, 0.94, 0.84, 0.91, 1.04, 0.92, 1.06, ...
## $ AVG <dbl> 0.236, 0.251, 0.247, 0.248, 0.241, 0.253, 0.253, 0.262...
## $ WHIP <dbl> 1.22, 1.29, 1.30, 1.31, 1.33, 1.35, 1.37, 1.41, 1.37, ...
## $ BABIP <dbl> 0.285, 0.295, 0.293, 0.287, 0.284, 0.292, 0.287, 0.292...
## $ LOB_pct <dbl> 0.74, 0.76, 0.75, 0.76, 0.73, 0.71, 0.73, 0.70, 0.71, ...
## $ FIP <dbl> 3.53, 3.86, 3.84, 4.18, 4.06, 4.08, 4.42, 4.40, 4.45, ...
## $ RAR <dbl> 255.7, 178.7, 202.1, 143.8, 171.8, 172.4, 120.3, 123.5...
## $ K_pct <dbl> 0.207, 0.191, 0.198, 0.183, 0.191, 0.179, 0.171, 0.152...
## $ BB_pct <dbl> 0.08, 0.08, 0.08, 0.09, 0.10, 0.09, 0.09, 0.09, 0.09, ...
## $ Record <fct> 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, ...
## $ League <fct> National League, National League, National League, Nat...
NL_full <- lm(W ~ RAR+LOB_pct+H+Hper9+BBper9+FIP+HRper9+ERA, data = mlb_NL)
summary(NL_full)
##
## Call:
## lm(formula = W ~ RAR + LOB_pct + H + Hper9 + BBper9 + FIP + HRper9 +
## ERA, data = mlb_NL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6277 -3.8531 0.6236 4.0614 18.4328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.474387 37.669850 -0.464 0.643094
## RAR 0.117112 0.009975 11.740 < 2e-16 ***
## LOB_pct 130.530725 46.823649 2.788 0.005673 **
## H 0.086080 0.032716 2.631 0.008983 **
## Hper9 -17.124652 5.443022 -3.146 0.001833 **
## BBper9 -7.593666 1.948334 -3.898 0.000122 ***
## FIP 15.873808 2.896662 5.480 9.51e-08 ***
## HRper9 -9.772962 6.330072 -1.544 0.123747
## ERA -3.429992 4.553977 -0.753 0.451973
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.193 on 279 degrees of freedom
## Multiple R-squared: 0.6931, Adjusted R-squared: 0.6843
## F-statistic: 78.78 on 8 and 279 DF, p-value: < 2.2e-16
NL_Best <- lm(W ~ RAR+LOB_pct+H+Hper9+BBper9+FIP, data = mlb_NL)
summary(NL_Best)
##
## Call:
## lm(formula = W ~ RAR + LOB_pct + H + Hper9 + BBper9 + FIP, data = mlb_NL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8018 -3.8123 0.3152 4.2274 19.8166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.17053 27.56025 -1.349 0.17852
## RAR 0.11888 0.01008 11.797 < 2e-16 ***
## LOB_pct 156.87046 27.10001 5.789 1.89e-08 ***
## H 0.10620 0.03240 3.278 0.00118 **
## Hper9 -20.34215 5.29179 -3.844 0.00015 ***
## BBper9 -5.94192 1.30916 -4.539 8.40e-06 ***
## FIP 8.84372 1.70426 5.189 4.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.271 on 281 degrees of freedom
## Multiple R-squared: 0.6831, Adjusted R-squared: 0.6763
## F-statistic: 101 on 6 and 281 DF, p-value: < 2.2e-16
stats <- c("RAR", "LOB_pct", "H", "Hper9", "BBper9", "FIP")
Best_NL_Model_Predict_Wins <- function (stats) {
Intercept = -37.17053
Slope = c(0.11888, 156.87046 , 0.10620, -20.34215, -5.94192, 8.84372)
coeff = sum((Slope * stats))
Wins = Intercept + coeff
return(Wins)
}
NL_2018 <- mlb_df_2018_best %>%
filter(League == "National League")
NL_2018$Prediction <- apply(NL_2018[3:8],1,Best_NL_Model_Predict_Wins)
NL_2018$Win_Diff <- NL_2018$W - NL_2018$Prediction
NL_2018
## Team W RAR LOB_pct H Hper9 BBper9 FIP HRper9 ERA
## 1 Dodgers 92 189.7 0.762 1279 7.80 2.57 3.60 1.09 3.40
## 2 Cubs 95 121.7 0.762 1319 8.04 3.79 4.14 0.96 3.65
## 3 Diamondbacks 82 153.0 0.757 1313 8.08 3.21 3.91 1.07 3.73
## 4 Brewers 96 155.5 0.744 1259 7.76 3.41 4.01 1.07 3.73
## 5 Braves 90 140.7 0.741 1236 7.64 3.92 3.99 0.95 3.75
## 6 Cardinals 88 138.8 0.730 1354 8.37 3.67 3.97 0.89 3.85
## 7 Giants 73 116.7 0.723 1387 8.54 3.23 3.98 0.96 3.95
## 8 Pirates 82 127.9 0.736 1380 8.66 3.12 4.06 1.09 4.00
## 9 Nationals 82 131.2 0.748 1320 8.22 3.03 4.15 1.23 4.04
## 10 Mets 77 144.9 0.730 1364 8.40 2.98 3.97 1.14 4.07
## 11 Phillies 80 182.3 0.710 1366 8.50 3.11 3.83 1.06 4.15
## 12 Rockies 91 178.0 0.713 1377 8.53 3.25 4.06 1.14 4.33
## 13 Padres 66 119.4 0.711 1430 8.83 3.21 4.10 1.14 4.41
## 14 Marlins 63 31.3 0.699 1388 8.66 3.78 4.57 1.20 4.76
## Prediction Win_Diff prob_winning_record League
## 1 98.64398 -6.6439841 0.916384758 National League
## 2 87.45249 7.5475055 0.345642349 National League
## 3 90.35046 -8.3504582 0.680826791 National League
## 4 89.07902 6.9209818 0.704550378 National League
## 5 83.64019 6.3598127 0.552074357 National League
## 6 80.67918 7.3208237 0.531041277 National League
## 7 79.70315 -6.7031516 0.297083576 National League
## 8 81.25057 0.7494256 0.410533465 National League
## 9 87.43458 -5.4345775 0.446555694 National League
## 10 85.95600 -8.9560046 0.597811475 National League
## 11 83.43232 -3.4323220 0.887375355 National League
## 12 85.15187 5.8481283 0.866741393 National League
## 13 77.98914 -11.9891434 0.322825296 National League
## 14 65.40079 -2.4007893 0.009288581 National League
ggplot(NL_2018 , aes(x=Prediction, y=W)) +
geom_point(aes(color=Team, size= abs(Win_Diff)))
As has been shown in the Simple Linear regression model, the Multiple regression model, the Logistic model, and the AL/NL models, pitching is a good predictor of Wins and winning seasons by a MLB team. A good follow up project would be to do the same with Hitting and Fielding statistics, and then combine all three to predict wins.