Pitching as a Predictor of Wins

1. Introduction

As a huge fan of Major League Baseball, I have many reasons to love the sport. In recent years, my interest in the game has grown due to its application of Analytics and my study of Data Science. Heavily influenced by the work of Bill James and others, MLB pioneered the use of statistics to objectively evaluate players and create new statistics which provide better insight into the value of players and what really contributes to winnning games. MLB’s use of Analytics has even penetrated the popular culture. The book and film, “Moneyball”, made using Analytics look heroic and cool. Even if someone is not a fan of the sport, anyone who has a strong intrest in seeing the principles of Data Science in action should take a look at how MLB has used Analytics to improve the sport.

In baseball, there’s a heuristic that pitching wins games and matters more than hitting and scoring. In a blog article titled, Hitting or Pitching,Which wins more games, www.wmbriggs.com/post/120/, by Tim Murray and William Briggs (April 28, 2008), they explore which is a better predictor of wins, hitting or pitching. For this project, I took a slightly different appoach. I wanted to see whether pitching alone can predict wins.

The Hypothesis test is as follows:

H0: Pitching alone cannot predict wins by a MLB team HA: Pitching alone can predict wins by a MLB team

2. Data

2.1 Data Collection

The data was collected from the website, fangraphs.com. Fangraphs is operated by FanGraphs, Inc. Fangraphs compiles historical statistical data for the entire history of Major League Baseball (“MLB”). In addition, it creates and records advanced baseball metrics outside of the established statistics. FanGraphs is well established as a chronicler and compiler of baseball statistics. It has parternership deals with ESPN and SB Nation. Fangraphs has an interface for members (of which I am one) where you are able to download individual and team statistics covering a single season or multiple years. For this project, I downloaded data for all 30 teams from 1998 to 2017. I purposely chose the year 1998 as that was the last year that MLB expanded its number of teams.On the site, you can generate a custom report for pitching statistics which is how I collected the data. I combined the downloads into one csv file which is located in the DATA folder. The file is titled: DATA/MLB_PITCHING_STATS_1998_to_2017.csv

mlb_df <- read.csv('DATA/MLB_PITCHING_STATS_1998_to_2017.csv')

2.2 Cases

In the mlb_df dataset, there are 600 cases or observations (30 teams x 20 seasons). This is observational data based on recorded statistics from these teams. Each observation represents the win total and key pitching statistics for 30 MLB teams from 1998 to 2017.

dim(mlb_df)

## [1] 600  22

glimpse(mlb_df)

## Observations: 600
## Variables: 22
## $ Team    <fct> Braves, Astros, Padres, Mets, Dodgers, Yankees, Pirate...
## $ W       <int> 106, 102, 98, 88, 83, 114, 69, 89, 92, 88, 83, 63, 65,...
## $ ERA     <dbl> 3.25, 3.50, 3.63, 3.77, 3.81, 3.82, 3.91, 4.19, 4.19, ...
## $ H       <int> 1291, 1435, 1384, 1381, 1332, 1357, 1433, 1457, 1406, ...
## $ R       <int> 581, 620, 635, 645, 678, 656, 718, 739, 729, 768, 782,...
## $ ER      <int> 520, 572, 587, 611, 612, 619, 629, 687, 668, 698, 706,...
## $ HR      <int> 117, 147, 139, 152, 135, 156, 147, 171, 168, 169, 151,...
## $ BB      <int> 467, 465, 501, 532, 587, 466, 530, 562, 504, 587, 558,...
## $ SO      <int> 1232, 1187, 1217, 1129, 1178, 1080, 1112, 1089, 1025, ...
## $ KsPer9  <dbl> 7.71, 7.26, 7.53, 6.97, 7.33, 6.67, 6.91, 6.64, 6.42, ...
## $ BBper9  <dbl> 2.92, 2.84, 3.10, 3.28, 3.65, 2.88, 3.29, 3.42, 3.16, ...
## $ KperBB  <dbl> 2.64, 2.55, 2.43, 2.12, 2.01, 2.32, 2.10, 1.94, 2.03, ...
## $ Hper9   <dbl> 8.08, 8.78, 8.56, 8.52, 8.28, 8.38, 8.90, 8.88, 8.81, ...
## $ HRper9  <dbl> 0.73, 0.90, 0.86, 0.94, 0.84, 0.96, 0.91, 1.04, 1.05, ...
## $ AVG     <dbl> 0.236, 0.251, 0.247, 0.248, 0.241, 0.244, 0.253, 0.253...
## $ WHIP    <dbl> 1.22, 1.29, 1.30, 1.31, 1.33, 1.25, 1.35, 1.37, 1.33, ...
## $ BABIP   <dbl> 0.285, 0.295, 0.293, 0.287, 0.284, 0.277, 0.292, 0.287...
## $ LOB_pct <dbl> 0.74, 0.76, 0.75, 0.76, 0.73, 0.74, 0.71, 0.73, 0.71, ...
## $ FIP     <dbl> 3.53, 3.86, 3.84, 4.18, 4.06, 4.15, 4.08, 4.42, 4.40, ...
## $ RAR     <dbl> 255.7, 178.7, 202.1, 143.8, 171.8, 212.5, 172.4, 120.3...
## $ K_pct   <dbl> 0.207, 0.191, 0.198, 0.183, 0.191, 0.177, 0.179, 0.171...
## $ BB_pct  <dbl> 0.08, 0.08, 0.08, 0.09, 0.10, 0.08, 0.09, 0.09, 0.08, ...

2.3 Variables

The table below provides descriptions for all 22 variables. In the table below, I briefly describe the variables. For a more in-depth description of these variables, please see FanGraphs.com. At this link, you will find a complete glossary for all statistics.

Variables	Description
Team	Name of the MLB Team
W	Number of Wins
ERA	Team Earned Run Average
H	Number of Hits Allowed by Team
R	Number of Runs Allowed by Team
ER	Earned Runs Allowed by Team
HR	Home Runs Allowed by Team
BB	Walks Allowed by Team
SO	Strikeouts by Team
KsPer9	Number of Strikeouts per 9 innings
BBper9	Number of Walks per 9 innings
KperBB	Strikeouts per Walks
Hper9	Hits allowed per 9 innings
HRper9	Home Runs per 9 innings
AVG	Batting average allowed by Team
WHIP	Walks plus Hits per innings pitched
BABIP	The rate at which the pitcher allows a hit when the ball is put in play, calculated as (H-HR)/(AB-K-HR+SF)
LOB_pct	Percentage of pitcher’s own base runners that they strand over the course of a season Not equal to the LOB column in the box score
FIP	An estimate of a pitcher’s ERA based on strikeouts, walks/HBP, and home runs allowed, assuming league average results on balls in play
RAR	Runs above replacement level The number of runs a player has been worth to his team compared to a freely available player based on FIP
K_pct	Frequency with which the pitcher has struck out a batter, calculated as strikeouts divided by total batters faced
BB_pct	Frequency with which the pitcher has issued a walk, calculated as walks divided by total batters faced

2.4 The Response Variable and Creating New Categorical Varibles

The Response Variable - Wins

For this project, Wins is the response variable. Looking at the distribution for wins, you can see that Wins are normally distributed. The distribution is unimodal with the highest frequency in the middle.

ggplot(mlb_df, aes(x=W)) + geom_histogram(bins=20,
                 col="blue", 
                 fill="orange") + labs(title="Histogram for Wins", align="center") +
                                  labs(x="Wins", y="Count")

#### New Categorical Variable - Record

I created a new variable called “Record” which catgorizes the win totals by two levels, at or more than 50% and below 50%.

Ator above 50% in wins gets coded as “1”.
Below 50% in wins gets coded as “0”.

mlb_df[mlb_df$W >= 81, "Record"] <- as.character("1")
mlb_df[mlb_df$W < 81, "Record"] <- as.character("0")
mlb_df$Record <- as.factor(mlb_df$Record)

New Categorical Variable - League Category

MLB is an unusual sport in that it consists of two different leagues, the American and National leagues (“AL” and “NL”), that play under different rules. In the National league, the pitcher, typically the worst hitter, has to bat. In the American league, teams are allowed a Designated Hitter.

In short, one league has the potential for more offense, AL, than the other, and this impacts pitching as NL pitchers typically have lower ERAs. I created a new category variable called “League”, and assigned this variable to each team.

mlb_df$League <- case_when(mlb_df$Team == 'Braves'~'National League',
 mlb_df$Team == 'Marlins'~'National League',
 mlb_df$Team == 'Mets'~'National League',
 mlb_df$Team == 'Phillies'~'National League',
 mlb_df$Team == 'Nationals'~'National League',
 mlb_df$Team == 'Cubs'~'National League',
 mlb_df$Team == 'Reds'~'National League',
 mlb_df$Team == 'Brewers'~'National League',
 mlb_df$Team == 'Pirates'~'National League',
 mlb_df$Team == 'Cardinals'~'National League',
 mlb_df$Team == 'iamondbacks'~'National League',
 mlb_df$Team == 'Rockies'~'National League',
 mlb_df$Team == 'Dodgers'~'National League',
 mlb_df$Team == 'Padres'~'National League',
 mlb_df$Team == 'Giants'~'National League',
 TRUE ~ as.character('American League'))

During the past 20 years, one team, the Houston Astros changed leagues from the NL to the AL, so I have to account for this switch.

rows <- as.numeric(row.names.data.frame(mlb_df[mlb_df$Team == 'Astros',])) 
rows <- rows[rows < 480]
mlb_df$League[ rows] <- "National League"
mlb_df$League <- as.factor(mlb_df$League)

3. Simple Linear Regression - ERA vs. FIP

Predictor Variables - Traditional vs.Sabremetrics

Traditional Predictor: ERA

In MLB, there are traditional pitching statistics such as ERA which was used to measure a pitcher’s effectiveness. ERA meaures the Earned Runs that a pitcher yields. Tradtionally, ERA was the primary statistic used to evaluate pitching. Since the purpose of this model is to determine if pitching can predict wins, the expectation is that ERA should be the best predictor from a traditional perspective.

Below, we can see that ERA is unimodal, normally distributed.

describe(mlb_df$ERA)

## mlb_df$ERA 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      600        0      205        1     4.28   0.6167     3.43     3.58 
##      .25      .50      .75      .90      .95 
##     3.89     4.24     4.66     5.01     5.20 
## 
## lowest : 2.94 3.02 3.03 3.13 3.15, highest: 5.56 5.67 5.69 5.71 6.03

ggplot(mlb_df, aes(x=ERA)) + geom_histogram(bins=20,
                 col="white", 
                 fill="red") + labs(title="Histogram for ERA") +
                                  labs(x="Earned Run Average", y="Count")

Sabremetrics Predictor: Fielding Independent Pitching (FIP)

Sabremetrics is the new school of baseball metrics. It applies statisticals in a more nuanced way to evaluate effectiveness. Fielding Independent Pitching (“FIP”) is one such measure. It estimates the effectiveness of a pitcher independent of the performance of their defense. It observes outcomes based solely on a pitcher’s ability to prevent runs.

Below we can see that FIP is also normally distributed.

describe(mlb_df$FIP)

## mlb_df$FIP 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      600        0      179        1    4.279   0.4915    3.580    3.729 
##      .25      .50      .75      .90      .95 
##    3.970    4.280    4.570    4.841    5.010 
## 
## lowest : 3.18 3.24 3.27 3.30 3.33, highest: 5.29 5.31 5.46 5.52 5.54

ggplot(mlb_df, aes(x=FIP)) + geom_histogram(bins=15,
                 col="blue", 
                 fill="orange") + labs(title="Histogram for Fielding Independent Pitching") +
                                    labs(x="FIP", y="Count")

Which pitching variable, ERA or FIP, is a better predictor of Wins? In this section, I used Simple Linear Regression to assess which variable is a better predictor for Wins.

ggplot(mlb_df, aes(x=FIP, y=W))+
    geom_point(aes(col=League))

ggplot(mlb_df, aes(x=ERA, y=W))+
    geom_point(aes(col=League))

Visually, both variables display somewhat of a negative linear relationship with wins which make intuitive sense. The higher each stat, the less wins. ERA seems to have a higher correlation to Wins than FIP. In the next section, I test a simple linear relationship between the two.

cor(mlb_df$W, mlb_df$FIP)

## [1] -0.5190366

cor(mlb_df$W, mlb_df$ERA)

## [1] -0.6256832

Simple Linear Regression - FIP and Wins

p1 <- lm(W ~ FIP, data = mlb_df)
summary(p1)

## 
## Call:
## lm(formula = W ~ FIP, data = mlb_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.845  -6.903  -0.366   7.427  32.828 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 140.2366     4.0117   34.96   <2e-16 ***
## FIP         -13.8506     0.9327  -14.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.852 on 598 degrees of freedom
## Multiple R-squared:  0.2694, Adjusted R-squared:  0.2682 
## F-statistic: 220.5 on 1 and 598 DF,  p-value: < 2.2e-16

\[ \hat{y} = 140.2366 -13.8506 * FIP \]

The Intercept for this model is 140.2366 which means that a team’s expected win total is 140 games when the slope is equal to zero which would mean that that team’s FIP is 0. This does make some sense since tt would mean that the only way an opposing team would score runs would be through errors or other defensive mis-plays.

The slope is -13.8506 which is the numerical relationship between FIP and Wins. For every FIP increases of 1, the expected value of the number wins goes down by -13.8506 with the reverse also being true for a 1 unit decrease of FIP. The standard error for the Intercept is 4.0117, and for the coefficient, it’s 0.9327

The R-sqaured score of 0.26494 may seem low. R-squared is the fraction by which the variance of the errors is less than the variance of the dependent variable. High variability of the dependent variable is to be expected in MLB.

Looking at the p-value for the FIP coefficient, we see that it is < 0.05 which makes it statistically significant. We cannot reject the null hypothesis.

Below, I test the model using stats from 2018. In that year, the Mets had a team FIP of 3.97, and inserting it into the equation, the Mets should have won 85 games. The Mets actually won 77 games in 2018.

Mets_FIP_Intercept = 140.2366
Mets_FIP_Slope = -13.8506
Mets_FIP = 3.97

Mets_FIP_Wins = Mets_FIP_Intercept + (Mets_FIP_Slope *Mets_FIP)
Mets_FIP_Wins

## [1] 85.24972

Diagnotics – FIP

Conditions are met for Linear Regression

Linearity. Satisfied. Normal QQ Plot shows linearity.
Nearly normal residuals. Satisfied. as residual plot does not show curvature and is scattered around zero.
Normal distribution of the residuals. Satisfied. Histogram shows this.
Independent observations. Satisfied. Observational data for 30 teams over 20 years.

plot(p1$residuals ~ mlb_df$FIP)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

ggplot(data=p1, aes(p1$residuals)) + 
  geom_histogram(breaks=seq(-40, 40, by = 5), bins=5, col="white", fill="blue", alpha = .2) +
    labs(title="Histogram for FIP Residuals", x="FIP Residuals")

qqnorm(p1$residuals)
qqline(p1$residuals)

Simple Linear Regression - ERA and Wins

p2 <- lm(W ~ ERA, data = mlb_df)
summary(p2)

## 
## Call:
## lm(formula = W ~ ERA, data = mlb_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.5474  -5.8343  -0.3085   6.5031  26.9156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 137.8924     2.9255   47.13   <2e-16 ***
## ERA         -13.3005     0.6781  -19.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.992 on 598 degrees of freedom
## Multiple R-squared:  0.3915, Adjusted R-squared:  0.3905 
## F-statistic: 384.7 on 1 and 598 DF,  p-value: < 2.2e-16

\[ \hat{y} = 137.8924 -13.3005 * ERA \]

The Intercept for this model is 137.8924 which means that a team’s expected win total is 138 games when the slope is equal to zero which would mean that that team’s ERA is 0. Intuitively, this does make some sense since it would mean that the only way an opposing team would score runs would be through errors or other defensive mis-plays.

The slope is -13.30 which is the numerical relationship between ERA and Wins. For every 1 unit increase in ERA increase, the expected value of the number wins goes down by -13.30 with the reverse also being true for a 1 unit decrease of FIP. The standard error for the Intercept is 2.9255, and for the coefficient, it’s 0.06781

Looking at the p-value for the ERA coefficient, we see that it is < 0.05 which makes it statistically significant. We cannot reject the null hypothesis.

Below, I test the model using stats from 2018. In that year, the Mets had a team ERA of 4.07, and inserting it into the equation, the Mets should have won 84 games. The Mets actually won 77 games in 2018.

Mets_ERA_Intercept = 137.8924
Mets_ERA_Slope = -13.3005
Mets_ERA = 4.07

Mets_ERA_Wins = Mets_ERA_Intercept + (Mets_ERA_Slope *Mets_ERA)
Mets_ERA_Wins

## [1] 83.75937

Diagnotics – ERA

Conditions are met for Linear Regression

Linearity. Satisfied. Normal QQ Plot shows linearity.
Nearly normal residuals. Satisfied. as residual plot does not show curvature and is scattered around zero.
Normal distribution of the residuals. Satisfied. Histogram shows this.
Independent observations. Satisfied. Observational data for 30 teams over 20 years.

plot(p2$residuals ~ mlb_df$ERA)
abline(h = 0, lty = 3)

ggplot(data=p2, aes(p2$residuals)) + 
  geom_histogram(breaks=seq(-40, 40, by = 5), bins=5, col="white", fill="blue", alpha = .2) +
    labs(title="Histogram for ERA Residuals", x="ERA Residuals")

qqnorm(p2$residuals)
qqline(p2$residuals)

Reflections:

Both statistics are good predictors of a team’s win total. In the next three code blocks, I created functions for each of the Simple Linear Regression models. Next, I downloaded 2018 data for testing purposes. I applied both functions that predict wins based on the team’s FIP or ERA. I then calculated the difference between actual wins minus predicted wins.

In the end, the average difference for each statistic were virtually the same.

ERAPredictWins <- function(ERA) {
        ERA_Intercept = 137.8924
        ERA_Slope = -13.3005
        ERA_Wins = ERA_Intercept + (ERA_Slope * ERA)
        return(ERA_Wins)
}

FIPPredictWins <- function(FIP) {
            FIP_Intercept = 140.2366
            FIP_Slope = -13.8506
            FIP_Wins = FIP_Intercept + (FIP_Slope * FIP)
            return(FIP_Wins)
}

mlb_df_2018 <- read.csv("DATA/MLB_ERA_FIP_2018.csv")
mlb_df_2018$FIP_Pred_W <- FIPPredictWins(mlb_df_2018$FIP)
mlb_df_2018$ERA_Pred_W <- ERAPredictWins(mlb_df_2018$ERA)
mlb_df_2018$ERA_Win_Diff <- (mlb_df_2018$W - mlb_df_2018$ERA_Pred_W )
mlb_df_2018$ERA_FIP_Diff <- (mlb_df_2018$W - mlb_df_2018$FIP_Pred_W )
mean(mlb_df_2018$ERA_FIP_Diff)

## [1] -1.690959

mean(mlb_df_2018$ERA_Win_Diff)

## [1] -1.62209

MULTIPLE REGRESSION - Pitching

Returning to the original purpose of this project and to test the hypothesis of the predictive value of pitching to wins, I built an multiple regression model for all of the pitching statistics.

pitch_full <- lm(W ~  ERA + H + R + ER + HR + BB + SO + KsPer9 + BBper9 + KperBB + Hper9 + HRper9 + 
                          AVG + WHIP + BABIP + LOB_pct + FIP + RAR + K_pct + BB_pct, data = mlb_df)
summary(pitch_full)

## 
## Call:
## lm(formula = W ~ ERA + H + R + ER + HR + BB + SO + KsPer9 + BBper9 + 
##     KperBB + Hper9 + HRper9 + AVG + WHIP + BABIP + LOB_pct + 
##     FIP + RAR + K_pct + BB_pct, data = mlb_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.8122  -4.5678   0.1799   4.3503  18.3915 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.696e+01  7.110e+01  -0.379  0.70471    
## ERA          6.276e+01  7.174e+01   0.875  0.38197    
## H            2.057e-01  2.848e-01   0.722  0.47061    
## R           -6.229e-02  4.929e-02  -1.264  0.20686    
## ER          -3.186e-01  4.472e-01  -0.713  0.47641    
## HR           1.618e-01  5.948e-01   0.272  0.78572    
## BB           5.506e-01  4.398e-01   1.252  0.21102    
## SO          -1.827e-01  1.762e-01  -1.037  0.30039    
## KsPer9       3.994e+01  2.697e+01   1.481  0.13908    
## BBper9      -9.433e+01  7.001e+01  -1.347  0.17837    
## KperBB       3.674e+00  4.777e+00   0.769  0.44210    
## Hper9       -3.224e+01  4.554e+01  -0.708  0.47927    
## HRper9      -5.053e+01  9.347e+01  -0.541  0.58895    
## AVG          2.897e+02  7.056e+02   0.411  0.68155    
## WHIP        -1.570e+01  9.799e+01  -0.160  0.87281    
## BABIP       -4.006e+02  4.900e+02  -0.818  0.41393    
## LOB_pct      1.322e+02  7.703e+01   1.716  0.08670 .  
## FIP          1.362e+01  4.676e+00   2.914  0.00371 ** 
## RAR          1.326e-01  7.408e-03  17.896  < 2e-16 ***
## K_pct       -2.974e+02  4.137e+02  -0.719  0.47244    
## BB_pct       9.459e+01  9.840e+01   0.961  0.33683    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.897 on 579 degrees of freedom
## Multiple R-squared:  0.6533, Adjusted R-squared:  0.6413 
## F-statistic: 54.55 on 20 and 579 DF,  p-value: < 2.2e-16

After several iterations where I removed variables to find the best model, I identified the following coefficients who p-values are statistically significant.

Variables	Description
RAR	Runs above replacement level The number of runs a player has been worth to his team compared to a freely available player based on FIP
LOB_pct	Percentage of pitcher’s own base runners that they strand over the course of a season Not equal to the LOB column in the box score
Hper9	Hits allowed per 9 innings
H	Number of Hits Allowed by Team
BBper9	Number of Walks per 9 innings
KperBB	Strikeouts per Walks
FIP	An estimate of a pitcher’s ERA based on strikeouts, walks/HBP, and home runs allowed, assuming league average results on balls in play
HRper9	Home Runs per 9 innings
ERA	Team Earned Run Average

pitch_full_best <- lm(W ~ RAR+LOB_pct+H+Hper9+BBper9+FIP+HRper9+ERA  , data = mlb_df)
summary(pitch_full_best)

## 
## Call:
## lm(formula = W ~ RAR + LOB_pct + H + Hper9 + BBper9 + FIP + HRper9 + 
##     ERA, data = mlb_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9150  -4.5522   0.1612   4.4809  18.1029 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.342203  28.897722  -2.227   0.0264 *  
## RAR           0.132246   0.007255  18.228  < 2e-16 ***
## LOB_pct     195.507407  36.061038   5.422 8.62e-08 ***
## H             0.104260   0.025240   4.131 4.14e-05 ***
## Hper9       -23.019203   4.277447  -5.382 1.07e-07 ***
## BBper9       -8.854223   1.493380  -5.929 5.19e-09 ***
## FIP          14.223588   2.092105   6.799 2.59e-11 ***
## HRper9      -21.098982   4.983850  -4.233 2.67e-05 ***
## ERA           7.880635   3.573680   2.205   0.0278 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.871 on 591 degrees of freedom
## Multiple R-squared:  0.6488, Adjusted R-squared:  0.644 
## F-statistic: 136.5 on 8 and 591 DF,  p-value: < 2.2e-16

The final model is below:

\[ \hat{y} = -64.342203 + 0.132246*RAR + 195.507407*LOB_pct + 0.104260*H - 23.019203*Hper9 - 8.854223*BBper9 + 14.223588 * FIP - -21.098982*HRper9 + 7.880635*ERA \]

DIAGNOSTICS and ASSUMPTIONS

The residuals of the model are nearly normal is satisfied as seen by the histogram below
The variability of the residuals is nearly constant. Satisfied. Variability of the residuals.
The residuals are independent as has been previousl established.
Each variable is linearly related to the outcome as seen by the Normal Q-Q Plot below

qqnorm(pitch_full_best$residuals, main="Normal Q-Q Plot for Pitching Best Model ")
qqline(pitch_full_best$residuals)

plot(abs(pitch_full_best$residuals) ~ pitch_full_best$fitted.values, main="Variability of Residuals for Pitching Best Model")

ggplot(data=pitch_full_best, aes(x=pitch_full_best$residuals)) + 
  geom_histogram(breaks=seq(-35, 35, by = 5), bins=5, col="white", fill="blue", alpha = .2) +
    labs(title="Histogram for Best Model Residuals", x="Best Model Residuals")

Test on 2018 data

In the code block below, I loaded the pitching stats for the best model along with 2018 win totals. I created a function that predicts wins based on these top stats and compared them to the actual win totals.

stats <-  c("RAR", "LOB_pct", "H", "Hper9", "BBper9", "FIP", "HRper9", "ERA" )
Best_Model_Predict_Wins <- function (stats) {
            Intercept = -64.342203
            Slope = c(0.132246, 195.507407, 0.104260, -23.019203, -8.854223, 14.223588, -21.098982, 7.880635)
            coeff = sum((Slope * stats))
            Wins = Intercept + coeff
            return(Wins)
}

mlb_df_2018_best <- read.csv("Data/MLB_ERA_MODEL_BEST.csv")
mlb_df_2018_best$LOB_pct <- as.numeric(mlb_df_2018_best$LOB_pct)
mlb_df_2018_best$Prediction <- apply(mlb_df_2018_best[3:10],1,Best_Model_Predict_Wins)
mlb_df_2018_best$Win_Diff <- mlb_df_2018_best$W - mlb_df_2018_best$Prediction

In the scatter plot below, the Predicted win total is plotted against the actual win total. The size of each dot is the absolute difference between actual wins and predicted wins. Color represents each team.

ggplot(mlb_df_2018_best, aes(x=Prediction, y=W)) +
    geom_point(aes(color=Team, size= abs(Win_Diff)))

head(mlb_df_2018_best,10)

##            Team   W   RAR LOB_pct    H Hper9 BBper9  FIP HRper9  ERA
## 1        Astros 103 287.3   0.779 1164  7.20   2.69 3.23   0.94 3.11
## 2       Dodgers  92 189.7   0.762 1279  7.80   2.57 3.60   1.09 3.40
## 3          Cubs  95 121.7   0.762 1319  8.04   3.79 4.14   0.96 3.65
## 4  Diamondbacks  82 153.0   0.757 1313  8.08   3.21 3.91   1.07 3.73
## 5       Brewers  96 155.5   0.744 1259  7.76   3.41 4.01   1.07 3.73
## 6          Rays  90 156.3   0.733 1236  7.68   3.11 3.82   1.02 3.75
## 7        Braves  90 140.7   0.741 1236  7.64   3.92 3.99   0.95 3.75
## 8       Red Sox 108 192.6   0.758 1305  8.05   3.16 3.82   1.09 3.75
## 9       Indians  91 210.4   0.760 1349  8.33   2.51 3.79   1.24 3.77
## 10      Yankees 100 252.3   0.739 1311  8.10   3.05 3.63   1.09 3.78
##    Prediction  Win_Diff
## 1   108.37278 -5.372782
## 2    95.76610 -3.766096
## 3    87.01077  7.989229
## 4    88.79979 -6.799793
## 5    87.97643  8.023569
## 6    86.54155  3.458451
## 7    83.68636  6.313643
## 8    92.98696 15.013041
## 9    96.19532 -5.195319
## 10   95.14991  4.850094

Conclusion and Reflections:

By looking at the p-values for the coefficients, the Diagnostics for the model, and the adjusted R-squared score, we can reject the null hypothesis that pitching cannot predict win totals. The data shows indeed that pitching is a critical factor in determining how many wins a team will have.

In the next section, I look at two categorical variables, Record and League, to evaluate whether pitching can predict a winning season and how pitching can predict based on the leagues.

4. LOGISTIC REGRESSION - Generalized Linear Model - Predicting A Winning Season

In this section, I explore the predictive value of winning team. For the purposes of this project, a winning team is defined by whether the team wins at least 50% of its games. The category variable, “Record”,

ggplot(mlb_df, aes(x=Record)) + geom_bar(color="Green", fill="yellow") + labs(title="Histogram for Record") +
                                    labs(x="Record", y="Count")

In the code block below record is the binary categorical factor with a value of “1” for a season at or above 50% in wins and a value of “0” for less than 50% wins.

Record <- glm(Record~ERA + H + R + ER + HR + BB + SO + KsPer9 + BBper9 + KperBB + Hper9 + HRper9 + 
                          AVG + WHIP + BABIP + LOB_pct + FIP + RAR + K_pct + BB_pct, data = mlb_df, family='binomial')
summary(Record) # display results

## 
## Call:
## glm(formula = Record ~ ERA + H + R + ER + HR + BB + SO + KsPer9 + 
##     BBper9 + KperBB + Hper9 + HRper9 + AVG + WHIP + BABIP + LOB_pct + 
##     FIP + RAR + K_pct + BB_pct, family = "binomial", data = mlb_df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.87008  -0.38161   0.03218   0.41238   2.57727  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -56.49685   38.62642  -1.463   0.1436    
## ERA           22.66176   36.74005   0.617   0.5374    
## H              0.04891    0.13988   0.350   0.7266    
## R             -0.02248    0.02502  -0.898   0.3690    
## ER            -0.10626    0.22876  -0.465   0.6423    
## HR             0.17283    0.28930   0.597   0.5502    
## BB             0.16023    0.21065   0.761   0.4469    
## SO            -0.05402    0.09313  -0.580   0.5619    
## KsPer9        15.50739   14.77751   1.049   0.2940    
## BBper9       -24.31243   34.00041  -0.715   0.4746    
## KperBB         3.55487    2.89923   1.226   0.2201    
## Hper9         -7.16777   22.84385  -0.314   0.7537    
## HRper9       -39.22307   45.11088  -0.869   0.3846    
## AVG          448.37802  363.40464   1.234   0.2173    
## WHIP         -39.67814   47.62090  -0.833   0.4047    
## BABIP       -331.12627  251.34917  -1.317   0.1877    
## LOB_pct       57.23993   39.59791   1.446   0.1483    
## FIP            2.79338    2.25450   1.239   0.2153    
## RAR            0.05123    0.00510  10.046   <2e-16 ***
## K_pct       -195.49126  232.44620  -0.841   0.4003    
## BB_pct        80.85668   47.66879   1.696   0.0898 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 831.45  on 599  degrees of freedom
## Residual deviance: 370.42  on 579  degrees of freedom
## AIC: 412.42
## 
## Number of Fisher Scoring iterations: 6

Using backward elimination with a p-value cutoff of < 0.05, I was identified only one statistically significant coefficient variable, Runs Above Replacement (“RAR”) which is a measure of how many runs a pitcher prevents above the league average for pitchers. The equation below captures this relationship.

\[ \ln(p/1-p)=y = 6.0655 - 0.044595 * RAR \]

We can interpret this model to mean that for every increase in RAR (runs prevented above league average), the probability that the team will have a winning season increases by 4.4%.

Best_Model <- glm(Record~RAR,data = mlb_df, family='binomial')
summary(Best_Model)

## 
## Call:
## glm(formula = Record ~ RAR, family = "binomial", data = mlb_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8685  -0.6581   0.1051   0.6754   2.7506  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.065461   0.506281  -11.98   <2e-16 ***
## RAR          0.044595   0.003612   12.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 831.45  on 599  degrees of freedom
## Residual deviance: 513.97  on 598  degrees of freedom
## AIC: 517.97
## 
## Number of Fisher Scoring iterations: 5

The goodness of fit is:

1 - (513.97/831.45)

## [1] 0.381839

Using the formula above and the 2018 data, I calculated the probabilities of each team having a winning record for 2018. In the code blocks and visualizations below, I show that the model accurately predicted the odds of having a winning season for 24 out of 30 teams.

RAR = mlb_df_2018_best$RAR
y = -6.065461 + (0.044595 * RAR)
probabilities = exp(y) / (1+exp(y))
mlb_df_2018_best$prob_winning_record <- probabilities

This code block and visualization show that when the model predicted more than a 50% chance of winning, 12 teams had a winning season.

model_right <- mlb_df_2018_best %>% 
    select(Team, W, prob_winning_record, RAR) %>% 
    filter(W >= 81 & prob_winning_record >= .5) %>% 
    arrange(desc(RAR, W))

## Warning: package 'bindrcpp' was built under R version 3.5.1

ggplot(model_right, aes(x=prob_winning_record, y=W)) +
    geom_point(aes(color=Team, size= W))

This code block and visualization show that when the model predicted less than a 50% chance of winning, these 12 teams had a losing season.

model_right2 <- mlb_df_2018_best %>% 
    select(Team, W, prob_winning_record, RAR) %>% 
    filter(W < 81 & prob_winning_record < .5) %>% 
    arrange(desc(RAR, W))

ggplot(model_right2, aes(x=prob_winning_record, y=W)) +
    geom_point(aes(color=Team, size= W))

In the next block, the visualization shows where the model predicted winning seasons but two teams had losing seasons.

model_wrong <- mlb_df_2018_best %>% 
    select(Team, W, prob_winning_record, RAR) %>% 
    filter(W < 81 & prob_winning_record >= .5) %>% 
    arrange(desc(RAR, W))

ggplot(model_wrong, aes(x=prob_winning_record, y=W)) +
    geom_point(aes(color=Team, size= W))

Finally, in the next block, the visualization shows where the model predicted losing seasons but these four teams had winning seasons.

model_wrong2 <- mlb_df_2018_best %>% 
    select(Team, W, prob_winning_record, RAR) %>% 
    filter(W >= 81 & prob_winning_record < .5) %>% 
    arrange(desc(RAR, W))

ggplot(model_wrong2, aes(x=prob_winning_record, y=W)) +
    geom_point(aes(color=Team, size= W))

Conclusion and Reflections.

The model created in the logistic regression model has shown to be an accurate predictor of whether a team would have a winning season or not. It correctly predicted the results of 24 out of the 30 teams. Again, I can reject the null hypothesis that pitching cannot predict wins.

5. Applying the Model: AL vs. NL

As stated earlier, the NL and AL operate under different rules. The AL allows a designated hitter (DH) to bat for a pitcher which leads to more offense in the AL. In the code blocks below, I separated out the AL from the NL and applied the best model to see if there was a difference. Separating out the American League gives us 312 observations over 24 variables.

mlb_AL <- mlb_df[mlb_df$League == 'American League',]
glimpse(mlb_AL)

## Observations: 312
## Variables: 24
## $ Team    <fct> Yankees, Red Sox, Blue Jays, Devil Rays, Expos, Indian...
## $ W       <int> 114, 92, 88, 63, 65, 89, 85, 65, 79, 70, 74, 65, 76, 8...
## $ ERA     <dbl> 3.82, 4.19, 4.29, 4.35, 4.39, 4.45, 4.49, 4.64, 4.74, ...
## $ H       <int> 1357, 1406, 1443, 1425, 1448, 1552, 1481, 1463, 1505, ...
## $ R       <int> 656, 729, 768, 751, 783, 779, 783, 812, 785, 818, 866,...
## $ ER      <int> 619, 668, 698, 698, 696, 722, 720, 738, 754, 765, 769,...
## $ HR      <int> 156, 168, 169, 171, 156, 171, 164, 188, 169, 180, 179,...
## $ BB      <int> 466, 504, 587, 643, 533, 563, 630, 489, 535, 457, 529,...
## $ SO      <int> 1080, 1025, 1154, 1008, 1017, 1037, 1091, 908, 1065, 9...
## $ KsPer9  <dbl> 6.67, 6.42, 7.09, 6.29, 6.41, 6.39, 6.80, 5.71, 6.70, ...
## $ BBper9  <dbl> 2.88, 3.16, 3.61, 4.01, 3.36, 3.47, 3.93, 3.07, 3.36, ...
## $ KperBB  <dbl> 2.32, 2.03, 1.97, 1.57, 1.91, 1.84, 1.73, 1.86, 1.99, ...
## $ Hper9   <dbl> 8.38, 8.81, 8.86, 8.89, 9.13, 9.57, 9.23, 9.19, 9.46, ...
## $ HRper9  <dbl> 0.96, 1.05, 1.04, 1.07, 0.98, 1.05, 1.02, 1.18, 1.06, ...
## $ AVG     <dbl> 0.244, 0.252, 0.252, 0.257, 0.258, 0.269, 0.262, 0.260...
## $ WHIP    <dbl> 1.25, 1.33, 1.39, 1.43, 1.39, 1.45, 1.46, 1.36, 1.43, ...
## $ BABIP   <dbl> 0.277, 0.282, 0.290, 0.287, 0.291, 0.303, 0.300, 0.282...
## $ LOB_pct <dbl> 0.74, 0.71, 0.71, 0.73, 0.69, 0.72, 0.71, 0.68, 0.70, ...
## $ FIP     <dbl> 4.15, 4.40, 4.36, 4.79, 4.38, 4.54, 4.51, 4.69, 4.40, ...
## $ RAR     <dbl> 212.5, 156.7, 181.2, 109.8, 119.9, 160.6, 161.4, 87.4,...
## $ K_pct   <dbl> 0.177, 0.167, 0.182, 0.161, 0.164, 0.162, 0.173, 0.148...
## $ BB_pct  <dbl> 0.08, 0.08, 0.09, 0.10, 0.09, 0.09, 0.10, 0.08, 0.09, ...
## $ Record  <fct> 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, ...
## $ League  <fct> American League, American League, American League, Ame...

AL_full <- lm(W ~ RAR+LOB_pct+H+Hper9+BBper9+FIP+HRper9+ERA, data = mlb_AL)
summary(AL_full)

## 
## Call:
## lm(formula = W ~ RAR + LOB_pct + H + Hper9 + BBper9 + FIP + HRper9 + 
##     ERA, data = mlb_AL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2905  -5.3591   0.3234   5.0264  18.0251 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -109.00120   43.04478  -2.532 0.011837 *  
## RAR            0.14242    0.01075  13.247  < 2e-16 ***
## LOB_pct      264.74632   53.92057   4.910 1.49e-06 ***
## H              0.12384    0.03842   3.224 0.001404 ** 
## Hper9        -29.80913    6.63934  -4.490 1.01e-05 ***
## BBper9       -10.39012    2.36379  -4.396 1.53e-05 ***
## FIP           13.93472    3.07375   4.533 8.36e-06 ***
## HRper9       -33.87673    7.67554  -4.414 1.42e-05 ***
## ERA           18.73142    5.48324   3.416 0.000722 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.332 on 303 degrees of freedom
## Multiple R-squared:  0.6343, Adjusted R-squared:  0.6246 
## F-statistic: 65.68 on 8 and 303 DF,  p-value: < 2.2e-16

The AL model is the same as the one created earlier under Multiple regression:

\[ \hat{y} = -64.342203 + 0.132246*RAR + 195.507407*LOB_pct + 0.104260*H - 23.019203*Hper9 - 8.854223*BBper9 + 14.223588 * FIP -21.098982*HRper9 + 7.880635*ERA \]

mlb_df_2018_best$League <- case_when(mlb_df_2018_best$Team == 'Braves'~'National League',
mlb_df_2018_best$Team == 'Marlins'~'National League',
mlb_df_2018_best$Team == 'Mets'~'National League',
mlb_df_2018_best$Team == 'Phillies'~'National League',
mlb_df_2018_best$Team == 'Nationals'~'National League',
mlb_df_2018_best$Team == 'Cubs'~'National League',
mlb_df_2018_best$Team == 'Brewers'~'National League',
mlb_df_2018_best$Team == 'Pirates'~'National League',
mlb_df_2018_best$Team == 'Cardinals'~'National League',
mlb_df_2018_best$Team == 'Diamondbacks'~'National League',
mlb_df_2018_best$Team == 'Rockies'~'National League',
mlb_df_2018_best$Team == 'Dodgers'~'National League',
mlb_df_2018_best$Team == 'Padres'~'National League',
mlb_df_2018_best$Team == 'Giants'~'National League',
 TRUE ~ as.character('American League'))

AL_2018 <- mlb_df_2018_best %>% 
            filter(League == "American League")

AL_2018$Prediction <- apply(AL_2018[3:10],1,Best_Model_Predict_Wins)
AL_2018$Win_Diff <- AL_2018$W - AL_2018$Prediction

ggplot(AL_2018 , aes(x=Prediction, y=W)) +
    geom_point(aes(color=Team, size= abs(Win_Diff)))

However for the National league, it’s a little different. There are 288 cases/observations over 24 variables for the past 30 years of the NL. The NL model shows Home Runs per 9 innings and ERA as variables with high p-values. Using backward elimination, I removed these two variables and re-ran the model.

The final best NL model is:

\[ \hat{y} = -64.342203 + 0.132246*RAR + 195.507407*LOB_pct + 0.104260*H - 23.019203*Hper9 - 8.854223*BBper9 + 14.223588 * FIP \]

mlb_NL <- mlb_df[mlb_df$League == 'National League',]
glimpse(mlb_NL)

## Observations: 288
## Variables: 24
## $ Team    <fct> Braves, Astros, Padres, Mets, Dodgers, Pirates, Giants...
## $ W       <int> 106, 102, 98, 88, 83, 69, 89, 83, 77, 90, 74, 75, 77, ...
## $ ERA     <dbl> 3.25, 3.50, 3.63, 3.77, 3.81, 3.91, 4.19, 4.32, 4.44, ...
## $ H       <int> 1291, 1435, 1384, 1381, 1332, 1433, 1457, 1513, 1400, ...
## $ R       <int> 581, 620, 635, 645, 678, 718, 739, 782, 760, 792, 812,...
## $ ER      <int> 520, 572, 587, 611, 612, 629, 687, 706, 711, 738, 746,...
## $ HR      <int> 117, 147, 139, 152, 135, 147, 171, 151, 170, 180, 188,...
## $ BB      <int> 467, 465, 501, 532, 587, 530, 562, 558, 573, 575, 550,...
## $ SO      <int> 1232, 1187, 1217, 1129, 1178, 1112, 1089, 972, 1098, 1...
## $ KsPer9  <dbl> 7.71, 7.26, 7.53, 6.97, 7.33, 6.91, 6.64, 5.95, 6.86, ...
## $ BBper9  <dbl> 2.92, 2.84, 3.10, 3.28, 3.65, 3.29, 3.42, 3.42, 3.58, ...
## $ KperBB  <dbl> 2.64, 2.55, 2.43, 2.12, 2.01, 2.10, 1.94, 1.74, 1.92, ...
## $ Hper9   <dbl> 8.08, 8.78, 8.56, 8.52, 8.28, 8.90, 8.88, 9.27, 8.74, ...
## $ HRper9  <dbl> 0.73, 0.90, 0.86, 0.94, 0.84, 0.91, 1.04, 0.92, 1.06, ...
## $ AVG     <dbl> 0.236, 0.251, 0.247, 0.248, 0.241, 0.253, 0.253, 0.262...
## $ WHIP    <dbl> 1.22, 1.29, 1.30, 1.31, 1.33, 1.35, 1.37, 1.41, 1.37, ...
## $ BABIP   <dbl> 0.285, 0.295, 0.293, 0.287, 0.284, 0.292, 0.287, 0.292...
## $ LOB_pct <dbl> 0.74, 0.76, 0.75, 0.76, 0.73, 0.71, 0.73, 0.70, 0.71, ...
## $ FIP     <dbl> 3.53, 3.86, 3.84, 4.18, 4.06, 4.08, 4.42, 4.40, 4.45, ...
## $ RAR     <dbl> 255.7, 178.7, 202.1, 143.8, 171.8, 172.4, 120.3, 123.5...
## $ K_pct   <dbl> 0.207, 0.191, 0.198, 0.183, 0.191, 0.179, 0.171, 0.152...
## $ BB_pct  <dbl> 0.08, 0.08, 0.08, 0.09, 0.10, 0.09, 0.09, 0.09, 0.09, ...
## $ Record  <fct> 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, ...
## $ League  <fct> National League, National League, National League, Nat...

NL_full <- lm(W ~ RAR+LOB_pct+H+Hper9+BBper9+FIP+HRper9+ERA, data = mlb_NL)
summary(NL_full)

## 
## Call:
## lm(formula = W ~ RAR + LOB_pct + H + Hper9 + BBper9 + FIP + HRper9 + 
##     ERA, data = mlb_NL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6277  -3.8531   0.6236   4.0614  18.4328 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.474387  37.669850  -0.464 0.643094    
## RAR           0.117112   0.009975  11.740  < 2e-16 ***
## LOB_pct     130.530725  46.823649   2.788 0.005673 ** 
## H             0.086080   0.032716   2.631 0.008983 ** 
## Hper9       -17.124652   5.443022  -3.146 0.001833 ** 
## BBper9       -7.593666   1.948334  -3.898 0.000122 ***
## FIP          15.873808   2.896662   5.480 9.51e-08 ***
## HRper9       -9.772962   6.330072  -1.544 0.123747    
## ERA          -3.429992   4.553977  -0.753 0.451973    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.193 on 279 degrees of freedom
## Multiple R-squared:  0.6931, Adjusted R-squared:  0.6843 
## F-statistic: 78.78 on 8 and 279 DF,  p-value: < 2.2e-16

NL_Best <- lm(W ~ RAR+LOB_pct+H+Hper9+BBper9+FIP, data = mlb_NL)
summary(NL_Best)

## 
## Call:
## lm(formula = W ~ RAR + LOB_pct + H + Hper9 + BBper9 + FIP, data = mlb_NL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8018  -3.8123   0.3152   4.2274  19.8166 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -37.17053   27.56025  -1.349  0.17852    
## RAR           0.11888    0.01008  11.797  < 2e-16 ***
## LOB_pct     156.87046   27.10001   5.789 1.89e-08 ***
## H             0.10620    0.03240   3.278  0.00118 ** 
## Hper9       -20.34215    5.29179  -3.844  0.00015 ***
## BBper9       -5.94192    1.30916  -4.539 8.40e-06 ***
## FIP           8.84372    1.70426   5.189 4.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.271 on 281 degrees of freedom
## Multiple R-squared:  0.6831, Adjusted R-squared:  0.6763 
## F-statistic:   101 on 6 and 281 DF,  p-value: < 2.2e-16

stats <-  c("RAR", "LOB_pct", "H", "Hper9", "BBper9", "FIP")
Best_NL_Model_Predict_Wins <- function (stats) {
            Intercept = -37.17053
            Slope = c(0.11888, 156.87046 ,  0.10620, -20.34215, -5.94192, 8.84372)
            coeff = sum((Slope * stats))
            Wins = Intercept + coeff
            return(Wins)
}

NL_2018 <- mlb_df_2018_best %>% 
            filter(League == "National League")
NL_2018$Prediction <- apply(NL_2018[3:8],1,Best_NL_Model_Predict_Wins)
NL_2018$Win_Diff <- NL_2018$W - NL_2018$Prediction

NL_2018

##            Team  W   RAR LOB_pct    H Hper9 BBper9  FIP HRper9  ERA
## 1       Dodgers 92 189.7   0.762 1279  7.80   2.57 3.60   1.09 3.40
## 2          Cubs 95 121.7   0.762 1319  8.04   3.79 4.14   0.96 3.65
## 3  Diamondbacks 82 153.0   0.757 1313  8.08   3.21 3.91   1.07 3.73
## 4       Brewers 96 155.5   0.744 1259  7.76   3.41 4.01   1.07 3.73
## 5        Braves 90 140.7   0.741 1236  7.64   3.92 3.99   0.95 3.75
## 6     Cardinals 88 138.8   0.730 1354  8.37   3.67 3.97   0.89 3.85
## 7        Giants 73 116.7   0.723 1387  8.54   3.23 3.98   0.96 3.95
## 8       Pirates 82 127.9   0.736 1380  8.66   3.12 4.06   1.09 4.00
## 9     Nationals 82 131.2   0.748 1320  8.22   3.03 4.15   1.23 4.04
## 10         Mets 77 144.9   0.730 1364  8.40   2.98 3.97   1.14 4.07
## 11     Phillies 80 182.3   0.710 1366  8.50   3.11 3.83   1.06 4.15
## 12      Rockies 91 178.0   0.713 1377  8.53   3.25 4.06   1.14 4.33
## 13       Padres 66 119.4   0.711 1430  8.83   3.21 4.10   1.14 4.41
## 14      Marlins 63  31.3   0.699 1388  8.66   3.78 4.57   1.20 4.76
##    Prediction    Win_Diff prob_winning_record          League
## 1    98.64398  -6.6439841         0.916384758 National League
## 2    87.45249   7.5475055         0.345642349 National League
## 3    90.35046  -8.3504582         0.680826791 National League
## 4    89.07902   6.9209818         0.704550378 National League
## 5    83.64019   6.3598127         0.552074357 National League
## 6    80.67918   7.3208237         0.531041277 National League
## 7    79.70315  -6.7031516         0.297083576 National League
## 8    81.25057   0.7494256         0.410533465 National League
## 9    87.43458  -5.4345775         0.446555694 National League
## 10   85.95600  -8.9560046         0.597811475 National League
## 11   83.43232  -3.4323220         0.887375355 National League
## 12   85.15187   5.8481283         0.866741393 National League
## 13   77.98914 -11.9891434         0.322825296 National League
## 14   65.40079  -2.4007893         0.009288581 National League

ggplot(NL_2018 , aes(x=Prediction, y=W)) +
    geom_point(aes(color=Team, size= abs(Win_Diff)))

5. Final Conclusion

As has been shown in the Simple Linear regression model, the Multiple regression model, the Logistic model, and the AL/NL models, pitching is a good predictor of Wins and winning seasons by a MLB team. A good follow up project would be to do the same with Hitting and Fielding statistics, and then combine all three to predict wins.

DATA606 FALL 2018 FINAL PROJECT

John K. Hancock

December 10, 2018

DATA606 Fall 2018 - FINAL PROJECT