AB RBI League
1 533 42 AL
2 505 61 AL
3 540 53 AL
4 530 58 AL
5 481 40 AL
6 464 48 AL
Simple Regression
model_ab <-lm(RBI ~ AB, data = df)summary(model_ab)
Call:
lm(formula = RBI ~ AB, data = df)
Residuals:
Min 1Q Median 3Q Max
-52.775 -16.115 1.697 15.551 60.155
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 314.93204 6.40239 49.19 <2e-16 ***
AB -0.51855 0.01538 -33.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.78 on 198 degrees of freedom
Multiple R-squared: 0.8517, Adjusted R-squared: 0.8509
F-statistic: 1137 on 1 and 198 DF, p-value: < 2.2e-16
At-bat Coefficient (-.51855)- this means for each at bat, on average we expect RBI to decrease by -0.51855, meaning there is a negative linear relationship between RBI and AB
model_ab_league <-lm(RBI ~ AB + League, data = df)summary(model_ab_league)
Call:
lm(formula = RBI ~ AB + League, data = df)
Residuals:
Min 1Q Median 3Q Max
-28.1804 -6.3759 -0.2709 6.0080 23.2048
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.87887 12.39089 -4.025 8.11e-05 ***
AB 0.19975 0.02468 8.094 5.88e-14 ***
LeagueNL 149.13172 4.94290 30.171 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.21 on 197 degrees of freedom
Multiple R-squared: 0.9736, Adjusted R-squared: 0.9733
F-statistic: 3634 on 2 and 197 DF, p-value: < 2.2e-16
AB Coefficient (.199) this means for each at bat, on average we expect RBI to increase by .199 when taking into account the league the player is in
LeagueNL (149.13) Since league is a categorical variable, the coefficient of 149.13 represents the average RBI’s for a player in the National League (NL), holding At Bats constant
The reason why there is a difference in the At Bats coefficient is because we are adding another variable that isolates the effect of At bats on RBI’s instead of just having AB explain solely. What is interesting is the change in the coefficient from -.5 (which means each at bat leads to less RBI’s which is suspicous) to .199 In the simple model, At-Bats appeared negatively related to RBIs because of a hidden variable: League. So in the overall data, AB and RBI seem negatively related, but it’s an illusion caused by league differences, where one league (AL) has more at bats but less RBI’s
library(ggplot2)ggplot(df, aes(x = AB, y = RBI)) +# SLR: One line across all data (gray)geom_smooth(method ="lm", se =FALSE, color ="gray40", linetype ="dashed") +# MLR: Separate lines for each leaguegeom_smooth(aes(color = League), method ="lm", se =FALSE, linewidth =1.2) +# Pointsgeom_point(aes(color = League), alpha =0.7) +labs(title ="Comparing Simple vs Multiple Regression Lines",subtitle ="RBI vs At-Bats (AB), with and without League accounted for",x ="At-Bats (AB)",y ="Runs Batted In (RBI)",color ="League" ) +theme_minimal(base_size =14)
The gray dashed line shows us the trend of the SLR, explaining RBI with AB, The other lines show us the predicted RBI values using League and AB to explain. The MR lines have the same slope coefficients, but National League players are expected to have 149 more RBI’s with AB held constant
yearID AB H BB HBP TB RC G RC_per_game
1 2013 555 193 90 5 353 156.4062 148 1.056798
Mike Trout and Kris Bryant were very close in the 2016 MVP Race, winning in their respective leagues. There Runs Created per game were almost identical, Mike Trout with .84 a game and Bryant with .832 per game. Bryant scored more home runs (39) and had more extra-base hits. The Cubs, Bryants team, was the most dominant team in the league and they won the world series. He had a historical season, but Cabrera’s 2013 season displayed peak home run efficiency. He put up 44 runs, 137 RBIs and created 156 runs in 148 games.
let’s compute Runs Created per Game (RC/Game) for all four players, then compare each to Nori Aoki in 2016 (as your “average MLB player”).
# Function to compute RC/Game for any player and yearget_rc_per_game <-function(player_id, year) { Batting %>%filter(playerID == player_id, yearID == year) %>%mutate(HBP =ifelse(is.na(HBP), 0, as.numeric(HBP)),X1B = H - X2B - X3B - HR,TB = X1B +2* X2B +3* X3B +4* HR,RC = (H + BB + HBP) * TB / (AB + BB + HBP),RC_per_game = RC / G ) %>%select(playerID, yearID, G, AB, H, BB, HBP, TB, RC, RC_per_game)}trout_2016 <-get_rc_per_game("troutmi01", 2016)bryant_2016 <-get_rc_per_game("bryankr01", 2016)cabrera_2013 <-get_rc_per_game("cabremi01", 2013)aoki_2016 <-get_rc_per_game("aokino01", 2016)# Combine for comparisonrc_comparison <-bind_rows(trout_2016, bryant_2016, cabrera_2013, aoki_2016)rc_comparison
yearID teamID OBP SLG R
1 2010 NYA 0.3498267 0.4359619 859
2 2011 NYA 0.3433812 0.4441827 867
3 2012 NYA 0.3369355 0.4531137 804
4 2013 NYA 0.3069241 0.3758488 650
5 2014 NYA 0.3068407 0.3800255 633
6 2015 NYA 0.3228599 0.4208730 764
7 2016 NYA 0.3145750 0.4052767 680
model_ops <-lm(R ~ OBP + SLG, data = team_data)summary(model_ops)
Call:
lm(formula = R ~ OBP + SLG, data = team_data)
Residuals:
Min 1Q Median 3Q Max
-61.367 -15.509 0.656 14.185 74.171
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -739.44 43.87 -16.86 <2e-16 ***
OBP 2345.91 192.42 12.19 <2e-16 ***
SLG 1703.24 93.15 18.29 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.52 on 207 degrees of freedom
Multiple R-squared: 0.8852, Adjusted R-squared: 0.8841
F-statistic: 797.8 on 2 and 207 DF, p-value: < 2.2e-16
plot(model_ops, which =1) # Residuals vs Fitted
The Residuals seem to be randomly scattered around the mean, there is a slight positive residual pattern on the far ends of the scale but it is not too worrying An R^2 value of .885 means on average, 88.5% of the variation in team runs (R) from 2010–2016 can be explained by OBP and SLG. At very small p-values <.0001, both predictors are significant to the model It seems that the OBP Stat has a stronger predicting value since it has a higher Coefficient but also a smaller T value. This means that OBP has a higher standard error so it has a stronger effect on predicting Runs but we are less confident in the estimate than SLG. However, we can overlook the standard error since both variables pass significance.
3) Pythagorean Expectation
When Comparing the Mariners and Tigers season record and Run Differential, I like to use the easy metric of +10 RD leads to 1 extra predicted win. The Mariners win % of .716 is .216 above 500, or 35 wins in a 162 game season. This would suggest an RD of 346 (300 actual) Tigers had a win % of .265 which is .235 under 500. .235 translates to 38 games which would suggest an RD of -376 (337 actual). The model produces sensible winning percentages for both teams since it is very strong.
Clearly, Run Difference has a much higher Playoff Series Win Prediction Accuracy than regular season Wins. Still, 50% accuracy is not that great and this could be due to Playoff series being small in sample size (best of 5 or 7). This means Random variation, hot streaks, or slumps have a stronger impact in short series. Even teams with superior season-long stats can be upset due to a small number of events.This diminishes our confidence in the predictor. Playoff baseball is also different, Managers use tighter bullpens, shorter rotations, and different strategies in October. Individual stars or clutch performers can takeover easier and shift a series, so team star power explains more variability in the postseason.
If a team trades Player A (150 runs created) for Player B (170 runs created), the overall expectation would be an increase of 20 runs or 1.2% Win% increase which is 2 games. Before the trade the team would have won 86 games or .530 and after the trade they are expected to win .542 or 88 games