Major League Baseball Runs Scored (2016)

Major League Baseball (MLB) is the oldest of the four major professional sports leagues in the United States and Canada. A total of 30 teams plays in the National League (NL) and American League (AL), with 15 teams in each league. The purpose of the game is to have the most points (runs) by end of nine innings (pending there are no extra innings). Some games are fast-paced (more runs) or slow (less runs). My study will focus on the external variables (mainly Game type and Game hours) that affect total runs in a baseball game. The other variables are there as controls.

I will be using the “2016 MLB Season” dataset attained from Kaggle.

Dependent variable:

  1. Total Runs

Independent variable:

  1. Game type: Day Game or Night Game
  2. Temperature
  3. Game hours
  4. Attendance
  5. Wind speed

Getting the data ready for Best-Fit Model

library(readr)
mlb2016<-read_csv("C:/Users/wroni/OneDrive/Documents/QC MADASR/SOC 712/mlb2016.csv")

I imported the data.

head(mlb2016)

This is a quick peak at the dataset.

library(dplyr)
mlb2016$game_type=as.factor(mlb2016$game_type)

This made game_type variable (Day Game or Night Game) into a factor variable, which will allow it to be analyzed later on.

Best-Fit Model

library(Zelig)
m1=zelig(total_runs~game_type,model = "poisson",data=mlb2016,cite=F)
m2=zelig(total_runs~game_type+attendance,model = "poisson",data=mlb2016,cite=F)
m3=zelig(total_runs~game_type+attendance+temperature,model = "poisson",data=mlb2016,cite=F)
m4=zelig(total_runs~game_type*temperature+attendance*game_hours_dec,model = "poisson",data=mlb2016,cite=F)
m5=zelig(total_runs~game_type*game_hours_dec*temperature+attendance+wind_speed,model = "poisson",data=mlb2016,cite=F)

library(texreg)
library(pander)
t1 <- htmlreg(list(m1, m2, m3, m4, m5), doctype= FALSE)

pander(t1)
Statistical models
Model 1 Model 2 Model 3 Model 4 Model 5
(Intercept) 2.18*** 2.20*** 1.92*** 0.38* 1.15*
(0.01) (0.03) (0.05) (0.15) (0.50)
game_typeNight Game 0.01 0.01 0.01 -0.01 -1.92**
(0.01) (0.01) (0.01) (0.10) (0.64)
attendance -0.00 -0.00 0.00** -0.00***
(0.00) (0.00) (0.00) (0.00)
temperature 0.00*** 0.00** -0.00
(0.00) (0.00) (0.01)
game_hours_dec 0.52*** 0.25
(0.04) (0.16)
game_typeNight Game:temperature 0.00 0.02**
(0.00) (0.01)
attendance:game_hours_dec -0.00***
(0.00)
wind_speed 0.01***
(0.00)
game_typeNight Game:game_hours_dec 0.61**
(0.20)
game_hours_dec:temperature 0.00
(0.00)
game_typeNight Game:game_hours_dec:temperature -0.01**
(0.00)
AIC 15180.65 15163.77 15128.35 14286.26 14257.92
BIC 15192.27 15181.19 15151.58 14326.92 14316.00
Log Likelihood -7588.32 -7578.88 -7560.18 -7136.13 -7118.96
Deviance 5553.35 5545.63 5508.22 4660.13 4625.79
Num. obs. 2463 2460 2460 2460 2460
p < 0.001, p < 0.01, p < 0.05

The pander table shows Model 5 with the lowest AIC and second lowest BIC score, so I will be working with that model. The difference between BIC score for Model 4 and 5 is insignificant.

Model 5 Quick Interpretation:

  1. Game type: If it is a Night game, as opposed to a Day game, total runs decrease by 1.92.
  2. Game hours: As the game time increase, total runs increase by 0.25.
  3. Night game * temperature: If it is a Night game and temperature increases, then total runs increase by 0.02.
  4. Night game * game hours: If it is a night game and game time increases, then total runs increase by 0.61.
  5. Game hours * temperature: No effect.
  6. Night game * game hours * temperature: If it is a night game and game time increaes and temperature increases, then total runs decrease by 0.01.

This tells us that Game type and Game hours, independent from each other and interacting with each other, has the best effect on total runs in a baseball game compared to the other observed variables.

Keep in mind that the Zelig simulation and plotting will give us a better representation of the effect of these variables has on total runs in a baseball game.

Zelig simulations

m6 <- zelig(total_runs ~ game_type + attendance + temperature + game_hours_dec + wind_speed, model = "poisson", data = mlb2016, cite = F)

summary(m6)
## Model: 
## 
## Call:
## z5$zelig(formula = total_runs ~ game_type + attendance + temperature + 
##     game_hours_dec + wind_speed, data = mlb2016)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.7280  -1.0183  -0.1300   0.8154   4.0463  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)          7.184e-01  6.615e-02  10.859  < 2e-16
## game_typeNight Game  1.459e-02  1.457e-02   1.001  0.31681
## attendance          -2.466e-06  7.017e-07  -3.515  0.00044
## temperature          3.773e-03  6.432e-04   5.865 4.48e-09
## game_hours_dec       3.846e-01  1.263e-02  30.453  < 2e-16
## wind_speed           7.043e-03  1.355e-03   5.198 2.02e-07
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5547.0  on 2459  degrees of freedom
## Residual deviance: 4644.5  on 2454  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 14269
## 
## Number of Fisher Scoring iterations: 4
## 
## Next step: Use 'setx' method

I used a “poisson” model for my Zelig simulations since Total runs is a count variable.

Zelig Simulation Interpretations:

  1. Game type: If it is a night game, as opposed to a Day game, then total runs increase by 1.459.
  2. Game hours: As game time increases, total runs increase by 3.846.

Plotting

x <- setx(m6, game_type = "Night Game")
x1 <- setx(m6, game_type = "Day Game")
s <- sim(m6, x = x, x1 = x1)
summary(s)
## 
##  sim x :
##  -----
## ev
##         mean         sd      50%     2.5%    97.5%
## [1,] 8.80379 0.07403885 8.805196 8.663167 8.945228
## pv
##       mean      sd 50% 2.5% 97.5%
## [1,] 8.856 2.88541   9    4    15
## 
##  sim x1 :
##  -----
## ev
##          mean        sd      50%     2.5%  97.5%
## [1,] 8.681783 0.1037319 8.680748 8.473593 8.8888
## pv
##       mean       sd 50%  2.5% 97.5%
## [1,] 8.696 2.844587   9 3.975    14
## fd
##            mean        sd        50%       2.5%    97.5%
## [1,] -0.1220067 0.1277862 -0.1227673 -0.3825141 0.129917
par("mar")
## [1] 5.1 4.1 4.1 2.1
par(mar= c(2,2,2,2))
s$graph()

This graph shows the the effect of Game type (Day game or Night game) on Total runs, with all other features at their defaults.

First difference: The difference between Night game and Day game total runs is about 0.1

Predicted values: The night game total run is about 7.0. The day game total run is about 6.0.

Expected values: The night game total run is about 8.8. The day game total run is abut 8.7.

h.range = min(mlb2016$game_hours_dec):max(mlb2016$game_hours_dec)
x <- setx(m6, game_hours_dec = h.range)
s <- sim(m6, x = x)
ci.plot(s)

This graph shows as game hours increase, total runs increase as well.

h.range = min(mlb2016$game_hours_dec):max(mlb2016$game_hours_dec)
x <- setx(m6, game_hours_dec = mean(mlb2016$game_hours_dec))
x1 <- setx(m6, game_hours_dec = mean(mlb2016$game_hours_dec) + sd(mlb2016$game_hours_dec))
s <- sim(m6, x = x, x1 = x1)
summary(s)
## 
##  sim x :
##  -----
## ev
##          mean         sd      50%     2.5%   97.5%
## [1,] 8.816183 0.07374476 8.816244 8.665106 8.96538
## pv
##       mean       sd 50% 2.5%  97.5%
## [1,] 8.835 2.876117   9    4 14.025
## 
##  sim x1 :
##  -----
## ev
##          mean        sd      50%    2.5%    97.5%
## [1,] 10.52369 0.1010945 10.52227 10.3266 10.72073
## pv
##        mean       sd 50% 2.5% 97.5%
## [1,] 10.563 3.151819  10    5    17
## fd
##          mean         sd      50%    2.5%    97.5%
## [1,] 1.707503 0.06071185 1.709615 1.58648 1.823581

This is the same analysis as above but tweaked for standard deviation. This means as game time increases, total runs increase by 10.453 (in terms of standard deviation).

Conclusion

In conclusion, Total runs in a baseball game is affected by external variabls of Game type (Night game or Day game) and Game hours. According to the Zelig simulation, if the game was held at night or as the game time increased then total runs increased as well. There is also an interaction effect between Game type and Game hours. I included Wind Speed and Attendace in the model used for the Zelig simulation for controls. This allowed for less inflation for my results.