We spent the final week in class working on projects. My group is using Major League Baseball data to analyze the impact that designated hitters have on the amount of offense. To do this, we want to:
I looked into modeling the number of hits by the visiting team.
bball<-read.csv("/Users/kuenn/Documents/St. Thomas/STAT 413/Project/Censored MLB Data CSV.csv")
#variable creation
nstrikeouts<-bball$v_strikeouts+bball$h_strikeouts
nhomeruns<-bball$v_homeruns+bball$h_homeruns
netwalks<-bball$h_walks+bball$v_walks-bball$h_intentional.walks-bball$v_intentional.walks
hits<-bball$h_hits+bball$v_hits
yrsince2000<-bball$year-2000
homeleague<-ifelse(bball$h_league=="NL","NL","AL")
dh<-ifelse(bball$h_league=="AL",1,0)
bball<-cbind(bball,nhomeruns,nstrikeouts,netwalks,hits,yrsince2000,dh)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
qplot(v_hits,
data=bball,
geom = "histogram",
binwidth=1,
main="Number of Visiting Hits",
xlab = "Hits in Single Game",
ylab = "Frequency",
fill=I("gray"),
col=I("black"))
We assumed that this response followed a normal distribution. Here are the two models we fit:
library(lme4)
## Warning: package 'lme4' was built under R version 3.6.3
## Loading required package: Matrix
m3<-lm(v_hits~dh,data = bball)
summary(m3)
##
## Call:
## lm(formula = v_hits ~ dh, data = bball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1127 -2.8718 -0.1127 2.1282 19.8873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.87180 0.02420 366.632 < 2e-16 ***
## dh 0.24087 0.03513 6.857 7.14e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.565 on 41296 degrees of freedom
## Multiple R-squared: 0.001137, Adjusted R-squared: 0.001113
## F-statistic: 47.02 on 1 and 41296 DF, p-value: 7.137e-12
mm3<-lmer(v_hits~dh+(1|v_name),data = bball)
summary(mm3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: v_hits ~ dh + (1 | v_name)
## Data: bball
##
## REML criterion at convergence: 222123.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.6009 -0.7492 -0.0473 0.5875 5.5954
##
## Random effects:
## Groups Name Variance Std.Dev.
## v_name (Intercept) 0.04104 0.2026
## Residual 12.67025 3.5595
## Number of obs: 41298, groups: v_name, 32
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 8.91458 0.04597 193.922
## dh 0.13612 0.04886 2.786
##
## Correlation of Fixed Effects:
## (Intr)
## dh -0.489
In the simple linear model, we can see that the average number of hits for visiting teams without a designated hitter is 8.87180. The addition of a designated hitter results in 0.24087 additional hits on average.
For the mixed model, we incorporate a random intercept for the visiting team. It is interesting to note how the mean number of hits added for games with a designated hitter decreases (from 0.24087 to 0.13612) when we assume the data to be correlated. This implies that a substantial amount of the hits added when a designated hitter is present, has to do with the team doing the hitting. Below is a likelihood ratio test between these two models:
test.stat<-2*(as.numeric(logLik(mm3))-as.numeric(logLik(m3)))
pchisq(test.stat,df=1,lower.tail = FALSE)
## [1] 1.885586e-14
The p-value is less than a critical value of .05. This means that the variance component for the visiting team is greater that 0. Thus, adding this mixed effect improves our model.
Using the mixed model, we can create two sets of 95% Confidence Intervals for the mean number hits.
8.91458+c(-1,1)*2*.2026 #without DH
## [1] 8.50938 9.31978
8.91458+.13612+c(-1,1)*(2*.2026) #with DH
## [1] 8.6455 9.4559
For games without a designated hitter, 95% of teams will have between 8.509 and 9.319 hits. For games with a designated hitter, 95% of teams will have between 8.645 and 9.456 hits. These intervals are close in value, but that does not mean they are meaningless. In baseball, all it takes is one hit to completley change the outcome of a game.