pitching.knit

Pitching Analysis with the ‘Lahman’ package

Hypotheses
- H0 = The number of wins per season cannot be accurately predicted by ERA.
- Ha = The number of wins per season can be accurately predicted by ERA.
Variables
- Win = Each time a baseball team wins a game, one pitcher is awarded “the win” based on certain objectives.
- ERA = Average number of runs given up to the opposition by the pitcher in a game. This excludes runs scored via fielding errors.

Correlation

library('Lahman')
cor.test(Pitching$ERA,Pitching$W, conf.level=0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  Pitching$ERA and Pitching$W
## t = -48.578, df = 48303, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  -0.2269676 -0.2046199
## sample estimates:
##       cor 
## -0.215822

There is a negative correlation between ERA and wins. As ERA goes up, wins decrease.

Regression

plot(Pitching$ERA,Pitching$W)
pitch<-lm(Pitching$W~Pitching$ERA)
summary(pitch)

## 
## Call:
## lm(formula = Pitching$W ~ Pitching$ERA)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.705 -3.844 -2.093  2.194 54.603 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.70491    0.03464  164.70   <2e-16 ***
## Pitching$ERA -0.22296    0.00459  -48.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.575 on 48303 degrees of freedom
##   (94 observations deleted due to missingness)
## Multiple R-squared:  0.04658,    Adjusted R-squared:  0.04656 
## F-statistic:  2360 on 1 and 48303 DF,  p-value: < 2.2e-16

abline(pitch, col="red", lwd=5)

The regression analysis shows a slope of -0.22296. An increase of ERA around 5 roughly decreases a pitcher’s season win total by 1.

summary(Pitching$W)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    2.00    4.55    7.00   60.00

summary(pitch$residual)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -5.705  -3.844  -2.093   0.000   2.194  54.603

hist(pitch$residual, breaks=50)

Conclusions

The median number of wins in the data set is 2. The median residual is -2.093. Half of the residuals are larger than the median number of wins. Given this, plus a low R-squared value of 0.04658, I am weary to reject my H0 Null Hypothesis based on this simple linear regression model.