Baseball: Simple Linear Regression

Tanner Norton

# Load your libraries
library(car)
library(tidyverse)
library(mosaic)
library(DT)
library(readr)
library(pander)
library(sandwich)
library(lmtest)

# Load your data after saving a csv file in your Data folder.
# You can use either 
CBrankings <- read.csv("CBrankings.csv", header=TRUE)


lm_ERACB <- lm(Win_perc ~ ERA, data = CBrankings)

Background

Information

Baseball is a game of statistics. Over the course of a division 1 college league baseball regular season, each of the leagues 297 teams plays around 54 games. As with the object of most team sports the goal is to score more points than your opponent does. In order to do this players must get on base via hit, walk, error, and hit by pitch; then have another player do the same to help them get to home plate and score. Defensively pitchers and defenders are trying to get three outs without allowing runs via strikeout, groundout, popout, pick off, among others.

While several factors go into a baseball team winning, an old phrase “Pitching wins championships” singles out pitching as most influential of them all. There are tons of pitching statistics but for this analysis, we are going to focus on “Earned Run Average” or ERA. MLB.com instructs that “Earned run average represents the number of earned runs a pitcher allows per nine innings – with earned runs being any runs that scored without the aid of an error or a passed ball.”. Thus a low ERA is good because it indicates that a team is giving up less runs to their opponents. For individual pitchers an ERA below 3.0 is considered elite but for teams a sub 4.0 is good. The data collected is the 2019 team ERA from all 297 division 1 college teams. The question to answer is do teams with lower ERA’s have higher win percentages?

\[ \underbrace{Y_i}_\text{Winning Percentage} = \overbrace{\beta_0}^\text{y-int} + {\beta_1} \underbrace{X_i}_\text{ERA} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]

The hypothesis test will be as follows. The null being that ERA has no impact on a teams winning percentage with the althernative saying that ERA has an impact on a teams winning percentage. \(\alpha\) will be evaluated at the .01 level of significance.

\[ H_0: \ \ \beta_1 = 0 \] \[ H_a: \ \ \beta_1 \ {\ne} \ 0 \]

Data Set

The data below was gathered off of the NCAA’s website and edited to the form that it is in now. I combined the teams winning percentage with their corresponding ERA. The data is team accumulated throughout the ongoing 2019 season.

datatable(CBrankings, options=list(lengthMenu = c(10,50)),  style = "default")

Analysis

Regression Plot

plot(Win_perc ~ ERA, col = Super_regional, pch=19, cex=1, data = CBrankings, xlab = "Team ERA", ylab = "Win %",
     main = "Pitching for Wins")

abline(lm_ERACB, col="blue", lwd=1)

The plot above shows the regression on the data. The data points in red are the 16 teams which have made it to the super regionals which is the final round of playoffs before the college world series begins this June 15th - 26th. And the black data points are teams who did not make the super regional. Almost all playoff teams had a sub 4.0 ERA with the excepion of a few teams. A general rule of thumb seen with this data is that once a team is above a 4.0 ERA they lose more games than they win, or in other words, they become sub .500 teams.

Show Validations

The diagnostic plots below are to see if the data fits the five assumptions for regression analysis. The residuals vs fitted values shows that it is linear in the parameters and appears to have constant variance or that the data is homoskedastic. To test this further I perform the Bruesch-Pagan test below. The residuals plot shows that the error terms are independent. And finally the qqplot shows that the data is normally distributed.

Bruesch-Pagan test

\[ H_0: \beta_1 = 0 : Homoskedastic (The\ variance\ is\ constant) \]

\[ H_a: \beta_i \neq0 : Heteroskedastic (The\ variance\ is\ not\ constant) \]

pander(bptest(lm_ERACB))

studentized Breusch-Pagan test: `lm_ERACB`
Test statistic	df	P value
0.009125	1	0.9239

From the Bruesch-Pagan test there is insufficient evidence to assume the alternative, we therefore conclude the null that the model does indeed have constant variance. Therefore, all five assumptions were confirmed.

plot(lm_ERACB, which=1:1)       # Checks the residuals and normality

plot(lm_ERACB$residuals)

qqPlot(lm_ERACB)

## [1] 133 220

Regression Ouput

pander(summary(lm_ERACB <- lm(Win_perc ~ ERA, data = CBrankings)))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.9404	0.02155	43.65	9.719e-131
ERA	-0.0867	0.004081	-21.24	2.092e-61

Fitting linear model: Win_perc ~ ERA
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
297	0.08716	0.6047	0.6034

The \(R^2\) of .6034 means that 60.34% of the data is explained by the regression equation. Both the y-intercept and ERA coefficients were found to be significant past the 99% level. For every 1 unit increase in ERA a teams winning percentage drops by .0866 or 8.66%.

Interpretation

As all five requirements were met for regression analysis and the results came back as significant, we can assume that the model is correct in predictive power. The purpose of this analysis was to answer “do teams with lower ERA’s have higher win percentages?”. ERA does in fact have a large impact on a teams winning percentage and can help indicate wether or not a team will make the playoffs. A low ERA leads to higher winning percentages and a high ERA leads to lower winning percentages. As mentioned at the beginning, there are tons of factors that play into a team winning a game. The number of earned runs they give up is not the only one. As a result this model should be built upon with other offensive, defensive, and pitching statistics. Doing so would give a much more complete model with greater predictive capability.

References

http://web1.ncaa.org/stats/StatsSrv/rankings?doWhat=archive&sportCode=MBA

http://m.mlb.com/glossary/standard-stats/earned-run-average