Baseball Analysis: Was 2017 too Home Run Happy?

Giancarlo Stanton of the Miami Marlins, who hit 59 Home Runs in 2017.

Problem Background

Baseball is America’s pastime, and statistics have been an integral part of the game since the 19th century. Over time, statistical analysis of baseball has changed greatly through extensive analysis of every facet of the game. However, a recent source of controversy related to the sport has been related to the recent 2017 season. In this season, 6105 home runs were hit, the largest number of any season in history. Many fans, players, and analysts alike were alarmed at such a high number of home runs being hit. This also caused many to believe that such a high number of home runs is far too abnormal, and must be caused by a change in an inherent aspect of the game. Many pitchers, specifically, have noted that the baseballs themselves feel different than in the past, and that they are not able to grip the ball as well, making it easier for batters to take advantage of pitches and hit them out of the park. Therefore, our group wants to apply our knowledge of statistical procedures to identify whether this record-setting home run year truly was a simple anomaly, or a symptom of a larger problem within the game.

Hypothesis/Prediction

For our analysis project, we will be using a hypothesis test to determine whether the true mean HR/Game for 2017 has changed compared to the historical average for HR/Game.
Null Hypothesis: Mean is the Historical Average. Alternate Hypothesis: Mean is greater than the Historical Average.
H0: \(\mu\)= \(\mu_0\), and our H1 is \(\mu\)> \(\mu_0\), where \(\mu_0\) is the historical average for HR/G.
Our group’s prediction was that we would reject our null hypothesis in favor of the alternate hypothesis that the Mean HR/Game for 2017 is greater than the historical average.

Methodology

The data used in this project was collected from Baseball-Reference.com, the most widely used website for statistical information, on which we found CSV Files containing hitting data from every year of the sport.
Within these CSV files, we focused on the columns relevant to our test, year, total home runs, and games in a season, and we then computed the HR/Game for each season from this.
For the historical average for HR/Game, although baseball stats have been tracked since its inception in 1871, we are starting with the year 1920, since this is widely known as the beginning of the ‘modern era’ or baseball. Prior to 1920, the game had a much different structure and flow, so home runs were hit at a much lower rate.
Lastly, we put the HR/Game column for 2017 and for the years 1920-2016 into separate text files so that they could be directly interpreted by R.

Part 1: Analyzing Historical Precedent for Home Runs

Determining Normality of the Dataset

Based on the histogram of HR/Game from 1920-2016 and the accompanying QQPlot, the data appears to be roughly normally distributed, with slightly longer tails than a normal distribution would have, so a t-distribution would perhaps be most accurate.
Overall, this analysis helped us to find a historical average for HR/G, which is .7603 HR/G. We will then apply a hypothesis test comparing it to our 2017 data.

Part 2: Analyzing Home Run Data from 2017

Based on the hisogram of HR/Game for 2017 and its accompanying QQPlot, it appears difficult to tell if the data is normally distributed due to a left skew of the data. Because the data trails from a normal distribution at the tail, we felt that a t-distribution would fit the data best.

Part 3: Comparing These Sets of Data

Testing using 1-Pop T Test

Based on the available datasets that we have, the most reasonable test that we could apply is the 1-Pop T-Test, in which our selected population is the data from 2017.
As indicated earlier, our H0 is \(\mu\)= .76302, and our H1 is \(\mu\)>.76302

v0 = read.table('mlb0hrg.txt')
m0 = v0[,1]
v17 = read.table('mlb17hrg.txt')
m17= v17[,1]
t.test(m17, alternative = c('greater'), mu = mean(m0), conf.level = .95)

## 
##  One Sample t-test
## 
## data:  m17
## t = 15.808, df = 29, p-value = 4.305e-16
## alternative hypothesis: true mean is greater than 0.7630291
## 95 percent confidence interval:
##  1.203168      Inf
## sample estimates:
## mean of x 
##  1.256173

Our P-Value was extremely small, \(4.305^-16\), which is smaller than our \(\alpha\) value of .05.
Additionally, using the confidence interal approach, the 95% confidence interval constructed from the data yielded a range from (1.203168,\(\infty\)), which does not contain the historical mean of .76302.

Further Analysis: Regression Model over Time

v0 = read.table('mlb0hrg.txt')
m0 = v0[,1]
v17 = read.table('mlb17hrg.txt')
m17= v17[,1]
year = seq(1920,2016)
model = lm(m0~year)
summary(model)

## 
## Call:
## lm(formula = m0 ~ year)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.241337 -0.100692  0.002098  0.086136  0.244672 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.268e+01  8.727e-01  -14.53   <2e-16 ***
## year         6.829e-03  4.434e-04   15.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1223 on 95 degrees of freedom
## Multiple R-squared:  0.7141, Adjusted R-squared:  0.711 
## F-statistic: 237.2 on 1 and 95 DF,  p-value: < 2.2e-16

Based on this linear regression, there appears to be a viable positive correlation between HR/Game and the year played.
The \(r^2\) value was .711, which means that 71.1% of the variation of HR/G can be explained by the line of best fit of HR/G over time.
The linear formula found by this regression line is (HR/G) = 00683(Year) - 12.68
Based on this formula, this line would predict 1.096 HR/G, which is still less than the mean 1.256 HR/G, further supporting the result of our hypothesis test.

Part 4: Conclusion

Final Statement Regarding Null Hypothesis

Based on the results of our hypothesis test, the p value was extremely small (\(4.305^-16\)) and nearly 0. Because of this, we would reject our null hypothesis at the 95% confidence level in favor of our alternate hypothesis. Thus, we would conclude that the true mean HR/Game in 2017 was greater than the historical average of .763, and that 2017’s average of 1.256 is the new average HR/Game.

Practical Interpretation

The results of our hypothesis test could be interpreted in the context of our initial problem staterment. Since our test indicated that the mean HR/G in 2017 is different from the historical average, we can then hypothesize the reasons behind this drastic change. Some possible reasons for this could be a difference in the type of balls used for this season or a change in the strategy of hitters in terms of favoring a more power-oriented approach at the plate.

Limitations of Project / Ideas for Future Testing

Our project did have some limitations when it came to collecting data. We initially wanted to use an SQL database to test whether the mean home runs for each player is statistically different in 2017 than the historical average, but it would have required more coding, and we are unsure whether this would have actually resulted in tangible results that we could have interpreted.
Another limitation was that our analysis cannot determine the cause of the increase in home runs, and this would require a more thorough investigation of the MLB.
One key idea that we would be able to apply for future testing is the incorporation of MLB statcast in our analysis of hitting data. Statcast was introduced in 2015 and uses radar data to track the game in a far more advanced manner.

Statcast leaderboard for the 2017 season

Example of Statcast applications.

With statcast, two relevant metrics that we could analyze are exit velocity (speed of the ball off the bat) and launch angle (upward angle of the ball’s trajectory). These metrics and related to the number of home runs hit, and with more years of statcast data available in the future, we could be able to analyze these with relation to the increase in home runs hit.
A home run is typically hit with an ideal combination of exit velocity and launch angle, so these metrics could be analyzed to help explain the rapid increase in home runs.
Overall, we were pleased with the result of our analysis, and this provides a good starting point for further analysis of the topic.