1. Abstract

This project aims to investigate the relationship between the length of baseball games and other variables from the given data set to provide insights into a game’s determining factors. We calculate the correlations between the variables and analyze the regression equation in terms of population based on what we discover to be the strongest predictor of time. Additionally, we test the main predictor’s significance and analyze the residual plots to determine what they say about the normality, linearity, and constant variance of our model. We complete all data analysis and visualization using R programming.

2. Introduction

A common complaint among baseball fans is how long the games take. This project uses R programming to answer this age-old question, analyze the significance of these factors in determining the length of baseball games, and determine how well the most relevant predictor fits a simple linear regression model.

3. Methodology

Data Source: Stat2 Models for a World of Data. Tools: R programming language, base packages. Data Visualization: Creating correlation matrices and residual plots using base R.

4. Results

The purpose of this analysis was to find and analyze the strongest predictor of the length of baseball games (in minutes) given the following four numeric variables: runs (total number scored between the two teams), margin of victory, pitchers (total number of pitchers used by both teams), and attendance (total number of spectators at the game).

##      Game League Runs Margin Pitchers Attendance Time
## 1 CLE-DET     AL   14      6        6      38774  168
## 2 CHI-BAL     AL   11      5        5      15398  164
## 3 BOS-NYY     AL   10      4       11      55058  202
## 4 TOR-TAM     AL    8      4       10      13478  172
## 5  TEX-KC     AL    3      1        4      17004  151
## 6 OAK-LAA     AL    6      4        4      37431  133

First, we calculated the correlations of each predictor variable against the length of the game (time), as shown below.

##        Runs      Margin    Pitchers  Attendance 
##  0.68131437 -0.07135831  0.89430821  0.25719248

The closer a correlation value is to 1.0, the stronger the positive correlation. In this case, we determine that “Pitchers” strongly correlates with time, based on its high value. We will base the rest of our analysis on the use of this variable.

To determine the regression equation, we must first find the coefficients that will make up the intercept and slope of our equation.

## (Intercept)    Pitchers 
##    94.84325    10.71017

Next, we determine the regression equation for predicting time, based on the variable “Pitchers” using the coefficients found using the correlation value.

## [1] "Time = 94.84 + 10.71 * x"

We estimate our slope coefficient to be 10.71. This value means that for every pitcher used in a baseball game, the game length increases by approximately 10.71 minutes. This increase is significant, especially considering that the range of pitchers used in the game available within our data is between 4 and 16.

It is necessary to consider whether the number of pitchers is correlated with time when considering the overall population of all baseball games rather than just with this sample. We do this by calculating the p-value based on the slope coefficient. First, we establish a null hypothesis and an alternative hypothesis.

Here the null hypothesis is that the correlation between pitchers and time is not statistically significant. On the other hand, the alternative hypothesis is that the correlation between pitchers and time is statistically significant. If p <= 0.05 we reject the null hypothesis, and if p > 0.05 we fail to reject the null hypothesis.

##  We reject the null hypothesis, the predictor variable is significantly correlated with time (p-value = 6.883851e-06 ).

The final step in our analysis is to analyze the appropriate residual plots for the linear model we fit to our data. We choose to use a residual vs. fitted values plot that tests the linearity and variance of the residuals, as well as a Q-Q or Quantile-Quantile plot to test the normality of the residuals because it provides the most thorough analysis of the model. Additionally, insight into what values are outliers is unnecessary with such a small sample size where the outliers are already apparent through looking at the data.

Although there are evidently some outliers, the residuals are spread relatively evenly about the horizontal axis at 0, which suggest that a linear model is a good fit for the data and there is no funnel shape, meaning the residuals do not fan out or contract as fitted value increase, this means the data meet the assumption of constant variance which a given assumption in regression analysis.

The Q-Q plot measures the normality of the residuals in a model. Since the residuals fall very close (with the exception of our already established outliers at observation 4, 13, and 15) to that of a theoretical normal distribution we can conclude that our data is also normally distributed.

5. Conclusion

This project aimed to explore the correlation between the length of baseball games and number of other variables. It was found that the number of pitchers used in a game by both teams is most strongly correlated with the length of a game. Using the “Pitchers” variable to calculate a regression equation and analyze two types of residual plots we determined that the data fits a linear model.

6. References

Baseball Times Dataset [https://www.stat2.org/datasets/BaseballTimes.csv]

7. Appendices

7.1. Appendix A: Code for Data Analysis

7.1.1. Setup, Data Pruning, and Correlation Matrix
BaseballTimes <- read.csv('BaseballTimes.csv')

head(BaseballTimes)

numeric_vars <- sapply(BaseballTimes, is.numeric)

BaseballTimes_numeric <- BaseballTimes[,numeric_vars]

correlation_matrix <- cor(BaseballTimes_numeric)

cor_with_time <- correlation_matrix["Time",]

example_1 <- cor_with_time[-c(5)] 

print(example_1)

7.2 Appendix B: Code for Regression Analysis

7.2.1. Model and Coefficients
lm_pitchers_time <- lm(Time ~ Pitchers, data = BaseballTimes)

coefficients <- coef(lm_pitchers_time)

print(coefficients)

intercept <- coefficients["(Intercept)"]

slope <- coefficients["Pitchers"]
7.2.2. Regression Equation
regression_equation <- paste("Time =", round(intercept, 2), "+", round(slope, 2), "* x")

print(regression_equation)
7.2.3. T-test
summary_lm_pitcher_time <- summary(lm_pitchers_time)

p_value <- coef(summary_lm_pitcher_time)["Pitchers", "Pr(>|t|)"]
if (p_value < 0.05) {
cat(" We reject the null hypothesis, the predictor variable is significantly correlated with time (p-value =", p_value, ").\n")
} else {
cat("We fail to reject the null hypothesis, the predictor variable is not significantly correlated with time (p-value =", p_value, ").\n")
}

7.3 Appendix C: Code for Residual Plots

7.3.1. Residual Plots
plot(lm_pitchers_time, which = 1)

plot(lm_pitchers_time, which = 2)