A Statistical Study on Baseball Game Durations

1. Introduction

This project analyzes the factors affecting baseball game duration by identifying the strongest predictor through correlation analysis, developing a regression model, and evaluating its significance. Residual plots are examined to ensure the model meets the assumptions for accurate statistical inference.

2. Data Preparation

data = read.csv("https://www.stat2.org/datasets/BaseballTimes.csv")
names(data)

## [1] "Game"       "League"     "Runs"       "Margin"     "Pitchers"  
## [6] "Attendance" "Time"

numeric_data = data[sapply(data, is.numeric)]

3. Correlation

# Filter numeric columns only
numeric_data = data[, sapply(data, is.numeric)]

# Calculate correlations with 'Time'
correlations = cor(numeric_data, use = "complete.obs")

# Extract correlations with Time
time_correlations = correlations[, "Time"]
time_correlations

##        Runs      Margin    Pitchers  Attendance        Time 
##  0.68131437 -0.07135831  0.89430821  0.25719248  1.00000000

4. Regression equation

# Fit linear model with 'Pitchers' as the predictor
model = lm(Time ~ Pitchers, data = data)

# summary of the regression model
summary(model)

## 
## Call:
## lm(formula = Time ~ Pitchers, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.945  -8.445  -3.104   9.751  50.794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   94.843     13.387   7.085 8.24e-06 ***
## Pitchers      10.710      1.486   7.206 6.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.46 on 13 degrees of freedom
## Multiple R-squared:  0.7998, Adjusted R-squared:  0.7844 
## F-statistic: 51.93 on 1 and 13 DF,  p-value: 6.884e-06

4.1 Best Predictor and Regression Equation

Based on the correlation analysis, Pitchers is the best predictor of game time with a correlation of 0.894.

The regression equation is:

\[ \text{Time} = \text{Intercept} + \text{Slope} \times \text{Pitchers} \]

Where:

Intercept: (insert intercept value from summary(model))
Slope: (insert slope value from summary(model))

The slope coefficient of SlopeValue means that for each additional pitcher, the game time increases by SlopeValue minutes.

5. Significance test

summary(model)$coefficients

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 94.84325  13.386998 7.084729 8.235046e-06
## Pitchers    10.71017   1.486221 7.206310 6.883851e-06

6. Scatterlot

# Plot residuals
ggplot(data, aes(x = model$fitted.values, y = model$residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals", title = "Residual plot") +
  theme_minimal()

6.1 Residuals Analysis

The residual plot shows that the residuals are randomly scattered around zero without any discernible pattern. This indicates that:

Homoscedasticity: The variance of the residuals is constant across all levels of the fitted values. This is a good sign as it suggests that the model does not suffer from heteroscedasticity (i.e., unequal variance of residuals).
Linearity: There is no apparent pattern in the residuals, which supports the assumption of linearity between the predictor variable and the response variable. This suggests that the linear model is a reasonable fit for the data.
Independence: The lack of systematic structure in the residuals suggests that the residuals are independent of each other. This is important for the validity of the statistical tests applied.

Overall, the residual plot suggests that the linear regression model meets the assumptions required for inference, and the model appears to be a good fit for the data.