This project analyzes the factors affecting baseball game duration by identifying the strongest predictor through correlation analysis, developing a regression model, and evaluating its significance. Residual plots are examined to ensure the model meets the assumptions for accurate statistical inference.
data = read.csv("https://www.stat2.org/datasets/BaseballTimes.csv")
names(data)
## [1] "Game" "League" "Runs" "Margin" "Pitchers"
## [6] "Attendance" "Time"
numeric_data = data[sapply(data, is.numeric)]
# Filter numeric columns only
numeric_data = data[, sapply(data, is.numeric)]
# Calculate correlations with 'Time'
correlations = cor(numeric_data, use = "complete.obs")
# Extract correlations with Time
time_correlations = correlations[, "Time"]
time_correlations
## Runs Margin Pitchers Attendance Time
## 0.68131437 -0.07135831 0.89430821 0.25719248 1.00000000
# Fit linear model with 'Pitchers' as the predictor
model = lm(Time ~ Pitchers, data = data)
# summary of the regression model
summary(model)
##
## Call:
## lm(formula = Time ~ Pitchers, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.945 -8.445 -3.104 9.751 50.794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.843 13.387 7.085 8.24e-06 ***
## Pitchers 10.710 1.486 7.206 6.88e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.46 on 13 degrees of freedom
## Multiple R-squared: 0.7998, Adjusted R-squared: 0.7844
## F-statistic: 51.93 on 1 and 13 DF, p-value: 6.884e-06
Based on the correlation analysis, Pitchers is the best
predictor of game time with a correlation of 0.894.
The regression equation is:
\[ \text{Time} = \text{Intercept} + \text{Slope} \times \text{Pitchers} \]
Where:
summary(model))summary(model))The slope coefficient of SlopeValue means that for each
additional pitcher, the game time increases by SlopeValue
minutes.
summary(model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.84325 13.386998 7.084729 8.235046e-06
## Pitchers 10.71017 1.486221 7.206310 6.883851e-06
# Plot residuals
ggplot(data, aes(x = model$fitted.values, y = model$residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals", title = "Residual plot") +
theme_minimal()
The residual plot shows that the residuals are randomly scattered around zero without any discernible pattern. This indicates that:
Homoscedasticity: The variance of the residuals is constant across all levels of the fitted values. This is a good sign as it suggests that the model does not suffer from heteroscedasticity (i.e., unequal variance of residuals).
Linearity: There is no apparent pattern in the residuals, which supports the assumption of linearity between the predictor variable and the response variable. This suggests that the linear model is a reasonable fit for the data.
Independence: The lack of systematic structure in the residuals suggests that the residuals are independent of each other. This is important for the validity of the statistical tests applied.
Overall, the residual plot suggests that the linear regression model meets the assumptions required for inference, and the model appears to be a good fit for the data.