Introduction

We analyze the BaseballTimes2017 dataset, which contains information on 15 Major League Baseball games from the 2017 season. The goal of this project is to explore the relationship between game characteristics and the length of a game using correlation and simple linear regression.

  1. Calculate the correlations of each predictor variable with the length of a game (Time). Identify which variable is most strongly correlated with Time.

  2. Choose the one predictor variable that you consider to be the best predictor of time. Determine the regression equation for predicting time based on that predictor. Also, interpret the slope coefficient of this equation.

  3. Perform the appropriate significance test of whether this predictor is really correlated with time in the population.

  4. Analyze appropriate residual plots for this model, and comment on what they reveal about whether the conditions for inference appear to be met here.

Analysis

We will explore the questions above in detail.

baseball <- read.csv("https://www.stat2.org/datasets/BaseballTimes.csv")
head(baseball)
##      Game League Runs Margin Pitchers Attendance Time
## 1 CLE-DET     AL   14      6        6      38774  168
## 2 CHI-BAL     AL   11      5        5      15398  164
## 3 BOS-NYY     AL   10      4       11      55058  202
## 4 TOR-TAM     AL    8      4       10      13478  172
## 5  TEX-KC     AL    3      1        4      17004  151
## 6 OAK-LAA     AL    6      4        4      37431  133

Question a.

Calculate the correlations of each predictor variable with the length of a game (Time). Identify which variable is most strongly correlated with Time.

cor(baseball$Time, baseball$Runs)
## [1] 0.6813144
cor(baseball$Time, baseball$Margin)
## [1] -0.07135831
cor(baseball$Time, baseball$Pitchers)
## [1] 0.8943082
cor(baseball$Time, baseball$Attendance)
## [1] 0.2571925

Correlations between game time and each predictor were calculated. Among the predictors, the number of pitchers has the strongest positive correlation with game time, indicating that games using more pitchers tend to last longer.

Question b.

Choose the one predictor variable that you consider to be the best predictor of time. Determine the regression equation for predicting time based on that predictor. Also, interpret the slope coefficient of this equation.

model <- lm(Time ~ Pitchers, data = baseball)
summary(model)
## 
## Call:
## lm(formula = Time ~ Pitchers, data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.945  -8.445  -3.104   9.751  50.794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   94.843     13.387   7.085 8.24e-06 ***
## Pitchers      10.710      1.486   7.206 6.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.46 on 13 degrees of freedom
## Multiple R-squared:  0.7998, Adjusted R-squared:  0.7844 
## F-statistic: 51.93 on 1 and 13 DF,  p-value: 6.884e-06

A simple linear regression model was fit using the number of pitchers as the predictor. The fitted model shows that game time increases by about 10.71 minutes for each additional pitcher used in a game, on average.

Question c.

Perform the appropriate significance test of whether this predictor is really correlated with time in the population.

To determine whether the number of pitchers used is related to game time, we test whether the slope of the regression line is zero. The p-value for the slope is approximately 6.88×10^-6, which is far less than 0.05. Therefore, we reject the null hypothesis and conclude that there is strong statistical evidence that the number of pitchers used is associated with the length of a baseball game.

Question d.

Analyze appropriate residual plots for this model, and comment on what they reveal about whether the conditions for inference appear to be met here.

par(mfrow = c(2,2))
plot(model)

The residual plots show no clear patterns, suggesting that the assumptions of linearity and constant variance are reasonable. The normal Q–Q plot indicates that the residuals are approximately normally distributed, and no influential outliers are evident. Overall, the conditions for regression inference appear to be met.

Appendix

The following R code was used for data analysis and visualization for the different questions in this report.

baseball <- read.csv("https://www.stat2.org/datasets/BaseballTimes.csv")
head(baseball)

cor(baseball$Time, baseball$Runs)
cor(baseball$Time, baseball$Margin)
cor(baseball$Time, baseball$Pitchers)
cor(baseball$Time, baseball$Attendance)

model <- lm(Time ~ Pitchers, data = baseball)
summary(model)

par(mfrow = c(2,2))
plot(model)