── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <-read_csv("BaseballTimes.csv")
Rows: 15 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Game, League
dbl (5): Runs, Margin, Pitchers, Attendance, Time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset contains 15 Major League Baseball games and 5 numeric variables. These numeric variables represent potential predictors of game length (Time, measured in minutes).
correlation with time
cor(numeric_data)[, "Time"]
Runs Margin Pitchers Attendance Time
0.68131437 -0.07135831 0.89430821 0.25719248 1.00000000
The number of pitchers used in a game has the strongest positive correlation with game length (r = 0.894). This shows a strong relationship between the two variables. Games that use more pitchers usually take more time to finish.
The number of runs scored has a moderate positive relationship with game length, meaning games with more runs tend to be a bit longer. Attendance shows only a weak relationship with game length, and the margin of victory has almost no relationship with how long a game lasts.
Fit the regression model
model <-lm(Time ~ Pitchers, data = data)
summary(model)
Call:
lm(formula = Time ~ Pitchers, data = data)
Residuals:
Min 1Q Median 3Q Max
-37.945 -8.445 -3.104 9.751 50.794
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.843 13.387 7.085 8.24e-06 ***
Pitchers 10.710 1.486 7.206 6.88e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.46 on 13 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7844
F-statistic: 51.93 on 1 and 13 DF, p-value: 6.884e-06
The estimated regression equation is:
Time = 94.84 + 10.71 × Pitchers
Interpretation of the slope
The slope of 10.71 means that for each additional pitcher used in a game, the expected length of the game increases by about 10.7 minutes, on average.
Model strength
The model explains about 80% of the variation in game length (R² = 0.80), indicating that the number of pitchers is a very strong predictor of how long a game lasts.
Significance Test
Null hypothesis (H₀): There is no relationship between Pitchers and Time
Alternative hypothesis (H₁): There is a relationship between Pitchers and Time
The p-value for the slope is 6.88 × 10⁻⁶, which is far less than 0.05.
Therefore, we reject the null hypothesis and conclude that the number of pitchers used in a game is significantly correlated with game length in the population.
Residual Analysis
Residuals vs Fitted
plot(model, which =1)
The residuals are randomly scattered around zero, showing that a linear model is appropriate.
Normal Q–Q Plot
plot(model, which =2)
The points follow the straight line closely, indicating the residuals are approximately normal.
Cook’s Distance
plot(model, which =4)
There are no highly influential observations, so no single data point unduly affects the model.
plot(model, which =5)
There are no highly influential points, so no single observation strongly affects the model.
Conclusion
There is a strong positive relationship between the number of pitchers used and the length of a baseball game. Each additional pitcher increases the expected game time by about 10.7 minutes, and the model explains roughly 80% of the variation in game length. The relationship is statistically significant, and diagnostic plots indicate that the regression assumptions are reasonably satisfied.