project

Author

Sandeep

Published

January 31, 2026

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_csv("BaseballTimes.csv")
Rows: 15 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Game, League
dbl (5): Runs, Margin, Pitchers, Attendance, Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
numeric_data <- data %>%
  select(where(is.numeric))
colnames(data)
[1] "Game"       "League"     "Runs"       "Margin"     "Pitchers"  
[6] "Attendance" "Time"      

correlation with time

str(numeric_data)
tibble [15 × 5] (S3: tbl_df/tbl/data.frame)
 $ Runs      : num [1:15] 14 11 10 8 3 6 5 23 3 19 ...
 $ Margin    : num [1:15] 6 5 4 4 1 4 1 5 1 1 ...
 $ Pitchers  : num [1:15] 6 5 11 10 4 4 5 14 6 12 ...
 $ Attendance: num [1:15] 38774 15398 55058 13478 17004 ...
 $ Time      : num [1:15] 168 164 202 172 151 133 151 239 156 211 ...
cor(numeric_data)
                  Runs      Margin   Pitchers  Attendance        Time
Runs        1.00000000  0.29801292 0.76564555 -0.01293088  0.68131437
Margin      0.29801292  1.00000000 0.09224695  0.26491543 -0.07135831
Pitchers    0.76564555  0.09224695 1.00000000  0.18192049  0.89430821
Attendance -0.01293088  0.26491543 0.18192049  1.00000000  0.25719248
Time        0.68131437 -0.07135831 0.89430821  0.25719248  1.00000000

The dataset contains 15 Major League Baseball games and 5 numeric variables. These numeric variables represent potential predictors of game length (Time, measured in minutes).

correlation with time

cor(numeric_data)[, "Time"]
       Runs      Margin    Pitchers  Attendance        Time 
 0.68131437 -0.07135831  0.89430821  0.25719248  1.00000000 

The number of pitchers used in a game has the strongest positive correlation with game length (r = 0.894). This shows a strong relationship between the two variables. Games that use more pitchers usually take more time to finish.

The number of runs scored has a moderate positive relationship with game length, meaning games with more runs tend to be a bit longer. Attendance shows only a weak relationship with game length, and the margin of victory has almost no relationship with how long a game lasts.

Fit the regression model

model <-lm(Time ~ Pitchers, data = data)
summary(model)

Call:
lm(formula = Time ~ Pitchers, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.945  -8.445  -3.104   9.751  50.794 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   94.843     13.387   7.085 8.24e-06 ***
Pitchers      10.710      1.486   7.206 6.88e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.46 on 13 degrees of freedom
Multiple R-squared:  0.7998,    Adjusted R-squared:  0.7844 
F-statistic: 51.93 on 1 and 13 DF,  p-value: 6.884e-06

The estimated regression equation is:

Time = 94.84 + 10.71 × Pitchers

Interpretation of the slope

The slope of 10.71 means that for each additional pitcher used in a game, the expected length of the game increases by about 10.7 minutes, on average.

Model strength

The model explains about 80% of the variation in game length (R² = 0.80), indicating that the number of pitchers is a very strong predictor of how long a game lasts.

Significance Test

Null hypothesis (H₀): There is no relationship between Pitchers and Time

Alternative hypothesis (H₁): There is a relationship between Pitchers and Time

The p-value for the slope is 6.88 × 10⁻⁶, which is far less than 0.05.

Therefore, we reject the null hypothesis and conclude that the number of pitchers used in a game is significantly correlated with game length in the population.

Residual Analysis

Residuals vs Fitted

plot(model, which = 1)

The residuals are randomly scattered around zero, showing that a linear model is appropriate.

Normal Q–Q Plot

plot(model, which = 2)

The points follow the straight line closely, indicating the residuals are approximately normal.

Cook’s Distance

plot(model, which = 4)

There are no highly influential observations, so no single data point unduly affects the model.

plot(model, which = 5)

There are no highly influential points, so no single observation strongly affects the model.

Conclusion

There is a strong positive relationship between the number of pitchers used and the length of a baseball game. Each additional pitcher increases the expected game time by about 10.7 minutes, and the model explains roughly 80% of the variation in game length. The relationship is statistically significant, and diagnostic plots indicate that the regression assumptions are reasonably satisfied.