library(tidyverse)
library(ggplot2)
# load the csv file
baseball_data <- read.csv(file = "https://www.stat2.org/datasets/BaseballTimes.csv",
header = TRUE, sep = ",")Simple Linear Regression Project 1
Introduction
For this project we will be analyzing the Baseball Times 2017 using Simple Linear regression.
Part(A): Analyzing Correlation and Time
The dataset contains 7 variables. We analyze the correlation between game time and each predictor variable to identify an appropriate variable for simple linear regression.
Column Data
glimpse(baseball_data)Rows: 15
Columns: 7
$ Game <chr> "CLE-DET", "CHI-BAL", "BOS-NYY", "TOR-TAM", "TEX-KC", "OAK-…
$ League <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "NL", "NL", "NL",…
$ Runs <int> 14, 11, 10, 8, 3, 6, 5, 23, 3, 19, 3, 12, 11, 9, 15
$ Margin <int> 6, 5, 4, 4, 1, 4, 1, 5, 1, 1, 1, 12, 7, 5, 1
$ Pitchers <int> 6, 5, 11, 10, 4, 4, 5, 14, 6, 12, 4, 9, 10, 7, 16
$ Attendance <int> 38774, 15398, 55058, 13478, 17004, 37431, 26292, 17929, 261…
$ Time <int> 168, 164, 202, 172, 151, 133, 151, 239, 156, 211, 147, 185,…
# select only numeric variables
numeric_vars <- baseball_data %>%
select(where(is.numeric))
# compute correlations with Time
correlations <- numeric_vars %>%
select(-Time) %>%
summarise(across(everything(),
~ cor(.x, numeric_vars$Time, use = "complete.obs"))) %>%
pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "Correlation") %>%
arrange(desc(abs(Correlation)))
correlations# A tibble: 4 × 2
Variable Correlation
<chr> <dbl>
1 Pitchers 0.894
2 Runs 0.681
3 Attendance 0.257
4 Margin -0.0714
Interpretation
The table above displays the correlation between each numeric predictor and game time. The variable with the strongest linear association with time is Pitchers, with a correlation of 0.89430821 . This variable will be considered as a potential predictor of game length.
Part (B): Simple Linear Regression Model
# fit simple linear regression
model <- lm(Time ~ Pitchers, data = baseball_data)
summary(model)
Call:
lm(formula = Time ~ Pitchers, data = baseball_data)
Residuals:
Min 1Q Median 3Q Max
-37.945 -8.445 -3.104 9.751 50.794
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.843 13.387 7.085 8.24e-06 ***
Pitchers 10.710 1.486 7.206 6.88e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.46 on 13 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7844
F-statistic: 51.93 on 1 and 13 DF, p-value: 6.884e-06
Regression Equation
The fitted regression equation is:
\[ \widehat{\text{Time}} = 94.843 + 10.710(\text{Pitchers}) \]
where:
- 94.843 represents the estimated intercept
- 10.710 represents the slope The slope coefficient indicates that for each additional pitcher used in a game, the predicted length of the game increases by approximately 10.71 minutes, on average.
Part (C): Significance Test for the Predictor Variable
To determine whether the number of pitchers used in a game is significantly associated with game time in the population, a hypothesis test on the slope coefficient was performed using the simple linear regression model.
The hypotheses are:
Null Hypothesis \[ H_0: \beta_1 = 0 \] Alternative Hypothesis: \[ H_a: \beta_1 \neq 0 \]
From the regression output, the p-value associated with the slope coefficient for Pitchers is less than 0.05.
Because the p-value is less than the significance level 𝛼 = 0.05 , we reject the null hypothesis. This provides strong evidence that the number of pitchers used in a game is significantly correlated with the length of the game in the population.
Part(D) Analyzing the residual plots:
Residuals vs Fitted
plot(model, which = 1)The residuals versus fitted values plot shows no strong pattern, suggesting that a linear relationship is reasonable. The spread of the residuals is mostly consistent, although there is slight variation at higher fitted values.
Q-Q Residuals
plot(model, which = 2) The normal Q–Q plot shows that the residuals follow an approximately straight line, indicating that the normality assumption is reasonable.
Scale Location
plot(model, which = 3) Ideally, points should be evenly distributed around the horizontal line. The plot here shows points are not evenly distributed
Cook’s distance
plot(model, which = 4)The Cook’s distance plot identifies one observation with higher influence than others, but it does not appear to overly affect the model.
Overall, the regression conditions appear to be reasonably met, and the model is appropriate for inference.