Simple Linear Regression Project 1

Author

Sabirin Muuse

Introduction

For this project we will be analyzing the Baseball Times 2017 using Simple Linear regression.

library(tidyverse)
library(ggplot2)

# load the csv file 
 baseball_data <- read.csv(file = "https://www.stat2.org/datasets/BaseballTimes.csv", 
                           header = TRUE, sep = ",")

Part(A): Analyzing Correlation and Time

The dataset contains 7 variables. We analyze the correlation between game time and each predictor variable to identify an appropriate variable for simple linear regression.

Column Data

glimpse(baseball_data)
Rows: 15
Columns: 7
$ Game       <chr> "CLE-DET", "CHI-BAL", "BOS-NYY", "TOR-TAM", "TEX-KC", "OAK-…
$ League     <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "NL", "NL", "NL",…
$ Runs       <int> 14, 11, 10, 8, 3, 6, 5, 23, 3, 19, 3, 12, 11, 9, 15
$ Margin     <int> 6, 5, 4, 4, 1, 4, 1, 5, 1, 1, 1, 12, 7, 5, 1
$ Pitchers   <int> 6, 5, 11, 10, 4, 4, 5, 14, 6, 12, 4, 9, 10, 7, 16
$ Attendance <int> 38774, 15398, 55058, 13478, 17004, 37431, 26292, 17929, 261…
$ Time       <int> 168, 164, 202, 172, 151, 133, 151, 239, 156, 211, 147, 185,…
# select only numeric variables
numeric_vars <- baseball_data %>%
  select(where(is.numeric))

# compute correlations with Time
correlations <- numeric_vars %>%
  select(-Time) %>%
  summarise(across(everything(),
                   ~ cor(.x, numeric_vars$Time, use = "complete.obs"))) %>%
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Correlation") %>%
  arrange(desc(abs(Correlation)))

correlations
# A tibble: 4 × 2
  Variable   Correlation
  <chr>            <dbl>
1 Pitchers        0.894 
2 Runs            0.681 
3 Attendance      0.257 
4 Margin         -0.0714

Interpretation

The table above displays the correlation between each numeric predictor and game time. The variable with the strongest linear association with time is Pitchers, with a correlation of 0.89430821 . This variable will be considered as a potential predictor of game length.


Part (B): Simple Linear Regression Model

# fit simple linear regression
model <- lm(Time ~ Pitchers, data = baseball_data)
summary(model)

Call:
lm(formula = Time ~ Pitchers, data = baseball_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.945  -8.445  -3.104   9.751  50.794 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   94.843     13.387   7.085 8.24e-06 ***
Pitchers      10.710      1.486   7.206 6.88e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.46 on 13 degrees of freedom
Multiple R-squared:  0.7998,    Adjusted R-squared:  0.7844 
F-statistic: 51.93 on 1 and 13 DF,  p-value: 6.884e-06

Regression Equation

The fitted regression equation is:

\[ \widehat{\text{Time}} = 94.843 + 10.710(\text{Pitchers}) \]

where:

  • 94.843 represents the estimated intercept
  • 10.710 represents the slope The slope coefficient indicates that for each additional pitcher used in a game, the predicted length of the game increases by approximately 10.71 minutes, on average.

Part (C): Significance Test for the Predictor Variable

To determine whether the number of pitchers used in a game is significantly associated with game time in the population, a hypothesis test on the slope coefficient was performed using the simple linear regression model.

The hypotheses are:

Null Hypothesis \[ H_0: \beta_1 = 0 \] Alternative Hypothesis: \[ H_a: \beta_1 \neq 0 \]

From the regression output, the p-value associated with the slope coefficient for Pitchers is less than 0.05.

Because the p-value is less than the significance level 𝛼 = 0.05 , we reject the null hypothesis. This provides strong evidence that the number of pitchers used in a game is significantly correlated with the length of the game in the population.

Part(D) Analyzing the residual plots:

Residuals vs Fitted

plot(model, which = 1)

The residuals versus fitted values plot shows no strong pattern, suggesting that a linear relationship is reasonable. The spread of the residuals is mostly consistent, although there is slight variation at higher fitted values.

Q-Q Residuals

plot(model, which = 2)  

The normal Q–Q plot shows that the residuals follow an approximately straight line, indicating that the normality assumption is reasonable.

Scale Location

plot(model, which = 3)  

Ideally, points should be evenly distributed around the horizontal line. The plot here shows points are not evenly distributed

Cook’s distance

plot(model, which = 4)

The Cook’s distance plot identifies one observation with higher influence than others, but it does not appear to overly affect the model.

Overall, the regression conditions appear to be reasonably met, and the model is appropriate for inference.