Section 1 : Introduction

In this project the aim was to find the predictor variable that correlated the most with the length of the baseball game (Time). By calculating the correlations between the numerous predictor variables (runs, margin, pitchers, and attendance) based on the correlation values it was found that pitchers held the most correlation with time. A regression model was then made to predict the game duration using the pitchers variable. The model was then used to form a regression equation by examining the intercept and slope coefficients. The slope coefficient was also examined to explore what that number means. Next a significance test was performed by examining the p-value to determine whether the predictor variable (pitchers) was statistically shown to be correlated with the time variable. Lastly a number of residual plots were applied to the model and examined to see whether they met the conditions of the inference, ensuring the reliability of the model.

Section 2 : Correlations of Predictor Variables with Time

In the data set the predictor variables are the runs, margin, pitchers, and attendance columns. Using those columns we can calculate the correlation times for each and determine which is closest to 1, indicating the strongest correlation. Each correlation calculation using the predictor variables with time is shown below.

## Correlation between runs and time: 0.6813144
## Correlation between margin and time: -0.07135831
## Correlation between pitchers and time: 0.8943082
## Correlation between attendance and time 0.2571925

The strongest correlation would be the number that is closest to 1. As shown above that number belongs to the pitchers variable which has a correlation of 0.8943082.

Section 3: Determining Regression Equation

The predictor variable that would be the best predictor of time would be the one that has the strongest correlation with time. That variable was previously determined as the pitchers variable. We can determine the regression equation of the model by fitting the linear regression model using the pitchers and time variable. The summary of the linear regression model is shown below.

## 
## Call:
## lm(formula = Time ~ Pitchers, data = Baseball_Times)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.945  -8.445  -3.104   9.751  50.794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   94.843     13.387   7.085 8.24e-06 ***
## Pitchers      10.710      1.486   7.206 6.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.46 on 13 degrees of freedom
## Multiple R-squared:  0.7998, Adjusted R-squared:  0.7844 
## F-statistic: 51.93 on 1 and 13 DF,  p-value: 6.884e-06

The intercept and slope from the model can be used as the regression coefficients to write out the equation. As we can see from the regression model the intercept is 94.843 and the slope is 10.710. Using these coefficients we can form the regression equation shown below

## The regression equation is Time =  94.84325 + 10.71017 * Pitchers

The slope coefficient shows that for each pitcher added roughly 10 mins is added to the game. This shows that game length increases significantly with pitcher increase.

Section 4: Signifigance Test

To determine the significance a t-test can be performed and the p-value can be examined to determine whether or not the null-hypothesis is rejected. The summary preformed in the previous section already shows the results of the t-test. The p-value was 6.884e-06 which is extremely small. Because of the small p-value the null-hypothesis can be rejected. This means that the pitchers variable is strongly correlated with time.

Section 5: Analysis of Plots

To solidify the prediction that pitchers is closely associated with time more plots can be analyzed.

Residuals vs Fitted: This plot will be used to examine the linearity and constant variance. Looking at the graph the linearity seems okay since the residuals are randomly scattered around the horizontal line at zero. The curve in the residuals suggests concern for the constant variance assumption.

Q-Q Plot: Most of the residuals follow a close fit onto the line however at the ends there are residuals that deviate from the line. This indicates some concern for the normality assumption with the model.

Scale Location Plot: The plot here shows an outlier, which suggests a slight violation of the constance in variance of the residuals.

Cooks distance: The influential points are 15, 13, and 4 with the most influential point being at index 15. This could point to a potential outlier that could be disproportionately affecting the regression model.

Apendix

Code for reading in dataset and finding correlation.

Baseball_Times <- read.csv("BaseballTimes.csv")
correlation_runs_time <- cor(
  Baseball_Times$Runs, 
  Baseball_Times$Time)

cat(
  "Correlation between runs and time:", 
  correlation_runs_time, 
  "\n")
correlation_margin_time <- cor(
  Baseball_Times$Margin, 
  Baseball_Times$Time)

cat(
  "Correlation between margin and time:",
  correlation_margin_time,
  "\n")

correlation_pitchers_time <- cor(
  Baseball_Times$Pitchers, 
  Baseball_Times$Time)

cat(
  "Correlation between pitchers and time:", 
  correlation_pitchers_time, 
  "\n")

correlation_attendance_time <-cor(
  Baseball_Times$Attendance, 
  Baseball_Times$Time)

cat(
  "Correlation between attendance and time", 
  correlation_attendance_time, 
  "\n")

Code for making regression model and finding regression equation

regression_model <- lm(Time ~ Pitchers, data = Baseball_Times)
summary(regression_model)

coefficients <- coef(regression_model)
intercept <- coefficients[1] #represents baseline game duration if no pitchers accounted in model
slope <- coefficients[2] #represents change in time for each additional pitcher in the game

cat("The regression equation is Time = ", intercept, "+", slope, "* Pitchers\n")

Code for plots used

#Residuals vs Fitted
plot(regression_model, which = 1)

#Q-Q plot
plot(regression_model, which = 2)

#Scale Location Plot
plot(regression_model, which = 3)

#Residuals vs Leverage Plot
plot(regression_model, which = 4)