This project aims to investigate the relationship between the length of baseball games and other variables from the given data set to provide insights into a game’s determining factors. We calculate the correlations between the variables and analyze the regression equation in terms of population based on what we discover to be the strongest predictor of time. Additionally, we test the main predictor’s significance and analyze the residual plots to determine what they say about the normality, linearity, and constant variance of our model. We complete all data analysis and visualization using R programming.
A common complaint among baseball fans is how long the games take. This project uses R programming to answer this age-old question, analyze the significance of these factors in determining the length of baseball games, and determine how well the most relevant predictor fits a simple linear regression model.
Data Source: Stat2 Models for a World of Data. Tools: R programming language, base packages. Data Visualization: Creating correlation matrices and residual plots using base R.
The purpose of this analysis was to find and analyze the strongest predictor of the length of baseball games (in minutes) given the following four numeric variables: runs (total number scored between the two teams), margin of victory, pitchers (total number of pitchers used by both teams), and attendance (total number of spectators at the game).
## Game League Runs Margin Pitchers Attendance Time
## 1 CLE-DET AL 14 6 6 38774 168
## 2 CHI-BAL AL 11 5 5 15398 164
## 3 BOS-NYY AL 10 4 11 55058 202
## 4 TOR-TAM AL 8 4 10 13478 172
## 5 TEX-KC AL 3 1 4 17004 151
## 6 OAK-LAA AL 6 4 4 37431 133
First, we calculated the correlations of each predictor variable against the length of the game (time), as shown below.
## Runs Margin Pitchers Attendance
## 0.68131437 -0.07135831 0.89430821 0.25719248
The closer a correlation value is to 1.0, the stronger the positive correlation. In this case, we determine that “Pitchers” strongly correlates with time, based on its high value. We will base the rest of our analysis on the use of this variable.
To determine the regression equation, we must first find the coefficients that will make up the intercept and slope of our equation.
## (Intercept) Pitchers
## 94.84325 10.71017
Next, we determine the regression equation for predicting time, based on the variable “Pitchers” using the coefficients found using the correlation value.
## [1] "Time = 94.84 + 10.71 * x"
We estimate our slope coefficient to be 10.71. This value means that for every pitcher used in a baseball game, the game length increases by approximately 10.71 minutes. This increase is significant, especially considering that the range of pitchers used in the game available within our data is between 4 and 16.
It is necessary to consider whether the number of pitchers is correlated with time when considering the overall population of all baseball games rather than just with this sample. We do this by calculating the p-value based on the slope coefficient. First, we establish a null hypothesis and an alternative hypothesis.
Here the null hypothesis is that the correlation between pitchers and time is not statistically significant. On the other hand, the alternative hypothesis is that the correlation between pitchers and time is statistically significant. If p <= 0.05 we reject the null hypothesis, and if p > 0.05 we fail to reject the null hypothesis.
## We reject the null hypothesis, the predictor variable is significantly correlated with time (p-value = 6.883851e-06 ).
The final step in our analysis is to analyze the appropriate residual plots for the linear model we fit to our data. We choose to use a residual vs. fitted values plot that tests the linearity and variance of the residuals, as well as a Q-Q or Quantile-Quantile plot to test the normality of the residuals because it provides the most thorough analysis of the model. Additionally, insight into what values are outliers is unnecessary with such a small sample size where the outliers are already apparent through looking at the data.
Although there are evidently some outliers, the residuals are spread relatively evenly about the horizontal axis at 0, which suggest that a linear model is a good fit for the data and there is no funnel shape, meaning the residuals do not fan out or contract as fitted value increase, this means the data meet the assumption of constant variance which a given assumption in regression analysis.
The Q-Q plot measures the normality of the residuals in a model. Since the residuals fall very close (with the exception of our already established outliers at observation 4, 13, and 15) to that of a theoretical normal distribution we can conclude that our data is also normally distributed.
This project aimed to explore the correlation between the length of baseball games and number of other variables. It was found that the number of pitchers used in a game by both teams is most strongly correlated with the length of a game. Using the “Pitchers” variable to calculate a regression equation and analyze two types of residual plots we determined that the data fits a linear model.
Baseball Times Dataset [https://www.stat2.org/datasets/BaseballTimes.csv]
BaseballTimes <- read.csv('BaseballTimes.csv')
head(BaseballTimes)
numeric_vars <- sapply(BaseballTimes, is.numeric)
BaseballTimes_numeric <- BaseballTimes[,numeric_vars]
correlation_matrix <- cor(BaseballTimes_numeric)
cor_with_time <- correlation_matrix["Time",]
example_1 <- cor_with_time[-c(5)]
print(example_1)
lm_pitchers_time <- lm(Time ~ Pitchers, data = BaseballTimes)
coefficients <- coef(lm_pitchers_time)
print(coefficients)
intercept <- coefficients["(Intercept)"]
slope <- coefficients["Pitchers"]
regression_equation <- paste("Time =", round(intercept, 2), "+", round(slope, 2), "* x")
print(regression_equation)
summary_lm_pitcher_time <- summary(lm_pitchers_time)
p_value <- coef(summary_lm_pitcher_time)["Pitchers", "Pr(>|t|)"]
if (p_value < 0.05) {
cat(" We reject the null hypothesis, the predictor variable is significantly correlated with time (p-value =", p_value, ").\n")
} else {
cat("We fail to reject the null hypothesis, the predictor variable is not significantly correlated with time (p-value =", p_value, ").\n")
}
plot(lm_pitchers_time, which = 1)
plot(lm_pitchers_time, which = 2)