Here we will analyze several metrics related to baseball to find a
metric that most strongly correlates (and will be a loose predictor of)
game length (in minutes). These metrics are sourced from the data frame
below:
## Game League Runs Margin Pitchers Attendance Time
## 1 CLE-DET AL 14 6 6 38774 168
## 2 CHI-BAL AL 11 5 5 15398 164
## 3 BOS-NYY AL 10 4 11 55058 202
## 4 TOR-TAM AL 8 4 10 13478 172
## 5 TEX-KC AL 3 1 4 17004 151
## 6 OAK-LAA AL 6 4 4 37431 133
## 7 MIN-SEA AL 5 1 5 26292 151
## 8 CHI-PIT NL 23 5 14 17929 239
## 9 LAD-WAS NL 3 1 6 26110 156
## 10 FLA-ATL NL 19 1 12 17539 211
## 11 CIN-HOU NL 3 1 4 30395 147
## 12 MIL-STL NL 12 12 9 41121 185
## 13 ARI-SD NL 11 7 10 32104 164
## 14 COL-SF NL 9 5 7 32695 180
## 15 NYM-PHI NL 15 1 16 45204 317
In which, the metrics of importance for correlation analysis
will be League, Runs, Margin, Pitchers, Attendance, and Time as the main
variable for analyzing. The Games metric does not lend itself to
computing correlation, as it is strictly categorical. However, we can
assign League to a set of dummy variables such that AL is mapped to 0
and NL is mapped to 1.
We can now calculate each of the respective
correlation coefficients:
Variable | Correlation |
---|---|
League | -0.4121187 |
Runs | 0.6813144 |
Margin | -0.0713583 |
Pitchers | 0.8943082 |
Attendance | 0.2571925 |
We can see that the absolute value of the correlation between
pitchers and time was the greatest, so we will consider the number of
pitchers to be the metric most strongly correlated with the length of
time of a game. Next, we can construct a linear regression model to
attempt to generalize the relationship between number of pitchers and
length of game. The model will be represented by the red dashed line
below:
The intercept coefficient of the model would be 94.8432502, and the
slope coefficient would be 10.7101727. The model would imply that for
each additional pitcher present in the game, we can expect the duration
of the game to increase by about 10.7101727 minutes.
We can
obtain a p-value for statistical significance by using a correlation
test (with t-distribution that has 13 degrees of freedom) and the
correlation test statistic. Doing so yields a p-value of 6.8838508^{-6}
with a 95% confidence interval of (0.705038, 0.9646464). That is, we are
95% sure that the true correlation between the number of pitchers and
duration of a game is between 0.705038 and 0.9646464. This low p-value
does suggest that, under the null hypothesis, the correlation is
statistically significant.
Consider the residual plots:
Here, in the residuals vs. fitted plot, we can see a roughly random
scattering of points. This is good as it demonstrates that the linear
model does not generally under or over-predict the true values.
Additionally the Q-Q plot suggests that the linear model is typically
accurate in its predictions as most points lie near to the predicted
line.