In this project, using data from 15 different baseball games, I will be finding which predictor variable can most accurately predict the length of a baseball game. With that predictor variable, I will determine the regression equation, and interpret the slope coefficient from the equation. I will then perform a significance test to determine if the correlation is statistically significant. Finally, I will analyze the residual plot for this model, and decide whether the conditions for inference are met.
## [1] "Game" "League" "Runs" "Margin" "Pitchers"
## [6] "Attendance" "Time"
## Game League Runs Margin Pitchers Attendance Time
## 1 CLE-DET AL 14 6 6 38774 168
## 2 CHI-BAL AL 11 5 5 15398 164
## 3 BOS-NYY AL 10 4 11 55058 202
## 4 TOR-TAM AL 8 4 10 13478 172
## 5 TEX-KC AL 3 1 4 17004 151
## 6 OAK-LAA AL 6 4 4 37431 133
## 7 MIN-SEA AL 5 1 5 26292 151
## 8 CHI-PIT NL 23 5 14 17929 239
## 9 LAD-WAS NL 3 1 6 26110 156
## 10 FLA-ATL NL 19 1 12 17539 211
## 11 CIN-HOU NL 3 1 4 30395 147
## 12 MIL-STL NL 12 12 9 41121 185
## 13 ARI-SD NL 11 7 10 32104 164
## 14 COL-SF NL 9 5 7 32695 180
## 15 NYM-PHI NL 15 1 16 45204 317
## Runs Margin Pitchers Attendance Time
## Runs 1.00000000 0.29801292 0.76564555 -0.01293088 0.68131437
## Margin 0.29801292 1.00000000 0.09224695 0.26491543 -0.07135831
## Pitchers 0.76564555 0.09224695 1.00000000 0.18192049 0.89430821
## Attendance -0.01293088 0.26491543 0.18192049 1.00000000 0.25719248
## Time 0.68131437 -0.07135831 0.89430821 0.25719248 1.00000000
Based on this correlation matrix, it appears that the pitcher has the strongest correlation with the run time of a baseball game.
##
## Call:
## lm(formula = Time ~ Pitchers, data = strongest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.945 -8.445 -3.104 9.751 50.794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.843 13.387 7.085 8.24e-06 ***
## Pitchers 10.710 1.486 7.206 6.88e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.46 on 13 degrees of freedom
## Multiple R-squared: 0.7998, Adjusted R-squared: 0.7844
## F-statistic: 51.93 on 1 and 13 DF, p-value: 6.884e-06
Based on this regression model, the regression equation for pitchers as the predictor and time as the response, our equation is: y = 94.843 * (10.710 * pitchers). This means that for every pitcher in the game, there is 10.710 minutes added to the game.
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 7.2063, df = 13, p-value = 6.884e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7050380 0.9646464
## sample estimates:
## cor
## 0.8943082
Based on this correlation hypothesis test, using a 95% significance level, it seems that the correlation is statistically significant. This confirms that the number of pitchers is correlated with the run time of the game.
This scatterplot displays the correlation between pitchers and game
time, showing a line of best fit, and the regression equation for this
line. This scatterplot seems to confirm that the conditions for
correlation are met because all of the points on the plot are tightly
clustered around the regression line.