Section 1: Introduction

In this project, using data from 15 different baseball games, I will be finding which predictor variable can most accurately predict the length of a baseball game. With that predictor variable, I will determine the regression equation, and interpret the slope coefficient from the equation. I will then perform a significance test to determine if the correlation is statistically significant. Finally, I will analyze the residual plot for this model, and decide whether the conditions for inference are met.

Section 1.1: Data

## [1] "Game"       "League"     "Runs"       "Margin"     "Pitchers"  
## [6] "Attendance" "Time"
##       Game League Runs Margin Pitchers Attendance Time
## 1  CLE-DET     AL   14      6        6      38774  168
## 2  CHI-BAL     AL   11      5        5      15398  164
## 3  BOS-NYY     AL   10      4       11      55058  202
## 4  TOR-TAM     AL    8      4       10      13478  172
## 5   TEX-KC     AL    3      1        4      17004  151
## 6  OAK-LAA     AL    6      4        4      37431  133
## 7  MIN-SEA     AL    5      1        5      26292  151
## 8  CHI-PIT     NL   23      5       14      17929  239
## 9  LAD-WAS     NL    3      1        6      26110  156
## 10 FLA-ATL     NL   19      1       12      17539  211
## 11 CIN-HOU     NL    3      1        4      30395  147
## 12 MIL-STL     NL   12     12        9      41121  185
## 13  ARI-SD     NL   11      7       10      32104  164
## 14  COL-SF     NL    9      5        7      32695  180
## 15 NYM-PHI     NL   15      1       16      45204  317

Section 2: Calculating Correlations

##                   Runs      Margin   Pitchers  Attendance        Time
## Runs        1.00000000  0.29801292 0.76564555 -0.01293088  0.68131437
## Margin      0.29801292  1.00000000 0.09224695  0.26491543 -0.07135831
## Pitchers    0.76564555  0.09224695 1.00000000  0.18192049  0.89430821
## Attendance -0.01293088  0.26491543 0.18192049  1.00000000  0.25719248
## Time        0.68131437 -0.07135831 0.89430821  0.25719248  1.00000000

Based on this correlation matrix, it appears that the pitcher has the strongest correlation with the run time of a baseball game.

Section 3: Regression Equation and Interpretation

## 
## Call:
## lm(formula = Time ~ Pitchers, data = strongest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.945  -8.445  -3.104   9.751  50.794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   94.843     13.387   7.085 8.24e-06 ***
## Pitchers      10.710      1.486   7.206 6.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.46 on 13 degrees of freedom
## Multiple R-squared:  0.7998, Adjusted R-squared:  0.7844 
## F-statistic: 51.93 on 1 and 13 DF,  p-value: 6.884e-06

Based on this regression model, the regression equation for pitchers as the predictor and time as the response, our equation is: y = 94.843 * (10.710 * pitchers). This means that for every pitcher in the game, there is 10.710 minutes added to the game.

Section 4: Significance Testing

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 7.2063, df = 13, p-value = 6.884e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7050380 0.9646464
## sample estimates:
##       cor 
## 0.8943082

Based on this correlation hypothesis test, using a 95% significance level, it seems that the correlation is statistically significant. This confirms that the number of pitchers is correlated with the run time of the game.

Section 5: Analyzing Residual Plots

This scatterplot displays the correlation between pitchers and game time, showing a line of best fit, and the regression equation for this line. This scatterplot seems to confirm that the conditions for correlation are met because all of the points on the plot are tightly clustered around the regression line.