Analysis of data

Data 1.0

##      Game League Runs Margin Pitchers Attendance Time
## 1 CLE-DET     AL   14      6        6      38774  168
## 2 CHI-BAL     AL   11      5        5      15398  164
## 3 BOS-NYY     AL   10      4       11      55058  202
## 4 TOR-TAM     AL    8      4       10      13478  172
## 5  TEX-KC     AL    3      1        4      17004  151
## 6 OAK-LAA     AL    6      4        4      37431  133

Correlation of Data 2.0

Here we are determining which variable correlates the strongest with Time

##                   [,1]
## Runs        0.68131437
## Margin     -0.07135831
## Pitchers    0.89430821
## Attendance  0.25719248
## Time        1.00000000

As you can see, besides itself Time correlates the most with the Pitchers variable, meaning there is some form of correlation between number of pitchers and the length of a baseball game according to this dataset

Fit a Linear Model 3.0

## 
## Call:
## lm(formula = Time ~ Pitchers, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.945  -8.445  -3.104   9.751  50.794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   94.843     13.387   7.085 8.24e-06 ***
## Pitchers      10.710      1.486   7.206 6.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.46 on 13 degrees of freedom
## Multiple R-squared:  0.7998, Adjusted R-squared:  0.7844 
## F-statistic: 51.93 on 1 and 13 DF,  p-value: 6.884e-06

Regression Equation: The linear regression model can be expressed as:

Time = 94.843 + 10.710 × Pitchers

We got this based on the coefficient values provided by the summary of the linear model. Based on the P value of the slope coefficient (6.88 x 10^-6), we can determine that there is a strong statistical significance between time and pitchers because the value is extremely low, which allows us to reject the hypothesis that the slope is equal to 0. Based on our slope equation we can estimate that each pitcher increases the time of the game by 10 units.

Plot the Data

We can notice a few things from the plot. The main focus is the linear pattern. This suggests that as our data told us befor there is a strong linear correlation between pitchers and time. Second we must note that there are a few outliers but that most of the plots follow the linear pattern we expected. Both of this allow us to confidently say that once again there is a strong relationship between the time of a baseball game and the number of pitchers used.

Conclusion/Final Words

This project allowed us to learn how to find correlation between two data vectors in R as well as how to fit a linear model to determine significance between of the relationship between the two variables. We were able to use linear regression to determine that there was a strong relationship between the length of a baseball game and the number of pitchers used in the game.

Apendix

Code Used

Load the dataset
data <- read.csv(“BaseballTimes.csv”)

Fit the linear regression model
model <- lm(Time ~ Pitchers, data = data)

Show summary of the model
summary(model)

Generate scatter plot of Time vs. Pitchers
plot(data\(Time, data\)Pitchers, xlab = “Time”, ylab = “Pitchers”, main = “Scatter Plot of Time vs. Pitchers”, pch = 19, # Solid circle col = “blue”)