The dataset is a summary of attendance at major league baseball parks.

Some definitions:

RS: Total runs scored by the team (only games and attendance only consider home games). RA: Total runs allowed by the team.

library(tidyverse)
library(corrplot)

df <- read_csv('C:\\Users\\pgood\\OneDrive\\Documents\\R\\Data 605\\baseball.csv')
baseball <- df %>%
  mutate(
    attendance_g = attendance/games
  )

cor_mat <- cor(select(baseball, - c(attendance, games, Team, league)))
corrplot(cor_mat, type = 'upper')

Since the number of home games is not standardized, we want to look at attendance per game.
Wins is strongly correlated with both runs scored and runs allowed, so it will capture some of their information. It also appears to have the strongest correlation with attendance. We’ll try it as our predictor variable.

ggplot(baseball) + geom_point(aes(wins, attendance_g )) + labs(y = 'Attendance Per Game', x = 'Team Wins')

ggplot(baseball) + geom_histogram(aes(wins), color = 'black', fill = 'darkgreen', binwidth = 5)

There appears to be a linear correlation. The linearity assumption holds. There also isn’t heavy skew, but we’ll still keep our eye on the residuals.

model <- lm(attendance_g ~ wins, baseball)
summary(model)
## 
## Call:
## lm(formula = attendance_g ~ wins, data = baseball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14719.7  -3872.2   -399.2   5616.9  11953.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    771.6     9151.7   0.084  0.93341   
## wins           366.9      112.1   3.272  0.00284 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6312 on 28 degrees of freedom
## Multiple R-squared:  0.2766, Adjusted R-squared:  0.2508 
## F-statistic: 10.71 on 1 and 28 DF,  p-value: 0.002836

Team wins does seem to offer some predictive power in relation to attendance.

diagnostics <- data.frame(model$model, residuals = model$residuals, 
                          fitted_values = model$fitted.values, residuals_abs = abs(model$residuals))

ggplot(diagnostics) + geom_point(aes(x = fitted_values, y = residuals)) + geom_hline(yintercept = 0)

qqnorm(diagnostics$residuals)
qqline(diagnostics$residuals)

ggplot(diagnostics) + geom_histogram(aes(x = residuals), color = "black", fill = "darkblue", binwidth = 3000)

ggplot(diagnostics) + geom_point(aes(x = fitted_values, y = residuals_abs))

The normal residuals assumption seems to hold, though it’s somewhat tough to evaluate with so few data points. There are no major outliers on the QQ plot, which is what we’d be most concerned with. There doesn’t seem to be any relation between the fitted values and variance level, so constand variance holds as well.