Discussion Week 11

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Introduction

I will be using the built in Lahman data set in R to create the regression model. As a Baseball and Yankees fan the seemingly spike in strikeouts across the league (specifically last years Yankees) in hopes of hitting more home runs has been something that I’ve always wondered about. You see batters striking out more but also hitting more Home runs so I always wondered is the spike in strikeouts worth the spike in home runs for teams. So, I want to investigate how the number of times a team strikes out in a season effects their win total for the year.

library(Lahman)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Inspect the data

We can see from the head of the data frame that we have columns for the year, Team name, Team ID, Teams total wins and Teams Total Strikeouts (SO).

We can also see that our data frame has 300 rows in it, and data up to 2022.

# Load team data
teams_data <- Teams
# Select only the relevant columns
teams_data <- teams_data %>%
  select(yearID,name, teamID, W, SO)

# Focus on recent years for a more relevant analysis, e.g., last 10 years and remove 2020 beacuase it was a shorter season 
recent_data <- teams_data %>%
  filter(yearID >= 2013 & yearID != 2020)

Inspect the head of the data frame

print(head(recent_data))

##   yearID                 name teamID  W   SO
## 1   2013 Arizona Diamondbacks    ARI 81 1142
## 2   2013       Atlanta Braves    ATL 96 1384
## 3   2013    Baltimore Orioles    BAL 85 1125
## 4   2013       Boston Red Sox    BOS 97 1308
## 5   2013    Chicago White Sox    CHA 63 1207
## 6   2013         Chicago Cubs    CHN 66 1230

Inspect the tail we can see we have data up to the 2022 season

print(tail(recent_data))

##     yearID                 name teamID  W   SO
## 265   2022 San Francisco Giants    SFN 81 1462
## 266   2022  St. Louis Cardinals    SLN 93 1226
## 267   2022       Tampa Bay Rays    TBA 86 1395
## 268   2022        Texas Rangers    TEX 68 1446
## 269   2022    Toronto Blue Jays    TOR 92 1242
## 270   2022 Washington Nationals    WAS 55 1221

Inspect the dimensions to see how many rows we have. We can see we have 270 rows.

print(dim(recent_data))

## [1] 270   5

Visualize the data

ggplot(recent_data, aes(x = SO, y = W)) +
  geom_point() +  # Add points to scatter plot
  theme_minimal() + 
  labs(title = "Relationship Between Team Strikeouts and Wins",
       x = "Team Strikeouts (SO)",
       y = "Wins (W)")

The plot above shows Team Strikeouts on the x-axis and Team Wins on the y-axis. From this plot we can see that there does not seem to be a obvious relationship between Team Strikeouts and Team Wins. We can see on the plot that there are teams who struck out a ton over 1500 times but still won 100 games and then we can also see teams who barley struck out with less than 80 wins. Overall, there doesn’t seem to be a strong correlation in either the positive or negative direction for these two variables.

Linear model

# Linear regression: Wins ~ SO
model <- lm(W ~ SO, data = recent_data)

# Model summary
summary(model)

## 
## Call:
## lm(formula = W ~ SO, data = recent_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.216  -9.256  -0.467   8.747  31.014 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.826711   7.907216   13.64  < 2e-16 ***
## SO           -0.020262   0.005942   -3.41 0.000749 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.44 on 268 degrees of freedom
## Multiple R-squared:  0.04159,    Adjusted R-squared:  0.03801 
## F-statistic: 11.63 on 1 and 268 DF,  p-value: 0.0007491

We have a y-intercept of 107.826711 and a slope of -0.020262. This means that a team that strikes out 0 times would be expected to have 107 wins and for every SO they would be expected to lose about .02 wins this hints at a negative relationship between strikeouts and wins meaning as the number of SO increases the number of wins is expected to decrease.

Residual Analysis

Assessing the models summary we can see that we have a median close to zero similar min and max values and also similar quartiles so this data does seem to be normally distributed with a median close to 0.

We see a Multiple R Squared value of 0.04159 this shows that the model does not fit well to the data as we want the value to be closer to 1. A value of only .04159 says that only about 4% of the variability in Wins is explained by strikeouts. The Adjusted R-squared value is 0.03801 again a small number that we would like closer to 1.

We see a p-value of .0007491 which is good as we want as smaller p-values a significant p-value is anything below .05 so this suggests that there is a significant relationship between Wins and strikeouts. However the low \(r^2\) value points to the two not being that significant.

ggplot(recent_data, aes(x = SO, y = W)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Impact of Team Batting Strikeouts on Wins",
       x = "Team Batting Strikeouts",
       y = "Wins") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

plot(fitted(model), resid(model))

There seems to be no pattern in the plot above the points are scatterd all about the plot with no uniform distribution.

qqnorm(resid(model))
qqline(resid(model))

The plot above suggests that the residuals are not normally distributed. We can infer that because if they were then we would expect them to follow the straight line however we get some divergence at either end of the line indicating that they are not normally distributed.

par(mfrow=c(2,2))
plot(model)

Conclusion

In conclusion, there is no significant relationship between wins and team strikeouts. The simple regression model showed a significant p value but the \(r^2\) value was so low at only about 4%. This analysis shows that while strikeouts may be a factor in Wins it is only a part and there are many other variables/factors that effect Wins.

Week 11 discussion

Joey Bochnik

2024-04-04