Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
I will be using the built in Lahman data set in R to create the regression model. As a Baseball and Yankees fan the seemingly spike in strikeouts across the league (specifically last years Yankees) in hopes of hitting more home runs has been something that I’ve always wondered about. You see batters striking out more but also hitting more Home runs so I always wondered is the spike in strikeouts worth the spike in home runs for teams. So, I want to investigate how the number of times a team strikes out in a season effects their win total for the year.
library(Lahman)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
We can see from the head of the data frame that we have columns for the year, Team name, Team ID, Teams total wins and Teams Total Strikeouts (SO).
We can also see that our data frame has 300 rows in it, and data up to 2022.
# Load team data
teams_data <- Teams
# Select only the relevant columns
teams_data <- teams_data %>%
select(yearID,name, teamID, W, SO)
# Focus on recent years for a more relevant analysis, e.g., last 10 years and remove 2020 beacuase it was a shorter season
recent_data <- teams_data %>%
filter(yearID >= 2013 & yearID != 2020)
Inspect the head of the data frame
print(head(recent_data))
## yearID name teamID W SO
## 1 2013 Arizona Diamondbacks ARI 81 1142
## 2 2013 Atlanta Braves ATL 96 1384
## 3 2013 Baltimore Orioles BAL 85 1125
## 4 2013 Boston Red Sox BOS 97 1308
## 5 2013 Chicago White Sox CHA 63 1207
## 6 2013 Chicago Cubs CHN 66 1230
Inspect the tail we can see we have data up to the 2022 season
print(tail(recent_data))
## yearID name teamID W SO
## 265 2022 San Francisco Giants SFN 81 1462
## 266 2022 St. Louis Cardinals SLN 93 1226
## 267 2022 Tampa Bay Rays TBA 86 1395
## 268 2022 Texas Rangers TEX 68 1446
## 269 2022 Toronto Blue Jays TOR 92 1242
## 270 2022 Washington Nationals WAS 55 1221
Inspect the dimensions to see how many rows we have. We can see we have 270 rows.
print(dim(recent_data))
## [1] 270 5
ggplot(recent_data, aes(x = SO, y = W)) +
geom_point() + # Add points to scatter plot
theme_minimal() +
labs(title = "Relationship Between Team Strikeouts and Wins",
x = "Team Strikeouts (SO)",
y = "Wins (W)")
The plot above shows Team Strikeouts on the x-axis and Team Wins on the y-axis. From this plot we can see that there does not seem to be a obvious relationship between Team Strikeouts and Team Wins. We can see on the plot that there are teams who struck out a ton over 1500 times but still won 100 games and then we can also see teams who barley struck out with less than 80 wins. Overall, there doesn’t seem to be a strong correlation in either the positive or negative direction for these two variables.
# Linear regression: Wins ~ SO
model <- lm(W ~ SO, data = recent_data)
# Model summary
summary(model)
##
## Call:
## lm(formula = W ~ SO, data = recent_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.216 -9.256 -0.467 8.747 31.014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.826711 7.907216 13.64 < 2e-16 ***
## SO -0.020262 0.005942 -3.41 0.000749 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.44 on 268 degrees of freedom
## Multiple R-squared: 0.04159, Adjusted R-squared: 0.03801
## F-statistic: 11.63 on 1 and 268 DF, p-value: 0.0007491
We have a y-intercept of 107.826711 and a slope of -0.020262. This means that a team that strikes out 0 times would be expected to have 107 wins and for every SO they would be expected to lose about .02 wins this hints at a negative relationship between strikeouts and wins meaning as the number of SO increases the number of wins is expected to decrease.
Assessing the models summary we can see that we have a median close to zero similar min and max values and also similar quartiles so this data does seem to be normally distributed with a median close to 0.
We see a Multiple R Squared value of 0.04159 this shows that the model does not fit well to the data as we want the value to be closer to 1. A value of only .04159 says that only about 4% of the variability in Wins is explained by strikeouts. The Adjusted R-squared value is 0.03801 again a small number that we would like closer to 1.
We see a p-value of .0007491 which is good as we want as smaller p-values a significant p-value is anything below .05 so this suggests that there is a significant relationship between Wins and strikeouts. However the low \(r^2\) value points to the two not being that significant.
ggplot(recent_data, aes(x = SO, y = W)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
labs(title = "Impact of Team Batting Strikeouts on Wins",
x = "Team Batting Strikeouts",
y = "Wins") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
plot(fitted(model), resid(model))
There seems to be no pattern in the plot above the points are scatterd all about the plot with no uniform distribution.
qqnorm(resid(model))
qqline(resid(model))
The plot above suggests that the residuals are not normally distributed. We can infer that because if they were then we would expect them to follow the straight line however we get some divergence at either end of the line indicating that they are not normally distributed.
par(mfrow=c(2,2))
plot(model)
In conclusion, there is no significant relationship between wins and team strikeouts. The simple regression model showed a significant p value but the \(r^2\) value was so low at only about 4%. This analysis shows that while strikeouts may be a factor in Wins it is only a part and there are many other variables/factors that effect Wins.