head(mlb11)
## team runs at_bats hits homeruns bat_avg strikeouts
## 1 Texas Rangers 855 5659 1599 210 0.283 930
## 2 Boston Red Sox 875 5710 1600 203 0.280 1108
## 3 Detroit Tigers 787 5563 1540 169 0.277 1143
## 4 Kansas City Royals 730 5672 1560 129 0.275 1006
## 5 St. Louis Cardinals 762 5532 1513 162 0.273 978
## 6 New York Mets 718 5600 1477 108 0.264 1085
## stolen_bases wins new_onbase new_slug new_obs
## 1 143 96 0.340 0.460 0.800
## 2 102 90 0.349 0.461 0.810
## 3 49 95 0.340 0.434 0.773
## 4 153 71 0.329 0.415 0.744
## 5 57 90 0.341 0.425 0.766
## 6 130 77 0.335 0.391 0.725
Using a scatter plot shows moderate linearity. A linear model should give a reasonable estimate of runs based on at-bats.
plot(runs ~ at_bats, data = mlb11, ylab = "runs", xlab = "at bats", col='blue', pch=19)
z <- line(mlb11$at_bats, mlb11$runs)
z
##
## Call:
## line(mlb11$at_bats, mlb11$runs)
##
## Coefficients:
## [1] -2661.1626 0.6043
abline(coef(z),col='red')
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
The relationship between at-bats and runs is linear with a positive slope. A correlation of 0.61 shows moderate positive linearity. There’s one point that could be considered high leverage, where the (“wretched”, as my husband calls them) New York Yankees got 867 runs with 5518 at-bats. But it doesn’t seem that this point influences the slope of the regression line.
The smallest sum of squares I got from using the interactive plots is 159179.4
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Regression line equation for runs based in home runs: \(\hat{y}\) = 415.2389 + 1.8345 ∗ homeruns
The slope of the line is positive, so for every home run the number of runs increases 1.8345. Which means that there are additional players on base 1.8345 times that a home run is hit.
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
y = −2789.2429 + 0.6305 ∗ atbats = −2789.2429 + 0.6305 ∗ 5578 = 727.6861
This is an overestimate if we go by the closest point, the Phillies with 5579 at bats got 713 runs, which would give a residual of -14 runs. The range of runs for at bats between approximately 5570 and 5650 are all below the regression line.
plot(mlb11$runs ~ mlb11$at_bats, ylab = "runs", xlab = "at bats", col='blue', pch=19)
abline(m1, col = 'red')
There is not a discernable pattern to the residuals. The relationship between runs and at-bats is linear.
plot(m1$residuals ~ mlb11$at_bats, col='orange', pch = 19, ylab = "residuals", xlab = "at bats")
abline(h = 0, lty = 2, col='blue') # adds a horizontal dashed line at y = 0
hist(m1$residuals, col='gold2', main = 'mb11 runs per at-bats residuals', xlab = 'residuals')
qqnorm(m1$residuals, col = 'orange', pch = 19)
qqline(m1$residuals, col = 'blue') # adds diagonal line to the normal prob plot
The distribution of the residuals is close to normal.
The variability of points in the scatterplot of runs by at-bats looks like is has constant variability in the data. The scale location plot has a horizontal line with randomly spread points
par(mfrow=c(2,2))
plot(m1)
The scatterplot of runs by hits looks linear.
m3 <- lm(runs ~ hits, data = mlb11)
plot(runs ~ hits, data = mlb11, ylab = "runs", xlab = "hits", col='blue', pch=19)
abline(m3, col = 'red')
The R2 value for runs_by_hits = 0.642 which is > the R2 value for runs_by_atbats = 0.373. So runs_by_hits accounts for a greater percentage of the variation in the data, and is a better model than runs_by_atbats.
summary(m3)
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
par(mfrow=c(2,2))
plot(m3)
Team batting average has the best R2 = 0.656. The scatter plot is closer to linear than hits. The distribution of the residuals in the qqplot are almost normal, there are no influencing outliers, and the homoscedasticity looks good in the scale-location plot.
s1 <- summary(m1) # at-bats
s2 <- summary(m2) # homeruns
s3 <- summary(m3) # hits
s4 <- summary(lm(runs ~ bat_avg, data = mlb11))
s5 <- summary(lm(runs ~ strikeouts, data = mlb11))
s6 <- summary(lm(runs ~ stolen_bases, data = mlb11))
s7 <- summary(lm(runs ~ wins, data = mlb11))
models <- tibble(name = c(s1$terms[[3]], s2$terms[[3]], s3$terms[[3]], s4$terms[[3]],
s5$terms[[3]], s6$terms[[3]], s7$terms[[3]]),
rsquared = c(s1$r.squared, s2$r.squared, s3$r.squared, s4$r.squared,
s5$r.squared, s6$r.squared, s7$r.squared),
fstatistic = c(s1$fstatistic[1], s2$fstatistic[1], s3$fstatistic[1], s4$fstatistic[1],
s5$fstatistic[1], s6$fstatistic[1], s7$fstatistic[1]))
models <- arrange(models, desc(rsquared))
models$name <- as.character(models$name)
kable(models, align = c('l', 'l', 'l'), table.attr = "style = \"color: blue;\"") %>% kable_minimal()
| name | rsquared | fstatistic |
|---|---|---|
| bat_avg | 0.6560771 | 53.4136041 |
| hits | 0.6419388 | 50.1989153 |
| homeruns | 0.6265636 | 46.9792942 |
| at_bats | 0.3728654 | 16.6475120 |
| wins | 0.3609712 | 15.8164901 |
| strikeouts | 0.1693579 | 5.7088634 |
| stolen_bases | 0.0029140 | 0.0818302 |
m4 <- lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg, ylab = "runs", xlab = "batting avg.", col='blue', pch=19)
abline(m4, col = 'red')
par(mfrow=c(2,2))
plot(m4)
All of the new variables have better R2 values than the traditional variables. new_obs has the highest at 0.935 (which seems almost Too Good). The scatter plot of runs by new_obs is very tightly linear.
n1 <- summary(lm(runs ~ new_onbase, data = mlb11))
n2 <- summary(lm(runs ~ new_slug, data = mlb11))
n3 <- summary(lm(runs ~ new_obs, data = mlb11))
modeln <- tibble(name = c(n1$terms[[3]], n2$terms[[3]], n3$terms[[3]]),
rsquared = c(n1$r.squared, n2$r.squared, n3$r.squared),
fstatistic = c(n1$fstatistic[1], n2$fstatistic[1], n3$fstatistic[1]))
modeln <- arrange(modeln, desc(rsquared))
modeln$name <- as.character(modeln$name)
kable(modeln, align = c('l', 'l', 'l'), table.attr = "style = \"color: blue;\"") %>% kable_minimal()
| name | rsquared | fstatistic |
|---|---|---|
| new_obs | 0.9349271 | 402.2868 |
| new_slug | 0.8968704 | 243.5030 |
| new_onbase | 0.8491053 | 157.5598 |
par(mfrow = c(1, 3))
plot(runs ~ new_onbase, data = mlb11, ylab = "runs", xlab = "new_onbase", col='blue', pch=19)
abline(lm(runs ~ new_onbase, data = mlb11), col = 'red')
plot(runs ~ new_slug, data = mlb11, ylab = "runs", xlab = "new_slug", col='blue', pch=19)
abline(lm(runs ~ new_slug, data = mlb11), col = 'red')
plot(runs ~ new_obs, data = mlb11, ylab = "runs", xlab = "new_obs", col='blue', pch=19)
abline(lm(runs ~ new_obs, data = mlb11), col = 'red')
The residuals qqplot for new_obs is very close to normal and the scale location plot has very even spread of the variance. Runs by new_obs is a very good linear model.
mbest <- lm(runs ~ new_obs, data = mlb11)
summary(mbest)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(mbest)