Lab 7: Linear Regression
I would use a scatterplot to show the relationship between runs and at_bats. Our data appears to have a linear relationship, but more tests will need to be used to be sure.

Our correlation coefficient is 0.610627 .
The relationship between runs and at bats appears to show that a player with more at-bats would be expected to have score more runs. It appears to be an increasing relationship. Two points may be outliers, and could be checked for leverage. The relationship appears to be stable. More exploration would be necessary to test the residuals, but the relationship appears to be linear. Fewer points occur at higher levels.
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
Note: Plot ss does not appear to have the functionality described in the lab. However, if we were to fit the line manually, we would not expect to achieve as well a result as if we compute it by maximizing a function.
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388

##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
The homerun to run equation would be: runs = 415.24 + 1.8345 * homeruns. The slope tells us that for every home run a team hits, it is expected to have 1.8 more runs.
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

There does not appear to be any pattern in the residuals plot. Our residuals appear to be close to normal. We would be more comfortable if they were closer, but this appears to be close enough. Our linear regression appears to be appropriate.
library(lmtest)
bptest(m1)
##
## studentized Breusch-Pagan test
##
## data: m1
## BP = 0.01269, df = 1, p-value = 0.9103
str(mlb11)
## 'data.frame': 30 obs. of 12 variables:
## $ team : Factor w/ 30 levels "Arizona Diamondbacks",..: 28 4 10 13 26 18 19 16 9 12 ...
## $ runs : int 855 875 787 730 762 718 867 721 735 615 ...
## $ at_bats : int 5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 ...
## $ hits : int 1599 1600 1540 1560 1513 1477 1452 1422 1429 1442 ...
## $ homeruns : int 210 203 169 129 162 108 222 185 163 95 ...
## $ bat_avg : num 0.283 0.28 0.277 0.275 0.273 0.264 0.263 0.261 0.258 0.258 ...
## $ strikeouts : int 930 1108 1143 1006 978 1085 1138 1083 1201 1164 ...
## $ stolen_bases: int 143 102 49 153 57 130 147 94 118 118 ...
## $ wins : int 96 90 95 71 90 77 97 96 73 56 ...
## $ new_onbase : num 0.34 0.349 0.34 0.329 0.341 0.335 0.343 0.325 0.329 0.311 ...
## $ new_slug : num 0.46 0.461 0.434 0.415 0.425 0.391 0.444 0.425 0.41 0.374 ...
## $ new_obs : num 0.8 0.81 0.773 0.744 0.766 0.725 0.788 0.75 0.739 0.684 ...
The constant variability test is met. The variance does not appear to change much and a Breusch-Pagan test confirms that.
On your own:
We chose to test if average batting average for a team was a linear predictor for number of runs scored. We start with a graph and a linear regression:

It appears that batting average is an appropriate linear predictor for runs scored.
m3 <- lm(runs ~ bat_avg, data = mlb11)
summary(m3)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
plot(mlb11$runs ~ mlb11$bat_avg)
abline(lm(runs ~ bat_avg, data = mlb11))

This relationship appears to be stronger than the relationship between at bats and runs. The R2 for our new regression is .6561, compared to .3729 for runs~ at bats. Our new variable has a correlation that suggests it explains more of the variation than at-bats.
plot(m3$residuals ~ mlb11$bat_avg)
abline(h = 0, lty = 3)

hist(m3$residuals)

qqnorm(m3$residuals)
qqline(m3$residuals)


##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07

Hits are a good predictor for runs. With an R2 of .6419, hits explain a good amount of the variation in runs. Batting average still has a higher coefficient of determination.

##
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.27 -46.95 -11.92 55.14 169.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1054.7342 151.7890 6.949 1.49e-07 ***
## strikeouts -0.3141 0.1315 -2.389 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1397
## F-statistic: 5.709 on 1 and 28 DF, p-value: 0.02386

## 'data.frame': 30 obs. of 12 variables:
## $ team : Factor w/ 30 levels "Arizona Diamondbacks",..: 28 4 10 13 26 18 19 16 9 12 ...
## $ runs : int 855 875 787 730 762 718 867 721 735 615 ...
## $ at_bats : int 5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 ...
## $ hits : int 1599 1600 1540 1560 1513 1477 1452 1422 1429 1442 ...
## $ homeruns : int 210 203 169 129 162 108 222 185 163 95 ...
## $ bat_avg : num 0.283 0.28 0.277 0.275 0.273 0.264 0.263 0.261 0.258 0.258 ...
## $ strikeouts : int 930 1108 1143 1006 978 1085 1138 1083 1201 1164 ...
## $ stolen_bases: int 143 102 49 153 57 130 147 94 118 118 ...
## $ wins : int 96 90 95 71 90 77 97 96 73 56 ...
## $ new_onbase : num 0.34 0.349 0.34 0.329 0.341 0.335 0.343 0.325 0.329 0.311 ...
## $ new_slug : num 0.46 0.461 0.434 0.415 0.425 0.391 0.444 0.425 0.41 0.374 ...
## $ new_obs : num 0.8 0.81 0.773 0.744 0.766 0.725 0.788 0.75 0.739 0.684 ...
With an R2 of .1694, strikeouts is not a great predictor for runs. Further, it is statistically significant at the .05 level, but not at .001 like our other predictors. This is not surprising. We can expect that better teams may have better offense and defense. But this predictor, as a defensive statistic, is further removed from the other relationships, which are all related to offense.
==================================================
Finally, we investigate the new statistics cited by the authors of “Moneyball”. There are three such variables in our data set: on-base percentage, slugging and on-base percentage plus slugging.

##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13


##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15


##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16




## [1] "The mean of the residuals is "
## [1] 2.181126e-16
##
## studentized Breusch-Pagan test
##
## data: m8
## BP = 0.056807, df = 1, p-value = 0.8116
It turns out the new statistics are very well correlated with runs scored. The R2 of the conglomerate statistic, on-base percentage plus slugging is .9349. The p-value for the slope of the regression is 2 * 10-16 . The residuals are approximately normal. They appear to be stable and our model for our new statistic is an appropriate model.