Lab 7: Linear Regression

Task: perform a series of linear regressions to test whether new statistics for baseball are better correlated with runs scored than traditional baseball statistics.

## 'data.frame':    30 obs. of  12 variables:
##  $ team        : Factor w/ 30 levels "Arizona Diamondbacks",..: 28 4 10 13 26 18 19 16 9 12 ...
##  $ runs        : int  855 875 787 730 762 718 867 721 735 615 ...
##  $ at_bats     : int  5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 ...
##  $ hits        : int  1599 1600 1540 1560 1513 1477 1452 1422 1429 1442 ...
##  $ homeruns    : int  210 203 169 129 162 108 222 185 163 95 ...
##  $ bat_avg     : num  0.283 0.28 0.277 0.275 0.273 0.264 0.263 0.261 0.258 0.258 ...
##  $ strikeouts  : int  930 1108 1143 1006 978 1085 1138 1083 1201 1164 ...
##  $ stolen_bases: int  143 102 49 153 57 130 147 94 118 118 ...
##  $ wins        : int  96 90 95 71 90 77 97 96 73 56 ...
##  $ new_onbase  : num  0.34 0.349 0.34 0.329 0.341 0.335 0.343 0.325 0.329 0.311 ...
##  $ new_slug    : num  0.46 0.461 0.434 0.415 0.425 0.391 0.444 0.425 0.41 0.374 ...
##  $ new_obs     : num  0.8 0.81 0.773 0.744 0.766 0.725 0.788 0.75 0.739 0.684 ...

I would use a scatterplot to show the relationship between runs and at_bats. Our data appears to have a linear relationship, but more tests will need to be used to be sure.

Our correlation coefficient is 0.610627 .

The relationship between runs and at bats appears to show that a player with more at-bats would be expected to have score more runs. It appears to be an increasing relationship. Two points may be outliers, and could be checked for leverage. The relationship appears to be stable. More exploration would be necessary to test the residuals, but the relationship appears to be linear. Fewer points occur at higher levels.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Note: Plot ss does not appear to have the functionality described in the lab. However, if we were to fit the line manually, we would not expect to achieve as well a result as if we compute it by maximizing a function.

m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

The homerun to run equation would be: runs = 415.24 + 1.8345 * homeruns. The slope tells us that for every home run a team hits, it is expected to have 1.8 more runs.

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

If a team manager saw the least squares regression line and not the actual data, he or she would predict a team with 5,578 at-bats would have 727.7 runs. We have no residual for this value as we have no observation of a team with 5578 at-bats.

## [1] "The mean of the residuals is "
## [1] -1.067202e-15

There does not appear to be any pattern in the residuals plot. Our residuals appear to be close to normal. We would be more comfortable if they were closer, but this appears to be close enough. Our linear regression appears to be appropriate.

library(lmtest)
bptest(m1)
## 
##  studentized Breusch-Pagan test
## 
## data:  m1
## BP = 0.01269, df = 1, p-value = 0.9103
str(mlb11)
## 'data.frame':    30 obs. of  12 variables:
##  $ team        : Factor w/ 30 levels "Arizona Diamondbacks",..: 28 4 10 13 26 18 19 16 9 12 ...
##  $ runs        : int  855 875 787 730 762 718 867 721 735 615 ...
##  $ at_bats     : int  5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 ...
##  $ hits        : int  1599 1600 1540 1560 1513 1477 1452 1422 1429 1442 ...
##  $ homeruns    : int  210 203 169 129 162 108 222 185 163 95 ...
##  $ bat_avg     : num  0.283 0.28 0.277 0.275 0.273 0.264 0.263 0.261 0.258 0.258 ...
##  $ strikeouts  : int  930 1108 1143 1006 978 1085 1138 1083 1201 1164 ...
##  $ stolen_bases: int  143 102 49 153 57 130 147 94 118 118 ...
##  $ wins        : int  96 90 95 71 90 77 97 96 73 56 ...
##  $ new_onbase  : num  0.34 0.349 0.34 0.329 0.341 0.335 0.343 0.325 0.329 0.311 ...
##  $ new_slug    : num  0.46 0.461 0.434 0.415 0.425 0.391 0.444 0.425 0.41 0.374 ...
##  $ new_obs     : num  0.8 0.81 0.773 0.744 0.766 0.725 0.788 0.75 0.739 0.684 ...

The constant variability test is met. The variance does not appear to change much and a Breusch-Pagan test confirms that.

On your own:

We chose to test if average batting average for a team was a linear predictor for number of runs scored. We start with a graph and a linear regression:

It appears that batting average is an appropriate linear predictor for runs scored.

m3 <- lm(runs ~ bat_avg, data = mlb11)
summary(m3)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08
plot(mlb11$runs ~ mlb11$bat_avg)
abline(lm(runs ~ bat_avg, data = mlb11))

This relationship appears to be stronger than the relationship between at bats and runs. The R2 for our new regression is .6561, compared to .3729 for runs~ at bats. Our new variable has a correlation that suggests it explains more of the variation than at-bats.

plot(m3$residuals ~ mlb11$bat_avg)
abline(h = 0, lty = 3) 

hist(m3$residuals)

qqnorm(m3$residuals)
qqline(m3$residuals)  

## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

Hits are a good predictor for runs. With an R2 of .6419, hits explain a good amount of the variation in runs. Batting average still has a higher coefficient of determination.

## 
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.27  -46.95  -11.92   55.14  169.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1054.7342   151.7890   6.949 1.49e-07 ***
## strikeouts    -0.3141     0.1315  -2.389   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1397 
## F-statistic: 5.709 on 1 and 28 DF,  p-value: 0.02386

## 'data.frame':    30 obs. of  12 variables:
##  $ team        : Factor w/ 30 levels "Arizona Diamondbacks",..: 28 4 10 13 26 18 19 16 9 12 ...
##  $ runs        : int  855 875 787 730 762 718 867 721 735 615 ...
##  $ at_bats     : int  5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 ...
##  $ hits        : int  1599 1600 1540 1560 1513 1477 1452 1422 1429 1442 ...
##  $ homeruns    : int  210 203 169 129 162 108 222 185 163 95 ...
##  $ bat_avg     : num  0.283 0.28 0.277 0.275 0.273 0.264 0.263 0.261 0.258 0.258 ...
##  $ strikeouts  : int  930 1108 1143 1006 978 1085 1138 1083 1201 1164 ...
##  $ stolen_bases: int  143 102 49 153 57 130 147 94 118 118 ...
##  $ wins        : int  96 90 95 71 90 77 97 96 73 56 ...
##  $ new_onbase  : num  0.34 0.349 0.34 0.329 0.341 0.335 0.343 0.325 0.329 0.311 ...
##  $ new_slug    : num  0.46 0.461 0.434 0.415 0.425 0.391 0.444 0.425 0.41 0.374 ...
##  $ new_obs     : num  0.8 0.81 0.773 0.744 0.766 0.725 0.788 0.75 0.739 0.684 ...

==================================================

Finally, we investigate the new statistics cited by the authors of “Moneyball”. There are three such variables in our data set: on-base percentage, slugging and on-base percentage plus slugging.

## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13

## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

## [1] "The mean of the residuals is "
## [1] 2.181126e-16
## 
##  studentized Breusch-Pagan test
## 
## data:  m8
## BP = 0.056807, df = 1, p-value = 0.8116

It turns out the new statistics are very well correlated with runs scored. The R2 of the conglomerate statistic, on-base percentage plus slugging is .9349. The p-value for the slope of the regression is 2 * 10-16 . The residuals are approximately normal. They appear to be stable and our model for our new statistic is an appropriate model.