Batter up

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.

The data

Let’s load up the data for the 2011 season.

In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins.

Using Backward elimination with the p-value approach.

## 'data.frame':    30 obs. of  12 variables:
##  $ team        : Factor w/ 30 levels "Arizona Diamondbacks",..: 28 4 10 13 26 18 19 16 9 12 ...
##  $ runs        : int  855 875 787 730 762 718 867 721 735 615 ...
##  $ at_bats     : int  5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 ...
##  $ hits        : int  1599 1600 1540 1560 1513 1477 1452 1422 1429 1442 ...
##  $ homeruns    : int  210 203 169 129 162 108 222 185 163 95 ...
##  $ bat_avg     : num  0.283 0.28 0.277 0.275 0.273 0.264 0.263 0.261 0.258 0.258 ...
##  $ strikeouts  : int  930 1108 1143 1006 978 1085 1138 1083 1201 1164 ...
##  $ stolen_bases: int  143 102 49 153 57 130 147 94 118 118 ...
##  $ wins        : int  96 90 95 71 90 77 97 96 73 56 ...
##  $ new_onbase  : num  0.34 0.349 0.34 0.329 0.341 0.335 0.343 0.325 0.329 0.311 ...
##  $ new_slug    : num  0.46 0.461 0.434 0.415 0.425 0.391 0.444 0.425 0.41 0.374 ...
##  $ new_obs     : num  0.8 0.81 0.773 0.744 0.766 0.725 0.788 0.75 0.739 0.684 ...

Using all the 7 variables for the multiple linear model

## 
## Call:
## lm(formula = runs ~ at_bats + hits + homeruns + bat_avg + strikeouts + 
##     stolen_bases + wins, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.273 -17.965   2.141  20.011  40.257 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.025e+03  3.750e+03   0.540 0.594549    
## at_bats      -4.764e-01  6.679e-01  -0.713 0.483159    
## hits          2.047e+00  2.599e+00   0.787 0.439522    
## homeruns      1.030e+00  2.220e-01   4.639 0.000127 ***
## bat_avg      -7.568e+03  1.458e+04  -0.519 0.608816    
## strikeouts    4.780e-02  6.733e-02   0.710 0.485216    
## stolen_bases  5.207e-01  1.705e-01   3.053 0.005825 ** 
## wins          9.586e-01  6.783e-01   1.413 0.171559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.08 on 22 degrees of freedom
## Multiple R-squared:  0.9182, Adjusted R-squared:  0.8922 
## F-statistic:  35.3 on 7 and 22 DF,  p-value: 1.562e-10

Looking at the p-value of each variable

runs~hits

## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

runs~homeruns

## 
## Call:
## lm(formula = runs ~ homeruns + homeruns^2, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

runs~bat_avg

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

runs~strikeouts

## 
## Call:
## lm(formula = runs ~ strikeouts^2 + strikeouts, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.27  -46.95  -11.92   55.14  169.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1054.7342   151.7890   6.949 1.49e-07 ***
## strikeouts    -0.3141     0.1315  -2.389   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1397 
## F-statistic: 5.709 on 1 and 28 DF,  p-value: 0.02386

runs~stolen_bases

## 
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -139.94  -62.87   10.01   38.54  182.49 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  677.3074    58.9751  11.485 4.17e-12 ***
## stolen_bases   0.1491     0.5211   0.286    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared:  0.002914,   Adjusted R-squared:  -0.0327 
## F-statistic: 0.08183 on 1 and 28 DF,  p-value: 0.7769

The p-value is greater than 0.05. This variable does not contribute to the prediction of runs.

runs~wins

## 
## Call:
## lm(formula = runs ~ wins, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.450  -47.506   -7.482   47.346  142.186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  342.121     89.223   3.834 0.000654 ***
## wins           4.341      1.092   3.977 0.000447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3381 
## F-statistic: 15.82 on 1 and 28 DF,  p-value: 0.0004469

We remove the variable stolen_bases because the p_value is 0.77>0.05

## 
## Call:
## lm(formula = runs ~ at_bats + hits + homeruns + log(homeruns) + 
##     bat_avg + strikeouts + wins, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.823 -18.498  -2.477  22.210  41.595 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    2.520e+03  4.010e+03   0.628  0.53618   
## at_bats       -2.223e-01  7.107e-01  -0.313  0.75734   
## hits           1.080e+00  2.764e+00   0.391  0.69982   
## homeruns       4.057e+00  1.309e+00   3.100  0.00523 **
## log(homeruns) -4.508e+02  1.888e+02  -2.388  0.02598 * 
## bat_avg       -2.527e+03  1.549e+04  -0.163  0.87193   
## strikeouts     4.946e-02  7.160e-02   0.691  0.49688   
## wins           1.270e+00  7.298e-01   1.741  0.09572 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.79 on 22 degrees of freedom
## Multiple R-squared:  0.9076, Adjusted R-squared:  0.8781 
## F-statistic: 30.85 on 7 and 22 DF,  p-value: 5.881e-10

The multiple linear model is: \(runs=2520-0.2223\times at\_bats+1.080\times hits+4.057\times homeruns -450.8\log(homeruns)-2527\times bat\_avg+0.04946\times strikeouts+1.270\times wins\)

This multiple linear model can predict runs with 90.76% accuracy. It is better than a simple linear model.