The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
Let’s load up the data for the 2011 season.
In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins.
## 'data.frame': 30 obs. of 12 variables:
## $ team : Factor w/ 30 levels "Arizona Diamondbacks",..: 28 4 10 13 26 18 19 16 9 12 ...
## $ runs : int 855 875 787 730 762 718 867 721 735 615 ...
## $ at_bats : int 5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 ...
## $ hits : int 1599 1600 1540 1560 1513 1477 1452 1422 1429 1442 ...
## $ homeruns : int 210 203 169 129 162 108 222 185 163 95 ...
## $ bat_avg : num 0.283 0.28 0.277 0.275 0.273 0.264 0.263 0.261 0.258 0.258 ...
## $ strikeouts : int 930 1108 1143 1006 978 1085 1138 1083 1201 1164 ...
## $ stolen_bases: int 143 102 49 153 57 130 147 94 118 118 ...
## $ wins : int 96 90 95 71 90 77 97 96 73 56 ...
## $ new_onbase : num 0.34 0.349 0.34 0.329 0.341 0.335 0.343 0.325 0.329 0.311 ...
## $ new_slug : num 0.46 0.461 0.434 0.415 0.425 0.391 0.444 0.425 0.41 0.374 ...
## $ new_obs : num 0.8 0.81 0.773 0.744 0.766 0.725 0.788 0.75 0.739 0.684 ...
run.lm <- lm(runs~at_bats+hits+homeruns+bat_avg+strikeouts+stolen_bases+wins, mlb11)
summary(run.lm)##
## Call:
## lm(formula = runs ~ at_bats + hits + homeruns + bat_avg + strikeouts +
## stolen_bases + wins, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.273 -17.965 2.141 20.011 40.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.025e+03 3.750e+03 0.540 0.594549
## at_bats -4.764e-01 6.679e-01 -0.713 0.483159
## hits 2.047e+00 2.599e+00 0.787 0.439522
## homeruns 1.030e+00 2.220e-01 4.639 0.000127 ***
## bat_avg -7.568e+03 1.458e+04 -0.519 0.608816
## strikeouts 4.780e-02 6.733e-02 0.710 0.485216
## stolen_bases 5.207e-01 1.705e-01 3.053 0.005825 **
## wins 9.586e-01 6.783e-01 1.413 0.171559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.08 on 22 degrees of freedom
## Multiple R-squared: 0.9182, Adjusted R-squared: 0.8922
## F-statistic: 35.3 on 7 and 22 DF, p-value: 1.562e-10
runs~hits
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
runs~homeruns
##
## Call:
## lm(formula = runs ~ homeruns + homeruns^2, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
runs~bat_avg
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
runs~strikeouts
##
## Call:
## lm(formula = runs ~ strikeouts^2 + strikeouts, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.27 -46.95 -11.92 55.14 169.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1054.7342 151.7890 6.949 1.49e-07 ***
## strikeouts -0.3141 0.1315 -2.389 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1397
## F-statistic: 5.709 on 1 and 28 DF, p-value: 0.02386
runs~stolen_bases
##
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -139.94 -62.87 10.01 38.54 182.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 677.3074 58.9751 11.485 4.17e-12 ***
## stolen_bases 0.1491 0.5211 0.286 0.777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared: 0.002914, Adjusted R-squared: -0.0327
## F-statistic: 0.08183 on 1 and 28 DF, p-value: 0.7769
The p-value is greater than 0.05. This variable does not contribute to the prediction of runs.
runs~wins
##
## Call:
## lm(formula = runs ~ wins, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.450 -47.506 -7.482 47.346 142.186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 342.121 89.223 3.834 0.000654 ***
## wins 4.341 1.092 3.977 0.000447 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.3381
## F-statistic: 15.82 on 1 and 28 DF, p-value: 0.0004469
We remove the variable stolen_bases because the p_value is 0.77>0.05
run.lm <- lm(runs~at_bats+hits+homeruns+log(homeruns)+bat_avg+strikeouts+wins, mlb11)
summary(run.lm)##
## Call:
## lm(formula = runs ~ at_bats + hits + homeruns + log(homeruns) +
## bat_avg + strikeouts + wins, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.823 -18.498 -2.477 22.210 41.595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.520e+03 4.010e+03 0.628 0.53618
## at_bats -2.223e-01 7.107e-01 -0.313 0.75734
## hits 1.080e+00 2.764e+00 0.391 0.69982
## homeruns 4.057e+00 1.309e+00 3.100 0.00523 **
## log(homeruns) -4.508e+02 1.888e+02 -2.388 0.02598 *
## bat_avg -2.527e+03 1.549e+04 -0.163 0.87193
## strikeouts 4.946e-02 7.160e-02 0.691 0.49688
## wins 1.270e+00 7.298e-01 1.741 0.09572 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.79 on 22 degrees of freedom
## Multiple R-squared: 0.9076, Adjusted R-squared: 0.8781
## F-statistic: 30.85 on 7 and 22 DF, p-value: 5.881e-10
The multiple linear model is: \(runs=2520-0.2223\times at\_bats+1.080\times hits+4.057\times homeruns -450.8\log(homeruns)-2527\times bat\_avg+0.04946\times strikeouts+1.270\times wins\)
This multiple linear model can predict runs with 90.76% accuracy. It is better than a simple linear model.