The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, betterpredict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
Let’s load up the data for the 2011 season.
load(url("http://www.openintro.org/stat/data/mlb11.RData"))
runs
and one of the other numerical variables? Plot this relationship using the variable at_bats
as the predictor. Does the relationship look linear? If you knew a team’s at_bats
, would you be comfortable using a linear model to predict the number of runs?ggplot(mlb11, aes(x=at_bats, y=runs)) + geom_point()
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
plot_ss
, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
homeruns
to predict runs
. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 415.239 1.835
##
## Sum of Squares: 73671.99
cor(mlb11$runs, mlb11$homeruns)
## [1] 0.7915577
summary(lm(runs ~ homeruns, data = mlb11))
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
\({\beta}_0\) = 415.2389
\({\beta}_1\) = 1.8345
y = \({\beta}_0\) + \({\beta}_1\) . x
=> runs = 415.2389 + 1.8345 * homeruns
Let’s create a scatterplot with the least squares line laid on top.
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)
data.frame(mlb11$runs,mlb11$at_bats)
## mlb11.runs mlb11.at_bats
## 1 855 5659
## 2 875 5710
## 3 787 5563
## 4 730 5672
## 5 762 5532
## 6 718 5600
## 7 867 5518
## 8 721 5447
## 9 735 5544
## 10 615 5598
## 11 708 5585
## 12 644 5436
## 13 654 5549
## 14 735 5612
## 15 667 5513
## 16 713 5579
## 17 654 5502
## 18 704 5509
## 19 731 5421
## 20 743 5559
## 21 619 5487
## 22 625 5508
## 23 610 5421
## 24 645 5452
## 25 707 5436
## 26 641 5528
## 27 624 5441
## 28 570 5486
## 29 593 5417
## 30 556 5421
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
Nearly normal residuals: To check this condition, we can look at a histogram
hist(m1$residuals)
or a normal probability plot of the residuals.
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot
Constant variability:
mlb11
that you think might be a good predictor of runs
. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?plot(x = mlb11$new_obs, y = mlb11$runs)
runs
and at_bats
? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs
better than at_bats
? How can you tell?summary(lm(runs ~ new_obs, data = mlb11))
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
no<-summary(lm(runs ~ new_obs, data = mlb11))$r.squared
r<-summary(m1)$r.squared
runs
and at_bats
runs
and each of the other five traditional variables. Which variable best predicts runs
? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).plot(x = mlb11$bat_avg, y = mlb11$runs)
summary(lm(runs ~ bat_avg, data = mlb11))
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
By simply comparing each of the r-squared values, we find that “bat_avg”, or batting average is the best predictor of runs scored, of the “traditional” variables.
runs
? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?summary(lm(runs ~ new_onbase, data = mlb11))$r.squared
## [1] 0.8491053
summary(lm(runs ~ new_slug, data = mlb11))$r.squared
## [1] 0.8968704
summary(lm(runs ~ hits, data = mlb11))$r.squared
## [1] 0.6419388
plot(mlb11$runs~mlb11$hits)
abline(lm(runs ~ hits, data = mlb11))
plot(m1$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)