## Warning: package 'ggplot2' was built under R version 3.5.1
In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
Let’s load up the data for the 2011 season.
## team runs at_bats hits homeruns bat_avg strikeouts
## 1 Texas Rangers 855 5659 1599 210 0.283 930
## 2 Boston Red Sox 875 5710 1600 203 0.280 1108
## 3 Detroit Tigers 787 5563 1540 169 0.277 1143
## 4 Kansas City Royals 730 5672 1560 129 0.275 1006
## 5 St. Louis Cardinals 762 5532 1513 162 0.273 978
## 6 New York Mets 718 5600 1477 108 0.264 1085
## 7 New York Yankees 867 5518 1452 222 0.263 1138
## 8 Milwaukee Brewers 721 5447 1422 185 0.261 1083
## 9 Colorado Rockies 735 5544 1429 163 0.258 1201
## 10 Houston Astros 615 5598 1442 95 0.258 1164
## stolen_bases wins new_onbase new_slug new_obs
## 1 143 96 0.340 0.460 0.800
## 2 102 90 0.349 0.461 0.810
## 3 49 95 0.340 0.434 0.773
## 4 153 71 0.329 0.415 0.744
## 5 57 90 0.341 0.425 0.766
## 6 130 77 0.335 0.391 0.725
## 7 147 97 0.343 0.444 0.788
## 8 94 96 0.325 0.425 0.750
## 9 118 73 0.329 0.410 0.739
## 10 118 56 0.311 0.374 0.684
runs
and one of the other numerical variables? Plot this relationship using the variable at_bats
as the predictor. Does the relationship look linear? If you knew a team’s at_bats
, would you be comfortable using a linear model to predict the number of runs?Here a scatterplot is chosen to display the relationship between runs and at-bats. There seems to be a linear relationship with positiv correlator. One could use a regression model to derive a rough estimation of runs based on at-bats.
## [1] 0.610627
Indeed the correlation is positive, as calculated by the cor() function.
There two variables are moderately positively correlated (strength and direction). The form seems to be linear with the residuals seeming to randomly distributed about the regression line.
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE
.
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
We can use the lm
function in R to fit the linear model (a.k.a. regression line).
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
With this table, we can write down the least squares regression line for the linear model:
\[ \hat{y} = -2789.2429 + 0.6305 * atbats \]
One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.
homeruns
to predict runs
. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
The regression line equation can be written as: \[ \hat{y} = 415.2389 + 1.8345 * atbats \]
The b1 coefficient is positive, indicating positive correlation, which can be verified here with the cor() function.
## [1] 0.7915577
Let’s create a scatterplot with the least squares line laid on top.
df <- data.frame(at_bats = mlb11$at_bats, runs = mlb11$runs )
df$pred_runs <- -2789.2429 + 0.6305 * mlb11$at_bats
df$residual <-df$runs - df$pred_runs
head(df)
## at_bats runs pred_runs residual
## 1 5659 855 778.7566 76.2434
## 2 5710 875 810.9121 64.0879
## 3 5563 787 718.2286 68.7714
## 4 5672 730 786.9531 -56.9531
## 5 5532 762 698.6831 63.3169
## 6 5600 718 741.5571 -23.5571
## [1] 727.6861
The number of the predicted runs based on the regression model is 728.
## integer(0)
We do not know if this is under or over-estimated since there are no cases with exactly that number of at-bats. We can observe the sactterplot above to note that in the range of 5550 to 5600 most points are below the fitted curve, suggesting the regression line is an overestimation in the range, but not definitely.
The residual plot reveals a more-or-less normally-distributed deviation from the regression line.
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.
There are slightly more points below the residual line, but this is likely due to chance. Thus, the relationship seems to be best described by a linear model.
Nearly normal residuals: To check this condition, we can look at a histogram
or a normal probability plot of the residuals.
Again there seems to be a slight bias towards negative residuals with positive results occurring in the extremely negative or positive regions. Perhaps the model could be improved by removing the really bad “Bush League” teams.
Constant variability:
Further analysis is needed to determine if the removal of certain teams improves the model.
* * *
mlb11
that you think might be a good predictor of runs
. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?Wins are chosen as a predictor, as games are won by scoring more runs than the opposing team.
Again, with evenness and no distinct patten about the regression line a linear relation in implied.
runs
and at_bats
? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs
better than at_bats
? How can you tell?##
## Call:
## lm(formula = runs ~ wins, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.450 -47.506 -7.482 47.346 142.186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 342.121 89.223 3.834 0.000654 ***
## wins 4.341 1.092 3.977 0.000447 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.3381
## F-statistic: 15.82 on 1 and 28 DF, p-value: 0.0004469
Here the R-squared value is 0.361, compared to 0.3729 when using at-bats as the predictor and 0.6266 when using homeruns. Wins is the weakest predictor of the three.
runs
and each of the other five traditional variables. Which variable best predicts runs
? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).## [1] "team" "runs" "at_bats" "hits"
## [5] "homeruns" "bat_avg" "strikeouts" "stolen_bases"
## [9] "wins" "new_onbase" "new_slug" "new_obs"
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
Of the seven traditional variables, batting average performs the best, with an R-squared of 0.6561.
The stronger tightness of fit around the regression line is indicated.
The histogram of residuals also shows greater concentration around 0.
runs
? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?The three Moneyball variables perform much better as run predictors.
##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
The new measure, on-base plus slugging (OBS) is far-and-away the best indicator with a Multiple R-squared of 0.9349. This makes sense because baseball runs are largely scored by driving in baserunners who are either in scoring position, or clearing the bases with homerooms.
Here there seems to be a very tight fit along the regression line, indicating a useful predictor. This is verified by the Q-Q plot.