Introduction to linear regression

## Warning: package 'ggplot2' was built under R version 3.5.1

Batter up

In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.

The data

Let’s load up the data for the 2011 season.

load("more/mlb11.RData")
head(mlb11, 10)

##                   team runs at_bats hits homeruns bat_avg strikeouts
## 1        Texas Rangers  855    5659 1599      210   0.283        930
## 2       Boston Red Sox  875    5710 1600      203   0.280       1108
## 3       Detroit Tigers  787    5563 1540      169   0.277       1143
## 4   Kansas City Royals  730    5672 1560      129   0.275       1006
## 5  St. Louis Cardinals  762    5532 1513      162   0.273        978
## 6        New York Mets  718    5600 1477      108   0.264       1085
## 7     New York Yankees  867    5518 1452      222   0.263       1138
## 8    Milwaukee Brewers  721    5447 1422      185   0.261       1083
## 9     Colorado Rockies  735    5544 1429      163   0.258       1201
## 10      Houston Astros  615    5598 1442       95   0.258       1164
##    stolen_bases wins new_onbase new_slug new_obs
## 1           143   96      0.340    0.460   0.800
## 2           102   90      0.349    0.461   0.810
## 3            49   95      0.340    0.434   0.773
## 4           153   71      0.329    0.415   0.744
## 5            57   90      0.341    0.425   0.766
## 6           130   77      0.335    0.391   0.725
## 7           147   97      0.343    0.444   0.788
## 8            94   96      0.325    0.425   0.750
## 9           118   73      0.329    0.410   0.739
## 10          118   56      0.311    0.374   0.684

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

sp <- ggplot(mlb11, aes(x=at_bats, y=runs))
sp + geom_point() + stat_smooth(method=lm, se=F)

Here a scatterplot is chosen to display the relationship between runs and at-bats. There seems to be a linear relationship with positiv correlator. One could use a regression model to derive a rough estimation of runs based on at-bats.

cor(mlb11$runs, mlb11$at_bats)

## [1] 0.610627

Indeed the correlation is positive, as calculated by the cor() function.

Sum of squared residuals

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

There two variables are moderately positively correlated (strength and direction). The form seems to be linear with the residuals seeming to randomly distributed about the regression line.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

The linear model

We can use the lm function in R to fit the linear model (a.k.a. regression line).

m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

With this table, we can write down the least squares regression line for the linear model:

\[ \hat{y} = -2789.2429 + 0.6305 * atbats \]

One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.

Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = TRUE)

m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

The regression line equation can be written as: \[ \hat{y} = 415.2389 + 1.8345 * atbats \]

The b1 coefficient is positive, indicating positive correlation, which can be verified here with the cor() function.

cor(mlb11$runs, mlb11$homeruns)

## [1] 0.7915577

Prediction and prediction errors

Let’s create a scatterplot with the least squares line laid on top.

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

df <- data.frame(at_bats = mlb11$at_bats, runs = mlb11$runs )
df$pred_runs <- -2789.2429 + 0.6305 * mlb11$at_bats
df$residual <-df$runs - df$pred_runs
head(df)

##   at_bats runs pred_runs residual
## 1    5659  855  778.7566  76.2434
## 2    5710  875  810.9121  64.0879
## 3    5563  787  718.2286  68.7714
## 4    5672  730  786.9531 -56.9531
## 5    5532  762  698.6831  63.3169
## 6    5600  718  741.5571 -23.5571

(pred1 <- -2789.2429 + 0.6305 * 5578)

## [1] 727.6861

The number of the predicted runs based on the regression model is 728.

df$runs[df$at_bats == 5578]

## integer(0)

We do not know if this is under or over-estimated since there are no cases with exactly that number of at-bats. We can observe the sactterplot above to note that in the range of 5550 to 5600 most points are below the fitted curve, suggesting the regression line is an overestimation in the range, but not definitely.

ggplot(df, aes(x=at_bats,y=residual)) + geom_point() + geom_rug(sides = 'rt')

The residual plot reveals a more-or-less normally-distributed deviation from the regression line.

Model diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

There are slightly more points below the residual line, but this is likely due to chance. Thus, the relationship seems to be best described by a linear model.

Nearly normal residuals: To check this condition, we can look at a histogram

hist(m1$residuals)

or a normal probability plot of the residuals.

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Again there seems to be a slight bias towards negative residuals with positive results occurring in the extremely negative or positive regions. Perhaps the model could be improved by removing the really bad “Bush League” teams.

Constant variability:

Based on the plot in (1), does the constant variability condition appear to be met?

Further analysis is needed to determine if the removal of certain teams improves the model.
* * *

On Your Own

Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

Wins are chosen as a predictor, as games are won by scoring more runs than the opposing team.

sp2 <- ggplot(mlb11, aes(x=wins, y=runs))
sp2 + geom_point() + stat_smooth(method=lm, se=F)

Again, with evenness and no distinct patten about the regression line a linear relation in implied.

How does this relationship compare to the relationship between runs and at_bats? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

m3 <- lm(runs ~ wins, data = mlb11)
summary(m3)

## 
## Call:
## lm(formula = runs ~ wins, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.450  -47.506   -7.482   47.346  142.186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  342.121     89.223   3.834 0.000654 ***
## wins           4.341      1.092   3.977 0.000447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3381 
## F-statistic: 15.82 on 1 and 28 DF,  p-value: 0.0004469

Here the R-squared value is 0.361, compared to 0.3729 when using at-bats as the predictor and 0.6266 when using homeruns. Wins is the weakest predictor of the three.

Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).

colnames(mlb11)

##  [1] "team"         "runs"         "at_bats"      "hits"        
##  [5] "homeruns"     "bat_avg"      "strikeouts"   "stolen_bases"
##  [9] "wins"         "new_onbase"   "new_slug"     "new_obs"

m5 <- lm(runs ~ bat_avg , data = mlb11)
summary(m5)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

Of the seven traditional variables, batting average performs the best, with an R-squared of 0.6561.

sp2 <- ggplot(mlb11, aes(x=bat_avg, y=runs))
sp2 + geom_point() + stat_smooth(method=lm, se=F)

The stronger tightness of fit around the regression line is indicated.

hist(m5$residuals)

The histogram of residuals also shows greater concentration around 0.

Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

The three Moneyball variables perform much better as run predictors.

m8 <- lm(runs ~ new_onbase, data = mlb11)
summary(m8)

## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13

m9 <- lm(runs ~ new_slug, data = mlb11)
summary(m9)

## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15

m10 <- lm(runs ~ new_obs, data = mlb11)
summary(m10)

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

The new measure, on-base plus slugging (OBS) is far-and-away the best indicator with a Multiple R-squared of 0.9349. This makes sense because baseball runs are largely scored by driving in baserunners who are either in scoring position, or clearing the bases with homerooms.

Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

sp <- ggplot(mlb11, aes(x=new_obs, y=runs))
sp + geom_point() + stat_smooth(method=lm, se=F)

Here there seems to be a very tight fit along the regression line, indicating a useful predictor. This is verified by the Q-Q plot.

qqnorm(m10$residuals)
qqline(m10$residuals)  # adds diagonal line to the normal prob plot