Introduction to Linear Regression

Batter up

30 major league baseball teams; which vazriable best predicts run scores

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

I would use a scatterplot, probably. Which should just be the “plot” function. The plot is showing a slight correlation between runs and at_bats, but I would not feel very confident using a linear model to predict the number of runs based on at_bats.

plot(mlb11$runs, mlb11$at_bats)

To find the correlation coefficient (strength of the relationship):

cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627

Sum of the squared residuals

Exercise 2

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

There seems to be a weak positive correlation between the two variables; as the number of runs increases, so does the number of at_bars. However, if we were to draw a regression line, there would be many points deviating far from the line. The correlation coefficient backs-up this claim. Since the coefficient is a postivie integer, the correlation is positive, but the efficient is not very close to 0 or 1 which indicated a moderate correlation.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Exercise 3

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

Mine isn’t letting me change the line at all. I get the same line every time I run it with a sum of squares of 123721.9. I am not sure why mine is not interactive….

The linear model

m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Exercise 4

Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

y = 415.2389 + 1.8345 * homeruns

Because the slope is positive, there is a positive realtionship between the two variables. Additionally the correlation coefficient is closer to 1 (cc = 0.7916) and the R-squared is larger (0.6266). 62.66% of the variablility in runs is explained by homeruns.

plot(mlb11$runs, mlb11$homeruns)

cor(mlb11$runs, mlb11$homeruns)
## [1] 0.7915577
plot_ss(x = mlb11$runs, y = mlb11$homeruns)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##    -85.1566       0.3415  
## 
## Sum of Squares:  13715.52
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

Predictions and prediction errors

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

Exercise 5

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

We previously stated the the least squares regression line had the following formula:

y = −2789.2429+0.6305∗atbats

So if we plug in 5578 for at_bats we get:

-2789.2429 + (0.6305*5578)
## [1] 727.6861

At 5579 at_bats we see 713 runs. So our prediction is a slight over-estimate by (residual of):

728-713
## [1] 15

Model diagnostics

Check for linearity

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

Exercise 6

There does not appear to be a pattern in the residuals plot. I have no clue what this suggests about the linearity of the relationship between run and at_bats, but I am assuming a pattern would indicate it was non-linear?

Check for nearly normal residuals

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals) 

Exercise 7

Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

It looks fairly normal to me but I am still not great with desciphering a qq plot. Maybe a slight right-hand tail?

Exercise 8

Based on the plot in (1), does the constant variability condition appear to be met?

I would say that based on the residuals vs at_bats plot, there does appear to be relatively constant variability. The variability looks a bit less for the larger values, but there are also fewer points which could explain that.

On your own

Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

plot(mlb11$runs, mlb11$hits)

m3 <- lm(runs ~ hits, data = mlb11)
plot(mlb11$runs ~ mlb11$hits)
abline(m3)

Yes, there does seem to be a linear relationship between runs and hits

How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

summary(m3)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

The R-squared value for runs vs at_bats is 0.3729 while the R-squared for runs vs hits is 0.6419. Since the runs vs hits has a R-squared value closer to 1, hits is a better predictor for runs.

Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).

m4 <- lm(runs ~ bat_avg, data = mlb11)
m5 <- lm(runs ~ strikeouts, data = mlb11)
m6 <- lm(runs ~ stolen_bases, data = mlb11)
m7 <- lm(runs ~ wins, data = mlb11)
m8 <- lm(runs ~ new_onbase, data = mlb11)
m9 <- lm(runs ~ new_slug, data = mlb11)
m10 <- lm(runs ~ new_obs, data = mlb11)
summary(m4)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08
summary(m5)
## 
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.27  -46.95  -11.92   55.14  169.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1054.7342   151.7890   6.949 1.49e-07 ***
## strikeouts    -0.3141     0.1315  -2.389   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1397 
## F-statistic: 5.709 on 1 and 28 DF,  p-value: 0.02386
summary(m6)
## 
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -139.94  -62.87   10.01   38.54  182.49 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  677.3074    58.9751  11.485 4.17e-12 ***
## stolen_bases   0.1491     0.5211   0.286    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared:  0.002914,   Adjusted R-squared:  -0.0327 
## F-statistic: 0.08183 on 1 and 28 DF,  p-value: 0.7769
summary(m7)
## 
## Call:
## lm(formula = runs ~ wins, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.450  -47.506   -7.482   47.346  142.186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  342.121     89.223   3.834 0.000654 ***
## wins           4.341      1.092   3.977 0.000447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3381 
## F-statistic: 15.82 on 1 and 28 DF,  p-value: 0.0004469
summary(m8)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13
summary(m9)
## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15
summary(m10)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

Although I have no idea what new_obs is, that is the variable that best predicts runs. It has the R-squared value closest to 1 (0.9349).

plot(mlb11$runs ~ mlb11$new_obs)
abline(m10)

Oh yeah, definitely looks like a good predictor for runs.

Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

Oops. I didn’t realize they only wanted the “old” variables for the last question. Yes, it was a good decision for the author of Moneyball to use new_onbase, new_slug, and new_obs to predict runs. Those variables were the top 3 predictor out of the variables provided (they had the best R-squared values). I know nothing about baseball, so I don’t know if this makes sense, but it must if the author of Moneyball decided to look into those three variables.

plot(mlb11$runs ~ mlb11$new_onbase)
abline(m8)

plot(mlb11$runs ~ mlb11$new_slug)
abline(m9)

Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

So we are going with new_obs. There is no pattern when we plot residuals vs new_obs. This means the model is linear. Based on the first plot displayed below, we can also look if there is constant variability. The variability does look constant. Finally, the second plot can help look for nearly normal residuals. It looks more normal the residuals from the runs vs at_bats data. The qq-plot supports that the residual data is nearly normal! Looks to me like it passes the diagnostic tests!

plot(m10$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)

hist(m10$residuals)

qqnorm(m10$residuals)
qqline(m10$residuals)