Sum of the squared residuals
Exercise 2
Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
There seems to be a weak positive correlation between the two variables; as the number of runs increases, so does the number of at_bars. However, if we were to draw a regression line, there would be many points deviating far from the line. The correlation coefficient backs-up this claim. Since the coefficient is a postivie integer, the correlation is positive, but the efficient is not very close to 0 or 1 which indicated a moderate correlation.
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
Exercise 3
Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?
Mine isn’t letting me change the line at all. I get the same line every time I run it with a sum of squares of 123721.9. I am not sure why mine is not interactive….
The linear model
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Exercise 4
Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?
y = 415.2389 + 1.8345 * homeruns
Because the slope is positive, there is a positive realtionship between the two variables. Additionally the correlation coefficient is closer to 1 (cc = 0.7916) and the R-squared is larger (0.6266). 62.66% of the variablility in runs is explained by homeruns.
plot(mlb11$runs, mlb11$homeruns)

cor(mlb11$runs, mlb11$homeruns)
## [1] 0.7915577
plot_ss(x = mlb11$runs, y = mlb11$homeruns)

## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -85.1566 0.3415
##
## Sum of Squares: 13715.52
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
Predictions and prediction errors
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

Exercise 5
If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
We previously stated the the least squares regression line had the following formula:
y = −2789.2429+0.6305∗atbats
So if we plug in 5578 for at_bats we get:
-2789.2429 + (0.6305*5578)
## [1] 727.6861
At 5579 at_bats we see 713 runs. So our prediction is a slight over-estimate by (residual of):
728-713
## [1] 15
Model diagnostics
Check for linearity
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0

Exercise 6
There does not appear to be a pattern in the residuals plot. I have no clue what this suggests about the linearity of the relationship between run and at_bats, but I am assuming a pattern would indicate it was non-linear?
Check for nearly normal residuals
hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)

Exercise 7
Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?
It looks fairly normal to me but I am still not great with desciphering a qq plot. Maybe a slight right-hand tail?
Exercise 8
Based on the plot in (1), does the constant variability condition appear to be met?
I would say that based on the residuals vs at_bats plot, there does appear to be relatively constant variability. The variability looks a bit less for the larger values, but there are also fewer points which could explain that.
On your own
Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
plot(mlb11$runs, mlb11$hits)

m3 <- lm(runs ~ hits, data = mlb11)
plot(mlb11$runs ~ mlb11$hits)
abline(m3)

Yes, there does seem to be a linear relationship between runs and hits
How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?
summary(m3)
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
The R-squared value for runs vs at_bats is 0.3729 while the R-squared for runs vs hits is 0.6419. Since the runs vs hits has a R-squared value closer to 1, hits is a better predictor for runs.
Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).
m4 <- lm(runs ~ bat_avg, data = mlb11)
m5 <- lm(runs ~ strikeouts, data = mlb11)
m6 <- lm(runs ~ stolen_bases, data = mlb11)
m7 <- lm(runs ~ wins, data = mlb11)
m8 <- lm(runs ~ new_onbase, data = mlb11)
m9 <- lm(runs ~ new_slug, data = mlb11)
m10 <- lm(runs ~ new_obs, data = mlb11)
summary(m4)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
summary(m5)
##
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.27 -46.95 -11.92 55.14 169.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1054.7342 151.7890 6.949 1.49e-07 ***
## strikeouts -0.3141 0.1315 -2.389 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1397
## F-statistic: 5.709 on 1 and 28 DF, p-value: 0.02386
summary(m6)
##
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -139.94 -62.87 10.01 38.54 182.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 677.3074 58.9751 11.485 4.17e-12 ***
## stolen_bases 0.1491 0.5211 0.286 0.777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared: 0.002914, Adjusted R-squared: -0.0327
## F-statistic: 0.08183 on 1 and 28 DF, p-value: 0.7769
summary(m7)
##
## Call:
## lm(formula = runs ~ wins, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.450 -47.506 -7.482 47.346 142.186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 342.121 89.223 3.834 0.000654 ***
## wins 4.341 1.092 3.977 0.000447 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.3381
## F-statistic: 15.82 on 1 and 28 DF, p-value: 0.0004469
summary(m8)
##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
summary(m9)
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
summary(m10)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
Although I have no idea what new_obs is, that is the variable that best predicts runs. It has the R-squared value closest to 1 (0.9349).
plot(mlb11$runs ~ mlb11$new_obs)
abline(m10)

Oh yeah, definitely looks like a good predictor for runs.
Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?
Oops. I didn’t realize they only wanted the “old” variables for the last question. Yes, it was a good decision for the author of Moneyball to use new_onbase, new_slug, and new_obs to predict runs. Those variables were the top 3 predictor out of the variables provided (they had the best R-squared values). I know nothing about baseball, so I don’t know if this makes sense, but it must if the author of Moneyball decided to look into those three variables.
plot(mlb11$runs ~ mlb11$new_onbase)
abline(m8)

plot(mlb11$runs ~ mlb11$new_slug)
abline(m9)

Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.
So we are going with new_obs. There is no pattern when we plot residuals vs new_obs. This means the model is linear. Based on the first plot displayed below, we can also look if there is constant variability. The variability does look constant. Finally, the second plot can help look for nearly normal residuals. It looks more normal the residuals from the runs vs at_bats data. The qq-plot supports that the residual data is nearly normal! Looks to me like it passes the diagnostic tests!
plot(m10$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)

hist(m10$residuals)

qqnorm(m10$residuals)
qqline(m10$residuals)
