download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
plot(mlb11$runs ~ mlb11$at_bats)
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.6106
The relationship between runs and at bats has a positive correlation. However, the correlation is not particularly strong, and several notable outliers can be identified. The data is also concentrated in the lower-mid region of at_bats, as few players have batted more than 5620 times.
The smallest sum of squares I got was 141037.9
m1 = lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.6 -47.0 -16.6 54.4 176.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.243 853.696 -3.27 0.00287 **
## at_bats 0.631 0.155 4.08 0.00034 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.35
## F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
m2 = lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.61 -33.41 3.23 24.29 104.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.239 41.678 9.96 1.0e-10 ***
## homeruns 1.835 0.268 6.85 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627, Adjusted R-squared: 0.613
## F-statistic: 47 on 1 and 28 DF, p-value: 1.9e-07
runs = 415.24+1.83(homeruns) For every increase in a homerun, the number of total runs increases by about 1.83 (not just by 1)
-2789.2429 + 0.6305 * 5579
## [1] 728.3
713 - 728.3
## [1] -15.3
Observed (713) - Expected(728.3) = Residual (-15.3) This is an overestimate by about 15.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)
There is no apparent pattern in the residual plot. This indicates that the relationship is linear.
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
Both the histogram and the normal probability plot indicate that the residuals are distributed normally.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)
The assumption seems to be met because the random of variation of the residuals appears roughly constant in the entire plot. It appears to be slightly greater towards the lower end of number of bats but the difference doesn't appear significant.
m3 = lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg)
abline(m3)
There does seem to be a linear relationship
summary(m3)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.7 -26.3 -5.5 28.5 131.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -643 183 -3.51 0.0015 **
## bat_avg 5242 717 7.31 5.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.2 on 28 degrees of freedom
## Multiple R-squared: 0.656, Adjusted R-squared: 0.644
## F-statistic: 53.4 on 1 and 28 DF, p-value: 5.88e-08
The R2 value for this relationship is .6561 while the R2 value for the relationships between runs and at_bats is .3729, so this is a much stronger relationship. We can predict runs much better with bat_avg than at_bats.
plot(mlb11$runs ~ mlb11$bat_avg)
abline(m3)
hist(m3$residuals)
qqnorm(m3$residuals)
qqline(m3$residuals)
plot(m3$residuals ~ mlb11$bat_avg)
abline(h = 0, lty = 3)
The variable relationship with the largest R2 value is between runs and bat_avg. The scatterplot, residual histogram, and normal probability plots all support the strength of this relationship.
m4 = lm(runs ~ new_obs, data = mlb11)
summary(m4)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.46 -13.69 1.16 13.94 41.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.6 68.9 -9.96 1e-10 ***
## new_obs 1919.4 95.7 20.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.4 on 28 degrees of freedom
## Multiple R-squared: 0.935, Adjusted R-squared: 0.933
## F-statistic: 402 on 1 and 28 DF, p-value: <2e-16
They are much more effective. I don't follow baseball and I have no idea what new_obs stands for.
plot(mlb11$runs ~ mlb11$new_obs)
abline(m4)
hist(m4$residuals)
qqnorm(m4$residuals)
qqline(m4$residuals)
plot(m4$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)