download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
I would use a scatterplot to display the relationship between runs and at-bats.
plot(mlb11$at_bats,mlb11$runs)
# Yes, this relationship looks moerately linear. However, even if I knew a team's at_bats, I would not be that comfortable using a linear model to predict the number of runs from the number of at_bats because the correlation seems fairly mild based on the scatterplot. In other words, at_bats do not appear to be an entirely reliable predictor of runs.
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
# The correlation coefficient is above 0.50, however, at 0.61, this suggests only moderate linear correlation.
There seems to be positive linear correlation because higher amounts of at_bats are associated with high amounts of runs, however, there is also a lot of spread between the observations. While most observations follow this relationship, for some low at_bat amounts there are high amounts of runs and for some high at_bat amounts there are low amounts of runs.
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
The smallest sum of squares I got was 123,721.9, which I believe was the exact least sum of squared residuals. This exact lowest possible sum of squared residuals number much lower than some of the other guesses I attempted when attempting to find a line of best fit by clicking on two points. The closest I got to approximating the exact least sum of squared residuals was 132,059.9.
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
# ŷ=415.2389 +1.8345*atbats
# The slope of 1.8345 tells us that for almost every additional homerun, there will be 1.8345 times as many runs scored. Thus, the success of a team is highly dependent upon homeruns since every homereun results in nearly two runs scored.
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)
A team manager would predict 728 runs for a team with 5,578 at-bats based on the least squares regression line. This is an overstimate at the actual data shows 5,578 at-bats results in about 700 runs. The residual is approximately -28.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)
# It appears that the residuals corresponding to the middle portion of at-bats have higher variability than the residuals corresponding to low or high amounts of at-bats.
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
# The histogram and the normal probability plot both indicate that the residuals are very nearly normally distributed.
It seems that the constant variability condition is not entirely met in the residuals plot given that the residuals increase for at-bat values in the middle portion of the residuals plot, though the variability is low enough that one could probably claim that the condition is met.
plot(mlb11$runs~mlb11$bat_avg)
m3 <- lm(runs~bat_avg,data = mlb11)
abline(m3)
# Yes, there does seem to be a linear relationship between batting average and runs. Higher batting averages are associated with more runs and vice versa.
summary(m3)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
# The relationship of at-bats to runs is similar to the relationship between batting average and runs, however, there is stronger positive linear correlation between batting average and runs than between at-bats and runs. The R2 values confirm this as the R2 value for runs as a function of at bats is only 0.37 whereas the R2 value for runs as a function of batting average is 0.65.
plot_ss(x = mlb11$bat_avg, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -642.8 5242.2
##
## Sum of Squares: 67849.52
cor(mlb11$bat_avg,mlb11$runs)
## [1] 0.8099859
# Yes, the batting average variable predicts runs better than at-bats and this can be inferred from the sum of squred residuals, which is far lower for the batting average variable at 67,849.52 than for the at-bat variable at 123,721.9.
# It appears that batting average is the best predictor of runs ut of the seven traditionally used variables. It had the highest correlation coefficient, edging out homeruns and hits slightly, at 0.809 as well as the lowest sum of squared residuals at 67,849.52.
plot_ss(x = mlb11$hits,y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -375.5600 0.7589
##
## Sum of Squares: 70638.75
plot_ss(x = mlb11$homeruns,y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 415.239 1.835
##
## Sum of Squares: 73671.99
plot(mlb11$runs~mlb11$new_obs)
m4 <- lm(runs~new_obs,data = mlb11)
abline(m4)
summary(m4)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
# The R2 value of 0.9349 indicates that 93.5% of the variability in runs is explained by obs, which clearly shows that obs is the best predictor of runs of the 10 variables