download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
library(ggplot2)
ggplot(data = mlb11, aes(x = at_bats, y = runs)) +
geom_line()
The relationship does not look remotely linear. I would not be comfortable using at bats as a predictor for the number of runs.
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
There is a general upward correlation between the at bats and the # of runs scored. It’s not a strong correlation, in that the data is not at all linear, but generally speaking the more at bats, the more runs scored.
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
y= 415.2389 +0.631∗at bats. Generally speaking, more at bats= more home runs
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)
They would potentially predict just over 700 runs. This is an overestimation- the actual amount of runs is closer to 670, and the difference is 30-50 runs.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)
# adds a horizontal dashed line at y = 0
There is no real pattern in the residuals plot. This indicates there is no real linearity between runs and at-bats. They are somewhat evenly spread out on the plot, but aside from that no pattern is evident.
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
Yes, the nearly normal residuals condition does appear to be met, based both on the histogram and the probability plot.
Based on the plot in Exercise 1, I would say that the constant variability condition has been met. The residuals stay largely constant in their distance away from the regression line.