if (Sys.info()["sysname"] == "Windows") {
setwd("~/Masters/DATA606/Week7/Lab/Lab7")
} else {
setwd("~/Documents/Masters/DATA606/Week7/Lab/Lab7")
}
load("more/mlb11.RData")
require(ggplot2)
## Loading required package: ggplot2
Answer:
Since runs are also a numerical variable, I would use a scatter plot to display each data point.
ggplot(mlb11, aes(y = at_bats, x = runs)) + geom_point() + geom_smooth(method = lm,
fullrange = TRUE)
to determine whether the linear regression line is a good fit for runs vs. at_bats, we should look at the correlation coefficient:
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
cor(mlb11$runs, mlb11$at_bats)^2
## [1] 0.3728654
The r-squared results for the linear regression line is 0.37 which means that this model accounts for approximately 37% of the variance, which does not indicate a good fit. The residual plot does not seem to show a pattern but, given the r-squared value, this still does not appear to be a good fit.
e1.lm <- lm(runs ~ at_bats, data = mlb11)
summary(e1.lm)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
residuals <- resid(e1.lm)
residuals <- data.frame(cbind(mlb11$at_bats, residuals))
names(residuals) <- c("at_bats", "residuals")
ggplot(residuals, aes(y = residuals, x = at_bats)) + geom_point() +
geom_hline(yintercept = 0)
Answer:
There appears to be a positive weak linear relationship between runs and at bats. I have stated that the relationship is postitive since the regression line has a positive slope. I have indicated it to be a weak relationship given the r-squared value, and it appears that a linear relationship is the best fit since there were no obvious patterns in the residuals plot.
Answer:
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
The best run I was able to produce is summarized below:
Call: lm(formula = y ~ x, data = pts)
Coefficients: (Intercept) x
-4284.0472 0.9026
Sum of Squares: 139329.3
Answer:
m1 <- lm(runs ~ homeruns, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
The equation for the least squares regression line is:
\[runs = 415.24 + 1.8 * homeruns\]
The slope of the line is 1.8 which means that for every homerun, the model expects to see 1.8 runs scored.
Answer:
m1 <- lm(runs ~ at_bats, data = mlb11)
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
m1.lm <- summary(m1)
m1_int <- unname(coefficients(m1.lm))[[1]]
m1_slope <- unname(coefficients(m1.lm))[[2]]
m1_predict <- m1_int + 5578 * m1_slope
m1_data <- mlb11[which(mlb11$at_bats == 5579), "runs"]
Since the intercept is -2789.24 and the slopw is 0.63, the linear regression line would predict 727.96 runs from 5.578 at bats. The closest data point to 5,578 at bats is 5579 and returns 713. Therefore the prediction is an overstimate by 14.96.
Answer:
There does not appear to be a pattern in the residuals plot. This indicates that a linear regression may be the best predictor for these data points since there does not appear to be a obvious non-linear relationship.
Answer:
plot(m1$residuals ~ mlb11$at_bats)
m1_alt <- data.frame(cbind(mlb11$at_bats, m1$residuals))
names(m1_alt) <- c("at_bats", "residuals")
abline(h = 0, lty = 3)
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
yes, the histogram appears to have a normal shape. However, it appears that the normal probability plot shows that the data may have short tails which means that the data could be narrower than the shape of a normal distribution.
Answer:
Yes, the variability appears to be the same regardless of the number of at bats.
Answer:
I will pick bat_avg to compare to runs since a team that registers more hits (higher batting average) would seem to score more runs
ggplot(mlb11, aes(y = runs, x = bat_avg)) + geom_point() + geom_smooth(method = lm,
fullrange = TRUE)
bat.cor <- cor(mlb11$bat_avg, mlb11$runs)^2
lm.bat <- lm(runs ~ bat_avg, data = mlb11)
summary(lm.bat)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
residuals <- resid(lm.bat)
residuals <- data.frame(cbind(mlb11$bat_avg, residuals))
names(residuals) <- c("bat_avg", "residuals")
ggplot(residuals, aes(y = residuals, x = bat_avg)) + geom_point() +
geom_hline(yintercept = 0)
Yes, given the scatter plot and initial correlation coeefficient of 0.66, it appears there is a linear relationship between batting average and runs.
Answer:
This relationship has a stronger linear correlation than the relationship between runs and at_bats. The r-squared value for at_bats vs. runs was 0.373 whereas the r-squared value for bat_avg vs. runs is 0.656.
Answer:
# #at_bats m1 <- lm(runs ~ at_bats, data = mlb11)
# cor(mlb11$at_bats, mlb11$runs)^2 summary(m1) ggplot(mlb11,
# aes(y=runs, x=at_bats)) + geom_point() + geom_smooth(method
# = lm, fullrange = TRUE) hist(m1$residuals)
# qqnorm(m1$residuals) qqline(m1$residuals) #hits m2 <-
# lm(runs ~ hits, data = mlb11) cor(mlb11$hits, mlb11$runs)^2
# summary(m2) ggplot(mlb11, aes(y=runs, x=hits)) +
# geom_point() + geom_smooth(method = lm, fullrange = TRUE)
# hist(m2$residuals) qqnorm(m2$residuals)
# qqline(m2$residuals) #homeruns m3 <- lm(runs ~ homeruns,
# data = mlb11) cor(mlb11$homeruns, mlb11$runs)^2 summary(m3)
# ggplot(mlb11, aes(y=runs, x=homeruns)) + geom_point() +
# geom_smooth(method = lm, fullrange = TRUE)
# hist(m3$residuals) qqnorm(m3$residuals)
# qqline(m3$residuals)
# batting average
m4 <- lm(runs ~ bat_avg, data = mlb11)
cor(mlb11$bat_avg, mlb11$runs)^2
## [1] 0.6560771
summary(m4)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
ggplot(mlb11, aes(y = runs, x = bat_avg)) + geom_point() + geom_smooth(method = lm,
fullrange = TRUE)
hist(m4$residuals)
qqnorm(m4$residuals)
qqline(m4$residuals)
residuals <- resid(m4)
residuals <- data.frame(cbind(mlb11$bat_avg, residuals))
names(residuals) <- c("bat_avg", "residuals")
ggplot(residuals, aes(y = residuals, x = bat_avg)) + geom_point() +
geom_hline(yintercept = 0)
# #strikeouts m5 <- lm(runs ~ strikeouts, data = mlb11)
# cor(mlb11$strikeouts, mlb11$runs)^2 summary(m5)
# ggplot(mlb11, aes(y=runs, x=strikeouts)) + geom_point() +
# geom_smooth(method = lm, fullrange = TRUE)
# hist(m5$residuals) qqnorm(m5$residuals)
# qqline(m5$residuals) #stolen bases m6 <- lm(runs ~
# stolen_bases, data = mlb11) cor(mlb11$stolen_bases,
# mlb11$runs)^2 summary(m6) ggplot(mlb11, aes(y=runs,
# x=stolen_bases)) + geom_point() + geom_smooth(method = lm,
# fullrange = TRUE) hist(m6$residuals) qqnorm(m6$residuals)
# qqline(m6$residuals) #wins m7 <- lm(runs ~ wins, data =
# mlb11) cor(mlb11$wins, mlb11$runs)^2 summary(m7)
# ggplot(mlb11, aes(y=runs, x=wins)) + geom_point() +
# geom_smooth(method = lm, fullrange = TRUE)
# hist(m7$residuals) qqnorm(m7$residuals)
# qqline(m7$residuals)
It appears that batting average predicts runs using linear regression models. The summary of this model is displayed above and all other analyses have been commented out, as requested.
Answer:
Model diagnostics have been provided as a response to question 4.