download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
I would use a scatter plot to show the relationship, corelating or not, between these two variables. The produced plot is not very linear at all. I would not feel comfortable using the at bats to predict the number of runs. The correlation coefficient is only 0.61027. This is slightly close to one, suggesting some positive correlation.
plot(mlb11$at_bats, mlb11$runs)
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.6106
As the correlation coefficient is .61027, this is neither a very week nor a very strong relationship. However, the coefficient shows that the direction of the relationship is slightly positive. Likewise, the form is more linear than curvilinear.
The best sums of squares I got was 143312. My lab mates found smaller sums of squares ranging from 139853 to 142948.
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.243 0.631
##
## Sum of Squares: 123722
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.243 0.631
##
## Sum of Squares: 123722
(equation of the regression line)runs hat = -2789.2429 + 0.6305 x at_bats
m1 = lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.6 -47.0 -16.6 54.4 176.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.243 853.696 -3.27 0.00287 **
## at_bats 0.631 0.155 4.08 0.00034 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.35
## F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
(equation of the regression line)runs hat = 415.2389 + 1.8345 x homeruns
This information shows us that the success of the team in terms of the runs scored correlates very well with the number of homeruns scored. This is resonable as a homerun is a run and often leads to extra runs on top.
m2 = lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.61 -33.41 3.23 24.29 104.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.239 41.678 9.96 1.0e-10 ***
## homeruns 1.835 0.268 6.85 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627, Adjusted R-squared: 0.613
## F-statistic: 47 on 1 and 28 DF, p-value: 1.9e-07
-2789.2429+0.6305(5579)= 728.3166 runs expected = predicted value Assuming the coach is from the Philies with 5579 at bats, the observed run count is 713. Therefore the predicted value is an overestimate by (729-713)= 16 runs.
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.6 -47.0 -16.6 54.4 176.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.243 853.696 -3.27 0.00287 **
## at_bats 0.631 0.155 4.08 0.00034 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.35
## F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
-2789.2429 + 0.6305(5579)
## Error: attempt to apply non-function
The residual plot seems to show homogeneous variance across values of at_bats.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
The normal probability plot does not show a very straight line, but it seems to be normal enough to meet the condition requiring nearly normality.
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot
The condition of constant variability appears to be met as there is no significant change in variability with dependence on the variable, at_bats.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
There does appear to be a positive linear correlation between the homeruns and the runs.
plot(mlb11$runs ~ mlb11$homeruns)
abline(m2)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.61 -33.41 3.23 24.29 104.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.239 41.678 9.96 1.0e-10 ***
## homeruns 1.835 0.268 6.85 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627, Adjusted R-squared: 0.613
## F-statistic: 47 on 1 and 28 DF, p-value: 1.9e-07
The R-squared value for the at bats relation ship is 0.3729. The R-squared value for the at bats relation ship is 0.6266. Therefore the home runs have better accuracy in predictions as its value is closer to 1 than that of at_bats.
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.6 -47.0 -16.6 54.4 176.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.243 853.696 -3.27 0.00287 **
## at_bats 0.631 0.155 4.08 0.00034 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.35
## F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.61 -33.41 3.23 24.29 104.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.239 41.678 9.96 1.0e-10 ***
## homeruns 1.835 0.268 6.85 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627, Adjusted R-squared: 0.613
## F-statistic: 47 on 1 and 28 DF, p-value: 1.9e-07
R-squared values at_bats=0.3729 homeruns=0.6266 bat_avg=0.6561 wins=0.361 hits=0.6419
Surprisingly, the batting average is the most accurate predictor of runs.
plot(mlb11$runs ~ mlb11$at_bats)
m1 = lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.6 -47.0 -16.6 54.4 176.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.243 853.696 -3.27 0.00287 **
## at_bats 0.631 0.155 4.08 0.00034 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.35
## F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
plot(mlb11$runs ~ mlb11$homeruns)
m2 = lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.61 -33.41 3.23 24.29 104.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.239 41.678 9.96 1.0e-10 ***
## homeruns 1.835 0.268 6.85 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627, Adjusted R-squared: 0.613
## F-statistic: 47 on 1 and 28 DF, p-value: 1.9e-07
plot(mlb11$runs ~ mlb11$bat_avg)
m3 = lm(runs ~ bat_avg, data = mlb11)
summary(m3)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.7 -26.3 -5.5 28.5 131.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -643 183 -3.51 0.0015 **
## bat_avg 5242 717 7.31 5.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.2 on 28 degrees of freedom
## Multiple R-squared: 0.656, Adjusted R-squared: 0.644
## F-statistic: 53.4 on 1 and 28 DF, p-value: 5.88e-08
plot(mlb11$runs ~ mlb11$wins)
m4 = lm(runs ~ wins, data = mlb11)
summary(m4)
##
## Call:
## lm(formula = runs ~ wins, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.45 -47.51 -7.48 47.35 142.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 342.12 89.22 3.83 0.00065 ***
## wins 4.34 1.09 3.98 0.00045 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.338
## F-statistic: 15.8 on 1 and 28 DF, p-value: 0.000447
plot(mlb11$runs ~ mlb11$hits)
m5 = lm(runs ~ hits, data = mlb11)
summary(m5)
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.72 -27.18 -5.23 19.32 140.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.560 151.181 -2.48 0.019 *
## hits 0.759 0.107 7.09 1e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.2 on 28 degrees of freedom
## Multiple R-squared: 0.642, Adjusted R-squared: 0.629
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.04e-07
R-squared values new_onbase=0.8491 new_slug=0.8969 new_obs=0.6419
These three variables are more accurate in predicting runs. As I have no idea what these new variables mean, let's say that this improvement in accuracy of prediction makes sense and call it a day.
plot(mlb11$runs ~ mlb11$new_onbase)
m6 = lm(runs ~ new_onbase, data = mlb11)
summary(m6)
##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.27 -18.33 3.25 19.52 69.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118 144 -7.74 2.0e-08 ***
## new_onbase 5654 450 12.55 5.1e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.6 on 28 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.844
## F-statistic: 158 on 1 and 28 DF, p-value: 5.12e-13
plot(mlb11$runs ~ mlb11$new_slug)
m7 = lm(runs ~ new_slug, data = mlb11)
summary(m7)
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.8 68.7 -5.47 7.7e-06 ***
## new_slug 2681.3 171.8 15.60 2.4e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27 on 28 degrees of freedom
## Multiple R-squared: 0.897, Adjusted R-squared: 0.893
## F-statistic: 244 on 1 and 28 DF, p-value: 2.42e-15
plot(mlb11$runs ~ mlb11$new_obs)
m8 = lm(runs ~ hits, data = mlb11)
summary(m8)
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.72 -27.18 -5.23 19.32 140.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.560 151.181 -2.48 0.019 *
## hits 0.759 0.107 7.09 1e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.2 on 28 degrees of freedom
## Multiple R-squared: 0.642, Adjusted R-squared: 0.629
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.04e-07
New slug appears to be the best predictor. The model diagnostics support this notion.
plot(m7$residuals ~ mlb11$new_slug)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
hist(m7$residuals)
qqnorm(m7$residuals)
qqline(m7$residuals) # adds diagonal line to the normal prob plot