Lab 9

Name:Sonora Williams

Section: 01l

Date: November 12, 2013

Exercises

Load data & inference function:

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1:

I would use a scatter plot to show the relationship, corelating or not, between these two variables. The produced plot is not very linear at all. I would not feel comfortable using the at bats to predict the number of runs. The correlation coefficient is only 0.61027. This is slightly close to one, suggesting some positive correlation.

plot(mlb11$at_bats, mlb11$runs)

plot of chunk unnamed-chunk-2

cor(mlb11$runs, mlb11$at_bats)
## [1] 0.6106

Exercise 2:

As the correlation coefficient is .61027, this is neither a very week nor a very strong relationship. However, the coefficient shows that the direction of the relationship is slightly positive. Likewise, the form is more linear than curvilinear.

Exercise 3:

The best sums of squares I got was 143312. My lab mates found smaller sums of squares ranging from 139853 to 142948.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

plot of chunk unnamed-chunk-3

## Click two points to make a line.

## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -2789.243        0.631  
## 
## Sum of Squares:  123722
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

plot of chunk unnamed-chunk-3

## Click two points to make a line.

## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -2789.243        0.631  
## 
## Sum of Squares:  123722

Exercise 4:

(equation of the regression line)runs hat = -2789.2429 + 0.6305 x at_bats

m1 = lm(runs ~ at_bats, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373,   Adjusted R-squared: 0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339

Exercise 5:

(equation of the regression line)runs hat = 415.2389 + 1.8345 x homeruns

This information shows us that the success of the team in terms of the runs scored correlates very well with the number of homeruns scored. This is resonable as a homerun is a run and often leads to extra runs on top.

m2 = lm(runs ~ homeruns, data = mlb11)
summary(m2)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  415.239     41.678    9.96  1.0e-10 ***
## homeruns       1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627,   Adjusted R-squared: 0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07

Exercise 6:

-2789.2429+0.6305(5579)= 728.3166 runs expected = predicted value Assuming the coach is from the Philies with 5579 at bats, the observed run count is 713. Therefore the predicted value is an overestimate by (729-713)= 16 runs.

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

plot of chunk unnamed-chunk-6

summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373,   Adjusted R-squared: 0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339
-2789.2429 + 0.6305(5579)
## Error: attempt to apply non-function

Exercise 7:

The residual plot seems to show homogeneous variance across values of at_bats.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

plot of chunk unnamed-chunk-7

Exercise 8:

The normal probability plot does not show a very straight line, but it seems to be normal enough to meet the condition requiring nearly normality.

hist(m1$residuals)

plot of chunk unnamed-chunk-8

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

plot of chunk unnamed-chunk-8

Exercise 9:

The condition of constant variability appears to be met as there is no significant change in variability with dependence on the variable, at_bats.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

plot of chunk unnamed-chunk-9

Exercise 10:

There does appear to be a positive linear correlation between the homeruns and the runs.

plot(mlb11$runs ~ mlb11$homeruns)
abline(m2)

plot of chunk unnamed-chunk-10

summary(m2)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  415.239     41.678    9.96  1.0e-10 ***
## homeruns       1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627,   Adjusted R-squared: 0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07

Exercise 11:

The R-squared value for the at bats relation ship is 0.3729. The R-squared value for the at bats relation ship is 0.6266. Therefore the home runs have better accuracy in predictions as its value is closer to 1 than that of at_bats.

summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373,   Adjusted R-squared: 0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339
summary(m2)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  415.239     41.678    9.96  1.0e-10 ***
## homeruns       1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627,   Adjusted R-squared: 0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07

Exercise 12:

R-squared values at_bats=0.3729 homeruns=0.6266 bat_avg=0.6561 wins=0.361 hits=0.6419

Surprisingly, the batting average is the most accurate predictor of runs.

plot(mlb11$runs ~ mlb11$at_bats)

plot of chunk unnamed-chunk-12

m1 = lm(runs ~ at_bats, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373,   Adjusted R-squared: 0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339
plot(mlb11$runs ~ mlb11$homeruns)

plot of chunk unnamed-chunk-12

m2 = lm(runs ~ homeruns, data = mlb11)
summary(m2)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  415.239     41.678    9.96  1.0e-10 ***
## homeruns       1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627,   Adjusted R-squared: 0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07
plot(mlb11$runs ~ mlb11$bat_avg)

plot of chunk unnamed-chunk-12

m3 = lm(runs ~ bat_avg, data = mlb11)
summary(m3)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -94.7  -26.3   -5.5   28.5  131.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -643        183   -3.51   0.0015 ** 
## bat_avg         5242        717    7.31  5.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 49.2 on 28 degrees of freedom
## Multiple R-squared: 0.656,   Adjusted R-squared: 0.644 
## F-statistic: 53.4 on 1 and 28 DF,  p-value: 5.88e-08
plot(mlb11$runs ~ mlb11$wins)

plot of chunk unnamed-chunk-12

m4 = lm(runs ~ wins, data = mlb11)
summary(m4)
## 
## Call:
## lm(formula = runs ~ wins, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -145.45  -47.51   -7.48   47.35  142.19 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   342.12      89.22    3.83  0.00065 ***
## wins            4.34       1.09    3.98  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared: 0.361,   Adjusted R-squared: 0.338 
## F-statistic: 15.8 on 1 and 28 DF,  p-value: 0.000447
plot(mlb11$runs ~ mlb11$hits)

plot of chunk unnamed-chunk-12

m5 = lm(runs ~ hits, data = mlb11)
summary(m5)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103.72  -27.18   -5.23   19.32  140.69 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.560    151.181   -2.48    0.019 *  
## hits           0.759      0.107    7.09    1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 50.2 on 28 degrees of freedom
## Multiple R-squared: 0.642,   Adjusted R-squared: 0.629 
## F-statistic: 50.2 on 1 and 28 DF,  p-value: 1.04e-07

Exercise 13:

R-squared values new_onbase=0.8491 new_slug=0.8969 new_obs=0.6419

These three variables are more accurate in predicting runs. As I have no idea what these new variables mean, let's say that this improvement in accuracy of prediction makes sense and call it a day.

plot(mlb11$runs ~ mlb11$new_onbase)

plot of chunk unnamed-chunk-13

m6 = lm(runs ~ new_onbase, data = mlb11)
summary(m6)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58.27 -18.33   3.25  19.52  69.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1118        144   -7.74  2.0e-08 ***
## new_onbase      5654        450   12.55  5.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 32.6 on 28 degrees of freedom
## Multiple R-squared: 0.849,   Adjusted R-squared: 0.844 
## F-statistic:  158 on 1 and 28 DF,  p-value: 5.12e-13
plot(mlb11$runs ~ mlb11$new_slug)

plot of chunk unnamed-chunk-13

m7 = lm(runs ~ new_slug, data = mlb11)
summary(m7)
## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -375.8       68.7   -5.47  7.7e-06 ***
## new_slug      2681.3      171.8   15.60  2.4e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 27 on 28 degrees of freedom
## Multiple R-squared: 0.897,   Adjusted R-squared: 0.893 
## F-statistic:  244 on 1 and 28 DF,  p-value: 2.42e-15
plot(mlb11$runs ~ mlb11$new_obs)

plot of chunk unnamed-chunk-13

m8 = lm(runs ~ hits, data = mlb11)
summary(m8)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103.72  -27.18   -5.23   19.32  140.69 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.560    151.181   -2.48    0.019 *  
## hits           0.759      0.107    7.09    1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 50.2 on 28 degrees of freedom
## Multiple R-squared: 0.642,   Adjusted R-squared: 0.629 
## F-statistic: 50.2 on 1 and 28 DF,  p-value: 1.04e-07

Exercise 14:

New slug appears to be the best predictor. The model diagnostics support this notion.

plot(m7$residuals ~ mlb11$new_slug)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

plot of chunk unnamed-chunk-14

hist(m7$residuals)

plot of chunk unnamed-chunk-14

qqnorm(m7$residuals)
qqline(m7$residuals)  # adds diagonal line to the normal prob plot

plot of chunk unnamed-chunk-14