Lab 9

Name: James Tian

Section: 3

Date: 11/12/13

Exercises

Load data & inference function:

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1:

plot(mlb11$runs ~ mlb11$at_bats)

plot of chunk unnamed-chunk-2

cor(mlb11$runs, mlb11$at_bats)

## [1] 0.6106

Exercise 2:

The relationship between runs and at bats has a positive correlation. However, the correlation is not particularly strong, and several notable outliers can be identified. The data is also concentrated in the lower-mid region of at_bats, as few players have batted more than 5620 times.

Exercise 3:

The smallest sum of squares I got was 141037.9

Exercise 4:

m1 = lm(runs ~ at_bats, data = mlb11)
summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373,   Adjusted R-squared: 0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339

Exercise 5:

m2 = lm(runs ~ homeruns, data = mlb11)
summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  415.239     41.678    9.96  1.0e-10 ***
## homeruns       1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627,   Adjusted R-squared: 0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07

runs = 415.24+1.83(homeruns) For every increase in a homerun, the number of total runs increases by about 1.83 (not just by 1)

Exercise 6:

-2789.2429 + 0.6305 * 5579

## [1] 728.3

713 - 728.3

## [1] -15.3

Observed (713) - Expected(728.3) = Residual (-15.3) This is an overestimate by about 15.

Exercise 7:

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

plot of chunk unnamed-chunk-6

There is no apparent pattern in the residual plot. This indicates that the relationship is linear.

Exercise 8:

hist(m1$residuals)

plot of chunk unnamed-chunk-7

qqnorm(m1$residuals)
qqline(m1$residuals)

plot of chunk unnamed-chunk-7

Both the histogram and the normal probability plot indicate that the residuals are distributed normally.

Exercise 9:

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

plot of chunk unnamed-chunk-8

The assumption seems to be met because the random of variation of the residuals appears roughly constant in the entire plot. It appears to be slightly greater towards the lower end of number of bats but the difference doesn't appear significant.

Exercise 10:

m3 = lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg)
abline(m3)

plot of chunk unnamed-chunk-9

There does seem to be a linear relationship

Exercise 11:

summary(m3)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -94.7  -26.3   -5.5   28.5  131.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -643        183   -3.51   0.0015 ** 
## bat_avg         5242        717    7.31  5.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 49.2 on 28 degrees of freedom
## Multiple R-squared: 0.656,   Adjusted R-squared: 0.644 
## F-statistic: 53.4 on 1 and 28 DF,  p-value: 5.88e-08

The R² value for this relationship is .6561 while the R² value for the relationships between runs and at_bats is .3729, so this is a much stronger relationship. We can predict runs much better with bat_avg than at_bats.

Exercise 12:

plot(mlb11$runs ~ mlb11$bat_avg)
abline(m3)

plot of chunk unnamed-chunk-11

hist(m3$residuals)

plot of chunk unnamed-chunk-11

qqnorm(m3$residuals)
qqline(m3$residuals)

plot of chunk unnamed-chunk-11

plot(m3$residuals ~ mlb11$bat_avg)
abline(h = 0, lty = 3)

plot of chunk unnamed-chunk-11

The variable relationship with the largest R² value is between runs and bat_avg. The scatterplot, residual histogram, and normal probability plots all support the strength of this relationship.

Exercise 13:

m4 = lm(runs ~ new_obs, data = mlb11)
summary(m4)

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.46 -13.69   1.16  13.94  41.16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -686.6       68.9   -9.96    1e-10 ***
## new_obs       1919.4       95.7   20.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 21.4 on 28 degrees of freedom
## Multiple R-squared: 0.935,   Adjusted R-squared: 0.933 
## F-statistic:  402 on 1 and 28 DF,  p-value: <2e-16

They are much more effective. I don't follow baseball and I have no idea what new_obs stands for.

Exercise 14:

plot(mlb11$runs ~ mlb11$new_obs)
abline(m4)

plot of chunk unnamed-chunk-13

hist(m4$residuals)

plot of chunk unnamed-chunk-13

qqnorm(m4$residuals)
qqline(m4$residuals)

plot of chunk unnamed-chunk-13

plot(m4$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)

plot of chunk unnamed-chunk-13