Lab 9

Name:Sonora Williams

Section: 01l

Date: November 12, 2013

Exercises

Load data & inference function:

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1:

I would use a scatter plot to show the relationship, corelating or not, between these two variables. The produced plot is not very linear at all. I would not feel comfortable using the at bats to predict the number of runs. The correlation coefficient is only 0.61027. This is slightly close to one, suggesting some positive correlation.

plot(mlb11$at_bats, mlb11$runs)

plot of chunk unnamed-chunk-2

cor(mlb11$runs, mlb11$at_bats)

## [1] 0.6106

Exercise 2:

As the correlation coefficient is .61027, this is neither a very week nor a very strong relationship. However, the coefficient shows that the direction of the relationship is slightly positive. Likewise, the form is more linear than curvilinear.

Exercise 3:

The best sums of squares I got was 143312. My lab mates found smaller sums of squares ranging from 139853 to 142948.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

plot of chunk unnamed-chunk-3

## Click two points to make a line.

## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -2789.243        0.631  
## 
## Sum of Squares:  123722

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

plot of chunk unnamed-chunk-3

## Click two points to make a line.

## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -2789.243        0.631  
## 
## Sum of Squares:  123722

Exercise 4:

(equation of the regression line)runs hat = -2789.2429 + 0.6305 x at_bats

m1 = lm(runs ~ at_bats, data = mlb11)
summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373,   Adjusted R-squared: 0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339

Exercise 5:

(equation of the regression line)runs hat = 415.2389 + 1.8345 x homeruns

This information shows us that the success of the team in terms of the runs scored correlates very well with the number of homeruns scored. This is resonable as a homerun is a run and often leads to extra runs on top.

m2 = lm(runs ~ homeruns, data = mlb11)
summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  415.239     41.678    9.96  1.0e-10 ***
## homeruns       1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627,   Adjusted R-squared: 0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07

Exercise 6:

-2789.2429+0.6305(5579)= 728.3166 runs expected = predicted value Assuming the coach is from the Philies with 5579 at bats, the observed run count is 713. Therefore the predicted value is an overestimate by (729-713)= 16 runs.

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

plot of chunk unnamed-chunk-6

summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373,   Adjusted R-squared: 0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339

-2789.2429 + 0.6305(5579)

## Error: attempt to apply non-function

Exercise 7:

The residual plot seems to show homogeneous variance across values of at_bats.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

plot of chunk unnamed-chunk-7

Exercise 8:

The normal probability plot does not show a very straight line, but it seems to be normal enough to meet the condition requiring nearly normality.

hist(m1$residuals)

plot of chunk unnamed-chunk-8

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

plot of chunk unnamed-chunk-8

Exercise 9:

The condition of constant variability appears to be met as there is no significant change in variability with dependence on the variable, at_bats.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

plot of chunk unnamed-chunk-9

Exercise 10:

There does appear to be a positive linear correlation between the homeruns and the runs.

plot(mlb11$runs ~ mlb11$homeruns)
abline(m2)

plot of chunk unnamed-chunk-10

summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  415.239     41.678    9.96  1.0e-10 ***
## homeruns       1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627,   Adjusted R-squared: 0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07

Exercise 11:

The R-squared value for the at bats relation ship is 0.3729. The R-squared value for the at bats relation ship is 0.6266. Therefore the home runs have better accuracy in predictions as its value is closer to 1 than that of at_bats.

summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373,   Adjusted R-squared: 0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339

summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  415.239     41.678    9.96  1.0e-10 ***
## homeruns       1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627,   Adjusted R-squared: 0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07