Problem Set # 4

Tyler Fricker

date()

## [1] "Tue Oct 29 12:37:07 2013"

Due Date: October 29, 2013

Total Points: 30

1 The data set trees (datasets package, part of base install) contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to be able to predict tree volume based on girth.

a) Create a scatter plot of the data and label the axes.

require(datasets)
require(ggplot2)

## Loading required package: ggplot2

p = ggplot(trees, aes(x = Girth, y = Volume)) + geom_point() + xlab("Girth (inches)") + 
    ylab("Volume")
p

plot of chunk unnamed-chunk-2

b) Add a linear regression line to the plot.

p + geom_smooth(method = lm, se = FALSE)

plot of chunk unnamed-chunk-3

c) Determine the sum of squared residuals.

model = lm(Volume ~ Girth, data = trees)
SSE = sum(resid(model)^2)
SSE

## [1] 524.3

d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why?

Girth2 = trees$Girth^2
p2 = ggplot(trees, aes(x = Girth2, y = Volume)) + geom_point() + xlab("Girth (inches)") + 
    ylab("Volume")
p2

plot of chunk unnamed-chunk-5

p2 + geom_smooth(method = lm, se = FALSE)

plot of chunk unnamed-chunk-5

model = lm(Volume ~ Girth2, data = trees)
SSE2 = sum(resid(model)^2)
SSE2

## [1] 329.3

I prefer the squared girth model (the second example). I feel this way because I think the graph looks more linear and appears to correct error better (as evident by SSE of 329.3 vs. 524.3).

2 The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state.

a) Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.

require(HSAUR2)

## Loading required package: HSAUR2 Loading required package: tools

m = ggplot(USmelanoma, aes(x = latitude, y = mortality)) + geom_point() + xlab("Latitude") + 
    ylab("Mortality")
m

plot of chunk unnamed-chunk-6

b) Add the linear regression line to your scatter plot.

m + geom_smooth(method = lm, se = FALSE)

plot of chunk unnamed-chunk-7

c) Regress mortality on latitude and interpret the value of the slope coefficient.

model = lm(mortality ~ latitude, data = USmelanoma)
summary(model)

## 
## Call:
## lm(formula = mortality ~ latitude, data = USmelanoma)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38.97 -13.18   0.97  12.01  43.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  389.189     23.812   16.34  < 2e-16 ***
## latitude      -5.978      0.598   -9.99  3.3e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.1 on 47 degrees of freedom
## Multiple R-squared:  0.68,   Adjusted R-squared:  0.673 
## F-statistic: 99.8 on 1 and 47 DF,  p-value: 3.31e-13

The model is statistically significant with latitude explaining 68% of the variation in mortality counts.

For every 1 degree increase in latitude, there is a decrease of roughly 6 deaths due to melanoma.

d) Determine the sum of squared errors.

model = lm(mortality ~ latitude, data = USmelanoma)
SSE3 = sum(resid(model)^2)
SSE3

## [1] 17173

e) Examine the model assumptions. What do you conclude about them?

require(sm)

## Loading required package: sm Package `sm', version 2.2-5: type help(sm)
## for summary information

ggplot(USmelanoma, aes(x = cut(latitude, breaks = 5), y = mortality)) + geom_boxplot() + 
    xlab("Latitude") + ylab("Mortality")

plot of chunk unnamed-chunk-10

res = residuals(model)
sm.density(res, model = "Normal")

qqnorm(res)
qqline(res)

plot of chunk unnamed-chunk-10

This model has some descrepencies between complete linearity and constant variance. The boxplot is in a relatively linear patter, but tends to slant downward. The boxes do not have the same size (2nd and 3rd boxes are larger), which suggests against constant variance. As far as normality, the black curve seems to fall within the blue density curve, so we have no reason to suspect the assumption of normally distributed residuals. The Q-Q plot seems to show small values of residuals.