Problem Set # 4

Doug Kossert

date()
## [1] "Tue Oct 29 18:53:59 2013"

Due Date: October 29, 2013

Total Points: 30

1 The data set trees (datasets package, part of base install) contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to be able to predict tree volume based on girth.

a) Create a scatter plot of the data and label the axes.

require(ggplot2)
## Loading required package: ggplot2
require(datasets)
p = ggplot(trees, aes(x = Girth, y = Volume)) + geom_point(col = "#00FF00") + 
    xlab("Girth (in)") + ylab("Volume") + theme_bw()
p

plot of chunk scatter plot

b) Add a linear regression line to the plot.

p = p + geom_smooth(method = lm, se = FALSE, col = "red")
p

plot of chunk linear regression line

c) Determine the sum of squared residuals.

p = p + geom_segment(aes(y = predict(lm(Volume ~ Girth)), yend = Volume, x = Girth, 
    xend = Girth), col = "gold")
p

plot of chunk sum of squared residuals

sum(residuals(lm(trees$Volume ~ trees$Girth))^2)
## [1] 524.3

d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why?

g = (trees$Girth^2)
g
##  [1]  68.89  73.96  77.44 110.25 114.49 116.64 121.00 121.00 123.21 125.44
## [11] 127.69 129.96 129.96 136.89 144.00 166.41 166.41 176.89 187.69 190.44
## [21] 196.00 201.64 210.25 256.00 265.69 299.29 306.25 320.41 324.00 324.00
## [31] 424.36
p2 = ggplot(trees, aes(x = g, y = Volume)) + geom_point(col = "#8B0000") + xlab("Girth Squared (in)") + 
    ylab("Volume") + theme_bw()
p2

plot of chunk unnamed-chunk-4

p2 = p2 + geom_smooth(method = lm, se = FALSE, col = "green")
p2

plot of chunk unnamed-chunk-5

p2 = p2 + geom_segment(aes(y = predict(lm(Volume ~ g)), yend = Volume, x = g, 
    xend = g), col = "#FF6347")
p2

plot of chunk unnamed-chunk-6

sum(residuals(lm(trees$Volume ~ g))^2)
## [1] 329.3

The second model is preferred due to the smaller residual values.

2 The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state.

a) Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.

require(HSAUR2)
## Loading required package: HSAUR2 Loading required package: tools
m = USmelanoma
pm = ggplot(m, aes(x = latitude, y = mortality)) + geom_point(col = "tomato") + 
    xlab("Latitude (degrees") + ylab("mortality") + theme_bw()
pm

plot of chunk unnamed-chunk-9

b) Add the linear regression line to your scatter plot.

pm = pm + geom_smooth(method = lm, se = FALSE, col = "blue")
pm

plot of chunk unnamed-chunk-10

c) Regress mortality on latitude and interpret the value of the slope coefficient.

model = lm(mortality ~ latitude, data = m)
summary(model)
## 
## Call:
## lm(formula = mortality ~ latitude, data = m)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38.97 -13.18   0.97  12.01  43.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  389.189     23.812   16.34  < 2e-16 ***
## latitude      -5.978      0.598   -9.99  3.3e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.1 on 47 degrees of freedom
## Multiple R-squared:  0.68,   Adjusted R-squared:  0.673 
## F-statistic: 99.8 on 1 and 47 DF,  p-value: 3.31e-13

Due to the slope value being negative, -5.98, this suggests that the melanoma cases will decrese as you approach one of the poles.

d) Determine the sum of squared errors.

SSE = sum(resid(model)^2)
SSE
## [1] 17173

e) Examine the model assumptions. What do you conclude about them?

require(sm)
## Loading required package: sm Package `sm', version 2.2-5: type help(sm)
## for summary information
ggplot(m, aes(x = cut(latitude, 5), y = mortality)) + geom_boxplot() + xlab("Latitude (Degrees)") + 
    ylab("Mortality Rate Per 1million")

plot of chunk unnamed-chunk-13

res = residuals(model)
sm.density(res, model = "Normal")

plot of chunk unnamed-chunk-14

qqnorm(res)
qqline(res)

plot of chunk unnamed-chunk-15

Based on the data and graphs above, I belive that all model assumptions are satisfied and that there is no evidence to reject them.