date()
## [1] "Tue Oct 29 12:37:07 2013"
Due Date: October 29, 2013
Total Points: 30
1 The data set trees (datasets package, part of base install) contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to be able to predict tree volume based on girth.
a) Create a scatter plot of the data and label the axes.
require(datasets)
require(ggplot2)
## Loading required package: ggplot2
p = ggplot(trees, aes(x = Girth, y = Volume)) + geom_point() + xlab("Girth (inches)") +
ylab("Volume")
p
b) Add a linear regression line to the plot.
p + geom_smooth(method = lm, se = FALSE)
c) Determine the sum of squared residuals.
model = lm(Volume ~ Girth, data = trees)
SSE = sum(resid(model)^2)
SSE
## [1] 524.3
d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why?
Girth2 = trees$Girth^2
p2 = ggplot(trees, aes(x = Girth2, y = Volume)) + geom_point() + xlab("Girth (inches)") +
ylab("Volume")
p2
p2 + geom_smooth(method = lm, se = FALSE)
model = lm(Volume ~ Girth2, data = trees)
SSE2 = sum(resid(model)^2)
SSE2
## [1] 329.3
I prefer the squared girth model (the second example). I feel this way because I think the graph looks more linear and appears to correct error better (as evident by SSE of 329.3 vs. 524.3).
2 The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state.
a) Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.
require(HSAUR2)
## Loading required package: HSAUR2 Loading required package: tools
m = ggplot(USmelanoma, aes(x = latitude, y = mortality)) + geom_point() + xlab("Latitude") +
ylab("Mortality")
m
b) Add the linear regression line to your scatter plot.
m + geom_smooth(method = lm, se = FALSE)
c) Regress mortality on latitude and interpret the value of the slope coefficient.
model = lm(mortality ~ latitude, data = USmelanoma)
summary(model)
##
## Call:
## lm(formula = mortality ~ latitude, data = USmelanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.97 -13.18 0.97 12.01 43.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 389.189 23.812 16.34 < 2e-16 ***
## latitude -5.978 0.598 -9.99 3.3e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.1 on 47 degrees of freedom
## Multiple R-squared: 0.68, Adjusted R-squared: 0.673
## F-statistic: 99.8 on 1 and 47 DF, p-value: 3.31e-13
The model is statistically significant with latitude explaining 68% of the variation in mortality counts.
For every 1 degree increase in latitude, there is a decrease of roughly 6 deaths due to melanoma.
d) Determine the sum of squared errors.
model = lm(mortality ~ latitude, data = USmelanoma)
SSE3 = sum(resid(model)^2)
SSE3
## [1] 17173
e) Examine the model assumptions. What do you conclude about them?
require(sm)
## Loading required package: sm Package `sm', version 2.2-5: type help(sm)
## for summary information
ggplot(USmelanoma, aes(x = cut(latitude, breaks = 5), y = mortality)) + geom_boxplot() +
xlab("Latitude") + ylab("Mortality")
res = residuals(model)
sm.density(res, model = "Normal")
qqnorm(res)
qqline(res)
This model has some descrepencies between complete linearity and constant variance. The boxplot is in a relatively linear patter, but tends to slant downward. The boxes do not have the same size (2nd and 3rd boxes are larger), which suggests against constant variance. As far as normality, the black curve seems to fall within the blue density curve, so we have no reason to suspect the assumption of normally distributed residuals. The Q-Q plot seems to show small values of residuals.