date()
## [1] "Tue Oct 29 18:53:59 2013"
Due Date: October 29, 2013
Total Points: 30
1 The data set trees (datasets package, part of base install) contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to be able to predict tree volume based on girth.
a) Create a scatter plot of the data and label the axes.
require(ggplot2)
## Loading required package: ggplot2
require(datasets)
p = ggplot(trees, aes(x = Girth, y = Volume)) + geom_point(col = "#00FF00") +
xlab("Girth (in)") + ylab("Volume") + theme_bw()
p
b) Add a linear regression line to the plot.
p = p + geom_smooth(method = lm, se = FALSE, col = "red")
p
c) Determine the sum of squared residuals.
p = p + geom_segment(aes(y = predict(lm(Volume ~ Girth)), yend = Volume, x = Girth,
xend = Girth), col = "gold")
p
sum(residuals(lm(trees$Volume ~ trees$Girth))^2)
## [1] 524.3
d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why?
g = (trees$Girth^2)
g
## [1] 68.89 73.96 77.44 110.25 114.49 116.64 121.00 121.00 123.21 125.44
## [11] 127.69 129.96 129.96 136.89 144.00 166.41 166.41 176.89 187.69 190.44
## [21] 196.00 201.64 210.25 256.00 265.69 299.29 306.25 320.41 324.00 324.00
## [31] 424.36
p2 = ggplot(trees, aes(x = g, y = Volume)) + geom_point(col = "#8B0000") + xlab("Girth Squared (in)") +
ylab("Volume") + theme_bw()
p2
p2 = p2 + geom_smooth(method = lm, se = FALSE, col = "green")
p2
p2 = p2 + geom_segment(aes(y = predict(lm(Volume ~ g)), yend = Volume, x = g,
xend = g), col = "#FF6347")
p2
sum(residuals(lm(trees$Volume ~ g))^2)
## [1] 329.3
The second model is preferred due to the smaller residual values.
2 The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state.
a) Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.
require(HSAUR2)
## Loading required package: HSAUR2 Loading required package: tools
m = USmelanoma
pm = ggplot(m, aes(x = latitude, y = mortality)) + geom_point(col = "tomato") +
xlab("Latitude (degrees") + ylab("mortality") + theme_bw()
pm
b) Add the linear regression line to your scatter plot.
pm = pm + geom_smooth(method = lm, se = FALSE, col = "blue")
pm
c) Regress mortality on latitude and interpret the value of the slope coefficient.
model = lm(mortality ~ latitude, data = m)
summary(model)
##
## Call:
## lm(formula = mortality ~ latitude, data = m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.97 -13.18 0.97 12.01 43.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 389.189 23.812 16.34 < 2e-16 ***
## latitude -5.978 0.598 -9.99 3.3e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.1 on 47 degrees of freedom
## Multiple R-squared: 0.68, Adjusted R-squared: 0.673
## F-statistic: 99.8 on 1 and 47 DF, p-value: 3.31e-13
Due to the slope value being negative, -5.98, this suggests that the melanoma cases will decrese as you approach one of the poles.
d) Determine the sum of squared errors.
SSE = sum(resid(model)^2)
SSE
## [1] 17173
e) Examine the model assumptions. What do you conclude about them?
require(sm)
## Loading required package: sm Package `sm', version 2.2-5: type help(sm)
## for summary information
ggplot(m, aes(x = cut(latitude, 5), y = mortality)) + geom_boxplot() + xlab("Latitude (Degrees)") +
ylab("Mortality Rate Per 1million")
res = residuals(model)
sm.density(res, model = "Normal")
qqnorm(res)
qqline(res)
Based on the data and graphs above, I belive that all model assumptions are satisfied and that there is no evidence to reject them.