date()
## [1] "Tue Oct 29 20:17:24 2013"
Due Date: October 29, 2013
Total Points: 30
1 The data set trees (datasets package, part of base install) contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to be able to predict tree volume based on girth.
a) Create a scatter plot of the data and label the axes.
require(scatterplot3d)
## Loading required package: scatterplot3d
## Warning: there is no package called 'scatterplot3d'
lmTrees = lm(Volume ~ Girth, data = trees)
plot(trees$Volume ~ trees$Girth, pch = 16, xlab = "Tree Diameter (in)", ylab = "Timber Volume (cubic ft)")
b) Add a linear regression line to the plot.
plot(trees$Volume ~ trees$Girth, pch = 16, xlab = "Tree Diameter (in)", ylab = "Timber Volume (cubic ft)")
abline(lmTrees)
c) Determine the sum of squared residuals.
sum(residuals(lmTrees)^2)
## [1] 524.3
d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why?
g2 = trees$Girth^2
lmTrees2 = lm(Volume ~ g2, data = trees)
plot(trees$Volume ~ g2, pch = 16, xlab = "Girth Squared (in^2)", ylab = "Timber Volume (cubic ft)")
abline(lmTrees2)
sum(residuals(lmTrees2)^2)
## [1] 329.3
I prefer the second linear regression model (lmTrees2). The RSS value for lmTrees is 524.3025 whereas the RSS value for lmTrees2 is 329.3191. The smaller value indicates that the second model is a better fit for the data.
2 The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state.
a) Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.
require(HSAUR2)
## Loading required package: HSAUR2
## Warning: package 'HSAUR2' was built under R version 3.0.2
## Loading required package: tools
lmMort = lm(mortality ~ latitude, data = USmelanoma)
plot(USmelanoma$mortality ~ USmelanoma$latitude, pch = 16, xlab = "Latitude",
ylab = "Mortality Rate")
b) Add the linear regression line to your scatter plot.
plot(USmelanoma$mortality ~ USmelanoma$latitude, pch = 16, xlab = "Latitude",
ylab = "Mortality Rate")
abline(lmMort)
c) Regress mortality on latitude and interpret the value of the slope coefficient.
lmMort = lm(mortality ~ latitude, data = USmelanoma)
lmMort
##
## Call:
## lm(formula = mortality ~ latitude, data = USmelanoma)
##
## Coefficients:
## (Intercept) latitude
## 389.19 -5.98
The slope coefficient indicates that the mortality rate decreases by 5.978 for each degree increase in latitude.
d) Determine the sum of squared errors.
sum(residuals(lmMort)^2)
## [1] 17173
e) Examine the model assumptions. What do you conclude about them?
require(ggplot2)
## Loading required package: ggplot2
lmMort.df = fortify(lmMort)
ggplot(lmMort.df, aes(x = cut(latitude, breaks = 5), y = mortality)) + geom_boxplot() +
xlab("Latitude") + ylab("Mortality Rate")
ggplot(lmMort.df, aes(x = latitude, y = mortality)) + geom_point() + geom_smooth(se = FALSE) +
geom_smooth(method = lm, color = "red", se = FALSE)
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
ggplot(lmMort.df, aes(x = .fitted, y = .stdresid)) + geom_point() + geom_smooth() +
geom_hline(yintercept = 0)
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
require(sm)
## Loading required package: sm
## Warning: package 'sm' was built under R version 3.0.2
## Package `sm', version 2.2-5: type help(sm) for summary information
res = residuals(lmMort)
hist(res)
plot(density(res))
sm.density(res, xlab = "Model Residuals", model = "Normal")
qqnorm(lmMort$residuals)
qqline(lmMort$residuals)
The box plot shows evidence that the model may not be linear-the interquartile range decreases as you move north. Thelocal regressions challenge the assumption of linearity and constant varience. The histogram shows that the model does not have a normal distribution–it is slightly skewed to the right. The density plot is close to a normal distribution. The model residual plot indicates that the results fall within a normal distribution. The staircase patter of the theoretical quantiles indicate that the data may be discrete.