Problem Set # 4

Ryan Truchelut

date()
## [1] "Tue Oct 29 11:15:20 2013"

Due Date: October 29, 2013

Total Points: 30

1 The data set trees (datasets package, part of base install) contains the girth (inches), height (feet) and volume of timber from 31 felled Black Cherry trees. Suppose you want to be able to predict tree volume based on girth.

a) Create a scatter plot of the data and label the axes.

require(ggplot2)
## Loading required package: ggplot2
plot1 = ggplot(trees, aes(x = Girth, y = Volume)) + geom_point() + xlab("Tree Diameter (in)") + 
    ylab("Timber Volume (ft^3)")
plot1

plot of chunk plottrees

b) Add a linear regression line to the plot.

plot1 + geom_smooth(method = lm, se = FALSE)

plot of chunk plottrees2

c) Determine the sum of squared residuals.

model1 = lm(Volume ~ Girth, data = trees)
sum(residuals(model1)^2)
## [1] 524.3

d) Repeat a, b, and c but use the square of the girth instead of girth as the explanatory variable. Which model do you prefer and why?


ggplot(trees, aes(x = (Girth)^2, y = Volume)) + geom_point() + geom_smooth(method = lm, 
    se = FALSE) + xlab("Tree Diameter^2 (in^2)") + ylab("Tree Volume (ft^3)")

plot of chunk modeltrees2

g2 = trees$Girth^2
model2 = lm(trees$Volume ~ g2)
sum(residuals(model2)^2)
## [1] 329.3
"The model based on square of the trees' girth is preferred. It is preferable statistically because the sum of the squared residuals is lower for the squared girth than the girth model. It is preferable physically because volume of a cylinder goes with the square of the radius (times height), making a linear model with crosssection squared a better choice than just the crosssection."
## [1] "The model based on square of the trees' girth is preferred. It is preferable statistically because the sum of the squared residuals is lower for the squared girth than the girth model. It is preferable physically because volume of a cylinder goes with the square of the radius (times height), making a linear model with crosssection squared a better choice than just the crosssection."

2 The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state.

a) Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.

require(ggplot2)
require(HSAUR2)
## Loading required package: HSAUR2 Loading required package: tools
plot1 = ggplot(USmelanoma, aes(x = latitude, y = mortality)) + geom_point() + 
    xlab("centroid latitude of state (deg N)") + ylab("melanoma mortality rate per 1000000 inhabitants")
plot1

plot of chunk plotmelanoma

b) Add the linear regression line to your scatter plot.

plot1 + geom_smooth(method = lm, se = FALSE)

plot of chunk plotmelanoma2

c) Regress mortality on latitude and interpret the value of the slope coefficient.

model1 = lm(mortality ~ latitude, data = USmelanoma)
summary(model1)
## 
## Call:
## lm(formula = mortality ~ latitude, data = USmelanoma)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38.97 -13.18   0.97  12.01  43.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  389.189     23.812   16.34  < 2e-16 ***
## latitude      -5.978      0.598   -9.99  3.3e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.1 on 47 degrees of freedom
## Multiple R-squared:  0.68,   Adjusted R-squared:  0.673 
## F-statistic: 99.8 on 1 and 47 DF,  p-value: 3.31e-13
"The slope coefficient of the linear regression model of mortality onto latitude is -5.98. This means that for each degree latitude further north the centroid of the US state is located, the model predicts mortality by melanoma will decline by roughly 6 people per million inhabitants of that state."
## [1] "The slope coefficient of the linear regression model of mortality onto latitude is -5.98. This means that for each degree latitude further north the centroid of the US state is located, the model predicts mortality by melanoma will decline by roughly 6 people per million inhabitants of that state."

d) Determine the sum of squared errors.

sum(residuals(model1)^2)
## [1] 17173

e) Examine the model assumptions. What do you conclude about them?

ggplot(USmelanoma, aes(x = cut(latitude, breaks = 5), y = mortality)) + geom_boxplot() + 
    xlab("centroid latitude of state (deg N)") + ylab("melanoma mortality rate per 1000000 inhabitants")

plot of chunk modelassumptions


model2 = lm(mortality ~ latitude, data = USmelanoma)
model.df = fortify(model2)
ggplot(model.df, aes(.resid)) + geom_histogram(binwidth = 10)

plot of chunk modelassumptions


require(sm)
## Loading required package: sm Package `sm', version 2.2-5: type help(sm)
## for summary information
res = residuals(model2)
sm.density(res, model = "Normal")

plot of chunk modelassumptions

"The box plot of melanoma rates with the centroid latitude of the state can be used to check the model assumptions of linearly and constant variance. With the data divided into quintiles, it appears that the average values of the sub-populations fall along a generally straight line, the assumption of linearity is satisfied. However, the standard deviation of the mortality melanoma rates appears to be greater for the sub-populations of low-latitude states than high-latitude states, in general. Therefore, it appears that the data violates the adequacy assumption of constant variance. The histogram of model residuals, assessing the assumption of normality, shows no obvious bias towards left- or right-skew in the distribution of residuals. This is confirmed by the plot of the confidence envelope for a normal distribution compared to the observed distribution of the residuals. As the actual distribution is well within the envelope, the assumption of normality is upheld. In summary, the model is adequate in terms of normality and linearity, but appears to violate the assumption of constant variance."
## [1] "The box plot of melanoma rates with the centroid latitude of the state can be used to check the model assumptions of linearly and constant variance. With the data divided into quintiles, it appears that the average values of the sub-populations fall along a generally straight line, the assumption of linearity is satisfied. However, the standard deviation of the mortality melanoma rates appears to be greater for the sub-populations of low-latitude states than high-latitude states, in general. Therefore, it appears that the data violates the adequacy assumption of constant variance. The histogram of model residuals, assessing the assumption of normality, shows no obvious bias towards left- or right-skew in the distribution of residuals. This is confirmed by the plot of the confidence envelope for a normal distribution compared to the observed distribution of the residuals. As the actual distribution is well within the envelope, the assumption of normality is upheld. In summary, the model is adequate in terms of normality and linearity, but appears to violate the assumption of constant variance."