Exam # 2

Michael Patterson

date()
## [1] "Tue Oct 30 13:57:56 2012"

Due Date/Time: October 30, 2012, 1:45pm
The points per quesion are given in parentheses.

(1) The UScereal (MASS package) contains many variables regarding breakfast cereals. One variable is the amount of sugar per portion and another is shelf position (counting from the floor up). Create side-by-side box plots showing the distribution of sugar by shelf number. Perform a t test to determine if there is a significant difference in the amount of sugar in cereals on the first and second shelves. What do you conclude? (20)

require("MASS")
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 2.15.2
require("ggplot2")
## Loading required package: ggplot2
ggplot(UScereal, aes(x = as.factor(UScereal$shelf), y = UScereal$sugar)) + geom_boxplot()

plot of chunk unnamed-chunk-2

t.test(UScereal$sugar[UScereal$shelf == 1], UScereal$sugar[UScereal$shelf == 
    2])
## 
##  Welch Two Sample t-test
## 
## data:  UScereal$sugar[UScereal$shelf == 1] and UScereal$sugar[UScereal$shelf == 2] 
## t = -3.975, df = 30, p-value = 0.0004086
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -9.404 -3.021 
## sample estimates:
## mean of x mean of y 
##     6.295    12.508 
## 

With a p value of .000409, we can conclude that there is a significant difference between the sugar content in cereals of the first shelf versus the second.

(2) The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state. (40)

a. Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.

require("HSAUR2")
## Loading required package: HSAUR2
## Warning: package 'HSAUR2' was built under R version 2.15.2
## Loading required package: lattice
## Loading required package: scatterplot3d
ggplot(USmelanoma, aes(x = latitude, y = mortality)) + geom_point()

plot of chunk unnamed-chunk-4

b. Add the linear regression line to your scatter plot.

ggplot(USmelanoma, aes(x = latitude, y = mortality)) + geom_point() + geom_smooth(method = lm, 
    se = FALSE)

plot of chunk unnamed-chunk-5

c. Regress mortality on latitude and interpret the value of the slope coefficient.

cancer = lm(mortality ~ latitude, data = USmelanoma)
cancer
## 
## Call:
## lm(formula = mortality ~ latitude, data = USmelanoma)
## 
## Coefficients:
## (Intercept)     latitude  
##      389.19        -5.98  
## 

We see that the model begins with an intercept of 389.189, or the projected mortality rate at the equator is 389.2 per million. We also see that for each degree north we move, the mortality rate decreases by 6.

d. Determine the sum of squared errors.

cancererror = lm(mortality ~ latitude, data = USmelanoma)
deviance(cancererror)
## [1] 17173

e. Use density and box plots to examine the model assumptions. What do you conclude?

boxplot(mortality ~ cut(latitude, breaks = quantile(latitude)), data = USmelanoma)

plot of chunk unnamed-chunk-8


require("sm")
## Loading required package: sm
## Package `sm', version 2.2-4.1 Copyright (C) 1997, 2000, 2005, 2007, 2008,
## A.W.Bowman & A.Azzalini Type help(sm) for summary information
cancerres = residuals(cancererror)
sm.density(cancerres, xlab = "Model Residuals", model = "Normal")

plot of chunk unnamed-chunk-8

Both the Box plot (broken down by quantile) and the density model demonstrate strong indicators that our data has a normal distribution and that our assumptions are not suspect.

(3) Davies and Goldsmith (1972) investigated the relationship between abrasion loss (abrasion) of samples of rubber (grams per hour) as a function of hardness (higher values indicate harder rubber) and tensile strength (kg/cm2 ). The data are in AbrasionLoss.txt. Input the data using AL = read.table(“http://myweb.fsu.edu/jelsner/AbrasionLoss.txt”, header=TRUE) (40)

a. Create a scatter plot matrix of the three variables. Based on the scatter of points in the plot of abrasion versus strength does it appear that tensile strength would be helpful in explaining abrasion loss?

AL = read.table("http://myweb.fsu.edu/jelsner/AbrasionLoss.txt", header = TRUE)
pairs(AL, panel = panel.smooth)

plot of chunk unnamed-chunk-9

Based on the scatter of points, tensile strength would not be helpful in explaining abrasion loss. The points are far too scattered with no real indication of linear relationship. Any model would have far too much error.

b. Regress abrasion loss on hardness and strength. What is the adjusted R squared value? Is strength an important explanatory variable after accounting for hardness?

ALmodel = lm(abrasion ~ hardness + strength, data = AL)
summary(ALmodel)
## 
## Call:
## lm(formula = abrasion ~ hardness + strength, data = AL)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79.38 -14.61   3.82  19.75  65.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  885.161     61.752   14.33  3.8e-14 ***
## hardness      -6.571      0.583  -11.27  1.0e-11 ***
## strength      -1.374      0.194   -7.07  1.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84,    Adjusted R-squared: 0.828 
## F-statistic:   71 on 2 and 27 DF,  p-value: 1.77e-11 
## 
drop1(ALmodel)
## Single term deletions
## 
## Model:
## abrasion ~ hardness + strength
##          Df Sum of Sq    RSS AIC
## <none>                 35950 219
## hardness  1    169027 204977 269
## strength  1     66607 102556 248

The adjusted R squared value is .828. After looking into the AIC value, strength is an important explanatory value after taking hardness into account.

c. On average how much additional abrasion is lost for every 1 kg/cm2 increase in tensile strength?

ALmodel
## 
## Call:
## lm(formula = abrasion ~ hardness + strength, data = AL)
## 
## Coefficients:
## (Intercept)     hardness     strength  
##      885.16        -6.57        -1.37  
## 

For each 1 kg/cm2 increase in tensile strength, the rate of loss due to abraison is reduced 1.37 grams per hour

d. Check the correlations between the explanatory variables. Could collinearity be a problem for interpreting the model?

cor(AL)
##          abrasion hardness strength
## abrasion   1.0000  -0.7377  -0.2984
## hardness  -0.7377   1.0000  -0.2992
## strength  -0.2984  -0.2992   1.0000

The correlation of hardness to strength is -.299, suggesting very strongly that collinearity is not a problem for interpreting this model.

e. Find the 95% prediction interval for the abrasion corresponding to a new rubber sample having a hardness of 60 units and a tensile strength of 200 kg/cm2.

predict(ALmodel, newdata = (data.frame(strength = 200, hardness = 60)), interval = "prediction")
##   fit   lwr   upr
## 1 216 138.9 293.2

The interval is between 138.93 and 293.16 with the fit being 216.05