date()
## [1] "Tue Oct 30 13:43:48 2012"
Due Date/Time: October 30, 2012, 1:45pm
The points per quesion are given in parentheses.
(1) The UScereal (MASS package) contains many variables regarding breakfast cereals. One variable is the amount of sugar per portion and another is shelf position (counting from the floor up). Create side-by-side box plots showing the distribution of sugar by shelf number. Perform a t test to determine if there is a significant difference in the amount of sugar in cereals on the first and second shelves. What do you conclude? (20)
require(MASS)
## Loading required package: MASS
head(UScereal)
## mfr calories protein fat sodium fibre carbo
## 100% Bran N 212.1 12.121 3.030 393.9 30.303 15.15
## All-Bran K 212.1 12.121 3.030 787.9 27.273 21.21
## All-Bran with Extra Fiber K 100.0 8.000 0.000 280.0 28.000 16.00
## Apple Cinnamon Cheerios G 146.7 2.667 2.667 240.0 2.000 14.00
## Apple Jacks K 110.0 2.000 0.000 125.0 1.000 11.00
## Basic 4 G 173.3 4.000 2.667 280.0 2.667 24.00
## sugars shelf potassium vitamins
## 100% Bran 18.18 3 848.48 enriched
## All-Bran 15.15 3 969.70 enriched
## All-Bran with Extra Fiber 0.00 3 660.00 enriched
## Apple Cinnamon Cheerios 13.33 1 93.33 enriched
## Apple Jacks 14.00 2 30.00 enriched
## Basic 4 10.67 3 133.33 enriched
attach(UScereal)
boxplot(sugars ~ shelf)
t.test(sugars[shelf == 1], sugars[shelf == 2])
##
## Welch Two Sample t-test
##
## data: sugars[shelf == 1] and sugars[shelf == 2]
## t = -3.975, df = 30, p-value = 0.0004086
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.404 -3.021
## sample estimates:
## mean of x mean of y
## 6.295 12.508
detach(UScereal)
The p-value indicates we can reject the null hypothesis that there is no significant difference between the sugar content for each shelf. The evidence is convincing (p< 0.001) and we can conclude that there is a significant difference between the amount of sugar for shelves 1 and 2.
(2) The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state. (40)
a. Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.
require(HSAUR2)
## Loading required package: HSAUR2
## Loading required package: lattice
## Loading required package: scatterplot3d
attach(USmelanoma)
head(USmelanoma)
## mortality latitude longitude ocean
## Alabama 219 33.0 87.0 yes
## Arizona 160 34.5 112.0 no
## Arkansas 170 35.0 92.5 no
## California 182 37.5 119.5 yes
## Colorado 149 39.0 105.5 no
## Connecticut 159 41.8 72.8 yes
mel1 = ggplot(USmelanoma, aes(x = latitude, y = mortality)) + geom_point(size = 4) +
xlab("Latitude") + ylab("Mortality Rate")
## Error: could not find function "ggplot"
mel1
## Error: object 'mel1' not found
b. Add the linear regression line to your scatter plot.
mel1 = mel1 + geom_smooth(method = lm, se = FALSE)
## Error: object 'mel1' not found
mel1
## Error: object 'mel1' not found
c. Regress mortality on latitude and interpret the value of the slope coefficient.
model1 = lm(mortality ~ latitude, data = USmelanoma)
model1
##
## Call:
## lm(formula = mortality ~ latitude, data = USmelanoma)
##
## Coefficients:
## (Intercept) latitude
## 389.19 -5.98
For every increase in degreee of latitude, there is a decrease in mortality of -5.978, rounded to 6.0/mil, male mortality rate.
d. Determine the sum of squared errors.
deviance(model1)
## [1] 17173
e. Use density and box plots to examine the model assumptions. What do you conclude?
require(sm)
## Loading required package: sm
## Package `sm', version 2.2-4.1 Copyright (C) 1997, 2000, 2005, 2007, 2008,
## A.W.Bowman & A.Azzalini Type help(sm) for summary information
res1 = residuals(model1)
sm.density(res1, model = "Normal")
boxplot(mortality ~ cut(latitude, 5), data = USmelanoma)
boxplot(mortality ~ cut(latitude, breaks = quantile(latitude)), data = USmelanoma)
Using density, we can conclude that normalcy is not suspect. Using the box plot we can conclude that linearity is not suspect as well. The model is a adequate.
(3) Davies and Goldsmith (1972) investigated the relationship between abrasion loss (abrasion) of samples of rubber (grams per hour) as a function of hardness (higher values indicate harder rubber) and tensile strength (kg/cm2 ). The data are in AbrasionLoss.txt. Input the data using AL = read.table(“http://myweb.fsu.edu/jelsner/AbrasionLoss.txt”, header=TRUE) (40)
a. Create a scatter plot matrix of the three variables. Based on the scatter of points in the plot of abrasion versus strength does it appear that tensile strength would be helpful in explaining abrasion loss?
AL = read.table("http://myweb.fsu.edu/jelsner/AbrasionLoss.txt", header = TRUE)
head(AL)
## abrasion hardness strength
## 1 372 45 162
## 2 206 55 233
## 3 175 61 232
## 4 154 66 231
## 5 136 71 231
## 6 112 71 237
pairs(AL)
No, tensile strenght is not closely related enough having no linarity when compared to abrasion.
b. Regress abrasion loss on hardness and strength. What is the adjusted R squared value? Is strength an important explanatory variable after accounting for hardness?
ALmodel = lm(abrasion ~ hardness + strength, data = AL)
ALmodel
##
## Call:
## lm(formula = abrasion ~ hardness + strength, data = AL)
##
## Coefficients:
## (Intercept) hardness strength
## 885.16 -6.57 -1.37
summary(ALmodel)
##
## Call:
## lm(formula = abrasion ~ hardness + strength, data = AL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.38 -14.61 3.82 19.75 65.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.161 61.752 14.33 3.8e-14 ***
## hardness -6.571 0.583 -11.27 1.0e-11 ***
## strength -1.374 0.194 -7.07 1.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.828
## F-statistic: 71 on 2 and 27 DF, p-value: 1.77e-11
The adjusted R value is 0.83, indicating that when accounting for hardness, strength becomes a significant explanatory variable.
c. On average how much additional abrasion is lost for every 1 kg/cm2 increase in tensile strength?
ALmodel2 = lm(abrasion ~ strength, data = AL)
ALmodel2
##
## Call:
## lm(formula = abrasion ~ strength, data = AL)
##
## Coefficients:
## (Intercept) strength
## 305.225 -0.719
summary(ALmodel2)
##
## Call:
## lm(formula = abrasion ~ strength, data = AL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.6 -59.9 2.8 61.2 183.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 305.225 79.996 3.82 0.00069 ***
## strength -0.719 0.435 -1.65 0.10923
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 85.6 on 28 degrees of freedom
## Multiple R-squared: 0.089, Adjusted R-squared: 0.0565
## F-statistic: 2.74 on 1 and 28 DF, p-value: 0.109
For every 1 kg/cm2 of strangth 0.72 grams of abrasion are lost per hour.
d. Check the correlations between the explanatory variables. Could collinearity be a problem for interpreting the model?
suppressMessages(require(psych))
cor(AL)
## abrasion hardness strength
## abrasion 1.0000 -0.7377 -0.2984
## hardness -0.7377 1.0000 -0.2992
## strength -0.2984 -0.2992 1.0000
pairs.panels(AL)
No, there is no colinearity between hardness and strength.
e. Find the 95% prediction interval for the abrasion corresponding to a new rubber sample having a hardness of 60 units and a tensile strength of 200 kg/cm2.
head(AL)
## abrasion hardness strength
## 1 372 45 162
## 2 206 55 233
## 3 175 61 232
## 4 154 66 231
## 5 136 71 231
## 6 112 71 237
predict(AL, data.frame(abrasion = 60), level = 0.95, interval = "confidence")
## Error: no applicable method for 'predict' applied to an object of class
## "data.frame"