date()
## [1] "Tue Oct 30 13:46:34 2012"
Due Date/Time: October 30, 2012, 1:45pm
The points per quesion are given in parentheses.
(1) The UScereal (MASS package) contains many variables regarding breakfast cereals. One variable is the amount of sugar per portion and another is shelf position (counting from the floor up). Create side-by-side box plots showing the distribution of sugar by shelf number. Perform a t test to determine if there is a significant difference in the amount of sugar in cereals on the first and second shelves. What do you conclude? (20)
require(MASS)
## Loading required package: MASS
attach(UScereal)
boxplot(sugars ~ shelf, data = UScereal)
t.test(sugars, shelf == 1 & 2)
##
## Welch Two Sample t-test
##
## data: sugars and shelf == 1 & 2
## t = 13.46, df = 64.76, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 8.324 11.224
## sample estimates:
## mean of x mean of y
## 10.0508 0.2769
Low p-value results in rejecting null hypothesis, therefore there is a strong relation between the two shelfs.
(2) The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state. (40)
a. Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.
require(HSAUR2)
## Loading required package: HSAUR2
## Warning: package 'HSAUR2' was built under R version 2.15.2
## Loading required package: lattice
## Loading required package: scatterplot3d
require(ggplot2)
## Loading required package: ggplot2
p = ggplot(USmelanoma, aes(x = latitude, y = mortality))
p = p + geom_point()
p
b. Add the linear regression line to your scatter plot.
p = p + geom_smooth(method = lm, se = FALSE)
p
c. Regress mortality on latitude and interpret the value of the slope coefficient.
model = lm(mortality ~ mortality, data = USmelanoma)
## Warning: the response appeared on the right-hand side and was dropped
## Warning: problem with term 1 in model.matrix: no columns are assigned
summary(model)
##
## Call:
## lm(formula = mortality ~ mortality, data = USmelanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.88 -24.88 -5.88 25.12 76.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 152.88 4.78 32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.4 on 48 degrees of freedom
Slope = -5.98
d. Determine the sum of squared errors.
sse = sum(residuals(model)^2)
sse
## [1] 53637
e. Use density and box plots to examine the model assumptions. What do you conclude?
(3) Davies and Goldsmith (1972) investigated the relationship between abrasion loss (abrasion) of samples of rubber (grams per hour) as a function of hardness (higher values indicate harder rubber) and tensile strength (kg/cm2 ). The data are in AbrasionLoss.txt. Input the data using AL = read.table(“http://myweb.fsu.edu/jelsner/AbrasionLoss.txt”, header=TRUE) (40)
AL = read.table("http://myweb.fsu.edu/jelsner/AbrasionLoss.txt", header = TRUE)
attach(AL)
a. Create a scatter plot matrix of the three variables. Based on the scatter of points in the plot of abrasion versus strength does it appear that tensile strength would be helpful in explaining abrasion loss?
pairs(AL, panel = panel.smooth)
b. Regress abrasion loss on hardness and strength. What is the adjusted R squared value? Is strength an important explanatory variable after accounting for hardness?
ALmodel = lm(abrasion ~ ., data = AL)
summary(ALmodel)
##
## Call:
## lm(formula = abrasion ~ ., data = AL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.38 -14.61 3.82 19.75 65.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.161 61.752 14.33 3.8e-14 ***
## hardness -6.571 0.583 -11.27 1.0e-11 ***
## strength -1.374 0.194 -7.07 1.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.828
## F-statistic: 71 on 2 and 27 DF, p-value: 1.77e-11
Adjusted R-squared value = .8284
Yes it is due to the high R-squared value along with the low p-value.
c. On average how much additional abrasion is lost for every 1 kg/cm2 increase in tensile strength?
ALmodel
##
## Call:
## lm(formula = abrasion ~ ., data = AL)
##
## Coefficients:
## (Intercept) hardness strength
## 885.16 -6.57 -1.37
885.16
d. Check the correlations between the explanatory variables. Could collinearity be a problem for interpreting the model?
cor(AL)
## abrasion hardness strength
## abrasion 1.0000 -0.7377 -0.2984
## hardness -0.7377 1.0000 -0.2992
## strength -0.2984 -0.2992 1.0000
Yes, because there's great distance between the values of hardness and strength.
e. Find the 95% prediction interval for the abrasion corresponding to a new rubber sample having a hardness of 60 units and a tensile strength of 200 kg/cm2.