Exam # 2

Guang Xing

date()

## [1] "Tue Oct 30 13:52:40 2012"

Due Date/Time: October 30, 2012, 1:45pm
The points per quesion are given in parentheses.

(1) The UScereal (MASS package) contains many variables regarding breakfast cereals. One variable is the amount of sugar per portion and another is shelf position (counting from the floor up). Create side-by-side box plots showing the distribution of sugar by shelf number. Perform a t test to determine if there is a significant difference in the amount of sugar in cereals on the first and second shelves. What do you conclude? (20)

# chooseCRANmirror()
install.packages("MASS")

## Installing package(s) into 'C:/Users/quant18/Documents/R/win-library/2.15'
## (as 'lib' is unspecified)

## Error: trying to use CRAN without setting a mirror

require(MASS)

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 2.15.2

head(UScereal)

##                           mfr calories protein   fat sodium  fibre carbo
## 100% Bran                   N    212.1  12.121 3.030  393.9 30.303 15.15
## All-Bran                    K    212.1  12.121 3.030  787.9 27.273 21.21
## All-Bran with Extra Fiber   K    100.0   8.000 0.000  280.0 28.000 16.00
## Apple Cinnamon Cheerios     G    146.7   2.667 2.667  240.0  2.000 14.00
## Apple Jacks                 K    110.0   2.000 0.000  125.0  1.000 11.00
## Basic 4                     G    173.3   4.000 2.667  280.0  2.667 24.00
##                           sugars shelf potassium vitamins
## 100% Bran                  18.18     3    848.48 enriched
## All-Bran                   15.15     3    969.70 enriched
## All-Bran with Extra Fiber   0.00     3    660.00 enriched
## Apple Cinnamon Cheerios    13.33     1     93.33 enriched
## Apple Jacks                14.00     2     30.00 enriched
## Basic 4                    10.67     3    133.33 enriched

require(ggplot2)

## Loading required package: ggplot2

ggplot(UScereal, aes(x = factor(shelf), y = sugars)) + geom_boxplot()

plot of chunk unnamed-chunk-2

attach(UScereal)
t.test(sugars[shelf == 1], sugars[shelf == 2])

## 
##  Welch Two Sample t-test
## 
## data:  sugars[shelf == 1] and sugars[shelf == 2] 
## t = -3.975, df = 30, p-value = 0.0004086
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -9.404 -3.021 
## sample estimates:
## mean of x mean of y 
##     6.295    12.508

p-value is less than 0.1, so there is a significant difference in the amount of sugar in cereals on the first and second shelves.

(2) The data set USmelanoma (HSAUR2 package) contains male mortality counts per one million inhabitants by state along with the latitude and longitude centroid of the state. (40)

a. Create a scatter plot of mortality versus latitude using latitude as the explanatory variable.

# chooseCRANmirror()
install.packages("HSAUR2")

## Installing package(s) into 'C:/Users/quant18/Documents/R/win-library/2.15'
## (as 'lib' is unspecified)

## Error: trying to use CRAN without setting a mirror

require(HSAUR2)

## Loading required package: HSAUR2

## Warning: package 'HSAUR2' was built under R version 2.15.2

## Loading required package: lattice

## Loading required package: scatterplot3d

head(USmelanoma)

##             mortality latitude longitude ocean
## Alabama           219     33.0      87.0   yes
## Arizona           160     34.5     112.0    no
## Arkansas          170     35.0      92.5    no
## California        182     37.5     119.5   yes
## Colorado          149     39.0     105.5    no
## Connecticut       159     41.8      72.8   yes

p = ggplot(USmelanoma, aes(x = latitude, y = mortality)) + geom_point()
p

plot of chunk unnamed-chunk-4

b. Add the linear regression line to your scatter plot.


p = p + geom_smooth(method = lm, se = FALSE)
p

plot of chunk unnamed-chunk-5

c. Regress mortality on latitude and interpret the value of the slope coefficient.


model = lm(mortality ~ latitude, data = USmelanoma)
model

## 
## Call:
## lm(formula = mortality ~ latitude, data = USmelanoma)
## 
## Coefficients:
## (Intercept)     latitude  
##      389.19        -5.98

coef(model)[2]

## latitude 
##   -5.978

with 1-degree increasement of latitude, the mortality decreases 5.9.

d. Determine the sum of squared errors.

deviance(model)

## [1] 17173

e. Use density and box plots to examine the model assumptions. What do you conclude?


suppressMessages(require(sm))
res = residuals(model)
sm.density(res, model = "Normal")

plot of chunk unnamed-chunk-8

boxplot(mortality ~ cut(latitude, 4), data = USmelanoma)

plot of chunk unnamed-chunk-8

This model looks good. It seems to satisfy linearity,constant variance and normality.

(3) Davies and Goldsmith (1972) investigated the relationship between abrasion loss (abrasion) of samples of rubber (grams per hour) as a function of hardness (higher values indicate harder rubber) and tensile strength (kg/cm² ). The data are in AbrasionLoss.txt. Input the data using AL = read.table(“http://myweb.fsu.edu/jelsner/AbrasionLoss.txt”, header=TRUE) (40)

a. Create a scatter plot matrix of the three variables. Based on the scatter of points in the plot of abrasion versus strength does it appear that tensile strength would be helpful in explaining abrasion loss?

AL = read.table("http://myweb.fsu.edu/jelsner/AbrasionLoss.txt", header = TRUE)
install.packages("scatterplot3d")

## Installing package(s) into 'C:/Users/quant18/Documents/R/win-library/2.15'
## (as 'lib' is unspecified)

## Warning: package 'scatterplot3d' is in use and will not be installed

## Error: trying to use CRAN without setting a mirror

require(scatterplot3d)

s3d = scatterplot3d(AL, angle = 55, scale.y = 0.7, pch = 16, xlab = "strength", 
    zlab = "abrasion", ylab = "hardness")

plot of chunk unnamed-chunk-9


one = lm(abrasion ~ strength, data = AL)
plot(AL$abrasion ~ AL$strength, pch = 16)
abline(one)

plot of chunk unnamed-chunk-9

# two = lm(abrasion ~ strength + hardness, data=AL) s3d$plane3d(two)

yes, it appear that tensile strength would be helpful in explaining abrasion loss.

b. Regress abrasion loss on hardness and strength. What is the adjusted R squared value? Is strength an important explanatory variable after accounting for hardness?


modelAL = lm(abrasion ~ strength + hardness, data = AL)
summary(modelAL)

## 
## Call:
## lm(formula = abrasion ~ strength + hardness, data = AL)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79.38 -14.61   3.82  19.75  65.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  885.161     61.752   14.33  3.8e-14 ***
## strength      -1.374      0.194   -7.07  1.3e-07 ***
## hardness      -6.571      0.583  -11.27  1.0e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84,    Adjusted R-squared: 0.828 
## F-statistic:   71 on 2 and 27 DF,  p-value: 1.77e-11

the adjusted R squared is 0.8284. strength is an important explanatory variable.

c. On average how much additional abrasion is lost for every 1 kg/cm² increase in tensile strength?


abra = coef(modelAL)[2] * 1  # abra represents additional abrasion
abra

## strength 
##   -1.374

additional abrasion 1.374312 is lost

d. Check the correlations between the explanatory variables. Could collinearity be a problem for interpreting the model?

install.packages("psych")

## Installing package(s) into 'C:/Users/quant18/Documents/R/win-library/2.15'
## (as 'lib' is unspecified)

## Error: trying to use CRAN without setting a mirror

require(psych)

## Loading required package: psych

## Warning: package 'psych' was built under R version 2.15.2

## Attaching package: 'psych'

## The following object(s) are masked from 'package:ggplot2':
## 
## %+%

pairs.panels(AL, panel = panel.smooth)

plot of chunk unnamed-chunk-12

collinearity is not a problem for interpreting the model.

e. Find the 95% prediction interval for the abrasion corresponding to a new rubber sample having a hardness of 60 units and a tensile strength of 200 kg/cm^2.

predict(modelAL, data.frame(hardness = 60, strength = 200), level = 0.95, interval = "prediction")

##   fit   lwr   upr
## 1 216 138.9 293.2