Exam 4

In this exercise you will study a dataset concerning vital capacity, which is the maximal amount of air, that can be exhaled after a maximal inhalation.

library(multcomp)

## Warning: package 'multcomp' was built under R version 3.2.5

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 3.2.5

## Loading required package: survival

## Warning: package 'survival' was built under R version 3.2.5

## Loading required package: TH.data

## Warning: package 'TH.data' was built under R version 3.2.5

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.2.5

## 
## Attaching package: 'TH.data'

## Det følgende objekt er maskeret fra 'package:MASS':
## 
##     geyser

library(ggplot2)

Read in the data:

vitcap <- read.table("http://asta.math.aau.dk/dan/static/datasets?file=vitcap.dat", header=TRUE)
head(vitcap)

##   exposure age vital.capacity z1 z2
## 1        C  39           4.62  0  1
## 2        C  40           5.29  0  1
## 3        C  41           5.52  0  1
## 4        C  41           3.71  0  1
## 5        C  45           4.02  0  1
## 6        C  49           5.09  0  1

In the dataset, the variable vital.capacity has been measured on 84 workers in the cadmium industry.

The next variable is the factor exposure with 3 levels, indicating the level of cadmium exposure:

A: None
B: Less than 10 years
C: More than 10 years

The data set also contains dummy variables for the factor exposure:

z1=1 if exposure=B and 0 otherwise.
z2=1 if exposure=C and 0 otherwise.

You will use these two variables later on.

Make a model and carry out an analysis investigating the effect of the factor exposure on the response vital.capacity. In that connection you should calculate/interpret

boxplot(vital.capacity ~ exposure, data = vitcap )

-(2) F-test for no effect of exposure

Since the p-value is not significant (0.09021), we can accept the Null Hypothesis of using the simple model for estimation. Therefore, we should NOT use a more complex model, namely one with exposure as a predictor, to better express the change in vital.capacity with the increase of age.

model = lm(vital.capacity ~ exposure, data = vitcap )
summary(model)

## 
## Call:
## lm(formula = vital.capacity ~ exposure, data = vitcap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77179 -0.45205  0.09808  0.51295  1.57083 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.46204    0.11223  39.757   <2e-16 ***
## exposureB    0.00974    0.17997   0.054   0.9570    
## exposureC   -0.51288    0.24245  -2.115   0.0375 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7445 on 81 degrees of freedom
## Multiple R-squared:  0.05767,    Adjusted R-squared:  0.0344 
## F-statistic: 2.478 on 2 and 81 DF,  p-value: 0.09021

-(3) confidence intervals for multiple pairwise comparisons. Hint (assuming your fitted model is called model and that the multcomp package is installed):

test <- glht(model, mcp(exposure = "Tukey"))
summary(test)

The difference between B time and A time is close to 0, meaning that exposure below 10 years does not appear to affect the vital capacity by a significant amount. However, the difference between C and B or A is significant. Working for more than 10 years will affect you.

We expand the analysis to include the workers age - the variable age - as a predictor. Make a model and carry out an analysis investigating the effect of the predictors exposure and age on the response vital.capacity. In that connection you should:

1. investigate whether there is interaction between the effects of exposure and age

modelA = lm(vital.capacity ~ exposure + age, data = vitcap)
modelB = lm(vital.capacity ~ exposure * age, data = vitcap)
anova(modelA, modelB)

## Analysis of Variance Table
## 
## Model 1: vital.capacity ~ exposure + age
## Model 2: vital.capacity ~ exposure * age
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 30.035                              
## 2     78 27.535  2    2.4995 3.5402 0.03376 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A p-value of around 3% means that we should use the more complex model to determining whether there is a significant interaction between the effects of exposure and age.

1. give a graphical interpretation of such an interaction

p <- ggplot(vitcap, aes(x=age, y=vital.capacity)) #empty plot
p <- p + geom_point(aes(color=exposure)) #add points
p <- p + geom_smooth(aes(color = exposure), method="lm", se=FALSE) #add lines
p

Consider the following two models, where we introduce the dummy variables.

model1 <- lm(vital.capacity ~ age*z2, data = vitcap)
model2 <- lm(vital.capacity ~ age*z1 + age*z2, data = vitcap)
model1
model2

1. Use an F-test to show that there is no significant difference between model1 and model2.
1. Give an interpretation of the difference between the two models. The largest p-value (50%) tells us we should use a simpler model. In this case, that is the model only considering exposure level C. Therefore exposure B doesn’t bring enough releveant information to the model to be significant. This is also visible in the boxplot and how B and C (kind of) overlap. C constains all of the information that B can provide. So we go with the simpler model.