Recipe 6: Analysis of Covariance

Recipes for the Design of Experiments

Zoe Konrad

Rensselaer Polytechnic Institute

Fall 2014 v1

1. Setting

System under test

We are interested in modeling the effect of geographic region and age on wealth of billionaires using Fortune’s 1992 billionaire list.

x <- read.csv("~/Desktop/Zoe/Recipe 6.csv")
attach(x)
head(x)
##   wealth age region
## 1   37.0  50      M
## 2   24.0  88      U
## 3   14.0  64      A
## 4   13.0  63      U
## 5   13.0  66      U
## 6   11.7  72      E

Factors and Levels

The factor we are interested in is region: the geographic location (Asia, Europe, Middle East, United States, and Other.)

plot(region)

plot of chunk unnamed-chunk-2

We will analyze age as a covariate.

plot(as.factor(age))

plot of chunk unnamed-chunk-3

summary(age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       7      56      65      64      72     102

Response variables

The response variable, individual’s wealth (measured in $ bil) is heavily right skewed.

summary(wealth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.30    1.80    2.73    3.00   37.00
summary(1/wealth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.027   0.333   0.556   0.555   0.769   1.000
par(mfrow=c(1,2))
qqnorm(wealth) ; qqline(wealth)
qqnorm(1/wealth) ; qqline(1/wealth)

plot of chunk unnamed-chunk-4

Because wealth in this case is not normally distributed, we will use 1/wealth as an appropriate transformation in our analysis.

2. (Experimental) Design

We will fit a model with the continuous explanatory variable ‘age’ and the discrete explanatory variable ‘region’ and use analysis of covariance (ANCOVA) to quantify the fit.

3. (Statistical) Analysis

Exploratory Data Analysis Graphics

plot(wealth~region) ; plot(wealth~region, outline=FALSE)

plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5

Removing outliers from the plots, is is easier to see that there is not a significant difference in means accross any of the regions, but it does stand out that M has a much larger variance.

plot(age~region)

plot of chunk unnamed-chunk-6

There does not seem to be any strong relation (interaction) between age and region.

plot(age, 1/wealth, col=region)

plot of chunk unnamed-chunk-7

A color-coded scatter plot does not reveal any obvious trends in the data.

Testing

model <- aov(1/wealth~age+region+region*age)
anova(model)
## Analysis of Variance Table
## 
## Response: 1/wealth
##             Df Sum Sq Mean Sq F value Pr(>F)
## age          1   0.04  0.0434    0.66   0.42
## region       4   0.15  0.0365    0.55   0.70
## age:region   4   0.46  0.1153    1.75   0.14
## Residuals  215  14.17  0.0659

Neither region nor age are significant factors. We fail to reject the null hypothesis that randmoization alone can account for the variation in wealth.

Estimation (of Parameters)

plot(TukeyHSD(aov(wealth~region)))

plot of chunk unnamed-chunk-9

aggregate(wealth, by=list(region), FUN=mean)
##   Group.1     x
## 1       A 2.651
## 2       E 2.258
## 3       M 4.264
## 4       O 2.279
## 5       U 3.000

There is not a significant difference in means accross any level of the factor region.

Diagnostics/Model Adequacy Checking

The residuals of the model are normally distributed which validates the primary assumption of normality. We can concluded that this model is appropriate.

qqnorm(residuals(model)) ; qqline(residuals(model))

plot of chunk unnamed-chunk-10

4. References to the literature