We are interested in modeling the effect of geographic region and age on wealth of billionaires using Fortune’s 1992 billionaire list.
x <- read.csv("~/Desktop/Zoe/Recipe 6.csv")
attach(x)
head(x)
## wealth age region
## 1 37.0 50 M
## 2 24.0 88 U
## 3 14.0 64 A
## 4 13.0 63 U
## 5 13.0 66 U
## 6 11.7 72 E
The factor we are interested in is region: the geographic location (Asia, Europe, Middle East, United States, and Other.)
plot(region)
We will analyze age as a covariate.
plot(as.factor(age))
summary(age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7 56 65 64 72 102
The response variable, individual’s wealth (measured in $ bil) is heavily right skewed.
summary(wealth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.30 1.80 2.73 3.00 37.00
summary(1/wealth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.027 0.333 0.556 0.555 0.769 1.000
par(mfrow=c(1,2))
qqnorm(wealth) ; qqline(wealth)
qqnorm(1/wealth) ; qqline(1/wealth)
Because wealth in this case is not normally distributed, we will use 1/wealth as an appropriate transformation in our analysis.
We will fit a model with the continuous explanatory variable ‘age’ and the discrete explanatory variable ‘region’ and use analysis of covariance (ANCOVA) to quantify the fit.
plot(wealth~region) ; plot(wealth~region, outline=FALSE)
Removing outliers from the plots, is is easier to see that there is not a significant difference in means accross any of the regions, but it does stand out that M has a much larger variance.
plot(age~region)
There does not seem to be any strong relation (interaction) between age and region.
plot(age, 1/wealth, col=region)
A color-coded scatter plot does not reveal any obvious trends in the data.
model <- aov(1/wealth~age+region+region*age)
anova(model)
## Analysis of Variance Table
##
## Response: 1/wealth
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 0.04 0.0434 0.66 0.42
## region 4 0.15 0.0365 0.55 0.70
## age:region 4 0.46 0.1153 1.75 0.14
## Residuals 215 14.17 0.0659
Neither region nor age are significant factors. We fail to reject the null hypothesis that randmoization alone can account for the variation in wealth.
plot(TukeyHSD(aov(wealth~region)))
aggregate(wealth, by=list(region), FUN=mean)
## Group.1 x
## 1 A 2.651
## 2 E 2.258
## 3 M 4.264
## 4 O 2.279
## 5 U 3.000
There is not a significant difference in means accross any level of the factor region.
The residuals of the model are normally distributed which validates the primary assumption of normality. We can concluded that this model is appropriate.
qqnorm(residuals(model)) ; qqline(residuals(model))