I'm finding myself quite skeptical of Fincher and Thornhill's paper “Pathogen prevalence predicts human cross-cultural variability in individualism/collectivism” (doi: 10.1098/rspb.2008.0094, 7 June 2008 vol. 275 no. 1640 1279-1285).
They say they took data from the CIA factbook, so I downloaded an
electronic version provided by google (warning: there's at least one bogus line in the file!)
baseurl <- "http://www.math.mcmaster.ca/bolker/misc/"
X <- read.csv(url(paste0(baseurl, "cia-data-all.csv")), stringsAsFactors = FALSE)
## find the names of the columns we want
## grep('(Gini|expectancy|GDP)',names(X),value=TRUE) extract relevant
## columns and fix names
CIAdat <- setNames(subset(X, select = c(Distribution.of.family.income...Gini.index,
Life.expectancy.at.birth, GDP...per.capita..PPP., Population, Area)), c("Gini",
"Life_expect", "GDPpercap", "Population", "Area"))
rownames(CIAdat) <- X[, 1]
## make GDP values into numbers and compute pop density
CIAdat <- transform(CIAdat, GDPpercap = as.numeric(gsub("[ $]", "", GDPpercap)),
popdens = Population/Area)
CIAdat <- subset(CIAdat, select = c(Gini:GDPpercap, popdens))
A basic plot:
library(car)
## Warning: package 'car' was built under R version 2.15.1
## Warning: package 'MASS' was built under R version 2.15.1
## Warning: package 'nnet' was built under R version 2.15.1
scatterplotMatrix(CIAdat, gap = 0)
It may amuse you to check out the high-density outliers. I wonder if something got mangled?
subset(CIAdat, popdens > 3000, select = popdens)
## popdens
## Holy See (Vatican City) Inf
## Monaco 16482
## Macau 19994
## Gaza Strip 4311
## Gibraltar 4005
## Hong Kong 6390
## Singapore 6682
Get Fincher & Thornhill data supplement: I took the PDF file, used pdftotext to convert to text, and edited the text file in emacs (lots of kill-rectangle and yank-rectangle) to put it into a R-friendly csv file.
ftdata <- read.csv(url(paste0(baseurl,"fincher_thornhill.csv")))
rownames(ftdata) <- ftdata$Country
## check for name mismatches
(missnames1 <- setdiff(rownames(ftdata),rownames(CIAdat)))
## [1] "China PRC" "Czech Rep." "England" "Nrth Ireland"
## [5] "Rep. Of Korea" "Trinidad" "UAE" "USA"
## setdiff(rownames(CIAdat),rownames(ftdata))
misspos1 <-match(missnames1,rownames(ftdata))
rownames(ftdata)[misspos1] <- c("China","Czech Republic",
"England", ## ?? United Kingdom ??
"Northern Ireland", ## ?? ditto ??
"Korea South",
"Trinidad and Tobago",
"United Arab Emirates",
"United States")
mm <- merge(CIAdat,ftdata,by="row.names",all=FALSE)
## clean up a bit: subset() is fragile but clean
mm <- subset(mm,select=c(Country:GLOBE,Gini:popdens,
Path_histor:Kashima))
rownames(mm) <- mm$Country
head(mm)
## Country World GLOBE Gini Life_expect GDPpercap
## Albania Albania W._Eurasia E._Europe 26.7 77.96 6000
## Argentina Argentina S._America Lat._America 49.0 76.56 14200
## Australia Australia Ins._Pacific Anglo 30.5 81.63 38100
## Austria Austria W._Eurasia Ger._Europe 26.0 79.50 39200
## Bahrain Bahrain W._Eurasia <NA> NA 75.16 37200
## Bangladesh Bangladesh E._Eurasia <NA> 33.2 60.25 1500
## popdens Path_histor Path_contemp Hofstede Suh Gelfand Kashima
## Albania 126.598 NA 24 NA NA 5.74 NA
## Argentina 14.715 -0.07 37 46 4.80 5.51 1
## Australia 2.747 -0.20 27 90 9.00 4.17 0
## Austria 97.892 -0.72 26 55 6.75 4.85 0
## Bahrain 982.166 0.15 23 NA 3.00 NA NA
## Bangladesh 1083.702 0.67 33 20 NA NA NA
library(car)
scatterplotMatrix(ftdata[, -(1:3)], gap = 0)
Appropriate procedure:
We have three choices for handling the missing value scores: (1) discard any countries with any missing values (i.e., na.omit()); (2) crude imputation, replacing missing values by the mean for that variable; (3) more sophisticated imputation (e.g. using the mice package).
Trying the middle option:
fill_NA <- function(x, verbose = FALSE) {
if ((totNA <- sum(is.na(x))) == 0 || !is.numeric(x))
return(x)
if (verbose)
message(paste("replacing", totNA, "NA values with mean"))
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
}
mmi <- as.data.frame(lapply(mm, fill_NA))
rownames(mmi) <- rownames(mm) ## argh
## ? is there a slightly nicer way to do this? sapply coerces to a matrix
What does the scatterplot of demographic/economic variables look like with the merged set (i.e. retaining only those countries for which we have social & pathogen data)?
scatterplotMatrix(subset(mmi, subset = popdens < 2000, select = c(Gini:popdens)),
gap = 0)
The subset command excludes only Hong Kong and Singapore (pop. dens. >6000, the next largest is Malta at 1282). It still looks
like popdens has a pretty extreme distribution, maybe log-transform it … ?
subset(mmi, popdens > 500, select = popdens)
## popdens
## Bahrain 982.2
## Bangladesh 1083.7
## Hong Kong 6390.5
## Malta 1282.2
## Singapore 6682.3
## Taiwan 638.5
Select social surveys:
consmat <- as.matrix(subset(mmi, TRUE, select = c(Hofstede:Kashima)))
The variables are reasonably (but not very) strongly correlated (\( r^2 \approx 0.4-0.5 \)), and the variances differ enormously, with Hofstede having the largest:
print(var(consmat), digits = 2)
## Hofstede Suh Gelfand Kashima
## Hofstede 385.1 20.38 -6.52 -5.52
## Suh 20.4 2.22 -0.49 -0.35
## Gelfand -6.5 -0.49 0.30 0.13
## Kashima -5.5 -0.35 0.13 0.15
print(cor(consmat), digits = 2)
## Hofstede Suh Gelfand Kashima
## Hofstede 1.00 0.7 -0.61 -0.73
## Suh 0.70 1.0 -0.60 -0.60
## Gelfand -0.61 -0.6 1.00 0.63
## Kashima -0.73 -0.6 0.63 1.00
PCA on scaled data (correlations), with a scree plot:
(pca1 <- princomp(consmat, cor = TRUE))
## Call:
## princomp(x = consmat, cor = TRUE)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3 Comp.4
## 1.7124 0.6540 0.6318 0.4908
##
## 4 variables and 96 observations.
par(las = 1)
plot(pca1)
mmi$PCA1 <- pca1$scores[, 1]
mmi <- mmi[order(mmi$PCA1), ]
c(head(rownames(mmi)), "...", tail(rownames(mmi)))
## [1] "USA" "Australia" "Netherlands" "Denmark"
## [5] "Canada" "Sweden" "..." "Pakistan"
## [9] "Thailand" "Rep. Of Korea" "Indonesia" "China PRC"
## [13] "Colombia"
## NB PCA1 is ordered from *least* to *most* 'conservative'
tmpf <- function(x) factor(x, levels = as.character(x))
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 2.15.1
theme_set(theme_bw())
ggplot(mmi, aes(x = PCA1, y = tmpf(Country), colour = World)) + geom_point() +
labs(x = "Conformity", y = "")
Regress scores on demog/social variables:
lm1 <- lm(PCA1 ~ Gini + popdens + GDPpercap + Life_expect, data = mmi)
mmi$dresids <- residuals(lm1) ## should we standardize?
scatterplotMatrix(mmi2 <- subset(mmi, select = c(dresids, PCA1, Path_histor,
Path_contemp)), gap = 0)
An alternative enhanced-pairs-plot view: it would be nice to modify the function so that negative and positive correlations were distinguished (and possibly get rid of the significance stars … )
library(PerformanceAnalytics)
## Warning: package 'zoo' was built under R version 2.15.1
chart.Correlation(mmi2)
## chart.Correlation plots ABSOLUTE VALUE of correlations
I'm not sure what the strong correlation between PCA1 and the residuals means …
summary((lm1 <- lm(dresids ~ Path_contemp, data = mmi2)))
##
## Call:
## lm(formula = dresids ~ Path_contemp, data = mmi2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.559 -0.654 0.166 0.715 3.857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1329 0.7030 -1.61 0.11
## Path_contemp 0.0360 0.0219 1.65 0.10
##
## Residual standard error: 1.39 on 94 degrees of freedom
## Multiple R-squared: 0.028, Adjusted R-squared: 0.0176
## F-statistic: 2.71 on 1 and 94 DF, p-value: 0.103
summary((lm2 <- lm(dresids ~ Path_histor, data = mmi2)))
##
## Call:
## lm(formula = dresids ~ Path_histor, data = mmi2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.124 -0.801 0.116 0.858 3.599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0135 0.1357 -0.10 0.9210
## Path_histor 0.7182 0.2133 3.37 0.0011 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.33 on 94 degrees of freedom
## Multiple R-squared: 0.108, Adjusted R-squared: 0.0981
## F-statistic: 11.3 on 1 and 94 DF, p-value: 0.0011
Somewhat to my surprise, there is still a strong relationship between historical pathogen index and the merged conformity index …
mmi2B <- with(mmi, data.frame(mmi2, World))
library(ggplot2)
library(grid)
zmargin <- theme(panel.margin = unit(0, "lines"))
theme_set(theme_bw())
library(reshape2)
library(RColorBrewer)
mmi3 <- melt(mmi2B, id.var = c(1:2, 5))
mmi4 <- melt(mmi3, id.var = 3:5)
mmi4 <- setNames(mmi4, c("World", "pred", "predval", "resp", "respval"))
ggplot(mmi4, aes(predval, respval)) + geom_point(aes(colour = World)) + geom_smooth(method = "loess") +
geom_smooth(method = "lm", colour = "red") + facet_grid(resp ~ pred, scale = "free") +
zmargin + scale_colour_brewer(palette = "Dark2")
(tip from Steve Walker: try pcaMethods package for PCA with missing data [to install: source("http://bioconductor.org/biocLite.R"); biocLite("pcaMethods")])
Admittedly starting to snoop now, but I'm concerned that there is a strong correlation between PCA1 and the residuals: does that mean that there was more signal that wasn't removed by the regressions, i.e. a nonlinear pattern?
lm1Q <- lm(PCA1 ~ poly(Gini, popdens, GDPpercap, Life_expect, degree = 2), data = mmi)
library(bbmle)
## Warning: package 'bbmle' was built under R version 2.15.1
## Loading required package: stats4
AICtab(lm1, lm1Q, weights = TRUE)
## dAIC df weight
## lm1Q 0.0 16 1
## lm1 17.9 3 <0.001
A bunch of machinery to construct a nice plot of the quadratic terms ordered by \( t \)-statistic (I should (a) scale these predictors to standardize them (b) look at the data more carefully to get a sense of the quadratic effects):
library(coefplot2)
## Loading required package: coda
## Loading required package: lattice
cc <- coeftab(lm1Q)
## manipulate unwieldy poly(...,2) var names
tfun <- function(cstr) {
vars <- strsplit(gsub("poly\\(([^\\)]+)\\).*", "\\1", cstr[2]), "[, ]")
vars <- vars[[1]][c(1, 3, 5, 7)] ## hack
cstr1 <- gsub("^.*\\)", "", cstr)
c("Intercept", sapply(cstr1[-1], function(x) {
y <- strsplit(x, "\\.")[[1]]
y[y == "0"] <- ""
y[y == "1"] <- vars[y == "1"]
y[y == "2"] <- paste0(vars[y == "2"], "^2")
paste(y[y != ""], collapse = "*")
}))
}
tvals <- coef(summary(lm1Q))[, "t value"]
rownames(cc) <- tfun(rownames(cc))
cc <- cc[order(abs(tvals)), ]
cc <- cc[rownames(cc) != "Intercept", ]
coefplot2(list(cc), intercept = TRUE)
mmi$dresidsQ <- residuals(lm1Q)
scatterplotMatrix(mmi5 <- subset(mmi, select = c(dresidsQ, PCA1, Path_histor,
Path_contemp)), gap = 0)
summary((lm1rQ <- lm(dresidsQ ~ Path_contemp, data = mmi5)))
##
## Call:
## lm(formula = dresidsQ ~ Path_contemp, data = mmi5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9478 -0.7311 0.0482 0.4972 2.9260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7275 0.5540 -1.31 0.19
## Path_contemp 0.0231 0.0173 1.34 0.18
##
## Residual standard error: 1.09 on 94 degrees of freedom
## Multiple R-squared: 0.0188, Adjusted R-squared: 0.00832
## F-statistic: 1.8 on 1 and 94 DF, p-value: 0.183
summary((lm2rQ <- lm(dresidsQ ~ Path_histor, data = mmi5)))
##
## Call:
## lm(formula = dresidsQ ~ Path_histor, data = mmi5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8112 -0.7489 -0.0547 0.5287 2.8037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.00612 0.11064 -0.06 0.956
## Path_histor 0.32543 0.17389 1.87 0.064 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.08 on 94 degrees of freedom
## Multiple R-squared: 0.0359, Adjusted R-squared: 0.0257
## F-statistic: 3.5 on 1 and 94 DF, p-value: 0.0644
There's still a weak signal left.