Re-analyzing Fincher and Thornhill

I'm finding myself quite skeptical of Fincher and Thornhill's paper “Pathogen prevalence predicts human cross-cultural variability in individualism/collectivism” (doi: 10.1098/rspb.2008.0094, 7 June 2008 vol. 275 no. 1640 1279-1285).

They say they took data from the CIA factbook, so I downloaded an
electronic version provided by google (warning: there's at least one bogus line in the file!)

baseurl <- "http://www.math.mcmaster.ca/bolker/misc/"
X <- read.csv(url(paste0(baseurl, "cia-data-all.csv")), stringsAsFactors = FALSE)
## find the names of the columns we want
## grep('(Gini|expectancy|GDP)',names(X),value=TRUE) extract relevant
## columns and fix names
CIAdat <- setNames(subset(X, select = c(Distribution.of.family.income...Gini.index, 
    Life.expectancy.at.birth, GDP...per.capita..PPP., Population, Area)), c("Gini", 
    "Life_expect", "GDPpercap", "Population", "Area"))
rownames(CIAdat) <- X[, 1]
## make GDP values into numbers and compute pop density
CIAdat <- transform(CIAdat, GDPpercap = as.numeric(gsub("[ $]", "", GDPpercap)), 
    popdens = Population/Area)
CIAdat <- subset(CIAdat, select = c(Gini:GDPpercap, popdens))

A basic plot:

library(car)
## Warning: package 'car' was built under R version 2.15.1
## Warning: package 'MASS' was built under R version 2.15.1
## Warning: package 'nnet' was built under R version 2.15.1
scatterplotMatrix(CIAdat, gap = 0)

plot of chunk ciaplot1

It may amuse you to check out the high-density outliers. I wonder if something got mangled?

subset(CIAdat, popdens > 3000, select = popdens)
##                         popdens
## Holy See (Vatican City)     Inf
## Monaco                    16482
## Macau                     19994
## Gaza Strip                 4311
## Gibraltar                  4005
## Hong Kong                  6390
## Singapore                  6682

Get Fincher & Thornhill data supplement: I took the PDF file, used pdftotext to convert to text, and edited the text file in emacs (lots of kill-rectangle and yank-rectangle) to put it into a R-friendly csv file.

ftdata <- read.csv(url(paste0(baseurl,"fincher_thornhill.csv")))
rownames(ftdata) <- ftdata$Country
## check for name mismatches
(missnames1 <- setdiff(rownames(ftdata),rownames(CIAdat)))
## [1] "China PRC"     "Czech Rep."    "England"       "Nrth Ireland" 
## [5] "Rep. Of Korea" "Trinidad"      "UAE"           "USA"
## setdiff(rownames(CIAdat),rownames(ftdata))
misspos1 <-match(missnames1,rownames(ftdata))
rownames(ftdata)[misspos1] <- c("China","Czech Republic",
                                "England", ## ?? United Kingdom ??
                                "Northern Ireland", ## ?? ditto ??
                                "Korea South",
                                "Trinidad and Tobago",
                                "United Arab Emirates",
                                "United States")
mm <- merge(CIAdat,ftdata,by="row.names",all=FALSE)
## clean up a bit: subset() is fragile but clean
mm <- subset(mm,select=c(Country:GLOBE,Gini:popdens,
                         Path_histor:Kashima))
rownames(mm) <- mm$Country
head(mm)
##               Country        World        GLOBE Gini Life_expect GDPpercap
## Albania       Albania   W._Eurasia    E._Europe 26.7       77.96      6000
## Argentina   Argentina   S._America Lat._America 49.0       76.56     14200
## Australia   Australia Ins._Pacific        Anglo 30.5       81.63     38100
## Austria       Austria   W._Eurasia  Ger._Europe 26.0       79.50     39200
## Bahrain       Bahrain   W._Eurasia         <NA>   NA       75.16     37200
## Bangladesh Bangladesh   E._Eurasia         <NA> 33.2       60.25      1500
##             popdens Path_histor Path_contemp Hofstede  Suh Gelfand Kashima
## Albania     126.598          NA           24       NA   NA    5.74      NA
## Argentina    14.715       -0.07           37       46 4.80    5.51       1
## Australia     2.747       -0.20           27       90 9.00    4.17       0
## Austria      97.892       -0.72           26       55 6.75    4.85       0
## Bahrain     982.166        0.15           23       NA 3.00      NA      NA
## Bangladesh 1083.702        0.67           33       20   NA      NA      NA
library(car)
scatterplotMatrix(ftdata[, -(1:3)], gap = 0)

plot of chunk ftplot1

Appropriate procedure:

Filling in missing data

We have three choices for handling the missing value scores: (1) discard any countries with any missing values (i.e., na.omit()); (2) crude imputation, replacing missing values by the mean for that variable; (3) more sophisticated imputation (e.g. using the mice package).

Trying the middle option:

fill_NA <- function(x, verbose = FALSE) {
    if ((totNA <- sum(is.na(x))) == 0 || !is.numeric(x)) 
        return(x)
    if (verbose) 
        message(paste("replacing", totNA, "NA values with mean"))
    x[is.na(x)] <- mean(x, na.rm = TRUE)
    x
}
mmi <- as.data.frame(lapply(mm, fill_NA))
rownames(mmi) <- rownames(mm)  ## argh
## ? is there a slightly nicer way to do this?  sapply coerces to a matrix

What does the scatterplot of demographic/economic variables look like with the merged set (i.e. retaining only those countries for which we have social & pathogen data)?

scatterplotMatrix(subset(mmi, subset = popdens < 2000, select = c(Gini:popdens)), 
    gap = 0)

plot of chunk unnamed-chunk-1

The subset command excludes only Hong Kong and Singapore (pop. dens. >6000, the next largest is Malta at 1282). It still looks
like popdens has a pretty extreme distribution, maybe log-transform it … ?

subset(mmi, popdens > 500, select = popdens)
##            popdens
## Bahrain      982.2
## Bangladesh  1083.7
## Hong Kong   6390.5
## Malta       1282.2
## Singapore   6682.3
## Taiwan       638.5

Select social surveys:

consmat <- as.matrix(subset(mmi, TRUE, select = c(Hofstede:Kashima)))

The variables are reasonably (but not very) strongly correlated (\( r^2 \approx 0.4-0.5 \)), and the variances differ enormously, with Hofstede having the largest:

print(var(consmat), digits = 2)
##          Hofstede   Suh Gelfand Kashima
## Hofstede    385.1 20.38   -6.52   -5.52
## Suh          20.4  2.22   -0.49   -0.35
## Gelfand      -6.5 -0.49    0.30    0.13
## Kashima      -5.5 -0.35    0.13    0.15
print(cor(consmat), digits = 2)
##          Hofstede  Suh Gelfand Kashima
## Hofstede     1.00  0.7   -0.61   -0.73
## Suh          0.70  1.0   -0.60   -0.60
## Gelfand     -0.61 -0.6    1.00    0.63
## Kashima     -0.73 -0.6    0.63    1.00

PCA on scaled data (correlations), with a scree plot:

(pca1 <- princomp(consmat, cor = TRUE))
## Call:
## princomp(x = consmat, cor = TRUE)
## 
## Standard deviations:
## Comp.1 Comp.2 Comp.3 Comp.4 
## 1.7124 0.6540 0.6318 0.4908 
## 
##  4  variables and  96 observations.
par(las = 1)
plot(pca1)

plot of chunk pca

mmi$PCA1 <- pca1$scores[, 1]
mmi <- mmi[order(mmi$PCA1), ]
c(head(rownames(mmi)), "...", tail(rownames(mmi)))
##  [1] "USA"           "Australia"     "Netherlands"   "Denmark"      
##  [5] "Canada"        "Sweden"        "..."           "Pakistan"     
##  [9] "Thailand"      "Rep. Of Korea" "Indonesia"     "China PRC"    
## [13] "Colombia"
## NB PCA1 is ordered from *least* to *most* 'conservative'
tmpf <- function(x) factor(x, levels = as.character(x))
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 2.15.1
theme_set(theme_bw())
ggplot(mmi, aes(x = PCA1, y = tmpf(Country), colour = World)) + geom_point() + 
    labs(x = "Conformity", y = "")

plot of chunk cplot

Regress scores on demog/social variables:

lm1 <- lm(PCA1 ~ Gini + popdens + GDPpercap + Life_expect, data = mmi)
mmi$dresids <- residuals(lm1)  ## should we standardize?
scatterplotMatrix(mmi2 <- subset(mmi, select = c(dresids, PCA1, Path_histor, 
    Path_contemp)), gap = 0)

plot of chunk residplot1

An alternative enhanced-pairs-plot view: it would be nice to modify the function so that negative and positive correlations were distinguished (and possibly get rid of the significance stars … )

library(PerformanceAnalytics)
## Warning: package 'zoo' was built under R version 2.15.1
chart.Correlation(mmi2)

plot of chunk corplot

## chart.Correlation plots ABSOLUTE VALUE of correlations

I'm not sure what the strong correlation between PCA1 and the residuals means …

summary((lm1 <- lm(dresids ~ Path_contemp, data = mmi2)))
## 
## Call:
## lm(formula = dresids ~ Path_contemp, data = mmi2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.559 -0.654  0.166  0.715  3.857 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -1.1329     0.7030   -1.61     0.11
## Path_contemp   0.0360     0.0219    1.65     0.10
## 
## Residual standard error: 1.39 on 94 degrees of freedom
## Multiple R-squared: 0.028,   Adjusted R-squared: 0.0176 
## F-statistic: 2.71 on 1 and 94 DF,  p-value: 0.103
summary((lm2 <- lm(dresids ~ Path_histor, data = mmi2)))
## 
## Call:
## lm(formula = dresids ~ Path_histor, data = mmi2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.124 -0.801  0.116  0.858  3.599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.0135     0.1357   -0.10   0.9210   
## Path_histor   0.7182     0.2133    3.37   0.0011 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.33 on 94 degrees of freedom
## Multiple R-squared: 0.108,   Adjusted R-squared: 0.0981 
## F-statistic: 11.3 on 1 and 94 DF,  p-value: 0.0011

Somewhat to my surprise, there is still a strong relationship between historical pathogen index and the merged conformity index …

mmi2B <- with(mmi, data.frame(mmi2, World))
library(ggplot2)
library(grid)
zmargin <- theme(panel.margin = unit(0, "lines"))
theme_set(theme_bw())
library(reshape2)
library(RColorBrewer)
mmi3 <- melt(mmi2B, id.var = c(1:2, 5))
mmi4 <- melt(mmi3, id.var = 3:5)
mmi4 <- setNames(mmi4, c("World", "pred", "predval", "resp", "respval"))
ggplot(mmi4, aes(predval, respval)) + geom_point(aes(colour = World)) + geom_smooth(method = "loess") + 
    geom_smooth(method = "lm", colour = "red") + facet_grid(resp ~ pred, scale = "free") + 
    zmargin + scale_colour_brewer(palette = "Dark2")

plot of chunk lastplot

(tip from Steve Walker: try pcaMethods package for PCA with missing data [to install: source("http://bioconductor.org/biocLite.R"); biocLite("pcaMethods")])

Using quadratic fits of econ/demog to social indices

Admittedly starting to snoop now, but I'm concerned that there is a strong correlation between PCA1 and the residuals: does that mean that there was more signal that wasn't removed by the regressions, i.e. a nonlinear pattern?

lm1Q <- lm(PCA1 ~ poly(Gini, popdens, GDPpercap, Life_expect, degree = 2), data = mmi)
library(bbmle)
## Warning: package 'bbmle' was built under R version 2.15.1
## Loading required package: stats4
AICtab(lm1, lm1Q, weights = TRUE)
##      dAIC df weight
## lm1Q  0.0 16 1     
## lm1  17.9 3  <0.001

A bunch of machinery to construct a nice plot of the quadratic terms ordered by \( t \)-statistic (I should (a) scale these predictors to standardize them (b) look at the data more carefully to get a sense of the quadratic effects):

library(coefplot2)
## Loading required package: coda
## Loading required package: lattice
cc <- coeftab(lm1Q)
## manipulate unwieldy poly(...,2) var names
tfun <- function(cstr) {
    vars <- strsplit(gsub("poly\\(([^\\)]+)\\).*", "\\1", cstr[2]), "[, ]")
    vars <- vars[[1]][c(1, 3, 5, 7)]  ## hack
    cstr1 <- gsub("^.*\\)", "", cstr)
    c("Intercept", sapply(cstr1[-1], function(x) {
        y <- strsplit(x, "\\.")[[1]]
        y[y == "0"] <- ""
        y[y == "1"] <- vars[y == "1"]
        y[y == "2"] <- paste0(vars[y == "2"], "^2")
        paste(y[y != ""], collapse = "*")
    }))
}
tvals <- coef(summary(lm1Q))[, "t value"]
rownames(cc) <- tfun(rownames(cc))
cc <- cc[order(abs(tvals)), ]
cc <- cc[rownames(cc) != "Intercept", ]
coefplot2(list(cc), intercept = TRUE)

plot of chunk qcoefs

mmi$dresidsQ <- residuals(lm1Q)
scatterplotMatrix(mmi5 <- subset(mmi, select = c(dresidsQ, PCA1, Path_histor, 
    Path_contemp)), gap = 0)