Re-analyzing Fincher and Thornhill

I'm finding myself quite skeptical of Fincher and Thornhill's paper “Pathogen prevalence predicts human cross-cultural variability in individualism/collectivism” (doi: 10.1098/rspb.2008.0094, 7 June 2008 vol. 275 no. 1640 1279-1285).

They say they took data from the CIA factbook, so I downloaded an
electronic version provided by google (warning: there's at least one bogus line in the file!)

baseurl <- "http://www.math.mcmaster.ca/bolker/misc/"

X <- read.csv(url(paste0(baseurl, "cia-data-all.csv")), stringsAsFactors = FALSE)
## find the names of the columns we want
## grep('(Gini|expectancy|GDP)',names(X),value=TRUE) extract relevant
## columns and fix names
CIAdat <- setNames(subset(X, select = c(Distribution.of.family.income...Gini.index, 
    Life.expectancy.at.birth, GDP...per.capita..PPP., Population, Area)), c("Gini", 
    "Life_expect", "GDPpercap", "Population", "Area"))
rownames(CIAdat) <- X[, 1]
## make GDP values into numbers and compute pop density
CIAdat <- transform(CIAdat, GDPpercap = as.numeric(gsub("[ $]", "", GDPpercap)), 
    popdens = Population/Area)
CIAdat <- subset(CIAdat, select = c(Gini:GDPpercap, popdens))

A basic plot:

library(car)

## Warning: package 'car' was built under R version 2.15.1

## Warning: package 'MASS' was built under R version 2.15.1

## Warning: package 'nnet' was built under R version 2.15.1

scatterplotMatrix(CIAdat, gap = 0)

plot of chunk ciaplot1

It may amuse you to check out the high-density outliers. I wonder if something got mangled?

subset(CIAdat, popdens > 3000, select = popdens)

##                         popdens
## Holy See (Vatican City)     Inf
## Monaco                    16482
## Macau                     19994
## Gaza Strip                 4311
## Gibraltar                  4005
## Hong Kong                  6390
## Singapore                  6682

Get Fincher & Thornhill data supplement: I took the PDF file, used pdftotext to convert to text, and edited the text file in emacs (lots of kill-rectangle and yank-rectangle) to put it into a R-friendly csv file.

ftdata <- read.csv(url(paste0(baseurl,"fincher_thornhill.csv")))
rownames(ftdata) <- ftdata$Country
## check for name mismatches
(missnames1 <- setdiff(rownames(ftdata),rownames(CIAdat)))

## [1] "China PRC"     "Czech Rep."    "England"       "Nrth Ireland" 
## [5] "Rep. Of Korea" "Trinidad"      "UAE"           "USA"

## setdiff(rownames(CIAdat),rownames(ftdata))
misspos1 <-match(missnames1,rownames(ftdata))
rownames(ftdata)[misspos1] <- c("China","Czech Republic",
                                "England", ## ?? United Kingdom ??
                                "Northern Ireland", ## ?? ditto ??
                                "Korea South",
                                "Trinidad and Tobago",
                                "United Arab Emirates",
                                "United States")
mm <- merge(CIAdat,ftdata,by="row.names",all=FALSE)
## clean up a bit: subset() is fragile but clean
mm <- subset(mm,select=c(Country:GLOBE,Gini:popdens,
                         Path_histor:Kashima))
rownames(mm) <- mm$Country
head(mm)

##               Country        World        GLOBE Gini Life_expect GDPpercap
## Albania       Albania   W._Eurasia    E._Europe 26.7       77.96      6000
## Argentina   Argentina   S._America Lat._America 49.0       76.56     14200
## Australia   Australia Ins._Pacific        Anglo 30.5       81.63     38100
## Austria       Austria   W._Eurasia  Ger._Europe 26.0       79.50     39200
## Bahrain       Bahrain   W._Eurasia         <NA>   NA       75.16     37200
## Bangladesh Bangladesh   E._Eurasia         <NA> 33.2       60.25      1500
##             popdens Path_histor Path_contemp Hofstede  Suh Gelfand Kashima
## Albania     126.598          NA           24       NA   NA    5.74      NA
## Argentina    14.715       -0.07           37       46 4.80    5.51       1
## Australia     2.747       -0.20           27       90 9.00    4.17       0
## Austria      97.892       -0.72           26       55 6.75    4.85       0
## Bahrain     982.166        0.15           23       NA 3.00      NA      NA
## Bangladesh 1083.702        0.67           33       20   NA      NA      NA

library(car)
scatterplotMatrix(ftdata[, -(1:3)], gap = 0)

plot of chunk ftplot1

Appropriate procedure:

(optional) recreate F&T plots and analyses to make sure we're OK so far
find PCA of conformity scores (possibly with imputation? there are only 34 countries with data for all four conformity scores)
regress the pooled conformity score on the CIA data (life expectancy+GDP+Gini+popdens) and take the residuals
see how well current or historical pathogen index predicts the residuals

Filling in missing data

We have three choices for handling the missing value scores: (1) discard any countries with any missing values (i.e., na.omit()); (2) crude imputation, replacing missing values by the mean for that variable; (3) more sophisticated imputation (e.g. using the mice package).

Trying the middle option:

fill_NA <- function(x, verbose = FALSE) {
    if ((totNA <- sum(is.na(x))) == 0 || !is.numeric(x)) 
        return(x)
    if (verbose) 
        message(paste("replacing", totNA, "NA values with mean"))
    x[is.na(x)] <- mean(x, na.rm = TRUE)
    x
}
mmi <- as.data.frame(lapply(mm, fill_NA))
rownames(mmi) <- rownames(mm)  ## argh
## ? is there a slightly nicer way to do this?  sapply coerces to a matrix

What does the scatterplot of demographic/economic variables look like with the merged set (i.e. retaining only those countries for which we have social & pathogen data)?

scatterplotMatrix(subset(mmi, subset = popdens < 2000, select = c(Gini:popdens)), 
    gap = 0)

plot of chunk unnamed-chunk-1

The subset command excludes only Hong Kong and Singapore (pop. dens. >6000, the next largest is Malta at 1282). It still looks
like popdens has a pretty extreme distribution, maybe log-transform it … ?

subset(mmi, popdens > 500, select = popdens)

##            popdens
## Bahrain      982.2
## Bangladesh  1083.7
## Hong Kong   6390.5
## Malta       1282.2
## Singapore   6682.3
## Taiwan       638.5

Select social surveys:

consmat <- as.matrix(subset(mmi, TRUE, select = c(Hofstede:Kashima)))

The variables are reasonably (but not very) strongly correlated (\( r^2 \approx 0.4-0.5 \)), and the variances differ enormously, with Hofstede having the largest:

print(var(consmat), digits = 2)

##          Hofstede   Suh Gelfand Kashima
## Hofstede    385.1 20.38   -6.52   -5.52
## Suh          20.4  2.22   -0.49   -0.35
## Gelfand      -6.5 -0.49    0.30    0.13
## Kashima      -5.5 -0.35    0.13    0.15

print(cor(consmat), digits = 2)

##          Hofstede  Suh Gelfand Kashima
## Hofstede     1.00  0.7   -0.61   -0.73
## Suh          0.70  1.0   -0.60   -0.60
## Gelfand     -0.61 -0.6    1.00    0.63
## Kashima     -0.73 -0.6    0.63    1.00

PCA on scaled data (correlations), with a scree plot:

(pca1 <- princomp(consmat, cor = TRUE))

## Call:
## princomp(x = consmat, cor = TRUE)
## 
## Standard deviations:
## Comp.1 Comp.2 Comp.3 Comp.4 
## 1.7124 0.6540 0.6318 0.4908 
## 
##  4  variables and  96 observations.

par(las = 1)
plot(pca1)

plot of chunk pca

mmi$PCA1 <- pca1$scores[, 1]
mmi <- mmi[order(mmi$PCA1), ]
c(head(rownames(mmi)), "...", tail(rownames(mmi)))

##  [1] "USA"           "Australia"     "Netherlands"   "Denmark"      
##  [5] "Canada"        "Sweden"        "..."           "Pakistan"     
##  [9] "Thailand"      "Rep. Of Korea" "Indonesia"     "China PRC"    
## [13] "Colombia"

## NB PCA1 is ordered from *least* to *most* 'conservative'

tmpf <- function(x) factor(x, levels = as.character(x))
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 2.15.1

theme_set(theme_bw())
ggplot(mmi, aes(x = PCA1, y = tmpf(Country), colour = World)) + geom_point() + 
    labs(x = "Conformity", y = "")

plot of chunk cplot

Regress scores on demog/social variables:

lm1 <- lm(PCA1 ~ Gini + popdens + GDPpercap + Life_expect, data = mmi)

mmi$dresids <- residuals(lm1)  ## should we standardize?
scatterplotMatrix(mmi2 <- subset(mmi, select = c(dresids, PCA1, Path_histor, 
    Path_contemp)), gap = 0)

plot of chunk residplot1

An alternative enhanced-pairs-plot view: it would be nice to modify the function so that negative and positive correlations were distinguished (and possibly get rid of the significance stars … )

library(PerformanceAnalytics)

## Warning: package 'zoo' was built under R version 2.15.1

chart.Correlation(mmi2)

plot of chunk corplot

## chart.Correlation plots ABSOLUTE VALUE of correlations

I'm not sure what the strong correlation between PCA1 and the residuals means …

summary((lm1 <- lm(dresids ~ Path_contemp, data = mmi2)))

## 
## Call:
## lm(formula = dresids ~ Path_contemp, data = mmi2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.559 -0.654  0.166  0.715  3.857 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -1.1329     0.7030   -1.61     0.11
## Path_contemp   0.0360     0.0219    1.65     0.10
## 
## Residual standard error: 1.39 on 94 degrees of freedom
## Multiple R-squared: 0.028,   Adjusted R-squared: 0.0176 
## F-statistic: 2.71 on 1 and 94 DF,  p-value: 0.103

summary((lm2 <- lm(dresids ~ Path_histor, data = mmi2)))

## 
## Call:
## lm(formula = dresids ~ Path_histor, data = mmi2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.124 -0.801  0.116  0.858  3.599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.0135     0.1357   -0.10   0.9210   
## Path_histor   0.7182     0.2133    3.37   0.0011 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.33 on 94 degrees of freedom
## Multiple R-squared: 0.108,   Adjusted R-squared: 0.0981 
## F-statistic: 11.3 on 1 and 94 DF,  p-value: 0.0011

Somewhat to my surprise, there is still a strong relationship between historical pathogen index and the merged conformity index …

mmi2B <- with(mmi, data.frame(mmi2, World))
library(ggplot2)
library(grid)
zmargin <- theme(panel.margin = unit(0, "lines"))
theme_set(theme_bw())
library(reshape2)
library(RColorBrewer)
mmi3 <- melt(mmi2B, id.var = c(1:2, 5))
mmi4 <- melt(mmi3, id.var = 3:5)
mmi4 <- setNames(mmi4, c("World", "pred", "predval", "resp", "respval"))
ggplot(mmi4, aes(predval, respval)) + geom_point(aes(colour = World)) + geom_smooth(method = "loess") + 
    geom_smooth(method = "lm", colour = "red") + facet_grid(resp ~ pred, scale = "free") + 
    zmargin + scale_colour_brewer(palette = "Dark2")

plot of chunk lastplot

(tip from Steve Walker: try pcaMethods package for PCA with missing data [to install: source("http://bioconductor.org/biocLite.R"); biocLite("pcaMethods")])

Using quadratic fits of econ/demog to social indices

Admittedly starting to snoop now, but I'm concerned that there is a strong correlation between PCA1 and the residuals: does that mean that there was more signal that wasn't removed by the regressions, i.e. a nonlinear pattern?

lm1Q <- lm(PCA1 ~ poly(Gini, popdens, GDPpercap, Life_expect, degree = 2), data = mmi)
library(bbmle)

## Warning: package 'bbmle' was built under R version 2.15.1

## Loading required package: stats4

AICtab(lm1, lm1Q, weights = TRUE)

##      dAIC df weight
## lm1Q  0.0 16 1     
## lm1  17.9 3  <0.001

A bunch of machinery to construct a nice plot of the quadratic terms ordered by \( t \)-statistic (I should (a) scale these predictors to standardize them (b) look at the data more carefully to get a sense of the quadratic effects):

library(coefplot2)

## Loading required package: coda

## Loading required package: lattice

cc <- coeftab(lm1Q)
## manipulate unwieldy poly(...,2) var names
tfun <- function(cstr) {
    vars <- strsplit(gsub("poly\\(([^\\)]+)\\).*", "\\1", cstr[2]), "[, ]")
    vars <- vars[[1]][c(1, 3, 5, 7)]  ## hack
    cstr1 <- gsub("^.*\\)", "", cstr)
    c("Intercept", sapply(cstr1[-1], function(x) {
        y <- strsplit(x, "\\.")[[1]]
        y[y == "0"] <- ""
        y[y == "1"] <- vars[y == "1"]
        y[y == "2"] <- paste0(vars[y == "2"], "^2")
        paste(y[y != ""], collapse = "*")
    }))
}
tvals <- coef(summary(lm1Q))[, "t value"]
rownames(cc) <- tfun(rownames(cc))
cc <- cc[order(abs(tvals)), ]
cc <- cc[rownames(cc) != "Intercept", ]
coefplot2(list(cc), intercept = TRUE)

plot of chunk qcoefs

mmi$dresidsQ <- residuals(lm1Q)
scatterplotMatrix(mmi5 <- subset(mmi, select = c(dresidsQ, PCA1, Path_histor, 
    Path_contemp)), gap = 0)

plot of chunk resids2

summary((lm1rQ <- lm(dresidsQ ~ Path_contemp, data = mmi5)))

## 
## Call:
## lm(formula = dresidsQ ~ Path_contemp, data = mmi5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9478 -0.7311  0.0482  0.4972  2.9260 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -0.7275     0.5540   -1.31     0.19
## Path_contemp   0.0231     0.0173    1.34     0.18
## 
## Residual standard error: 1.09 on 94 degrees of freedom
## Multiple R-squared: 0.0188,  Adjusted R-squared: 0.00832 
## F-statistic:  1.8 on 1 and 94 DF,  p-value: 0.183

summary((lm2rQ <- lm(dresidsQ ~ Path_histor, data = mmi5)))

## 
## Call:
## lm(formula = dresidsQ ~ Path_histor, data = mmi5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8112 -0.7489 -0.0547  0.5287  2.8037 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.00612    0.11064   -0.06    0.956  
## Path_histor  0.32543    0.17389    1.87    0.064 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.08 on 94 degrees of freedom
## Multiple R-squared: 0.0359,  Adjusted R-squared: 0.0257 
## F-statistic:  3.5 on 1 and 94 DF,  p-value: 0.0644

There's still a weak signal left.