# Re-analyzing Fincher and Thornhill

I'm finding myself quite skeptical of Fincher and Thornhill's paper “Pathogen prevalence predicts human cross-cultural variability in individualism/collectivism” (doi: 10.1098/rspb.2008.0094, 7 June 2008 vol. 275 no. 1640 1279-1285).

They say they took data from the CIA factbook, so I downloaded an
electronic version provided by google (warning: there's at least one bogus line in the file!)

baseurl <- "http://www.math.mcmaster.ca/bolker/misc/"

X <- read.csv(url(paste0(baseurl, "cia-data-all.csv")), stringsAsFactors = FALSE)
## find the names of the columns we want
## grep('(Gini|expectancy|GDP)',names(X),value=TRUE) extract relevant
## columns and fix names
CIAdat <- setNames(subset(X, select = c(Distribution.of.family.income...Gini.index,
Life.expectancy.at.birth, GDP...per.capita..PPP., Population, Area)), c("Gini",
"Life_expect", "GDPpercap", "Population", "Area"))
## make GDP values into numbers and compute pop density
CIAdat <- transform(CIAdat, GDPpercap = as.numeric(gsub("[ $]", "", GDPpercap)), popdens = Population/Area) CIAdat <- subset(CIAdat, select = c(Gini:GDPpercap, popdens))  A basic plot: library(car)  ## Warning: package 'car' was built under R version 2.15.1  ## Warning: package 'MASS' was built under R version 2.15.1  ## Warning: package 'nnet' was built under R version 2.15.1  scatterplotMatrix(CIAdat, gap = 0)  It may amuse you to check out the high-density outliers. I wonder if something got mangled? subset(CIAdat, popdens > 3000, select = popdens)  ## popdens ## Holy See (Vatican City) Inf ## Monaco 16482 ## Macau 19994 ## Gaza Strip 4311 ## Gibraltar 4005 ## Hong Kong 6390 ## Singapore 6682  Get Fincher & Thornhill data supplement: I took the PDF file, used pdftotext to convert to text, and edited the text file in emacs (lots of kill-rectangle and yank-rectangle) to put it into a R-friendly csv file. ftdata <- read.csv(url(paste0(baseurl,"fincher_thornhill.csv"))) rownames(ftdata) <- ftdata$Country
## check for name mismatches

## [1] "China PRC"     "Czech Rep."    "England"       "Nrth Ireland"
## [5] "Rep. Of Korea" "Trinidad"      "UAE"           "USA"

## setdiff(rownames(CIAdat),rownames(ftdata))
misspos1 <-match(missnames1,rownames(ftdata))
rownames(ftdata)[misspos1] <- c("China","Czech Republic",
"England", ## ?? United Kingdom ??
"Northern Ireland", ## ?? ditto ??
"Korea South",
"United Arab Emirates",
"United States")
## clean up a bit: subset() is fragile but clean
mm <- subset(mm,select=c(Country:GLOBE,Gini:popdens,
Path_histor:Kashima))
rownames(mm) <- mm$Country head(mm)  ## Country World GLOBE Gini Life_expect GDPpercap ## Albania Albania W._Eurasia E._Europe 26.7 77.96 6000 ## Argentina Argentina S._America Lat._America 49.0 76.56 14200 ## Australia Australia Ins._Pacific Anglo 30.5 81.63 38100 ## Austria Austria W._Eurasia Ger._Europe 26.0 79.50 39200 ## Bahrain Bahrain W._Eurasia <NA> NA 75.16 37200 ## Bangladesh Bangladesh E._Eurasia <NA> 33.2 60.25 1500 ## popdens Path_histor Path_contemp Hofstede Suh Gelfand Kashima ## Albania 126.598 NA 24 NA NA 5.74 NA ## Argentina 14.715 -0.07 37 46 4.80 5.51 1 ## Australia 2.747 -0.20 27 90 9.00 4.17 0 ## Austria 97.892 -0.72 26 55 6.75 4.85 0 ## Bahrain 982.166 0.15 23 NA 3.00 NA NA ## Bangladesh 1083.702 0.67 33 20 NA NA NA  library(car) scatterplotMatrix(ftdata[, -(1:3)], gap = 0)  Appropriate procedure: • (optional) recreate F&T plots and analyses to make sure we're OK so far • find PCA of conformity scores (possibly with imputation? there are only 34 countries with data for all four conformity scores) • regress the pooled conformity score on the CIA data (life expectancy+GDP+Gini+popdens) and take the residuals • see how well current or historical pathogen index predicts the residuals ## Filling in missing data We have three choices for handling the missing value scores: (1) discard any countries with any missing values (i.e., na.omit()); (2) crude imputation, replacing missing values by the mean for that variable; (3) more sophisticated imputation (e.g. using the mice package). Trying the middle option: fill_NA <- function(x, verbose = FALSE) { if ((totNA <- sum(is.na(x))) == 0 || !is.numeric(x)) return(x) if (verbose) message(paste("replacing", totNA, "NA values with mean")) x[is.na(x)] <- mean(x, na.rm = TRUE) x } mmi <- as.data.frame(lapply(mm, fill_NA)) rownames(mmi) <- rownames(mm) ## argh ## ? is there a slightly nicer way to do this? sapply coerces to a matrix  What does the scatterplot of demographic/economic variables look like with the merged set (i.e. retaining only those countries for which we have social & pathogen data)? scatterplotMatrix(subset(mmi, subset = popdens < 2000, select = c(Gini:popdens)), gap = 0)  The subset command excludes only Hong Kong and Singapore (pop. dens. >6000, the next largest is Malta at 1282). It still looks like popdens has a pretty extreme distribution, maybe log-transform it … ? subset(mmi, popdens > 500, select = popdens)  ## popdens ## Bahrain 982.2 ## Bangladesh 1083.7 ## Hong Kong 6390.5 ## Malta 1282.2 ## Singapore 6682.3 ## Taiwan 638.5  Select social surveys: consmat <- as.matrix(subset(mmi, TRUE, select = c(Hofstede:Kashima)))  The variables are reasonably (but not very) strongly correlated ($$r^2 \approx 0.4-0.5$$), and the variances differ enormously, with Hofstede having the largest: print(var(consmat), digits = 2)  ## Hofstede Suh Gelfand Kashima ## Hofstede 385.1 20.38 -6.52 -5.52 ## Suh 20.4 2.22 -0.49 -0.35 ## Gelfand -6.5 -0.49 0.30 0.13 ## Kashima -5.5 -0.35 0.13 0.15  print(cor(consmat), digits = 2)  ## Hofstede Suh Gelfand Kashima ## Hofstede 1.00 0.7 -0.61 -0.73 ## Suh 0.70 1.0 -0.60 -0.60 ## Gelfand -0.61 -0.6 1.00 0.63 ## Kashima -0.73 -0.6 0.63 1.00  PCA on scaled data (correlations), with a scree plot: (pca1 <- princomp(consmat, cor = TRUE))  ## Call: ## princomp(x = consmat, cor = TRUE) ## ## Standard deviations: ## Comp.1 Comp.2 Comp.3 Comp.4 ## 1.7124 0.6540 0.6318 0.4908 ## ## 4 variables and 96 observations.  par(las = 1) plot(pca1)  mmi$PCA1 <- pca1$scores[, 1] mmi <- mmi[order(mmi$PCA1), ]

##  [1] "USA"           "Australia"     "Netherlands"   "Denmark"
##  [5] "Canada"        "Sweden"        "..."           "Pakistan"
##  [9] "Thailand"      "Rep. Of Korea" "Indonesia"     "China PRC"
## [13] "Colombia"

## NB PCA1 is ordered from *least* to *most* 'conservative'

tmpf <- function(x) factor(x, levels = as.character(x))
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 2.15.1

theme_set(theme_bw())
ggplot(mmi, aes(x = PCA1, y = tmpf(Country), colour = World)) + geom_point() +
labs(x = "Conformity", y = "")


Regress scores on demog/social variables:

lm1 <- lm(PCA1 ~ Gini + popdens + GDPpercap + Life_expect, data = mmi)

mmi$dresids <- residuals(lm1) ## should we standardize? scatterplotMatrix(mmi2 <- subset(mmi, select = c(dresids, PCA1, Path_histor, Path_contemp)), gap = 0)  An alternative enhanced-pairs-plot view: it would be nice to modify the function so that negative and positive correlations were distinguished (and possibly get rid of the significance stars … ) library(PerformanceAnalytics)  ## Warning: package 'zoo' was built under R version 2.15.1  chart.Correlation(mmi2)  ## chart.Correlation plots ABSOLUTE VALUE of correlations  I'm not sure what the strong correlation between PCA1 and the residuals means … summary((lm1 <- lm(dresids ~ Path_contemp, data = mmi2)))  ## ## Call: ## lm(formula = dresids ~ Path_contemp, data = mmi2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.559 -0.654 0.166 0.715 3.857 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.1329 0.7030 -1.61 0.11 ## Path_contemp 0.0360 0.0219 1.65 0.10 ## ## Residual standard error: 1.39 on 94 degrees of freedom ## Multiple R-squared: 0.028, Adjusted R-squared: 0.0176 ## F-statistic: 2.71 on 1 and 94 DF, p-value: 0.103  summary((lm2 <- lm(dresids ~ Path_histor, data = mmi2)))  ## ## Call: ## lm(formula = dresids ~ Path_histor, data = mmi2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.124 -0.801 0.116 0.858 3.599 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.0135 0.1357 -0.10 0.9210 ## Path_histor 0.7182 0.2133 3.37 0.0011 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.33 on 94 degrees of freedom ## Multiple R-squared: 0.108, Adjusted R-squared: 0.0981 ## F-statistic: 11.3 on 1 and 94 DF, p-value: 0.0011  Somewhat to my surprise, there is still a strong relationship between historical pathogen index and the merged conformity index … mmi2B <- with(mmi, data.frame(mmi2, World)) library(ggplot2) library(grid) zmargin <- theme(panel.margin = unit(0, "lines")) theme_set(theme_bw()) library(reshape2) library(RColorBrewer) mmi3 <- melt(mmi2B, id.var = c(1:2, 5)) mmi4 <- melt(mmi3, id.var = 3:5) mmi4 <- setNames(mmi4, c("World", "pred", "predval", "resp", "respval")) ggplot(mmi4, aes(predval, respval)) + geom_point(aes(colour = World)) + geom_smooth(method = "loess") + geom_smooth(method = "lm", colour = "red") + facet_grid(resp ~ pred, scale = "free") + zmargin + scale_colour_brewer(palette = "Dark2")  (tip from Steve Walker: try pcaMethods package for PCA with missing data [to install: source("http://bioconductor.org/biocLite.R"); biocLite("pcaMethods")]) ## Using quadratic fits of econ/demog to social indices Admittedly starting to snoop now, but I'm concerned that there is a strong correlation between PCA1 and the residuals: does that mean that there was more signal that wasn't removed by the regressions, i.e. a nonlinear pattern? lm1Q <- lm(PCA1 ~ poly(Gini, popdens, GDPpercap, Life_expect, degree = 2), data = mmi) library(bbmle)  ## Warning: package 'bbmle' was built under R version 2.15.1  ## Loading required package: stats4  AICtab(lm1, lm1Q, weights = TRUE)  ## dAIC df weight ## lm1Q 0.0 16 1 ## lm1 17.9 3 <0.001  A bunch of machinery to construct a nice plot of the quadratic terms ordered by $$t$$-statistic (I should (a) scale these predictors to standardize them (b) look at the data more carefully to get a sense of the quadratic effects): library(coefplot2)  ## Loading required package: coda  ## Loading required package: lattice  cc <- coeftab(lm1Q) ## manipulate unwieldy poly(...,2) var names tfun <- function(cstr) { vars <- strsplit(gsub("poly\$$([^\$$]+)\\).*", "\\1", cstr[2]), "[, ]") vars <- vars[[1]][c(1, 3, 5, 7)] ## hack cstr1 <- gsub("^.*\\)", "", cstr) c("Intercept", sapply(cstr1[-1], function(x) { y <- strsplit(x, "\\.")[[1]] y[y == "0"] <- "" y[y == "1"] <- vars[y == "1"] y[y == "2"] <- paste0(vars[y == "2"], "^2") paste(y[y != ""], collapse = "*") })) } tvals <- coef(summary(lm1Q))[, "t value"] rownames(cc) <- tfun(rownames(cc)) cc <- cc[order(abs(tvals)), ] cc <- cc[rownames(cc) != "Intercept", ] coefplot2(list(cc), intercept = TRUE)  mmi$dresidsQ <- residuals(lm1Q)
scatterplotMatrix(mmi5 <- subset(mmi, select = c(dresidsQ, PCA1, Path_histor,
Path_contemp)), gap = 0)