DATA 110 – PROJECT 2

Democratization and Infectious Disease

Introduction and Background

In their 2009 article, “Parasites, democratization, and the liberalization of values across contemporary countries,” Randy Thornhill, Corey Fincher, and Deveraj Aran hypothesize that “…the variation in values pertaining to autocracy-democracy arises fundamentally out of human (Homo Sapiens) species-typical psychological adaptation that manifests contingently, producing values and associated behaviors that functioned adaptively in human evolutionary history to cope with local levels of infectious diseases.” (Thornhill, Fincher, and Aran, 2009)

In that article, the authors argue that “…the risk of infectious disease…is a cause affecting global variation in three central aspects of democratization: (1) the willingness of powerful people to extend economic and social resources and opportunities outside their own kin or ethnic group, and encourage political involvement of the populace; (2) the validity of rank/authority, as perceived by the general population, and thus the authoritarian—anti-authoritarian dimension; and (3) attitudes about non-traditional ideas and ways of life that determine whether innovation occurs as well as whether innovation diffuses within and across geopolitical boundaries. … the empirical implication is that the degree of democratization should increase as disease prevalence decreases across the countries of the world.” (Thornhill, Fincher, and Aran, 2009)

This hypothesis is controversial, as scholars have long posited that other factors like economic development, modernization, resource and political power distributions (factors which Thornhill et al argue are salient components of democratization, but not causes of democratization) are the true determinants of political systems. In 2013, Damian Murray, Mark Schaller, and Peter Suedfeld attempted to test the parasite-stress hypothesis while “statistically controlling for other threats to human welfare.” (Murray, Schaller, Suedfeld, 2013) They ran two studies. The first examined the relationship not just between state governance systems and infection rates, but the relationship between those variables and authoritarian attitudes of the people in the country. The second introduced an additional variable and a statistical mediation test to determine whether the individuals’ attitudes influenced or were influenced by the government system. They determined that “…the ecological prevalence of infectious diseases predicts the individual authoritarian personalities of people living within that ecological region, and these individual-level dispositions in turn give rise to (and sustain) authoritarian systems of government.” (Murray, Schaller, Suedfeld, 2013)

Other scholars dismiss this hypothesis. In their 2018 article, “Parasites and politics: why cross-cultural studies must control for relatedness, proximity and covariation,” Lindell Bromham, Xia Hua, Marcel Cardillo, Hilde Schneeman, and Simon Greenhill dismiss the hypothesis completely, arguing that most analyses that purport to prove infection rate corresponds with political system “fail to account for one or more sources of statistical non-independence inherent in large observational datasets, which can lead to spurious relationships between traits and environments.” (Bromhan, et al, 2018) Thomas Curry and Ruth Mace succinctly point out the greatest flaw in the development of this hypothesis: “Because of their historical relationships, countries (F&T’s unit of analysis) cannot be considered as independent for the purposes of statistical analysis.” (Curry, Mace, 2012)

As a historian of Russia, I have long been fascinated by the question of what has led Russia to develop an autocratic system so at odds with most of the rest of Europe. There are no shortage of theories, including one tongue-in-cheek analysis of the relationship between type of liquor a country consumes and the harshness of its government, but there are no definitive answers. I am intrigued by Thornhill’s hypothesis, but doubtful that the answer to what causes formation of a particular system could be so simple.

Using a small dataset, it is possible to identify some of the weaknesses in the parasite-stress hypothesis.

Data and Visualization

This dataset, from the Global Infectious Diseases and Epidemiology Network (GIDEON), enables the examination of the parasite-stress hypothesis of democratic or authoritarian political development. It includes the country’s name, its income group, its democracy score, and its infection rate.

library(tidyverse)  #download tidyverse to manipulate data
library(ggthemes)  # to use themes
library(RColorBrewer)  # to use color brewer palattes
setwd("~/Desktop/DATA 110")
gideon_data <- read_csv("disease_democ.csv")  # import data into variable "gideon_data"
head(gideon_data)  # examine top 6 rows of data to make sure it loaded correctly
## # A tibble: 6 x 4
##   country      income_group          democ_score infect_rate
##   <chr>        <chr>                       <dbl>       <dbl>
## 1 Bahrain      High income: non-OECD        45.6          23
## 2 Bahamas, The High income: non-OECD        48.4          24
## 3 Qatar        High income: non-OECD        50.4          24
## 4 Latvia       High income: non-OECD        52.8          25
## 5 Barbados     High income: non-OECD        46            26
## 6 Singapore    High income: non-OECD        64            26

The data is structured as follows:

str(gideon_data)
## tibble [168 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country     : chr [1:168] "Bahrain" "Bahamas, The" "Qatar" "Latvia" ...
##  $ income_group: chr [1:168] "High income: non-OECD" "High income: non-OECD" "High income: non-OECD" "High income: non-OECD" ...
##  $ democ_score : num [1:168] 45.6 48.4 50.4 52.8 46 64 65.8 70.6 57.6 40.6 ...
##  $ infect_rate : num [1:168] 23 24 24 25 26 26 26 26 27 28 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   income_group = col_character(),
##   ..   democ_score = col_double(),
##   ..   infect_rate = col_double()
##   .. )

This data appears to be clean and tidy, with four variables and 168 entries per variable. A check for NA’s reveals that the data is tidy with no missing entries for each variable.

# get the total number of NAs in the data
sum(is.na(gideon_data))
## [1] 0

The income group variable is divided into five categories: “High income: non-OECD”, “High income: OECD”, “Low income”, “Lower middle income”, “Upper middle income”.

(Note, the OECD is “The Organization for Economic Co-operation and Development…an intergovernmental economic organization with 37 member countries, founded in 1961 to stimulate economic progress and world trade. It is a forum of countries describing themselves as committed to democracy and the market economy, providing a platform to compare policy experiences, seek answers to common problems, identify good practices and coordinate domestic and international policies of its members. Generally, OECD members are high-income economies with a very high Human Development Index (HDI) and are regarded as developed countries. As of 2017, the OECD member countries collectively comprised 62.2% of global nominal GDP ($49.6 trillion) and 42.8% of global GDP ($54.2 trillion) at purchasing power parity. The OECD is an official United Nations observer.”) Wikipedia

# use unique to find the categories included under "income_group"
unique(gideon_data$income_group)
## [1] "High income: non-OECD" "High income: OECD"     "Low income"           
## [4] "Lower middle income"   "Upper middle income"

Based upon the top six rows of the data (above), it appears that infection rate is a whole number, probably indicating the number of people per some other number (100 or 1000 most likely), infected. More generally, it seems the higher the number, the higher the impact of disease on the population. Similarly, the democracy score is a decimal number, with a higher score reflecting a higher degree of democratization. The maximum and minimum in each of those categories is as follows:

# use max and min to identify highest and lowest democracy score and infection rate
max(gideon_data$democ_score)
## [1] 86.6
min(gideon_data$democ_score)
## [1] 15.8
max(gideon_data$infect_rate)
## [1] 48
min(gideon_data$infect_rate)
## [1] 23

Given these variables, it is possible to examine the relationship between democracy and infection rate, and democracy and income group, and infection rate and income group. This allows a very basic examination of the hypothesis that infection rate causes democratization or authoritarianism.

First, a five-number summary of the democracy score and infection rate will reveal some basic information about those indicators.

democracy <- gideon_data$democ_score
infection <- gideon_data$infect_rate
summary(democracy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.80   28.40   38.40   42.78   52.65   86.60
summary(infection)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.00   27.00   32.00   33.33   39.00   48.00
mediandemocracy <- 38.4
medianinfection <- 32.00

This gives a basic understanding of the shape of the data. A simple histogram will show the same information.

ggplot(gideon_data) +
  geom_histogram(binwidth = 1, aes(gideon_data$democ_score), fill = "blue") +
  labs(title = "Distribution of Countries by Democracy Score", x = "Democracy Score", y = "Number of Countries") +
  theme_solarized()

ggplot(gideon_data) +
  geom_histogram(binwidth = 1, aes(gideon_data$infect_rate), fill = "blue")+
  labs(title = "Distribution of Countries by Infection Rate", x = "Infection Rate", y = "Number of Countries")+
  theme_solarized()

We can also see how many countries are in each income group.

ggplot(gideon_data) +
  geom_bar(aes(gideon_data$income_group), fill = "blue")+
  labs(title = "Distribution of Countries by Income Group", x = "Income Group", y = "Number of Countries")+
  theme_solarized()

A simple check of the parasite-stress hypothesis would be a scatterplot checking for a relationship between democracy score and infection rate.

ggplot(mapping = aes(x = gideon_data$democ_score, y = gideon_data$infect_rate)) +
  geom_point(color = "blue", alpha = 0.5)+
  geom_smooth(color = "red")+
  geom_vline(xintercept = mediandemocracy, size = 1, color = "black")+
  geom_text(aes(x = mediandemocracy + 12, y = 47, label = paste("Median Democracy Score\n (38.4)")))+
  geom_hline(yintercept = medianinfection, size = 1, color = "black")+
  geom_text(aes(x = 70, y = 35, label = paste("Median Infection Rate\n (32.0)"))) +
  labs(title = "Relationship between Infection Rate and Democracy Score", x = "Democracy Score", y= "Infection Rate") +
  theme_solarized()

Inserting a LOESS smoother (in red), it appears there is a relationship, but it is not strongly linear (it looks more like a curve). Nonetheless, the rough relationship is that the higher the infection rate, the lower the democracy score, and vice-versa.

Indeed, a linear regression analysis reveals a correlation.

cor(gideon_data$democ_score, gideon_data$infect_rate) # check the correlation of democracy score and infection rate
## [1] -0.6664911

The correlation coefficient for democracy score and infection rate is -.6664911, which is a weak negative correlation between those two factors.

A linear regression model will provide more information.

fit1 <- lm(formula = gideon_data$democ_score~gideon_data$infect_rate)
summary(fit1)
## 
## Call:
## lm(formula = gideon_data$democ_score ~ gideon_data$infect_rate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.838  -9.689  -1.512   7.775  31.763 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             104.4458     5.4627   19.12   <2e-16 ***
## gideon_data$infect_rate  -1.8503     0.1606  -11.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.08 on 166 degrees of freedom
## Multiple R-squared:  0.4442, Adjusted R-squared:  0.4409 
## F-statistic: 132.7 on 1 and 166 DF,  p-value: < 2.2e-16

This model shows that for each increase in infection rate, there will be a drop in the democracy score of 1.8503 points. The p value is very low and is thus considered “statistically significant,” but the adjusted R-squared score indicates that the model explains about 44% of the variation in the democracy score in this data. That means, about 56% of the variation is not explained by this model.

Another hypothesis is that income is a greater predictor of democracy. Examining the democracy score by income group reveals the following:

ggplot(data = gideon_data) +
  geom_histogram(mapping = aes(gideon_data$democ_score), fill = "blue") +
  labs(title = "Democracy Score by Country and Income Group", x = "Democracy Score", y = "Number of Countries") +
  facet_wrap(~gideon_data$income_group) +
  theme_linedraw()

There definitely appears to be a relationship between democracy score and income level. The exception appears to be high income, non-OECD countries. By filtering we can discover what those countries are.

high_income_nonOECD <- gideon_data %>%
  filter(gideon_data$income_group == "High income: non-OECD")

high_income_nonOECD
## # A tibble: 16 x 4
##    country              income_group          democ_score infect_rate
##    <chr>                <chr>                       <dbl>       <dbl>
##  1 Bahrain              High income: non-OECD        45.6          23
##  2 Bahamas, The         High income: non-OECD        48.4          24
##  3 Qatar                High income: non-OECD        50.4          24
##  4 Latvia               High income: non-OECD        52.8          25
##  5 Barbados             High income: non-OECD        46            26
##  6 Singapore            High income: non-OECD        64            26
##  7 Cyprus               High income: non-OECD        65.8          26
##  8 Malta                High income: non-OECD        70.6          26
##  9 Croatia              High income: non-OECD        57.6          27
## 10 United Arab Emirates High income: non-OECD        40.6          28
## 11 Trinidad and Tobago  High income: non-OECD        46.6          28
## 12 Kuwait               High income: non-OECD        49.6          28
## 13 Taiwan               High income: non-OECD        77.6          29
## 14 Oman                 High income: non-OECD        33            35
## 15 Equatorial Guinea    High income: non-OECD        28.4          36
## 16 Saudi Arabia         High income: non-OECD        40            37

A quick scatterplot will reveal the relationship (or lack of relationship) between democracy and infection rate among these countries.

ggplot(mapping = aes(x = high_income_nonOECD$democ_score, y = high_income_nonOECD$infect_rate)) +
  geom_point(color = "red", alpha = 0.5)+
  geom_smooth(aes( color = "LOESS"))+
  geom_smooth(method = 'lm', formula = y~x, aes(color = "Linear Regression"))+
  geom_vline(xintercept = mediandemocracy, size = 1, color = "black")+
  geom_text(aes(x = mediandemocracy + 12, y = 47, label = paste("Median Democracy Score\n (38.4)")))+
  geom_hline(yintercept = medianinfection, size = 1, color = "black")+
  geom_text(aes(x = 70, y = 35, label = paste("Median Infection Rate\n (32.0)"))) +
  labs(title = "Infection and Democracy in High Income non-OECD Countries", x = "Democracy Score", y= "Infection Rate")+
  scale_colour_manual(name="lines", values=c("red", "blue"))+
  theme_solarized()

The blue line represents the LOESS smoother, which is the curve of best fit without assuming the data has some particular shape. In contrast, the red line is the linear regression line. The contrast between the two suggests there is not a strong linear relationship between the two variables. The correlation coefficient confirms this.

cor(high_income_nonOECD$democ_score, high_income_nonOECD$infect_rate)
## [1] -0.5060138

At -.5060138, the correlation coefficient reveals an even weaker correlation than among all countries. Filtering by infection rate (selecting only those countries with an infection rate below the median), enables a closer examination of the democratization pattern of countries with a low-infection rate.

high_income_low_infection <- high_income_nonOECD %>% filter(high_income_nonOECD$infect_rate < 32.0)
high_income_low_infection
## # A tibble: 13 x 4
##    country              income_group          democ_score infect_rate
##    <chr>                <chr>                       <dbl>       <dbl>
##  1 Bahrain              High income: non-OECD        45.6          23
##  2 Bahamas, The         High income: non-OECD        48.4          24
##  3 Qatar                High income: non-OECD        50.4          24
##  4 Latvia               High income: non-OECD        52.8          25
##  5 Barbados             High income: non-OECD        46            26
##  6 Singapore            High income: non-OECD        64            26
##  7 Cyprus               High income: non-OECD        65.8          26
##  8 Malta                High income: non-OECD        70.6          26
##  9 Croatia              High income: non-OECD        57.6          27
## 10 United Arab Emirates High income: non-OECD        40.6          28
## 11 Trinidad and Tobago  High income: non-OECD        46.6          28
## 12 Kuwait               High income: non-OECD        49.6          28
## 13 Taiwan               High income: non-OECD        77.6          29

To examine these countries more closely, use plotly to be able to hover over the points and see what they represent.

library(plotly)
interactive_plot1 <- ggplot(mapping = aes(x = high_income_low_infection$democ_score, y =  high_income_low_infection$infect_rate, text = paste("country" = high_income_low_infection$country))) +
  geom_point(color = "red")+
  geom_smooth(color = "blue")+
  geom_vline(xintercept = mediandemocracy, size = 1, color = "black")+
  geom_text(aes(x = mediandemocracy + 8, y = 47, label = paste("Median Democracy Score\n (38.4)")))+
  labs(title = "Infection and Democracy in High Income non-OECD Countries", x = "Democracy Score", y= "Infection Rate")+
  theme_solarized()
interactive_plot1 <- ggplotly(interactive_plot1)
interactive_plot1

While all high income, non-OECD countries with below-median levels of infection are above the median for democracy score, there appears to be no relationship between the infection rate and the democracy score. (The linear regression model below confirms this.) These represent only thirteen of the 168 countries in this dataset, but that is a non-trivial 7-8% of the total data.

fit3 <- lm(formula = high_income_low_infection$democ_score~high_income_low_infection$infect_rate)
summary(fit3)
## 
## Call:
## lm(formula = high_income_low_infection$democ_score ~ high_income_low_infection$infect_rate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.549  -8.549  -1.026   9.212  17.770 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             11.091     46.809   0.237    0.817
## high_income_low_infection$infect_rate    1.681      1.786   0.941    0.367
## 
## Residual standard error: 11.25 on 11 degrees of freedom
## Multiple R-squared:  0.07452,    Adjusted R-squared:  -0.009618 
## F-statistic: 0.8857 on 1 and 11 DF,  p-value: 0.3669

Expanding this exploration to all countries with below-median levels of infection reveals the following relationship between infection rate and democracy score:

low_infection <- gideon_data %>% filter(gideon_data$infect_rate < 32.0) # filter for countries below median
interactive_plot2 <- ggplot(mapping = aes(x = low_infection$democ_score, y =  low_infection$infect_rate, text = paste("country" = low_infection$country))) +
  geom_point(color = "red", alpha = 0.5)+
  geom_smooth(color = "blue")+
  geom_vline(xintercept = mediandemocracy, size = 1, color = "black")+
  geom_text(aes(x = mediandemocracy + 12, y = 47, label = paste("Median Democracy Score\n (38.4)")))+
  labs(title = "Infection and Democracy Below-Median Infection Countries", x = "Democracy Score", y= "Infection Rate")+
  theme_solarized()
interactive_plot2 <- ggplotly(interactive_plot2)
interactive_plot2

With the hypothesis that parasite-stress causes lower democracy scores, we would expect to see these countries, with below-median parasite-stress, clustered toward the right—the higher democracy score region of the plot. That is not what this plot shows. While most points are indeed above the median, there remain several that are below.

There appears to be only a very weak relationship between infection rate and democracy score in countries with low infection rates. Applying a LOESS smoother and linear regression line will make this clearer. In this plot, the countries, their income groups, and the relationship between their infection rate and democracy score are evident.

plot2 <- ggplot(mapping = aes(x = low_infection$democ_score, y =  low_infection$infect_rate, color = low_infection$income_group)) +
  geom_point()+
  geom_smooth(color = "purple")+
  geom_smooth(method = 'lm', formula = y~x, color = 'red')+
  geom_vline(xintercept = mediandemocracy, size = 1, color = "black")+
  geom_text(aes(x = mediandemocracy + 15, y = 47, label = paste("Median Democracy Score\n (38.4)")))+
  labs(title = "Infection and Democracy in Below-Median Infection Countries", x = "Democracy Score", y= "Infection Rate", color = "Income Group")+
  theme_solarized()
plot2

The correlation coefficient confirms this.

cor(low_infection$democ_score, low_infection$infect_rate)
## [1] -0.3867273

This correlation coefficient is below .5, so very weak.

fit2 <- lm(formula = low_infection$democ_score~low_infection$infect_rate)
summary(fit2)
## 
## Call:
## lm(formula = low_infection$democ_score ~ low_infection$infect_rate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.25 -14.28  -1.59  14.68  31.62 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               149.2608    25.7501   5.797 1.46e-07 ***
## low_infection$infect_rate  -3.4717     0.9496  -3.656  0.00047 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.22 on 76 degrees of freedom
## Multiple R-squared:  0.1496, Adjusted R-squared:  0.1384 
## F-statistic: 13.37 on 1 and 76 DF,  p-value: 0.0004697

The adjusted R-squared for a linear regression model of the relationship between infection rate and democracy score is a very low .1384. The model explains only 14 percent of the data. Something else is probably responsible for the low democracy scores in these countries.

Conclusion

The study above is really only a tiny representation of the data that Thornhill and others examined in order to develop and challenge the parasite-stress hypothesis. As a result, it is really insufficient to challenge that thesis. The lack of a really clear, strong relationship between the variables of infection rate and democracy score when only one other variable (income) is considered, does raise questions about the accuracy of the hypothesis. Given that an examination of so few variables could raise questions, it would be unsurprising that looking at even more of the multitude of factors that make-up countries’ nature would raise even more questions. The states that Thornhill breaks down his analysis by are relatively recent creations. As a result, Currie and Mace’s observation that states are not independent variables for the purpose of statistical analysis rings true. They are somewhat artificial and arbitrary units when considered within the history of human and disease evolution.

Nonetheless, the analysis of the dataset above was useful—particularly the creation of data visualizations. The initial visualizations lacked the median lines, and that led me to falsely conclude that things were distributed in a different way than they actually were. For example, I did not notice that all of the points in the high-income non-OECD countries plot had above median democracy scores. Placing those lines encouraged me to go back and look at the statistical models to double-check my first impressions. The data supported Thornhill’s conclusions more than I believed based upon my first-draft plots, but a closer look at the statistical analyses confirmed my doubts about his conclusions.

Finally, the dataset had some limitations that inhibited my analysis. Aside from only having four variables, one of those variables–the income group–was not as useful as it could have been had it been quantitative (actual GDP) rather than categorical. Because it was not quantitative, I could not run a linear regression between income and infection rate or democracy score. (I could have run a logistic regression, but I am still learning how to do those.)

In addition, I am still beset by the technical problem with plotly and smoother lines. I believe there is something with the text function that is causing the lines to disappear when I run plotly, but I am not certain what the problem is.

Bibliography

Bromham, L., Hua, X., Cardillo, M., Schneemann, H., & Greenhill, S. J. (2018). Parasites and politics: why cross-cultural studies must control for relatedness, proximity and covariation. Royal Society Open Science, 5(8), 181100. https://doi.org/doi:10.1098/rsos.181100

Currie, T., & Mace, R. (2012, 04/01). Analyses do not support the parasite-stress theory of human sociality. The Behavioral and brain sciences, 35, 83-85. https://doi.org/10.1017/S0140525X11000963

Murray, D. R. (2013, May 1, 2013). Pathogens and Politics: Further Evidence that Parasite Prevalence Predicts Authoritarianism. PLoS ONE, 8(5). https://doi.org/10.1371/journal.pone.0062275

Thornhill, R., Fincher, C. L., & Aran, D. (2009). Parasites, democratization, and the liberalization of values across contemporary countries. Biological Reviews, 84(1), 113-131. https://doi.org/https://doi.org/10.1111/j.1469-185X.2008.00062.x