Introduction

Project 2 Review

This final project is a continuation of my prior Project 2 where we looked at the connection between infectious disease and democratization in countries around the world. We used a dataset from 2003 with DEMOCRACY SCORES and INFECTIOUS DISEASE RATES for 168 countries to evaluate a hypothesis of two evolutionary biologists at the University of New Mexico. Their ideas briefly summarized here in New Scientist Magazine:

“The starting point for Thornhill and Fincher’s thinking is a basic human survival instinct: the desire to avoid illness. In a region where disease is rife, they argue, fear of contagion may cause people to avoid outsiders, who may be carrying a strain of infection to which they have no immunity. Such a mindset would tend to make a community as a whole xenophobic, and might also discourage interaction between the various groups within a society – the social classes, for instance – to prevent unnecessary contact that might spread disease. What is more, Thornhill and Fincher argue, it could encourage people to conform to social norms and to respect authority, since adventurous behavior may flout rules of conduct set in place to prevent contamination.”

From “Genes, germs and the origins of politics,” by Jim Giles, in 18 May 2011, New Scientist Magazine, San Francisco CA

Through statistical analysis of their 2003 dataset we were able to confirm a strong correlation between infectious disease and democracy scores. However we were suspicious that other variables, such as economic development, might play an equally important role in democratization.

Objective 1 - Economic Development and Democratization

In this final project we plan to explore the connection between economic development and democratization by merging historical GDP data from the World Bank with the original 2003 data set used by Thornhill/Fincher above. We will evaluate the correlation between PER CAPITA GDP and democracy scores and compare the significance with our original findings from Project 2.

Objective 2 - Recent Data on Disease and Democratization

We will also include an analysis of two more recent data sets on disease (BURDEN OF DISEASE) and democracy (DEMOCRACY INDEX); perform correlation analysis and linear regression on these new data sets; and compare the results to the original 2003 dataset.

First we’ll start off by calling packages and importing datasets:

Call Packages and Read Datasets

getwd()
## [1] "/Users/h0age/Documents/data110/Project 2"
setwd("/Users/h0age/Documents/data110/data_sets")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(shiny)
library(urltools)
library(ggplot2)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(ggrepel)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(dplyr)
library(RColorBrewer)
library(ggthemes)
library(ggrepel)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(tmap)
library(tmaptools)
library(leaflet)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(htmlwidgets)
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
library(leaflet.extras)
library(rio)
library(sp)


demosick <- read_csv("disease_democ.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   country = col_character(),
##   income_group = col_character(),
##   democ_score = col_double(),
##   infect_rate = col_double()
## )
GDP2003 <- read_csv("World Bank 2003 GDP.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `2003` = col_double()
## )
BurdenComm <- read_csv("burdencomm.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Entity = col_character(),
##   Code = col_character(),
##   Year = col_double(),
##   DALYs = col_double()
## )
DemoIndex <- read_csv("DemoIndex.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   geo = col_character(),
##   name = col_character(),
##   time = col_double(),
##   `Democracy index (EIU)` = col_double(),
##   `Electoral pluralism index (EIU)` = col_double(),
##   `Government index (EIU)` = col_double(),
##   `Political participation index(EIU)` = col_double(),
##   `Political culture index (EIU)` = col_double(),
##   `Civil liberties index (EIU)` = col_double(),
##   `Change in democracy index (EIU)` = col_double()
## )
PC2003 <- read_csv("percapita2003.csv")
## Warning: Missing column names filled in: 'X3' [3]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   `Data Source` = col_character(),
##   `World Development Indicators` = col_character(),
##   X3 = col_double()
## )

2003 Dataset - Regression and Correlation

Let us review the results of the linear regression and correlation analysis we performed on the 2003 dataset in Project 2.

byinfection <- demosick %>%
  arrange(infect_rate)

2003 Data - Linear Regression

byinfection %>% ggplot(aes(x = democ_score, y = infect_rate, color = income_group)) +
  ggtitle("Is There a Relationship Between Disease and Democracy?") + ## plot title
  xlab("Democracy Score") + ## x axis label
  ylab("Infectious Disease Score") + ## y axis labell
  ##geom_point(show.legend = FALSE, alpha=0.4)  + ## calling scatter plot
  geom_point(alpha=0.4)  + ## calling scatter plot
  geom_smooth(color="red", ,formula=y~x,method="lm", se=FALSE, linetype="solid") +
  ##geom_smooth(method='lm',formula=y~x) + ## how to get the income group parameter out??
  labs(color='Country Income Group') +
  scale_fill_discrete(name = "Country's Income Group") +  ## label for the legend 
  scale_y_continuous(limits = c(30,50)) + ## do I need this? the chunk doesn't work if i remove it
  theme_light() ## add new theme
## Warning: Removed 67 rows containing non-finite values (stat_smooth).
## Warning: Removed 67 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_smooth).

2003 Data - Correlation Analysis

cor(byinfection$infect_rate, byinfection$democ_score) ## check correlation 
## [1] -0.6664911
model <- lm(democ_score ~ infect_rate, data = byinfection) 
summary(model) ## view summary
## 
## Call:
## lm(formula = democ_score ~ infect_rate, data = byinfection)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.838  -9.689  -1.512   7.775  31.763 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104.4458     5.4627   19.12   <2e-16 ***
## infect_rate  -1.8503     0.1606  -11.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.08 on 166 degrees of freedom
## Multiple R-squared:  0.4442, Adjusted R-squared:  0.4409 
## F-statistic: 132.7 on 1 and 166 DF,  p-value: < 2.2e-16

Comment | 2003 Data - Correlation Analysis

With Infection Rate (x) as the explanatory variable and Democracy Score (y) as response variable, our linear regression plot shows us clear visible evidence of a negative correlation between infectious disease rates and democracy scores. Correlation analysis provides further evidence for this, producing a negative correlation coefficient of -0.667, where values of 1 indicate a strong positive relationship, values of -1 a strong negative relationship, and values of zero indicating no relationship at all.

The column for Pr(>|t|) showing how useful the predictor is to the model is <2e-16 ***. Three asterisks indicating that the predictor has a significant impact on the model. This reinforces conclusions of Thornhill et al that there is a negative correlation between infection rates and democracy.

2003 Dataset - Add World Bank Percapita GDP data

In order to test the effect of economic development on democratization, let us bring in a new variable: PerCapita GDP data from 2003 via the World Bank. In the plots below PerCapita GDP will replace Infectious Disease as our explanatory variable in relation to the Democracy Index variable, our response variable.

Clean Percapita GDP Data

GDP <- GDP2003 %>%
  rename("country" = "Country Name", "gdp" = "2003")
demosickGDP <- left_join(demosick, GDP, by="country") ## 

Merge percapita GDP with with 2003 dataset

PerCap <- PC2003 %>%
  rename("country" = "Data Source", "PerCapitaGDP" = "X3")
PCdemosick <- left_join(demosick, PerCap, by="country")

Remove NA values

PCdemosick_na_omit <- na.omit(PCdemosick)

Round Percapita GDP Values to remove decimals

PCdemosick$percapita <- round(PCdemosick$PerCapitaGDP, 0)

PCdemosick %>%
mutate(percapita = round(PerCapitaGDP, 0))
## # A tibble: 168 x 7
##    country   income_group democ_score infect_rate `World Developme… PerCapitaGDP
##    <chr>     <chr>              <dbl>       <dbl> <chr>                    <dbl>
##  1 Bahrain   High income…        45.6          23 BHR                     14222.
##  2 Bahamas,… High income…        48.4          24 BHS                     28327.
##  3 Qatar     High income…        50.4          24 QAT                     34518.
##  4 Latvia    High income…        52.8          25 LVA                      5135.
##  5 Barbados  High income…        46            26 BRB                     11699.
##  6 Singapore High income…        64            26 SGP                     23730.
##  7 Cyprus    High income…        65.8          26 CYP                     20252.
##  8 Malta     High income…        70.6          26 MLT                     13669.
##  9 Croatia   High income…        57.6          27 HRV                      8059.
## 10 United A… High income…        40.6          28 ARE                     33499.
## # … with 158 more rows, and 1 more variable: percapita <dbl>

Now we plot the PerCapita GDP variable against Democracy Index scores to view the distribution of the data and check for any visual evidence of correlation

Linear Regression - PerCapitaGDP & Democracy Score

PCdemosick_na_omit %>% ggplot(aes(x = PerCapitaGDP, y = democ_score, color = income_group)) +
  ggtitle(("Relation Between Per Capita GDP and Democracy")) +
  xlab("Per Capita GDP") +
  ylab("Democracy Score") +
  geom_point(alpha=0.2) +
  geom_smooth(color="red", formula=y~x,method="lm", se=FALSE, linetype="solid") +
  labs(color="Country Income Group") +
  scale_fill_discrete(name = "Country's Income Group") +
  theme_light()

## Commment | PerCapita GDP vs Democracy - Linear Regression Using PerCapitaGDP as the explanatory variable and Democracy Score as response variable we see there is clear visual evidence of a positive correlation between the two. Let’s perform a correlation analysis using the same explanatory and response variables.

To perform the correlation test we need to remove any NA values from the dataset

str(PCdemosick_na_omit)
## tibble [160 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country                     : chr [1:160] "Bahrain" "Bahamas, The" "Qatar" "Latvia" ...
##  $ income_group                : chr [1:160] "High income: non-OECD" "High income: non-OECD" "High income: non-OECD" "High income: non-OECD" ...
##  $ democ_score                 : num [1:160] 45.6 48.4 50.4 52.8 46 64 65.8 70.6 57.6 40.6 ...
##  $ infect_rate                 : num [1:160] 23 24 24 25 26 26 26 26 27 28 ...
##  $ World Development Indicators: chr [1:160] "BHR" "BHS" "QAT" "LVA" ...
##  $ PerCapitaGDP                : num [1:160] 14222 28327 34518 5135 11699 ...
##  - attr(*, "na.action")= 'omit' Named int [1:8] 13 51 73 101 105 107 130 142
##   ..- attr(*, "names")= chr [1:8] "13" "51" "73" "101" ...

Correlation - Per Capita GDP and Democracy

cor(PCdemosick_na_omit$PerCapitaGDP, PCdemosick_na_omit$democ_score) ## check correlation 
## [1] 0.8321572

Comment | Correlation - Per Capita GDP and Democracy

With a correlation coefficient of 0.83 (with strongest positive correlation represented by a coefficient of 1) we see there is indeed a very strong positive correlation between PerCapita GDP and Democracy Index scores, even stronger than the -0.67 coefficient (strongest negative correlation being -1) for the disease/democracy test above.

Now lets look at the Per Capita GDP data in HighCharter scatterplot so we can explore the distribution of the data in finer detail.

Highchart Scatterplot - Per Capita and Democracy

cols <- brewer.pal(5, "Set1")

highchart() %>%
  hc_add_series(data = PCdemosick, type = "scatter", hcaes(x = percapita, y = democ_score, group = income_group)) %>%
  ## add regression line??
  ## lm(infect_rate ~ democ_score)
  ## hc_add_series(model, type = "line", color = "red") %>%
  hc_title(text = "Relation Between Per Capita GDP and Democracy") %>% ## set topline title
  hc_xAxis(title = list(text="Per Capita GDP"))     %>%                     ## x-axis title
  hc_yAxis(title = list(text="Democracy Score")) %>%                ## y-axis title
  ## Legend code
  hc_legend(align = "right",
            verticalAlign = "top") %>% ## Legend alignment 
  ## add a caption
 ## hc_caption(text = "<b>Source:  Vanhanen, Tatu. 2003: Democratization: A Comparative Analysis of 170 Countries. London: Routledge Research in Comparative Politics, Routledge. Taylor & Francis Group, 302 pp</em>'") %>%
  ## tool tip code
  hc_tooltip(pointFormat = "Country: {point.country}:<br> 
                            Per Capita GDP: $: {point.percapita} <br>
                            Democracy Score: {point.democ_score} <br>") %>%
  hc_colors(cols) %>% ## call color palette
  hc_chart(style = list(fontFamily = "Georgia")) ## set new font
## Add "$" sign, remove decimal places?

Comment | Highchart Scatterplot - Per Capita and Democracy

Again we see a clear connection between GDP and democracy scores, with most low PerCapita GDP countries clustered towards the lowest end of the democracy score range, and many higher income countries scoring highest (although with significant variance).

New Datasets - Burden of Communicable Disease, Democracy Index

Now lets bring in more recent data to see if the the original hypothesis—that higher disease prevalence correlates with lower democratization—still holds up.

The Democracy Index

The Democracy Index provides a measure of the quality of democracy in a given country and is produced by the Economist Intelligence Unit and published on a yearly basis. Available on GapMinder here: https://www.gapminder.org/data/documentation/democracy-index/

The Democracy Index value here is in fact one of the data sources used by Thornhill et Al. in their published work on disease and democracy. From this dataset we will use only the variable “Democracy index (EIU)” values.

Burden of Disease Data

The Burden from Communicable, Neonatal, Maternal and Nutritional Diseases dataset is compiled by Institute for Health Metrics and Evaluation and available from our World in Data here: https://ourworldindata.org/burden-of-disease#the-burden-from-communicable-neonatal-maternal-and-nutritional-diseases

We’ll be using the ‘Disability Adjusted Life Years‘ (DALYs) values from this dataset as a stand in for the Infection Rate from our original 2003 dataset. DALYs are calculated per 100,000 people of the population. Regions considered to have the best health are below 20,000 DALYs per 100,000 individuals. In 2017 this is achieved in many European countries, but also in Canada, Israel, South Korea, Taiwan, Japan, Kuwait, the Maldives, and Australia. In worst-off regions the rate is higher than 80,000 DALYs per 100,000.

## Clean DemoIndex Dataset - Democracy Index
DemocracyIndex <- DemoIndex%>%
## change "name" to "country"
## change "Democracy index (EIU)" to "DemocracyIndex"
## change "time" to "Year"
  rename("country" = "name", "DemocracyIndex" = "Democracy index (EIU)", "Year" = "time")


## Clean BurdenComm Dataset - Burden of Disease
Burden <- BurdenComm %>%
## change "entity" to "country"
  rename("country" = "Entity")

## merge by country
## merge by year
NewDemoSet <- left_join(Burden, DemocracyIndex, by=c("country", "Year")) %>%
  filter(DemocracyIndex != "NA")

Now that we’ve cleaned the two datasets and merged them into new object, let us look at a scatterplot and linear regression to look at the distribution of the data and check for any visual evidence of correlation.

Plot New Datasets -

Scatterplot Linear Regression - Burden of Disease by Democracy index

NewDemoSet %>% ggplot(aes(x = DALYs, y = DemocracyIndex)) +
  ggtitle("Burden of Disease vs Democracy Index") + ## plot title
  xlab("Burden of Disease") + ## x axis label
  ylab("Democracy Index") + ## y axis labell
  ##geom_point(show.legend = FALSE, alpha=0.4)  + ## calling scatter plot
  geom_point(alpha=0.4)  + ## calling scatter plot
  geom_smooth(color="red", ,formula=y~x,method="lm", se=FALSE, linetype="solid") +
  ## labs(color='Country Income Group') +
  ##scale_fill_discrete(name = "Country's Income Group") +  ## label for the legend 
  scale_y_continuous(limits = c(30,50)) + ## do I need this? the chunk doesn't work if i remove it
  theme_light() ## add new theme
## Warning: Removed 1406 rows containing non-finite values (stat_smooth).
## Warning: Removed 1406 rows containing missing values (geom_point).

Comment | Scatterplot Linear Regression - Burden of Disease by Democracy index

Using our measure of disease prevalence (DALYs) as explanatory variable (x) and Democracy Index as response variable (y) we see that the negative correlation from the 2003 data set is no longer visible, in fact it appears to even have a slight positive correlation

To look at the data in more depth let us view the data through an interactive Highcharter scatterplot

New Dataset - Higher Charter Scatterplot

# define color palette
cols <- brewer.pal(2, "Set2")
## Warning in brewer.pal(2, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
highchart() %>%
  hc_add_series(data = NewDemoSet, type = "scatter", hcaes(x = DALYs, y = DemocracyIndex)) %>%
  hc_title(text = "Burden of Disease vs Democracy Index") %>% ## set topline title
  hc_xAxis(title = list(text="Burden of Disease"))     %>%                     ## x-axis title
  hc_yAxis(title = list(text="Democracy Index")) %>%                ## y-axis title
  ## Legend code
  hc_legend(align = "right",
            verticalAlign = "top") %>% ## Legend alignment 
  ## add a caption
  hc_caption(text = "<b>Democracy Index by Economist Intelligence Unit; Disease burden data from non-communicable diseases</em>'") %>%
  ## tool tip code
  hc_tooltip(pointFormat = "Country: {point.country}:<br> 
                            Democracy Index: {point.DemocracyIndex}:<br>
                            Burden of Disease: {point.DALYs} <br>") %>%
  hc_colors(cols) %>% ## call color palette
  hc_chart(style = list(fontFamily = "Georgia")) ## set new font

Comment | New Dataset - Higher Charter Scatterplot

In the flat distribution of the data we see again very little visual evidence of a strong correlation, either positive or negative.

The above visualizations included 11 years of observations for each each of the two variables in each country resulting in a crowded plot with 1884 datapoints. In the hopes of producing a more manageable plot, we will limit the observations to the most recent year (2017), a scope and method used in evaluating our original 2003 dataset.

Filter out of the dataset all years except 2017

New Dataset for most recent year 2017

NewDemo2017 <- NewDemoSet %>%
  filter(Year == "2017")

Linear Regression for 2017 data

NewDemo2017 %>% ggplot(aes(x = DALYs, y = DemocracyIndex)) +
  ggtitle("Burden of Disease vs Democracy Index") + ## plot title
  xlab("Burden of Disease") + ## x axis label
  ylab("Democracy Index") + ## y axis labell
  ##geom_point(show.legend = FALSE, alpha=0.4)  + ## calling scatter plot
  geom_point(alpha=0.4)  + ## calling scatter plot
  geom_smooth(color="red", ,formula=y~x,method="lm", se=FALSE, linetype="solid") +
  ## labs(color='Country Income Group') +
  ##scale_fill_discrete(name = "Country's Income Group") +  ## label for the legend 
  scale_y_continuous(limits = c(30,50)) + ## do I need this? the chunk doesn't work if i remove it
  theme_light() ## add new theme
## Warning: Removed 114 rows containing non-finite values (stat_smooth).
## Warning: Removed 114 rows containing missing values (geom_point).

Correlation Coefficient for 2017 data

cor(NewDemo2017$DALYs, NewDemo2017$DemocracyIndex) ## check correlation 
## [1] -0.4097654

Highcharter Scatter Plot for 2017 Data

# define color palette
cols <- brewer.pal(2, "Set2")
## Warning in brewer.pal(2, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
highchart() %>%
  hc_add_series(data = NewDemo2017, type = "scatter", hcaes(x = DALYs, y = DemocracyIndex)) %>%
  hc_title(text = "Burden of Disease vs Democracy Index") %>% ## set topline title
  hc_xAxis(title = list(text="Burden of Disease"))     %>%                     ## x-axis title
  hc_yAxis(title = list(text="Democracy Index")) %>%                ## y-axis title
  ## Legend code
  hc_legend(align = "right",
            verticalAlign = "top") %>% ## Legend alignment 
  ## add a caption
  hc_caption(text = "<b>Democracy Index by Economist Intelligence Unit; Disease burden data from non-communicable diseases</em>'") %>%
  ## tool tip code
  hc_tooltip(pointFormat = "Country: {point.country}:<br> 
                            Democracy Index: {point.DemocracyIndex}:<br>
                            Burden of Disease: {point.DALYs} <br>") %>%
  hc_colors(cols) %>% ## call color palette
  hc_chart(style = list(fontFamily = "Georgia")) ## set new font

Comments on three plots for 2017 Data

We see repeated again the pattern of little evidence for strong correlation between disease and democracy in all three of the above visualizations and analysis. First our linear regression line shows an almost flat correlation line; second our correlation test produces a coefficient of -0.41, a relatively weak level of correlation. In the third Highcharter scatterplot we see again the loose and flat distribution of data points and, by using the interactive pointer, we can see countries that regardless of having a relative low burden of disease, still show very low democracy index values, for example Syria and North Korea, among others.

Summary of Findings

As a socio-political hypothesis, I found Thornhill and Fincher’s arguments connecting high rates of infectious disease to less democratic forms of government compelling and insightful. However I was not convinced by their conclusion that disease prevalence was nearly the sole factor in determining whether a country had the potential to develop into a democracy.

Looking deeper into their research I discovered that their dataset was carefully selected through collaboration with other academics and tailored to the models that they had developed. My inclusion of newer, alternate datasets is not meant to replicate exactly their methods or their model. However I was curious to see if more broad, raw data on democracy and disease levels would show a similar correlation between close approximate variables.

Because the highcharter scatterplot I used in Project 2 strongly hinted that countries with higher levels of economic development might be more likely to have higher democracy scores, I also wanted to investigate an economic factor as an alternative explanatory variable and confounding factor that might play an important role in democratization.

After reviewing as a whole the statistical analyses and visualizations above, it is clear that the disease and democracy hypothesis as a broad trend is not easily replicated in datasets approximate to the original used by Thornhill et al. Instead of pointing to disease prevalence as the determinative factor, I think we should consider it as one of many, including economic development and many possible others, that play important roles in the development of democratic countries.