This final project is a continuation of my prior Project 2 where we looked at the connection between infectious disease and democratization in countries around the world. We used a dataset from 2003 with DEMOCRACY SCORES and INFECTIOUS DISEASE RATES for 168 countries to evaluate a hypothesis of two evolutionary biologists at the University of New Mexico. Their ideas briefly summarized here in New Scientist Magazine:
“The starting point for Thornhill and Fincher’s thinking is a basic human survival instinct: the desire to avoid illness. In a region where disease is rife, they argue, fear of contagion may cause people to avoid outsiders, who may be carrying a strain of infection to which they have no immunity. Such a mindset would tend to make a community as a whole xenophobic, and might also discourage interaction between the various groups within a society – the social classes, for instance – to prevent unnecessary contact that might spread disease. What is more, Thornhill and Fincher argue, it could encourage people to conform to social norms and to respect authority, since adventurous behavior may flout rules of conduct set in place to prevent contamination.”
From “Genes, germs and the origins of politics,” by Jim Giles, in 18 May 2011, New Scientist Magazine, San Francisco CA
Through statistical analysis of their 2003 dataset we were able to confirm a strong correlation between infectious disease and democracy scores. However we were suspicious that other variables, such as economic development, might play an equally important role in democratization.
In this final project we plan to explore the connection between economic development and democratization by merging historical GDP data from the World Bank with the original 2003 data set used by Thornhill/Fincher above. We will evaluate the correlation between PER CAPITA GDP and democracy scores and compare the significance with our original findings from Project 2.
We will also include an analysis of two more recent data sets on disease (BURDEN OF DISEASE) and democracy (DEMOCRACY INDEX); perform correlation analysis and linear regression on these new data sets; and compare the results to the original 2003 dataset.
First we’ll start off by calling packages and importing datasets:
getwd()
## [1] "/Users/h0age/Documents/data110/Project 2"
setwd("/Users/h0age/Documents/data110/data_sets")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(shiny)
library(urltools)
library(ggplot2)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(ggrepel)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(dplyr)
library(RColorBrewer)
library(ggthemes)
library(ggrepel)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(tmap)
library(tmaptools)
library(leaflet)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(htmlwidgets)
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
library(leaflet.extras)
library(rio)
library(sp)
demosick <- read_csv("disease_democ.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## country = col_character(),
## income_group = col_character(),
## democ_score = col_double(),
## infect_rate = col_double()
## )
GDP2003 <- read_csv("World Bank 2003 GDP.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `2003` = col_double()
## )
BurdenComm <- read_csv("burdencomm.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Entity = col_character(),
## Code = col_character(),
## Year = col_double(),
## DALYs = col_double()
## )
DemoIndex <- read_csv("DemoIndex.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## geo = col_character(),
## name = col_character(),
## time = col_double(),
## `Democracy index (EIU)` = col_double(),
## `Electoral pluralism index (EIU)` = col_double(),
## `Government index (EIU)` = col_double(),
## `Political participation index(EIU)` = col_double(),
## `Political culture index (EIU)` = col_double(),
## `Civil liberties index (EIU)` = col_double(),
## `Change in democracy index (EIU)` = col_double()
## )
PC2003 <- read_csv("percapita2003.csv")
## Warning: Missing column names filled in: 'X3' [3]
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## `Data Source` = col_character(),
## `World Development Indicators` = col_character(),
## X3 = col_double()
## )
Let us review the results of the linear regression and correlation analysis we performed on the 2003 dataset in Project 2.
byinfection <- demosick %>%
arrange(infect_rate)
byinfection %>% ggplot(aes(x = democ_score, y = infect_rate, color = income_group)) +
ggtitle("Is There a Relationship Between Disease and Democracy?") + ## plot title
xlab("Democracy Score") + ## x axis label
ylab("Infectious Disease Score") + ## y axis labell
##geom_point(show.legend = FALSE, alpha=0.4) + ## calling scatter plot
geom_point(alpha=0.4) + ## calling scatter plot
geom_smooth(color="red", ,formula=y~x,method="lm", se=FALSE, linetype="solid") +
##geom_smooth(method='lm',formula=y~x) + ## how to get the income group parameter out??
labs(color='Country Income Group') +
scale_fill_discrete(name = "Country's Income Group") + ## label for the legend
scale_y_continuous(limits = c(30,50)) + ## do I need this? the chunk doesn't work if i remove it
theme_light() ## add new theme
## Warning: Removed 67 rows containing non-finite values (stat_smooth).
## Warning: Removed 67 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_smooth).
cor(byinfection$infect_rate, byinfection$democ_score) ## check correlation
## [1] -0.6664911
model <- lm(democ_score ~ infect_rate, data = byinfection)
summary(model) ## view summary
##
## Call:
## lm(formula = democ_score ~ infect_rate, data = byinfection)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.838 -9.689 -1.512 7.775 31.763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.4458 5.4627 19.12 <2e-16 ***
## infect_rate -1.8503 0.1606 -11.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.08 on 166 degrees of freedom
## Multiple R-squared: 0.4442, Adjusted R-squared: 0.4409
## F-statistic: 132.7 on 1 and 166 DF, p-value: < 2.2e-16
In order to test the effect of economic development on democratization, let us bring in a new variable: PerCapita GDP data from 2003 via the World Bank. In the plots below PerCapita GDP will replace Infectious Disease as our explanatory variable in relation to the Democracy Index variable, our response variable.
Clean Percapita GDP Data
GDP <- GDP2003 %>%
rename("country" = "Country Name", "gdp" = "2003")
demosickGDP <- left_join(demosick, GDP, by="country") ##
Merge percapita GDP with with 2003 dataset
PerCap <- PC2003 %>%
rename("country" = "Data Source", "PerCapitaGDP" = "X3")
PCdemosick <- left_join(demosick, PerCap, by="country")
Remove NA values
PCdemosick_na_omit <- na.omit(PCdemosick)
Round Percapita GDP Values to remove decimals
PCdemosick$percapita <- round(PCdemosick$PerCapitaGDP, 0)
PCdemosick %>%
mutate(percapita = round(PerCapitaGDP, 0))
## # A tibble: 168 x 7
## country income_group democ_score infect_rate `World Developme… PerCapitaGDP
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 Bahrain High income… 45.6 23 BHR 14222.
## 2 Bahamas,… High income… 48.4 24 BHS 28327.
## 3 Qatar High income… 50.4 24 QAT 34518.
## 4 Latvia High income… 52.8 25 LVA 5135.
## 5 Barbados High income… 46 26 BRB 11699.
## 6 Singapore High income… 64 26 SGP 23730.
## 7 Cyprus High income… 65.8 26 CYP 20252.
## 8 Malta High income… 70.6 26 MLT 13669.
## 9 Croatia High income… 57.6 27 HRV 8059.
## 10 United A… High income… 40.6 28 ARE 33499.
## # … with 158 more rows, and 1 more variable: percapita <dbl>
Now we plot the PerCapita GDP variable against Democracy Index scores to view the distribution of the data and check for any visual evidence of correlation
PCdemosick_na_omit %>% ggplot(aes(x = PerCapitaGDP, y = democ_score, color = income_group)) +
ggtitle(("Relation Between Per Capita GDP and Democracy")) +
xlab("Per Capita GDP") +
ylab("Democracy Score") +
geom_point(alpha=0.2) +
geom_smooth(color="red", formula=y~x,method="lm", se=FALSE, linetype="solid") +
labs(color="Country Income Group") +
scale_fill_discrete(name = "Country's Income Group") +
theme_light()
## Commment | PerCapita GDP vs Democracy - Linear Regression Using PerCapitaGDP as the explanatory variable and Democracy Score as response variable we see there is clear visual evidence of a positive correlation between the two. Let’s perform a correlation analysis using the same explanatory and response variables.
To perform the correlation test we need to remove any NA values from the dataset
str(PCdemosick_na_omit)
## tibble [160 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : chr [1:160] "Bahrain" "Bahamas, The" "Qatar" "Latvia" ...
## $ income_group : chr [1:160] "High income: non-OECD" "High income: non-OECD" "High income: non-OECD" "High income: non-OECD" ...
## $ democ_score : num [1:160] 45.6 48.4 50.4 52.8 46 64 65.8 70.6 57.6 40.6 ...
## $ infect_rate : num [1:160] 23 24 24 25 26 26 26 26 27 28 ...
## $ World Development Indicators: chr [1:160] "BHR" "BHS" "QAT" "LVA" ...
## $ PerCapitaGDP : num [1:160] 14222 28327 34518 5135 11699 ...
## - attr(*, "na.action")= 'omit' Named int [1:8] 13 51 73 101 105 107 130 142
## ..- attr(*, "names")= chr [1:8] "13" "51" "73" "101" ...
cor(PCdemosick_na_omit$PerCapitaGDP, PCdemosick_na_omit$democ_score) ## check correlation
## [1] 0.8321572
With a correlation coefficient of 0.83 (with strongest positive correlation represented by a coefficient of 1) we see there is indeed a very strong positive correlation between PerCapita GDP and Democracy Index scores, even stronger than the -0.67 coefficient (strongest negative correlation being -1) for the disease/democracy test above.
Now lets look at the Per Capita GDP data in HighCharter scatterplot so we can explore the distribution of the data in finer detail.
cols <- brewer.pal(5, "Set1")
highchart() %>%
hc_add_series(data = PCdemosick, type = "scatter", hcaes(x = percapita, y = democ_score, group = income_group)) %>%
## add regression line??
## lm(infect_rate ~ democ_score)
## hc_add_series(model, type = "line", color = "red") %>%
hc_title(text = "Relation Between Per Capita GDP and Democracy") %>% ## set topline title
hc_xAxis(title = list(text="Per Capita GDP")) %>% ## x-axis title
hc_yAxis(title = list(text="Democracy Score")) %>% ## y-axis title
## Legend code
hc_legend(align = "right",
verticalAlign = "top") %>% ## Legend alignment
## add a caption
## hc_caption(text = "<b>Source: Vanhanen, Tatu. 2003: Democratization: A Comparative Analysis of 170 Countries. London: Routledge Research in Comparative Politics, Routledge. Taylor & Francis Group, 302 pp</em>'") %>%
## tool tip code
hc_tooltip(pointFormat = "Country: {point.country}:<br>
Per Capita GDP: $: {point.percapita} <br>
Democracy Score: {point.democ_score} <br>") %>%
hc_colors(cols) %>% ## call color palette
hc_chart(style = list(fontFamily = "Georgia")) ## set new font
## Add "$" sign, remove decimal places?
Again we see a clear connection between GDP and democracy scores, with most low PerCapita GDP countries clustered towards the lowest end of the democracy score range, and many higher income countries scoring highest (although with significant variance).
Now lets bring in more recent data to see if the the original hypothesis—that higher disease prevalence correlates with lower democratization—still holds up.
The Democracy Index provides a measure of the quality of democracy in a given country and is produced by the Economist Intelligence Unit and published on a yearly basis. Available on GapMinder here: https://www.gapminder.org/data/documentation/democracy-index/
The Democracy Index value here is in fact one of the data sources used by Thornhill et Al. in their published work on disease and democracy. From this dataset we will use only the variable “Democracy index (EIU)” values.
The Burden from Communicable, Neonatal, Maternal and Nutritional Diseases dataset is compiled by Institute for Health Metrics and Evaluation and available from our World in Data here: https://ourworldindata.org/burden-of-disease#the-burden-from-communicable-neonatal-maternal-and-nutritional-diseases
We’ll be using the ‘Disability Adjusted Life Years‘ (DALYs) values from this dataset as a stand in for the Infection Rate from our original 2003 dataset. DALYs are calculated per 100,000 people of the population. Regions considered to have the best health are below 20,000 DALYs per 100,000 individuals. In 2017 this is achieved in many European countries, but also in Canada, Israel, South Korea, Taiwan, Japan, Kuwait, the Maldives, and Australia. In worst-off regions the rate is higher than 80,000 DALYs per 100,000.
## Clean DemoIndex Dataset - Democracy Index
DemocracyIndex <- DemoIndex%>%
## change "name" to "country"
## change "Democracy index (EIU)" to "DemocracyIndex"
## change "time" to "Year"
rename("country" = "name", "DemocracyIndex" = "Democracy index (EIU)", "Year" = "time")
## Clean BurdenComm Dataset - Burden of Disease
Burden <- BurdenComm %>%
## change "entity" to "country"
rename("country" = "Entity")
## merge by country
## merge by year
NewDemoSet <- left_join(Burden, DemocracyIndex, by=c("country", "Year")) %>%
filter(DemocracyIndex != "NA")
Now that we’ve cleaned the two datasets and merged them into new object, let us look at a scatterplot and linear regression to look at the distribution of the data and check for any visual evidence of correlation.
NewDemoSet %>% ggplot(aes(x = DALYs, y = DemocracyIndex)) +
ggtitle("Burden of Disease vs Democracy Index") + ## plot title
xlab("Burden of Disease") + ## x axis label
ylab("Democracy Index") + ## y axis labell
##geom_point(show.legend = FALSE, alpha=0.4) + ## calling scatter plot
geom_point(alpha=0.4) + ## calling scatter plot
geom_smooth(color="red", ,formula=y~x,method="lm", se=FALSE, linetype="solid") +
## labs(color='Country Income Group') +
##scale_fill_discrete(name = "Country's Income Group") + ## label for the legend
scale_y_continuous(limits = c(30,50)) + ## do I need this? the chunk doesn't work if i remove it
theme_light() ## add new theme
## Warning: Removed 1406 rows containing non-finite values (stat_smooth).
## Warning: Removed 1406 rows containing missing values (geom_point).
Using our measure of disease prevalence (DALYs) as explanatory variable (x) and Democracy Index as response variable (y) we see that the negative correlation from the 2003 data set is no longer visible, in fact it appears to even have a slight positive correlation
To look at the data in more depth let us view the data through an interactive Highcharter scatterplot
# define color palette
cols <- brewer.pal(2, "Set2")
## Warning in brewer.pal(2, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
highchart() %>%
hc_add_series(data = NewDemoSet, type = "scatter", hcaes(x = DALYs, y = DemocracyIndex)) %>%
hc_title(text = "Burden of Disease vs Democracy Index") %>% ## set topline title
hc_xAxis(title = list(text="Burden of Disease")) %>% ## x-axis title
hc_yAxis(title = list(text="Democracy Index")) %>% ## y-axis title
## Legend code
hc_legend(align = "right",
verticalAlign = "top") %>% ## Legend alignment
## add a caption
hc_caption(text = "<b>Democracy Index by Economist Intelligence Unit; Disease burden data from non-communicable diseases</em>'") %>%
## tool tip code
hc_tooltip(pointFormat = "Country: {point.country}:<br>
Democracy Index: {point.DemocracyIndex}:<br>
Burden of Disease: {point.DALYs} <br>") %>%
hc_colors(cols) %>% ## call color palette
hc_chart(style = list(fontFamily = "Georgia")) ## set new font
In the flat distribution of the data we see again very little visual evidence of a strong correlation, either positive or negative.
The above visualizations included 11 years of observations for each each of the two variables in each country resulting in a crowded plot with 1884 datapoints. In the hopes of producing a more manageable plot, we will limit the observations to the most recent year (2017), a scope and method used in evaluating our original 2003 dataset.
Filter out of the dataset all years except 2017
NewDemo2017 <- NewDemoSet %>%
filter(Year == "2017")
NewDemo2017 %>% ggplot(aes(x = DALYs, y = DemocracyIndex)) +
ggtitle("Burden of Disease vs Democracy Index") + ## plot title
xlab("Burden of Disease") + ## x axis label
ylab("Democracy Index") + ## y axis labell
##geom_point(show.legend = FALSE, alpha=0.4) + ## calling scatter plot
geom_point(alpha=0.4) + ## calling scatter plot
geom_smooth(color="red", ,formula=y~x,method="lm", se=FALSE, linetype="solid") +
## labs(color='Country Income Group') +
##scale_fill_discrete(name = "Country's Income Group") + ## label for the legend
scale_y_continuous(limits = c(30,50)) + ## do I need this? the chunk doesn't work if i remove it
theme_light() ## add new theme
## Warning: Removed 114 rows containing non-finite values (stat_smooth).
## Warning: Removed 114 rows containing missing values (geom_point).
cor(NewDemo2017$DALYs, NewDemo2017$DemocracyIndex) ## check correlation
## [1] -0.4097654
# define color palette
cols <- brewer.pal(2, "Set2")
## Warning in brewer.pal(2, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
highchart() %>%
hc_add_series(data = NewDemo2017, type = "scatter", hcaes(x = DALYs, y = DemocracyIndex)) %>%
hc_title(text = "Burden of Disease vs Democracy Index") %>% ## set topline title
hc_xAxis(title = list(text="Burden of Disease")) %>% ## x-axis title
hc_yAxis(title = list(text="Democracy Index")) %>% ## y-axis title
## Legend code
hc_legend(align = "right",
verticalAlign = "top") %>% ## Legend alignment
## add a caption
hc_caption(text = "<b>Democracy Index by Economist Intelligence Unit; Disease burden data from non-communicable diseases</em>'") %>%
## tool tip code
hc_tooltip(pointFormat = "Country: {point.country}:<br>
Democracy Index: {point.DemocracyIndex}:<br>
Burden of Disease: {point.DALYs} <br>") %>%
hc_colors(cols) %>% ## call color palette
hc_chart(style = list(fontFamily = "Georgia")) ## set new font
We see repeated again the pattern of little evidence for strong correlation between disease and democracy in all three of the above visualizations and analysis. First our linear regression line shows an almost flat correlation line; second our correlation test produces a coefficient of -0.41, a relatively weak level of correlation. In the third Highcharter scatterplot we see again the loose and flat distribution of data points and, by using the interactive pointer, we can see countries that regardless of having a relative low burden of disease, still show very low democracy index values, for example Syria and North Korea, among others.
As a socio-political hypothesis, I found Thornhill and Fincher’s arguments connecting high rates of infectious disease to less democratic forms of government compelling and insightful. However I was not convinced by their conclusion that disease prevalence was nearly the sole factor in determining whether a country had the potential to develop into a democracy.
Looking deeper into their research I discovered that their dataset was carefully selected through collaboration with other academics and tailored to the models that they had developed. My inclusion of newer, alternate datasets is not meant to replicate exactly their methods or their model. However I was curious to see if more broad, raw data on democracy and disease levels would show a similar correlation between close approximate variables.
Because the highcharter scatterplot I used in Project 2 strongly hinted that countries with higher levels of economic development might be more likely to have higher democracy scores, I also wanted to investigate an economic factor as an alternative explanatory variable and confounding factor that might play an important role in democratization.
After reviewing as a whole the statistical analyses and visualizations above, it is clear that the disease and democracy hypothesis as a broad trend is not easily replicated in datasets approximate to the original used by Thornhill et al. Instead of pointing to disease prevalence as the determinative factor, I think we should consider it as one of many, including economic development and many possible others, that play important roles in the development of democratic countries.
Comment | 2003 Data - Correlation Analysis
With Infection Rate (x) as the explanatory variable and Democracy Score (y) as response variable, our linear regression plot shows us clear visible evidence of a negative correlation between infectious disease rates and democracy scores. Correlation analysis provides further evidence for this, producing a negative correlation coefficient of -0.667, where values of 1 indicate a strong positive relationship, values of -1 a strong negative relationship, and values of zero indicating no relationship at all.
The column for Pr(>|t|) showing how useful the predictor is to the model is <2e-16 ***. Three asterisks indicating that the predictor has a significant impact on the model. This reinforces conclusions of Thornhill et al that there is a negative correlation between infection rates and democracy.