Introduction

This problem set prepares the data for, conducts longitudinal and cross-sectional comparisons of, and performs basic data visualization comparing regime type and basic health statistics across 161 countries for the years 1960-2015. The HTML file for this assignment can be found at: http://rpubs.com/carriecoberly/247473

Part 1: Data Management

Loading the data and the relevant libraries

First, I load the necessary libraries in R:

library(dplyr)
library(tidyr)
library(ggplot2)
suppressPackageStartupMessages(library(googleVis))
library(plotly)
library(knitr)

I am using the Varieties of Democracy dataset for information on regime type, and data from the UN and World Health Organization on the number of physicians, percent malnourished, and health expenditures across countries. Both datasets are comma separated values files. I load both files:

vdem <- read.csv("vdem.csv")
health <- read.csv("UNdata.csv")

Cleaning the data

Varieties of Democracy

For this assignment, I will examine the relationship between health indicators and two measures of democracy, three measures of civil liberties, and three measures of women’s rights. To do so, I will keep the following variables from the Varieties of Democracy dataset:

  • Country Identifiers
  • country_name
  • country_text_id: The UN country identifier
  • COWcode: The country code used by the Correlates of War database
  • Measures of Regime Type
  • v2x_polyarchy: This measure of electoral democracy (or polyarchy) evaluates whether a country hold free and fair elections. It incororates measures of suffrage, freedoms of association and speech, and electoral fraud. This variable is an interval variable - a continuous measures of democracy scaled between 0 and 1.
  • vs2_libdem: This measure of liberal democracy incorporates the definition of electoral democracy with additional protections of individual and minority rights. In addition to the freedoms described above, this variable also examines whether there are constitutionally protected civil liberties, rule of law, an independent judiciary, and effective checks and balances on the executive. This variable is an interval variable - a continuous measures of democracy scaled between 0 and 1.

  • Measures of Civil Liberties
  • v2x_cspart: This index evaluates whether major civil society organizations are routinely consulted by policymakers; popular participation in civil society organizations, including female participation; and the centralization of candidate nominations within parties. This variable is an interval variable - a continuous measures of participation scaled between 0 and 1.
  • v2xme_altinf: This index evaluates the extent to which print and broadcast media are un-biased in their coverage, allowed to be critical of the regime, and representative of a wide array of political perspectives. This variable is an interval variable - a continuous measure of bias scaled between 0 and 1.
  • v2xcl_rol: This index measures the extent to which laws are transparent, rigorously and impartially enforced and the extent to which citizens enjoy access to justice and basic rights. It is formed using idicators for impartial public administration, transparent and predictable laws, access to justice, property rights, freedom from torture, political killings and forced labor, and freedom of religion and movement. This variable is an interval variable - a continuous measure of rule of law scaled between 0 and 1.

  • Measures of Gender Equality
  • v2xeg_eqprotec: This index measures how equal the protection of rights and freedoms is across social groups. This variable is a ratio variable - a continuous measure of protection scaled between 0 and 1.
  • v2x_gender: This variable measures women’s political empowerment, which incorporates protection of women’s civil liberties, women’s and participation in civil society organizations, and the representation of women in the formal political system. This variable is an interval variable - a continuous measure of rights scaled between 0 and 1.
  • v2x_suffr: This variable measures the percent of adult citizens that has the legal right to vote in national elections. It is an interval variable - a percentage between 0 and 1.

vdem <- select(vdem, country_name, country_text_id, COWcode, year, v2x_polyarchy, v2x_libdem, v2x_cspart, v2xeg_eqprotec, v2x_gender, v2xme_altinf, v2x_suffr, v2xcl_rol)

The Varieties of Democracy includes information on territories not recognized by the UN (or included in their datasets). In order to remove these territories (which do not have Correlates of War codes) from the dataset, I use the code:

vdem <- drop_na(vdem, COWcode)

This dataset is now ready to merge with the UN data.

UN Health Data

The UN data is organized with years as variables instead of observations. In order to format this data to merge with the Varieties of Democracy data, I need to transpose the columns and rows and reformat them to match the VDEM data.

First, I will remove one unnecessary variable:

health <- select(health, -Series.Code)

The column that includes variables names also includes rows that do not include relevant information. To eliminate them, I will filter to keep only rows that have data associated with our three variables of interest: Health expenditure per capita, Physicians per 1,000 people, and Prevalence of undernourishment. I also remove data for UN aggregate measures and territories.

levels(health$Series.Name)
## [1] ""                                                              
## [2] "Data from database: Health Nutrition and Population Statistics"
## [3] "Health expenditure per capita (current US$)"                   
## [4] "Last Updated: 12/16/2016"                                      
## [5] "Physicians (per 1,000 people)"                                 
## [6] "Prevalence of undernourishment (% of population)"
health <- filter(health, Series.Name=="Health expenditure per capita (current US$)" | Series.Name=="Physicians (per 1,000 people)" | Series.Name=="Prevalence of undernourishment (% of population)")
health <- filter(health, Country.Code!="ARB" & Country.Code!="ASM" & Country.Code!="CSS" & Country.Code!= "CEB" & Country.Code!= "EAR" & Country.Code!="EAS" & Country.Code!="EAP" & Country.Code!="TEA" & Country.Code!="EMU" & Country.Code!="ECS" & Country.Code!="ECA" & Country.Code!="TEC" & Country.Code!="EUU" & Country.Code!="FRO" & Country.Code!="FCS" & Country.Code!="PYF" & Country.Code!="GIB" & Country.Code!="GRL" & Country.Code!="GUM" & Country.Code!="HPC" & Country.Code!="HIC" & Country.Code!="HKG" & Country.Code!="IMY" & Country.Code!="LTE" & Country.Code!="LAC" & Country.Code!="LCN" & Country.Code!="LDC" & Country.Code!="LMY" & Country.Code!="LIC" & Country.Code!="LMC" & Country.Code!="MAC" & Country.Code!="MEA" & Country.Code!="MNA" & Country.Code!="TMN" & Country.Code!="MIC" & Country.Code!="NCL" & Country.Code!="NAC" & Country.Code!="MNP" & Country.Code!="OED" & Country.Code!="OSS" & Country.Code!="PSS" & Country.Code!="PST" & Country.Code!="PRE" & Country.Code!="PRI" & Country.Code!="SXM" & Country.Code!="SST" & Country.Code!="SSA" & Country.Code!="SSF" & Country.Code!="SAS" & Country.Code!="TSA" & Country.Code!="TSS" & Country.Code!="MAF" & Country.Code!="WLD" & Country.Code!="UMC" & Country.Code!="VIR" & Country.Code!="VGB" & Country.Code!="WBG" & Country.Code!="CYM" & Country.Code!="CHI" & Country.Code!="TLA")

We need to reshape the data into a long version by moving the years to the rows. We use the gather command for that:

health <- gather(health, key="year", value="stat", X1960..YR1960.:X2015..YR2015.)
## Warning: attributes are not identical across measure variables; they will
## be dropped

The UN data categorized missing data with a “..” instead of the “NA” standard for R. To convert the data, I place the data into a separate vector, replace all values as appropriate within that vector, append that vector to the original dataset and drop the earlier version of the variable. I then can convert this column to a numeric variable.

conv <- health$stat 
replace(conv, conv=="..", NA)
health$conv <- conv
health <- select(health, -stat)
health$conv <- as.numeric(health$conv)

The year variable still has an X prefix and a suffix. To remove this prefix we treat year like a character variable, and use the substr command to omit the unneeded characters. Finally we convert year to a numeric variable.

health$year <- substr(health$year, 2, 5)
health$year <- as.numeric(health$year)

I now need to move the three variables to columns:

health <- spread(health, key=Series.Name, value=conv)

For clarity and to match the variable names to the VDEM data, I rename the variables as follows:

names(health) <- c("country", "country_text_id", "year", "health.exp", "physicians", "undernour")

Since the “undernour” variable is the percentage of the population that is undernourished, I will convert it to a fraction between 0 and 1 using the mutate command:

health <- mutate(health, undernour = undernour/100)

Merging the datasets

To merge the two datasets, I use the “join” command. The “full_join” command keeps all data, regardless of whether they match or not.

vdem_un <- full_join(vdem, health, by=c("country_text_id", "year"))
## Warning in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector

The UN health data had information on more countries than are included in the VDEM data (countries with populations smaller than 1 million people, primarily Pacific and Caribbean island countries), and includes data on countries prior to their independence (for example, on Armenia before the breakup of the Soviet Union). The VDEM data includes historical countries (such as South Yemen and North Vietnam) that are not in the UN health data. I would normally go back to the codebook for the UN data to determine whether the coders separated data for the Soviet Union into each republic to see if countries need to be reconstructed, but in this case I do not have that codebook and choose to exclude data on countries that do not exist in the other dataset. This will eliminate small countries, historical countries, and countries prior to independence from the dataset.

vdem_un <- drop_na(vdem_un, country)
vdem_un <- drop_na(vdem_un, country_name)

The last step is to rearrange the columns and rows. I will do this by dropping the extra country name variable, after this the data will be in order. I use arrange() to sort the data by country name and year.

vdem_un <- select(vdem_un, -country)
vdem_un <- arrange(vdem_un, country_name, year)

In order to find out how many countries remain in the dataset, and how many years are covered for each country, I use the “count” command:

count(vdem_un, country_name)
## # A tibble: 161 × 2
##    country_name     n
##          <fctr> <int>
## 1   Afghanistan    56
## 2       Albania    53
## 3       Algeria    56
## 4        Angola    53
## 5     Argentina    56
## 6       Armenia    26
## 7     Australia    55
## 8       Austria    53
## 9    Azerbaijan    26
## 10   Bangladesh    44
## # ... with 151 more rows

The dataset now has data for 161 countries for the years 1960-2015, although many countries are missing data for one or more years.

Part 2: Collapsing the data

I will now examine whether the percent undernourished in a country (undernour) is related to the country’s overall level of electoral democracy (v2x_polyarchy) and women’s empowerment (v2x_gender).

I will first examine the data by year. To do so, I group (group_by()) and provide summary statistics of (summarize()) the data by year.

vdem_un_year <- group_by(vdem_un, year)
vdem_un_year <- summarize(vdem_un_year, demmean=mean(v2x_polyarchy, na.rm=TRUE),
                            gendermean=mean(v2x_gender, na.rm=TRUE), 
                            nourmean=mean(undernour, na.rm=TRUE))
kable(vdem_un_year, digits=3)
year demmean gendermean nourmean
1960 0.329 0.412 NaN
1961 0.332 0.412 NaN
1962 0.330 0.410 NaN
1963 0.333 0.421 NaN
1964 0.330 0.429 NaN
1965 0.331 0.430 NaN
1966 0.332 0.437 NaN
1967 0.327 0.442 NaN
1968 0.325 0.444 NaN
1969 0.322 0.449 NaN
1970 0.322 0.467 NaN
1971 0.321 0.471 NaN
1972 0.321 0.468 NaN
1973 0.319 0.457 NaN
1974 0.319 0.470 NaN
1975 0.324 0.501 NaN
1976 0.328 0.509 NaN
1977 0.331 0.510 NaN
1978 0.336 0.511 NaN
1979 0.350 0.512 NaN
1980 0.356 0.522 NaN
1981 0.354 0.528 NaN
1982 0.354 0.527 NaN
1983 0.357 0.533 NaN
1984 0.362 0.533 NaN
1985 0.374 0.536 NaN
1986 0.380 0.545 NaN
1987 0.385 0.547 NaN
1988 0.393 0.554 NaN
1989 0.401 0.564 NaN
1990 0.437 0.598 NaN
1991 0.469 0.611 0.245
1992 0.493 0.617 0.248
1993 0.506 0.627 0.248
1994 0.514 0.637 0.248
1995 0.518 0.644 0.244
1996 0.524 0.645 0.240
1997 0.527 0.649 0.235
1998 0.528 0.657 0.232
1999 0.526 0.661 0.226
2000 0.529 0.677 0.220
2001 0.531 0.682 0.213
2002 0.541 0.688 0.208
2003 0.544 0.689 0.202
2004 0.546 0.695 0.196
2005 0.551 0.701 0.190
2006 0.555 0.702 0.183
2007 0.552 0.708 0.177
2008 0.558 0.712 0.171
2009 0.561 0.713 0.166
2010 0.561 0.717 0.161
2011 0.566 0.720 0.156
2012 0.567 0.723 0.151
2013 0.566 0.745 0.136
2014 0.555 0.775 0.134
2015 0.485 0.695 0.137

This data on its own does not provide much information (aside from the fact that data on undernourishment is not available before 1991) – the global levels of democracy and women’s empowerment have generally increased at similar rates over time while overall levels of undernourishment have decreased at a slower rate. Correlations between these variables are unclear.

Examining the data by country, we use the same commands to group by country then generate summary statistics.

We can do the same thing to look at cross-state comparisons:

vdem_un_country <- group_by(vdem_un, country_name)
vdem_un_country <- summarize(vdem_un_country, demmean=mean(v2x_polyarchy, na.rm=TRUE),
                            gendermean=mean(v2x_gender, na.rm=TRUE), 
                            nourmean=mean(undernour, na.rm=TRUE))
kable(vdem_un_country, digits=3)
country_name demmean gendermean nourmean
Afghanistan 0.181 0.269 0.358
Albania 0.320 0.529 NaN
Algeria 0.243 0.394 0.070
Angola 0.102 0.472 0.440
Argentina 0.612 0.739 0.050
Armenia 0.453 0.642 0.153
Australia 0.898 0.791 NaN
Austria 0.881 0.830 NaN
Azerbaijan 0.252 0.477 0.141
Bangladesh 0.437 0.580 0.238
Barbados 0.584 0.823 0.053
Belarus 0.346 0.810 NaN
Belgium 0.815 0.852 NaN
Benin 0.403 0.697 0.190
Bhutan 0.127 0.516 NaN
Bolivia 0.479 0.519 0.300
Bosnia and Herzegovina 0.340 0.765 NaN
Botswana 0.602 0.544 0.309
Brazil 0.596 0.597 0.090
Bulgaria 0.421 0.771 NaN
Burkina Faso 0.405 0.705 0.238
Burma_Myanmar 0.150 0.213 0.406
Burundi 0.205 0.442 NaN
Cambodia 0.251 0.455 0.243
Cameroon 0.256 0.480 0.255
Canada 0.874 0.860 NaN
Cape Verde 0.477 0.683 0.153
Central African Republic 0.228 0.412 0.428
Chad 0.200 0.359 0.448
Chile 0.585 0.662 0.055
China 0.097 0.571 0.163
Colombia 0.467 0.538 0.104
Comoros 0.331 0.484 NaN
Congo_Republic of the 0.223 0.343 0.363
Costa Rica 0.867 0.782 0.055
Croatia 0.668 0.838 NaN
Cuba 0.108 0.617 0.082
Cyprus 0.661 0.636 NaN
Czech Republic 0.500 0.750 NaN
Denmark 0.920 0.926 NaN
Djibouti 0.201 0.250 0.487
Dominican Republic 0.506 0.692 0.250
Ecuador 0.548 0.527 0.170
Egypt 0.224 0.420 0.050
El Salvador 0.358 0.398 0.127
Eritrea 0.084 0.335 NaN
Estonia 0.911 0.904 NaN
Ethiopia 0.171 0.288 0.523
Fiji 0.434 0.571 0.052
Finland 0.887 0.911 NaN
France 0.893 0.823 NaN
Gabon 0.286 0.478 0.065
Gambia 0.430 0.487 0.137
Georgia 0.498 0.748 0.214
Germany 0.771 0.861 NaN
Ghana 0.438 0.661 0.163
Greece 0.693 0.773 NaN
Guatemala 0.342 0.288 0.170
Guinea 0.209 0.387 0.232
Guinea-Bissau 0.208 0.444 0.244
Guyana 0.442 0.764 0.129
Haiti 0.265 0.462 0.570
Honduras 0.395 0.492 0.181
Hungary 0.434 0.676 NaN
Iceland 0.878 0.802 NaN
India 0.707 0.574 0.188
Indonesia 0.348 0.443 0.158
Iran 0.192 0.296 0.057
Iraq 0.174 0.401 0.228
Ireland 0.866 0.755 NaN
Israel 0.733 0.739 NaN
Italy 0.815 0.753 NaN
Ivory Coast 0.304 0.483 0.136
Jamaica 0.554 0.819 0.081
Japan 0.874 0.676 NaN
Jordan 0.189 0.397 0.059
Kazakhstan 0.295 0.674 0.050
Kenya 0.315 0.373 0.301
Korea_North 0.092 0.278 0.354
Korea_South 0.534 0.592 0.050
Kyrgyzstan 0.352 0.643 0.120
Laos 0.123 0.317 0.346
Latvia 0.866 0.934 NaN
Lebanon 0.437 0.477 0.050
Lesotho 0.286 0.521 0.131
Liberia 0.306 0.456 0.363
Libya 0.117 0.239 NaN
Lithuania 0.868 0.917 NaN
Macedonia 0.529 0.719 NaN
Madagascar 0.331 0.514 0.335
Malawi 0.292 0.353 0.302
Malaysia 0.311 0.565 0.050
Maldives 0.223 0.527 0.111
Mali 0.376 0.581 0.123
Mauritania 0.271 0.413 0.115
Mauritius 0.730 0.590 0.063
Mexico 0.449 0.530 0.055
Moldova 0.603 0.781 NaN
Mongolia 0.440 0.636 0.347
Montenegro 0.516 0.797 NaN
Morocco 0.210 0.449 0.060
Mozambique 0.241 0.538 0.409
Namibia 0.346 0.409 0.340
Nepal 0.236 0.329 0.177
Netherlands 0.894 0.870 NaN
New Zealand 0.880 0.841 NaN
Nicaragua 0.411 0.534 0.332
Niger 0.365 0.549 0.216
Nigeria 0.312 0.588 0.096
Norway 0.905 0.922 NaN
Pakistan 0.293 0.469 0.231
Panama 0.426 0.577 0.227
Papua New Guinea 0.410 0.407 NaN
Paraguay 0.357 0.382 0.136
Peru 0.506 0.623 0.198
Philippines 0.448 0.714 0.191
Poland 0.519 0.772 NaN
Portugal 0.684 0.692 NaN
Qatar 0.041 0.297 NaN
Russia 0.259 0.602 NaN
Rwanda 0.217 0.518 0.491
Sao Tome and Principe 0.331 0.677 0.166
Saudi Arabia 0.020 0.145 0.050
Senegal 0.562 0.653 0.239
Serbia 0.297 0.705 NaN
Seychelles 0.337 0.654 NaN
Sierra Leone 0.325 0.367 0.361
Slovakia 0.734 0.873 NaN
Slovenia 0.810 0.882 NaN
Solomon Islands 0.490 0.352 0.147
Somalia 0.191 0.240 NaN
South Africa 0.389 0.448 0.051
South Sudan 0.219 0.496 NaN
Spain 0.651 0.703 NaN
Sri Lanka 0.578 0.616 0.289
Sudan 0.203 0.284 NaN
Suriname 0.658 0.694 0.120
Swaziland 0.118 0.382 0.201
Sweden 0.893 0.919 NaN
Switzerland 0.867 0.839 NaN
Syria 0.155 0.439 NaN
Tajikistan 0.248 0.523 0.362
Tanzania 0.356 0.659 0.337
Thailand 0.327 0.628 0.174
Togo 0.251 0.574 0.289
Trinidad and Tobago 0.655 0.760 0.125
Tunisia 0.222 0.593 0.050
Turkey 0.529 0.466 0.050
Turkmenistan 0.153 0.385 0.074
Uganda 0.264 0.586 0.257
Ukraine 0.508 0.788 NaN
United Kingdom 0.873 0.769 NaN
United States 0.838 0.793 NaN
Uruguay 0.699 0.751 0.053
Uzbekistan 0.177 0.420 0.089
Vanuatu 0.480 0.756 0.084
Venezuela 0.716 0.713 0.122
Vietnam_Democratic Republic of 0.194 0.599 0.267
Yemen 0.194 0.192 0.286
Zambia 0.377 0.534 0.444
Zimbabwe 0.258 0.363 0.406

Breaking the data down by country shows us that while the average levels of democracy and gender empowerment in a country are often equivalent, in some countries (particularly current or former socialist countries) they diverge (general political rights may be limited, but women have higher degrees of equality within them). The correlation with undernourishment is unclear, except to note that this data is missing for many countries, notably high income countries. As with the annual data, levels of women’s empowerment are generally higher than levels of democracy.

Part 3: Graphics

1. Scatterplot by year

To better understand the correlation between electoral democracy (v2x_polyarchy) and undernourishment (undernour), I will graph the data. First I will create a smaller version of the dataset covering only the relevant years (1991-2015) then I will create a grid of scatterplots by year, with best fit lines and 95% confidence intervals.

dem_nour <- filter(vdem_un, year>1990)
g1 <- ggplot(dem_nour, aes(x=v2x_polyarchy, y=undernour)) +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap( ~ year)
g1

These graphs show us that for the limited number of countries for which data is available, the weak correlation between regime type and undernourishment has decreased over time. While in the early 1990s there was a weak but clear correlation between democracy score and percent undernourished, by 2015 this relationship disappeared. This correlation might appear stronger, however, if high income countries were included in the available data. In addition, it appears that the overall number of countries for which data is available decreases over time as well.

2. Scatterplot by country

I will now graph the same data by country instead of by year.

g2 <- ggplot(dem_nour, aes(x=v2x_polyarchy, y=undernour)) +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap( ~ country_name, ncol=5)
g2

These graphs show us that for most countries, the level of undernourishment does not change with the level of democracy – the level of undernourishment remains fairly constant even when there is a change in the level of democracy. There are a few countries that are exceptions to this trend – countries such as Angola and Georgia experienced lower levels of malnourishment when they had higher levels of electoral democracy, while a handful of countries (Mongolia, Zambia) experienced slighly higher levels of malnourishment with higher levels of electoral democracy.

3. Time series line plot

To show change over time for these a variables within an individual countries, I create time series graphs. I have chosen Angola, Mongolia, and Thailand as sample countries because these three countries demonstrate three different trends in the cross-national data for democracy and undernourishment. These graphs will should change in electoral democracy score.

graphdata <- filter(vdem_un, country_name=="Angola" | country_name=="Mongolia" | country_name=="Thailand")
g3 <- ggplot(graphdata, aes(x=year, y=v2x_polyarchy, group=country_name, color=country_name)) +
  geom_line() 
g3

This graph shows the variations in types of political transitions around the world. Some countries (such as Mongolia) make a stable transition from an authoritarian system to a democratic system. Others, such as Thailand, have experienced a series of coups that have moved the country repeatedly between democracy and non-democracy. Other countries like Angola may experience transitions (such as the 1991 accords that were designed to create free and fairr elections) but remain autocracies (in this case, due to civil war).

4. Time series graph with plotly

I recreate the above graph using the “plotly”" package.

plot_ly(graphdata, x = ~year, y = ~v2x_polyarchy, color = ~country_name, type = "scatter", mode = "lines")

Plotly produces the same graph, but allows you to view the exact data along the line with an interactive feature.

5. Motion graph

With GoogleVis, I can graph the change in malnourishment and democracy over time in my three sample countries. This graph takes a long time to load, but it will appear in a separate website when you run it.

M1 <- gvisMotionChart(graphdata, idvar = "country_name", timevar = "year", xvar = "v2x_polyarchy", yvar = "undernour", colorvar = "country_name", options=list(width=600, height=400))
plot(M1)
## starting httpd help server ...
##  done
print(M1, "chart")

6. Interactive map

GoogleVis can also plot a world map with color variations based on the level of democracy in a country. To develop this graph, I first select one year for analysis (in this case, 2010). On this graph, darker shades of green correspond to higher levels of democracy.

data2010 <- filter(vdem_un, year==2010)
M2 <- gvisGeoChart(data2010, locationvar = "country_name", colorvar = "v2x_polyarchy", options=list(width=600, height=400))
plot(M2)
print(M2, "chart")

7. 3D scatterplot

Finally, it is possible to graph both the democracy score and level of malnourishment for my three sample countries on the same 3D graph.

plot_ly(graphdata, x = ~year, y = ~v2x_polyarchy, z = ~undernour,
        type = "scatter3d", color = ~country_name)
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

While the relationship for Thailand is unclear, it is possible to that both Angola and Thailand’s levels of undernourishment have decreased over time, despite a lack of significant change in their democracy score during the time period under consideration.