In this section, lets grab all the data from gapminder and tidy it into one dataframe that we’ll then use to create the models.
Predictor Variables
Lets download, join, and tidy the two data sources we’ll be looking at for our predictor variables. Both are csv files that contain metrics by country and year.
Since all of the gapminder datasets seem to be in the same format, with countries as rows and years as columns, lets use a function to unpivot them into country and year in one line, since those will be the observations we’ll be using.
unpivot <- function(src_df, metric) {
df <- gather(src_df, year, val, -country) %>%
filter(!is.na(val)) %>%
mutate(year = as.numeric(str_replace_all(year, 'X', '')))
names(df)[3] <- metric
return(df)
}
# Unpivot Raw Data
l_rate_tall <- unpivot(l_rate, 'literacy_rate')
pc_rate_tall <- unpivot(pc_rate, 'pschool_crate')
e_rate_tall <- unpivot(e_rate, 'pschool_erate')
# Combine Data for Predictor Varaibles Dataframe
pv_df <- l_rate_tall %>%
full_join(pc_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
full_join(e_rate_tall, by = c('country' = 'country', 'year' = 'year'))
head(pv_df)
## country year literacy_rate pschool_crate pschool_erate
## 1 Burkina Faso 1975 8.69 7.63 NA
## 2 Central African Republic 1975 18.20 36.20 NA
## 3 Kuwait 1975 59.60 60.60 NA
## 4 Turkey 1975 61.60 NA NA
## 5 United Arab Emirates 1975 53.50 41.70 NA
## 6 Uruguay 1975 93.90 NA NA
## [1] 144
It looks like there’s only 144 country-year observations with all 3 metrics, which may make it difficult to use all three predictor variables in one model.
Lets evaluate which variables have enough data that we can use:
pvars <- pv_df %>%
mutate(l = ifelse(is.na(literacy_rate), 0, 1),
c = ifelse(is.na(pschool_crate), 0, 1),
e = ifelse(is.na(pschool_erate), 0, 1),
n = l + c + e) %>%
group_by(n) %>%
summarize(ns = n(),
ls = sum(l),
cs = sum(c),
es = sum(e)) %>%
arrange(desc(n))
pvars
## # A tibble: 3 x 5
## n ns ls cs es
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 3 144 144 144 144
## 2 2 1303 256 1274 1076
## 3 1 3684 161 3197 326
This is a table that shows the instances of the variables.
- n = the number of variables, out of 3, that exist per observation.
There are 144 observations with all 3 varaibles, 1303 observations with 2 variables, and 3684 with only 1 variable.
## Number of Rows for literacy rate dataframe: 561
## Number of Distinct Countries: 150
## Number of Rows for primary school completion rate dataframe: 4615
## Number of Distinct Countries: 185
## Number of Rows for primary school expenditure rate dataframe: 1546
## Number of Distinct Countries: 163
It looks like the primary school completion rate has the most observations, but we’ll still need to determine which variables can be used based on the country-year match in the response variables.
Murder and Suicide
For most people, the idea of murder and suicide is a rare occurence in civilized society. We hear about these acts of violence rarely with people we know first hand, and unfortunately quite often in the news.
It wouldn’t be a huge leap to say that people who are more educated are less likely to be involved in these kinds of situations, but we’ll take a look at the data to see if this association can be supported by data.
Preparing the Data
Below, we’ll import the data from gapminder, tidy it, and add the variables to our combined predictor variable dataset.
## country year literacy_rate pschool_crate pschool_erate
## 1 Burkina Faso 1975 8.69 7.63 NA
## 2 Central African Republic 1975 18.20 36.20 NA
## 3 Kuwait 1975 59.60 60.60 NA
## 4 Turkey 1975 61.60 NA NA
## 5 United Arab Emirates 1975 53.50 41.70 NA
## 6 Uruguay 1975 93.90 NA NA
## murder_rate suicide_rate
## 1 NA NA
## 2 NA NA
## 3 1.64 0.504
## 4 NA NA
## 5 NA NA
## 6 2.96 9.930
Vaccination Rate
Childhood vaccinations are one way in which successful societies protect their population. Here in the United States, vaccinations are received for a variety of potential ailments. As any of these could serve as a proxy for society wellness, we pulled in each of these datasets to see which was the most complete.
# DTP vaccine percentage in 1 year olds
dtp_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/dtp3_immunized_percent_of_one_year_olds.csv',
stringsAsFactors = F)
# Measels vaccine percentage in 1 year olds
MCV_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/mcv_immunized_percent_of_one_year_olds.csv',
stringsAsFactors = F)
# Teatenus vaccine percentage in newborns
PAB_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/pab_immunized_percent_of_newborns.csv',
stringsAsFactors = F)
# Hepatitis vaccine percentage in 1 year olds
hepb3_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/hepb3_immunized_percent_of_one_year_olds.csv',
stringsAsFactors = F)
# Let's use the same function we created earlier to unpivot year columns
dtp_rate_tall <- unpivot(dtp_rate, 'dtp_rate')
MCV_rate_tall <- unpivot(MCV_rate, 'MCV_rate')
PAB_rate_tall <- unpivot(PAB_rate, 'PAB_rate')
hepb3_rate_tall <- unpivot(hepb3_rate, 'hepb3_rate')
dfVax <- pv_df %>%
full_join(dtp_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
full_join(MCV_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
full_join(PAB_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
full_join(hepb3_rate_tall, by = c('country' = 'country', 'year' = 'year'))
summary(dfVax)
## country year literacy_rate pschool_crate
## Length:7369 Min. :1970 Min. : 8.69 Min. : 1.52
## Class :character 1st Qu.:1987 1st Qu.: 64.90 1st Qu.: 62.50
## Mode :character Median :1997 Median : 85.10 Median : 90.70
## Mean :1997 Mean : 77.04 Mean : 79.38
## 3rd Qu.:2007 3rd Qu.: 94.10 3rd Qu.: 98.70
## Max. :2019 Max. :100.00 Max. :135.00
## NA's :6808 NA's :2754
## pschool_erate dtp_rate MCV_rate PAB_rate
## Min. : 0.235 Min. : 0.00 Min. : 0.00 Min. : 1.00
## 1st Qu.:10.500 1st Qu.:66.00 1st Qu.:62.00 1st Qu.:42.00
## Median :15.000 Median :86.00 Median :84.00 Median :67.00
## Mean :15.791 Mean :76.87 Mean :75.49 Mean :60.31
## 3rd Qu.:20.000 3rd Qu.:95.00 3rd Qu.:94.00 3rd Qu.:83.00
## Max. :65.100 Max. :99.00 Max. :99.00 Max. :99.00
## NA's :5823 NA's :1817 NA's :1937 NA's :4479
## hepb3_rate
## Min. : 1.00
## 1st Qu.:77.00
## Median :91.00
## Mean :82.47
## 3rd Qu.:96.00
## Max. :99.00
## NA's :5106
Looking at the quantity of NAs in each variable in the summary, it is clear there is significantly more data on dtp and measles vaccinations. For these reasons we will exclude the PAB and hepb vaccinations.
## country year literacy_rate pschool_crate pschool_erate
## 1 Burkina Faso 1975 8.69 7.63 NA
## 2 Central African Republic 1975 18.20 36.20 NA
## 3 Kuwait 1975 59.60 60.60 NA
## 4 Turkey 1975 61.60 NA NA
## 5 United Arab Emirates 1975 53.50 41.70 NA
## 6 Uruguay 1975 93.90 NA NA
## murder_rate suicide_rate dtp_rate MCV_rate
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 1.64 0.504 NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 2.96 9.930 NA NA
Inequality
The Gini index, a measure of inequality, is another metric in this dataset we wanted to explore. The Gini index is built such that a value of 0 indicates that all members of a society have equal income, and 1 indicates that one individual earns all income in a society, while the other members are without any income. In this dataset, it appears that each country has a Gini index value for all years, making it ideal for our purposes.
Exploring the Data
Now that we have our working dataframe, lets make a few observations through visualizations before we dive into modeling.
Literacy Rate

The plot was split into 6 groups because it would be difficult to see all the lines overlapping on one chart.
Generally, countries are seeing higher rates of adult literacy over time. This may be a result of countries advancing and growing. There are some countries that seem to be declining in literacy, perhaps in areas of war? Let’s take a look.
## # A tibble: 9 x 6
## country min_yr max_yr literacy_rate.x literacy_rate.y change
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lesotho 2000 2009 86.3 75.8 -10.5
## 2 Kenya 2000 2007 82.2 72.2 -10
## 3 Madagascar 2000 2009 70.7 64.5 -6.2
## 4 Congo, Dem. Rep. 2001 2007 67.2 61.2 -6
## 5 Nigeria 1991 2008 55.5 51.1 -4.40
## 6 Zambia 1990 2007 65 61.4 -3.6
## 7 Albania 2001 2011 98.7 96.8 -1.9
## 8 Tonga 1976 2006 99.6 99 -0.600
## 9 Mongolia 2000 2011 97.8 97.4 -0.400
bind_rows(select(lr_mm, country, yr = min_yr, val = literacy_rate.x),
select(lr_mm, country, yr = max_yr, val = literacy_rate.y)) %>%
ggplot(aes(x = yr, y = val, color = country)) +
geom_line(size = 1) +
theme_bw() +
labs(title = 'Overall Declining Literacy Rates',
y = '% of Adults',
x = element_blank())

It seems like some countries in Africa are struggling with improving adult literacy. I’m not quite sure about the history of those countries, but its possibly a sampling error.
Albania, Tonga, and Mongolia also have shown a net decline, but almost all of the population is literate. There may just be a ceiling to literacy in any given country, given that some people are unable to learn for reasons other than infrastructure.
Primary School Completion Rate
# break countries in groups
group_count <- 8
cgrp <- distinct(pc_rate, country) %>%
mutate(grp = ntile(country, group_count))
# Primary School Completion Rates Over Time
inner_join(pc_rate_tall, cgrp, by = ('country' = 'country')) %>%
filter(grp <= 4) %>%
ggplot(aes(x = as.Date(ISOdate(year,1,1)), y = pschool_crate, color = country)) +
geom_smooth(se = F, method = 'lm') +
theme_bw() +
theme(legend.position = 'none') +
facet_wrap(~grp) +
labs(title = 'Primary School Completion Rate Over Time',
subtitle = 'Group 1 - 4',
y = '% of Adults',
x = element_blank())

# Primary School Completion Rates Over Time
inner_join(pc_rate_tall, cgrp, by = ('country' = 'country')) %>%
filter(grp >= 5) %>%
ggplot(aes(x = as.Date(ISOdate(year,1,1)), y = pschool_crate, color = country)) +
geom_smooth(se = F, method = 'lm') +
theme_bw() +
theme(legend.position = 'none') +
facet_wrap(~grp) +
labs(title = 'Primary School Completion Rate Over Time',
subtitle = 'Group 5 - 8',
y = '% of Adults',
x = element_blank())

Again, We split the countries into 8 groups since it would be too messy to view on one chart. The views are linear models of the data points for each country, which shows a general trend towards higher completion rate over time.
Primary School Expediture Rate (% of GDP)

For most countries, it looks almost like the rate of gdp expenditure for primary school remained relatively steady throughout the years. There do seem to be some countries that invested quite a bit into their childrens’ future. Let’s isolate some of those countries and take a look.

Here, we’re looking at all countries in the dataset that have at any point spent at least 1/3 of GDP per person on primary school education. It’s an arbitrary amount, but that’s 1/3 the value of each person towards furthering basic education. Cuba is quite impressive and seems like its still rising, where Ukraine is seeing the opposite effect.
Murder Rate (per 100k People)

It looks like most countries are grounded to low levels of murder, but at present, most countries are near zero. Since travel is quite common in this day and age, let’s look at the top and bottom countries for murder rate, so we know where or where not to plan our next trip.
Let’s look at the most recent year for every country and exclude anything from before a decade ago, or 2010.
# Look at only the latest year of data for each country
m_ly <- group_by(m_rate_tall, country) %>%
summarize(year = max(year)) %>%
inner_join(m_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
filter(year >= 2010) %>%
mutate(country_year = str_c(country, ' (', year, ')', sep = ''))
top_x <- 15
# Most Dangerous Countries
arrange(m_ly, desc(murder_rate))[1:top_x,] %>%
ggplot(aes(x = reorder(country_year, murder_rate), y = murder_rate)) +
geom_col() +
coord_flip() +
theme_bw() +
labs(title = 'Countries With Highest Murder Rate',
y = 'Murders per 100K People',
x = element_blank())


Oman seems to be the safest country if you’re worried about being murdered. Keep mind that this is only one crime and that the numbers are reported or sampled by different methods, so this isn’t a list of safest countries, just a list of the countries that reported the lowest murder rate.
Suicide Rate

For most countries, the rate is pretty low, and for others it seems to peak. Lets pick out some of the countries that had a high level of suicides at one point and see what we find.

Take a look at Hungary– communism ended there in 1989, which coincides with the peak of suicides. Thankfully, the suicide rate there has been declining ever since.
Suriname, with a prominent peak of suicides in the 80’s, went through historical changes that seemed to coincide with these figures. A coup d’état and political uncertainty might contribute to these figures.
The other countries above likely to have their reasons why there’s so much psychological pressure within their borders, whether we can find them or not.