Data607 Final Project

Introduction

We want to research the association of basic education and its impact to different aspects of society. There’s a common understanding that education is beneficial for society and as people become more educated, society becomes more civilized. This is paired with the idea that when evaluating a scale of animal instincts to conscious human actions– its better for society if its members are more like wise sages and less like impulsive cave men.

The data that we will use to explore these assumptions and its effects on different facets of society will be sourced from gapminder.

To proxy or measure basic education, we’ll take a look at literacy rates of adults, completion rates of primary school, and expenditure per student as a % of GDP. Primary school completion rates can be influenced by many factors, but our initial idea is that this is a suitable proxy to measure how much a society values and is capable of putting their children through school.

We’ll then be looking at a few other factors that we believe may be related and discuss the results of regression models.

Preparing the data

In this section, lets grab all the data from gapminder and tidy it into one dataframe that we’ll then use to create the models.

All data has been downloaded from Gapminder

Packages

These are the packages we’ll be using for this project:

library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(GGally)
library(caret)
library(tidymodels)

Explanatory Variables

Lets download, join, and tidy the two data sources we’ll be looking at for our predictor variables. Both are csv files that contain metrics by country and year.

# Literacy Rate
l_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/literacy_rate_adult_total_percent_of_people_ages_15_and_above.csv',
                   stringsAsFactors = F)
# Primary School Completion Rate
pc_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/primary_completion_rate_total_percent_of_relevant_age_group.csv',
                    stringsAsFactors = F)
# Primary School Expenditure Rate (% of GDP per Person)
e_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/expenditure_per_student_primary_percent_of_gdp_per_person.csv',
                   stringsAsFactors = F)

Since all of the gapminder datasets seem to be in the same format, with countries as rows and years as columns, lets use a function to unpivot them into country and year in one line, since those will be the observations we’ll be using.

unpivot <- function(src_df, metric) {
  df <- gather(src_df, year, val, -country) %>%
    filter(!is.na(val)) %>%
    mutate(year = as.numeric(str_replace_all(year, 'X', '')))
  
  names(df)[3] <- metric
  
  return(df)
}
# Unpivot Raw Data
l_rate_tall <- unpivot(l_rate, 'literacy_rate')
pc_rate_tall <- unpivot(pc_rate, 'pschool_crate')
e_rate_tall <- unpivot(e_rate, 'pschool_erate')
# Combine Data for Predictor Varaibles Dataframe
pv_df <- l_rate_tall %>%
  full_join(pc_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(e_rate_tall, by = c('country' = 'country', 'year' = 'year'))
head(pv_df)

##                    country year literacy_rate pschool_crate pschool_erate
## 1             Burkina Faso 1975          8.69          7.63            NA
## 2 Central African Republic 1975         18.20         36.20            NA
## 3                   Kuwait 1975         59.60         60.60            NA
## 4                   Turkey 1975         61.60            NA            NA
## 5     United Arab Emirates 1975         53.50         41.70            NA
## 6                  Uruguay 1975         93.90            NA            NA

# Observations will all 3 metrics
pv_df2 <- filter(pv_df, !is.na(literacy_rate) & !is.na(pschool_crate) & !is.na(pschool_erate))
nrow(pv_df2)

## [1] 144

It looks like there’s only 144 country-year observations with all 3 metrics, which may make it difficult to use all three predictor variables in one model.

Lets evaluate which variables have enough data that we can use:

pvars <- pv_df %>%
  mutate(l = ifelse(is.na(literacy_rate), 0, 1),
          c = ifelse(is.na(pschool_crate), 0, 1),
          e = ifelse(is.na(pschool_erate), 0, 1),
          n = l + c + e) %>%
  group_by(n) %>%
  summarize(ns = n(),
            ls = sum(l),
            cs = sum(c),
            es = sum(e)) %>%
  arrange(desc(n))
pvars

## # A tibble: 3 x 5
##       n    ns    ls    cs    es
##   <dbl> <int> <dbl> <dbl> <dbl>
## 1     3   144   144   144   144
## 2     2  1303   256  1274  1076
## 3     1  3684   161  3197   326

This is a table that shows the instances of the variables.

n = the number of variables, out of 3, that exist per observation.

There are 144 observations with all 3 variables, 1303 observations with 2 variables, and 3684 with only 1 variable.

cat('Number of Rows for literacy rate dataframe:', nrow(l_rate_tall), '\nNumber of Distinct Countries:', n_distinct(l_rate_tall$country))

## Number of Rows for literacy rate dataframe: 561 
## Number of Distinct Countries: 150

cat('Number of Rows for primary school completion rate dataframe:', nrow(pc_rate_tall), '\nNumber of Distinct Countries:', n_distinct(pc_rate_tall$country))

## Number of Rows for primary school completion rate dataframe: 4615 
## Number of Distinct Countries: 185

cat('Number of Rows for primary school expenditure rate dataframe:', nrow(e_rate_tall), '\nNumber of Distinct Countries:', n_distinct(e_rate_tall$country))

## Number of Rows for primary school expenditure rate dataframe: 1546 
## Number of Distinct Countries: 163

It looks like the primary school completion rate has the most observations, but we’ll still need to determine which variables can be used based on the country-year match in the response variables.

Murder and Suicide

For most people, the idea of murder and suicide is a rare occurence in civilized society. We hear about these acts of violence rarely with people we know first hand, and unfortunately quite often in the news.

It wouldn’t be a huge leap to say that people who are more educated are less likely to be involved in these kinds of situations, but we’ll take a look at the data to see if this association can be supported by data.

# Murder Rate per 100k People
m_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/murder_per_100000_people.csv',
                   stringsAsFactors = F)
# Suicide Rate per 100k People
s_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/suicide_per_100000_people.csv',
                    stringsAsFactors = F)  
# Let's use the same function we created earlier to unpivot year columns
m_rate_tall <- unpivot(m_rate, 'murder_rate')
s_rate_tall <- unpivot(s_rate, 'suicide_rate')
df <- pv_df %>%
  full_join(m_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(s_rate_tall, by = c('country' = 'country', 'year' = 'year'))
head(df)

##                    country year literacy_rate pschool_crate pschool_erate
## 1             Burkina Faso 1975          8.69          7.63            NA
## 2 Central African Republic 1975         18.20         36.20            NA
## 3                   Kuwait 1975         59.60         60.60            NA
## 4                   Turkey 1975         61.60            NA            NA
## 5     United Arab Emirates 1975         53.50         41.70            NA
## 6                  Uruguay 1975         93.90            NA            NA
##   murder_rate suicide_rate
## 1          NA           NA
## 2          NA           NA
## 3        1.64        0.504
## 4          NA           NA
## 5          NA           NA
## 6        2.96        9.930

Vaccination Rate

Childhood vaccinations are one way in which successful societies protect their population. Here in the United States, vaccinations are received for a variety of potential ailments. As any of these could serve as a proxy for society wellness and mindfulness, we pulled in each of these datasets to see which was the most complete.

# DTP vaccine percentage in 1 year olds
DTP_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/dtp3_immunized_percent_of_one_year_olds.csv',
                   stringsAsFactors = F)
# Measels vaccine percentage in 1 year olds
MCV_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/mcv_immunized_percent_of_one_year_olds.csv',
                    stringsAsFactors = F)  
# Teatenus vaccine percentage in newborns
PAB_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/pab_immunized_percent_of_newborns.csv',
                    stringsAsFactors = F)  
# Hepatitis vaccine percentage in 1 year olds
hepb3_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/hepb3_immunized_percent_of_one_year_olds.csv',
                    stringsAsFactors = F)  
# Let's use the same function we created earlier to unpivot year columns
DTP_rate_tall <- unpivot(DTP_rate, 'DTP_rate')
MCV_rate_tall <- unpivot(MCV_rate, 'MCV_rate')
PAB_rate_tall <- unpivot(PAB_rate, 'PAB_rate')
hepb3_rate_tall <- unpivot(hepb3_rate, 'hepb3_rate')
dfVax <- pv_df %>%
  full_join(DTP_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(MCV_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(PAB_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(hepb3_rate_tall, by = c('country' = 'country', 'year' = 'year'))
summary(dfVax)

##    country               year      literacy_rate    pschool_crate   
##  Length:7369        Min.   :1970   Min.   :  8.69   Min.   :  1.52  
##  Class :character   1st Qu.:1987   1st Qu.: 64.90   1st Qu.: 62.50  
##  Mode  :character   Median :1997   Median : 85.10   Median : 90.70  
##                     Mean   :1997   Mean   : 77.04   Mean   : 79.38  
##                     3rd Qu.:2007   3rd Qu.: 94.10   3rd Qu.: 98.70  
##                     Max.   :2019   Max.   :100.00   Max.   :135.00  
##                                    NA's   :6808     NA's   :2754    
##  pschool_erate       DTP_rate        MCV_rate        PAB_rate    
##  Min.   : 0.235   Min.   : 0.00   Min.   : 0.00   Min.   : 1.00  
##  1st Qu.:10.500   1st Qu.:66.00   1st Qu.:62.00   1st Qu.:42.00  
##  Median :15.000   Median :86.00   Median :84.00   Median :67.00  
##  Mean   :15.791   Mean   :76.87   Mean   :75.49   Mean   :60.31  
##  3rd Qu.:20.000   3rd Qu.:95.00   3rd Qu.:94.00   3rd Qu.:83.00  
##  Max.   :65.100   Max.   :99.00   Max.   :99.00   Max.   :99.00  
##  NA's   :5823     NA's   :1817    NA's   :1937    NA's   :4479   
##    hepb3_rate   
##  Min.   : 1.00  
##  1st Qu.:77.00  
##  Median :91.00  
##  Mean   :82.47  
##  3rd Qu.:96.00  
##  Max.   :99.00  
##  NA's   :5106

Looking at the quantity of NAs in each variable in the summary, it is clear there is significantly more data on DTP and measles vaccinations. For these reasons we will exclude the PAB and HepB vaccination datasets.

df <- df %>%
  full_join(DTP_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(MCV_rate_tall, by = c('country' = 'country', 'year' = 'year'))
head(df)

##                    country year literacy_rate pschool_crate pschool_erate
## 1             Burkina Faso 1975          8.69          7.63            NA
## 2 Central African Republic 1975         18.20         36.20            NA
## 3                   Kuwait 1975         59.60         60.60            NA
## 4                   Turkey 1975         61.60            NA            NA
## 5     United Arab Emirates 1975         53.50         41.70            NA
## 6                  Uruguay 1975         93.90            NA            NA
##   murder_rate suicide_rate DTP_rate MCV_rate
## 1          NA           NA       NA       NA
## 2          NA           NA       NA       NA
## 3        1.64        0.504       NA       NA
## 4          NA           NA       NA       NA
## 5          NA           NA       NA       NA
## 6        2.96        9.930       NA       NA

Inequality

The Gini coefficient, a measure of inequality, is another metric in this dataset we wanted to explore. The Gini coefficient is on a scale from 1 to 100, with higher numbers implying a greater rate of inequality. In this dataset, it appears that each country has a Gini coefficient value for all years, making it ideal for our purposes.

Gini_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/gini.csv',
                    stringsAsFactors = F) 
Gini_rate_tall <- unpivot(Gini_rate, 'Gini_rate')
df <- df %>%
  full_join(Gini_rate_tall, by = c('country' = 'country', 'year' = 'year'))

Exploring the Data

Now that we have our working dataframe, lets make a few observations through visualizations before we dive into modeling.

Literacy Rate

# break countries in groups
group_count <- 6
lgrp <- distinct(l_rate, country) %>%
  mutate(grp = ntile(country, group_count))
# Literacy Rates Over Time
inner_join(l_rate_tall, lgrp, by = c('country' = 'country')) %>%
ggplot(aes(x = year, y = literacy_rate, color = country)) +
  geom_line() +
  theme_bw() +
  theme(legend.position = 'none') +
  facet_wrap(~grp) +
  labs(title = 'Literacy Rate Over Time',
       caption = 'Each Line Is a Country',
       y = '% of Adults',
       x = element_blank())

The plot was split into 6 groups because it would be difficult to see all the lines overlapping on one chart.

Generally, countries are seeing higher rates of adult literacy over time. This may be a result of countries advancing and growing. There are some countries that seem to be declining in literacy, perhaps in areas of war? Let’s take a look.

lr_mm <- group_by(l_rate_tall, country) %>%
  summarize(min_yr = min(year),
            max_yr = max(year)) %>%
  left_join(l_rate_tall, by = c('country' = 'country', 'min_yr' = 'year')) %>%
  left_join(l_rate_tall, by = c('country' = 'country', 'max_yr' = 'year')) %>%
  mutate(change = literacy_rate.y - literacy_rate.x) %>%
  filter(change < 0) %>%
  arrange(change)
lr_mm

## # A tibble: 9 x 6
##   country          min_yr max_yr literacy_rate.x literacy_rate.y  change
##   <chr>             <dbl>  <dbl>           <dbl>           <dbl>   <dbl>
## 1 Lesotho            2000   2009            86.3            75.8 -10.5  
## 2 Kenya              2000   2007            82.2            72.2 -10    
## 3 Madagascar         2000   2009            70.7            64.5  -6.2  
## 4 Congo, Dem. Rep.   2001   2007            67.2            61.2  -6    
## 5 Nigeria            1991   2008            55.5            51.1  -4.40 
## 6 Zambia             1990   2007            65              61.4  -3.6  
## 7 Albania            2001   2011            98.7            96.8  -1.9  
## 8 Tonga              1976   2006            99.6            99    -0.600
## 9 Mongolia           2000   2011            97.8            97.4  -0.400

bind_rows(select(lr_mm, country, yr = min_yr, val = literacy_rate.x),
          select(lr_mm, country, yr = max_yr, val = literacy_rate.y)) %>%
  ggplot(aes(x = yr, y = val, color = country)) +
  geom_line(size = 1) +
  theme_bw() +
  labs(title = 'Overall Declining Literacy Rates',
       y = '% of Adults',
       x = element_blank())

It seems like some countries in Africa are struggling with improving adult literacy. I’m not quite sure about the history of those countries, but its possibly a sampling error.

Albania, Tonga, and Mongolia also have shown a net decline, but almost all of the population is literate. There may just be a ceiling to literacy in any given country, given that some people are unable to learn for reasons other than infrastructure.

Primary School Completion Rate

# break countries in groups
group_count <- 6
cgrp <- distinct(pc_rate, country) %>%
  mutate(grp = ntile(country, group_count))
# Primary School Completion Rates Over Time
inner_join(pc_rate_tall, cgrp, by = ('country' = 'country')) %>%
  ggplot(aes(x = as.Date(ISOdate(year,1,1)), y = pschool_crate, color = country)) +
  geom_line() +
  theme_bw() +
  theme(legend.position = 'none') +
  facet_wrap(~grp) +
  labs(title = 'Primary School Completion Rate Over Time',
       y = '% of Adults',
       x = element_blank())

Again, We split the countries into 8 groups since it would be too messy to view on one chart. The views are linear models of the data points for each country, which shows a general trend towards higher completion rate over time.

pc_mm <- group_by(pc_rate_tall, country) %>%
  summarize(min_yr = min(year),
            max_yr = max(year)) %>%
  left_join(pc_rate_tall, by = c('country' = 'country', 'min_yr' = 'year')) %>%
  left_join(pc_rate_tall, by = c('country' = 'country', 'max_yr' = 'year')) %>%
  mutate(change = pschool_crate.y - pschool_crate.x) %>%
  filter(change < 0) %>%
  arrange(change)
pc_mm

## # A tibble: 29 x 6
##    country             min_yr max_yr pschool_crate.x pschool_crate.y change
##    <chr>                <dbl>  <dbl>           <dbl>           <dbl>  <dbl>
##  1 St. Kitts and Nevis   1992   2016           119              98.1 -20.9 
##  2 Maldives              1997   2017           118              97.4 -20.6 
##  3 Marshall Islands      1999   2016            90              70.9 -19.1 
##  4 Armenia               1994   2018           108              89.9 -18.1 
##  5 Trinidad and Tobago   1982   2010           113              94.9 -18.1 
##  6 St. Lucia             1983   2018           106              94.9 -11.1 
##  7 Brazil                2001   2004           112             101   -11   
##  8 Bahamas               1994   2018            86.3            76.5  -9.80
##  9 Nigeria               2000   2010            82.2            73.8  -8.4 
## 10 Bulgaria              1971   2017            96.8            89.8  -7   
## # ... with 19 more rows

bind_rows(select(pc_mm, country, yr = min_yr, val = pschool_crate.x),
          select(pc_mm, country, yr = max_yr, val = pschool_crate.y)) %>%
  ggplot(aes(x = yr, y = val, color = country)) +
  geom_line(size = 1) +
  theme_bw() +
  labs(title = 'Overall Declining Primary School Completion Rates',
       y = '% of Adults',
       x = element_blank())

Primary School Expediture Rate

(% of GDP)

group_count <- 6
egrp <- distinct(e_rate, country) %>%
  mutate(grp = ntile(country, group_count))
inner_join(e_rate_tall, egrp, by = c('country' = 'country')) %>%
ggplot(aes(x = year, y = pschool_erate, color = country)) +
  geom_line() +
  theme_bw() +
  theme(legend.position = 'none') +
  facet_wrap(~grp) +
  labs(title = 'Primary School Expenditure Rate Over Time',
       y = '% of Adults',
       x = element_blank())

For most countries, it looks almost like the rate of gdp expenditure for primary school remained relatively steady throughout the years. There do seem to be some countries that invested quite a bit into their childrens’ future. Let’s isolate some of those countries and take a look.

filter(e_rate_tall, pschool_erate >= 33.33) %>%
  distinct(country) %>%
  inner_join(e_rate_tall, by = c('country' = 'country')) %>%
  ggplot(aes(x = year, y = pschool_erate, color = country)) +
  geom_line() +
  facet_wrap(~country) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = 'none') +
  geom_hline(yintercept = 33.33, linetype = 3) +
  labs(title = 'High Spenders on Primary School',
       subtitle = 'Expenditure Rate >= 33.33',
       y = 'Primary School Expenditure Rate',
       x = element_blank())

Here, we’re looking at all countries in the dataset that have at any point spent at least 1/3 of GDP per person on primary school education. It’s an arbitrary amount, but that’s 1/3 the value of each person towards furthering basic education. Cuba is quite impressive and seems like its still rising, where Ukraine is seeing the opposite effect.

Murder Rate

(per 100k People)

ggplot(m_rate_tall, aes(x = year, y = murder_rate, color = country)) +
  geom_line() +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'Murder Rate by Year by Country',
       x = element_blank(),
       y = 'Murders per 100k People')

It looks like most countries are grounded to low levels of murder, but at present, most countries are near zero. Since travel is quite common in this day and age, let’s look at the top and bottom countries for murder rate, so we know where or where not to plan our next trip.

Let’s look at the most recent year for every country and exclude anything from before a decade ago, or 2010.

# Look at only the latest year of data for each country
m_ly <- group_by(m_rate_tall, country) %>%
  summarize(year = max(year)) %>%
  inner_join(m_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  filter(year >= 2010) %>%
  mutate(country_year = str_c(country, ' (', year, ')', sep = ''))
top_x <- 10
  
# Most Dangerous Countries
arrange(m_ly, desc(murder_rate))[1:top_x,] %>%
ggplot(aes(x = reorder(country_year, murder_rate), y = murder_rate)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = 'Countries With Highest Murder Rate',
       y = 'Murders per 100K People',
       x = element_blank())

# Least Dangerous Countries
arrange(m_ly, murder_rate)[1:top_x,] %>%
ggplot(aes(x = reorder(country_year, desc(murder_rate)), y = murder_rate)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = 'Countries With Lowest Murder Rate',
       y = 'Murders per 100K People',
       x = element_blank())

Oman seems to be the safest country if you’re worried about being murdered. Keep mind that this is only one crime and that the numbers are reported or sampled by different methods, so this isn’t a list of safest countries, just a list of the countries that reported the lowest murder rate.

Suicide Rate

(per 100k People)

group_count <- 6
sgrp <- distinct(s_rate, country) %>%
  mutate(grp = ntile(country, group_count))
inner_join(s_rate_tall, egrp, by = c('country' = 'country')) %>%
ggplot(aes(x = year, y = suicide_rate, color = country)) +
  geom_line() +
  facet_wrap(~grp) +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'Suicide Rate by Year by Country',
       x = element_blank(),
       y = 'Suicide Rate Per 100K People')

For most countries, the rate is pretty low, and for others it seems to peak. Lets pick out some of the countries that had a high level of suicides at one point and see what we find.

filter(s_rate_tall, suicide_rate >= 25) %>%
  distinct(country) %>%
  inner_join(s_rate_tall, by = c('country' = 'country')) %>%
  ggplot(aes(x = year, y = suicide_rate, color = country)) +
  geom_line() +
  facet_wrap(~country) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = 'none') +
  labs(title = 'Top Suicide Rates',
       subtitle = '>= 25 / 100,000',
       y = 'Suicide Rate Per 100K People',
       x = element_blank())

Take a look at Hungary– communism ended there in 1989, which coincides with the peak of suicides. Thankfully, the suicide rate there has been declining ever since.

Suriname, with a prominent peak of suicides in the 80’s, went through historical changes that seemed to coincide with these figures. A coup d’état and political uncertainty might contribute to these figures.

The other countries above likely to have their reasons why there’s so much psychological pressure within their borders, whether we can find them or not.

Vaccination rates

Lets start by visualizing the raw data to see if there are any obvious differences in vaccination rates and the change of this rate over time for each of our vaccines; DTP and MCV.

Group_df <- dplyr::distinct(df, country)
Group_df$grp <- ntile(Group_df,6)

inner_join(DTP_rate_tall, Group_df, by = c('country' = 'country')) %>%
  ggplot(aes(x = year, y = DTP_rate, color = country)) +
  geom_line() +
  facet_wrap(~grp) +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'DTP Vaccination Rate by Country',
       x = element_blank(),
       subtitle = 'Complete cycle received before 1st birthday')

inner_join(MCV_rate_tall, Group_df, by = c('country' = 'country')) %>%
  ggplot(aes(x = year, y = MCV_rate, color = country)) +
  geom_line() +
  facet_wrap(~grp) +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'MCV Vaccination Rate by Country',
       x = element_blank(),
       subtitle = 'Complete cycle received before 1st birthday')

As we see above, there is not much to distinguish these two vaccination rates. It makes sense that these are largely collinear, due to their history and rise in acceptance.

Lets visualize the change in rate of vaccination from 2 different groups, the most and least vaccination countries. As our sample is around 195 countries, 20 countries in each of these groups should be sufficient.

#aggregate average vaccination rates across sample for both vaccines, compile into new DF. 
Vaccine_Sum <- data.frame(DTP_rate$country,rowMeans(DTP_rate[,2:33],na.rm = TRUE,dims=1),rowMeans(MCV_rate[,2:33],na.rm = TRUE,dims=1))
names(Vaccine_Sum)<-c("Country","Avg_DTP", "Avg_MCV")
Vaccine_Sum$Avg_Avg <- (Vaccine_Sum$Avg_DTP+Vaccine_Sum$Avg_MCV)/2
#Select top and bottom 20 countries, label as such. 
LVaxCounties <- top_n(Vaccine_Sum, 20, -Avg_Avg) 
LVaxCounties$HL <- "L"
HVaxCountries <- top_n(Vaccine_Sum, 20, Avg_Avg)
HVaxCountries$HL <- "H" 
EVaxCountries <- rbind(LVaxCounties, HVaxCountries)
DTP_rate_tall %>%
  left_join(EVaxCountries,by = c('country' = 'Country')) %>%
  filter(country %in% EVaxCountries$Country) %>%
ggplot(aes(x = year, y = DTP_rate, color = HL)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'DTP Vaccination Rate by Country',
       x = element_blank(),
       subtitle = '20 countries with lowest and highest rates of vaccinations')

MCV_rate_tall %>%
  left_join(EVaxCountries,by = c('country' = 'Country')) %>%
  filter(country %in% EVaxCountries$Country) %>%
ggplot(aes(x = year, y = MCV_rate, color = HL)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'MCV Vaccination Rate by Country',
       x = element_blank(),
       subtitle = '20 countries with lowest and highest rates of vaccinations')

Here we can see how the most vaccinated countries already had high vaccination rates starting at the beginning of this dataset in 1980. While there was a slight increase leading to present day, the growth achieved by the countries with the lowest vaccination rate is staggering.

Gini coefficient

For our inequality index, lets again begin by viewing the raw data to see what stands out. Its worth noting that this dataset begins in 1800, well outside of the scope of our Explanatory variables. Lets limit our scope to something more reasonable. While many of our variables begin around 1970, lets start around WW1 to get a full picture in how inequality has changed leading up to modern times.

inner_join(Gini_rate_tall, Group_df, by = c('country' = 'country')) %>%
  filter(year > 1910, year < 2020) %>%
  ggplot(aes(x = year, y = Gini_rate, color = country)) +
  geom_line() +
  facet_wrap(~grp) +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'Gini coefficient by Country',
       x = element_blank(),
       subtitle = 'Higher value implies more equitable society')

Here we see a potential issue with data quality. There are many countries whose inequality either never changed, or remained static for large periods of time. It is possible that this is approximated. What is unclear if this approximation is applied evenly for all countries or only in situations when better data was not available.

Lets plot the change in inequality for our most dynamic countries. We will calculate the Standard Deviation of the inequality index over the entire sample (from 1800) and select the top 20 countries.

Gini_SD <- transform(Gini_rate, SD=apply(Gini_rate,1, sd, na.rm = TRUE))[,"SD"]
Gini_sum <- data.frame(Gini_rate$country, Gini_SD)
Gini_extreme <- top_n(Gini_sum, 20, Gini_SD) 
inner_join(Gini_rate_tall, egrp, by = c('country' = 'country')) %>%
  filter(country %in% Gini_extreme$Gini_rate.country) %>%
  ggplot(aes(x = year, y = Gini_rate, color = country)) +
  geom_line() +
  theme_bw() + 
  labs(title = 'Gini coefficient by Country',
       x = element_blank(),
       subtitle = 'Countries with most extreme range of values by SD')

As expected - we see wild swings in both directions. Its noteworthy how much wider the variation in inequality is in the 1980s, and how narrow this is before WW1. I am not an expert in history, however I was not surprised to see countries like South Africa and Iran, noted for their relatively recent or frequent shifts in power. That being said - seeing countries like Finland and the Netherlands on this list encourage me to learn more about their history.

As a final exploratory exercise with this data - lets see which countries are the most similar. For the sake of discovery, we will remove countries who have remained static in their inequality measure, and examine a potentially more interesting set of countries by filtering those whose Standard Deviation of their Gini coefficient is less than 6.

names(Gini_rate) <- as.numeric(str_replace_all(names(Gini_rate), 'X', ''))
names(Gini_rate)[1] <- "country"
Gini_cor_DF <- data.frame(cor(t(Gini_rate[-1])))
Gini_cor_df <- Gini_cor_DF
names(Gini_cor_DF) <- Gini_rate$country
Gini_cor_DF$CountryA <- Gini_rate$country
Gini_cor_DF <- Gini_cor_DF %>% 
  pivot_longer(-CountryA,names_to = "CountryB",values_to = "Cor")%>%
  filter(CountryA != CountryB)%>%
  arrange(-Cor)
Gini_extreme2 <- filter(Gini_sum,Gini_SD>6) 
Gini_cor_DF_2 <- filter(Gini_cor_DF, CountryA %in% Gini_extreme2$Gini_rate.country) %>%
    filter(CountryA>CountryB) %>%
    filter(Cor > .9)
Gini_cor_DF_2$Grp <- as.numeric(row.names(Gini_cor_DF_2))
Gini_cor_DF_2 <- Gini_cor_DF_2 %>%
  pivot_longer(cols = c(CountryA, CountryB),values_to = "Country")
Gini_cor_DF_2 <- Gini_cor_DF_2[,-c(1,3)]
inner_join(Gini_rate_tall,Gini_cor_DF_2, by = c('country' = 'Country')) %>%
    filter(Grp<10) %>%
    ggplot(aes(x = year, y = Gini_rate, color = country)) +
  geom_line() +
  facet_wrap(~Grp) +
  theme_bw() + 
  labs(title = 'Similar Gini values',
       x = element_blank(),
       subtitle = 'Correlation over 90%, each with a SD > 6')

inner_join(Gini_rate_tall,Gini_cor_DF_2, by = c('country' = 'Country')) %>%
    filter(Grp<19, Grp>9) %>%
    ggplot(aes(x = year, y = Gini_rate, color = country)) +
  geom_line() +
  facet_wrap(~Grp) +
  theme_bw() + 
  labs(title = 'Similar Gini values',
       x = element_blank(),
       subtitle = 'Correlation over 90%, each with a SD > 6')

While it is possible (fairly likely) that some of these correlations are unrelated, this is an interesting visualization. Pairings like Greece and France seem significantly more likely than the Netherlands and Moldova.

Models

In this section, let’s create models to see if we can predict certain behaviors based on our basic education data.

Pairs Plot

ggpairs(df, columns = 3:ncol(df))

Its no surprise that literacy rate and primary school completion rate have a strong correlation. Reading is taught in primary school and we can make the assumption that even with the gap in time between when a person completes primary school and when they’re considered an adult, that the primary school completion rate can be a proxy for the effective value of basic education within a nation in any given year.

Interestingly enough, there’s a weak correlation between adult literacy rate and the suicide rate. It’s a bit scary to think that there’s a link between being able to read and suicide. There’s also a weak negative correlation between the primary school completion rate and the murder rate.

Murder

Let’s see if murder can be predicted by basic education statistics. We’ll try multiple regression and then use backwards elimination to get our final model.

mm <- lm(murder_rate ~ literacy_rate + pschool_crate + pschool_erate, data = df)
summary(mm)$r.squared

## [1] 0.03867512

It looks like this model is worthless. It’s not exactly a surprise that literacy and information about primary school are weak indicators for murder.

What would a model look like with just primary school completion rate, which showed a weak correlation?

mm2 <- lm(murder_rate ~ pschool_crate, data = df)
summary(mm2)$r.squared

## [1] 0.1062743

It looks like this simple linear regression has better results, but its still not robust enough to be useful. With an \(R^2\) of .1, only 10% of the variance is explained by the model.

df$pred <- predict(mm2, df)
df$resid <- df$murder_rate - df$pred
filter(df, !is.na(df$murder_rate)) %>%
ggplot(aes(x = pred, y = resid)) +
  geom_point(alpha = .1) +
  geom_hline(yintercept = 0, linetype = 3) +
  theme_bw() +
  labs(title = 'Predictions vs Residuals',
       subtitle = 'Murder Rate Multiple Regression Model Evaluation',
       x = 'Predictions',
       y = 'Residuals')

filter(df, !is.na(murder_rate)) %>%
  ggplot(aes(x = pred, y = murder_rate)) +
  geom_point(alpha = .1) +
  geom_abline(slope = 1, linetype = 3) +
  scale_x_continuous(limits = c(0, 50)) +
  theme_bw() +
  labs(title = 'Predictions vs Actuals',
       subtitle = 'Murder Rate Multiple Regression Model Evaluation',
       x = 'Predictions',
       y = 'Murder Rate')

Murder with Random Forest

Earlier, we saw that there was a negative correlation between and murder rate and primary school completion rate. Let’s try and see if a random forest model will produce a model with better predictions.

set.seed(321)
# Random Forest Model
x <- filter(df, !is.na(murder_rate) & !is.na(literacy_rate) & !is.na(pschool_crate))
s <- sample(nrow(x), nrow(x) * .7)
train <- x[s,]
test <- x[-s,]
rf <- train(murder_rate ~ literacy_rate + pschool_crate, data = train, model = 'rf')

## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

test$pred <- predict(rf, newdata = test)
test$resid <- test$murder_rate - test$pred
ggplot(test, aes(x = pred, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 3) +
  theme_bw() +
  labs(title = 'Predicted vs Residuals',
       x = 'Predicted',
       y = 'Residuals')

ggplot(test, aes(x = pred, y = murder_rate)) +
  geom_point() +
  geom_abline(slope = 1, linetype = 3) +
  theme_bw() +
  labs(title = 'Predicted vs Actuals',
       x = 'Predicted',
       y = 'Actual Murder Rate')

rf

## Random Forest 
## 
## 53 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 53, 53, 53, 53, 53, 53, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6.867724  0.1040575  3.835524
## 
## Tuning parameter 'mtry' was held constant at a value of 2

Based on the \(R^2\) returned by the random forest, this model isn’t quite up to par either.

Suicide

What about suicide? Let’s use a similar procedure to determine whether the data we have available can predict suicide rates based on basic education.

sm <- lm(suicide_rate ~ literacy_rate + pschool_crate + pschool_erate, data = df)
summary(sm)

## 
## Call:
## lm(formula = suicide_rate ~ literacy_rate + pschool_crate + pschool_erate, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3292 -5.2300 -0.3022  3.1134 14.1280 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -140.85195   67.29326  -2.093   0.0493 *
## literacy_rate    1.53450    0.68922   2.226   0.0376 *
## pschool_crate    0.03788    0.24999   0.152   0.8811  
## pschool_erate   -0.09900    0.20825  -0.475   0.6396  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.158 on 20 degrees of freedom
##   (46971 observations deleted due to missingness)
## Multiple R-squared:  0.2171, Adjusted R-squared:  0.09969 
## F-statistic: 1.849 on 3 and 20 DF,  p-value: 0.1708

It seems like this model is stronger than the one that predicts murder, since the \(R^2\) is higher. Let’s use backwards elimination to see if we can reduce the variance in the next model.

sm2 <- lm(suicide_rate ~ literacy_rate + pschool_erate, data = df)
summary(sm2)$r.squared

## [1] 0.2156279

It looks like the linear regression using one explanatory variable results in the best relative model. Let’s take a look at how this model looks.

df$pred <- predict(sm2, df)
df$resid <- df$suicide_rate - df$pred
filter(df, !is.na(df$suicide_rate)) %>%
ggplot(aes(x = pred, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 3) +
  theme_bw() +
  labs(title = 'Predictions vs Residuals',
       subtitle = 'Suicide Rate Multiple Regression Model Evaluation',
       x = 'Predictions',
       y = 'Residuals')

filter(df, !is.na(suicide_rate)) %>%
  ggplot(aes(x = pred, y = suicide_rate)) +
  geom_point() +
  geom_abline(slope = 1, linetype = 3) +
  scale_x_continuous(limits = c(0, 50)) +
  theme_bw() +
  labs(title = 'Predictions vs Actuals',
       subtitle = 'Suicide Rate Multiple Regression Model Evaluation',
       x = 'Predictions',
       y = 'Suicide Rate')

It looks like this multiple regression model isn’t very robust.

Let’s try to build a random forest model using literacy and primary school completion rate as the dependent variable. Using all 3 variables wouldn’t leave us with enough observations to run a reasonable model.

Random Forest

set.seed(123)
# Random Forest Model
# x <- filter(df, !is.na(literacy_rate) & !is.na(suicide_rate))
x <- filter(df, !is.na(literacy_rate) & !is.na(suicide_rate) & !is.na(pschool_crate))
s <- sample(nrow(x), nrow(x) * .7)
train <- x[s,]
test <- x[-s,]
# rf <- train(suicide_rate ~ literacy_rate, data = train, model = 'rf')
rf <- train(suicide_rate ~ literacy_rate + pschool_crate, data = train, model = 'rf')

## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

test$pred <- predict(rf, newdata = test)
test$resid <- test$suicide_rate - test$pred
ggplot(test, aes(x = pred, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 3) +
  theme_bw() +
  labs(title = 'Predicted vs Residuals',
       x = 'Predicted',
       y = 'Residuals')

ggplot(test, aes(x = pred, y = suicide_rate)) +
  geom_point() +
  geom_abline(slope = 1, linetype = 3) +
  theme_bw() +
  labs(title = 'Predicted vs Actuals',
       x = 'Predicted',
       y = 'Actual Suicide Rate')

rf

## Random Forest 
## 
## 53 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 53, 53, 53, 53, 53, 53, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   7.23475  0.2286956  5.465412
## 
## Tuning parameter 'mtry' was held constant at a value of 2

It looks like a random forest model with literacy rate and primary school completion rate is a better predictor of suicides compared to a multiple linear regression. Still, even with the best model, the predictions won’t be very convincing.

Vaccination

As we saw above, the rates of MCV and DTP vaccination appear very correlated. Lets measure to what degree:

Vax_lm <- lm(DTP_rate ~ MCV_rate, data = df)
summary(Vax_lm)$r.squared

## [1] 0.8002264

df %>%
ggplot(aes(x = MCV_rate, y = DTP_rate)) +
  geom_point() +
  geom_smooth(method = "lm", size =2) +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'MCV vs DTP Vaccination Rate')+
  geom_abline(slope = 1, size = 2, colour = "red")

Reviewing these metrics, we do indeed find that these are very correlated. Our \(R^2\) is .80.

Lets now look at the predicting power of education on vaccination rate starting with DTP.

DTP_lm <- lm(DTP_rate ~ literacy_rate + pschool_crate + pschool_erate, data = df)
summary(DTP_lm)

## 
## Call:
## lm(formula = DTP_rate ~ literacy_rate + pschool_crate + pschool_erate, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.038  -3.260   2.437   6.253  21.570 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   53.30339    4.38329  12.161  < 2e-16 ***
## literacy_rate  0.07993    0.08763   0.912 0.363264    
## pschool_crate  0.27844    0.07928   3.512 0.000599 ***
## pschool_erate  0.28100    0.14464   1.943 0.054056 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.63 on 140 degrees of freedom
##   (46851 observations deleted due to missingness)
## Multiple R-squared:  0.3235, Adjusted R-squared:  0.309 
## F-statistic: 22.31 on 3 and 140 DF,  p-value: 7.209e-12

DTP_lm2 <- lm(DTP_rate ~ pschool_crate + pschool_erate, data = df)
summary(DTP_lm2)

## 
## Call:
## lm(formula = DTP_rate ~ pschool_crate + pschool_erate, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.402  -2.945   1.847   5.661  26.286 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   52.24570    1.67128  31.261  < 2e-16 ***
## pschool_crate  0.38964    0.01899  20.514  < 2e-16 ***
## pschool_erate  0.16083    0.04821   3.336 0.000889 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.99 on 793 degrees of freedom
##   (46199 observations deleted due to missingness)
## Multiple R-squared:  0.3885, Adjusted R-squared:  0.3869 
## F-statistic: 251.9 on 2 and 793 DF,  p-value: < 2.2e-16

DTP_lm3 <- lm(DTP_rate ~ pschool_crate, data = df)
summary(DTP_lm3)

## 
## Call:
## lm(formula = DTP_rate ~ pschool_crate, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -87.436  -7.055   4.026   9.830  43.092 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   34.20107    1.05214   32.51   <2e-16 ***
## pschool_crate  0.56091    0.01251   44.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.24 on 2987 degrees of freedom
##   (44006 observations deleted due to missingness)
## Multiple R-squared:  0.4021, Adjusted R-squared:  0.4019 
## F-statistic:  2009 on 1 and 2987 DF,  p-value: < 2.2e-16

Starting with each of our predictor variables, seems things initially promising: an \(R^2\) of 32% with the P Value of the variable Literacy Rate very high at .36. When we remove Literacy Rate, our \(R^2\) climbed further. When we limited our variables to Public School Completion rate, our \(R^2\) increased yet again to 40%, implying that this alone is a stronger predictor than any other combination.

df %>%
ggplot(aes(x = pschool_crate, y = DTP_rate)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'Public School Completion vs DTP Vaccine Rates')

#hist(DTP_lm3$residuals)

ggplot(DTP_lm3, aes(x=DTP_lm3$residuals)) + geom_histogram(binwidth = 5)+
  labs(title = 'DTP Residuals Histogram')

Looking towards measles we initially see a higher \(R^2\), yet, this only decreases as we remove variables:

MCV_lm <- lm(MCV_rate ~ literacy_rate + pschool_crate + pschool_erate, data = df)
summary(MCV_lm)$r.squared

## [1] 0.4293597

MCV_lm2 <- lm(MCV_rate ~ pschool_crate + pschool_erate, data = df)
summary(MCV_lm2)$r.squared

## [1] 0.3778673

Vaccinations with Random Forest

Lets create a test a model with Random Forest and see if this fares any better. We will begin by filtering the dataframe and removing country/year pairings with any null values.

library(tidymodels)
vax_df <- filter(df, !is.na(literacy_rate) & !is.na(pschool_erate) & !is.na(pschool_crate))
dim(vax_df)

## [1] 144  12

This only leaves us with 144 values - not enough to train our dataset. As Literacy Rate fared the worst in our linear regression, lets remove that and reassess.

vax_df2 <- filter(df, !is.na(pschool_erate) & !is.na(pschool_crate)  & !is.na(MCV_rate))
dim(vax_df2)

## [1] 796  12

This seems appropriate - lets move forward.

#may as well use the seed my parter selected earlier 
set.seed(123)
vax_samp <- initial_split(vax_df2, .75)
vax_train <- training(vax_samp)
vax_test  <- testing(vax_samp)

vax_model <- 
  rand_forest(trees = 500) %>%
  set_mode("regression") %>%
  set_engine("randomForest")
vax_model_fit <- parsnip::fit(vax_model, MCV_rate ~ pschool_crate + pschool_erate, vax_train)
vax_model_result <- vax_model_fit %>%
  predict(new_data = vax_test)
vax_model_result$Actual<-vax_test$MCV_rate
names(vax_model_result)[1] <- "Prediction"
vax_model_fit

## parsnip model object
## 
## Fit time:  421ms 
## 
## Call:
##  randomForest(x = as.data.frame(x), y = y, ntree = ~500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 115.1834
##                     % Var explained: 34.84

The \(R^2\) is close to, but a little worse than the one which we arrived at with Linear modeling above. That being said - it is in the same realm. Lets test this model against our testing sample:

vax_model_result %>%
ggplot(aes(x = Prediction, y = Actual)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() + 
  theme(legend.position = 'none') +
  geom_abline(slope = 1, colour = "red") +
  labs(title = 'MCV Model: Predictions vs Actual')

vax_model_result$Resid <- vax_model_result$Actual - vax_model_result$Prediction
vax_model_result %>%
ggplot(aes(x = Actual, y = Resid)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() + 
  geom_abline(slope = 0, colour = "red")+
  theme(legend.position = 'none') +
  labs(title = 'MCV Model: Residuals vs Actual')

vax_model_result %>%
  ggplot(aes(x= Resid)) + geom_histogram(binwidth = 3)+
  labs(title = 'MCV Residuals Histogram')

These are some interesting graphs. At first glance our model seems very successful. The predicted values are on average very close to the actual values for our testing sample. That being said - the model becomes significantly less accurate as the actual vaccination rate moves down from 100%. This could be said to have failed the condition of constant variability.

Gini coefficient

Initially, the logic in searching for correlations between education and inequality was that, as a population becomes more educated, the GDP would increase and the population more engaged in democracy, which independently might lead to a move from a rigid class structure. After viewing the data, however, the erraticness of this response variable gives me little hope in finding correlations:

Gini_lm <- lm(Gini_rate ~ literacy_rate + pschool_crate + pschool_erate, data = df)
summary(Gini_lm)$r.squared

## [1] 0.142503

Gini_lm2 <- lm(Gini_rate ~pschool_erate, data = df)
summary(Gini_lm2)$r.squared

## [1] 0.1097555

Immediately we see a very low \(R^2\).

As linear modeling does not seem to work with a variable of this nature, lets see if a Random Forest approach might be more effective:

gini_df <-filter(df, !is.na(pschool_erate) & !is.na(pschool_crate)  & !is.na(Gini_rate)) 
set.seed(123)
gini_samp <- initial_split(gini_df, .75)
gini_train <- training(gini_samp)
gini_test  <- testing(gini_samp)

gini_model <- 
  rand_forest(trees = 500) %>%
  set_mode("regression") %>%
  set_engine("randomForest")
gini_model_fit <- parsnip::fit(gini_model, Gini_rate ~ pschool_crate + pschool_erate, gini_train)
gini_model_result <- gini_model_fit %>%
  predict(new_data = gini_test)
gini_model_result$Actual<-gini_test$Gini_rate
names(gini_model_result)[1] <- "Prediction"
gini_model_fit

## parsnip model object
## 
## Fit time:  701ms 
## 
## Call:
##  randomForest(x = as.data.frame(x), y = y, ntree = ~500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 55.56297
##                     % Var explained: 22.07

We find that this model give us significantly more explanation into Gini coefficient as a response variable. Lets take a look at the correlation and the distribution of residuals:

gini_model_result %>%
ggplot(aes(x = Prediction, y = Actual)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() + 
  theme(legend.position = 'none') +
  geom_abline(slope = 1, colour = "red") +
  labs(title = 'Gini Model: Predictions vs Actual')

gini_model_result$Resid <- gini_model_result$Actual - gini_model_result$Prediction
gini_model_result %>%
ggplot(aes(x = Actual, y = Resid)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() + 
  geom_abline(slope = 0, colour = "red")+
  theme(legend.position = 'none') +
  labs(title = 'Gini Model: Residuals vs Actual')

gini_model_result %>%
  ggplot(aes(x=Resid)) + geom_histogram(binwidth = 3)+
  labs(title = 'Gini Residuals Histogram')

While it is clear the correlation is weak, the trend line is not awful. The distribution of the residuals is fairly interesting: it seems that our model is a decent predictor of the Gini Coefficient for values between -20 and 20. For values higher than this, we see some outliers. This raises questions about our testing sample - it is possible that an out-sized number of these larger Gini values ended up in our testing set.

Conclusion

Before doing this research, I guessed there would be a negative correlation between basic education vs murder and suicide. As expected, it turns out that that is a generalizing assumption that isn’t based on data, at least not here. We were able to determine that most people across different countries in the world have been getting more basic education. On the other hand, we’re looking at averages of entire countries over the course of one year as one observation. It’s difficult to create an accurate model when we’re working with data that has already been reduced. Also, the concept of murder and suicide is highly complicated and there’s countless circumstances that influence these behaviors.

In other words, it was probably too simplistic to believe that these rare and strange events could be predicted a common behavior like basic education. Seems much clearer now compared to before we mined the data.

Looking toward vaccination rate - with a relatively small amount of variation across countries compared to our explanatory variables, I did not have a lot of hope in our ability to predict vaccination rate. I was pleasantly surprised with how our models preformed. Taking a broader view, however, it is unclear on the role education plays in vaccination. The most likely relativity is that vaccinations are compulsory in places where primary school is equally mandated - leading to the correlation we were able to model.

Finally, the Gini Coefficient was an interesting variable to work with. I was disappointed in the quality of the data, however, I was also pleasantly surprised in how much better the random forest approach preformed over linear modeling. While I was expecting a negative correlation, finding a positive correlation was enlightening.

Data607 Final Project

Leo Yi & Christopher Bloome

4/28/2020

Introduction

Preparing the data

Packages

Explanatory Variables

Murder and Suicide

Vaccination Rate

Inequality

Exploring the Data

Literacy Rate

Primary School Completion Rate

Primary School Expediture Rate

Murder Rate

Suicide Rate

Vaccination rates

Gini coefficient

Models

Pairs Plot

Murder

Murder with Random Forest

Suicide

Random Forest

Vaccination

Vaccinations with Random Forest

Gini coefficient

Conclusion