Data607 Final Project

Introduction

We want to research the association of basic education and its impact to different aspects of society. There’s a common understanding that education is beneficial for society and as people become more educated, society becomes more civilized. This is paired with the idea that when evaluating a scale of animal instincts to conscious human actions– its better for society if its members are more like wise sages and less like impulsive cave men.

The data that we will use to explore these assumptions and its effects on different facets of society will be sourced from gapminder.

To proxy or measure basic education, we’ll take a look at literacy rates of adults, completion rates of primary school, and expenditure per student as a % of GDP. Primary school completion rates can be influenced by many factors, but our initial idea is that this is a suitable proxy to measure how much a society values and is capable of putting their children through school.

We’ll then be looking at a few other factors that we believe may be related and discuss the results of regression models.

Preparing the data

In this section, lets grab all the data from gapminder and tidy it into one dataframe that we’ll then use to create the models.

All data has been downloaded from Gapminder

Packages

These are the packages we’ll be using for this project:

library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(GGally)
library(caret)

Predictor Variables

Lets download, join, and tidy the two data sources we’ll be looking at for our predictor variables. Both are csv files that contain metrics by country and year.

# Literacy Rate
l_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/literacy_rate_adult_total_percent_of_people_ages_15_and_above.csv',
                   stringsAsFactors = F)

# Primary School Completion Rate
pc_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/primary_completion_rate_total_percent_of_relevant_age_group.csv',
                    stringsAsFactors = F)

# Primary School Expenditure Rate (% of GDP per Person)
e_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/expenditure_per_student_primary_percent_of_gdp_per_person.csv',
                   stringsAsFactors = F)

Since all of the gapminder datasets seem to be in the same format, with countries as rows and years as columns, lets use a function to unpivot them into country and year in one line, since those will be the observations we’ll be using.

unpivot <- function(src_df, metric) {
  df <- gather(src_df, year, val, -country) %>%
    filter(!is.na(val)) %>%
    mutate(year = as.numeric(str_replace_all(year, 'X', '')))
  
  names(df)[3] <- metric
  
  return(df)
}

# Unpivot Raw Data
l_rate_tall <- unpivot(l_rate, 'literacy_rate')
pc_rate_tall <- unpivot(pc_rate, 'pschool_crate')
e_rate_tall <- unpivot(e_rate, 'pschool_erate')

# Combine Data for Predictor Varaibles Dataframe
pv_df <- l_rate_tall %>%
  full_join(pc_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(e_rate_tall, by = c('country' = 'country', 'year' = 'year'))

head(pv_df)

##                    country year literacy_rate pschool_crate pschool_erate
## 1             Burkina Faso 1975          8.69          7.63            NA
## 2 Central African Republic 1975         18.20         36.20            NA
## 3                   Kuwait 1975         59.60         60.60            NA
## 4                   Turkey 1975         61.60            NA            NA
## 5     United Arab Emirates 1975         53.50         41.70            NA
## 6                  Uruguay 1975         93.90            NA            NA

# Observations will all 3 metrics
pv_df2 <- filter(pv_df, !is.na(literacy_rate) & !is.na(pschool_crate) & !is.na(pschool_erate))

nrow(pv_df2)

## [1] 144

It looks like there’s only 144 country-year observations with all 3 metrics, which may make it difficult to use all three predictor variables in one model.

Lets evaluate which variables have enough data that we can use:

pvars <- pv_df %>%
  mutate(l = ifelse(is.na(literacy_rate), 0, 1),
          c = ifelse(is.na(pschool_crate), 0, 1),
          e = ifelse(is.na(pschool_erate), 0, 1),
          n = l + c + e) %>%
  group_by(n) %>%
  summarize(ns = n(),
            ls = sum(l),
            cs = sum(c),
            es = sum(e)) %>%
  arrange(desc(n))

pvars

## # A tibble: 3 x 5
##       n    ns    ls    cs    es
##   <dbl> <int> <dbl> <dbl> <dbl>
## 1     3   144   144   144   144
## 2     2  1303   256  1274  1076
## 3     1  3684   161  3197   326

This is a table that shows the instances of the variables.

n = the number of variables, out of 3, that exist per observation.

There are 144 observations with all 3 varaibles, 1303 observations with 2 variables, and 3684 with only 1 variable.

cat('Number of Rows for literacy rate dataframe:', nrow(l_rate_tall), '\nNumber of Distinct Countries:', n_distinct(l_rate_tall$country))

## Number of Rows for literacy rate dataframe: 561 
## Number of Distinct Countries: 150

cat('Number of Rows for primary school completion rate dataframe:', nrow(pc_rate_tall), '\nNumber of Distinct Countries:', n_distinct(pc_rate_tall$country))

## Number of Rows for primary school completion rate dataframe: 4615 
## Number of Distinct Countries: 185

cat('Number of Rows for primary school expenditure rate dataframe:', nrow(e_rate_tall), '\nNumber of Distinct Countries:', n_distinct(e_rate_tall$country))

## Number of Rows for primary school expenditure rate dataframe: 1546 
## Number of Distinct Countries: 163

It looks like the primary school completion rate has the most observations, but we’ll still need to determine which variables can be used based on the country-year match in the response variables.

Murder and Suicide

For most people, the idea of murder and suicide is a rare occurence in civilized society. We hear about these acts of violence rarely with people we know first hand, and unfortunately quite often in the news.

It wouldn’t be a huge leap to say that people who are more educated are less likely to be involved in these kinds of situations, but we’ll take a look at the data to see if this association can be supported by data.

Preparing the Data

Below, we’ll import the data from gapminder, tidy it, and add the variables to our combined predictor variable dataset.

# Murder Rate per 100k People
m_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/murder_per_100000_people.csv',
                   stringsAsFactors = F)

# Suicide Rate per 100k People
s_rate <- read.csv('https://raw.githubusercontent.com/dataconsumer101/data607_final_project/master/suicide_per_100000_people.csv',
                    stringsAsFactors = F)  

# Let's use the same function we created earlier to unpivot year columns
m_rate_tall <- unpivot(m_rate, 'murder_rate')
s_rate_tall <- unpivot(s_rate, 'suicide_rate')


df <- pv_df %>%
  full_join(m_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(s_rate_tall, by = c('country' = 'country', 'year' = 'year'))

head(df)

##                    country year literacy_rate pschool_crate pschool_erate
## 1             Burkina Faso 1975          8.69          7.63            NA
## 2 Central African Republic 1975         18.20         36.20            NA
## 3                   Kuwait 1975         59.60         60.60            NA
## 4                   Turkey 1975         61.60            NA            NA
## 5     United Arab Emirates 1975         53.50         41.70            NA
## 6                  Uruguay 1975         93.90            NA            NA
##   murder_rate suicide_rate
## 1          NA           NA
## 2          NA           NA
## 3        1.64        0.504
## 4          NA           NA
## 5          NA           NA
## 6        2.96        9.930

Vaccination Rate

Childhood vaccinations are one way in which successful societies protect their population. Here in the United States, vaccinations are received for a variety of potential ailments. As any of these could serve as a proxy for society wellness, we pulled in each of these datasets to see which was the most complete.

# DTP vaccine percentage in 1 year olds
dtp_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/dtp3_immunized_percent_of_one_year_olds.csv',
                   stringsAsFactors = F)
# Measels vaccine percentage in 1 year olds
MCV_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/mcv_immunized_percent_of_one_year_olds.csv',
                    stringsAsFactors = F)  
# Teatenus vaccine percentage in newborns
PAB_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/pab_immunized_percent_of_newborns.csv',
                    stringsAsFactors = F)  
# Hepatitis vaccine percentage in 1 year olds
hepb3_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/hepb3_immunized_percent_of_one_year_olds.csv',
                    stringsAsFactors = F)  




# Let's use the same function we created earlier to unpivot year columns
dtp_rate_tall <- unpivot(dtp_rate, 'dtp_rate')
MCV_rate_tall <- unpivot(MCV_rate, 'MCV_rate')
PAB_rate_tall <- unpivot(PAB_rate, 'PAB_rate')
hepb3_rate_tall <- unpivot(hepb3_rate, 'hepb3_rate')
dfVax <- pv_df %>%
  full_join(dtp_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(MCV_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(PAB_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(hepb3_rate_tall, by = c('country' = 'country', 'year' = 'year'))

summary(dfVax)

##    country               year      literacy_rate    pschool_crate   
##  Length:7369        Min.   :1970   Min.   :  8.69   Min.   :  1.52  
##  Class :character   1st Qu.:1987   1st Qu.: 64.90   1st Qu.: 62.50  
##  Mode  :character   Median :1997   Median : 85.10   Median : 90.70  
##                     Mean   :1997   Mean   : 77.04   Mean   : 79.38  
##                     3rd Qu.:2007   3rd Qu.: 94.10   3rd Qu.: 98.70  
##                     Max.   :2019   Max.   :100.00   Max.   :135.00  
##                                    NA's   :6808     NA's   :2754    
##  pschool_erate       dtp_rate        MCV_rate        PAB_rate    
##  Min.   : 0.235   Min.   : 0.00   Min.   : 0.00   Min.   : 1.00  
##  1st Qu.:10.500   1st Qu.:66.00   1st Qu.:62.00   1st Qu.:42.00  
##  Median :15.000   Median :86.00   Median :84.00   Median :67.00  
##  Mean   :15.791   Mean   :76.87   Mean   :75.49   Mean   :60.31  
##  3rd Qu.:20.000   3rd Qu.:95.00   3rd Qu.:94.00   3rd Qu.:83.00  
##  Max.   :65.100   Max.   :99.00   Max.   :99.00   Max.   :99.00  
##  NA's   :5823     NA's   :1817    NA's   :1937    NA's   :4479   
##    hepb3_rate   
##  Min.   : 1.00  
##  1st Qu.:77.00  
##  Median :91.00  
##  Mean   :82.47  
##  3rd Qu.:96.00  
##  Max.   :99.00  
##  NA's   :5106

Looking at the quantity of NAs in each variable in the summary, it is clear there is significantly more data on dtp and measles vaccinations. For these reasons we will exclude the PAB and hepb vaccinations.

df <- df %>%
  full_join(dtp_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  full_join(MCV_rate_tall, by = c('country' = 'country', 'year' = 'year'))
head(df)

##                    country year literacy_rate pschool_crate pschool_erate
## 1             Burkina Faso 1975          8.69          7.63            NA
## 2 Central African Republic 1975         18.20         36.20            NA
## 3                   Kuwait 1975         59.60         60.60            NA
## 4                   Turkey 1975         61.60            NA            NA
## 5     United Arab Emirates 1975         53.50         41.70            NA
## 6                  Uruguay 1975         93.90            NA            NA
##   murder_rate suicide_rate dtp_rate MCV_rate
## 1          NA           NA       NA       NA
## 2          NA           NA       NA       NA
## 3        1.64        0.504       NA       NA
## 4          NA           NA       NA       NA
## 5          NA           NA       NA       NA
## 6        2.96        9.930       NA       NA

Inequality

The Gini index, a measure of inequality, is another metric in this dataset we wanted to explore. The Gini index is built such that a value of 0 indicates that all members of a society have equal income, and 1 indicates that one individual earns all income in a society, while the other members are without any income. In this dataset, it appears that each country has a Gini index value for all years, making it ideal for our purposes.

Gini_rate <- read.csv('https://raw.githubusercontent.com/ChristopherBloome/607/master/gini.csv',
                    stringsAsFactors = F) 
Gini_rate_tall <- unpivot(Gini_rate, 'Gini_rate')

df <- df %>%
  full_join(Gini_rate_tall, by = c('country' = 'country', 'year' = 'year'))

Exploring the Data

Now that we have our working dataframe, lets make a few observations through visualizations before we dive into modeling.

Literacy Rate

# break countries in groups
group_count <- 6
lgrp <- distinct(l_rate, country) %>%
  mutate(grp = ntile(country, group_count))

# Literacy Rates Over Time
inner_join(l_rate_tall, lgrp, by = c('country' = 'country')) %>%
ggplot(aes(x = year, y = literacy_rate, color = country)) +
  geom_line() +
  theme_bw() +
  theme(legend.position = 'none') +
  facet_wrap(~grp) +
  labs(title = 'Literacy Rate Over Time',
       caption = 'Each Line Is a Country',
       y = '% of Adults',
       x = element_blank())

The plot was split into 6 groups because it would be difficult to see all the lines overlapping on one chart.

Generally, countries are seeing higher rates of adult literacy over time. This may be a result of countries advancing and growing. There are some countries that seem to be declining in literacy, perhaps in areas of war? Let’s take a look.

lr_mm <- group_by(l_rate_tall, country) %>%
  summarize(min_yr = min(year),
            max_yr = max(year)) %>%
  left_join(l_rate_tall, by = c('country' = 'country', 'min_yr' = 'year')) %>%
  left_join(l_rate_tall, by = c('country' = 'country', 'max_yr' = 'year')) %>%
  mutate(change = literacy_rate.y - literacy_rate.x) %>%
  filter(change < 0) %>%
  arrange(change)

lr_mm

## # A tibble: 9 x 6
##   country          min_yr max_yr literacy_rate.x literacy_rate.y  change
##   <chr>             <dbl>  <dbl>           <dbl>           <dbl>   <dbl>
## 1 Lesotho            2000   2009            86.3            75.8 -10.5  
## 2 Kenya              2000   2007            82.2            72.2 -10    
## 3 Madagascar         2000   2009            70.7            64.5  -6.2  
## 4 Congo, Dem. Rep.   2001   2007            67.2            61.2  -6    
## 5 Nigeria            1991   2008            55.5            51.1  -4.40 
## 6 Zambia             1990   2007            65              61.4  -3.6  
## 7 Albania            2001   2011            98.7            96.8  -1.9  
## 8 Tonga              1976   2006            99.6            99    -0.600
## 9 Mongolia           2000   2011            97.8            97.4  -0.400

bind_rows(select(lr_mm, country, yr = min_yr, val = literacy_rate.x),
          select(lr_mm, country, yr = max_yr, val = literacy_rate.y)) %>%
  ggplot(aes(x = yr, y = val, color = country)) +
  geom_line(size = 1) +
  theme_bw() +
  labs(title = 'Overall Declining Literacy Rates',
       y = '% of Adults',
       x = element_blank())

It seems like some countries in Africa are struggling with improving adult literacy. I’m not quite sure about the history of those countries, but its possibly a sampling error.

Albania, Tonga, and Mongolia also have shown a net decline, but almost all of the population is literate. There may just be a ceiling to literacy in any given country, given that some people are unable to learn for reasons other than infrastructure.

Primary School Completion Rate

# break countries in groups
group_count <- 8
cgrp <- distinct(pc_rate, country) %>%
  mutate(grp = ntile(country, group_count))


# Primary School Completion Rates Over Time
inner_join(pc_rate_tall, cgrp, by = ('country' = 'country')) %>%
  filter(grp <= 4) %>%
  ggplot(aes(x = as.Date(ISOdate(year,1,1)), y = pschool_crate, color = country)) +
  geom_smooth(se = F, method = 'lm') +
  theme_bw() +
  theme(legend.position = 'none') +
  facet_wrap(~grp) +
  labs(title = 'Primary School Completion Rate Over Time',
       subtitle = 'Group 1 - 4',
       y = '% of Adults',
       x = element_blank())

# Primary School Completion Rates Over Time
inner_join(pc_rate_tall, cgrp, by = ('country' = 'country')) %>%
  filter(grp >= 5) %>%
  ggplot(aes(x = as.Date(ISOdate(year,1,1)), y = pschool_crate, color = country)) +
  geom_smooth(se = F, method = 'lm') +
  theme_bw() +
  theme(legend.position = 'none') +
  facet_wrap(~grp) +
  labs(title = 'Primary School Completion Rate Over Time',
       subtitle = 'Group 5 - 8',
       y = '% of Adults',
       x = element_blank())

Again, We split the countries into 8 groups since it would be too messy to view on one chart. The views are linear models of the data points for each country, which shows a general trend towards higher completion rate over time.

Primary School Expediture Rate (% of GDP)

group_count <- 6
egrp <- distinct(e_rate, country) %>%
  mutate(grp = ntile(country, group_count))

inner_join(e_rate_tall, egrp, by = c('country' = 'country')) %>%
ggplot(aes(x = year, y = pschool_erate, color = country)) +
  geom_line() +
  theme_bw() +
  theme(legend.position = 'none') +
  facet_wrap(~grp)

For most countries, it looks almost like the rate of gdp expenditure for primary school remained relatively steady throughout the years. There do seem to be some countries that invested quite a bit into their childrens’ future. Let’s isolate some of those countries and take a look.

filter(e_rate_tall, pschool_erate >= 33.33) %>%
  distinct(country) %>%
  inner_join(e_rate_tall, by = c('country' = 'country')) %>%
  ggplot(aes(x = year, y = pschool_erate, color = country)) +
  geom_line() +
  facet_wrap(~country) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = 'none') +
  geom_hline(yintercept = 33.33, linetype = 3)

Here, we’re looking at all countries in the dataset that have at any point spent at least 1/3 of GDP per person on primary school education. It’s an arbitrary amount, but that’s 1/3 the value of each person towards furthering basic education. Cuba is quite impressive and seems like its still rising, where Ukraine is seeing the opposite effect.

Murder Rate (per 100k People)

ggplot(m_rate_tall, aes(x = year, y = murder_rate, color = country)) +
  geom_line() +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'Murder Rate by Year by Country',
       x = element_blank(),
       subtitle = 'Murders per 100k People')

It looks like most countries are grounded to low levels of murder, but at present, most countries are near zero. Since travel is quite common in this day and age, let’s look at the top and bottom countries for murder rate, so we know where or where not to plan our next trip.

Let’s look at the most recent year for every country and exclude anything from before a decade ago, or 2010.

# Look at only the latest year of data for each country
m_ly <- group_by(m_rate_tall, country) %>%
  summarize(year = max(year)) %>%
  inner_join(m_rate_tall, by = c('country' = 'country', 'year' = 'year')) %>%
  filter(year >= 2010) %>%
  mutate(country_year = str_c(country, ' (', year, ')', sep = ''))

top_x <- 15
  
# Most Dangerous Countries
arrange(m_ly, desc(murder_rate))[1:top_x,] %>%
ggplot(aes(x = reorder(country_year, murder_rate), y = murder_rate)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = 'Countries With Highest Murder Rate',
       y = 'Murders per 100K People',
       x = element_blank())

# Least Dangerous Countries
arrange(m_ly, murder_rate)[1:top_x,] %>%
ggplot(aes(x = reorder(country_year, desc(murder_rate)), y = murder_rate)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = 'Countries With Lowest Murder Rate',
       y = 'Murders per 100K People',
       x = element_blank())

Oman seems to be the safest country if you’re worried about being murdered. Keep mind that this is only one crime and that the numbers are reported or sampled by different methods, so this isn’t a list of safest countries, just a list of the countries that reported the lowest murder rate.

Suicide Rate

group_count <- 6
sgrp <- distinct(s_rate, country) %>%
  mutate(grp = ntile(country, group_count))

inner_join(s_rate_tall, egrp, by = c('country' = 'country')) %>%
ggplot(aes(x = year, y = suicide_rate, color = country)) +
  geom_line() +
  facet_wrap(~grp) +
  theme_bw() + 
  theme(legend.position = 'none') +
  labs(title = 'Suicide Rate by Year by Country',
       x = element_blank(),
       subtitle = 'Suicides per 100k People')

For most countries, the rate is pretty low, and for others it seems to peak. Lets pick out some of the countries that had a high level of suicides at one point and see what we find.

filter(s_rate_tall, suicide_rate >= 25) %>%
  distinct(country) %>%
  inner_join(s_rate_tall, by = c('country' = 'country')) %>%
  ggplot(aes(x = year, y = suicide_rate, color = country)) +
  geom_line() +
  facet_wrap(~country) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = 'none')

Take a look at Hungary– communism ended there in 1989, which coincides with the peak of suicides. Thankfully, the suicide rate there has been declining ever since.

Suriname, with a prominent peak of suicides in the 80’s, went through historical changes that seemed to coincide with these figures. A coup d’état and political uncertainty might contribute to these figures.

The other countries above likely to have their reasons why there’s so much psychological pressure within their borders, whether we can find them or not.

Models

In this section, let’s create models to see if we can predict certain behaviors based on our basic education data.

Pairs Plot

ggpairs(df, columns = 3:ncol(df))

Its no surprise that literacy rate and primary school completion rate have a strong correlation. Reading is taught in primary school and we can make the assumption that even with the gap in time between when a person completes primary school and when they’re considered an adult, that the primary school completion rate can be a proxy for the effective value of basic education within a nation in any given year.

Interestingly enough, there’s a weak correlation between adult literacy rate and the suicide rate. It’s a bit scary to think that there’s a link between being able to read and suicide. There’s also a weak negative correlation between the primary school completion rate and the murder rate.

Murder

Let’s see if murder can be predicted by basic education statstics. We’ll try multiple regression and then use backwards elimination to get our final model.

mm <- lm(murder_rate ~ literacy_rate + pschool_crate + pschool_erate, data = df)
summary(mm)$r.squared

## [1] 0.03867512

It looks like this model is worthless. It’s not exactly a surprise that literacy and information about primary school are weak indicators for murder.

What would a model look like with just primary school completion rate, which showed a weak correlation?

mm2 <- lm(murder_rate ~ pschool_crate, data = df)
summary(mm2)$r.squared

## [1] 0.1062743

# ggplot(df, aes(y = pschool_crate, x = murder_rate)) +
#   geom_point() +
#   scale_x_log10() +
#   geom_smooth(method = 'lm')

It looks like this simple linear regression has better results, but its still not robust enough to be useful. With an \(R^2\) of .1, only 10% of the variance is explained by the model.

Murder with Random Forest

Earlier, we saw that there was a negative correlation between and murder rate and primary school completion rate. Let’s try and see if a random forest model will produce a model with better predictions.

set.seed(321)
# Random Forest Model
x <- filter(df, !is.na(murder_rate) & !is.na(literacy_rate) & !is.na(pschool_crate))

s <- sample(nrow(x), nrow(x) * .7)
train <- x[s,]
test <- x[-s,]
rf <- train(murder_rate ~ literacy_rate + pschool_crate, data = train, model = 'rf')

## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

test$pred <- predict(rf, newdata = test)
test$resid <- test$murder_rate - test$pred

ggplot(test, aes(x = pred, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 3)

ggplot(test, aes(x = pred, y = murder_rate)) +
  geom_point() +
  geom_abline(slope = 1, linetype = 3)

rf

## Random Forest 
## 
## 53 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 53, 53, 53, 53, 53, 53, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6.867724  0.1040575  3.835524
## 
## Tuning parameter 'mtry' was held constant at a value of 2

Based on the \(R^2\) returned by the random forest, this model isn’t quite up to par either.

Suicide

What about suicide? Let’s use a similar procedure to determine whether the data we have available can predict suicide rates based on basic education.

sm <- lm(suicide_rate ~ literacy_rate + pschool_crate + pschool_erate, data = df)
summary(sm)

## 
## Call:
## lm(formula = suicide_rate ~ literacy_rate + pschool_crate + pschool_erate, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3292 -5.2300 -0.3022  3.1134 14.1280 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -140.85195   67.29326  -2.093   0.0493 *
## literacy_rate    1.53450    0.68922   2.226   0.0376 *
## pschool_crate    0.03788    0.24999   0.152   0.8811  
## pschool_erate   -0.09900    0.20825  -0.475   0.6396  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.158 on 20 degrees of freedom
##   (46971 observations deleted due to missingness)
## Multiple R-squared:  0.2171, Adjusted R-squared:  0.09969 
## F-statistic: 1.849 on 3 and 20 DF,  p-value: 0.1708

It seems like this model is stronger than the one that predicts murder, since the \(R^2\) is higher. Let’s use backwards elimination to see if we can reduce the variance in the next model.

sm2 <- lm(suicide_rate ~ literacy_rate + pschool_erate, data = df)

summary(sm2)$r.squared > summary(sm)$r.squared

## [1] FALSE

It looks like the multiple regression using all 3 predictor variables results in the best relative model. Let’s take a look at how this model looks.

df$pred <- predict(sm, df)
df$resid <- df$suicide_rate - df$pred

ssd <- sd(s_rate_tall$suicide_rate)

filter(df, !is.na(df$suicide_rate)) %>%
ggplot(aes(x = pred, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 3) +
  theme_bw() +
  labs(title = 'Predictions vs Residuals',
       subtitle = 'Suicide Rate Multiple Regression Model Evaluation',
       x = 'Predictions',
       y = 'Residuals')

filter(df, !is.na(suicide_rate)) %>%
  ggplot(aes(x = pred, y = suicide_rate)) +
  geom_point() +
  geom_abline(slope = 1, linetype = 3) +
  scale_x_continuous(limits = c(0, 50)) +
  theme_bw() +
  labs(title = 'Predictions vs Actuals',
       subtitle = 'Suicide Rate Multiple Regression Model Evaluation',
       x = 'Predictions',
       y = 'Suicide Rate')

It looks like this multiple regression model isn’t very robust.

Let’s try to build a random forest model using literacy and primary school completion rate as the dependent variable. Using all 3 variables wouldn’t leave us with enough observations to run a reasonable model.

Random Forest

set.seed(123)
# Random Forest Model
# x <- filter(df, !is.na(literacy_rate) & !is.na(suicide_rate))
x <- filter(df, !is.na(literacy_rate) & !is.na(suicide_rate) & !is.na(pschool_crate))

s <- sample(nrow(x), nrow(x) * .7)
train <- x[s,]
test <- x[-s,]

# rf <- train(suicide_rate ~ literacy_rate, data = train, model = 'rf')
rf <- train(suicide_rate ~ literacy_rate + pschool_crate, data = train, model = 'rf')

## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

test$pred <- predict(rf, newdata = test)
test$resid <- test$suicide_rate - test$pred


ggplot(test, aes(x = pred, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 3)

ggplot(test, aes(x = pred, y = suicide_rate)) +
  geom_point() +
  geom_abline(slope = 1, linetype = 3)

rf

## Random Forest 
## 
## 53 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 53, 53, 53, 53, 53, 53, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   7.23475  0.2286956  5.465412
## 
## Tuning parameter 'mtry' was held constant at a value of 2

It looks like a random forest model with literacy rate and primary school completion rate is a better predictor of suicides compared to a multiple linear regression. Still, even with the best model, the predictions won’t be very convincing.

Conclusion

Before doing this research, I guessed there would be a negative correlation between basic education vs murder and suicide. As expected, it turns out that that is a generalizing assumption that isn’t based on data, at least not here. We were able to determine that most people across different countries in the world have been getting more basic education. On the other hand, we’re looking at averages of entire countries over the course of one year as one observation. It’s difficult to create an accurate model when we’re working with data that has already been reduced. Also, the concept of murder and suicide is highly complicated and there’s countless cirumstances that influence these behaviors.

In other words, it was probably too simplistic to believe that these rare and strange events could be predicted a common behavior like basic education. Seems much clearer now compared to before we mined the data.