Suicide Rate Prediction

Background

Introduction and Objective

Suicide rates are defined as the deaths deliberately initiated and performed by a person in the full knowledge or expectation of its fatal outcome, per 100,000 population. In other words, suicide is a death caused by intentional self-harm. Suicide rate is an important proxy for the prevalence of mental health disorders in many countries in the world.

This project objective is to create a machine learning algorithm for suicide rate prediction. It uses country level suicide rates data from the World Health Organization and other country level indicators from The World Bank (WB) and The United Nations Development Program (UNDP), from 1985 to 2016. The data contains information on suicide rates per 100k population, age, sex, generation, country Gross Domestic Product (GDP), population size, and Human Development Index (HDI) country score.

Key Facts

Close to 800 000 people die due to suicide every year.
For every suicide there are many more people who attempt suicide every year.
Suicide is the third leading cause of death in 13-19 year olds.
79% of suicides occur in low to middle income countries.

Literature Review

According to previous research, prediction of suicide rate was done by using Kaplan-Meier method to estimate the events and Cox Proportional Hazards Regression was used to examine the predictors of suicide and suicides reattempt (L.N Grendas, et al., 2019). The aim was to evaluate the prospective predictors and the correlation with the factors that effects suicide rates. Based on the prospective prediction of suicide attempts (M. Miche, et al., 2020), four prediction models were used: Logistic regression, random forest, lasso, ridge (variants of the logistic regression). All four yielded good accuracies. The predictive performance of random forest (ML algorithm) depends on the sample size, with bigger sample size leading to increased performance results. In predicting death by suicide using administrative health care system data (M.Sanderson, et al., 2019), they have compared the performance of recurrent neural networks, one-dimensional convolutional neural networks, and gradient boosted trees, with logistic regression and feedforward neural networks. In this study, they have concluded, the recurrent neural network model configuration, one-dimensional convolutional neural network configuration has outperformed logistic regression.

Data Set

Data set for this project was obtained from Kaggle Suicide Rates Overview 1985-2016. This dataset was built using a combination of four other datasets namely, United Nations Development Program, World Bank, World Health Organization and 2017 Journal article.The dataset consists of 27820 rows with 12 columns as follows :

Country (character) - Represents the name of country where data is collected from
Year (integer) - Represents year when data is collected
Sex (character) - Male or female suiside victim
Age (integer) - Age of suicide victim
Suicide Number (integer) - Number of suicides
Population (integer) - Number of people in a given country
Suicide per 100k Population (numerical) - Normalized rate of suicide for every 100k people
Country & Year (character) - Concatenated data with country and year
Annual HDI (numerical) - Yearly Human Development Index
Annual GDP (character) - Yearly Gross Domestic Product
GDP per Capita (integer) - Gross Domestic Product per capita
Generation (character) - Suicide victim generation group (i.e Millenial, Gen Z, etc)

Limitations

In this project, a transformation for the target variable is introduced which is the log transformation after adding +1 to the variable. This particular transformation and the value 1 was are selected based on convenience. Other transformations can be explored and optimized for the purpose of predicting suicide rates. Besides that, is the parameter tuning for the Random Forests algorithm. The number of trees used for the model was set at 500. This number can be changes to further optimize the performance of the model.

Data Processing

Installing Libraries

All of the following libraries are pre-installed in R Studio. Thus, only packages loading are necessary for use in analysis and prediction modeling. Short description on what each pacakge does is shown below.

library(tidyverse) #general data tidying

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidyr) #general
library(ggalt) #dumbbell plots

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

library(countrycode) #continent
library(rworldmap) #country-level heatmaps

## Loading required package: sp

## ### Welcome to rworldmap ###

## For a short introduction type :   vignette('rworldmap')

library(gridExtra) #plots

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret) #test and training data split

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(wesanderson) #visualization theme
library(ranger) #random forest 
library(broom) #convert statistical objects into tidy tibbles

Loading Dataset

Working directory is set to allow for dataexport into R studio seamlessly. Dataset which is in csv format is loaded into R environment for further data processing.

#set working directory
setwd("/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Group Project")
getwd()

## [1] "/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Group Project"

#load and view data
suicidedata<-read.csv("master.csv", header = TRUE, sep = ",")
colSums(is.na(suicidedata))

##            country               year                sex                age 
##                  0                  0                  0                  0 
##        suicides_no         population  suicides.100k.pop       country.year 
##                  0                  0                  0                  0 
##       HDI.for.year   gdp_for_year.... gdp_per_capita....         generation 
##              19456                  0                  0                  0

head(suicidedata)

##   country year    sex         age suicides_no population suicides.100k.pop
## 1 Albania 1987   male 15-24 years          21     312900              6.71
## 2 Albania 1987   male 35-54 years          16     308000              5.19
## 3 Albania 1987 female 15-24 years          14     289700              4.83
## 4 Albania 1987   male   75+ years           1      21800              4.59
## 5 Albania 1987   male 25-34 years           9     274300              3.28
## 6 Albania 1987 female   75+ years           1      35600              2.81
##   country.year HDI.for.year gdp_for_year.... gdp_per_capita....      generation
## 1  Albania1987           NA    2,156,624,900                796    Generation X
## 2  Albania1987           NA    2,156,624,900                796          Silent
## 3  Albania1987           NA    2,156,624,900                796    Generation X
## 4  Albania1987           NA    2,156,624,900                796 G.I. Generation
## 5  Albania1987           NA    2,156,624,900                796         Boomers
## 6  Albania1987           NA    2,156,624,900                796 G.I. Generation

str(suicidedata)

## 'data.frame':    27820 obs. of  12 variables:
##  $ country           : chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ year              : int  1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
##  $ sex               : chr  "male" "male" "female" "male" ...
##  $ age               : chr  "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
##  $ suicides_no       : int  21 16 14 1 9 1 6 4 1 0 ...
##  $ population        : int  312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
##  $ suicides.100k.pop : num  6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
##  $ country.year      : chr  "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
##  $ HDI.for.year      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gdp_for_year....  : chr  "2,156,624,900" "2,156,624,900" "2,156,624,900" "2,156,624,900" ...
##  $ gdp_per_capita....: int  796 796 796 796 796 796 796 796 796 796 ...
##  $ generation        : chr  "Generation X" "Silent" "Generation X" "G.I. Generation" ...

Data Cleaning

Prior to data analysis and model building, data cleaning was done on the dataset. Here, HDI column was removed due to insufficient data. More than 70% of dataset in HDI column have missing values. After HDI column was removed, missing values from the remaining columns were dropped. Only one row with missing data was dropped. After that, annual gdp column which was initially in character format was changed to numerical by removing commas and reformatting its data type.Next, some of the column were renamed for simplifaction purposes. A new category was added to represent continent, using “countrycode” package.

#data cleaning & tidying
suicidedataclean<-suicidedata %>%
  subset(select =-c(HDI.for.year)) %>% #remove HDI column
  drop_na() %>% #drop any missing values
  mutate(across(all_of("gdp_for_year...."), ~gsub(",", "",.) %>% as.numeric)) %>% #remove commas and change gdp column to numeric
  rename(suicide.number=suicides_no) %>% #rename suicide number column
  rename(annual.gdp=gdp_for_year....) %>% #rename annual gdp column
  rename(gdp.per.capita=gdp_per_capita....) #rename gdp per capita column

#adding new category
suicidedataclean$continent <- countrycode(sourcevar = suicidedataclean[, "country"],
                              origin = "country.name",
                              destination = "continent")

#converting categorical values such as gender variable as factor
#categorical values are then given indicators. Eg: female=0 and male=1 
suicidedataclean <- suicidedataclean %>% 
  mutate(sex= as.factor(sex)) %>% 
  mutate(age= as.factor(age)) %>% 
  mutate(country= as.factor(country)) %>%
  mutate(generation= as.factor(generation))

#preview clean data
head(suicidedataclean)

##   country year    sex         age suicide.number population suicides.100k.pop
## 1 Albania 1987   male 15-24 years             21     312900              6.71
## 2 Albania 1987   male 35-54 years             16     308000              5.19
## 3 Albania 1987 female 15-24 years             14     289700              4.83
## 4 Albania 1987   male   75+ years              1      21800              4.59
## 5 Albania 1987   male 25-34 years              9     274300              3.28
## 6 Albania 1987 female   75+ years              1      35600              2.81
##   country.year annual.gdp gdp.per.capita      generation continent
## 1  Albania1987 2156624900            796    Generation X    Europe
## 2  Albania1987 2156624900            796          Silent    Europe
## 3  Albania1987 2156624900            796    Generation X    Europe
## 4  Albania1987 2156624900            796 G.I. Generation    Europe
## 5  Albania1987 2156624900            796         Boomers    Europe
## 6  Albania1987 2156624900            796 G.I. Generation    Europe

str(suicidedataclean)

## 'data.frame':    27820 obs. of  12 variables:
##  $ country          : Factor w/ 101 levels "Albania","Antigua and Barbuda",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year             : int  1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 1 2 1 ...
##  $ age              : Factor w/ 6 levels "15-24 years",..: 1 3 1 6 2 6 3 2 5 4 ...
##  $ suicide.number   : int  21 16 14 1 9 1 6 4 1 0 ...
##  $ population       : int  312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
##  $ suicides.100k.pop: num  6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
##  $ country.year     : chr  "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
##  $ annual.gdp       : num  2.16e+09 2.16e+09 2.16e+09 2.16e+09 2.16e+09 ...
##  $ gdp.per.capita   : int  796 796 796 796 796 796 796 796 796 796 ...
##  $ generation       : Factor w/ 6 levels "Boomers","G.I. Generation",..: 3 6 3 2 1 2 6 1 2 3 ...
##  $ continent        : chr  "Europe" "Europe" "Europe" "Europe" ...

Exploratory Data Analysis

Plots for exploratory data analysis were done using “per 100k population” as a basis. This was done to normalize suicide number data due to the variation in population size across different countries.

Global Suicide per 100k Population against Year An overall plot showing global suicide rate per 100k population across 30 years indicates a high rate of increase from 1990 to 1995. Since 1995, the overall rate of suicide appears to be gradually decreasing, with the lowest being in 2015. Current rates are now returning to the rates in the 90s. However, data from pre-90s are limited so it may be challenging to confirm if these are representative of actual rates.

overallglobalsuicide<- suicidedataclean %>%
  select(year, suicide.number, population) %>%
  group_by(year) %>%
  summarise(suicide.capita = round((sum(suicide.number) / sum(population)) * 100000, 2)) 

suicide.over.year <- ggplot(data = overallglobalsuicide,
                            aes(x = year, y = suicide.capita, colour = suicide.capita))

suicide.over.year + 
    scale_x_continuous(breaks = seq(1985, 2015, 5)) +
    geom_line() +
    geom_point(size = 3) +
    labs(title = 'Global Suicides 1985-2015',
         x = 'Year',
         y = 'Suicides per 100k people') +
    theme_grey()

Global Suicide per 100k Population by Sex When comparing suicide rate across different genders, it is apparent that suicide rate in males is higher than females. The trend shows that suicide cases among males are up to 3.5 times higher than females at a given year. However, it is worth noting that in the 1980s, this ratio is low at about 2.7.

overallglobalsuicidesex<- suicidedataclean %>%
  select(year, sex, suicide.number, population) %>%
  group_by(sex,year) %>%
  summarise(suicide.capita = round((sum(suicide.number) / sum(population)) * 100000, 2))

## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.

suicide.over.year.sex <- ggplot(data = overallglobalsuicidesex,
                            aes(x = year, y = suicide.capita, group=sex, colour = sex))

suicide.over.year.sex + 
    scale_x_continuous(breaks = seq(1985, 2015, 5)) +
    geom_line() +
    geom_point(size = 3) +
    labs(title = 'Global Suicides 1985-2015 by Sex',
         x = 'Year',
         y = 'Suicides per 100k people') +
    theme_grey()

Global Suicide per 100k Population by Age Age group also has a significant factor in suicide rates. From the trends below, we can see that suicide rates increase as age group increase. In short, older age seems to have high correlation with high suicide rates.A notable observation would be that the rate of suicide for age 75+ has dropped at least 50% since 1990.

overallglobalsuicidesage<- suicidedataclean %>%
  select(year, age, suicide.number, population) %>%
  group_by(age,year) %>%
  summarise(suicide.capita = round((sum(suicide.number) / sum(population)) * 100000, 2))

## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.

suicide.over.year.age <- ggplot(data = overallglobalsuicidesage,
                            aes(x = year, y = suicide.capita, group=age, colour = age))

suicide.over.year.age + 
    scale_x_continuous(breaks = seq(1985, 2015, 5)) +
    geom_line() +
    geom_point(size = 3) +
    labs(title = 'Global Suicides 1985-2015 by Age',
         x = 'Year',
         y = 'Suicides per 100k people') +
    theme_grey()

Global Suicide per 100k Population by Country

When lining up each country’s suicide rate, we can see that Lithuania, Russian Federation, Sri Lanka, Belarus and Hungary are the top 5 countries with high suicide rates. In contrast, Oman, Jamaica and Maldives have among the lowest suicide rates. This proves that suicide rates differ across different countries.

overallglobalsuicidescountry<- suicidedataclean %>%
    select(country, suicide.number, population) %>%
    group_by(country) %>%
    summarise(suicide.capita = round((sum(suicide.number) / sum(population)) * 100000, 2))

suicides.by.country <- ggplot(data = overallglobalsuicidescountry,
                              aes(x = reorder(country, +suicide.capita), 
                                  y = suicide.capita,
                                  fill = country))

suicides.by.country + 
    geom_bar(stat = 'identity', show.legend = F) + 
    coord_flip() +
    labs(title = 'Global Suicides 1985-2015 by Country',
         x = 'Country',
         y = 'Suicides per 100k people') +
     theme(axis.title = element_text(size = 15, colour = 'Black'),
          plot.title = element_text(size = 15, colour = 'Black', hjust = 0.5),
          axis.text = element_text(size = 10))

Global Suicide per 100k Population by Continent From continent based analysis, we see that Europe shows a significant drop since 1995, which is in line with our overall global suicide rates. Here, we also see a significant drop in African continent since 1995 to about 0-1 per 100k population. This could be due to varying factors which include misreports, but for this analysis we choose to disregard outside factors. Asia and Ocenia shows varying rates of increase and decrease across 30 year. Finally, Americas has a relatively steady rate over time with low overall rates compared to Europe, Asia and Ocenia.

continent <- suicidedataclean %>%
  group_by(continent) %>%
  summarize(suicide.capita = (sum(as.numeric(suicide.number)) / sum(as.numeric(population))) * 100000) %>%
  arrange(suicide.capita)

continent$continent <- factor(continent$continent, ordered = T, levels = continent$continent)

continent_plot <- ggplot(continent, aes(x = continent, y = suicide.capita, fill = continent)) + 
  geom_bar(stat = "identity") + 
  labs(title = "Global Suicides 1985-2015 by Continent",
  x = "Continent", 
  y = "Suicides per 100k people", 
  fill = "Continent") +
  theme(legend.position = "none", title = element_text(size = 10)) + 
  scale_y_continuous(breaks = seq(0, 20, 1), minor_breaks = F)


continent_time <- suicidedataclean %>%
  group_by(year, continent) %>%
  summarize(suicide.capita = (sum(as.numeric(suicide.number)) / sum(as.numeric(population))) * 100000)

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

continent_time$continent <- factor(continent_time$continent, ordered = T, levels = continent$continent)

continent_time_plot <- ggplot(continent_time, aes(x = year, y = suicide.capita, col = factor(continent))) + 
  facet_grid(continent ~ ., scales = "free_y") + 
  geom_line() + 
  geom_point() + 
  labs(title = "Trends Over Time by Continent", 
       x = "Year", 
       y = "Suicides per 100k people", 
       color = "Continent") + 
  theme(legend.position = "none", title = element_text(size = 10)) + 
  scale_x_continuous(breaks = seq(1985, 2015, 5), minor_breaks = F)

grid.arrange(continent_plot, continent_time_plot, ncol = 2)

Global Suicide per 100k Population Country Heatmap

A heatmap across global regions confirms our finding on continent analysis. Here, we see a higher suicide rate in Norhtern Asia represented by red, and lower rates.

country <- suicidedataclean %>%
  group_by(country) %>%
  summarize(suicide.capita = (sum(as.numeric(suicide.number)) / sum(as.numeric(population))) * 100000)

countrydata <- joinCountryData2Map(country, joinCode = "NAME", nameJoinColumn = "country")

par(mar=c(0, 0, 0, 0)) # margins

mapCountryData(countrydata, 
nameColumnToPlot="suicide.capita", 
mapTitle="", 
colourPalette = "heat", 
oceanCol="white", 
missingCountryCol="grey60", 
catMethod = "pretty")

Multiple Linear Regression

Multiple linear regression (MLR) model is a type of statistical technique used to predict the outcome of dependent variable. The dependent variable is affected by two or more regressors (independent variable). The main objective of this model is to model the linear relationship between the variables. There are a few assumptions made when using MLR. The assumptions are: • There is a linear relationship present between the exploratory variable and response variable. • Independent variables are not too highly correlated with each other. • Residuals of the model must be normally distributed.

The results are interpreted by interpreting the p-values of each regressors. Pr(>|t|) shows the p-values. When p-values are lesser than 0.05, with the confidence interval of 95%, null hypothesis is rejected. Thus, concluding the regressor has effect on the dependent variable.

#Select columns of independent variables
suicide.data2 <- suicidedataclean %>% 
  select(country, year, sex, age, suicides.100k.pop, gdp.per.capita,population, generation)

#Datasets were split into test and train 
set.seed(1, sample.kind="Rounding")

## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

test_index <- createDataPartition(y = suicide.data2$suicides.100k.pop, times = 1, 
                                  p = 0.2, list = FALSE)
train <- suicide.data2[-test_index,]
test <- suicide.data2[test_index,]


#Multiple linear regression(MLR) model used. suicides.100k.pop is the dependent variable. sex, age, generation,population, gdp.per.capita, country , year 
#Tested as independent variables. MLR used to test if suicides.100k.pop is dependent on the multi independent variables on the train dataset
suicide.lm <- lm(suicides.100k.pop~ sex + age + generation + gdp.per.capita + population + year + country, data = train)
#summary(suicide.lm)


#Test dataset was used to predict
test$lm <- predict(suicide.lm, test)
 

#RMSE calculated for predicted and actual values
RMSE(test$lm, test$suicides.100k.pop)

## [1] 13.27364

##MLR run on test data to retrieve coefficents of the regressors 
suicide.lm2 <- lm(suicides.100k.pop~ sex + age + generation + gdp.per.capita + population + year + country, data = test)
#summary(suicide.lm2)


#Data Visualization of predicted and actual values 
test %>% 
  gather(key=valuetype, value=rate, suicides.100k.pop, lm) %>%
  mutate(suicides=rate*population/100000) %>%
  group_by(year, valuetype) %>%
  mutate(rate=sum(suicides)*100000/sum(population)) %>%
  ggplot(aes(year, rate, col=valuetype)) +
  geom_line(size= 1.5) +
  geom_point(size = 3, color = "cornsilk4") +
  scale_color_manual(values = wes_palette("GrandBudapest1", n = 2))+
  scale_x_continuous(breaks = seq(1985, 2016, 2)) +
  labs(title = 'Actual vs Predicted(MLR)',
       x = 'Year',
       y = 'Suicides per 100k people') +
  theme(axis.text.x = element_text(angle = 90),
        plot.title = element_text(color = "gray7", size = 16, face = "bold", hjust= 0.5))

Random Forest Regression

Random Forest Regression is a supervised learning algorithm that uses ensemble learning method for regression. Ensemble learning method is a technique that combines predictions from multiple machine learning algorithms to make a more accurate prediction than a single model.

A Random Forest operates by constructing several decision trees during training time and outputting the mean of the classes as the prediction of all the trees. Random forest works well for features with linear and non-linear relationships.In this project, since the target variable is far from normality, a transformation is considered to scale the distribution to symmetry in which a log function was applied to the target output.

The model was created using the formula below: Suicides.100k.pop_log <- population + country + sex + year + age+ gdp.per.capita + generation + continent

### Prepping the Data for the Model
#Since the target variable is far from normality, a transformation is considered in order to scale the distribution to symmetry. 
#Many transformations are discussed in the Statistical literature for the purpose of transforming a skewed variable to a symmetric one, however, not all of them are suitable for the `suicides.100k.pop` variable. 
#The natural 'log' transformation produces infinite values since `suicides.100k.pop` include zero values. 
#For this reason, the value 1 is added to the variable prior to applying the natural 'log' transformation.
# Variable transformation

suicidedataclean <- suicidedataclean %>%
mutate(suicides.100k.pop_log=log(1+suicides.100k.pop))

# Specify explanatory and outcome variables and model formula
vars <- c("population", "country", "sex", "year", "age", "gdp.per.capita","generation","continent")
outcome <- "suicides.100k.pop_log"
(fmla <- as.formula(paste(outcome, "~", paste(vars, collapse = " + "))))

## suicides.100k.pop_log ~ population + country + sex + year + age + 
##     gdp.per.capita + generation + continent

fmla

## suicides.100k.pop_log ~ population + country + sex + year + age + 
##     gdp.per.capita + generation + continent

# Split to training and testing data sets
set.seed(1, sample.kind="Rounding")

## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

test_index <- createDataPartition(y = suicidedataclean$suicides.100k.pop_log, times = 1, 
                                  p = 0.2, list = FALSE)
train <- suicidedataclean[-test_index,]
test <- suicidedataclean[test_index,]

#### Random Forests
#Fitted using the `ranger` package
# Random forests
set.seed(1, sample.kind="Rounding")

## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

rf1 <- ranger(fmla, # formula 
              train, # data
              num.trees = 500, 
              respect.unordered.factors = "order",
              seed = 1)
rf1

## Ranger result
## 
## Call:
##  ranger(fmla, train, num.trees = 500, respect.unordered.factors = "order",      seed = 1) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      22254 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.223804 
## R squared (OOB):                  0.8669882

### Test the Model
#After this, the model is applied to the test data. This is done through the `predict` function. 
#To test the performance of the model, the Root Mean Squared Errors (RMSE) is considered. 
#After generating the model predictions, the RMSE is calculated by comparing the model predictions against the true value of the suicide rates.
#Generate predictions using the test data

test$rf <- predict(rf1, test)$predictions


# Calculate RMSE
case1 <- test %>% gather(key=model, value=log_pred, rf) %>%
  mutate(pred=exp(log_pred),
         residuals=suicides.100k.pop-pred) %>%
  group_by(model) %>%
  summarize(rmse=sqrt(mean(residuals^2)))
case1

## # A tibble: 1 x 2
##   model  rmse
##   <chr> <dbl>
## 1 rf     6.86

#Visualization to compare the actual and predicted results
test %>% mutate(rf=exp(rf)) %>%
  gather(key=valuetype, value=rate, suicides.100k.pop, rf) %>%
  mutate(suicides=rate*population/100000) %>%
  group_by(year, valuetype) %>%
  mutate(rate_year=sum(suicides)*100000/sum(population)) %>%
  ggplot(aes(year, rate_year, col=valuetype)) +
  geom_line(size=1.5) +
  geom_point(size = 3, color = "cornsilk4") +
  scale_color_manual(values = wes_palette("GrandBudapest1", n = 2)) +
  scale_x_continuous(breaks = seq(1985, 2016, 2)) +
   labs(title = 'Actual vs Predicted (Random Forest)',
       x = 'Year',
       y = 'Suicides per 100k people') +
  theme(axis.text.x = element_text(angle = 90),
  plot.title = element_text(color = "gray7", size = 16, face = "bold", hjust= 0.5))

Conclusion

Based on random forest comparison graph, we can see the graphs over line each other at certain points. Whereas, in MLR there is a distinct difference of the predicted and actual values.

The first model using multiple linear regression (MLR) yields an RMSE of 13.27, while the second model using random forest regression has an RMSE of 6.86.

Thus, concluding that random forest prediction is more accurate and outperforms MLR for suicide rate prediction.

Reference

References:

Bakshi, C. (2020, June 9). Random Forest Regression - Level Up Coding. Medium. https://levelup.gitconnected.com/random-forest-regression-209c0f354c84#:%7E:text=Random%20Forest%20Regression%20is%20a%20supervised%20learning%20algorithm,ensemble%20learning%20method%20for%20regression.&text=A%20Random%20Forest%20operates%20by,prediction%20of%20all%20the%20trees.
Health status - Suicide rates - OECD Data. (n.d.). TheOECD. https://data.oecd.org/healthstat/suicide-rates.htm
How to tune hyperparameters in a random forest. (2018, May 3). Cross Validated. https://stats.stackexchange.com/questions/344220/how-to-tune-hyperparameters-in-a-random-forest Mwiti, D. (2021, May 11). Random Forest Regression: When Does It Fail and Why? Neptune.Ai. https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why
Ritchie, H. (2015, June 15). Suicide. Our World in Data. https://ourworldindata.org/suicide SUICIDE RATE. (2017, June). World Health Organization. https://www.un.org/esa/sustdev/natlinfo/indicators/methodology_sheets/health/suicide_rate.pdf
Wikipedia contributors. (2021, May 22). Suicide. Wikipedia. https://en.wikipedia.org/wiki/Suicide
Example of Multiple Linear Regression in R https://datatofish.com/multiple-linear-regression-in-r
Multiple (Linear) Regression https://www.statmethods.net/stats/regression.html
Multiple Linear Regression(MLR) https://www.investopedia.com/terms/m/mlr.asp
Sanderson, M., Bulloch, A., Wang, J., Williamson, T., & Patten, S. (2020). Predicting death by suicide using administrative health care system data: Can recurrent neural network, one-dimensional convolutional neural network, and gradient boosted trees models improve prediction performance?. Journal Of Affective Disorders, 264, 107-114. doi: 10.1016/j.jad.2019.12.024.https://www.sciencedirect.com/science/article/pii/S0165032719331775
Grendas, L., Rojas, S., Puppo, S., Vidjen, P., Portela, A., & Chiapella, L. et al. (2019). Interaction between prospective risk factors in the prediction of suicide risk. Journal Of Affective Disorders, 258, 144-150. doi: 10.1016/j.jad.2019.07.071.https://www.sciencedirect.com/science/article/abs/pii/S0165032719309887

Suicide Rate Prediction

Ainaa Najihah, Thivyaa Moses, Vinnithra Vasu

5/23/2021

Background

Data Processing

Exploratory Data Analysis

Multiple Linear Regression

Multiple Linear Regression

Random Forest Regression

Random Forest Regression

Conclusion

Conclusion

Reference

References: