Introduction and Objective
Suicide rates are defined as the deaths deliberately initiated and performed by a person in the full knowledge or expectation of its fatal outcome, per 100,000 population. In other words, suicide is a death caused by intentional self-harm. Suicide rate is an important proxy for the prevalence of mental health disorders in many countries in the world.
This project objective is to create a machine learning algorithm for suicide rate prediction. It uses country level suicide rates data from the World Health Organization and other country level indicators from The World Bank (WB) and The United Nations Development Program (UNDP), from 1985 to 2016. The data contains information on suicide rates per 100k population, age, sex, generation, country Gross Domestic Product (GDP), population size, and Human Development Index (HDI) country score.
Key Facts
Literature Review
According to previous research, prediction of suicide rate was done by using Kaplan-Meier method to estimate the events and Cox Proportional Hazards Regression was used to examine the predictors of suicide and suicides reattempt (L.N Grendas, et al., 2019). The aim was to evaluate the prospective predictors and the correlation with the factors that effects suicide rates. Based on the prospective prediction of suicide attempts (M. Miche, et al., 2020), four prediction models were used: Logistic regression, random forest, lasso, ridge (variants of the logistic regression). All four yielded good accuracies. The predictive performance of random forest (ML algorithm) depends on the sample size, with bigger sample size leading to increased performance results. In predicting death by suicide using administrative health care system data (M.Sanderson, et al., 2019), they have compared the performance of recurrent neural networks, one-dimensional convolutional neural networks, and gradient boosted trees, with logistic regression and feedforward neural networks. In this study, they have concluded, the recurrent neural network model configuration, one-dimensional convolutional neural network configuration has outperformed logistic regression.
Data Set
Data set for this project was obtained from Kaggle Suicide Rates Overview 1985-2016. This dataset was built using a combination of four other datasets namely, United Nations Development Program, World Bank, World Health Organization and 2017 Journal article.The dataset consists of 27820 rows with 12 columns as follows :
Limitations
In this project, a transformation for the target variable is introduced which is the log transformation after adding +1 to the variable. This particular transformation and the value 1 was are selected based on convenience. Other transformations can be explored and optimized for the purpose of predicting suicide rates. Besides that, is the parameter tuning for the Random Forests algorithm. The number of trees used for the model was set at 500. This number can be changes to further optimize the performance of the model.
Installing Libraries
All of the following libraries are pre-installed in R Studio. Thus, only packages loading are necessary for use in analysis and prediction modeling. Short description on what each pacakge does is shown below.
library(tidyverse) #general data tidying
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidyr) #general
library(ggalt) #dumbbell plots
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
library(countrycode) #continent
library(rworldmap) #country-level heatmaps
## Loading required package: sp
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
library(gridExtra) #plots
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret) #test and training data split
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(wesanderson) #visualization theme
library(ranger) #random forest
library(broom) #convert statistical objects into tidy tibbles
Loading Dataset
Working directory is set to allow for dataexport into R studio seamlessly. Dataset which is in csv format is loaded into R environment for further data processing.
#set working directory
setwd("/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Group Project")
getwd()
## [1] "/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Group Project"
#load and view data
suicidedata<-read.csv("master.csv", header = TRUE, sep = ",")
colSums(is.na(suicidedata))
## country year sex age
## 0 0 0 0
## suicides_no population suicides.100k.pop country.year
## 0 0 0 0
## HDI.for.year gdp_for_year.... gdp_per_capita.... generation
## 19456 0 0 0
head(suicidedata)
## country year sex age suicides_no population suicides.100k.pop
## 1 Albania 1987 male 15-24 years 21 312900 6.71
## 2 Albania 1987 male 35-54 years 16 308000 5.19
## 3 Albania 1987 female 15-24 years 14 289700 4.83
## 4 Albania 1987 male 75+ years 1 21800 4.59
## 5 Albania 1987 male 25-34 years 9 274300 3.28
## 6 Albania 1987 female 75+ years 1 35600 2.81
## country.year HDI.for.year gdp_for_year.... gdp_per_capita.... generation
## 1 Albania1987 NA 2,156,624,900 796 Generation X
## 2 Albania1987 NA 2,156,624,900 796 Silent
## 3 Albania1987 NA 2,156,624,900 796 Generation X
## 4 Albania1987 NA 2,156,624,900 796 G.I. Generation
## 5 Albania1987 NA 2,156,624,900 796 Boomers
## 6 Albania1987 NA 2,156,624,900 796 G.I. Generation
str(suicidedata)
## 'data.frame': 27820 obs. of 12 variables:
## $ country : chr "Albania" "Albania" "Albania" "Albania" ...
## $ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
## $ sex : chr "male" "male" "female" "male" ...
## $ age : chr "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
## $ suicides_no : int 21 16 14 1 9 1 6 4 1 0 ...
## $ population : int 312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
## $ suicides.100k.pop : num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
## $ country.year : chr "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
## $ HDI.for.year : num NA NA NA NA NA NA NA NA NA NA ...
## $ gdp_for_year.... : chr "2,156,624,900" "2,156,624,900" "2,156,624,900" "2,156,624,900" ...
## $ gdp_per_capita....: int 796 796 796 796 796 796 796 796 796 796 ...
## $ generation : chr "Generation X" "Silent" "Generation X" "G.I. Generation" ...
Data Cleaning
Prior to data analysis and model building, data cleaning was done on the dataset. Here, HDI column was removed due to insufficient data. More than 70% of dataset in HDI column have missing values. After HDI column was removed, missing values from the remaining columns were dropped. Only one row with missing data was dropped. After that, annual gdp column which was initially in character format was changed to numerical by removing commas and reformatting its data type.Next, some of the column were renamed for simplifaction purposes. A new category was added to represent continent, using “countrycode” package.
#data cleaning & tidying
suicidedataclean<-suicidedata %>%
subset(select =-c(HDI.for.year)) %>% #remove HDI column
drop_na() %>% #drop any missing values
mutate(across(all_of("gdp_for_year...."), ~gsub(",", "",.) %>% as.numeric)) %>% #remove commas and change gdp column to numeric
rename(suicide.number=suicides_no) %>% #rename suicide number column
rename(annual.gdp=gdp_for_year....) %>% #rename annual gdp column
rename(gdp.per.capita=gdp_per_capita....) #rename gdp per capita column
#adding new category
suicidedataclean$continent <- countrycode(sourcevar = suicidedataclean[, "country"],
origin = "country.name",
destination = "continent")
#converting categorical values such as gender variable as factor
#categorical values are then given indicators. Eg: female=0 and male=1
suicidedataclean <- suicidedataclean %>%
mutate(sex= as.factor(sex)) %>%
mutate(age= as.factor(age)) %>%
mutate(country= as.factor(country)) %>%
mutate(generation= as.factor(generation))
#preview clean data
head(suicidedataclean)
## country year sex age suicide.number population suicides.100k.pop
## 1 Albania 1987 male 15-24 years 21 312900 6.71
## 2 Albania 1987 male 35-54 years 16 308000 5.19
## 3 Albania 1987 female 15-24 years 14 289700 4.83
## 4 Albania 1987 male 75+ years 1 21800 4.59
## 5 Albania 1987 male 25-34 years 9 274300 3.28
## 6 Albania 1987 female 75+ years 1 35600 2.81
## country.year annual.gdp gdp.per.capita generation continent
## 1 Albania1987 2156624900 796 Generation X Europe
## 2 Albania1987 2156624900 796 Silent Europe
## 3 Albania1987 2156624900 796 Generation X Europe
## 4 Albania1987 2156624900 796 G.I. Generation Europe
## 5 Albania1987 2156624900 796 Boomers Europe
## 6 Albania1987 2156624900 796 G.I. Generation Europe
str(suicidedataclean)
## 'data.frame': 27820 obs. of 12 variables:
## $ country : Factor w/ 101 levels "Albania","Antigua and Barbuda",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 1 2 1 ...
## $ age : Factor w/ 6 levels "15-24 years",..: 1 3 1 6 2 6 3 2 5 4 ...
## $ suicide.number : int 21 16 14 1 9 1 6 4 1 0 ...
## $ population : int 312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
## $ suicides.100k.pop: num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
## $ country.year : chr "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
## $ annual.gdp : num 2.16e+09 2.16e+09 2.16e+09 2.16e+09 2.16e+09 ...
## $ gdp.per.capita : int 796 796 796 796 796 796 796 796 796 796 ...
## $ generation : Factor w/ 6 levels "Boomers","G.I. Generation",..: 3 6 3 2 1 2 6 1 2 3 ...
## $ continent : chr "Europe" "Europe" "Europe" "Europe" ...
Plots for exploratory data analysis were done using “per 100k population” as a basis. This was done to normalize suicide number data due to the variation in population size across different countries.
Global Suicide per 100k Population against Year An overall plot showing global suicide rate per 100k population across 30 years indicates a high rate of increase from 1990 to 1995. Since 1995, the overall rate of suicide appears to be gradually decreasing, with the lowest being in 2015. Current rates are now returning to the rates in the 90s. However, data from pre-90s are limited so it may be challenging to confirm if these are representative of actual rates.
overallglobalsuicide<- suicidedataclean %>%
select(year, suicide.number, population) %>%
group_by(year) %>%
summarise(suicide.capita = round((sum(suicide.number) / sum(population)) * 100000, 2))
suicide.over.year <- ggplot(data = overallglobalsuicide,
aes(x = year, y = suicide.capita, colour = suicide.capita))
suicide.over.year +
scale_x_continuous(breaks = seq(1985, 2015, 5)) +
geom_line() +
geom_point(size = 3) +
labs(title = 'Global Suicides 1985-2015',
x = 'Year',
y = 'Suicides per 100k people') +
theme_grey()
Global Suicide per 100k Population by Sex When comparing suicide rate across different genders, it is apparent that suicide rate in males is higher than females. The trend shows that suicide cases among males are up to 3.5 times higher than females at a given year. However, it is worth noting that in the 1980s, this ratio is low at about 2.7.
overallglobalsuicidesex<- suicidedataclean %>%
select(year, sex, suicide.number, population) %>%
group_by(sex,year) %>%
summarise(suicide.capita = round((sum(suicide.number) / sum(population)) * 100000, 2))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
suicide.over.year.sex <- ggplot(data = overallglobalsuicidesex,
aes(x = year, y = suicide.capita, group=sex, colour = sex))
suicide.over.year.sex +
scale_x_continuous(breaks = seq(1985, 2015, 5)) +
geom_line() +
geom_point(size = 3) +
labs(title = 'Global Suicides 1985-2015 by Sex',
x = 'Year',
y = 'Suicides per 100k people') +
theme_grey()
Global Suicide per 100k Population by Age Age group also has a significant factor in suicide rates. From the trends below, we can see that suicide rates increase as age group increase. In short, older age seems to have high correlation with high suicide rates.A notable observation would be that the rate of suicide for age 75+ has dropped at least 50% since 1990.
overallglobalsuicidesage<- suicidedataclean %>%
select(year, age, suicide.number, population) %>%
group_by(age,year) %>%
summarise(suicide.capita = round((sum(suicide.number) / sum(population)) * 100000, 2))
## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.
suicide.over.year.age <- ggplot(data = overallglobalsuicidesage,
aes(x = year, y = suicide.capita, group=age, colour = age))
suicide.over.year.age +
scale_x_continuous(breaks = seq(1985, 2015, 5)) +
geom_line() +
geom_point(size = 3) +
labs(title = 'Global Suicides 1985-2015 by Age',
x = 'Year',
y = 'Suicides per 100k people') +
theme_grey()
Global Suicide per 100k Population by Country
When lining up each country’s suicide rate, we can see that Lithuania, Russian Federation, Sri Lanka, Belarus and Hungary are the top 5 countries with high suicide rates. In contrast, Oman, Jamaica and Maldives have among the lowest suicide rates. This proves that suicide rates differ across different countries.
overallglobalsuicidescountry<- suicidedataclean %>%
select(country, suicide.number, population) %>%
group_by(country) %>%
summarise(suicide.capita = round((sum(suicide.number) / sum(population)) * 100000, 2))
suicides.by.country <- ggplot(data = overallglobalsuicidescountry,
aes(x = reorder(country, +suicide.capita),
y = suicide.capita,
fill = country))
suicides.by.country +
geom_bar(stat = 'identity', show.legend = F) +
coord_flip() +
labs(title = 'Global Suicides 1985-2015 by Country',
x = 'Country',
y = 'Suicides per 100k people') +
theme(axis.title = element_text(size = 15, colour = 'Black'),
plot.title = element_text(size = 15, colour = 'Black', hjust = 0.5),
axis.text = element_text(size = 10))
Global Suicide per 100k Population by Continent From continent based analysis, we see that Europe shows a significant drop since 1995, which is in line with our overall global suicide rates. Here, we also see a significant drop in African continent since 1995 to about 0-1 per 100k population. This could be due to varying factors which include misreports, but for this analysis we choose to disregard outside factors. Asia and Ocenia shows varying rates of increase and decrease across 30 year. Finally, Americas has a relatively steady rate over time with low overall rates compared to Europe, Asia and Ocenia.
continent <- suicidedataclean %>%
group_by(continent) %>%
summarize(suicide.capita = (sum(as.numeric(suicide.number)) / sum(as.numeric(population))) * 100000) %>%
arrange(suicide.capita)
continent$continent <- factor(continent$continent, ordered = T, levels = continent$continent)
continent_plot <- ggplot(continent, aes(x = continent, y = suicide.capita, fill = continent)) +
geom_bar(stat = "identity") +
labs(title = "Global Suicides 1985-2015 by Continent",
x = "Continent",
y = "Suicides per 100k people",
fill = "Continent") +
theme(legend.position = "none", title = element_text(size = 10)) +
scale_y_continuous(breaks = seq(0, 20, 1), minor_breaks = F)
continent_time <- suicidedataclean %>%
group_by(year, continent) %>%
summarize(suicide.capita = (sum(as.numeric(suicide.number)) / sum(as.numeric(population))) * 100000)
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
continent_time$continent <- factor(continent_time$continent, ordered = T, levels = continent$continent)
continent_time_plot <- ggplot(continent_time, aes(x = year, y = suicide.capita, col = factor(continent))) +
facet_grid(continent ~ ., scales = "free_y") +
geom_line() +
geom_point() +
labs(title = "Trends Over Time by Continent",
x = "Year",
y = "Suicides per 100k people",
color = "Continent") +
theme(legend.position = "none", title = element_text(size = 10)) +
scale_x_continuous(breaks = seq(1985, 2015, 5), minor_breaks = F)
grid.arrange(continent_plot, continent_time_plot, ncol = 2)
Global Suicide per 100k Population Country Heatmap
A heatmap across global regions confirms our finding on continent analysis. Here, we see a higher suicide rate in Norhtern Asia represented by red, and lower rates.
country <- suicidedataclean %>%
group_by(country) %>%
summarize(suicide.capita = (sum(as.numeric(suicide.number)) / sum(as.numeric(population))) * 100000)
countrydata <- joinCountryData2Map(country, joinCode = "NAME", nameJoinColumn = "country")
par(mar=c(0, 0, 0, 0)) # margins
mapCountryData(countrydata,
nameColumnToPlot="suicide.capita",
mapTitle="",
colourPalette = "heat",
oceanCol="white",
missingCountryCol="grey60",
catMethod = "pretty")
Multiple linear regression (MLR) model is a type of statistical technique used to predict the outcome of dependent variable. The dependent variable is affected by two or more regressors (independent variable). The main objective of this model is to model the linear relationship between the variables. There are a few assumptions made when using MLR. The assumptions are: • There is a linear relationship present between the exploratory variable and response variable. • Independent variables are not too highly correlated with each other. • Residuals of the model must be normally distributed.
The results are interpreted by interpreting the p-values of each regressors. Pr(>|t|) shows the p-values. When p-values are lesser than 0.05, with the confidence interval of 95%, null hypothesis is rejected. Thus, concluding the regressor has effect on the dependent variable.
#Select columns of independent variables
suicide.data2 <- suicidedataclean %>%
select(country, year, sex, age, suicides.100k.pop, gdp.per.capita,population, generation)
#Datasets were split into test and train
set.seed(1, sample.kind="Rounding")
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
test_index <- createDataPartition(y = suicide.data2$suicides.100k.pop, times = 1,
p = 0.2, list = FALSE)
train <- suicide.data2[-test_index,]
test <- suicide.data2[test_index,]
#Multiple linear regression(MLR) model used. suicides.100k.pop is the dependent variable. sex, age, generation,population, gdp.per.capita, country , year
#Tested as independent variables. MLR used to test if suicides.100k.pop is dependent on the multi independent variables on the train dataset
suicide.lm <- lm(suicides.100k.pop~ sex + age + generation + gdp.per.capita + population + year + country, data = train)
#summary(suicide.lm)
#Test dataset was used to predict
test$lm <- predict(suicide.lm, test)
#RMSE calculated for predicted and actual values
RMSE(test$lm, test$suicides.100k.pop)
## [1] 13.27364
##MLR run on test data to retrieve coefficents of the regressors
suicide.lm2 <- lm(suicides.100k.pop~ sex + age + generation + gdp.per.capita + population + year + country, data = test)
#summary(suicide.lm2)
#Data Visualization of predicted and actual values
test %>%
gather(key=valuetype, value=rate, suicides.100k.pop, lm) %>%
mutate(suicides=rate*population/100000) %>%
group_by(year, valuetype) %>%
mutate(rate=sum(suicides)*100000/sum(population)) %>%
ggplot(aes(year, rate, col=valuetype)) +
geom_line(size= 1.5) +
geom_point(size = 3, color = "cornsilk4") +
scale_color_manual(values = wes_palette("GrandBudapest1", n = 2))+
scale_x_continuous(breaks = seq(1985, 2016, 2)) +
labs(title = 'Actual vs Predicted(MLR)',
x = 'Year',
y = 'Suicides per 100k people') +
theme(axis.text.x = element_text(angle = 90),
plot.title = element_text(color = "gray7", size = 16, face = "bold", hjust= 0.5))
Random Forest Regression is a supervised learning algorithm that uses ensemble learning method for regression. Ensemble learning method is a technique that combines predictions from multiple machine learning algorithms to make a more accurate prediction than a single model.
A Random Forest operates by constructing several decision trees during training time and outputting the mean of the classes as the prediction of all the trees. Random forest works well for features with linear and non-linear relationships.In this project, since the target variable is far from normality, a transformation is considered to scale the distribution to symmetry in which a log function was applied to the target output.
The model was created using the formula below: Suicides.100k.pop_log <- population + country + sex + year + age+ gdp.per.capita + generation + continent
### Prepping the Data for the Model
#Since the target variable is far from normality, a transformation is considered in order to scale the distribution to symmetry.
#Many transformations are discussed in the Statistical literature for the purpose of transforming a skewed variable to a symmetric one, however, not all of them are suitable for the `suicides.100k.pop` variable.
#The natural 'log' transformation produces infinite values since `suicides.100k.pop` include zero values.
#For this reason, the value 1 is added to the variable prior to applying the natural 'log' transformation.
# Variable transformation
suicidedataclean <- suicidedataclean %>%
mutate(suicides.100k.pop_log=log(1+suicides.100k.pop))
# Specify explanatory and outcome variables and model formula
vars <- c("population", "country", "sex", "year", "age", "gdp.per.capita","generation","continent")
outcome <- "suicides.100k.pop_log"
(fmla <- as.formula(paste(outcome, "~", paste(vars, collapse = " + "))))
## suicides.100k.pop_log ~ population + country + sex + year + age +
## gdp.per.capita + generation + continent
fmla
## suicides.100k.pop_log ~ population + country + sex + year + age +
## gdp.per.capita + generation + continent
# Split to training and testing data sets
set.seed(1, sample.kind="Rounding")
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
test_index <- createDataPartition(y = suicidedataclean$suicides.100k.pop_log, times = 1,
p = 0.2, list = FALSE)
train <- suicidedataclean[-test_index,]
test <- suicidedataclean[test_index,]
#### Random Forests
#Fitted using the `ranger` package
# Random forests
set.seed(1, sample.kind="Rounding")
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
rf1 <- ranger(fmla, # formula
train, # data
num.trees = 500,
respect.unordered.factors = "order",
seed = 1)
rf1
## Ranger result
##
## Call:
## ranger(fmla, train, num.trees = 500, respect.unordered.factors = "order", seed = 1)
##
## Type: Regression
## Number of trees: 500
## Sample size: 22254
## Number of independent variables: 8
## Mtry: 2
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 0.223804
## R squared (OOB): 0.8669882
### Test the Model
#After this, the model is applied to the test data. This is done through the `predict` function.
#To test the performance of the model, the Root Mean Squared Errors (RMSE) is considered.
#After generating the model predictions, the RMSE is calculated by comparing the model predictions against the true value of the suicide rates.
#Generate predictions using the test data
test$rf <- predict(rf1, test)$predictions
# Calculate RMSE
case1 <- test %>% gather(key=model, value=log_pred, rf) %>%
mutate(pred=exp(log_pred),
residuals=suicides.100k.pop-pred) %>%
group_by(model) %>%
summarize(rmse=sqrt(mean(residuals^2)))
case1
## # A tibble: 1 x 2
## model rmse
## <chr> <dbl>
## 1 rf 6.86
#Visualization to compare the actual and predicted results
test %>% mutate(rf=exp(rf)) %>%
gather(key=valuetype, value=rate, suicides.100k.pop, rf) %>%
mutate(suicides=rate*population/100000) %>%
group_by(year, valuetype) %>%
mutate(rate_year=sum(suicides)*100000/sum(population)) %>%
ggplot(aes(year, rate_year, col=valuetype)) +
geom_line(size=1.5) +
geom_point(size = 3, color = "cornsilk4") +
scale_color_manual(values = wes_palette("GrandBudapest1", n = 2)) +
scale_x_continuous(breaks = seq(1985, 2016, 2)) +
labs(title = 'Actual vs Predicted (Random Forest)',
x = 'Year',
y = 'Suicides per 100k people') +
theme(axis.text.x = element_text(angle = 90),
plot.title = element_text(color = "gray7", size = 16, face = "bold", hjust= 0.5))
Based on random forest comparison graph, we can see the graphs over line each other at certain points. Whereas, in MLR there is a distinct difference of the predicted and actual values.
The first model using multiple linear regression (MLR) yields an RMSE of 13.27, while the second model using random forest regression has an RMSE of 6.86.
Thus, concluding that random forest prediction is more accurate and outperforms MLR for suicide rate prediction.
Bakshi, C. (2020, June 9). Random Forest Regression - Level Up Coding. Medium. https://levelup.gitconnected.com/random-forest-regression-209c0f354c84#:%7E:text=Random%20Forest%20Regression%20is%20a%20supervised%20learning%20algorithm,ensemble%20learning%20method%20for%20regression.&text=A%20Random%20Forest%20operates%20by,prediction%20of%20all%20the%20trees.
Health status - Suicide rates - OECD Data. (n.d.). TheOECD. https://data.oecd.org/healthstat/suicide-rates.htm
How to tune hyperparameters in a random forest. (2018, May 3). Cross Validated. https://stats.stackexchange.com/questions/344220/how-to-tune-hyperparameters-in-a-random-forest Mwiti, D. (2021, May 11). Random Forest Regression: When Does It Fail and Why? Neptune.Ai. https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why
Ritchie, H. (2015, June 15). Suicide. Our World in Data. https://ourworldindata.org/suicide SUICIDE RATE. (2017, June). World Health Organization. https://www.un.org/esa/sustdev/natlinfo/indicators/methodology_sheets/health/suicide_rate.pdf
Wikipedia contributors. (2021, May 22). Suicide. Wikipedia. https://en.wikipedia.org/wiki/Suicide
Example of Multiple Linear Regression in R https://datatofish.com/multiple-linear-regression-in-r
Multiple (Linear) Regression https://www.statmethods.net/stats/regression.html
Multiple Linear Regression(MLR) https://www.investopedia.com/terms/m/mlr.asp
Sanderson, M., Bulloch, A., Wang, J., Williamson, T., & Patten, S. (2020). Predicting death by suicide using administrative health care system data: Can recurrent neural network, one-dimensional convolutional neural network, and gradient boosted trees models improve prediction performance?. Journal Of Affective Disorders, 264, 107-114. doi: 10.1016/j.jad.2019.12.024.https://www.sciencedirect.com/science/article/pii/S0165032719331775
Grendas, L., Rojas, S., Puppo, S., Vidjen, P., Portela, A., & Chiapella, L. et al. (2019). Interaction between prospective risk factors in the prediction of suicide risk. Journal Of Affective Disorders, 258, 144-150. doi: 10.1016/j.jad.2019.07.071.https://www.sciencedirect.com/science/article/abs/pii/S0165032719309887