This R Markdown aims at predicting suicide rates or what indicates them through Linear Regression. It uses country level suicide rates data from the World Health Organization (WHO) and other country level indicators from The World Bank (WB) and The United Nations Development Program (UNDP), from 1985 to 2016. The data contains information on suicide rates per 100k population with respect to population age, sex, generation, country Gross Domestic Product (GDP), population size, and Human Development Index (HDI) country score.
# Download and clean the data
dl <- tempfile()
download.file("https://raw.githubusercontent.com/cordelljones1996/suicide/main/master.csv",
dl)
suicide <- read.csv(dl, col.names = c("country", "year", "sex", "age", "suicides_no",
"population", "suicide_rate", "country.year", "HDI.for.year", "gdp_for_year",
"gdp_per_capita", "generation"))
glimpse(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
The dataset includes 12 variables: 1 target variable and 11 features. The target variable is suicide_rate, which is the suicide rate per 100k population. The features, respectively, are: country name, year, sex, age, population size, HDI score, GDP and GDP per capita and generation.
suicide_rategdp.percapita($)The distribution of the suicide_rate variable seems to be extremely positively skewed to the right, with a spike at the first bin closest to the value 0. This shape suggests that the majority of the observations have a very small value of suicide rate with a small number of observations with very high values, causing a very long positive tail to the shape of the distribution. This severe skewness might suggest the use of a transformation. This topic will be discussed further later on in the report.
Suicide rate varies from one country to another. This is confirmed by the following bar chart. For easier interpretation, the figure shows the top 25 countries in terms of suicide rates. Lithuania is universally the top country in terms of suicide at a little over 40 suicides per 100k population. This rate is extremely high especially that Lithuania is not a big country like Russia, for example, which comes second after Lithuania in the ranking. In fact, the average population size in Lithuania between 1985 and 2016 is 40.4155725 compared to 34.8923765. This disproportion in the case of Lithuania induces further exploration of the country population size.
A continent variable is not present in the dataset but it can easily be added using the countrycode package.
# Add continent variable
suicide$continent <- countrycode(sourcevar = suicide[, "country"],
origin = "country.name",
destination = "continent")
The distribution of suicide rate varies from one continent to another. The highest suicide rates are in Europe and the distribution is severely skewed to the right. We see similar skeweness in Asia and the Americas as well with the presense of a few extreme values in each continent.
There seems to be some sort of positive linear association between the country population size and its corresponding suicide rate. Lithuania and the United States can be considered as two outliers in 2 opposite directions in the data. Presence of such extreme values affects the prediction process.
## Warning: Width not defined. Set with `position_dodge(width = ?)`
It seems that not all years included in the dataset have the same number of observations. The following plot shows that the year 2016 has the least number of observations. For this reason, year 2016 will be excluded later in the analysis
Suicide rates vary from one year to another. The following timeplot shows that before 1995, there was a global ascending trend of suicide. The opposite is true after 1995, as we see suicide rates decrease in a downward trend.
The next plot shows suicide rates plotted against per capita GDP of the countries in the dataset. It seems that there is a positive linear correlation between suicide rates per 100k population and the country’s GDP per capita. There exists some outlier values in the case of Lithuania, where we have relatively small per capita GDP and very high suicide rate (the highest as seen earlier).
Now after taking an overview on suicide variability by several country and nation level variables, we move to demographic variables that characterise the populations of these countries. First, we look at suicide variability by age goup. The following plot shows that suicide rates distribution varies from one age group to another. The highest suicide rates are found in individuals aged 75+ and the lowest suicide rates are found in individuals aged between 5-14.
When looking at suicide rates by sex, the data shows that suicide is more prevalent among males than females, with universal suicide rate of almost 21 versus about 6 suicides per 100k male population.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Suicide rate trends across time varies by sex as well. The following plot shows that the suicide rates for females exhibit a universal descending trend across time while the trend for males fluctuates around 1995 which, resembles the trend we saw before.
From the exploratory analysis performed above, two main observations emerge:
Lithuania was found to have an extreme value for suicide rates per 100k population.
The year 2016 included the least number of observations.
The target variable, suicide_rate, is severely positively skewed
Due to these factors, predicting suicide rates can be highly jeopardized if the data is used as is. Several model attempts are made and performance of each attempt is measured.
Since the target variable, suicide_rate is a continuous variable, multiple linear regression algorithm is considered. A regression model is fitted that takes into account country, population size, per capita CGP, year, sex and age group effects on suicide rates per 100k population. The following model is considers:
\(log(suicide\_rate_{c,p,g,y,s,a}) = \mu + \beta_c + \beta_{ct} + \beta_p + \beta_g + \beta_y + \beta_s + \beta_a + \epsilon_{c,p,g,y,s,a}\)
Where \(\beta_c\) is the country effect, \(\beta_{ct}\) is the continent effect, \(\beta_p\) is the population effect, \(\beta_g\) is the per capita GDP effect, \(\beta_y\) is the year effect, \(\beta_s\) is the sex effect, \(\beta_a\) is the age group effect and \(\epsilon\) is the model error term.
In addition to linear regression, Random Forests algorithm is attempted as well. Performance of the two models is evaluated.
# Specify explanatory and outcome variables and model formula
vars = c("continent", "population", "country", "sex", "year", "age", "gdp_per_capita")
outcome = "suicide_rate_log"
(fmla = as.formula(paste(outcome, "~", paste(vars, collapse = " + "))))
## suicide_rate_log ~ continent + population + country + sex + year +
## age + gdp_per_capita
Since the target variable is far from normality, a transformation is considered in order to scale the distribution to symmetry. Many transformations are discussed in the Staitstical literature for the purpose of transforming a skewed variable to a symmetric one, however, not all of them are suitable for the suicide_rate variable. The natural \(log\) transformation produces infinte values since suicide_rates include zero values. For this reason, the value 1 is added to the variable prior to appling the natural \(log\) transformation.
# Variable transformation
suicide <- suicide %>%
mutate(suicide_rate_log=log(1+suicide_rate))
Training and testing datasets are created. Testing set is 20% of the entire dataset.
# Split to training and testing datasets
set.seed(1, sample.kind="Rounding")
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
test_index = createDataPartition(y = suicide$suicide_rate_log, times = 1,
p = 0.2, list = FALSE)
train = suicide[-test_index,]
test = suicide[test_index,]
A linear regression model is fitted using the train data using the lm function
# Linear regression
lm1 <- train %>%
lm(fmla, data=.)
Fitted using the ranger package
# Random forests
set.seed(1, sample.kind="Rounding")
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
rf1 = ranger(fmla, # formula
train, # data
num.trees = 500,
respect.unordered.factors = "order",
seed = 1)
After this, the model is applied to the test data. This is done through the predict function. To test the performance of the model, the Root Mean Squared Errors (RMSE) is considered. After generating the model predictions, the RMSE is calculated by comparing the model predictions against the true value of the suicide rates.
The model with the lowest RMSE in this case is random forests.
Other models are attempted with the following changes in each attempt:
Eliminate the year 2016
Eliminate Lithuania
Training and testing data sets are generated again to accommodate the above changes for each case and the models are refitted. Comparison between models performance is done through RMSE values.
The suicide rates per 100k populations prediction algorithm includes several country-level variables in addition to population demographic characteristics. The model used for prediction is a multiple linear regression model fitted to the training data and tested on the testing data. Random Forests model is also attempted at number of trees of 500. Models are fitted based on 2 different cases. Assessment of the perfomance is based on the value of RMSE.
The following table summarizes the performance of each model at the 2 specified cases:
case1
## # A tibble: 2 × 2
## model rmse
## <chr> <dbl>
## 1 lm 12.4
## 2 rf 6.56
case2
## # A tibble: 2 × 2
## model rmse
## <chr> <dbl>
## 1 lm 12.5
## 2 rf 7.20
The Random Forests model yielded the least value of RMSE and is considered the best model for predicting suicide rates per 100k population. Eliminating data of the year 2016 and of Lithuania did not improve the value of RMSE.
The following graph shows predictions from the linear regression model and the random forests model compared to the true values of suicide rates in the test dataset for each year. The plot shows that annual predictions generated by random forests model are closer to the true value of suicide rates than those predicted by the linear regression model.
test %>% mutate(lm=exp(lm), rf=exp(rf)) %>%
gather(key=valuetype, value=rate, suicide_rate, lm, rf) %>%
mutate(suicides=rate*population/100000) %>%
group_by(year, valuetype) %>%
mutate(rate_year=sum(suicides)*100000/sum(population)) %>%
ggplot(aes(year, rate_year, col=valuetype)) +
geom_line() +
geom_point(size = 2) +
scale_x_continuous(breaks = seq(1985, 2016, 2)) +
theme(legend.position = 'bottom', axis.text.x = element_text(angle = 45))
Additional modeling attempts were made. They are based on the concept of regularization; to penalize the least squares estimates using the parameter lambda to optimize for lambda that minimizes the RMSE. The attempt yielded value of zero for lambda: suggesting that no penalty for the least squares estimates of the model would further enhance the model performance. The attempt is not discussed in the report but it is included in the R script file.
There are other variables that exist in the dataset but not included in the analysis and the modeling. These variables are:
gdp_for_year, which is the GDP of the country at a given year. It is eliminated as it is highly correlated with gdp_per_capita to eliminate multicollinearity in the model. gdp_per_capita was selected over gdp_for_year as it is a better measure for the GDP and wealth of the nations that takes into account population size.
generation, which is a categorical variable for the generation of the population. It is left out as it is highly correlated with age. age is selected over generation to include in the analysis as it more easily and intuitively understood.
HDI.for.year, as over two thirds of the variable is missing.