MATH 624: HW3 Linear Regression

Introduction

This R Markdown aims at predicting suicide rates or what indicates them through Linear Regression. It uses country level suicide rates data from the World Health Organization (WHO) and other country level indicators from The World Bank (WB) and The United Nations Development Program (UNDP), from 1985 to 2016. The data contains information on suicide rates per 100k population with respect to population age, sex, generation, country Gross Domestic Product (GDP), population size, and Human Development Index (HDI) country score.

# Download and clean the data
dl <- tempfile()
download.file("https://raw.githubusercontent.com/cordelljones1996/suicide/main/master.csv",
              dl)
suicide <- read.csv(dl, col.names = c("country", "year", "sex", "age", "suicides_no", 
        "population", "suicide_rate", "country.year", "HDI.for.year", "gdp_for_year", 
        "gdp_per_capita", "generation"))

Data Exploration

glimpse(data)

## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
##     envir = .GlobalEnv, overwrite = TRUE)

The dataset includes 12 variables: 1 target variable and 11 features. The target variable is suicide_rate, which is the suicide rate per 100k population. The features, respectively, are: country name, year, sex, age, population size, HDI score, GDP and GDP per capita and generation.

The Response/Target: `suicide_rate`

The Predictors/Features: `gdp.percapita($)`

The distribution of the suicide_rate variable seems to be extremely positively skewed to the right, with a spike at the first bin closest to the value 0. This shape suggests that the majority of the observations have a very small value of suicide rate with a small number of observations with very high values, causing a very long positive tail to the shape of the distribution. This severe skewness might suggest the use of a transformation. This topic will be discussed further later on in the report.

Exploring Suicide Rate Variability by Country and Continent

Suicide rate varies from one country to another. This is confirmed by the following bar chart. For easier interpretation, the figure shows the top 25 countries in terms of suicide rates. Lithuania is universally the top country in terms of suicide at a little over 40 suicides per 100k population. This rate is extremely high especially that Lithuania is not a big country like Russia, for example, which comes second after Lithuania in the ranking. In fact, the average population size in Lithuania between 1985 and 2016 is 40.4155725 compared to 34.8923765. This disproportion in the case of Lithuania induces further exploration of the country population size.

A continent variable is not present in the dataset but it can easily be added using the countrycode package.

# Add continent variable
suicide$continent <- countrycode(sourcevar = suicide[, "country"],
                                 origin = "country.name",
                                 destination = "continent")

The distribution of suicide rate varies from one continent to another. The highest suicide rates are in Europe and the distribution is severely skewed to the right. We see similar skeweness in Asia and the Americas as well with the presense of a few extreme values in each continent.

Suicide Rate Variability by Population Size

There seems to be some sort of positive linear association between the country population size and its corresponding suicide rate. Lithuania and the United States can be considered as two outliers in 2 opposite directions in the data. Presence of such extreme values affects the prediction process.

## Warning: Width not defined. Set with `position_dodge(width = ?)`

It seems that not all years included in the dataset have the same number of observations. The following plot shows that the year 2016 has the least number of observations. For this reason, year 2016 will be excluded later in the analysis

Suicide rates vary from one year to another. The following timeplot shows that before 1995, there was a global ascending trend of suicide. The opposite is true after 1995, as we see suicide rates decrease in a downward trend.

Effect of GDP on Suicide Rates

The next plot shows suicide rates plotted against per capita GDP of the countries in the dataset. It seems that there is a positive linear correlation between suicide rates per 100k population and the country’s GDP per capita. There exists some outlier values in the case of Lithuania, where we have relatively small per capita GDP and very high suicide rate (the highest as seen earlier).

Population Characteristics Variability

Now after taking an overview on suicide variability by several country and nation level variables, we move to demographic variables that characterise the populations of these countries. First, we look at suicide variability by age goup. The following plot shows that suicide rates distribution varies from one age group to another. The highest suicide rates are found in individuals aged 75+ and the lowest suicide rates are found in individuals aged between 5-14.

When looking at suicide rates by sex, the data shows that suicide is more prevalent among males than females, with universal suicide rate of almost 21 versus about 6 suicides per 100k male population.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Suicide rate trends across time varies by sex as well. The following plot shows that the suicide rates for females exhibit a universal descending trend across time while the trend for males fluctuates around 1995 which, resembles the trend we saw before.

Method and Analysis

From the exploratory analysis performed above, two main observations emerge:

Lithuania was found to have an extreme value for suicide rates per 100k population.
The year 2016 included the least number of observations.
The target variable, suicide_rate, is severely positively skewed

Due to these factors, predicting suicide rates can be highly jeopardized if the data is used as is. Several model attempts are made and performance of each attempt is measured.

Specifying the Model

Since the target variable, suicide_rate is a continuous variable, multiple linear regression algorithm is considered. A regression model is fitted that takes into account country, population size, per capita CGP, year, sex and age group effects on suicide rates per 100k population. The following model is considers:

$log(suicide\_rate_{c,p,g,y,s,a}) = \mu + \beta_c + \beta_{ct} + \beta_p + \beta_g + \beta_y + \beta_s + \beta_a + \epsilon_{c,p,g,y,s,a}$

Where $\beta_c$ is the country effect, $\beta_{ct}$ is the continent effect, $\beta_p$ is the population effect, $\beta_g$ is the per capita GDP effect, $\beta_y$ is the year effect, $\beta_s$ is the sex effect, $\beta_a$ is the age group effect and $\epsilon$ is the model error term.

In addition to linear regression, Random Forests algorithm is attempted as well. Performance of the two models is evaluated.

# Specify explanatory and outcome variables and model formula

vars = c("continent", "population", "country", "sex", "year", "age", "gdp_per_capita")
outcome = "suicide_rate_log"
(fmla = as.formula(paste(outcome, "~", paste(vars, collapse = " + "))))

## suicide_rate_log ~ continent + population + country + sex + year + 
##     age + gdp_per_capita

Prepping the Data for the Model

Since the target variable is far from normality, a transformation is considered in order to scale the distribution to symmetry. Many transformations are discussed in the Staitstical literature for the purpose of transforming a skewed variable to a symmetric one, however, not all of them are suitable for the suicide_rate variable. The natural $log$ transformation produces infinte values since suicide_rates include zero values. For this reason, the value 1 is added to the variable prior to appling the natural $log$ transformation.

# Variable transformation

suicide <- suicide %>%
  mutate(suicide_rate_log=log(1+suicide_rate))

Create Training and Testing sets

Training and testing datasets are created. Testing set is 20% of the entire dataset.

# Split to training and testing datasets

set.seed(1, sample.kind="Rounding")

## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

test_index = createDataPartition(y = suicide$suicide_rate_log, times = 1, 
            p = 0.2, list = FALSE)
train = suicide[-test_index,]
test = suicide[test_index,]

Train the Model

Linear Regression

A linear regression model is fitted using the train data using the lm function

# Linear regression

lm1 <- train %>% 
  lm(fmla, data=.)

Random Forests

Fitted using the ranger package

# Random forests

set.seed(1, sample.kind="Rounding")

## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

rf1 = ranger(fmla, # formula 
       train, # data
       num.trees = 500, 
       respect.unordered.factors = "order",
       seed = 1)

Test the Model

After this, the model is applied to the test data. This is done through the predict function. To test the performance of the model, the Root Mean Squared Errors (RMSE) is considered. After generating the model predictions, the RMSE is calculated by comparing the model predictions against the true value of the suicide rates.

The model with the lowest RMSE in this case is random forests.

Other models are attempted with the following changes in each attempt:

Eliminate the year 2016
Eliminate Lithuania

Training and testing data sets are generated again to accommodate the above changes for each case and the models are refitted. Comparison between models performance is done through RMSE values.

Results and Conclusions

The suicide rates per 100k populations prediction algorithm includes several country-level variables in addition to population demographic characteristics. The model used for prediction is a multiple linear regression model fitted to the training data and tested on the testing data. Random Forests model is also attempted at number of trees of 500. Models are fitted based on 2 different cases. Assessment of the perfomance is based on the value of RMSE.

The following table summarizes the performance of each model at the 2 specified cases:

case1

## # A tibble: 2 × 2
##   model  rmse
##   <chr> <dbl>
## 1 lm    12.4 
## 2 rf     6.56

case2

## # A tibble: 2 × 2
##   model  rmse
##   <chr> <dbl>
## 1 lm    12.5 
## 2 rf     7.20

The Random Forests model yielded the least value of RMSE and is considered the best model for predicting suicide rates per 100k population. Eliminating data of the year 2016 and of Lithuania did not improve the value of RMSE.

Linear Regression & Random Forests Models vs Actual Visualization

The following graph shows predictions from the linear regression model and the random forests model compared to the true values of suicide rates in the test dataset for each year. The plot shows that annual predictions generated by random forests model are closer to the true value of suicide rates than those predicted by the linear regression model.

test %>% mutate(lm=exp(lm), rf=exp(rf)) %>%
  gather(key=valuetype, value=rate, suicide_rate, lm, rf) %>%
  mutate(suicides=rate*population/100000) %>%
  group_by(year, valuetype) %>%
  mutate(rate_year=sum(suicides)*100000/sum(population)) %>%
  ggplot(aes(year, rate_year, col=valuetype)) +
  geom_line() +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1985, 2016, 2)) +
  theme(legend.position = 'bottom', axis.text.x = element_text(angle = 45))

Additional modeling attempts were made. They are based on the concept of regularization; to penalize the least squares estimates using the parameter lambda to optimize for lambda that minimizes the RMSE. The attempt yielded value of zero for lambda: suggesting that no penalty for the least squares estimates of the model would further enhance the model performance. The attempt is not discussed in the report but it is included in the R script file.

There are other variables that exist in the dataset but not included in the analysis and the modeling. These variables are:

gdp_for_year, which is the GDP of the country at a given year. It is eliminated as it is highly correlated with gdp_per_capita to eliminate multicollinearity in the model. gdp_per_capita was selected over gdp_for_year as it is a better measure for the GDP and wealth of the nations that takes into account population size.
generation, which is a categorical variable for the generation of the population. It is left out as it is highly correlated with age. age is selected over generation to include in the analysis as it more easily and intuitively understood.
HDI.for.year, as over two thirds of the variable is missing.

MATH 624: HW3 Linear Regression

Suicide Rates EDA & Prediction

Cordell Jones

13 September 2021

Introduction

Data Exploration

The Response/Target: `suicide_rate`

The Predictors/Features: `gdp.percapita($)`

Exploring Suicide Rate Variability by Country and Continent

Suicide Rate Variability by Population Size

Effect of GDP on Suicide Rates

Population Characteristics Variability

Method and Analysis

Specifying the Model

Prepping the Data for the Model

Create Training and Testing sets

Train the Model

Linear Regression

Random Forests

Test the Model

Results and Conclusions

Linear Regression & Random Forests Models vs Actual Visualization

MATH 624: HW3 Linear Regression

Suicide Rates EDA & Prediction

Cordell Jones

13 September 2021

Introduction

Data Exploration

The Response/Target: suicide_rate

The Predictors/Features: gdp.percapita($)

Exploring Suicide Rate Variability by Country and Continent

Suicide Rate Variability by Population Size

Effect of GDP on Suicide Rates

Population Characteristics Variability

Method and Analysis

Specifying the Model

Prepping the Data for the Model

Create Training and Testing sets

Train the Model

Linear Regression

Random Forests

Test the Model

Results and Conclusions

Linear Regression & Random Forests Models vs Actual Visualization

The Response/Target: `suicide_rate`

The Predictors/Features: `gdp.percapita($)`