Suicide in the world in 1985 - 2016

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Introduction

I have chosen the dataset “Suicide Rates Overview 1985 to 2016”; the source is: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

The dataset has years from 1985 to 2016. The dataset includes demographic variables, such as countries, genders, number of countries’ population, ages, and names of generation. Economic variables include GDP for year and GDP per capita. The data about suicides includes the raw amount of suicides and the number of suicides per 100.000 population.

The dataset includes five characters and seven numerical variables.

I have chosen the variable ‘suicides/100k pop’ instead of ‘suicides_no’ to conduct my analysis. The ‘suicides_no’ is the raw number of reported suicides for the age/gender group in that a given country for that given year. The ‘suicides/100k pop’ is the number of suicides divided by 100k population of the age/gender group and gender within that country for that given year. I think the ‘suicide/100k pop’ works better as a normalizing feature for comparisons, and I use ‘suicide/100k pop’ throughout all of the analysis.

getwd()

## [1] "C:/Users/small/Desktop/MCollege/2021/DATA110/week6"

setwd("C:/Users/small/Desktop/MCollege/2021/DATA110/Datasets")
suicide_world <- read_csv("Suicide_world.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   country = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   age = col_character(),
##   suicides_no = col_double(),
##   population = col_double(),
##   suicides_100k_pop = col_double(),
##   country_year = col_character(),
##   `hdi for year` = col_double(),
##   gdp_for_year = col_number(),
##   gdp_per_capita = col_double(),
##   generation = col_character()
## )

suicide_world

## # A tibble: 27,820 x 12
##    country  year sex   age   suicides_no population suicides_100k_p~
##    <chr>   <dbl> <chr> <chr>       <dbl>      <dbl>            <dbl>
##  1 Albania  1987 male  15-2~          21     312900             6.71
##  2 Albania  1987 male  35-5~          16     308000             5.19
##  3 Albania  1987 fema~ 15-2~          14     289700             4.83
##  4 Albania  1987 male  75+ ~           1      21800             4.59
##  5 Albania  1987 male  25-3~           9     274300             3.28
##  6 Albania  1987 fema~ 75+ ~           1      35600             2.81
##  7 Albania  1987 fema~ 35-5~           6     278800             2.15
##  8 Albania  1987 fema~ 25-3~           4     257200             1.56
##  9 Albania  1987 male  55-7~           1     137500             0.73
## 10 Albania  1987 fema~ 5-14~           0     311000             0   
## # ... with 27,810 more rows, and 5 more variables: country_year <chr>, `hdi for
## #   year` <dbl>, gdp_for_year <dbl>, gdp_per_capita <dbl>, generation <chr>

Structure of the dataset and types of variables.

In this chunk I check the types of the variables and the structure of the dataset.

str(suicide_world)

## spec_tbl_df [27,820 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country          : chr [1:27820] "Albania" "Albania" "Albania" "Albania" ...
##  $ year             : num [1:27820] 1987 1987 1987 1987 1987 ...
##  $ sex              : chr [1:27820] "male" "male" "female" "male" ...
##  $ age              : chr [1:27820] "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
##  $ suicides_no      : num [1:27820] 21 16 14 1 9 1 6 4 1 0 ...
##  $ population       : num [1:27820] 312900 308000 289700 21800 274300 ...
##  $ suicides_100k_pop: num [1:27820] 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
##  $ country_year     : chr [1:27820] "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
##  $ hdi for year     : num [1:27820] NA NA NA NA NA NA NA NA NA NA ...
##  $ gdp_for_year     : num [1:27820] 2.16e+09 2.16e+09 2.16e+09 2.16e+09 2.16e+09 ...
##  $ gdp_per_capita   : num [1:27820] 796 796 796 796 796 796 796 796 796 796 ...
##  $ generation       : chr [1:27820] "Generation X" "Silent" "Generation X" "G.I. Generation" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   year = col_double(),
##   ..   sex = col_character(),
##   ..   age = col_character(),
##   ..   suicides_no = col_double(),
##   ..   population = col_double(),
##   ..   suicides_100k_pop = col_double(),
##   ..   country_year = col_character(),
##   ..   `hdi for year` = col_double(),
##   ..   gdp_for_year = col_number(),
##   ..   gdp_per_capita = col_double(),
##   ..   generation = col_character()
##   .. )

General world trend

In the beginning, I wanted to see the whole picture of suicide in the world. For that, I grouped the dataset by countries, years, and total suicides and summarized all data by the number of suicides per 100K population.

Also, I arranged the list of countries from a country with the largest number of suicide to the smallest one. As a result, I got the top ten countries with the largest suicide - Russian Federation, Lithuania, Hungary, Kazakhstan, Republic of Korea, Austria, Ukraine, Japan, Finland, and Belgium. As we can see in this list, almost half of these countries are former Soviet Union Republics.

suicide_countries <- suicide_world %>%
group_by(country) %>%
summarise(total_suicide = sum(suicides_100k_pop))%>%
arrange(desc(total_suicide))
suicide_countries

## # A tibble: 101 x 2
##    country            total_suicide
##    <chr>                      <dbl>
##  1 Russian Federation        11305.
##  2 Lithuania                 10589.
##  3 Hungary                   10156.
##  4 Kazakhstan                 9520.
##  5 Republic of Korea          9350.
##  6 Austria                    9076.
##  7 Ukraine                    8932.
##  8 Japan                      8025.
##  9 Finland                    7924.
## 10 Belgium                    7900.
## # ... with 91 more rows

I created the plot to see a general trend worldwide from 1985 to 2016 for all of the countries. I selected the following variables: year, country, the number of suicides. Then I grouped by years to see the time dynamic and summarized it by the number of suicides. The plot shows that the trend goes up from 1985 to 1995, stays at the same level to 2002, and then goes down.

suicide_year <- suicide_world %>%
  select(country, year, suicides_100k_pop) %>%
group_by(year) %>%
summarise(total_suicide = sum(suicides_100k_pop))
suicide_year

## # A tibble: 32 x 2
##     year total_suicide
##  * <dbl>         <dbl>
##  1  1985         6812.
##  2  1986         6580.
##  3  1987         7545.
##  4  1988         7473.
##  5  1989         8037.
##  6  1990         9879.
##  7  1991        10321.
##  8  1992        10529.
##  9  1993        10790.
## 10  1994        11484.
## # ... with 22 more rows

ggplot(data = suicide_year, aes(x = year, y = total_suicide)) + geom_line(color = 'red') +
 geom_point(fill = 'black', shape = 3) + labs(
  title = 'Rate of Suicide in the World', subtitle = 'From 1985 to 2016', x = 'Year', y = 'Total suicides') + theme_minimal()

Focus on the trends in each of top 10 countries.

I created a new dataset to focus on the top ten countries with the highest rates of suicide.

suicide_top_10 <-suicide_world %>%
filter(country == 'Russian Federation'| country == 'Lithuania'| country == 'Hungary'| country == 'Kazakhstan'| country == 'Republic of Korea' | country == 'Austria' | country == 'Ukraine' | country == 'Japan' | country == 'Finland' | country == 'Belgium')
suicide_top_10

## # A tibble: 3,390 x 12
##    country  year sex   age   suicides_no population suicides_100k_p~
##    <chr>   <dbl> <chr> <chr>       <dbl>      <dbl>            <dbl>
##  1 Austria  1985 male  75+ ~         152     156535            97.1 
##  2 Austria  1985 male  55-7~         355     584253            60.8 
##  3 Austria  1985 male  35-5~         515     940526            54.8 
##  4 Austria  1985 male  25-3~         232     548783            42.3 
##  5 Austria  1985 fema~ 75+ ~         110     339223            32.4 
##  6 Austria  1985 male  15-2~         207     653728            31.7 
##  7 Austria  1985 fema~ 55-7~         220     842978            26.1 
##  8 Austria  1985 fema~ 35-5~         186     936799            19.8 
##  9 Austria  1985 fema~ 25-3~          56     544765            10.3 
## 10 Austria  1985 fema~ 15-2~          50     633592             7.89
## # ... with 3,380 more rows, and 5 more variables: country_year <chr>, `hdi for
## #   year` <dbl>, gdp_for_year <dbl>, gdp_per_capita <dbl>, generation <chr>

To create different plots, I made the new data frame grouping by year, country, GDP per capita and summarized by the number of suicides.

suicide_10_by_year <- suicide_top_10 %>%
  group_by(year, country, gdp_per_capita) %>%
summarise(total_suicide = sum(suicides_100k_pop))

## `summarise()` has grouped output by 'year', 'country'. You can override using the `.groups` argument.

suicide_10_by_year

## # A tibble: 283 x 4
## # Groups:   year, country [283]
##     year country           gdp_per_capita total_suicide
##    <dbl> <chr>                      <dbl>         <dbl>
##  1  1985 Austria                     9759          385.
##  2  1985 Belgium                     9356          332.
##  3  1985 Japan                      12401          300.
##  4  1985 Republic of Korea           2731          147.
##  5  1986 Austria                    13911          402.
##  6  1986 Belgium                    12992          315.
##  7  1986 Japan                      18288          324.
##  8  1986 Republic of Korea           3078          143.
##  9  1987 Austria                    17415          406.
## 10  1987 Belgium                    16165          328.
## # ... with 273 more rows

The first plot for top ten countries shows the trend of suicide by years. The suicide rate of each country (except the Republic of Korea) repeats the general trend for the whole world.

ggplot(data = suicide_10_by_year) +
  geom_line(aes(x = year, y = total_suicide, group = country, color = country)) +
    labs(title = 'Rate of Suicide in Top 10 Countries', subtitle = 'From 1985 to 2016', x = 'Year', y = 'Total suicide', color = 'Country') + theme_minimal()

Demographic aspects of suicide in 10 top countries.

To see the distribution by gender, I grouped by sex and summarized it by the number of suicides. Across the top 10 countries, the number of male suicides is about three and a half times higher than that of female ones. Therefore, gender is a factor of suicide.

suicide_10_sex <- suicide_top_10 %>% 
  group_by(sex) %>% 
  summarise(total_by_gender = sum(suicides_100k_pop))
  suicide_10_sex

## # A tibble: 2 x 2
##   sex    total_by_gender
## * <chr>            <dbl>
## 1 female          19760.
## 2 male            73018.

The plot for genders shows that the male distribution of suicides is significantly higher then the female’s.

plot1 <-suicide_top_10 %>%
  ggplot(aes(sex, suicides_100k_pop, fill = sex)) + 
  ggtitle("The Distribution of Suicide by Gender") + 
  xlab("Sex") +
  ylab("Frequency") +
  geom_boxplot() +
  scale_fill_discrete(name = "Sex", labels = c("Female", "Male")) +
  theme_minimal()
  plot1

To see the distribution by age, I grouped by age and summarized it by the number of suicides. Across the top 10 countries, the suicide rate is getting higher when the age is higher. Therefore, age is a factor of suicide.

suicide_10_age <- suicide_top_10 %>% 
  group_by(age) %>% 
  summarise(total_by_age = sum(suicides_100k_pop)) %>%
arrange(desc(total_by_age))
  suicide_10_age

## # A tibble: 6 x 2
##   age         total_by_age
##   <chr>              <dbl>
## 1 75+ years         29821.
## 2 55-74 years       20239.
## 3 35-54 years       19019.
## 4 25-34 years       14117.
## 5 15-24 years        9038.
## 6 5-14 years          545.

Correlation between the rate of suicide and economic factor.

I checked two variables on correlation for top 10 countries. The independent variable is GDP per capita, and the dependent variable is the number of suicides. There is a moderate negative correlation, which means that if the GDP decreases, the number of suicides increases.

cor(suicide_10_by_year$gdp_per_capita, suicide_10_by_year$total_suicide)

## [1] -0.5586889

For the regression plot, I added the second layer (by geom_method) to produce the line of best fit and deleted a “confidence interval”.

ggplot(data = suicide_10_by_year, aes(x = gdp_per_capita, y = total_suicide)) +
  geom_point(aes(color = country)) +
  stat_smooth(method ="lm", formula = y~log(x), linetype=5, color = 'red', se = F) +
  labs(title = "Regration Between Suicide and GDP per Capita", subtitle = 'For top 10 countries, from 1985 to 2016', y = 'Total suicide', x = 'GDP per capita', color = 'Country') + 
  theme_minimal()

Conclusion:

The top ten countries with the largest amount of suicides are the Russian Federation, Lithuania, Hungary, Kazakhstan, Republic of Korea, Austria, Ukraine, Japan, Finland, and Belgium. Almost half of these countries are from the former USSR. The following up of this analysis might be researching mental health policies and economic factors in the former Soviet republics. My suggestion is that the traumatic experience of the broken USSR and the difficult transition to the market economy in former Soviet Union countries may be the main reason of the high rates of suicide.
The plot shows that the trend goes up from 1985 to 1995, then stays the same level from 1995 to 2002, and then goes down. To follow up on this analysis, it would be interesting to research what has been happened in the world (for instance, in the pharmaceutical industry) that may decrease the suicide rate in 2002. Another thought that might be useful to explain the peak of suicide in 1995 and decreasing trend after 2002 is analyzing the ‘generation’ variable. The G. I. (Greatest) Generation is the demographic unit of people born from 1901 to 1927. This Generation had a deal with Great Depression in the USA, and two World Wars. By 1995 (when there was a peak of suicide) people from G.I. Generation should be 68-94 years. This particular age group has the highest rate of suicide. Their traumatic life experience might be a reason for a high level of suicide by the end of their lives when primary obligations to raise their children have been made. If we look at the period of decrease in 2002 – 2016 retrospectively, the people who have committed suicide were from Silent and Baby Boomer Generations. These two generations have had significantly different characteristics compared to the G.I. generation, and probably there is a collective pattern of generational behavior on the macro-level.
The plot with the trend of suicide by each country by year looks similar to the general trend (except Korea). The trend of each country has a different degree of increasing before 2012 and then very identical decreasing. The top 10 countries include the developing and developed countries and have significant differences in all political, social, demographic, and historical factors; however, all of them tend to decrease the number of suicides. There is one unifying factor – the same type of Generation in the period under review.
There are demographic trends: a) men are more likely to commit suicide than women by 3,5 times; b) people in the age group of 75+ are in the highest risk of suicide.
The regression analysis shows a moderate negative relation between the GDP and the number of suicides. I expected to see this result after getting the data frame with the top 10 countries in which four countries are from the former USSR. The standard of living in the former Soviet republics, in general, is less than developed countries. Thus, the degree of national wealth has a certain degree of reduction in the suicide rate.

Project1

Olga Tolchinsky

9 March, 2021