DATA 110, Project # 3

Introduction

I have chosen the dataset from Kaggle: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

This dataset includes information about the suicide rates in each country associated with socio-economic data from 1985 until 2016. This compiled dataset was pulled from the four other datasets linked by time and place (References below).

Suicide is a global social issue that is correlated with various factors. As World Health Organization states, nearly 800,000 people die from suicide every year. For every completed suicide, there are many more people who attempt suicide every year. Suicide is the third leading cause of death in 15-19-year-olds. 79% of global suicides occur in low- and middle-income countries. Ingestion of pesticides, hanging, and firearms are among the most common methods of suicide globally.

Current project is the continuation of my work with the same dataset. I decided to continue analyzing the global rates of suicide because in my previous research I found out that the highest number of suicides occured in the former Soviet Union countries. The time frame when the highest number of suicides were completed was immediately following the collapse of the USSR. During that time, I lived in that region. I remember the country’s disintegration as one of the most severe social catastrophes that had a tremendous impact on the lives of all Soviet people. Since I am from Russia (the former USSR republic), and I lived in that region during the USSR collapse, I had a strong personal connection to the findings in this project.

I decided to continue working with this dataset and research the following questions:

How do the suicide trends look like as a function of the age, gender, and the geographical location?
Which continents and countries have the highest suicide rate?
What is the relationship between the economical factors and the suicide dates?

This dataset includes 12 variables: the name of the countries, year, sex, age, number of suicides, population, the number of suicides divided on 100k of population, GDP rate per year, GDP rate per capita, the types of Generations, Human Develop Index at the year of person’s suicide, and contry&year.

Set labraries

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.5

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.0.5

## Warning: package 'tidyr' was built under R version 4.0.5

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(countrycode)

## Warning: package 'countrycode' was built under R version 4.0.5

library(tinytex)

## Warning: package 'tinytex' was built under R version 4.0.5

library(dplyr)
library(ggplot2)
library(plotly)

## Warning: package 'plotly' was built under R version 4.0.5

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(tidyr)
library(plyr)

## Warning: package 'plyr' was built under R version 4.0.5

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:plotly':
## 
##     arrange, mutate, rename, summarise

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

library(highcharter)

## Warning: package 'highcharter' was built under R version 4.0.5

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(RColorBrewer)
library(treemap)
library(dbplyr)

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.0.5

Set working directory and read the csv file

getwd() #check the working directory

## [1] "C:/Users/small/Desktop/MCollege/2021/DATA110/Project3"

setwd("C:/Users/small/Desktop/MCollege/2021/DATA110/Project3") #set the working directory
data <- read_csv("master.csv") #read the file

## 
## -- Column specification --------------------------------------------------------
## cols(
##   country = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   age = col_character(),
##   suicides_no = col_double(),
##   population = col_double(),
##   suicides_100k_pop = col_double(),
##   country_year = col_character(),
##   hdi_for_year = col_double(),
##   gdp_for_year = col_number(),
##   gdp_per_capita = col_double(),
##   generation = col_character()
## )

head(data) #look the variables

## # A tibble: 6 x 12
##   country  year sex   age   suicides_no population suicides_100k_p~ country_year
##   <chr>   <dbl> <chr> <chr>       <dbl>      <dbl>            <dbl> <chr>       
## 1 Albania  1987 male  15-2~          21     312900             6.71 Albania1987 
## 2 Albania  1987 male  35-5~          16     308000             5.19 Albania1987 
## 3 Albania  1987 fema~ 15-2~          14     289700             4.83 Albania1987 
## 4 Albania  1987 male  75+ ~           1      21800             4.59 Albania1987 
## 5 Albania  1987 male  25-3~           9     274300             3.28 Albania1987 
## 6 Albania  1987 fema~ 75+ ~           1      35600             2.81 Albania1987 
## # ... with 4 more variables: hdi_for_year <dbl>, gdp_for_year <dbl>,
## #   gdp_per_capita <dbl>, generation <chr>

Structure of the dataset and types of variables

In the beginning, I examine the structure of the data object. The structure() function gives information about the rows(observations) and columns(variables) along with additional information like the names of the columns, class of each columns followed by a few of the initial observations of each of the columns. The dataset contains data from 101 countries, 27820 samples with 12 variables.

str(data)

## spec_tbl_df [27,820 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country          : chr [1:27820] "Albania" "Albania" "Albania" "Albania" ...
##  $ year             : num [1:27820] 1987 1987 1987 1987 1987 ...
##  $ sex              : chr [1:27820] "male" "male" "female" "male" ...
##  $ age              : chr [1:27820] "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
##  $ suicides_no      : num [1:27820] 21 16 14 1 9 1 6 4 1 0 ...
##  $ population       : num [1:27820] 312900 308000 289700 21800 274300 ...
##  $ suicides_100k_pop: num [1:27820] 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
##  $ country_year     : chr [1:27820] "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
##  $ hdi_for_year     : num [1:27820] NA NA NA NA NA NA NA NA NA NA ...
##  $ gdp_for_year     : num [1:27820] 2.16e+09 2.16e+09 2.16e+09 2.16e+09 2.16e+09 ...
##  $ gdp_per_capita   : num [1:27820] 796 796 796 796 796 796 796 796 796 796 ...
##  $ generation       : chr [1:27820] "Generation X" "Silent" "Generation X" "G.I. Generation" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   year = col_double(),
##   ..   sex = col_character(),
##   ..   age = col_character(),
##   ..   suicides_no = col_double(),
##   ..   population = col_double(),
##   ..   suicides_100k_pop = col_double(),
##   ..   country_year = col_character(),
##   ..   hdi_for_year = col_double(),
##   ..   gdp_for_year = col_number(),
##   ..   gdp_per_capita = col_double(),
##   ..   generation = col_character()
##   .. )

The output of the summary() function shows a set of descriptive statistics for every variable, depending on the type of the variable: Numerical variables of this dataset (year, suicides_no, population, suicides_100k_pop, hdi_for_year, gdp_for_year, and gdp_per_capita) give the range, quartiles, median, and mean.

summary(data)

##    country               year          sex                age           
##  Length:27820       Min.   :1985   Length:27820       Length:27820      
##  Class :character   1st Qu.:1995   Class :character   Class :character  
##  Mode  :character   Median :2002   Mode  :character   Mode  :character  
##                     Mean   :2001                                        
##                     3rd Qu.:2008                                        
##                     Max.   :2016                                        
##                                                                         
##   suicides_no        population       suicides_100k_pop country_year      
##  Min.   :    0.0   Min.   :     278   Min.   :  0.00    Length:27820      
##  1st Qu.:    3.0   1st Qu.:   97498   1st Qu.:  0.92    Class :character  
##  Median :   25.0   Median :  430150   Median :  5.99    Mode  :character  
##  Mean   :  242.6   Mean   : 1844794   Mean   : 12.82                      
##  3rd Qu.:  131.0   3rd Qu.: 1486143   3rd Qu.: 16.62                      
##  Max.   :22338.0   Max.   :43805214   Max.   :224.97                      
##                                                                           
##   hdi_for_year    gdp_for_year       gdp_per_capita    generation       
##  Min.   :0.483   Min.   :4.692e+07   Min.   :   251   Length:27820      
##  1st Qu.:0.713   1st Qu.:8.985e+09   1st Qu.:  3447   Class :character  
##  Median :0.779   Median :4.811e+10   Median :  9372   Mode  :character  
##  Mean   :0.777   Mean   :4.456e+11   Mean   : 16866                     
##  3rd Qu.:0.855   3rd Qu.:2.602e+11   3rd Qu.: 24874                     
##  Max.   :0.944   Max.   :1.812e+13   Max.   :126352                     
##  NA's   :19456

Global wolrd trends in 1985 - 2016

In the beginning, I wanted to see the trend of the total number of suicides by year in all countries. I created the new variable with the total of suicides over the years and set up a new dataset.

suicides_by_years <- data %>% #new dataset
   dplyr::group_by(year) %>% #group by year
  dplyr::mutate(sum_suicides = sum(suicides_100k_pop)) #summarize suicides per 100k population over the years
head(suicides_by_years)

## # A tibble: 6 x 13
## # Groups:   year [1]
##   country  year sex   age   suicides_no population suicides_100k_p~ country_year
##   <chr>   <dbl> <chr> <chr>       <dbl>      <dbl>            <dbl> <chr>       
## 1 Albania  1987 male  15-2~          21     312900             6.71 Albania1987 
## 2 Albania  1987 male  35-5~          16     308000             5.19 Albania1987 
## 3 Albania  1987 fema~ 15-2~          14     289700             4.83 Albania1987 
## 4 Albania  1987 male  75+ ~           1      21800             4.59 Albania1987 
## 5 Albania  1987 male  25-3~           9     274300             3.28 Albania1987 
## 6 Albania  1987 fema~ 75+ ~           1      35600             2.81 Albania1987 
## # ... with 5 more variables: hdi_for_year <dbl>, gdp_for_year <dbl>,
## #   gdp_per_capita <dbl>, generation <chr>, sum_suicides <dbl>

I looked at the descriptive statistics of the new variable “sum_suicides” as it is the most important index of the total number of suicides in the world:

The average of annual suicide rate in 101 countries from 1985 to 2016 is 11,142/100,000 of population.
The smallest number of suicide in 101 countries from 1985 to 2016 is 2,147/100,000 of population.
The Highest number of suicide in 101 countries from 1985 to 2016 is 14.660/100,000 of population.

summary_by_years <- suicides_by_years %>%
  dplyr::group_by(year) %>% #group by year
  dplyr::summarise(sum_suicides = sum(suicides_100k_pop)) %>% #summarize the suicides per 100k population over the years
  arrange(-sum_suicides) #sort rows in descending order

summary(summary_by_years$sum_suicides) #statistical summary of the total suicides per 100k population over the years

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2147   10200   11664   11142   13675   14660

The largest number of suicides was from 1994 to 2005 (the peak was in 1995). Since the data are available until 2016, we can assume that data for the year 2016 were imcomplete. Even though the data from 2016 are not incomplete, there is a decreasing number of suicides starting from 2010.

suicides_by_years_line <- suicides_by_years 
suicides_by_years_line <- ggplot(suicides_by_years_line,aes(x = year, y = sum_suicides))+ # plot by years and the total number of suicides
 geom_line(color = "red", size = 0.8)+ # set the color and size of the line
  geom_point() +
  geom_hline(yintercept = mean(suicides_by_years$sum_suicides), linetype = "dashed", color= "black",size = 0.5)+ #set the mean line of the total number of suicides
   annotate("text", x = 1987, y = 12500, label = "Mean Line", vjust = -0.5,color= "red")+ # set the label "Mean Line" on the plot
  labs(x = "Year",y = "Number of Suicides", title="Time Graph of the Suicide Rate in the World")+ ## set the title and the name of axis
  theme_minimal() # set the type of graph theme
fig1<-ggplotly(suicides_by_years_line) # apply "plotly" function for interactivity in graph
fig1

The following plot has a geom_bar() function that counts the number of cases on x-axes and depicts the height of each bar according to this number. The plot below represents another type of global trend visualization during 1985 - 2016.

suicides_by_years %>% 
  ggplot(aes(x = year, y = sum_suicides)) + # plot by years and the total number of suicides
  geom_bar(stat = "identity", width=0.7, aes(fill = sum_suicides)) + # set the width of the bars
  scale_fill_gradient(name = "Total suicide", low = "yellow", high = "red") + #set the name of the legend and applying colors for bars
  labs(title = "Distribution of the Suicide Rate in the Wolrd",
       subtitle = "Trend over time, 1985 - 2015",
       x = "Year", 
       y = "Frequency") + #set the title, subtitle, and the name of axis
    theme_minimal() + #set the type of theme
  coord_flip() #flip the plot 90 degrees

Global distribution of the suicie rates by gender

The next step is to look at the distribution of the suicide rates by gender. To visualize the gender trends, I created a new dataset with the rate of suicides per 100K of the population.

data_sex <- data %>% 
  dplyr::select(suicides_no, population, sex, year) %>% #select 4 variables
  dplyr::mutate(rate = (suicides_no / population) * 100000) %>% #create new variable with rate of suicides per 100K of the population
  group_by(sex, year) %>% #group by sex and year
  dplyr::summarize(avg_rate = mean(rate)) #summarized mean of the rate

## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.

data_sex

## # A tibble: 64 x 3
## # Groups:   sex [2]
##    sex     year avg_rate
##    <chr>  <dbl>    <dbl>
##  1 female  1985     5.78
##  2 female  1986     5.81
##  3 female  1987     5.73
##  4 female  1988     6.32
##  5 female  1989     6.18
##  6 female  1990     5.91
##  7 female  1991     6.03
##  8 female  1992     6.25
##  9 female  1993     6.03
## 10 female  1994     6.24
## # ... with 54 more rows

The following plot visualizes the distribution of suicide rates by gender. We can observe a significant difference between male and female distributions. The chart of male suicides has a peak around 1990 - 2014. The chart of female suicides does not have any peaks. Also, the chart of female suicides shows the lower rate of the total number of suicides as compared to males.

plot_sex <- data_sex %>% 
  
  ggplot(mapping = aes(x = year, y = avg_rate, col = sex)) + #plot the percentage of subsides by sex 
  geom_line() +
  geom_point() +
  labs(title = "Rate of the Total Suicides by Gender",
       x = "Year",
       y = "Percentage of Suisides",
       col = "Sex") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1") #set color palette from the RColorBrewer
fig2 <-ggplotly(plot_sex)
fig2

The following plot shows the distribution of the number of suicides by combination of two factors - gender and generations. As we can see, for men, the largest number of suicides occured for the GI and Silent Generations. For women, the highest number of suicides occured for the GI Generation.

American Generations Timeline:

GI Generation, born 1901-1924 *They were teenagers during the Great Depression and fought in World War II. Sometimes called the greatest generation (following a book by journalist Tom Brokaw) or the swing generation because of their jazz music.
Silent Generation, born 1925-1942 *They were too young to see action in World War II and too old to participate in the fun of the Summer of Love. This label describes their conformist tendencies and belief that following the rules was a sure ticket to success.
Baby Boomers, born 1943-1964 *The boomers were born during an economic and baby boom following World War II. These hippie kids protested against the Vietnam War and participated in the civil rights movement, all with rock ‘n’ roll music blaring in the background.
Generation X, born 1965-1979 *They were originally called the baby busters because fertility rates fell after the boomers. As teenagers, they experienced the AIDs epidemic and the fall of the Berlin Wall. Sometimes called the MTV Generation, the “X” in their name refers to this generation’s desire not to be defined.
Millennials, born 1980-2000 *They experienced the rise of the Internet, Sept. 11 and the wars that followed. Sometimes called Generation Y. Because of their dependence on technology, they are said to be entitled and narcissistic.
Generation Z, born 2001-2013 *These kids were the first born with the Internet and are suspected to be the most individualistic and technology-dependent generation. Sometimes referred to as the iGeneration.

suicides_by_sex_1 <- data %>%
  dplyr::group_by(sex, generation) %>% #group by sex
  dplyr::summarize(population = sum(population), 
            suicides = sum(suicides_no), 
            suicides_per_100k = (suicides / population) * 100000)

## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.

ggplot(suicides_by_sex_1, aes(sex, suicides_per_100k,fill = generation))+ #plot the distribution by gender and generation
         geom_bar(stat = "identity")+
    labs(title = "Rate of the Total Suicides by Gender and Generations",
       x = "Gender", 
       y = "Frequency", 
       color = "Generation") +
   theme_minimal() +
  scale_fill_brewer(palette = "Dark2") #set color palette from the RColorBrewer

Global distribution the number of suicide by age

To visualize trends by age, I created a new dataset.

data_age <- data %>% 
  dplyr::select(suicides_no, population, age, year) %>% #select 4 variables
  dplyr::mutate(rate = (suicides_no / population) * 100000) %>% #create new variable as the rate of number suicides by number population
  group_by(age, year) %>% #group by age and year
  dplyr::summarize(avg_rate = mean(rate)) #summarize mean of the rate

## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.

data_age

## # A tibble: 191 x 3
## # Groups:   age [6]
##    age          year avg_rate
##    <chr>       <dbl>    <dbl>
##  1 15-24 years  1985     8.43
##  2 15-24 years  1986     8.15
##  3 15-24 years  1987     7.49
##  4 15-24 years  1988     8.75
##  5 15-24 years  1989     9.16
##  6 15-24 years  1990     9.12
##  7 15-24 years  1991     9.91
##  8 15-24 years  1992     9.19
##  9 15-24 years  1993     9.33
## 10 15-24 years  1994     9.49
## # ... with 181 more rows

The following plot confirms the findings reported in the previous bar chart that the highest number of suicides occurred in the GI and Silent Generations (age 55+).

plot_age <- data_age %>% 
  
  ggplot(mapping = aes(x = year, y = avg_rate, col = age)) + #plot the percentage of subsides by age 
  geom_line() +
  geom_point() +
  labs(title = "Rate of the Total Suicides by Age",
       x = "Year",
       y = "Percentage of Suisides",
       col = "Age") +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2") # set color pallete
fig3 <-ggplotly(plot_age)
fig3

Global distribution by continents and countries

After looking at the demographic factors such as gender and age, I wanted to visualize the geographical distribution of the number of suicides. For this task, I created a new dataset with the new variable called “continent”. The new variable has all the countries grouped by the continents.

data_continents <- data %>% 
as.data.frame(data_continents) #set new dataset as a data frame
data_continents$continent <- #take the variable from the dataset
countrycode(sourcevar = data_continents[, "country"],origin = "country.name", destination = "continent") #convert the list of the countries into list of continents
data_continents <- as_tibble(data_continents) #set dataset into tibble
head(data_continents)

## # A tibble: 6 x 13
##   country  year sex   age   suicides_no population suicides_100k_p~ country_year
##   <chr>   <dbl> <chr> <chr>       <dbl>      <dbl>            <dbl> <chr>       
## 1 Albania  1987 male  15-2~          21     312900             6.71 Albania1987 
## 2 Albania  1987 male  35-5~          16     308000             5.19 Albania1987 
## 3 Albania  1987 fema~ 15-2~          14     289700             4.83 Albania1987 
## 4 Albania  1987 male  75+ ~           1      21800             4.59 Albania1987 
## 5 Albania  1987 male  25-3~           9     274300             3.28 Albania1987 
## 6 Albania  1987 fema~ 75+ ~           1      35600             2.81 Albania1987 
## # ... with 5 more variables: hdi_for_year <dbl>, gdp_for_year <dbl>,
## #   gdp_per_capita <dbl>, generation <chr>, continent <chr>

The bar plot shows the distribution of the total suicides by continent. As we can see, the highest number of suicides was in Europe. The next region with the highest number of suicides was the Americas.

plot_continents<-data_continents %>% 
  group_by(continent) #group by year

plot_continents <- ggplot(data_continents, aes(year, suicides_100k_pop, fill = continent))+ 
         geom_bar(stat = "identity") + #bar plot of the suicide rate by continent 
    labs(title = "Distribution of the Suicide Rates by Continents",
       subtitle = "Trend over time, 1985 - 2015",
       x = "Year", 
       y = "Rates of Suicide") +
  theme_minimal() +
  scale_fill_brewer(palette = "Dark2")
fig4 <-ggplotly(plot_continents)
fig4

Here is another visualization of the distribution by country. The treemap visually displays the total number of suicides in each country by varying the size of the square representing the country. As we can see, the Russian Federation has the highest number of suicides in the world.

treemap(data_continents, index="country", vSize="suicides_100k_pop", #set size
        vColor="suicides_100k_pop", #set color
        type="value",
        title="Treemap of Suicide Rates by Countries", palette="Spectral") #set color palette from the RColorBrewer

There is one more visualization of how the number of suicides is distributed by country. The colors represent continents, so we can observe that the countries with the highest number of suicides are mostly in Europe.

plot_countries <- data_continents %>%
 dplyr::group_by(country, continent) %>% #group by countries and continents
dplyr::summarize(n = n(), suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000) %>%
  arrange(desc(suicide_per_100k)) #sort rows in descending order

## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

plot_countries$country <- factor(plot_countries$country, ordered = T, levels = rev(plot_countries$country))

ggplot(plot_countries, aes(x = country, y = suicide_per_100k, fill = continent)) + 
  geom_bar(stat = "identity") + 
  
  labs(title = "Distribution of the Suicide Rates by Countries",
       x = "Country", 
       y = "Suicide Rates", 
       fill = "Continent") +
  coord_flip() + #flip the plot 90 degrees
  scale_y_continuous(breaks = seq(0, 45, 5)) + #scale y-axis
  theme_minimal() +
  theme(legend.position = "bottom") + #set position of the legend
  scale_fill_brewer(palette = "Dark2") #set color palette from the RColorBrewer

As we can see in the dataset below, seven former Soviet Union Republics are in the top 20 countries with the highest rate of suicide.

top_20_countries <- data %>%
dplyr::group_by(country) %>%
dplyr::summarise(total_suicide = sum(suicides_100k_pop))%>%
arrange(desc(total_suicide)) %>%
slice(1:20)
top_20_countries

## # A tibble: 20 x 2
##    country            total_suicide
##    <chr>                      <dbl>
##  1 Russian Federation        11305.
##  2 Lithuania                 10589.
##  3 Hungary                   10156.
##  4 Kazakhstan                 9520.
##  5 Republic of Korea          9350.
##  6 Austria                    9076.
##  7 Ukraine                    8932.
##  8 Japan                      8025.
##  9 Finland                    7924.
## 10 Belgium                    7900.
## 11 Belarus                    7831.
## 12 France                     7803.
## 13 Latvia                     7373.
## 14 Suriname                   7162.
## 15 Bulgaria                   7016.
## 16 Slovenia                   7013.
## 17 Estonia                    6874.
## 18 Guyana                     6656.
## 19 Uruguay                    6539.
## 20 Singapore                  6341.

The relation economic factors and the suidice rate

The economic factors might have a significant impact on the suicide rates. Since the current dataset contains GDP per capita, I decided to check the relationship between suicide rates and GDP per capita. I created a new dataset to visualize the suicide rates by GDP per capita.

gdp_suicide <- data_continents %>%
   dplyr::group_by(country, continent)%>% #group by country and continent
   dplyr:: summarise(population = sum(population), suicides = sum(suicides_no), suicides_per_100k = (suicides / population) * 100000,
  gdp_per_capita = mean(gdp_per_capita)) %>%   
  ungroup() %>% #ungroup to the original dataset
  arrange(desc(suicides_per_100k)) #sort from a larger value to lower

## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

gdp_suicide

## # A tibble: 101 x 6
##    country         continent population suicides suicides_per_10~ gdp_per_capita
##    <chr>           <chr>          <dbl>    <dbl>            <dbl>          <dbl>
##  1 Lithuania       Europe      68085210    28039             41.2          9281.
##  2 Russian Federa~ Europe    3690802620  1209742             32.8          6519.
##  3 Sri Lanka       Asia       182525626    55641             30.5           904.
##  4 Belarus         Europe     197372292    59892             30.3          3334.
##  5 Hungary         Europe     248644256    73891             29.7          9370.
##  6 Latvia          Europe      44852640    12770             28.5          8961.
##  7 Kazakhstan      Asia       377513869   101546             26.9          5329.
##  8 Slovenia        Europe      40268619    10615             26.4         18642.
##  9 Estonia         Europe      27090810     7034             26.0         11376.
## 10 Ukraine         Europe    1286469184   319950             24.9          1868.
## # ... with 91 more rows

The plot below shows the relationship between suicide rate and GDP per capita per continent. The colors represent the continents. The size of the circles represents the number of suicides.

fig5 <- plot_ly(data = gdp_suicide, x = ~gdp_per_capita, y = ~suicides_per_100k, color = ~continent, size = ~suicides_per_100k, colors = "Dark2",
 text = ~paste("GDP:", gdp_per_capita, '<br>Suicides:', suicides_per_100k)) %>%
  layout(title = 'Suicide Rates vs GPR per Capita by Continents',
         xaxis = list(title = "GDP per Capita"),
         yaxis = list(title = "Suicide Rates"))
fig5

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

The correlation coefficient and statistacal summary

A previous visualization of the relationship between the suicide rates and GDP shows a large variation in the suicide rates as a function of the economic factors. Thus, I decided to build the regression model to check the correlation coefficient and the statistical summary for each continent.

Visually the correlations between the suicides rate and GDP per capita are weak for each continent. There are positive (Americas and Oceania) and negative (Africa, Asia, and Europe) correlations.

fgdp_suicide_plot <- gdp_suicide
ggplot(gdp_suicide, aes(gdp_per_capita, suicides_per_100k))+
  geom_point()+
  geom_smooth(method = "lm", formula=y~x, color="red", size = 0.5, fill="#cccccc") +
  labs(x="GDP per capita",y="Suicide Rates", title = "Correlation between Suicide Rates and GDP per Capita", subtitle = "By Continents") +
facet_wrap(.~ continent) +
theme_bw()

The European continent has the highest rate of suicides. Thus, I decided to check the statistical summary for this region.

gdp_suicide_europe <- gdp_suicide %>%
  filter(continent == 'Europe') #filter Europe

The results show that the correlation coefficient is - 0.21, which means weak negative correlation.

The model has the equation is: suicides_per_100k = -1.003e-04(gdp_per_capita) + 1.871e+01

The Adjusted R-Squared value (0.01707) states that only 1.7% of the variation in the observations may be explained by this model.

p-value (0.2082) is significantly big, so the predictor (gdp_per_capita) is not useful to predict the rate of suicide in this model.

cor(gdp_suicide_europe$suicides_per_100k, gdp_suicide_europe$gdp_per_capita) # correlation coefficient

## [1] -0.2088996

data_sum <- lm(suicides_per_100k ~ gdp_per_capita, data = gdp_suicide_europe) # statistacal summary

summary(data_sum)

## 
## Call:
## lm(formula = suicides_per_100k ~ gdp_per_capita, data = gdp_suicide_europe)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3585  -7.1371  -0.4603   5.2490  23.4072 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.871e+01  2.301e+00   8.129 1.15e-09 ***
## gdp_per_capita -1.003e-04  7.825e-05  -1.282    0.208    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.894 on 36 degrees of freedom
## Multiple R-squared:  0.04364,    Adjusted R-squared:  0.01707 
## F-statistic: 1.643 on 1 and 36 DF,  p-value: 0.2082

Conclusion

Suicide is a global social issue. According to the World Health Organization, nearly 800,000 people die from suicide every year. It is a very serious issue, which has to be taken seriously by the public. The first step in researching social issues is studying the data from the past. It is especially concerning that seven countries formed after the collapse of the Soviet Union are among the top 20 countries with the highest rates of suicide worldwide. Russia has the highest suicide rate, and it is my native country. Thus, I was most interested in researching the suicide rate dynamics. The dataset I used is from Kaggle and it contains information about 101 countries, including the suicide rates per country and per year, and the social-economical factors during 1984 – 2014.

*Global trends of the suicide rates during 1984 - 2016.

The global trend during 1984 – 2016 showed a different dynamic. The suicide rate increased dramatically from 1984 to 1995. From 1994 to 2010, the rate of suicide was higher than average, with a peak in 1995. Starting from 2010, the suicide rate started to decrease. The average annual suicide rate in 101 countries from 1985 to 2016 was 11,142 per 100,000 of population.

*Distribution of the suicide rate by gender.

We can observe a significant difference between distributions of suicide for men and women. The suicide rate is higher for men than women - 76.9% for men and 23.1% for women. The chart of male suicides has a peak around 1990 - 2014. The chart of female suicides does not have any peaks. In addition, the chart of female suicides showed a lower rate of the total number of suicides as compared to males.

*Distribution of the suicide rate by age.

The highest suicide rate occurred in the age category of 75+. Additionally, the plotline for age 75+ has a peak similar to the one in a global trend in 1995.

*Countries with the highest suicide rate.

Europe is the continent with the highest suicide rate. The top 20 countries with the highest suicide rates in the world include seven countries from the former USSR (Russian Federation, Lithuania, Kazakhstan, Ukraine, Belarus, Latvia, and Estonia).

*The relation between the economic factors and the suicide rates.

GDP per capita is the best measurement of a country’s standard of living. The visualization of the relationship between the suicide rates and GDP per capita shows a large variation in the suicide rates as a function of the economic factors. The correlations between the suicide rates and GDP per capita are weak for each continent. There are positive correlations (in the Americas and Oceania) and negative correlations (in Africa, Asia, and Europe). For example, in Europe, the correlation coefficient between GDP per capita and the suicide rate is - 0.21, which means a weak negative correlation.

References:

Suicide in the Twenty-First Century [dataset]. (2017). Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/

World Health Organization.(2019). Suicide. https://www.who.int/news-room/fact-sheets/detail/suicide

NPR, Hear Every Voice https://www.npr.org/

Suicide Rates in the World in 1985 - 2016

Olga Tolchinsky

11 May, 2021