I have chosen the dataset from Kaggle: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016
This dataset includes information about the suicide rates in each country associated with socio-economic data from 1985 until 2016. This compiled dataset was pulled from the four other datasets linked by time and place (References below).
Suicide is a global social issue that is correlated with various factors. As World Health Organization states, nearly 800,000 people die from suicide every year. For every completed suicide, there are many more people who attempt suicide every year. Suicide is the third leading cause of death in 15-19-year-olds. 79% of global suicides occur in low- and middle-income countries. Ingestion of pesticides, hanging, and firearms are among the most common methods of suicide globally.
Current project is the continuation of my work with the same dataset. I decided to continue analyzing the global rates of suicide because in my previous research I found out that the highest number of suicides occured in the former Soviet Union countries. The time frame when the highest number of suicides were completed was immediately following the collapse of the USSR. During that time, I lived in that region. I remember the country’s disintegration as one of the most severe social catastrophes that had a tremendous impact on the lives of all Soviet people. Since I am from Russia (the former USSR republic), and I lived in that region during the USSR collapse, I had a strong personal connection to the findings in this project.
I decided to continue working with this dataset and research the following questions:
How do the suicide trends look like as a function of the age, gender, and the geographical location?
Which continents and countries have the highest suicide rate?
What is the relationship between the economical factors and the suicide dates?
This dataset includes 12 variables: the name of the countries, year, sex, age, number of suicides, population, the number of suicides divided on 100k of population, GDP rate per year, GDP rate per capita, the types of Generations, Human Develop Index at the year of person’s suicide, and contry&year.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.3
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(countrycode)
## Warning: package 'countrycode' was built under R version 4.0.5
library(tinytex)
## Warning: package 'tinytex' was built under R version 4.0.5
library(dplyr)
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
library(plyr)
## Warning: package 'plyr' was built under R version 4.0.5
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:plotly':
##
## arrange, mutate, rename, summarise
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(RColorBrewer)
library(treemap)
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.5
getwd() #check the working directory
## [1] "C:/Users/small/Desktop/MCollege/2021/DATA110/Project3"
setwd("C:/Users/small/Desktop/MCollege/2021/DATA110/Project3") #set the working directory
data <- read_csv("master.csv") #read the file
##
## -- Column specification --------------------------------------------------------
## cols(
## country = col_character(),
## year = col_double(),
## sex = col_character(),
## age = col_character(),
## suicides_no = col_double(),
## population = col_double(),
## suicides_100k_pop = col_double(),
## country_year = col_character(),
## hdi_for_year = col_double(),
## gdp_for_year = col_number(),
## gdp_per_capita = col_double(),
## generation = col_character()
## )
head(data) #look the variables
## # A tibble: 6 x 12
## country year sex age suicides_no population suicides_100k_p~ country_year
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Albania 1987 male 15-2~ 21 312900 6.71 Albania1987
## 2 Albania 1987 male 35-5~ 16 308000 5.19 Albania1987
## 3 Albania 1987 fema~ 15-2~ 14 289700 4.83 Albania1987
## 4 Albania 1987 male 75+ ~ 1 21800 4.59 Albania1987
## 5 Albania 1987 male 25-3~ 9 274300 3.28 Albania1987
## 6 Albania 1987 fema~ 75+ ~ 1 35600 2.81 Albania1987
## # ... with 4 more variables: hdi_for_year <dbl>, gdp_for_year <dbl>,
## # gdp_per_capita <dbl>, generation <chr>
In the beginning, I examine the structure of the data object. The structure() function gives information about the rows(observations) and columns(variables) along with additional information like the names of the columns, class of each columns followed by a few of the initial observations of each of the columns. The dataset contains data from 101 countries, 27820 samples with 12 variables.
str(data)
## spec_tbl_df [27,820 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ country : chr [1:27820] "Albania" "Albania" "Albania" "Albania" ...
## $ year : num [1:27820] 1987 1987 1987 1987 1987 ...
## $ sex : chr [1:27820] "male" "male" "female" "male" ...
## $ age : chr [1:27820] "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
## $ suicides_no : num [1:27820] 21 16 14 1 9 1 6 4 1 0 ...
## $ population : num [1:27820] 312900 308000 289700 21800 274300 ...
## $ suicides_100k_pop: num [1:27820] 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
## $ country_year : chr [1:27820] "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
## $ hdi_for_year : num [1:27820] NA NA NA NA NA NA NA NA NA NA ...
## $ gdp_for_year : num [1:27820] 2.16e+09 2.16e+09 2.16e+09 2.16e+09 2.16e+09 ...
## $ gdp_per_capita : num [1:27820] 796 796 796 796 796 796 796 796 796 796 ...
## $ generation : chr [1:27820] "Generation X" "Silent" "Generation X" "G.I. Generation" ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. year = col_double(),
## .. sex = col_character(),
## .. age = col_character(),
## .. suicides_no = col_double(),
## .. population = col_double(),
## .. suicides_100k_pop = col_double(),
## .. country_year = col_character(),
## .. hdi_for_year = col_double(),
## .. gdp_for_year = col_number(),
## .. gdp_per_capita = col_double(),
## .. generation = col_character()
## .. )
The output of the summary() function shows a set of descriptive statistics for every variable, depending on the type of the variable: Numerical variables of this dataset (year, suicides_no, population, suicides_100k_pop, hdi_for_year, gdp_for_year, and gdp_per_capita) give the range, quartiles, median, and mean.
summary(data)
## country year sex age
## Length:27820 Min. :1985 Length:27820 Length:27820
## Class :character 1st Qu.:1995 Class :character Class :character
## Mode :character Median :2002 Mode :character Mode :character
## Mean :2001
## 3rd Qu.:2008
## Max. :2016
##
## suicides_no population suicides_100k_pop country_year
## Min. : 0.0 Min. : 278 Min. : 0.00 Length:27820
## 1st Qu.: 3.0 1st Qu.: 97498 1st Qu.: 0.92 Class :character
## Median : 25.0 Median : 430150 Median : 5.99 Mode :character
## Mean : 242.6 Mean : 1844794 Mean : 12.82
## 3rd Qu.: 131.0 3rd Qu.: 1486143 3rd Qu.: 16.62
## Max. :22338.0 Max. :43805214 Max. :224.97
##
## hdi_for_year gdp_for_year gdp_per_capita generation
## Min. :0.483 Min. :4.692e+07 Min. : 251 Length:27820
## 1st Qu.:0.713 1st Qu.:8.985e+09 1st Qu.: 3447 Class :character
## Median :0.779 Median :4.811e+10 Median : 9372 Mode :character
## Mean :0.777 Mean :4.456e+11 Mean : 16866
## 3rd Qu.:0.855 3rd Qu.:2.602e+11 3rd Qu.: 24874
## Max. :0.944 Max. :1.812e+13 Max. :126352
## NA's :19456
In the beginning, I wanted to see the trend of the total number of suicides by year in all countries. I created the new variable with the total of suicides over the years and set up a new dataset.
suicides_by_years <- data %>% #new dataset
dplyr::group_by(year) %>% #group by year
dplyr::mutate(sum_suicides = sum(suicides_100k_pop)) #summarize suicides per 100k population over the years
head(suicides_by_years)
## # A tibble: 6 x 13
## # Groups: year [1]
## country year sex age suicides_no population suicides_100k_p~ country_year
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Albania 1987 male 15-2~ 21 312900 6.71 Albania1987
## 2 Albania 1987 male 35-5~ 16 308000 5.19 Albania1987
## 3 Albania 1987 fema~ 15-2~ 14 289700 4.83 Albania1987
## 4 Albania 1987 male 75+ ~ 1 21800 4.59 Albania1987
## 5 Albania 1987 male 25-3~ 9 274300 3.28 Albania1987
## 6 Albania 1987 fema~ 75+ ~ 1 35600 2.81 Albania1987
## # ... with 5 more variables: hdi_for_year <dbl>, gdp_for_year <dbl>,
## # gdp_per_capita <dbl>, generation <chr>, sum_suicides <dbl>
I looked at the descriptive statistics of the new variable “sum_suicides” as it is the most important index of the total number of suicides in the world:
summary_by_years <- suicides_by_years %>%
dplyr::group_by(year) %>% #group by year
dplyr::summarise(sum_suicides = sum(suicides_100k_pop)) %>% #summarize the suicides per 100k population over the years
arrange(-sum_suicides) #sort rows in descending order
summary(summary_by_years$sum_suicides) #statistical summary of the total suicides per 100k population over the years
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2147 10200 11664 11142 13675 14660
The largest number of suicides was from 1994 to 2005 (the peak was in 1995). Since the data are available until 2016, we can assume that data for the year 2016 were imcomplete. Even though the data from 2016 are not incomplete, there is a decreasing number of suicides starting from 2010.
suicides_by_years_line <- suicides_by_years
suicides_by_years_line <- ggplot(suicides_by_years_line,aes(x = year, y = sum_suicides))+ # plot by years and the total number of suicides
geom_line(color = "red", size = 0.8)+ # set the color and size of the line
geom_point() +
geom_hline(yintercept = mean(suicides_by_years$sum_suicides), linetype = "dashed", color= "black",size = 0.5)+ #set the mean line of the total number of suicides
annotate("text", x = 1987, y = 12500, label = "Mean Line", vjust = -0.5,color= "red")+ # set the label "Mean Line" on the plot
labs(x = "Year",y = "Number of Suicides", title="Time Graph of the Suicide Rate in the World")+ ## set the title and the name of axis
theme_minimal() # set the type of graph theme
fig1<-ggplotly(suicides_by_years_line) # apply "plotly" function for interactivity in graph
fig1
The following plot has a geom_bar() function that counts the number of cases on x-axes and depicts the height of each bar according to this number. The plot below represents another type of global trend visualization during 1985 - 2016.
suicides_by_years %>%
ggplot(aes(x = year, y = sum_suicides)) + # plot by years and the total number of suicides
geom_bar(stat = "identity", width=0.7, aes(fill = sum_suicides)) + # set the width of the bars
scale_fill_gradient(name = "Total suicide", low = "yellow", high = "red") + #set the name of the legend and applying colors for bars
labs(title = "Distribution of the Suicide Rate in the Wolrd",
subtitle = "Trend over time, 1985 - 2015",
x = "Year",
y = "Frequency") + #set the title, subtitle, and the name of axis
theme_minimal() + #set the type of theme
coord_flip() #flip the plot 90 degrees
The next step is to look at the distribution of the suicide rates by gender. To visualize the gender trends, I created a new dataset with the rate of suicides per 100K of the population.
data_sex <- data %>%
dplyr::select(suicides_no, population, sex, year) %>% #select 4 variables
dplyr::mutate(rate = (suicides_no / population) * 100000) %>% #create new variable with rate of suicides per 100K of the population
group_by(sex, year) %>% #group by sex and year
dplyr::summarize(avg_rate = mean(rate)) #summarized mean of the rate
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
data_sex
## # A tibble: 64 x 3
## # Groups: sex [2]
## sex year avg_rate
## <chr> <dbl> <dbl>
## 1 female 1985 5.78
## 2 female 1986 5.81
## 3 female 1987 5.73
## 4 female 1988 6.32
## 5 female 1989 6.18
## 6 female 1990 5.91
## 7 female 1991 6.03
## 8 female 1992 6.25
## 9 female 1993 6.03
## 10 female 1994 6.24
## # ... with 54 more rows
The following plot visualizes the distribution of suicide rates by gender. We can observe a significant difference between male and female distributions. The chart of male suicides has a peak around 1990 - 2014. The chart of female suicides does not have any peaks. Also, the chart of female suicides shows the lower rate of the total number of suicides as compared to males.
plot_sex <- data_sex %>%
ggplot(mapping = aes(x = year, y = avg_rate, col = sex)) + #plot the percentage of subsides by sex
geom_line() +
geom_point() +
labs(title = "Rate of the Total Suicides by Gender",
x = "Year",
y = "Percentage of Suisides",
col = "Sex") +
theme_minimal() +
scale_color_brewer(palette = "Set1") #set color palette from the RColorBrewer
fig2 <-ggplotly(plot_sex)
fig2
The following plot shows the distribution of the number of suicides by combination of two factors - gender and generations. As we can see, for men, the largest number of suicides occured for the GI and Silent Generations. For women, the highest number of suicides occured for the GI Generation.
American Generations Timeline:
GI Generation, born 1901-1924 *They were teenagers during the Great Depression and fought in World War II. Sometimes called the greatest generation (following a book by journalist Tom Brokaw) or the swing generation because of their jazz music.
Silent Generation, born 1925-1942 *They were too young to see action in World War II and too old to participate in the fun of the Summer of Love. This label describes their conformist tendencies and belief that following the rules was a sure ticket to success.
Baby Boomers, born 1943-1964 *The boomers were born during an economic and baby boom following World War II. These hippie kids protested against the Vietnam War and participated in the civil rights movement, all with rock ‘n’ roll music blaring in the background.
Generation X, born 1965-1979 *They were originally called the baby busters because fertility rates fell after the boomers. As teenagers, they experienced the AIDs epidemic and the fall of the Berlin Wall. Sometimes called the MTV Generation, the “X” in their name refers to this generation’s desire not to be defined.
Millennials, born 1980-2000 *They experienced the rise of the Internet, Sept. 11 and the wars that followed. Sometimes called Generation Y. Because of their dependence on technology, they are said to be entitled and narcissistic.
Generation Z, born 2001-2013 *These kids were the first born with the Internet and are suspected to be the most individualistic and technology-dependent generation. Sometimes referred to as the iGeneration.
suicides_by_sex_1 <- data %>%
dplyr::group_by(sex, generation) %>% #group by sex
dplyr::summarize(population = sum(population),
suicides = sum(suicides_no),
suicides_per_100k = (suicides / population) * 100000)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
ggplot(suicides_by_sex_1, aes(sex, suicides_per_100k,fill = generation))+ #plot the distribution by gender and generation
geom_bar(stat = "identity")+
labs(title = "Rate of the Total Suicides by Gender and Generations",
x = "Gender",
y = "Frequency",
color = "Generation") +
theme_minimal() +
scale_fill_brewer(palette = "Dark2") #set color palette from the RColorBrewer
To visualize trends by age, I created a new dataset.
data_age <- data %>%
dplyr::select(suicides_no, population, age, year) %>% #select 4 variables
dplyr::mutate(rate = (suicides_no / population) * 100000) %>% #create new variable as the rate of number suicides by number population
group_by(age, year) %>% #group by age and year
dplyr::summarize(avg_rate = mean(rate)) #summarize mean of the rate
## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.
data_age
## # A tibble: 191 x 3
## # Groups: age [6]
## age year avg_rate
## <chr> <dbl> <dbl>
## 1 15-24 years 1985 8.43
## 2 15-24 years 1986 8.15
## 3 15-24 years 1987 7.49
## 4 15-24 years 1988 8.75
## 5 15-24 years 1989 9.16
## 6 15-24 years 1990 9.12
## 7 15-24 years 1991 9.91
## 8 15-24 years 1992 9.19
## 9 15-24 years 1993 9.33
## 10 15-24 years 1994 9.49
## # ... with 181 more rows
The following plot confirms the findings reported in the previous bar chart that the highest number of suicides occurred in the GI and Silent Generations (age 55+).
plot_age <- data_age %>%
ggplot(mapping = aes(x = year, y = avg_rate, col = age)) + #plot the percentage of subsides by age
geom_line() +
geom_point() +
labs(title = "Rate of the Total Suicides by Age",
x = "Year",
y = "Percentage of Suisides",
col = "Age") +
theme_minimal() +
scale_color_brewer(palette = "Dark2") # set color pallete
fig3 <-ggplotly(plot_age)
fig3
After looking at the demographic factors such as gender and age, I wanted to visualize the geographical distribution of the number of suicides. For this task, I created a new dataset with the new variable called “continent”. The new variable has all the countries grouped by the continents.
data_continents <- data %>%
as.data.frame(data_continents) #set new dataset as a data frame
data_continents$continent <- #take the variable from the dataset
countrycode(sourcevar = data_continents[, "country"],origin = "country.name", destination = "continent") #convert the list of the countries into list of continents
data_continents <- as_tibble(data_continents) #set dataset into tibble
head(data_continents)
## # A tibble: 6 x 13
## country year sex age suicides_no population suicides_100k_p~ country_year
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Albania 1987 male 15-2~ 21 312900 6.71 Albania1987
## 2 Albania 1987 male 35-5~ 16 308000 5.19 Albania1987
## 3 Albania 1987 fema~ 15-2~ 14 289700 4.83 Albania1987
## 4 Albania 1987 male 75+ ~ 1 21800 4.59 Albania1987
## 5 Albania 1987 male 25-3~ 9 274300 3.28 Albania1987
## 6 Albania 1987 fema~ 75+ ~ 1 35600 2.81 Albania1987
## # ... with 5 more variables: hdi_for_year <dbl>, gdp_for_year <dbl>,
## # gdp_per_capita <dbl>, generation <chr>, continent <chr>
The bar plot shows the distribution of the total suicides by continent. As we can see, the highest number of suicides was in Europe. The next region with the highest number of suicides was the Americas.
plot_continents<-data_continents %>%
group_by(continent) #group by year
plot_continents <- ggplot(data_continents, aes(year, suicides_100k_pop, fill = continent))+
geom_bar(stat = "identity") + #bar plot of the suicide rate by continent
labs(title = "Distribution of the Suicide Rates by Continents",
subtitle = "Trend over time, 1985 - 2015",
x = "Year",
y = "Rates of Suicide") +
theme_minimal() +
scale_fill_brewer(palette = "Dark2")
fig4 <-ggplotly(plot_continents)
fig4
Here is another visualization of the distribution by country. The treemap visually displays the total number of suicides in each country by varying the size of the square representing the country. As we can see, the Russian Federation has the highest number of suicides in the world.
treemap(data_continents, index="country", vSize="suicides_100k_pop", #set size
vColor="suicides_100k_pop", #set color
type="value",
title="Treemap of Suicide Rates by Countries", palette="Spectral") #set color palette from the RColorBrewer
There is one more visualization of how the number of suicides is distributed by country. The colors represent continents, so we can observe that the countries with the highest number of suicides are mostly in Europe.
plot_countries <- data_continents %>%
dplyr::group_by(country, continent) %>% #group by countries and continents
dplyr::summarize(n = n(), suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000) %>%
arrange(desc(suicide_per_100k)) #sort rows in descending order
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
plot_countries$country <- factor(plot_countries$country, ordered = T, levels = rev(plot_countries$country))
ggplot(plot_countries, aes(x = country, y = suicide_per_100k, fill = continent)) +
geom_bar(stat = "identity") +
labs(title = "Distribution of the Suicide Rates by Countries",
x = "Country",
y = "Suicide Rates",
fill = "Continent") +
coord_flip() + #flip the plot 90 degrees
scale_y_continuous(breaks = seq(0, 45, 5)) + #scale y-axis
theme_minimal() +
theme(legend.position = "bottom") + #set position of the legend
scale_fill_brewer(palette = "Dark2") #set color palette from the RColorBrewer
As we can see in the dataset below, seven former Soviet Union Republics are in the top 20 countries with the highest rate of suicide.
top_20_countries <- data %>%
dplyr::group_by(country) %>%
dplyr::summarise(total_suicide = sum(suicides_100k_pop))%>%
arrange(desc(total_suicide)) %>%
slice(1:20)
top_20_countries
## # A tibble: 20 x 2
## country total_suicide
## <chr> <dbl>
## 1 Russian Federation 11305.
## 2 Lithuania 10589.
## 3 Hungary 10156.
## 4 Kazakhstan 9520.
## 5 Republic of Korea 9350.
## 6 Austria 9076.
## 7 Ukraine 8932.
## 8 Japan 8025.
## 9 Finland 7924.
## 10 Belgium 7900.
## 11 Belarus 7831.
## 12 France 7803.
## 13 Latvia 7373.
## 14 Suriname 7162.
## 15 Bulgaria 7016.
## 16 Slovenia 7013.
## 17 Estonia 6874.
## 18 Guyana 6656.
## 19 Uruguay 6539.
## 20 Singapore 6341.
The economic factors might have a significant impact on the suicide rates. Since the current dataset contains GDP per capita, I decided to check the relationship between suicide rates and GDP per capita. I created a new dataset to visualize the suicide rates by GDP per capita.
gdp_suicide <- data_continents %>%
dplyr::group_by(country, continent)%>% #group by country and continent
dplyr:: summarise(population = sum(population), suicides = sum(suicides_no), suicides_per_100k = (suicides / population) * 100000,
gdp_per_capita = mean(gdp_per_capita)) %>%
ungroup() %>% #ungroup to the original dataset
arrange(desc(suicides_per_100k)) #sort from a larger value to lower
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
gdp_suicide
## # A tibble: 101 x 6
## country continent population suicides suicides_per_10~ gdp_per_capita
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Lithuania Europe 68085210 28039 41.2 9281.
## 2 Russian Federa~ Europe 3690802620 1209742 32.8 6519.
## 3 Sri Lanka Asia 182525626 55641 30.5 904.
## 4 Belarus Europe 197372292 59892 30.3 3334.
## 5 Hungary Europe 248644256 73891 29.7 9370.
## 6 Latvia Europe 44852640 12770 28.5 8961.
## 7 Kazakhstan Asia 377513869 101546 26.9 5329.
## 8 Slovenia Europe 40268619 10615 26.4 18642.
## 9 Estonia Europe 27090810 7034 26.0 11376.
## 10 Ukraine Europe 1286469184 319950 24.9 1868.
## # ... with 91 more rows
The plot below shows the relationship between suicide rate and GDP per capita per continent. The colors represent the continents. The size of the circles represents the number of suicides.
fig5 <- plot_ly(data = gdp_suicide, x = ~gdp_per_capita, y = ~suicides_per_100k, color = ~continent, size = ~suicides_per_100k, colors = "Dark2",
text = ~paste("GDP:", gdp_per_capita, '<br>Suicides:', suicides_per_100k)) %>%
layout(title = 'Suicide Rates vs GPR per Capita by Continents',
xaxis = list(title = "GDP per Capita"),
yaxis = list(title = "Suicide Rates"))
fig5
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
A previous visualization of the relationship between the suicide rates and GDP shows a large variation in the suicide rates as a function of the economic factors. Thus, I decided to build the regression model to check the correlation coefficient and the statistical summary for each continent.
Visually the correlations between the suicides rate and GDP per capita are weak for each continent. There are positive (Americas and Oceania) and negative (Africa, Asia, and Europe) correlations.
fgdp_suicide_plot <- gdp_suicide
ggplot(gdp_suicide, aes(gdp_per_capita, suicides_per_100k))+
geom_point()+
geom_smooth(method = "lm", formula=y~x, color="red", size = 0.5, fill="#cccccc") +
labs(x="GDP per capita",y="Suicide Rates", title = "Correlation between Suicide Rates and GDP per Capita", subtitle = "By Continents") +
facet_wrap(.~ continent) +
theme_bw()
The European continent has the highest rate of suicides. Thus, I decided to check the statistical summary for this region.
gdp_suicide_europe <- gdp_suicide %>%
filter(continent == 'Europe') #filter Europe
The results show that the correlation coefficient is - 0.21, which means weak negative correlation.
The model has the equation is: suicides_per_100k = -1.003e-04(gdp_per_capita) + 1.871e+01
The Adjusted R-Squared value (0.01707) states that only 1.7% of the variation in the observations may be explained by this model.
p-value (0.2082) is significantly big, so the predictor (gdp_per_capita) is not useful to predict the rate of suicide in this model.
cor(gdp_suicide_europe$suicides_per_100k, gdp_suicide_europe$gdp_per_capita) # correlation coefficient
## [1] -0.2088996
data_sum <- lm(suicides_per_100k ~ gdp_per_capita, data = gdp_suicide_europe) # statistacal summary
summary(data_sum)
##
## Call:
## lm(formula = suicides_per_100k ~ gdp_per_capita, data = gdp_suicide_europe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3585 -7.1371 -0.4603 5.2490 23.4072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.871e+01 2.301e+00 8.129 1.15e-09 ***
## gdp_per_capita -1.003e-04 7.825e-05 -1.282 0.208
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.894 on 36 degrees of freedom
## Multiple R-squared: 0.04364, Adjusted R-squared: 0.01707
## F-statistic: 1.643 on 1 and 36 DF, p-value: 0.2082
Suicide is a global social issue. According to the World Health Organization, nearly 800,000 people die from suicide every year. It is a very serious issue, which has to be taken seriously by the public. The first step in researching social issues is studying the data from the past. It is especially concerning that seven countries formed after the collapse of the Soviet Union are among the top 20 countries with the highest rates of suicide worldwide. Russia has the highest suicide rate, and it is my native country. Thus, I was most interested in researching the suicide rate dynamics. The dataset I used is from Kaggle and it contains information about 101 countries, including the suicide rates per country and per year, and the social-economical factors during 1984 – 2014.
*Global trends of the suicide rates during 1984 - 2016.
The global trend during 1984 – 2016 showed a different dynamic. The suicide rate increased dramatically from 1984 to 1995. From 1994 to 2010, the rate of suicide was higher than average, with a peak in 1995. Starting from 2010, the suicide rate started to decrease. The average annual suicide rate in 101 countries from 1985 to 2016 was 11,142 per 100,000 of population.
*Distribution of the suicide rate by gender.
We can observe a significant difference between distributions of suicide for men and women. The suicide rate is higher for men than women - 76.9% for men and 23.1% for women. The chart of male suicides has a peak around 1990 - 2014. The chart of female suicides does not have any peaks. In addition, the chart of female suicides showed a lower rate of the total number of suicides as compared to males.
*Distribution of the suicide rate by age.
The highest suicide rate occurred in the age category of 75+. Additionally, the plotline for age 75+ has a peak similar to the one in a global trend in 1995.
*Countries with the highest suicide rate.
Europe is the continent with the highest suicide rate. The top 20 countries with the highest suicide rates in the world include seven countries from the former USSR (Russian Federation, Lithuania, Kazakhstan, Ukraine, Belarus, Latvia, and Estonia).
*The relation between the economic factors and the suicide rates.
GDP per capita is the best measurement of a country’s standard of living. The visualization of the relationship between the suicide rates and GDP per capita shows a large variation in the suicide rates as a function of the economic factors. The correlations between the suicide rates and GDP per capita are weak for each continent. There are positive correlations (in the Americas and Oceania) and negative correlations (in Africa, Asia, and Europe). For example, in Europe, the correlation coefficient between GDP per capita and the suicide rate is - 0.21, which means a weak negative correlation.
References:
Suicide in the Twenty-First Century [dataset]. (2017). Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook
United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506
World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#
World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/
World Health Organization.(2019). Suicide. https://www.who.int/news-room/fact-sheets/detail/suicide
NPR, Hear Every Voice https://www.npr.org/