Introduction

Good day, today we will look at a dataset from Kaggle analyzing suicide deaths from 1985 to 2016. This dataset consist of a collection of 4 different datasets and was created to find signals related to increased suicide rates in various groups globally, across the social-economic spectrum. We will be looking at varibles such as age range, gender, suicide total, and generation.

This is the link to the dataset: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016?select=master.csv

References

United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/

Inspiration

Suicide Prevention.

Lets start by installing required packages

install.packages(“tidyverse”) install.packages(“highcharter”) install.packages(“ploltly”)

Load packages

library("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("dplyr")
library("highcharter")
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library("RColorBrewer")
library("plotly")
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library("treemap")

Set working directory to load the dataset

setwd("C:/Users/User/Documents/Data_Science/datasets")
suicide_dataset <- read_csv("suicide_rates.csv") 
## 
## -- Column specification --------------------------------------------------------
## cols(
##   country = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   age = col_character(),
##   suicides_no = col_double(),
##   population = col_double(),
##   `suicides/100k pop` = col_double(),
##   `country-year` = col_character(),
##   `HDI for year` = col_double(),
##   `gdp_for_year ($)` = col_number(),
##   `gdp_per_capita ($)` = col_double(),
##   generation = col_character()
## )

Use dbplyr’s function to look to get a summary of the suicides_dataset

summary(suicide_dataset)
##    country               year          sex                age           
##  Length:27820       Min.   :1985   Length:27820       Length:27820      
##  Class :character   1st Qu.:1995   Class :character   Class :character  
##  Mode  :character   Median :2002   Mode  :character   Mode  :character  
##                     Mean   :2001                                        
##                     3rd Qu.:2008                                        
##                     Max.   :2016                                        
##                                                                         
##   suicides_no        population       suicides/100k pop country-year      
##  Min.   :    0.0   Min.   :     278   Min.   :  0.00    Length:27820      
##  1st Qu.:    3.0   1st Qu.:   97498   1st Qu.:  0.92    Class :character  
##  Median :   25.0   Median :  430150   Median :  5.99    Mode  :character  
##  Mean   :  242.6   Mean   : 1844794   Mean   : 12.82                      
##  3rd Qu.:  131.0   3rd Qu.: 1486143   3rd Qu.: 16.62                      
##  Max.   :22338.0   Max.   :43805214   Max.   :224.97                      
##                                                                           
##   HDI for year   gdp_for_year ($)    gdp_per_capita ($)  generation       
##  Min.   :0.483   Min.   :4.692e+07   Min.   :   251     Length:27820      
##  1st Qu.:0.713   1st Qu.:8.985e+09   1st Qu.:  3447     Class :character  
##  Median :0.779   Median :4.811e+10   Median :  9372     Mode  :character  
##  Mean   :0.777   Mean   :4.456e+11   Mean   : 16866                       
##  3rd Qu.:0.855   3rd Qu.:2.602e+11   3rd Qu.: 24874                       
##  Max.   :0.944   Max.   :1.812e+13   Max.   :126352                       
##  NA's   :19456

Rename the “sex” varible to “gender” and “suicides_no” to “suicide_total”

 suicide_data <- suicide_dataset %>%
  rename(
    gender = sex,
    suicide_total = suicides_no
  )

Which gender had the most suicides from 1985 to 2016?

Let’s make a plot of the total suicide deaths by gender

plot1 <- ggplot(suicide_data, aes(x = year, y = suicide_total, color=gender)) +
  xlab("Year") +
  ylab("Suicide deaths count") + 
  theme_minimal(base_size = 12)
plot1 + geom_point()

From the visual above we see that suicide deaths are more dominant in males than females.

Which Country and Generation had the most suicides?

generation_suicide_plot <- ggplot(suicide_data, aes(x = year, y = suicide_total,color= generation, text = paste("Country:", country))) + 
     geom_point(alpha = 0.5)  + 
  ggtitle("Suicide deaths") +
  xlab("Year") +
  ylab ("Population") +
  theme_minimal(base_size = 12)
generation_suicide_plot <- ggplotly(generation_suicide_plot)

generation_suicide_plot

The highest number of suicides was from Russia from 1985 to 2016 the generation was Boomers which age ranged from 35 to 54.

Filter by Country

Filter the total number of suicide deaths by four different countries: Japan, United States, France, and Italy

suicide_by_country <- suicide_data %>%
  filter( country == "Japan" | country == "United States" | country == "France" | country == "Italy")%>%
  arrange(suicide_total)
  suicide_by_country
## # A tibble: 1,476 x 12
##    country  year gender age        suicide_total population `suicides/100k pop`
##    <chr>   <dbl> <chr>  <chr>              <dbl>      <dbl>               <dbl>
##  1 Italy    2006 female 5-14 years             0    2692378                0   
##  2 Italy    1995 female 5-14 years             1    2778756                0.04
##  3 Italy    1998 female 5-14 years             1    2707786                0.04
##  4 Italy    2010 female 5-14 years             1    2743696                0.04
##  5 France   1996 female 5-14 years             2    3562100                0.06
##  6 Italy    1991 female 5-14 years             2    3110000                0.06
##  7 Italy    2000 female 5-14 years             2    2688032                0.07
##  8 Italy    2005 female 5-14 years             2    2656241                0.08
##  9 Italy    2008 male   5-14 years             2    2866857                0.07
## 10 Italy    2009 female 5-14 years             2    2725983                0.07
## # ... with 1,466 more rows, and 5 more variables: country-year <chr>,
## #   HDI for year <dbl>, gdp_for_year ($) <dbl>, gdp_per_capita ($) <dbl>,
## #   generation <chr>

Plot a histogram for the country and age

suicide_histgram <- suicide_by_country %>%
  ggplot() +
  geom_histogram(aes(x=year, y=suicide_total, fill = age),
      position = "dodge", stat = "identity") +
  ggtitle("Total number of suicides") +
  ylab("Number of deaths by suicide") + 
  labs(fill = "Ages")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
suicide_histgram

The age with the highest suicide total was 35-54.

make a treegraph

treemap(suicide_by_country, index="country", vSize="population", 
        vColor="suicide_total", type="value", 
        palette="RdYlBu",
        
          title = "Suicidal deaths by country and population") 

Conclusion:

Taking a deeper look at the United states, Japan, France and Italy we see that united states has the highest amount population size and most suicidal deaths from 1985 to 2016. The topic of the data is suicide prevention. Through this analyst of suicide deaths we can research more about why suicide is so high in that country.The variables used in this analyst were age/gender, population, generation, suicide total and country. The country and generation variables were categorical, the population and suicide total were quantitative.The dataset used in this project came from Kaggle but the source of the data originated from organizations the World bank World Health Organization, Szamil and United Nations Development Program. I cleaned the data by renaming the least descriptive variables: “sex” to “gender” and “suicide no” to “suicide total”. The World Health Organization gets more into detail about the reality of suicide stating “More than 700 000 people die by suicide every year. Furthermore, for each suicide, there are more than 20 suicide attempts. Suicides and suicide attempts have a ripple effect that impacts on families, friends, colleagues, communities and societies.”(World Health Organization, 2019) Most importantly suicide can be prevented This visualizations above represent an analyst of the total count of suicidal deaths from different countries, generations, gender and age ranges. The surprising but scary reality is that the United States is above most of the countries listed. One thing I wished I could have worked out was an alluvial for generation in this dataset.

References:

World Health Organization 2019 https://www.who.int/health-topics/suicide#tab=tab_1