Good day, today we will look at a dataset from Kaggle analyzing suicide deaths from 1985 to 2016. This dataset consist of a collection of 4 different datasets and was created to find signals related to increased suicide rates in various groups globally, across the social-economic spectrum. We will be looking at varibles such as age range, gender, suicide total, and generation.
This is the link to the dataset: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016?select=master.csv
United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506
World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#
[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook
World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/
Suicide Prevention.
install.packages(“tidyverse”) install.packages(“highcharter”) install.packages(“ploltly”)
library("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("dplyr")
library("highcharter")
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library("RColorBrewer")
library("plotly")
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library("treemap")
Set working directory to load the dataset
setwd("C:/Users/User/Documents/Data_Science/datasets")
suicide_dataset <- read_csv("suicide_rates.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## country = col_character(),
## year = col_double(),
## sex = col_character(),
## age = col_character(),
## suicides_no = col_double(),
## population = col_double(),
## `suicides/100k pop` = col_double(),
## `country-year` = col_character(),
## `HDI for year` = col_double(),
## `gdp_for_year ($)` = col_number(),
## `gdp_per_capita ($)` = col_double(),
## generation = col_character()
## )
Use dbplyr’s function to look to get a summary of the suicides_dataset
summary(suicide_dataset)
## country year sex age
## Length:27820 Min. :1985 Length:27820 Length:27820
## Class :character 1st Qu.:1995 Class :character Class :character
## Mode :character Median :2002 Mode :character Mode :character
## Mean :2001
## 3rd Qu.:2008
## Max. :2016
##
## suicides_no population suicides/100k pop country-year
## Min. : 0.0 Min. : 278 Min. : 0.00 Length:27820
## 1st Qu.: 3.0 1st Qu.: 97498 1st Qu.: 0.92 Class :character
## Median : 25.0 Median : 430150 Median : 5.99 Mode :character
## Mean : 242.6 Mean : 1844794 Mean : 12.82
## 3rd Qu.: 131.0 3rd Qu.: 1486143 3rd Qu.: 16.62
## Max. :22338.0 Max. :43805214 Max. :224.97
##
## HDI for year gdp_for_year ($) gdp_per_capita ($) generation
## Min. :0.483 Min. :4.692e+07 Min. : 251 Length:27820
## 1st Qu.:0.713 1st Qu.:8.985e+09 1st Qu.: 3447 Class :character
## Median :0.779 Median :4.811e+10 Median : 9372 Mode :character
## Mean :0.777 Mean :4.456e+11 Mean : 16866
## 3rd Qu.:0.855 3rd Qu.:2.602e+11 3rd Qu.: 24874
## Max. :0.944 Max. :1.812e+13 Max. :126352
## NA's :19456
Rename the “sex” varible to “gender” and “suicides_no” to “suicide_total”
suicide_data <- suicide_dataset %>%
rename(
gender = sex,
suicide_total = suicides_no
)
Let’s make a plot of the total suicide deaths by gender
plot1 <- ggplot(suicide_data, aes(x = year, y = suicide_total, color=gender)) +
xlab("Year") +
ylab("Suicide deaths count") +
theme_minimal(base_size = 12)
plot1 + geom_point()
From the visual above we see that suicide deaths are more dominant in males than females.
generation_suicide_plot <- ggplot(suicide_data, aes(x = year, y = suicide_total,color= generation, text = paste("Country:", country))) +
geom_point(alpha = 0.5) +
ggtitle("Suicide deaths") +
xlab("Year") +
ylab ("Population") +
theme_minimal(base_size = 12)
generation_suicide_plot <- ggplotly(generation_suicide_plot)
generation_suicide_plot
The highest number of suicides was from Russia from 1985 to 2016 the generation was Boomers which age ranged from 35 to 54.
Filter the total number of suicide deaths by four different countries: Japan, United States, France, and Italy
suicide_by_country <- suicide_data %>%
filter( country == "Japan" | country == "United States" | country == "France" | country == "Italy")%>%
arrange(suicide_total)
suicide_by_country
## # A tibble: 1,476 x 12
## country year gender age suicide_total population `suicides/100k pop`
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Italy 2006 female 5-14 years 0 2692378 0
## 2 Italy 1995 female 5-14 years 1 2778756 0.04
## 3 Italy 1998 female 5-14 years 1 2707786 0.04
## 4 Italy 2010 female 5-14 years 1 2743696 0.04
## 5 France 1996 female 5-14 years 2 3562100 0.06
## 6 Italy 1991 female 5-14 years 2 3110000 0.06
## 7 Italy 2000 female 5-14 years 2 2688032 0.07
## 8 Italy 2005 female 5-14 years 2 2656241 0.08
## 9 Italy 2008 male 5-14 years 2 2866857 0.07
## 10 Italy 2009 female 5-14 years 2 2725983 0.07
## # ... with 1,466 more rows, and 5 more variables: country-year <chr>,
## # HDI for year <dbl>, gdp_for_year ($) <dbl>, gdp_per_capita ($) <dbl>,
## # generation <chr>
suicide_histgram <- suicide_by_country %>%
ggplot() +
geom_histogram(aes(x=year, y=suicide_total, fill = age),
position = "dodge", stat = "identity") +
ggtitle("Total number of suicides") +
ylab("Number of deaths by suicide") +
labs(fill = "Ages")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
suicide_histgram
The age with the highest suicide total was 35-54.
treemap(suicide_by_country, index="country", vSize="population",
vColor="suicide_total", type="value",
palette="RdYlBu",
title = "Suicidal deaths by country and population")
Taking a deeper look at the United states, Japan, France and Italy we see that united states has the highest amount population size and most suicidal deaths from 1985 to 2016. The topic of the data is suicide prevention. Through this analyst of suicide deaths we can research more about why suicide is so high in that country.The variables used in this analyst were age/gender, population, generation, suicide total and country. The country and generation variables were categorical, the population and suicide total were quantitative.The dataset used in this project came from Kaggle but the source of the data originated from organizations the World bank World Health Organization, Szamil and United Nations Development Program. I cleaned the data by renaming the least descriptive variables: “sex” to “gender” and “suicide no” to “suicide total”. The World Health Organization gets more into detail about the reality of suicide stating “More than 700 000 people die by suicide every year. Furthermore, for each suicide, there are more than 20 suicide attempts. Suicides and suicide attempts have a ripple effect that impacts on families, friends, colleagues, communities and societies.”(World Health Organization, 2019) Most importantly suicide can be prevented This visualizations above represent an analyst of the total count of suicidal deaths from different countries, generations, gender and age ranges. The surprising but scary reality is that the United States is above most of the countries listed. One thing I wished I could have worked out was an alluvial for generation in this dataset.
References:
World Health Organization 2019 https://www.who.int/health-topics/suicide#tab=tab_1