library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
I have chosen the dataset “Suicide Rates Overview 1985 to 2016”; the source is: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016
The dataset has years from 1985 to 2016. The dataset includes demographic variables, such as countries, genders, number of countries’ population, ages, and names of generation. Economic variables include GDP for year and GDP per capita. The data about suicides includes the raw amount of suicides and the number of suicides per 100.000 population.
The dataset includes five characters and seven numerical variables.
I have chosen the variable ‘suicides/100k pop’ instead of ‘suicides_no’ to conduct my analysis. The ‘suicides_no’ is the raw number of reported suicides for the age/gender group in that a given country for that given year. The ‘suicides/100k pop’ is the number of suicides divided by 100k population of the age/gender group and gender within that country for that given year. I think the ‘suicide/100k pop’ works better as a normalizing feature for comparisons, and I use ‘suicide/100k pop’ throughout all of the analysis.
getwd()
## [1] "C:/Users/small/Desktop/MCollege/2021/DATA110/week6"
setwd("C:/Users/small/Desktop/MCollege/2021/DATA110/Datasets")
suicide_world <- read_csv("Suicide_world.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## country = col_character(),
## year = col_double(),
## sex = col_character(),
## age = col_character(),
## suicides_no = col_double(),
## population = col_double(),
## suicides_100k_pop = col_double(),
## country_year = col_character(),
## `hdi for year` = col_double(),
## gdp_for_year = col_number(),
## gdp_per_capita = col_double(),
## generation = col_character()
## )
suicide_world
## # A tibble: 27,820 x 12
## country year sex age suicides_no population suicides_100k_p~
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Albania 1987 male 15-2~ 21 312900 6.71
## 2 Albania 1987 male 35-5~ 16 308000 5.19
## 3 Albania 1987 fema~ 15-2~ 14 289700 4.83
## 4 Albania 1987 male 75+ ~ 1 21800 4.59
## 5 Albania 1987 male 25-3~ 9 274300 3.28
## 6 Albania 1987 fema~ 75+ ~ 1 35600 2.81
## 7 Albania 1987 fema~ 35-5~ 6 278800 2.15
## 8 Albania 1987 fema~ 25-3~ 4 257200 1.56
## 9 Albania 1987 male 55-7~ 1 137500 0.73
## 10 Albania 1987 fema~ 5-14~ 0 311000 0
## # ... with 27,810 more rows, and 5 more variables: country_year <chr>, `hdi for
## # year` <dbl>, gdp_for_year <dbl>, gdp_per_capita <dbl>, generation <chr>
In this chunk I check the types of the variables and the structure of the dataset.
str(suicide_world)
## spec_tbl_df [27,820 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ country : chr [1:27820] "Albania" "Albania" "Albania" "Albania" ...
## $ year : num [1:27820] 1987 1987 1987 1987 1987 ...
## $ sex : chr [1:27820] "male" "male" "female" "male" ...
## $ age : chr [1:27820] "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
## $ suicides_no : num [1:27820] 21 16 14 1 9 1 6 4 1 0 ...
## $ population : num [1:27820] 312900 308000 289700 21800 274300 ...
## $ suicides_100k_pop: num [1:27820] 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
## $ country_year : chr [1:27820] "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
## $ hdi for year : num [1:27820] NA NA NA NA NA NA NA NA NA NA ...
## $ gdp_for_year : num [1:27820] 2.16e+09 2.16e+09 2.16e+09 2.16e+09 2.16e+09 ...
## $ gdp_per_capita : num [1:27820] 796 796 796 796 796 796 796 796 796 796 ...
## $ generation : chr [1:27820] "Generation X" "Silent" "Generation X" "G.I. Generation" ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. year = col_double(),
## .. sex = col_character(),
## .. age = col_character(),
## .. suicides_no = col_double(),
## .. population = col_double(),
## .. suicides_100k_pop = col_double(),
## .. country_year = col_character(),
## .. `hdi for year` = col_double(),
## .. gdp_for_year = col_number(),
## .. gdp_per_capita = col_double(),
## .. generation = col_character()
## .. )
In the beginning, I wanted to see the whole picture of suicide in the world. For that, I grouped the dataset by countries, years, and total suicides and summarized all data by the number of suicides per 100K population.
Also, I arranged the list of countries from a country with the largest number of suicide to the smallest one. As a result, I got the top ten countries with the largest suicide - Russian Federation, Lithuania, Hungary, Kazakhstan, Republic of Korea, Austria, Ukraine, Japan, Finland, and Belgium. As we can see in this list, almost half of these countries are former Soviet Union Republics.
suicide_countries <- suicide_world %>%
group_by(country) %>%
summarise(total_suicide = sum(suicides_100k_pop))%>%
arrange(desc(total_suicide))
suicide_countries
## # A tibble: 101 x 2
## country total_suicide
## <chr> <dbl>
## 1 Russian Federation 11305.
## 2 Lithuania 10589.
## 3 Hungary 10156.
## 4 Kazakhstan 9520.
## 5 Republic of Korea 9350.
## 6 Austria 9076.
## 7 Ukraine 8932.
## 8 Japan 8025.
## 9 Finland 7924.
## 10 Belgium 7900.
## # ... with 91 more rows
I created the plot to see a general trend worldwide from 1985 to 2016 for all of the countries. I selected the following variables: year, country, the number of suicides. Then I grouped by years to see the time dynamic and summarized it by the number of suicides. The plot shows that the trend goes up from 1985 to 1995, stays at the same level to 2002, and then goes down.
suicide_year <- suicide_world %>%
select(country, year, suicides_100k_pop) %>%
group_by(year) %>%
summarise(total_suicide = sum(suicides_100k_pop))
suicide_year
## # A tibble: 32 x 2
## year total_suicide
## * <dbl> <dbl>
## 1 1985 6812.
## 2 1986 6580.
## 3 1987 7545.
## 4 1988 7473.
## 5 1989 8037.
## 6 1990 9879.
## 7 1991 10321.
## 8 1992 10529.
## 9 1993 10790.
## 10 1994 11484.
## # ... with 22 more rows
ggplot(data = suicide_year, aes(x = year, y = total_suicide)) + geom_line(color = 'red') +
geom_point(fill = 'black', shape = 3) + labs(
title = 'Rate of Suicide in the World', subtitle = 'From 1985 to 2016', x = 'Year', y = 'Total suicides') + theme_minimal()
I created a new dataset to focus on the top ten countries with the highest rates of suicide.
suicide_top_10 <-suicide_world %>%
filter(country == 'Russian Federation'| country == 'Lithuania'| country == 'Hungary'| country == 'Kazakhstan'| country == 'Republic of Korea' | country == 'Austria' | country == 'Ukraine' | country == 'Japan' | country == 'Finland' | country == 'Belgium')
suicide_top_10
## # A tibble: 3,390 x 12
## country year sex age suicides_no population suicides_100k_p~
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Austria 1985 male 75+ ~ 152 156535 97.1
## 2 Austria 1985 male 55-7~ 355 584253 60.8
## 3 Austria 1985 male 35-5~ 515 940526 54.8
## 4 Austria 1985 male 25-3~ 232 548783 42.3
## 5 Austria 1985 fema~ 75+ ~ 110 339223 32.4
## 6 Austria 1985 male 15-2~ 207 653728 31.7
## 7 Austria 1985 fema~ 55-7~ 220 842978 26.1
## 8 Austria 1985 fema~ 35-5~ 186 936799 19.8
## 9 Austria 1985 fema~ 25-3~ 56 544765 10.3
## 10 Austria 1985 fema~ 15-2~ 50 633592 7.89
## # ... with 3,380 more rows, and 5 more variables: country_year <chr>, `hdi for
## # year` <dbl>, gdp_for_year <dbl>, gdp_per_capita <dbl>, generation <chr>
To create different plots, I made the new data frame grouping by year, country, GDP per capita and summarized by the number of suicides.
suicide_10_by_year <- suicide_top_10 %>%
group_by(year, country, gdp_per_capita) %>%
summarise(total_suicide = sum(suicides_100k_pop))
## `summarise()` has grouped output by 'year', 'country'. You can override using the `.groups` argument.
suicide_10_by_year
## # A tibble: 283 x 4
## # Groups: year, country [283]
## year country gdp_per_capita total_suicide
## <dbl> <chr> <dbl> <dbl>
## 1 1985 Austria 9759 385.
## 2 1985 Belgium 9356 332.
## 3 1985 Japan 12401 300.
## 4 1985 Republic of Korea 2731 147.
## 5 1986 Austria 13911 402.
## 6 1986 Belgium 12992 315.
## 7 1986 Japan 18288 324.
## 8 1986 Republic of Korea 3078 143.
## 9 1987 Austria 17415 406.
## 10 1987 Belgium 16165 328.
## # ... with 273 more rows
The first plot for top ten countries shows the trend of suicide by years. The suicide rate of each country (except the Republic of Korea) repeats the general trend for the whole world.
ggplot(data = suicide_10_by_year) +
geom_line(aes(x = year, y = total_suicide, group = country, color = country)) +
labs(title = 'Rate of Suicide in Top 10 Countries', subtitle = 'From 1985 to 2016', x = 'Year', y = 'Total suicide', color = 'Country') + theme_minimal()
To see the distribution by gender, I grouped by sex and summarized it by the number of suicides. Across the top 10 countries, the number of male suicides is about three and a half times higher than that of female ones. Therefore, gender is a factor of suicide.
suicide_10_sex <- suicide_top_10 %>%
group_by(sex) %>%
summarise(total_by_gender = sum(suicides_100k_pop))
suicide_10_sex
## # A tibble: 2 x 2
## sex total_by_gender
## * <chr> <dbl>
## 1 female 19760.
## 2 male 73018.
The plot for genders shows that the male distribution of suicides is significantly higher then the female’s.
plot1 <-suicide_top_10 %>%
ggplot(aes(sex, suicides_100k_pop, fill = sex)) +
ggtitle("The Distribution of Suicide by Gender") +
xlab("Sex") +
ylab("Frequency") +
geom_boxplot() +
scale_fill_discrete(name = "Sex", labels = c("Female", "Male")) +
theme_minimal()
plot1
To see the distribution by age, I grouped by age and summarized it by the number of suicides. Across the top 10 countries, the suicide rate is getting higher when the age is higher. Therefore, age is a factor of suicide.
suicide_10_age <- suicide_top_10 %>%
group_by(age) %>%
summarise(total_by_age = sum(suicides_100k_pop)) %>%
arrange(desc(total_by_age))
suicide_10_age
## # A tibble: 6 x 2
## age total_by_age
## <chr> <dbl>
## 1 75+ years 29821.
## 2 55-74 years 20239.
## 3 35-54 years 19019.
## 4 25-34 years 14117.
## 5 15-24 years 9038.
## 6 5-14 years 545.
I checked two variables on correlation for top 10 countries. The independent variable is GDP per capita, and the dependent variable is the number of suicides. There is a moderate negative correlation, which means that if the GDP decreases, the number of suicides increases.
cor(suicide_10_by_year$gdp_per_capita, suicide_10_by_year$total_suicide)
## [1] -0.5586889
For the regression plot, I added the second layer (by geom_method) to produce the line of best fit and deleted a “confidence interval”.
ggplot(data = suicide_10_by_year, aes(x = gdp_per_capita, y = total_suicide)) +
geom_point(aes(color = country)) +
stat_smooth(method ="lm", formula = y~log(x), linetype=5, color = 'red', se = F) +
labs(title = "Regration Between Suicide and GDP per Capita", subtitle = 'For top 10 countries, from 1985 to 2016', y = 'Total suicide', x = 'GDP per capita', color = 'Country') +
theme_minimal()
The top ten countries with the largest amount of suicides are the Russian Federation, Lithuania, Hungary, Kazakhstan, Republic of Korea, Austria, Ukraine, Japan, Finland, and Belgium. Almost half of these countries are from the former USSR. The following up of this analysis might be researching mental health policies and economic factors in the former Soviet republics. My suggestion is that the traumatic experience of the broken USSR and the difficult transition to the market economy in former Soviet Union countries may be the main reason of the high rates of suicide.
The plot shows that the trend goes up from 1985 to 1995, then stays the same level from 1995 to 2002, and then goes down. To follow up on this analysis, it would be interesting to research what has been happened in the world (for instance, in the pharmaceutical industry) that may decrease the suicide rate in 2002. Another thought that might be useful to explain the peak of suicide in 1995 and decreasing trend after 2002 is analyzing the ‘generation’ variable. The G. I. (Greatest) Generation is the demographic unit of people born from 1901 to 1927. This Generation had a deal with Great Depression in the USA, and two World Wars. By 1995 (when there was a peak of suicide) people from G.I. Generation should be 68-94 years. This particular age group has the highest rate of suicide. Their traumatic life experience might be a reason for a high level of suicide by the end of their lives when primary obligations to raise their children have been made. If we look at the period of decrease in 2002 – 2016 retrospectively, the people who have committed suicide were from Silent and Baby Boomer Generations. These two generations have had significantly different characteristics compared to the G.I. generation, and probably there is a collective pattern of generational behavior on the macro-level.
The plot with the trend of suicide by each country by year looks similar to the general trend (except Korea). The trend of each country has a different degree of increasing before 2012 and then very identical decreasing. The top 10 countries include the developing and developed countries and have significant differences in all political, social, demographic, and historical factors; however, all of them tend to decrease the number of suicides. There is one unifying factor – the same type of Generation in the period under review.
There are demographic trends: a) men are more likely to commit suicide than women by 3,5 times; b) people in the age group of 75+ are in the highest risk of suicide.
The regression analysis shows a moderate negative relation between the GDP and the number of suicides. I expected to see this result after getting the data frame with the top 10 countries in which four countries are from the former USSR. The standard of living in the former Soviet republics, in general, is less than developed countries. Thus, the degree of national wealth has a certain degree of reduction in the suicide rate.