library(ggplot2)
library(knitr)
library(imager)
## Loading required package: magrittr
## 
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
## 
##     add
## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum
## The following object is masked from 'package:graphics':
## 
##     frame
## The following object is masked from 'package:base':
## 
##     save.image
library(DATA606)
## Loading required package: shiny
## Loading required package: openintro
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
## 
##     diamonds
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
## Loading required package: OIdata
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: maps
## Loading required package: markdown
## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.
## 
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
## 
##     demo

Suicide Rates around the world in 2015

Introduction

Suicide is one of the significant problem that is increasing every year globally around the world. Globalization, unemployment and some other factors are the predictors of suicide. Although there are tons of factors which makes people commit suicide but in this study, I will be taking few variables from the data set and I will check their relationship with suicide numbers globally. Both males and females are involved in committing suicides and almost from every age group. According to few researches, it could be due to financial situations or overall country’s economy.

The project’s objective is to find out if there is any difference in number of suicides among sex and age groups. The project would be helpful giving a broad picture of suicide’s predictors and hence it would help governments and relevant institutions controlling the suicide rates globally. Due to the limitation of data set regarding limited variables, we will only discuss few variables.

Data

Data has been collected from Kaggle’s dataset which is online and free dataset. It can be downloaded free from https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016. Dataset was compilled from four different datasets linked through time and place for the better understanding of suicides globally. The source of those datasets are WHO, World Bank, UNDP and a dataset published in Kaggle. The dataset contains the number of suicides from 1985 to 2016 from different countries around the world. It has 27,820 observations and 12 observations including both predictors and dependent variable. Number of suicides and suicides in 100K are the outcomes and year, sex, age, population, GDP, GDP per capita and Generation are the predictors. Country is a string variable which gives information about the list of countries exist in the dataset. This study is an observational study as the data set already exists and we will conduct few statistical tests to conclude the study. The main objective of this study is to find out if there is any difference in suicide rates among the sex and age groups globally in 2015. For the sake of simplicity we would not consider time-series data and hence we would focus on suicide rates in 2015 only for now. As discussed above, the project would be helpful for government and relevant instititions making policies which would be helpful to prevent suicides globally. Although there are many other predictors of suicides as well but due to the limitaiton of data set, we would only consider the above mentioned variables for this project. This data is cross-sectional study as it has many representatives of the population i.e. different countries and many variables but for the sake of simplicity, we will conduct the analysis only for 2015. Type of study is observational study.

suicide2 <- read.csv("master2015.csv", header= TRUE, sep= ",", stringsAsFactors = FALSE)

Exploratory Data analysis

First of all, I am going to check the summary statistics of the data.

summary(suicide2)
##   ï..country             year          sex                age           
##  Length:744         Min.   :2015   Length:744         Length:744        
##  Class :character   1st Qu.:2015   Class :character   Class :character  
##  Mode  :character   Median :2015   Mode  :character   Mode  :character  
##                     Mean   :2015                                        
##                     3rd Qu.:2015                                        
##                     Max.   :2015                                        
##   suicides_no        population       suicides.100k.pop country.year      
##  Min.   :    0.0   Min.   :    1076   Min.   :  0.000   Length:744        
##  1st Qu.:    5.0   1st Qu.:  178922   1st Qu.:  1.308   Class :character  
##  Median :   34.0   Median :  580235   Median :  6.065   Mode  :character  
##  Mean   :  273.7   Mean   : 2385293   Mean   : 11.094                     
##  3rd Qu.:  155.2   3rd Qu.: 2317470   3rd Qu.: 14.232                     
##  Max.   :11634.0   Max.   :41658010   Max.   :140.740                     
##  HDI.for.year   gdp_for_year....    gdp_per_capita....  generation       
##  Mode:logical   Min.   :7.567e+08   Min.   :  1285     Length:744        
##  NA's:744       1st Qu.:3.580e+10   1st Qu.:  9431     Class :character  
##                 Median :1.811e+11   Median : 15116     Mode  :character  
##                 Mean   :7.802e+11   Mean   : 26231                       
##                 3rd Qu.:4.979e+11   3rd Qu.: 42830                       
##                 Max.   :1.812e+13   Max.   :107456

Now let’s explore the data to see the pattern of suicides with reference to gender and age group in 2015.

ggplot(suicide2, aes(y=suicides_no, x=age))+geom_boxplot()+coord_flip()

ggplot(suicide2, aes(x=suicides_no))+geom_freqpoly(mapping=aes(color=age), binwidth=100)

In the above graphs, it looks like there are few outliers in the data which is extremely high but after doing some research, it has been found that allot of people committed suicide in US in 2015 which is highest in the last few decades (http://theconversation.com/in-2015-more-people-committed-suicide-in-u-s-jails-than-over-the-last-decade-45196).

ggplot(suicide2, aes(y=suicides_no, x=sex))+geom_boxplot()+coord_flip()

ggplot(suicide2, aes(x=suicides_no))+geom_freqpoly(mapping=aes(color=sex), binwidth=100)

In 2015, 11634 people committed suicide in United States so we cannot just remove the so-called outlier from this data as it is part of the data. The data contains very low value as 0 and goes up to 11634 people who committed suicide in different countries in 2015.

Following map shows the suicide rates in different countries in 2015. Size of the point identifies the suicide rate in that country. If we take a look at the following map, we can see that United States, Russia and Japan are the promiment countries where there are highest number of suicides in 2015.

file <- system.file('C:/Users/hukha/OneDrive/Desktop/MS - Data Science/Data 606 - Statistics and Probability for data science/Data 606/3.png', package='imager')
im <- load.image('C:/Users/hukha/OneDrive/Desktop/MS - Data Science/Data 606 - Statistics and Probability for data science/Data 606/3.png')
plot(im)

Inference

ANOVA - To Check the difference in suicide rates among the age groups

Ho: There is no difference across the age groups in their suicide rates in 2015 globally

Ha: There is difference across the age groups in their suicide rates in 2015 globally

Let’s see apparently if there is any difference in the suicide rates of different age groups before going forward with ANOVA.

by(suicide2$suicides_no, suicide2$age, mean)
## suicide2$age: 15-24 years
## [1] 176.1371
## -------------------------------------------------------- 
## suicide2$age: 25-34 years
## [1] 259.0161
## -------------------------------------------------------- 
## suicide2$age: 35-54 years
## [1] 574.9839
## -------------------------------------------------------- 
## suicide2$age: 5-14 years
## [1] 13.55645
## -------------------------------------------------------- 
## suicide2$age: 55-74 years
## [1] 432.1613
## -------------------------------------------------------- 
## suicide2$age: 75+ years
## [1] 186.4032

The above average suicide rates clearly indicates that there is difference in the suicide rates of different age groups. People with age group of 5 - 14 years have the least number of suicides while people with 35 - 54 years old have the most number of suicides globally.

Now let’s move forward to check the difference using ANOVA but before conducting analysis, we have to check the conditions which need to be met. The conditions are following:

Conditions

Let’s check the length of each group before discussing the conditions.

by(suicide2$suicides_no, suicide2$age, length)
## suicide2$age: 15-24 years
## [1] 124
## -------------------------------------------------------- 
## suicide2$age: 25-34 years
## [1] 124
## -------------------------------------------------------- 
## suicide2$age: 35-54 years
## [1] 124
## -------------------------------------------------------- 
## suicide2$age: 5-14 years
## [1] 124
## -------------------------------------------------------- 
## suicide2$age: 55-74 years
## [1] 124
## -------------------------------------------------------- 
## suicide2$age: 75+ years
## [1] 124

1- Normality / Sample size - Seems like data is not normal but due to huge sample size, we can say that this condition is met.

2- Independence between the groups - Respondents could be in one of the age groups and hence this condition is met too.

3 - Independence withing group - Suicide rates have been taken from different countries and the maximum number of suicides is 11634 which is far lower than any country’s population. This condition is met too.

Since all the conditions are met now let’s apply ANOVA to see if there is significant difference in the suicides among the age groups.

anova <- aov(suicides_no ~ age, data= suicide2)
summary(anova)
##              Df    Sum Sq Mean Sq F value   Pr(>F)    
## age           5  24913023 4982605   6.859 2.88e-06 ***
## Residuals   738 536145304  726484                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above p-value shows that there is significant difference in the suicides among the age group which was clearly shown in the boxplot. Let’s check it here again to visualize the difference in suicides among the age groups.

ggplot(suicide2, aes(x=age, y=suicides_no))+geom_boxplot(aes(color=age))+coord_flip()

The above boxplot although does not clearly indicate the difference due to extreme values which were previously justified that in few countries such as Japan, US and Russia; a lot of people committed suicide and hence we could not exclude those values. Results are justified by the p-value.

Hence, we reject the null hypothesis which says that there is no difference in the suicide rates among the age groups.

Tukey’s test would be helpful to see the clear difference in the suicide rates among the age groups.

tukeystest <- TukeyHSD(anova, ordered=TRUE)
tukeystest
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = suicides_no ~ age, data = suicide2)
## 
## $age
##                              diff         lwr      upr     p adj
## 15-24 years-5-14 years  162.58065 -146.705900 471.8672 0.6631532
## 75+ years-5-14 years    172.84677 -136.439771 482.1333 0.6009499
## 25-34 years-5-14 years  245.45968  -63.826868 554.7462 0.2087730
## 55-74 years-5-14 years  418.60484  109.318294 727.8914 0.0016661
## 35-54 years-5-14 years  561.42742  252.140874 870.7140 0.0000041
## 75+ years-15-24 years    10.26613 -299.020416 319.5527 0.9999989
## 25-34 years-15-24 years  82.87903 -226.407513 392.1656 0.9731388
## 55-74 years-15-24 years 256.02419  -53.262351 565.3107 0.1699887
## 35-54 years-15-24 years 398.84677   89.560229 708.1333 0.0033443
## 25-34 years-75+ years    72.61290 -236.673642 381.8994 0.9850776
## 55-74 years-75+ years   245.75806  -63.528480 555.0446 0.2076010
## 35-54 years-75+ years   388.58065   79.294100 697.8672 0.0047377
## 55-74 years-25-34 years 173.14516 -136.141384 482.4317 0.5991195
## 35-54 years-25-34 years 315.96774    6.681197 625.2543 0.0419846
## 35-54 years-55-74 years 142.82258 -166.463964 452.1091 0.7742828

The above diff values show the difference in the suicides among the age groups which clearly shows that there are differences excluding the few groups which don’t have much difference. We could see the same result with ggplot above.

t-test - To check the difference in suicides between male and female

Our next objective is to check the difference in suicide rates between male and female for which we are going to use indepedent sample t-test. Before moving forward with the analysis, we have to check the conditions of the test if we can even go with the test otherwise we will simulate the data to get the results.

Ho: There is no difference in the suicide rates among male and female

Ha: There is difference in the suicide rates among male and female

Conditions

There are two conditions for independent sample t-test which are following:

1- All the observations are independent - This condition is met because even if we take a look at the maximum number of suicides in any country (11634 in US in 2015), those numbers are still far lower than 10% of the total country’s population. 2- Normality - We wouldn’t much concern about normality of the data as both groups are huge. Hence this condition is met too.

by(suicide2$suicides_no, suicide2$sex, length)
## suicide2$sex: female
## [1] 372
## -------------------------------------------------------- 
## suicide2$sex: male
## [1] 372

Now, before moving forward with the analysis, let’s take a look at the average suicide rates in both male and female to have a basic understanding of the data.

by(suicide2$suicides_no, suicide2$sex, mean)
## suicide2$sex: female
## [1] 127.0108
## -------------------------------------------------------- 
## suicide2$sex: male
## [1] 420.4086

Before moving forward the hypothesis testing, let’s visualize boxplot for suicide rates in male and female to have better understanding of the data.

ggplot(suicide2, aes(x=sex, y=suicides_no))+geom_boxplot(aes(color=sex))+coord_flip()

It clearly shows that male have more suicide rates than the female. To confirm the hypothesis, let’s now apply t-test and check if the claim is correct or not.

inference(y=suicide2$suicides_no, x=suicide2$sex, est="mean", type="ht", null=0,
          alternative="twosided", method="theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_female = 372, mean_female = 127.0108, sd_female = 365.1276
## n_male = 372, mean_male = 420.4086, sd_male = 1155.773
## Observed difference between means (female-male) = -293.3978
## 
## H0: mu_female - mu_male = 0 
## HA: mu_female - mu_male != 0 
## Standard error = 62.843 
## Test statistic: Z =  -4.669 
## p-value =  0

As per the p-value i.e. 0, we reject the null hypothesis which states that there is no difference in the suicide rates among male and female. The observed difference between means is 293.3978. Female’s suicide rates are far lower than male in 2015.

Simulation

For the sake of project, let’s now simulate the t-test to see if there would be any difference in the results. Although we have huge sample size overall and even in each case which fulfill the basic conditions but still we would like to see the results after simulation.

inference(y=suicide2$suicides_no, x=suicide2$sex, est="mean", type="ht", null=0,
          alternative="twosided", method="simulation")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_female = 372, mean_female = 127.0108, sd_female = 365.1276
## n_male = 372, mean_male = 420.4086, sd_male = 1155.773
## Observed difference between means (female-male) = -293.3978
## 
## H0: mu_female - mu_male = 0 
## HA: mu_female - mu_male != 0

## p-value =  0

After simulating the data while using t-test, we still can see the same results. P-value is 0 and hence we reject the null hypothesis in the favor of alternate hypothesis. There is significant difference in the suicide rates among male and female. As said before, male have higher suicide rates as compared with female globally in 2015.

Conclusion

The objective of project was to see if there is any significant difference in the suicide rates among male & female and different age groups. We had to use t-test to check the difference in suicides among male and female while we used ANOVA to see the difference among age groups. There were few extreme values in our dataset but since the sample size was high, we could not remove the outliers. Those outliers were not mistake but we found external references that showed high number of suicides in those countries. Result showed that male have more suicide rates globally than the female. On the other side, there is significant difference in the suicide rates in different age groups. People with age group of 5 - 14 years old have the least number of suicides while people with age group of 35 - 54 years old have the highest number of suicides globally with Japan, United States and Russian are the top suicidal countries in 2015.

This project has its own limitation and future data scientists may elaborate the other aspects of suicides globally with including time-series data. Other variables such as unemployment, economy, stress level, etc should also be taken in consideration behind the reasons of suicides around the world.

References

OpenIntro Statistics, Third Edition. Diez, D. et all. 2015

Wickham, H., & Grolemund, G. (2016) R for Data Science. O’Reilly.

https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/

http://theconversation.com/in-2015-more-people-committed-suicide-in-u-s-jails-than-over-the-last-decade-45196

https://www.nytimes.com/2016/04/22/health/us-suicide-rate-surges-to-a-30-year-high.html