Introduction & Problem Statement

Suicide has devastating impact on families and whole communities, fortunately suicides can be prevented. This report investigates if the Australia’s suicides rates have dropped in 2015 than it was 30 years ago. A hypothesis test will be conducted on the 1985 and 2015 data to determine if this rate has reduced.

Suicide rates: the number of suicide per 100,000 people

Data

The dataset used for this report is sourced from open source website https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016.

This compiled dataset was created to locate signals in relation to increasing suicide rates in various groups globally, across the socia-economic spectrum. For the sake of this investigation, data relating to other countries will be filtered out. I also preprocessed the Australian data separately into aggregation to show total trend.

suicide <- read_xls("master.xls")
suicide_oz <- suicide %>% filter(country=="Australia")
suicide_oz %>% head()

Data Cont.

The suicide data comprises of 12 variables and 360 observations. Specifically, the variables represent:

Country: in this investigation we use Australia as observation.
Year: from 1985 - 2015.
Sex: male and female.
Age: age is divided into 6 groups: 5-14 years, 15-24 years, 25-34 years, 35-54 years, 55-74 years, 75+ years.
Suicides_no: the number of suicide incidents.
Population: Australian population.
Rates: the number of suicides per 100k population.
Country-year: combination variable of country and year.
HDI for year: Human development index (HDI).
Gdp_for_year($): GDP in USD$.
Gdp_per_capita($): GDP per capita in USD$.
Generation: based on age grouping average.

Data Cont.

When the data is read into R, sex, age and generation are incorrectly classified as characters.

str(suicide_oz)

## Classes 'tbl_df', 'tbl' and 'data.frame':    360 obs. of  12 variables:
##  $ country           : chr  "Australia" "Australia" "Australia" "Australia" ...
##  $ year              : num  1985 1985 1985 1985 1985 ...
##  $ sex               : chr  "male" "male" "male" "male" ...
##  $ age               : chr  "75+ years" "25-34 years" "55-74 years" "15-24 years" ...
##  $ suicides_no       : num  67 357 282 315 411 36 109 143 69 64 ...
##  $ population        : num  219000 1299100 1177400 1355800 1906800 ...
##  $ rates             : num  30.6 27.5 23.9 23.2 21.6 ...
##  $ country-year      : chr  "Australia1985" "Australia1985" "Australia1985" "Australia1985" ...
##  $ HDI for year      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gdp_for_year ($)  : num  1.8e+11 1.8e+11 1.8e+11 1.8e+11 1.8e+11 ...
##  $ gdp_per_capita ($): num  12374 12374 12374 12374 12374 ...
##  $ generation        : chr  "G.I. Generation" "Boomers" "G.I. Generation" "Generation X" ...

Data Cond.

The following procedures are performed to reclass sex, age and generation into factors and specify their levels and orders if has any.

sex <- suicide_oz$sex %>% as.factor()
sex %>% head()

## [1] male   male   male   male   male   female
## Levels: female male

suicide_oz$age <- factor(suicide_oz$age,levels=c("5-14 years", "15-24 years","25-34 years", "35-54 years","55-74 years","75+ years"), ordered=T)
suicide_oz$age %>% head()

## [1] 75+ years   25-34 years 55-74 years 15-24 years 35-54 years 75+ years  
## 6 Levels: 5-14 years < 15-24 years < 25-34 years < ... < 75+ years

suicide_oz$generation <- factor(suicide_oz$generation, levels = c("G.I. Generation","Silent","Boomers","Generation Z","Millenials","Generation X"),ordered=T)
suicide_oz$generation %>% head()

## [1] G.I. Generation Boomers         G.I. Generation Generation X   
## [5] Silent          G.I. Generation
## 6 Levels: G.I. Generation < Silent < Boomers < ... < Generation X

Descriptive Statistics and Visualisation

Before further anaylsing, the data needs to be “cleaned” for missing values and outliers. As shown below, no missing value and outliers are detected.

This reports will focus on one variable: rates. To gain a visual overveiw of suicide rates across time and relative to GDP per capita, aggregation data without sex and age grouping is imported.

sum(is.na(suicide_oz$rates))

## [1] 0

rates_outliers <- suicide_oz$rates %>% boxplot()

Decsriptive Statistics Cont.

gdp_outliers <- suicide_oz$`gdp_per_capita ($)` %>% boxplot()

australia <- read_xls("master.xls", sheet = 2)

Decsriptive Statistics Cont.

Trendline() function allows to observe the overall trend of suicide rates of Australia from 1985 to 2015. This graph shows that this rate does not pose a clear lineaity. From 1985 to around 2005 this ratio takes on degression but slowly increased afterwards.

trendline(australia$year, australia$rates)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9771 -10.2942   1.1903   8.5782  26.4003 
## 
## Coefficients:
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 3413.98781  503.60037  6.7792 2.311e-07 ***
## x             -1.62917    0.25182 -6.4696 5.222e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.475 on 28 degrees of freedom
## Multiple R-squared:  0.59918,    Adjusted R-squared:  0.58486 
## F-statistic: 41.856 on 1 and 28 DF,  p-value: 5.2219e-07
## 
## 
## N: 30 , AIC: 240.49 , BIC:  244.69 
## Residual Sum of Squares:  4357.5

Decsriptive Statistics Cont.

The scatter plot depicts the relationship between suicide rates and GDP per capita. We want to know if there is any correlation of these 2 variables. The result also shows weak linearity, even with logarithmic transformation. However it seems suicide rates are high when GDP per capita is lower than $25,000.

plot(australia$rates ~ australia$`gdp_per_capita ($)`)

plot(log(australia$rates) ~ log(australia$`gdp_per_capita ($)`))

Decsriptive Statistics Cont.

The table below summarises the descriptive statistics of suicide rates in different years.

suicide_oz %>% group_by(year) %>% summarise(Min = min(rates,na.rm = TRUE),
                                           Q1 = quantile(rates,probs = .25,na.rm = TRUE),
                                           Median = median(rates, na.rm = TRUE),
                                           Q3 = quantile(rates,probs = .75,na.rm = TRUE),
                                           Max = max(rates,na.rm = TRUE),
                                           Mean = mean(rates, na.rm = TRUE),
                                           SD = sd(rates, na.rm = TRUE),
                                           n = n()) -> table1
table1 %>% head()

Hypothesis Testing

To solve our primary problem, we will test if the suicide rates in 1985 and 2015 are different using original dataset suicide_oz. Ho assumes the mean of 1985 rates and 2015 rates are equal.

\[H_0: \mu_1=\mu_2 \]

\[H_A: \mu_1 \ne\mu_2\]

Hypothesis Testing Cont.

Independent t test first assumes the normality of population data. As suicide rates from original data are grouped into ages, our sample size will be the number of age and gender groups. This is very small (<30) so Q-Q plot will help insepct normality. As shown below, all data points fall inside 95% CI for the normal quantiles so we are safe to assume normaility.

suicide_1985 <- suicide_oz %>% filter(year == 1985)
suicide_1985$rates %>% qqPlot(dist="norm")

## [1] 1 2

Hypothesis Testing Cont.

suicide_2015 <- suicide_oz %>% filter(year==2015)
suicide_2015$rates %>% qqPlot(dist="norm")

## [1] 1 2

Hypothesis Testing Cont.

Second assumption of independent t test is population homogeneity of variance, this is inspected with Levene’s test. As p = 0.8014 > 0.05, equal variances are assumend.

suicide_new <- suicide_oz %>% filter(year==1985| year==2015)
suicide_new$year <- factor(suicide_new$year, levels=c("1985", "2015"), ordered = T)
leveneTest(suicide_new$rates~suicide_new$year)

Hypothesis Testing Cont.

Now we can perform hypothesis test with t.test(), assume 5% confidence level.

Estimated difference between means = 13.61750-12.84833= 0.76917. The result found a statistically significant mean difference beween 1985 and 2015 suicide rates, t(df=22) = 0.178, p=0.8607, 95% CI for the difference in means [-8.214148 9.752482]. Our decision is to reject Ho.

t.test(
  suicide_new$rates ~ suicide_new$year,
  var.equal=T,
  alternative = "two.sided"
)

## 
##  Two Sample t-test
## 
## data:  suicide_new$rates by suicide_new$year
## t = 0.17757, df = 22, p-value = 0.8607
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.214148  9.752482
## sample estimates:
## mean in group 1985 mean in group 2015 
##           13.61750           12.84833

Discussion

Our investigation reports the Australia’s suicide rates has reduced in 2015 comparing to the year 1985, in other words, there is a sign of effective suicide prevention in place.

The strength of this investigation is dataset meets all the assumptions and is ideal for hypothesis testing. We use “rates” instead of the suicide number to dispel the influence from change in the population. Also the dataset is grouped by gender and age, this allows to analysis the data in different ways. A big limitation of using this dataset is unknown dada collection. It is unclear of how the information was gathered and what method adopted for sampling.

In thee future analysis, we can possibly look at the the correlation between suicide rates and suicide preventions in place, which may explain the decline in the suicide rates.

References

Data source:

United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/

Australia’s Suicide Rates Overview 1985 to 2015

RPubs link information

Introduction & Problem Statement

Data

Data Cont.

Data Cont.

Data Cond.

Descriptive Statistics and Visualisation

Decsriptive Statistics Cont.

Decsriptive Statistics Cont.

Decsriptive Statistics Cont.

Decsriptive Statistics Cont.

Hypothesis Testing

Hypothesis Testing Cont.

Hypothesis Testing Cont.

Hypothesis Testing Cont.

Hypothesis Testing Cont.

Discussion

References