Introduction

Mental health has become an extremely important issue in the recent times. Over the recent decade, awareness in forms of support groups, celebrity shoutouts has been on a rise.
As a part of our analysis, we want to understand the impact of these campaigns on society.
We have compared the suicide rates in 2000 vis-a-vis suicide rates in 2014 worldwide to conduct this study.
This time period gap would be essential to understand if there is indeed a difference between the suicide rates in more recent time vs historically observed.
The analysis has been performed using number of suicide per country over several time periods data compiled from World Health Organization and procurred from kaggle(open source data).

Problem Statement

To quantitatively verify if there is a significant difference between the suicide rates observed in 2000 vis-a-vis 2014
The difference in suicide rates for the same variable characteristics have been computed in the dataset using the mutate function and this column would be used as the base to perform the paired hypothesis testing.
For the purpose of this analysis, we would be comparing a combination of country,gender and age across the two time periods using paired sample hypothesis testing and check if the there is a difference in suicide rates.

Data

Data for 142 countries reported from 1979 to 2016 divided by gender and age group. This data was procurred from https://www.kaggle.com/szamil/who-suicide-statistics.
This data available in CSV format was open source.
The data contained country, year(observation years from 1979 to 2016), sex(Male or Female), Age (Listed in 6 buckets 5-14,15-24,25-34,35-54,55-74 and >75),suicides_no(number of suicides),population(For a particular sex and age group at that given point in time)
Prior to importing the data into R, the data was pre-processed using Microsoft Excel.
The data is imported to R using the readr package
The data originally comprised of 43776 observation rows and 6 columns.

#Reading the data into R and checking successful import by displaying dimensions
who <- read_csv("C:/Users/Syed Hassan Afsar/Downloads/RMIT 1st Semester/Intro to Statistics/Assignment 3/who_suicide_statistics.csv")

dim(who)

## [1] 43776     6

Data Preprocessing - Steps (1/2)

The original data was preprocessed in Microsoft excel prior to uploading the final file for analysis.
Data for years 2000 and 2014 was retained.All other entries were deleted.
A new variable “Suicide%” was derived using existing variables “suicides_no” / “population”.
A unique identifier using country,age and gender was created using the concatenation function which would help facilitate the separation of the suicide % into two different columns corresponding to two different time periods for the same unique identifier.
Using a simple vlookup we mapped the corresponding suicide rate for a particular year to the unique identifier.
Incase unique identifier wise data for a particular time period was not available, the data was excluded from the analysis.
Only columns required for this study were retained (UniqueIdentifier,Suicide%_2000 & Suicide%_2014)
The reduced dataset has 783 rows and 3 columns

#Importing reduced dataset used for analysis

who_final <- read_csv("C:/Users/Syed Hassan Afsar/Downloads/RMIT 1st Semester/Intro to Statistics/Assignment 3/Who_processed.csv")

dim(who_final)

## [1] 1566    3

Data Preprocessing - Steps (2/2)

At the outset of the analysis, three variables were used:

UniqueIdentifier - Combination of country,age and sex variable from the original dataset.
Suicide%_2000(Expressed in numeric terms without the % sign) - The suicide rate derived as a percentage of the suicide_no/population for year 2000 for a particular age,sex and country bracket
Suicide%_2014(Expressed in numeric terms without the % sign) - The suicide rate derived as a percentage of the suicide_no/population for year 2014 for a particular age,sex and country bracket.

Missing values : Since this analysis is based on facts and numbers, row exclusion as a whole would be our preferred method of dealing with missing values.

#Re-checking missing values
any(is.na(who_final))

## [1] FALSE

The data prepared contains no missing values.

Descriptive Statistics

For the purpose of performing paired hypothesis testing , a difference in suicide rate column“suicrate_diff” was mutated computed as the difference between the (suicide rates in 2014 - suicide rates in 2000).
The summary statistics for each of the numeric variables were populated and analyzed.

who_final <- who_final %>% mutate(suicrate_diff = who_final$`Suicide%_2014`-who_final$`Suicide%_2000`)

table1 <- who_final %>% summarise(Mean_2000 = mean(`Suicide%_2000`),SD_2000 = sd(`Suicide%_2000`),
                                  Q1_2000 = quantile(`Suicide%_2000`,probs = .25),Q3_2000 = quantile(`Suicide%_2000`,probs = 0.75), Max_2000 = max(`Suicide%_2000`), Min_2000 = min(`Suicide%_2000`),
                                  IQR_2000=Q3_2000 - Q1_2000,Range_2000 = Max_2000 - Min_2000, n = n())

table2 <- who_final %>% summarise(Mean_2014 = mean(`Suicide%_2014`),SD_2014 = sd(`Suicide%_2014`),
                                  Q1_2014 = quantile(`Suicide%_2014`,probs = .25),Q3_2014 = quantile(`Suicide%_2014`,probs = 0.75), Max_2014 = max(`Suicide%_2014`), Min_2014 = min(`Suicide%_2014`),
                                  IQR_2014=Q3_2014 - Q1_2014,Range_2014 = Max_2014 - Min_2014, n = n())

table3 <- who_final %>% summarise(Mean_Diff = mean(`suicrate_diff`),SD_Diff = sd(`suicrate_diff`),
                                  Q1_Diff = quantile(`suicrate_diff`,probs = .25),Q3_Diff = quantile(`suicrate_diff`,probs = 0.75), Max_Diff = max(`suicrate_diff`), Min_Diff = min(`suicrate_diff`),
                                  IQR_Diff=Q3_Diff - Q1_Diff,Range_Diff = Max_Diff - Min_Diff, n = n())

knitr::kable(table1) %>% kable_styling()

Mean_2000	SD_2000	Q1_2000	Q3_2000	Max_2000	Min_2000	IQR_2000	Range_2000	n
0.0168455	0.0223623	0	0.02	0.14	0	0.02	0.14	1566

knitr::kable(table2) %>% kable_styling()

Mean_2014	SD_2014	Q1_2014	Q3_2014	Max_2014	Min_2014	IQR_2014	Range_2014	n
0.0126692	0.0159921	0	0.02	0.12	0	0.02	0.12	1566

Descriptive Statistics

knitr::kable(table3) %>% kable_styling()

Mean_Diff	SD_Diff	Q1_Diff	Q3_Diff	Max_Diff	Min_Diff	IQR_Diff	Range_Diff	n
-0.0041762	0.0110441	-0.01	0	0.04	-0.06	0.01	0.1	1566

Visualization

The suicide rates in 2000 and 2014 were further visualized using a matplot.

matplot(t(data.frame(who_final$`Suicide%_2000`,
who_final$`Suicide%_2014`)),
type="b", pch=19, col=1, lty=1, xlab= "", ylab="Suicide Rate(%)",
xaxt = "n")
axis(1, at=1:2, labels=c("2000","2014"))

Descriptive Statistics and Visualization

It appears via the summary statistics and the matplot that there is a difference in the mean suicide rates observed in 2000 and 2014.
It appears that suicide rates in 2000 were greater than those observed in 2014.
This would be further verified by using paired sample hypothesis testing.

Outlier Detection

Using boxplot visualization, the outliers in the difference in suicide rates were checked

boxplot(who_final$suicrate_diff,col = "blue", ylab = "Difference in suicide rates(%)", main = "Outlier detection for difference in suicide rates")

- As visible in the boxplot, the difference in suicide rates does demonstrate certain values which might be considered as outliers.

Outlier Filtering

Z-score approach was used to check for outliers. Difference in suicide rates which demonstrated absolute z-score greater than 3 were excluded. 23 observations were removed by subsetting using base functions.
The boxplot post exclusions has been displayed to check outlier exclusion

boxplot(who_final_filtered$suicrate_diff,col = "blue", ylab = "Difference in suicide rates(%)", main = "Outlier detection for difference in suicide rates")

- The exclusions dropped the number of visible outliers significantly.

Normality (1/2)

Normality of the difference in suicide rates was checked using

Visualization : QQ - Plot
Quantification : Shapiro-Wilk test

who_final_filtered$suicrate_diff %>% qqPlot(dist="norm",main = "QQ-Plot for difference in suicide rates", ylab = "Difference in suicide rates(%)")

## [1] 628 639

Normality (2/2)

shapiro.test(who_final_filtered$suicrate_diff)

## 
##  Shapiro-Wilk normality test
## 
## data:  who_final_filtered$suicrate_diff
## W = 0.71697, p-value < 2.2e-16

Verification of normal distribution could not be established using the techniques.
However, since the sample size exceeds 700 observations, as suggested by the central limit theorem when n >30, sampling distribution will approximate a normal distribution.

Hypothesis Testing

Paired t test has been used to determine if there is a difference between the suicide rates observed in 2000 vs that observed in 2014. Considering that data of same characteristics has been collected in “pairs” across two time periods and the base being analyzed is the same, this would be the ideal t-test to use.
Null Hypothesis

\[H_0:\mu_?? = 0\] - Alternate Hypothesis

\[H_A:\mu_?? \ne 0\]

Significance level: 5%
Assumptions:

We are comparing the population average difference or change between two matched samples.
The data is assumed to normal considering n > 30

Decision Rules:

Reject Ho :

If p-value < 0.05 ( significance level)

If CI of the mean difference does not capture the hypothesized mean

Otherwise, fail to reject Ho .

Hypothesis Testing Cont.

pttest <- t.test(who_final_filtered$`Suicide%_2014`,
who_final_filtered$`Suicide%_2000`,
paired = TRUE,
alternative = "two.sided",
conf.level = .95
)

pttest

## 
##  Paired t-test
## 
## data:  who_final_filtered$`Suicide%_2014` and who_final_filtered$`Suicide%_2000`
## t = -14.908, df = 1542, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.004759514 -0.003652670
## sample estimates:
## mean of the differences 
##            -0.004206092

Discussion

The p-value of the paired t-test(<2.2e-16) is lesser than the significance level. The 95% confidence interval fails to capture the null hypothesized mean.
As a result of this we can conclude that our test was statistically significant and we can reject the null hypothesis that there is no difference between the suicide rates observed in 2000 vis-a-vis 2014
Mean of the differences between the suicide rates in 2014 - suicide rates in 2000 is -0.00325.
Observing from the sample statistics, the mean of the suicide rates in 2014 was lesser than the mean of the suicide rates observed in 2000.
In an incidental finding, it was observed that the countries in which mental health awareness programs were not so common, the suicide rates were higher.
Strengths : Data from most of the countries and age groups was captured in this analysis. Even after excluding data in which data capture might have been inaccurate(population blank, suicide no blank), our sample size was large enough to get a good representation of the overall population.
Limitations and possible improvements

Data does not explicitly capture reasons for suicide which could be a crucial factor in this analysis.
Using 2014 as a benchmark for this exercise. Mental health awareness first kicked off in latter part of the previous decade and numbers as on 2018 could be a better measure,however data constraints.

Overall, we can conclude with a high level of confidence that the impact of the mental health awareness campaigns has been positive and society is understanding ways to deal with the demon inside.

Mental Health Awareness

An impact study

Introduction

Problem Statement

Data

Data Preprocessing - Steps (1/2)

Data Preprocessing - Steps (2/2)

Descriptive Statistics

Descriptive Statistics

Visualization

Descriptive Statistics and Visualization

Outlier Detection

Outlier Filtering

Normality (1/2)

Normality (2/2)

Hypothesis Testing

Hypothesis Testing Cont.

Discussion

References