Syed Hassan Afsar - 3734089, Arjun Khopkar - 3729445, Siddharth Sharma - 3738019
Last updated: 28 October, 2018
Mental health has become an extremely important issue in the recent times. Over the recent decade, awareness in forms of support groups, celebrity shoutouts has been on a rise.
As a part of our analysis, we want to understand the impact of these campaigns on society.
We have compared the suicide rates in 2000 vis-a-vis suicide rates in 2014 worldwide to conduct this study.
This time period gap would be essential to understand if there is indeed a difference between the suicide rates in more recent time vs historically observed.
The analysis has been performed using number of suicide per country over several time periods data compiled from World Health Organization and procurred from kaggle(open source data).
To quantitatively verify if there is a significant difference between the suicide rates observed in 2000 vis-a-vis 2014
The difference in suicide rates for the same variable characteristics have been computed in the dataset using the mutate function and this column would be used as the base to perform the paired hypothesis testing.
For the purpose of this analysis, we would be comparing a combination of country,gender and age across the two time periods using paired sample hypothesis testing and check if the there is a difference in suicide rates.
Data for 142 countries reported from 1979 to 2016 divided by gender and age group. This data was procurred from https://www.kaggle.com/szamil/who-suicide-statistics.
This data available in CSV format was open source.
The data contained country, year(observation years from 1979 to 2016), sex(Male or Female), Age (Listed in 6 buckets 5-14,15-24,25-34,35-54,55-74 and >75),suicides_no(number of suicides),population(For a particular sex and age group at that given point in time)
Prior to importing the data into R, the data was pre-processed using Microsoft Excel.
The data is imported to R using the readr package
The data originally comprised of 43776 observation rows and 6 columns.
#Reading the data into R and checking successful import by displaying dimensions
who <- read_csv("C:/Users/Syed Hassan Afsar/Downloads/RMIT 1st Semester/Intro to Statistics/Assignment 3/who_suicide_statistics.csv")
dim(who)## [1] 43776 6
The original data was preprocessed in Microsoft excel prior to uploading the final file for analysis.
Data for years 2000 and 2014 was retained.All other entries were deleted.
A new variable “Suicide%” was derived using existing variables “suicides_no” / “population”.
A unique identifier using country,age and gender was created using the concatenation function which would help facilitate the separation of the suicide % into two different columns corresponding to two different time periods for the same unique identifier.
Using a simple vlookup we mapped the corresponding suicide rate for a particular year to the unique identifier.
Incase unique identifier wise data for a particular time period was not available, the data was excluded from the analysis.
Only columns required for this study were retained (UniqueIdentifier,Suicide%_2000 & Suicide%_2014)
The reduced dataset has 783 rows and 3 columns
#Importing reduced dataset used for analysis
who_final <- read_csv("C:/Users/Syed Hassan Afsar/Downloads/RMIT 1st Semester/Intro to Statistics/Assignment 3/Who_processed.csv")
dim(who_final)## [1] 1566 3
UniqueIdentifier - Combination of country,age and sex variable from the original dataset.
Suicide%_2000(Expressed in numeric terms without the % sign) - The suicide rate derived as a percentage of the suicide_no/population for year 2000 for a particular age,sex and country bracket
Suicide%_2014(Expressed in numeric terms without the % sign) - The suicide rate derived as a percentage of the suicide_no/population for year 2014 for a particular age,sex and country bracket.
#Re-checking missing values
any(is.na(who_final))## [1] FALSE
For the purpose of performing paired hypothesis testing , a difference in suicide rate column“suicrate_diff” was mutated computed as the difference between the (suicide rates in 2014 - suicide rates in 2000).
The summary statistics for each of the numeric variables were populated and analyzed.
who_final <- who_final %>% mutate(suicrate_diff = who_final$`Suicide%_2014`-who_final$`Suicide%_2000`)
table1 <- who_final %>% summarise(Mean_2000 = mean(`Suicide%_2000`),SD_2000 = sd(`Suicide%_2000`),
Q1_2000 = quantile(`Suicide%_2000`,probs = .25),Q3_2000 = quantile(`Suicide%_2000`,probs = 0.75), Max_2000 = max(`Suicide%_2000`), Min_2000 = min(`Suicide%_2000`),
IQR_2000=Q3_2000 - Q1_2000,Range_2000 = Max_2000 - Min_2000, n = n())
table2 <- who_final %>% summarise(Mean_2014 = mean(`Suicide%_2014`),SD_2014 = sd(`Suicide%_2014`),
Q1_2014 = quantile(`Suicide%_2014`,probs = .25),Q3_2014 = quantile(`Suicide%_2014`,probs = 0.75), Max_2014 = max(`Suicide%_2014`), Min_2014 = min(`Suicide%_2014`),
IQR_2014=Q3_2014 - Q1_2014,Range_2014 = Max_2014 - Min_2014, n = n())
table3 <- who_final %>% summarise(Mean_Diff = mean(`suicrate_diff`),SD_Diff = sd(`suicrate_diff`),
Q1_Diff = quantile(`suicrate_diff`,probs = .25),Q3_Diff = quantile(`suicrate_diff`,probs = 0.75), Max_Diff = max(`suicrate_diff`), Min_Diff = min(`suicrate_diff`),
IQR_Diff=Q3_Diff - Q1_Diff,Range_Diff = Max_Diff - Min_Diff, n = n())knitr::kable(table1) %>% kable_styling()| Mean_2000 | SD_2000 | Q1_2000 | Q3_2000 | Max_2000 | Min_2000 | IQR_2000 | Range_2000 | n |
|---|---|---|---|---|---|---|---|---|
| 0.0168455 | 0.0223623 | 0 | 0.02 | 0.14 | 0 | 0.02 | 0.14 | 1566 |
knitr::kable(table2) %>% kable_styling()| Mean_2014 | SD_2014 | Q1_2014 | Q3_2014 | Max_2014 | Min_2014 | IQR_2014 | Range_2014 | n |
|---|---|---|---|---|---|---|---|---|
| 0.0126692 | 0.0159921 | 0 | 0.02 | 0.12 | 0 | 0.02 | 0.12 | 1566 |
knitr::kable(table3) %>% kable_styling()| Mean_Diff | SD_Diff | Q1_Diff | Q3_Diff | Max_Diff | Min_Diff | IQR_Diff | Range_Diff | n |
|---|---|---|---|---|---|---|---|---|
| -0.0041762 | 0.0110441 | -0.01 | 0 | 0.04 | -0.06 | 0.01 | 0.1 | 1566 |
matplot(t(data.frame(who_final$`Suicide%_2000`,
who_final$`Suicide%_2014`)),
type="b", pch=19, col=1, lty=1, xlab= "", ylab="Suicide Rate(%)",
xaxt = "n")
axis(1, at=1:2, labels=c("2000","2014"))It appears via the summary statistics and the matplot that there is a difference in the mean suicide rates observed in 2000 and 2014.
It appears that suicide rates in 2000 were greater than those observed in 2014.
This would be further verified by using paired sample hypothesis testing.
boxplot(who_final$suicrate_diff,col = "blue", ylab = "Difference in suicide rates(%)", main = "Outlier detection for difference in suicide rates") - As visible in the boxplot, the difference in suicide rates does demonstrate certain values which might be considered as outliers.
Z-score approach was used to check for outliers. Difference in suicide rates which demonstrated absolute z-score greater than 3 were excluded. 23 observations were removed by subsetting using base functions.
The boxplot post exclusions has been displayed to check outlier exclusion
boxplot(who_final_filtered$suicrate_diff,col = "blue", ylab = "Difference in suicide rates(%)", main = "Outlier detection for difference in suicide rates") - The exclusions dropped the number of visible outliers significantly.
who_final_filtered$suicrate_diff %>% qqPlot(dist="norm",main = "QQ-Plot for difference in suicide rates", ylab = "Difference in suicide rates(%)")## [1] 628 639
shapiro.test(who_final_filtered$suicrate_diff)##
## Shapiro-Wilk normality test
##
## data: who_final_filtered$suicrate_diff
## W = 0.71697, p-value < 2.2e-16
Verification of normal distribution could not be established using the techniques.
However, since the sample size exceeds 700 observations, as suggested by the central limit theorem when n >30, sampling distribution will approximate a normal distribution.
Paired t test has been used to determine if there is a difference between the suicide rates observed in 2000 vs that observed in 2014. Considering that data of same characteristics has been collected in “pairs” across two time periods and the base being analyzed is the same, this would be the ideal t-test to use.
Null Hypothesis
\[H_0:\mu_?? = 0\] - Alternate Hypothesis
\[H_A:\mu_?? \ne 0\]
Significance level: 5%
Assumptions:
We are comparing the population average difference or change between two matched samples.
The data is assumed to normal considering n > 30
Reject Ho :
If p-value < 0.05 ( significance level)
If CI of the mean difference does not capture the hypothesized mean
Otherwise, fail to reject Ho .
pttest <- t.test(who_final_filtered$`Suicide%_2014`,
who_final_filtered$`Suicide%_2000`,
paired = TRUE,
alternative = "two.sided",
conf.level = .95
)
pttest##
## Paired t-test
##
## data: who_final_filtered$`Suicide%_2014` and who_final_filtered$`Suicide%_2000`
## t = -14.908, df = 1542, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.004759514 -0.003652670
## sample estimates:
## mean of the differences
## -0.004206092
The p-value of the paired t-test(<2.2e-16) is lesser than the significance level. The 95% confidence interval fails to capture the null hypothesized mean.
As a result of this we can conclude that our test was statistically significant and we can reject the null hypothesis that there is no difference between the suicide rates observed in 2000 vis-a-vis 2014
Mean of the differences between the suicide rates in 2014 - suicide rates in 2000 is -0.00325.
Observing from the sample statistics, the mean of the suicide rates in 2014 was lesser than the mean of the suicide rates observed in 2000.
In an incidental finding, it was observed that the countries in which mental health awareness programs were not so common, the suicide rates were higher.
Strengths : Data from most of the countries and age groups was captured in this analysis. Even after excluding data in which data capture might have been inaccurate(population blank, suicide no blank), our sample size was large enough to get a good representation of the overall population.
Limitations and possible improvements
Data does not explicitly capture reasons for suicide which could be a crucial factor in this analysis.
Using 2014 as a benchmark for this exercise. Mental health awareness first kicked off in latter part of the previous decade and numbers as on 2018 could be a better measure,however data constraints.