These were the packages that were used to produce this report
# Useful for data manipulation
library(dplyr)
# Useful for value imputation
library(Hmisc)
# Useful for creating nice tables
library(knitr)
# Useful to build complex tables
library(kableExtra)
# Useful to create neat records of analysis
library(rmarkdown)
# Useful for tidy up the dataset
library(tidyr)
library(forecast)
# Useful for date manipulation
library(lubridate)
# useful for finding correlation
library(psychometric)
# useful for Outliers
library(outliers)
The aim of this report was to investigate whether there is correlation between suicide rate and mental health disorder rate over a period of time. In order to aid our investigation, we have taken three datasets and preprocessed the data as follows. - Datasets “suicide_death_rates” , “mental_health_disorders” and “population” data based on countries, belonging to years 1990 to 2017 were considered. - The column names were renamed for clarity. - The data from “mental_health_disorders” data is joined to “suicide_death_rates” data based on their “year”, “country_name” and “country_country_code” variables. - On the joined dataset, data type verification and conversion were performed. - For better understanding , the average values of suicide_death_rates and “Mental disorder rate” were calculated and the percentage of suicide rate per mental disorder rate were calculated. - All missing and special values were checked and imputed with mean their respective mean values - All numerical values were checked for outliers using Z-score and handled by replacing them with their nearest possible values using capping technique. - To understand distribution of suicide rate and percentage variables, histogram is plotted and found that the data is right skewed. It is then transformed using square root transformation.
Source of population data : https://ourworldindata.org/grapher/population
The data is recorded from 1990 to 2017.
The data is recorded from 1990 to 2016.
mental disorder count column consists of Population by country, available from 1800 to 2019 based on Gapminder data, HYDE, and UN Population Division (2019) estimates.
The final dataframe ‘joined_rates_data’ has the folowing variables: country_name : consists of different countries across world Country_code : acronymns of country name year : year of collected data suicide_rate : Deaths due to Self-harm of both male and female for 100,000 individuals mental_health_disorders : People with mental_health_disorders population_count : count of people from a particular country on the respective year
# Reading and importing suicide_death_rates data
suicide_death_rates <-read.csv("suicide-death-rates.csv",header = TRUE,
stringsAsFactors = FALSE,
col.names = c("country_name","country_code","year","suicide_rate"))
str(suicide_death_rates)
## 'data.frame': 6468 obs. of 4 variables:
## $ country_name: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ country_code: chr "AFG" "AFG" "AFG" "AFG" ...
## $ year : int 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
## $ suicide_rate: num 10.3 10.3 10.3 10.4 10.6 ...
head(suicide_death_rates)
# Reading and importing mental_health_disorders data
mental_health_disorders <-read.csv("people-with-mental-health-disorders.csv", header = TRUE,
stringsAsFactors = FALSE,
col.names = c("country_name","country_code","year","mental_disorder_count") )
str(mental_health_disorders)
## 'data.frame': 6156 obs. of 4 variables:
## $ country_name : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ country_code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ year : int 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
## $ mental_disorder_count: num 1669742 1774327 1913721 2067860 2221090 ...
head(mental_health_disorders)
# Reading and importing mental_health_disorders data
population <-read.csv("population.csv", header = TRUE,
stringsAsFactors = FALSE,
col.names = c("country_name","country_code","year","population_count") )
str(population)
## 'data.frame': 46883 obs. of 4 variables:
## $ country_name : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ country_code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ year : int 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ...
## $ population_count: num 3280000 3280000 3280000 3280000 3280000 3280000 3280000 3280000 3280000 3280000 ...
head(population)
# joining suicide_death_rates and mental_health_disorders datasets
joined_rates_data <- right_join(suicide_death_rates, mental_health_disorders,
by = c('country_name','country_code','year'))
# joining joined_rates_data and population
joined_rates_data <- left_join(joined_rates_data, population,
by= c('country_name','country_code','year'))
head(joined_rates_data)
#Structure of joined data
str(joined_rates_data)
## 'data.frame': 6156 obs. of 6 variables:
## $ country_name : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ country_code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ year : int 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
## $ suicide_rate : num 10.3 10.3 10.3 10.4 10.6 ...
## $ mental_disorder_count: num 1669742 1774327 1913721 2067860 2221090 ...
## $ population_count : num 12412000 13299000 14486000 15817000 17076000 ...
# Applying type conversion
joined_rates_data$country_name <- as.factor(joined_rates_data$country_name)
joined_rates_data$country_code <- as.factor(joined_rates_data$country_code)
joined_rates_data$mental_disorder_count <- as.integer(joined_rates_data$mental_disorder_count)
joined_rates_data$population_count <- as.integer(joined_rates_data$population_count)
# Verifying type of variables
class(joined_rates_data$country_name)
## [1] "factor"
class(joined_rates_data$country_code)
## [1] "factor"
class(joined_rates_data$mental_disorder_count)
## [1] "integer"
class(joined_rates_data$population_count)
## [1] "integer"
It is observed that the dataset satisfies the tidy principles and was readily available in a tidy format. Therefore, no further manipulations were required.
* The country_code variable was removed as it can be acquired from country name and is not required for analysis. * The average suicide rate and average people with mental disorder were calculated by using group_by() and summarise() function.
# removing county_code column
joined_rates_data <- subset(joined_rates_data,
select= c('country_name','year','suicide_rate','mental_disorder_count','population_count'))
joined_rates_data%>% group_by(country_name) %>% summarise(avg_suicide_rate = round(mean(suicide_rate),3),
avg_person_mental_disorder = round(mean(mental_disorder_count,na.rm = TRUE),3))
head(joined_rates_data)
In order to find the mental_health_disorder rate, each ‘mental_disorder_count’ variable is divided by its respective population_count value and converted as rate by multiplying 100000.Since, suicide rate is for 100000 population rate.
Percentage column is added to understand relation between suicide_rate and mental_health_disorders_rate
# Mutating mental_health_disorders_rate
joined_rates_data <- mutate(joined_rates_data,
mental_health_disorders_rate =((mental_disorder_count/population_count)*100000))
head(joined_rates_data)
#Mutating percentage
joined_rates_data <- mutate(joined_rates_data, percentage = (suicide_rate/mental_health_disorders_rate)*100)
head(joined_rates_data)
The rest of the variables such as mental_health_disorders_rate and percentage were recalculated only for missing values and mutated with the origininal.
Inconsistent and NaN values values were scanned using special function and observed to not have any special values.
#Finding sum of missing values in each column
colSums(is.na(joined_rates_data))
## country_name year
## 0 0
## suicide_rate mental_disorder_count
## 0 0
## population_count mental_health_disorders_rate
## 837 837
## percentage
## 837
# mean imputation (for numerical variables)
joined_rates_data$population_count<- impute(joined_rates_data$population_count, fun = mean)
joined_rates_data <- mutate(joined_rates_data,
mental_health_disorders_rate = ifelse(is.na(mental_health_disorders_rate),
( mental_disorder_count/population_count)*100000, mental_health_disorders_rate))
joined_rates_data <- mutate(joined_rates_data,
percentage = ifelse(is.na(percentage),(suicide_rate/mental_health_disorders_rate)*100,
percentage))
#re verifying for na values
sum(is.na(joined_rates_data$population_count))
## [1] 0
sum(is.na(joined_rates_data$mental_health_disorders_rate))
## [1] 0
sum(is.na(joined_rates_data$percentage))
## [1] 0
# special function to find NaN and inconsistent values
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
#finding number of inconsistent values in data
sapply(joined_rates_data, function(x) sum(is.special(x)))
## country_name year
## 0 0
## suicide_rate mental_disorder_count
## 0 0
## population_count mental_health_disorders_rate
## 0 0
## percentage
## 0
# Subsetting joined_rates_data
joined_data_sub <- joined_rates_data %>% dplyr:: select(suicide_rate,mental_disorder_count,population_count,
mental_health_disorders_rate,percentage)
summary(joined_data_sub)
##
## 837 values imputed to 34177499
## suicide_rate mental_disorder_count population_count
## Min. : 1.527 Min. : 4579 Min. :4.500e+04
## 1st Qu.: 6.722 1st Qu.: 288429 1st Qu.:2.351e+06
## Median :10.658 Median : 1129996 Median :9.686e+06
## Mean :12.076 Mean : 14791266 Mean :3.418e+07
## 3rd Qu.:14.791 3rd Qu.: 5076646 3rd Qu.:3.418e+07
## Max. :98.832 Max. :947272757 Max. :1.414e+09
## mental_health_disorders_rate percentage
## Min. : 706.5 Min. :0.0003633
## 1st Qu.: 11405.9 1st Qu.:0.0449718
## Median : 12553.5 Median :0.0763073
## Mean : 43092.9 Mean :0.1012998
## 3rd Qu.: 14317.9 3rd Qu.:0.1148571
## Max. :2771626.9 Max. :1.4140575
#z-score method for detecting ouliers
z.scores <- joined_data_sub %>% scores(type = "z")
length(which( abs(z.scores) >3 ))
## [1] 422
# capping for imputing outliers
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ))
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
# Apply a user defined function "cap" to a data frame
joined_data_capped <- as.data.frame(sapply(joined_data_sub, FUN = cap ))
summary(joined_data_capped)
## suicide_rate mental_disorder_count population_count
## Min. : 1.527 Min. : 4579 Min. : 45000
## 1st Qu.: 6.722 1st Qu.: 288429 1st Qu.: 2351250
## Median :10.658 Median : 1129996 Median : 9685500
## Mean :11.601 Mean :10942720 Mean : 21166783
## 3rd Qu.:14.791 3rd Qu.: 5076646 3rd Qu.: 34177499
## Max. :26.854 Max. :66248822 Max. :103103750
## mental_health_disorders_rate percentage
## Min. : 8478 Min. :0.0003633
## 1st Qu.: 11406 1st Qu.:0.0449718
## Median : 12554 Median :0.0763073
## Mean : 30404 Mean :0.0887515
## 3rd Qu.: 14318 3rd Qu.:0.1148571
## Max. :169071 Max. :0.2462393
# Histogram
suicide_rate <- joined_data_capped$suicide_rate
hist(suicide_rate )
#Square root Transformation
suicide_rate_transformed<- sqrt(joined_data_capped$suicide_rate)
hist(suicide_rate_transformed)
#Histogram
percentage <- joined_data_capped$ percentage
hist( percentage)
#Square root Transformation
percentage_transformed <- sqrt(joined_data_capped$ percentage)
hist(percentage_transformed)
## Analysis
pttest <- t.test(joined_data_capped$suicide_rate, joined_data_capped$mental_health_disorders_rate,
paired = TRUE,alternative = "two.sided",conf.level = .95)
pttest
##
## Paired t-test
##
## data: joined_data_capped$suicide_rate and joined_data_capped$mental_health_disorders_rate
## t = -47.752, df = 6155, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -31639.77 -29144.39
## sample estimates:
## mean of the differences
## -30392.08
The findings from results of the hypothesis test by acquiring the p-value and confidence intervals. . The paired sample t-test was done for analysis. . The t-test resulted in a p-value of p-value < 2.2e-16 which is less than the alpha value 0.05. . The mean difference doesnt falls in the 95% of CI. . The result is statistically significant. so, our result support alternate Hypothesis
Hence we come to the conclusion that:
. The paired-samples t-test found a statistically significant relation between suicide rate and mental health disorder rate
The analysis would be better if we conduct for different age people and different circumstances like income problems,marriage problems so on