Data preprocessing - Suicide rate and Mental health Datasets

Required packages

These were the packages that were used to produce this report

 # Useful for data manipulation
library(dplyr)
# Useful for value imputation
library(Hmisc)
# Useful for creating nice tables
library(knitr)
 # Useful to build complex tables
library(kableExtra) 
# Useful to create neat records of analysis
library(rmarkdown)
 # Useful for tidy up the dataset
library(tidyr) 
library(forecast) 
 # Useful for date manipulation
library(lubridate)
# useful for finding correlation
library(psychometric)
# useful for Outliers
library(outliers)

Executive Summary

The aim of this report was to investigate whether there is correlation between suicide rate and mental health disorder rate over a period of time. In order to aid our investigation, we have taken three datasets and preprocessed the data as follows. - Datasets “suicide_death_rates” , “mental_health_disorders” and “population” data based on countries, belonging to years 1990 to 2017 were considered. - The column names were renamed for clarity. - The data from “mental_health_disorders” data is joined to “suicide_death_rates” data based on their “year”, “country_name” and “country_country_code” variables. - On the joined dataset, data type verification and conversion were performed. - For better understanding , the average values of suicide_death_rates and “Mental disorder rate” were calculated and the percentage of suicide rate per mental disorder rate were calculated. - All missing and special values were checked and imputed with mean their respective mean values - All numerical values were checked for outliers using Z-score and handled by replacing them with their nearest possible values using capping technique. - To understand distribution of suicide rate and percentage variables, histogram is plotted and found that the data is right skewed. It is then transformed using square root transformation.

Later on paired t-test is performed to understand the relation between msuicide rate and mental health disorder rates.
Based on which, we could find a statistically significant relation between suicide rate and mental health disorder rate

Data

Datasets suicide_death_rates and mental_health_disorders used for this investigation were taken from the below sources:
Source to Suicide death rates data: https://ourworldindata.org/grapher/suicide-death-rates
Source to Mental health disorders data: https://ourworldindata.org/grapher/people-with-mental-health-disorders
Source of population data : https://ourworldindata.org/grapher/population
Suicide death rates data consists of 6468 observations from 4 variables such as country name, country code, year and suicide death rates.
suicide death rates column represents the Age-standardized death rates from suicide, measured as the number of deaths per 100,000 individuals.Age-standardization assumes a constant population age & structure to allow for comparisons between countries and with time without the effects of a changing age distribution within a population (e.g. aging).
The data is recorded from 1990 to 2017.
Mental health disorders data consists of 6156 observations from 4 variables such as country name, country code, year and mental disorder count.
mental disorder count column consists of number of people with mental health and neurodevelopmental disorders, not including alcohol and drug use disorders. Figures attempt to provide a true estimate (going beyond reported diagnosis) of prevalence based on medical, epidemiological data, surveys and meta-regression modelling.
The data is recorded from 1990 to 2016.
The population dataset consists of 46883 observations from 4 variables such as country name, country code, year and population count.
mental disorder count column consists of Population by country, available from 1800 to 2019 based on Gapminder data, HYDE, and UN Population Division (2019) estimates.
Initially datasets ‘Suicide death rates’ and ‘Mental health disorders’ were joined using right join as mental_health_disorders doesnt have data for year 2017. Using the variables ‘country_name’,‘country_code’ and ‘year’ the two datasets were joined and new dataframe ‘joined_rates_data’ was created. With the new dataframe as basis, population data is added to it by using left join based on same variables such as ‘country_name’,‘country_code’ and ‘year’ like previous.
The final dataframe ‘joined_rates_data’ has the folowing variables: country_name : consists of different countries across world Country_code : acronymns of country name year : year of collected data suicide_rate : Deaths due to Self-harm of both male and female for 100,000 individuals mental_health_disorders : People with mental_health_disorders population_count : count of people from a particular country on the respective year

# Reading and importing suicide_death_rates data
suicide_death_rates <-read.csv("suicide-death-rates.csv",header = TRUE, 
                    stringsAsFactors = FALSE,
                    col.names = c("country_name","country_code","year","suicide_rate"))
str(suicide_death_rates)

## 'data.frame':    6468 obs. of  4 variables:
##  $ country_name: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ country_code: chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ year        : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ suicide_rate: num  10.3 10.3 10.3 10.4 10.6 ...

head(suicide_death_rates)

# Reading and importing mental_health_disorders data
mental_health_disorders <-read.csv("people-with-mental-health-disorders.csv", header = TRUE,
                                   stringsAsFactors = FALSE,
                    col.names = c("country_name","country_code","year","mental_disorder_count") )
str(mental_health_disorders)

## 'data.frame':    6156 obs. of  4 variables:
##  $ country_name         : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ country_code         : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ year                 : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ mental_disorder_count: num  1669742 1774327 1913721 2067860 2221090 ...

head(mental_health_disorders)

# Reading and importing mental_health_disorders data
population <-read.csv("population.csv",  header = TRUE,
                      stringsAsFactors = FALSE,
                   col.names = c("country_name","country_code","year","population_count") )
str(population)

## 'data.frame':    46883 obs. of  4 variables:
##  $ country_name    : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ country_code    : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ year            : int  1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ...
##  $ population_count: num  3280000 3280000 3280000 3280000 3280000 3280000 3280000 3280000 3280000 3280000 ...

head(population)

# joining suicide_death_rates and mental_health_disorders datasets
joined_rates_data <- right_join(suicide_death_rates, mental_health_disorders,
                                by = c('country_name','country_code','year'))

# joining  joined_rates_data and population 
joined_rates_data <- left_join(joined_rates_data, population,
                               by= c('country_name','country_code','year'))


head(joined_rates_data)

Understand

To Understand data structure of ‘joined_rates_data’ data frame, str() function was used. Based on the existing structure, it is observed that the datatypes of variables ‘country_code’, ‘country_name’ can be converted from character to factor as there were only 195 possible countries.
The year column datatype is left as integer as the manipulation can be performed on top of year withoutdate conversion.
The ‘country_name’ and ‘country_code’ variables were converted to datatype factor using as.factor().
The variablea ‘mental_disorder_count’ and ‘population_count’ were converted to datatype integer using as.integer().
The data types of converted variables were viewed using class().

#Structure of joined data
str(joined_rates_data)

## 'data.frame':    6156 obs. of  6 variables:
##  $ country_name         : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ country_code         : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ year                 : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ suicide_rate         : num  10.3 10.3 10.3 10.4 10.6 ...
##  $ mental_disorder_count: num  1669742 1774327 1913721 2067860 2221090 ...
##  $ population_count     : num  12412000 13299000 14486000 15817000 17076000 ...

# Applying type conversion

joined_rates_data$country_name <- as.factor(joined_rates_data$country_name)
joined_rates_data$country_code <- as.factor(joined_rates_data$country_code)
joined_rates_data$mental_disorder_count <- as.integer(joined_rates_data$mental_disorder_count)
joined_rates_data$population_count <- as.integer(joined_rates_data$population_count)

# Verifying type of variables
class(joined_rates_data$country_name)

## [1] "factor"

class(joined_rates_data$country_code)

## [1] "factor"

class(joined_rates_data$mental_disorder_count)

## [1] "integer"

class(joined_rates_data$population_count)

## [1] "integer"

Tidy & Manipulate Data I

In this step, the dataset is checked against following characteristics established by Hadley Wickham and Grolemund:

Each variable must have its own column.
Each unique observation has its own row.
All values have its own cell.

It is observed that the dataset satisfies the tidy principles and was readily available in a tidy format. Therefore, no further manipulations were required.
* The country_code variable was removed as it can be acquired from country name and is not required for analysis. * The average suicide rate and average people with mental disorder were calculated by using group_by() and summarise() function.

# removing county_code column

joined_rates_data <- subset(joined_rates_data, 
                     select= c('country_name','year','suicide_rate','mental_disorder_count','population_count'))
joined_rates_data%>% group_by(country_name) %>% summarise(avg_suicide_rate = round(mean(suicide_rate),3),
                             avg_person_mental_disorder = round(mean(mental_disorder_count,na.rm = TRUE),3))

head(joined_rates_data)

Tidy & Manipulate Data II

In order to find the mental_health_disorder rate, each ‘mental_disorder_count’ variable is divided by its respective population_count value and converted as rate by multiplying 100000.Since, suicide rate is for 100000 population rate.
Percentage column is added to understand relation between suicide_rate and mental_health_disorders_rate

# Mutating mental_health_disorders_rate
joined_rates_data <- mutate(joined_rates_data,
                            mental_health_disorders_rate =((mental_disorder_count/population_count)*100000))

head(joined_rates_data)

#Mutating percentage
joined_rates_data <- mutate(joined_rates_data, percentage = (suicide_rate/mental_health_disorders_rate)*100)


head(joined_rates_data)

Scan I

In this step the ‘joined_rates_data’ dataframe was scanned for missing values, inconsistencies and obvious errors.
Missing values for each column was scanned using colsums() function.
Based on the output, it was observed that the columns population_count, mental_health_disorders_rate and percentage each has 827 missing values.
As the missing value occurs mainly in population count, and mental_health_disorders_rate and percentage were calculated based on population_count, the missing values of population_count were imputed using mean.
The NA values are imputed instead of removal,in order to preserve the data.
The rest of the variables such as mental_health_disorders_rate and percentage were recalculated only for missing values and mutated with the origininal.
Inconsistent and NaN values values were scanned using special function and observed to not have any special values.

#Finding sum of missing values in each column
colSums(is.na(joined_rates_data))

##                 country_name                         year 
##                            0                            0 
##                 suicide_rate        mental_disorder_count 
##                            0                            0 
##             population_count mental_health_disorders_rate 
##                          837                          837 
##                   percentage 
##                          837

# mean imputation (for numerical variables)


joined_rates_data$population_count<- impute(joined_rates_data$population_count, fun = mean)



joined_rates_data <- mutate(joined_rates_data, 
                            mental_health_disorders_rate = ifelse(is.na(mental_health_disorders_rate),
                           ( mental_disorder_count/population_count)*100000, mental_health_disorders_rate))


joined_rates_data <- mutate(joined_rates_data,
                            percentage = ifelse(is.na(percentage),(suicide_rate/mental_health_disorders_rate)*100,
                            percentage))

#re verifying for na values
sum(is.na(joined_rates_data$population_count))

## [1] 0

sum(is.na(joined_rates_data$mental_health_disorders_rate))

## [1] 0

sum(is.na(joined_rates_data$percentage))

## [1] 0

# special function to find NaN and inconsistent values
is.special <- function(x){
 if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

 #finding number of inconsistent values in data
sapply(joined_rates_data, function(x) sum(is.special(x)))

##                 country_name                         year 
##                            0                            0 
##                 suicide_rate        mental_disorder_count 
##                            0                            0 
##             population_count mental_health_disorders_rate 
##                            0                            0 
##                   percentage 
##                            0

Scan II

In order to detect outliers, the joined_rates_data is subsetted by selecting only the numeric variables such as suicide_rate,mental_disorder_count,population_count,mental_health_disorders_rate and percentage.
The new dataframe joined_data_sub is used for detecting outliers via Z-score method.
Since all the variables numeric , it falls under univariate. Hence Z-score method is used.
There were 422 Outliers detected. Eventhough the outlier percentage was closer to 5% , imputation using capping method is done to preserve all the data.
Z-score and capping methods were used for identifying and imputing the outliers as they work better on continuous values.

# Subsetting joined_rates_data
joined_data_sub <- joined_rates_data %>% dplyr:: select(suicide_rate,mental_disorder_count,population_count,
                                mental_health_disorders_rate,percentage) 

summary(joined_data_sub)

## 
##  837 values imputed to 34177499

##   suicide_rate    mental_disorder_count population_count   
##  Min.   : 1.527   Min.   :     4579     Min.   :4.500e+04  
##  1st Qu.: 6.722   1st Qu.:   288429     1st Qu.:2.351e+06  
##  Median :10.658   Median :  1129996     Median :9.686e+06  
##  Mean   :12.076   Mean   : 14791266     Mean   :3.418e+07  
##  3rd Qu.:14.791   3rd Qu.:  5076646     3rd Qu.:3.418e+07  
##  Max.   :98.832   Max.   :947272757     Max.   :1.414e+09  
##  mental_health_disorders_rate   percentage       
##  Min.   :    706.5            Min.   :0.0003633  
##  1st Qu.:  11405.9            1st Qu.:0.0449718  
##  Median :  12553.5            Median :0.0763073  
##  Mean   :  43092.9            Mean   :0.1012998  
##  3rd Qu.:  14317.9            3rd Qu.:0.1148571  
##  Max.   :2771626.9            Max.   :1.4140575

#z-score method for detecting ouliers
z.scores <- joined_data_sub  %>% scores(type = "z")
length(which( abs(z.scores) >3 ))

## [1] 422

# capping for imputing outliers
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ))
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}

# Apply a user defined function "cap" to a data frame
joined_data_capped <-  as.data.frame(sapply(joined_data_sub, FUN = cap ))
summary(joined_data_capped)

##   suicide_rate    mental_disorder_count population_count   
##  Min.   : 1.527   Min.   :    4579      Min.   :    45000  
##  1st Qu.: 6.722   1st Qu.:  288429      1st Qu.:  2351250  
##  Median :10.658   Median : 1129996      Median :  9685500  
##  Mean   :11.601   Mean   :10942720      Mean   : 21166783  
##  3rd Qu.:14.791   3rd Qu.: 5076646      3rd Qu.: 34177499  
##  Max.   :26.854   Max.   :66248822      Max.   :103103750  
##  mental_health_disorders_rate   percentage       
##  Min.   :  8478               Min.   :0.0003633  
##  1st Qu.: 11406               1st Qu.:0.0449718  
##  Median : 12554               Median :0.0763073  
##  Mean   : 30404               Mean   :0.0887515  
##  3rd Qu.: 14318               3rd Qu.:0.1148571  
##  Max.   :169071               Max.   :0.2462393

Transform

To understand distribution of suicide rate and percentage column,histogram is plotted and found that data is right skewed
To reduce skewness for suicide rate and percentage columns,square root transformation has been applied and observed that data is normally distributed

# Histogram
suicide_rate <- joined_data_capped$suicide_rate
hist(suicide_rate )

#Square root Transformation
suicide_rate_transformed<- sqrt(joined_data_capped$suicide_rate)
hist(suicide_rate_transformed)

#Histogram
percentage <- joined_data_capped$ percentage
hist( percentage)

#Square root Transformation
percentage_transformed <- sqrt(joined_data_capped$ percentage)
hist(percentage_transformed)

## Analysis

In order to understand relation between suicide rate and mental health disorder rate, paired t-test is conducted by assuming data is normally distributed.

pttest <- t.test(joined_data_capped$suicide_rate, joined_data_capped$mental_health_disorders_rate,
                 paired = TRUE,alternative = "two.sided",conf.level = .95)
pttest

## 
##  Paired t-test
## 
## data:  joined_data_capped$suicide_rate and joined_data_capped$mental_health_disorders_rate
## t = -47.752, df = 6155, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -31639.77 -29144.39
## sample estimates:
## mean of the differences 
##               -30392.08

Interpretation

Null hypthesis: There isnt any relation between suicide rate and mental health disorder rate
Alternate hypothesis:There is relation between suicide rate and mental health disorder rate

The findings from results of the hypothesis test by acquiring the p-value and confidence intervals. . The paired sample t-test was done for analysis. . The t-test resulted in a p-value of p-value < 2.2e-16 which is less than the alpha value 0.05. . The mean difference doesnt falls in the 95% of CI. . The result is statistically significant. so, our result support alternate Hypothesis

Hence we come to the conclusion that:

. The paired-samples t-test found a statistically significant relation between suicide rate and mental health disorder rate

Discussion:

The analysis would be better if we conduct for different age people and different circumstances like income problems,marriage problems so on

Reference:

[1] Source to Suicide death rates data: https://ourworldindata.org/grapher/suicide-death-rates
[2] Source to Mental health disorders data: https://ourworldindata.org/grapher/people-with-mental-health-disorders
[3] Source of population data: https://ourworldindata.org/grapher/population
[4] https://www.stats.indiana.edu/vitals/CalculatingARate.pdf