Required packages

These were the packages that were used to produce this report

 # Useful for data manipulation
library(dplyr)
# Useful for value imputation
library(Hmisc)
# Useful for creating nice tables
library(knitr)
 # Useful to build complex tables
library(kableExtra) 
# Useful to create neat records of analysis
library(rmarkdown)
 # Useful for tidy up the dataset
library(tidyr) 
library(forecast) 
 # Useful for date manipulation
library(lubridate)
# useful for finding correlation
library(psychometric)
# useful for Outliers
library(outliers)

Executive Summary

The aim of this report was to investigate whether there is correlation between suicide rate and mental health disorder rate over a period of time. In order to aid our investigation, we have taken three datasets and preprocessed the data as follows. - Datasets “suicide_death_rates” , “mental_health_disorders” and “population” data based on countries, belonging to years 1990 to 2017 were considered. - The column names were renamed for clarity. - The data from “mental_health_disorders” data is joined to “suicide_death_rates” data based on their “year”, “country_name” and “country_country_code” variables. - On the joined dataset, data type verification and conversion were performed. - For better understanding , the average values of suicide_death_rates and “Mental disorder rate” were calculated and the percentage of suicide rate per mental disorder rate were calculated. - All missing and special values were checked and imputed with mean their respective mean values - All numerical values were checked for outliers using Z-score and handled by replacing them with their nearest possible values using capping technique. - To understand distribution of suicide rate and percentage variables, histogram is plotted and found that the data is right skewed. It is then transformed using square root transformation.

Data

# Reading and importing suicide_death_rates data
suicide_death_rates <-read.csv("suicide-death-rates.csv",header = TRUE, 
                    stringsAsFactors = FALSE,
                    col.names = c("country_name","country_code","year","suicide_rate"))
str(suicide_death_rates)
## 'data.frame':    6468 obs. of  4 variables:
##  $ country_name: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ country_code: chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ year        : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ suicide_rate: num  10.3 10.3 10.3 10.4 10.6 ...
head(suicide_death_rates)
# Reading and importing mental_health_disorders data
mental_health_disorders <-read.csv("people-with-mental-health-disorders.csv", header = TRUE,
                                   stringsAsFactors = FALSE,
                    col.names = c("country_name","country_code","year","mental_disorder_count") )
str(mental_health_disorders)
## 'data.frame':    6156 obs. of  4 variables:
##  $ country_name         : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ country_code         : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ year                 : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ mental_disorder_count: num  1669742 1774327 1913721 2067860 2221090 ...
head(mental_health_disorders)
# Reading and importing mental_health_disorders data
population <-read.csv("population.csv",  header = TRUE,
                      stringsAsFactors = FALSE,
                   col.names = c("country_name","country_code","year","population_count") )
str(population)
## 'data.frame':    46883 obs. of  4 variables:
##  $ country_name    : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ country_code    : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ year            : int  1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ...
##  $ population_count: num  3280000 3280000 3280000 3280000 3280000 3280000 3280000 3280000 3280000 3280000 ...
head(population)
# joining suicide_death_rates and mental_health_disorders datasets
joined_rates_data <- right_join(suicide_death_rates, mental_health_disorders,
                                by = c('country_name','country_code','year'))

# joining  joined_rates_data and population 
joined_rates_data <- left_join(joined_rates_data, population,
                               by= c('country_name','country_code','year'))


head(joined_rates_data)

Understand

#Structure of joined data
str(joined_rates_data)
## 'data.frame':    6156 obs. of  6 variables:
##  $ country_name         : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ country_code         : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ year                 : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ suicide_rate         : num  10.3 10.3 10.3 10.4 10.6 ...
##  $ mental_disorder_count: num  1669742 1774327 1913721 2067860 2221090 ...
##  $ population_count     : num  12412000 13299000 14486000 15817000 17076000 ...
# Applying type conversion

joined_rates_data$country_name <- as.factor(joined_rates_data$country_name)
joined_rates_data$country_code <- as.factor(joined_rates_data$country_code)
joined_rates_data$mental_disorder_count <- as.integer(joined_rates_data$mental_disorder_count)
joined_rates_data$population_count <- as.integer(joined_rates_data$population_count)

# Verifying type of variables
class(joined_rates_data$country_name)
## [1] "factor"
class(joined_rates_data$country_code)
## [1] "factor"
class(joined_rates_data$mental_disorder_count)
## [1] "integer"
class(joined_rates_data$population_count)
## [1] "integer"

Tidy & Manipulate Data I

  1. Each variable must have its own column.
  2. Each unique observation has its own row.
  3. All values have its own cell.

It is observed that the dataset satisfies the tidy principles and was readily available in a tidy format. Therefore, no further manipulations were required.
* The country_code variable was removed as it can be acquired from country name and is not required for analysis. * The average suicide rate and average people with mental disorder were calculated by using group_by() and summarise() function.

# removing county_code column

joined_rates_data <- subset(joined_rates_data, 
                     select= c('country_name','year','suicide_rate','mental_disorder_count','population_count'))
joined_rates_data%>% group_by(country_name) %>% summarise(avg_suicide_rate = round(mean(suicide_rate),3),
                             avg_person_mental_disorder = round(mean(mental_disorder_count,na.rm = TRUE),3))  
head(joined_rates_data)

Tidy & Manipulate Data II

# Mutating mental_health_disorders_rate
joined_rates_data <- mutate(joined_rates_data,
                            mental_health_disorders_rate =((mental_disorder_count/population_count)*100000))

head(joined_rates_data)
#Mutating percentage
joined_rates_data <- mutate(joined_rates_data, percentage = (suicide_rate/mental_health_disorders_rate)*100)


head(joined_rates_data)

Scan I

#Finding sum of missing values in each column
colSums(is.na(joined_rates_data))
##                 country_name                         year 
##                            0                            0 
##                 suicide_rate        mental_disorder_count 
##                            0                            0 
##             population_count mental_health_disorders_rate 
##                          837                          837 
##                   percentage 
##                          837
# mean imputation (for numerical variables)


joined_rates_data$population_count<- impute(joined_rates_data$population_count, fun = mean)



joined_rates_data <- mutate(joined_rates_data, 
                            mental_health_disorders_rate = ifelse(is.na(mental_health_disorders_rate),
                           ( mental_disorder_count/population_count)*100000, mental_health_disorders_rate))


joined_rates_data <- mutate(joined_rates_data,
                            percentage = ifelse(is.na(percentage),(suicide_rate/mental_health_disorders_rate)*100,
                            percentage))

#re verifying for na values
sum(is.na(joined_rates_data$population_count))
## [1] 0
sum(is.na(joined_rates_data$mental_health_disorders_rate))
## [1] 0
sum(is.na(joined_rates_data$percentage))
## [1] 0
# special function to find NaN and inconsistent values
is.special <- function(x){
 if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

 #finding number of inconsistent values in data
sapply(joined_rates_data, function(x) sum(is.special(x)))
##                 country_name                         year 
##                            0                            0 
##                 suicide_rate        mental_disorder_count 
##                            0                            0 
##             population_count mental_health_disorders_rate 
##                            0                            0 
##                   percentage 
##                            0

Scan II

# Subsetting joined_rates_data
joined_data_sub <- joined_rates_data %>% dplyr:: select(suicide_rate,mental_disorder_count,population_count,
                                mental_health_disorders_rate,percentage) 

summary(joined_data_sub)       
## 
##  837 values imputed to 34177499
##   suicide_rate    mental_disorder_count population_count   
##  Min.   : 1.527   Min.   :     4579     Min.   :4.500e+04  
##  1st Qu.: 6.722   1st Qu.:   288429     1st Qu.:2.351e+06  
##  Median :10.658   Median :  1129996     Median :9.686e+06  
##  Mean   :12.076   Mean   : 14791266     Mean   :3.418e+07  
##  3rd Qu.:14.791   3rd Qu.:  5076646     3rd Qu.:3.418e+07  
##  Max.   :98.832   Max.   :947272757     Max.   :1.414e+09  
##  mental_health_disorders_rate   percentage       
##  Min.   :    706.5            Min.   :0.0003633  
##  1st Qu.:  11405.9            1st Qu.:0.0449718  
##  Median :  12553.5            Median :0.0763073  
##  Mean   :  43092.9            Mean   :0.1012998  
##  3rd Qu.:  14317.9            3rd Qu.:0.1148571  
##  Max.   :2771626.9            Max.   :1.4140575
#z-score method for detecting ouliers
z.scores <- joined_data_sub  %>% scores(type = "z")
length(which( abs(z.scores) >3 ))
## [1] 422
# capping for imputing outliers
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ))
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}

# Apply a user defined function "cap" to a data frame
joined_data_capped <-  as.data.frame(sapply(joined_data_sub, FUN = cap ))
summary(joined_data_capped)
##   suicide_rate    mental_disorder_count population_count   
##  Min.   : 1.527   Min.   :    4579      Min.   :    45000  
##  1st Qu.: 6.722   1st Qu.:  288429      1st Qu.:  2351250  
##  Median :10.658   Median : 1129996      Median :  9685500  
##  Mean   :11.601   Mean   :10942720      Mean   : 21166783  
##  3rd Qu.:14.791   3rd Qu.: 5076646      3rd Qu.: 34177499  
##  Max.   :26.854   Max.   :66248822      Max.   :103103750  
##  mental_health_disorders_rate   percentage       
##  Min.   :  8478               Min.   :0.0003633  
##  1st Qu.: 11406               1st Qu.:0.0449718  
##  Median : 12554               Median :0.0763073  
##  Mean   : 30404               Mean   :0.0887515  
##  3rd Qu.: 14318               3rd Qu.:0.1148571  
##  Max.   :169071               Max.   :0.2462393

Transform

# Histogram
suicide_rate <- joined_data_capped$suicide_rate
hist(suicide_rate )

#Square root Transformation
suicide_rate_transformed<- sqrt(joined_data_capped$suicide_rate)
hist(suicide_rate_transformed)

#Histogram
percentage <- joined_data_capped$ percentage
hist( percentage)

#Square root Transformation
percentage_transformed <- sqrt(joined_data_capped$ percentage)
hist(percentage_transformed)

## Analysis

pttest <- t.test(joined_data_capped$suicide_rate, joined_data_capped$mental_health_disorders_rate,
                 paired = TRUE,alternative = "two.sided",conf.level = .95)
pttest
## 
##  Paired t-test
## 
## data:  joined_data_capped$suicide_rate and joined_data_capped$mental_health_disorders_rate
## t = -47.752, df = 6155, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -31639.77 -29144.39
## sample estimates:
## mean of the differences 
##               -30392.08

Interpretation

The findings from results of the hypothesis test by acquiring the p-value and confidence intervals. . The paired sample t-test was done for analysis. . The t-test resulted in a p-value of p-value < 2.2e-16 which is less than the alpha value 0.05. . The mean difference doesnt falls in the 95% of CI. . The result is statistically significant. so, our result support alternate Hypothesis

Hence we come to the conclusion that:

. The paired-samples t-test found a statistically significant relation between suicide rate and mental health disorder rate

Discussion:

The analysis would be better if we conduct for different age people and different circumstances like income problems,marriage problems so on

Reference: