Save someone you know from breaking appart

INTRODUCTION :

Suicide is a complex and devastating phenomenon that has far-reaching impacts on individuals, families, and communities worldwide. Despite significant efforts to increase awareness and prevention, suicide remains a serious public health issue.

The aim of this study is to investigate the correlation between sex, location, and time period and suicide rates in the global population in the hopes of providing insights to the public and possibly contributing to the mental health community.

To accomplish this, we will use R for data cleaning, data analysis, and statistical correlations, use Pearson correlations to find to detect correlation, use ARIMA prediction model to predict trends, visualize data using R and Tableau and utilize Tableau and R pubs to publish the findings.

Our ultimate goal is to publish our findings to share the analysis with others in the hopes of contributing to the development of more effective suicide prevention strategies and policies.

Before the start of this study, we assume that There was an uptrend in the suicide rate, there was no correlation between suicide rate and gender, and all countries or regions are equally impacted.

Data collection and preparation:

During this phase we will gather and prepare data for analysis.

Looking for data source:

  • Data Source: https://www.who.int/data/gho/data/themes/mental-health/suicide-rates
  • Data period: 2000-2019
  • Data Last updated: 2021-07-06.
  • Indicator name: Age-standardized mortality rate (per 100 000 population)
  • Short name: Age-standardized mortality rate (per 100 000 population)
  • Data type: Rate
  • Indicator Id: 78
  • Topic: Mortality and burden of disease

When It comes to global health data the closest most reliable data source that comes to mind is the World Health Organization or WHO. The World Health Organization is a specialized organization of the U.N. responsible for international public health. Headquartered in Geneva, Switzerland, The W.H.O. was founded on 7 April 1948 and currently has six regional offices and 150 field offices all over the world.

Limitations to consider as follows :

  • Cultural Influence - cultural norms and beliefs may impact how suicide is perceived and may affect the factual reporting of suicide.

  • Stigma stereotypes - people may be ashamed to report suicide or seek help openly or report suicidal attempts, this may result in reporting or misreporting.

  • Underreporting - happens when a case is not reported, may be impacted by cultural beliefs, stigma, and stereotypes, or an inadequate reporting system.

  • Misclassification - happens when a suicide case is wrongly classified as suicide and instead classified as another crime or case, Ex. accidental overdose

By considering these limitations we can be aware of the capability of our available data, thus providing a bird’s eye of the data set.

Setting up the R environment

Now that I have gathered the data from the the data source the next step is to prepare the R working environment. The functions below with help up set up the R environment to be ready for analysis and visualization.

Packages installed and their uses :

  • “readr” : reading xls csv’s etc for importation
  • “dplyr” : data manipulation.
  • “tidyverse” : tidying data and manipulation
  • “skimr” : tidying data and manipulation
  • “janitor” : cleaning data
  • “ggplot2” : graphical presentation
  • “here” : for file referencing

Installing the packages

install.packages("readr")     # reading xls csv's etc for importation
install.packages("dplyr")     # data manipulation.
install.packages("tidyverse") # tidying data and manipulation
install.packages("skimr")     # tidying data and manipulation 
install.packages("janitor")   # cleaning data
install.packages("ggplot2")   # graphical presentation 
install.packages("here")      # for file referencing 

Activating the installed packages

library("readr")     # reading xls csv's etc for importation
library("dplyr")     # data manipulation.
library("tidyverse") # tidying data and manipulation
library("skimr")     # tidying data and manipulation 
library("janitor")   # cleaning data
library("ggplot2")   # graphical presentation 
library("here")      # for file referencing 

Importing the data

In this stage, We will now see thru the importation process after the open-source data was downloaded from https://www.who.int/data/gho/data/themes/mental-health/suicide-rates.

We will be using the “readxl function to read the csv file.

For the library to be functional first we must activate the library “readxl”

library(readxl)

We will use the “readxl” function to import the data

Global_suicide_rates_WHO_2000_2019_RAW <- read_excel("Global_suicide_rates_WHO_2000-2019_RAW.xlsx")
View(Global_suicide_rates_WHO_2000_2019_RAW)
install.packages("readxl")
## Installing readxl [1.4.2] ...
##  OK [linked cache in 2.1 milliseconds]
## * Installed 1 package in 4.5 seconds.
library(readxl)
Global_suicide_rates_WHO_2000_2019_RAW <- read_excel("Global_suicide_rates_WHO_2000-2019_RAW.xlsx")
View(Global_suicide_rates_WHO_2000_2019_RAW)

Snapshot view

Data manipulation

Creating the data frame called “select()” and View(), will allow the creation a new data that contains the value that we need for the analysis.

  • ParentLocationCode

  • ParentLocation

  • Location

  • FactValueNumeric

  • Period

  • Dim1

We will use the “select()” function to create a data frame with our desired fields.

library(dplyr)
data <- data %>% select(ParentLocationCode,ParentLocation,Location,FactValueNumeric,Period,Dim1)
View(data)

Using the select function we create a new data table with only the desired fields or columns for this case study.

data <- data %>% select(ParentLocationCode,ParentLocation,Location,FactValueNumeric,Period,Dim1)

Using the colnames() function we will rename the desired columns to better make sense of the data for our audience, as some column names sound to jargon.

Renaming the columns syntax : colnames(data)[colnames(data) == “current column name”] <- “new column name” renaming the one by one

colnames(data)[colnames(data) == "ParentLocationCode"] <- "Region_code" 
colnames(data)[colnames(data) == "ParentLocation"] <- "Region"
colnames(data)[colnames(data) == "Location"] <- "Country"
colnames(data)[colnames(data) == "FactValueNumeric"] <- "Suicide rate"
colnames(data)[colnames(data) == "Suicide rate"] <- "Suicide_rate"
colnames(data)[colnames(data) == "Period"] <- "Year"
colnames(data)[colnames(data) == "Dim1"] <- "Sex"

We are using the “skimr” function to have a data summary view. Now we can check for any nulls, blank spaces, etc.

install.packages("skimr")
library("skimr")
skim(data)

Summary view

ORGANIZING THE DATA

During this stage, our aim is to organize the data to have 3 separate tables as follows:

  • data - contains all the values for both Male and Female sexes

  • Male_data - contains all the values for both Male counts

  • Female_data - contains all the values for both Females counts

Filtering the data to acquire desired fields.

data <- data[data$Sex != 'Both sexes', ]  # this line will filter out the value 'Both sexes' and create a new table named data with out the "Both Sexes" value
Male_data <- data[data$Sex != 'Female', ] # this line will filter out the value 'Female' and create a new table named Male_data
Female_data <- data[data$Sex != 'Male', ] # this line will filter out the value 'Male' and create a new table Female_data

Remove missing and blank values from the original data frame

To eliminate all nulls and blank spaces from a data frame, we will use the complete.cases() function to create a logical vector indicating which rows have no missing values.

data <- data[complete.cases(data), ]
Male_data <- Male_data[complete.cases(Male_data), ]
Female_data <- Female_data[complete.cases(Female_data), ]

After all the manipulation and cleaning that had transpired, we will have 3 data frames ready for analysis.

data table

Male data

Female data

DESCRIPTIVE STATISTICS :

We will be providing a statistical summary of the data and well as visualizing the statistical summary to provide an overview. We will utilize R in providing a statistical summary of the data, retrieve the mean, median and standard deviation of the suicide rate, and the suicide rate distribution among male and female

Using R we will create a summary view of the descriptive statistics of the data.

summary(data)
mean(data$Suicide_rate)
median(data$Suicide_rate)
sd(data$Suicide_rate)
table(data$Sex)

The image below shows a summary view of the descriptive statistics of the data.

Descriptive statistics

Using R we will create a summary view of the descriptive statistics of the data and plot the distribution of suicide cases among Male and Female.

library(dplyr)
library(ggplot2)

mean_suicide_rate <- data %>%
group_by(Sex) %>%
summarize(mean_rate = mean(Suicide_rate))

ggplot(mean_suicide_rate, aes(x = Sex, y = mean_rate, color = Sex)) +
geom_col(fill = "gray") +
labs(x = "Sex", y = "Mean Suicide Rate", title = "Mean Suicide Rate by Sex") +
theme_minimal()

The image below shows the distribution of suicide cases among Male and Female

Descriptive statistics distribution of suicide cases among Male and Female

Using R we will create a summary view of the descriptive statistics of the data and plot the distribution of suicide cases by Region.

library(dplyr)
library(ggplot2)

mean_suicide_rate <- data %>%
group_by(Region) %>%
summarize(mean_rate = mean(Suicide_rate))

ggplot(mean_suicide_rate, aes(x = Region, y = mean_rate, color = Region)) +
geom_col(fill = "gray") +
labs(x = "Region", y = "Mean Suicide Rate", title = "Mean Suicide Rate by Region") +
theme_minimal()

The image below shows the distribution of suicide cases by Region.

Descriptive statistics distribution of suicide cases by region

CORRELATION ANALYSIS :

Now that we’ve done all the data cleaning, we can now proceed to find possible correlations of numeric variables using the Pearson method. We will test if there is a correlation between the suicide rate and the Time period in Years.

The value ranges are as follows: -1 to +1. A value of -1 indicates a perfectly negative correlation, while a value of +1 indicates a perfect positive correlation. A value of 0 indicates no correlation between the two variables.

Data summary for data Suicide rate from 2000-2019 cited a decline in suicide rate through the years.

data_summary <- data %>%
group_by(Sex, Year) %>%
summarize(count_by_siteyear =  n(),
        Suicide_rate = mean(Suicide_rate)) 

ggscatterstats(data = data,
            x     = Year,
            y     = Suicide_rate,
            type  = "R") 

Suicide rate vs year

This shows a correlation of -0.06. As you can see on the plotted blue line there is an almost unnoticeable decrease in the suicide rate as the years passed from the Year 2000-2019. it is also very noticeable that there is a possibility of a region or country having a higher than usual count among all regions or countries.

To have a concrete view of the trend we will break down the values to represent male and female trends separately.

We will create A table to see the correlation between Males and suicide rates throughout the time period 2000-2019.

Summary_M_Y <- Male_data %>%
group_by(Sex, Year) %>%
summarize(count_by_siteyear =  n(),
         Suicide_rate = mean(Suicide_rate)) 

ggscatterstats(data = Summary_M_Y,
           x     = Year,
           y     = Suicide_rate,
           type  = "R")

This chart shows the correlation between Males and suicide rates throughout the time period 2000-2019. indicating a downward trend.

Male Suicide rate vs year

We will create A table to see the correlation between Females and suicide rates through-out the time period 2000-2019.

Summary_F_Y <- Female_data %>%
group_by(Sex, Year) %>%
summarize(count_by_siteyear =  n(),
        Suicide_rate = mean(Suicide_rate)) 

ggscatterstats(data = Summary_F_Y,
            x    = Year,
            y    = Suicide_rate,
            type = "R")

This chart shows the correlation between Females and suicide rates throughout the time period 2000-2019. indicating a downward trend.

Female Suicide rate vs year

When we break down the values to male and female we can now clearly see a downward trend in the suicide rate. female correlation for suicide rate vs time period shows -0.99,-0.96, the male correlation for suicide rate vs time period shows -0.99,-0.95.

Now we try to find a correlation between suicide rates and region to identify which areas are suicide hotshots.

In this case, we will be using the function “ggplot” to create graphs to visualize and facet the data into regions to isolate the area with the most suicide cases.

Though this graph shows a decrease in the suicide trend among all regions, the African regions show the highest recorded suicide rate at 195.20 in the year 2015. It is also noticeable that most suicide cases are corresponding to males.

To better understand the relevance of suicide rate, location, and Sex, We will create a chart that shows the percentage by region and percentage by Sex.

pivot_table_Region<- data %>% 
group_by(Region) %>% 
summarize(Suicide_rate = sum(Suicide_rate))

pivot_table <- pivot_table %>% 
mutate(Percentage = round(Suicide_rate / sum(Suicide_rate) * 100, 1))

ggplot(pivot_table, aes(x = "", y = Suicide_rate, fill = Region)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
theme_void() +
theme(legend.position = "right") +
scale_fill_discrete(name = "Region") +
ggtitle("Suicide Rate percentage by Region") +
labs(caption = "Source: Your Data Source") +
geom_text(aes(label = paste0(Percentage, "%")), position = position_stack(vjust = 0.5), size = 4)

The plot below indicates that the African Region is the most impacted by suicide among other Regions.

Percentage by Region

library(dplyr)
library(ggplot2)

pivot_table_sex <- data %>% 
group_by(Sex) %>% 
summarize(Suicide_rate = sum(Suicide_rate))

pivot_table_sex <- pivot_table_sex %>% 
mutate(Percentage = round(Suicide_rate / sum(Suicide_rate) * 100, 1))

ggplot(pivot_table_sex, aes(x = "", y = Suicide_rate, fill = Sex)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
theme_void() +
theme(legend.position = "right") +
scale_fill_discrete(name = "Sex") +
ggtitle("Suicide Rate by Sex") +
labs(caption = "Source: Your Data Source") +
geom_text(aes(label = paste0(Percentage, "%")), position = position_stack(vjust = 0.5), size = 4)

The Plot below indicates That males are more prone to the risk of committing suicide compared to Females.

Percentage by Sex

Trend Prediction

Next, we will try to predict the Trend of suicide rate until the Year 2025. To do this we will be using the ARIMA model. It stands for Auto Regressive Integrated Moving Average. ARIMA is a statistical method used for time series forecasting and analysis.

library(dplyr)
library(tidyr)
library(ggplot2)
library(ggthemes)

data_filtered <- data %>%
filter(Year >= 2000 & Year <= 2019)

summary_table <- data_filtered %>%
group_by(Year, Region) %>%
summarize(avg_suicide_rate = mean(Suicide_rate))

ts_data <- ts(summary_table$avg_suicide_rate, start = c(2000, 1), end = c(2019, 1), frequency = 1)

ggplot(summary_table, aes(x = Year, y = avg_suicide_rate, color = Region)) +
geom_line() +
labs(x = "Year", y = "Average Suicide Rate", title = "Suicide Rate by Region, 2000-2019") +
theme_economist()

model <- lm(avg_suicide_rate ~ Year, data = summary_table)

new_years <- data.frame(Year = c(2020, 2021, 2022, 2023, 2024, 2025))
predicted_data <- predict(model, newdata = new_years)

combined_data <- bind_rows(summary_table, data.frame(Year = new_years$Year, 
                                                 avg_suicide_rate = predicted_data, 
                                                 Region = rep("Predicted", 6)))

ggplot(combined_data, aes(x = Year, y = avg_suicide_rate, color = Region)) +
geom_line() +
labs(x = "Year", y = "Average Suicide Rate", title = "Suicide Rate by Region, 2000-2025") +
theme_economist()

According to the plot below the suicide rate is decreasing.

ARIMA Trend Prediction

After visualizing we now have a clearer view of the correlations.

  • As years passed suicide rates have declined.
  • The African region has contributed to the most amount of suicide at 37%, followed by Europe at 28%.
  • The south East Asian Region have the lowest suicide rate of 4% followed by the Eastern Mediterranean at 5%.
  • Based on the data Males have the highest percentage of suicide 77.6%, this shows that males are more prone to committing suicide compared to females.
  • Based on the prediction model Suicide rate is on a decline and is expected to do so until 2025.

Below is a graphical summary view i created in Tableau. Here’s the link: https://public.tableau.com/app/profile/robert.faciolan/viz/SuicideRatesGlobalDahsboard/Story1?publish=yes

Public Suicide Rate Dashboard

CONCLUSION :

Based on our overall findings and the ARIMA prediction, suicide rates have been decreasing over time, and is expected to continue trending down. However, there are still significant variations in suicide rates across different regions and genders, with the African region and males having the highest rates.

This highlights the importance of continued efforts to address the underlying causes of suicide, such as mental health issues, social and economic factors, and access to resources and support. It may also be worth exploring targeted interventions or policies to address the specific needs of high-risk groups, such as males in certain regions.

Overall, our findings and the ARIMA prediction suggest that progress is being made in reducing suicide rates, however, there is still much work to be done to achieve further improvements in global mental health. For now I will be publishing this case study on R pubs.

Spread awareness save lives.

CREDITS

http://brewminate.com/ and https://imgs.search.brave.com/ - for the artwork

https://www.who.int/ - for the open source data set

https://www.coursera.org/ and - For the Course sponsorship

https://public.tableau.com/ - for providing free visualization Tools.

https://www.r-project.org/ - for providing an open source programming language