
Rpubs link : http://rpubs.com/beancounter/430586
Motion Chart link: http://jasonperez.learningnomad.com/2018/10/suicide-rate-vs-gdp-motion-chart.html (only works with Internet Explorer browser)
Geomap link: http://jasonperez.learningnomad.com/2018/10/suicide-rate-per-country-geomap.html
Packages used
library(dplyr)
library(readr)
library(ggplot2)
library(tidyr)
library(MVN)
library(googleVis)
library(gridExtra)
Executive Summary
In order to come up with an understanding on whether suicide rates have a dependence on the country’s GDP, we have taken two sets of data from various sources. Prior to arriving to a conclusion, several data pre-processing have been done to come up with a clean data and be able to use it for visualisation and further analysis. Steps taken are identified below:
- Once the data are imported to R, it was initially checked and inspected the data by checking the structure, type, etc.
- Selected and filtered those variables that are only needed since not all of it are required for analysis and final results.
- Organised the data and transform it from wide to long format.
- Mutated or added new variable in the data frame.
- Merged the two data sets to come up with one data frame.
- The missing values were dropped and identified the outliers.
- Transformed the country’s GDP per capita using the \(log10\) function to normalise the values.
After doing the data pre-processing tasks, several visual representations have been shown. We can see on the graphics that there is a non-linear relationship between suicide cases/rates and the country’s GDP per capita. Several plots were presented and added statistical methods for analysis. On the other hand, there are some confounding variables why persons are committing suicide such as geographical weather (winter, summer, etc.), mental health issues or in a more contextual level such as cultural or spiritual aspects. Economic factor is one of the drivers also such as GDP. However, GDP is one of the many factors affecting suicide.
Lastly, it is good to note though that the suicide rate per country have been decreasing since people are becoming aware of the issue and the regular campaigns promoted by the charitable institutions plays a big role. If we know someone who needs help, talk to them. It may take a big difference, it may save someone’s lives.
Data
Suicide remains one of the leading causes of death in Australia aged between 15 and 44.\(^1\) This remains to be a problem in every country and considered to be a world phenomenon. Regardless of whether the country is rich or poor or how high or low the country’s living standards which all can be measured through GDP, no one is exempted on this global problem. Based on studies, about 90% of people who commit suicide have mental illness where depression is one of the top factors, however there are other mental health disorders that contribute to individual’s suicidal tendencies such as bipolar or schizophrenia.\(^2\).
In order to have an insights between suicide and GDP per country, two sets of data have been taken from various websites. The global data on suicide cases have been collected per country as well as the data about every country’s GDP. Suicide data cases included country, year (from 2000 to 2015), gender, age, number of suicide cases and population. The data about GDP per country included country and their GDP per year.
The suicide data have been taken from Kaggle, a public data platform which was originally sourced from WHO (World Health Organisation)\(^3\). On the other hand, the GDP data have been taken from World Bank Open Data \(^4\).
Limitations and Assumptions:
- The data included from 2000 to 2015 excluding years prior to 2000.
- Countries with complete data from year 2000 to 2015 with suicide cases and GDP data were only accounted and included on this analysis. There are countries that have missing data on a particular year or years. Missing values were identified but these are ignored in the analysis. However, there is a code used below in order to handle the missing values.
- Screenshots of the charts in googlevis are presented but you can see the interactive graphics in rpubs using the link. Make sure this is open in Internet Explorer as googlevis package works on Internet Explorer and NOT on Chrome or Microsoft Edge.
- Suicide cases are assumed to be normal and represented as the number of cases for every 100,000 population. However, the GDP per capita is calculated as the GDP amount divided by the population and then normalised using \(log10\) function.
- Age groups and genders are ignored in the summary for purposes of simplicity but these variables were used to show for tidying up the data, i.e., formatting, variable ordering, etc.
- Outliers are identified but remained to be included on the report as these are the actual reported statistics from the source. However, there are codes used in order to handle these outliers as shown below.
- The significance level is set to 0.05 and is assumed that there is a homogenity of variance in the selected countries taken.
- There are several functions that are created which can be useful in repetitive tasks for reporting/visualisation.
Understand
After importing the two data sets onto R, following are the steps taken to inspect, show the data structure and merge the data sets:
- Inspect the data structure and show the first 6 lines in the data frame.
- On the suicide data set, it selected those years from 2000 to 2015 and dropping the information prior to 2000.
- On the GDP data set, selected variables were taken only such as Country and the Year from 2000 to 2015.
- The \(gather\) function was used to create the selected year and used GDP as the value.
- Merged the two data set using \(left join\) using the Country Name/country as their primary key.
setwd("C:/Users/JP/Desktop/Analytics/Data Preprocessing/Assignment 3")
suicide_stats <- read.csv("who_suicide_statistics.csv") # Import the suicide statistics per country
head(suicide_stats) # Show the first 6 data information
str(suicide_stats) #Inspect the data structure
'data.frame': 40368 obs. of 6 variables:
$ country : Factor w/ 127 levels "Albania","Anguilla",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
$ sex : int 2 2 2 2 2 2 1 1 1 1 ...
$ age : Factor w/ 6 levels "15-24 years",..: 1 2 3 4 5 6 1 2 3 4 ...
$ suicides_no: int NA NA NA NA NA NA NA NA NA NA ...
$ population : int 277900 246800 267500 298300 138700 34200 301400 264200 296700 325800 ...
suicide_stats_2000_to_2015 <- subset(suicide_stats, suicide_stats$year >= 2000) # Filter out data earlier than year 2000
gdp <- read.csv("w_gdp2.csv", header = TRUE, skip = 3, check.names = FALSE) # Import the world GDP
gdp1 <- gdp %>% select(`Country Name`, c("2000":"2015")) # Select only information needed
str(gdp1) # Inspect the data structure
'data.frame': 219 obs. of 17 variables:
$ Country Name: Factor w/ 219 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
$ 2000 : num NA 3.63e+09 5.48e+10 NA 1.43e+09 ...
$ 2001 : num 2.46e+09 4.06e+09 5.47e+10 NA 1.50e+09 ...
$ 2002 : num 4.13e+09 4.44e+09 5.68e+10 5.14e+08 1.73e+09 ...
$ 2003 : num 4.58e+09 5.75e+09 6.79e+10 5.27e+08 2.40e+09 ...
$ 2004 : num 5.29e+09 7.31e+09 8.53e+10 5.12e+08 2.94e+09 ...
$ 2005 : num 6.28e+09 8.16e+09 1.03e+11 5.03e+08 3.26e+09 ...
$ 2006 : num 7.06e+09 8.99e+09 1.17e+11 4.96e+08 3.54e+09 ...
$ 2007 : num 9.84e+09 1.07e+10 1.35e+11 5.20e+08 4.02e+09 ...
$ 2008 : num 1.02e+10 1.29e+10 1.71e+11 5.63e+08 4.01e+09 ...
$ 2009 : num 1.25e+10 1.20e+10 1.37e+11 6.78e+08 3.66e+09 ...
$ 2010 : num 1.59e+10 1.19e+10 1.61e+11 5.76e+08 3.36e+09 ...
$ 2011 : num 1.79e+10 1.29e+10 2.00e+11 5.74e+08 3.44e+09 ...
$ 2012 : num 2.05e+10 1.23e+10 2.09e+11 6.44e+08 3.16e+09 ...
$ 2013 : num 2.00e+10 1.28e+10 2.10e+11 6.41e+08 3.28e+09 ...
$ 2014 : num 2.01e+10 1.32e+10 2.14e+11 6.43e+08 3.35e+09 ...
$ 2015 : num 1.92e+10 1.13e+10 1.66e+11 6.59e+08 2.81e+09 ...
gdp2<- gdp1 %>% gather(c("2000":"2015"), key = "Year", value = "GDP") # Create a year column
suicide_gdp_country <- suicide_stats2000to2015_country %>%
left_join(gdp2, ., by = c("Country Name" = "country", "Year" = "year")) # Merged two datasets
Column `Country Name`/`country` joining factors with different levels, coercing to character vectorColumn `Year`/`year` joining character vector and factor, coercing into character vector
Tidy & Manipulate Data I
Following are some of the steps for tidying and manipulating data:
- Filtered the rows that are not missing or non zero values.
- Converted the year variable to factor.
- Factored the gender variable and labelled it accordingly.
suicide_stats_2000_to_2015_clean <- suicide_stats_2000_to_2015 %>%
filter(suicides_no != "NA" & suicides_no!=0) # Filter the rows that are not NA or zeroes
suicide_stats_2000_to_2015_clean[,"year"] <- as.factor(suicide_stats_2000_to_2015_clean[,"year"]) # Convert the year (int) to factor
levels(suicide_stats_2000_to_2015_clean$year)
[1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010"
[12] "2011" "2012" "2013" "2014" "2015" "2016"
suicide_stats_2000_to_2015_clean$sex <- suicide_stats_2000_to_2015_clean$sex %>% factor(c(1,2), labels = c("Male", "Female")) # Factor the gender
Tidy & Manipulate Data II
Some of the steps in tidying up and manipulating data are shown on the following:
- Aggregated the suicide cases and population per country per year.
- Added new variables using \(mutate\) function. The new variables are called Suicide_Rate, GDP_per_capita and the normalised GDP_per_capita_log.
- Selected variables to use on the analysis.
- Converted the year variable to numeric and rename the Country Name variable.
suicide_stats2000to2015_country <- aggregate(. ~country+year, data=suicide_stats_2000_to_2015_clean1, sum, na.rm=TRUE) # Aggregate the suicide cases and population
suicide_gdp_country_clean1 <- mutate(suicide_gdp_country_clean, Suicide_Rate = (suicides_no/population*100000), GDP_per_capita = (GDP/population), GDP_per_capita_log = log10(GDP/population) ) # Add Mortality rate, GDP per capita, GDP per capita log as new columns
suicide_stats_2000_to_2015_clean1 <- suicide_stats_2000_to_2015_clean %>% select(country, year, suicides_no, population) # Selected variables
suicide_gdp_country_clean1$Year <- as.numeric(as.character(suicide_gdp_country_clean1$Year)) # Converted year to numeric
colnames(suicide_gdp_country_clean1)[colnames(suicide_gdp_country_clean1)=="Country Name"] <- "Country" # Rename the variable
Scan I (Missing Values)
On the data frame, we have found out few missing values. We can identify the location of NAs but since there are a lot of items on the data frame, only the code has been shown. The sum of the total missing values in the data frame including on the columns have been identified.
All missing values have been intentionally omitted in the data frame as it was mentioned on the Limitations and Assumptions area that we are including only those countries that have full data of suicide cases and GDP data. Countries with incomplete data have been totally excluded. Although there are several methods of handling missing data such as imputation, it was opted to just exclude it due to its high sensitivity of the nature of data.
#which(is.na(suicide_gdp_country)) # Identify the location of NAs
sum(is.na(suicide_gdp_country)) # Sum up the total of NAs
[1] 4645
colSums(is.na(suicide_gdp_country)) # Identify the nuber of NAs in columns
Country Name Year GDP suicides_no population
0 0 253 2196 2196
suicide_gdp_country_clean <- na.omit(suicide_gdp_country) # Omit NAs in the data frame
Scan II (Outliers)
Several visual graphs have been shown below to show the presence of outliers. Even though one variable have been transformed, outliers are still present. Due to sensitivity of data and would like to present the actual data, countries with the full suicide and GDP data were presented. Although, there are several methods of handling outliers like imputing, capping or totally excluding it, those outliers identified were intentionally included since this report is heavily reliant on the data source. In real life though, this poses further investigation prior to any suggested action in handling outliers. This was also shown as one of the items in the Limitations and Assumptions of the study.
suicide_gdp_country_clean1 %>% boxplot(Suicide_Rate ~ Year, data = ., main="Boxplot of Suicide Rate", ylab = "Suicide Rate per 100,000", xlab = "Year", col="lightblue")
grid()

suicide_gdp_out <- suicide_gdp_country_clean1 %>% select(GDP_per_capita_log, Suicide_Rate)
suicide_gdp_out_Contour <- mvn(data = suicide_gdp_out, multivariateOutlierMethod = "quan", multivariatePlot = "contour")

grid()

Visualisation Part I
suicide_gdp_year <- suicide_gdp_country_clean1 %>% select(Year, Suicide_Rate, GDP_per_capita_log)
suicide_gdp_year1<- aggregate(. ~`Year`, data=suicide_gdp_year, sum)
plot(suicide_gdp_year1, col = "darkblue")

ggplot(suicide_gdp_country_clean1, aes(x = GDP_per_capita_log, y = Suicide_Rate)) + geom_jitter(alpha = 0.3) + geom_smooth(lwd = 0.85, alpha = 0.15) + ggtitle("Scatter Plot of Suicide Rate and GDP per Capita") # Scatter plot

ggplot(suicide_gdp_country_clean1, aes(x = GDP_per_capita_log, y = Suicide_Rate, colour = factor(Year))) + geom_line(alpha=0.5) + ggtitle("Plot of Yearly Suicide Rate and GDP per Capital") # Scatter plot

yearly_suicide <- ggplot(suicide_gdp_country_clean1, aes(x = Year, y = Suicide_Rate)) + geom_jitter() + geom_smooth(lwd = 2, se = FALSE, method = "lm") + ggtitle("Scatter Plot per Year") # Scatter plot per year
yearly_GDP <- ggplot(suicide_gdp_country_clean1, aes(x = Year, y = GDP_per_capita_log)) +geom_jitter() +geom_smooth(lwd = 2, se = FALSE, method = "lm") + ggtitle("Scatter Plot per Year") # Scatter plot per year
grid.arrange(yearly_suicide, yearly_GDP, ncol=2)

Visualisation part II (including the Functions created)
A couple of functions were created to either randomly select or specifically select a country. Few plots were shown to see the trend on suicide cases per year and also suicide and GDP per year. Lastly, the top 4 countries were presented alongside with some graphical presentations.
countries <- c("Argentina","Aruba","Austria","Belgium","Belize","Brazil","Brunei Darussalam","Chile","Colombia","Croatia","Cuba","Czech Republic","Denmark","Ecuador","Egypt, Arab Rep.","Estonia","Finland","Germany","Greece","Guatemala","Hong Kong SAR, China","Hungary","Iceland","Israel","Italy","Japan","Kazakhstan","Korea, Rep.","Kyrgyz Republic","Latvia","Lithuania","Luxembourg","Malta","Mauritius","Mexico","Moldova","Netherlands","Norway","Panama","Poland","Puerto Rico","Romania","Russian Federation","Serbia","Singapore","Slovenia","South Africa","Spain","Sweden","Switzerland","Turkmenistan","United Kingdom","United States")
selected_countries <- suicide_gdp_country_clean1 %>% filter(`Country`%in% countries)
plot_per_country <- function(i) {selected_countries %>% filter(`Country`== i) %>% ggplot(aes(Year,Suicide_Rate)) + geom_line(alpha=0.5)+ ggtitle(i)} # Function to check the suicide rate per country per country
plot_per_country("Brazil")

for(i in selected_countries$'Country'[c(800,816)]){
print(ggplot(filter(selected_countries, selected_countries$`Country`== i), aes(x = Year, y = Suicide_Rate)) + geom_line(alpha=0.5) + ggtitle(i)) # Function to check the suicide rate for multiple country per year
}


for(i in selected_countries$'Country'[c(814,832)]){
print(ggplot(filter(selected_countries, selected_countries$`Country`== i), aes(x = GDP_per_capita_log, y = Suicide_Rate)) + geom_line(alpha=0.5) + ggtitle(i)) # Function to check the suicide rate and GDP for multiple country per year
}


four_countries <- c("Korea, Rep.","Latvia","Lithuania","Slovenia") # Select the top 4 countries of suicide rate
top4_countries <- suicide_gdp_country_clean1 %>% filter(`Country`%in% four_countries) # Select the data for top 4 countries of suicide rate
top4_suicide <- ggplot(top4_countries, aes(x = Year, y = Suicide_Rate, colour = Country)) + geom_line(alpha=1) # Visualise top 4
top4_GDP <- ggplot(top4_countries, aes(x = Year, y = GDP_per_capita_log, colour = Country)) + geom_line(alpha=1) # Visualise top 4
grid.arrange(top4_suicide,top4_GDP, nrow = 2, ncol = 1)

top4_Suicide_GDP <- ggplot(top4_countries, aes(x = GDP_per_capita_log, y = Suicide_Rate, colour = Year)) + geom_line(alpha=1) + ggtitle("Suicide_GDP_Top4_countries")#
top4_Suicide_GDP # Visualise top 4

Visualisation using Motion Chart and Maps
In order to view the interactive charts particularly the motion chart using the googlevis package, you need to use the Internet Explorer browser. Take note that if you use Google Chrome, Firefox or Edge browsers, the motion chart would not work.
The codes used to create the visuals are shown below including the screenshots of the output of the reports.
The motion chart presents the suicide cases and GDP per year where the size of the circle/bubble represents the population. Alternatively, the map shows the geographical suicide cases per country which highlights the intensity level based on the number of cases. Links to view the interactive charts are shown below :
http://jasonperez.learningnomad.com/2018/10/suicide-rate-vs-gdp-motion-chart.html (Use Internet Explorer to view this)
http://jasonperez.learningnomad.com/2018/10/suicide-rate-per-country-geomap.html
data_motion <- gvisMotionChart(suicide_gdp_country_clean1, idvar = "Country", timevar = "Year", xvar = "GDP_per_capita_log", yvar = "Suicide_Rate", sizevar = "population")
plot(data_motion)
data_geomap <- gvisGeoChart(suicide_gdp_country_clean1, "Country", "Suicide_Rate",options=list(width=600, height=400))
suicide_gdp_country_clean2 <- suicide_gdp_country_clean1 %>% select(Country, Suicide_Rate) # Select the variable
data_table <- gvisTable(suicide_gdp_country_clean2, options=list(width=300, height=200)) # Show the data table
geomap_table <- gvisMerge(data_geomap, data_table, horizontal=TRUE) # Merged two data set
plot(geomap_table) # Plot



Statistical Conclusion
Using statistical tools, we can identify the average number of suicide cases based from the sampled countries. We can also run the regression using the \(lm\) function and the results are presented below. It was shown on the previous graphs that there is a non-linear relationship between the Suicide Rates and the Country’s GDP per capita. Based on the selected countries which we used as samples, it can be concluded that since p value (0.2412) is > 0.05, this means that the data used do not fit the linear regression model and can be assumed that there is no relationship between the Country’s Suicide Cases and the GDP per capita.
Although GDP does not in any way directly impact the suicide cases, this could be one of the many contextual reasons as it relates to economic aspects. However, based on this analysis, it is not statistically significant to support that GDP per capita of one’s country contributes to the suicide cases. There are a lot of confounding factors why people are taking suicide which are mentioned in the Executive Summary. If we take our part to resolve the issue, we could be saving the life of our friends, family, relatives or even a stranger.
summary(suicide_gdp_country_clean1$Suicide_Rate) # Summay statistics for Suicide Rate
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.05183 6.24922 11.81422 12.95302 18.29047 53.23583
summary(suicide_gdp_country_clean1$GDP_per_capita_log) # Summary statistics for normalised GDP
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.496 3.717 4.180 4.132 4.612 5.824
summary(suicide_gdp_country_clean1$GDP_per_capita) # Summary statistics for normalised GDP
Min. 1st Qu. Median Mean 3rd Qu. Max.
313.5 5216.9 15123.6 27267.4 40900.4 666228.4
summary(lm(suicide_gdp_country_clean1$GDP_per_capita_log~suicide_gdp_country_clean1$Suicide_Rate,data=suicide_gdp_country_clean1))
Call:
lm(formula = suicide_gdp_country_clean1$GDP_per_capita_log ~
suicide_gdp_country_clean1$Suicide_Rate, data = suicide_gdp_country_clean1)
Residuals:
Min 1Q Median 3Q Max
-1.63348 -0.41355 0.03416 0.48038 1.64644
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.104898 0.028251 145.300 <2e-16 ***
suicide_gdp_country_clean1$Suicide_Rate 0.002125 0.001813 1.173 0.241
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.567 on 1300 degrees of freedom
Multiple R-squared: 0.001056, Adjusted R-squared: 0.000288
F-statistic: 1.375 on 1 and 1300 DF, p-value: 0.2412
cor.test(suicide_gdp_country_clean1$GDP_per_capita_log, suicide_gdp_country_clean1$Suicide_Rate, method = "pearson")
Pearson's product-moment correlation
data: suicide_gdp_country_clean1$GDP_per_capita_log and suicide_gdp_country_clean1$Suicide_Rate
t = 1.1725, df = 1300, p-value = 0.2412
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02186329 0.08667626
sample estimates:
cor
0.03250232
