Libraries

library(rmdformats)
library(readr)
library(tidyverse)
library(dplyr)
library(psych)
library(knitr)
library(ggthemes)
library(kableExtra)
library(plotly)
library(forecast)
library(RMySQL)
library(dbConnect)

Introduction

Mental illness is a serious disease that affects millions of people worldwide. One unfortunate outcome is some individuals decide to commit suicide. This study will attempt to understand suicide rates for different countries, and sex.

Data

The data is taken from kaggle.com’s https://www.kaggle.com/szamil/who-suicide-statistics/home. The dataset consists of the following variables: country, year, sex, age, suicides_no, and population. It also consists of 43776 rows of data. This dataset was aggregated from multiple datasets to include only the previous variables. The data was uploaded on my github account.

suicide1<- dbGetQuery(mydb, "select * from who_suicide_statistics limit 10")

head (suicide1)
##   country year    sex         age suicides_no population
## 1 Albania 1985 female 15-24 years                 277900
## 2 Albania 1985 female 25-34 years                 246800
## 3 Albania 1985 female 35-54 years                 267500
## 4 Albania 1985 female  5-14 years                 298300
## 5 Albania 1985 female 55-74 years                 138700
## 6 Albania 1985 female   75+ years                  34200

Exploratory Data Analysis

A preiminilary analysis shows that the # of people in each age group are distributed evenly:

agecount <- suicide %>%
  group_by(age) %>%
  summarize(n=n())

agecount
## # A tibble: 6 x 2
##   age             n
##   <chr>       <int>
## 1 15-24 years  7296
## 2 25-34 years  7296
## 3 35-54 years  7296
## 4 5-14 years   7296
## 5 55-74 years  7296
## 6 75+ years    7296
ggplot(agecount, aes(agecount$age, agecount$n)) +
   stat_summary(fun.y = sum,
               geom = "bar") 

The # of values for each country in the data varies. The min of data recorded appears 12 times, while the max for a country appears 456 times. Some countries have more recorded data than others- this can be due to the size of the country, its population, the ease of acquiring that data.

countrycount <- suicide %>%
  group_by(country) %>%
  summarize(n=n())

summary(countrycount)
##    country                n        
##  Length:141         Min.   : 12.0  
##  Class :character   1st Qu.:204.0  
##  Mode  :character   Median :372.0  
##                     Mean   :310.5  
##                     3rd Qu.:432.0  
##                     Max.   :456.0

If we filter the dataset by number of suicides greater than 1000, the data becomes easier to see which countries have the greatest sucidie numbers

countrysuicide <- suicide %>%
  filter(suicides_no > 1000) 

p <- ggplot(data = countrysuicide, aes(reorder(countrysuicide$country, -countrysuicide$suicides_no), countrysuicide$suicides_no)) + 
  stat_summary(fun.y = sum,
               geom = "bar", fill = "grey") +
  theme_fivethirtyeight() +
  xlab("Country") + 
  ylab("Count") +
  coord_flip() 

ggplotly(p)

It appears that the Russian Federation has the most # of suicides (1450349), followed by the United States of America (1138176)

We can also break down by gender and country

countrysuicide <- suicide %>%
  filter(suicide$suicides_no > 1000)

p <- ggplot(data = countrysuicide, aes(countrysuicide$`sex`,countrysuicide$suicides_no)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill=sex)) +
  xlab("Sex") + 
  ylab("Count") +
  facet_wrap(~countrysuicide$country) +
  ggtitle("Suicide vs. Country and Sex")

ggplotly(p)

According to this graph, in both Russa and the United States, Males outnumber the women in suicide #s.

Age Group

United States

countrysuicide <- suicide %>%
  filter(country == "United States of America")

p <- ggplot(data = countrysuicide, aes(countrysuicide$age,countrysuicide$suicides_no)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill=sex)) +
  xlab("Age") + 
  ylab("Count") +
  ggtitle("Suicide vs. Country and Sex") 

ggplotly(p)
## Warning: Removed 12 rows containing non-finite values (stat_summary).

Russian Federation

countrysuicide <- suicide %>%
  filter(country == "Russian Federation")

p <- ggplot(data = countrysuicide, aes(countrysuicide$age,countrysuicide$suicides_no)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill=sex)) +
  xlab("Age") + 
  ylab("Count") +
  ggtitle("Suicide vs. Country and Sex")

ggplotly(p)
## Warning: Removed 24 rows containing non-finite values (stat_summary).

In both the United States and Russia, Males outnumber the females in sucidie. The age group most vulnerable is the 35-54 age group.

Suicide over the years in the Dataset

p <- ggplot(data = suicide, aes(suicide$year, suicide$suicides_no)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill=year)) +
  theme(legend.position="none") +
  xlab("Year") +
  ylab("Count") +
  ggtitle("Suicide") 

ggplotly(p)
## Warning: Removed 2256 rows containing non-finite values (stat_summary).

The data shows an increasing trend until around 2000, where it beings to flatten and slow decrease. The year 2016 has a large difference - it seems as if the data for 2016 is incomplete.

tssuicide <- suicide %>%
  dplyr::select(year, suicides_no) 

tssuicide$suicides_no[is.na(tssuicide$suicides_no)] <- 0

tssuicide <- tssuicide %>% 
  group_by(year) %>%
  summarize_all(sum) 

tssuicide$year <- as.character(tssuicide$year)

myts <- ts(tssuicide, start = 1985, end = 2016, frequency = 1)

autoplot(myts) 

Population #s

p <- ggplot(data = suicide, aes(suicide$year, suicide$population)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill=year)) +
  theme(legend.position="none") +
  xlab("Year") +
  ylab("Count") +
  ggtitle("Suicide") 

ggplotly(p)
## Warning: Removed 5460 rows containing non-finite values (stat_summary).

Throughout the dataset, you can see a steady increasing trend in population

ANOVA Model

We can run an ANOVA test on different data within our dataset to see if there is a statisticaly significant difference between the years, and countries in the data.

\[H_0: \mu_{1985} = \mu_{1986} = \mu_1{987} ... = \mu_{2016}\]

\[H_a: \mu_{1985} \neq \mu_{1986} \neq \mu_{1987} ... \neq \mu_{2016} \]

model <- lm(suicides_no ~ year, data= suicide)

anova(model)
## Analysis of Variance Table
## 
## Response: suicides_no
##              Df     Sum Sq Mean Sq F value Pr(>F)  
## year          1 3.8225e+06 3822457  5.9645 0.0146 *
## Residuals 41518 2.6608e+10  640868                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is low (less than 0.05) - null must go; meaning, the \(H_0\) is rejected and the \(H_a\) is accepted. There is significant evidence that number of suicides is different throughout the years

\[H_0: \mu_{country1} = \mu_{country2} = \mu_1{country3} ... = \mu_{countryn}\]

\[H_a: \mu_{country1} \neq \mu_{country2} \neq \mu_{country3} ... \neq \mu_{countryn} \]

model <- lm(suicides_no ~ country, data= suicide)

anova(model)
## Analysis of Variance Table
## 
## Response: suicides_no
##              Df     Sum Sq  Mean Sq F value    Pr(>F)    
## country     140 1.1131e+10 79509087  212.53 < 2.2e-16 ***
## Residuals 41379 1.5480e+10   374105                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions

After analyzing the suicide data from Kaggle, the data shows that on average, Males are more likly to commit suicide than females, the number of suicides over the years has increased, the number of population overall has also increased.

Russia and the United States have the highest sucide rates within the dataset. In both the United States and Russia, Males outnumber the females in suicide The age group most vulnerable is the 35-54 age group.

Reference:

https://www.kaggle.com/szamil/who-suicide-statistics