library(rmdformats)
library(readr)
library(tidyverse)
library(dplyr)
library(psych)
library(knitr)
library(ggthemes)
library(kableExtra)
library(plotly)
library(forecast)
library(RMySQL)
library(dbConnect)
Mental illness is a serious disease that affects millions of people worldwide. One unfortunate outcome is some individuals decide to commit suicide. This study will attempt to understand suicide rates for different countries, and sex.
The data is taken from kaggle.com’s https://www.kaggle.com/szamil/who-suicide-statistics/home. The dataset consists of the following variables: country, year, sex, age, suicides_no, and population. It also consists of 43776 rows of data. This dataset was aggregated from multiple datasets to include only the previous variables. The data was uploaded on my github account.
suicide1<- dbGetQuery(mydb, "select * from who_suicide_statistics limit 10")
head (suicide1)
## country year sex age suicides_no population
## 1 Albania 1985 female 15-24 years 277900
## 2 Albania 1985 female 25-34 years 246800
## 3 Albania 1985 female 35-54 years 267500
## 4 Albania 1985 female 5-14 years 298300
## 5 Albania 1985 female 55-74 years 138700
## 6 Albania 1985 female 75+ years 34200
A preiminilary analysis shows that the # of people in each age group are distributed evenly:
agecount <- suicide %>%
group_by(age) %>%
summarize(n=n())
agecount
## # A tibble: 6 x 2
## age n
## <chr> <int>
## 1 15-24 years 7296
## 2 25-34 years 7296
## 3 35-54 years 7296
## 4 5-14 years 7296
## 5 55-74 years 7296
## 6 75+ years 7296
ggplot(agecount, aes(agecount$age, agecount$n)) +
stat_summary(fun.y = sum,
geom = "bar")
The # of values for each country in the data varies. The min of data recorded appears 12 times, while the max for a country appears 456 times. Some countries have more recorded data than others- this can be due to the size of the country, its population, the ease of acquiring that data.
countrycount <- suicide %>%
group_by(country) %>%
summarize(n=n())
summary(countrycount)
## country n
## Length:141 Min. : 12.0
## Class :character 1st Qu.:204.0
## Mode :character Median :372.0
## Mean :310.5
## 3rd Qu.:432.0
## Max. :456.0
countrysuicide <- suicide %>%
filter(suicides_no > 1000)
p <- ggplot(data = countrysuicide, aes(reorder(countrysuicide$country, -countrysuicide$suicides_no), countrysuicide$suicides_no)) +
stat_summary(fun.y = sum,
geom = "bar", fill = "grey") +
theme_fivethirtyeight() +
xlab("Country") +
ylab("Count") +
coord_flip()
ggplotly(p)
It appears that the Russian Federation has the most # of suicides (1450349), followed by the United States of America (1138176)
countrysuicide <- suicide %>%
filter(suicide$suicides_no > 1000)
p <- ggplot(data = countrysuicide, aes(countrysuicide$`sex`,countrysuicide$suicides_no)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill=sex)) +
xlab("Sex") +
ylab("Count") +
facet_wrap(~countrysuicide$country) +
ggtitle("Suicide vs. Country and Sex")
ggplotly(p)
According to this graph, in both Russa and the United States, Males outnumber the women in suicide #s.
countrysuicide <- suicide %>%
filter(country == "United States of America")
p <- ggplot(data = countrysuicide, aes(countrysuicide$age,countrysuicide$suicides_no)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill=sex)) +
xlab("Age") +
ylab("Count") +
ggtitle("Suicide vs. Country and Sex")
ggplotly(p)
## Warning: Removed 12 rows containing non-finite values (stat_summary).
countrysuicide <- suicide %>%
filter(country == "Russian Federation")
p <- ggplot(data = countrysuicide, aes(countrysuicide$age,countrysuicide$suicides_no)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill=sex)) +
xlab("Age") +
ylab("Count") +
ggtitle("Suicide vs. Country and Sex")
ggplotly(p)
## Warning: Removed 24 rows containing non-finite values (stat_summary).
In both the United States and Russia, Males outnumber the females in sucidie. The age group most vulnerable is the 35-54 age group.
p <- ggplot(data = suicide, aes(suicide$year, suicide$suicides_no)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill=year)) +
theme(legend.position="none") +
xlab("Year") +
ylab("Count") +
ggtitle("Suicide")
ggplotly(p)
## Warning: Removed 2256 rows containing non-finite values (stat_summary).
The data shows an increasing trend until around 2000, where it beings to flatten and slow decrease. The year 2016 has a large difference - it seems as if the data for 2016 is incomplete.
tssuicide <- suicide %>%
dplyr::select(year, suicides_no)
tssuicide$suicides_no[is.na(tssuicide$suicides_no)] <- 0
tssuicide <- tssuicide %>%
group_by(year) %>%
summarize_all(sum)
tssuicide$year <- as.character(tssuicide$year)
myts <- ts(tssuicide, start = 1985, end = 2016, frequency = 1)
autoplot(myts)
p <- ggplot(data = suicide, aes(suicide$year, suicide$population)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill=year)) +
theme(legend.position="none") +
xlab("Year") +
ylab("Count") +
ggtitle("Suicide")
ggplotly(p)
## Warning: Removed 5460 rows containing non-finite values (stat_summary).
Throughout the dataset, you can see a steady increasing trend in population
We can run an ANOVA test on different data within our dataset to see if there is a statisticaly significant difference between the years, and countries in the data.
\[H_0: \mu_{1985} = \mu_{1986} = \mu_1{987} ... = \mu_{2016}\]
\[H_a: \mu_{1985} \neq \mu_{1986} \neq \mu_{1987} ... \neq \mu_{2016} \]
model <- lm(suicides_no ~ year, data= suicide)
anova(model)
## Analysis of Variance Table
##
## Response: suicides_no
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 3.8225e+06 3822457 5.9645 0.0146 *
## Residuals 41518 2.6608e+10 640868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is low (less than 0.05) - null must go; meaning, the \(H_0\) is rejected and the \(H_a\) is accepted. There is significant evidence that number of suicides is different throughout the years
\[H_0: \mu_{country1} = \mu_{country2} = \mu_1{country3} ... = \mu_{countryn}\]
\[H_a: \mu_{country1} \neq \mu_{country2} \neq \mu_{country3} ... \neq \mu_{countryn} \]
model <- lm(suicides_no ~ country, data= suicide)
anova(model)
## Analysis of Variance Table
##
## Response: suicides_no
## Df Sum Sq Mean Sq F value Pr(>F)
## country 140 1.1131e+10 79509087 212.53 < 2.2e-16 ***
## Residuals 41379 1.5480e+10 374105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After analyzing the suicide data from Kaggle, the data shows that on average, Males are more likly to commit suicide than females, the number of suicides over the years has increased, the number of population overall has also increased.
Russia and the United States have the highest sucide rates within the dataset. In both the United States and Russia, Males outnumber the females in suicide The age group most vulnerable is the 35-54 age group.