Suicide is death caused by injuring oneself with the intent to die. A suicide attempt is when someone harms themselves with the intent to end their life, but they do not die as a result of their actions. Several factors can increase the risk for suicide and protect against it. Suicide is connected to other forms of injury and violence, and causes serious health and economic consequences.
In this small project, we will explore dataset about suicides rate from 1985 to 2016. This dataset is contains several variabels or columns that will be used as a reference in the process of extracting valuable information.
Unfortunately, not all information in the dataset will be explored in this project. However, we will try to set up simple information goals that will be the basis for working on this project. These objectives are :
# Read data
data <- read.csv("suicide_rate_overview_1985to2016.csv", sep =",")
datadim(data)## [1] 27820 12
names(data)## [1] "ï..country" "year" "sex"
## [4] "age" "suicides_no" "population"
## [7] "suicides.100k.pop" "country.year" "HDI.for.year"
## [10] "gdp_for_year...." "gdp_per_capita...." "generation"
From our inspection we can conclude :
colSums(is.na(data))## ï..country year sex age
## 0 0 0 0
## suicides_no population suicides.100k.pop country.year
## 0 0 0 0
## HDI.for.year gdp_for_year.... gdp_per_capita.... generation
## 19456 0 0 0
mean(is.na(data$HDI.for.year))*100## [1] 69.9353
From NA checking above, HDI.for.year column has more than 50% missing value. So, drop columns with more 50% missing values using package dplyr.
library(dplyr)data <- data %>%
select(-c(HDI.for.year, country.year))
head(data)country.year columns is not contain valuable information because it was represented by year and country columns.
str(data)## 'data.frame': 27820 obs. of 10 variables:
## $ ï..country : chr "Albania" "Albania" "Albania" "Albania" ...
## $ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
## $ sex : chr "male" "male" "female" "male" ...
## $ age : chr "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
## $ suicides_no : int 21 16 14 1 9 1 6 4 1 0 ...
## $ population : int 312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
## $ suicides.100k.pop : num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
## $ gdp_for_year.... : chr "2,156,624,900" "2,156,624,900" "2,156,624,900" "2,156,624,900" ...
## $ gdp_per_capita....: int 796 796 796 796 796 796 796 796 796 796 ...
## $ generation : chr "Generation X" "Silent" "Generation X" "G.I. Generation" ...
data <- data %>%
rename("country" = "ï..country",
"suicides(/100k.pop)" = "suicides.100k.pop",
"gdp.peryear($)" = "gdp_for_year....",
"gdp.percapita($)" = "gdp_per_capita....") %>%
mutate_at(c("country","sex", "age", "generation"), as.factor) %>%
# Adjust factor levels in "generation" columns
mutate(generation = factor(generation, levels = c("G.I. Generation",
"Silent",
"Boomers",
"Generation X",
"Millenials",
"Generation Z"))) %>%
# Adjust factor levels in "age" columns
mutate(age = factor(age, levels = c("5-14 years",
"15-24 years",
"25-34 years",
"35-54 years",
"55-74 years",
"75+ years"))) %>%
mutate(`gdp.peryear($)` = as.numeric(gsub(",","", `gdp.peryear($)`,fixed = T)))levels(data$age)## [1] "5-14 years" "15-24 years" "25-34 years" "35-54 years" "55-74 years"
## [6] "75+ years"
levels(data$generation)## [1] "G.I. Generation" "Silent" "Boomers" "Generation X"
## [5] "Millenials" "Generation Z"
American Generations Timeline
Though there is a consensus on the general time period for generations, there is not an agreement on the exact year that each generation begins and ends.
1. GI Generation
Born 1901-1924 (Age 90+)
*They were teenagers during the Great Depression and fought in World War II. Sometimes called the greatest generation (following a book by journalist Tom Brokaw) or the swing generation because of their jazz music.
2. Silent Generation
Born 1925-1942 (Age 72-89)
*They were too young to see action in World War II and too old to participate in the fun of the Summer of Love. This label describes their conformist tendencies and belief that following the rules was a sure ticket to success.
3. Baby Boomers
Born 1943-1964 (Age 50-71)
*The boomers were born during an economic and baby boom following World War II. These hippie kids protested against the Vietnam War and participated in the civil rights movement, all with rock ‘n’ roll music blaring in the background.
4. Generation X
Born 1965-1979 (Age 35-49)
*They were originally called the baby busters because fertility rates fell after the boomers. As teenagers, they experienced the AIDs epidemic and the fall of the Berlin Wall. Sometimes called the MTV Generation, the “X” in their name refers to this generation’s desire not to be defined.
5. Millennials
Born 1980-2000 (Age 14-34)
*They experienced the rise of the Internet, Sept. 11 and the wars that followed. Sometimes called Generation Y. Because of their dependence on technology, they are said to be entitled and narcissistic.
6. Generation Z
Born 2001-2013 (Age 1-13)
*These kids were the first born with the Internet and are suspected to be the most individualistic and technology-dependent generation. Sometimes referred to as the iGeneration.
source : https://www.npr.org/
In this case, data in 2016 has low information. So, we will drop the all data in 2016.
data %>%
group_by(year) %>%
summarise(n = n())data <- data %>%
filter(year != 2016)summary(data)## country year sex age
## Argentina: 372 Min. :1985 female:13830 5-14 years :4610
## Austria : 372 1st Qu.:1994 male :13830 15-24 years:4610
## Belgium : 372 Median :2002 25-34 years:4610
## Brazil : 372 Mean :2001 35-54 years:4610
## Chile : 372 3rd Qu.:2008 55-74 years:4610
## Colombia : 372 Max. :2015 75+ years :4610
## (Other) :25428
## suicides_no population suicides(/100k.pop) gdp.peryear($)
## Min. : 0.0 Min. : 278 Min. : 0.00 Min. :4.692e+07
## 1st Qu.: 3.0 1st Qu.: 97535 1st Qu.: 0.91 1st Qu.:8.976e+09
## Median : 25.0 Median : 430725 Median : 5.98 Median :4.801e+10
## Mean : 243.4 Mean : 1850689 Mean : 12.81 Mean :4.471e+11
## 3rd Qu.: 132.0 3rd Qu.: 1491041 3rd Qu.: 16.60 3rd Qu.:2.602e+11
## Max. :22338.0 Max. :43805214 Max. :224.97 Max. :1.812e+13
##
## gdp.percapita($) generation
## Min. : 251 G.I. Generation:2744
## 1st Qu.: 3436 Silent :6332
## Median : 9283 Boomers :4958
## Mean : 16816 Generation X :6376
## 3rd Qu.: 24796 Millenials :5780
## Max. :126352 Generation Z :1470
##
str(data)## 'data.frame': 27660 obs. of 10 variables:
## $ country : Factor w/ 101 levels "Albania","Antigua and Barbuda",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 1 2 1 ...
## $ age : Factor w/ 6 levels "5-14 years","15-24 years",..: 2 4 2 6 3 6 4 3 5 1 ...
## $ suicides_no : int 21 16 14 1 9 1 6 4 1 0 ...
## $ population : int 312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
## $ suicides(/100k.pop): num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
## $ gdp.peryear($) : num 2.16e+09 2.16e+09 2.16e+09 2.16e+09 2.16e+09 ...
## $ gdp.percapita($) : int 796 796 796 796 796 796 796 796 796 796 ...
## $ generation : Factor w/ 6 levels "G.I. Generation",..: 4 2 4 1 3 1 2 3 1 4 ...
Variable explanation
country : The name of the country that listed in the dataset.year (1985-2016) :sex : Gender (male and female).age : Age data which is formed into age groups (factor).suicide_no : Incident case number.population : A group of individuals of the same species living and interbreeding within a given area. In this data, population depends on country, year, sex, age and generation.suicides(/100k.pop) : Number of suicides per 100,000 of population.gdp.peryear($) : The total monetary or market value of all the finished goods and services produced within a country’s borders in a specific time period (based on year).gdp.percapita($) : A measure of a country’s economic output that accounts for its number of people. It divides the country’s gross domestic product by its total population.generation : All of the people born and living at about the same time, regarded collectively.# number of countries
length(unique(data$country))## [1] 100
# country list
unique(data$country)## [1] Albania Antigua and Barbuda
## [3] Argentina Armenia
## [5] Aruba Australia
## [7] Austria Azerbaijan
## [9] Bahamas Bahrain
## [11] Barbados Belarus
## [13] Belgium Belize
## [15] Bosnia and Herzegovina Brazil
## [17] Bulgaria Cabo Verde
## [19] Canada Chile
## [21] Colombia Costa Rica
## [23] Croatia Cuba
## [25] Cyprus Czech Republic
## [27] Denmark Dominica
## [29] Ecuador El Salvador
## [31] Estonia Fiji
## [33] Finland France
## [35] Georgia Germany
## [37] Greece Grenada
## [39] Guatemala Guyana
## [41] Hungary Iceland
## [43] Ireland Israel
## [45] Italy Jamaica
## [47] Japan Kazakhstan
## [49] Kiribati Kuwait
## [51] Kyrgyzstan Latvia
## [53] Lithuania Luxembourg
## [55] Macau Maldives
## [57] Malta Mauritius
## [59] Mexico Montenegro
## [61] Netherlands New Zealand
## [63] Nicaragua Norway
## [65] Oman Panama
## [67] Paraguay Philippines
## [69] Poland Portugal
## [71] Puerto Rico Qatar
## [73] Republic of Korea Romania
## [75] Russian Federation Saint Kitts and Nevis
## [77] Saint Lucia Saint Vincent and Grenadines
## [79] San Marino Serbia
## [81] Seychelles Singapore
## [83] Slovakia Slovenia
## [85] South Africa Spain
## [87] Sri Lanka Suriname
## [89] Sweden Switzerland
## [91] Thailand Trinidad and Tobago
## [93] Turkey Turkmenistan
## [95] Ukraine United Arab Emirates
## [97] United Kingdom United States
## [99] Uruguay Uzbekistan
## 101 Levels: Albania Antigua and Barbuda Argentina Armenia Aruba ... Uzbekistan
There are 100 countries in dataset, but not all countries in the world are listed in this dataset.
For data inspection and cleansing above, we can conclude :
In this section, we will try to find the answers of information objectives that have been defined above.
worldyear <- data %>%
group_by(year) %>%
summarise(sum_suicides = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicides) %>%
mutate(year = as.factor(year))
worldyear# visualization
library(ggplot2)
worldyear %>%
ggplot(aes(x = sum_suicides, y = reorder(year, sum_suicides), fill = year)) +
geom_bar(stat = "identity") +
theme(legend.position = "none")We can see that the year with the highest number of suicides (14,660.26/100,000 of population) is in 1995.
Descriptive Statistics
summaryyear <- data %>%
group_by(year) %>%
summarise(sum_suicides = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicides)
summary(summaryyear$sum_suicides)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6580 10314 11844 11432 13723 14660
worldsex <- data %>%
group_by(sex) %>%
summarise(sum_suicides = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicides)
worldsex# visualization
y <- worldsex$sum_suicides
z <- worldsex$sex
piepercent <- round(100*y/sum(y), 1)
pie(y, labels = piepercent, main = "City pie chart",col = rainbow(length(y)))
legend("topright", c("Male","Female"), cex = 0.8,
fill = rainbow(length(y))) We can see that the sex with the highest number of suicide is Male (279,767.16/100,000 of population or 78.9% of total suicide) and and female take up 21.1% suicide of total suicide (74,629.28/100,000 of population).
data %>%
group_by(age) %>%
summarise(sum_suicides = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicides)We can see that the age group with the highest number of suicides (110,532.19/100,000 of population) is 75+ years.
data %>%
group_by(generation) %>%
summarise(sum_suicides = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicides)We can see that the generation with the highest number of suicides (116,548.73 /100,000 of population) is Silent generation.
data %>%
group_by(country) %>%
summarise(sum_suicide = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicide) %>%
head(3)We can see the country that have the highest suicides from 1985 to 2015 is Russian Federation. Now, we will explore the suicide data in Russian Federation.
data %>%
filter(country == "Russian Federation") %>%
group_by(year) %>%
summarise(sum_suicide = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicide)In 1994, the number of suicides reached 567.64/100,000 of population. This is the highest number of suicides in Russia Federation.
data %>%
filter(country == "Russian Federation",
year == 1994) %>%
group_by(sex) %>%
summarise(sum_suicide = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicide)Based on data on suicides in 1994 in Russia which were then grouped by gender, the highest number of suicides occurred in male (477.82/100,000 of population). And, the number of cases of suicide in women is 89.82/100,000 of population.
The tabset below will explain further about the distribution of data on suicide in 1994 at Russian Federation which were grouped into 3 gender groups (Male, Female and Both)
data %>%
filter(country == "Russian Federation",
year == 1994) %>%
group_by(age) %>%
summarise(sum_suicide = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicide)The data above show that the age group with the highest suicides is 75+ years. Continued by finding out the generation with the most suicides in the +75 years age group from both genders
data %>%
filter(country == "Russian Federation",
year == 1994,
age == "75+ years") %>%
group_by(generation) %>%
summarise(sum_suicide = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicide)G.I. Generation is the generation with the most suicides in the +75 years age group from both genders (142.31/100,000 of population).
data %>%
filter(country == "Russian Federation",
year == 1994,
sex == "male") %>%
group_by(age) %>%
summarise(sum_suicide = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicide)The data above show that the age group with the highest suicides is 35-54 years. Continued by finding out the generation with the most suicides in the 35-54 years age group from male.
data %>%
filter(country == "Russian Federation",
year == 1994,
sex == "male",
age == "35-54 years") %>%
group_by(generation) %>%
summarise(sum_suicide = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicide)Boomers is the generation with the most suicides in the 35-54 years age group from male (142.31/100,000 of population).
data %>%
filter(country == "Russian Federation",
year == 1994,
sex == "female") %>%
group_by(age) %>%
summarise(sum_suicide = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicide)The data above show that the age group with the highest suicides is 75+ years. Continued by finding out the generation with the most suicides in the 75+ years age group from female.
data %>%
filter(country == "Russian Federation",
year == 1994,
sex == "male",
age == "75+ years") %>%
group_by(generation) %>%
summarise(sum_suicide = sum(`suicides(/100k.pop)`)) %>%
arrange(-sum_suicide)G.I. Generation is the generation with the most suicides in the +75 years age group from female (142.31/100,000 of population).
Suicide occurs more often in older than in younger people, but is still one of the leading causes of death in late childhood and adolescence worldwide. From this dataset, the most cases of suicide in the world from 1985 to 2015 occurred in 1995. In addition, from that time period, the sex with the most suicides is male. The age group with the most suicides is +75 years. In the last, The silent generation became the generation with the highest suicides from that period of the year.
This dataset proves that Russian Federation was the country with the most suicides from 1985 to 2015. Suicide cases in Russia reached 11,305.13/100,000 of population (1985-2015). From that time period, the year 1994 became the most reported suicide cases in Russian Federation (567.64/100,000 of population), 447.82/100,000 of population of these cases were from men and 89.82/100,000 of population were from women. The +75 years age group accounts for the largest number of cases and is dominated by the G.I. generation.