Loading data

Suicide Rates

This compiled dataset pulled from four other datasets linked by time and place, and was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum.

General dataset parameters

First 5 rows

library(knitr)
kable(SuicideRates[1:5,], caption = "dataset parameters")
dataset parameters
п.їcountry year sex age suicides_no population suicides.100k.pop country.year HDI.for.year gdp_for_year…. gdp_per_capita…. generation
Albania 1987 male 15-24 years 21 312900 6.71 Albania1987 NA 2,156,624,900 796 Generation X
Albania 1987 male 35-54 years 16 308000 5.19 Albania1987 NA 2,156,624,900 796 Silent
Albania 1987 female 15-24 years 14 289700 4.83 Albania1987 NA 2,156,624,900 796 Generation X
Albania 1987 male 75+ years 1 21800 4.59 Albania1987 NA 2,156,624,900 796 G.I. Generation
Albania 1987 male 25-34 years 9 274300 3.28 Albania1987 NA 2,156,624,900 796 Boomers

Dataset dimension

SuicideRates.dim <- dim(SuicideRates)

Dataset contains 27820 rows and 12 columns

Columns list

str(SuicideRates)
## 'data.frame':    27820 obs. of  12 variables:
##  $ п.їcountry        : chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ year              : int  1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
##  $ sex               : chr  "male" "male" "female" "male" ...
##  $ age               : chr  "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
##  $ suicides_no       : int  21 16 14 1 9 1 6 4 1 0 ...
##  $ population        : int  312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
##  $ suicides.100k.pop : num  6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
##  $ country.year      : chr  "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
##  $ HDI.for.year      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gdp_for_year....  : chr  "2,156,624,900" "2,156,624,900" "2,156,624,900" "2,156,624,900" ...
##  $ gdp_per_capita....: int  796 796 796 796 796 796 796 796 796 796 ...
##  $ generation        : chr  "Generation X" "Silent" "Generation X" "G.I. Generation" ...

Basic statistical characteristics of the dataset

summary(SuicideRates)
##   п.їcountry             year          sex                age           
##  Length:27820       Min.   :1985   Length:27820       Length:27820      
##  Class :character   1st Qu.:1995   Class :character   Class :character  
##  Mode  :character   Median :2002   Mode  :character   Mode  :character  
##                     Mean   :2001                                        
##                     3rd Qu.:2008                                        
##                     Max.   :2016                                        
##                                                                         
##   suicides_no        population       suicides.100k.pop country.year      
##  Min.   :    0.0   Min.   :     278   Min.   :  0.00    Length:27820      
##  1st Qu.:    3.0   1st Qu.:   97498   1st Qu.:  0.92    Class :character  
##  Median :   25.0   Median :  430150   Median :  5.99    Mode  :character  
##  Mean   :  242.6   Mean   : 1844794   Mean   : 12.82                      
##  3rd Qu.:  131.0   3rd Qu.: 1486143   3rd Qu.: 16.62                      
##  Max.   :22338.0   Max.   :43805214   Max.   :224.97                      
##                                                                           
##   HDI.for.year   gdp_for_year....   gdp_per_capita....  generation       
##  Min.   :0.483   Length:27820       Min.   :   251     Length:27820      
##  1st Qu.:0.713   Class :character   1st Qu.:  3447     Class :character  
##  Median :0.779   Mode  :character   Median :  9372     Mode  :character  
##  Mean   :0.777                      Mean   : 16866                       
##  3rd Qu.:0.855                      3rd Qu.: 24874                       
##  Max.   :0.944                      Max.   :126352                       
##  NA's   :19456

Data visualization

library(ggplot2)
ggplot(SuicideRates, aes(x=suicides.100k.pop)) + geom_histogram(binwidth = 1)

library(ggplot2)
ggplot(SuicideRates, aes(x=year, y=suicides_no)) + geom_point()

ggplot(SuicideRates, aes(x=generation, y=suicides_no)) + geom_point()

ggplot(SuicideRates, aes(x=suicides.100k.pop, y=п.їcountry)) + geom_point()

pairs(~  year + suicides_no + population + suicides.100k.pop + HDI.for.year + gdp_per_capita...., data = SuicideRates, main = 'Suicide Rates Data')

Feature correlation

## corrplot 0.84 loaded
##                            year  suicides_no  population suicides.100k.pop
## year                1.000000000 -0.004545958 0.008850170      -0.039036797
## suicides_no        -0.004545958  1.000000000 0.616162268       0.306604451
## population          0.008850170  0.616162268 1.000000000       0.008284973
## suicides.100k.pop  -0.039036797  0.306604451 0.008284973       1.000000000
## gdp_per_capita....  0.339134280  0.061329749 0.081509858       0.001785134
##                    gdp_per_capita....
## year                      0.339134280
## suicides_no               0.061329749
## population                0.081509858
## suicides.100k.pop         0.001785134
## gdp_per_capita....        1.000000000

Normalization

## Loading required package: lattice
##       year         suicides_no          population       suicides.100k.pop 
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.3226   1st Qu.:0.0001343   1st Qu.:0.002219   1st Qu.:0.004089  
##  Median :0.5484   Median :0.0011192   Median :0.009813   Median :0.026626  
##  Mean   :0.5245   Mean   :0.0108593   Mean   :0.042107   Mean   :0.056968  
##  3rd Qu.:0.7419   3rd Qu.:0.0058644   3rd Qu.:0.033920   3rd Qu.:0.073876  
##  Max.   :1.0000   Max.   :1.0000000   Max.   :1.000000   Max.   :1.000000  
##  gdp_per_capita....
##  Min.   :0.00000   
##  1st Qu.:0.02534   
##  Median :0.07233   
##  Mean   :0.13176   
##  3rd Qu.:0.19526   
##  Max.   :1.00000

Conclusion

The results of the exploratory data analysis are the main statistical characteristics of the dataset features, the determination of the need to remove noise from data, the elimination of highly correlated parameters.