Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:dplyr’:
src, summarize
The following objects are masked from ‘package:base’:
format.pval, units
********************************************************
Note: As of version 1.0.0, cowplot does not change the
default ggplot2 theme anymore. To recover the previous
behavior, execute:
theme_set(theme_cowplot())
********************************************************
Attaching package: ‘cowplot’
The following object is masked from ‘package:lubridate’:
stamp
Registered S3 method overwritten by 'xts':
method from
as.zoo.xts zoo
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Registered S3 methods overwritten by 'forecast':
method from
fitted.fracdiff fracdiff
residuals.fracdiff fracdiff
Two socio-economic datasets, Suicide Rates Overview 1985 to 2016 and World Bank Data (1960 to 2016) were retrieved from Kaggle as text files.
The Life Expectancy table from the World Bank Data was tidied by converting the table from wide to long format and then left joined to the suicide rates data.
The resulting data’s structure is then inspected. Variables with incorrect data types were cast to the appropriate data types and the data was subset.
The data is then checked to see if the data is tidy. From previously tidying the data before joining, the data is determined to be tidy.
A new variable suicide_pop_prop is created from mutating the variables suicides_no and population by dividing the two values to create the new variable.
The dataset were checked for NaN, Inf and NA values.
Missing values were identified in the variables hdi_per_year and lifeexpectancy. Missing values in the variables were imputed and mutated to form new variables such that an analyst may refer to the raw data where appropriate.
Outliers for each numeric column are winsorised to the 5th and 95th percentile and stored in a new data frame. There are large number of outliers for each variable and so the variables are not directly replaced, since imputation or removal may impact the results of the analysis.
The lifeexpectancy variable is transformed using the Box-Cox transformation to remove the effect of skewness.
The World Bank Data (1960 to 2016) was sourced from Kaggle (https://www.kaggle.com/gemartin/world-bank-data-1960-to-2016) which contains three datasets:
Life Expectancy at Birth: number of years a newborn would if the patterns of mortality at the time of birth remain the same throughout his life.
Fertility rate: number of children a woman should give birth to during her childbearing years.
Country population: a total number of residents regardless of legal status or citizenship (midyear estimates)
The Suicide Rates Overview 1985 to 2016 data was also sourced from Kaggle (https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016) compares socio-economic information with suicide rates by year and country and is composed of one file.
First we import and preview the data. Since they are both csv/text format we call the appropriate function for these file formats.
suiciderates <- read_csv("suiciderates.csv")
Parsed with column specification:
cols(
country = [31mcol_character()[39m,
year = [32mcol_double()[39m,
sex = [31mcol_character()[39m,
age = [31mcol_character()[39m,
suicides_no = [32mcol_double()[39m,
population = [32mcol_double()[39m,
`suicides/100k pop` = [32mcol_double()[39m,
`country-year` = [31mcol_character()[39m,
`HDI for year` = [32mcol_double()[39m,
`gdp_for_year ($)` = [32mcol_number()[39m,
`gdp_per_capita ($)` = [32mcol_double()[39m,
generation = [31mcol_character()[39m
)
lifeexpectancy <- read_csv("life_expectancy.csv")
Parsed with column specification:
cols(
.default = col_double(),
`Country Name` = [31mcol_character()[39m,
`Country Code` = [31mcol_character()[39m,
`Indicator Name` = [31mcol_character()[39m,
`Indicator Code` = [31mcol_character()[39m
)
See spec(...) for full column specifications.
suiciderates %>% head()
lifeexpectancy %>% head()
We see that the life expectancy table is untidy because the year values are variables rather than observations.
We clean the life expactancy dataset first by converting it to long format to tidy the data and to make it easier to join the two datasets, since year is a common factor between the two.
lifeexpectancy %>% head()
lifeexpectancy <- lifeexpectancy %>% gather(c(5:length(lifeexpectancy)), key = "year", value = "lifeexpectancy")
lifeexpectancy %>% head()
We create a new variable to join on so the data can be matched appropriately. We then join the two datasets using a left join on the suicide rates dataset.
#Mutate new variable, country-year
lifeexpectancy <- lifeexpectancy %>% mutate(`country-year` = paste0(`Country Name`, year))
#Left join lifeexpectancy on suiciderates
suicide_life <- suiciderates %>% left_join(lifeexpectancy, by = "country-year")
#Working dataset
suicide_life %>% head()
First we check the structure of our dataset.
str(suicide_life)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 27820 obs. of 18 variables:
$ country : chr "Albania" "Albania" "Albania" "Albania" ...
$ year.x : num 1987 1987 1987 1987 1987 ...
$ sex : chr "male" "male" "female" "male" ...
$ age : chr "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
$ suicides_no : num 21 16 14 1 9 1 6 4 1 0 ...
$ population : num 312900 308000 289700 21800 274300 ...
$ suicides/100k pop : num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
$ country-year : chr "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
$ HDI for year : num NA NA NA NA NA NA NA NA NA NA ...
$ gdp_for_year ($) : num 2.16e+09 2.16e+09 2.16e+09 2.16e+09 2.16e+09 ...
$ gdp_per_capita ($): num 796 796 796 796 796 796 796 796 796 796 ...
$ generation : chr "Generation X" "Silent" "Generation X" "G.I. Generation" ...
$ Country Name : chr "Albania" "Albania" "Albania" "Albania" ...
$ Country Code : chr "ALB" "ALB" "ALB" "ALB" ...
$ Indicator Name : chr "Life expectancy at birth, total (years)" "Life expectancy at birth, total (years)" "Life expectancy at birth, total (years)" "Life expectancy at birth, total (years)" ...
$ Indicator Code : chr "SP.DYN.LE00.IN" "SP.DYN.LE00.IN" "SP.DYN.LE00.IN" "SP.DYN.LE00.IN" ...
$ year.y : chr "1987" "1987" "1987" "1987" ...
$ lifeexpectancy : num 71.8 71.8 71.8 71.8 71.8 ...
We see that we have some variables that are not the data type they should be, so we convert them to the appropriate data types.
#Rename columns
colnames(suicide_life)[colnames(suicide_life) == "year.x"] <- "year"
colnames(suicide_life)[colnames(suicide_life) == "HDI for year"] <- "hdi_for_year"
colnames(suicide_life)[colnames(suicide_life) == "suicides/100k pop"] <- "suicides_per_100k_pop"
#Factor sex
suicide_life$sex <- suicide_life$sex %>% factor(labels = unique(suicide_life$sex))
#Factor age
suicide_life$age <- suicide_life$age %>% factor(levels = c("5-14 years", "15-24 years", "25-34 years", "35-54 years", "55-74 years", "75+ years"), labels = unique(suicide_life$age), ordered = TRUE)
#Factor generation
suicide_life$generation <-
suicide_life$generation %>% factor(labels = unique(suicide_life$generation))
#Convert year to integer
suicide_life$year <- as.integer(suicide_life$year)
#Subset the data
suicide_life <- suicide_life %>% select(-c("year.y", "Country Name", "Country Code", "Indicator Name", "Indicator Code", "gdp_for_year ($)", "gdp_per_capita ($)"))
#Check the structure again
str(suicide_life)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 27820 obs. of 11 variables:
$ country : chr "Albania" "Albania" "Albania" "Albania" ...
$ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
$ sex : Factor w/ 2 levels "male","female": 2 2 1 2 2 1 1 1 2 1 ...
$ age : Ord.factor w/ 6 levels "15-24 years"<..: 2 4 2 6 3 6 4 3 5 1 ...
$ suicides_no : num 21 16 14 1 9 1 6 4 1 0 ...
$ population : num 312900 308000 289700 21800 274300 ...
$ suicides_per_100k_pop: num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
$ country-year : chr "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
$ hdi_for_year : num NA NA NA NA NA NA NA NA NA NA ...
$ generation : Factor w/ 6 levels "Generation X",..: 3 6 3 2 1 2 6 1 2 3 ...
$ lifeexpectancy : num 71.8 71.8 71.8 71.8 71.8 ...
Previously to join the datasets, the datasets were tidied in order to join better, so the current working dataset is clean.
suicide_life %>% head()
The data is tidy since each variable has its own column (after being tidied in a previous step), each observation has its own row and each value has its own cell.
We will create a new variable suicide_pop_prop by mutating to existing variables. To create the proportion variable we mutate by dividing the suicides_no variable by the population variable.
suicide_life <- suicide_life %>% mutate(suicide_pop_prop = suicides_no/population)
Let’s check which variables have missing values. We also check for any Inf or NaN values.
#NA check
colSums(is.na(suicide_life))
country year
0 0
sex age
0 0
suicides_no population
0 0
suicides_per_100k_pop country-year
0 0
hdi_for_year generation
19456 0
lifeexpectancy suicide_pop_prop
1980 0
#NaN check
suicide_life %>% sapply(function (x) sum(is.nan(x)))
country year
0 0
sex age
0 0
suicides_no population
0 0
suicides_per_100k_pop country-year
0 0
hdi_for_year generation
0 0
lifeexpectancy suicide_pop_prop
0 0
#Inf check
suicide_life %>% sapply(function (x) sum(is.infinite(x)))
country year
0 0
sex age
0 0
suicides_no population
0 0
suicides_per_100k_pop country-year
0 0
hdi_for_year generation
0 0
lifeexpectancy suicide_pop_prop
0 0
We see that the lifeexpectancy variable has 1980 missing values and the hdi_for_year variable has 19456 values missing.
#Check number of observations
dim(suicide_life)
[1] 27820 12
#Is missing data more than 5%?
sum(is.na(suicide_life$hdi_for_year))/dim(suicide_life)[1]
[1] 0.699353
sum(is.na(suicide_life$lifeexpectancy))/dim(suicide_life)[1]
[1] 0.07117182
Since both variables have more than 5% of the observations that are missing, excluding the data could bias the analysis.
We check the distribution of the variables with outliers to determine the best method of imputation to handle these.
ggplot(data = suicide_life, aes(x = hdi_for_year)) + geom_histogram()
ggplot(data = suicide_life, aes(x = lifeexpectancy)) + geom_histogram()
Both values are negatively skewed, so we will use the median to impute the missing values and mutate them into a new variable.
By retaining the original data an analyst may refer back to the original, since the hdi_for_year column may be heavily biased by the sheer volume of imputed values (19456 of 27820 total observations, more than 69% of the values).
suicide_life$hdiforyearfilter <- impute(suicide_life$hdi_for_year, fun = median)
suicide_life$lifeexpectancyfilter <- impute(suicide_life$lifeexpectancy, fun = median)
colSums(is.na(suicide_life))
country year
0 0
sex age
0 0
suicides_no population
0 0
suicides_per_100k_pop country-year
0 0
hdi_for_year generation
19456 0
lifeexpectancy suicide_pop_prop
1980 0
hdiforyearfilter lifeexpectancyfilter
0 0
The mutated variables hdiforyearfilter and lifeexpectancyfilter are free from missing values.
To find outliers we plot boxplots for each numeric value.
str(suicide_life)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 27820 obs. of 14 variables:
$ country : chr "Albania" "Albania" "Albania" "Albania" ...
$ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
$ sex : Factor w/ 2 levels "male","female": 2 2 1 2 2 1 1 1 2 1 ...
$ age : Ord.factor w/ 6 levels "15-24 years"<..: 2 4 2 6 3 6 4 3 5 1 ...
$ suicides_no : num 21 16 14 1 9 1 6 4 1 0 ...
$ population : num 312900 308000 289700 21800 274300 ...
$ suicides_per_100k_pop: num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
$ country-year : chr "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
$ hdi_for_year : num NA NA NA NA NA NA NA NA NA NA ...
$ generation : Factor w/ 6 levels "Generation X",..: 3 6 3 2 1 2 6 1 2 3 ...
$ lifeexpectancy : num 71.8 71.8 71.8 71.8 71.8 ...
$ suicide_pop_prop : num 6.71e-05 5.19e-05 4.83e-05 4.59e-05 3.28e-05 ...
$ hdiforyearfilter : 'impute' num 0.779 0.779 0.779 0.779 0.779 0.779 0.779 0.779 0.779 0.779 ...
..- attr(*, "imputed")= int 1 2 3 4 5 6 7 8 9 10 ...
$ lifeexpectancyfilter : 'impute' num 71.8 71.8 71.8 71.8 71.8 ...
..- attr(*, "imputed")= int 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 ...
p1 <- ggplot(data = suicide_life, aes(x = factor(0), y = suicides_no)) + geom_boxplot()
p2 <- ggplot(data = suicide_life, aes(x = factor(0), y = population)) + geom_boxplot()
p3 <- ggplot(data = suicide_life, aes(x = factor(0), y = suicides_per_100k_pop)) + geom_boxplot()
p4 <- ggplot(data = suicide_life, aes(x = factor(0), y = hdiforyearfilter)) + geom_boxplot()
p5 <- ggplot(data = suicide_life, aes(x = factor(0), y = lifeexpectancyfilter)) + geom_boxplot()
p6 <- ggplot(data = suicide_life, aes(x = factor(0), y = suicide_pop_prop)) + geom_boxplot()
plot_grid(p1, p2, p3, p4, p5, p6)
Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.
Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.
All numeric values in the dataset appear to have some outliers.
For many of the variables, it appears (visually) that for some variables, there is a large amount of outliers. For this reason, the cleaned variables are stored in a new dataframe to allow an analyst to refer to the original if needed.
Numeric values are subset into new dataframe and the outliers windsorised to the 5th and 95th percentiles.
#original data
summary(suicide_life)
19456 values imputed to 0.779
1980 values imputed to 74.678
country year sex
Length:27820 Min. :1985 male :13910
Class :character 1st Qu.:1995 female:13910
Mode :character Median :2002
Mean :2001
3rd Qu.:2008
Max. :2016
age suicides_no
15-24 years:4610 Min. : 0.0
35-54 years:4642 1st Qu.: 3.0
75+ years :4642 Median : 25.0
25-34 years:4642 Mean : 242.6
55-74 years:4642 3rd Qu.: 131.0
5-14 years :4642 Max. :22338.0
population suicides_per_100k_pop
Min. : 278 Min. : 0.00
1st Qu.: 97498 1st Qu.: 0.92
Median : 430150 Median : 5.99
Mean : 1844794 Mean : 12.82
3rd Qu.: 1486143 3rd Qu.: 16.62
Max. :43805214 Max. :224.97
country-year hdi_for_year
Length:27820 Min. :0.483
Class :character 1st Qu.:0.713
Mode :character Median :0.779
Mean :0.777
3rd Qu.:0.855
Max. :0.944
NA's :19456
generation lifeexpectancy
Generation X :4990 Min. :52.57
Silent :2744 1st Qu.:70.90
G.I. Generation:6408 Median :74.68
Boomers :1470 Mean :74.14
Millenials :5844 3rd Qu.:77.88
Generation Z :6364 Max. :83.79
NA's :1980
suicide_pop_prop hdiforyearfilter
Min. :0.000e+00 Min. :0.4830
1st Qu.:9.187e-06 1st Qu.:0.7790
Median :5.991e-05 Median :0.7790
Mean :1.282e-04 Mean :0.7783
3rd Qu.:1.662e-04 3rd Qu.:0.7790
Max. :2.250e-03 Max. :0.9440
lifeexpectancyfilter
Min. :52.57
1st Qu.:71.22
Median :74.68
Mean :74.18
3rd Qu.:77.61
Max. :83.79
capped <- suicide_life %>% select(c("suicides_no", "population", "suicides_per_100k_pop", "hdiforyearfilter", "lifeexpectancyfilter", "suicide_pop_prop"))
capped$suicides_no <- suicide_life$suicides_no %>% cap()
capped$population <- suicide_life$population %>% cap()
capped$suicides_per_100k_pop <- suicide_life$suicides_per_100k_pop %>% cap()
capped$hdiforyearfilter <- suicide_life$hdiforyearfilter %>% cap()
capped$lifeexpectancyfilter <- capped$lifeexpectancyfilter %>% cap()
capped$suicide_pop_prop <- suicide_life$suicide_pop_prop %>% cap()
#windsorised data
summary(capped)
19456 values imputed to 0.779
1980 values imputed to 74.678
suicides_no population
Min. : 0.0 Min. : 278
1st Qu.: 3.0 1st Qu.: 97498
Median : 25.0 Median : 430150
Mean : 189.1 Mean :1850021
3rd Qu.: 131.0 3rd Qu.:1486143
Max. :1050.0 Max. :8850240
suicides_per_100k_pop hdiforyearfilter
Min. : 0.00 Min. :0.6750
1st Qu.: 0.92 1st Qu.:0.7790
Median : 5.99 Median :0.7790
Mean :11.64 Mean :0.7789
3rd Qu.:16.62 3rd Qu.:0.7790
Max. :50.53 Max. :0.8820
lifeexpectancyfilter suicide_pop_prop
Min. :61.73 Min. :0.000e+00
1st Qu.:71.22 1st Qu.:9.187e-06
Median :74.68 Median :5.991e-05
Mean :74.27 Mean :1.164e-04
3rd Qu.:77.61 3rd Qu.:1.662e-04
Max. :83.79 Max. :5.053e-04
Looking at the distribution of lifeexpectancyfilter_capped variable, we see that it is slightly left skewed.
ggplot(capped, aes(x = lifeexpectancyfilter)) + geom_histogram()
We will transform the data to minimise the effect of skewness in analysis by applying a Box-Cox transformation on the variable.
capped$lifeexpectancyfilter_boxcox <- BoxCox(capped$lifeexpectancyfilter, lambda = "auto")
NA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive valueNA/Inf replaced by maximum positive value
ggplot(data = capped, aes(x = lifeexpectancyfilter_boxcox)) + geom_histogram()
The resulting distribution is less skewed and ready for analysis.