The following are the required packaged use for the preprocessing of the data sets:
library(readr)
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
Suicide is a global phenomenon which accounts for 1.4% of all deaths worldwide (Suicide across the world (2016), 2020). One person dies from it every 40 seconds (Suicide across the world (2016), 2020).
This reports aims to prepare a set of data to determine if there is a correlation between suicide rates and a country’s standard of living.
As real world data are commonly noisy and incomplete, the data have to be preprocess before any statistical analysis can be done. These steps are taken to preprocess the data:
According to the World Health Organization(WHO), one person die from suicide every 40 seconds, this numbers adds up close to 800,000 suicidal deaths in a year (Suicide across the world (2016), 2020).
The WHO further reports that the suicidal rates have increased by 60% worldwide in the last 45 years. This global phenomenon have caused pain to hundreds of thousands of people around the world and should not be overlooked.
The GDP per capita is used to measure a country’s standard of living. The following two sets of data are use to determine if a country’s standard of living would impact its suicide rates:
WHO Suicide Statistics: https://www.kaggle.com/szamil/who-suicide-statistics?select=who_suicide_statistics.csv
The WHO Suicide statistics is a csv file which contains the number of suicides by country, year, sex and age groups.
It contains 6 variables with 43,776 observations. The 6 variables are as follows:
* Country: 141 countries where the suicide statistics are measured
* Year: The year of the observation (1979-2016)
* Sex: Male or Female
* Age: 6 unique age group (5-14 Years, 15-24 Years, 25-34 Years, 35-54 Years, 55-74 Years, 75+ Years)
* Suicide_No: Number of suicides
* Population: Population by country, sex and age group
GDP per Capita Data: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
The GDP per capita Data is a XLS file which contains the GDP per Capita data of 264 countries.
It contains 64 Variables with 264 observations. The 64 variables are as follows:
* Country Name: Name of the 264 observed countries
* Country Code: Short alphabetic codes to represent countries. (Not used in final dataset)
* Indicator Name: The indicator used is the GDP per capita (current US$) for all 264 countries (Not used in final dataset)
* Variable 4-64 (Years from 1960 to 2019): GDP per capita in USD of the particular year
The tidying and merging of data will be shown at the next section: Tidy & Manipulate Data I
#Step 1
gdp <- read_excel("API_NY.GDP.PCAP.CD_DS2_en_excel_v2_1345208.xls", sheet = 1, skip = 3)
#Step 2
suicidestat <- read_csv("who_suicide_statistics 2.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_double(),
## sex = col_character(),
## age = col_character(),
## suicides_no = col_double(),
## population = col_double()
## )
#Step 3
str(gdp)
## tibble [264 × 64] (S3: tbl_df/tbl/data.frame)
## $ Country Name : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
## $ Country Code : chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
## $ Indicator Name: chr [1:264] "GDP per capita (current US$)" "GDP per capita (current US$)" "GDP per capita (current US$)" "GDP per capita (current US$)" ...
## $ Indicator Code: chr [1:264] "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" ...
## $ 1960 : num [1:264] NA 59.8 NA NA NA ...
## $ 1961 : num [1:264] NA 59.9 NA NA NA ...
## $ 1962 : num [1:264] NA 58.5 NA NA NA ...
## $ 1963 : num [1:264] NA 78.7 NA NA NA ...
## $ 1964 : num [1:264] NA 82.1 NA NA NA ...
## $ 1965 : num [1:264] NA 101 NA NA NA ...
## $ 1966 : num [1:264] NA 138 NA NA NA ...
## $ 1967 : num [1:264] NA 161 NA NA NA ...
## $ 1968 : num [1:264] NA 129 NA NA NA ...
## $ 1969 : num [1:264] NA 129 NA NA NA ...
## $ 1970 : num [1:264] NA 157 NA NA 3239 ...
## $ 1971 : num [1:264] NA 160 NA NA 3498 ...
## $ 1972 : num [1:264] NA 135 NA NA 4217 ...
## $ 1973 : num [1:264] NA 143 NA NA 5342 ...
## $ 1974 : num [1:264] NA 174 NA NA 6320 ...
## $ 1975 : num [1:264] NA 187 NA NA 7169 ...
## $ 1976 : num [1:264] NA 197 NA NA 7152 ...
## $ 1977 : num [1:264] NA 224 NA NA 7751 ...
## $ 1978 : num [1:264] NA 247 NA NA 9130 ...
## $ 1979 : num [1:264] NA 276 NA NA 11821 ...
## $ 1980 : num [1:264] NA 273 711 NA 12377 ...
## $ 1981 : num [1:264] NA 264 642 NA 10372 ...
## $ 1982 : num [1:264] NA NA 620 NA 9610 ...
## $ 1983 : num [1:264] NA NA 623 NA 8023 ...
## $ 1984 : num [1:264] NA NA 638 639 7729 ...
## $ 1985 : num [1:264] NA NA 758 640 7774 ...
## $ 1986 : num [1:264] 6473 NA 685 694 10362 ...
## $ 1987 : num [1:264] 7886 NA 756 675 12616 ...
## $ 1988 : num [1:264] 9765 NA 792 653 14304 ...
## $ 1989 : num [1:264] 11392 NA 891 698 15166 ...
## $ 1990 : num [1:264] 12307 NA 948 617 18879 ...
## $ 1991 : num [1:264] 13496 NA 866 337 19533 ...
## $ 1992 : num [1:264] 14047 NA 656 201 20548 ...
## $ 1993 : num [1:264] 14937 NA 441 367 16516 ...
## $ 1994 : num [1:264] 16241 NA 329 586 16235 ...
## $ 1995 : num [1:264] 16439 NA 397 751 18461 ...
## $ 1996 : num [1:264] 16586 NA 523 1010 19017 ...
## $ 1997 : num [1:264] 17928 NA 514 717 18353 ...
## $ 1998 : num [1:264] 19078 NA 424 814 18895 ...
## $ 1999 : num [1:264] 19356 NA 388 1033 19262 ...
## $ 2000 : num [1:264] 20621 NA 557 1127 21854 ...
## $ 2001 : num [1:264] 20669 NA 527 1282 22972 ...
## $ 2002 : num [1:264] 20437 179 872 1425 25067 ...
## $ 2003 : num [1:264] 20834 191 983 1846 32272 ...
## $ 2004 : num [1:264] 22570 211 1256 2374 37969 ...
## $ 2005 : num [1:264] 23300 242 1902 2674 40066 ...
## $ 2006 : num [1:264] 24045 264 2600 2973 42676 ...
## $ 2007 : num [1:264] 25835 360 3122 3595 47804 ...
## $ 2008 : num [1:264] 27085 365 4081 4371 48718 ...
## $ 2009 : num [1:264] 24630 438 3123 4114 43503 ...
## $ 2010 : num [1:264] 23513 543 3588 4094 40853 ...
## $ 2011 : num [1:264] 24986 591 4615 4437 43335 ...
## $ 2012 : num [1:264] 24714 642 5100 4248 38686 ...
## $ 2013 : num [1:264] 26189 637 5255 4413 39539 ...
## $ 2014 : num [1:264] 26648 614 5408 4579 41304 ...
## $ 2015 : num [1:264] 27981 578 4167 3953 35763 ...
## $ 2016 : num [1:264] 28281 547 3506 4124 37475 ...
## $ 2017 : num [1:264] 29008 556 4096 4531 38963 ...
## $ 2018 : num [1:264] NA 524 3290 5284 41793 ...
## $ 2019 : num [1:264] NA 502 2974 5353 40886 ...
str(suicidestat)
## tibble [43,776 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ country : chr [1:43776] "Albania" "Albania" "Albania" "Albania" ...
## $ year : num [1:43776] 1985 1985 1985 1985 1985 ...
## $ sex : chr [1:43776] "female" "female" "female" "female" ...
## $ age : chr [1:43776] "15-24 years" "25-34 years" "35-54 years" "5-14 years" ...
## $ suicides_no: num [1:43776] NA NA NA NA NA NA NA NA NA NA ...
## $ population : num [1:43776] 277900 246800 267500 298300 138700 ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. year = col_double(),
## .. sex = col_character(),
## .. age = col_character(),
## .. suicides_no = col_double(),
## .. population = col_double()
## .. )
#Step 4
head(gdp)
head(suicidestat)
#Step 5
gdpset <- gdp %>% select(`Country Name`, c("1979":"2016")) #Selecting the relevant variables
head(gdpset)
To ensure a successful/proper join, we will have to first tidy the data to create the common key to merge.
A data is deemed as tidy if:
1. Each Variable have its own column
2. Each Observation has its own row
3. Each Value has its own cell
Based on the criteria of tidy data, the suicide data is tidy. However, the gdpset data is not. The gdpset data is untidy because the column headers are values and not variable names (variables 1979 to 2016 do not have its own column (it is not in one column) and the variable GDP Per Capita is spread over the 37 columns). Thus the wide data frame should be converted in to a long format using the gather() function by converting the “1979”….“2016” columns into “Years” and its values as “GDP Per Capita”.
Now that both data are tidied, we will merge both data sets using an inner join. An inner join is used because to determine if a country’s standard of living would impact its suicide rates, it would be required to retain only rows that are in both data sets.
#Steps Taken to ensure a tidy data:
gdpset<- gdpset %>% gather(c("1979":"2016"), key = "Year", value = "GDP Per Capita")
head(gdpset)
## Steps Taken to merge both data sets using inner join:
##Step 1
gdpset <- gdpset %>% rename(country=`Country Name`, year= Year)
##Step 2
class(gdpset$country) #Class = Character
## [1] "character"
class(suicidestat$country) #Class = Character
## [1] "character"
class(gdpset$year) #Class = Chracter
## [1] "character"
class(suicidestat$year) #Class = Numeric
## [1] "numeric"
typeof(suicidestat$year) #Type of Numeric = Double
## [1] "double"
##Step 3
gdpset$year <- as.integer(gdpset$year)
suicidestat$year <- as.integer(suicidestat$year)
##Step 4
class(gdpset$year) #Class = Integer
## [1] "integer"
class(suicidestat$year) #Class = Integer
## [1] "integer"
##Step 5
suicideGDP <- suicidestat %>% inner_join(gdpset, by= c("country", "year" )) %>% filter(suicides_no!=0)
head(suicideGDP, 10)
Now that the data sets are joined, we will have to check the attributes in the data and apply proper data type conversions.
The results of the attributes() function is checked using the console and is not included in the report as the data set is too large to fit in the report. However, the code is displayed for reference.
After checking the structure of the data types, we can see that these variables are not in the correct format: sex, age, suicide_no and population.
##Checking the attributes in the data:
dim(suicideGDP)
## [1] 28052 7
#attributes(suicideGDP)
str(suicideGDP)
## tibble [28,052 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ country : chr [1:28052] "Albania" "Albania" "Albania" "Albania" ...
## $ year : int [1:28052] 1987 1987 1987 1987 1987 1987 1987 1987 1987 1988 ...
## $ sex : chr [1:28052] "female" "female" "female" "female" ...
## $ age : chr [1:28052] "15-24 years" "25-34 years" "35-54 years" "75+ years" ...
## $ suicides_no : num [1:28052] 14 4 6 1 21 9 16 1 1 8 ...
## $ population : num [1:28052] 289700 257200 278800 35600 312900 ...
## $ GDP Per Capita: num [1:28052] 675 675 675 675 675 ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. year = col_double(),
## .. sex = col_character(),
## .. age = col_character(),
## .. suicides_no = col_double(),
## .. population = col_double()
## .. )
##Applying proper data conversion
##Step 1
suicideGDP$suicides_no <- as.integer(suicideGDP$suicides_no)
suicideGDP$population <- as.integer(suicideGDP$population)
##Step 2
suicideGDP$sex <- factor(suicideGDP$sex, levels= c("male", "female"))
suicideGDP$age <- factor(suicideGDP$age, levels = c("5-14 years", "15-24 years","25-34 years", "35-54 years", "55-74 years", "75+ years" ))
##Step 3
levels(suicideGDP$sex)
## [1] "male" "female"
levels(suicideGDP$age)
## [1] "5-14 years" "15-24 years" "25-34 years" "35-54 years" "55-74 years"
## [6] "75+ years"
str(suicideGDP)
## tibble [28,052 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ country : chr [1:28052] "Albania" "Albania" "Albania" "Albania" ...
## $ year : int [1:28052] 1987 1987 1987 1987 1987 1987 1987 1987 1987 1988 ...
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 2 1 1 1 1 1 2 ...
## $ age : Factor w/ 6 levels "5-14 years","15-24 years",..: 2 3 4 6 2 3 4 5 6 2 ...
## $ suicides_no : int [1:28052] 14 4 6 1 21 9 16 1 1 8 ...
## $ population : int [1:28052] 289700 257200 278800 35600 312900 274300 308000 137500 21800 295600 ...
## $ GDP Per Capita: num [1:28052] 675 675 675 675 675 ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. year = col_double(),
## .. sex = col_character(),
## .. age = col_character(),
## .. suicides_no = col_double(),
## .. population = col_double()
## .. )
head(suicideGDP)
As different countries have different population, the suicide_no cannot be compared directly, therefore it is important to convert the suicide_no in to rates. This is done so by dividing suicide_no by the population. As the results will be very small, we will determine the suicide rate per 100,000 people instead. The mutate() function from the dplyr package is used for this operation.
#step 1
suicideGDP <- suicideGDP %>% mutate(Suicide_Rate = suicides_no/population*100000)
#Step 2
head(suicideGDP)
To prevent the results from getting too long, the which(is.na()) function is done in the console to determine the location of the missing values in each column.
# This is the R chunk for the Scan I
#Step 1
sum(is.na(suicideGDP))
## [1] 5687
##Step 2
colSums(is.na(suicideGDP))
## country year sex age suicides_no
## 0 0 0 0 0
## population GDP Per Capita Suicide_Rate
## 1962 1763 1962
##Step 3: user defined function to check for special values
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sum(unlist(sapply(suicideGDP, is.special))) #Total sum of special values = 0
## [1] 0
The missing values are found to be from the variables:
* population: 1962
* GDP Per Capita: 1763
* Suicide_Rate: 1962
The variables that are required for the analysis would include: country, year, sex, age, GDP Per Capital and suicide rate. There are no missing values in variables country, year, sex and age, therefore no actions are needed to be taken for these variables.
On the other hand, the Suicide_Rate is computed using suicides_no/population, therefore in each row, if either suicides_no or population are missing, the resulting Suicide_Rate would also turn out to be NA. Both suicides_no and population is specific to the particular country by year, age and gender. The missing datas are usually missing as a whole by country. (eg:All population data of Zimbabwe is missing). Also, the missing GDP Per Capita is specific to each country. The missing GDP Per Capita is NA for the whole column of the specific country (eg: Uzbekistan do not possess any GDP Per Capita data).
Therefore, as the suicide_no, GDP Per Capital and thus Suicide_Rates are highly sensitive data, and the analysis would require data of both the GDP Per Capita and Suicide_Rate, it was opted to omit the rows with missing data.
# This is the R chunk for the Scan II
suicideGDP <- na.omit(suicideGDP)
head(suicideGDP)
The ColSums() function together with the is.na() function is used to check the data set for missing values again. This time, all missing values have been handled and therefore each rows are cleared of missing values.
colSums(is.na(suicideGDP))
## country year sex age suicides_no
## 0 0 0 0 0
## population GDP Per Capita Suicide_Rate
## 0 0 0
After dealing with missing values, special values and obvious errors, we will have to scan the numeric data for outliers.
Although outliers are observed, however, these outliers should not be dropped or handled as they are legitimate observations. These outliers are not results of data entry errors, measurement errors, experimental errors or intentional errors. The suicide rate is a highly sensitive data and should not be remove, impute or capped as they might provide valuable information or insights to the analyst and is therefore chosen to be kept for further investigation.
# Step 1
hist(suicideGDP$Suicide_Rate, main="Distribution of Suicide Rate", xlab = "Suicide Rate")
hist(suicideGDP$`GDP Per Capita`, main="Distribution of GDP Per Capita", xlab = "GDP Per Capita")
# Step 2: investigate using univariate - Turkey's method of outlier detection
par(mfrow=c(1,2))
boxplot(suicideGDP$Suicide_Rate, main="Box Plot of Suicide_Rate", ylab="Suicide_Rate", col = "grey")
boxplot(suicideGDP$`GDP Per Capita`, main="Box Plot of GDP per Capita", ylab="GDP Per Capita", col = "grey")
# Step 3: Multivariate Outlier Detection Method
par(mfrow=c(1,1))
suicideGDP %>% plot(Suicide_Rate~`GDP Per Capita`, data = ., main="Boxplot of Suicide Rate", ylab = "Suicide Rate per 100,000", xlab = "GDP Per Capita", col="blue")
Before doing an analysis, the data will be transformed before it is used for modeling. Usually, a symmetrical distribution will be preferred over a skewed distribution. Multiple statistical analysis techniques also requires a normal distribution and the homogeneity of variance.
# This is the R chunk for the Transform Section
#Step 1:
par(mfrow=c(1,2))
hist(suicideGDP$Suicide_Rate, main="Distribution of Suicide_Rate", xlab="Suicide_Rate", col = "grey")
hist(suicideGDP$`GDP Per Capita`, main="Distribution of GDP Per Capita", xlab="GDP Per Capita", col = "grey")
log10Suicide_Rate <- log10(suicideGDP$Suicide_Rate)
log10GDP <- log10(suicideGDP$`GDP Per Capita`)
logSuicide_Rate <- log(suicideGDP$Suicide_Rate)
loggdp <- log(suicideGDP$`GDP Per Capita`)
par(mfrow=c(2,2))
hist(log10Suicide_Rate, main = "Distribution of log10 Suicide_Rate", xlab ="log10 Suicide_Rate" )
hist(log10GDP ,main = "Distribution of log10 GDP Per Capita", xlab = "log10 GDP Per Capita")
hist(logSuicide_Rate, main = "Distribution of log Suicide_Rate", xlab = "log Suicide_Rate" )
hist(loggdp, main = "Distribution of log GDP Per Capita", xlab = "log GDP Per Capita")
par(mfrow=c(2,2))
qqnorm(suicideGDP$Suicide_Rate, main = "QQ Plot of Suicide_Rate")
qqline(suicideGDP$Suicide_Rate, col = 'red')
qqnorm(suicideGDP$`GDP Per Capita`, main = "QQ Plot of GDP per Capita")
qqline(suicideGDP$`GDP Per Capita`, col = 'red')
qqnorm(log10Suicide_Rate, main = "QQ Plot of Log10 Suicide_Rate")
qqline(log10Suicide_Rate, col = 'red')
qqnorm(log10GDP, main = "QQ Plot of Log10 GDP per Capita ")
qqline(log10GDP, col = 'red')
suicideGDP$Suicide_Rate <- log10(suicideGDP$Suicide_Rate)
suicideGDP$`GDP Per Capita` <- log10(suicideGDP$`GDP Per Capita`)
Dolgun, A 2020, ‘Module 4 Tidy and Manipulate: Tidy Data Principles and Manipulating Data’, Module notes, MATH2349, RMIT University, viewed 6 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_04.html.
Dolgun, A 2020, ‘Module 5 Scan: Missing Values’, Module notes, MATH2349, RMIT University, viewed 6 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_05.html.
Dolgun, A 2020, ‘Scan: Outliers’, Module notes, MATH2349, RMIT University, viewed 6 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_06.html.
Dolgun, A 2020, ‘Transform: Data Transformation, Standardisation, and Reduction’, Module notes, MATH2349, RMIT University, viewed 6 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_07.html.
Kaggle.com. 2020. WHO Suicide Statistics. [online] Available at: https://www.kaggle.com/szamil/who-suicide-statistics?select=who_suicide_statistics.csv [Accessed 12 October 2020].
The World Bank. 2020. GDP Per Capita (Current US$). [online] Available at: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD [Accessed 12 October 2020].
World Health Organization. 2020. Suicide Across The World (2016). [online] Available at: https://www.who.int/mental_health/prevention/suicide/suicideprevent/en/ [Accessed 1 October 2020].