Required packages

The following are the required packaged use for the preprocessing of the data sets:

library(readr)
library(readxl)
library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr) 

Executive Summary

Suicide is a global phenomenon which accounts for 1.4% of all deaths worldwide (Suicide across the world (2016), 2020). One person dies from it every 40 seconds (Suicide across the world (2016), 2020).

This reports aims to prepare a set of data to determine if there is a correlation between suicide rates and a country’s standard of living.

As real world data are commonly noisy and incomplete, the data have to be preprocess before any statistical analysis can be done. These steps are taken to preprocess the data:

  1. GET: Two data sets (WHO Suicide Statistics and GDP per Capita) are sourced and imported using the read_excel() and read_csv() function respectively.
  2. Since not all attributes are required for the analysis, the select() function is use to extract the relevant variables from the GDP per Capita data.
  3. TIDY AND MANIPULATE(I): Before the data sets are merged, they are tidied & Manipulated by transforming them from a wide to a long format to create a common key to merge.
  4. UNDERSTAND: The attributes, structure and dimension of the data are checked and applied proper data type conversions. The data have to be understood and ensured to be in the right format before any type of preprocessing/analysis can be done.
  5. TIDY AND MANIPULATE (II): A new variable, Suicide_Rate is created to aid us in the analysis.
  6. SCAN(I): The data is scanned for missing values, special values and obvious errors. As the data values are highly sensitive to each country by year, gender and age, it was opted to omit the rows with missing data.
  7. SCAN(II): Next, the data is scanned for outliers using both Turkey’s and the scatter plot method. Although outliers are observed, they should not be dropped or handled as they are legitimate observations.
  8. TRANSFORM: A normal distribution is preferred in many statistical analysis. The distribution of the variables are checked using a histogram and are transformed to an approximate normal distribution using the log10 transformation.

Data

According to the World Health Organization(WHO), one person die from suicide every 40 seconds, this numbers adds up close to 800,000 suicidal deaths in a year (Suicide across the world (2016), 2020).

The WHO further reports that the suicidal rates have increased by 60% worldwide in the last 45 years. This global phenomenon have caused pain to hundreds of thousands of people around the world and should not be overlooked.

The GDP per capita is used to measure a country’s standard of living. The following two sets of data are use to determine if a country’s standard of living would impact its suicide rates:

Data 1

WHO Suicide Statistics: https://www.kaggle.com/szamil/who-suicide-statistics?select=who_suicide_statistics.csv
The WHO Suicide statistics is a csv file which contains the number of suicides by country, year, sex and age groups.

It contains 6 variables with 43,776 observations. The 6 variables are as follows:
* Country: 141 countries where the suicide statistics are measured
* Year: The year of the observation (1979-2016)
* Sex: Male or Female
* Age: 6 unique age group (5-14 Years, 15-24 Years, 25-34 Years, 35-54 Years, 55-74 Years, 75+ Years)
* Suicide_No: Number of suicides
* Population: Population by country, sex and age group

Data 2

GDP per Capita Data: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
The GDP per capita Data is a XLS file which contains the GDP per Capita data of 264 countries.

It contains 64 Variables with 264 observations. The 64 variables are as follows:
* Country Name: Name of the 264 observed countries
* Country Code: Short alphabetic codes to represent countries. (Not used in final dataset)
* Indicator Name: The indicator used is the GDP per capita (current US$) for all 264 countries (Not used in final dataset)
* Variable 4-64 (Years from 1960 to 2019): GDP per capita in USD of the particular year

Steps Taken:
  1. The function read_csv() function of the readr package is use to read the Suicide data. This data is stored as suicidestat.
  2. The function read_excel() function of the readxl package is used to read the GDP per Capital Data. The data is stored as gdp.
  3. The data structure of each data set are checked and inspected using the str() function.
  4. The data, gdp and suicidestat are displayed using the head() function
  5. Before the data sets can be joined, we will have to extract the relevant data from the 2 data sets and the data should be tidied to ensure that the join will merge accurately. Since the WHO Suicide Statistics data only contain information from 1979-2016, the GDP data should also reflect only these relevant years. The Select function is use to extract these the relevant variables from the GDP data and is stored as gdpset. The gdpset data is displayed using the head() function.
  6. Before the data can be merged, it should be tidied to create a common key to merge.

The tidying and merging of data will be shown at the next section: Tidy & Manipulate Data I

#Step 1
gdp <- read_excel("API_NY.GDP.PCAP.CD_DS2_en_excel_v2_1345208.xls", sheet = 1, skip = 3)

#Step 2
suicidestat <- read_csv("who_suicide_statistics 2.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   age = col_character(),
##   suicides_no = col_double(),
##   population = col_double()
## )
#Step 3
str(gdp)
## tibble [264 × 64] (S3: tbl_df/tbl/data.frame)
##  $ Country Name  : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country Code  : chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
##  $ Indicator Name: chr [1:264] "GDP per capita (current US$)" "GDP per capita (current US$)" "GDP per capita (current US$)" "GDP per capita (current US$)" ...
##  $ Indicator Code: chr [1:264] "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" "NY.GDP.PCAP.CD" ...
##  $ 1960          : num [1:264] NA 59.8 NA NA NA ...
##  $ 1961          : num [1:264] NA 59.9 NA NA NA ...
##  $ 1962          : num [1:264] NA 58.5 NA NA NA ...
##  $ 1963          : num [1:264] NA 78.7 NA NA NA ...
##  $ 1964          : num [1:264] NA 82.1 NA NA NA ...
##  $ 1965          : num [1:264] NA 101 NA NA NA ...
##  $ 1966          : num [1:264] NA 138 NA NA NA ...
##  $ 1967          : num [1:264] NA 161 NA NA NA ...
##  $ 1968          : num [1:264] NA 129 NA NA NA ...
##  $ 1969          : num [1:264] NA 129 NA NA NA ...
##  $ 1970          : num [1:264] NA 157 NA NA 3239 ...
##  $ 1971          : num [1:264] NA 160 NA NA 3498 ...
##  $ 1972          : num [1:264] NA 135 NA NA 4217 ...
##  $ 1973          : num [1:264] NA 143 NA NA 5342 ...
##  $ 1974          : num [1:264] NA 174 NA NA 6320 ...
##  $ 1975          : num [1:264] NA 187 NA NA 7169 ...
##  $ 1976          : num [1:264] NA 197 NA NA 7152 ...
##  $ 1977          : num [1:264] NA 224 NA NA 7751 ...
##  $ 1978          : num [1:264] NA 247 NA NA 9130 ...
##  $ 1979          : num [1:264] NA 276 NA NA 11821 ...
##  $ 1980          : num [1:264] NA 273 711 NA 12377 ...
##  $ 1981          : num [1:264] NA 264 642 NA 10372 ...
##  $ 1982          : num [1:264] NA NA 620 NA 9610 ...
##  $ 1983          : num [1:264] NA NA 623 NA 8023 ...
##  $ 1984          : num [1:264] NA NA 638 639 7729 ...
##  $ 1985          : num [1:264] NA NA 758 640 7774 ...
##  $ 1986          : num [1:264] 6473 NA 685 694 10362 ...
##  $ 1987          : num [1:264] 7886 NA 756 675 12616 ...
##  $ 1988          : num [1:264] 9765 NA 792 653 14304 ...
##  $ 1989          : num [1:264] 11392 NA 891 698 15166 ...
##  $ 1990          : num [1:264] 12307 NA 948 617 18879 ...
##  $ 1991          : num [1:264] 13496 NA 866 337 19533 ...
##  $ 1992          : num [1:264] 14047 NA 656 201 20548 ...
##  $ 1993          : num [1:264] 14937 NA 441 367 16516 ...
##  $ 1994          : num [1:264] 16241 NA 329 586 16235 ...
##  $ 1995          : num [1:264] 16439 NA 397 751 18461 ...
##  $ 1996          : num [1:264] 16586 NA 523 1010 19017 ...
##  $ 1997          : num [1:264] 17928 NA 514 717 18353 ...
##  $ 1998          : num [1:264] 19078 NA 424 814 18895 ...
##  $ 1999          : num [1:264] 19356 NA 388 1033 19262 ...
##  $ 2000          : num [1:264] 20621 NA 557 1127 21854 ...
##  $ 2001          : num [1:264] 20669 NA 527 1282 22972 ...
##  $ 2002          : num [1:264] 20437 179 872 1425 25067 ...
##  $ 2003          : num [1:264] 20834 191 983 1846 32272 ...
##  $ 2004          : num [1:264] 22570 211 1256 2374 37969 ...
##  $ 2005          : num [1:264] 23300 242 1902 2674 40066 ...
##  $ 2006          : num [1:264] 24045 264 2600 2973 42676 ...
##  $ 2007          : num [1:264] 25835 360 3122 3595 47804 ...
##  $ 2008          : num [1:264] 27085 365 4081 4371 48718 ...
##  $ 2009          : num [1:264] 24630 438 3123 4114 43503 ...
##  $ 2010          : num [1:264] 23513 543 3588 4094 40853 ...
##  $ 2011          : num [1:264] 24986 591 4615 4437 43335 ...
##  $ 2012          : num [1:264] 24714 642 5100 4248 38686 ...
##  $ 2013          : num [1:264] 26189 637 5255 4413 39539 ...
##  $ 2014          : num [1:264] 26648 614 5408 4579 41304 ...
##  $ 2015          : num [1:264] 27981 578 4167 3953 35763 ...
##  $ 2016          : num [1:264] 28281 547 3506 4124 37475 ...
##  $ 2017          : num [1:264] 29008 556 4096 4531 38963 ...
##  $ 2018          : num [1:264] NA 524 3290 5284 41793 ...
##  $ 2019          : num [1:264] NA 502 2974 5353 40886 ...
str(suicidestat)
## tibble [43,776 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country    : chr [1:43776] "Albania" "Albania" "Albania" "Albania" ...
##  $ year       : num [1:43776] 1985 1985 1985 1985 1985 ...
##  $ sex        : chr [1:43776] "female" "female" "female" "female" ...
##  $ age        : chr [1:43776] "15-24 years" "25-34 years" "35-54 years" "5-14 years" ...
##  $ suicides_no: num [1:43776] NA NA NA NA NA NA NA NA NA NA ...
##  $ population : num [1:43776] 277900 246800 267500 298300 138700 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   year = col_double(),
##   ..   sex = col_character(),
##   ..   age = col_character(),
##   ..   suicides_no = col_double(),
##   ..   population = col_double()
##   .. )
#Step 4
head(gdp) 
head(suicidestat)
#Step 5
gdpset <- gdp %>% select(`Country Name`, c("1979":"2016")) #Selecting the relevant variables
head(gdpset) 

Tidy & Manipulate Data I

Tidy Data

To ensure a successful/proper join, we will have to first tidy the data to create the common key to merge.

A data is deemed as tidy if:
1. Each Variable have its own column
2. Each Observation has its own row
3. Each Value has its own cell

Based on the criteria of tidy data, the suicide data is tidy. However, the gdpset data is not. The gdpset data is untidy because the column headers are values and not variable names (variables 1979 to 2016 do not have its own column (it is not in one column) and the variable GDP Per Capita is spread over the 37 columns). Thus the wide data frame should be converted in to a long format using the gather() function by converting the “1979”….“2016” columns into “Years” and its values as “GDP Per Capita”.

Steps Taken to ensure a tidy data:
  1. The gather() function from the tidyr package is used to transform the data from a wide to the long format.

Merge Data

Now that both data are tidied, we will merge both data sets using an inner join. An inner join is used because to determine if a country’s standard of living would impact its suicide rates, it would be required to retain only rows that are in both data sets.

Steps Taken to merge both data sets using inner join:
  1. Renaming the variables Year into year and Country Name into country in the gdpset data set to create a common key name with the suicidestat data set. Both Year and Country Name will be used as keys to merge both data sets.
  2. Check the types of variables and data structures, to ensure that the keys used to join the data sets are of compatible types. The class() function is use to check both the country and year variable from both data sets to ensure they are of compatible types.
  3. The country variable in both gdpset and suicidestat data set are in the correct format, they both have class Character. However, the class for year in the gdpset is denoted as character, while that of the sucidestat set is denoted as numeric with type Double. The as.integer() function will be use to convert the year variable from both the data sets into a integer type.
  4. The class of the year variables from both data steps are checked again to ensure that they are of integer type.
  5. Both data sets are now joined using the inner_join function from the dplyr package. These data sets are joined with the common key name, “country” and “year” and is stored as suicideGDP. As the data is prepared to study if a country’s standard of living would impact its suicide rates, therefore we will remove observations where suicides_no = 0. This is done using the filter() function of the dplyr package and the outcome is stored as suicideGDP.
#Steps Taken to ensure a tidy data:
gdpset<- gdpset %>% gather(c("1979":"2016"), key = "Year", value = "GDP Per Capita")
head(gdpset)
## Steps Taken to merge both data sets using inner join:
##Step 1
gdpset <- gdpset %>% rename(country=`Country Name`, year= Year)

##Step 2
class(gdpset$country) #Class = Character
## [1] "character"
class(suicidestat$country) #Class = Character
## [1] "character"
class(gdpset$year) #Class = Chracter
## [1] "character"
class(suicidestat$year) #Class = Numeric
## [1] "numeric"
typeof(suicidestat$year) #Type of Numeric = Double 
## [1] "double"
##Step 3
gdpset$year <- as.integer(gdpset$year)
suicidestat$year <- as.integer(suicidestat$year)

##Step 4
class(gdpset$year) #Class = Integer
## [1] "integer"
class(suicidestat$year) #Class = Integer
## [1] "integer"
##Step 5
suicideGDP <- suicidestat %>% inner_join(gdpset, by= c("country", "year" )) %>% filter(suicides_no!=0)
head(suicideGDP, 10)

Understand

Now that the data sets are joined, we will have to check the attributes in the data and apply proper data type conversions.

Checking the attributes in the data:

  1. Checking the dimensions of the data frame using dim() function gives 28052 observations and 7 variables.
  2. Using the str() and attributes() function we have summarized and checked the types of variables in the data set.

The results of the attributes() function is checked using the console and is not included in the report as the data set is too large to fit in the report. However, the code is displayed for reference.

Applying proper data conversion

After checking the structure of the data types, we can see that these variables are not in the correct format: sex, age, suicide_no and population.

The following steps are take to ensure proper data type conversion:
  1. The suicide_no and population variable should be in integer format as they should be in whole numbers. The as.integer() function is used to convert the data type from a double to an integer.
  2. The sex and age variables are to be converted to a factor as they are categorical variables. The sex variable have 2 specific categories, “female” and “male”. On the other hand, the age variable has 6 unique categories, 5-14 years, 15-24 years, 25-34 years, 35-54 years, 55-74 years and 75+ years. The factor() function is used to convert the data type from a character to a factor type.
  3. After the conversion has been made, the levels of the factor variables,sex and age, are checked and the overall structure of the variables are checked again to ensure to be in the right format.
##Checking the attributes in the data:
dim(suicideGDP)
## [1] 28052     7
#attributes(suicideGDP)
str(suicideGDP) 
## tibble [28,052 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country       : chr [1:28052] "Albania" "Albania" "Albania" "Albania" ...
##  $ year          : int [1:28052] 1987 1987 1987 1987 1987 1987 1987 1987 1987 1988 ...
##  $ sex           : chr [1:28052] "female" "female" "female" "female" ...
##  $ age           : chr [1:28052] "15-24 years" "25-34 years" "35-54 years" "75+ years" ...
##  $ suicides_no   : num [1:28052] 14 4 6 1 21 9 16 1 1 8 ...
##  $ population    : num [1:28052] 289700 257200 278800 35600 312900 ...
##  $ GDP Per Capita: num [1:28052] 675 675 675 675 675 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   year = col_double(),
##   ..   sex = col_character(),
##   ..   age = col_character(),
##   ..   suicides_no = col_double(),
##   ..   population = col_double()
##   .. )
##Applying proper data conversion
##Step 1
suicideGDP$suicides_no <- as.integer(suicideGDP$suicides_no)
suicideGDP$population <- as.integer(suicideGDP$population)

##Step 2
suicideGDP$sex <- factor(suicideGDP$sex, levels= c("male", "female"))
suicideGDP$age <- factor(suicideGDP$age, levels = c("5-14 years", "15-24 years","25-34 years", "35-54 years", "55-74 years", "75+ years" ))

##Step 3
levels(suicideGDP$sex)
## [1] "male"   "female"
levels(suicideGDP$age)
## [1] "5-14 years"  "15-24 years" "25-34 years" "35-54 years" "55-74 years"
## [6] "75+ years"
str(suicideGDP) 
## tibble [28,052 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country       : chr [1:28052] "Albania" "Albania" "Albania" "Albania" ...
##  $ year          : int [1:28052] 1987 1987 1987 1987 1987 1987 1987 1987 1987 1988 ...
##  $ sex           : Factor w/ 2 levels "male","female": 2 2 2 2 1 1 1 1 1 2 ...
##  $ age           : Factor w/ 6 levels "5-14 years","15-24 years",..: 2 3 4 6 2 3 4 5 6 2 ...
##  $ suicides_no   : int [1:28052] 14 4 6 1 21 9 16 1 1 8 ...
##  $ population    : int [1:28052] 289700 257200 278800 35600 312900 274300 308000 137500 21800 295600 ...
##  $ GDP Per Capita: num [1:28052] 675 675 675 675 675 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   year = col_double(),
##   ..   sex = col_character(),
##   ..   age = col_character(),
##   ..   suicides_no = col_double(),
##   ..   population = col_double()
##   .. )
head(suicideGDP)

Tidy & Manipulate Data II

As different countries have different population, the suicide_no cannot be compared directly, therefore it is important to convert the suicide_no in to rates. This is done so by dividing suicide_no by the population. As the results will be very small, we will determine the suicide rate per 100,000 people instead. The mutate() function from the dplyr package is used for this operation.

The steps are as follows:

  1. The mutate() function is used to add a new variable Suicide_Rate.
  2. The data is displayed using the head() function
#step 1
suicideGDP <- suicideGDP %>% mutate(Suicide_Rate = suicides_no/population*100000)
#Step 2
head(suicideGDP)

Scan I

To scan the data for missing values:

  1. The sum(), together with the is.na() function is used to determine the total sum of NAs within the dataframe. This resulted in having a total of 5687 missing values.
  2. The colSums() together with the is.na() function is then used to determine the total missing values in each column. This resulted in knowing that the missing values come from the variables, population: 1962, GDP Per Capita: 1763 and Suicide_Rate: 1962
  3. The data is then checked for special values using a user defined function called is.special. This function checks every numerical column for any infinite or NaN values and returns a logical value. The is.special function is used together with the sum() function to determine the total sum of special values. The results show that no special values are detected from the data.

To prevent the results from getting too long, the which(is.na()) function is done in the console to determine the location of the missing values in each column.

# This is the R chunk for the Scan I
#Step 1
sum(is.na(suicideGDP)) 
## [1] 5687
##Step 2
colSums(is.na(suicideGDP)) 
##        country           year            sex            age    suicides_no 
##              0              0              0              0              0 
##     population GDP Per Capita   Suicide_Rate 
##           1962           1763           1962
##Step 3: user defined function to check for special values
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sum(unlist(sapply(suicideGDP, is.special))) #Total sum of special values = 0 
## [1] 0

The missing values are found to be from the variables:
* population: 1962
* GDP Per Capita: 1763
* Suicide_Rate: 1962

The variables that are required for the analysis would include: country, year, sex, age, GDP Per Capital and suicide rate. There are no missing values in variables country, year, sex and age, therefore no actions are needed to be taken for these variables.

On the other hand, the Suicide_Rate is computed using suicides_no/population, therefore in each row, if either suicides_no or population are missing, the resulting Suicide_Rate would also turn out to be NA. Both suicides_no and population is specific to the particular country by year, age and gender. The missing datas are usually missing as a whole by country. (eg:All population data of Zimbabwe is missing). Also, the missing GDP Per Capita is specific to each country. The missing GDP Per Capita is NA for the whole column of the specific country (eg: Uzbekistan do not possess any GDP Per Capita data).

Therefore, as the suicide_no, GDP Per Capital and thus Suicide_Rates are highly sensitive data, and the analysis would require data of both the GDP Per Capita and Suicide_Rate, it was opted to omit the rows with missing data.

# This is the R chunk for the Scan II
suicideGDP <- na.omit(suicideGDP)
head(suicideGDP)

The ColSums() function together with the is.na() function is used to check the data set for missing values again. This time, all missing values have been handled and therefore each rows are cleared of missing values.

colSums(is.na(suicideGDP)) 
##        country           year            sex            age    suicides_no 
##              0              0              0              0              0 
##     population GDP Per Capita   Suicide_Rate 
##              0              0              0

Scan II

After dealing with missing values, special values and obvious errors, we will have to scan the numeric data for outliers.

The following are the steps taken and results of the scan for outliers:

  1. The histogram of the Suicide_Rate and GDP Per Capita is displayed.
    • Both histogram can be seen to be non- normal. They are both right-skewed. As the data are non normal, we are not able to use the z-score or the Mahalanobis distance method to detect outliers as they require the data to be approximately normal. Also, as our sample size >5000, the Mahalanobis distance method cannot be used.
  2. We will start off with investigating the univariate outliers by using the Turkey’s method of outlier detection for both Suicide_Rate and GDP Per Capita data.
    • According to the Turkey’s method, both the Suicide_Rate and GDP Per Capita appears to have many outliers.
  3. As both Suicide_Rate and GDP Per Capita data are quantitative variables, we can use the scatter plot method to visualize the relationship between the two variables. This method can be used to detect obvious outliers.
    • According to the scatter plot, there might be some possible outliers at the upper left side of the scatter plot.

Although outliers are observed, however, these outliers should not be dropped or handled as they are legitimate observations. These outliers are not results of data entry errors, measurement errors, experimental errors or intentional errors. The suicide rate is a highly sensitive data and should not be remove, impute or capped as they might provide valuable information or insights to the analyst and is therefore chosen to be kept for further investigation.

# Step 1
hist(suicideGDP$Suicide_Rate, main="Distribution of Suicide Rate", xlab = "Suicide Rate") 

hist(suicideGDP$`GDP Per Capita`, main="Distribution of GDP Per Capita", xlab = "GDP Per Capita") 

# Step 2: investigate using univariate - Turkey's method of outlier detection
par(mfrow=c(1,2))
boxplot(suicideGDP$Suicide_Rate, main="Box Plot of Suicide_Rate", ylab="Suicide_Rate", col = "grey")
boxplot(suicideGDP$`GDP Per Capita`, main="Box Plot of GDP per Capita", ylab="GDP Per Capita", col = "grey")

# Step 3: Multivariate Outlier Detection Method
par(mfrow=c(1,1))
suicideGDP %>% plot(Suicide_Rate~`GDP Per Capita`, data = ., main="Boxplot of Suicide Rate", ylab = "Suicide Rate per 100,000", xlab = "GDP Per Capita", col="blue")

Transform

Before doing an analysis, the data will be transformed before it is used for modeling. Usually, a symmetrical distribution will be preferred over a skewed distribution. Multiple statistical analysis techniques also requires a normal distribution and the homogeneity of variance.

The following steps are taken to decrease the skewness of the distribution and convert the distribution into a normal distribution:
  1. The distribution of Suicide_Rate and GDP Per Capital are checked using the histogram.
    • Both histogram can be seen to be non-normal. They are both right-skewed.
  2. Different Data Transformation method such as the Mathematical Operations methods and the Box-Cox Transformation are tested on the data and the Log transformation(log10 and loge) seemed to provide the best result in transforming the data into an approximately normal distribution. The Log transformation is commonly used to reduce right skewness. Also, the discrepancies of the GDP Per Capital and the population data used to calculate the Suicide_Rate are large between countries, by using the Log transformation, it can compress high values and spread low values, spreading the magnitude of the data.
    • Both log10 and loge are tested on the Suicide_Rate and GDP Per Capita. Looking at the histogram the log10 transformation seemed to perform slightly better and is therefore chosen for the transformation.
  3. A qqplot is used to check the assumption of normality, it plots the distribution of the data against the expected normal distribution.
    • From the qqplot, it could be seen that after the log10 tranformation, both Suicide_Rate and GDP Per Capita are approximately normal.
# This is the R chunk for the Transform Section
#Step 1: 
par(mfrow=c(1,2))
hist(suicideGDP$Suicide_Rate, main="Distribution of Suicide_Rate", xlab="Suicide_Rate", col = "grey")
hist(suicideGDP$`GDP Per Capita`, main="Distribution of GDP Per Capita", xlab="GDP Per Capita", col = "grey")

log10Suicide_Rate <- log10(suicideGDP$Suicide_Rate)
log10GDP <- log10(suicideGDP$`GDP Per Capita`)
logSuicide_Rate <- log(suicideGDP$Suicide_Rate)
loggdp <- log(suicideGDP$`GDP Per Capita`)

par(mfrow=c(2,2))
hist(log10Suicide_Rate, main = "Distribution of log10 Suicide_Rate", xlab ="log10 Suicide_Rate" )
hist(log10GDP ,main = "Distribution of log10 GDP Per Capita", xlab = "log10 GDP Per Capita")
hist(logSuicide_Rate, main = "Distribution of log Suicide_Rate", xlab = "log Suicide_Rate" )
hist(loggdp, main = "Distribution of log GDP Per Capita", xlab = "log GDP Per Capita")

par(mfrow=c(2,2))
qqnorm(suicideGDP$Suicide_Rate, main = "QQ Plot of Suicide_Rate")
qqline(suicideGDP$Suicide_Rate, col = 'red')
qqnorm(suicideGDP$`GDP Per Capita`, main = "QQ Plot of GDP per Capita")
qqline(suicideGDP$`GDP Per Capita`, col = 'red')
qqnorm(log10Suicide_Rate, main = "QQ Plot of Log10 Suicide_Rate")
qqline(log10Suicide_Rate, col = 'red')
qqnorm(log10GDP, main = "QQ Plot of Log10 GDP per Capita ")
qqline(log10GDP, col = 'red')

suicideGDP$Suicide_Rate <- log10(suicideGDP$Suicide_Rate)
suicideGDP$`GDP Per Capita` <- log10(suicideGDP$`GDP Per Capita`)

References

Dolgun, A 2020, ‘Module 4 Tidy and Manipulate: Tidy Data Principles and Manipulating Data’, Module notes, MATH2349, RMIT University, viewed 6 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_04.html.

Dolgun, A 2020, ‘Module 5 Scan: Missing Values’, Module notes, MATH2349, RMIT University, viewed 6 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_05.html.

Dolgun, A 2020, ‘Scan: Outliers’, Module notes, MATH2349, RMIT University, viewed 6 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_06.html.

Dolgun, A 2020, ‘Transform: Data Transformation, Standardisation, and Reduction’, Module notes, MATH2349, RMIT University, viewed 6 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_07.html.

Kaggle.com. 2020. WHO Suicide Statistics. [online] Available at: https://www.kaggle.com/szamil/who-suicide-statistics?select=who_suicide_statistics.csv [Accessed 12 October 2020].

The World Bank. 2020. GDP Per Capita (Current US$). [online] Available at: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD [Accessed 12 October 2020].

World Health Organization. 2020. Suicide Across The World (2016). [online] Available at: https://www.who.int/mental_health/prevention/suicide/suicideprevent/en/ [Accessed 1 October 2020].