Required packages
The packages shown below are used in my report for generating outputs.
# This is the R chunk for the required packages
library(readr)
library(readxl)
library(foreign)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(deducorrect)
library(validate)
library(forecast)
library(stringr)
library(outliers)
library(MASS)
library(ggplot2)
library(knitr)
Executive Summary
We are analyzing the percentage of suicide within Australia over the time period 2009 – 2015. The dataset having the number of death and suicide is taken from the site of Australian Bureau of Statistics and we are analyzing the suicide rate in Australia comparing to the overall death rate in different provinces. The analysis is relevant as if we know the pattern in the suicide rate we can get to know if the state’s suitability for living. As we know the data, we got is raw and it will be noisy. So, before we start with the analysis, we need to do data cleansing. The data we have will include missing values, outliers etc.
We will first read the 2 files, one for the death details and other with suicide rates, using the read_xlsx(). We will then combine both the files using left join and create a single dataset. As now we have a single dataset with the suicide count and death count, we will calculate the percentage of suicide and add it to the data using mutate function. We now have all the details required, we need to check for missing values and outliers. The function is.na() was used to check the missing values, in this case there were no missing values in the dataset. If we had any missing values, we could check for the skewness and then based on that can replace them with median or mean. Next we will check for the outliers. The data has outliers and we need to handle them. We will check if the data is less than the lower fence, if so, it will be replaced by Q1 and if it is greater than the upper fence, we will replace the outliers with Q4 and now our data is ready for analysis.Finaling we can do data transformation to transform complex non-linear relationships into linear relationships.
Data
The datasets are read into R as excel files. There are two datasets used in this report the first one is a list of deaths registered by sex, states and territories and the second datasets intentional self-harm by State and territory of usual residence and Sex, Age-standardized death rates, 2009-2015.The death dataset is imported from “https://www.abs.gov.au/ausstats/abs@.nsf/mf/3105.0.65.001/” and the suicide dataset is imported from “https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3303.02018?OpenDocument”.This is an open source of datasets.
setwd("/Users/joeljvc/Desktop/R ")
The working directory was changed to /Users/joeljvc/Desktop/R inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
getwd()
[1] "/Users/joeljvc/Desktop/R "
death_1 <-read_xls("Total_death.xls",sheet="Table 5.1",skip = 4)
New names:
* `` -> ...1
suicides<-read_xls(" intentional selfharm.xls",sheet="Table 11.6",skip=6)
New names:
* `` -> ...1
* `` -> ...12
* `` -> ...13
Understand
The field and its description is given below:
**Data variables
Data 1 - death_1
1.States - Name of each states in Australia.
2.Year - Time period from 2009-2015.
3.Total number of deaths from 2009-2015 - This shows the total number of deaths.
Data 2 -suicides
1.States -Name of states.
2.Year -Time period from 2009-2015.
3.Number of suicides -Number of Suicides of each states.
death_2<- death_1[-c(1:23,32:55),-c(2:186)]
str(death_2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 8 obs. of 8 variables:
$ ...1: chr "NSW" "Vic." "Qld" "SA" ...
$ 2009: num 46974 35640 26316 12468 12566 ...
$ 2010: num 47945 35623 27289 12957 12720 ...
$ 2011: num 50661 36552 27414 12665 12724 ...
$ 2012: num 49314 35760 28300 13178 13339 ...
$ 2013: num 50396 35916 27901 12804 13414 ...
$ 2014: num 52320 38042 28704 13262 13787 ...
$ 2015: num 53600 39904 29782 13647 14448 ...
names(death_2)[names(death_2) == "...1"] <- "States"
suicides_1<- suicides[-c(1:25,34:47),-c(9:13)]
str(suicides_1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 8 obs. of 8 variables:
$ ...1: chr "NSW" "Vic." "Qld" "SA" ...
$ 2009: chr "623" "576" "525" "185" ...
$ 2010: chr "674" "558" "588" "197" ...
$ 2011: chr "617" "526" "578" "212" ...
$ 2012: chr "727" "514" "631" "198" ...
$ 2013: chr "718" "533" "676" "203" ...
$ 2014: chr "832" "658" "658" "243" ...
$ 2015: chr "839" "678" "761" "234" ...
Tidy & Manipulate Data I
Both of the datasets were untidy and to tidy this dataset, we need to gather the columns into a new pair of variables and with gather() function with the parameters year values 2009-2015, and the column name is given as year, the number of suicides is given as values, since that is the one which spread over the suicide file and total number of deaths is given as values for death files.
suicides_1<-suicides_1 %>%
gather('2009','2010','2011','2012','2013','2014','2015',key = "year",value = "Number of suicides ")
names(suicides_1)[names(suicides_1) == "...1"] <-"States"
death_5<-gather(death_2,'2009','2010','2011','2012','2013','2014','2015', key = "year",value = "Total Number of deaths from 2009-2015")
colnames(death_5)
[1] "States" "year"
[3] "Total Number of deaths from 2009-2015"
##Joining Datasets and Type Conversion
since we have change the datasets to the required format we can join both the datasets using left_join() functions over States and year columns. Using str() we check the columns are correct format and could find that the column “Number of suicides” is in character data type even though it clearly holds numerical data. So we will convert it into numeric using as.numeric().We have the state column having the state details and that can be converted to levels using factor().
x<-left_join(suicides_1,death_5,by= c("States","year"))
str(x)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 56 obs. of 4 variables:
$ States : chr "NSW" "Vic." "Qld" "SA" ...
$ year : chr "2009" "2009" "2009" "2009" ...
$ Number of suicides : chr "623" "576" "525" "185" ...
$ Total Number of deaths from 2009-2015: num 46974 35640 26316 12468 12566 ...
x$`Number of suicides `<- as.numeric(x$`Number of suicides `)
x$States<-factor(x$States,levels=c("NSW","Vic.","Qld","SA","WA","Tas.","NT","ACT"),ordered = TRUE)
levels(x$States)
[1] "NSW" "Vic." "Qld" "SA" "WA" "Tas." "NT" "ACT"
Tidy & Manipulate Data II
1.By using mutate() function we can add the new variable “Percentage of suicide” while preserving the existing variables.
2.The new variable “Percentage of suicide” is created to know the percentage value of both “Number of suicides” and “Total Number of deaths from 2009-2015”.
y<-mutate(x,"Percentage of suicide"=(x$`Number of suicides ` / x$`Total Number of deaths from 2009-2015`)*100)
Scan I
1.With the is.na() functions the missing values is being checked and during the data preprocessing method this datasets has no missing values, special values.
2.Obvious errors or inconsistencies are also checked in the dataset.
colSums(is.na(y))
States year Number of suicides
0 0 0
Total Number of deaths from 2009-2015 Percentage of suicide
0 0
Scan II
As we are done with a handling missing values, we will check for outliers by generating boxplot for the attributes. There are no outliers for “Number of suicides” and “Total number of deaths from 2009-2015”, but we can see that in the column "percentage of suicides there are six outliers which have to be handled. We can replace them using the 5th and 95th quartile based on if the value is less than the lower fence or greater than the upper fence respectively using cap() function.
boxplot(y$`Number of suicides `)

boxplot(y$`Total Number of deaths from 2009-2015`)

z<-boxplot(y$`Percentage of suicide`)

z$out
[1] 3.882476 4.587156 4.564315 4.624277 4.794521 4.099061
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
y$`Percentage of suicide`<-cap(y$`Percentage of suicide`)
boxplot(y$`Percentage of suicide`)

Transform
To further proceed with statistical analysis, data transformation is requisite.
Below is the situation where we might need transformations.
1.For better understanding we may need to change the scale or standardize the values of a variable.
2.We may need to transform complex non-linear relationships into linear relationships.
3.In measurable induction, normal distribution is preferred over skewed distribution. Along these lines, at whatever point we have skewed distribution or heterogeneous of variances, we can use transformations which can decrease skewness or heterogeneity of variances.
4.By using hist() function we can observe “Number of suicides”, “Total Number of deaths from 2009-2015” and “Percentage of suicide” having a right-skewed distribution.
5.We are using boxcox transformation for “Number of suicides” which is an non-normal data and the Box-Cox transformation is a type of power transformation to transform non-normal data into a normal distribution.
6.Another transformation is the square root transformation. It is additionally utilized for decreasing right skewness, and furthermore has the advantage that it can be applied to zero values. The square root transformation is applied to “Total Number of deaths from 2009-2015” and “Percentage of suicide” for reducing right skewness.
hist(y$`Number of suicides `, main = "Histogram of number of suicides", xlab = "Suicides")

hist(y$`Total Number of deaths from 2009-2015`, main = "Total number of deaths", xlab = "Deaths")

hist(y$`Percentage of suicide`, main = "Percentage of suicides",xlab = "Percentage of suicides")

#Boxcox transformation on 'number of suicides'
boxcox_suicides<- BoxCox(y$`Number of suicides `,lambda = "auto")
hist(boxcox_suicides)

#square root transformation on total number of deaths
sqrt_deaths<-sqrt(y$`Total Number of deaths from 2009-2015`)
hist(sqrt_deaths,main = "Histogram after transformation - Total number of deaths", xlab = "Total number of deaths")

# square root transformation on percentage of suicides
sqrt_percentage<-sqrt(y$`Percentage of suicide`)
hist(sqrt_percentage,main = "Histogram after transformation - Percentage of suicide", xlab = "Percentage of suicides")

NA
NA
---
title: "MATH2349 Semester 1, 2020"
author: "JOEL VARGHESE CHERIAN (S3808033)"
subtitle: Assignment 2
output:
  html_notebook: default
---


## Required packages 

The packages shown below are used in my report for generating outputs.

```{r}
# This is the R chunk for the required packages
library(readr)
library(readxl)
library(foreign)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(deducorrect)
library(validate)
library(forecast)
library(stringr)
library(outliers)
library(MASS)
library(ggplot2)
library(knitr) 

```


## Executive Summary 

We are analyzing the percentage of suicide within Australia over the time period 2009 – 2015. The dataset having the number of death and suicide is taken from the site of Australian Bureau of Statistics and we are analyzing the suicide rate in Australia comparing to the overall death rate in different provinces. The analysis is relevant as if we know the pattern in the suicide rate we can get to know if the state’s suitability for living. As we know the data, we got is raw and it will be noisy. So, before we start with the analysis, we need to do data cleansing. The data we have will include missing values, outliers etc. 

We will first read the 2 files, one for the death details and other with suicide rates, using the read_xlsx(). We will then combine both the files using left join and create a single dataset. As now we have a single dataset with the suicide count and death count, we will calculate the percentage of suicide and add it to the data using mutate function. We now have all the details required, we need to check for missing values and outliers. The function is.na() was used to check the missing values, in this case there were no missing values in the dataset. If we had any missing values, we could check for the skewness and then based on that can replace them with median or mean. Next we will check for the outliers. The data has outliers and we need to handle them. We will check if the data is less than the lower fence, if so, it will be replaced by Q1 and if it is greater than the upper fence, we will replace the outliers with Q4 and now our data is ready for analysis.Finaling we can do data transformation to transform complex non-linear relationships into linear relationships.


## Data 

The datasets are read into R as excel files. There are two datasets used in this report the first one is a list of deaths registered by sex, states and territories and the second datasets intentional self-harm by State and territory of usual residence and Sex, Age-standardized death rates, 2009-2015.The death dataset is imported from "https://www.abs.gov.au/ausstats/abs@.nsf/mf/3105.0.65.001/" and the suicide dataset is imported from "https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3303.02018?OpenDocument".This is an open source of datasets.


```{r}

setwd("/Users/joeljvc/Desktop/R ")
getwd()

death_1 <-read_xls("Total_death.xls",sheet="Table 5.1",skip = 4)

suicides<-read_xls(" intentional selfharm.xls",sheet="Table 11.6",skip=6)

```

## Understand 

The field and its description is given below:

**Data variables

Data 1 - death_1          
1.States - Name of each states in Australia.                                                                           
2.Year - Time period from 2009-2015.                                                     
3.Total number of deaths from 2009-2015 - This shows the total number of deaths.                            
                                                  
Data 2 -suicides                                                             
1.States -Name of states.                                                                                       
2.Year -Time period from 2009-2015.                                                  
3.Number of suicides -Number of Suicides of each states.                                                                   


```{r}

death_2<- death_1[-c(1:23,32:55),-c(2:186)]
str(death_2)
names(death_2)[names(death_2) == "...1"] <- "States"

suicides_1<- suicides[-c(1:25,34:47),-c(9:13)]
str(suicides_1)

```

##	Tidy & Manipulate Data I 

Both of the datasets were untidy and to tidy this dataset, we need to gather the columns into a new pair of variables and with gather() function with the parameters year values 2009-2015, and the column name is given as year, the number of suicides is given as values, since that is the one which spread over the suicide file and total number of deaths is given as values for death files.

```{r}
 
suicides_1<-suicides_1 %>%
  gather('2009','2010','2011','2012','2013','2014','2015',key = "year",value = "Number of suicides ")
names(suicides_1)[names(suicides_1) == "...1"] <-"States"

death_5<-gather(death_2,'2009','2010','2011','2012','2013','2014','2015', key = "year",value = "Total Number of deaths from 2009-2015")
colnames(death_5)

```

##Joining Datasets and Type Conversion

since we have change the datasets to the required format we can join both the datasets using left_join() functions over States and year columns. Using str() we check the columns are correct format and could find that the column "Number of suicides" is in character data type even though it clearly holds numerical data. So we will convert it into numeric using as.numeric().We have the state column having the state details and that can be converted to levels using factor().

```{r}
x<-left_join(suicides_1,death_5,by= c("States","year"))
str(x)
x$`Number of suicides `<- as.numeric(x$`Number of suicides `)
x$States<-factor(x$States,levels=c("NSW","Vic.","Qld","SA","WA","Tas.","NT","ACT"),ordered = TRUE)
levels(x$States)

```

##	Tidy & Manipulate Data II 

1.By using mutate() function we can add the new variable "Percentage of suicide" while preserving the existing variables.      
                                                                      
2.The new variable "Percentage of suicide" is created to know the percentage value of both "Number of suicides" and "Total Number of deaths from 2009-2015".                                 

```{r}


y<-mutate(x,"Percentage of suicide"=(x$`Number of suicides ` / x$`Total Number of deaths from 2009-2015`)*100)

```


##	Scan I             
                          
1.With the is.na() functions the missing values is being checked and during the data preprocessing method this datasets has no missing values, special values.                           
                                           
2.Obvious errors or inconsistencies are also checked in the dataset.


```{r}

colSums(is.na(y))

```


##	Scan II
      
As we are done with a handling missing values, we will check for outliers by generating boxplot for the attributes. There are no outliers for "Number of suicides" and "Total number of deaths from 2009-2015", but we can see that in the column "percentage of suicides there are six outliers which have to be handled. We can replace them using the 5th and 95th quartile based on if the value is less than the lower fence or greater than the upper fence respectively using cap() function.           

```{r}

boxplot(y$`Number of suicides `)
boxplot(y$`Total Number of deaths from 2009-2015`)
z<-boxplot(y$`Percentage of suicide`)
z$out

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}
y$`Percentage of suicide`<-cap(y$`Percentage of suicide`)
boxplot(y$`Percentage of suicide`)

```


##	Transform 

To further proceed with statistical analysis, data transformation is requisite.                                        
                                          
Below is the situation where we might need transformations.                                       
1.For better understanding we may need to change the scale or standardize the values of a variable.                             
2.We may need to transform complex non-linear relationships into linear relationships.                          
3.In measurable induction, normal distribution is preferred over skewed distribution. Along these lines, at whatever point we have skewed distribution or heterogeneous of variances, we can use transformations which can decrease skewness or heterogeneity of variances.                                                     
4.By using hist() function we can observe "Number of suicides”, “Total Number of deaths from 2009-2015" and "Percentage of suicide" having a right-skewed distribution.                                      
5.We are using boxcox transformation for "Number of suicides" which is an non-normal data and the Box-Cox transformation is a type of power transformation to transform non-normal data into a normal distribution.                                
6.Another transformation is the square root transformation. It is additionally utilized for decreasing right skewness, and furthermore has the advantage that it can be applied to zero values. The square root transformation is applied to "Total Number of deaths from 2009-2015" and "Percentage of suicide" for reducing right skewness.                                           


```{r}

hist(y$`Number of suicides `, main = "Histogram of number of suicides", xlab = "Suicides")
hist(y$`Total Number of deaths from 2009-2015`, main = "Total number of deaths", xlab = "Deaths")
hist(y$`Percentage of suicide`, main = "Percentage of suicides",xlab = "Percentage of suicides")

#Boxcox transformation on 'number of suicides'
boxcox_suicides<- BoxCox(y$`Number of suicides `,lambda = "auto")
hist(boxcox_suicides)

#square root transformation on total number of deaths
sqrt_deaths<-sqrt(y$`Total Number of deaths from 2009-2015`) 
hist(sqrt_deaths,main = "Histogram after transformation - Total number of deaths", xlab = "Total number of deaths")

# square root transformation on percentage of suicides 
sqrt_percentage<-sqrt(y$`Percentage of suicide`)
hist(sqrt_percentage,main = "Histogram after transformation - Percentage of suicide", xlab = "Percentage of suicides")


```

## Reference List

https://www.abs.gov.au/ausstats/abs@.nsf/mf/3105.0.65.001/                                                                      
https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3303.02018?               
http://rare-phoenix-161610.appspot.com/secured/index.html                    

<br>
<br>
