Required packages

# Loading the required packages.

library(readr)
library(tidyr)
library (dplyr)
library(forecast)


Data Description

The data used in this assignment was sourced from the WHO data repository website (https://www.who.int/gho/database/en/).

The first dataset contains the Current health expenditure (CHE) as percentage of gross domestic product (GDP) (%) categorised by country. The variables being: Country, Year, and CHE.

The second dataset contains the Out-of-pocket expenditure (OPE) as percentage of current health expenditure (CHE) (%) categorised by country. The variables being: Country, Year, and OPE.

The datasets were merged in order to find out the Out-of-pocket expenditure (OPE_GDP) as percentage of gross domestic product (GDP) (%). The final variables being: Country, Year, CHE, OPE, OPE_GDP.


Executive Summary

The datasets were read into R and tidied up accordingly such that they can be merged successfully. The country and year values were common attributes of the datasets.

Once they were merged, the class of each variable was checked and modified where necessary. The data was then amended with the calculated OPE_GDP variable. Next, the data was searched for values which were NA (not available) as well as special values (NaN, Inf, -Inf). Furthermore, it was searched for values which were either 0 or 100 (both unusual values for the variables in the dataset). The values were dealt with accordingly.

The OPE_GDP variable was then checked for any outliers; the outliers were capped in order to minimise their influence on the data. Lastly the variable was transformed by applying the square root transformation so that it is close to being normally distributed and thus, ready to be analysed.


Data Importing

# The data was imported and named appropriately.

che_gdp <- read_csv("CHE_GDP.csv", skip = 1)
Parsed with column specification:
cols(
  Country = col_character(),
  `2017` = col_double(),
  `2016` = col_double(),
  `2015` = col_double(),
  `2014` = col_double(),
  `2013` = col_double(),
  `2012` = col_double(),
  `2011` = col_double(),
  `2010` = col_double(),
  `2009` = col_double(),
  `2008` = col_double(),
  `2007` = col_double(),
  `2006` = col_double(),
  `2005` = col_double(),
  `2004` = col_double(),
  `2003` = col_double(),
  `2002` = col_double(),
  `2001` = col_double(),
  `2000` = col_double()
)
head(che_gdp) # Providing an overview of the data using the header function.
ope_che <- read_csv("OPE_CHE.csv", skip = 1)
Parsed with column specification:
cols(
  Country = col_character(),
  `2017` = col_double(),
  `2016` = col_double(),
  `2015` = col_double(),
  `2014` = col_double(),
  `2013` = col_double(),
  `2012` = col_double(),
  `2011` = col_double(),
  `2010` = col_double(),
  `2009` = col_double(),
  `2008` = col_double(),
  `2007` = col_double(),
  `2006` = col_double(),
  `2005` = col_double(),
  `2004` = col_double(),
  `2003` = col_double(),
  `2002` = col_double(),
  `2001` = col_double(),
  `2000` = col_double()
)
head(ope_che)


Tidy & Manipulate Data I

As observed in the output above, the data is untidy. The columns contain the values of the year (which should be a variable) while each row represents several observations and not one. The gather function is used to tidy up the data in both datasets. Once the datasets were tidied, it was possible to merge them.

# Tidying the datasets.
che <- che_gdp %>% gather(2:19, key ="Year", value = "CHE(%)")
head (che)
ope <- ope_che %>% gather(2:19, key ="Year", value = "OPE(%)")
head(ope)
# The datasets were mereged using the full_join() function.
merge = che %>% full_join(ope)
Joining, by = c("Country", "Year")
head(merge)


Understand

It can be observed from the header output of the merged datasets that the data type for the variables country and year is charachter (chr) and the data type for the OPE variable is double (dbl). In my opinion, the year should be stored as a factor since it is categorical variable. Thus, it was converted appropriately.

lapply(merge, class) # Double checking the class of the variables using lappy and class functions.
$Country
[1] "character"

$Year
[1] "character"

$`CHE(%)`
[1] "numeric"

$`OPE(%)`
[1] "numeric"
merge$Year <- as.factor(merge$Year) # Converting the Year variable to the datatype factor.
levels(merge$Year) # Checking the levels of the factor Year. 
 [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010"
[12] "2011" "2012" "2013" "2014" "2015" "2016" "2017"

It can be observed from the outputs above that the conversion was successful. Furthermore, the levels of the variable were found to be in the correct order and do not need to be changed.


Tidy & Manipulate Data II

In order to find out the Out-of-pocket expenditure (OPE_GDP) as percentage of gross domestic product, the Out-of-pocket expenditure (OPE) as percentage of current health expenditure (CHE) was multiplied with the CHE as percentage of the GDP.

# The new column is added to the dataset with the help of the mutate function. 
ope_gdp <- merge %>% mutate(`OPE_GDP(%)` = (`OPE(%)`/100)*`CHE(%)`)
ope_gdp$`OPE_GDP(%)` <- round(ope_gdp$`OPE_GDP(%)`, digits = 2) # Rounding to 2 decimal places.
head(ope_gdp)

It can be observed from the output that the OPE_GDP was successfully calculated an added to the dataset.


Scan I

The data is now scanned for missing values(NAs), special values(NaN, Inf, -Inf) and obvious errors (0,100). It is certain that not all health expenditure is out of pocket nor can it be fully funded, thus the values 0 and 100 are considered as errors.

which(is.na(ope_gdp)) # Scanning for missing values.
  [1]  6937  7006  7028  7127  7189  7196  7218  7317  7379  7386  7507  7569  7576
 [14]  7697  7759  7766  7887  7949  8139  8329  8519  8550  8709  8740  8899  8930
 [27]  9089  9120  9279  9310  9469  9500  9567  9659  9690  9757  9770  9849  9859
 [40]  9880  9881  9947  9960 10039 10049 10070 10071 10137 10150 10229 10239 10260
 [53] 10357 10426 10448 10547 10609 10616 10638 10737 10799 10806 10927 10989 10996
 [66] 11117 11179 11186 11307 11369 11559 11749 11939 11970 12129 12160 12319 12350
 [79] 12509 12540 12699 12730 12889 12920 12987 13079 13110 13177 13190 13269 13279
 [92] 13300 13301 13367 13380 13459 13469 13490 13491 13557 13570 13649 13659 13680
[105] 13777 13846 13868 13967 14029 14036 14058 14157 14219 14226 14347 14409 14416
[118] 14537 14599 14606 14727 14789 14979 15169 15359 15390 15549 15580 15739 15770
[131] 15929 15960 16119 16150 16309 16340 16407 16499 16530 16597 16610 16689 16699
[144] 16720 16721 16787 16800 16879 16889 16910 16911 16977 16990 17069 17079 17100
sum(is.na(ope_gdp))   # Finding the total number of missing values.
[1] 156
sum(is.nan(as.matrix(ope_gdp))) # Finding the total number of NaN values.
[1] 0
sum(is.infinite(as.matrix(ope_gdp))) # Finding the total number of +/- infinite values.
[1] 0
ope_gdp %>% filter(`CHE(%)`== 0 | `CHE(%)` == 100 |`OPE(%)` == 0 | `OPE(%)` == 100) # Searching for inconsistencies.    
ope_gdp <- na.omit(ope_gdp) # Removing all the observations containing missing values.

It is observed from the outputs above that there are 156 missing values. No special values or obvious errors were detected. Since, the data varies by country and by year, it would be difficult to determine a single constant for all the missing values. This also means that the missing values can’t be replaced by the mean or median. Therefore, the rows containing the missing values were removed using the na.omit() function.


Scan II

The OPE_GDP data is now scanned for outliers so that it can be ready to be analysed. To visualize the outliers in the overall data consisting of several countries in the world, we can consider two variables, OPE_GDP and Year. Thus, a bivariate box plot can be used to detect outliers.

ope_gdp %>% plot(`OPE_GDP(%)` ~ Year, data = ., ylab="OPE_GDP(%)", xlab="Year", main="Detecting the Outliers")

It can be observed from the initial box plot that there were several outliers in the data. Thus, these outliers cannot be excluded. It can be understood that this may be because some countries may have a high OPE for health for health expenditure while others might have a very low value depending on their public health systems. However, since the outliers can significantly skew the data, they have been eliminated by Capping.

# Defining the cap function which will be used to cap the outliers.

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

ope_gdp$`OPE_GDP(%)`<- ope_gdp$`OPE_GDP(%)` %>% cap() # Applying cap() function to the OPE_GDP values.

ope_gdp %>% plot(`OPE_GDP(%)` ~ Year, data = ., ylab="OPE_GDP(%)", xlab="Year", main="Detecting the Outliers(After Capping)")

It can be observed from the second box plot that the outliers were successfully dealt with.


Transform

The final step of preprocessing the OPE_GDP variable requires for it to achieve normality. It can be viewed from the initial histogram that the variable is slightly skewed to the right. Initially a Box-Cox transformation was applied to normalise the data.

hist(ope_gdp$`OPE_GDP(%)`) # Histogram of OPE_GDP variable.

boxcox<- BoxCox(ope_gdp$`OPE_GDP(%)` ,lambda = "auto") # Applying Box-Cox Transformation.
hist(boxcox) # Histogram of OPE_GDP variable after Box-Cox Transformation.

Upon reflecting on the histogram of the transformed data, it can be seen that it is slightly skewed to the left. In order to find a better fit, the logarithmic transformation (base 10) and the square root transformation were applied to the data.

log_o <- log10(ope_gdp$`OPE_GDP(%)`) # Applying logarithmic transformation.
hist(log_o)


sqrt <- ope_gdp$`OPE_GDP(%)`^(0.5)  # Applying square root transformation.
hist(sqrt)

As observed from the histogram above, the square root transformation was most successful in achieving near normality. It is safe to assume that the data has been pre-proccessed and the OPE_GDP variable is ready to be analysed.


References

1.) Dolgun,A 2020, Modules 1-8 , lecture notes, Math2349, Data Wrangling ,RMIT University.

2.) World Health organisation (WHO) 2020, Current health expenditure (CHE) as percentage of gross domestic product (GDP) (%), Data by country, .csv file, viewed 7 June 2020, < https://apps.who.int/gho/data/node.main.GHEDCHEGDPSHA2011?lang=en >

2.) World Health organisation (WHO) 2020, Out-of-pocket expenditure as percentage of current health expenditure (CHE) (%), Data by country, .csv file, viewed 7 June 2020, < https://apps.who.int/gho/data/view.main.GHEDOOPSCHESHA2011v?lang=en >



---
title: "MATH2349 Semester 1, 2020"
author: "Harshik Dinesh, Student No. S3818758"
subtitle: Assignment 2
output:
  html_notebook: default
---
<br>

## Required packages 

```{r}
# Loading the required packages.

library(readr)
library(tidyr)
library (dplyr)
library(forecast)

```

<br>

## Data Description


The data used in this assignment was sourced from the WHO data repository website (https://www.who.int/gho/database/en/). 

The first dataset contains the Current health expenditure (CHE) as percentage of gross domestic product (GDP) (%) categorised by country. The variables being: Country, Year, and CHE.

The second dataset contains the Out-of-pocket expenditure (OPE) as percentage of current health expenditure (CHE) (%) categorised by country. The variables being: Country, Year, and OPE.

The datasets were merged in order to find out the Out-of-pocket expenditure (OPE_GDP) as percentage of gross domestic product (GDP) (%). The final variables being: Country, Year, CHE, OPE, OPE_GDP.

<br>


## Executive Summary 


The datasets were read into R and tidied up accordingly such that they can be merged successfully. The country and year values were common attributes of the datasets. 

Once they were merged, the class of each variable was checked and modified where necessary. The data was then amended with the calculated OPE_GDP variable. Next, the data was searched for values which were NA (not available) as well as special values (NaN, Inf, -Inf). Furthermore, it was searched for values which were either 0 or 100 (both unusual values for the variables in the dataset). The values were dealt with accordingly. 

The OPE_GDP variable was then checked for any outliers; the outliers were capped in order to minimise their influence on the data. Lastly the variable was transformed by applying the square root transformation so that it is close to being normally distributed and thus, ready to be analysed.


<br>


## Data Importing


```{r}
# The data was imported and named appropriately.

che_gdp <- read_csv("CHE_GDP.csv", skip = 1)
head(che_gdp) # Providing an overview of the data using the header function.
ope_che <- read_csv("OPE_CHE.csv", skip = 1)
head(ope_che)
```
<br>


##	Tidy & Manipulate Data I 

As observed in the output above, the data is untidy. The columns contain the values of the year (which should be a variable) while each row represents several observations and not one. The gather function is used to tidy up the data in both datasets. Once the datasets were tidied, it was possible to merge them.

```{r}
# Tidying the datasets.
che <- che_gdp %>% gather(2:19, key ="Year", value = "CHE(%)")
head (che)
ope <- ope_che %>% gather(2:19, key ="Year", value = "OPE(%)")
head(ope)
# The datasets were mereged using the full_join() function.
merge = che %>% full_join(ope)
head(merge)
```

<br>

## Understand 

It can be observed from the header output of the merged datasets that the data type for the variables country and year is charachter (chr) and the data type for the OPE variable is double (dbl). In my opinion, the year should be stored as a factor since it is categorical variable. Thus, it was converted appropriately.


```{r}
lapply(merge, class) # Double checking the class of the variables using lappy and class functions.
merge$Year <- as.factor(merge$Year) # Converting the Year variable to the datatype factor.
levels(merge$Year) # Checking the levels of the factor Year. 
```

It can be observed from the outputs above that the conversion was successful. Furthermore, the levels of the variable were found to be in the correct order and do not need to be changed.

<br>

##	Tidy & Manipulate Data II 

In order to find out the Out-of-pocket expenditure (OPE_GDP) as percentage of gross domestic product, the Out-of-pocket expenditure (OPE) as percentage of current health expenditure (CHE) was multiplied with the CHE as percentage of the GDP.


```{r}
# The new column is added to the dataset with the help of the mutate function. 
ope_gdp <- merge %>% mutate(`OPE_GDP(%)` = (`OPE(%)`/100)*`CHE(%)`)
ope_gdp$`OPE_GDP(%)` <- round(ope_gdp$`OPE_GDP(%)`, digits = 2) # Rounding to 2 decimal places.
head(ope_gdp)
```

It can be observed from the output that the OPE_GDP was successfully calculated an added to the dataset.


<br>

##	Scan I 

The data is now scanned for missing values(NAs), special values(NaN, Inf, -Inf) and obvious errors (0,100). It is certain that not all health expenditure is out of pocket nor can it be fully funded, thus the values 0 and 100 are considered as errors. 


```{r}
which(is.na(ope_gdp)) # Scanning for missing values.
sum(is.na(ope_gdp))   # Finding the total number of missing values.
sum(is.nan(as.matrix(ope_gdp))) # Finding the total number of NaN values.
sum(is.infinite(as.matrix(ope_gdp))) # Finding the total number of +/- infinite values.
ope_gdp %>% filter(`CHE(%)`== 0 | `CHE(%)` == 100 |`OPE(%)` == 0 | `OPE(%)` == 100) # Searching for inconsistencies.    
ope_gdp <- na.omit(ope_gdp) # Removing all the observations containing missing values.

```

It is observed from the outputs above that there are 156 missing values. No special values or obvious errors were detected. Since, the data varies by country and by year, it would be difficult to determine a single constant for all the missing values. This also means that the missing values can't be replaced by the mean or median. Therefore, the rows containing the missing values were removed using the na.omit() function.

<br>

##	Scan II

The OPE_GDP data is now scanned for outliers so that it can be ready to be analysed. To visualize the outliers in the overall data consisting of several countries in the world, we can consider two variables, OPE_GDP and Year. Thus, a bivariate box plot can be used to detect outliers. 

```{r}
ope_gdp %>% plot(`OPE_GDP(%)` ~ Year, data = ., ylab="OPE_GDP(%)", xlab="Year", main="Detecting the Outliers")
```


It can be observed from the initial box plot that there were several outliers in the data. Thus, these outliers cannot be excluded. It can be understood that this may be because some countries may have a high OPE for health for health expenditure while others might have a very low value depending on their public health systems. However, since the outliers can significantly skew the data, they have been eliminated by Capping.

```{r}
# Defining the cap function which will be used to cap the outliers.

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

ope_gdp$`OPE_GDP(%)`<- ope_gdp$`OPE_GDP(%)` %>% cap() # Applying cap() function to the OPE_GDP values.

ope_gdp %>% plot(`OPE_GDP(%)` ~ Year, data = ., ylab="OPE_GDP(%)", xlab="Year", main="Detecting the Outliers(After Capping)")

```

It can be observed from the second box plot that the outliers were successfully dealt with.

<br>

##	Transform 

The final step of preprocessing the OPE_GDP variable requires for it to achieve normality. It can be viewed from the initial histogram that the variable is slightly skewed to the right. Initially a Box-Cox transformation was applied to normalise the data. 


```{r}
hist(ope_gdp$`OPE_GDP(%)`) # Histogram of OPE_GDP variable.
boxcox<- BoxCox(ope_gdp$`OPE_GDP(%)` ,lambda = "auto") # Applying Box-Cox Transformation.
hist(boxcox) # Histogram of OPE_GDP variable after Box-Cox Transformation.
```
Upon reflecting on the histogram of the transformed data, it can be seen that it is slightly skewed to the left. In order to find a better fit, the logarithmic transformation (base 10) and the square root transformation were applied to the data.

```{r}
log_o <- log10(ope_gdp$`OPE_GDP(%)`) # Applying logarithmic transformation.
hist(log_o)

sqrt <- ope_gdp$`OPE_GDP(%)`^(0.5)  # Applying square root transformation.
hist(sqrt)

```

As observed from the histogram above, the square root transformation was most successful in achieving near normality. It is safe to assume that the data has been pre-proccessed and the OPE_GDP variable is ready to be analysed.

<br>

##	References 


1.) Dolgun,A 2020, Modules 1-8 , lecture notes, Math2349, Data Wrangling ,RMIT University.

2.) World Health organisation (WHO) 2020, Current health expenditure (CHE) as percentage of gross domestic product (GDP) (%), Data by country, .csv file, viewed 7
June 2020, < https://apps.who.int/gho/data/node.main.GHEDCHEGDPSHA2011?lang=en >

2.) World Health organisation (WHO) 2020, Out-of-pocket expenditure as percentage of current health expenditure (CHE) (%), Data by country, .csv file, viewed 7
June 2020, < https://apps.who.int/gho/data/view.main.GHEDOOPSCHESHA2011v?lang=en >

<br>
<br>
