Required packages
The below packages have been used for this analysis:
library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(hms)
library(outliers)
library(ggplot2)
library(forecast)
Executive Summary
- Data preprocessing is a process and the collection of operations needed to prepare all forms of untidy data for statistical analysis.
- There are 5 major tasks for data preprocessing framework, namely: Get, Understand, Tidy & Manipulate, Scan and Transform.
- We load the necessary packages into R required for our processes.
- The data sets which was downloaded from an open source website is loaded into R using read_csv() as they were in csv format. The head of the datasets are also displayed.
- Next, we join the 2 datasets using inner_join() function with the common variable being “Country” and display the head of the newly joined data frame.
- We check the structure of the data frame to get an overview of the data structures and variable types. We make necessary data type conversions on the data set.
- While observing the dataset, we realised that the way country coordinates were represented was incorrect and tidying was necessary to join the latitude and longitude values into a single column. We use unite() function to join the columns giving country coordinates in the correct format.
- A new column is mutated from Dystopia Residual column to get the Residual value of each column (Residual Value = Dystopia Residual -1.85).
- Once we are done with the tidying and manipulating the dataset, we scan for missing values. It is seen that there are no missing values hence we move on to scan for outliers.
- We use boxplot method to scan for outliers on numeric variables and use capping function to cap those values to 5th and 95th percentile values of the data.
- Lastly, data transformation is performed on 3 variables having non-normal distribution. The transformations were done using Square root method and BoxCox method inorder to reduce the skewness and transforming the data into a normal distribution.
Data
The first dataset used is “World Happiness Report” from the year 2015. This report is a landmark survey of the state of global happiness of various countries across the globe. The data consist of a happiness score which is calculated on a scale of 0 to 10 where 10 is the happiest and the respective countries ranking amongst others. Switzerland is currently ranking 1st with a happiness score of 7.59. The different columns are described as below:
- Country – Name of the country
- Region – Region the country belongs to
- Happiness Rank – Country’s rank based on happiness score
- Happiness Score – Score measured by asking sampled people how they would rate their happiness.
- Standard Error – The standard error of the happiness score
- Economy (GDP per Capita) – The extent to which GDP contributes to the calculation of happiness score
- Family - The extent to which Family contributes to the calculation of happiness score
- Health (Life Expectancy) - The extent to which health contributes to the calculation of happiness score
- Freedom - The extent to which Freedom contributes to the calculation of happiness score
- Trust (Government Corruption) - The extent to which perception of corruption contributes to the calculation of happiness score
- Generosity – The extent to which Generosity contributes to the calculation of happiness score
- Dystopia Residual – It is the sum of the dystopia happiness score (1.85) ie score of a hypothetical country having rank lower than the lowest ranking country in the report, plus the residual value of each country which is a number left over from the normalization of the variables which cannot be explained).
By adding up all these factors including the Dystopia Residual, we get the happiness score for each country.
The second dataset used here is “Countries Longitude and Latitude” which gives the latitude and longitude details of each country which is used as a reference in several cases. The latitude and longitude of each country is given in Simple Decimal Standard in separate columns. The columns in the dataset can be explained as below:
- (Blank column title) – Gives the serial number of countries
- Latitude – Gives the respective country’s latitudinal coordinates
- Longitude – Gives the respective country’s longitudinal coordinates
- Country – Name of the country
The datasets used were obtained from an open source website: “World Happiness Report” (https://www.kaggle.com/unsdsn/world-happiness) and “Countries Longitude and Latitude” (https://www.kaggle.com/folaraz/world-countries-and-continents-details?select=Countries+Longitude+and+Latitude.csv).
- First the working directory was set to my preference using setwd() function.
- The two datasets which were in csv format was read into R using read_csv(). The happiness report dataset is stored in data frame “Happy” and countries coordinates in “Country” (column with no title consisting of serial number was excluded as it possessed no relevance to our analysis).
- The head of datasets is displayed using head().The two data sets have Country names as the common variable.
- Although there were different approaches in joining the 2 datasets, I went ahead using inner_join(). The reason being that dataset “Happy” consisted of 158 observations and after using inner join, the resultant dataset had 147 observations meaning that 11 countries in the first dataset did not have coordinates details from the second dataset. As this is only a small percentage of the dataset, I decided to omit those observation, hence using inner join to join the datasets.
#Setting the working directory
setwd("C:/Users/User/Desktop/RMIT class/Data Wrangling/Assignment 2")
getwd()
[1] "C:/Users/User/Desktop/RMIT class/Data Wrangling/Assignment 2"
#Reading the datasets
Happy <- read_csv("2015.csv")
Parsed with column specification:
cols(
Country = [31mcol_character()[39m,
Region = [31mcol_character()[39m,
`Happiness Rank` = [32mcol_double()[39m,
`Happiness Score` = [32mcol_double()[39m,
`Standard Error` = [32mcol_double()[39m,
`Economy (GDP per Capita)` = [32mcol_double()[39m,
Family = [32mcol_double()[39m,
`Health (Life Expectancy)` = [32mcol_double()[39m,
Freedom = [32mcol_double()[39m,
`Trust (Government Corruption)` = [32mcol_double()[39m,
Generosity = [32mcol_double()[39m,
`Dystopia Residual` = [32mcol_double()[39m
)
head(Happy)
Country1 <- read_csv("Countries.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
latitude = [32mcol_double()[39m,
longitude = [32mcol_double()[39m,
Country = [31mcol_character()[39m
)
Country <- Country1[,c(2,3,4)]
head(Country)
#Joining the 2 datasets
join <- inner_join(Happy, Country, by = "Country")
head(join)
NA
Understand
- The data types of the data frame were checked using str() function.
- The variable ‘Region’ was factorised and levels were set with region names. Also the variable ‘Rank’ was converted to ordered factor.
- The structure of the data frame was checked after the data type conversions and all the variables were in the correct format.
str(join)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 147 obs. of 14 variables:
$ Country : chr "Switzerland" "Iceland" "Denmark" "Norway" ...
$ Region : chr "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
$ Happiness Rank : num 1 2 3 4 5 6 7 8 9 10 ...
$ Happiness Score : num 7.59 7.56 7.53 7.52 7.43 ...
$ Standard Error : num 0.0341 0.0488 0.0333 0.0388 0.0355 ...
$ Economy (GDP per Capita) : num 1.4 1.3 1.33 1.46 1.33 ...
$ Family : num 1.35 1.4 1.36 1.33 1.32 ...
$ Health (Life Expectancy) : num 0.941 0.948 0.875 0.885 0.906 ...
$ Freedom : num 0.666 0.629 0.649 0.67 0.633 ...
$ Trust (Government Corruption): num 0.42 0.141 0.484 0.365 0.33 ...
$ Generosity : num 0.297 0.436 0.341 0.347 0.458 ...
$ Dystopia Residual : num 2.52 2.7 2.49 2.47 2.45 ...
$ latitude : num 46.8 65 56.3 60.5 56.1 ...
$ longitude : num 8.23 -19.02 9.5 8.47 -106.35 ...
#Data type conversion
join$Region <- factor(join$Region, order = FALSE, levels = c("Australia and New Zealand", "Central and Eastern Europe", "Eastern Asia", "Latin America and Caribbean", "Middle East and Northern Africa", "North America", "Southeastern Asia", "Southern Asia", "Sub-Saharan Africa", "Western Europe"))
join$`Happiness Rank` <- factor(join$`Happiness Rank`, order = TRUE, levels = c(1:158))
str(join)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 147 obs. of 14 variables:
$ Country : chr "Switzerland" "Iceland" "Denmark" "Norway" ...
$ Region : Factor w/ 10 levels "Australia and New Zealand",..: 10 10 10 10 6 10 10 10 1 1 ...
$ Happiness Rank : Ord.factor w/ 158 levels "1"<"2"<"3"<"4"<..: 1 2 3 4 5 6 7 8 9 10 ...
$ Happiness Score : num 7.59 7.56 7.53 7.52 7.43 ...
$ Standard Error : num 0.0341 0.0488 0.0333 0.0388 0.0355 ...
$ Economy (GDP per Capita) : num 1.4 1.3 1.33 1.46 1.33 ...
$ Family : num 1.35 1.4 1.36 1.33 1.32 ...
$ Health (Life Expectancy) : num 0.941 0.948 0.875 0.885 0.906 ...
$ Freedom : num 0.666 0.629 0.649 0.67 0.633 ...
$ Trust (Government Corruption): num 0.42 0.141 0.484 0.365 0.33 ...
$ Generosity : num 0.297 0.436 0.341 0.347 0.458 ...
$ Dystopia Residual : num 2.52 2.7 2.49 2.47 2.45 ...
$ latitude : num 46.8 65 56.3 60.5 56.1 ...
$ longitude : num 8.23 -19.02 9.5 8.47 -106.35 ...
Tidy & Manipulate Data I
- In my data frame (join) the columns latitude and longitude are untidy.
- Location coordinates are represented in various formats. For example Decimal coordinates (20.5936832, 78.962883), Decimal Degrees Coordinates (20.5937° N 78.9629° E) and DMS (Degrees, Minutes and Seconds) Coordinates (20°35’37.3’’ N 78°57’46.4’’ E). When you look up any location on google maps, the address of that location would also give the coordinates written in Decimal Degrees format.
- Location coordinates by default is written with latitude and longitude together and not seperately. This link shows the formats in which location coordinates are written (https://www.wikihow.com/Write-Latitude-and-Longitude#:~:text=Start%20with%20your%20line%20of,or%20West%20as%20the%20direction.)
- If the coordinates are written in the correct format, it would be helpful while exporting the data into a GPS software for plotting them on the map.
- In my data frame, the latitude and longitude is written seperately. Therefore, I unite those two columns using unite() with sep=“,”.
- The new column created by uniting the latitude and longitude is called “Country Coordinates”.
- The columns have been re-arranged to place the new column next to the country names.
join1 <- join %>% unite(`Country Coordinates`, latitude, longitude, sep = ", ")
join2 <- join1[, c(1,13,2:12)]
Tidy & Manipulate Data II
- In our dataset, Dystopia Residual is measured as the sum of Dystopia happiness score and residual value of each country.
- The Dystopia happiness score as mentioned earlier is a constant value (1.85) which is same for all countries.
- Inorder to check the residual value of each counry, we use mutate() function to create a new variable by preserving the existing variable.
- We name the new variable “Residual Value” and it is calculated as the difference between Dystopia residual of each country and 1.85.
- The residual value is directly proportional to the dystopia residual meaning as the residual value increases, dystopia residual increases.
join3 <- join2 %>% mutate(`Residual Value` = (join$`Dystopia Residual` - 1.85))
Scan I
- We check for NA values in our dataset by incorporating is.na() within colSums() so as to get the sum of NA values in each column.
- By running the code, we see that there are no NA values in our dataset.
- The next step is to check for special values (-Inf, Inf, Nan).
- Inorder to check all numeric columns for the presence of special values, we create a function called “special”. The function checks all obervations whether if its a special value and the result is in logical form (TRUE if special values are detected, FALSE if not a special value).
- To get the sum of special values in each column, we use sapply() function as given below. This would give the count of special values in each column of our dataset.
#Checking for NA values
colSums(is.na(join3))
Country Country Coordinates Region
0 0 0
Happiness Rank Happiness Score Standard Error
0 0 0
Economy (GDP per Capita) Family Health (Life Expectancy)
0 0 0
Freedom Trust (Government Corruption) Generosity
0 0 0
Dystopia Residual Residual Value
0 0
#Checking for special values
special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))}
#Sum of special values in each column
sapply(join3, function(x){if (is.numeric(x)) sum(special(x))})
$Country
NULL
$`Country Coordinates`
NULL
$Region
NULL
$`Happiness Rank`
NULL
$`Happiness Score`
[1] 0
$`Standard Error`
[1] 0
$`Economy (GDP per Capita)`
[1] 0
$Family
[1] 0
$`Health (Life Expectancy)`
[1] 0
$Freedom
[1] 0
$`Trust (Government Corruption)`
[1] 0
$Generosity
[1] 0
$`Dystopia Residual`
[1] 0
$`Residual Value`
[1] 0
Scan II
- Once we have scanned for NA values and special values, the next step is to handle the outliers present in our dataset.
- We use boxplot method to detect the univariate variables (as our variables are not dependent on each other)
- By observing the boxplots, we see that 6 out of 10 variables are having outliers. We use capping method to handle the outliers.
- Capping process involves in replacing the outliers with the nearest neighbouring values that are not outliers. As an example, for outliers that lie outside the outlier fence we cap it by replacing those values outside the lower limit with the value of 5 th percentile and for those that lie above the upper limit with the 95 th percentile value.
- We check the boxplots once again after applying capping function on the variables having outliers.
- While observing the boxplots, we see that the ‘Standard Error’ and ‘Trust (Government Corruption)’ are showing the presence of outliers on the boxplot. They are actually the outlier fence value which acts as a barrier between outliers and non outlier values. In order to confirm this, we use length() on the boxplots of those two variables.
#Boxplots before handling outliers
box_1 <- boxplot(join3$`Happiness Score`, join3$Family, join3$`Dystopia Residual`,
join3$`Residual Value`, join3$`Economy (GDP per Capita)`,
names = c("Happiness", "Family", "Dys Residual", "Res Value", "Economy"),
main = "Boxplot to detect outliers")

box_2 <- boxplot(join3$`Health (Life Expectancy)`, join3$Freedom, join3$`Standard Error`,
join3$`Trust (Government Corruption)`, join3$Generosity,
names = c("Health", "Freedom", "Standard Err", "Trust", "Generosity"),
main = "Boxplot to detect outliers")

#Capping function
cap <- function(x){
quantiles <- quantile(x, c(.05, .25, .75, .95) )
x[ x < quantiles[2] - 1.5*IQR(x)] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x)] <- quantiles[4]
x}
join3$`Standard Error` <- join3$`Standard Error` %>% cap()
join3$Family <- join3$Family %>% cap()
join3$`Trust (Government Corruption)` <- join3$`Trust (Government Corruption)` %>% cap()
join3$Generosity <- join3$Generosity %>% cap()
join3$`Dystopia Residual` <- join3$`Dystopia Residual` %>% cap()
join3$`Residual Value` <- join3$`Residual Value` %>% cap()
#Boxplots after removing outliers
box_3 <- boxplot(join3$`Happiness Score`, join3$Family, join3$`Dystopia Residual`,
join3$`Residual Value`, join3$`Economy (GDP per Capita)`,
names = c("Happiness", "Family", "Dys Residual", "Res Value", "Economy"),
main = "Boxplot to detect outliers")

box_4<- boxplot(join3$`Health (Life Expectancy)`, join3$Freedom, join3$`Standard Error`,
join3$`Trust (Government Corruption)`, join3$Generosity,
names = c("Health", "Freedom", "Standard Err", "Trust", "Generosity"),
main = "Boxplot to detect outliers")

boxplot(join3$`Standard Error`, main = "Boxplot of Standard Error")$out
[1] 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768
[12] 0.078768

boxplot(join3$`Trust (Government Corruption)`, main = "Boxplot of Trust")$out
[1] 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353
[12] 0.405353 0.405353

References
Kaggle.com. 2019. World Happiness Report. [online] Available at: https://www.kaggle.com/unsdsn/world-happiness
Kaggle.com. 2017. World Countries And Continents Details. [online] Available at: https://www.kaggle.com/folaraz/world-countries-and-continents-details?select=Countries+Longitude+and+Latitude.csv
Dolgun, Anil 2010, ‘Data Wrangling’, lecture notes, MATH2349, RMIT University, http://rare-phoenix-161610.appspot.com/secured/index.html
Baglin, James 2010, ‘Applied Analytics’, lecture notes, MATH1324, RMIT University, https://astral-theory-157510.appspot.com/secured/index.html
wikiHow. 2020. How To Write Latitude And Longitude. [online] Available at: https://www.wikihow.com/Write-Latitude-and-Longitude#:~:text=Start%20with%20your%20line%20of,or%20West%20as%20the%20direction.
---
title: "MATH2349 Semester 1, 2020"
author: "Siddharth Jacob (S3825234)"
subtitle: Assignment 2
output:
  html_notebook: default
---



## Required packages 

The below packages have been used for this analysis:
```{r}
library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(hms)
library(outliers)
library(ggplot2)
library(forecast)

```
<br>
<br>

## Executive Summary 

* Data preprocessing is a process and the collection of operations needed to prepare all forms of untidy data for statistical analysis. 
* There are 5 major tasks for data preprocessing framework, namely: Get, Understand, Tidy & Manipulate, Scan and Transform.
* We load the necessary packages into R required for our processes.
* The data sets which was downloaded from an open source website is loaded into R using read_csv() as they were in csv format. The head of the datasets are also displayed.
* Next, we join the 2 datasets using inner_join() function with the common variable being “Country” and display the head of the newly joined data frame.
* We check the structure of the data frame to get an overview of the data structures and variable types. We make necessary data type conversions on the data set.
* While observing the dataset, we realised that the way country coordinates were represented was incorrect and tidying was necessary to join the latitude and longitude values into a single column. We use unite() function to join the columns giving country coordinates in the correct format.
* A new column is mutated from Dystopia Residual column to get the Residual value of each column (Residual Value = Dystopia Residual -1.85).
* Once we are done with the tidying and manipulating the dataset, we scan for missing values. It is seen that there are no missing values hence we move on to scan for outliers.
* We use boxplot method to scan for outliers on numeric variables and use capping function to cap those values to 5th and 95th percentile values of the data. 
* Lastly, data transformation is performed on 3 variables having non-normal distribution. The transformations were done using Square root method and BoxCox method inorder to reduce the skewness and transforming the data into a normal distribution.

<br>
<br>

## Data 

The first dataset used is “World Happiness Report” from the year 2015. This report is a landmark survey of the state of global happiness of various countries across the globe. The data consist of a happiness score which is calculated on a scale of 0 to 10 where 10 is the happiest and the respective countries ranking amongst others. Switzerland is currently ranking 1st with a happiness score of 7.59. The different columns are described as below:

1.	Country – Name of the country
2.	Region – Region the country belongs to
3.	Happiness Rank – Country’s rank based on happiness score
4.	Happiness Score – Score measured by asking sampled people how they would rate their happiness.
5.	Standard Error – The standard error of the happiness score
6.	Economy (GDP per Capita) – The extent to which GDP contributes to the calculation of happiness     score
7.	Family - The extent to which Family contributes to the calculation of happiness score
8.	Health (Life Expectancy) - The extent to which health contributes to the calculation of           happiness score
9.	Freedom - The extent to which Freedom contributes to the calculation of happiness score
10.	Trust (Government Corruption) - The extent to which perception of corruption contributes to       the calculation of happiness score
11.	Generosity – The extent to which Generosity contributes to the calculation of happiness score
12.	Dystopia Residual – It is the sum of the dystopia happiness score (1.85) ie score of a            hypothetical country having rank lower than the lowest ranking country in the report, plus the     residual value of each country which is a number left over from the normalization of the          variables which cannot be explained).

By adding up all these factors including the Dystopia Residual, we get the happiness score for each country. 
<br>
The second dataset used here is “Countries Longitude and Latitude” which gives the latitude and longitude details of each country which is used as a reference in several cases. The latitude and longitude of each country is given in Simple Decimal Standard in separate columns. The columns in the dataset can be explained as below:

1.	(Blank column title) – Gives the serial number of countries
2.	Latitude – Gives the respective country’s latitudinal coordinates
3.	Longitude – Gives the respective country’s longitudinal coordinates
4.	Country – Name of the country

The datasets used were obtained from an open source website: “World Happiness Report” (https://www.kaggle.com/unsdsn/world-happiness) and “Countries Longitude and Latitude” (https://www.kaggle.com/folaraz/world-countries-and-continents-details?select=Countries+Longitude+and+Latitude.csv).

* First the working directory was set to my preference using setwd() function.
* The two datasets which were in csv format was read into R using read_csv(). The happiness report dataset is stored in data frame “Happy” and countries coordinates in “Country” (column with no title consisting of serial number was excluded as it possessed no relevance to our analysis).
* The head of datasets is displayed using head().The two data sets have Country names as the common variable.
* Although there were different approaches in joining the 2 datasets, I went ahead using inner_join(). The reason being that dataset “Happy” consisted of 158 observations and after using inner join, the resultant dataset had 147 observations meaning that 11 countries in the first dataset did not have coordinates details from the second dataset. As this is only a small percentage of the dataset, I decided to omit those observation, hence using inner join to join the datasets.

```{r}
#Setting the working directory
setwd("C:/Users/User/Desktop/RMIT class/Data Wrangling/Assignment 2")
getwd()

#Reading the datasets
Happy <- read_csv("2015.csv")
head(Happy)

Country1 <- read_csv("Countries.csv")
Country <- Country1[,c(2,3,4)]
head(Country)

#Joining the 2 datasets
join <- inner_join(Happy, Country, by = "Country")
head(join)

```
<br>
<br>

## Understand 

* The data types of the data frame were checked using str() function. 
* The variable 'Region' was factorised and levels were set with region names. Also the variable 'Rank' was converted to ordered factor.
* The structure of the data frame was checked after the data type conversions and all the variables were in the correct format.

```{r}
str(join)

#Data type conversion

join$Region <- factor(join$Region, order = FALSE, levels = c("Australia and New Zealand", "Central and Eastern Europe", "Eastern Asia", "Latin America and Caribbean", "Middle East and Northern Africa", "North America", "Southeastern Asia", "Southern Asia", "Sub-Saharan Africa", "Western Europe"))

join$`Happiness Rank` <- factor(join$`Happiness Rank`, order = TRUE, levels = c(1:158))

str(join)
```
<br>
<br>

##	Tidy & Manipulate Data I 

* In my data frame (join) the columns latitude and longitude are untidy.
* Location coordinates are represented in various formats. For example Decimal coordinates (20.5936832, 78.962883), Decimal Degrees Coordinates (20.5937° N 78.9629° E) and DMS (Degrees, Minutes and Seconds) Coordinates (20°35'37.3'' N 78°57'46.4'' E). When you look up any location on google maps, the address of that location would also give the coordinates written in Decimal Degrees format. 
* Location coordinates by default is written with latitude and longitude together and not seperately. This link shows the formats in which location coordinates are written (https://www.wikihow.com/Write-Latitude-and-Longitude#:~:text=Start%20with%20your%20line%20of,or%20West%20as%20the%20direction.)
* If the coordinates are written in the correct format, it would be helpful while exporting the data into a GPS software for plotting them on the map.
* In my data frame, the latitude and longitude is written seperately. Therefore, I unite those two columns using unite() with sep=", ".
* The new column created by uniting the latitude and longitude is called "Country Coordinates".
* The columns have been re-arranged to place the new column next to the country names.

```{r}
join1 <- join %>% unite(`Country Coordinates`, latitude, longitude, sep = ", ")
join2 <- join1[, c(1,13,2:12)]

```
<br>
<br>

##	Tidy & Manipulate Data II 

* In our dataset, Dystopia Residual is measured as the sum of Dystopia happiness score and residual value of each country.
* The Dystopia happiness score as mentioned earlier is a constant value (1.85) which is same for all countries.
* Inorder to check the residual value of each counry, we use mutate() function to create a new variable by preserving the existing variable. 
* We name the new variable "Residual Value" and it is calculated as the difference between Dystopia residual of each country and 1.85.
* The residual value is directly proportional to the dystopia residual meaning as the residual value increases, dystopia residual increases.


```{r}
join3 <- join2 %>% mutate(`Residual Value` = (join$`Dystopia Residual` - 1.85))

```
<br>
<br>

##	Scan I 

* We check for NA values in our dataset by incorporating is.na() within colSums() so as to get the sum of NA values in each column. 
* By running the code, we see that there are no NA values in our dataset.
* The next step is to check for special values (-Inf, Inf, Nan).
* Inorder to check all numeric columns for the presence of special values, we create a function called "special". The function checks all obervations whether if its a special value and the result is in logical form (TRUE if special values are detected, FALSE if not a special value).
* To get the sum of special values in each column, we use sapply() function as given below. This would give the count of special values in each column of our dataset.


```{r}
#Checking for NA values
colSums(is.na(join3))

#Checking for special values
special <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x))}

#Sum of special values in each column
sapply(join3, function(x){if (is.numeric(x)) sum(special(x))})

```
<br>
<br>

##	Scan II

* Once we have scanned for NA values and special values, the next step is to handle the outliers present in our dataset.
* We use boxplot method to detect the univariate variables (as our variables are not dependent on each other)
* By observing the boxplots, we see that 6 out of 10 variables are having outliers. We use capping method to handle the outliers.
* Capping process involves in replacing the outliers with the nearest neighbouring values that are not outliers. As an example, for outliers that lie outside the outlier fence we cap it by replacing those values outside the lower limit with the value of 5 th percentile and for those that lie above the upper limit with the 95 th percentile value.
* We check the boxplots once again after applying capping function on the variables having outliers. 
* While observing the boxplots, we see that the 'Standard Error' and 'Trust (Government Corruption)' are showing the presence of outliers on the boxplot. They are actually the outlier fence value which acts as a barrier between outliers and non outlier values. In order to confirm this, we use length() on the boxplots of those two variables. 

```{r}
#Boxplots before handling outliers
box_1 <- boxplot(join3$`Happiness Score`, join3$Family, join3$`Dystopia Residual`, 
                 join3$`Residual Value`, join3$`Economy (GDP per Capita)`,
                 names = c("Happiness", "Family", "Dys Residual", "Res Value", "Economy"), 
                 main = "Boxplot to detect outliers")

box_2 <- boxplot(join3$`Health (Life Expectancy)`, join3$Freedom, join3$`Standard Error`, 
                 join3$`Trust (Government Corruption)`, join3$Generosity,
                 names = c("Health", "Freedom", "Standard Err", "Trust", "Generosity"),
                 main = "Boxplot to detect outliers")

#Capping function
cap <- function(x){
  quantiles <- quantile(x, c(.05, .25, .75, .95) )
  x[ x < quantiles[2] - 1.5*IQR(x)] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x)] <- quantiles[4]
  x}

join3$`Standard Error` <- join3$`Standard Error` %>% cap()
join3$Family <- join3$Family %>% cap()
join3$`Trust (Government Corruption)` <- join3$`Trust (Government Corruption)` %>% cap()
join3$Generosity <- join3$Generosity %>% cap()
join3$`Dystopia Residual` <- join3$`Dystopia Residual` %>% cap()
join3$`Residual Value` <- join3$`Residual Value` %>% cap()

#Boxplots after removing outliers
box_3 <- boxplot(join3$`Happiness Score`, join3$Family, join3$`Dystopia Residual`, 
                 join3$`Residual Value`, join3$`Economy (GDP per Capita)`,
                 names = c("Happiness", "Family", "Dys Residual", "Res Value", "Economy"), 
                 main = "Boxplot to detect outliers")

box_4<- boxplot(join3$`Health (Life Expectancy)`, join3$Freedom, join3$`Standard Error`, 
                join3$`Trust (Government Corruption)`, join3$Generosity,
                names = c("Health", "Freedom", "Standard Err", "Trust", "Generosity"),
                main = "Boxplot to detect outliers")

boxplot(join3$`Standard Error`, main = "Boxplot of Standard Error")$out
boxplot(join3$`Trust (Government Corruption)`, main = "Boxplot of Trust")$out

```
<br>
<br>

##	Transform 

* The last step in our analysis is to perform transformation. We perform data transformation to standardise the variables, to make the relationships linear and even to make the distribution less skewed.
* We perform data transformation on 3 variables here: Residual Value, Dystopia Residual and Trust (Government Corruption).
* On checking the histogram for Trust, we see that it is skewed towards the right. We use square root transformation method as it can reduce right skewness and can also be applied to zero values. The function sqrt() will apply square root tranformation and a histogram is plotted post transformation where the distribution is seen be normal.
* Next we check the histogram for Residual Value. Although the distribution seems to be normal, it is not perfect. Box-Cox method is a type of power transformation used to transform non-normal data into normal distribution. We use the package "forecast" to perform this function. To get the best Box-Cox transformation, we use BoxCox() function using lambda="auto" argument. Upon plotting the histogram after the transformation, we see that the distribution is perfectly normal.
* We check the histogram for Dystopia Residual. This also appears to be normally distributed but not perfectly symmetric. Here also we perform BoxCox() function with labmda="auto". The histogram after transformation gives a much better symmetric looking distribution.

```{r}
#Trust (Government Corruption)
hist(join3$`Trust (Government Corruption)`, main = "Histogram of Trust",
     xlab = "Trust (Government Corruption)")
sqrt_trust<- sqrt(join3$`Trust (Government Corruption)`)
hist(sqrt_trust, main = "Trust (after Transformation)",
     xlab = "Trust (Government Corruption)")

# Residual value
hist(join3$`Residual Value`, main = "Histogram of Residual Value", 
     xlab = "Residual Value")
box_residual <- BoxCox(join3$`Residual Value`, lambda = "auto")
hist(box_residual, main = "Residual Value (after Transformation)",
     xlab = "Residual Value")

# Dystopia residual
hist(join3$`Dystopia Residual`, main = "Histogram of Dystopia Residual",
     xlab = "Dystopia Residual")
box_dystopia <- BoxCox(join3$`Dystopia Residual`, lambda = "auto")
hist(box_dystopia, main = "Dystopia Residual (after Transformation)",
     xlab = "Dystopia Residual")

```
<br>
<br>

## References

* Kaggle.com. 2019. World Happiness Report. [online] Available at: <https://www.kaggle.com/unsdsn/world-happiness>

* Kaggle.com. 2017. World Countries And Continents Details. [online] Available at: <https://www.kaggle.com/folaraz/world-countries-and-continents-details?select=Countries+Longitude+and+Latitude.csv>

* Dolgun, Anil 2010, ‘Data Wrangling’, lecture notes, MATH2349, RMIT University, <http://rare-phoenix-161610.appspot.com/secured/index.html>

* Baglin, James 2010, ‘Applied Analytics’, lecture notes, MATH1324, RMIT University, <https://astral-theory-157510.appspot.com/secured/index.html>

* wikiHow. 2020. How To Write Latitude And Longitude. [online] Available at: <https://www.wikihow.com/Write-Latitude-and-Longitude#:~:text=Start%20with%20your%20line%20of,or%20West%20as%20the%20direction.>

<br>
<br>
