Required packages

The below packages have been used for this analysis:

library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(hms)
library(outliers)
library(ggplot2)
library(forecast)



Executive Summary



Data

The first dataset used is “World Happiness Report” from the year 2015. This report is a landmark survey of the state of global happiness of various countries across the globe. The data consist of a happiness score which is calculated on a scale of 0 to 10 where 10 is the happiest and the respective countries ranking amongst others. Switzerland is currently ranking 1st with a happiness score of 7.59. The different columns are described as below:

  1. Country – Name of the country
  2. Region – Region the country belongs to
  3. Happiness Rank – Country’s rank based on happiness score
  4. Happiness Score – Score measured by asking sampled people how they would rate their happiness.
  5. Standard Error – The standard error of the happiness score
  6. Economy (GDP per Capita) – The extent to which GDP contributes to the calculation of happiness score
  7. Family - The extent to which Family contributes to the calculation of happiness score
  8. Health (Life Expectancy) - The extent to which health contributes to the calculation of happiness score
  9. Freedom - The extent to which Freedom contributes to the calculation of happiness score
  10. Trust (Government Corruption) - The extent to which perception of corruption contributes to the calculation of happiness score
  11. Generosity – The extent to which Generosity contributes to the calculation of happiness score
  12. Dystopia Residual – It is the sum of the dystopia happiness score (1.85) ie score of a hypothetical country having rank lower than the lowest ranking country in the report, plus the residual value of each country which is a number left over from the normalization of the variables which cannot be explained).

By adding up all these factors including the Dystopia Residual, we get the happiness score for each country.
The second dataset used here is “Countries Longitude and Latitude” which gives the latitude and longitude details of each country which is used as a reference in several cases. The latitude and longitude of each country is given in Simple Decimal Standard in separate columns. The columns in the dataset can be explained as below:

  1. (Blank column title) – Gives the serial number of countries
  2. Latitude – Gives the respective country’s latitudinal coordinates
  3. Longitude – Gives the respective country’s longitudinal coordinates
  4. Country – Name of the country

The datasets used were obtained from an open source website: “World Happiness Report” (https://www.kaggle.com/unsdsn/world-happiness) and “Countries Longitude and Latitude” (https://www.kaggle.com/folaraz/world-countries-and-continents-details?select=Countries+Longitude+and+Latitude.csv).

#Setting the working directory
setwd("C:/Users/User/Desktop/RMIT class/Data Wrangling/Assignment 2")
getwd()
[1] "C:/Users/User/Desktop/RMIT class/Data Wrangling/Assignment 2"
#Reading the datasets
Happy <- read_csv("2015.csv")
Parsed with column specification:
cols(
  Country = col_character(),
  Region = col_character(),
  `Happiness Rank` = col_double(),
  `Happiness Score` = col_double(),
  `Standard Error` = col_double(),
  `Economy (GDP per Capita)` = col_double(),
  Family = col_double(),
  `Health (Life Expectancy)` = col_double(),
  Freedom = col_double(),
  `Trust (Government Corruption)` = col_double(),
  Generosity = col_double(),
  `Dystopia Residual` = col_double()
)
head(Happy)

Country1 <- read_csv("Countries.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
  X1 = col_double(),
  latitude = col_double(),
  longitude = col_double(),
  Country = col_character()
)
Country <- Country1[,c(2,3,4)]
head(Country)

#Joining the 2 datasets
join <- inner_join(Happy, Country, by = "Country")
head(join)
NA



Understand

str(join)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    147 obs. of  14 variables:
 $ Country                      : chr  "Switzerland" "Iceland" "Denmark" "Norway" ...
 $ Region                       : chr  "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
 $ Happiness Rank               : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Happiness Score              : num  7.59 7.56 7.53 7.52 7.43 ...
 $ Standard Error               : num  0.0341 0.0488 0.0333 0.0388 0.0355 ...
 $ Economy (GDP per Capita)     : num  1.4 1.3 1.33 1.46 1.33 ...
 $ Family                       : num  1.35 1.4 1.36 1.33 1.32 ...
 $ Health (Life Expectancy)     : num  0.941 0.948 0.875 0.885 0.906 ...
 $ Freedom                      : num  0.666 0.629 0.649 0.67 0.633 ...
 $ Trust (Government Corruption): num  0.42 0.141 0.484 0.365 0.33 ...
 $ Generosity                   : num  0.297 0.436 0.341 0.347 0.458 ...
 $ Dystopia Residual            : num  2.52 2.7 2.49 2.47 2.45 ...
 $ latitude                     : num  46.8 65 56.3 60.5 56.1 ...
 $ longitude                    : num  8.23 -19.02 9.5 8.47 -106.35 ...
#Data type conversion

join$Region <- factor(join$Region, order = FALSE, levels = c("Australia and New Zealand", "Central and Eastern Europe", "Eastern Asia", "Latin America and Caribbean", "Middle East and Northern Africa", "North America", "Southeastern Asia", "Southern Asia", "Sub-Saharan Africa", "Western Europe"))

join$`Happiness Rank` <- factor(join$`Happiness Rank`, order = TRUE, levels = c(1:158))

str(join)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    147 obs. of  14 variables:
 $ Country                      : chr  "Switzerland" "Iceland" "Denmark" "Norway" ...
 $ Region                       : Factor w/ 10 levels "Australia and New Zealand",..: 10 10 10 10 6 10 10 10 1 1 ...
 $ Happiness Rank               : Ord.factor w/ 158 levels "1"<"2"<"3"<"4"<..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Happiness Score              : num  7.59 7.56 7.53 7.52 7.43 ...
 $ Standard Error               : num  0.0341 0.0488 0.0333 0.0388 0.0355 ...
 $ Economy (GDP per Capita)     : num  1.4 1.3 1.33 1.46 1.33 ...
 $ Family                       : num  1.35 1.4 1.36 1.33 1.32 ...
 $ Health (Life Expectancy)     : num  0.941 0.948 0.875 0.885 0.906 ...
 $ Freedom                      : num  0.666 0.629 0.649 0.67 0.633 ...
 $ Trust (Government Corruption): num  0.42 0.141 0.484 0.365 0.33 ...
 $ Generosity                   : num  0.297 0.436 0.341 0.347 0.458 ...
 $ Dystopia Residual            : num  2.52 2.7 2.49 2.47 2.45 ...
 $ latitude                     : num  46.8 65 56.3 60.5 56.1 ...
 $ longitude                    : num  8.23 -19.02 9.5 8.47 -106.35 ...



Tidy & Manipulate Data I

join1 <- join %>% unite(`Country Coordinates`, latitude, longitude, sep = ", ")
join2 <- join1[, c(1,13,2:12)]



Tidy & Manipulate Data II

join3 <- join2 %>% mutate(`Residual Value` = (join$`Dystopia Residual` - 1.85))



Scan I

#Checking for NA values
colSums(is.na(join3))
                      Country           Country Coordinates                        Region 
                            0                             0                             0 
               Happiness Rank               Happiness Score                Standard Error 
                            0                             0                             0 
     Economy (GDP per Capita)                        Family      Health (Life Expectancy) 
                            0                             0                             0 
                      Freedom Trust (Government Corruption)                    Generosity 
                            0                             0                             0 
            Dystopia Residual                Residual Value 
                            0                             0 
#Checking for special values
special <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x))}

#Sum of special values in each column
sapply(join3, function(x){if (is.numeric(x)) sum(special(x))})
$Country
NULL

$`Country Coordinates`
NULL

$Region
NULL

$`Happiness Rank`
NULL

$`Happiness Score`
[1] 0

$`Standard Error`
[1] 0

$`Economy (GDP per Capita)`
[1] 0

$Family
[1] 0

$`Health (Life Expectancy)`
[1] 0

$Freedom
[1] 0

$`Trust (Government Corruption)`
[1] 0

$Generosity
[1] 0

$`Dystopia Residual`
[1] 0

$`Residual Value`
[1] 0



Scan II

#Boxplots before handling outliers
box_1 <- boxplot(join3$`Happiness Score`, join3$Family, join3$`Dystopia Residual`, 
                 join3$`Residual Value`, join3$`Economy (GDP per Capita)`,
                 names = c("Happiness", "Family", "Dys Residual", "Res Value", "Economy"), 
                 main = "Boxplot to detect outliers")


box_2 <- boxplot(join3$`Health (Life Expectancy)`, join3$Freedom, join3$`Standard Error`, 
                 join3$`Trust (Government Corruption)`, join3$Generosity,
                 names = c("Health", "Freedom", "Standard Err", "Trust", "Generosity"),
                 main = "Boxplot to detect outliers")


#Capping function
cap <- function(x){
  quantiles <- quantile(x, c(.05, .25, .75, .95) )
  x[ x < quantiles[2] - 1.5*IQR(x)] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x)] <- quantiles[4]
  x}

join3$`Standard Error` <- join3$`Standard Error` %>% cap()
join3$Family <- join3$Family %>% cap()
join3$`Trust (Government Corruption)` <- join3$`Trust (Government Corruption)` %>% cap()
join3$Generosity <- join3$Generosity %>% cap()
join3$`Dystopia Residual` <- join3$`Dystopia Residual` %>% cap()
join3$`Residual Value` <- join3$`Residual Value` %>% cap()

#Boxplots after removing outliers
box_3 <- boxplot(join3$`Happiness Score`, join3$Family, join3$`Dystopia Residual`, 
                 join3$`Residual Value`, join3$`Economy (GDP per Capita)`,
                 names = c("Happiness", "Family", "Dys Residual", "Res Value", "Economy"), 
                 main = "Boxplot to detect outliers")


box_4<- boxplot(join3$`Health (Life Expectancy)`, join3$Freedom, join3$`Standard Error`, 
                join3$`Trust (Government Corruption)`, join3$Generosity,
                names = c("Health", "Freedom", "Standard Err", "Trust", "Generosity"),
                main = "Boxplot to detect outliers")


boxplot(join3$`Standard Error`, main = "Boxplot of Standard Error")$out
 [1] 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768 0.078768
[12] 0.078768

boxplot(join3$`Trust (Government Corruption)`, main = "Boxplot of Trust")$out
 [1] 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353 0.405353
[12] 0.405353 0.405353



Transform

#Trust (Government Corruption)
hist(join3$`Trust (Government Corruption)`, main = "Histogram of Trust",
     xlab = "Trust (Government Corruption)")

sqrt_trust<- sqrt(join3$`Trust (Government Corruption)`)
hist(sqrt_trust, main = "Trust (after Transformation)",
     xlab = "Trust (Government Corruption)")


# Residual value
hist(join3$`Residual Value`, main = "Histogram of Residual Value", 
     xlab = "Residual Value")

box_residual <- BoxCox(join3$`Residual Value`, lambda = "auto")
hist(box_residual, main = "Residual Value (after Transformation)",
     xlab = "Residual Value")


# Dystopia residual
hist(join3$`Dystopia Residual`, main = "Histogram of Dystopia Residual",
     xlab = "Dystopia Residual")

box_dystopia <- BoxCox(join3$`Dystopia Residual`, lambda = "auto")
hist(box_dystopia, main = "Dystopia Residual (after Transformation)",
     xlab = "Dystopia Residual")



References



