Required packages

Below packages are installed and loaded as these packages contain many of the functions that will be used in this assignment.

#To load the necessary packages
library(readr)
library(magrittr)
library(tidyr)
library(dplyr)
library(car)
library(forecast)

Executive Summary

Data preprocessing involves steps undertaken in transforming raw data into an understandable format. The chosen datasets depict the happiness score along with few other parameters such as Happiness Rank, Health Life Expectancy Rate and GDP Per Capita income of countries around the world for the years 2016 and 2017. Our goal is to compare the prosperity in these nations from 2016 to 2017.

In the process of achieving the above, we joined the two datasets by country. To obtain the training dataset, the joined dataset contained columns that had observations which were out of our interests. Hence, we selected variables - Country, Happiness_Rank_2017, Happiness_Score_2017, Happiness_Rank_2016, Happiness_Score_2016, GDP_Per_Capita_2017, GDP_Per_Capita_2016, Health_Life_Expectancy. Further steps taken are as below:

Data

The World Happiness Report is a landmark survey of the state of global happiness. Both the datasets were downloaded from https://www.kaggle.com/unsdsn/world-happiness. The datasets review the state of happiness in the world for the 2016 and 2017.

2016.csv

  1. Country : Name of the country.
  2. Region : Region the country belongs to.
  3. Happiness Rank : Rank of the country based on the Happiness Score.
  4. Happiness Score : A metric measured in 2016 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest”.
  5. Lower Confidence Interval : Lower Confidence Interval of the Happiness Score.
  6. Upper Confidence Interval : Upper Confidence Interval of the Happiness Score.
  7. Economy (GDP per Capita) : The extent to which GDP contributes to the calculation of the Happiness Score.
  8. Family : The extent to which Family contributes to the calculation of the Happiness Score.
  9. Health (Life Expectancy) : The extent to which Life expectancy contributed to the calculation of the Happiness Score.
  10. Freedom : The extent to which Freedom contributed to the calculation of the Happiness Score.
  11. Trust (Government Corruption) : The extent to which Perception of Corruption contributes to Happiness Score.
  12. Generosity : The extent to which Generosity contributed to the calculation of the Happiness Score.
  13. Dystopia Residual : The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.

2017.csv

  1. Country : Name of the country.
  2. Happiness.Rank : Rank of the country based on the Happiness Score.
  3. Happiness.Score : A metric measured in 2016 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest”.
  4. Whisker.high : Highest reading of Whisker.
  5. Whisker.low : Lowest reading of Whisker.
  6. Economy..GDP.per.Capita. : The extent to which GDP contributes to the calculation of the Happiness Score.
  7. Freedom : The extent to which Freedom contributed to the calculation of the Happiness Score.
  8. Family : The extent to which Family contributes to the calculation of the Happiness Score.
  9. Generosity : The extent to which Generosity contributed to the calculation of the Happiness Score.
  10. Trust..Government.Corruption. : The extent to which Perception of Corruption contributes to Happiness Score.
  11. Dystopia.Residual : The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.

We used select function to fetch the variables of our interest. We included Country, Happiness.Rank, Happiness.Score, Happiness Rank, Happiness Score, Economy..GDP.per.Capita. ,Economy (GDP per Capita), Health..Life.Expectancy.

#To read the .csv file
happy2017 <- read_csv("2017.csv")
Parsed with column specification:
cols(
  Country = col_character(),
  Happiness.Rank = col_double(),
  Happiness.Score = col_double(),
  Whisker.high = col_double(),
  Whisker.low = col_double(),
  Economy..GDP.per.Capita. = col_double(),
  Family = col_double(),
  Health..Life.Expectancy. = col_double(),
  Freedom = col_double(),
  Generosity = col_double(),
  Trust..Government.Corruption. = col_double(),
  Dystopia.Residual = col_double()
)
happy2016 <- read_csv("2016.csv")
Parsed with column specification:
cols(
  Country = col_character(),
  Region = col_character(),
  `Happiness Rank` = col_double(),
  `Happiness Score` = col_double(),
  `Lower Confidence Interval` = col_double(),
  `Upper Confidence Interval` = col_double(),
  `Economy (GDP per Capita)` = col_double(),
  Family = col_double(),
  `Health (Life Expectancy)` = col_double(),
  Freedom = col_double(),
  `Trust (Government Corruption)` = col_double(),
  Generosity = col_double(),
  `Dystopia Residual` = col_double()
)
country_combined2017_16 <- left_join(happy2017,happy2016, by = "Country")
happiness_comparision <-country_combined2017_16 %>% select(Country, Happiness.Rank,Happiness.Score,`Happiness Rank`,`Happiness Score`,Economy..GDP.per.Capita.,`Economy (GDP per Capita)`,Health..Life.Expectancy.)
print(happiness_comparision)

Understand

We renamed the names of the varaibles to increase the readability using colnames function. The str function gives the structure of the data frame and information about the type of the variables.

Furthermore, we performed type conversion of Health_Life_Expectancy, Happiness_Rank_2017, and Happiness_Rank_2016 from numeric to ordered factor. The factor function is used with ordered = “TRUE”.

Rest of the other variables are of numeric type.

#To change the name of variables
colnames(happiness_comparision)[2] <- "Happiness_Rank_2017"
colnames(happiness_comparision)[3] <- "Happiness_Score_2017"
colnames(happiness_comparision)[4] <- "Happiness_Rank_2016"
colnames(happiness_comparision)[5] <- "Happiness_Score_2016"
colnames(happiness_comparision)[6] <- "GDP_Per_Capita_2017"
colnames(happiness_comparision)[7] <- "GDP_Per_Capita_2016"
colnames(happiness_comparision)[8] <- "Health_Life_Expectancy"

str(happiness_comparision)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    155 obs. of  8 variables:
 $ Country               : chr  "Norway" "Denmark" "Iceland" "Switzerland" ...
 $ Happiness_Rank_2017   : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Happiness_Score_2017  : num  7.54 7.52 7.5 7.49 7.47 ...
 $ Happiness_Rank_2016   : num  4 1 3 2 5 7 6 8 10 9 ...
 $ Happiness_Score_2016  : num  7.5 7.53 7.5 7.51 7.41 ...
 $ GDP_Per_Capita_2017   : num  1.62 1.48 1.48 1.56 1.44 ...
 $ GDP_Per_Capita_2016   : num  1.58 1.44 1.43 1.53 1.41 ...
 $ Health_Life_Expectancy: num  0.797 0.793 0.834 0.858 0.809 ...
happiness_comparision$Health_Life_Expectancy <- happiness_comparision$Health_Life_Expectancy*100
breaks <- c(0,20,40,60,80,100)
happiness_comparision$Health_Life_Expectancy <- happiness_comparision$Health_Life_Expectancy %>% cut(breaks = breaks, include.lowest = TRUE)
levels(happiness_comparision$Health_Life_Expectancy)
[1] "[0,20]"   "(20,40]"  "(40,60]"  "(60,80]"  "(80,100]"
happiness_comparision$Health_Life_Expectancy <- happiness_comparision$Health_Life_Expectancy %>% factor(levels = c("[0,20]", "(20,40]" , "(40,60]" , "(60,80]" , "(80,100]"), labels = c("<=20","21-40","41-60","61-80",">80"), ordered = TRUE)
levels(happiness_comparision$Health_Life_Expectancy)
[1] "<=20"  "21-40" "41-60" "61-80" ">80"  
class(happiness_comparision$Health_Life_Expectancy)
[1] "ordered" "factor" 
happiness_comparision$Happiness_Rank_2017 <- happiness_comparision$Happiness_Rank_2017 %>% factor(ordered = TRUE)
levels(happiness_comparision$Happiness_Rank_2017)
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14" 
 [15] "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24"  "25"  "26"  "27"  "28" 
 [29] "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42" 
 [43] "43"  "44"  "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56" 
 [57] "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70" 
 [71] "71"  "72"  "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
 [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98" 
 [99] "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110" "111" "112"
[113] "113" "114" "115" "116" "117" "118" "119" "120" "121" "122" "123" "124" "125" "126"
[127] "127" "128" "129" "130" "131" "132" "133" "134" "135" "136" "137" "138" "139" "140"
[141] "141" "142" "143" "144" "145" "146" "147" "148" "149" "150" "151" "152" "153" "154"
[155] "155"
class(happiness_comparision$Happiness_Rank_2017)
[1] "ordered" "factor" 
happiness_comparision$Happiness_Rank_2016 <- happiness_comparision$Happiness_Rank_2016 %>% factor(ordered = TRUE)
levels(happiness_comparision$Happiness_Rank_2016)
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14" 
 [15] "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24"  "25"  "26"  "27"  "28"  "29" 
 [29] "30"  "31"  "32"  "33"  "34"  "36"  "37"  "38"  "39"  "41"  "42"  "43"  "44"  "45" 
 [43] "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "59"  "60" 
 [57] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72"  "73"  "74" 
 [71] "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87"  "88"  "89" 
 [85] "90"  "91"  "92"  "93"  "94"  "95"  "96"  "98"  "99"  "100" "101" "103" "104" "105"
 [99] "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119"
[113] "120" "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132" "133"
[127] "134" "135" "136" "137" "139" "140" "141" "142" "143" "144" "145" "147" "148" "149"
[141] "150" "151" "152" "153" "154" "155" "156" "157"
class(happiness_comparision$Happiness_Rank_2016)
[1] "ordered" "factor" 
class(happiness_comparision$Country)
[1] "character"
class(happiness_comparision$Happiness_Score_2017)
[1] "numeric"
class(happiness_comparision$Happiness_Score_2016)
[1] "numeric"
class(happiness_comparision$GDP_Per_Capita_2017)
[1] "numeric"
class(happiness_comparision$GDP_Per_Capita_2016)
[1] "numeric"

Tidy & Manipulate Data I

According to the three interrelated rules which make a dataset tidy (Hadley Wickham and Grolemund (2016)). In tidy data:

Our dataset obeys all the above mentioned principles. Sample of first 10 rows is displayed using the print() function to show the dataset is tidy.

print(happiness_comparision[ , (1:4)])
print(happiness_comparision[ , -(1:4)])
dim(happiness_comparision)
[1] 155   8

Tidy & Manipulate Data II

We have created a new variable Diff_In_Happiness which is the absolute difference between the happiness scores of 2017 and 2016 by using mutate function.

#To create a new variable
happiness_comparision <- happiness_comparision %>% mutate(Diff_In_Happiness = abs(Happiness_Score_2017 - Happiness_Score_2016))

print(happiness_comparision[ , c(1,3,5,9)])
dim(happiness_comparision)
[1] 155   9

Scan I

To scan the missing values, inconsistencies and obvious errors, we made use of is.na, is.infinite, is.nan functions. We observed that there are 20 missing values and no special values.

The missing value percentage was found out to be 1.433692% (20 out of 1395 values). Since the percentage of missing values is less than 5%, we decided to omit these missing values using na.omit to proceed further. In fact, the omitted 5 rows were the countries that were present in the 2017 dataset but not in 2016.

#To find sum of missing values
sum(is.na(happiness_comparision))
[1] 20
#To find sum of infinite values
sum(sapply(happiness_comparision, is.infinite))
[1] 0
#To find sum of not a number(NAN) values
sum(sapply(happiness_comparision, is.nan))
[1] 0
Total_value <- nrow(happiness_comparision) * ncol(happiness_comparision)
NA_Percentage <- sum(is.na(happiness_comparision)) * 100 / Total_value
NA_Percentage
[1] 1.433692
happiness_comparision <- na.omit(happiness_comparision)
sum(is.na(happiness_comparision))
[1] 0
dim(happiness_comparision)
[1] 150   9

Scan II

We have used Boxplot function to check if there are any outliers in the numeric variables. From Figure-5, We can see that Diff_In_Happiness variable has outliers. We successfully handled these outliers by capping them with nearest quantile value. Figure-6 shows that there are no outliers in Diff_In_Happiness after capping.

#To visualize outliers
Boxplot(happiness_comparision$Happiness_Score_2016)

Figure - 1

Boxplot(happiness_comparision$Happiness_Score_2017)

Figure - 2

Boxplot(happiness_comparision$GDP_Per_Capita_2016)

Figure - 3

Boxplot(happiness_comparision$GDP_Per_Capita_2017)

Figure - 4

Boxplot(happiness_comparision$Diff_In_Happiness)
[1]  52  80 103 137 141

Figure - 5

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

happiness_comparision$Diff_In_Happiness <- happiness_comparision$Diff_In_Happiness %>% cap()
Boxplot(happiness_comparision$Diff_In_Happiness)

Figure - 6

Transform

In order to decrease the skewness (shown in the Figure-7) and convert the distribution into a normal distribution of the variable Happiness_Score_2016, we used BoxCox tranformation. The resultant distribution (shown in the Figure-8) showed right skewness and hence we applied square root transformation on boxcox_Happiness_Score_2016 to obtain normal distribution (shown in the Figure-9). BoxCox and sqrt functions are made use for the purpose of transformation. hist function is used to visualize the distribution.

Even from the summary function, we can infer that “mean” and “median” values are almost equal. Difference between these values is less than 0.5(0.031). Hence the distribution is normal.

Now, the variable Tranformed_Happiness_Score_2016 is normally distributed.

#To visualize distribution
hist(happiness_comparision$Happiness_Score_2016)

Figure - 7

happiness_comparision$boxcox_Happiness_Score_2016 <- BoxCox(happiness_comparision$`Happiness_Score_2016`,lambda = "auto")
hist(happiness_comparision$boxcox_Happiness_Score_2016)

Figure - 8

happiness_comparision$Tranformed_Happiness_Score_2016 <- sqrt(happiness_comparision$boxcox_Happiness_Score_2016)
hist(happiness_comparision$Tranformed_Happiness_Score_2016)

Figure - 9

summary(happiness_comparision$Tranformed_Happiness_Score_2016)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.829   2.798   3.364   3.395   3.927   4.699 

Reference

