Required packages
Below packages are installed and loaded as these packages contain many of the functions that will be used in this assignment.
#To load the necessary packages
library(readr)
library(magrittr)
library(tidyr)
library(dplyr)
library(car)
library(forecast)
Executive Summary
Data preprocessing involves steps undertaken in transforming raw data into an understandable format. The chosen datasets depict the happiness score along with few other parameters such as Happiness Rank, Health Life Expectancy Rate and GDP Per Capita income of countries around the world for the years 2016 and 2017. Our goal is to compare the prosperity in these nations from 2016 to 2017.
In the process of achieving the above, we joined the two datasets by country. To obtain the training dataset, the joined dataset contained columns that had observations which were out of our interests. Hence, we selected variables - Country, Happiness_Rank_2017, Happiness_Score_2017, Happiness_Rank_2016, Happiness_Score_2016, GDP_Per_Capita_2017, GDP_Per_Capita_2016, Health_Life_Expectancy. Further steps taken are as below:
- As a part of tidying the dataset, we started from renaming the variables.
- The datatype of few variables such as
Happiness_Rank_2016 and Happiness_Rank_2017 was converted to an ordered factor. The variable Health_Life_Expectancy which represented percentage values was converted from decimal to percentage and then grouped as an ordered factor for better understanding. Thus, encoding categorical values.
- In order to compare the data from 2016 and 2017, we created a new column
Diff_In_Happiness that gave us the absolute difference between the Happiness Score in the course of two years.
- Before deriving the conclusions out of our dataset, taking care of missing Data from the dataset is crucial. The percentage of missing values in our dataset was calculated and was found to be less than 5%. Hence omitting these rows did not make any impact on the analysis.
- Handling outliers is also an important step in data preprocessing. All the numeric variables except
Happiness_Rank_2016 had no outliers. Outliers in Happiness_Rank_2016 were handled by capping them with the nearest quantile value.
- Almost all numerical variables were normally distributed but
Happiness_Rank_2016 showed skewness. We used BoxCox and square root transformation method on the variable and transformed the variable to be normally distributed.
Data
The World Happiness Report is a landmark survey of the state of global happiness. Both the datasets were downloaded from https://www.kaggle.com/unsdsn/world-happiness. The datasets review the state of happiness in the world for the 2016 and 2017.
2016.csv
- Country : Name of the country.
- Region : Region the country belongs to.
- Happiness Rank : Rank of the country based on the Happiness Score.
- Happiness Score : A metric measured in 2016 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest”.
- Lower Confidence Interval : Lower Confidence Interval of the Happiness Score.
- Upper Confidence Interval : Upper Confidence Interval of the Happiness Score.
- Economy (GDP per Capita) : The extent to which GDP contributes to the calculation of the Happiness Score.
- Family : The extent to which Family contributes to the calculation of the Happiness Score.
- Health (Life Expectancy) : The extent to which Life expectancy contributed to the calculation of the Happiness Score.
- Freedom : The extent to which Freedom contributed to the calculation of the Happiness Score.
- Trust (Government Corruption) : The extent to which Perception of Corruption contributes to Happiness Score.
- Generosity : The extent to which Generosity contributed to the calculation of the Happiness Score.
- Dystopia Residual : The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.
2017.csv
- Country : Name of the country.
- Happiness.Rank : Rank of the country based on the Happiness Score.
- Happiness.Score : A metric measured in 2016 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest”.
- Whisker.high : Highest reading of Whisker.
- Whisker.low : Lowest reading of Whisker.
- Economy..GDP.per.Capita. : The extent to which GDP contributes to the calculation of the Happiness Score.
- Freedom : The extent to which Freedom contributed to the calculation of the Happiness Score.
- Family : The extent to which Family contributes to the calculation of the Happiness Score.
- Generosity : The extent to which Generosity contributed to the calculation of the Happiness Score.
- Trust..Government.Corruption. : The extent to which Perception of Corruption contributes to Happiness Score.
- Dystopia.Residual : The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.
We used select function to fetch the variables of our interest. We included Country, Happiness.Rank, Happiness.Score, Happiness Rank, Happiness Score, Economy..GDP.per.Capita. ,Economy (GDP per Capita), Health..Life.Expectancy.
#To read the .csv file
happy2017 <- read_csv("2017.csv")
Parsed with column specification:
cols(
Country = [31mcol_character()[39m,
Happiness.Rank = [32mcol_double()[39m,
Happiness.Score = [32mcol_double()[39m,
Whisker.high = [32mcol_double()[39m,
Whisker.low = [32mcol_double()[39m,
Economy..GDP.per.Capita. = [32mcol_double()[39m,
Family = [32mcol_double()[39m,
Health..Life.Expectancy. = [32mcol_double()[39m,
Freedom = [32mcol_double()[39m,
Generosity = [32mcol_double()[39m,
Trust..Government.Corruption. = [32mcol_double()[39m,
Dystopia.Residual = [32mcol_double()[39m
)
happy2016 <- read_csv("2016.csv")
Parsed with column specification:
cols(
Country = [31mcol_character()[39m,
Region = [31mcol_character()[39m,
`Happiness Rank` = [32mcol_double()[39m,
`Happiness Score` = [32mcol_double()[39m,
`Lower Confidence Interval` = [32mcol_double()[39m,
`Upper Confidence Interval` = [32mcol_double()[39m,
`Economy (GDP per Capita)` = [32mcol_double()[39m,
Family = [32mcol_double()[39m,
`Health (Life Expectancy)` = [32mcol_double()[39m,
Freedom = [32mcol_double()[39m,
`Trust (Government Corruption)` = [32mcol_double()[39m,
Generosity = [32mcol_double()[39m,
`Dystopia Residual` = [32mcol_double()[39m
)
country_combined2017_16 <- left_join(happy2017,happy2016, by = "Country")
happiness_comparision <-country_combined2017_16 %>% select(Country, Happiness.Rank,Happiness.Score,`Happiness Rank`,`Happiness Score`,Economy..GDP.per.Capita.,`Economy (GDP per Capita)`,Health..Life.Expectancy.)
print(happiness_comparision)
Understand
We renamed the names of the varaibles to increase the readability using colnames function. The str function gives the structure of the data frame and information about the type of the variables.
Furthermore, we performed type conversion of Health_Life_Expectancy, Happiness_Rank_2017, and Happiness_Rank_2016 from numeric to ordered factor. The factor function is used with ordered = “TRUE”.
Rest of the other variables are of numeric type.
#To change the name of variables
colnames(happiness_comparision)[2] <- "Happiness_Rank_2017"
colnames(happiness_comparision)[3] <- "Happiness_Score_2017"
colnames(happiness_comparision)[4] <- "Happiness_Rank_2016"
colnames(happiness_comparision)[5] <- "Happiness_Score_2016"
colnames(happiness_comparision)[6] <- "GDP_Per_Capita_2017"
colnames(happiness_comparision)[7] <- "GDP_Per_Capita_2016"
colnames(happiness_comparision)[8] <- "Health_Life_Expectancy"
str(happiness_comparision)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 155 obs. of 8 variables:
$ Country : chr "Norway" "Denmark" "Iceland" "Switzerland" ...
$ Happiness_Rank_2017 : num 1 2 3 4 5 6 7 8 9 10 ...
$ Happiness_Score_2017 : num 7.54 7.52 7.5 7.49 7.47 ...
$ Happiness_Rank_2016 : num 4 1 3 2 5 7 6 8 10 9 ...
$ Happiness_Score_2016 : num 7.5 7.53 7.5 7.51 7.41 ...
$ GDP_Per_Capita_2017 : num 1.62 1.48 1.48 1.56 1.44 ...
$ GDP_Per_Capita_2016 : num 1.58 1.44 1.43 1.53 1.41 ...
$ Health_Life_Expectancy: num 0.797 0.793 0.834 0.858 0.809 ...
happiness_comparision$Health_Life_Expectancy <- happiness_comparision$Health_Life_Expectancy*100
breaks <- c(0,20,40,60,80,100)
happiness_comparision$Health_Life_Expectancy <- happiness_comparision$Health_Life_Expectancy %>% cut(breaks = breaks, include.lowest = TRUE)
levels(happiness_comparision$Health_Life_Expectancy)
[1] "[0,20]" "(20,40]" "(40,60]" "(60,80]" "(80,100]"
happiness_comparision$Health_Life_Expectancy <- happiness_comparision$Health_Life_Expectancy %>% factor(levels = c("[0,20]", "(20,40]" , "(40,60]" , "(60,80]" , "(80,100]"), labels = c("<=20","21-40","41-60","61-80",">80"), ordered = TRUE)
levels(happiness_comparision$Health_Life_Expectancy)
[1] "<=20" "21-40" "41-60" "61-80" ">80"
class(happiness_comparision$Health_Life_Expectancy)
[1] "ordered" "factor"
happiness_comparision$Happiness_Rank_2017 <- happiness_comparision$Happiness_Rank_2017 %>% factor(ordered = TRUE)
levels(happiness_comparision$Happiness_Rank_2017)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
[15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
[29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
[43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
[57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
[71] "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
[85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98"
[99] "99" "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110" "111" "112"
[113] "113" "114" "115" "116" "117" "118" "119" "120" "121" "122" "123" "124" "125" "126"
[127] "127" "128" "129" "130" "131" "132" "133" "134" "135" "136" "137" "138" "139" "140"
[141] "141" "142" "143" "144" "145" "146" "147" "148" "149" "150" "151" "152" "153" "154"
[155] "155"
class(happiness_comparision$Happiness_Rank_2017)
[1] "ordered" "factor"
happiness_comparision$Happiness_Rank_2016 <- happiness_comparision$Happiness_Rank_2016 %>% factor(ordered = TRUE)
levels(happiness_comparision$Happiness_Rank_2016)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
[15] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
[29] "30" "31" "32" "33" "34" "36" "37" "38" "39" "41" "42" "43" "44" "45"
[43] "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "59" "60"
[57] "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74"
[71] "76" "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
[85] "90" "91" "92" "93" "94" "95" "96" "98" "99" "100" "101" "103" "104" "105"
[99] "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119"
[113] "120" "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132" "133"
[127] "134" "135" "136" "137" "139" "140" "141" "142" "143" "144" "145" "147" "148" "149"
[141] "150" "151" "152" "153" "154" "155" "156" "157"
class(happiness_comparision$Happiness_Rank_2016)
[1] "ordered" "factor"
class(happiness_comparision$Country)
[1] "character"
class(happiness_comparision$Happiness_Score_2017)
[1] "numeric"
class(happiness_comparision$Happiness_Score_2016)
[1] "numeric"
class(happiness_comparision$GDP_Per_Capita_2017)
[1] "numeric"
class(happiness_comparision$GDP_Per_Capita_2016)
[1] "numeric"
Tidy & Manipulate Data I
According to the three interrelated rules which make a dataset tidy (Hadley Wickham and Grolemund (2016)). In tidy data:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Our dataset obeys all the above mentioned principles. Sample of first 10 rows is displayed using the print() function to show the dataset is tidy.
print(happiness_comparision[ , (1:4)])
print(happiness_comparision[ , -(1:4)])
dim(happiness_comparision)
[1] 155 8
Tidy & Manipulate Data II
We have created a new variable Diff_In_Happiness which is the absolute difference between the happiness scores of 2017 and 2016 by using mutate function.
#To create a new variable
happiness_comparision <- happiness_comparision %>% mutate(Diff_In_Happiness = abs(Happiness_Score_2017 - Happiness_Score_2016))
print(happiness_comparision[ , c(1,3,5,9)])
dim(happiness_comparision)
[1] 155 9
Scan I
To scan the missing values, inconsistencies and obvious errors, we made use of is.na, is.infinite, is.nan functions. We observed that there are 20 missing values and no special values.
The missing value percentage was found out to be 1.433692% (20 out of 1395 values). Since the percentage of missing values is less than 5%, we decided to omit these missing values using na.omit to proceed further. In fact, the omitted 5 rows were the countries that were present in the 2017 dataset but not in 2016.
#To find sum of missing values
sum(is.na(happiness_comparision))
[1] 20
#To find sum of infinite values
sum(sapply(happiness_comparision, is.infinite))
[1] 0
#To find sum of not a number(NAN) values
sum(sapply(happiness_comparision, is.nan))
[1] 0
Total_value <- nrow(happiness_comparision) * ncol(happiness_comparision)
NA_Percentage <- sum(is.na(happiness_comparision)) * 100 / Total_value
NA_Percentage
[1] 1.433692
happiness_comparision <- na.omit(happiness_comparision)
sum(is.na(happiness_comparision))
[1] 0
dim(happiness_comparision)
[1] 150 9
Scan II
We have used Boxplot function to check if there are any outliers in the numeric variables. From Figure-5, We can see that Diff_In_Happiness variable has outliers. We successfully handled these outliers by capping them with nearest quantile value. Figure-6 shows that there are no outliers in Diff_In_Happiness after capping.
#To visualize outliers
Boxplot(happiness_comparision$Happiness_Score_2016)

Figure - 1
Boxplot(happiness_comparision$Happiness_Score_2017)

Figure - 2
Boxplot(happiness_comparision$GDP_Per_Capita_2016)

Figure - 3
Boxplot(happiness_comparision$GDP_Per_Capita_2017)

Figure - 4
Boxplot(happiness_comparision$Diff_In_Happiness)
[1] 52 80 103 137 141

Figure - 5
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
happiness_comparision$Diff_In_Happiness <- happiness_comparision$Diff_In_Happiness %>% cap()
Boxplot(happiness_comparision$Diff_In_Happiness)

Figure - 6
Reference
- Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. “ O’Reilly Media, Inc.”
