“Happiness is when what you think, what you say, and what you do are in harmony.” Mahatma Gandhi
Two datasets are used on this Homework:
The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2018, ranks 156 countries by their happiness levels, and 117 countries by the happiness of their immigrants.
http://worldhappiness.report/ed/2018/
The second report is the “Human Development Report” from the United Nations Development Programme. This report is a composite index measuring average achievement in three basic dimensions of human development-a long and healthy life, knowledge and a decent standard of living.
http://hdr.undp.org/en/data (Go to Data>Table 1: Human Development Index and its components)
The codes below reads onto the two the data sources.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(ggfortify)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(tools)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(stringr)
library(cluster)
library(FactoMineR)
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(ggthemes)
library(NbClust)
library(readxl)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:cluster':
##
## votes.repub
library(devtools)
## Warning: package 'devtools' was built under R version 3.6.3
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 3.6.3
library(summarytools)
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
library(data.table)
## Warning: package 'data.table' was built under R version 3.6.3
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(rgl)
## Registered S3 method overwritten by 'shiny':
## method from
## print.key_missing fastmap
library(plot3D)
library(ggiraph)
library(RColorBrewer)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.6.3
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
Reading the data for the two sources:
library(readr)
## Warning: package 'readr' was built under R version 3.6.3
library("readxl")
Reading Human development Index and World Happiness 2018
human_development <- read_excel("F:/MS.c_Statistics/Kansas_University/806 Data Visualization/Lesson 8/Homework/2018_Statistical_Annex_Table_1.xlsx")
## New names:
## * `HDI rank` -> `HDI rank...1`
## * `HDI rank` -> `HDI rank...9`
X2018<-read_excel("F:/MS.c_Statistics/Kansas_University/806 Data Visualization/Lesson 8/Homework/WHR2018.xls")
summary(human_development)
## HDI rank...1 Country Human Development Index (HDI)
## Length:261 Length:261 Length:261
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Life expectancy at birth Expected years of schooling Mean years of schooling
## Length:261 Length:261 Length:261
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Gross national income (GNI) per capita GNI per capita rank minus HDI rank
## Length:261 Length:261
## Class :character Class :character
## Mode :character Mode :character
## HDI rank...9
## Length:261
## Class :character
## Mode :character
view(dfSummary(human_development))
## Switching method to 'browser'
## Output file written: C:\Users\Oswaldo\AppData\Local\Temp\RtmpKWdT6D\file4aa85eba78a1.html
Which cols are incorrectly classed?
sapply(human_development, class) # which cols are incorrectly classed?
## HDI rank...1 Country
## "character" "character"
## Human Development Index (HDI) Life expectancy at birth
## "character" "character"
## Expected years of schooling Mean years of schooling
## "character" "character"
## Gross national income (GNI) per capita GNI per capita rank minus HDI rank
## "character" "character"
## HDI rank...9
## "character"
Changing from character to
human_development[, c(3: 9)]<- sapply(human_development[ , c(3:9)], as.numeric)
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
human_development$Country = as.character(human_development$Country)
human_development$`HDI rank..1`=as.numeric(human_development$`HDI rank...1`)
## Warning: NAs introduced by coercion
sapply(human_development, class) # which cols are incorrectly classed?
## HDI rank...1 Country
## "character" "character"
## Human Development Index (HDI) Life expectancy at birth
## "numeric" "numeric"
## Expected years of schooling Mean years of schooling
## "numeric" "numeric"
## Gross national income (GNI) per capita GNI per capita rank minus HDI rank
## "numeric" "numeric"
## HDI rank...9 HDI rank..1
## "numeric" "numeric"
We can see that all variables are character. We need to convert most of them to numeric type with the exception of country, that should be character.
In the case of X2008…
summary(X2018)
## Country year Life Ladder Log GDP per capita
## Length:1562 Min. :2005 Min. :2.662 Min. : 6.377
## Class :character 1st Qu.:2009 1st Qu.:4.606 1st Qu.: 8.311
## Mode :character Median :2012 Median :5.333 Median : 9.399
## Mean :2012 Mean :5.434 Mean : 9.221
## 3rd Qu.:2015 3rd Qu.:6.271 3rd Qu.:10.191
## Max. :2017 Max. :8.019 Max. :11.770
## NA's :27
## Social support Healthy life expectancy at birth Freedom to make life choices
## Min. :0.2902 Min. :37.77 Min. :0.2575
## 1st Qu.:0.7483 1st Qu.:57.30 1st Qu.:0.6338
## Median :0.8330 Median :63.80 Median :0.7480
## Mean :0.8107 Mean :62.25 Mean :0.7290
## 3rd Qu.:0.9043 3rd Qu.:68.10 3rd Qu.:0.8436
## Max. :0.9873 Max. :76.54 Max. :0.9852
## NA's :13 NA's :9 NA's :29
## Generosity Perceptions of corruption Positive affect
## Min. :-0.32295 Min. :0.0352 Min. :0.3625
## 1st Qu.:-0.11431 1st Qu.:0.6974 1st Qu.:0.6215
## Median :-0.02264 Median :0.8081 Median :0.7174
## Mean : 0.00008 Mean :0.7536 Mean :0.7090
## 3rd Qu.: 0.09465 3rd Qu.:0.8801 3rd Qu.:0.8009
## Max. : 0.67777 Max. :0.9833 Max. :0.9436
## NA's :80 NA's :90 NA's :18
## Negative affect Confidence in national government Democratic Quality
## Min. :0.08343 Min. :0.06877 Min. :-2.4482
## 1st Qu.:0.20412 1st Qu.:0.33473 1st Qu.:-0.7720
## Median :0.25180 Median :0.46314 Median :-0.2259
## Mean :0.26317 Mean :0.48021 Mean :-0.1266
## 3rd Qu.:0.31151 3rd Qu.:0.61072 3rd Qu.: 0.6659
## Max. :0.70459 Max. :0.99360 Max. : 1.5401
## NA's :12 NA's :161 NA's :171
## Delivery Quality Standard deviation of ladder by country-year
## Min. :-2.14497 Min. :0.863
## 1st Qu.:-0.71746 1st Qu.:1.738
## Median :-0.21014 Median :1.960
## Mean : 0.00495 Mean :2.004
## 3rd Qu.: 0.71800 3rd Qu.:2.216
## Max. : 2.18472 Max. :3.528
## NA's :171
## Standard deviation/Mean of ladder by country-year
## Min. :0.1339
## 1st Qu.:0.3097
## Median :0.3698
## Mean :0.3873
## 3rd Qu.:0.4518
## Max. :1.0228
##
## GINI index (World Bank estimate)
## Min. :0.2410
## 1st Qu.:0.3070
## Median :0.3490
## Mean :0.3728
## 3rd Qu.:0.4335
## Max. :0.6480
## NA's :979
## GINI index (World Bank estimate), average 2000-15
## Min. :0.2288
## 1st Qu.:0.3216
## Median :0.3710
## Mean :0.3870
## 3rd Qu.:0.4331
## Max. :0.6260
## NA's :176
## gini of household income reported in Gallup, by wp5-year
## Min. :0.2235
## 1st Qu.:0.3685
## Median :0.4254
## Mean :0.4452
## 3rd Qu.:0.5086
## Max. :0.9614
## NA's :357
view(dfSummary(X2018))
## Switching method to 'browser'
## Output file written: C:\Users\Oswaldo\AppData\Local\Temp\RtmpKWdT6D\file4aa85eff3fd8.html
In the case of X2008, which cols are incorrectly classed?
sapply(X2018, class) # which cols are incorrectly classed?
## Country
## "character"
## year
## "numeric"
## Life Ladder
## "numeric"
## Log GDP per capita
## "numeric"
## Social support
## "numeric"
## Healthy life expectancy at birth
## "numeric"
## Freedom to make life choices
## "numeric"
## Generosity
## "numeric"
## Perceptions of corruption
## "numeric"
## Positive affect
## "numeric"
## Negative affect
## "numeric"
## Confidence in national government
## "numeric"
## Democratic Quality
## "numeric"
## Delivery Quality
## "numeric"
## Standard deviation of ladder by country-year
## "numeric"
## Standard deviation/Mean of ladder by country-year
## "numeric"
## GINI index (World Bank estimate)
## "numeric"
## GINI index (World Bank estimate), average 2000-15
## "numeric"
## gini of household income reported in Gallup, by wp5-year
## "numeric"
The information located in the X2008 data set are in the correct data type
First, we are creating a file with all the information, including NA’s values. This file will be used later.
HDIvsHap_all <- human_development %>%
left_join(X2018 , by = c("Country")) %>%
mutate(Country = factor(Country)) %>%
select(Country, `Human Development Index (HDI)`,`Life expectancy at birth`, # Human development variables
`Expected years of schooling`, `Mean years of schooling`, `Gross national income (GNI) per capita`,`GNI per capita rank minus HDI rank`,
#Happines (X2018)
`Life Ladder`, `Log GDP per capita` , `Social support`, `Healthy life expectancy at birth`,`Freedom to make life choices`, Generosity,
`Perceptions of corruption`, `Positive affect`, `Negative affect`, `Confidence in national government`, `Democratic Quality`, `Delivery Quality`,
`Standard deviation of ladder by country-year`, `Standard deviation/Mean of ladder by country-year`, `GINI index (World Bank estimate)` )
The presense of many highly intercorrelated explanatory variables may substantially:
Because of the above reasons, an Principal Component Analysis has been performed on the data.
First, joining the data in the new dataset named HDIvsHap.
We observed some NA’s so we have to remove NA’s
human_development <-na.omit(human_development)
HDIvsHap <- human_development %>%
left_join(X2018 , by = "Country") %>%
mutate(Country = factor(Country)) %>%
select(Country, `Human Development Index (HDI)`,`Life expectancy at birth`, # Human development variables
`Expected years of schooling`, `Mean years of schooling`, `Gross national income (GNI) per capita`,`GNI per capita rank minus HDI rank`,
#Happines (X2018)
`Life Ladder`, `Log GDP per capita` , `Social support`, `Healthy life expectancy at birth`,`Freedom to make life choices`, Generosity,
`Perceptions of corruption`, `Positive affect`, `Negative affect`, `Confidence in national government`, `Democratic Quality`, `Delivery Quality`,
`Standard deviation of ladder by country-year`, `Standard deviation/Mean of ladder by country-year`, `GINI index (World Bank estimate)` )
HDIvsHap2<-HDIvsHap
We observed some NA’s so we have to remove NA’s
HDIvsHap <-na.omit(HDIvsHap)
Converting all columns to numeric with the excepcion of the first column Country.
sapply(HDIvsHap, class) # which cols are incorrectly classed?
## Country
## "factor"
## Human Development Index (HDI)
## "numeric"
## Life expectancy at birth
## "numeric"
## Expected years of schooling
## "numeric"
## Mean years of schooling
## "numeric"
## Gross national income (GNI) per capita
## "numeric"
## GNI per capita rank minus HDI rank
## "numeric"
## Life Ladder
## "numeric"
## Log GDP per capita
## "numeric"
## Social support
## "numeric"
## Healthy life expectancy at birth
## "numeric"
## Freedom to make life choices
## "numeric"
## Generosity
## "numeric"
## Perceptions of corruption
## "numeric"
## Positive affect
## "numeric"
## Negative affect
## "numeric"
## Confidence in national government
## "numeric"
## Democratic Quality
## "numeric"
## Delivery Quality
## "numeric"
## Standard deviation of ladder by country-year
## "numeric"
## Standard deviation/Mean of ladder by country-year
## "numeric"
## GINI index (World Bank estimate)
## "numeric"
Now that we already have our dataset, we can take a closer look at it.
HDIvsHap.pca <- PCA(HDIvsHap[, 2:22], graph=FALSE)
eigenvalues <- HDIvsHap.pca$eig
fviz_screeplot(HDIvsHap.pca, addlabels = TRUE, ylim = c(0, 65))
pc <- princomp(HDIvsHap[, 2:22], cor=TRUE, scores=TRUE)
(HDIvsHap.pca$var$contrib)
## Dim.1 Dim.2
## Human Development Index (HDI) 8.43157195 3.75685785
## Life expectancy at birth 7.43704623 1.89820014
## Expected years of schooling 7.19569472 2.72844293
## Mean years of schooling 5.32533298 7.52189076
## Gross national income (GNI) per capita 8.51765248 0.07835988
## GNI per capita rank minus HDI rank 0.02005064 6.62603117
## Life Ladder 7.05412410 2.95059301
## Log GDP per capita 8.36121117 1.47572167
## Social support 5.06735278 0.53513300
## Healthy life expectancy at birth 7.10175789 2.49759048
## Freedom to make life choices 3.61051129 11.03834037
## Generosity 1.25829789 11.54427651
## Perceptions of corruption 3.24311593 4.93822928
## Positive affect 1.70435313 14.77972373
## Negative affect 1.13028109 7.39410440
## Confidence in national government 0.08523666 11.88152721
## Democratic Quality 6.83231266 0.57368393
## Delivery Quality 8.08341554 0.13303477
## Standard deviation of ladder by country-year 1.73937861 0.88229843
## Standard deviation/Mean of ladder by country-year 6.23219329 2.70135090
## GINI index (World Bank estimate) 1.56910895 4.06460959
## Dim.3 Dim.4
## Human Development Index (HDI) 3.195949e-01 0.002022310
## Life expectancy at birth 3.843151e+00 0.003163369
## Expected years of schooling 3.489551e-03 0.217496409
## Mean years of schooling 2.602819e+00 0.940241201
## Gross national income (GNI) per capita 5.132588e-02 4.614258639
## GNI per capita rank minus HDI rank 6.935929e+00 11.547734345
## Life Ladder 2.125901e+00 0.338756062
## Log GDP per capita 1.359271e+00 1.096276363
## Social support 1.577122e+00 15.239327959
## Healthy life expectancy at birth 3.154583e+00 0.043565014
## Freedom to make life choices 5.057390e-01 0.008426865
## Generosity 7.976327e-04 0.224304424
## Perceptions of corruption 3.035577e+00 17.738275842
## Positive affect 9.369760e+00 4.316767615
## Negative affect 9.762692e+00 16.631428778
## Confidence in national government 6.238764e+00 13.846709651
## Democratic Quality 3.989636e-01 1.120832018
## Delivery Quality 1.132419e-03 5.111634966
## Standard deviation of ladder by country-year 2.675454e+01 1.524915277
## Standard deviation/Mean of ladder by country-year 3.625731e+00 3.783954131
## GINI index (World Bank estimate) 1.833311e+01 1.649908761
## Dim.5
## Human Development Index (HDI) 1.267098e-01
## Life expectancy at birth 2.527776e+00
## Expected years of schooling 1.676589e+00
## Mean years of schooling 1.381827e+00
## Gross national income (GNI) per capita 3.578547e+00
## GNI per capita rank minus HDI rank 4.356439e+01
## Life Ladder 3.457342e-04
## Log GDP per capita 2.311771e+00
## Social support 3.193014e+00
## Healthy life expectancy at birth 3.304650e+00
## Freedom to make life choices 4.542502e+00
## Generosity 4.430438e-01
## Perceptions of corruption 8.812025e+00
## Positive affect 1.329891e+00
## Negative affect 2.178451e-01
## Confidence in national government 9.625652e+00
## Democratic Quality 8.950582e-01
## Delivery Quality 6.519867e-01
## Standard deviation of ladder by country-year 3.733824e+00
## Standard deviation/Mean of ladder by country-year 2.115811e+00
## GINI index (World Bank estimate) 5.966740e+00
Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set. The contribution of variables was extracted above: The larger the value of the contribution, the more the variable contributes to the component. The variables selected for the model are:
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
fviz_pca_var(HDIvsHap.pca, col.var="contrib",gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE )
variables in explaining the variations retained by the principal components.
number <- NbClust(HDIvsHap[, 2:22], distance="euclidean",
min.nc=2, max.nc=20, method='kmeans', index='all', alphaBeale = 0.1)
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 3 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 1 proposed 18 as the best number of clusters
## * 4 proposed 20 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
set.seed(2018)
pam <- pam(HDIvsHap[, 2:22], diss=FALSE, 3, keep.data=TRUE, do.swap= TRUE)
fviz_silhouette(pam)
## cluster size ave.sil.width
## 1 1 94 0.61
## 2 2 161 0.58
## 3 3 221 0.64
fviz_cluster(pam, stand = FALSE, geom = "point",ellipse.type = "norm")
A World Map of three clusters
HDIvsHap['cluster'] <- as.factor(pam$clustering)
map <- map_data("world")
map <- left_join(map, HDIvsHap, by = c('region' = 'Country'))
ggplot() + geom_polygon(data = map, aes(x = long, y = lat, group = group, fill=cluster, color=cluster)) +
labs(title = "Clustering Happy Planet Index", subtitle = "Based on data from:http://happyplanetindex.org/", x=NULL, y=NULL) +
theme_minimal()
Now that we have decided which variables contributes to the model, we will start with the outliers detection procedures.
Creating the data set with the newly selected variables
HDIvsHap2<-HDIvsHap2[-c(5:22)]
The analysis of outliers has been performed on multivariate an not in univariate data. The reason for that, is because some univariate outliers may be not extreme in multiple regression model, and conversely, some multivariate outliers may not be detectable in single-variables (Kutner, 2005).
Now we have the information in the correct way in order to perform Outliers determination. Let’s create different outliers detection functions:
#Creating and Outlier detection function (tukey's Fences or Boxplots)
tukey_outlier = function(x) {
x < quantile(x, 0.25) - 1.5*IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x)
}
Calculating for each variable
tukey_HDI<-tukey_outlier(HDIvsHap2$`Human Development Index (HDI)`)
tukey_LEB<-tukey_outlier(HDIvsHap2$`Life expectancy at birth`)
tukey_EYS<-tukey_outlier(HDIvsHap2$`Expected years of schooling`)
HDIvsHap2<-mutate(HDIvsHap2, tukey_EYS, tukey_HDI, tukey_LEB)
sd_outlier = function(x) {
abs(scale(x)[,1]) > 3
}
Calculating for each variable
sd_outlier_HDI<- sd_outlier(HDIvsHap2$`Human Development Index (HDI)`)
sd_outlier_LEB<- sd_outlier(HDIvsHap2$`Life expectancy at birth`)
sd_outlier_EYS<- sd_outlier(HDIvsHap2$`Expected years of schooling`)
HDIvsHap2<-mutate(HDIvsHap2, sd_outlier_HDI, sd_outlier_LEB, sd_outlier_EYS)
mad_outlier = function(x) {
m = mad(x)
med = median(x)
x > med + 3 * m | x < med - 3 * m
}
Calculating for each variable
mad_HDI<-mad_outlier(HDIvsHap2$`Human Development Index (HDI)`)
mad_LEB<-mad_outlier(HDIvsHap2$`Life expectancy at birth`)
mad_EYS<-mad_outlier(HDIvsHap2$`Expected years of schooling`)
HDIvsHap2<-mutate(HDIvsHap2, mad_HDI, mad_LEB, mad_EYS)
Creating an outlier function
outlierKD <- function(dt, var) {
var_name <- eval(substitute(var),eval(dt))
tot <- round(sum(!is.na(var_name)),2)
na1 <- round(sum(is.na(var_name)),2)
m1 <- round(mean(var_name, na.rm = T),2)
par(mfrow=c(2, 2), oma=c(0,0,3,0))
boxplot(var_name, main="With outliers", col = "yellow")
hist(var_name, main="With outliers", xlab=NA, ylab=NA, col = "red")
outlier <- boxplot.stats(var_name)$out
mo <- round(mean(outlier),2)
var_name <- ifelse(var_name %in% outlier, NA, var_name)
boxplot(var_name, main="Without outliers", col = "yellow")
hist(var_name, main="Without outliers", xlab=NA, ylab=NA, col = "red")
title("Outlier Check", outer=TRUE)
na2 <- round(sum(is.na(var_name)),2)
message("Outliers identified: ", na2 - na1, " from ", tot, " observations")
message("Proportion (%) of outliers: ", round((na2 - na1) / tot*100),2)
message("Mean of the outliers: ", mo)
m2 <- round(mean(var_name, na.rm = T),2)
message("Mean without removing outliers: ", m1)
message("Mean if we remove outliers: ", m2)
}
outlierKD(HDIvsHap2,HDIvsHap2$`Human Development Index (HDI)` )
## Outliers identified: 0 from 1406 observations
## Proportion (%) of outliers: 02
## Mean of the outliers: NaN
## Mean without removing outliers: 0.73
## Mean if we remove outliers: 0.73
Now, we need are going to identify the outliers from above graph.
which(tukey_HDI)
## integer(0)
Accordign to Tukey’s method, the outliers are located at rows 429 430 431 432 433 434 435 436 437 438 439 440.
Checking with other methods…
which(sd_outlier_HDI)
## integer(0)
According to the sd methos, the only outliers are 439 and 440
which(mad_HDI)
## integer(0)
For the Human Development Index (HDI) variable, al three methods classify points 439 and 440 as outliers. Both Tukey’s and SD identifid points 433 434 435 436 437 438 as outliers. Only Tukey’s identified points 429 430 431 432 as outliers. The initial criteria for removing outliers in this project is that at least two methods should have identified the points as outliers. In other words, points 433 434 435 436 437 438 439 and 440 will be removed.
HDIvsHap2<-HDIvsHap2[-c(433,434,435,436,437,438,439),]
outlierKD(HDIvsHap2,HDIvsHap2$`Human Development Index (HDI)` )
## Outliers identified: 0 from 1399 observations
## Proportion (%) of outliers: 02
## Mean of the outliers: NaN
## Mean without removing outliers: 0.73
## Mean if we remove outliers: 0.73
outlierKD(HDIvsHap2,HDIvsHap2$`Life expectancy at birth` )
## Outliers identified: 0 from 1399 observations
## Proportion (%) of outliers: 02
## Mean of the outliers: NaN
## Mean without removing outliers: 72.94
## Mean if we remove outliers: 72.94
Now, we need are going to identify the outliers from above graph.
which(tukey_LEB)
## integer(0)
Accordign to Tukey’s method, the outliers are located at rows 386 389 398 404 406 420 427 430 432 434 435 436 437 438.
Checking with other methods…
which(sd_outlier_LEB)
## integer(0)
According to the sd methos, the only outliers are 404 406 420 434 436 438
which(mad_LEB)
## integer(0)
For the Life expectancy at birth (LEB) variable, al three methods classify points 404 406 420 434 436 438 as outliers. Both MAD and SD identified the same set of points 404 406 420 434 436 438 as outliers. Only Tukey’s identified points 386 389 398 as outliers. The initial criteria for removing outliers in this project is that at least two methods have identified the points as outliers. In other words, points 404 406 420 434 436 438 will be removed.
HDIvsHap2<-HDIvsHap2[-c(404, 406, 420, 434, 436, 438),]
outlierKD(HDIvsHap2,HDIvsHap2$`Expected years of schooling` )
## Outliers identified: 28 from 1393 observations
## Proportion (%) of outliers: 22
## Mean of the outliers: 12.19
## Mean without removing outliers: 13.64
## Mean if we remove outliers: 13.67
Now, we need are going to identify the outliers from above graph.
which(tukey_EYS)
## [1] 15 16 17 18 19 20 21 22 23 24 25 1324 1386 1387 1388
## [16] 1389 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406
Accordign to Tukey’s method, the outliers are located at rows 5 417 422 429 437 438 439 440.
Checking with other methods…
which(sd_outlier_EYS)
## [1] 15 16 17 18 19 20 21 22 23 24 25
According to the sd methos, the outliers are 5 429 437 439 440.
which(mad_EYS)
## [1] 15 16 17 18 19 20 21 22 23 24 25
For the Expected years of schooling (EYS) variable, al three methods classify points 5 429 437 439 440 as outliers. Both MAD and Tukey’s identified the additional point 422 as outliers. Only Tukey’s identified point 417 as outliers. The initial criteria for removing outliers in this project is that at least two methods have identified the points as outliers. In other words, points 5 422 429 437 439 440 will be removed.
HDIvsHap2<-HDIvsHap2[-c(5, 422, 429, 437, 439, 440),]
Finally, we will take a look of the correlation coefficients
ggpairs(HDIvsHap2, columns = 2:4)
It looks like Life expectancy at birth still need some improvement. This variable shows a bimodal sampling distribution, suggesting non-normality and that probably important variables has not been included yet. ALso, Human Development Index (HDI) has an important skweness to the left, also indicating non-normality.This important aspect will be cover in the next step. Finally, Expected year of schooling suggest some degree of normality.
Now, let’s take a look of all these variables together.
Now, we will try to run simple linear regression on our dataset and create some models
fmla <- `Human Development Index (HDI)` ~ `Life expectancy at birth` + `Expected years of schooling`
Fit the model
Happiness_model <- lm(fmla, data = HDIvsHap2)
print Happiness_model and call summary()
Happiness_model
##
## Call:
## lm(formula = fmla, data = HDIvsHap2)
##
## Coefficients:
## (Intercept) `Life expectancy at birth`
## -0.39003 0.01062
## `Expected years of schooling`
## 0.02497
summary(Happiness_model)
##
## Call:
## lm(formula = fmla, data = HDIvsHap2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.126213 -0.031985 0.002543 0.029001 0.132839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3900302 0.0129470 -30.12 <2e-16 ***
## `Life expectancy at birth` 0.0106232 0.0002634 40.33 <2e-16 ***
## `Expected years of schooling` 0.0249736 0.0006695 37.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04345 on 1384 degrees of freedom
## Multiple R-squared: 0.9219, Adjusted R-squared: 0.9218
## F-statistic: 8171 on 2 and 1384 DF, p-value: < 2.2e-16
plot(Happiness_model)
The Normal QQ plot reveals that the linear regression has some degree of deviation from normality at the bottom left of the plot. However, an slightly devition from normality may be accepted. In the residuals plots, no megaphone effect, cyclicity, trend or sequence have been obseved.
Using the prediction option…
HDIvsHap2$prediction <- predict(Happiness_model)
Visualizing the information
ggplot(HDIvsHap2, aes(x = prediction, y = `Human Development Index (HDI)`)) +
geom_point() +
geom_abline(color = "blue")
ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
In the file X2018, the Happiness Score is called ‘Ladder’. During the Gallup World Poll (GWP), the english wording of the of the question is "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. This measure is referred as ‘life ladder’ in the X2018 Excel dataset.
Adding the variable Life Ladder (Happiness score) to the file HDIvsHap2…
HDIvsHap2 <- HDIvsHap2 %>%
left_join(X2018 , by = "Country") %>%
mutate(Country = factor(Country)) %>%
select(Country, `Human Development Index (HDI)`,`Life expectancy at birth`, `Expected years of schooling`, # Human development variables
#Happines variables (X2018)
`Life Ladder` )
HDIvsHap2<-HDIvsHap2
Visualizing the information
ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) +
geom_point(aes(fill= `Life Ladder`), size = 5, pch=21) +
geom_smooth(method = "lm")+
ggtitle("Association Between Happiness vs. Human Development")+
scale_fill_gradient(low="yellow", high="red")
## `geom_smooth()` using formula 'y ~ x'
sp<-ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) +
geom_point(aes(fill= `Life Ladder`), size = HDIvsHap2$`Life Ladder`, pch=21) +
geom_smooth(method = "lm")+
ggtitle("Association Between Happiness vs. Human Development")
Adding more channels to the graph…
sp+scale_fill_gradient(low="yellow", high="red")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 46 rows containing missing values (geom_point).
From the model, we can see that in general terms, High scores of Happiness (Life Ladder) are associated with high values of ‘Human Development Index (HDI)’.
Creating an interactive graph
pl = ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) +
geom_point_interactive(aes(tooltip = HDIvsHap2$Country,
size= HDIvsHap2$`Life Ladder`,
fill=`Life Ladder`), pch = 21) +
geom_smooth(method = "lm")+
ggtitle("Association Between Happiness vs. Human Development")+
scale_fill_gradient(low="yellow", high="red")
ggiraph(code = print(pl), width = 1, height_svg = 4)
## Warning: Use of `HDIvsHap2$Country` is discouraged. Use `Country` instead.
## Warning: Use of `HDIvsHap2$`Life Ladder`` is discouraged. Use `Life Ladder`
## instead.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 46 rows containing missing values (geom_interactive_point).
According to the above interactive map, Countries like Australia, Canada, Austria and Finland are on the top of Happiness and Human Developmnet. Countries like Egypt, Indonesia, Paraguay and Belize are in the middle of Happiness and Human Development and Niger, Sierre Leone, Chad and Central African Republic in the bottom. Therefore, this correlation apparently correspont with geographic location. Europe and North America on the top, next South America and North Africa and on the bottom Africa.
Let’s have a look at the average happiness for each country and region for 2018.
dfavg <- HDIvsHap_all %>%
select(Country, `Life Ladder`) %>% #Remember Life Ladder = Happiness Score
group_by(Country) %>%
summarize(Average = mean(`Life Ladder`)) %>%
arrange(desc(Average))
head(dfavg,n=10)
## # A tibble: 10 x 2
## Country Average
## <fct> <dbl>
## 1 Denmark 7.70
## 2 Norway 7.56
## 3 Switzerland 7.54
## 4 Finland 7.52
## 5 Netherlands 7.47
## 6 Canada 7.45
## 7 Iceland 7.41
## 8 Sweden 7.37
## 9 New Zealand 7.32
## 10 Australia 7.31
dfregions<-HDIvsHap_all[1431:1436,1:6]
kable(dfregions)
| Country | Human Development Index (HDI) | Life expectancy at birth | Expected years of schooling | Mean years of schooling | Gross national income (GNI) per capita |
|---|---|---|---|---|---|
| Arab States | 0.6989547 | 71.50650 | 11.90614 | 7.034884 | 15836.653 |
| East Asia and the Pacific | 0.7334344 | 74.66739 | 13.29281 | 7.862161 | 13687.730 |
| Europe and Central Asia | 0.7707206 | 73.39237 | 14.08741 | 10.255684 | 15331.155 |
| Latin America and the Caribbean | 0.7579895 | 75.71654 | 14.44067 | 8.483807 | 13670.660 |
| South Asia | 0.6378350 | 69.26117 | 11.87420 | 6.411054 | 6473.295 |
| Sub-Saharan Africa | 0.5371360 | 60.71723 | 10.05516 | 5.554588 | 3399.249 |
Incorporating this information in the graph “Association Between Happiness vs. Human Development”
pl = ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) +
geom_point_interactive(aes(tooltip = HDIvsHap2$Country,
size= HDIvsHap2$`Life Ladder`,
fill=`Life Ladder`), pch = 21) +
geom_smooth(method = "lm")+
annotate("text", x= 90, y=0.40, label="Sub Saharan Africa")+
annotate("text", x= 65, y=0.70, label="Latin america")+
annotate("text", x= 75, y=0.95, label="Europe and Central Asia")+
ggtitle("Association Between Happiness vs. Human Development")+
scale_fill_gradient(low="yellow", high="red")
ggiraph(code = print(pl), width = 1, height_svg = 4)
## Warning: Use of `HDIvsHap2$Country` is discouraged. Use `Country` instead.
## Warning: Use of `HDIvsHap2$`Life Ladder`` is discouraged. Use `Life Ladder`
## instead.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 46 rows containing missing values (geom_interactive_point).
The above graph is not accurate. However, it shows tendency. It is necessary to depicts all the results on a map.
worldmap <- map_data("world")
happy_world <- HDIvsHap2
worldmap <- full_join(worldmap, HDIvsHap2, by = c('region' = 'Country'))
map_theme <- theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.background = element_rect(fill = "white"))
ggplot() + geom_polygon(data = worldmap, aes(x = long, y = lat, group = group, fill = `Life Ladder`)) +
scale_fill_continuous(low="thistle2", high="darkred", na.value="snow2") +
coord_quickmap() +
labs(title = "Happiness Around the World - 2018", subtitle = "Based on data from:http://happyplanetindex.org/", x=NULL, y=NULL) +
theme_minimal()
The darker the red, the higher the happiness score. Regions in gray do not have happiness data or there is some display problem, like in the cases of USA, Rusia and Venezuela. The happiest regions of the world appear to be in Europe, North and South America, and New Zealand. Africa appears to contain the lowest overall happiness scores.
The Human Development report, has statistics by Region, while the Happines report (X2018) do not.
Finally, showing the main factor of Happiness
dfwide<-HDIvsHap2 %>%
head(850)
dflong <-gather(dfwide, Factor, `Importance of Factor` , `Human Development Index (HDI)`:`Expected years of schooling`, factor_key = TRUE)
ggplot(data = dflong) +
geom_bar(stat = "identity",
aes(x = Country, y = `Importance of Factor`, fill = Factor)) +
coord_flip() +
theme_minimal() +
theme(legend.position = "top") +
labs(title = "Main Factors of Happiness Top 10 Countries") +
theme(plot.title = element_text(size = rel(1.5)),
axis.title = element_text(size = rel(1.5)))