“Happiness is when what you think, what you say, and what you do are in harmony.” Mahatma Gandhi

World Happiness Report

Two datasets are used on this Homework:

The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2018, ranks 156 countries by their happiness levels, and 117 countries by the happiness of their immigrants.

http://worldhappiness.report/ed/2018/

The second report is the “Human Development Report” from the United Nations Development Programme. This report is a composite index measuring average achievement in three basic dimensions of human development-a long and healthy life, knowledge and a decent standard of living.

http://hdr.undp.org/en/data (Go to Data>Table 1: Human Development Index and its components)

The codes below reads onto the two the data sources.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

library(ggfortify)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(reshape2)
library(tools)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(stringr)
library(cluster)
library(FactoMineR)
library(factoextra)

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

library(ggthemes)
library(NbClust)
library(readxl)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:cluster':
## 
##     votes.repub

library(devtools)

## Warning: package 'devtools' was built under R version 3.6.3

## Loading required package: usethis

## Warning: package 'usethis' was built under R version 3.6.3

library(summarytools)

## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp

library(data.table)

## Warning: package 'data.table' was built under R version 3.6.3

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(rgl)

## Registered S3 method overwritten by 'shiny':
##   method            from   
##   print.key_missing fastmap

library(plot3D)
library(ggiraph)
library(RColorBrewer)
library(tidyr)

## Warning: package 'tidyr' was built under R version 3.6.3

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

Reading the data for the two sources:

library(readr)

## Warning: package 'readr' was built under R version 3.6.3

library("readxl")

Reading Human development Index and World Happiness 2018

human_development <- read_excel("F:/MS.c_Statistics/Kansas_University/806 Data Visualization/Lesson 8/Homework/2018_Statistical_Annex_Table_1.xlsx")

## New names:
## * `HDI rank` -> `HDI rank...1`
## * `HDI rank` -> `HDI rank...9`

X2018<-read_excel("F:/MS.c_Statistics/Kansas_University/806 Data Visualization/Lesson 8/Homework/WHR2018.xls")

summary(human_development)

##  HDI rank...1         Country          Human Development Index (HDI)
##  Length:261         Length:261         Length:261                   
##  Class :character   Class :character   Class :character             
##  Mode  :character   Mode  :character   Mode  :character             
##  Life expectancy at birth Expected years of schooling Mean years of schooling
##  Length:261               Length:261                  Length:261             
##  Class :character         Class :character            Class :character       
##  Mode  :character         Mode  :character            Mode  :character       
##  Gross national income (GNI) per capita GNI per capita rank minus HDI rank
##  Length:261                             Length:261                        
##  Class :character                       Class :character                  
##  Mode  :character                       Mode  :character                  
##  HDI rank...9      
##  Length:261        
##  Class :character  
##  Mode  :character

view(dfSummary(human_development))

## Switching method to 'browser'

## Output file written: C:\Users\Oswaldo\AppData\Local\Temp\RtmpKWdT6D\file4aa85eba78a1.html

Which cols are incorrectly classed?

sapply(human_development, class) # which cols are incorrectly classed?

##                           HDI rank...1                                Country 
##                            "character"                            "character" 
##          Human Development Index (HDI)               Life expectancy at birth 
##                            "character"                            "character" 
##            Expected years of schooling                Mean years of schooling 
##                            "character"                            "character" 
## Gross national income (GNI) per capita     GNI per capita rank minus HDI rank 
##                            "character"                            "character" 
##                           HDI rank...9 
##                            "character"

Changing from character to

 human_development[, c(3: 9)]<- sapply(human_development[ , c(3:9)], as.numeric)

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

    human_development$Country = as.character(human_development$Country)
    human_development$`HDI rank..1`=as.numeric(human_development$`HDI rank...1`)

## Warning: NAs introduced by coercion

sapply(human_development, class) # which cols are incorrectly classed?

##                           HDI rank...1                                Country 
##                            "character"                            "character" 
##          Human Development Index (HDI)               Life expectancy at birth 
##                              "numeric"                              "numeric" 
##            Expected years of schooling                Mean years of schooling 
##                              "numeric"                              "numeric" 
## Gross national income (GNI) per capita     GNI per capita rank minus HDI rank 
##                              "numeric"                              "numeric" 
##                           HDI rank...9                            HDI rank..1 
##                              "numeric"                              "numeric"

We can see that all variables are character. We need to convert most of them to numeric type with the exception of country, that should be character.

In the case of X2008…

summary(X2018)

##    Country               year       Life Ladder    Log GDP per capita
##  Length:1562        Min.   :2005   Min.   :2.662   Min.   : 6.377    
##  Class :character   1st Qu.:2009   1st Qu.:4.606   1st Qu.: 8.311    
##  Mode  :character   Median :2012   Median :5.333   Median : 9.399    
##                     Mean   :2012   Mean   :5.434   Mean   : 9.221    
##                     3rd Qu.:2015   3rd Qu.:6.271   3rd Qu.:10.191    
##                     Max.   :2017   Max.   :8.019   Max.   :11.770    
##                                                    NA's   :27        
##  Social support   Healthy life expectancy at birth Freedom to make life choices
##  Min.   :0.2902   Min.   :37.77                    Min.   :0.2575              
##  1st Qu.:0.7483   1st Qu.:57.30                    1st Qu.:0.6338              
##  Median :0.8330   Median :63.80                    Median :0.7480              
##  Mean   :0.8107   Mean   :62.25                    Mean   :0.7290              
##  3rd Qu.:0.9043   3rd Qu.:68.10                    3rd Qu.:0.8436              
##  Max.   :0.9873   Max.   :76.54                    Max.   :0.9852              
##  NA's   :13       NA's   :9                        NA's   :29                  
##    Generosity       Perceptions of corruption Positive affect 
##  Min.   :-0.32295   Min.   :0.0352            Min.   :0.3625  
##  1st Qu.:-0.11431   1st Qu.:0.6974            1st Qu.:0.6215  
##  Median :-0.02264   Median :0.8081            Median :0.7174  
##  Mean   : 0.00008   Mean   :0.7536            Mean   :0.7090  
##  3rd Qu.: 0.09465   3rd Qu.:0.8801            3rd Qu.:0.8009  
##  Max.   : 0.67777   Max.   :0.9833            Max.   :0.9436  
##  NA's   :80         NA's   :90                NA's   :18      
##  Negative affect   Confidence in national government Democratic Quality
##  Min.   :0.08343   Min.   :0.06877                   Min.   :-2.4482   
##  1st Qu.:0.20412   1st Qu.:0.33473                   1st Qu.:-0.7720   
##  Median :0.25180   Median :0.46314                   Median :-0.2259   
##  Mean   :0.26317   Mean   :0.48021                   Mean   :-0.1266   
##  3rd Qu.:0.31151   3rd Qu.:0.61072                   3rd Qu.: 0.6659   
##  Max.   :0.70459   Max.   :0.99360                   Max.   : 1.5401   
##  NA's   :12        NA's   :161                       NA's   :171       
##  Delivery Quality   Standard deviation of ladder by country-year
##  Min.   :-2.14497   Min.   :0.863                               
##  1st Qu.:-0.71746   1st Qu.:1.738                               
##  Median :-0.21014   Median :1.960                               
##  Mean   : 0.00495   Mean   :2.004                               
##  3rd Qu.: 0.71800   3rd Qu.:2.216                               
##  Max.   : 2.18472   Max.   :3.528                               
##  NA's   :171                                                    
##  Standard deviation/Mean of ladder by country-year
##  Min.   :0.1339                                   
##  1st Qu.:0.3097                                   
##  Median :0.3698                                   
##  Mean   :0.3873                                   
##  3rd Qu.:0.4518                                   
##  Max.   :1.0228                                   
##                                                   
##  GINI index (World Bank estimate)
##  Min.   :0.2410                  
##  1st Qu.:0.3070                  
##  Median :0.3490                  
##  Mean   :0.3728                  
##  3rd Qu.:0.4335                  
##  Max.   :0.6480                  
##  NA's   :979                     
##  GINI index (World Bank estimate), average 2000-15
##  Min.   :0.2288                                   
##  1st Qu.:0.3216                                   
##  Median :0.3710                                   
##  Mean   :0.3870                                   
##  3rd Qu.:0.4331                                   
##  Max.   :0.6260                                   
##  NA's   :176                                      
##  gini of household income reported in Gallup, by wp5-year
##  Min.   :0.2235                                          
##  1st Qu.:0.3685                                          
##  Median :0.4254                                          
##  Mean   :0.4452                                          
##  3rd Qu.:0.5086                                          
##  Max.   :0.9614                                          
##  NA's   :357

view(dfSummary(X2018))

## Switching method to 'browser'

## Output file written: C:\Users\Oswaldo\AppData\Local\Temp\RtmpKWdT6D\file4aa85eff3fd8.html

In the case of X2008, which cols are incorrectly classed?

sapply(X2018, class) # which cols are incorrectly classed?

##                                                  Country 
##                                              "character" 
##                                                     year 
##                                                "numeric" 
##                                              Life Ladder 
##                                                "numeric" 
##                                       Log GDP per capita 
##                                                "numeric" 
##                                           Social support 
##                                                "numeric" 
##                         Healthy life expectancy at birth 
##                                                "numeric" 
##                             Freedom to make life choices 
##                                                "numeric" 
##                                               Generosity 
##                                                "numeric" 
##                                Perceptions of corruption 
##                                                "numeric" 
##                                          Positive affect 
##                                                "numeric" 
##                                          Negative affect 
##                                                "numeric" 
##                        Confidence in national government 
##                                                "numeric" 
##                                       Democratic Quality 
##                                                "numeric" 
##                                         Delivery Quality 
##                                                "numeric" 
##             Standard deviation of ladder by country-year 
##                                                "numeric" 
##        Standard deviation/Mean of ladder by country-year 
##                                                "numeric" 
##                         GINI index (World Bank estimate) 
##                                                "numeric" 
##        GINI index (World Bank estimate), average 2000-15 
##                                                "numeric" 
## gini of household income reported in Gallup, by wp5-year 
##                                                "numeric"

The information located in the X2008 data set are in the correct data type

First, we are creating a file with all the information, including NA’s values. This file will be used later.

HDIvsHap_all <- human_development %>%
  left_join(X2018 , by = c("Country")) %>%
  mutate(Country = factor(Country)) %>%
  select(Country, `Human Development Index (HDI)`,`Life expectancy at birth`, # Human development variables
          `Expected years of schooling`, `Mean years of schooling`, `Gross national income (GNI) per capita`,`GNI per capita rank minus HDI rank`,      
         
          #Happines (X2018)
         `Life Ladder`, `Log GDP per capita` , `Social support`, `Healthy life expectancy at birth`,`Freedom to make life choices`, Generosity,
          `Perceptions of corruption`,  `Positive affect`, `Negative affect`, `Confidence in national government`, `Democratic Quality`, `Delivery Quality`,
         `Standard deviation of ladder by country-year`, `Standard deviation/Mean of ladder by country-year`, `GINI index (World Bank estimate)` )

Principal Component Analysis

The presense of many highly intercorrelated explanatory variables may substantially:

Increase the sampling variation of the regression coefficients.
Detract model’s descriptive abilities.
Cause roundoff problems (Kuntner, 2005).

Because of the above reasons, an Principal Component Analysis has been performed on the data.

Creating a new joined data set

First, joining the data in the new dataset named HDIvsHap.

We observed some NA’s so we have to remove NA’s

human_development <-na.omit(human_development)

HDIvsHap <- human_development %>%
  left_join(X2018 , by = "Country") %>%
  mutate(Country = factor(Country)) %>%
  select(Country, `Human Development Index (HDI)`,`Life expectancy at birth`, # Human development variables
          `Expected years of schooling`, `Mean years of schooling`, `Gross national income (GNI) per capita`,`GNI per capita rank minus HDI rank`,      
         
          #Happines (X2018)
         `Life Ladder`, `Log GDP per capita` , `Social support`, `Healthy life expectancy at birth`,`Freedom to make life choices`, Generosity,
          `Perceptions of corruption`,  `Positive affect`, `Negative affect`, `Confidence in national government`, `Democratic Quality`, `Delivery Quality`,
         `Standard deviation of ladder by country-year`, `Standard deviation/Mean of ladder by country-year`, `GINI index (World Bank estimate)` ) 
 
HDIvsHap2<-HDIvsHap

We observed some NA’s so we have to remove NA’s

HDIvsHap <-na.omit(HDIvsHap)

Principal Component Analysis with all data joined

Converting all columns to numeric with the excepcion of the first column Country.

 sapply(HDIvsHap, class) # which cols are incorrectly classed?

##                                           Country 
##                                          "factor" 
##                     Human Development Index (HDI) 
##                                         "numeric" 
##                          Life expectancy at birth 
##                                         "numeric" 
##                       Expected years of schooling 
##                                         "numeric" 
##                           Mean years of schooling 
##                                         "numeric" 
##            Gross national income (GNI) per capita 
##                                         "numeric" 
##                GNI per capita rank minus HDI rank 
##                                         "numeric" 
##                                       Life Ladder 
##                                         "numeric" 
##                                Log GDP per capita 
##                                         "numeric" 
##                                    Social support 
##                                         "numeric" 
##                  Healthy life expectancy at birth 
##                                         "numeric" 
##                      Freedom to make life choices 
##                                         "numeric" 
##                                        Generosity 
##                                         "numeric" 
##                         Perceptions of corruption 
##                                         "numeric" 
##                                   Positive affect 
##                                         "numeric" 
##                                   Negative affect 
##                                         "numeric" 
##                 Confidence in national government 
##                                         "numeric" 
##                                Democratic Quality 
##                                         "numeric" 
##                                  Delivery Quality 
##                                         "numeric" 
##      Standard deviation of ladder by country-year 
##                                         "numeric" 
## Standard deviation/Mean of ladder by country-year 
##                                         "numeric" 
##                  GINI index (World Bank estimate) 
##                                         "numeric"

Now that we already have our dataset, we can take a closer look at it.

HDIvsHap.pca <- PCA(HDIvsHap[, 2:22], graph=FALSE)
eigenvalues <- HDIvsHap.pca$eig
fviz_screeplot(HDIvsHap.pca, addlabels = TRUE, ylim = c(0, 65))

pc <- princomp(HDIvsHap[, 2:22], cor=TRUE, scores=TRUE)

(HDIvsHap.pca$var$contrib)

##                                                        Dim.1       Dim.2
## Human Development Index (HDI)                     8.43157195  3.75685785
## Life expectancy at birth                          7.43704623  1.89820014
## Expected years of schooling                       7.19569472  2.72844293
## Mean years of schooling                           5.32533298  7.52189076
## Gross national income (GNI) per capita            8.51765248  0.07835988
## GNI per capita rank minus HDI rank                0.02005064  6.62603117
## Life Ladder                                       7.05412410  2.95059301
## Log GDP per capita                                8.36121117  1.47572167
## Social support                                    5.06735278  0.53513300
## Healthy life expectancy at birth                  7.10175789  2.49759048
## Freedom to make life choices                      3.61051129 11.03834037
## Generosity                                        1.25829789 11.54427651
## Perceptions of corruption                         3.24311593  4.93822928
## Positive affect                                   1.70435313 14.77972373
## Negative affect                                   1.13028109  7.39410440
## Confidence in national government                 0.08523666 11.88152721
## Democratic Quality                                6.83231266  0.57368393
## Delivery Quality                                  8.08341554  0.13303477
## Standard deviation of ladder by country-year      1.73937861  0.88229843
## Standard deviation/Mean of ladder by country-year 6.23219329  2.70135090
## GINI index (World Bank estimate)                  1.56910895  4.06460959
##                                                          Dim.3        Dim.4
## Human Development Index (HDI)                     3.195949e-01  0.002022310
## Life expectancy at birth                          3.843151e+00  0.003163369
## Expected years of schooling                       3.489551e-03  0.217496409
## Mean years of schooling                           2.602819e+00  0.940241201
## Gross national income (GNI) per capita            5.132588e-02  4.614258639
## GNI per capita rank minus HDI rank                6.935929e+00 11.547734345
## Life Ladder                                       2.125901e+00  0.338756062
## Log GDP per capita                                1.359271e+00  1.096276363
## Social support                                    1.577122e+00 15.239327959
## Healthy life expectancy at birth                  3.154583e+00  0.043565014
## Freedom to make life choices                      5.057390e-01  0.008426865
## Generosity                                        7.976327e-04  0.224304424
## Perceptions of corruption                         3.035577e+00 17.738275842
## Positive affect                                   9.369760e+00  4.316767615
## Negative affect                                   9.762692e+00 16.631428778
## Confidence in national government                 6.238764e+00 13.846709651
## Democratic Quality                                3.989636e-01  1.120832018
## Delivery Quality                                  1.132419e-03  5.111634966
## Standard deviation of ladder by country-year      2.675454e+01  1.524915277
## Standard deviation/Mean of ladder by country-year 3.625731e+00  3.783954131
## GINI index (World Bank estimate)                  1.833311e+01  1.649908761
##                                                          Dim.5
## Human Development Index (HDI)                     1.267098e-01
## Life expectancy at birth                          2.527776e+00
## Expected years of schooling                       1.676589e+00
## Mean years of schooling                           1.381827e+00
## Gross national income (GNI) per capita            3.578547e+00
## GNI per capita rank minus HDI rank                4.356439e+01
## Life Ladder                                       3.457342e-04
## Log GDP per capita                                2.311771e+00
## Social support                                    3.193014e+00
## Healthy life expectancy at birth                  3.304650e+00
## Freedom to make life choices                      4.542502e+00
## Generosity                                        4.430438e-01
## Perceptions of corruption                         8.812025e+00
## Positive affect                                   1.329891e+00
## Negative affect                                   2.178451e-01
## Confidence in national government                 9.625652e+00
## Democratic Quality                                8.950582e-01
## Delivery Quality                                  6.519867e-01
## Standard deviation of ladder by country-year      3.733824e+00
## Standard deviation/Mean of ladder by country-year 2.115811e+00
## GINI index (World Bank estimate)                  5.966740e+00

Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set. The contribution of variables was extracted above: The larger the value of the contribution, the more the variable contributes to the component. The variables selected for the model are:

                                                  Dim.1      Dim.2        Dim.3        Dim.4        Dim.5

Human Development Index (HDI) 8.58865397 3.5500896 0.143669350 3.192140e-05 0.125451498
Life expectancy at birth 7.55539539 2.0912030 3.139821431 6.066089e-02 1.963118970
Expected years of schooling 7.24635749 2.5465918 0.003315004 6.003436e-01 1.386340930

fviz_pca_var(HDIvsHap.pca, col.var="contrib",gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE )

variables in explaining the variations retained by the principal components.

Using Clustering Analysis to group countries

number <- NbClust(HDIvsHap[, 2:22], distance="euclidean",
               min.nc=2, max.nc=20, method='kmeans', index='all', alphaBeale = 0.1)

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 3 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 18 as the best number of clusters 
## * 4 proposed 20 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

set.seed(2018)
pam <- pam(HDIvsHap[, 2:22], diss=FALSE, 3, keep.data=TRUE, do.swap= TRUE)
fviz_silhouette(pam)

##   cluster size ave.sil.width
## 1       1   94          0.61
## 2       2  161          0.58
## 3       3  221          0.64

fviz_cluster(pam, stand = FALSE, geom = "point",ellipse.type = "norm")

A World Map of three clusters

HDIvsHap['cluster'] <- as.factor(pam$clustering)
map <- map_data("world")
map <- left_join(map, HDIvsHap, by = c('region' = 'Country'))

ggplot() + geom_polygon(data = map, aes(x = long, y = lat, group = group, fill=cluster, color=cluster)) +
  labs(title = "Clustering Happy Planet Index", subtitle = "Based on data from:http://happyplanetindex.org/", x=NULL, y=NULL) + 
  theme_minimal()

Now that we have decided which variables contributes to the model, we will start with the outliers detection procedures.

Creating the data set with the newly selected variables

    HDIvsHap2<-HDIvsHap2[-c(5:22)]

Detecting Outliers

The analysis of outliers has been performed on multivariate an not in univariate data. The reason for that, is because some univariate outliers may be not extreme in multiple regression model, and conversely, some multivariate outliers may not be detectable in single-variables (Kutner, 2005).

Now we have the information in the correct way in order to perform Outliers determination. Let’s create different outliers detection functions:

Tukey’s fences function

#Creating and Outlier detection function (tukey's Fences or Boxplots)

tukey_outlier = function(x) {
  x < quantile(x, 0.25) - 1.5*IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x)
}

Calculating for each variable

tukey_HDI<-tukey_outlier(HDIvsHap2$`Human Development Index (HDI)`)
tukey_LEB<-tukey_outlier(HDIvsHap2$`Life expectancy at birth`)
tukey_EYS<-tukey_outlier(HDIvsHap2$`Expected years of schooling`)
HDIvsHap2<-mutate(HDIvsHap2, tukey_EYS, tukey_HDI, tukey_LEB)

Standard Deviation function

sd_outlier = function(x) {
  abs(scale(x)[,1]) > 3
}

Calculating for each variable

sd_outlier_HDI<- sd_outlier(HDIvsHap2$`Human Development Index (HDI)`)
sd_outlier_LEB<- sd_outlier(HDIvsHap2$`Life expectancy at birth`)
sd_outlier_EYS<- sd_outlier(HDIvsHap2$`Expected years of schooling`)

HDIvsHap2<-mutate(HDIvsHap2, sd_outlier_HDI, sd_outlier_LEB, sd_outlier_EYS)

mad_outlier = function(x) {
  m = mad(x)
  med = median(x)
  x > med + 3 * m | x < med - 3 * m
}

Calculating for each variable

mad_HDI<-mad_outlier(HDIvsHap2$`Human Development Index (HDI)`)
mad_LEB<-mad_outlier(HDIvsHap2$`Life expectancy at birth`)
mad_EYS<-mad_outlier(HDIvsHap2$`Expected years of schooling`)

HDIvsHap2<-mutate(HDIvsHap2, mad_HDI, mad_LEB, mad_EYS)

Creating an outlier function

outlierKD <- function(dt, var) {
  
  var_name <- eval(substitute(var),eval(dt))
  tot <- round(sum(!is.na(var_name)),2)
  na1 <- round(sum(is.na(var_name)),2)
  m1 <- round(mean(var_name, na.rm = T),2)
  par(mfrow=c(2, 2), oma=c(0,0,3,0))
  boxplot(var_name, main="With outliers", col = "yellow")
  hist(var_name, main="With outliers", xlab=NA, ylab=NA, col = "red")
  outlier <- boxplot.stats(var_name)$out
  mo <- round(mean(outlier),2)
  var_name <- ifelse(var_name %in% outlier, NA, var_name)
  boxplot(var_name, main="Without outliers", col = "yellow")
  hist(var_name, main="Without outliers", xlab=NA, ylab=NA, col = "red")
  title("Outlier Check", outer=TRUE)
  na2 <- round(sum(is.na(var_name)),2)
  message("Outliers identified: ", na2 - na1, " from ", tot, " observations")
  message("Proportion (%) of outliers: ", round((na2 - na1) / tot*100),2)
  message("Mean of the outliers: ", mo)
  m2 <- round(mean(var_name, na.rm = T),2)
  message("Mean without removing outliers: ", m1)
  message("Mean if we remove outliers: ", m2)
 
}

Outlier Analysis for Human Development Index (HDI)

outlierKD(HDIvsHap2,HDIvsHap2$`Human Development Index (HDI)` )

## Outliers identified: 0 from 1406 observations

## Proportion (%) of outliers: 02

## Mean of the outliers: NaN

## Mean without removing outliers: 0.73

## Mean if we remove outliers: 0.73

Now, we need are going to identify the outliers from above graph.

which(tukey_HDI)

## integer(0)

Accordign to Tukey’s method, the outliers are located at rows 429 430 431 432 433 434 435 436 437 438 439 440.

Checking with other methods…

which(sd_outlier_HDI)

## integer(0)

According to the sd methos, the only outliers are 439 and 440

which(mad_HDI)

## integer(0)

For the Human Development Index (HDI) variable, al three methods classify points 439 and 440 as outliers. Both Tukey’s and SD identifid points 433 434 435 436 437 438 as outliers. Only Tukey’s identified points 429 430 431 432 as outliers. The initial criteria for removing outliers in this project is that at least two methods should have identified the points as outliers. In other words, points 433 434 435 436 437 438 439 and 440 will be removed.

HDIvsHap2<-HDIvsHap2[-c(433,434,435,436,437,438,439),]

outlierKD(HDIvsHap2,HDIvsHap2$`Human Development Index (HDI)` )

## Outliers identified: 0 from 1399 observations

## Proportion (%) of outliers: 02

## Mean of the outliers: NaN

## Mean without removing outliers: 0.73

## Mean if we remove outliers: 0.73

Outlier Analysis for Life expectancy at birth (LEB)

outlierKD(HDIvsHap2,HDIvsHap2$`Life expectancy at birth` )

## Outliers identified: 0 from 1399 observations

## Proportion (%) of outliers: 02

## Mean of the outliers: NaN

## Mean without removing outliers: 72.94

## Mean if we remove outliers: 72.94

Now, we need are going to identify the outliers from above graph.

which(tukey_LEB)

## integer(0)

Accordign to Tukey’s method, the outliers are located at rows 386 389 398 404 406 420 427 430 432 434 435 436 437 438.

Checking with other methods…

which(sd_outlier_LEB)

## integer(0)

According to the sd methos, the only outliers are 404 406 420 434 436 438

which(mad_LEB)

## integer(0)

For the Life expectancy at birth (LEB) variable, al three methods classify points 404 406 420 434 436 438 as outliers. Both MAD and SD identified the same set of points 404 406 420 434 436 438 as outliers. Only Tukey’s identified points 386 389 398 as outliers. The initial criteria for removing outliers in this project is that at least two methods have identified the points as outliers. In other words, points 404 406 420 434 436 438 will be removed.

HDIvsHap2<-HDIvsHap2[-c(404, 406, 420, 434, 436, 438),]

Outlier Analysis for Expected years of schooling (EYS)

outlierKD(HDIvsHap2,HDIvsHap2$`Expected years of schooling` )

## Outliers identified: 28 from 1393 observations

## Proportion (%) of outliers: 22

## Mean of the outliers: 12.19

## Mean without removing outliers: 13.64

## Mean if we remove outliers: 13.67

Now, we need are going to identify the outliers from above graph.

which(tukey_EYS)

##  [1]   15   16   17   18   19   20   21   22   23   24   25 1324 1386 1387 1388
## [16] 1389 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406

Accordign to Tukey’s method, the outliers are located at rows 5 417 422 429 437 438 439 440.

Checking with other methods…

which(sd_outlier_EYS)

##  [1] 15 16 17 18 19 20 21 22 23 24 25

According to the sd methos, the outliers are 5 429 437 439 440.

which(mad_EYS)

##  [1] 15 16 17 18 19 20 21 22 23 24 25

For the Expected years of schooling (EYS) variable, al three methods classify points 5 429 437 439 440 as outliers. Both MAD and Tukey’s identified the additional point 422 as outliers. Only Tukey’s identified point 417 as outliers. The initial criteria for removing outliers in this project is that at least two methods have identified the points as outliers. In other words, points 5 422 429 437 439 440 will be removed.

HDIvsHap2<-HDIvsHap2[-c(5, 422, 429, 437, 439, 440),]

Finally, we will take a look of the correlation coefficients

ggpairs(HDIvsHap2, columns = 2:4)

It looks like Life expectancy at birth still need some improvement. This variable shows a bimodal sampling distribution, suggesting non-normality and that probably important variables has not been included yet. ALso, Human Development Index (HDI) has an important skweness to the left, also indicating non-normality.This important aspect will be cover in the next step. Finally, Expected year of schooling suggest some degree of normality.

Now, let’s take a look of all these variables together.

Creating a Linear Regression

Now, we will try to run simple linear regression on our dataset and create some models

fmla <- `Human Development Index (HDI)` ~ `Life expectancy at birth` + `Expected years of schooling`

Fit the model

Happiness_model <- lm(fmla, data = HDIvsHap2)

print Happiness_model and call summary()

Happiness_model

## 
## Call:
## lm(formula = fmla, data = HDIvsHap2)
## 
## Coefficients:
##                   (Intercept)     `Life expectancy at birth`  
##                      -0.39003                        0.01062  
## `Expected years of schooling`  
##                       0.02497

summary(Happiness_model)

## 
## Call:
## lm(formula = fmla, data = HDIvsHap2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.126213 -0.031985  0.002543  0.029001  0.132839 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -0.3900302  0.0129470  -30.12   <2e-16 ***
## `Life expectancy at birth`     0.0106232  0.0002634   40.33   <2e-16 ***
## `Expected years of schooling`  0.0249736  0.0006695   37.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04345 on 1384 degrees of freedom
## Multiple R-squared:  0.9219, Adjusted R-squared:  0.9218 
## F-statistic:  8171 on 2 and 1384 DF,  p-value: < 2.2e-16

plot(Happiness_model)

The Normal QQ plot reveals that the linear regression has some degree of deviation from normality at the bottom left of the plot. However, an slightly devition from normality may be accepted. In the residuals plots, no megaphone effect, cyclicity, trend or sequence have been obseved.

Using the prediction option…

HDIvsHap2$prediction <- predict(Happiness_model)

Visualizing the information

ggplot(HDIvsHap2, aes(x = prediction, y = `Human Development Index (HDI)`)) + 
  geom_point() +
  geom_abline(color = "blue")

ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) + 
  geom_point() +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Rankings

In the file X2018, the Happiness Score is called ‘Ladder’. During the Gallup World Poll (GWP), the english wording of the of the question is "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. This measure is referred as ‘life ladder’ in the X2018 Excel dataset.

Adding the variable Life Ladder (Happiness score) to the file HDIvsHap2…

HDIvsHap2 <- HDIvsHap2 %>%
  left_join(X2018 , by = "Country") %>%
  mutate(Country = factor(Country)) %>%
  select(Country, `Human Development Index (HDI)`,`Life expectancy at birth`, `Expected years of schooling`, # Human development variables
               
         
          #Happines variables (X2018)
         `Life Ladder` ) 

 HDIvsHap2<-HDIvsHap2

Visualizing the information

  ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) + 
  geom_point(aes(fill= `Life Ladder`), size = 5, pch=21) +
  geom_smooth(method = "lm")+
  ggtitle("Association Between Happiness vs. Human Development")+
  scale_fill_gradient(low="yellow", high="red")

## `geom_smooth()` using formula 'y ~ x'

sp<-ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) + 
  geom_point(aes(fill= `Life Ladder`), size = HDIvsHap2$`Life Ladder`, pch=21) +
  geom_smooth(method = "lm")+
  ggtitle("Association Between Happiness vs. Human Development")

Adding more channels to the graph…

sp+scale_fill_gradient(low="yellow", high="red")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 46 rows containing missing values (geom_point).

From the model, we can see that in general terms, High scores of Happiness (Life Ladder) are associated with high values of ‘Human Development Index (HDI)’.

Creating an interactive graph

pl = ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) + 
     geom_point_interactive(aes(tooltip = HDIvsHap2$Country,
                             size= HDIvsHap2$`Life Ladder`,
                             fill=`Life Ladder`), pch = 21) +
     geom_smooth(method = "lm")+
     ggtitle("Association Between Happiness vs. Human Development")+
     scale_fill_gradient(low="yellow", high="red")

ggiraph(code = print(pl), width = 1, height_svg = 4)

## Warning: Use of `HDIvsHap2$Country` is discouraged. Use `Country` instead.

## Warning: Use of `HDIvsHap2$`Life Ladder`` is discouraged. Use `Life Ladder`
## instead.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 46 rows containing missing values (geom_interactive_point).

According to the above interactive map, Countries like Australia, Canada, Austria and Finland are on the top of Happiness and Human Developmnet. Countries like Egypt, Indonesia, Paraguay and Belize are in the middle of Happiness and Human Development and Niger, Sierre Leone, Chad and Central African Republic in the bottom. Therefore, this correlation apparently correspont with geographic location. Europe and North America on the top, next South America and North Africa and on the bottom Africa.

Let’s have a look at the average happiness for each country and region for 2018.

dfavg <- HDIvsHap_all %>%
  select(Country, `Life Ladder`) %>% #Remember Life Ladder = Happiness Score
  group_by(Country) %>%
  summarize(Average = mean(`Life Ladder`)) %>%
  arrange(desc(Average))

head(dfavg,n=10)

## # A tibble: 10 x 2
##    Country     Average
##    <fct>         <dbl>
##  1 Denmark        7.70
##  2 Norway         7.56
##  3 Switzerland    7.54
##  4 Finland        7.52
##  5 Netherlands    7.47
##  6 Canada         7.45
##  7 Iceland        7.41
##  8 Sweden         7.37
##  9 New Zealand    7.32
## 10 Australia      7.31

dfregions<-HDIvsHap_all[1431:1436,1:6]

kable(dfregions)

Country	Human Development Index (HDI)	Life expectancy at birth	Expected years of schooling	Mean years of schooling	Gross national income (GNI) per capita
Arab States	0.6989547	71.50650	11.90614	7.034884	15836.653
East Asia and the Pacific	0.7334344	74.66739	13.29281	7.862161	13687.730
Europe and Central Asia	0.7707206	73.39237	14.08741	10.255684	15331.155
Latin America and the Caribbean	0.7579895	75.71654	14.44067	8.483807	13670.660
South Asia	0.6378350	69.26117	11.87420	6.411054	6473.295
Sub-Saharan Africa	0.5371360	60.71723	10.05516	5.554588	3399.249

Incorporating this information in the graph “Association Between Happiness vs. Human Development”

pl = ggplot(HDIvsHap2, aes(x = `Expected years of schooling` + `Life expectancy at birth`, y = `Human Development Index (HDI)`)) + 
     geom_point_interactive(aes(tooltip = HDIvsHap2$Country,
                             size= HDIvsHap2$`Life Ladder`,
                             fill=`Life Ladder`), pch = 21) +
     geom_smooth(method = "lm")+
     annotate("text", x= 90, y=0.40, label="Sub Saharan Africa")+
     annotate("text", x= 65, y=0.70, label="Latin america")+
     annotate("text", x= 75, y=0.95, label="Europe and Central Asia")+
     ggtitle("Association Between Happiness vs. Human Development")+
     scale_fill_gradient(low="yellow", high="red")

ggiraph(code = print(pl), width = 1, height_svg = 4)

## Warning: Use of `HDIvsHap2$Country` is discouraged. Use `Country` instead.

## Warning: Use of `HDIvsHap2$`Life Ladder`` is discouraged. Use `Life Ladder`
## instead.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 46 rows containing missing values (geom_interactive_point).

The above graph is not accurate. However, it shows tendency. It is necessary to depicts all the results on a map.

Creating a map

worldmap <- map_data("world")

happy_world <- HDIvsHap2 
    worldmap <- full_join(worldmap, HDIvsHap2, by = c('region' = 'Country')) 

map_theme <- theme(
    axis.title.x = element_blank(),
    axis.text.x  = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y  = element_blank(),
    axis.ticks.y = element_blank(),
    panel.background = element_rect(fill = "white")) 


ggplot() + geom_polygon(data = worldmap, aes(x = long, y = lat, group = group, fill = `Life Ladder`)) +
  scale_fill_continuous(low="thistle2", high="darkred", na.value="snow2") +
  coord_quickmap() +
  labs(title = "Happiness Around the World - 2018", subtitle = "Based on data from:http://happyplanetindex.org/", x=NULL, y=NULL) + 
  theme_minimal()

The darker the red, the higher the happiness score. Regions in gray do not have happiness data or there is some display problem, like in the cases of USA, Rusia and Venezuela. The happiest regions of the world appear to be in Europe, North and South America, and New Zealand. Africa appears to contain the lowest overall happiness scores.

The Human Development report, has statistics by Region, while the Happines report (X2018) do not.

Finally, showing the main factor of Happiness

dfwide<-HDIvsHap2 %>%
  head(850)

dflong <-gather(dfwide, Factor, `Importance of Factor` ,  `Human Development Index (HDI)`:`Expected years of schooling`,  factor_key = TRUE)

ggplot(data = dflong) +
  geom_bar(stat = "identity", 
           aes(x = Country, y = `Importance of Factor`, fill = Factor)) +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "top") +
  labs(title = "Main Factors of Happiness Top 10 Countries") +
  theme(plot.title = element_text(size = rel(1.5)),
        axis.title = element_text(size = rel(1.5)))

World Happines Project

Oswaldo Salazar

March, 28 2019

World Happiness Report

Principal Component Analysis

Creating a new joined data set

Principal Component Analysis with all data joined

Using Clustering Analysis to group countries

Detecting Outliers

Creating a Linear Regression

Rankings

Creating a map