Introduction

The aim of this article is to use PCA method for dimension reduction on The World Happiness Report 2016 data. PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. As the result this project helps to get better understanding of the data and dependencies between variables.

library(factoextra)
library(labdsv)
library(maptools)
library(psych)
library(ClusterR)
library(readxl)
library(cluster)
library(flexclust)
library(fpc)
library(clustertend)
library(ggthemes)
library(plotly)
library(stringr)
library(missMDA)
library(ade4)
library(smacof)
library(ggplot2)
library(Rtsne)
library(psy)
library(scales)
library(kableExtra)
library(factoextra)
library(psy)
library(pdp)
library(scales)
library(corrplot)

Dataset

The data comes from The World Happiness Report 2016. It is a landmark survey of the state of global happiness, which ranks 157 countries.

Source:https://www.kaggle.com/unsdsn/world-happiness?select=2016.csv.

dane<-read.csv("Happiness2016.csv", sep=",", dec=".", header=TRUE) 
dane <- data.frame(dane)
rownames(dane) <- dane[,1]
#data = data[,-1]

#I delete strings
drop <- c("Country","Region","Happiness.Rank","Lower.Confidence.Interval","Upper.Confidence.Interval") 
dane = dane[,!(names(dane) %in% drop)] 

head(dane)%>%kbl() %>%kable_paper("hover")%>%scroll_box(width = "910px")
Happiness.Score Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Trust..Government.Corruption. Generosity Dystopia.Residual
Denmark 7.526 1.44178 1.16374 0.79504 0.57941 0.44453 0.36171 2.73939
Switzerland 7.509 1.52733 1.14524 0.86303 0.58557 0.41203 0.28083 2.69463
Iceland 7.501 1.42666 1.18326 0.86733 0.56624 0.14975 0.47678 2.83137
Norway 7.498 1.57744 1.12690 0.79579 0.59609 0.35776 0.37895 2.66465
Finland 7.413 1.40598 1.13464 0.81091 0.57104 0.41004 0.25492 2.82596
Canada 7.404 1.44015 1.09610 0.82760 0.57370 0.31329 0.44834 2.70485
cat("Number of observations in the dataset:", nrow(dane))
## Number of observations in the dataset: 157
cat("Number of years variables in the analysis:", ncol(dane))
## Number of years variables in the analysis: 8

Check for missing values

missing_in_cols <- sapply(dane, function(x) sum(is.na(x))/nrow(dane))
percent(missing_in_cols)
##               Happiness.Score      Economy..GDP.per.Capita. 
##                          "0%"                          "0%" 
##                        Family      Health..Life.Expectancy. 
##                          "0%"                          "0%" 
##                       Freedom Trust..Government.Corruption. 
##                          "0%"                          "0%" 
##                    Generosity             Dystopia.Residual 
##                          "0%"                          "0%"

No missing values were found in the data

summary(dane) %>% kbl() %>% kable_paper("hover")
Happiness.Score Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Trust..Government.Corruption. Generosity Dystopia.Residual
Min. :2.905 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.8179
1st Qu.:4.404 1st Qu.:0.6702 1st Qu.:0.6418 1st Qu.:0.3829 1st Qu.:0.2575 1st Qu.:0.06126 1st Qu.:0.1546 1st Qu.:2.0317
Median :5.314 Median :1.0278 Median :0.8414 Median :0.5966 Median :0.3975 Median :0.10547 Median :0.2225 Median :2.2907
Mean :5.382 Mean :0.9539 Mean :0.7936 Mean :0.5576 Mean :0.3710 Mean :0.13762 Mean :0.2426 Mean :2.3258
3rd Qu.:6.269 3rd Qu.:1.2796 3rd Qu.:1.0215 3rd Qu.:0.7299 3rd Qu.:0.4845 3rd Qu.:0.17554 3rd Qu.:0.3119 3rd Qu.:2.6646
Max. :7.526 Max. :1.8243 Max. :1.1833 Max. :0.9528 Max. :0.6085 Max. :0.50521 Max. :0.8197 Max. :3.8377

The dataset consists all continuous variables.

At the beginning, I am going to compute the covariance matrix (it has as entries the covariances associated with all possible pairs of the initial data), to see if there is any relationship between variables.

cor.matrix <- cor(dane, method = c("pearson"))
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col=colorRampPalette(c("#99FF33", "#CC0066", "black"))(200))

In the dataset we may observe positively correlated variables. Variable - Happiness.Score seem to be highly correlated to Economy.GDP.per.Capita, Health.Life.Expectancy and Family. Moreover, high correlation occur in case of variables Economy.GDP.per.Capita and Health.Life.Expectancy.

Principal Component Analysis (PCA)

PCA is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

I have to standardize data, to avoid a problem in which some features come to dominate solely

data.pca <- prcomp(dane, center = TRUE, scale = TRUE)
summary(data.pca)
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5    PC6    PC7
## Standard deviation     1.947 1.2061 1.0303 0.81897 0.72416 0.5967 0.3752
## Proportion of Variance 0.474 0.1818 0.1327 0.08384 0.06555 0.0445 0.0176
## Cumulative Proportion  0.474 0.6558 0.7885 0.87235 0.93790 0.9824 1.0000
##                              PC8
## Standard deviation     0.0002111
## Proportion of Variance 0.0000000
## Cumulative Proportion  1.0000000

The result of the summary function shows three statistics according to all components: standard deviation, proportion of variance and cumulative variance. From the output we can see that PC1 explains 47% of variance, PC2 explains 18% of variance, PC3 explains 13% of variance, ans so on.

eig.val <- get_eigenvalue(data.pca)
eig.val
##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.791850e+00     4.739812e+01                    47.39812
## Dim.2 1.454735e+00     1.818419e+01                    65.58231
## Dim.3 1.061524e+00     1.326905e+01                    78.85136
## Dim.4 6.707090e-01     8.383862e+00                    87.23523
## Dim.5 5.244024e-01     6.555030e+00                    93.79026
## Dim.6 3.560158e-01     4.450198e+00                    98.24045
## Dim.7 1.407636e-01     1.759545e+00                   100.00000
## Dim.8 4.454561e-08     5.568201e-07                   100.00000
scree.plot(eig.val[,1], title = "Scree Plot", type = "E",  simu = "P")

The 3 components should be chosen, because eigenvalues of those are higher than 1.

fviz_eig(data.pca,barfill = "#990066",barcolor = "#990066",linecolor = "red")

Scree plot represents graphically the percentage of variance explained by every component. The results show that PC1 explains 47% of variation. To explain 79% of variance, there have to be 3 components.

Correlation circle

fviz_pca_var(data.pca, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=11)

The chart shows that the darkest variables are the most important. The names of the variables are hardly visible, however, they are: Happiness.Score, Economy.GDP.per.Capita, Health.Life.Expectancy and Family. On the other hand, the variable Dystopia.Residual has the least influence.

Below are loading plots presenting how strongly a loading of a given variable contributes to a given principal component.

PC1, PC2
fviz_pca_var(data.pca, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
  labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=11)

PC2, PC3
fviz_pca_var(data.pca, col.var="contrib", repel = TRUE, axes = c(2, 3)) +
  labs(title="Variables loadings for PC2 and PC3", x="PC2", y="PC3")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=20)

#### {-}

The plot below shows what percent of variance has been explained for each number of principal components (aggregate variance explained).

plot(summary(data.pca)$importance[3,])

fviz_pca_ind(data.pca, col.ind="cos2", geom = "point", gradient.cols = c("#99FF33", "#CC0066", "black" ))

PC1 <- fviz_contrib(data.pca, choice = "var", axes = 1,fill = "#990066",color = "#990066")
PC2 <- fviz_contrib(data.pca, choice = "var", axes = 2,fill = "#990066",color = "#990066")
PC3 <- fviz_contrib(data.pca, choice = "var", axes = 3,fill = "#990066",color = "#990066")
grid.arrange(PC1, PC2, PC3,   ncol=3)

The most important component consist 4 variables, the second one consist also 4, but variable - Dystopia.Residual belongs to the third component which consist only this 1 variable.

Hierarchical clustering

data.standarized <- as.data.frame(lapply(dane, scale))

distance.m<-dist(t(data.standarized))
hc<-hclust(distance.m, method="complete") 
plot(hc, hang=-1)
rect.hclust(hc, k = 3, border='#99FF33')

sub_grp<-cutree(hc, k=3) 
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("#339900", "#CC0066", "black" ))

The results show that there are no differences between clusters generated from hierarchical tree and PCA.

Summary

The main goal of this research was to examine whether happiness scores can be described by a smaller number of variables by PCA method for dimension reduction. The results of the analysis suggest that latent concepts can be described by 3 dimensions. After that hierarchical clustering were applied to the same dataset to explore if these two methods give the same results. It was shown that the final results do not differ.