Introduction

The aim of this article is to use PCA method for dimension reduction on The World Happiness Report 2016 data. PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. As the result this project helps to get better understanding of the data and dependencies between variables.

library(factoextra)
library(labdsv)
library(maptools)
library(psych)
library(ClusterR)
library(readxl)
library(cluster)
library(flexclust)
library(fpc)
library(clustertend)
library(ggthemes)
library(plotly)
library(stringr)
library(missMDA)
library(ade4)
library(smacof)
library(ggplot2)
library(Rtsne)
library(psy)
library(scales)
library(kableExtra)
library(factoextra)
library(psy)
library(pdp)
library(scales)
library(corrplot)

Dataset

The data comes from The World Happiness Report 2016. It is a landmark survey of the state of global happiness, which ranks 157 countries.

Source:https://www.kaggle.com/unsdsn/world-happiness?select=2016.csv.

dane<-read.csv("Happiness2016.csv", sep=",", dec=".", header=TRUE) 
dane <- data.frame(dane)
rownames(dane) <- dane[,1]
#data = data[,-1]

#I delete strings
drop <- c("Country","Region","Happiness.Rank","Lower.Confidence.Interval","Upper.Confidence.Interval") 
dane = dane[,!(names(dane) %in% drop)] 

head(dane)%>%kbl() %>%kable_paper("hover")%>%scroll_box(width = "910px")

	Happiness.Score	Economy..GDP.per.Capita.	Family	Health..Life.Expectancy.	Freedom	Trust..Government.Corruption.	Generosity	Dystopia.Residual
Denmark	7.526	1.44178	1.16374	0.79504	0.57941	0.44453	0.36171	2.73939
Switzerland	7.509	1.52733	1.14524	0.86303	0.58557	0.41203	0.28083	2.69463
Iceland	7.501	1.42666	1.18326	0.86733	0.56624	0.14975	0.47678	2.83137
Norway	7.498	1.57744	1.12690	0.79579	0.59609	0.35776	0.37895	2.66465
Finland	7.413	1.40598	1.13464	0.81091	0.57104	0.41004	0.25492	2.82596
Canada	7.404	1.44015	1.09610	0.82760	0.57370	0.31329	0.44834	2.70485

cat("Number of observations in the dataset:", nrow(dane))

## Number of observations in the dataset: 157

cat("Number of years variables in the analysis:", ncol(dane))

## Number of years variables in the analysis: 8

Check for missing values

missing_in_cols <- sapply(dane, function(x) sum(is.na(x))/nrow(dane))
percent(missing_in_cols)

##               Happiness.Score      Economy..GDP.per.Capita. 
##                          "0%"                          "0%" 
##                        Family      Health..Life.Expectancy. 
##                          "0%"                          "0%" 
##                       Freedom Trust..Government.Corruption. 
##                          "0%"                          "0%" 
##                    Generosity             Dystopia.Residual 
##                          "0%"                          "0%"

No missing values were found in the data

summary(dane) %>% kbl() %>% kable_paper("hover")

Happiness.Score	Economy..GDP.per.Capita.	Family	Health..Life.Expectancy.	Freedom	Trust..Government.Corruption.	Generosity	Dystopia.Residual
Min. :2.905	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.8179
1st Qu.:4.404	1st Qu.:0.6702	1st Qu.:0.6418	1st Qu.:0.3829	1st Qu.:0.2575	1st Qu.:0.06126	1st Qu.:0.1546	1st Qu.:2.0317
Median :5.314	Median :1.0278	Median :0.8414	Median :0.5966	Median :0.3975	Median :0.10547	Median :0.2225	Median :2.2907
Mean :5.382	Mean :0.9539	Mean :0.7936	Mean :0.5576	Mean :0.3710	Mean :0.13762	Mean :0.2426	Mean :2.3258
3rd Qu.:6.269	3rd Qu.:1.2796	3rd Qu.:1.0215	3rd Qu.:0.7299	3rd Qu.:0.4845	3rd Qu.:0.17554	3rd Qu.:0.3119	3rd Qu.:2.6646
Max. :7.526	Max. :1.8243	Max. :1.1833	Max. :0.9528	Max. :0.6085	Max. :0.50521	Max. :0.8197	Max. :3.8377

The dataset consists all continuous variables.

At the beginning, I am going to compute the covariance matrix (it has as entries the covariances associated with all possible pairs of the initial data), to see if there is any relationship between variables.

cor.matrix <- cor(dane, method = c("pearson"))
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col=colorRampPalette(c("#99FF33", "#CC0066", "black"))(200))

In the dataset we may observe positively correlated variables. Variable - Happiness.Score seem to be highly correlated to Economy.GDP.per.Capita, Health.Life.Expectancy and Family. Moreover, high correlation occur in case of variables Economy.GDP.per.Capita and Health.Life.Expectancy.

Principal Component Analysis (PCA)

PCA is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

I have to standardize data, to avoid a problem in which some features come to dominate solely

data.pca <- prcomp(dane, center = TRUE, scale = TRUE)
summary(data.pca)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5    PC6    PC7
## Standard deviation     1.947 1.2061 1.0303 0.81897 0.72416 0.5967 0.3752
## Proportion of Variance 0.474 0.1818 0.1327 0.08384 0.06555 0.0445 0.0176
## Cumulative Proportion  0.474 0.6558 0.7885 0.87235 0.93790 0.9824 1.0000
##                              PC8
## Standard deviation     0.0002111
## Proportion of Variance 0.0000000
## Cumulative Proportion  1.0000000

The result of the summary function shows three statistics according to all components: standard deviation, proportion of variance and cumulative variance. From the output we can see that PC1 explains 47% of variance, PC2 explains 18% of variance, PC3 explains 13% of variance, ans so on.

eig.val <- get_eigenvalue(data.pca)
eig.val

##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.791850e+00     4.739812e+01                    47.39812
## Dim.2 1.454735e+00     1.818419e+01                    65.58231
## Dim.3 1.061524e+00     1.326905e+01                    78.85136
## Dim.4 6.707090e-01     8.383862e+00                    87.23523
## Dim.5 5.244024e-01     6.555030e+00                    93.79026
## Dim.6 3.560158e-01     4.450198e+00                    98.24045
## Dim.7 1.407636e-01     1.759545e+00                   100.00000
## Dim.8 4.454561e-08     5.568201e-07                   100.00000

scree.plot(eig.val[,1], title = "Scree Plot", type = "E",  simu = "P")

The 3 components should be chosen, because eigenvalues of those are higher than 1.

fviz_eig(data.pca,barfill = "#990066",barcolor = "#990066",linecolor = "red")

Scree plot represents graphically the percentage of variance explained by every component. The results show that PC1 explains 47% of variation. To explain 79% of variance, there have to be 3 components.

Correlation circle

fviz_pca_var(data.pca, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=11)

The chart shows that the darkest variables are the most important. The names of the variables are hardly visible, however, they are: Happiness.Score, Economy.GDP.per.Capita, Health.Life.Expectancy and Family. On the other hand, the variable Dystopia.Residual has the least influence.

Below are loading plots presenting how strongly a loading of a given variable contributes to a given principal component.

PC1, PC2

fviz_pca_var(data.pca, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
  labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=11)

PC2, PC3

fviz_pca_var(data.pca, col.var="contrib", repel = TRUE, axes = c(2, 3)) +
  labs(title="Variables loadings for PC2 and PC3", x="PC2", y="PC3")+
scale_color_gradient2(low="#99FF33", mid="#CC0066", 
                      high="black", midpoint=20)

#### {-}

The plot below shows what percent of variance has been explained for each number of principal components (aggregate variance explained).

plot(summary(data.pca)$importance[3,])

fviz_pca_ind(data.pca, col.ind="cos2", geom = "point", gradient.cols = c("#99FF33", "#CC0066", "black" ))

PC1 <- fviz_contrib(data.pca, choice = "var", axes = 1,fill = "#990066",color = "#990066")
PC2 <- fviz_contrib(data.pca, choice = "var", axes = 2,fill = "#990066",color = "#990066")
PC3 <- fviz_contrib(data.pca, choice = "var", axes = 3,fill = "#990066",color = "#990066")
grid.arrange(PC1, PC2, PC3,   ncol=3)

The most important component consist 4 variables, the second one consist also 4, but variable - Dystopia.Residual belongs to the third component which consist only this 1 variable.

Hierarchical clustering

data.standarized <- as.data.frame(lapply(dane, scale))

distance.m<-dist(t(data.standarized))
hc<-hclust(distance.m, method="complete") 
plot(hc, hang=-1)
rect.hclust(hc, k = 3, border='#99FF33')

sub_grp<-cutree(hc, k=3) 
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("#339900", "#CC0066", "black" ))

The results show that there are no differences between clusters generated from hierarchical tree and PCA.

Summary

The main goal of this research was to examine whether happiness scores can be described by a smaller number of variables by PCA method for dimension reduction. The results of the analysis suggest that latent concepts can be described by 3 dimensions. After that hierarchical clustering were applied to the same dataset to explore if these two methods give the same results. It was shown that the final results do not differ.

Dimension Reduction on The World Happiness Report 2016 data

Introduction

Dataset

The data comes from The World Happiness Report 2016. It is a landmark survey of the state of global happiness, which ranks 157 countries.

Source:https://www.kaggle.com/unsdsn/world-happiness?select=2016.csv.

Check for missing values

No missing values were found in the data

The dataset consists all continuous variables.

At the beginning, I am going to compute the covariance matrix (it has as entries the covariances associated with all possible pairs of the initial data), to see if there is any relationship between variables.

In the dataset we may observe positively correlated variables. Variable - Happiness.Score seem to be highly correlated to Economy.GDP.per.Capita, Health.Life.Expectancy and Family. Moreover, high correlation occur in case of variables Economy.GDP.per.Capita and Health.Life.Expectancy.

Principal Component Analysis (PCA)

PCA is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

The result of the summary function shows three statistics according to all components: standard deviation, proportion of variance and cumulative variance. From the output we can see that PC1 explains 47% of variance, PC2 explains 18% of variance, PC3 explains 13% of variance, ans so on.

The 3 components should be chosen, because eigenvalues of those are higher than 1.

Scree plot represents graphically the percentage of variance explained by every component. The results show that PC1 explains 47% of variation. To explain 79% of variance, there have to be 3 components.

Correlation circle

The chart shows that the darkest variables are the most important. The names of the variables are hardly visible, however, they are: Happiness.Score, Economy.GDP.per.Capita, Health.Life.Expectancy and Family. On the other hand, the variable Dystopia.Residual has the least influence.

Below are loading plots presenting how strongly a loading of a given variable contributes to a given principal component.

PC1, PC2

PC2, PC3

The plot below shows what percent of variance has been explained for each number of principal components (aggregate variance explained).

The most important component consist 4 variables, the second one consist also 4, but variable - Dystopia.Residual belongs to the third component which consist only this 1 variable.

Hierarchical clustering

The results show that there are no differences between clusters generated from hierarchical tree and PCA.

Summary