Introduction
The aim of this article is to use PCA method for dimension reduction on The World Happiness Report 2016 data. PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. As the result this project helps to get better understanding of the data and dependencies between variables.
library(factoextra)
library(labdsv)
library(maptools)
library(psych)
library(ClusterR)
library(readxl)
library(cluster)
library(flexclust)
library(fpc)
library(clustertend)
library(ggthemes)
library(plotly)
library(stringr)
library(missMDA)
library(ade4)
library(smacof)
library(ggplot2)
library(Rtsne)
library(psy)
library(scales)
library(kableExtra)
library(factoextra)
library(psy)
library(pdp)
library(scales)
library(corrplot)
Dataset
The data comes from The World Happiness Report 2016. It is a landmark survey of the state of global happiness, which ranks 157 countries.
dane<-read.csv("Happiness2016.csv", sep=",", dec=".", header=TRUE)
dane <- data.frame(dane)
rownames(dane) <- dane[,1]
#data = data[,-1]
#I delete strings
drop <- c("Country","Region","Happiness.Rank","Lower.Confidence.Interval","Upper.Confidence.Interval")
dane = dane[,!(names(dane) %in% drop)]
head(dane)%>%kbl() %>%kable_paper("hover")%>%scroll_box(width = "910px")
|
|
Happiness.Score
|
Economy..GDP.per.Capita.
|
Family
|
Health..Life.Expectancy.
|
Freedom
|
Trust..Government.Corruption.
|
Generosity
|
Dystopia.Residual
|
|
Denmark
|
7.526
|
1.44178
|
1.16374
|
0.79504
|
0.57941
|
0.44453
|
0.36171
|
2.73939
|
|
Switzerland
|
7.509
|
1.52733
|
1.14524
|
0.86303
|
0.58557
|
0.41203
|
0.28083
|
2.69463
|
|
Iceland
|
7.501
|
1.42666
|
1.18326
|
0.86733
|
0.56624
|
0.14975
|
0.47678
|
2.83137
|
|
Norway
|
7.498
|
1.57744
|
1.12690
|
0.79579
|
0.59609
|
0.35776
|
0.37895
|
2.66465
|
|
Finland
|
7.413
|
1.40598
|
1.13464
|
0.81091
|
0.57104
|
0.41004
|
0.25492
|
2.82596
|
|
Canada
|
7.404
|
1.44015
|
1.09610
|
0.82760
|
0.57370
|
0.31329
|
0.44834
|
2.70485
|
cat("Number of observations in the dataset:", nrow(dane))
## Number of observations in the dataset: 157
cat("Number of years variables in the analysis:", ncol(dane))
## Number of years variables in the analysis: 8
Check for missing values
missing_in_cols <- sapply(dane, function(x) sum(is.na(x))/nrow(dane))
percent(missing_in_cols)
## Happiness.Score Economy..GDP.per.Capita.
## "0%" "0%"
## Family Health..Life.Expectancy.
## "0%" "0%"
## Freedom Trust..Government.Corruption.
## "0%" "0%"
## Generosity Dystopia.Residual
## "0%" "0%"
No missing values were found in the data
summary(dane) %>% kbl() %>% kable_paper("hover")
|
|
Happiness.Score
|
Economy..GDP.per.Capita.
|
Family
|
Health..Life.Expectancy.
|
Freedom
|
Trust..Government.Corruption.
|
Generosity
|
Dystopia.Residual
|
|
|
Min. :2.905
|
Min. :0.0000
|
Min. :0.0000
|
Min. :0.0000
|
Min. :0.0000
|
Min. :0.00000
|
Min. :0.0000
|
Min. :0.8179
|
|
|
1st Qu.:4.404
|
1st Qu.:0.6702
|
1st Qu.:0.6418
|
1st Qu.:0.3829
|
1st Qu.:0.2575
|
1st Qu.:0.06126
|
1st Qu.:0.1546
|
1st Qu.:2.0317
|
|
|
Median :5.314
|
Median :1.0278
|
Median :0.8414
|
Median :0.5966
|
Median :0.3975
|
Median :0.10547
|
Median :0.2225
|
Median :2.2907
|
|
|
Mean :5.382
|
Mean :0.9539
|
Mean :0.7936
|
Mean :0.5576
|
Mean :0.3710
|
Mean :0.13762
|
Mean :0.2426
|
Mean :2.3258
|
|
|
3rd Qu.:6.269
|
3rd Qu.:1.2796
|
3rd Qu.:1.0215
|
3rd Qu.:0.7299
|
3rd Qu.:0.4845
|
3rd Qu.:0.17554
|
3rd Qu.:0.3119
|
3rd Qu.:2.6646
|
|
|
Max. :7.526
|
Max. :1.8243
|
Max. :1.1833
|
Max. :0.9528
|
Max. :0.6085
|
Max. :0.50521
|
Max. :0.8197
|
Max. :3.8377
|
The dataset consists all continuous variables.
At the beginning, I am going to compute the covariance matrix (it has as entries the covariances associated with all possible pairs of the initial data), to see if there is any relationship between variables.
cor.matrix <- cor(dane, method = c("pearson"))
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col=colorRampPalette(c("#99FF33", "#CC0066", "black"))(200))

In the dataset we may observe positively correlated variables. Variable - Happiness.Score seem to be highly correlated to Economy.GDP.per.Capita, Health.Life.Expectancy and Family. Moreover, high correlation occur in case of variables Economy.GDP.per.Capita and Health.Life.Expectancy.
Principal Component Analysis (PCA)
The result of the summary function shows three statistics according to all components: standard deviation, proportion of variance and cumulative variance. From the output we can see that PC1 explains 47% of variance, PC2 explains 18% of variance, PC3 explains 13% of variance, ans so on.
eig.val <- get_eigenvalue(data.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.791850e+00 4.739812e+01 47.39812
## Dim.2 1.454735e+00 1.818419e+01 65.58231
## Dim.3 1.061524e+00 1.326905e+01 78.85136
## Dim.4 6.707090e-01 8.383862e+00 87.23523
## Dim.5 5.244024e-01 6.555030e+00 93.79026
## Dim.6 3.560158e-01 4.450198e+00 98.24045
## Dim.7 1.407636e-01 1.759545e+00 100.00000
## Dim.8 4.454561e-08 5.568201e-07 100.00000
scree.plot(eig.val[,1], title = "Scree Plot", type = "E", simu = "P")

The 3 components should be chosen, because eigenvalues of those are higher than 1.
fviz_eig(data.pca,barfill = "#990066",barcolor = "#990066",linecolor = "red")

Scree plot represents graphically the percentage of variance explained by every component. The results show that PC1 explains 47% of variation. To explain 79% of variance, there have to be 3 components.
Correlation circle
fviz_pca_var(data.pca, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)

The chart shows that the darkest variables are the most important. The names of the variables are hardly visible, however, they are: Happiness.Score, Economy.GDP.per.Capita, Health.Life.Expectancy and Family. On the other hand, the variable Dystopia.Residual has the least influence.
Below are loading plots presenting how strongly a loading of a given variable contributes to a given principal component.
PC1, PC2
fviz_pca_var(data.pca, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)

PC2, PC3
fviz_pca_var(data.pca, col.var="contrib", repel = TRUE, axes = c(2, 3)) +
labs(title="Variables loadings for PC2 and PC3", x="PC2", y="PC3")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=20)
#### {-}
The plot below shows what percent of variance has been explained for each number of principal components (aggregate variance explained).
plot(summary(data.pca)$importance[3,])

fviz_pca_ind(data.pca, col.ind="cos2", geom = "point", gradient.cols = c("#99FF33", "#CC0066", "black" ))

PC1 <- fviz_contrib(data.pca, choice = "var", axes = 1,fill = "#990066",color = "#990066")
PC2 <- fviz_contrib(data.pca, choice = "var", axes = 2,fill = "#990066",color = "#990066")
PC3 <- fviz_contrib(data.pca, choice = "var", axes = 3,fill = "#990066",color = "#990066")
grid.arrange(PC1, PC2, PC3, ncol=3)

The most important component consist 4 variables, the second one consist also 4, but variable - Dystopia.Residual belongs to the third component which consist only this 1 variable.
Hierarchical clustering
data.standarized <- as.data.frame(lapply(dane, scale))
distance.m<-dist(t(data.standarized))
hc<-hclust(distance.m, method="complete")
plot(hc, hang=-1)
rect.hclust(hc, k = 3, border='#99FF33')

sub_grp<-cutree(hc, k=3)
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("#339900", "#CC0066", "black" ))

The results show that there are no differences between clusters generated from hierarchical tree and PCA.
Summary
The main goal of this research was to examine whether happiness scores can be described by a smaller number of variables by PCA method for dimension reduction. The results of the analysis suggest that latent concepts can be described by 3 dimensions. After that hierarchical clustering were applied to the same dataset to explore if these two methods give the same results. It was shown that the final results do not differ.