Principal Component Analysis (PCA) is a widely used data analysis technique that can be applied to a range of fields and applications. In this report I will conduct PCA to perform dimension reduction of the World Happiness Report 2017, an annual survey that ranks countries according to their level of happiness and well-being. The 2017 edition of the report contains a wealth of data on a range of factors that contribute to happiness, including economic, social, and political factors. By applying PCA to this dataset, it is possible to identify the key factors that contribute to happiness and well-being across different countries and regions. # Load Libraries
library(tidyverse)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(ggplot2)
library(DALEX)
library(ggpubr)
library(ClusterR)
library(fastDummies)
library(qdapTools)
library(gridExtra)
library(corrplot)
library(psych)
library(reshape2)
library(scales)
Happiness_Report <- read_csv("Data/World Happiness Report/2017.csv")
Happiness_Report
We can delete the columns(“Country”, “Happiness.Rank”, “Whisker.high”, “Whisker.low”)
Happiness <- Happiness_Report %>%
select(-c(Country, Happiness.Rank, Whisker.high, Whisker.low))
Happiness
Scaling the data is an essential preprocessing step in PCA. The reason is that PCA is a variance-maximizing exercise, which means that it will identify the directions of maximum variance in the data. When data is not scaled, variables with higher variances will dominate the analysis, regardless of their importance in explaining the underlying structure of the data. By scaling the data, we ensure that each variable has equal importance in the analysis, which is crucial for correctly identifying the directions of maximum variance. Scaling also helps to avoid numerical problems that can arise when working with variables that have different scales or units of measurement. Overall, scaling is an important step in PCA that allows us to extract meaningful information from the data and make reliable inferences about the underlying structure of the variables.
Happiness_scaled <- center_scale(Happiness)
colnames(Happiness_scaled) <- colnames(Happiness)
head(Happiness_scaled)
## Happiness.Score Economy..GDP.per.Capita. Family Health..Life.Expectancy.
## [1,] 1.929741 1.501321 1.199688 1.034812
## [2,] 1.916481 1.182684 1.260949 1.017514
## [3,] 1.900569 1.178525 1.467911 1.190400
## [4,] 1.891729 1.378972 1.141860 1.294078
## [5,] 1.869629 1.090451 1.223092 1.087501
## [6,] 1.788302 1.233924 0.835616 1.093991
## Freedom Generosity Trust..Government.Corruption. Dystopia.Residual
## [1,] 1.510938 0.85419543 1.8969355 0.8535293
## [2,] 1.448164 0.80424936 2.7311448 0.9268865
## [3,] 1.455871 1.69651389 0.2990966 0.9449014
## [4,] 1.408589 0.32397752 2.3990321 0.8529085
## [5,] 1.394458 -0.01039246 2.5525258 1.1598216
## [6,] 1.177345 1.65904262 1.5693552 0.8890821
happy_long <- melt(data.frame(Happiness_scaled), id.vars = NULL)
ggplot(data = happy_long, aes(x = value, fill = variable)) +
geom_histogram() +
theme(plot.title = element_text(hjust = 0.5, size = 15)) +
facet_wrap(.~ variable, scales = "free", ncol = 3)
cor_mat <- cor(Happiness)
corrplot(cor_mat, type = "lower", order = "hclust",
tl.col = "black", tl.cex = 0.5)
Principal Component Analysis (PCA) is a method used to decrease the complexity of data by reducing its dimensionality. This allows for easier interpretation while minimizing the amount of information lost. PCA achieves this by generating new variables that are not correlated with each other, while also maximizing the amount of variance present in the data.
happ_pca <- prcomp(Happiness, center = TRUE, scale = TRUE)
summary(happ_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9564 1.1812 1.0409 0.81890 0.72606 0.59984 0.36887
## Proportion of Variance 0.4785 0.1744 0.1354 0.08382 0.06589 0.04498 0.01701
## Cumulative Proportion 0.4785 0.6529 0.7883 0.87212 0.93802 0.98299 1.00000
## PC8
## Standard deviation 0.0002029
## Proportion of Variance 0.0000000
## Cumulative Proportion 1.0000000
There are 3 most common methods used to select the number of components: The Kaiser Rule, Cree plot, Proportion of Variance explained
The Kaiser rule is a commonly used method for determining the number of components to retain in PCA. It focuses on the eigenvalues of each component, which represent the amount of variance explained by that component. The Kaiser rule suggests that only components with eigenvalues greater than 1 should be retained, as eigenvalues less than 1 indicate that the component explains less variance than a single variable. This approach is based on the idea that components with eigenvalues greater than 1 are more meaningful and contain more information than those with eigenvalues less than 1.
eig.val <- get_eigenvalue(happ_pca)
round(eig.val,3)
Based on the table above we can see that the components 1, 2, and 3 have eigen values higher than 1, thus only those components should be chosen.
A scree plot is a graphical tool used in PCA that displays the eigenvalues of each principal component against its corresponding component number. The scree plot helps to identify the number of significant principal components that should be retained for further analysis. The plot typically shows a steep drop in eigenvalues for the initial principal components,
fviz_eig(happ_pca, choice='eigenvalue')
This approach, as well as the Kaiser rule, indicates that the right
number of components is 3.
The number of components can be determined based on the amount of variance explained by each component. This method suggests that chosen components should explain over 2/3 of the variance.
fviz_eig(happ_pca, choice='variance', addlabels = T)+
scale_y_continuous(breaks = seq(0, 100, by = 10),
labels = paste0(format(seq(0, 100, by = 10)),"%")) +
labs(title = "Percentage of Variance Explained") +
theme_bw()
The plot above shows that the component 1 explains 47.8% of Variance, the component 17.4% of variance and component 13.5% of variance.
pca_df <- data.frame(eig.val)
pca_df$Component <- rownames(pca_df)
pca_df %>%
mutate(Component = substr(Component, 5,6),
Component = as.numeric(Component)) %>%
ggplot(aes(x = Component,
y = cumulative.variance.percent)) +
geom_line(col = "red") +
geom_bar(stat = "identity", fill = "steelblue", width = .9, alpha = 0.9)+
scale_y_continuous(breaks = seq(0, 100, by = 10),
labels = paste0(format(seq(0, 100, by = 10)),"%")) +
scale_x_continuous(n.breaks = 8) +
geom_text(aes(label = paste(round(cumulative.variance.percent,2),"%")),
size = 5) +
expand_limits(y = 0)+
labs(title = "Cumulative Variance explained") +
theme_bw()
With the cummulative variance above, we can see that the components
1,2,3 have a cumulative variance of 78.8% which is already well above
2/3 of total variance. limitting our analysis to just these 3 components
would help avoiding overfitting.
The “cloud of points” graph shows individual observations quality of representation.
fviz_pca_ind(happ_pca, col.ind="cos2", geom = "point", gradient.cols = c("green", "yellow", "red" ))
fviz_pca_var(happ_pca, col.var = "red")
The above plot depicts both the interrelationships between variables and the “quality” of all factors. Variables that exhibit positive correlation are situated closely together, whereas those with negative correlation are situated at opposite sides of the plot. The “quality” of each variable is indicated by its distance from the center of the plot.
PC1 <- fviz_contrib(happ_pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(happ_pca, choice = "var", axes = 2)
PC3 <- fviz_contrib(happ_pca, choice = "var", axes = 3)
grid.arrange(PC1, PC2, PC3, ncol = 3)
The plots above, we have the variables selected in each component.
Component 1 contains: Happiness.Score, Economy..GDP.per.Capita, Health..Life.Expectancy, Family.
Component 2 contains: Generosity, Trust..Government.Corruption., Freedom, Dystopia.Residual.
Component 3 contains: Dystopia.Residual
Dimension reduction involves reducing the number of dimensions in a dataset with the goal of preserving as much information as possible while reducing the number of features. Recent research indicates that using just three variables out of eight can explain over 78.8% of the variance, retaining almost three-quarters of the information contained in the original dataset. Dimension reduction techniques are particularly useful for the analysis and storage of large datasets, and can help to simplify and streamline complex data structures while retaining key information.