Research Question: How can we group the countries based on
indicators of social support, healthy life expectancy, freedom to make
decisions, generosity and perception of corruption?
Objective:
Identify homogeneous groups of countries and explore how they differ
from each other.
Use the results to analyze whether countries with similar
characteristics belong to specific regions or share levels of economic
development.
# Select the columns containing the numeric variables for clustering
mydata_clu_std <- as.data.frame(scale(X2018[c(5:9)])) # Standardize columns 5 to 9
head(mydata_clu_std)
## Social_support Healthy_life_expectancy Freedom_to_make_life_choices
## 1 1.2477925 1.116026 1.388175
## 2 1.2146014 1.063673 1.418970
## 3 1.2411542 1.091863 1.400493
## 4 1.4203862 1.277114 1.363540
## 5 1.1050708 1.329468 1.258839
## 6 0.9026051 1.132135 1.123343
## Generosity Perceptions_of_corruption
## 1 0.2128356 2.912165
## 2 1.0631330 2.362896
## 3 1.0428878 3.067619
## 4 1.7413464 0.269453
## 5 0.7594554 2.539077
## 6 1.5388947 1.896535
Interpretation:
This step is performed after the variables have been described and
their key parameters explained, which ensures that the original values
and their distributions are understood before they are transformed.
Standardization is a necessary preparation for the following
analyses, specifically for calculating dissimilarity and performing
clustering.
Standardization converts each variable so that it has a mean of 0 and
a standard deviation of 1. This ensures that all variables have equal
weight in the dissimilarity and clustering analysis.
# Calculate the dissimilarity based on the standardized variables
X2018$Dissimilarity <- sqrt(
X2018$`Social_support`^2 +
X2018$`Healthy_life_expectancy`^2 +
X2018$`Freedom_to_make_life_choices`^2 +
X2018$`Generosity`^2 +
X2018$`Perceptions_of_corruption`^2
)
head(X2018[order(-X2018$Dissimilarity), c("Country_or_region", "Dissimilarity")], 10)
## # A tibble: 10 × 2
## Country_or_region Dissimilarity
## <chr> <dbl>
## 1 Iceland 2.03
## 2 New Zealand 2.02
## 3 Denmark 2.00
## 4 Finland 1.99
## 5 Australia 1.99
## 6 Norway 1.98
## 7 Switzerland 1.97
## 8 Ireland 1.96
## 9 Singapore 1.95
## 10 Canada 1.94
Interpretation:
Calculating dissimilarity allows us to measure the distance between
each country and the center of the cluster. This helps to identify
outliers. It is an important step to ensure that the clustering results
are not biased by extreme values and that the groups formed reflect
meaningful relationships between observations.
Dissimilarity values:
The highest dissimilarity values are between 1.94 and 2.03,
indicating that these countries (Iceland, New Zealand, Denmark, etc.)
are the most different compared to the rest of the data set.
The values are not extremely high, indicating that these countries
are not necessarily outliers, but simply move a little further away from
the average in the space of the selected variables.
I decided to keep them. There is no evidence that they are extreme
cases, as we saw in class, that is why I think it would be better to
include them in the analysis, since they represent an important group of
countries with characteristics that help the analysis.
# Graphical display of the dissimilarity matrix
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
rownames(mydata_clu_std) <- X2018$`Country or region` # Assign country names to rows and columns
## Warning: Unknown or uninitialised column: `Country or region`.
Distance <- get_dist(mydata_clu_std, method = "euclidean")
fviz_dist(
Distance,
gradient = list(low = "darkred", mid = "grey95", high = "white")
) +
theme(
axis.text.x = element_text(size = 5, angle = 45, hjust = 1), # Diagonal rotation
axis.text.y = element_text(size = 5) # Keep Y-axis labels small
)

Interpretation:
We had a problem with the names on the axes. I guess it must be
because of the amount of data to analyze. I tried to help myself with
Chat Gpt, but I also didn’t want to use codes that I don’t understand
much.
Basically, the matrix represents the Euclidean distances between the
selected countries based on the standardized variables.
Light colors indicate that the countries are similar in the selected
variables. While dark colors indicate greater dissimilarity (significant
difference).
As we saw in class, if the matrix has many dark areas (as in this
case), it can suggest that there is a great diversity between the
countries, which could result in multiple clusters when performing the
clustering analysis.
Finally I think that the initial number of clusters (based on the
visual analysis of the dissimilarity matrix) could be between 3-5
groups.
library(factoextra)
# Calculate el Hopkins Statistic
hopkins_result <- get_clust_tendency(
mydata_clu_std,
n = nrow(mydata_clu_std) - 1, # Number of samples
graph = FALSE # Without generating graph
)
# Show result
hopkins_result$hopkins_stat
## [1] 0.7375746
Interpretation:
According to what we saw in class, the Hopkins value is between 0 and
1.
Closer to 1: The data has a clear cluster structure. Closer to 0.5 or
less: The data is randomly distributed, with no clear tendency to form
clusters.
The value of 0.73 suggests that there is a moderate to strong
tendency for the data to form clusters. This means that clustering is a
valid approach to analyzing this data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(factoextra)
# Calculate the Euclidean distance and apply the Ward method
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
# Show the WARD object
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 155
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation:
If we look at the structure of the dendrogram, by cutting around a
height of 15, we can form 3 main groups. This cutting point divides the
largest branches without subdividing too much.
library(factoextra)
library(NbClust)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")

Interpretation:
The graph suggests that grouping the data into 3 clusters is adequate
to capture the variability in the data without overfitting the model.
However, it is important to complement this analysis with other metrics,
such as the silhouette index, to confirm the validity of this
choice.
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")

Interpretation:
Both methods (Elbow and Silhouette) do not contradict each other, but
they offer different perspectives:
The first method, Elbow Method, focuses on reducing internal
variability. The second method, Silhouette Analysis, evaluates the
separation and cohesion of the clusters.
Based on this, if I am looking for higher quality in the separation,
3 clusters would be the most robust option.
library(NbClust)
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##

## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 0.4036 62.4545 62.8998 -2.8519 222.6243 22932062709 14722.489 546.7976
## 3 4.6551 75.0214 26.1866 -0.0641 404.6440 15945013959 6192.934 387.4948
## 4 0.7731 66.9165 24.2662 0.5435 502.4819 15078921107 5012.477 330.5478
## 5 1.3119 63.8932 19.2961 1.7854 601.2691 12456570479 3274.897 284.7823
## 6 15.4723 61.1389 8.9736 2.5689 697.8498 9619477309 2562.778 252.3232
## 7 0.0630 55.1405 0.6334 1.8839 738.9063 10046365181 2235.662 237.9902
## 8 0.6234 47.2348 19.9321 0.0417 743.1549 12766989838 2189.347 236.9760
## 9 18.5976 49.0898 6.9700 1.7663 823.7610 9606017111 1585.708 208.6805
## 10 0.3027 46.1745 7.9087 1.4636 859.0216 9446297583 1567.307 199.1721
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 3.4099 1.4082 0.3562 1.0719 0.3632 0.6719 62.4911 1.5162 0.3653
## 3 6.4650 1.9871 0.3455 1.2605 0.3215 0.6658 27.6054 1.5522 0.4015
## 4 7.7110 2.3295 0.3367 1.3243 0.2672 1.4731 -25.3730 -0.9765 0.3763
## 5 9.3317 2.7038 0.3411 1.2773 0.2727 1.9743 -38.4915 -1.4977 0.3547
## 6 11.2207 3.0516 0.3622 1.2103 0.2914 1.0623 -1.1136 -0.1732 0.3346
## 7 12.1060 3.2354 0.3555 1.1296 0.2955 2.1951 -29.4001 -1.6385 0.3141
## 8 12.1746 3.2493 0.3144 1.2223 0.2499 1.5095 -6.0752 -0.9809 0.2940
## 9 13.8368 3.6899 0.3814 1.1862 0.2657 1.0007 -0.0132 -0.0021 0.2846
## 10 14.7371 3.8660 0.3759 1.2123 0.2356 1.1795 -5.0212 -0.4572 0.2722
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 273.3988 0.5155 0.8954 0.2396 0.1042 0.0034 1.5716 1.7401 1.1886
## 3 129.1649 0.5893 1.2694 0.8466 0.1118 0.0036 1.6570 1.4542 0.7824
## 4 82.6370 0.5227 0.6573 1.4675 0.1250 0.0038 1.7715 1.3313 0.6625
## 5 56.9565 0.4939 -0.1848 1.9734 0.1201 0.0038 1.7355 1.2383 0.4547
## 6 42.0539 0.5318 0.0300 1.8964 0.1361 0.0043 1.7551 1.1843 0.3897
## 7 33.9986 0.5368 -6.3236 1.9103 0.1361 0.0044 1.6294 1.1582 0.3701
## 8 29.6220 0.4675 0.0072 2.5503 0.1090 0.0042 1.5117 1.1474 0.3397
## 9 23.1867 0.4857 1.1836 2.5925 0.1407 0.0047 1.6617 1.0918 0.3302
## 10 19.9172 0.4675 4.3507 2.8599 0.1407 0.0049 1.8185 1.0640 0.3063
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.7102 52.2320 0.1826
## 3 0.6717 26.8772 0.1725
## 4 0.5639 61.0918 1.0000
## 5 0.5550 62.5495 1.0000
## 6 0.4477 23.4420 1.0000
## 7 0.5162 50.6206 1.0000
## 8 0.3943 27.6451 1.0000
## 9 0.4360 24.5757 1.0000
## 10 0.5094 31.7873 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 9.0000 3.0000 3.0000 6.0000 3.0000 3 3.000
## Value_Index 18.5976 75.0214 36.7131 2.5689 182.0196 6120955896 8529.555
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 3.0000 9.0000 8.0000 2.0000 2.0000 4.0000
## Value_Index 102.3559 3.0551 -0.2644 0.3144 1.0719 0.3632 1.4731
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 4.000 2.0000 3.0000 3.0000 3.0000 1 2.0000
## Value_Index -25.373 1.5162 0.4015 144.2339 0.5893 NA 0.2396
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 9.0000 0 8.0000 0 10.0000
## Value_Index 0.1407 0 1.5117 0 0.3063
##
## $Best.partition
## [1] 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 1 3 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1
## [38] 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [75] 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 3 2 2 1 1 1 1 1 1 2 2 2 2 2 1 1
## [112] 2 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 1 3 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2
## [149] 2 3 2 2 2 2 2
Interpretation:
I confirm that 3 clusters represent the best option according to most
of the indices analyzed.
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of different positions of initial leaders
Clustering
## K-means clustering with 3 clusters of sizes 25, 45, 85
##
## Cluster means:
## Social_support Healthy_life_expectancy Freedom_to_make_life_choices
## 1 0.8500304 0.9256207 1.06175389
## 2 -1.1340743 -1.2124972 -0.55954120
## 3 0.3503833 0.3696689 -0.01605286
## Generosity Perceptions_of_corruption
## 1 1.2153767 1.8177715
## 2 0.1633474 -0.2137200
## 3 -0.4439418 -0.4214928
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1 1 1 3 1 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3
## [38] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2
## [75] 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 1 2 2 3 3 3 3 3 3 2 2 2 2 2 3 3
## [112] 2 2 2 3 2 2 3 3 2 3 2 2 2 2 2 2 3 1 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2
## [149] 2 1 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 61.19617 147.56179 178.73682
## (between_SS / total_SS = 49.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Interpretation:
Here we can see the group sizes and distribution of units. We have 23
in group 1, 85 in group 2, and 45 in group 3. These results indicate
that the data set is divided into three groups with distinct
characteristics based on the selected variables.
Approximately 49.7% of the variability is explained by clustering,
which is a reasonable start but could indicate that the groups are not
extremely distinct.
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
ggtheme = theme_bw(),
data = mydata_clu_std)

X2018 <- X2018 %>%
filter(!`Country_or_region` %in% c("Myanmar", "Rwanda", "Somalia","Central African Republic"))
mydata_clu_std <- as.data.frame(scale(X2018[c(5:9)]))
rownames(mydata_clu_std) <- X2018$`Country_or_region` # Assign country names to rows and columns
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of different positions of initial leaders
Clustering
## K-means clustering with 3 clusters of sizes 44, 23, 84
##
## Cluster means:
## Social_support Healthy_life_expectancy Freedom_to_make_life_choices
## 1 -1.1559245 -1.2116799 -0.587682376
## 2 0.9864587 1.0554353 1.089572595
## 3 0.3353825 0.3457012 0.009498272
## Generosity Perceptions_of_corruption
## 1 0.1688719 -0.2174719
## 2 1.2255174 1.9095231
## 3 -0.4240151 -0.4089317
##
## Clustering vector:
## Finland Norway Denmark
## 2 2 2
## Iceland Switzerland Netherlands
## 2 2 2
## Canada New Zealand Sweden
## 2 2 2
## Australia United Kingdom Austria
## 2 3 2
## Costa Rica Ireland Germany
## 3 2 2
## Belgium Luxembourg United States
## 2 2 2
## Israel Czech Republic Malta
## 2 3 2
## France Mexico Chile
## 3 3 3
## Taiwan Panama Brazil
## 3 3 3
## Argentina Guatemala Uruguay
## 3 3 3
## Qatar Saudi Arabia Singapore
## 2 3 2
## Malaysia Spain Colombia
## 3 3 3
## Trinidad & Tobago Slovakia El Salvador
## 3 3 3
## Nicaragua Poland Bahrain
## 3 3 3
## Uzbekistan Kuwait Thailand
## 2 3 3
## Italy Ecuador Belize
## 3 3 3
## Lithuania Slovenia Romania
## 3 3 3
## Latvia Japan Mauritius
## 3 3 3
## Jamaica South Korea Northern Cyprus
## 3 3 3
## Russia Kazakhstan Cyprus
## 3 3 3
## Bolivia Estonia Paraguay
## 3 3 3
## Peru Kosovo Moldova
## 3 3 3
## Turkmenistan Hungary Libya
## 3 3 3
## Philippines Honduras Belarus
## 3 3 3
## Turkey Pakistan Hong Kong
## 3 1 2
## Portugal Serbia Greece
## 3 3 3
## Lebanon Montenegro Croatia
## 3 3 3
## Dominican Republic Algeria Morocco
## 3 3 3
## China Azerbaijan Tajikistan
## 3 3 3
## Macedonia Jordan Nigeria
## 3 3 1
## Kyrgyzstan Bosnia and Herzegovina Mongolia
## 3 3 3
## Vietnam Indonesia Bhutan
## 3 1 2
## Cameroon Bulgaria Nepal
## 1 3 3
## Venezuela Gabon Palestinian Territories
## 3 3 3
## South Africa Iran Ivory Coast
## 3 1 1
## Ghana Senegal Laos
## 1 1 1
## Tunisia Albania Sierra Leone
## 3 3 1
## Congo (Brazzaville) Bangladesh Sri Lanka
## 1 1 3
## Iraq Mali Namibia
## 1 1 3
## Cambodia Burkina Faso Egypt
## 3 1 1
## Mozambique Kenya Zambia
## 1 1 1
## Mauritania Ethiopia Georgia
## 1 1 1
## Armenia Chad Congo (Kinshasa)
## 3 1 1
## India Niger Uganda
## 1 1 1
## Benin Sudan Ukraine
## 1 1 3
## Togo Guinea Lesotho
## 1 1 1
## Angola Madagascar Zimbabwe
## 1 1 1
## Afghanistan Botswana Malawi
## 1 3 1
## Haiti Liberia Syria
## 1 1 1
## Yemen Tanzania South Sudan
## 1 1 1
## Burundi
## 1
##
## Within cluster sum of squares by cluster:
## [1] 139.07804 38.92111 187.71952
## (between_SS / total_SS = 51.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Interpretation:
After some corrections we have new results. The clusters are managing
to explain 51.2% of the total variance of the data, which is an
acceptable proportion for the problem we want to analyze.
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
ggtheme = theme_bw(),
data = mydata_clu_std)

Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
## Social_support Healthy_life_expectancy Freedom_to_make_life_choices
## 1 -1.1559245 -1.2116799 -0.587682376
## 2 0.9864587 1.0554353 1.089572595
## 3 0.3353825 0.3457012 0.009498272
## Generosity Perceptions_of_corruption
## 1 0.1688719 -0.2174719
## 2 1.2255174 1.9095231
## 3 -0.4240151 -0.4089317
Interpretation:
These results show the standardized averages of the variables in each
of the 3 clusters. Basically, is telling us the main characteristics
that define each group:
Cluster 1: Positive values across all variables. This cluster
represents countries with high overall indicators in the studied
dimensions, likely indicating well-developed nations with strong systems
in place for social and individual well-being.
Cluster 2: Moderate, slightly positive values: Social support (0.33),
life expectancy (0.34), freedom of choice (0.00), and perceptions of
corruption (-0.40). This group represents average-performing countries,
balancing positive scores in some metrics with challenges like lower
perceptions of generosity and corruption.
Cluster 3: Negative values in all variables: Social support (-1.15),
life expectancy (-1.21), freedom of choice (-0.58), and generosity
(0.16). This group highlights countries with lower development and
significant challenges in social, health, and individual freedom
indicators.
Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Social_support", "Healthy_life_expectancy", "Freedom_to_make_life_choices", "Generosity", "Perceptions_of_corruption"))
Figure$Group <- factor(Figure$id,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$ImeF <- factor(Figure$name,
levels = c("Social_support", "Healthy_life_expectancy", "Freedom_to_make_life_choices", "Generosity", "Perceptions_of_corruption"),
labels = c("Social_support", "Healthy_life_expectancy", "Freedom_to_make_life_choices", "Generosity", "Perceptions_of_corruption"))
library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
scale_color_brewer(palette="Set1") +
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Interpretation:
In this graph, we are visualizing the means of the variables for each
of the three groups defined by the K-means clustering analysis. This
allows us to compare and characterize the groups based on the variables
used for clustering.
Group 1 (red dots): This group consistently exhibits high positive
values in all variables, indicating strong social support, long healthy
life expectancy, significant freedom of choice, high generosity, and low
perceptions of corruption. This likely represents countries with better
socio-economic conditions and stability.
Group 2 (blue triangles): These countries have mixed performance,
with most variables near the average but showing a negative trend in
“Generosity” and “Perceptions of corruption.” This indicates countries
with moderate development and some challenges in specific areas.
Group 3 (green squares): This group has negative values across most
variables, especially in social support and healthy life expectancy.
These countries may face significant socio-economic difficulties and
represent lower levels of development and support systems.
X2018$Group <- Clustering$cluster #Assigning units to groups
# Checking if clustering variables successfully differentiate between groups
fit <- aov(cbind(`Social_support`, `Healthy_life_expectancy`, `Freedom_to_make_life_choices`, `Generosity`, `Perceptions_of_corruption`) ~ as.factor(Group),
data = X2018)
summary(fit)
## Response Social_support :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 7.3181 3.6590 112.93 < 2.2e-16 ***
## Residuals 148 4.7952 0.0324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Healthy_life_expectancy :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 5.9074 2.9537 149.16 < 2.2e-16 ***
## Residuals 148 2.9308 0.0198
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Freedom_to_make_life_choices :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 1.1167 0.55836 29.264 1.954e-11 ***
## Residuals 148 2.8239 0.01908
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Generosity :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.44866 0.224331 38.009 4.77e-14 ***
## Residuals 148 0.87351 0.005902
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Perceptions_of_corruption :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.85545 0.42772 147.97 < 2.2e-16 ***
## Residuals 148 0.42782 0.00289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation:
The very low p-values (< 0.001) indicate that the clusters are
significantly different from each other across all analyzed variables.
This confirms that the clustering successfully groups entities with
distinct profiles.
aggregate(X2018$`GDP_per_capita`,
by = list(X2018$Group),
FUN = mean)
## Group.1 x
## 1 1 0.4787955
## 2 2 1.3315217
## 3 3 1.0028571
Interpretation:
In this part, I did the criterion validity, where I evaluate whether
the variables not directly used in the clustering (in this case, GDP per
capita) show significant differences between the created groups.
If the means are significantly different between groups, this
indicates that the clustering is valid, since the groups reflect real
differences in variables not included in the initial clustering
process.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(X2018$`GDP_per_capita`, as.factor(X2018$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 2.7799 0.06529 .
## 148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation:
Levene’s test checks whether the variations between groups are
homogeneous.
H₀: The variances of the groups are equal (homogeneity of variance).
H₁:The variances of the groups are not equal (heterogeneity of
variance).
The p-value of 0.1145 is greater than the significance level of 0.05.
Thus, we fail to reject the H₀, meaning there is no significant evidence
to suggest that the variances across the groups are different.
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
X2018 %>% group_by(as.factor(X2018$Group)) %>%
shapiro_test(`GDP_per_capita`)
## # A tibble: 3 × 4
## `as.factor(X2018$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 GDP_per_capita 0.970 0.298
## 2 2 GDP_per_capita 0.759 0.0000908
## 3 3 GDP_per_capita 0.981 0.265
Interpretation:
The Shapiro-Wilk test assesses whether the data within each group
follow a normal distribution.
H₀: the data are normally distributed. H₁: the data are not normally
distributed.
From the Shapiro-Wilk test, the p-values for each group are:
Group 1: p = 0.0000908, reject H₀ → the data is not normally
distributed.
Group 2: p = 0.2648, do not reject H₀ → the data is normally
distributed.
Group 3: p = 0.298, do not reject H₀ → the data is normally
distributed.
Since Group 1 does not meet normality, using parametric tests such as
ANOVA might not be appropriate. In this case, we might consider using a
nonparametric test such as Kruskal-Wallis, which does not assume
normality in the data.
# Perform Kruskal Wallis
kruskal.test(GDP_per_capita ~ Group,
data = X2018)
##
## Kruskal-Wallis rank sum test
##
## data: GDP_per_capita by Group
## Kruskal-Wallis chi-squared = 92.733, df = 2, p-value < 2.2e-16
Interpretation:
Since the p-value is extremely low (< 0.05), we reject the H₀,
which assumes that the distributions of GDP per capita are the same
across all groups. This means there are significant differences in GDP
per capita among the groups identified through clustering.
This confirms that the clustering method effectively groups countries
into distinct categories based on GDP per capita, aligning with our
criterion validity check.
CONCLUSION
This analysis aimed to explore how countries can be effectively
grouped based on key social and development indicators, including social
support, healthy life expectancy, freedom to make decisions, generosity,
and perceptions of corruption. By employing clustering techniques, the
study identified three distinct groups of countries, each defined by
unique patterns in these dimensions.
Based on the results, here are the clusters and their
characteristics:
Cluster 1: These countries exhibit consistently high levels in all
indicators, representing strong social support systems, longer healthy
life expectancies, high freedom of choice, notable generosity, and low
perceptions of corruption. They reflect regions with robust stability
and well-being.
Cluster 2: This group includes countries with moderate scores in
social and development indicators. While they show progress in areas
like social support and life expectancy, they face challenges in metrics
like generosity and perceptions of corruption. These nations may
represent those striving toward greater social and economic
development.
Cluster 3: Countries in this cluster score significantly lower across
all dimensions, highlighting struggles in areas like social support,
life expectancy, and freedom of choice. They may represent regions with
fewer resources and higher social and economic difficulties.