#HOMEWORK 2

Instruction:

  1. Find your data.

  2. Import data into R Studio.

  3. Display your data using the head function.

  4. Explain your data (what is the unit of observation, sample size, definition of all variables, units of measurement, etc.).

  5. Name the source of the data.

  6. Carry our data manipulation if necessary(e.g.,clean the data,convert categorical variables to factors,etc.). Show the descriptive statistics and explain a few estimates of parameters (1 point).

library(readr)
X2018 <- read_csv("~/Documents/MASTER IMB /Multivariate Analysis/Homework 2 /2018.csv")
## Rows: 156 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country or region, Perceptions of corruption
## dbl (7): Overall rank, Score, GDP per capita, Social support, Healthy life e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(X2018)
## # A tibble: 6 × 9
##   `Overall rank` `Country or region` Score `GDP per capita` `Social support`
##            <dbl> <chr>               <dbl>            <dbl>            <dbl>
## 1              1 Finland              7.63             1.30             1.59
## 2              2 Norway               7.59             1.46             1.58
## 3              3 Denmark              7.56             1.35             1.59
## 4              4 Iceland              7.50             1.34             1.64
## 5              5 Switzerland          7.49             1.42             1.55
## 6              6 Netherlands          7.44             1.36             1.49
## # ℹ 4 more variables: `Healthy life expectancy` <dbl>,
## #   `Freedom to make life choices` <dbl>, Generosity <dbl>,
## #   `Perceptions of corruption` <chr>

Explanation of the data

Observation unit: Each row represents a country or region.

Sample size: 156 countries or regions.

Variables:

Link of the source: https://www.kaggle.com/datasets/unsdsn/world-happiness

# Convert 'Perceptions of corruption' to numeric
X2018$`Perceptions of corruption` <- as.numeric(X2018$`Perceptions of corruption`)
## Warning: NAs introduced by coercion

Descriptive statistics

summary(X2018[c(5:9)])
##  Social support  Healthy life expectancy Freedom to make life choices
##  Min.   :0.000   Min.   :0.0000          Min.   :0.0000              
##  1st Qu.:1.067   1st Qu.:0.4223          1st Qu.:0.3560              
##  Median :1.255   Median :0.6440          Median :0.4870              
##  Mean   :1.213   Mean   :0.5973          Mean   :0.4545              
##  3rd Qu.:1.463   3rd Qu.:0.7772          3rd Qu.:0.5785              
##  Max.   :1.644   Max.   :1.0300          Max.   :0.7240              
##                                                                      
##    Generosity     Perceptions of corruption
##  Min.   :0.0000   Min.   :0.000            
##  1st Qu.:0.1095   1st Qu.:0.051            
##  Median :0.1740   Median :0.082            
##  Mean   :0.1810   Mean   :0.112            
##  3rd Qu.:0.2390   3rd Qu.:0.137            
##  Max.   :0.5980   Max.   :0.457            
##                   NA's   :1
X2018$`Perceptions of corruption` <- as.numeric(X2018$`Perceptions of corruption`)

# Check rows with NA entered 
na_rows <- which(is.na(X2018$`Perceptions of corruption`))
X2018[na_rows, ]
## # A tibble: 1 × 9
##   `Overall rank` `Country or region`  Score `GDP per capita` `Social support`
##            <dbl> <chr>                <dbl>            <dbl>            <dbl>
## 1             20 United Arab Emirates  6.77             2.10            0.776
## # ℹ 4 more variables: `Healthy life expectancy` <dbl>,
## #   `Freedom to make life choices` <dbl>, Generosity <dbl>,
## #   `Perceptions of corruption` <dbl>

This code converts the Perceptions of corruption column to a numeric format and checks if missing values (NA) were introduced during the process. If rows with NA values are detected, they are listed for identification.

# Remove all rows that contain missing values (NA) in any column of the X2018 dataframe.
X2018 <- na.omit(X2018)
summary(X2018[c(5:9)])
##  Social support  Healthy life expectancy Freedom to make life choices
##  Min.   :0.000   Min.   :0.0000          Min.   :0.0000              
##  1st Qu.:1.075   1st Qu.:0.4205          1st Qu.:0.3575              
##  Median :1.258   Median :0.6430          Median :0.4930              
##  Mean   :1.216   Mean   :0.5969          Mean   :0.4556              
##  3rd Qu.:1.464   3rd Qu.:0.7785          3rd Qu.:0.5790              
##  Max.   :1.644   Max.   :1.0300          Max.   :0.7240              
##    Generosity    Perceptions of corruption
##  Min.   :0.000   Min.   :0.000            
##  1st Qu.:0.109   1st Qu.:0.051            
##  Median :0.173   Median :0.082            
##  Mean   :0.181   Mean   :0.112            
##  3rd Qu.:0.240   3rd Qu.:0.137            
##  Max.   :0.598   Max.   :0.457

Social support

Median (1.258): The median indicates that half of the countries have a level of social support greater than or equal to 1.258. This value is significant because it represents the central point of the distribution, independent of extreme values (outliers). Overall, this suggests that social support is strong in most countries in the data set.

Healthy life expectancy

Maximum (1.030): The maximum of 1.030 represents countries with the healthiest life expectancies in the world. These values usually correspond to countries with advanced health systems and ideal living conditions, such as Japan or the Nordic countries.

Freedom to make life choices

Minimum (0.000): A minimum value of 0.000 signals the complete absence of perceived personal freedom. This is likely to be observed in countries with authoritarian regimes, where citizens’ decisions are strictly controlled by the government or institutions.

Generosity

Average (0.181): The average of 0.181 shows that, on average, global generosity is moderate or low. This may reflect cultural and economic differences in people’s willingness to donate or engage in altruistic activities.

Perceptions of corruption

1st Quartile (0.051): The first quartile indicates that 25% of countries have corruption perception levels less than or equal to 0.051. This includes nations where corruption is perceived to be almost non-existent, likely those with high levels of transparency and good governance.

names(X2018) <- gsub(" ", "_", names(X2018))  # Replaces spaces with underscores
names(X2018)  # View updated column names
## [1] "Overall_rank"                 "Country_or_region"           
## [3] "Score"                        "GDP_per_capita"              
## [5] "Social_support"               "Healthy_life_expectancy"     
## [7] "Freedom_to_make_life_choices" "Generosity"                  
## [9] "Perceptions_of_corruption"

I used this code to replace spaces in column names with underscores (“_“)

Research Question: How can we group the countries based on indicators of social support, healthy life expectancy, freedom to make decisions, generosity and perception of corruption?

Objective:

Identify homogeneous groups of countries and explore how they differ from each other.

Use the results to analyze whether countries with similar characteristics belong to specific regions or share levels of economic development.

# Select the columns containing the numeric variables for clustering
mydata_clu_std <- as.data.frame(scale(X2018[c(5:9)]))  # Standardize columns 5 to 9

head(mydata_clu_std)
##   Social_support Healthy_life_expectancy Freedom_to_make_life_choices
## 1      1.2477925                1.116026                     1.388175
## 2      1.2146014                1.063673                     1.418970
## 3      1.2411542                1.091863                     1.400493
## 4      1.4203862                1.277114                     1.363540
## 5      1.1050708                1.329468                     1.258839
## 6      0.9026051                1.132135                     1.123343
##   Generosity Perceptions_of_corruption
## 1  0.2128356                  2.912165
## 2  1.0631330                  2.362896
## 3  1.0428878                  3.067619
## 4  1.7413464                  0.269453
## 5  0.7594554                  2.539077
## 6  1.5388947                  1.896535

Interpretation:

This step is performed after the variables have been described and their key parameters explained, which ensures that the original values and their distributions are understood before they are transformed.

Standardization is a necessary preparation for the following analyses, specifically for calculating dissimilarity and performing clustering.

Standardization converts each variable so that it has a mean of 0 and a standard deviation of 1. This ensures that all variables have equal weight in the dissimilarity and clustering analysis.

# Calculate the dissimilarity based on the standardized variables
X2018$Dissimilarity <- sqrt(
  X2018$`Social_support`^2 + 
  X2018$`Healthy_life_expectancy`^2 + 
  X2018$`Freedom_to_make_life_choices`^2 + 
  X2018$`Generosity`^2 + 
  X2018$`Perceptions_of_corruption`^2
)

head(X2018[order(-X2018$Dissimilarity), c("Country_or_region", "Dissimilarity")], 10)
## # A tibble: 10 × 2
##    Country_or_region Dissimilarity
##    <chr>                     <dbl>
##  1 Iceland                    2.03
##  2 New Zealand                2.02
##  3 Denmark                    2.00
##  4 Finland                    1.99
##  5 Australia                  1.99
##  6 Norway                     1.98
##  7 Switzerland                1.97
##  8 Ireland                    1.96
##  9 Singapore                  1.95
## 10 Canada                     1.94

Interpretation:

Calculating dissimilarity allows us to measure the distance between each country and the center of the cluster. This helps to identify outliers. It is an important step to ensure that the clustering results are not biased by extreme values and that the groups formed reflect meaningful relationships between observations.

Dissimilarity values:

The highest dissimilarity values are between 1.94 and 2.03, indicating that these countries (Iceland, New Zealand, Denmark, etc.) are the most different compared to the rest of the data set.

The values are not extremely high, indicating that these countries are not necessarily outliers, but simply move a little further away from the average in the space of the selected variables.

I decided to keep them. There is no evidence that they are extreme cases, as we saw in class, that is why I think it would be better to include them in the analysis, since they represent an important group of countries with characteristics that help the analysis.

# Graphical display of the dissimilarity matrix
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
rownames(mydata_clu_std) <- X2018$`Country or region`  # Assign country names to rows and columns
## Warning: Unknown or uninitialised column: `Country or region`.
Distance <- get_dist(mydata_clu_std, method = "euclidean")

fviz_dist(
  Distance,
  gradient = list(low = "darkred", mid = "grey95", high = "white")
) +
theme(
  axis.text.x = element_text(size = 5, angle = 45, hjust = 1),  # Diagonal rotation
  axis.text.y = element_text(size = 5)  # Keep Y-axis labels small
)

Interpretation:

We had a problem with the names on the axes. I guess it must be because of the amount of data to analyze. I tried to help myself with Chat Gpt, but I also didn’t want to use codes that I don’t understand much.

Basically, the matrix represents the Euclidean distances between the selected countries based on the standardized variables.

Light colors indicate that the countries are similar in the selected variables. While dark colors indicate greater dissimilarity (significant difference).

As we saw in class, if the matrix has many dark areas (as in this case), it can suggest that there is a great diversity between the countries, which could result in multiple clusters when performing the clustering analysis.

Finally I think that the initial number of clusters (based on the visual analysis of the dissimilarity matrix) could be between 3-5 groups.

library(factoextra)

# Calculate el Hopkins Statistic
hopkins_result <- get_clust_tendency(
  mydata_clu_std,
  n = nrow(mydata_clu_std) - 1,  # Number of samples
  graph = FALSE                  # Without generating graph
)

# Show result
hopkins_result$hopkins_stat
## [1] 0.7375746

Interpretation:

According to what we saw in class, the Hopkins value is between 0 and 1.

Closer to 1: The data has a clear cluster structure. Closer to 0.5 or less: The data is randomly distributed, with no clear tendency to form clusters.

The value of 0.73 suggests that there is a moderate to strong tendency for the data to form clusters. This means that clustering is a valid approach to analyzing this data.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(factoextra)

# Calculate the Euclidean distance and apply the Ward method
WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")

# Show the WARD object
WARD
## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 155
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation:

If we look at the structure of the dendrogram, by cutting around a height of 15, we can form 3 main groups. This cutting point divides the largest branches without subdividing too much.

library(factoextra)
library(NbClust)

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

Interpretation:

The graph suggests that grouping the data into 3 clusters is adequate to capture the variability in the data without overfitting the model. However, it is important to complement this analysis with other metrics, such as the silhouette index, to confirm the validity of this choice.

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

Interpretation:

Both methods (Elbow and Silhouette) do not contradict each other, but they offer different perspectives:

The first method, Elbow Method, focuses on reducing internal variability. The second method, Silhouette Analysis, evaluates the separation and cohesion of the clusters.

Based on this, if I am looking for higher quality in the separation, 3 clusters would be the most robust option.

library(NbClust)
NbClust(mydata_clu_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 10 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 3 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
## $All.index
##         KL      CH Hartigan     CCC    Scott     Marriot    TrCovW   TraceW
## 2   0.4036 62.4545  62.8998 -2.8519 222.6243 22932062709 14722.489 546.7976
## 3   4.6551 75.0214  26.1866 -0.0641 404.6440 15945013959  6192.934 387.4948
## 4   0.7731 66.9165  24.2662  0.5435 502.4819 15078921107  5012.477 330.5478
## 5   1.3119 63.8932  19.2961  1.7854 601.2691 12456570479  3274.897 284.7823
## 6  15.4723 61.1389   8.9736  2.5689 697.8498  9619477309  2562.778 252.3232
## 7   0.0630 55.1405   0.6334  1.8839 738.9063 10046365181  2235.662 237.9902
## 8   0.6234 47.2348  19.9321  0.0417 743.1549 12766989838  2189.347 236.9760
## 9  18.5976 49.0898   6.9700  1.7663 823.7610  9606017111  1585.708 208.6805
## 10  0.3027 46.1745   7.9087  1.4636 859.0216  9446297583  1567.307 199.1721
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2    3.4099 1.4082 0.3562 1.0719     0.3632 0.6719  62.4911  1.5162    0.3653
## 3    6.4650 1.9871 0.3455 1.2605     0.3215 0.6658  27.6054  1.5522    0.4015
## 4    7.7110 2.3295 0.3367 1.3243     0.2672 1.4731 -25.3730 -0.9765    0.3763
## 5    9.3317 2.7038 0.3411 1.2773     0.2727 1.9743 -38.4915 -1.4977    0.3547
## 6   11.2207 3.0516 0.3622 1.2103     0.2914 1.0623  -1.1136 -0.1732    0.3346
## 7   12.1060 3.2354 0.3555 1.1296     0.2955 2.1951 -29.4001 -1.6385    0.3141
## 8   12.1746 3.2493 0.3144 1.2223     0.2499 1.5095  -6.0752 -0.9809    0.2940
## 9   13.8368 3.6899 0.3814 1.1862     0.2657 1.0007  -0.0132 -0.0021    0.2846
## 10  14.7371 3.8660 0.3759 1.2123     0.2356 1.1795  -5.0212 -0.4572    0.2722
##        Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  273.3988     0.5155  0.8954  0.2396 0.1042 0.0034  1.5716 1.7401 1.1886
## 3  129.1649     0.5893  1.2694  0.8466 0.1118 0.0036  1.6570 1.4542 0.7824
## 4   82.6370     0.5227  0.6573  1.4675 0.1250 0.0038  1.7715 1.3313 0.6625
## 5   56.9565     0.4939 -0.1848  1.9734 0.1201 0.0038  1.7355 1.2383 0.4547
## 6   42.0539     0.5318  0.0300  1.8964 0.1361 0.0043  1.7551 1.1843 0.3897
## 7   33.9986     0.5368 -6.3236  1.9103 0.1361 0.0044  1.6294 1.1582 0.3701
## 8   29.6220     0.4675  0.0072  2.5503 0.1090 0.0042  1.5117 1.1474 0.3397
## 9   23.1867     0.4857  1.1836  2.5925 0.1407 0.0047  1.6617 1.0918 0.3302
## 10  19.9172     0.4675  4.3507  2.8599 0.1407 0.0049  1.8185 1.0640 0.3063
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.7102            52.2320       0.1826
## 3          0.6717            26.8772       0.1725
## 4          0.5639            61.0918       1.0000
## 5          0.5550            62.5495       1.0000
## 6          0.4477            23.4420       1.0000
## 7          0.5162            50.6206       1.0000
## 8          0.3943            27.6451       1.0000
## 9          0.4360            24.5757       1.0000
## 10         0.5094            31.7873       1.0000
## 
## $Best.nc
##                      KL      CH Hartigan    CCC    Scott    Marriot   TrCovW
## Number_clusters  9.0000  3.0000   3.0000 6.0000   3.0000          3    3.000
## Value_Index     18.5976 75.0214  36.7131 2.5689 182.0196 6120955896 8529.555
##                   TraceW Friedman   Rubin Cindex     DB Silhouette   Duda
## Number_clusters   3.0000   3.0000  9.0000 8.0000 2.0000     2.0000 4.0000
## Value_Index     102.3559   3.0551 -0.2644 0.3144 1.0719     0.3632 1.4731
##                 PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters    4.000 2.0000    3.0000   3.0000     3.0000    1  2.0000
## Value_Index      -25.373 1.5162    0.4015 144.2339     0.5893   NA  0.2396
##                   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 9.0000      0  8.0000      0 10.0000
## Value_Index     0.1407      0  1.5117      0  0.3063
## 
## $Best.partition
##   [1] 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 1 3 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1
##  [38] 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
##  [75] 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 3 2 2 1 1 1 1 1 1 2 2 2 2 2 1 1
## [112] 2 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 1 3 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2
## [149] 2 3 2 2 2 2 2

Interpretation:

I confirm that 3 clusters represent the best option according to most of the indices analyzed.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of different positions of initial leaders

Clustering
## K-means clustering with 3 clusters of sizes 25, 45, 85
## 
## Cluster means:
##   Social_support Healthy_life_expectancy Freedom_to_make_life_choices
## 1      0.8500304               0.9256207                   1.06175389
## 2     -1.1340743              -1.2124972                  -0.55954120
## 3      0.3503833               0.3696689                  -0.01605286
##   Generosity Perceptions_of_corruption
## 1  1.2153767                 1.8177715
## 2  0.1633474                -0.2137200
## 3 -0.4439418                -0.4214928
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1 1 1 3 1 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3
##  [38] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2
##  [75] 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 1 2 2 3 3 3 3 3 3 2 2 2 2 2 3 3
## [112] 2 2 2 3 2 2 3 3 2 3 2 2 2 2 2 2 3 1 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2
## [149] 2 1 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  61.19617 147.56179 178.73682
##  (between_SS / total_SS =  49.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Interpretation:

Here we can see the group sizes and distribution of units. We have 23 in group 1, 85 in group 2, and 45 in group 3. These results indicate that the data set is divided into three groups with distinct characteristics based on the selected variables.

Approximately 49.7% of the variability is explained by clustering, which is a reasonable start but could indicate that the groups are not extremely distinct.

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = TRUE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

X2018 <- X2018 %>%
  filter(!`Country_or_region` %in% c("Myanmar", "Rwanda", "Somalia","Central African Republic"))

mydata_clu_std <- as.data.frame(scale(X2018[c(5:9)]))

rownames(mydata_clu_std) <- X2018$`Country_or_region`  # Assign country names to rows and columns
Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of different positions of initial leaders

Clustering
## K-means clustering with 3 clusters of sizes 44, 23, 84
## 
## Cluster means:
##   Social_support Healthy_life_expectancy Freedom_to_make_life_choices
## 1     -1.1559245              -1.2116799                 -0.587682376
## 2      0.9864587               1.0554353                  1.089572595
## 3      0.3353825               0.3457012                  0.009498272
##   Generosity Perceptions_of_corruption
## 1  0.1688719                -0.2174719
## 2  1.2255174                 1.9095231
## 3 -0.4240151                -0.4089317
## 
## Clustering vector:
##                 Finland                  Norway                 Denmark 
##                       2                       2                       2 
##                 Iceland             Switzerland             Netherlands 
##                       2                       2                       2 
##                  Canada             New Zealand                  Sweden 
##                       2                       2                       2 
##               Australia          United Kingdom                 Austria 
##                       2                       3                       2 
##              Costa Rica                 Ireland                 Germany 
##                       3                       2                       2 
##                 Belgium              Luxembourg           United States 
##                       2                       2                       2 
##                  Israel          Czech Republic                   Malta 
##                       2                       3                       2 
##                  France                  Mexico                   Chile 
##                       3                       3                       3 
##                  Taiwan                  Panama                  Brazil 
##                       3                       3                       3 
##               Argentina               Guatemala                 Uruguay 
##                       3                       3                       3 
##                   Qatar            Saudi Arabia               Singapore 
##                       2                       3                       2 
##                Malaysia                   Spain                Colombia 
##                       3                       3                       3 
##       Trinidad & Tobago                Slovakia             El Salvador 
##                       3                       3                       3 
##               Nicaragua                  Poland                 Bahrain 
##                       3                       3                       3 
##              Uzbekistan                  Kuwait                Thailand 
##                       2                       3                       3 
##                   Italy                 Ecuador                  Belize 
##                       3                       3                       3 
##               Lithuania                Slovenia                 Romania 
##                       3                       3                       3 
##                  Latvia                   Japan               Mauritius 
##                       3                       3                       3 
##                 Jamaica             South Korea         Northern Cyprus 
##                       3                       3                       3 
##                  Russia              Kazakhstan                  Cyprus 
##                       3                       3                       3 
##                 Bolivia                 Estonia                Paraguay 
##                       3                       3                       3 
##                    Peru                  Kosovo                 Moldova 
##                       3                       3                       3 
##            Turkmenistan                 Hungary                   Libya 
##                       3                       3                       3 
##             Philippines                Honduras                 Belarus 
##                       3                       3                       3 
##                  Turkey                Pakistan               Hong Kong 
##                       3                       1                       2 
##                Portugal                  Serbia                  Greece 
##                       3                       3                       3 
##                 Lebanon              Montenegro                 Croatia 
##                       3                       3                       3 
##      Dominican Republic                 Algeria                 Morocco 
##                       3                       3                       3 
##                   China              Azerbaijan              Tajikistan 
##                       3                       3                       3 
##               Macedonia                  Jordan                 Nigeria 
##                       3                       3                       1 
##              Kyrgyzstan  Bosnia and Herzegovina                Mongolia 
##                       3                       3                       3 
##                 Vietnam               Indonesia                  Bhutan 
##                       3                       1                       2 
##                Cameroon                Bulgaria                   Nepal 
##                       1                       3                       3 
##               Venezuela                   Gabon Palestinian Territories 
##                       3                       3                       3 
##            South Africa                    Iran             Ivory Coast 
##                       3                       1                       1 
##                   Ghana                 Senegal                    Laos 
##                       1                       1                       1 
##                 Tunisia                 Albania            Sierra Leone 
##                       3                       3                       1 
##     Congo (Brazzaville)              Bangladesh               Sri Lanka 
##                       1                       1                       3 
##                    Iraq                    Mali                 Namibia 
##                       1                       1                       3 
##                Cambodia            Burkina Faso                   Egypt 
##                       3                       1                       1 
##              Mozambique                   Kenya                  Zambia 
##                       1                       1                       1 
##              Mauritania                Ethiopia                 Georgia 
##                       1                       1                       1 
##                 Armenia                    Chad        Congo (Kinshasa) 
##                       3                       1                       1 
##                   India                   Niger                  Uganda 
##                       1                       1                       1 
##                   Benin                   Sudan                 Ukraine 
##                       1                       1                       3 
##                    Togo                  Guinea                 Lesotho 
##                       1                       1                       1 
##                  Angola              Madagascar                Zimbabwe 
##                       1                       1                       1 
##             Afghanistan                Botswana                  Malawi 
##                       1                       3                       1 
##                   Haiti                 Liberia                   Syria 
##                       1                       1                       1 
##                   Yemen                Tanzania             South Sudan 
##                       1                       1                       1 
##                 Burundi 
##                       1 
## 
## Within cluster sum of squares by cluster:
## [1] 139.07804  38.92111 187.71952
##  (between_SS / total_SS =  51.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Interpretation:

After some corrections we have new results. The clusters are managing to explain 51.2% of the total variance of the data, which is an acceptable proportion for the problem we want to analyze.

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = TRUE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
##   Social_support Healthy_life_expectancy Freedom_to_make_life_choices
## 1     -1.1559245              -1.2116799                 -0.587682376
## 2      0.9864587               1.0554353                  1.089572595
## 3      0.3353825               0.3457012                  0.009498272
##   Generosity Perceptions_of_corruption
## 1  0.1688719                -0.2174719
## 2  1.2255174                 1.9095231
## 3 -0.4240151                -0.4089317

Interpretation:

These results show the standardized averages of the variables in each of the 3 clusters. Basically, is telling us the main characteristics that define each group:

Cluster 1: Positive values across all variables. This cluster represents countries with high overall indicators in the studied dimensions, likely indicating well-developed nations with strong systems in place for social and individual well-being.

Cluster 2: Moderate, slightly positive values: Social support (0.33), life expectancy (0.34), freedom of choice (0.00), and perceptions of corruption (-0.40). This group represents average-performing countries, balancing positive scores in some metrics with challenges like lower perceptions of generosity and corruption.

Cluster 3: Negative values in all variables: Social support (-1.15), life expectancy (-1.21), freedom of choice (-0.58), and generosity (0.16). This group highlights countries with lower development and significant challenges in social, health, and individual freedom indicators.

Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Social_support", "Healthy_life_expectancy", "Freedom_to_make_life_choices", "Generosity", "Perceptions_of_corruption"))

Figure$Group <- factor(Figure$id, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$ImeF <- factor(Figure$name, 
              levels = c("Social_support", "Healthy_life_expectancy", "Freedom_to_make_life_choices", "Generosity", "Perceptions_of_corruption"),
              labels = c("Social_support", "Healthy_life_expectancy", "Freedom_to_make_life_choices", "Generosity", "Perceptions_of_corruption"))


library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  scale_color_brewer(palette="Set1") +
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Interpretation:

In this graph, we are visualizing the means of the variables for each of the three groups defined by the K-means clustering analysis. This allows us to compare and characterize the groups based on the variables used for clustering.

Group 1 (red dots): This group consistently exhibits high positive values in all variables, indicating strong social support, long healthy life expectancy, significant freedom of choice, high generosity, and low perceptions of corruption. This likely represents countries with better socio-economic conditions and stability.

Group 2 (blue triangles): These countries have mixed performance, with most variables near the average but showing a negative trend in “Generosity” and “Perceptions of corruption.” This indicates countries with moderate development and some challenges in specific areas.

Group 3 (green squares): This group has negative values across most variables, especially in social support and healthy life expectancy. These countries may face significant socio-economic difficulties and represent lower levels of development and support systems.

X2018$Group <- Clustering$cluster #Assigning units to groups
 # Checking if clustering variables successfully differentiate between groups
fit <- aov(cbind(`Social_support`, `Healthy_life_expectancy`, `Freedom_to_make_life_choices`, `Generosity`, `Perceptions_of_corruption`) ~ as.factor(Group), 
           data = X2018)

summary(fit)
##  Response Social_support :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 7.3181  3.6590  112.93 < 2.2e-16 ***
## Residuals        148 4.7952  0.0324                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Healthy_life_expectancy :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 5.9074  2.9537  149.16 < 2.2e-16 ***
## Residuals        148 2.9308  0.0198                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Freedom_to_make_life_choices :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 1.1167 0.55836  29.264 1.954e-11 ***
## Residuals        148 2.8239 0.01908                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Generosity :
##                   Df  Sum Sq  Mean Sq F value   Pr(>F)    
## as.factor(Group)   2 0.44866 0.224331  38.009 4.77e-14 ***
## Residuals        148 0.87351 0.005902                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Perceptions_of_corruption :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.85545 0.42772  147.97 < 2.2e-16 ***
## Residuals        148 0.42782 0.00289                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:

The very low p-values (< 0.001) indicate that the clusters are significantly different from each other across all analyzed variables. This confirms that the clustering successfully groups entities with distinct profiles.

aggregate(X2018$`GDP_per_capita`, 
          by = list(X2018$Group), 
          FUN = mean)
##   Group.1         x
## 1       1 0.4787955
## 2       2 1.3315217
## 3       3 1.0028571

Interpretation:

In this part, I did the criterion validity, where I evaluate whether the variables not directly used in the clustering (in this case, GDP per capita) show significant differences between the created groups.

If the means are significantly different between groups, this indicates that the clustering is valid, since the groups reflect real differences in variables not included in the initial clustering process.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(X2018$`GDP_per_capita`, as.factor(X2018$Group))
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   2  2.7799 0.06529 .
##       148                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:

Levene’s test checks whether the variations between groups are homogeneous.

H₀: The variances of the groups are equal (homogeneity of variance). H₁:The variances of the groups are not equal (heterogeneity of variance).

The p-value of 0.1145 is greater than the significance level of 0.05. Thus, we fail to reject the H₀, meaning there is no significant evidence to suggest that the variances across the groups are different.

library(dplyr)
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
X2018 %>% group_by(as.factor(X2018$Group)) %>%
  shapiro_test(`GDP_per_capita`)
## # A tibble: 3 × 4
##   `as.factor(X2018$Group)` variable       statistic         p
##   <fct>                    <chr>              <dbl>     <dbl>
## 1 1                        GDP_per_capita     0.970 0.298    
## 2 2                        GDP_per_capita     0.759 0.0000908
## 3 3                        GDP_per_capita     0.981 0.265

Interpretation:

The Shapiro-Wilk test assesses whether the data within each group follow a normal distribution.

H₀: the data are normally distributed. H₁: the data are not normally distributed.

From the Shapiro-Wilk test, the p-values for each group are:

Group 1: p = 0.0000908, reject H₀ → the data is not normally distributed.

Group 2: p = 0.2648, do not reject H₀ → the data is normally distributed.

Group 3: p = 0.298, do not reject H₀ → the data is normally distributed.

Since Group 1 does not meet normality, using parametric tests such as ANOVA might not be appropriate. In this case, we might consider using a nonparametric test such as Kruskal-Wallis, which does not assume normality in the data.

# Perform Kruskal Wallis
kruskal.test(GDP_per_capita ~ Group, 
             data = X2018)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  GDP_per_capita by Group
## Kruskal-Wallis chi-squared = 92.733, df = 2, p-value < 2.2e-16

Interpretation:

Since the p-value is extremely low (< 0.05), we reject the H₀, which assumes that the distributions of GDP per capita are the same across all groups. This means there are significant differences in GDP per capita among the groups identified through clustering.

This confirms that the clustering method effectively groups countries into distinct categories based on GDP per capita, aligning with our criterion validity check.

CONCLUSION

This analysis aimed to explore how countries can be effectively grouped based on key social and development indicators, including social support, healthy life expectancy, freedom to make decisions, generosity, and perceptions of corruption. By employing clustering techniques, the study identified three distinct groups of countries, each defined by unique patterns in these dimensions.

Based on the results, here are the clusters and their characteristics:

Cluster 1: These countries exhibit consistently high levels in all indicators, representing strong social support systems, longer healthy life expectancies, high freedom of choice, notable generosity, and low perceptions of corruption. They reflect regions with robust stability and well-being.

Cluster 2: This group includes countries with moderate scores in social and development indicators. While they show progress in areas like social support and life expectancy, they face challenges in metrics like generosity and perceptions of corruption. These nations may represent those striving toward greater social and economic development.

Cluster 3: Countries in this cluster score significantly lower across all dimensions, highlighting struggles in areas like social support, life expectancy, and freedom of choice. They may represent regions with fewer resources and higher social and economic difficulties.