1. DATA IMPORT AND DESCRIPTION

mydata <- read.table("~/Documents/Country-data.csv",
                     header = TRUE, sep = ",", dec = ".")

head(mydata, 10)

##                country child_mort exports health imports income inflation
## 1          Afghanistan       90.2    10.0   7.58    44.9   1610     9.440
## 2              Albania       16.6    28.0   6.55    48.6   9930     4.490
## 3              Algeria       27.3    38.4   4.17    31.4  12900    16.100
## 4               Angola      119.0    62.3   2.85    42.9   5900    22.400
## 5  Antigua and Barbuda       10.3    45.5   6.03    58.9  19100     1.440
## 6            Argentina       14.5    18.9   8.10    16.0  18700    20.900
## 7              Armenia       18.1    20.8   4.40    45.3   6700     7.770
## 8            Australia        4.8    19.8   8.73    20.9  41400     1.160
## 9              Austria        4.3    51.3  11.00    47.8  43200     0.873
## 10          Azerbaijan       39.2    54.3   5.88    20.7  16000    13.800
##    life_expec total_fer  gdpp
## 1        56.2      5.82   553
## 2        76.3      1.65  4090
## 3        76.5      2.89  4460
## 4        60.1      6.16  3530
## 5        76.8      2.13 12200
## 6        75.8      2.37 10300
## 7        73.3      1.69  3220
## 8        82.0      1.93 51900
## 9        80.5      1.44 46900
## 10       69.1      1.92  5840

str(mydata)

## 'data.frame':    167 obs. of  10 variables:
##  $ country   : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : int  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : int  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

EXPLANATION OF THE DATA:

Unit of observation: A country.
Sample size: 167 units of observation.

DESCRIPTION OF THE VARIABLES:

country: Name of the country.
child_mort: Death of children under 5 years of age per 1000 live births exports (number).
exports: Exports of goods and services per capita (percentage of the GDP per capita).
health: Total health spending per capita (percentage of GDP per capita).
imports: Imports of goods and services per capita (percentage of the GDP per capita).
income: Net income per person (USD).
inflation: The measurement of the annual growth rate of the Total GDP (percentage).
life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same (years).
total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same (number).
gdpp: The GDP per capita (total GDP divided by the total population, USD per capita).

DATA SOURCE: Data was retrieved from Kaggle.com from: https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data/data, a dataset that is used to categorise the countries using socio-economic and health factors that determine the overall development of the country. The data is based on HELP International, that is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

2. DESCRIPTIVE STATISTICS

# Adding ID:

mydata$ID <- seq(1, nrow(mydata))
head(mydata)

##               country child_mort exports health imports income inflation
## 1         Afghanistan       90.2    10.0   7.58    44.9   1610      9.44
## 2             Albania       16.6    28.0   6.55    48.6   9930      4.49
## 3             Algeria       27.3    38.4   4.17    31.4  12900     16.10
## 4              Angola      119.0    62.3   2.85    42.9   5900     22.40
## 5 Antigua and Barbuda       10.3    45.5   6.03    58.9  19100      1.44
## 6           Argentina       14.5    18.9   8.10    16.0  18700     20.90
##   life_expec total_fer  gdpp ID
## 1       56.2      5.82   553  1
## 2       76.3      1.65  4090  2
## 3       76.5      2.89  4460  3
## 4       60.1      6.16  3530  4
## 5       76.8      2.13 12200  5
## 6       75.8      2.37 10300  6

summary(mydata[ , -c(1,11)])

##    child_mort        exports            health          imports        
##  Min.   :  2.60   Min.   :  0.109   Min.   : 1.810   Min.   :  0.0659  
##  1st Qu.:  8.25   1st Qu.: 23.800   1st Qu.: 4.920   1st Qu.: 30.2000  
##  Median : 19.30   Median : 35.000   Median : 6.320   Median : 43.3000  
##  Mean   : 38.27   Mean   : 41.109   Mean   : 6.816   Mean   : 46.8902  
##  3rd Qu.: 62.10   3rd Qu.: 51.350   3rd Qu.: 8.600   3rd Qu.: 58.7500  
##  Max.   :208.00   Max.   :200.000   Max.   :17.900   Max.   :174.0000  
##      income         inflation         life_expec      total_fer    
##  Min.   :   609   Min.   : -4.210   Min.   :32.10   Min.   :1.150  
##  1st Qu.:  3355   1st Qu.:  1.810   1st Qu.:65.30   1st Qu.:1.795  
##  Median :  9960   Median :  5.390   Median :73.10   Median :2.410  
##  Mean   : 17145   Mean   :  7.782   Mean   :70.56   Mean   :2.948  
##  3rd Qu.: 22800   3rd Qu.: 10.750   3rd Qu.:76.80   3rd Qu.:3.880  
##  Max.   :125000   Max.   :104.000   Max.   :82.80   Max.   :7.490  
##       gdpp       
##  Min.   :   231  
##  1st Qu.:  1330  
##  Median :  4660  
##  Mean   : 12964  
##  3rd Qu.: 14050  
##  Max.   :105000

#install.packages("pastecs")
library(pastecs)
round(stat.desc(mydata[ , -c(1,11)]), 1)

##              child_mort exports health imports      income inflation life_expec
## nbr.val           167.0   167.0  167.0   167.0       167.0     167.0      167.0
## nbr.null            0.0     0.0    0.0     0.0         0.0       0.0        0.0
## nbr.na              0.0     0.0    0.0     0.0         0.0       0.0        0.0
## min                 2.6     0.1    1.8     0.1       609.0      -4.2       32.1
## max               208.0   200.0   17.9   174.0    125000.0     104.0       82.8
## range             205.4   199.9   16.1   173.9    124391.0     108.2       50.7
## sum              6391.1  6865.2 1138.2  7830.7   2863163.0    1299.6    11782.8
## median             19.3    35.0    6.3    43.3      9960.0       5.4       73.1
## mean               38.3    41.1    6.8    46.9     17144.7       7.8       70.6
## SE.mean             3.1     2.1    0.2     1.9      1491.8       0.8        0.7
## CI.mean.0.95        6.2     4.2    0.4     3.7      2945.3       1.6        1.4
## var              1626.4   751.4    7.5   586.1 371643894.2     111.7       79.1
## std.dev            40.3    27.4    2.7    24.2     19278.1      10.6        8.9
## coef.var            1.1     0.7    0.4     0.5         1.1       1.4        0.1
##              total_fer        gdpp
## nbr.val          167.0       167.0
## nbr.null           0.0         0.0
## nbr.na             0.0         0.0
## min                1.1       231.0
## max                7.5    105000.0
## range              6.3    104769.0
## sum              492.3   2165014.0
## median             2.4      4660.0
## mean               2.9     12964.2
## SE.mean            0.1      1418.3
## CI.mean.0.95       0.2      2800.3
## var                2.3 335941420.0
## std.dev            1.5     18328.7
## coef.var           0.5         1.4

Mean of Exports: The average percentage of the GDP per capita from exports for all countries is approximately 41.1%.
Median of Exports: Half of countries have exports lower than 35 percent of GDP per capita, and half have higher than 35.
Minimum and Maximum of Health: The minimum spending per capita on health services among all countries is 1.8 percent of the GDP per capita, and the maximum spending per capita is 17.9.
Comparing Mean and Median of Income: mean = 17144.7, median = 9960.0 Since the numbers differ, and mean is more than (to the right of) the median, then the dataset for Income should be skewed to the right The difference is quite significant, so it is likely that there are outliers.
Range of Inflation: The maximum difference between the inflation of two countries is 108.2 percent.
Third quantile of GDP per capita: approximately 75 percent of countries have less than 14050 GDP per capita on average. And approximately 25 percent of participants have more than 14050 GDP per capita on average.
SE.mean of Imports: standard error of the mean 1.9 tells how much the sample mean could vary if the samples of 167 observations of countries are repeatedly taken. Since sample mean is relatively high (46.9), the SE.mean suggests the sample mean is quite a good estimate of the population mean.
Coef.Var of Life Expectancy: Coefficient of variation (CV) compares relative variability or the dispersion of data across different datasets, regardless their units; low coefficient of variation implies that data is clustered around the mean, as in this case, where CV is 0.1.

3. DATA CLUSTERING

RESEARCH QUESTION: How can countries be classified on the basis of the following seven cluster variables: child mortality, exports, imports, health expenditures, inflation, life expectancy, and total fertility?

# Standardization of cluster variables:

mydata_clu_std <- as.data.frame(scale(mydata[c(2,3,4,5,7,8,9)]))

3.1 DEALING WITH OUTLIERS:

# Calculation of the variable Dissimilarity to assess which units are potential outliers and could be removed:

mydata$Dissimilarity = sqrt(mydata_clu_std$child_mort^2 + mydata_clu_std$exports^2 + mydata_clu_std$health^2 + mydata_clu_std$imports^2 + mydata_clu_std$inflation^2 + mydata_clu_std$life_expec^2 + mydata_clu_std$total_fer^2)

# Showing countries ordered by descending Dissimilarity, in order to see outliers:

head(mydata[order(-mydata$Dissimilarity), c(11, 1, 12)], 10)

##      ID                  country Dissimilarity
## 114 114                  Nigeria      9.755748
## 134 134                Singapore      8.175757
## 92   92               Luxembourg      6.523653
## 99   99                    Malta      6.303252
## 67   67                    Haiti      6.160549
## 133 133             Sierra Leone      4.632482
## 160 160            United States      4.614717
## 32   32 Central African Republic      4.439482
## 88   88                  Lesotho      4.152258
## 33   33                     Chad      4.088458

# Removing countries with IDs 114 (Nigeria) and 134 (Singapore), which seem to be outliers:

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>% filter(!country %in% c("Nigeria", "Singapore"))


# Setting ID anew:
mydata$ID <- seq(1, nrow(mydata))


# Standardizing values again:
mydata_clu_std <- as.data.frame(scale(mydata[c(2,3,4,5,7,8,9)]))


head(mydata[order(-mydata$Dissimilarity), c(11, 1, 12)], 10)

##      ID                  country Dissimilarity
## 92   92               Luxembourg      6.523653
## 99   99                    Malta      6.303252
## 67   67                    Haiti      6.160549
## 132 132             Sierra Leone      4.632482
## 158 158            United States      4.614717
## 32   32 Central African Republic      4.439482
## 88   88                  Lesotho      4.152258
## 33   33                     Chad      4.088458
## 113 113                    Niger      4.031854
## 162 162                Venezuela      3.978946

3.2 TESTING CLUSTERABILITY:

# Finding EUCLIDIAN DISTANCES, based on 7 cluster variables, and saving them into object Distances:

#install.packages("factoextra")
library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distances <- get_dist(mydata_clu_std, method = "euclidian")


# Graphical representation of DISSIMILARITY MATRIX, which is a matrix of all distances between objects, with dimentions of 165*165:
# We can see a darker line of zeros, and some groups appearing along the line.

fviz_dist(Distances,
          gradient = list(low = "darkred", mid = "grey95", high = "white"))

# Measuring the HOPKINS STATISTIC, that tells whether natural groups exist. It should be higher than 0.5 and closest to 1, in order for data to be considered clusterable:

# In this case, H is 0.82, which means that data can be clustered.

library(factoextra)
get_clust_tendency(mydata_clu_std,
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8181938
## 
## $plot
## NULL

3.3 FINDING NUMBER OF CLUSTERS:

# Testing how many groups can be formed, based on the HIERARCHICAL (agglomerate) approach:

library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
  get_dist(method = "euclidian") %>%
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 165

# Showing Hierarchical approach visualization or the Dendrogram:

# We can see that the largest difference between distances can be achieved by cutting the data into two groups.

library(factoextra)
fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Testing how many groups can be formed, based on the K-MEANS approach:

# 1) Elbow method: We can see a break (change of slope) at 2 clusters, and then small break at 5 and 7.

#install.packages("NbClust")
library(NbClust)
library(factoextra)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

# Testing how many groups can be formed, based on the K-MEANS approach:

# 2) Silhouette analysis: The highest point is at 2 clusters, and then smaller at 5.

library(NbClust)
library(factoextra)
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette analysis")

# Testing how many groups can be formed, based on the Indices:

# The most proposed number of clusters is 2. 

library(NbClust)
NbClust(mydata_clu_std,
        distance = "euclidean",
        min.nc = 2, max.nc = 10,
        method = "kmeans",
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 6 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

## $All.index
##         KL      CH Hartigan     CCC    Scott      Marriot    TrCovW   TraceW
## 2   3.5819 77.9674  31.5760 -0.5605 242.4149 4.664570e+13 20181.047 776.5532
## 3   0.7109 61.9412  31.0220 -1.6471 383.1495 4.472687e+13 13628.344 650.5332
## 4   2.1012 59.1748  18.9468 -1.1595 492.9166 4.088173e+13  8073.676 545.9813
## 5  15.0168 54.0031   9.1034 -1.0191 591.7487 3.509252e+13  6237.537 488.4943
## 6   0.0627 47.1845  16.2425 -1.9679 658.5469 3.370996e+13  5811.509 462.1970
## 7   1.1730 45.7547  14.1681 -1.1773 768.9352 2.350174e+13  5007.710 419.3579
## 8   2.5805 44.4758   8.5085 -0.5363 828.9952 2.133049e+13  4131.354 384.8479
## 9   0.7331 41.8208   9.1162 -0.6894 883.2107 1.943596e+13  3691.191 365.0636
## 10  0.2452 40.1006  24.0308 -0.5801 944.1423 1.658607e+13  3347.439 344.9081
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2   12.2040 1.4783 0.2823 1.3771     0.3425 0.9352   8.3201  0.3137    0.3470
## 3   14.8689 1.7647 0.2612 1.5585     0.2324 0.7198  31.9237  1.7554    0.3635
## 4   17.0715 2.1026 0.3198 1.3890     0.2423 0.9607   2.0837  0.1814    0.3572
## 5   18.6064 2.3501 0.3675 1.4284     0.2228 1.4336 -18.7519 -1.3436    0.3369
## 6   19.6617 2.4838 0.3577 1.3880     0.2222 1.5752  -9.4945 -1.5562    0.3139
## 7   22.1248 2.7375 0.3395 1.3930     0.2180 1.0659  -1.7938 -0.2623    0.2998
## 8   23.1962 2.9830 0.3442 1.3520     0.1940 1.1632  -6.3144 -0.6207    0.2876
## 9   24.5894 3.1447 0.3426 1.4280     0.1803 1.1246  -1.4404 -0.3373    0.2746
## 10  26.4065 3.3284 0.3372 1.4113     0.2009 0.6968  14.3580  1.9181    0.2638
##        Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  388.2766     0.5183  1.9249  0.4855 0.0774 0.0019  2.0939 1.9527 0.9501
## 3  216.8444     0.4455  0.7463  1.1107 0.0695 0.0019  2.0565 1.7757 0.7509
## 4  136.4953     0.4140  0.1825  1.8668 0.0714 0.0019  1.8026 1.6145 0.7403
## 5   97.6989     0.4239 -0.0285  2.1464 0.0974 0.0023  1.9754 1.5422 0.5437
## 6   77.0328     0.4344  0.3215  2.1749 0.0974 0.0025  2.0231 1.5106 0.5443
## 7   59.9083     0.4284  0.6744  2.5104 0.1101 0.0026  1.9135 1.4449 0.4573
## 8   48.1060     0.4136  5.5546  2.8224 0.0906 0.0030  2.1616 1.3989 0.4421
## 9   40.5626     0.3764  0.5554  3.4806 0.0831 0.0030  2.2779 1.3567 0.4139
## 10  34.4908     0.3660 -0.0493  3.7933 0.0492 0.0031  2.3988 1.3065 0.3131
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.7577            38.3637       0.9479
## 3          0.7303            30.2759       0.0939
## 4          0.6446            28.1144       0.9890
## 5          0.6481            33.6586       1.0000
## 6          0.5070            25.2865       1.0000
## 7          0.4938            29.7319       1.0000
## 8          0.6291            26.5343       1.0000
## 9          0.1049           110.9776       1.0000
## 10         0.6154            20.6234       0.0685
## 
## $Best.nc
##                      KL      CH Hartigan     CCC    Scott      Marriot   TrCovW
## Number_clusters  5.0000  2.0000  10.0000  8.0000   3.0000 7.000000e+00    3.000
## Value_Index     15.0168 77.9674  14.9146 -0.5363 140.7345 8.036961e+12 6552.703
##                  TraceW Friedman   Rubin Cindex    DB Silhouette   Duda
## Number_clusters  4.0000   3.0000  5.0000 3.0000 8.000     2.0000 2.0000
## Value_Index     47.0649   2.6649 -0.1137 0.2612 1.352     0.3425 0.9352
##                 PseudoT2  Beale Ratkowsky     Ball PtBiserial   Frey McClain
## Number_clusters   2.0000 2.0000    3.0000   3.0000     2.0000 2.0000  2.0000
## Value_Index       8.3201 0.3137    0.3635 171.4322     0.5183 1.9249  0.4855
##                   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 7.0000      0  4.0000      0 10.0000
## Value_Index     0.1101      0  1.8026      0  0.3131
## 
## $Best.partition
##   [1] 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 2 1 2 1 1 2 2 1 1 1 2
##  [38] 2 2 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 1 1 2 2 1 2 1 1 2 1 1 2 1
##  [75] 1 1 1 1 1 1 2 2 1 1 2 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 1 1 2 1 1 2 2 2 2 1
## [112] 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 2 2 1 2
## [149] 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 2

3.4 CLUSTERING:

# Clustering the countries into 2 groups, by Hierarchical, K-Means, and Indices method:

# Ratio is 32.4%, which indicates the percent of variability explained by the classification.

Clustering <- kmeans(mydata_clu_std,
                     centers = 2,
                     nstart = 25)

Clustering

## K-means clustering with 2 clusters of sizes 52, 113
## 
## Cluster means:
##   child_mort    exports     health     imports  inflation life_expec  total_fer
## 1  1.2225529 -0.4892033 -0.2312156 -0.20805482  0.5470284 -1.1597291  1.1904205
## 2 -0.5625907  0.2251201  0.1064001  0.09574204 -0.2517299  0.5336806 -0.5478041
## 
## Clustering vector:
##   [1] 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 2 1 2 2 1 1 2 2 2 1
##  [38] 1 1 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 1 2 2 2 1 1 2 1 2 2 1 2 2 1 2
##  [75] 2 2 2 2 2 2 1 1 2 2 1 2 2 1 1 2 2 2 2 1 1 2 2 1 2 1 2 2 2 1 2 2 1 1 1 1 2
## [112] 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 1 1 2 1
## [149] 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 295.7289 480.8242
##  (between_SS / total_SS =  32.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Showing the visualization of clustering:

library(factoextra)
fviz_cluster(Clustering,
             palette = "Set1",
             repel = TRUE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

# Removing outliers that are evident from the clustering graph:

library(dplyr)
mydata <- mydata %>% filter(!ID %in% c(92, 99, 67))


# Setting ID anew:
mydata$ID <- seq(1, nrow(mydata))


# Standardizing values again:
mydata_clu_std <- as.data.frame(scale(mydata[c(2,3,4,5,7,8,9)]))

Clustering <- kmeans(mydata_clu_std,
                     centers = 2,
                     nstart = 25)

library(factoextra)
fviz_cluster(Clustering,
             palette = "Set1",
             repel = TRUE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

# Removing more outliers that are evident from the clustering graph:

library(dplyr)
mydata <- mydata %>% filter(!ID %in% c(128, 73))


# Setting ID anew:
mydata$ID <- seq(1, nrow(mydata))


# Standardizing values again:
mydata_clu_std <- as.data.frame(scale(mydata[c(2,3,4,5,7,8,9)]))

Clustering <- kmeans(mydata_clu_std,
                     centers = 2,
                     nstart = 25)

library(factoextra)
fviz_cluster(Clustering,
             palette = "Set1",
             repel = TRUE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

# Checking the clustering again - we see that the ratio increased (improved) slightly from 32.4 to 32.9%, so the removal of outliers was a correct decision:

Clustering

## K-means clustering with 2 clusters of sizes 51, 109
## 
## Cluster means:
##   child_mort    exports     health     imports  inflation life_expec  total_fer
## 1   1.232497 -0.4966399 -0.2307947 -0.16585337  0.5366107 -1.1686245  1.1824603
## 2  -0.576673  0.2323728  0.1079865  0.07760112 -0.2510748  0.5467876 -0.5532612
## 
## Clustering vector:
##   [1] 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 2 1 2 2 1 1 2 2 2 1
##  [38] 1 1 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 1 2 2 2 1 1 2 2 2 1 2 2 1 2 2
##  [75] 2 2 2 2 1 1 2 2 1 2 2 1 1 2 2 2 1 1 2 2 1 1 2 2 2 1 2 2 1 1 1 1 2 2 1 2 2
## [112] 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 2 2 1 1 2 1 1 2 2 2 2
## [149] 1 2 2 2 2 2 2 2 1 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 310.0788 436.2213
##  (between_SS / total_SS =  32.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

3.5 DESCRIBING THE GROUPS:

# Average values of cluster variables to describe three groups:

Averages <- Clustering$centers
Averages

##   child_mort    exports     health     imports  inflation life_expec  total_fer
## 1   1.232497 -0.4966399 -0.2307947 -0.16585337  0.5366107 -1.1686245  1.1824603
## 2  -0.576673  0.2323728  0.1079865  0.07760112 -0.2510748  0.5467876 -0.5532612

# Showing a visualisation chart for description of two groups:

Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)

  
#install.packages("tidyr")
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:pastecs':
## 
##     extract

Figure <- pivot_longer(Figure, cols = c("child_mort", "exports", "health", "imports", "inflation", "life_expec", "total_fer"))

Figure$Group <- factor(Figure$id, levels = c(1,2), labels = c("1", "2"))

Figure$ImeF <- factor(Figure$name, levels = c("child_mort", "exports", "health", "imports", "inflation", "life_expec", "total_fer"), labels = c("child_mort", "exports", "health", "imports", "inflation", "life_expec", "total_fer"))

#install.packages("ggplot2")
library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_classic() +
  geom_point(aes(shape = Group, col = Group), size = 4) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  scale_color_brewer(palette = "Set1") +
  ylim(-2, 2) +
  theme(axis.test.x = element_text(angle = 45, vjust = 0.50, size = 10))

## Warning in plot_theme(plot): The `axis.test.x` theme element is not defined in
## the element hierarchy.

# Assigning countries to groups:

mydata$Group <- Clustering$cluster

3.6 CHECKING APPROPRITENESS OF VARIABLES:

Implementing one-way ANOVA for each cluster variable, in order to check whether clustering variables successfully differentiate between groups. If there are differences between means of two groups for each cluster variable, then the chosen cluster variables are appropriate.

# H0: Mean of group 1 is equal to mean of group 2 (for each variable).
# H1: Mean of group 1 is not equal to mean of group 2 (for each variable).

fit <- aov(cbind(child_mort, exports, health, imports, inflation, life_expec, total_fer) ~ as.factor(Group),
           data = mydata)

summary(fit)

##  Response child_mort :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1 162685  162685  396.81 < 2.2e-16 ***
## Residuals        158  64777     410                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response exports :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1   6979  6978.8   20.76 1.036e-05 ***
## Residuals        158  53115   336.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response health :
##                   Df  Sum Sq Mean Sq F value  Pr(>F)  
## as.factor(Group)   1   30.59 30.5882  4.0645 0.04549 *
## Residuals        158 1189.06  7.5257                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response imports :
##                   Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group)   1    687  686.86  2.0732 0.1519
## Residuals        158  52347  331.31               
## 
##  Response inflation :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1 1207.7 1207.65  24.781 1.665e-06 ***
## Residuals        158 7699.9   48.73                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response life_expec :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1 7128.3  7128.3  284.59 < 2.2e-16 ***
## Residuals        158 3957.5    25.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response total_fer :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1 238.94 238.942  304.43 < 2.2e-16 ***
## Residuals        158 124.01   0.785                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reject H0 for child_mort (p<0.001), exports (p<0.001), health (p=0.045), inflation (p<0.001), life_expec (p<0.001), and total_fer(p<0.001). For these variables, we found differences in means between two clusters. Hence, the variables are appropriate to use as cluster variables.
For imports, p-value is higher than 0.05, H0 is not rejected, and therefore this variable is not appropriate to use as cluster variable - it should be removed, and clustering should be performed again.

3.7 CHECKING THE CRITERION VALIDITY:

Checking the criterion validity with a variable that was not used in the clustering process: in this case, GDP per capita (gdpp), by a WELCH TWO-SAMPLE T-TEST or WILCOXON RANK SUM TEST for two independent samples, with which we would attempt to find differences in means of GDP per capita of two clusters.

# Calculating means of two clusters:

aggregate(mydata$gdpp,
          by = list(mydata$Group),
          FUN = mean)

##   Group.1         x
## 1       1  2101.529
## 2       2 16721.505

Means seem to be very different.

# Checking for data normality by Shapiro-Wilk normality test:

#H0: Variables in two groups of countries are normally distributed.
#H0: Variables in two groups of countries are not normally distributed.

library(dplyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>% 
  group_by(Group) %>%
  shapiro_test(gdpp)

## # A tibble: 2 × 4
##   Group variable statistic        p
##   <int> <chr>        <dbl>    <dbl>
## 1     1 gdpp         0.556 3.49e-11
## 2     2 gdpp         0.783 2.21e-11

Reject H0 for both groups at p<0.001.
Normality assumption is violated, and therefore we make a non-parametric test.

# Performing Wilcoxon Rank Sum Test:

# H0: Location distribution of countries in group 1 is equal to location distribution of countries in group 2.
# H1: Location distribution of countries in group 1 is not equal to location distribution of countries in group 2.

fit <- wilcox.test(gdpp ~ as.factor(Group),
                   data = mydata)

fit

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  gdpp by as.factor(Group)
## W = 509.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Reject H0 at p<0.001.
We assume there are differences in location distributions of countries in two groups.
We validated the classification by founding differences between groups.

3.8 CONCLUSION:

Based on 7 standardized variables (child mortality, exports, health expenditures, imports, inflation, life expectancy, and total fertility), we divided 160 countries into 2 groups, using Hierarchical, K-means, and Indices clustering. However, the data showed than variable imports does not influence clustering division, and could be ommitted.
Group 1 is a smaller group that consists of 51 countries (32%). Countries in this cluster are classified with higher child mortality rates (𝜇=1.23), lower than average exports of goods and services (𝜇=-0.49), health expenditures (𝜇=-0.23), and imports (𝜇=-0.16), high inflation rate (𝜇=0.53), significantly low life expectancy (𝜇=-1.16), and very high total fertility (𝜇=1.18). The countries in this group also have lower GDP per capita on average, with mean equal to approximately 2101 USD per capita. It can be assumed that countries classified into this cluster have lower quality of life.
Group 2 is a larger group that consists of 109 countries (68%). Countries in this cluster are classified with lower child mortality rates (𝜇=-0.57), higher exports of goods and services (𝜇=0.23), health expenditures (𝜇=0.11), and import (𝜇=0.08), lower inflation rate (𝜇=-0.25), higher than average life expectancy (𝜇=0.55), and lower fertility rates (𝜇=-0.55). The countries in this group also have very significantly higher GDP per capita on average, with mean equal to approximately 16721 USD per capita. All the factors lead to an assumption that countries classified into this cluster have high quality of life.

Homework №2 for AFM

Author: Anna Kuznetsova

2025-01-21