mydata <- read.table("~/Documents/Country-data.csv",
header = TRUE, sep = ",", dec = ".")
head(mydata, 10)
## country child_mort exports health imports income inflation
## 1 Afghanistan 90.2 10.0 7.58 44.9 1610 9.440
## 2 Albania 16.6 28.0 6.55 48.6 9930 4.490
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.100
## 4 Angola 119.0 62.3 2.85 42.9 5900 22.400
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.440
## 6 Argentina 14.5 18.9 8.10 16.0 18700 20.900
## 7 Armenia 18.1 20.8 4.40 45.3 6700 7.770
## 8 Australia 4.8 19.8 8.73 20.9 41400 1.160
## 9 Austria 4.3 51.3 11.00 47.8 43200 0.873
## 10 Azerbaijan 39.2 54.3 5.88 20.7 16000 13.800
## life_expec total_fer gdpp
## 1 56.2 5.82 553
## 2 76.3 1.65 4090
## 3 76.5 2.89 4460
## 4 60.1 6.16 3530
## 5 76.8 2.13 12200
## 6 75.8 2.37 10300
## 7 73.3 1.69 3220
## 8 82.0 1.93 51900
## 9 80.5 1.44 46900
## 10 69.1 1.92 5840
str(mydata)
## 'data.frame': 167 obs. of 10 variables:
## $ country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ child_mort: num 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : int 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : int 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
EXPLANATION OF THE DATA:
DESCRIPTION OF THE VARIABLES:
DATA SOURCE: Data was retrieved from Kaggle.com from: https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data/data, a dataset that is used to categorise the countries using socio-economic and health factors that determine the overall development of the country. The data is based on HELP International, that is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.
# Adding ID:
mydata$ID <- seq(1, nrow(mydata))
head(mydata)
## country child_mort exports health imports income inflation
## 1 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44
## 2 Albania 16.6 28.0 6.55 48.6 9930 4.49
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.10
## 4 Angola 119.0 62.3 2.85 42.9 5900 22.40
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44
## 6 Argentina 14.5 18.9 8.10 16.0 18700 20.90
## life_expec total_fer gdpp ID
## 1 56.2 5.82 553 1
## 2 76.3 1.65 4090 2
## 3 76.5 2.89 4460 3
## 4 60.1 6.16 3530 4
## 5 76.8 2.13 12200 5
## 6 75.8 2.37 10300 6
summary(mydata[ , -c(1,11)])
## child_mort exports health imports
## Min. : 2.60 Min. : 0.109 Min. : 1.810 Min. : 0.0659
## 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920 1st Qu.: 30.2000
## Median : 19.30 Median : 35.000 Median : 6.320 Median : 43.3000
## Mean : 38.27 Mean : 41.109 Mean : 6.816 Mean : 46.8902
## 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600 3rd Qu.: 58.7500
## Max. :208.00 Max. :200.000 Max. :17.900 Max. :174.0000
## income inflation life_expec total_fer
## Min. : 609 Min. : -4.210 Min. :32.10 Min. :1.150
## 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30 1st Qu.:1.795
## Median : 9960 Median : 5.390 Median :73.10 Median :2.410
## Mean : 17145 Mean : 7.782 Mean :70.56 Mean :2.948
## 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80 3rd Qu.:3.880
## Max. :125000 Max. :104.000 Max. :82.80 Max. :7.490
## gdpp
## Min. : 231
## 1st Qu.: 1330
## Median : 4660
## Mean : 12964
## 3rd Qu.: 14050
## Max. :105000
#install.packages("pastecs")
library(pastecs)
round(stat.desc(mydata[ , -c(1,11)]), 1)
## child_mort exports health imports income inflation life_expec
## nbr.val 167.0 167.0 167.0 167.0 167.0 167.0 167.0
## nbr.null 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## nbr.na 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## min 2.6 0.1 1.8 0.1 609.0 -4.2 32.1
## max 208.0 200.0 17.9 174.0 125000.0 104.0 82.8
## range 205.4 199.9 16.1 173.9 124391.0 108.2 50.7
## sum 6391.1 6865.2 1138.2 7830.7 2863163.0 1299.6 11782.8
## median 19.3 35.0 6.3 43.3 9960.0 5.4 73.1
## mean 38.3 41.1 6.8 46.9 17144.7 7.8 70.6
## SE.mean 3.1 2.1 0.2 1.9 1491.8 0.8 0.7
## CI.mean.0.95 6.2 4.2 0.4 3.7 2945.3 1.6 1.4
## var 1626.4 751.4 7.5 586.1 371643894.2 111.7 79.1
## std.dev 40.3 27.4 2.7 24.2 19278.1 10.6 8.9
## coef.var 1.1 0.7 0.4 0.5 1.1 1.4 0.1
## total_fer gdpp
## nbr.val 167.0 167.0
## nbr.null 0.0 0.0
## nbr.na 0.0 0.0
## min 1.1 231.0
## max 7.5 105000.0
## range 6.3 104769.0
## sum 492.3 2165014.0
## median 2.4 4660.0
## mean 2.9 12964.2
## SE.mean 0.1 1418.3
## CI.mean.0.95 0.2 2800.3
## var 2.3 335941420.0
## std.dev 1.5 18328.7
## coef.var 0.5 1.4
Mean of Exports: The average percentage of the GDP per capita from exports for all countries is approximately 41.1%.
Median of Exports: Half of countries have exports lower than 35 percent of GDP per capita, and half have higher than 35.
Minimum and Maximum of Health: The minimum spending per capita on health services among all countries is 1.8 percent of the GDP per capita, and the maximum spending per capita is 17.9.
Comparing Mean and Median of Income: mean = 17144.7, median = 9960.0 Since the numbers differ, and mean is more than (to the right of) the median, then the dataset for Income should be skewed to the right The difference is quite significant, so it is likely that there are outliers.
Range of Inflation: The maximum difference between the inflation of two countries is 108.2 percent.
Third quantile of GDP per capita: approximately 75 percent of countries have less than 14050 GDP per capita on average. And approximately 25 percent of participants have more than 14050 GDP per capita on average.
SE.mean of Imports: standard error of the mean 1.9 tells how much the sample mean could vary if the samples of 167 observations of countries are repeatedly taken. Since sample mean is relatively high (46.9), the SE.mean suggests the sample mean is quite a good estimate of the population mean.
Coef.Var of Life Expectancy: Coefficient of variation (CV) compares relative variability or the dispersion of data across different datasets, regardless their units; low coefficient of variation implies that data is clustered around the mean, as in this case, where CV is 0.1.
RESEARCH QUESTION: How can countries be classified on the basis of the following seven cluster variables: child mortality, exports, imports, health expenditures, inflation, life expectancy, and total fertility?
# Standardization of cluster variables:
mydata_clu_std <- as.data.frame(scale(mydata[c(2,3,4,5,7,8,9)]))
# Calculation of the variable Dissimilarity to assess which units are potential outliers and could be removed:
mydata$Dissimilarity = sqrt(mydata_clu_std$child_mort^2 + mydata_clu_std$exports^2 + mydata_clu_std$health^2 + mydata_clu_std$imports^2 + mydata_clu_std$inflation^2 + mydata_clu_std$life_expec^2 + mydata_clu_std$total_fer^2)
# Showing countries ordered by descending Dissimilarity, in order to see outliers:
head(mydata[order(-mydata$Dissimilarity), c(11, 1, 12)], 10)
## ID country Dissimilarity
## 114 114 Nigeria 9.755748
## 134 134 Singapore 8.175757
## 92 92 Luxembourg 6.523653
## 99 99 Malta 6.303252
## 67 67 Haiti 6.160549
## 133 133 Sierra Leone 4.632482
## 160 160 United States 4.614717
## 32 32 Central African Republic 4.439482
## 88 88 Lesotho 4.152258
## 33 33 Chad 4.088458
# Removing countries with IDs 114 (Nigeria) and 134 (Singapore), which seem to be outliers:
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>% filter(!country %in% c("Nigeria", "Singapore"))
# Setting ID anew:
mydata$ID <- seq(1, nrow(mydata))
# Standardizing values again:
mydata_clu_std <- as.data.frame(scale(mydata[c(2,3,4,5,7,8,9)]))
head(mydata[order(-mydata$Dissimilarity), c(11, 1, 12)], 10)
## ID country Dissimilarity
## 92 92 Luxembourg 6.523653
## 99 99 Malta 6.303252
## 67 67 Haiti 6.160549
## 132 132 Sierra Leone 4.632482
## 158 158 United States 4.614717
## 32 32 Central African Republic 4.439482
## 88 88 Lesotho 4.152258
## 33 33 Chad 4.088458
## 113 113 Niger 4.031854
## 162 162 Venezuela 3.978946
# Finding EUCLIDIAN DISTANCES, based on 7 cluster variables, and saving them into object Distances:
#install.packages("factoextra")
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(mydata_clu_std, method = "euclidian")
# Graphical representation of DISSIMILARITY MATRIX, which is a matrix of all distances between objects, with dimentions of 165*165:
# We can see a darker line of zeros, and some groups appearing along the line.
fviz_dist(Distances,
gradient = list(low = "darkred", mid = "grey95", high = "white"))
# Measuring the HOPKINS STATISTIC, that tells whether natural groups exist. It should be higher than 0.5 and closest to 1, in order for data to be considered clusterable:
# In this case, H is 0.82, which means that data can be clustered.
library(factoextra)
get_clust_tendency(mydata_clu_std,
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8181938
##
## $plot
## NULL
# Testing how many groups can be formed, based on the HIERARCHICAL (agglomerate) approach:
library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
get_dist(method = "euclidian") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 165
# Showing Hierarchical approach visualization or the Dendrogram:
# We can see that the largest difference between distances can be achieved by cutting the data into two groups.
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Testing how many groups can be formed, based on the K-MEANS approach:
# 1) Elbow method: We can see a break (change of slope) at 2 clusters, and then small break at 5 and 7.
#install.packages("NbClust")
library(NbClust)
library(factoextra)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
# Testing how many groups can be formed, based on the K-MEANS approach:
# 2) Silhouette analysis: The highest point is at 2 clusters, and then smaller at 5.
library(NbClust)
library(factoextra)
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")
# Testing how many groups can be formed, based on the Indices:
# The most proposed number of clusters is 2.
library(NbClust)
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 3.5819 77.9674 31.5760 -0.5605 242.4149 4.664570e+13 20181.047 776.5532
## 3 0.7109 61.9412 31.0220 -1.6471 383.1495 4.472687e+13 13628.344 650.5332
## 4 2.1012 59.1748 18.9468 -1.1595 492.9166 4.088173e+13 8073.676 545.9813
## 5 15.0168 54.0031 9.1034 -1.0191 591.7487 3.509252e+13 6237.537 488.4943
## 6 0.0627 47.1845 16.2425 -1.9679 658.5469 3.370996e+13 5811.509 462.1970
## 7 1.1730 45.7547 14.1681 -1.1773 768.9352 2.350174e+13 5007.710 419.3579
## 8 2.5805 44.4758 8.5085 -0.5363 828.9952 2.133049e+13 4131.354 384.8479
## 9 0.7331 41.8208 9.1162 -0.6894 883.2107 1.943596e+13 3691.191 365.0636
## 10 0.2452 40.1006 24.0308 -0.5801 944.1423 1.658607e+13 3347.439 344.9081
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 12.2040 1.4783 0.2823 1.3771 0.3425 0.9352 8.3201 0.3137 0.3470
## 3 14.8689 1.7647 0.2612 1.5585 0.2324 0.7198 31.9237 1.7554 0.3635
## 4 17.0715 2.1026 0.3198 1.3890 0.2423 0.9607 2.0837 0.1814 0.3572
## 5 18.6064 2.3501 0.3675 1.4284 0.2228 1.4336 -18.7519 -1.3436 0.3369
## 6 19.6617 2.4838 0.3577 1.3880 0.2222 1.5752 -9.4945 -1.5562 0.3139
## 7 22.1248 2.7375 0.3395 1.3930 0.2180 1.0659 -1.7938 -0.2623 0.2998
## 8 23.1962 2.9830 0.3442 1.3520 0.1940 1.1632 -6.3144 -0.6207 0.2876
## 9 24.5894 3.1447 0.3426 1.4280 0.1803 1.1246 -1.4404 -0.3373 0.2746
## 10 26.4065 3.3284 0.3372 1.4113 0.2009 0.6968 14.3580 1.9181 0.2638
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 388.2766 0.5183 1.9249 0.4855 0.0774 0.0019 2.0939 1.9527 0.9501
## 3 216.8444 0.4455 0.7463 1.1107 0.0695 0.0019 2.0565 1.7757 0.7509
## 4 136.4953 0.4140 0.1825 1.8668 0.0714 0.0019 1.8026 1.6145 0.7403
## 5 97.6989 0.4239 -0.0285 2.1464 0.0974 0.0023 1.9754 1.5422 0.5437
## 6 77.0328 0.4344 0.3215 2.1749 0.0974 0.0025 2.0231 1.5106 0.5443
## 7 59.9083 0.4284 0.6744 2.5104 0.1101 0.0026 1.9135 1.4449 0.4573
## 8 48.1060 0.4136 5.5546 2.8224 0.0906 0.0030 2.1616 1.3989 0.4421
## 9 40.5626 0.3764 0.5554 3.4806 0.0831 0.0030 2.2779 1.3567 0.4139
## 10 34.4908 0.3660 -0.0493 3.7933 0.0492 0.0031 2.3988 1.3065 0.3131
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.7577 38.3637 0.9479
## 3 0.7303 30.2759 0.0939
## 4 0.6446 28.1144 0.9890
## 5 0.6481 33.6586 1.0000
## 6 0.5070 25.2865 1.0000
## 7 0.4938 29.7319 1.0000
## 8 0.6291 26.5343 1.0000
## 9 0.1049 110.9776 1.0000
## 10 0.6154 20.6234 0.0685
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 5.0000 2.0000 10.0000 8.0000 3.0000 7.000000e+00 3.000
## Value_Index 15.0168 77.9674 14.9146 -0.5363 140.7345 8.036961e+12 6552.703
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 4.0000 3.0000 5.0000 3.0000 8.000 2.0000 2.0000
## Value_Index 47.0649 2.6649 -0.1137 0.2612 1.352 0.3425 0.9352
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.0000 3.0000 3.0000 2.0000 2.0000 2.0000
## Value_Index 8.3201 0.3137 0.3635 171.4322 0.5183 1.9249 0.4855
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 7.0000 0 4.0000 0 10.0000
## Value_Index 0.1101 0 1.8026 0 0.3131
##
## $Best.partition
## [1] 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 2 1 2 1 1 2 2 1 1 1 2
## [38] 2 2 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 1 1 2 2 1 2 1 1 2 1 1 2 1
## [75] 1 1 1 1 1 1 2 2 1 1 2 1 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 1 1 2 1 1 2 2 2 2 1
## [112] 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 2 2 1 2
## [149] 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 2
# Clustering the countries into 2 groups, by Hierarchical, K-Means, and Indices method:
# Ratio is 32.4%, which indicates the percent of variability explained by the classification.
Clustering <- kmeans(mydata_clu_std,
centers = 2,
nstart = 25)
Clustering
## K-means clustering with 2 clusters of sizes 52, 113
##
## Cluster means:
## child_mort exports health imports inflation life_expec total_fer
## 1 1.2225529 -0.4892033 -0.2312156 -0.20805482 0.5470284 -1.1597291 1.1904205
## 2 -0.5625907 0.2251201 0.1064001 0.09574204 -0.2517299 0.5336806 -0.5478041
##
## Clustering vector:
## [1] 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 2 1 2 2 1 1 2 2 2 1
## [38] 1 1 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 1 2 2 2 1 1 2 1 2 2 1 2 2 1 2
## [75] 2 2 2 2 2 2 1 1 2 2 1 2 2 1 1 2 2 2 2 1 1 2 2 1 2 1 2 2 2 1 2 2 1 1 1 1 2
## [112] 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 1 1 2 1
## [149] 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 295.7289 480.8242
## (between_SS / total_SS = 32.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Showing the visualization of clustering:
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
ggtheme = theme_bw(),
data = mydata_clu_std)
# Removing outliers that are evident from the clustering graph:
library(dplyr)
mydata <- mydata %>% filter(!ID %in% c(92, 99, 67))
# Setting ID anew:
mydata$ID <- seq(1, nrow(mydata))
# Standardizing values again:
mydata_clu_std <- as.data.frame(scale(mydata[c(2,3,4,5,7,8,9)]))
Clustering <- kmeans(mydata_clu_std,
centers = 2,
nstart = 25)
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
ggtheme = theme_bw(),
data = mydata_clu_std)
# Removing more outliers that are evident from the clustering graph:
library(dplyr)
mydata <- mydata %>% filter(!ID %in% c(128, 73))
# Setting ID anew:
mydata$ID <- seq(1, nrow(mydata))
# Standardizing values again:
mydata_clu_std <- as.data.frame(scale(mydata[c(2,3,4,5,7,8,9)]))
Clustering <- kmeans(mydata_clu_std,
centers = 2,
nstart = 25)
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
ggtheme = theme_bw(),
data = mydata_clu_std)
# Checking the clustering again - we see that the ratio increased (improved) slightly from 32.4 to 32.9%, so the removal of outliers was a correct decision:
Clustering
## K-means clustering with 2 clusters of sizes 51, 109
##
## Cluster means:
## child_mort exports health imports inflation life_expec total_fer
## 1 1.232497 -0.4966399 -0.2307947 -0.16585337 0.5366107 -1.1686245 1.1824603
## 2 -0.576673 0.2323728 0.1079865 0.07760112 -0.2510748 0.5467876 -0.5532612
##
## Clustering vector:
## [1] 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 2 1 2 2 1 1 2 2 2 1
## [38] 1 1 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 1 2 2 2 1 1 2 2 2 1 2 2 1 2 2
## [75] 2 2 2 2 1 1 2 2 1 2 2 1 1 2 2 2 1 1 2 2 1 1 2 2 2 1 2 2 1 1 1 1 2 2 1 2 2
## [112] 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 2 2 1 1 2 1 1 2 2 2 2
## [149] 1 2 2 2 2 2 2 2 1 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 310.0788 436.2213
## (between_SS / total_SS = 32.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Average values of cluster variables to describe three groups:
Averages <- Clustering$centers
Averages
## child_mort exports health imports inflation life_expec total_fer
## 1 1.232497 -0.4966399 -0.2307947 -0.16585337 0.5366107 -1.1686245 1.1824603
## 2 -0.576673 0.2323728 0.1079865 0.07760112 -0.2510748 0.5467876 -0.5532612
# Showing a visualisation chart for description of two groups:
Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)
#install.packages("tidyr")
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:pastecs':
##
## extract
Figure <- pivot_longer(Figure, cols = c("child_mort", "exports", "health", "imports", "inflation", "life_expec", "total_fer"))
Figure$Group <- factor(Figure$id, levels = c(1,2), labels = c("1", "2"))
Figure$ImeF <- factor(Figure$name, levels = c("child_mort", "exports", "health", "imports", "inflation", "life_expec", "total_fer"), labels = c("child_mort", "exports", "health", "imports", "inflation", "life_expec", "total_fer"))
#install.packages("ggplot2")
library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
geom_hline(yintercept = 0) +
theme_classic() +
geom_point(aes(shape = Group, col = Group), size = 4) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
scale_color_brewer(palette = "Set1") +
ylim(-2, 2) +
theme(axis.test.x = element_text(angle = 45, vjust = 0.50, size = 10))
## Warning in plot_theme(plot): The `axis.test.x` theme element is not defined in
## the element hierarchy.
# Assigning countries to groups:
mydata$Group <- Clustering$cluster
Implementing one-way ANOVA for each cluster variable, in order to check whether clustering variables successfully differentiate between groups. If there are differences between means of two groups for each cluster variable, then the chosen cluster variables are appropriate.
# H0: Mean of group 1 is equal to mean of group 2 (for each variable).
# H1: Mean of group 1 is not equal to mean of group 2 (for each variable).
fit <- aov(cbind(child_mort, exports, health, imports, inflation, life_expec, total_fer) ~ as.factor(Group),
data = mydata)
summary(fit)
## Response child_mort :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 162685 162685 396.81 < 2.2e-16 ***
## Residuals 158 64777 410
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response exports :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 6979 6978.8 20.76 1.036e-05 ***
## Residuals 158 53115 336.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response health :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 30.59 30.5882 4.0645 0.04549 *
## Residuals 158 1189.06 7.5257
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response imports :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 687 686.86 2.0732 0.1519
## Residuals 158 52347 331.31
##
## Response inflation :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 1207.7 1207.65 24.781 1.665e-06 ***
## Residuals 158 7699.9 48.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response life_expec :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 7128.3 7128.3 284.59 < 2.2e-16 ***
## Residuals 158 3957.5 25.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response total_fer :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 238.94 238.942 304.43 < 2.2e-16 ***
## Residuals 158 124.01 0.785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reject H0 for child_mort (p<0.001), exports (p<0.001), health (p=0.045), inflation (p<0.001), life_expec (p<0.001), and total_fer(p<0.001). For these variables, we found differences in means between two clusters. Hence, the variables are appropriate to use as cluster variables.
For imports, p-value is higher than 0.05, H0 is not rejected, and therefore this variable is not appropriate to use as cluster variable - it should be removed, and clustering should be performed again.
Checking the criterion validity with a variable that was not used in the clustering process: in this case, GDP per capita (gdpp), by a WELCH TWO-SAMPLE T-TEST or WILCOXON RANK SUM TEST for two independent samples, with which we would attempt to find differences in means of GDP per capita of two clusters.
# Calculating means of two clusters:
aggregate(mydata$gdpp,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 2101.529
## 2 2 16721.505
# Checking for data normality by Shapiro-Wilk normality test:
#H0: Variables in two groups of countries are normally distributed.
#H0: Variables in two groups of countries are not normally distributed.
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(Group) %>%
shapiro_test(gdpp)
## # A tibble: 2 × 4
## Group variable statistic p
## <int> <chr> <dbl> <dbl>
## 1 1 gdpp 0.556 3.49e-11
## 2 2 gdpp 0.783 2.21e-11
# Performing Wilcoxon Rank Sum Test:
# H0: Location distribution of countries in group 1 is equal to location distribution of countries in group 2.
# H1: Location distribution of countries in group 1 is not equal to location distribution of countries in group 2.
fit <- wilcox.test(gdpp ~ as.factor(Group),
data = mydata)
fit
##
## Wilcoxon rank sum test with continuity correction
##
## data: gdpp by as.factor(Group)
## W = 509.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Based on 7 standardized variables (child mortality, exports, health expenditures, imports, inflation, life expectancy, and total fertility), we divided 160 countries into 2 groups, using Hierarchical, K-means, and Indices clustering. However, the data showed than variable imports does not influence clustering division, and could be ommitted.
Group 1 is a smaller group that consists of 51 countries (32%). Countries in this cluster are classified with higher child mortality rates (𝜇=1.23), lower than average exports of goods and services (𝜇=-0.49), health expenditures (𝜇=-0.23), and imports (𝜇=-0.16), high inflation rate (𝜇=0.53), significantly low life expectancy (𝜇=-1.16), and very high total fertility (𝜇=1.18). The countries in this group also have lower GDP per capita on average, with mean equal to approximately 2101 USD per capita. It can be assumed that countries classified into this cluster have lower quality of life.
Group 2 is a larger group that consists of 109 countries (68%). Countries in this cluster are classified with lower child mortality rates (𝜇=-0.57), higher exports of goods and services (𝜇=0.23), health expenditures (𝜇=0.11), and import (𝜇=0.08), lower inflation rate (𝜇=-0.25), higher than average life expectancy (𝜇=0.55), and lower fertility rates (𝜇=-0.55). The countries in this group also have very significantly higher GDP per capita on average, with mean equal to approximately 16721 USD per capita. All the factors lead to an assumption that countries classified into this cluster have high quality of life.