Clustering Countries

Introduction

The objective of this project is to perform a clustering analysis on a dataset of 167 countries, using their various features, such as health, income, and inflation, to identify their level of development. This analysis will help Non-Governmental Organizations (NGOs) determine their budget allocation to help people in underdeveloped countries. The goal is to cluster the countries into different groups based on their similarities and differences in these features. The resulting clusters can then be used by the NGOs to better allocate their resources and aid efforts, with the ultimate goal of improving the lives of people in underdeveloped countries.

Clustering

Clustering is a method for unsupervised learning that groups objects based on their attributes and similarity between each other. Different clustering algorithms have their own distinct advantages and disadvantages, and the appropriate choice of algorithm depends on the specific problem and dataset being examined. In this project, different algorithms and distance metrics will be analysed.

Data Preparation

Before applying clustering algorithms to a data set, it is crucial to conduct an initial analysis to ensure that the data is appropriate for clustering. One important consideration is that clustering algorithms tend to work best with numerical data. If the dataset includes categorical or text-based variables, these may need to be transformed or removed to facilitate clustering analysis.

Additionally, clustering algorithms often require complete data sets without missing values. If the data set contains missing values, it may be necessary to impute or remove these values before clustering. In the countries data, there is not any missing values. Furthermore, if the dataset includes different types of variables, such as numerical and categorical variables, it may be necessary to use appropriate data transformations to ensure that all variables are on a comparable scale.

Another important consideration is the potential benefit of scaling the data. Scaling can help to ensure that all variables are weighted equally in the clustering analysis and can result in more accurate and meaningful clustering results. Overall, conducting an initial analysis of the dataset is essential to ensure that the data is appropriately prepared for clustering analysis and that the resulting clustering results are meaningful and accurate.

Data Analysis

countries <- read.csv("Country-data.csv")  
head(countries)

##               country child_mort exports health imports income inflation
## 1         Afghanistan       90.2    10.0   7.58    44.9   1610      9.44
## 2             Albania       16.6    28.0   6.55    48.6   9930      4.49
## 3             Algeria       27.3    38.4   4.17    31.4  12900     16.10
## 4              Angola      119.0    62.3   2.85    42.9   5900     22.40
## 5 Antigua and Barbuda       10.3    45.5   6.03    58.9  19100      1.44
## 6           Argentina       14.5    18.9   8.10    16.0  18700     20.90
##   life_expec total_fer  gdpp
## 1       56.2      5.82   553
## 2       76.3      1.65  4090
## 3       76.5      2.89  4460
## 4       60.1      6.16  3530
## 5       76.8      2.13 12200
## 6       75.8      2.37 10300

This is the structure of the data set. It can be seen that data has only one character variable which is the country name. This means that it should be removed before performing clustering algorithms.

str(countries)

## 'data.frame':    167 obs. of  10 variables:
##  $ country   : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : int  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : int  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

Statistical analysis of variables

summary(countries)

##    country            child_mort        exports            health      
##  Length:167         Min.   :  2.60   Min.   :  0.109   Min.   : 1.810  
##  Class :character   1st Qu.:  8.25   1st Qu.: 23.800   1st Qu.: 4.920  
##  Mode  :character   Median : 19.30   Median : 35.000   Median : 6.320  
##                     Mean   : 38.27   Mean   : 41.109   Mean   : 6.816  
##                     3rd Qu.: 62.10   3rd Qu.: 51.350   3rd Qu.: 8.600  
##                     Max.   :208.00   Max.   :200.000   Max.   :17.900  
##     imports             income         inflation         life_expec   
##  Min.   :  0.0659   Min.   :   609   Min.   : -4.210   Min.   :32.10  
##  1st Qu.: 30.2000   1st Qu.:  3355   1st Qu.:  1.810   1st Qu.:65.30  
##  Median : 43.3000   Median :  9960   Median :  5.390   Median :73.10  
##  Mean   : 46.8902   Mean   : 17145   Mean   :  7.782   Mean   :70.56  
##  3rd Qu.: 58.7500   3rd Qu.: 22800   3rd Qu.: 10.750   3rd Qu.:76.80  
##  Max.   :174.0000   Max.   :125000   Max.   :104.000   Max.   :82.80  
##    total_fer          gdpp       
##  Min.   :1.150   Min.   :   231  
##  1st Qu.:1.795   1st Qu.:  1330  
##  Median :2.410   Median :  4660  
##  Mean   :2.948   Mean   : 12964  
##  3rd Qu.:3.880   3rd Qu.: 14050  
##  Max.   :7.490   Max.   :105000

Distribution of variables

countries %>%
  gather(Features, value, 2:10) %>%
  ggplot(aes(x = value)) +
  geom_histogram(fill = "white", colour = "black") +
  facet_wrap(~Features, scales = "free_x") +
  labs(x = "Values", y = "Frequency")

Conversion of integer variables to numeric variables

countries$income <- as.numeric(countries$income)
countries$gdpp <- as.numeric(countries$gdpp)

df <- countries
names(df) <- NULL

Converting dataframe to matrix and scaling values

matrix1 <- as.matrix(df[,2:10])
newdf <- t(matrix1)

colnames(newdf) <- countries$country
countries_matrix <- scale(t(newdf),TRUE,TRUE)

Hopkins statistic is a measure used to assess is data set is suitable for clustering. if a data set has a meaningful clustering structure, then the data points are more likely to be closer to other data points in the same cluster than to those in other clusters.

If the Hopkins statistic is close to 1, then the data set is likely to have a clustering structure. If the value is close to 0.5, then it is not clear whether the data has a clustering structure or not. If the value is close to 0, indicating a lack of clustering structure.

For country data, hopkins statistics is 0.83. Which means that it is highly clusterable.

get_clust_tendency(countries_matrix, 2, graph = TRUE, gradient = list(low="blue",  high="red"), seed = 1234)

## $hopkins_stat
## [1] 0.8325281
## 
## $plot

These graphs are showing optimum number of clusters. First one, the elbow method uses the total within sum of squares and after the number of 3, there is not any significant changes in total within sum of squares. For the silhouette method, optimum number of cluster is shown as 5. But there is not a significant difference between 3. As a result, choosing the number of cluster as 3 is the most logical solution.

fviz_nbclust(countries_matrix, kmeans, method = "wss", k.max = 5) + theme_minimal() + ggtitle("The Elbow Method")

fviz_nbclust(countries_matrix, kmeans, method = "silhouette", k.max = 5) + theme_minimal() + ggtitle("Shilhouette Method")

## Experiment Results for Clustering

K_means Clustering

km<-eclust(countries_matrix, "kmeans", hc_metric = "euclidean",k =3)

fviz_silhouette(km)

##   cluster size ave.sil.width
## 1       1   36          0.15
## 2       2   84          0.36
## 3       3   47          0.24

Pam Clustering

pam<-eclust(countries_matrix, "pam", hc_metric="euclidean",k=3)

fviz_silhouette(pam)

##   cluster size ave.sil.width
## 1       1   51          0.25
## 2       2   85          0.31
## 3       3   31          0.26

Clara Clustering

clara<-eclust(countries_matrix, "clara", hc_metric="euclidean",k=3)

fviz_silhouette(clara)

##   cluster size ave.sil.width
## 1       1   39          0.28
## 2       2   94          0.34
## 3       3   34          0.18

Hierarchical Clustering

dist_matrix <- dist(countries_matrix, method = "euclidean")
hc <- hclust(dist_matrix, method = "single")
plot(hc)

sil_cl <- silhouette(cutree(hc, h=3) , dist_matrix)
plot(sil_cl)

km<-kmeans(countries_matrix, 3) # stats::
plot(countries_matrix, col = km$cluster, pch=".", cex=3) # figure has only 2D
points(km$centers, col = 1:5, pch = 8, cex=2, lwd=2)

fviz_cluster(list(data=countries_matrix, cluster=km$cluster), ellipse.type = "norm", geom = "point", stand = FALSE, palette="jco", ggtheme=theme_classic())

D<-daisy(countries_matrix) # calculates the dissimilarity matrix, cluster:: 
plot(silhouette(km$cluster, D), col=1:2, border=NA)

sil<-silhouette(km$cluster, dist(countries_matrix))
fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1   36          0.15
## 2       2   84          0.36
## 3       3   47          0.24

km2<-cclust(countries_matrix, k = 3, simple = FALSE, save.data = TRUE)
plot(km2)

This graph shows the relative importance of each variable in the clustering analysis, higher scores indicating greater importance. The graph can help identify which variables are most important for clusters.

km<-kcca(countries_matrix, k=3)
FeatureImp_km<-FeatureImpCluster(km, as.data.table(countries_matrix))
plot(FeatureImp_km)

A Heuristic Approach to K-Means Algorithm

K-means is a popular algorithm for clustering. The procedure is starting with randomly selected cluster centers and assigning data points to its nearest centroid, and then updating the centroids based on the mean values of the assigned data points. Clustering is a NP-hard optimization problem. Which means that finding the optimum solution is extremely hard. For solving NP-hard optimization problems, heuristic algorithms developed such as Simulated Annealing, Genetic Algorithm and Tabou Search. In this paper, Simulated Annealing algorithm will be used for improving the performance of K-means algorithm.

Simulated annealing is a heuristic algorithm that is inspired by the physical process of annealing in metallurgy, where a material is heated and slowly cooled to increase its durability and remove defects. In simulated annealing, the optimization problem is treated as a thermodynamic system, and the algorithm simulates the annealing process by gradually reducing the temperature and allowing the system to settle into a minimum-energy state.

Simulated annealing process for K-Means:

Start the k-means algorithm with randomly selected cluster centers.
Define objective function, sum of the distances between data points and its assigned cluster center used as objective function.
Define a temperature and cooling rate. The temperature will be used for probability of accepting a worse solution, on the other hand, the cooling rate will determine how quickly the temperature decreases.
Repeat the following steps until the stopping criteria are met: Choose a neighbor solution by randomly selecting a data point and reassigning it to a different cluster.
Calculate the new objective function for the neighbor solution.
Calculate the difference in objective function between the current and neighbor solutions.
If the difference is negative (i.e., the neighbor solution is better), accept the neighbor solution as the new current solution.
If the difference is positive (i.e., the neighbor solution is worse), calculate the probability of accepting the worse solution based on the current temperature and the magnitude of the difference. Accept the worse solution with probability equal to the calculated probability.
Update the temperature according to the cooling schedule.
Return the final clustering solution with the lowest objective function value found during the simulated annealing process.

The objective of using simulated annealing with k-means clustering is to find a global optimum solution that minimizes the objective function. By allowing the algorithm to accept worse solutions with a certain probability, simulated annealing is able to explore a larger search space and avoid getting stuck in local optima.

k_means <- function(data, k, clusters) {
  
  # vectors that will contain the cluster assignments for each observation
  cluster_vec <- c()
  last_vec <- c(0)
  
  #iteration
  iter <- 1
  
  stop <- 0
  
  #data frame that will contain all the assignments for every iteration
  all_df <- data.frame()
  
  #cdata frames that will contain the centroids for every iteration
  all_center_df <- data.frame()
  
  while (stop == 0) {
    
    #loop through each observation
    for (i in 1:nrow(data)) {
      
      #find the distance of the ith observation to each of the clusters
      dist <- data[i,] %>%
        rbind(clusters) %>%
        dist()
      
      #find which cluster the ith observation has the smallest distance with
      i_cluster <- dist[1:k] %>%
        which.min()
      
      #add the cluster assignment for the ith observation to a vector
      #containing the cluster assignments of all observations
      cluster_vec[i] <- i_cluster
      
    }
    
    #check to see if the cluster assignments have changed at all since
    #the last iteration
    if (all(cluster_vec == last_vec)) {
      stop <-  1
    }
    
    #save the cluster assignments from this iteration to another object
    #so we can check to see if cluster assignments change in the next iteration
    last_vec <- cluster_vec
    
    #save this iteration's clustering vector and original data to a new data frame
    df <- data %>%
      add_column(cluster = cluster_vec) %>%
      add_column(iteration = iter)
    
    #save this iteration's centroids to a new data frame with the iteration number too
    center_df <- clusters %>%
      add_column(cluster = c(1:k)) %>%
      add_column(iteration = iter)
    
    #add this iteration's cluster assignments to the data frame containing all the
    #assignments of all previous iterations
    all_df <- rbind(all_df, df)
    
    #add this iteration's centroids to the data frame containing all the 
    #centroids of all the previous iterations
    all_center_df <- rbind(all_center_df, center_df)
    
    #group the observations into their assigned clusters and find the means
    #of all the columns to use as the new cluster centers
    clusters <- data %>%
      cbind(cluster_vec) %>%
      group_by(cluster_vec) %>%
      summarize_all(mean)
    
    #remove the first column that contains the cluster number
    clusters <- clusters[, -1]
    
    #clear the data frames containing information on only the current iteration
    df <- data.frame()
    center_df <- data.frame()
    
    #add to the iteration counter
    iter <- iter + 1
    
    #find sizes and clusters once the algorithm is finished
    if (stop == 1) {
      sizes <- data %>% 
        cbind(cluster_vec) %>% 
        count(cluster_vec) %>% 
        pull(n)
      
      clusters <- data %>%
        cbind(cluster_vec) %>%
        group_by(cluster_vec) %>%
        summarize_all(mean)
      
    }
    
  }
  
  result <- list("Sizes" = sizes, 
                 "Cluster Means" = clusters,
                 "Clustering Vector" = cluster_vec,
                 "All Assignments" = all_df,
                 "All Centroids" = all_center_df)
  
  return(result)
}

sse <- function(data,assignments,cluster_means){
  
  data$assigned_cluster <- assignments
  withinss <- c()

  for (i in 1:nrow(cluster_means)){
    
    sse <- 0
    
    cluster_i <-data[data$assigned_cluster == i,]
    
    for (j in 1:nrow(cluster_i)){
      
      cluster <- cluster_i %>% 
        dplyr::select(-assigned_cluster)
      
      
      distance <- dist(rbind(cluster[j,], cluster[i,]))
      sse <- (sse + (distance)^2)
      
    }
    
    withinss <- c(withinss, sse)
    
  }
  
  return(withinss)
    
}

centers <- sample(1:nrow(countries), 3)

simulated_annealing_k_means <- function(data,k,centers){
  
  solution <- 2147483647
  
  Temperature <- 10
  
  cooling_rate <- 0.1
  
  acceptance_criteria <- 0.5 * (100 / Temperature)
  
  # Get current time
  now <- Sys.time()
  # Use time in seconds as seed for random number generator
  set.seed(as.integer(now, units = "secs"))
  
  centers <- sample(1:nrow(data), 3)
  
  while(Temperature > 0.01){
    # Get current time
    now <- Sys.time()
    # Use time in seconds as seed for random number generator
    set.seed(as.integer(now, units = "secs"))
    
    random_element <- sample(centers, size = 1)
    new_element <- sample(1:nrow(data),1)
    
    if(new_element %in% centers){
      
      next
      
    }
    
    iteration_centers <- replace(centers, centers == random_element, new_element)
    
    #print(centers)
    
    clusters <- data[iteration_centers,]
    
    clustering <- k_means(data,k,clusters)
  
    sse <- sse(data,clustering$`Clustering Vector`,clustering$`Cluster Means`)
    #print(sse)
    if(is.na(sum(sse)) || sum(sse) == 0){
      
      next
    }
    
    if(sum(sse) < solution  || acceptance_criteria < runif(1)){
      
      solution <- sum(sse)
      centers <- iteration_centers
      
    }
    
    Temperature <- Temperature * (1 - cooling_rate)
  
    print(solution)
    
  }
  
  return(solution)
  
}

simulated_annealing_k_means(countries,3,centers)

## [1] 1243.193
## [1] 1243.193
## [1] 1243.193
## [1] 1243.193
## [1] 1243.193
## [1] 1243.193
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564

## [1] 1007.564

Conclusion

Following data preprocessing and conversion, four distinct clustering algorithms, including k-means, pam, clara, and hierarchical experimented, utilizing the Euclidean distance metric. After conducting appropriate tests, that the optimal number of clusters determined as three. Among the four algorithms tested, hierarchical clustering provided the most optimal solution, with the average silhouette width 0.35. Additionally, A heuristic approach introduced to solve the problem of local optima, which led to a approximately 20% improvement in the classical k-means algorithm.

References

A simulated annealing algorithm for the clustering problem. (2003, May 19). A Simulated Annealing Algorithm for the Clustering Problem - ScienceDirect. https://doi.org/10.1016/0031-3203(91)90097-O

https://rpubs.com/hasiegler/926806

ChatGPT