The objective of this project is to perform a clustering analysis on a dataset of 167 countries, using their various features, such as health, income, and inflation, to identify their level of development. This analysis will help Non-Governmental Organizations (NGOs) determine their budget allocation to help people in underdeveloped countries. The goal is to cluster the countries into different groups based on their similarities and differences in these features. The resulting clusters can then be used by the NGOs to better allocate their resources and aid efforts, with the ultimate goal of improving the lives of people in underdeveloped countries.
Clustering is a method for unsupervised learning that groups objects based on their attributes and similarity between each other. Different clustering algorithms have their own distinct advantages and disadvantages, and the appropriate choice of algorithm depends on the specific problem and dataset being examined. In this project, different algorithms and distance metrics will be analysed.
Before applying clustering algorithms to a data set, it is crucial to conduct an initial analysis to ensure that the data is appropriate for clustering. One important consideration is that clustering algorithms tend to work best with numerical data. If the dataset includes categorical or text-based variables, these may need to be transformed or removed to facilitate clustering analysis.
Additionally, clustering algorithms often require complete data sets without missing values. If the data set contains missing values, it may be necessary to impute or remove these values before clustering. In the countries data, there is not any missing values. Furthermore, if the dataset includes different types of variables, such as numerical and categorical variables, it may be necessary to use appropriate data transformations to ensure that all variables are on a comparable scale.
Another important consideration is the potential benefit of scaling the data. Scaling can help to ensure that all variables are weighted equally in the clustering analysis and can result in more accurate and meaningful clustering results. Overall, conducting an initial analysis of the dataset is essential to ensure that the data is appropriately prepared for clustering analysis and that the resulting clustering results are meaningful and accurate.
Data Analysis
countries <- read.csv("Country-data.csv")
head(countries)
## country child_mort exports health imports income inflation
## 1 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44
## 2 Albania 16.6 28.0 6.55 48.6 9930 4.49
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.10
## 4 Angola 119.0 62.3 2.85 42.9 5900 22.40
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44
## 6 Argentina 14.5 18.9 8.10 16.0 18700 20.90
## life_expec total_fer gdpp
## 1 56.2 5.82 553
## 2 76.3 1.65 4090
## 3 76.5 2.89 4460
## 4 60.1 6.16 3530
## 5 76.8 2.13 12200
## 6 75.8 2.37 10300
This is the structure of the data set. It can be seen that data has only one character variable which is the country name. This means that it should be removed before performing clustering algorithms.
str(countries)
## 'data.frame': 167 obs. of 10 variables:
## $ country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ child_mort: num 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : int 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : int 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
Statistical analysis of variables
summary(countries)
## country child_mort exports health
## Length:167 Min. : 2.60 Min. : 0.109 Min. : 1.810
## Class :character 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920
## Mode :character Median : 19.30 Median : 35.000 Median : 6.320
## Mean : 38.27 Mean : 41.109 Mean : 6.816
## 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600
## Max. :208.00 Max. :200.000 Max. :17.900
## imports income inflation life_expec
## Min. : 0.0659 Min. : 609 Min. : -4.210 Min. :32.10
## 1st Qu.: 30.2000 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30
## Median : 43.3000 Median : 9960 Median : 5.390 Median :73.10
## Mean : 46.8902 Mean : 17145 Mean : 7.782 Mean :70.56
## 3rd Qu.: 58.7500 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80
## Max. :174.0000 Max. :125000 Max. :104.000 Max. :82.80
## total_fer gdpp
## Min. :1.150 Min. : 231
## 1st Qu.:1.795 1st Qu.: 1330
## Median :2.410 Median : 4660
## Mean :2.948 Mean : 12964
## 3rd Qu.:3.880 3rd Qu.: 14050
## Max. :7.490 Max. :105000
Distribution of variables
countries %>%
gather(Features, value, 2:10) %>%
ggplot(aes(x = value)) +
geom_histogram(fill = "white", colour = "black") +
facet_wrap(~Features, scales = "free_x") +
labs(x = "Values", y = "Frequency")
Conversion of integer variables to numeric variables
countries$income <- as.numeric(countries$income)
countries$gdpp <- as.numeric(countries$gdpp)
df <- countries
names(df) <- NULL
Converting dataframe to matrix and scaling values
matrix1 <- as.matrix(df[,2:10])
newdf <- t(matrix1)
colnames(newdf) <- countries$country
countries_matrix <- scale(t(newdf),TRUE,TRUE)
Hopkins statistic is a measure used to assess is data set is suitable for clustering. if a data set has a meaningful clustering structure, then the data points are more likely to be closer to other data points in the same cluster than to those in other clusters.
If the Hopkins statistic is close to 1, then the data set is likely to have a clustering structure. If the value is close to 0.5, then it is not clear whether the data has a clustering structure or not. If the value is close to 0, indicating a lack of clustering structure.
For country data, hopkins statistics is 0.83. Which means that it is highly clusterable.
get_clust_tendency(countries_matrix, 2, graph = TRUE, gradient = list(low="blue", high="red"), seed = 1234)
## $hopkins_stat
## [1] 0.8325281
##
## $plot
These graphs are showing optimum number of clusters. First one, the elbow method uses the total within sum of squares and after the number of 3, there is not any significant changes in total within sum of squares. For the silhouette method, optimum number of cluster is shown as 5. But there is not a significant difference between 3. As a result, choosing the number of cluster as 3 is the most logical solution.
fviz_nbclust(countries_matrix, kmeans, method = "wss", k.max = 5) + theme_minimal() + ggtitle("The Elbow Method")
fviz_nbclust(countries_matrix, kmeans, method = "silhouette", k.max = 5) + theme_minimal() + ggtitle("Shilhouette Method")
## Experiment Results for Clustering
K_means Clustering
km<-eclust(countries_matrix, "kmeans", hc_metric = "euclidean",k =3)
fviz_silhouette(km)
## cluster size ave.sil.width
## 1 1 36 0.15
## 2 2 84 0.36
## 3 3 47 0.24
Pam Clustering
pam<-eclust(countries_matrix, "pam", hc_metric="euclidean",k=3)
fviz_silhouette(pam)
## cluster size ave.sil.width
## 1 1 51 0.25
## 2 2 85 0.31
## 3 3 31 0.26
Clara Clustering
clara<-eclust(countries_matrix, "clara", hc_metric="euclidean",k=3)
fviz_silhouette(clara)
## cluster size ave.sil.width
## 1 1 39 0.28
## 2 2 94 0.34
## 3 3 34 0.18
Hierarchical Clustering
dist_matrix <- dist(countries_matrix, method = "euclidean")
hc <- hclust(dist_matrix, method = "single")
plot(hc)
sil_cl <- silhouette(cutree(hc, h=3) , dist_matrix)
plot(sil_cl)
km<-kmeans(countries_matrix, 3) # stats::
plot(countries_matrix, col = km$cluster, pch=".", cex=3) # figure has only 2D
points(km$centers, col = 1:5, pch = 8, cex=2, lwd=2)
fviz_cluster(list(data=countries_matrix, cluster=km$cluster), ellipse.type = "norm", geom = "point", stand = FALSE, palette="jco", ggtheme=theme_classic())
D<-daisy(countries_matrix) # calculates the dissimilarity matrix, cluster::
plot(silhouette(km$cluster, D), col=1:2, border=NA)
sil<-silhouette(km$cluster, dist(countries_matrix))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 36 0.15
## 2 2 84 0.36
## 3 3 47 0.24
km2<-cclust(countries_matrix, k = 3, simple = FALSE, save.data = TRUE)
plot(km2)
This graph shows the relative importance of each variable in the clustering analysis, higher scores indicating greater importance. The graph can help identify which variables are most important for clusters.
km<-kcca(countries_matrix, k=3)
FeatureImp_km<-FeatureImpCluster(km, as.data.table(countries_matrix))
plot(FeatureImp_km)
K-means is a popular algorithm for clustering. The procedure is starting with randomly selected cluster centers and assigning data points to its nearest centroid, and then updating the centroids based on the mean values of the assigned data points. Clustering is a NP-hard optimization problem. Which means that finding the optimum solution is extremely hard. For solving NP-hard optimization problems, heuristic algorithms developed such as Simulated Annealing, Genetic Algorithm and Tabou Search. In this paper, Simulated Annealing algorithm will be used for improving the performance of K-means algorithm.
Simulated annealing is a heuristic algorithm that is inspired by the physical process of annealing in metallurgy, where a material is heated and slowly cooled to increase its durability and remove defects. In simulated annealing, the optimization problem is treated as a thermodynamic system, and the algorithm simulates the annealing process by gradually reducing the temperature and allowing the system to settle into a minimum-energy state.
Simulated annealing process for K-Means:
Start the k-means algorithm with randomly selected cluster centers.
Define objective function, sum of the distances between data points and its assigned cluster center used as objective function.
Define a temperature and cooling rate. The temperature will be used for probability of accepting a worse solution, on the other hand, the cooling rate will determine how quickly the temperature decreases.
Repeat the following steps until the stopping criteria are met: Choose a neighbor solution by randomly selecting a data point and reassigning it to a different cluster.
Calculate the new objective function for the neighbor solution.
Calculate the difference in objective function between the current and neighbor solutions.
If the difference is negative (i.e., the neighbor solution is better), accept the neighbor solution as the new current solution.
If the difference is positive (i.e., the neighbor solution is worse), calculate the probability of accepting the worse solution based on the current temperature and the magnitude of the difference. Accept the worse solution with probability equal to the calculated probability.
Update the temperature according to the cooling schedule.
Return the final clustering solution with the lowest objective function value found during the simulated annealing process.
The objective of using simulated annealing with k-means clustering is to find a global optimum solution that minimizes the objective function. By allowing the algorithm to accept worse solutions with a certain probability, simulated annealing is able to explore a larger search space and avoid getting stuck in local optima.
k_means <- function(data, k, clusters) {
# vectors that will contain the cluster assignments for each observation
cluster_vec <- c()
last_vec <- c(0)
#iteration
iter <- 1
stop <- 0
#data frame that will contain all the assignments for every iteration
all_df <- data.frame()
#cdata frames that will contain the centroids for every iteration
all_center_df <- data.frame()
while (stop == 0) {
#loop through each observation
for (i in 1:nrow(data)) {
#find the distance of the ith observation to each of the clusters
dist <- data[i,] %>%
rbind(clusters) %>%
dist()
#find which cluster the ith observation has the smallest distance with
i_cluster <- dist[1:k] %>%
which.min()
#add the cluster assignment for the ith observation to a vector
#containing the cluster assignments of all observations
cluster_vec[i] <- i_cluster
}
#check to see if the cluster assignments have changed at all since
#the last iteration
if (all(cluster_vec == last_vec)) {
stop <- 1
}
#save the cluster assignments from this iteration to another object
#so we can check to see if cluster assignments change in the next iteration
last_vec <- cluster_vec
#save this iteration's clustering vector and original data to a new data frame
df <- data %>%
add_column(cluster = cluster_vec) %>%
add_column(iteration = iter)
#save this iteration's centroids to a new data frame with the iteration number too
center_df <- clusters %>%
add_column(cluster = c(1:k)) %>%
add_column(iteration = iter)
#add this iteration's cluster assignments to the data frame containing all the
#assignments of all previous iterations
all_df <- rbind(all_df, df)
#add this iteration's centroids to the data frame containing all the
#centroids of all the previous iterations
all_center_df <- rbind(all_center_df, center_df)
#group the observations into their assigned clusters and find the means
#of all the columns to use as the new cluster centers
clusters <- data %>%
cbind(cluster_vec) %>%
group_by(cluster_vec) %>%
summarize_all(mean)
#remove the first column that contains the cluster number
clusters <- clusters[, -1]
#clear the data frames containing information on only the current iteration
df <- data.frame()
center_df <- data.frame()
#add to the iteration counter
iter <- iter + 1
#find sizes and clusters once the algorithm is finished
if (stop == 1) {
sizes <- data %>%
cbind(cluster_vec) %>%
count(cluster_vec) %>%
pull(n)
clusters <- data %>%
cbind(cluster_vec) %>%
group_by(cluster_vec) %>%
summarize_all(mean)
}
}
result <- list("Sizes" = sizes,
"Cluster Means" = clusters,
"Clustering Vector" = cluster_vec,
"All Assignments" = all_df,
"All Centroids" = all_center_df)
return(result)
}
sse <- function(data,assignments,cluster_means){
data$assigned_cluster <- assignments
withinss <- c()
for (i in 1:nrow(cluster_means)){
sse <- 0
cluster_i <-data[data$assigned_cluster == i,]
for (j in 1:nrow(cluster_i)){
cluster <- cluster_i %>%
dplyr::select(-assigned_cluster)
distance <- dist(rbind(cluster[j,], cluster[i,]))
sse <- (sse + (distance)^2)
}
withinss <- c(withinss, sse)
}
return(withinss)
}
centers <- sample(1:nrow(countries), 3)
simulated_annealing_k_means <- function(data,k,centers){
solution <- 2147483647
Temperature <- 10
cooling_rate <- 0.1
acceptance_criteria <- 0.5 * (100 / Temperature)
# Get current time
now <- Sys.time()
# Use time in seconds as seed for random number generator
set.seed(as.integer(now, units = "secs"))
centers <- sample(1:nrow(data), 3)
while(Temperature > 0.01){
# Get current time
now <- Sys.time()
# Use time in seconds as seed for random number generator
set.seed(as.integer(now, units = "secs"))
random_element <- sample(centers, size = 1)
new_element <- sample(1:nrow(data),1)
if(new_element %in% centers){
next
}
iteration_centers <- replace(centers, centers == random_element, new_element)
#print(centers)
clusters <- data[iteration_centers,]
clustering <- k_means(data,k,clusters)
sse <- sse(data,clustering$`Clustering Vector`,clustering$`Cluster Means`)
#print(sse)
if(is.na(sum(sse)) || sum(sse) == 0){
next
}
if(sum(sse) < solution || acceptance_criteria < runif(1)){
solution <- sum(sse)
centers <- iteration_centers
}
Temperature <- Temperature * (1 - cooling_rate)
print(solution)
}
return(solution)
}
simulated_annealing_k_means(countries,3,centers)
## [1] 1243.193
## [1] 1243.193
## [1] 1243.193
## [1] 1243.193
## [1] 1243.193
## [1] 1243.193
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1010.648
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
## [1] 1007.564
Following data preprocessing and conversion, four distinct clustering algorithms, including k-means, pam, clara, and hierarchical experimented, utilizing the Euclidean distance metric. After conducting appropriate tests, that the optimal number of clusters determined as three. Among the four algorithms tested, hierarchical clustering provided the most optimal solution, with the average silhouette width 0.35. Additionally, A heuristic approach introduced to solve the problem of local optima, which led to a approximately 20% improvement in the classical k-means algorithm.
A simulated annealing algorithm for the clustering problem. (2003, May 19). A Simulated Annealing Algorithm for the Clustering Problem - ScienceDirect. https://doi.org/10.1016/0031-3203(91)90097-O
https://rpubs.com/hasiegler/926806
ChatGPT