Introduction

Clustering analysis is a statistical technique that is used to group together a set of objects based on their similarities. This technique is widely used in data analysis, machine learning and pattern recognition to identify the underlying structure of a data set.

The goal of clustering analysis is to uncover the inherent structure of the data and to identify patterns, trends, and relationships in the data. Clustering is an unsupervised learning technique, meaning that it is used to find structure in data without using prior knowledge or labels. The resulting groups, or clusters, are typically based on similarities in the data set and can provide valuable insights into the relationships between objects.

There are various methods of clustering analysis, including K-Means, Hierarchical Clustering, and Density-Based Clustering. The K-Means method clusters data by partitioning the data set into a specified number of clusters, while Hierarchical Clustering builds a hierarchy of clusters by successively merging smaller clusters into larger ones. Density-Based Clustering focuses on dense regions of the data set and forms clusters around these regions.

Dataset

The “Forbes top 2000 companies” dataset whic is provided by kaggle (https://www.kaggle.com/datasets/ash316/forbes-top-2000-companies) includes information about the largest and most successful companies in the world. The data includes various important attributes such as market value, profit, revenue, assets, and number of employees. These attributes provide a comprehensive overview of a company’s financial performance and can be used to make informed decisions about the company. The dataset is in csv format. We can read it using read.csv() function.

forbestop2000 <- read.csv("C:\\Users\\PC\\AppData\\Local\\Temp\\Rar$DIa16748.24528\\Forbes_2000_top_company_CLNQ11.csv")

Now, we need to use some libraries to apply required and useful functions for analysis

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(corrplot)
## corrplot 0.92 loaded
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(DT)

Preparing the dataset for clustering analysis

datatable(forbestop2000, options = list(scrollX = TRUE))
str(forbestop2000)
## 'data.frame':    1999 obs. of  11 variables:
##  $ X2022.Ranking          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Organization.Name      : chr  "Berkshire Hathaway" "ICBC" "Saudi Arabian Oil Company (Saudi Aramco)" "JPMorgan Chase" ...
##  $ Industry               : chr  "Diversified Financials" "Banking" "Oil & Gas Operations" "Diversified Financials" ...
##  $ Country                : chr  "United States" "China" "Saudi Arabia" "United States" ...
##  $ Year.Founded           : int  1939 1984 1933 2000 2014 1994 1976 1979 1998 1937 ...
##  $ CEO                    : chr  "Warren Edward Buffett" "Shu Gu" "Amin bin Hasan Al-Nasser" "Jamie Dimon" ...
##  $ Revenue..Billions.     : num  276 208 400 125 202 ...
##  $ Profits..Billions.     : num  89.8 54 105.4 42.1 46.9 ...
##  $ Assets..Billions.      : num  959 5519 576 3955 4747 ...
##  $ Market.Value..Billions.: num  741 214 2292 374 181 ...
##  $ Total.Employees        : chr  "372000.0" "449296" "68493.0" "271025.0" ...

In order to perform clustering analysis on this data, it is necessary to first ensure that all columns are in a suitable format for analysis. Specifically, the “Total.Employees” column in the data set must be converted to a numeric data type.

forbestop2000$Total.Employees = as.numeric(forbestop2000$Total.Employees)
## Warning: NAs introduced by coercion

Checking for missing values

any(is.na(forbestop2000))
## [1] TRUE
forbestop2000 <- na.omit(forbestop2000)

Unlabeling data

mydata <- select(forbestop2000, c(5,7:11))

Before applying clustering methods, we should Scale and normalize the data. It is crucial process in clustering analysis as it helps ensure the data is in a suitable format for analysis and reduces the impact of variables with different scales. By scaling and normalizing data and using Euclidean distance, clustering analysis results can be made more accurate and reliable.

mydata_scale <- scale(mydata)

mydata <- dist(mydata_scale)

Correlation and Assessing clustering tendency

Correlation matrix

corrplot(cor(mydata_scale, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="forbes2000", mar=c(0,0,1,0))

In our example, all pairs of variables have positive correlations, which means that as one variable increases, the other variable tends to increase as well. Correlations between “year.founded” and all other variables are very weak, with coefficients ranging from only 0.05 to 0.07. On the other hand, correlatoin between “profit.bilions” and “revenue.billions” are very strong, with the coeficient of 0.8.

Assessing clustering tendency:

It refers to evaluating the presence of structure in the data that would make it suitable for clustering analysis. Assessing clustering tendency is important because it helps to determine whether clustering is a suitable method for analyzing the data and whether the results of the analysis will be meaningful.There are several methods for assessing clustering tendency. Hopkins statistics is one of them.The Hopkins statistic ranges from 0 to 1, with values close to 1 indicating a strong clustering tendency and values close to 0 indicating a weak clustering tendency.

get_clust_tendency(mydata_scale, 2, graph=TRUE, gradient=list(low="green",  high="white"), seed=1234)
## $hopkins_stat
## [1] 0.9743571
## 
## $plot

A Hopkins statistic of 0.97 for the given data is a very positive result. It is considered to be very high and is a strong indication that the data has a clear structure that would make it suitable for clustering.

Optimal number of clusters

Determining the optimal number of clusters is an important step in clustering analysis. It refers to the number of clusters that best represent the underlying structure of the data. The optimal number of clusters is a trade-off between having too few clusters, which may result in loss of information, and having too many clusters, which may result in over-segmentation of the data.There are several methods for determining the optimal number of clusters:

Elbow method

The elbow method involves plotting the sum of squared distances between data points and their cluster centroids (known as the within-cluster sum of squares) against the number of clusters. The optimal number of clusters is identified as the number of clusters where the within-cluster sum of squares starts to decrease at a slower rate.

w1 <- fviz_nbclust(mydata_scale, FUNcluster = kmeans, method = "wss") + ggtitle("K-means")
w2 <- fviz_nbclust(mydata_scale, FUNcluster = cluster::pam, method = "wss") + ggtitle("PAM")
w3 <- fviz_nbclust(mydata_scale, FUNcluster = cluster::clara, method = "wss") + ggtitle("Clara")

grid.arrange(w1, w2, w3, ncol=2, top = "Optimal number of clusters (wss)")

It shows us that how many cluster is optimal for each clustering method. According to the method, it seems that optimal number is 2 or 3. In order to be more precise, we can use another method

Silhouette method

The silhouette method involves computing a silhouette score for each data point, which measures how well the data point is assigned to its own cluster compared to other clusters. The optimal number of clusters is identified as the number of clusters where the average silhouette score is highest.

s1 <- fviz_nbclust(mydata_scale, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("K-means")
s2 <- fviz_nbclust(mydata_scale, FUNcluster = cluster::pam, method = "silhouette") + 
  ggtitle("PAM")
s3 <- fviz_nbclust(mydata_scale, FUNcluster = cluster::clara, method = "silhouette") + 
  ggtitle("Clara")

grid.arrange(s1, s2, s3, ncol=2, top= "Optimal number of clusters (silhouette)")

The Silhouette method shows us that the optimal clustering number for each clustering method is 2.

Clustering

The results of two methods are same. So, we can start clustering the data based on the results by using number of clustering methods, such as k-means, pam and clara

K-means

It groups similar data points together into k clusters, where k is the number of clusters defined by the user. The algorithm iteratively updates the centroid of each cluster and reassigns data points to the nearest centroid until convergence. The goal is to minimize the sum of squared distances between data points and their corresponding cluster centroids. k-means is widely used for tasks such as market segmentation, document classification, and image compression.

km2 <- eclust(mydata_scale, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)

km2c <- fviz_cluster(km2, data=mydata_scale, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
km2s <- fviz_silhouette(km2)
##   cluster size ave.sil.width
## 1       1 1882          0.84
## 2       2   59         -0.09
grid.arrange(km2c, km2s, ncol=2)

The result of the average silhouette width for each cluster shows us that the results of the k-means clustering algorithm cannot be considered successful in general.

A high silhouette coefficient means that the data point is well-matched to its own cluster and poorly matched to other clusters, indicating a good clustering solution. A low silhouette coefficient means that the data point is poorly matched to its own cluster and possibly better matched to a different cluster. The coefficient for the second cluster is below 0 and 0.84 for the first one. So, we can conclude that k-means method is not useful for clustering analysis of this specific dataset.

PAM

PAM (Partitioning Around Medoids) is a type of clustering algorithm that is used to partition a dataset into multiple groups. Unlike K-means, which uses the mean value of each cluster as its center point, PAM uses a medoid, which is the most centrally located data point in a cluster.One advantage of PAM over K-means is that it is more robust to outliers, as it uses the medoid instead of the mean value as the center of a cluster. Additionally, PAM is more efficient for datasets with large numbers of categorical variables, as it is not as sensitive to the scale of the data.

pam2 <- eclust(mydata_scale, k=2, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)

p2c <- fviz_cluster(pam2, data=mydata_scale, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
p2s <- fviz_silhouette(pam2)
##   cluster size ave.sil.width
## 1       1 1847          0.68
## 2       2   94          0.88
grid.arrange(p2c, p2s, ncol=2)

The result of the average silhouette width for each cluster shows us that the results of the k-means clustering algorithm can be considered highly successful.The coefficients are near to 1 for each cluster

CLARA

The results of PAM method are quite good. Finally, let’s check with the clara method to see how we can get a different result compared to PAM method. CLARA (Clustering Large Applications) is an efficient and effective algorithm for clustering large datasets. It is an improvement on the PAM algorithm. It is a useful method for clustering large datasets.

c2 <- eclust(mydata_scale, k=2, FUNcluster="clara", hc_metric="euclidean", graph=FALSE)

c2c <- fviz_cluster(c2, data=mydata_scale, elipse.type="convex", geom=c("point")) + ggtitle("Clara with 2 clusters")
c2s <- fviz_silhouette(c2)
##   cluster size ave.sil.width
## 1       1 1847          0.68
## 2       2   94          0.88
grid.arrange(c2c, c2s, ncol=2)

Since Clara is an improved version of the PAM method, we obtained the same results in Clara.

Distribution of data points within clusters

In the following tables, we can see the distribution of data points in the clusters( the results of PAM and CLARA methods) by countries and industries.

mytable <- table(forbestop2000$Country, c2$cluster)
datatable(mytable, options = list(scrollX = TRUE))
mytable1 <-table(forbestop2000$Industry, c2$cluster) 
datatable(mytable1, options = list(scrollX = TRUE))

Conclusion

In conclusion, the clustering analysis of the “Forbes top 2000 companies” dataset aimed at grouping similar companies based on certain characteristics. The study utilized three different clustering algorithms: k-means, PAM and Clara.

The k-means algorithm, which is a widely used method for clustering, failed to produce meaningful results in this case. This may have been due to the nature of the data and the specific parameters used in the algorithm, which did not effectively capture the underlying structure of the data.

On the other hand, both the PAM and Clara algorithms produced same results and were successful in grouping same companies. These methods are based on a different approach to clustering, which may have better accommodated the characteristics of the data.

Overall, the results of this analysis provide insight into the clustering of the Forbes top 2000 companies and suggest that the PAM and Clara methods are effective for this specific dataset.