This section explains the conceptual elements of statistical clustering. While it introduces the basic concepts, it primarily focuses on the clustering methods recommended for the SPP-RAF. For those interested in a deeper exploration, Zelterman (2015) provides an excellent resource, particularly Chapters 11 and 10 (in that order).
The problem of clustering is relatively easy to state: given a set of individuals with different traits, the goal is to group them into clusters where the characteristics within each group are similar, and the characteristics between groups are distinct. However, this seemingly simple definition raises two key challenges:
Each of these questions presents multiple alternatives and methodological challenges. As with previous topics, there is no single correct answer. Indeed, the “Ugly Duckling” theorem (Watanabe, 1969) demonstrates that clustering always involves a degree of subjectivity. Consequently, analysts must rely on their domain knowledge to interpret data effectively within the given context.
The first step in clustering is defining a measure of distance between individuals. Several distance metrics can be used, each with different properties and implications. Among the most common is the Euclidean metric, which measures the straight-line distance between two points \(p\) and \(q\), computed using the Pythagorean Theorem.
Ahmed is 40 years old with 20 years of work experience, whereas Mohammed is 37.5 years old with 15 years of experience. The Euclidean distance between them is: \[ \sqrt{(40 - 37.5)^2 + (20 - 15)^2} = 5.59 \]
Other distance metrics include: - Manhattan distance, which is more robust to outliers because it does not square differences. - Jaccard index, which measures similarity between categorical data. The Jaccard distance is derived by subtracting the index value from 1.
One crucial aspect when working with distances is the scale of the observations. Consider the following individuals: - Ahmed: 40 years old, 20 years of work experience - Mona: 40 years old, 15 years of work experience - Maryam: 37.5 years old, 20 years of work experience - Mohammed: 37.5 years old, 15 years of work experience
Using the Euclidean metric, the distance between Maryam and Mona is also 5.59, the same as between Ahmed and Mohammed. However, if we measure age in days instead of years, the distances become 16,303 for the men and 15,572 for the women. This shift alters the ratio between the two distances (previously 1, now 1.05), affecting the clustering outcome. Thus, proper standardization or transformation of variables is necessary to maintain consistency.
Another issue arises from the relationship between age and work experience. As shown in Figure 2, these variables are highly correlated. While geometrically the four individuals appear equidistant, conceptually they are not. Ahmed and Mohammed follow a common pattern, whereas Maryam and Mona deviate from it. Proper adjustments using covariance transformations can address this issue.
A more sophisticated alternative to Euclidean distance is the Mahalanobis distance (Mahalanobis, 1936), which accounts for correlations among variables. This metric is invariant to scale and incorporates covariance structures, making it particularly useful for clustering multivariate data. The later sections explain how to implement Mahalanobis distance in R.
Additionally, when assessing distances between multivariate datasets, it is useful to consider not just the direct distance between two points but also their distance relative to a distribution. The Mahalanobis metric achieves this by incorporating the inverse covariance matrix, making it effective for imbalanced datasets and one-class classification.
The clustering method proposed here is hierarchical clustering, using a bottom-up approach. The code implements hierarchical clustering with Ward linkage, which minimizes variance within clusters, leading to balanced groupings.
Other linkage methods include: - Single linkage (nearest neighbor linkage): Useful for identifying irregularly shaped clusters. - Complete linkage (farthest neighbor linkage): Produces more compact clusters.
These methods are not implemented in the provided code but could be useful depending on the dataset.
There is no universal guideline for selecting a clustering method; it is recommended to test different linkages to determine which best reflects the data. However, Ward linkage is often preferred when the goal is to obtain balanced clusters, as other methods may produce highly unbalanced structures.
Once a decision tree is generated, the next challenge at the policy level is determining the number of clusters required for profiling. For instance, if policymakers aim to create three distinct targeting strategies, the corresponding dendrogram (Figure 3) can help determine the appropriate cut-off.
While clustering helps group individuals, it does not explain why individuals are classified in a particular way. To address this, two techniques are used: 1. A descriptive approach (explained via examples). 2. A technical approach using classification and regression trees (CART) (Breiman, 1984), a fundamental technique in machine learning.
A target variable has four classes represented by different colors: - Yellow (4 observations) - Orange (5 observations) - Green (8 observations) - Blue (10 observations)
The largest group is blue, so if no further information is available, the default classification is blue. However, by incorporating additional variables—such as work experience—the algorithm refines the classification. If an individual has over 25 years of experience, they are assigned to the orange category. Further refinements occur with additional splits.
While it is possible to continue dividing the tree until each node contains only one class, this leads to overfitting, where the model captures noise rather than meaningful patterns. Overfitting can result in misclassifications caused by minor data inconsistencies (e.g., typographical errors). Therefore, the balance between accuracy and generalization must be considered. For avid readers, the book of Gareth et al. (2017) provides an in-depth analysis of this issue. However for the readers interested in the concept but not in the details, the following link provides practical explanations about the decision tree depth
Thus, the code will be there, but it will not be explained in the manual but the codes are commented XXXXX.
This section introduces the R code used for clustering, identified as
Ch5_Functions_Commented030325. In previous methods, the
number of clusters was determined automatically, often resulting in
groups with minimal observations. In contrast, the new approach
specifies both the number of cuts and the final number of clusters
(lines 78-79) to ensure meaningful segmentation.
The provided R code is a custom implementation for clustering and
decision tree generation. Unlike traditional packages, this approach: -
Handles both numeric and categorical data. - Includes a custom metric
(findingX) to optimize cluster splits. - Ensures a minimum
cluster size (basketLimit). - Uses squared Euclidean
distance for robustness. - Incorporates decision trees
(rpart) for classification.
Three key functions are used: 1.
findingX: Identifies optimal cut points
for clustering. 2. treeGraph: Generates a
network visualization of clusters. 3.
clusteringFunction: Executes clustering,
refines clusters, and builds a decision tree.
This custom framework enables clustering across 46 variables for 150 observations, specifically for Governorate A. The distance matrix applies to PMT variables, but clusters are profiled based on additional attributes (e.g., gender, region).
Despite its advantages, this method is computationally intensive.
Users seeking a simpler approach may consider standard algorithms such
as kmeans or hclust. However, this tailored
approach ensures policy-relevant constraints and maintains cluster
interpretability.
For the purposes of the code, we assume that the government of Nemey is interested in 5 groups or cluster, meaning that Nemey would like to categorize at least 5 type of beneficiaries. At the beginning of the code there are some commands to prepare the datasets according to PMT and scoring estimated previously.
basketLimit=max(15,ceiling(dim(BeneficiariesData)[1]*0.025))
PolicyGroups=5
source()
#Clustering Function'results are saved in result List
resultList=clusteringFunction(BeneficiariesSelectionData, BeneficiariesDistanceData, basketLimit, PolicyGroups)
After the custom version of the clustering is executed, it is suggested to use decision trees which serve as both a refinement tool and an explanation mechanism for the clustering results.
In the code the first stage builds a constrained-depth decision tree
(max depth = 4) to establish broad cluster definitions, while the second
stage creates a more granular tree (minimum 2 observations per node) to
capture finer patterns. By using extremely small complexity parameters
(cp ≈ 0), the algorithm prioritizes capturing the true cluster structure
over preventing overfitting. The final output includes both a visual
decision tree (via rpart.plot) that explains cluster
membership rules in terms of the original variables, and a frequency
table showing the distribution of refined clusters. This approach is
particularly valuable when you need uneerstand beneficiaries
characteristics.
This code generates an Excel report with visualizations and tables summarizing the results of a clustering analysis. It first creates a workbook and then adds three main sections: (1) a dendrogram showing how data points are grouped into clusters, (2) a decision tree explaining the clustering rules, and (3) descriptive graphs comparing cluster characteristics—bar charts for categorical variables and average values for numerical variables. Each section includes formatted plots (saved as PNGs) alongside supporting data tables, with text dynamically adjusted for either English or Arabic based on a language flag. For further information about this type of code and use of dictionary please refer to XXX CHAPTER 4…
The code automates the entire reporting process—from creating plots
with consistent styling (fonts, colors, labels) to organizing them in
Excel with proper spacing. It handles scaling, positioning, and
language-specific formatting, then saves two versions of the final
report: one in English (CH 1 EN.xlsx) and one in Arabic
(CH 1 AR.xlsx). The output is a polished, ready-to-share
analysis of how clusters were formed and what distinguishes them.
Decision cluster
Descriptive statistics