Clustering

This section explains the conceptual elements of statistical clustering. While it introduces the basic concepts, it primarily focuses on the clustering methods recommended for the SPP-RAF. For those interested in a deeper exploration, Zelterman (2015) provides an excellent resource, particularly Chapters 11 and 10 (in that order).

The problem of clustering is relatively easy to state: given a set of individuals with different traits, the goal is to group them into clusters where the characteristics within each group are similar, and the characteristics between groups are distinct. However, this seemingly simple definition raises two key challenges:

  1. What criteria should be used to define similarity or dissimilarity between two individuals (Step 1)?
  2. How can we determine when two individuals are too different to belong to the same group (Step 2)?

Each of these questions presents multiple alternatives and methodological challenges. As with previous topics, there is no single correct answer. Indeed, the “Ugly Duckling” theorem (Watanabe, 1969) demonstrates that clustering always involves a degree of subjectivity. Consequently, analysts must rely on their domain knowledge to interpret data effectively within the given context.

Step 1: Distances

The first step in clustering is defining a measure of distance between individuals. Several distance metrics can be used, each with different properties and implications. Among the most common is the Euclidean metric, which measures the straight-line distance between two points \(p\) and \(q\), computed using the Pythagorean Theorem.

Example

Ahmed is 40 years old with 20 years of work experience, whereas Mohammed is 37.5 years old with 15 years of experience. The Euclidean distance between them is: \[ \sqrt{(40 - 37.5)^2 + (20 - 15)^2} = 5.59 \]

Other distance metrics include: - Manhattan distance, which is more robust to outliers because it does not square differences. - Jaccard index, which measures similarity between categorical data. The Jaccard distance is derived by subtracting the index value from 1.

Scale Considerations

One crucial aspect when working with distances is the scale of the observations. Consider the following individuals: - Ahmed: 40 years old, 20 years of work experience - Mona: 40 years old, 15 years of work experience - Maryam: 37.5 years old, 20 years of work experience - Mohammed: 37.5 years old, 15 years of work experience

Using the Euclidean metric, the distance between Maryam and Mona is also 5.59, the same as between Ahmed and Mohammed. However, if we measure age in days instead of years, the distances become 16,303 for the men and 15,572 for the women. This shift alters the ratio between the two distances (previously 1, now 1.05), affecting the clustering outcome. Thus, proper standardization or transformation of variables is necessary to maintain consistency.

Another issue arises from the relationship between age and work experience. As shown in Figure 2, these variables are highly correlated. While geometrically the four individuals appear equidistant, conceptually they are not. Ahmed and Mohammed follow a common pattern, whereas Maryam and Mona deviate from it. Proper adjustments using covariance transformations can address this issue.

Mahalanobis Distance

A more sophisticated alternative to Euclidean distance is the Mahalanobis distance (Mahalanobis, 1936), which accounts for correlations among variables. This metric is invariant to scale and incorporates covariance structures, making it particularly useful for clustering multivariate data. The later sections explain how to implement Mahalanobis distance in R.

Additionally, when assessing distances between multivariate datasets, it is useful to consider not just the direct distance between two points but also their distance relative to a distribution. The Mahalanobis metric achieves this by incorporating the inverse covariance matrix, making it effective for imbalanced datasets and one-class classification.

Step 2: Clustering

The clustering method proposed here is hierarchical clustering, using a bottom-up approach. The code implements hierarchical clustering with Ward linkage, which minimizes variance within clusters, leading to balanced groupings.

Other linkage methods include: - Single linkage (nearest neighbor linkage): Useful for identifying irregularly shaped clusters. - Complete linkage (farthest neighbor linkage): Produces more compact clusters.

These methods are not implemented in the provided code but could be useful depending on the dataset.

There is no universal guideline for selecting a clustering method; it is recommended to test different linkages to determine which best reflects the data. However, Ward linkage is often preferred when the goal is to obtain balanced clusters, as other methods may produce highly unbalanced structures.

Once a decision tree is generated, the next challenge at the policy level is determining the number of clusters required for profiling. For instance, if policymakers aim to create three distinct targeting strategies, the corresponding dendrogram (Figure 3) can help determine the appropriate cut-off.

Step 3: Classification

While clustering helps group individuals, it does not explain why individuals are classified in a particular way. To address this, two techniques are used: 1. A descriptive approach (explained via examples). 2. A technical approach using classification and regression trees (CART) (Breiman, 1984), a fundamental technique in machine learning.

Example

A target variable has four classes represented by different colors: - Yellow (4 observations) - Orange (5 observations) - Green (8 observations) - Blue (10 observations)

The largest group is blue, so if no further information is available, the default classification is blue. However, by incorporating additional variables—such as work experience—the algorithm refines the classification. If an individual has over 25 years of experience, they are assigned to the orange category. Further refinements occur with additional splits.

While it is possible to continue dividing the tree until each node contains only one class, this leads to overfitting, where the model captures noise rather than meaningful patterns. Overfitting can result in misclassifications caused by minor data inconsistencies (e.g., typographical errors). Therefore, the balance between accuracy and generalization must be considered. For avid readers, the book of Gareth et al. (2017) provides an in-depth analysis of this issue. However for the readers interested in the concept but not in the details, the following link provides practical explanations about the decision tree depth

Thus, the code will be there, but it will not be explained in the manual but the codes are commented  XXXXX.

Custom implementation for clustering (optional)

This section introduces the R code used for clustering, identified as Ch5_Functions_Commented030325. In previous methods, the number of clusters was determined automatically, often resulting in groups with minimal observations. In contrast, the new approach specifies both the number of cuts and the final number of clusters (lines 78-79) to ensure meaningful segmentation.

The provided R code is a custom implementation for clustering and decision tree generation. Unlike traditional packages, this approach: - Handles both numeric and categorical data. - Includes a custom metric (findingX) to optimize cluster splits. - Ensures a minimum cluster size (basketLimit). - Uses squared Euclidean distance for robustness. - Incorporates decision trees (rpart) for classification.

Three key functions are used: 1. findingX: Identifies optimal cut points for clustering. 2. treeGraph: Generates a network visualization of clusters. 3. clusteringFunction: Executes clustering, refines clusters, and builds a decision tree.

This custom framework enables clustering across 46 variables for 150 observations, specifically for Governorate A. The distance matrix applies to PMT variables, but clusters are profiled based on additional attributes (e.g., gender, region).

Despite its advantages, this method is computationally intensive. Users seeking a simpler approach may consider standard algorithms such as kmeans or hclust. However, this tailored approach ensures policy-relevant constraints and maintains cluster interpretability.

For the purposes of the code, we assume that the government of Nemey is interested in 5 groups or cluster, meaning that Nemey would like to categorize at least 5 type of beneficiaries. At the beginning of the code there are some commands to prepare the datasets according to PMT and scoring estimated previously.


basketLimit=max(15,ceiling(dim(BeneficiariesData)[1]*0.025))
PolicyGroups=5
source()
#Clustering Function'results are saved in  result List
resultList=clusteringFunction(BeneficiariesSelectionData, BeneficiariesDistanceData, basketLimit, PolicyGroups)

After the custom version of the clustering is executed, it is suggested to use decision trees which serve as both a refinement tool and an explanation mechanism for the clustering results.

In the code the first stage builds a constrained-depth decision tree (max depth = 4) to establish broad cluster definitions, while the second stage creates a more granular tree (minimum 2 observations per node) to capture finer patterns. By using extremely small complexity parameters (cp ≈ 0), the algorithm prioritizes capturing the true cluster structure over preventing overfitting. The final output includes both a visual decision tree (via rpart.plot) that explains cluster membership rules in terms of the original variables, and a frequency table showing the distribution of refined clusters. This approach is particularly valuable when you need uneerstand beneficiaries characteristics.

4. Descriptive statistics to assess the clusters

This code generates an Excel report with visualizations and tables summarizing the results of a clustering analysis. It first creates a workbook and then adds three main sections: (1) a dendrogram showing how data points are grouped into clusters, (2) a decision tree explaining the clustering rules, and (3) descriptive graphs comparing cluster characteristics—bar charts for categorical variables and average values for numerical variables. Each section includes formatted plots (saved as PNGs) alongside supporting data tables, with text dynamically adjusted for either English or Arabic based on a language flag. For further information about this type of code and use of dictionary please refer to XXX CHAPTER 4…

The code automates the entire reporting process—from creating plots with consistent styling (fonts, colors, labels) to organizing them in Excel with proper spacing. It handles scaling, positioning, and language-specific formatting, then saves two versions of the final report: one in English (CH 1 EN.xlsx) and one in Arabic (CH 1 AR.xlsx). The output is a polished, ready-to-share analysis of how clusters were formed and what distinguishes them.

Decision cluster

Descriptive statistics