Introduction

In this analysis, I delve into a dermatology dataset to uncover potential patterns through clustering techniques. The dataset includes various clinical and histopathological features, and my goal is to identify groups that may correspond to distinct types of skin diseases.

Hypothesis

My hypothesis is that clustering analysis will reveal subgroups of patients with similar clinical and histopathological characteristics. These subgroups may align with different types of skin diseases, offering insights not easily discerned through traditional diagnostic methods.

Dermatology Dataset Overview

The dermatology dataset is designed for the differential diagnosis of “erythemato-squamous” skin diseases, “Erythematous” refers to redness of the skin, and “squamous” refers to the scale-like appearance of the skin lesions. Therefore, “erythemato-squamous” diseases are a group of skin conditions characterized by redness and scaling. These diseases often share common clinical features, making their differential diagnosis challenging, a challenging problem in dermatology due to the similarities in clinical features among these diseases. The dataset provides a comprehensive set of features, combining both clinical and histopathological information, to aid in the accurate diagnosis of skin conditions.

Dataset Details:
- Number of Instances: 366
- Number of Attributes: 34
- Missing Attribute Values: 8 (in the ‘Age’ attribute), marked with ‘?’.

Feature Information:

Clinical Attributes:
- Erythema
- Scaling
- Definite borders
- Itching
- Koebner phenomenon
- Polygonal papules
- Follicular papules
- Oral mucosal involvement
- Knee and elbow involvement
- Scalp involvement
- Family history (0 or 1)
- Age (linear)

Histopathological Attributes:
- Melanin incontinence
- Eosinophils in the infiltrate
- PNL infiltrate
- Fibrosis of the papillary dermis
- Exocytosis
- Acanthosis
- Hyperkeratosis
- Parakeratosis
- Clubbing of the rete ridges
- Elongation of the rete ridges
- Thinning of the suprapapillary epidermis
- Spongiform pustule
- Munro microabscess
- Focal hypergranulosis
- Disappearance of the granular layer
- Vacuolization and damage of basal layer
- Spongiosis
- Saw-tooth appearance of retes
- Follicular horn plug
- Perifollicular parakeratosis
- Inflammatory mononuclear infiltrate
- Band-like infiltrate

Data Processing:
The dataset includes both clinical and histopathological features, with values in the range of 0 to 3.
The ‘Family history’ feature is binary (0 or 1).
The ‘Age’ feature represents the age of the patient.
Clinical and histopathological features are given degrees in the range of 0 to 3, with 0 indicating the feature’s
absence, 3 indicating the largest amount possible, and 1 or 2 indicating relative intermediate values.


Data Loading and Preprocessing
We begin by loading necessary libraries—tidyverse, cluster, and ggplot2. The dataset is read from a CSV file, and preprocessing includes handling missing values in the ‘age’ attribute by replacing ‘?’ with NA and imputing missing values with the mean of the column. Features are standardized using the scale() function to ensure comparability.

# Install 'tidyverse' package
install.packages("tidyverse")

# Install 'cluster' package
install.packages("cluster")

# Install 'plot3D' package
install.packages("plot3D")

# Install 'ggplot2' package
install.packages("ggplot2")

library(ggplot2)
library(tidyverse)
library(cluster)
library(ggplot2)

Read the dataset

data <- read.csv("dermatology_database_1.csv")

Handle missing values in ‘age’ attribute

data$age <- as.numeric(ifelse(data$age == "?", NA, data$age))
data$age <- ifelse(is.na(data$age), mean(data$age, na.rm = TRUE), data$age)

I convert the missing values in the ‘age’ attribute (represented as “?”) to NA using the ifelse() function. Then, we replace the NA values with the mean age, calculated using mean(data$age, na.rm = TRUE). This ensures that the ‘age’ attribute is appropriately handled, and missing values are imputed with the mean age. This step is crucial for preparing the data for further analysis, including clustering.

Standardize/normalize features

data_scaled <- scale(data[, 1:33])

I use the scale() function to standardize or normalize the features in the dataset. The data[, 1:33] part ensures that we are applying the standardization only to the relevant features, excluding the target variable or any other non-numeric columns. The range 1:33 in the code signifies that I am standardizing or normalizing the features from the 1st to the 33rd column of the dataset. In other words, I am excluding the last column (34th column) because it represents the target variable or labels.

Here’s why I exclude the target variable when standardizing features:

Target Variable: The target variable, which is often the variable we are trying to predict or classify, does not need to be standardized. This is because it is not a feature used for prediction and its scale doesn’t impact the computations of distance or similarity between data points.

Standardization Purpose: Standardization is typically applied to ensure that all features contribute equally to the analysis, especially when they are on different scales. Including the target variable in the standardization process may introduce unintended effects.

Therefore, by specifying 1:33, I focus on standardizing or normalizing only the features, excluding the target variable. This ensures that the scaling is applied appropriately to the input features used in the clustering analysis.

Standardizing or normalizing features is a common preprocessing step in machine learning to ensure that all features contribute equally to the analysis, especially when they are on different scales.

Determining Optimal Number of Clusters (k)

To identify the optimal number of clusters (k), we employ the elbow method, a widely used technique in clustering analysis. This method helps us find the point at which the addition of more clusters ceases to provide significant improvement in within-cluster cohesion.

The process involves running the k-means clustering algorithm for different values of k (ranging from 1 to 10 in this case) and calculating the within-cluster sum of squares (WSS) for each k The WSS represents the sum of squared distances between each data point and its assigned cluster center.

Elbow method

In the resulting plot, the ‘elbow’ represents a point where the reduction in WSS begins to diminish, indicating that additional clusters do not significantly improve the model’s fit.

K-Means Clustering

Based on the insight gained from the elbow method, which suggests that k = 3is an optimal choice, we proceed with the K-Means clustering algorithm. K-Means is a partitioning method that assigns each data point to one of k clusters. The goal is to minimize the within-cluster sum of squares, making data points within each cluster as similar as possible.

Choose K = 3 based on the elbow method

k <- 3

Perform k-means clustering

set.seed(123)
kmeans_result <- kmeans(data_scaled, centers = k)

I set the seed for reproducibility using set.seed(123) The kmeans() function is employed to perform K-Means clustering on the standardized data, specifying k = 3 as the number of clusters. Assign cluster labels to the original dataset

data$cluster_label <- as.factor(kmeans_result$cluster)
dataset
erythema scaling definite_borders itching koebner_phenomenon polygonal_papules follicular_papules oral_mucosal_involvement knee_and_elbow_involvement scalp_involvement family_history melanin_incontinence eosinophils_infiltrate PNL_infiltrate fibrosis_papillary_dermis exocytosis acanthosis hyperkeratosis parakeratosis clubbing_rete_ridges elongation_rete_ridges thinning_suprapapillary_epidermis spongiform_pustule munro_microabcess focal_hypergranulosis disappearance_granular_layer vacuolisation_damage_basal_layer spongiosis saw_tooth_appearance_retes follicular_horn_plug perifollicular_parakeratosis inflammatory_mononuclear_infiltrate band_like_infiltrate age class cluster_label
2 2 0 3 0 0 0 0 1 0 0 0 0 0 0 3 2 0 0 0 0 0 0 0 0 0 0 3 0 0 0 1 0 55 2 2
3 3 3 2 1 0 0 0 1 1 1 0 0 1 0 1 2 0 2 2 2 2 2 1 0 0 0 0 0 0 0 1 0 8 1 3
2 1 2 3 1 3 0 3 0 0 0 1 0 0 0 1 2 0 2 0 0 0 0 0 2 0 2 3 2 0 0 2 3 26 3 1
2 2 2 0 0 0 0 0 3 2 0 0 0 3 0 0 2 0 3 2 2 2 2 0 0 3 0 0 0 0 0 3 0 40 1 3
2 3 2 2 2 2 0 2 0 0 0 1 0 0 0 1 2 0 0 0 0 0 0 0 2 2 3 2 3 0 0 2 3 45 3 1

Cluster labels are assigned to the original dataset under the variable cluster_label The resulting clusters offer a grouping of data points based on their similarity in the feature space.

Now we know which patient are in which group! that’s what we want :D, so we can proceed to analyze and visualize these clusters.

Cluster Analysis and Visualization

After performing K-Means clustering and assigning cluster labels to the dataset, I delve into understanding the distribution of data points across these clusters.

Explore the distribution of clusters

cluster_distribution <- table(data$cluster_label)
print(cluster_distribution)
table
Var1 Freq
1 72
2 183
3 111

Here, I employ the table() function to examine the distribution of data points among the identified clusters. The output provides a count of instances within each cluster, offering a preliminary understanding of the size and balance of the clusters.

Now, let’s visualize these clusters in a 2D plot using ggplot2:

Visualize clusters in 2D using ggplot2

I create a new dataframe (cluster_data) that includes the first two principal components (PC1 and PC2) and the assigned cluster labels. A scatter plot is generated using ‘ggplot2’, where each point is colored based on its cluster membership. The resulting 2D visualization provides an intuitive representation of the identified clusters in the feature space.

and I decied to make 3D visualization because it provides a more intuitive representation. It’s particularly valuable when exploring multivariate datasets and principal components. While 2D plots offer simplicity, 3D visualization adds depth, aiding in a comprehensive interpretation of clustering patterns.

Conclusion

In conclusion, the clustering analysis has successfully revealed three distinct clusters within the dermatology dataset. These clusters suggest the existence of potential subgroups of patients with similar clinical and histopathological features. The identification of such clusters can provide valuable insights for dermatologists, aiding in the understanding and categorization of skin diseases that share common characteristics.

The elbow method, employed to determine the optimal number of clusters, guided our choice of k = 3 This decision was supported by the observed ‘elbow’ in the plot.

Limitations and Considerations:

The interpretation is based on observed patterns and associations in the data, and causation cannot be directly inferred. Clustering might be influenced by shared features that may not necessarily represent direct causal relationships. Collaborative efforts with dermatologists are essential to validate the observed clusters and understand the clinical context. This explanation offers insights into why each cluster may have formed based on shared characteristics, emphasizing common pathogenic mechanisms, anatomical predilections, and potential genetic influences. Adjustments can be made based on specific insights gained from the data and any additional context or domain knowledge available.