In this analysis, I delve into a dermatology dataset to uncover potential patterns through clustering techniques. The dataset includes various clinical and histopathological features, and my goal is to identify groups that may correspond to distinct types of skin diseases.
My hypothesis is that clustering analysis will reveal subgroups of patients with similar clinical and histopathological characteristics. These subgroups may align with different types of skin diseases, offering insights not easily discerned through traditional diagnostic methods.
The dermatology dataset is designed for the differential diagnosis of “erythemato-squamous” skin diseases, “Erythematous” refers to redness of the skin, and “squamous” refers to the scale-like appearance of the skin lesions. Therefore, “erythemato-squamous” diseases are a group of skin conditions characterized by redness and scaling. These diseases often share common clinical features, making their differential diagnosis challenging, a challenging problem in dermatology due to the similarities in clinical features among these diseases. The dataset provides a comprehensive set of features, combining both clinical and histopathological information, to aid in the accurate diagnosis of skin conditions.
Dataset Details:
- Number of Instances:
366
- Number of Attributes: 34
- Missing Attribute Values: 8 (in the ‘Age’ attribute), marked with
‘?’.
Clinical Attributes:
- Erythema
-
Scaling
- Definite borders
- Itching
- Koebner
phenomenon
- Polygonal papules
- Follicular papules
- Oral
mucosal involvement
- Knee and elbow involvement
- Scalp
involvement
- Family history (0 or 1)
- Age (linear)
Histopathological Attributes:
- Melanin
incontinence
- Eosinophils in the infiltrate
- PNL
infiltrate
- Fibrosis of the papillary dermis
- Exocytosis
-
Acanthosis
- Hyperkeratosis
- Parakeratosis
- Clubbing of
the rete ridges
- Elongation of the rete ridges
- Thinning of
the suprapapillary epidermis
- Spongiform pustule
- Munro
microabscess
- Focal hypergranulosis
- Disappearance of the
granular layer
- Vacuolization and damage of basal layer
-
Spongiosis
- Saw-tooth appearance of retes
- Follicular horn
plug
- Perifollicular parakeratosis
- Inflammatory mononuclear
infiltrate
- Band-like infiltrate
Data Processing:
The dataset includes both
clinical and histopathological features, with values in the range of 0
to 3.
The ‘Family history’ feature is binary (0 or 1).
The ‘Age’
feature represents the age of the patient.
Clinical and
histopathological features are given degrees in the range of 0 to 3,
with 0 indicating the feature’s
absence, 3 indicating the largest
amount possible, and 1 or 2 indicating relative intermediate
values.
Data Loading and Preprocessing
We begin by
loading necessary libraries—tidyverse, cluster, and ggplot2. The dataset
is read from a CSV file, and preprocessing includes handling missing
values in the ‘age’ attribute by replacing ‘?’ with NA and imputing
missing values with the mean of the column. Features are standardized
using the scale() function to ensure comparability.
# Install 'tidyverse' package
install.packages("tidyverse")
# Install 'cluster' package
install.packages("cluster")
# Install 'plot3D' package
install.packages("plot3D")
# Install 'ggplot2' package
install.packages("ggplot2")
library(ggplot2)
library(tidyverse)
library(cluster)
library(ggplot2)
data <- read.csv("dermatology_database_1.csv")
data$age <- as.numeric(ifelse(data$age == "?", NA, data$age))
data$age <- ifelse(is.na(data$age), mean(data$age, na.rm = TRUE), data$age)
I convert the missing values in the ‘age’ attribute (represented as
“?”) to NA using the ifelse() function. Then, we replace the NA values
with the mean age, calculated using
mean(data$age, na.rm = TRUE). This ensures that the ‘age’
attribute is appropriately handled, and missing values are imputed with
the mean age. This step is crucial for preparing the data for further
analysis, including clustering.
data_scaled <- scale(data[, 1:33])
I use the scale() function to standardize or normalize
the features in the dataset. The data[, 1:33] part ensures
that we are applying the standardization only to the relevant features,
excluding the target variable or any other non-numeric columns. The
range 1:33 in the code signifies that I am standardizing or normalizing
the features from the 1st to the 33rd column of the dataset. In other
words, I am excluding the last column (34th column) because it
represents the target variable or labels.
Here’s why I exclude the target variable when standardizing features:
Target Variable: The target variable, which is often the variable we are trying to predict or classify, does not need to be standardized. This is because it is not a feature used for prediction and its scale doesn’t impact the computations of distance or similarity between data points.
Standardization Purpose: Standardization is typically applied to ensure that all features contribute equally to the analysis, especially when they are on different scales. Including the target variable in the standardization process may introduce unintended effects.
Therefore, by specifying 1:33, I focus on standardizing
or normalizing only the features, excluding the target variable. This
ensures that the scaling is applied appropriately to the input features
used in the clustering analysis.
Standardizing or normalizing features is a common preprocessing step in machine learning to ensure that all features contribute equally to the analysis, especially when they are on different scales.
To identify the optimal number of clusters (k), we employ the elbow method, a widely used technique in clustering analysis. This method helps us find the point at which the addition of more clusters ceases to provide significant improvement in within-cluster cohesion.
The process involves running the k-means clustering algorithm for different values of k (ranging from 1 to 10 in this case) and calculating the within-cluster sum of squares (WSS) for each k The WSS represents the sum of squared distances between each data point and its assigned cluster center.
In the resulting plot, the ‘elbow’ represents a point where the reduction in WSS begins to diminish, indicating that additional clusters do not significantly improve the model’s fit.
Based on the insight gained from the elbow method, which suggests
that k = 3is an optimal choice, we proceed with the K-Means
clustering algorithm. K-Means is a partitioning method that assigns each
data point to one of k clusters. The goal is to minimize
the within-cluster sum of squares, making data points within each
cluster as similar as possible.
Choose K = 3 based on the elbow method
k <- 3
Perform k-means clustering
set.seed(123)
kmeans_result <- kmeans(data_scaled, centers = k)
I set the seed for reproducibility using set.seed(123)
The kmeans() function is employed to perform K-Means
clustering on the standardized data, specifying k = 3 as
the number of clusters. Assign cluster labels to the original
dataset
data$cluster_label <- as.factor(kmeans_result$cluster)
| erythema | scaling | definite_borders | itching | koebner_phenomenon | polygonal_papules | follicular_papules | oral_mucosal_involvement | knee_and_elbow_involvement | scalp_involvement | family_history | melanin_incontinence | eosinophils_infiltrate | PNL_infiltrate | fibrosis_papillary_dermis | exocytosis | acanthosis | hyperkeratosis | parakeratosis | clubbing_rete_ridges | elongation_rete_ridges | thinning_suprapapillary_epidermis | spongiform_pustule | munro_microabcess | focal_hypergranulosis | disappearance_granular_layer | vacuolisation_damage_basal_layer | spongiosis | saw_tooth_appearance_retes | follicular_horn_plug | perifollicular_parakeratosis | inflammatory_mononuclear_infiltrate | band_like_infiltrate | age | class | cluster_label |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 2 | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 55 | 2 | 2 |
| 3 | 3 | 3 | 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 8 | 1 | 3 |
| 2 | 1 | 2 | 3 | 1 | 3 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 3 | 2 | 0 | 0 | 2 | 3 | 26 | 3 | 1 |
| 2 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 0 | 3 | 2 | 2 | 2 | 2 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 40 | 1 | 3 |
| 2 | 3 | 2 | 2 | 2 | 2 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 3 | 2 | 3 | 0 | 0 | 2 | 3 | 45 | 3 | 1 |
Cluster labels are assigned to the original dataset under the
variable cluster_label The resulting clusters offer a
grouping of data points based on their similarity in the feature
space.
Now we know which patient are in which group! that’s what we want :D, so we can proceed to analyze and visualize these clusters.
After performing K-Means clustering and assigning cluster labels to the dataset, I delve into understanding the distribution of data points across these clusters.
Explore the distribution of clusters
cluster_distribution <- table(data$cluster_label)
print(cluster_distribution)
| Var1 | Freq |
|---|---|
| 1 | 72 |
| 2 | 183 |
| 3 | 111 |
Here, I employ the table() function to examine the
distribution of data points among the identified clusters. The output
provides a count of instances within each cluster, offering a
preliminary understanding of the size and balance of the clusters.
Now, let’s visualize these clusters in a 2D plot using ggplot2:
Visualize clusters in 2D using ggplot2
I create a new dataframe (cluster_data) that includes
the first two principal components (PC1 and PC2) and the assigned
cluster labels. A scatter plot is generated using ‘ggplot2’, where each
point is colored based on its cluster membership. The resulting 2D
visualization provides an intuitive representation of the identified
clusters in the feature space.
and I decied to make 3D visualization because it provides a more intuitive representation. It’s particularly valuable when exploring multivariate datasets and principal components. While 2D plots offer simplicity, 3D visualization adds depth, aiding in a comprehensive interpretation of clustering patterns.
In conclusion, the clustering analysis has successfully revealed three distinct clusters within the dermatology dataset. These clusters suggest the existence of potential subgroups of patients with similar clinical and histopathological features. The identification of such clusters can provide valuable insights for dermatologists, aiding in the understanding and categorization of skin diseases that share common characteristics.
The elbow method, employed to determine the optimal number of
clusters, guided our choice of k = 3 This decision was
supported by the observed ‘elbow’ in the plot.
The interpretation is based on observed patterns and associations in the data, and causation cannot be directly inferred. Clustering might be influenced by shared features that may not necessarily represent direct causal relationships. Collaborative efforts with dermatologists are essential to validate the observed clusters and understand the clinical context. This explanation offers insights into why each cluster may have formed based on shared characteristics, emphasizing common pathogenic mechanisms, anatomical predilections, and potential genetic influences. Adjustments can be made based on specific insights gained from the data and any additional context or domain knowledge available.