Hierarchical Clustering with Julia

Data Science with Julia

Julia Workshop


Hierarchical Clustering with Julia

This tutorial provides a comprehensive introduction to hierarchical clustering in Julia. Experiment with different distance metrics and linkage methods to see how they affect the results. The dendrogram is an invaluable tool for understanding the hierarchical structure of your data. Remember to choose the method that best suits your data and the goals of your analysis.

using Pkg
Pkg.add(["Clustering", "DataFrames", "Plots", "Distances"]) # Install necessary packages
using Clustering
using DataFrames
using Plots
using Distances
using Random

# 1. Generate or Load Data

# Option A: Generate synthetic data (for demonstration)
Random.seed!(123)
n_clusters = 3
n_points = 100
data = vcat(
    [rand(2) .+ [i, i] for i in 1:n_clusters]...
) * 5
data = transpose(hcat(data...))

# Option B: Load data from a file (replace with your actual file path)
# df = DataFrame(CSV.read("your_data.csv", DataFrame))
# data = Matrix(df[:, [:column1, :column2]])

# 2. Perform Hierarchical Clustering

# Choose a distance metric (see previous tutorial for options)
distance_metric = Euclidean() # Example: Euclidean distance

# Calculate the distance matrix
distance_matrix = pairwise(distance_metric, data, dims=1)

# Perform hierarchical clustering using a linkage method
# Common linkage methods: :single, :complete, :average, :ward
linkage_method = :complete # Example: Complete linkage

# Perform the clustering
tree = hclust(distance_matrix, method=linkage_method)

# 3. Analyze Results

# Cut the dendrogram to get cluster assignments
k = 3 # Desired number of clusters
cluster_assignments = cutree(tree, k=k)

# Dendrogram
p_dendrogram = plot(tree,
    xticks = false,
    xlabel = "Data Points",
    ylabel = "Distance",
    title = "Dendrogram",
    linecolor=:black,
    linewidth=1)
display(p_dendrogram)

# 4. Visualize Results

# Scatter plot of data points, colored by cluster assignment
p_scatter = scatter(data[:, 1], data[:, 2],
          group = cluster_assignments,
          xlabel = "X", ylabel = "Y",
          title = "Hierarchical Clustering (Complete Linkage)",
          markersize = 5,
          markerstrokewidth = 0,
          alpha = 0.8)

display(p_scatter)


# 5. Exploring Different Linkage Methods

# Compare the results with different linkage methods
linkage_methods = [:single, :complete, :average, :ward]
for method in linkage_methods
    tree = hclust(distance_matrix, method=method)
    cluster_assignments = cutree(tree, k=k)

    p = scatter(data[:, 1], data[:, 2],
          group = cluster_assignments,
          xlabel = "X", ylabel = "Y",
          title = "Hierarchical Clustering ($(method) Linkage)",
          markersize = 5,
          markerstrokewidth = 0,
          alpha = 0.8)
    display(p)
end

# 6. Using DataFrames (Optional)

# If you have your data in a DataFrame:
# df[:Cluster] = cluster_assignments
# groupby(df, :Cluster) # Analyze data within each cluster

#7. Plotting with Dendrogram and Clusters Side-by-Side
combined_plot = plot(p_dendrogram, p_scatter, layout = @layout [a{0.5w} b])
display(combined_plot)

Explanation and Key Improvements:

  • Package Management: Includes all necessary packages in Pkg.add.
  • Data Generation/Loading: Provides both synthetic data generation and instructions for loading from a file.
  • Distance Metric: Explicitly shows how to choose a distance metric. Refers back to the previous distance metric tutorial.
  • Linkage Methods: Explains the different linkage methods (:single, :complete, :average, :ward) and provides examples of each.
  • Dendrogram: Plots the dendrogram, which is crucial for visualizing the hierarchical relationships.
  • Cutree: Shows how to use cutree to get cluster assignments based on a desired number of clusters (k).
  • Visualization: Creates scatter plots of the clustered data, colored by cluster assignment.
  • Comparison of Linkage Methods: Loops through the different linkage methods and displays the resulting clusterings, allowing for easy comparison.
  • DataFrames Integration: Shows how to add cluster assignments back to a DataFrame.
  • Combined Plot: Shows how to plot the dendrogram and the scatter plot of the clustered data side-by-side for better visualization.
  • Clear Comments: Improved comments to explain each step.

Key Concepts:

  • Distance Matrix: Hierarchical clustering starts by calculating a distance matrix between all pairs of data points.
  • Linkage Methods: Linkage methods define how the distance between clusters is calculated. Common methods include:
    • Single Linkage: Distance between the closest points in two clusters.
    • Complete Linkage: Distance between the farthest points in two clusters.
    • Average Linkage: Average distance between all pairs of points in two clusters.
    • Ward’s Method: Minimizes the variance within clusters.
  • Dendrogram: A tree-like diagram that shows the hierarchical relationships between clusters.
  • Cutree: A function that cuts the dendrogram at a specified level to create a given number of clusters.

Running the Code:

  1. Install Julia and the required packages.
  2. Copy and paste the code into the Julia REPL or run it in an IDE.