LA -1

Author

Shreya S Vakkund

Introduction

  • Data analysis helps identify patterns in data
  • Clustering groups similar data points
  • This study uses hierarchical clustering to analyze passenger similarity

Objective

  • Analyze similarity among passengers
  • Use:
    • Dendrograms
    • Hierarchical clustering
  • Based on features:
    • Age, Fare, Pclass, SibSp, Parch

Dataset Overview

  • Dataset: Titanic (Kaggle)
  • Contains passenger information
  • Focus on numerical features for clustering

Methodology

  1. Data preprocessing
  2. Feature selection
  3. Data scaling
  4. Distance calculation
  5. Hierarchical clustering
  6. Visualization & interpretation

Data Preprocessing

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read.csv("D:/College/Data visualization/titanic/train.csv")

df <- data %>%
  select(Age, Fare, Pclass, SibSp, Parch) %>%
  na.omit()
  • Selected relevant numeric features
  • Removed missing values
  • Prepared clean dataset

Data Scaling

df_scaled <- scale(df)
  • Standardized features
  • Mean = 0, SD = 1
  • Ensures equal contribution of variables

Distance Calculation

dist_matrix <- dist(df_scaled, method = "euclidean")
  • Measures similarity between passengers
  • Smaller distance → more similar

Hierarchical Clustering

hc <- hclust(dist_matrix, method = "ward.D2")
  • Uses Ward’s method
  • Forms compact and meaningful clusters

Dendrogram

plot(hc, labels = FALSE, main = "Dendrogram of Passenger Similarity")

  • Tree representation of clustering
  • Shows how groups merge

Enhanced Dendrogram

library(factoextra)
Warning: package 'factoextra' was built under R version 4.5.3
Welcome to factoextra!
Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
fviz_dend(hc, k = 4, rect = TRUE)

  • Divides data into 4 clusters
  • Highlights cluster boundaries

Cluster Formation

clusters <- cutree(hc, k = 4)
df$Cluster <- as.factor(clusters)

table(df$Cluster)

  1   2   3   4 
387 186  35 106 
  • Assigns each passenger to a cluster
  • Shows cluster distribution

Cluster Analysis

df %>%
  group_by(Cluster) %>%
  summarise(
    Avg_Age = mean(Age),
    Avg_Fare = mean(Fare),
    Avg_Class = mean(Pclass)
  )
# A tibble: 4 × 4
  Cluster Avg_Age Avg_Fare Avg_Class
  <fct>     <dbl>    <dbl>     <dbl>
1 1         29.5      12.4      2.68
2 2         38.2      88.0      1   
3 3          7.21     31.7      3   
4 4         22.9      23.7      2.53
  • Summarizes key characteristics
  • Helps interpret clusters

Cluster Visualization

fviz_cluster(list(data = df_scaled, cluster = clusters))

  • Visual representation of clusters
  • Different colors indicate different groups

Key Insights

  • Clusters formed based on:

    • Fare (economic status)
    • Passenger class
    • Family size
  • Distinct passenger groups observed

  • Indicates meaningful similarity patterns

Conclusion

  • Hierarchical clustering successfully applied
  • Dendrogram revealed similarity structure
  • Clustering grouped passengers effectively
  • Demonstrates usefulness of clustering in data analysis

Thank You