LA-1

Author

Shreya S Vakkund

Objective:

Use dendrograms and hierarchical clustering to analyze similarity among passengers in the Titanic dataset based on selected attributes such as Age, Fare, Passenger Class, and family size.

Step 01: Loading Required Libraries

  • To load required libraries for data manipulation, clustering, and visualization.
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)
Warning: package 'cluster' was built under R version 4.5.3
library(factoextra)
Warning: package 'factoextra' was built under R version 4.5.3
Welcome to factoextra!
Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
  • tidyverse: data manipulation and visualization
  • cluster: clustering algorithms
  • factoextra: dendrogram and cluster visualization

Step 2: Loading the Dataset

  • To import the Titanic dataset into R.
data <- read.csv("D:/College/Data visualization/titanic/train.csv")

head(data)
  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
5           5        0      3
6           6        0      3
                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
5                            Allen, Mr. William Henry   male  35     0     0
6                                    Moran, Mr. James   male  NA     0     0
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S
5           373450  8.0500              S
6           330877  8.4583              Q
str(data)
'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...
  • read.csv() loads dataset
  • head() previews data
  • str() shows structure and data types

Step 3: Data Preprocessing

  • To clean data and select relevant features.
df <- data %>%
  select(Age, Fare, Pclass, SibSp, Parch)

colSums(is.na(df))
   Age   Fare Pclass  SibSp  Parch 
   177      0      0      0      0 
df <- na.omit(df)

summary(df)
      Age             Fare            Pclass          SibSp       
 Min.   : 0.42   Min.   :  0.00   Min.   :1.000   Min.   :0.0000  
 1st Qu.:20.12   1st Qu.:  8.05   1st Qu.:1.000   1st Qu.:0.0000  
 Median :28.00   Median : 15.74   Median :2.000   Median :0.0000  
 Mean   :29.70   Mean   : 34.69   Mean   :2.237   Mean   :0.5126  
 3rd Qu.:38.00   3rd Qu.: 33.38   3rd Qu.:3.000   3rd Qu.:1.0000  
 Max.   :80.00   Max.   :512.33   Max.   :3.000   Max.   :5.0000  
     Parch       
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.4314  
 3rd Qu.:1.0000  
 Max.   :6.0000  
  • Selected relevant numeric features
  • Checked and removed missing values
  • Ensured clean dataset for clustering

Step 4: Data Scaling

  • To standardize features for fair comparison.
df_scaled <- scale(df)

head(df_scaled)
         Age       Fare    Pclass      SibSp      Parch
1 -0.5300051 -0.5186143  0.910594  0.5242027 -0.5055408
2  0.5714304  0.6914121 -1.475329  0.5242027 -0.5055408
3 -0.2546462 -0.5058589  0.910594 -0.5513166 -0.5055408
4  0.3649113  0.3478053 -1.475329  0.5242027 -0.5055408
5  0.3649113 -0.5034968  0.910594 -0.5513166 -0.5055408
7  1.6728659  0.3244205 -1.475329 -0.5513166 -0.5055408
  • Converts data to mean = 0 and SD = 1
  • Prevents large values from dominating

Step 5: Distance Matrix Calculation

  • To measure similarity between data points.
dist_matrix <- dist(df_scaled, method = "euclidean")
  • Uses Euclidean distance
  • Smaller distance → more similar

Step 6: Hierarchical Clustering & Dendrogram

  • To apply clustering and visualize relationships.
hc <- hclust(dist_matrix, method = "ward.D2")

plot(hc, labels = FALSE, main = "Dendrogram of Passenger Similarity")

fviz_dend(hc, k = 4, rect = TRUE, main = "Hierarchical Clustering Dendrogram")

  • Ward’s method creates compact clusters
  • Dendrogram shows cluster formation
  • k = 4 highlights 4 clusters

Step 7: Forming Clusters & Analysis

  • To divide and analyze clusters.
clusters <- cutree(hc, k = 4)

df$Cluster <- as.factor(clusters)

table(df$Cluster)

  1   2   3   4 
387 186  35 106 
cluster_summary <- df %>%
  group_by(Cluster) %>%
  summarise(
    Avg_Age = mean(Age),
    Avg_Fare = mean(Fare),
    Avg_Class = mean(Pclass),
    Avg_SibSp = mean(SibSp),
    Avg_Parch = mean(Parch)
  )

print(cluster_summary)
# A tibble: 4 × 6
  Cluster Avg_Age Avg_Fare Avg_Class Avg_SibSp Avg_Parch
  <fct>     <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
1 1         29.5      12.4      2.68     0.214     0    
2 2         38.2      88.0      1        0.452     0.409
3 3          7.21     31.7      3        3.66      1.57 
4 4         22.9      23.7      2.53     0.670     1.67 
  • Divides data into 4 clusters
  • Adds cluster labels
  • Summarizes each cluster

Step 8: Cluster Visualization & Conclusion

  • To visualize clusters and interpret results.
fviz_cluster(list(data = df_scaled, cluster = clusters),
             main = "Cluster Visualization of Passengers")

  • Displays clusters in 2D
  • Helps understand separation

Conclusion

Hierarchical clustering was applied to analyze similarity among passengers. The dendrogram and cluster visualization revealed distinct groups based on fare, class, and family structure, demonstrating effective similarity analysis.