LA-1

Author

Shreya S Vakkund

Objective:

Use dendrograms and hierarchical clustering to analyze similarity among passengers in the Titanic dataset based on selected attributes such as Age, Fare, Passenger Class, and family size.

Step 01: Loading Required Libraries

To load required libraries for data manipulation, clustering, and visualization.

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cluster)

Warning: package 'cluster' was built under R version 4.5.3

library(factoextra)

Warning: package 'factoextra' was built under R version 4.5.3

Welcome to factoextra!
Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/

tidyverse: data manipulation and visualization
cluster: clustering algorithms
factoextra: dendrogram and cluster visualization

Step 2: Loading the Dataset

To import the Titanic dataset into R.

data <- read.csv("D:/College/Data visualization/titanic/train.csv")

head(data)

  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
5           5        0      3
6           6        0      3
                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
5                            Allen, Mr. William Henry   male  35     0     0
6                                    Moran, Mr. James   male  NA     0     0
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S
5           373450  8.0500              S
6           330877  8.4583              Q

str(data)

'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...

read.csv() loads dataset
head() previews data
str() shows structure and data types

Step 3: Data Preprocessing

To clean data and select relevant features.

df <- data %>%
  select(Age, Fare, Pclass, SibSp, Parch)

colSums(is.na(df))

   Age   Fare Pclass  SibSp  Parch 
   177      0      0      0      0

df <- na.omit(df)

summary(df)

      Age             Fare            Pclass          SibSp       
 Min.   : 0.42   Min.   :  0.00   Min.   :1.000   Min.   :0.0000  
 1st Qu.:20.12   1st Qu.:  8.05   1st Qu.:1.000   1st Qu.:0.0000  
 Median :28.00   Median : 15.74   Median :2.000   Median :0.0000  
 Mean   :29.70   Mean   : 34.69   Mean   :2.237   Mean   :0.5126  
 3rd Qu.:38.00   3rd Qu.: 33.38   3rd Qu.:3.000   3rd Qu.:1.0000  
 Max.   :80.00   Max.   :512.33   Max.   :3.000   Max.   :5.0000  
     Parch       
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.4314  
 3rd Qu.:1.0000  
 Max.   :6.0000

Selected relevant numeric features
Checked and removed missing values
Ensured clean dataset for clustering

Step 4: Data Scaling

To standardize features for fair comparison.

df_scaled <- scale(df)

head(df_scaled)

         Age       Fare    Pclass      SibSp      Parch
1 -0.5300051 -0.5186143  0.910594  0.5242027 -0.5055408
2  0.5714304  0.6914121 -1.475329  0.5242027 -0.5055408
3 -0.2546462 -0.5058589  0.910594 -0.5513166 -0.5055408
4  0.3649113  0.3478053 -1.475329  0.5242027 -0.5055408
5  0.3649113 -0.5034968  0.910594 -0.5513166 -0.5055408
7  1.6728659  0.3244205 -1.475329 -0.5513166 -0.5055408

Converts data to mean = 0 and SD = 1
Prevents large values from dominating

Step 5: Distance Matrix Calculation

To measure similarity between data points.

dist_matrix <- dist(df_scaled, method = "euclidean")

Uses Euclidean distance
Smaller distance → more similar

Step 6: Hierarchical Clustering & Dendrogram

To apply clustering and visualize relationships.

hc <- hclust(dist_matrix, method = "ward.D2")

plot(hc, labels = FALSE, main = "Dendrogram of Passenger Similarity")

fviz_dend(hc, k = 4, rect = TRUE, main = "Hierarchical Clustering Dendrogram")

Ward’s method creates compact clusters
Dendrogram shows cluster formation
k = 4 highlights 4 clusters

Step 7: Forming Clusters & Analysis

To divide and analyze clusters.

clusters <- cutree(hc, k = 4)

df$Cluster <- as.factor(clusters)

table(df$Cluster)


  1   2   3   4 
387 186  35 106

cluster_summary <- df %>%
  group_by(Cluster) %>%
  summarise(
    Avg_Age = mean(Age),
    Avg_Fare = mean(Fare),
    Avg_Class = mean(Pclass),
    Avg_SibSp = mean(SibSp),
    Avg_Parch = mean(Parch)
  )

print(cluster_summary)

# A tibble: 4 × 6
  Cluster Avg_Age Avg_Fare Avg_Class Avg_SibSp Avg_Parch
  <fct>     <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
1 1         29.5      12.4      2.68     0.214     0    
2 2         38.2      88.0      1        0.452     0.409
3 3          7.21     31.7      3        3.66      1.57 
4 4         22.9      23.7      2.53     0.670     1.67

Divides data into 4 clusters
Adds cluster labels
Summarizes each cluster

Step 8: Cluster Visualization & Conclusion

To visualize clusters and interpret results.

fviz_cluster(list(data = df_scaled, cluster = clusters),
             main = "Cluster Visualization of Passengers")

Displays clusters in 2D
Helps understand separation

Conclusion

Hierarchical clustering was applied to analyze similarity among passengers. The dendrogram and cluster visualization revealed distinct groups based on fare, class, and family structure, demonstrating effective similarity analysis.