Use dendrograms and hierarchical clustering to analyze similarity among passengers in the Titanic dataset based on selected attributes such as Age, Fare, Passenger Class, and family size.
Step 01: Loading Required Libraries
To load required libraries for data manipulation, clustering, and visualization.
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)
Warning: package 'cluster' was built under R version 4.5.3
library(factoextra)
Warning: package 'factoextra' was built under R version 4.5.3
Welcome to factoextra!
Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
tidyverse: data manipulation and visualization
cluster: clustering algorithms
factoextra: dendrogram and cluster visualization
Step 2: Loading the Dataset
To import the Titanic dataset into R.
data <-read.csv("D:/College/Data visualization/titanic/train.csv")head(data)
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
6 6 0 3
Name Sex Age SibSp Parch
1 Braund, Mr. Owen Harris male 22 1 0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
3 Heikkinen, Miss. Laina female 26 0 0
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
5 Allen, Mr. William Henry male 35 0 0
6 Moran, Mr. James male NA 0 0
Ticket Fare Cabin Embarked
1 A/5 21171 7.2500 S
2 PC 17599 71.2833 C85 C
3 STON/O2. 3101282 7.9250 S
4 113803 53.1000 C123 S
5 373450 8.0500 S
6 330877 8.4583 Q
str(data)
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
read.csv() loads dataset
head() previews data
str() shows structure and data types
Step 3: Data Preprocessing
To clean data and select relevant features.
df <- data %>%select(Age, Fare, Pclass, SibSp, Parch)colSums(is.na(df))
Age Fare Pclass SibSp Parch
177 0 0 0 0
df <-na.omit(df)summary(df)
Age Fare Pclass SibSp
Min. : 0.42 Min. : 0.00 Min. :1.000 Min. :0.0000
1st Qu.:20.12 1st Qu.: 8.05 1st Qu.:1.000 1st Qu.:0.0000
Median :28.00 Median : 15.74 Median :2.000 Median :0.0000
Mean :29.70 Mean : 34.69 Mean :2.237 Mean :0.5126
3rd Qu.:38.00 3rd Qu.: 33.38 3rd Qu.:3.000 3rd Qu.:1.0000
Max. :80.00 Max. :512.33 Max. :3.000 Max. :5.0000
Parch
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.4314
3rd Qu.:1.0000
Max. :6.0000
fviz_cluster(list(data = df_scaled, cluster = clusters),main ="Cluster Visualization of Passengers")
Displays clusters in 2D
Helps understand separation
Conclusion
Hierarchical clustering was applied to analyze similarity among passengers. The dendrogram and cluster visualization revealed distinct groups based on fare, class, and family structure, demonstrating effective similarity analysis.