In this study, the Titanic dataset is explored to investigate how the variables Age, SibSp, Parch, and Fare are interrelated. The examination is carried out using tools such as correlation matrices, covariance measures, and eigen decomposition techniques.
# Load the dataset
Titanic.Dataset <- read.csv("C:/Users/LOQ/Downloads/archive/Titanic-Dataset.csv")
# Select specific variables
vars <- c("Age", "SibSp", "Parch", "Fare")
dat <- Titanic.Dataset[, vars]
# Handle missing values
dat_clean <- dat[complete.cases(dat), ]
# Check the first few rows
head(dat_clean)
## Age SibSp Parch Fare
## 1 22 1 0 7.2500
## 2 38 1 0 71.2833
## 3 26 0 0 7.9250
## 4 35 1 0 53.1000
## 5 35 0 0 8.0500
## 7 54 0 0 51.8625
summary(dat_clean)
## Age SibSp Parch Fare
## Min. : 0.42 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:20.12 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 8.05
## Median :28.00 Median :0.0000 Median :0.0000 Median : 15.74
## Mean :29.70 Mean :0.5126 Mean :0.4314 Mean : 34.69
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 33.38
## Max. :80.00 Max. :5.0000 Max. :6.0000 Max. :512.33
cor_matrix <- cor(dat_clean)
cor_matrix
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
Conclusion: The strongest relationship is between family-related variables (SibSp–Parch), while age and fare show little to no correlation.
cov_matrix <- cov(dat_clean)
cov_matrix
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
Diagonal (Variance): Each diagonal entry shows the variance of a variable in squared units (e.g., years² for Age, currency² for Fare). - Fare variance ≈ 2800.41 → much larger than Age, SibSp, or Parch, indicating ticket prices varied far more widely than other features.
Off-diagonal (Covariance): These values show the direction of relationships, but their magnitude depends on variable scales, making direct comparison tricky. - Cov(Age, SibSp) = −4.16 → negative relationship: older passengers tended to have fewer siblings/spouses aboard. - Cov(Age, Fare) = 73.85 → positive but scale-driven: the correlation is weak, yet Fare’s large variance inflates the covariance value
Conclusion: Variance highlights Fare as the most dispersed variable, while covariance mainly signals direction of relationships but is less interpretable without considering scale.
eig <- eigen(cov_matrix)
# Eigenvalues
eig$values
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
# Eigenvectors
eig$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
# Percentage
eig$values / sum(eig$values) * 100
## [1] 93.0149541 6.9378309 0.0313266 0.0158884
Eigenvalues: - λ₁ ≈ 2802.56 → ~93% of total variance - λ₂ ≈ 209.04 → ~7% - λ₃ ≈ 0.94 → ~0.03% - λ₄ ≈ 0.48 → ~0.02% - Total ≈ 3013.02 → matches the trace, confirming validity.
Dimensionality Reduction: - PC1 + PC2 together explain ~99.95% of the variance. - Data can be reduced to 2 dimensions with negligible loss of information.
Eigenvectors (Loadings): - PC1 dominated by Fare (loading ≈ 0.9996) → essentially a “Fare component.” - PC2 dominated by Age (loading ≈ 0.9993) → essentially an “Age component.” - PC3 & PC4 reflect SibSp and Parch combinations, but their eigenvalues are tiny, so they explain almost no variance.
Conclusion:PCA shows that most of the dataset’s variability is driven by Fare (PC1), followed by Age (PC2). Family-related variables (SibSp, Parch) contribute only marginal variance, making them negligible in dimensional reduction.
Socio-economic class, reflected in ticket Fare, was the most influential factor in the dataset, overshadowing age and family-related variables.