Introduction

In this study, the Titanic dataset is explored to investigate how the variables Age, SibSp, Parch, and Fare are interrelated. The examination is carried out using tools such as correlation matrices, covariance measures, and eigen decomposition techniques.

1. Data Preparation

# Load the dataset
Titanic.Dataset <- read.csv("C:/Users/LOQ/Downloads/archive/Titanic-Dataset.csv")

# Select specific variables
vars <- c("Age", "SibSp", "Parch", "Fare")
dat <- Titanic.Dataset[, vars]

# Handle missing values
dat_clean <- dat[complete.cases(dat), ]

# Check the first few rows
head(dat_clean)
##   Age SibSp Parch    Fare
## 1  22     1     0  7.2500
## 2  38     1     0 71.2833
## 3  26     0     0  7.9250
## 4  35     1     0 53.1000
## 5  35     0     0  8.0500
## 7  54     0     0 51.8625
summary(dat_clean)
##       Age            SibSp            Parch             Fare       
##  Min.   : 0.42   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:20.12   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  8.05  
##  Median :28.00   Median :0.0000   Median :0.0000   Median : 15.74  
##  Mean   :29.70   Mean   :0.5126   Mean   :0.4314   Mean   : 34.69  
##  3rd Qu.:38.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 33.38  
##  Max.   :80.00   Max.   :5.0000   Max.   :6.0000   Max.   :512.33

2. Statistical Analysis

2.1 Correlation Matrix

cor_matrix <- cor(dat_clean)
cor_matrix
##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000
  • Age vs SibSp (−0.308) → moderate negative correlation: older passengers tended to travel with fewer siblings/spouses.
  • Age vs Parch (−0.189) → weak negative correlation.
  • Age vs Fare (0.096) → almost no linear relationship between age and ticket price.
  • SibSp vs Parch (0.384) → moderate positive correlation: larger families often had both siblings/spouses and parents/children on board.
  • SibSp vs Fare (0.138), Parch vs Fare (0.205) → small positive correlations: family size had only a minor link to ticket price.

Conclusion: The strongest relationship is between family-related variables (SibSp–Parch), while age and fare show little to no correlation.

2.2 Variance-Covariance Matrix

cov_matrix <- cov(dat_clean)
cov_matrix
##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100

Diagonal (Variance): Each diagonal entry shows the variance of a variable in squared units (e.g., years² for Age, currency² for Fare). - Fare variance ≈ 2800.41 → much larger than Age, SibSp, or Parch, indicating ticket prices varied far more widely than other features.

Off-diagonal (Covariance): These values show the direction of relationships, but their magnitude depends on variable scales, making direct comparison tricky. - Cov(Age, SibSp) = −4.16 → negative relationship: older passengers tended to have fewer siblings/spouses aboard. - Cov(Age, Fare) = 73.85 → positive but scale-driven: the correlation is weak, yet Fare’s large variance inflates the covariance value

Conclusion: Variance highlights Fare as the most dispersed variable, while covariance mainly signals direction of relationships but is less interpretable without considering scale.

3. Eigen-Decomposition

eig <- eigen(cov_matrix)
# Eigenvalues
eig$values
## [1] 2802.5636587  209.0385659    0.9438783    0.4787214
# Eigenvectors
eig$vectors
##             [,1]        [,2]         [,3]          [,4]
## [1,] 0.028477552  0.99929943 -0.024018111  0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322  0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826  0.004609234  0.0009266652
# Percentage
eig$values / sum(eig$values) * 100
## [1] 93.0149541  6.9378309  0.0313266  0.0158884

Eigenvalues: - λ₁ ≈ 2802.56 → ~93% of total variance - λ₂ ≈ 209.04 → ~7% - λ₃ ≈ 0.94 → ~0.03% - λ₄ ≈ 0.48 → ~0.02% - Total ≈ 3013.02 → matches the trace, confirming validity.

Dimensionality Reduction: - PC1 + PC2 together explain ~99.95% of the variance. - Data can be reduced to 2 dimensions with negligible loss of information.

Eigenvectors (Loadings): - PC1 dominated by Fare (loading ≈ 0.9996) → essentially a “Fare component.” - PC2 dominated by Age (loading ≈ 0.9993) → essentially an “Age component.” - PC3 & PC4 reflect SibSp and Parch combinations, but their eigenvalues are tiny, so they explain almost no variance.

Conclusion:PCA shows that most of the dataset’s variability is driven by Fare (PC1), followed by Age (PC2). Family-related variables (SibSp, Parch) contribute only marginal variance, making them negligible in dimensional reduction.

4. Conclusion

Socio-economic class, reflected in ticket Fare, was the most influential factor in the dataset, overshadowing age and family-related variables.