Load Data

load("Titanic_Task_Data.RData")

1. Data Preparation

Selected_Dataset <- Titanic_Dataset[, c("Age", "SibSp", "Parch", "Fare")]
Cleaned_Dataset <- na.omit(Selected_Dataset)
str(Cleaned_Dataset)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 714 obs. of  4 variables:
##  $ Age  : num  22 38 26 35 35 54 2 27 14 4 ...
##  $ SibSp: num  1 1 0 1 0 0 3 0 1 1 ...
##  $ Parch: num  0 0 0 0 0 0 1 2 0 1 ...
##  $ Fare : num  7.25 71.28 7.92 53.1 8.05 ...
##  - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
##   ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...

This step selects 4 variables from the Titanic dataset: 1. Age: Passenger’s age in years 2. SibSp: Number of siblings/spouses aboard 3. Parch: Number of parents/children aboard 4. Fare: Ticket price in British pounds

Missing values were removed using na.omit(), resulting in 714 complete observations from 891 original records.

2. Correlation Matrix

correlation_matrix <- cor(Cleaned_Dataset)
correlation_matrix
##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000

Result

The correlation matrix shows: 1. Age vs SibSp (-0.308): Older passengers travel with fewer siblings/spouses 2. SibSp vs Parch (0.384): Passengers with siblings/spouses also tend to have parents/children aboard (family groups) 3. Fare: Shows weak correlations with all variables, indicating ticket price is relatively independent

All correlations are weak to moderate, meaning the variables measure different aspects of passenger characteristics.

3. Variance-Covariance Matrix

cov_matrix <- cov(Cleaned_Dataset)
cov_matrix
##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100

Result

The variance-covariance matrix shows: 1. Age: 211.02 (moderate variability) 2. SibSp: 0.865 (low - most passengers have 0-2 siblings/spouses) 3. Parch: 0.728 (low - most passengers have 0-2 parents/children) 4. Fare: 2800.41 (extremely high - ticket prices vary dramatically)

The Fare variable has by far the highest variance, meaning ticket prices vary much more than other variables. This will dominate the principal component analysis.

4. Eigen Values and Vectors

eigen_result <- eigen(cov_matrix)
eigen_result$values
## [1] 2802.5636587  209.0385659    0.9438783    0.4787214
eigen_result$vectors
##             [,1]        [,2]         [,3]          [,4]
## [1,] 0.028477552  0.99929943 -0.024018111  0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322  0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826  0.004609234  0.0009266652

Eigenvalues

The eigenvalues represent the amount of variance captured by each principal component. The results show that the first eigenvalue (2802.564) is larger than the others, indicating that the first principal component captures the majority of variance in the data. The second eigenvalue (209.039) is much smaller, but still notable compared to the third and fourth eigenvalues. Lastly, the third and fourth eigenvalues (0.944 and 0.479) are very small, indicating that these components capture minimal variance.

The large difference between the first eigenvalue and the others suggests that one principal component dominates the data structure.

Eigenvectors

The eigenvectors show:

  1. PC1 is dominated by Fare (coefficient = 0.9996), meaning the first principal component largely represents the Fare variable.
  2. PC2 is dominated by Age (coefficient = -0.9997), meaning the second principal component is fundamentally the Age variable.
  3. PC3 is dominated by Parch (coefficient = 0.9915), representing the parent/children aspect.
  4. PC4 is dominated by Age (coefficient = 0.9996), representing residual age variation.