load("Titanic_Task_Data.RData")
Selected_Dataset <- Titanic_Dataset[, c("Age", "SibSp", "Parch", "Fare")]
Cleaned_Dataset <- na.omit(Selected_Dataset)
str(Cleaned_Dataset)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 714 obs. of 4 variables:
## $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
## $ SibSp: num 1 1 0 1 0 0 3 0 1 1 ...
## $ Parch: num 0 0 0 0 0 0 1 2 0 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
## ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...
This step selects 4 variables from the Titanic dataset: 1. Age: Passenger’s age in years 2. SibSp: Number of siblings/spouses aboard 3. Parch: Number of parents/children aboard 4. Fare: Ticket price in British pounds
Missing values were removed using na.omit(), resulting
in 714 complete observations from 891 original records.
correlation_matrix <- cor(Cleaned_Dataset)
correlation_matrix
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
The correlation matrix shows: 1. Age vs SibSp (-0.308): Older passengers travel with fewer siblings/spouses 2. SibSp vs Parch (0.384): Passengers with siblings/spouses also tend to have parents/children aboard (family groups) 3. Fare: Shows weak correlations with all variables, indicating ticket price is relatively independent
All correlations are weak to moderate, meaning the variables measure different aspects of passenger characteristics.
cov_matrix <- cov(Cleaned_Dataset)
cov_matrix
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
The variance-covariance matrix shows: 1. Age: 211.02 (moderate variability) 2. SibSp: 0.865 (low - most passengers have 0-2 siblings/spouses) 3. Parch: 0.728 (low - most passengers have 0-2 parents/children) 4. Fare: 2800.41 (extremely high - ticket prices vary dramatically)
The Fare variable has by far the highest variance, meaning ticket prices vary much more than other variables. This will dominate the principal component analysis.
eigen_result <- eigen(cov_matrix)
eigen_result$values
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
eigen_result$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
The eigenvalues represent the amount of variance captured by each principal component. The results show that the first eigenvalue (2802.564) is larger than the others, indicating that the first principal component captures the majority of variance in the data. The second eigenvalue (209.039) is much smaller, but still notable compared to the third and fourth eigenvalues. Lastly, the third and fourth eigenvalues (0.944 and 0.479) are very small, indicating that these components capture minimal variance.
The large difference between the first eigenvalue and the others suggests that one principal component dominates the data structure.
The eigenvectors show: