titanic <- read.csv("data/Titanic-Dataset.csv")
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
head(titanic)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
dim(titanic)
## [1] 891 12
colnames(titanic)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
data_mv <- titanic[, c("Age", "SibSp", "Parch", "Fare")]
str(data_mv)
## 'data.frame': 891 obs. of 4 variables:
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp: int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch: int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
colSums(is.na(data_mv))
## Age SibSp Parch Fare
## 177 0 0 0
data_mv_clean <- na.omit(data_mv)
colSums(is.na(data_mv_clean))
## Age SibSp Parch Fare
## 0 0 0 0
dim(data_mv_clean)
## [1] 714 4
cor_matrix <- cor(data_mv_clean)
cor_matrix
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
The correlation matrix shows that most of the relationships between variables are weak to moderate. The most prominent relationship is between SibSp and Parch, reflecting the interrelationship between family variables. Age tends to have a negative correlation with family variables, while the relationship between Age and Fare is relatively weak. This indicates that passenger age does not play a significant role in determining ticket fares or the number of family members traveling.
pairs(data_mv_clean)
Scatter plot visualization shows a relatively weak linear relationship
pattern in most variable pairs, except for the relationship between
SibSp and Parch.
cov_matrix <- cov(data_mv_clean)
cov_matrix
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
Based on the variance–covariance matrix, the Fare variable has the largest variance compared to the other variables, at 2800.41. This indicates that passenger ticket fares have a very high level of data dispersion, indicating significant differences in ticket prices among passengers. The Age variance of 211.02 indicates a significant variation in passenger ages, while the SibSp and Parch variances are relatively small, indicating that the number of siblings/spouses or parents/children traveling tends to be limited to small values.
The covariance between SibSp and Parch is positive (0.3045), indicating that passengers with more siblings or spouses tend to also bring more parents or children. Conversely, the covariance between Age and SibSp (−4.16) and Age and Parch (−2.34) is negative, indicating that younger passengers tend to travel with more family members. The positive covariance between Fare and the family variable indicates that passengers traveling with more family members tend to pay higher fares, although this relationship is not very strong. Overall, the variance–covariance matrix provides an initial overview of the patterns of variation and interrelationships between variables in their original units.
heatmap(cov_matrix, symm = TRUE, main = "Variance–Covariance Matrix Heatmap")
eigen_result <- eigen(cov_matrix)
eigen_result$values
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
eigen_result$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
The eigendecomposition results of the variance–covariance matrix show that the first eigenvalue has a very large value, at 2802.56, much higher than the other eigenvalues. This indicates that the first principal component explains most of the variation in the Titanic multivariate data. The second eigenvalue, at 209.04, still contributes significantly to the variation, while the third and fourth eigenvalues are very small, at 0.94 and 0.48, respectively, indicating that the variation contributions of these components are relatively negligible.
These findings indicate that the variation structure of the Titanic data is strongly dominated by one principal dimension, and most of the information in the data can be effectively represented by one or two principal components. This provides an important basis for the application of dimensionality reduction techniques such as Principal Component Analysis (PCA).
The eigenvectors show the direction of the linear combination of the variables Age, SibSp, Parch, and Fare that form each principal component. In the eigenvector associated with the largest eigenvalue, the Fare variable has the largest loading, indicating that ticket fare variation is a major contributor to the formation of the first principal component. Meanwhile, the variables Age, SibSp, and Parch contribute relatively little to this component.
The next principal component is formed by a combination of the family variables (SibSp and Parch) and passenger age, reflecting the dimension of variation related to the passenger’s family structure. Overall, the eigenvectors illustrate that ticket fare differences are the dominant factor in the variation in the Titanic data, while age and family structure act as additional dimensions of variation.
plot(
eigen_result$values,
type = "b",
xlab = "Main Component",
ylab = "Eigen Value",
main = "Scree Plot"
)
prop_variance <- eigen_result$values / sum(eigen_result$values)
prop_variance
## [1] 0.930149541 0.069378309 0.000313266 0.000158884
The proportion of variance shows that the first principal component explains the largest proportion of the total variation in the data.