data <- read.csv("Titanic-Dataset.csv")
head(data)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
data_selected <- data[, c("Age","SibSp","Parch","Fare")]
data_clean <- na.omit(data_selected)
A correlation matrix is used to determine the degree of linear relationship between variables. Correlation values range from -1 to 1. Values close to 1 indicate a strong positive relationship, values close to -1 indicate a strong negative relationship, while values close to 0 indicate a weak relationship.
cor_matrix <- cor(data_clean)
cor_matrix
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
Based on the analysis results, the relationship between the variables Age, SibSp, Parch, and Fare tends to be low. This indicates that these variables do not have a significant linear correlation.
The covariance matrix is used to see the direction of the relationship between variables. Positive values indicate a direct relationship, while negative values indicate an inverse relationship.
cov_matrix <- cov(data_clean)
cov_matrix
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
It can be seen that Fare has the largest variance
(2800.41), which indicates that Titanic passenger ticket prices varied
greatly. This is consistent with the actual conditions, where there were
significant differences in passenger classes.
The covariance between Age and SibSp is negative
(-4.16), which indicates that as age increases, the number of
siblings/spouses brought along tends to decrease. The direction of this
relationship is consistent with the results in the correlation
matrix.
While the positive covariance between Fare and Age, SibSp, and
Parch shows that passengers with more expensive tickets tend to
have higher values for these variables, this relationship is not very
strong.
Eigen values indicate the amount of data variation that can be explained by each principal component. Larger eigenvalues indicate that the component has a more dominant contribution to data variation.
eigen_result <- eigen(cov_matrix)
eigen_result$values
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
The first eigen value (2802.56) has a large note value compared to the others. This indicates that most of the data variation is influenced by one main component. The second eigen value (209.04) still contributes to data variation, but much less than the first component. The third and fourth eigen values are very small (<1), so their contribution to data variation is relatively insignificant. Most of the data information can be explained by only one or two main components. This indicates that there is a dominance of one variable in the data structure, which most likely comes from the Fare variable because it has the highest variance.
Eigen vectors show the contribution of each variable in forming the main components. Variables with larger eigenvector values have a greater influence on the formation of these components.
eigen_result$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
In the first column, the Fare variable has the most
dominant value (0.9996), which means that the greatest variation in the
data is greatly influenced by ticket prices. The Age, SibSp, and
Parch variables have relatively small contributions to the
first principal component, so it can be concluded that ticket price
differences are the main factor that distinguishes passenger
characteristics in this dataset.
In the second principal component, the Age variable has
a very large contribution (0.9993), indicating that this component
represents variations related to passenger age. Meanwhile, the third and
fourth components are more influenced by SibSp and
Parch, but their contributions are relatively small to the
overall data variation.
Based on the results of the correlation matrix, covariance, eigen value, and eigen vector analysis, it can be concluded that:
The linear relationship between variables is relatively weak to moderate.
The Fare variable has the largest data spread and is the dominant factor in dataset variation.
Most of the data variation can be explained by one or two main components.
The Age variable makes the second largest contribution to data variation.
The SibSp and Parch variables play a role in explaining passenger family relationships, but their contribution is not as significant as Fare and Age.