a. Import Dataset

data <- read.csv("Titanic-Dataset.csv")
head(data)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

b. Use the Age, Sibsp, Parch, Fare columns and delete rows with missing values.

data_selected <- data[, c("Age","SibSp","Parch","Fare")]
data_clean <- na.omit(data_selected)

c. Correlation Matrix

A correlation matrix is used to determine the degree of linear relationship between variables. Correlation values range from -1 to 1. Values close to 1 indicate a strong positive relationship, values close to -1 indicate a strong negative relationship, while values close to 0 indicate a weak relationship.

cor_matrix <- cor(data_clean)
cor_matrix
##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000

Based on the analysis results, the relationship between the variables Age, SibSp, Parch, and Fare tends to be low. This indicates that these variables do not have a significant linear correlation.

d. Covariance Matrix

The covariance matrix is used to see the direction of the relationship between variables. Positive values indicate a direct relationship, while negative values indicate an inverse relationship.

cov_matrix <- cov(data_clean)
cov_matrix
##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100

It can be seen that Fare has the largest variance (2800.41), which indicates that Titanic passenger ticket prices varied greatly. This is consistent with the actual conditions, where there were significant differences in passenger classes.
The covariance between Age and SibSp is negative (-4.16), which indicates that as age increases, the number of siblings/spouses brought along tends to decrease. The direction of this relationship is consistent with the results in the correlation matrix.
While the positive covariance between Fare and Age, SibSp, and Parch shows that passengers with more expensive tickets tend to have higher values for these variables, this relationship is not very strong.

e. Eigen Analysis

1. Eigen Value

Eigen values indicate the amount of data variation that can be explained by each principal component. Larger eigenvalues indicate that the component has a more dominant contribution to data variation.

eigen_result <- eigen(cov_matrix)
eigen_result$values
## [1] 2802.5636587  209.0385659    0.9438783    0.4787214

The first eigen value (2802.56) has a large note value compared to the others. This indicates that most of the data variation is influenced by one main component. The second eigen value (209.04) still contributes to data variation, but much less than the first component. The third and fourth eigen values are very small (<1), so their contribution to data variation is relatively insignificant. Most of the data information can be explained by only one or two main components. This indicates that there is a dominance of one variable in the data structure, which most likely comes from the Fare variable because it has the highest variance.

2. Eigen Vector

Eigen vectors show the contribution of each variable in forming the main components. Variables with larger eigenvector values have a greater influence on the formation of these components.

eigen_result$vectors
##             [,1]        [,2]         [,3]          [,4]
## [1,] 0.028477552  0.99929943 -0.024018111  0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322  0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826  0.004609234  0.0009266652

In the first column, the Fare variable has the most dominant value (0.9996), which means that the greatest variation in the data is greatly influenced by ticket prices. The Age, SibSp, and Parch variables have relatively small contributions to the first principal component, so it can be concluded that ticket price differences are the main factor that distinguishes passenger characteristics in this dataset.
In the second principal component, the Age variable has a very large contribution (0.9993), indicating that this component represents variations related to passenger age. Meanwhile, the third and fourth components are more influenced by SibSp and Parch, but their contributions are relatively small to the overall data variation.

f. Conlusion

Based on the results of the correlation matrix, covariance, eigen value, and eigen vector analysis, it can be concluded that:

  1. The linear relationship between variables is relatively weak to moderate.

  2. The Fare variable has the largest data spread and is the dominant factor in dataset variation.

  3. Most of the data variation can be explained by one or two main components.

  4. The Age variable makes the second largest contribution to data variation.

  5. The SibSp and Parch variables play a role in explaining passenger family relationships, but their contribution is not as significant as Fare and Age.