# Importing the data into R
titanic_data <- read.csv("Titanic-Dataset.csv")
# Select the Age, SibSp, Parch, Fare columns and also delete rows with missing values.
data_selectedl <- titanic_data[, c("Age", "SibSp", "Parch", "Fare")]
data_cleaned <- na.omit(data_selectedl)
summary(data_cleaned)
## Age SibSp Parch Fare
## Min. : 0.42 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:20.12 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 8.05
## Median :28.00 Median :0.0000 Median :0.0000 Median : 15.74
## Mean :29.70 Mean :0.5126 Mean :0.4314 Mean : 34.69
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 33.38
## Max. :80.00 Max. :5.0000 Max. :6.0000 Max. :512.33
# a) Correlation Matrix
correlation_matrix <- cor(data_cleaned)
correlation_matrix
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
# b) Variance-Covariance Matrix
var_covariance_matrix <- cov(data_cleaned)
var_covariance_matrix
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
# c) Eigen value dan Eigen vector
eigen_result <- eigen(var_covariance_matrix)
print("Eigen Values:")
## [1] "Eigen Values:"
eigen_result$values
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
print("Eigen Vectors:")
## [1] "Eigen Vectors:"
eigen_result$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
The Correlation Matrix is used to understand the linear relationship between variables, in Titanic dataset such as Age, SibSp (siblings or spouse), Parch (parents or children) and Fare. Correlation values is range from −1 to 1, where values close to 1 or −1 indicate a strong relationship, while values close to 0 indicate there’s a some weak connection.
The output results show that there is a positive correlation of 0.383 between the variables SibSp (siblings or spouse) and Parch (parents or children), which indicates a positive relationship with moderate strength. This is show that passengers who has travel with more siblings (SibSp) also usually travel with parents or children (Parch). This one is showing a consistent thing with the actual situation, where family members often travel together.
Also, there is some negative correlation between Age and SibSp with values of -0.189 and -0.3082, which means that younger passengers tend to have more siblings traveling with them than older passengers. The correlation between Age and Parch of -0.1891 indicates a weak or not very strong negative correlation. And at the same time, the correlation between Age and Fare is only 0.0961, which indicates a fairly weak relationship. This value shows that passenger age does not have a significant effect on ticket prices or costs. The correlation between SibSp and Fare is 0.1383, also between Parch and Fare is 0.2051. Overall, can be seen that the linear relationship between variables in this dataset is mostly weak to moderate.
The Variance-Covariance Matrix is used to figure out how much each variable changes on its own (variance) and how the variables relate to each other (covariance). Large differences show the data is spread out a lot.
From the ouput results, the Fare variable show the highest variation of 2800.41. This means that ticket prices for titanic passengers was different, with some tickets being affordable and others being more expensive. The Age variable has a variance of 211.02, which show the difference in passenger age, but this difference is not too large or significant compared to the variation in ticket prices.
At the same time, the SibSp and Parch variables have much smaller variations of 0.8645 and 0.7281 respectively. This shows that most passengers travel with a small number of family members, whether siblings, parents or children. The covariance value between SibSp and Parch is 0.3045, which is positive and means that two variables tend to increase at the same time. On the other hand, the covariance between age and number of siblings of -4.1633, show a tendency for an opposite relationship.
Eigen values show how much of the data variation can be explained by each principal component. Larger eigen values indicate that the component contributes more to explaining the differences or variations in the data.
The output results show that the first component has the largest eigen value of 2802.56, means this component explains most of the variation in the data. The second component has an eigen value of 209.04, while the third and fourth components are only 0.94 and 0.48 respectively. This shows that most of the data variation or differences is found in the first two principal components.
The eigen vector shows how much each variable contributes to each component. In the first component, the Fare variable has a very large impact with an eigen vector value of 0.9996, means that most of the changes in the first component are because of ticket prices or costs. In the second component, the age variable has the most dominant contribution with an eigen vector value of 0.9993, means that passenger age have an important role in forming the second component.
Based on the analysis on the Titanic dataset before, it can be see that the relationship between the variables Age, SibSp (siblings or spouse), Parch (parents or children) and Fare is mostly weak or only slightly strong, with the most obvious relationship between SibSp and Parch. The Fare variable has the greatest variation, while Age has moderate variation and SibSp and Parch have relative has a small variation. Analysis of the eigen values and eigen vectors shows that the majority of the differences in the data can be explained by two main components, with the variables Fare and Age being the ones that have the biggest impact.