# Multivariate Analysis Task 1
# Titanic Dataset
# Abdullah Al-Firdaus Nuzula (008) / INT24
# 1. Dataset Prepare
df <- read.csv("Titanic-Dataset.csv")
# 2. Preprocessing : Numerical Column and Cleaning Data
data_numerical <- df[, c("Age", "SibSp", "Parch", "Fare")]
clean_data <- na.omit(data_numerical)
head(clean_data)
## Age SibSp Parch Fare
## 1 22 1 0 7.2500
## 2 38 1 0 71.2833
## 3 26 0 0 7.9250
## 4 35 1 0 53.1000
## 5 35 0 0 8.0500
## 7 54 0 0 51.8625
###The data has been prepared, namely we have taken the numeric columns and we have also cleaned them by removing NaN.
# 3. Correlation Matrix
cor_matrix <- cor(clean_data)
## Correlation Matrix
print(cor_matrix)
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
### In general, correlation matrices show values between -1 and +1. This correlation shows us the relationship between variables with different units (Age vs. Ticket price).
### This can be seen from SibSp vs. Parch (+0.38), which is the strongest correlation in this matrix. I want to show that passengers tend to travel in large family groups (father, mother, children, uncles, aunts, etc.). If a passenger has many siblings (high SibSp), it is likely that their parents will also accompany them (high Parch).
### There is a negative correlation in Age vs. SibSp (-0.308), meaning that the younger the passenger, the more siblings (SibSp) they bring, because children (younger age) usually travel with their older and younger siblings. Conversely, adults (older age) more often travel alone or only with their partner (fewer SibSp compared to children who have 3-4 siblings).
# 4. Variance-Covariance Matrix
cov_matrix <- cov(clean_data)
## Variance-Covariance Matrix
print(cov_matrix)
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
### The diagonal numbers from top left to bottom right are the variances of each variable, so we can see that the variance of Fare is very large because ticket prices vary greatly, ranging from very cheap to very expensive VIP tickets, resulting in a very wide range of numbers.
### Similarly, SibSp. and Parch have small variances because the data is not widely dispersed, with values only ranging from small numbers.
# 5. Eigen Value & Vector
eigen_results <- eigen(cov_matrix)
## Eigen Values
print(eigen_results$values)
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
## Eigen Vectors
print(eigen_results$vectors)
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
### Eigen Values represent the amount of variance (information) that can be explained by each new component (eigenvector). The total sum of eigenvalues is equal to the total sum of variance on the main diagonal of the covariance matrix (Trace). Since we calculate the eigenvalues of the covariance matrix on unscaled data, the first eigenvalue we have is 2802.56.
### Fare has units of thousands/hundreds, while SibSp has units of units (0-8). Because the variance of Fare is very large (around 2800), the first principal component is completely “attracted” towards Fare.
### Almost all of the diversity (variability) of the data in this dataset is dominated by differences in ticket prices (Fare) and age (Age), while the number of families (SibSp and Parch) has a very small contribution to variance (less than 1).
### Eigen Vectors. This is a 4x4 matrix. Each column represents one eigenvector paired with the corresponding eigenvalue above. The rows represent the original variables (Age, SibSp, Parch, Fare).
### For statistical analysis (such as PCA), ideally the data should first be standardized (using a Correlation Matrix) so that small-scale variables are still considered equivalent.