Main Process
data <- data_clean
Correlation Matrix
corr_matrix <- cor(data)
print(corr_matrix)
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
From results above, we can conclude that:
The highest correlation is a positive one between variables is between Parch and Sibsp with score of 0.38, meaning that as the number of sibling or spouse increase, the number of parents or child also tends to increase. This also means that most passenger tends to travel with their family.
The second highest correlation, which happens to be a negative, between variables is between ‘Age’ and ‘SibSp’ with score of -0.31, meaning that as the age increases, the number of sibling or spouse tends to decrease. This also means that older passenger tends to travel alone while the younger tends to travel with their sibling or spouse.
Other notable correlation is a positive one between ‘Parch’ and ‘Fare’ with score of 0.21, meaning tthat as the more parent or child the passenger bring, the more expensive ticket they bought.
Variance-Covariance Matrix
cov_matrix <- cov(data)
print(cov_matrix)
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
From results above, we can conclude that:
Variance:
1.1. Age & Fare: The variance is huge, showing that the data spread widely. In ‘Fare’, it means that sone passenger pay a little while the other cost a fortune. In ‘Age’, it means that some passenger is maybe a toddler and others maybe a war veteran
1.2. SibSp & Parch: The variance is small, showing that the data mostly had similar value, and it is reasonable. It is not everyday to see people with 16 siblings or 5 parents.
Covariance: Most of it is already covered in Correlation anyway, just unscaled. We will mainly use it for Eigenvalues
Eigenvectors
Covariance Matrix
eigen_cov <- eigen(cov_matrix)
print("value")
## [1] "value"
print(eigen_cov$values)
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
print("vector")
## [1] "vector"
print(eigen_cov$vectors)
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
From the results above, we can conclude that:
- Eigenvalues: The first value 2802.56 is dominating the value with the coverage of 93% of all information in the dataset, while the second value offers a bit help with the remaining 7%. The rest are just additional and didnt matter.
- Eigenvectors: The first component, which capture 93% of all the information, is can be seen be dominated by the ‘Fare’ variables with 99% proportion. The secondary component can also be seen dominated heavily by Age with the same 99%.
Well, how does the ‘Age’ and ‘Fare’ can be that dominating? It is because if we use covariance, we will use the unstandardize value of each variables. It is worth noting that most dataset is scaled and measured differently for each variables, so we will be more likely to see one variable have the range of 0-512 and one with 0-8, just like what we have with ‘Fare’ and ‘SibSp’.
Thus, we will check the eigenvectors using the standardized version of covariance matrix: Correlation.
Correlation Matrix
eigen_corr <- eigen(corr_matrix)
print("value")
## [1] "value"
print(eigen_corr$values)
## [1] 1.6367503 1.1071770 0.6694052 0.5866676
print("vector")
## [1] "vector"
print(eigen_corr$vectors)
## [,1] [,2] [,3] [,4]
## [1,] 0.4388714 -0.5962415 0.56095237 0.37043268
## [2,] -0.6250770 0.0732461 0.05500006 0.77517016
## [3,] -0.5908590 -0.1774532 0.60558695 -0.50265342
## [4,] -0.2599159 -0.7795136 -0.56175785 -0.09607493
From the results above, we can conclude that:
Eigenvalues: Unlike when using covariance matrix where the first two component heavily dominating the values, the current one have a more balanced distribution with each component (in the same order) have proportion of: 41%, 28%, 17%, and 14%.
Eigenvectors: While the previous eigenvectors showing dominance of one variables, the current shows an interesting dynamic between each variables.
2.1. In first component, we will see that ‘Age’ have strong positive score while ‘SibSp’ and ‘Parch’ have really strong negative score, while the ‘Fare’ have weak negative score, in contrast with previous eigen where it dominated. This new dynamics show us the component we’re lacking before: Family structure. Where the older passenger tends to go alone and mostly bought an appropriate ticket just for himself; the younger passenger tends to go with their family, causing them to bought an expensive ticket.
2.2. In second component, we will see that ‘Age’ and ‘Fare’ both have strong negative score, while the other two have a really weak score. This also capture something: Wealthy factor. This component shows that, assuuming they go alone, older people tends to bought more expensive ticket than the younger.
So, I guess that’s all. We can continue to PCA if I want, but I don’t want to.