Muhammad Raffi Fahrezi
24031554100/24INT
Titanic <- read.csv("Titanic-Dataset.csv")
head(Titanic)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
First we check for the missing data within the dataset
colSums(is.na(Titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
From the cell above, we see that the age column contains many missing values. There are many approaches in handling missing values within a dataset, but for this task we will be deleting the missing rows containing missing values within the dataset itself
clean_titanic <- na.omit(Titanic)
colSums(is.na(clean_titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
selected_cols <- clean_titanic[, c("Age", "SibSp", "Parch", "Fare")]
As we can see from the cell above, the dataset is now clean of any missing values
A correlation matrix is a 2D (two-dimensional) table that shows the correlation of coefficients between multiple variables (in this case, X and Y). It measures the strength and direction of linear relationships between pairs of variables.
\(\rho_{XY} = \frac{\operatorname{Cov}(X, Y)}{\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}}\)
cor_matrix <- cor(selected_cols, use="complete.obs")
print(cor_matrix)
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
library(corrplot)
## corrplot 0.95 loaded
corrplot(cor_matrix,
method = "color",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black")
From the correlation matrix above, it can be concluded that the features within the titanic dataset are roughly independent (no strong correlation) as most of them are no where near the 1 threshold, excluding the diagonal line themselves.
One notable correlation we would like to point out is between the two variables are SibSp (Sibling/Spouse) and Age with a total score of “-0.31”, this means that they have a negative correlation. To put it into perspective, the younger (lower age) a passenger is, then it would be logical that they would also bring in more sibling/spouse (higher SibSp) to take care of them, and vice versa.
Variance is a measure of variability between random variables. Covariance is a measure of dependency between random variables.
cov_matrix <- cov(selected_cols)
print(cov_matrix)
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
To further clarify the results above the diagonal elements are the variances, and non diagonal elements are the covariances. Fare has the largest spread of values, meaning ticket prices varied dramatically on the Titanic itself.
Eigen vector tells us which direction the variance of our data is going through. Think of it like throwing a ball, whom in which can go multiple directions such as forward, upwards, or sideways, etc.
cor_eigen <- eigen(cor_matrix)
cov_eigen <- eigen(cov_matrix)
print("Correlation Eigen Vector:")
## [1] "Correlation Eigen Vector:"
cor_value <- cor_eigen$vectors
print(cor_value)
## [,1] [,2] [,3] [,4]
## [1,] 0.4388714 -0.5962415 0.56095237 0.37043268
## [2,] -0.6250770 0.0732461 0.05500006 0.77517016
## [3,] -0.5908590 -0.1774532 0.60558695 -0.50265342
## [4,] -0.2599159 -0.7795136 -0.56175785 -0.09607493
print("Covariance Eigen Vector:")
## [1] "Covariance Eigen Vector:"
cov_value <- cov_eigen$vectors
print(cov_value)
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
We used both the correlation and covariance matrices to make sure which is more fit for interpretation. It can be shown above that in the correlation matrix, the variables are contributed more equally rather than the covariance (since the variable with the largest variance tends to dominate). It shows that family composition, age and fare are interconnected in meaningful patterns
Eigen Value tells us how important each direction is whom is explained by the Eigen Vector. While Eigen Vector is the direction of which the ball is thrown, Eigen Value is how far the ball is thrown
cor_eigen <- eigen(cor_matrix)
cov_eigen <- eigen(cov_matrix)
print("Correlation Eigen Value:")
## [1] "Correlation Eigen Value:"
cor_value <- cor_eigen$values
print(cor_value)
## [1] 1.6367503 1.1071770 0.6694052 0.5866676
print("Covariance Eigen Value:")
## [1] "Covariance Eigen Value:"
cov_value <- cov_eigen$values
print(cov_value)
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
The results for correlation shows genuine data structure with multiple important patterns, while the covariance merely reflect that Fare and Age have larger numerical ranges than SibSp and Parch. This confirms that correlation-based analysis is superior for interpretation.