Multivariate Analysis

Overview Dataset

Titanic <- read.csv("Titanic-Dataset.csv")
head(Titanic)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Data Cleaning

First we check for the missing data within the dataset

colSums(is.na(Titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

From the cell above, we see that the age column contains many missing values. There are many approaches in handling missing values within a dataset, but for this task we will be deleting the missing rows containing missing values within the dataset itself

clean_titanic <- na.omit(Titanic)
colSums(is.na(clean_titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

selected_cols <- clean_titanic[, c("Age", "SibSp", "Parch", "Fare")]

As we can see from the cell above, the dataset is now clean of any missing values

Correlation Matrix

A correlation matrix is a 2D (two-dimensional) table that shows the correlation of coefficients between multiple variables (in this case, X and Y). It measures the strength and direction of linear relationships between pairs of variables.

\(\rho_{XY} = \frac{\operatorname{Cov}(X, Y)}{\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}}\)

cor_matrix <- cor(selected_cols, use="complete.obs")
print(cor_matrix)

##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000

library(corrplot)

## corrplot 0.95 loaded

corrplot(cor_matrix,
         method = "color", 
         tl.col = "black", 
         tl.srt = 45,
         addCoef.col = "black")

From the correlation matrix above, it can be concluded that the features within the titanic dataset are roughly independent (no strong correlation) as most of them are no where near the 1 threshold, excluding the diagonal line themselves.

One notable correlation we would like to point out is between the two variables are SibSp (Sibling/Spouse) and Age with a total score of “-0.31”, this means that they have a negative correlation. To put it into perspective, the younger (lower age) a passenger is, then it would be logical that they would also bring in more sibling/spouse (higher SibSp) to take care of them, and vice versa.

Variance-Covariance Matrix

Variance is a measure of variability between random variables. Covariance is a measure of dependency between random variables.

cov_matrix <- cov(selected_cols)
print(cov_matrix)

##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100

To further clarify the results above the diagonal elements are the variances, and non diagonal elements are the covariances. Fare has the largest spread of values, meaning ticket prices varied dramatically on the Titanic itself.

Eigen Vector

Eigen vector tells us which direction the variance of our data is going through. Think of it like throwing a ball, whom in which can go multiple directions such as forward, upwards, or sideways, etc.

cor_eigen <- eigen(cor_matrix)
cov_eigen <- eigen(cov_matrix)

print("Correlation Eigen Vector:")

## [1] "Correlation Eigen Vector:"

cor_value <- cor_eigen$vectors
print(cor_value)

##            [,1]       [,2]        [,3]        [,4]
## [1,]  0.4388714 -0.5962415  0.56095237  0.37043268
## [2,] -0.6250770  0.0732461  0.05500006  0.77517016
## [3,] -0.5908590 -0.1774532  0.60558695 -0.50265342
## [4,] -0.2599159 -0.7795136 -0.56175785 -0.09607493

print("Covariance Eigen Vector:")

## [1] "Covariance Eigen Vector:"

cov_value <- cov_eigen$vectors
print(cov_value)

##             [,1]        [,2]         [,3]          [,4]
## [1,] 0.028477552  0.99929943 -0.024018111  0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322  0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826  0.004609234  0.0009266652

We used both the correlation and covariance matrices to make sure which is more fit for interpretation. It can be shown above that in the correlation matrix, the variables are contributed more equally rather than the covariance (since the variable with the largest variance tends to dominate). It shows that family composition, age and fare are interconnected in meaningful patterns

Eigen Value

Eigen Value tells us how important each direction is whom is explained by the Eigen Vector. While Eigen Vector is the direction of which the ball is thrown, Eigen Value is how far the ball is thrown

cor_eigen <- eigen(cor_matrix)
cov_eigen <- eigen(cov_matrix)

print("Correlation Eigen Value:")

## [1] "Correlation Eigen Value:"

cor_value <- cor_eigen$values
print(cor_value)

## [1] 1.6367503 1.1071770 0.6694052 0.5866676

print("Covariance Eigen Value:")

## [1] "Covariance Eigen Value:"

cov_value <- cov_eigen$values
print(cov_value)

## [1] 2802.5636587  209.0385659    0.9438783    0.4787214

The results for correlation shows genuine data structure with multiple important patterns, while the covariance merely reflect that Fare and Age have larger numerical ranges than SibSp and Parch. This confirms that correlation-based analysis is superior for interpretation.

Titanic Dataset Overview

2026-02-03