Titanic Dataset Analysis: Correlation Matrix, Variance-Covariance Matrix, and Eigenvalues

1. Introduction

This document presents an analysis of the Titanic dataset focusing on statistical matrices and dimensionality reduction techniques. The dataset is sourced from Kaggle - Titanic Dataset.

What we’ll analyze:

Correlation Matrix - Shows relationships between variables
Variance-Covariance Matrix - Measures data spread and co-movement
Eigenvalues & Eigenvectors - Used for Principal Component Analysis (PCA)

1.1 Variable Used

We will analyze four numerical variables from the Titanic dataset:

Age: Passenger’s age in years
SibSp: Number of siblings/spouses aboard
Parch: Number of parents/children aboard
Fare: Ticket price in British pounds

These variables represent key demographic and economic characteristics that might relate to passenger survival patterns.

2. Data Preparation

2.1 Load Library

library(tidyverse)
library(corrplot)
library(knitr)
library(DT)

2.2 Load the Dataset

titanic <- read.csv("Titanic-Dataset.csv")

cat("Dataset dimension:", dim(titanic)[1], "row x", dim(titanic)[2], "columns\n")

## Dataset dimension: 891 row x 12 columns

2.3 Dataset Preview

head(titanic) %>% 
  kable(caption = "Head Rows of the Titanic Dataset") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Head Rows of the Titanic Dataset
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500		S
6	0	3	Moran, Mr. James	male	NA	0	330877	8.4583		Q

2.4 Variable Selection

Select only the 4 variables we need.

titanic_selected <- titanic %>%
  select(Age, SibSp, Parch, Fare)

head(titanic_selected)

##   Age SibSp Parch    Fare
## 1  22     1     0  7.2500
## 2  38     1     0 71.2833
## 3  26     0     0  7.9250
## 4  35     1     0 53.1000
## 5  35     0     0  8.0500
## 6  NA     0     0  8.4583

2.5 Handling Missing Values

titanic_clean <- titanic_selected %>%
  na.omit()

cat("Number of rows before removing NA:", nrow(titanic_selected), "\n")

## Number of rows before removing NA: 891

cat("Number of rows after removing NA:", nrow(titanic_clean), "\n")

## Number of rows after removing NA: 714

cat("Rows removed:", nrow(titanic_selected) - nrow(titanic_clean), "\n")

## Rows removed: 177

3. Data Analysis

3.1 Descriptive Statistics

summary(titanic_clean) %>%
  kable(caption = "Descriptive Statistics") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Descriptive Statistics
Age	SibSp	Parch	Fare
Min. : 0.42	Min. :0.0000	Min. :0.0000	Min. : 0.00
1st Qu.:20.12	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.: 8.05
Median :28.00	Median :0.0000	Median :0.0000	Median : 15.74
Mean :29.70	Mean :0.5126	Mean :0.4314	Mean : 34.69
3rd Qu.:38.00	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.: 33.38
Max. :80.00	Max. :5.0000	Max. :6.0000	Max. :512.33

Min/Max: The smallest and largest values
Median: The middle value when data is sorted
Mean: The average value
1st/3rd Quartile: 25% and 75% points in the data

3.1.1 Visualization of Data Distribution

par(mfrow = c(2, 2))

hist(titanic_clean$Age, 
     breaks = 30,
     col = "#3498db",
     border = "white",
     main = "Distribution of Age",
     xlab = "Age (years)",
     ylab = "Frequency",
     cex.main = 1.3)
abline(v = mean(titanic_clean$Age), col = "red", lwd = 2, lty = 2)
abline(v = median(titanic_clean$Age), col = "orange", lwd = 2, lty = 2)
legend("topright", 
       legend = c("Mean", "Median"), 
       col = c("red", "orange"), 
       lty = 2, lwd = 2,
       cex = 0.8)

hist(titanic_clean$SibSp, 
     breaks = seq(-0.5, max(titanic_clean$SibSp) + 0.5, 1),
     col = "#e74c3c",
     border = "white",
     main = "Distribution of Siblings/Spouses",
     xlab = "Number of Siblings/Spouses",
     ylab = "Frequency",
     cex.main = 1.3)

hist(titanic_clean$Parch, 
     breaks = seq(-0.5, max(titanic_clean$Parch) + 0.5, 1),
     col = "#2ecc71",
     border = "white",
     main = "Distribution of Parents/Children",
     xlab = "Number of Parents/Children",
     ylab = "Frequency",
     cex.main = 1.3)

hist(titanic_clean$Fare, 
     breaks = 50,
     col = "#f39c12",
     border = "white",
     main = "Distribution of Fare",
     xlab = "Fare (British Pounds)",
     ylab = "Frequency",
     cex.main = 1.3)
abline(v = mean(titanic_clean$Fare), col = "red", lwd = 2, lty = 2)
abline(v = median(titanic_clean$Fare), col = "orange", lwd = 2, lty = 2)
legend("topright", 
       legend = c("Mean", "Median"), 
       col = c("red", "orange"), 
       lty = 2, lwd = 2,
       cex = 0.8)

3.2 Correlation Matrix

A correlation matrix shows the strength and direction of relationships between variables.

Values range from -1 to +1
+1 = perfect positive correlation (both variables increase together)
-1 = perfect negative correlation (one increases, the other decreases)
0 = no linear relationship
The diagonal is always 1 (a variable perfectly correlates with itself)

cor_matrix <- cor(titanic_clean)

cor_matrix %>%
  round(4) %>%
  kable(caption = "Correlation Matrix") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Correlation Matrix
	Age	SibSp	Parch	Fare
Age	1.0000	-0.3082	-0.1891	0.0961
SibSp	-0.3082	1.0000	0.3838	0.1383
Parch	-0.1891	0.3838	1.0000	0.2051
Fare	0.0961	0.1383	0.2051	1.0000

corrplot(cor_matrix, 
         method = "color", 
         type = "upper",
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         number.cex = 1,
         title = "Correlation Matrix - Titanic Dataset",
         mar = c(0,0,2,0),
         col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200))

cor_matrix_no_diag <- cor_matrix
diag(cor_matrix_no_diag) <- NA

max_cor <- which(abs(cor_matrix_no_diag) == max(abs(cor_matrix_no_diag), na.rm = TRUE), arr.ind = TRUE)
max_cor_value <- cor_matrix[max_cor[1,1], max_cor[1,2]]

cat("Strongest Correlation:\n",
    rownames(cor_matrix)[max_cor[1,1]], "vs", 
    colnames(cor_matrix)[max_cor[1,2]], "=", 
    round(max_cor_value, 4), "\n\n")

## Strongest Correlation:
##  Parch vs SibSp = 0.3838

Based on the correlation matrix, we can observe several interesting patterns:

Parch vs SibSp (positive correlation): This makes sense because both variables represent family size. Passengers traveling with siblings/spouses often also travel with parents/children.
Age vs SibSp (negative correlation): Older passengers tend to travel with fewer siblings/spouses. This is logical - as people age, they’re less likely to travel with large family groups.
Age vs Parch (negative correlation): Similar to above, older passengers have fewer parents/children aboard. This is expected since older passengers likely have independent adult children who aren’t traveling with them.
Fare vs other variables : Ticket prices don’t strongly correlate with demographic variables. This suggests that fare was more related to cabin class than to passenger characteristics like age or family size.

3.3 Variance-Covariance Matrix

This matrix contains two types of information:

Diagonal elements (variance): Show how spread out each variable’s data is
Off-diagonal elements (covariance): Show how two variables change together

If correlation tells us the direction and strength of a relationship, covariance tells us that plus the scale of the relationship.

Formula:

Variance:

σ² = Σ(xᵢ - x̄)² / (n-1)
Covariance:

Cov(X,Y) = Σ(xᵢ - x̄)(yᵢ - ȳ) / (n-1)

Key difference from correlation: Covariance is not standardized, so its values depend on the units of measurement. Correlation is standardized (-1 to +1).

cov_matrix <- cov(titanic_clean)

cov_matrix %>%
  round(4) %>%
  kable(caption = "Variance-Covariance Matrix") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Variance-Covariance Matrix
	Age	SibSp	Parch	Fare
Age	211.0191	-4.1633	-2.3442	73.8490
SibSp	-4.1633	0.8645	0.3045	6.8062
Parch	-2.3442	0.3045	0.7281	9.2622
Fare	73.8490	6.8062	9.2622	2800.4131

variances <- diag(cov_matrix)

variance_df <- data.frame(
  Variabel = names(variances),
  Varians = variances,
  Std_Dev = sqrt(variances)
) %>%
  arrange(desc(Varians))

variance_df %>%
  kable(caption = "Variance and Standard Deviation of Each Variable", digits = 4) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Variance and Standard Deviation of Each Variable
	Variabel	Varians	Std_Dev
Fare	Fare	2800.4131	52.9189
Age	Age	211.0191	14.5265
SibSp	SibSp	0.8645	0.9298
Parch	Parch	0.7281	0.8533

corrplot(cov_matrix, 
         method = "color", 
         is.corr = FALSE,
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         number.cex = 0.8,
         title = "Variance-Covariance Matrix",
         mar = c(0,0,2,0),
         col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200))

Note: Unlike the correlation plot, colors here represent the magnitude of covariance (not standardized), so variables with larger scales will have more intense colors.

Key findings from the variance-covariance matrix:

Fare has the highest variance (2800,41): This indicates huge variability in ticket prices. Some passengers paid very little while others paid a lot, reflecting different cabin classes (1st, 2nd, 3rd class). The standard deviation is about £52,9, which is quite large.
Age has moderate variance (≈211): Ages range widely, from young children to elderly passengers. Standard deviation of ≈14.5 years means there’s good age diversity.
SibSp and Parch have low variance: Most passengers traveled alone or with few family members. Values close to 0 are most common, with occasional larger families.
Covariances confirm correlation patterns:
- Positive Cov(SibSp, Parch) = family members travel together
- Negative Cov(Age, SibSp) = older passengers have fewer siblings/spouses aboard

3.4 Eigenvalues and Eigenvectors

Imagine we have a dataset with multiple variables plotted in multi-dimensional space. Eigenvalues and eigenvectors help us find the “principal directions” in this space - the directions where data varies the most.

Eigenvector: A direction/axis in the data
Eigenvalue: How much variance exists along that direction

In Principal Component Analysis (PCA), we use eigenvalues and eigenvectors to:

Reduce dimensions: Transform 4 variables into 2-3 “principal components”
Remove noise: Keep components with high eigenvalues, discard those with low eigenvalues
Visualize data: Plot high-dimensional data in 2D or 3D
Avoid multicollinearity: Create independent components for regression analysis

A common rule is to keep only components with eigenvalue > 1. These components explain more variance than a single original variable.

3.4.1 Eigenanalysis of Correlation Matrix

eigen_cor <- eigen(cor_matrix)

prop_var_cor <- eigen_cor$values / sum(eigen_cor$values) * 100

eigen_summary_cor <- data.frame(
  PC = paste0("PC", 1:length(eigen_cor$values)),
  Eigenvalue = eigen_cor$values,
  Proportion = prop_var_cor,
  Cumulative = cumsum(prop_var_cor)
)

eigen_summary_cor %>%
  kable(caption = "Eigenvalues dan Proporsi Varians (Correlation Matrix)", 
        digits = 4,
        col.names = c("Principal Component", "Eigenvalue", "Proportion (%)", "Cumulative (%)")) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Eigenvalues dan Proporsi Varians (Correlation Matrix)
Principal Component	Eigenvalue	Proportion (%)	Cumulative (%)
PC1	1.6368	40.9188	40.9188
PC2	1.1072	27.6794	68.5982
PC3	0.6694	16.7351	85.3333
PC4	0.5867	14.6667	100.0000

Component: Principal Component number (PC1, PC2, etc.)
Eigenvalue: How much total variance this component captures
Variance %: Percentage of total variance explained by this component
Cumulative %: Running total of variance explained

What does this mean?

For example, if PC1 has an eigenvalue of 1.637 and explains 41%, it means this single component captures 41% of all the information in our 4 original variables.

eigen_vectors_cor <- eigen_cor$vectors
rownames(eigen_vectors_cor) <- colnames(cor_matrix)
colnames(eigen_vectors_cor) <- paste0("PC", 1:ncol(eigen_vectors_cor))

eigen_vectors_cor %>%
  round(4) %>%
  kable(caption = "Eigenvectors (Correlation Matrix)") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Eigenvectors (Correlation Matrix)
	PC1	PC2	PC3	PC4
Age	0.4389	-0.5962	0.5610	0.3704
SibSp	-0.6251	0.0732	0.0550	0.7752
Parch	-0.5909	-0.1775	0.6056	-0.5027
Fare	-0.2599	-0.7795	-0.5618	-0.0961

Each column represents one principal component. The values show how much each original variable contributes to that component.

3.4.2 Eigenanalysis of Covariance Matrix

eigen_cov <- eigen(cov_matrix)

prop_var_cov <- eigen_cov$values / sum(eigen_cov$values) * 100

eigen_summary_cov <- data.frame(
  PC = paste0("PC", 1:length(eigen_cov$values)),
  Eigenvalue = eigen_cov$values,
  Proportion = prop_var_cov,
  Cumulative = cumsum(prop_var_cov)
)

eigen_summary_cov %>%
  kable(caption = "Eigenvalues dan Proporsi Varians (Covariance Matrix)", 
        digits = 4,
        col.names = c("Principal Component", "Eigenvalue", "Proportion (%)", "Cumulative (%)")) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Eigenvalues dan Proporsi Varians (Covariance Matrix)
Principal Component	Eigenvalue	Proportion (%)	Cumulative (%)
PC1	2802.5637	93.0150	93.0150
PC2	209.0386	6.9378	99.9528
PC3	0.9439	0.0313	99.9841
PC4	0.4787	0.0159	100.0000

When we use the covariance matrix, variables with larger scales dominate the analysis. Fare has a variance of ~2800, while SibSp has variance of ~0,86. That’s a ratio of about 2000:1! So PC1 from covariance matrix is almost entirely driven by Fare.

eigen_vectors_cov <- eigen_cov$vectors
rownames(eigen_vectors_cov) <- colnames(cov_matrix)
colnames(eigen_vectors_cov) <- paste0("PC", 1:ncol(eigen_vectors_cov))

eigen_vectors_cov %>%
  round(4) %>%
  kable(caption = "Eigenvectors (Covariance Matrix)") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Eigenvectors (Covariance Matrix)
	PC1	PC2	PC3	PC4
Age	0.0285	0.9993	-0.0240	0.0036
SibSp	0.0024	-0.0209	-0.7737	0.6332
Parch	0.0033	-0.0125	-0.6331	-0.7740
Fare	0.9996	-0.0284	0.0046	0.0009

PC1 is almost entirely composed of Fare (loading ≈ 1.000), while other variables contribute very little. This confirms that Fare dominates when we don’t standardize variables.

3.4.3 Scree Plot

A scree plot shows eigenvalues in descending order. The name comes from “scree” - the rubble at the bottom of a cliff. We’re looking for the “cliff” (important components) vs the “scree” (unimportant ones).

par(mfrow = c(1, 2))

plot(1:length(eigen_cor$values), eigen_cor$values, 
     type = "b", 
     pch = 19,
     col = "blue",
     main = "Scree Plot (Correlation Matrix)",
     xlab = "Principal Component",
     ylab = "Eigenvalue",
     cex.main = 1.3,
     lwd = 2)
abline(h = 1, col = "red", lty = 2, lwd = 2)
text(3, 1.1, "Kaiser Criterion (Eigenvalue = 1)", col = "red", cex = 0.8)
grid()

plot(1:length(eigen_cov$values), eigen_cov$values, 
     type = "b", 
     pch = 19,
     col = "darkgreen",
     main = "Scree Plot (Covariance Matrix)",
     xlab = "Principal Component",
     ylab = "Eigenvalue",
     cex.main = 1.3,
     lwd = 2)
grid()

Based on the scree plot, we would typically keep 2 principal components for further analysis.

4. Summary and Conclusion

Based on our analysis of the Titanic dataset, here are the main findings:

1. Correlation Analysis

Moderate correlation between SibSp and Parch (r = 0.38): Family members tend to travel together, which makes intuitive sense.
Negative correlation between Age and family size variables: Older passengers typically travel with fewer family members.
Correlations with Fare: Ticket price is relatively independent of demographic characteristics, suggesting fare was determined more by cabin class choice than passenger characteristics.

2. Variance Analysis

Fare shows highest variability (Var ≈ 2800): Enormous range in ticket prices reflecting the class system on the Titanic (1st, 2nd, 3rd class).
Age shows moderate spread (SD ≈ 14.5 years): Good diversity in passenger ages.
Family size variables have low variance: Most passengers traveled alone or with small families, with occasional large families as outliers.

3. Principal Component Analysis (PCA)

Two principal components (eigenvalue > 1) explain approximately 68.5% of total variance.
PC1: Represents “Family Travel Pattern” - capturing family size and passenger age
PC2: Represents “Passenger Demographics & Economic Status” - capturing age and fare

This means we can reduce from 4 variables to 2 components while retaining about 70% of the information. This is useful for:

Data visualization (plotting in 2D instead of 4D)
Simplifying machine learning models
Reducing computational complexity
Removing multicollinearity

Notice: The difference between correlation matrix and covariance matrix results demonstrates why standardization is crucial when variables have different scales. The correlation matrix gives equal importance to all variables, while the covariance matrix lets the largest-scale variable (Fare) dominate.

So,

This analysis successfully demonstrates three fundamental statistical techniques:

Correlation Matrix: Revealed relationships between variables, with family-related variables showing expected positive correlations.
Variance-Covariance Matrix: Showed that Fare has by far the highest variability, reflecting the economic diversity of Titanic passengers.
Eigenvalue Decomposition: Enabled us to reduce 4 variables to 2 principal components while retaining most of the information, demonstrating the power of dimensionality reduction.

Thank you for reading this analysis. If you have questions or suggestions, feel free to reach out!

Note: To reproduce this analysis, ensure you have the Titanic-Dataset.csv file in your R working directory and all required packages installed.