Titanic Dataset Analysis: Correlation Matrix, Variance-Covariance Matrix, and Eigenvalues


1. Introduction

This document presents an analysis of the Titanic dataset focusing on statistical matrices and dimensionality reduction techniques. The dataset is sourced from Kaggle - Titanic Dataset.

What we’ll analyze:

  • Correlation Matrix - Shows relationships between variables

  • Variance-Covariance Matrix - Measures data spread and co-movement

  • Eigenvalues & Eigenvectors - Used for Principal Component Analysis (PCA)

1.1 Variable Used

We will analyze four numerical variables from the Titanic dataset:

  1. Age: Passenger’s age in years
  2. SibSp: Number of siblings/spouses aboard
  3. Parch: Number of parents/children aboard
  4. Fare: Ticket price in British pounds

These variables represent key demographic and economic characteristics that might relate to passenger survival patterns.


2. Data Preparation

2.1 Load Library

library(tidyverse)
library(corrplot)
library(knitr)
library(DT)

2.2 Load the Dataset

titanic <- read.csv("Titanic-Dataset.csv")

cat("Dataset dimension:", dim(titanic)[1], "row x", dim(titanic)[2], "columns\n")
## Dataset dimension: 891 row x 12 columns

2.3 Dataset Preview

head(titanic) %>% 
  kable(caption = "Head Rows of the Titanic Dataset") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Head Rows of the Titanic Dataset
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q

2.4 Variable Selection

Select only the 4 variables we need.

titanic_selected <- titanic %>%
  select(Age, SibSp, Parch, Fare)

head(titanic_selected)
##   Age SibSp Parch    Fare
## 1  22     1     0  7.2500
## 2  38     1     0 71.2833
## 3  26     0     0  7.9250
## 4  35     1     0 53.1000
## 5  35     0     0  8.0500
## 6  NA     0     0  8.4583

2.5 Handling Missing Values

titanic_clean <- titanic_selected %>%
  na.omit()

cat("Number of rows before removing NA:", nrow(titanic_selected), "\n")
## Number of rows before removing NA: 891
cat("Number of rows after removing NA:", nrow(titanic_clean), "\n")
## Number of rows after removing NA: 714
cat("Rows removed:", nrow(titanic_selected) - nrow(titanic_clean), "\n")
## Rows removed: 177

3. Data Analysis

3.1 Descriptive Statistics

summary(titanic_clean) %>%
  kable(caption = "Descriptive Statistics") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Descriptive Statistics
Age SibSp Parch Fare
Min. : 0.42 Min. :0.0000 Min. :0.0000 Min. : 0.00
1st Qu.:20.12 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 8.05
Median :28.00 Median :0.0000 Median :0.0000 Median : 15.74
Mean :29.70 Mean :0.5126 Mean :0.4314 Mean : 34.69
3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 33.38
Max. :80.00 Max. :5.0000 Max. :6.0000 Max. :512.33
  • Min/Max: The smallest and largest values
  • Median: The middle value when data is sorted
  • Mean: The average value
  • 1st/3rd Quartile: 25% and 75% points in the data

3.1.1 Visualization of Data Distribution

par(mfrow = c(2, 2))

hist(titanic_clean$Age, 
     breaks = 30,
     col = "#3498db",
     border = "white",
     main = "Distribution of Age",
     xlab = "Age (years)",
     ylab = "Frequency",
     cex.main = 1.3)
abline(v = mean(titanic_clean$Age), col = "red", lwd = 2, lty = 2)
abline(v = median(titanic_clean$Age), col = "orange", lwd = 2, lty = 2)
legend("topright", 
       legend = c("Mean", "Median"), 
       col = c("red", "orange"), 
       lty = 2, lwd = 2,
       cex = 0.8)

hist(titanic_clean$SibSp, 
     breaks = seq(-0.5, max(titanic_clean$SibSp) + 0.5, 1),
     col = "#e74c3c",
     border = "white",
     main = "Distribution of Siblings/Spouses",
     xlab = "Number of Siblings/Spouses",
     ylab = "Frequency",
     cex.main = 1.3)

hist(titanic_clean$Parch, 
     breaks = seq(-0.5, max(titanic_clean$Parch) + 0.5, 1),
     col = "#2ecc71",
     border = "white",
     main = "Distribution of Parents/Children",
     xlab = "Number of Parents/Children",
     ylab = "Frequency",
     cex.main = 1.3)

hist(titanic_clean$Fare, 
     breaks = 50,
     col = "#f39c12",
     border = "white",
     main = "Distribution of Fare",
     xlab = "Fare (British Pounds)",
     ylab = "Frequency",
     cex.main = 1.3)
abline(v = mean(titanic_clean$Fare), col = "red", lwd = 2, lty = 2)
abline(v = median(titanic_clean$Fare), col = "orange", lwd = 2, lty = 2)
legend("topright", 
       legend = c("Mean", "Median"), 
       col = c("red", "orange"), 
       lty = 2, lwd = 2,
       cex = 0.8)


3.2 Correlation Matrix

A correlation matrix shows the strength and direction of relationships between variables.

  • Values range from -1 to +1
  • +1 = perfect positive correlation (both variables increase together)
  • -1 = perfect negative correlation (one increases, the other decreases)
  • 0 = no linear relationship
  • The diagonal is always 1 (a variable perfectly correlates with itself)
cor_matrix <- cor(titanic_clean)

cor_matrix %>%
  round(4) %>%
  kable(caption = "Correlation Matrix") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Correlation Matrix
Age SibSp Parch Fare
Age 1.0000 -0.3082 -0.1891 0.0961
SibSp -0.3082 1.0000 0.3838 0.1383
Parch -0.1891 0.3838 1.0000 0.2051
Fare 0.0961 0.1383 0.2051 1.0000
corrplot(cor_matrix, 
         method = "color", 
         type = "upper",
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         number.cex = 1,
         title = "Correlation Matrix - Titanic Dataset",
         mar = c(0,0,2,0),
         col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200))

cor_matrix_no_diag <- cor_matrix
diag(cor_matrix_no_diag) <- NA

max_cor <- which(abs(cor_matrix_no_diag) == max(abs(cor_matrix_no_diag), na.rm = TRUE), arr.ind = TRUE)
max_cor_value <- cor_matrix[max_cor[1,1], max_cor[1,2]]

cat("Strongest Correlation:\n",
    rownames(cor_matrix)[max_cor[1,1]], "vs", 
    colnames(cor_matrix)[max_cor[1,2]], "=", 
    round(max_cor_value, 4), "\n\n")
## Strongest Correlation:
##  Parch vs SibSp = 0.3838

Based on the correlation matrix, we can observe several interesting patterns:

  1. Parch vs SibSp (positive correlation): This makes sense because both variables represent family size. Passengers traveling with siblings/spouses often also travel with parents/children.

  2. Age vs SibSp (negative correlation): Older passengers tend to travel with fewer siblings/spouses. This is logical - as people age, they’re less likely to travel with large family groups.

  3. Age vs Parch (negative correlation): Similar to above, older passengers have fewer parents/children aboard. This is expected since older passengers likely have independent adult children who aren’t traveling with them.

  4. Fare vs other variables : Ticket prices don’t strongly correlate with demographic variables. This suggests that fare was more related to cabin class than to passenger characteristics like age or family size.


3.3 Variance-Covariance Matrix

This matrix contains two types of information:

  • Diagonal elements (variance): Show how spread out each variable’s data is
  • Off-diagonal elements (covariance): Show how two variables change together

If correlation tells us the direction and strength of a relationship, covariance tells us that plus the scale of the relationship.

Formula:

  • Variance:

    σ² = Σ(xᵢ - x̄)² / (n-1)

  • Covariance:

    Cov(X,Y) = Σ(xᵢ - x̄)(yᵢ - ȳ) / (n-1)

Key difference from correlation: Covariance is not standardized, so its values depend on the units of measurement. Correlation is standardized (-1 to +1).

cov_matrix <- cov(titanic_clean)

cov_matrix %>%
  round(4) %>%
  kable(caption = "Variance-Covariance Matrix") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Variance-Covariance Matrix
Age SibSp Parch Fare
Age 211.0191 -4.1633 -2.3442 73.8490
SibSp -4.1633 0.8645 0.3045 6.8062
Parch -2.3442 0.3045 0.7281 9.2622
Fare 73.8490 6.8062 9.2622 2800.4131
variances <- diag(cov_matrix)

variance_df <- data.frame(
  Variabel = names(variances),
  Varians = variances,
  Std_Dev = sqrt(variances)
) %>%
  arrange(desc(Varians))

variance_df %>%
  kable(caption = "Variance and Standard Deviation of Each Variable", digits = 4) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Variance and Standard Deviation of Each Variable
Variabel Varians Std_Dev
Fare Fare 2800.4131 52.9189
Age Age 211.0191 14.5265
SibSp SibSp 0.8645 0.9298
Parch Parch 0.7281 0.8533
corrplot(cov_matrix, 
         method = "color", 
         is.corr = FALSE,
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         number.cex = 0.8,
         title = "Variance-Covariance Matrix",
         mar = c(0,0,2,0),
         col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200))

Note: Unlike the correlation plot, colors here represent the magnitude of covariance (not standardized), so variables with larger scales will have more intense colors.

Key findings from the variance-covariance matrix:

  1. Fare has the highest variance (2800,41): This indicates huge variability in ticket prices. Some passengers paid very little while others paid a lot, reflecting different cabin classes (1st, 2nd, 3rd class). The standard deviation is about £52,9, which is quite large.

  2. Age has moderate variance (≈211): Ages range widely, from young children to elderly passengers. Standard deviation of ≈14.5 years means there’s good age diversity.

  3. SibSp and Parch have low variance: Most passengers traveled alone or with few family members. Values close to 0 are most common, with occasional larger families.

  4. Covariances confirm correlation patterns:

    • Positive Cov(SibSp, Parch) = family members travel together
    • Negative Cov(Age, SibSp) = older passengers have fewer siblings/spouses aboard

3.4 Eigenvalues and Eigenvectors

Imagine we have a dataset with multiple variables plotted in multi-dimensional space. Eigenvalues and eigenvectors help us find the “principal directions” in this space - the directions where data varies the most.

  • Eigenvector: A direction/axis in the data
  • Eigenvalue: How much variance exists along that direction

In Principal Component Analysis (PCA), we use eigenvalues and eigenvectors to:

  1. Reduce dimensions: Transform 4 variables into 2-3 “principal components”
  2. Remove noise: Keep components with high eigenvalues, discard those with low eigenvalues
  3. Visualize data: Plot high-dimensional data in 2D or 3D
  4. Avoid multicollinearity: Create independent components for regression analysis

A common rule is to keep only components with eigenvalue > 1. These components explain more variance than a single original variable.


3.4.1 Eigenanalysis of Correlation Matrix

eigen_cor <- eigen(cor_matrix)

prop_var_cor <- eigen_cor$values / sum(eigen_cor$values) * 100

eigen_summary_cor <- data.frame(
  PC = paste0("PC", 1:length(eigen_cor$values)),
  Eigenvalue = eigen_cor$values,
  Proportion = prop_var_cor,
  Cumulative = cumsum(prop_var_cor)
)

eigen_summary_cor %>%
  kable(caption = "Eigenvalues dan Proporsi Varians (Correlation Matrix)", 
        digits = 4,
        col.names = c("Principal Component", "Eigenvalue", "Proportion (%)", "Cumulative (%)")) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Eigenvalues dan Proporsi Varians (Correlation Matrix)
Principal Component Eigenvalue Proportion (%) Cumulative (%)
PC1 1.6368 40.9188 40.9188
PC2 1.1072 27.6794 68.5982
PC3 0.6694 16.7351 85.3333
PC4 0.5867 14.6667 100.0000
  • Component: Principal Component number (PC1, PC2, etc.)
  • Eigenvalue: How much total variance this component captures
  • Variance %: Percentage of total variance explained by this component
  • Cumulative %: Running total of variance explained

What does this mean?

For example, if PC1 has an eigenvalue of 1.637 and explains 41%, it means this single component captures 41% of all the information in our 4 original variables.

eigen_vectors_cor <- eigen_cor$vectors
rownames(eigen_vectors_cor) <- colnames(cor_matrix)
colnames(eigen_vectors_cor) <- paste0("PC", 1:ncol(eigen_vectors_cor))

eigen_vectors_cor %>%
  round(4) %>%
  kable(caption = "Eigenvectors (Correlation Matrix)") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Eigenvectors (Correlation Matrix)
PC1 PC2 PC3 PC4
Age 0.4389 -0.5962 0.5610 0.3704
SibSp -0.6251 0.0732 0.0550 0.7752
Parch -0.5909 -0.1775 0.6056 -0.5027
Fare -0.2599 -0.7795 -0.5618 -0.0961

Each column represents one principal component. The values show how much each original variable contributes to that component.


3.4.2 Eigenanalysis of Covariance Matrix

eigen_cov <- eigen(cov_matrix)

prop_var_cov <- eigen_cov$values / sum(eigen_cov$values) * 100

eigen_summary_cov <- data.frame(
  PC = paste0("PC", 1:length(eigen_cov$values)),
  Eigenvalue = eigen_cov$values,
  Proportion = prop_var_cov,
  Cumulative = cumsum(prop_var_cov)
)

eigen_summary_cov %>%
  kable(caption = "Eigenvalues dan Proporsi Varians (Covariance Matrix)", 
        digits = 4,
        col.names = c("Principal Component", "Eigenvalue", "Proportion (%)", "Cumulative (%)")) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Eigenvalues dan Proporsi Varians (Covariance Matrix)
Principal Component Eigenvalue Proportion (%) Cumulative (%)
PC1 2802.5637 93.0150 93.0150
PC2 209.0386 6.9378 99.9528
PC3 0.9439 0.0313 99.9841
PC4 0.4787 0.0159 100.0000

When we use the covariance matrix, variables with larger scales dominate the analysis. Fare has a variance of ~2800, while SibSp has variance of ~0,86. That’s a ratio of about 2000:1! So PC1 from covariance matrix is almost entirely driven by Fare.

eigen_vectors_cov <- eigen_cov$vectors
rownames(eigen_vectors_cov) <- colnames(cov_matrix)
colnames(eigen_vectors_cov) <- paste0("PC", 1:ncol(eigen_vectors_cov))

eigen_vectors_cov %>%
  round(4) %>%
  kable(caption = "Eigenvectors (Covariance Matrix)") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Eigenvectors (Covariance Matrix)
PC1 PC2 PC3 PC4
Age 0.0285 0.9993 -0.0240 0.0036
SibSp 0.0024 -0.0209 -0.7737 0.6332
Parch 0.0033 -0.0125 -0.6331 -0.7740
Fare 0.9996 -0.0284 0.0046 0.0009

PC1 is almost entirely composed of Fare (loading ≈ 1.000), while other variables contribute very little. This confirms that Fare dominates when we don’t standardize variables.


3.4.3 Scree Plot

A scree plot shows eigenvalues in descending order. The name comes from “scree” - the rubble at the bottom of a cliff. We’re looking for the “cliff” (important components) vs the “scree” (unimportant ones).

par(mfrow = c(1, 2))

plot(1:length(eigen_cor$values), eigen_cor$values, 
     type = "b", 
     pch = 19,
     col = "blue",
     main = "Scree Plot (Correlation Matrix)",
     xlab = "Principal Component",
     ylab = "Eigenvalue",
     cex.main = 1.3,
     lwd = 2)
abline(h = 1, col = "red", lty = 2, lwd = 2)
text(3, 1.1, "Kaiser Criterion (Eigenvalue = 1)", col = "red", cex = 0.8)
grid()

plot(1:length(eigen_cov$values), eigen_cov$values, 
     type = "b", 
     pch = 19,
     col = "darkgreen",
     main = "Scree Plot (Covariance Matrix)",
     xlab = "Principal Component",
     ylab = "Eigenvalue",
     cex.main = 1.3,
     lwd = 2)
grid()

Based on the scree plot, we would typically keep 2 principal components for further analysis.


4. Summary and Conclusion

Based on our analysis of the Titanic dataset, here are the main findings:

1. Correlation Analysis

  • Moderate correlation between SibSp and Parch (r = 0.38): Family members tend to travel together, which makes intuitive sense.
  • Negative correlation between Age and family size variables: Older passengers typically travel with fewer family members.
  • Correlations with Fare: Ticket price is relatively independent of demographic characteristics, suggesting fare was determined more by cabin class choice than passenger characteristics.

2. Variance Analysis

  • Fare shows highest variability (Var ≈ 2800): Enormous range in ticket prices reflecting the class system on the Titanic (1st, 2nd, 3rd class).
  • Age shows moderate spread (SD ≈ 14.5 years): Good diversity in passenger ages.
  • Family size variables have low variance: Most passengers traveled alone or with small families, with occasional large families as outliers.

3. Principal Component Analysis (PCA)

  • Two principal components (eigenvalue > 1) explain approximately 68.5% of total variance.
  • PC1: Represents “Family Travel Pattern” - capturing family size and passenger age
  • PC2: Represents “Passenger Demographics & Economic Status” - capturing age and fare

This means we can reduce from 4 variables to 2 components while retaining about 70% of the information. This is useful for:

  • Data visualization (plotting in 2D instead of 4D)
  • Simplifying machine learning models
  • Reducing computational complexity
  • Removing multicollinearity

Notice: The difference between correlation matrix and covariance matrix results demonstrates why standardization is crucial when variables have different scales. The correlation matrix gives equal importance to all variables, while the covariance matrix lets the largest-scale variable (Fare) dominate.

So,

This analysis successfully demonstrates three fundamental statistical techniques:

  1. Correlation Matrix: Revealed relationships between variables, with family-related variables showing expected positive correlations.

  2. Variance-Covariance Matrix: Showed that Fare has by far the highest variability, reflecting the economic diversity of Titanic passengers.

  3. Eigenvalue Decomposition: Enabled us to reduce 4 variables to 2 principal components while retaining most of the information, demonstrating the power of dimensionality reduction.


Thank you for reading this analysis. If you have questions or suggestions, feel free to reach out!

Note: To reproduce this analysis, ensure you have the Titanic-Dataset.csv file in your R working directory and all required packages installed.