Titanic Dataset Analysis: Correlation Matrix, Variance-Covariance Matrix, and Eigenvalues
1. Introduction
This document presents an analysis of the Titanic dataset focusing on statistical matrices and dimensionality reduction techniques. The dataset is sourced from Kaggle - Titanic Dataset.
What we’ll analyze:
Correlation Matrix - Shows relationships between variables
Variance-Covariance Matrix - Measures data spread and co-movement
Eigenvalues & Eigenvectors - Used for Principal Component Analysis (PCA)
1.1 Variable Used
We will analyze four numerical variables from the Titanic dataset:
- Age: Passenger’s age in years
- SibSp: Number of siblings/spouses aboard
- Parch: Number of parents/children aboard
- Fare: Ticket price in British pounds
These variables represent key demographic and economic characteristics that might relate to passenger survival patterns.
2. Data Preparation
2.2 Load the Dataset
titanic <- read.csv("Titanic-Dataset.csv")
cat("Dataset dimension:", dim(titanic)[1], "row x", dim(titanic)[2], "columns\n")## Dataset dimension: 891 row x 12 columns
2.3 Dataset Preview
head(titanic) %>%
kable(caption = "Head Rows of the Titanic Dataset") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q |
2.4 Variable Selection
Select only the 4 variables we need.
## Age SibSp Parch Fare
## 1 22 1 0 7.2500
## 2 38 1 0 71.2833
## 3 26 0 0 7.9250
## 4 35 1 0 53.1000
## 5 35 0 0 8.0500
## 6 NA 0 0 8.4583
3. Data Analysis
3.1 Descriptive Statistics
summary(titanic_clean) %>%
kable(caption = "Descriptive Statistics") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))| Age | SibSp | Parch | Fare | |
|---|---|---|---|---|
| Min. : 0.42 | Min. :0.0000 | Min. :0.0000 | Min. : 0.00 | |
| 1st Qu.:20.12 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.: 8.05 | |
| Median :28.00 | Median :0.0000 | Median :0.0000 | Median : 15.74 | |
| Mean :29.70 | Mean :0.5126 | Mean :0.4314 | Mean : 34.69 | |
| 3rd Qu.:38.00 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.: 33.38 | |
| Max. :80.00 | Max. :5.0000 | Max. :6.0000 | Max. :512.33 |
- Min/Max: The smallest and largest values
- Median: The middle value when data is sorted
- Mean: The average value
- 1st/3rd Quartile: 25% and 75% points in the data
3.1.1 Visualization of Data Distribution
par(mfrow = c(2, 2))
hist(titanic_clean$Age,
breaks = 30,
col = "#3498db",
border = "white",
main = "Distribution of Age",
xlab = "Age (years)",
ylab = "Frequency",
cex.main = 1.3)
abline(v = mean(titanic_clean$Age), col = "red", lwd = 2, lty = 2)
abline(v = median(titanic_clean$Age), col = "orange", lwd = 2, lty = 2)
legend("topright",
legend = c("Mean", "Median"),
col = c("red", "orange"),
lty = 2, lwd = 2,
cex = 0.8)
hist(titanic_clean$SibSp,
breaks = seq(-0.5, max(titanic_clean$SibSp) + 0.5, 1),
col = "#e74c3c",
border = "white",
main = "Distribution of Siblings/Spouses",
xlab = "Number of Siblings/Spouses",
ylab = "Frequency",
cex.main = 1.3)
hist(titanic_clean$Parch,
breaks = seq(-0.5, max(titanic_clean$Parch) + 0.5, 1),
col = "#2ecc71",
border = "white",
main = "Distribution of Parents/Children",
xlab = "Number of Parents/Children",
ylab = "Frequency",
cex.main = 1.3)
hist(titanic_clean$Fare,
breaks = 50,
col = "#f39c12",
border = "white",
main = "Distribution of Fare",
xlab = "Fare (British Pounds)",
ylab = "Frequency",
cex.main = 1.3)
abline(v = mean(titanic_clean$Fare), col = "red", lwd = 2, lty = 2)
abline(v = median(titanic_clean$Fare), col = "orange", lwd = 2, lty = 2)
legend("topright",
legend = c("Mean", "Median"),
col = c("red", "orange"),
lty = 2, lwd = 2,
cex = 0.8)3.2 Correlation Matrix
A correlation matrix shows the strength and direction of relationships between variables.
- Values range from -1 to +1
- +1 = perfect positive correlation (both variables increase together)
- -1 = perfect negative correlation (one increases, the other decreases)
- 0 = no linear relationship
- The diagonal is always 1 (a variable perfectly correlates with itself)
cor_matrix <- cor(titanic_clean)
cor_matrix %>%
round(4) %>%
kable(caption = "Correlation Matrix") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Age | SibSp | Parch | Fare | |
|---|---|---|---|---|
| Age | 1.0000 | -0.3082 | -0.1891 | 0.0961 |
| SibSp | -0.3082 | 1.0000 | 0.3838 | 0.1383 |
| Parch | -0.1891 | 0.3838 | 1.0000 | 0.2051 |
| Fare | 0.0961 | 0.1383 | 0.2051 | 1.0000 |
corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
number.cex = 1,
title = "Correlation Matrix - Titanic Dataset",
mar = c(0,0,2,0),
col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200))cor_matrix_no_diag <- cor_matrix
diag(cor_matrix_no_diag) <- NA
max_cor <- which(abs(cor_matrix_no_diag) == max(abs(cor_matrix_no_diag), na.rm = TRUE), arr.ind = TRUE)
max_cor_value <- cor_matrix[max_cor[1,1], max_cor[1,2]]
cat("Strongest Correlation:\n",
rownames(cor_matrix)[max_cor[1,1]], "vs",
colnames(cor_matrix)[max_cor[1,2]], "=",
round(max_cor_value, 4), "\n\n")## Strongest Correlation:
## Parch vs SibSp = 0.3838
Based on the correlation matrix, we can observe several interesting patterns:
Parch vs SibSp (positive correlation): This makes sense because both variables represent family size. Passengers traveling with siblings/spouses often also travel with parents/children.
Age vs SibSp (negative correlation): Older passengers tend to travel with fewer siblings/spouses. This is logical - as people age, they’re less likely to travel with large family groups.
Age vs Parch (negative correlation): Similar to above, older passengers have fewer parents/children aboard. This is expected since older passengers likely have independent adult children who aren’t traveling with them.
Fare vs other variables : Ticket prices don’t strongly correlate with demographic variables. This suggests that fare was more related to cabin class than to passenger characteristics like age or family size.
3.3 Variance-Covariance Matrix
This matrix contains two types of information:
- Diagonal elements (variance): Show how spread out each variable’s data is
- Off-diagonal elements (covariance): Show how two variables change together
If correlation tells us the direction and strength of a relationship, covariance tells us that plus the scale of the relationship.
Formula:
Variance:
σ² = Σ(xᵢ - x̄)² / (n-1)
Covariance:
Cov(X,Y) = Σ(xᵢ - x̄)(yᵢ - ȳ) / (n-1)
Key difference from correlation: Covariance is not standardized, so its values depend on the units of measurement. Correlation is standardized (-1 to +1).
cov_matrix <- cov(titanic_clean)
cov_matrix %>%
round(4) %>%
kable(caption = "Variance-Covariance Matrix") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Age | SibSp | Parch | Fare | |
|---|---|---|---|---|
| Age | 211.0191 | -4.1633 | -2.3442 | 73.8490 |
| SibSp | -4.1633 | 0.8645 | 0.3045 | 6.8062 |
| Parch | -2.3442 | 0.3045 | 0.7281 | 9.2622 |
| Fare | 73.8490 | 6.8062 | 9.2622 | 2800.4131 |
variances <- diag(cov_matrix)
variance_df <- data.frame(
Variabel = names(variances),
Varians = variances,
Std_Dev = sqrt(variances)
) %>%
arrange(desc(Varians))
variance_df %>%
kable(caption = "Variance and Standard Deviation of Each Variable", digits = 4) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))| Variabel | Varians | Std_Dev | |
|---|---|---|---|
| Fare | Fare | 2800.4131 | 52.9189 |
| Age | Age | 211.0191 | 14.5265 |
| SibSp | SibSp | 0.8645 | 0.9298 |
| Parch | Parch | 0.7281 | 0.8533 |
corrplot(cov_matrix,
method = "color",
is.corr = FALSE,
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
number.cex = 0.8,
title = "Variance-Covariance Matrix",
mar = c(0,0,2,0),
col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200))Note: Unlike the correlation plot, colors here represent the magnitude of covariance (not standardized), so variables with larger scales will have more intense colors.
Key findings from the variance-covariance matrix:
Fare has the highest variance (2800,41): This indicates huge variability in ticket prices. Some passengers paid very little while others paid a lot, reflecting different cabin classes (1st, 2nd, 3rd class). The standard deviation is about £52,9, which is quite large.
Age has moderate variance (≈211): Ages range widely, from young children to elderly passengers. Standard deviation of ≈14.5 years means there’s good age diversity.
SibSp and Parch have low variance: Most passengers traveled alone or with few family members. Values close to 0 are most common, with occasional larger families.
Covariances confirm correlation patterns:
- Positive Cov(SibSp, Parch) = family members travel together
- Negative Cov(Age, SibSp) = older passengers have fewer siblings/spouses aboard
3.4 Eigenvalues and Eigenvectors
Imagine we have a dataset with multiple variables plotted in multi-dimensional space. Eigenvalues and eigenvectors help us find the “principal directions” in this space - the directions where data varies the most.
- Eigenvector: A direction/axis in the data
- Eigenvalue: How much variance exists along that direction
In Principal Component Analysis (PCA), we use eigenvalues and eigenvectors to:
- Reduce dimensions: Transform 4 variables into 2-3 “principal components”
- Remove noise: Keep components with high eigenvalues, discard those with low eigenvalues
- Visualize data: Plot high-dimensional data in 2D or 3D
- Avoid multicollinearity: Create independent components for regression analysis
A common rule is to keep only components with eigenvalue > 1. These components explain more variance than a single original variable.
3.4.1 Eigenanalysis of Correlation Matrix
eigen_cor <- eigen(cor_matrix)
prop_var_cor <- eigen_cor$values / sum(eigen_cor$values) * 100
eigen_summary_cor <- data.frame(
PC = paste0("PC", 1:length(eigen_cor$values)),
Eigenvalue = eigen_cor$values,
Proportion = prop_var_cor,
Cumulative = cumsum(prop_var_cor)
)
eigen_summary_cor %>%
kable(caption = "Eigenvalues dan Proporsi Varians (Correlation Matrix)",
digits = 4,
col.names = c("Principal Component", "Eigenvalue", "Proportion (%)", "Cumulative (%)")) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))| Principal Component | Eigenvalue | Proportion (%) | Cumulative (%) |
|---|---|---|---|
| PC1 | 1.6368 | 40.9188 | 40.9188 |
| PC2 | 1.1072 | 27.6794 | 68.5982 |
| PC3 | 0.6694 | 16.7351 | 85.3333 |
| PC4 | 0.5867 | 14.6667 | 100.0000 |
- Component: Principal Component number (PC1, PC2, etc.)
- Eigenvalue: How much total variance this component captures
- Variance %: Percentage of total variance explained by this component
- Cumulative %: Running total of variance explained
What does this mean?
For example, if PC1 has an eigenvalue of 1.637 and explains 41%, it means this single component captures 41% of all the information in our 4 original variables.
eigen_vectors_cor <- eigen_cor$vectors
rownames(eigen_vectors_cor) <- colnames(cor_matrix)
colnames(eigen_vectors_cor) <- paste0("PC", 1:ncol(eigen_vectors_cor))
eigen_vectors_cor %>%
round(4) %>%
kable(caption = "Eigenvectors (Correlation Matrix)") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| Age | 0.4389 | -0.5962 | 0.5610 | 0.3704 |
| SibSp | -0.6251 | 0.0732 | 0.0550 | 0.7752 |
| Parch | -0.5909 | -0.1775 | 0.6056 | -0.5027 |
| Fare | -0.2599 | -0.7795 | -0.5618 | -0.0961 |
Each column represents one principal component. The values show how much each original variable contributes to that component.
3.4.2 Eigenanalysis of Covariance Matrix
eigen_cov <- eigen(cov_matrix)
prop_var_cov <- eigen_cov$values / sum(eigen_cov$values) * 100
eigen_summary_cov <- data.frame(
PC = paste0("PC", 1:length(eigen_cov$values)),
Eigenvalue = eigen_cov$values,
Proportion = prop_var_cov,
Cumulative = cumsum(prop_var_cov)
)
eigen_summary_cov %>%
kable(caption = "Eigenvalues dan Proporsi Varians (Covariance Matrix)",
digits = 4,
col.names = c("Principal Component", "Eigenvalue", "Proportion (%)", "Cumulative (%)")) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))| Principal Component | Eigenvalue | Proportion (%) | Cumulative (%) |
|---|---|---|---|
| PC1 | 2802.5637 | 93.0150 | 93.0150 |
| PC2 | 209.0386 | 6.9378 | 99.9528 |
| PC3 | 0.9439 | 0.0313 | 99.9841 |
| PC4 | 0.4787 | 0.0159 | 100.0000 |
When we use the covariance matrix, variables with larger scales dominate the analysis. Fare has a variance of ~2800, while SibSp has variance of ~0,86. That’s a ratio of about 2000:1! So PC1 from covariance matrix is almost entirely driven by Fare.
eigen_vectors_cov <- eigen_cov$vectors
rownames(eigen_vectors_cov) <- colnames(cov_matrix)
colnames(eigen_vectors_cov) <- paste0("PC", 1:ncol(eigen_vectors_cov))
eigen_vectors_cov %>%
round(4) %>%
kable(caption = "Eigenvectors (Covariance Matrix)") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| Age | 0.0285 | 0.9993 | -0.0240 | 0.0036 |
| SibSp | 0.0024 | -0.0209 | -0.7737 | 0.6332 |
| Parch | 0.0033 | -0.0125 | -0.6331 | -0.7740 |
| Fare | 0.9996 | -0.0284 | 0.0046 | 0.0009 |
PC1 is almost entirely composed of Fare (loading ≈ 1.000), while other variables contribute very little. This confirms that Fare dominates when we don’t standardize variables.
3.4.3 Scree Plot
A scree plot shows eigenvalues in descending order. The name comes from “scree” - the rubble at the bottom of a cliff. We’re looking for the “cliff” (important components) vs the “scree” (unimportant ones).
par(mfrow = c(1, 2))
plot(1:length(eigen_cor$values), eigen_cor$values,
type = "b",
pch = 19,
col = "blue",
main = "Scree Plot (Correlation Matrix)",
xlab = "Principal Component",
ylab = "Eigenvalue",
cex.main = 1.3,
lwd = 2)
abline(h = 1, col = "red", lty = 2, lwd = 2)
text(3, 1.1, "Kaiser Criterion (Eigenvalue = 1)", col = "red", cex = 0.8)
grid()
plot(1:length(eigen_cov$values), eigen_cov$values,
type = "b",
pch = 19,
col = "darkgreen",
main = "Scree Plot (Covariance Matrix)",
xlab = "Principal Component",
ylab = "Eigenvalue",
cex.main = 1.3,
lwd = 2)
grid()Based on the scree plot, we would typically keep 2 principal components for further analysis.
4. Summary and Conclusion
Based on our analysis of the Titanic dataset, here are the main findings:
1. Correlation Analysis
- Moderate correlation between SibSp and Parch (r = 0.38): Family members tend to travel together, which makes intuitive sense.
- Negative correlation between Age and family size variables: Older passengers typically travel with fewer family members.
- Correlations with Fare: Ticket price is relatively independent of demographic characteristics, suggesting fare was determined more by cabin class choice than passenger characteristics.
2. Variance Analysis
- Fare shows highest variability (Var ≈ 2800): Enormous range in ticket prices reflecting the class system on the Titanic (1st, 2nd, 3rd class).
- Age shows moderate spread (SD ≈ 14.5 years): Good diversity in passenger ages.
- Family size variables have low variance: Most passengers traveled alone or with small families, with occasional large families as outliers.
3. Principal Component Analysis (PCA)
- Two principal components (eigenvalue > 1) explain approximately 68.5% of total variance.
- PC1: Represents “Family Travel Pattern” - capturing family size and passenger age
- PC2: Represents “Passenger Demographics & Economic Status” - capturing age and fare
This means we can reduce from 4 variables to 2 components while retaining about 70% of the information. This is useful for:
- Data visualization (plotting in 2D instead of 4D)
- Simplifying machine learning models
- Reducing computational complexity
- Removing multicollinearity
Notice: The difference between correlation matrix and covariance matrix results demonstrates why standardization is crucial when variables have different scales. The correlation matrix gives equal importance to all variables, while the covariance matrix lets the largest-scale variable (Fare) dominate.
So,
This analysis successfully demonstrates three fundamental statistical techniques:
Correlation Matrix: Revealed relationships between variables, with family-related variables showing expected positive correlations.
Variance-Covariance Matrix: Showed that Fare has by far the highest variability, reflecting the economic diversity of Titanic passengers.
Eigenvalue Decomposition: Enabled us to reduce 4 variables to 2 principal components while retaining most of the information, demonstrating the power of dimensionality reduction.
Thank you for reading this analysis. If you have questions or suggestions, feel free to reach out!
Note: To reproduce this analysis, ensure you have
the Titanic-Dataset.csv file in your R working directory
and all required packages installed.