Student ID: 24031554115

Class: INT24

Dataset Processing

Load the dataset. Select the columns {Age, Sibsp, Parch, Fare} and remove row that missing values

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("Titanic-Dataset.csv") |>
  select(Age, SibSp, Parch, Fare) |>
  drop_na()
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 4
##     Age SibSp Parch  Fare
##   <dbl> <dbl> <dbl> <dbl>
## 1    22     1     0  7.25
## 2    38     1     0 71.3 
## 3    26     0     0  7.92
## 4    35     1     0 53.1 
## 5    35     0     0  8.05
## 6    54     0     0 51.9

Variance-Covariance Matrix

cov_matrix <- cov(df)
cov_matrix
##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100

The variance-covariance matrix about tell us the scaling issue in the dataset. As you can see the Fare has variance approximately 2800, which is drastically large compare to others. This reveals that the features have vastly different units.

Correlation Matrix

To overcome the problem of the scaling from the variance-covariance matrix, the correlation matrix done normalization to it.

 cor_matrix <- cor(df)
 cor_matrix
##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000

The correlation matrix show as that SibSp and Parch (0.38) meaning they have moderate positive correlation, so as the passengers that traveling with siblings/spouses are likely to also travel with parents/children.

The Age and SibSp (-0.3) showing negative correlation that older passenger less likely to travel with siblings. The Fare and Age meaning that the older passengers not necessarily buying more expensive tickets.

Eigen Value and Eigen Vector

The eigen using the correlation matrix since the correlation matrix already scale the data such that the scaling issue will not interfere the eigen result.

eig_result <- eigen(cor_matrix)
eig_result$values
## [1] 1.6367503 1.1071770 0.6694052 0.5866676
eig_result$vectors
##            [,1]       [,2]        [,3]        [,4]
## [1,]  0.4388714 -0.5962415  0.56095237  0.37043268
## [2,] -0.6250770  0.0732461  0.05500006  0.77517016
## [3,] -0.5908590 -0.1774532  0.60558695 -0.50265342
## [4,] -0.2599159 -0.7795136 -0.56175785 -0.09607493

The eigen values shows about the variance, we can get how much a feature from PCA represent the information from the data. For example, PC1 has the value 1.63, then the variance it explain is \(\frac{1.63}{4} \approx 40\) or around 40%.

The eigen vectors show some kinda weight for each original features to the feature created by PCA. For example, in PC1 Age has a positive weight (0.44), while SibSp (-0.62) and Parch (-0.59) have strong negative weights. This suggests PC1 separates passengers by Life Stage, positive values represent older, solitary travelers, while negative values represent younger passengers with large families.

PCA

pca_matrix <- scale(df) %*% eig_result$vectors

# convert to dataframe
pca_df <- as_tibble(pca_matrix) |> 
  rename(PC1 = V1, PC2 = V2, PC3 = V3, PC4 = V4)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
head(pca_df)
## # A tibble: 6 × 4
##       PC1    PC2    PC3     PC4
##     <dbl>  <dbl>  <dbl>   <dbl>
## 1 -0.127   0.848 -0.283  0.514 
## 2  0.0421 -0.752 -0.345  0.806 
## 3  0.663   0.595 -0.195 -0.219 
## 4  0.0408 -0.361 -0.268  0.762 
## 5  0.934   0.224  0.151  0.0103
## 6  1.29   -1.20   0.420  0.415

Using the prcomp for conforming that the PCA is exactly the same and since the factoextra need it for visualizations.

pca_result <- prcomp(df, scale = TRUE)
manual_abs <- round(abs(as.matrix(pca_df)), 5)
auto_abs   <- round(abs(pca_result$x), 5)

cat("Are the Manual and Auto PCA is the same?", all.equal(manual_abs, auto_abs))
## Are the Manual and Auto PCA is the same? TRUE
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Scree Plot
fviz_eig(pca_result, addlabels = TRUE)
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

The Scree Plot reveals us that what Principle Component that is important and what can be remove regarding our use case. Usually we try to search for elbow where the line start to flattens out like in PC3. This is because the diminishing return but for most of cases, people just take how many dimension based on they wanted. For example, since PC1 (40.9%) and PC2 (27.7%) combined retain nearly 69% of the data information. This mean that we can just use both of them instead the 4 variables, which still give use roughly 70% the information.

# Biplot
fviz_pca_biplot(pca_result, 
                label = "var",
                repel = TRUE,
                alpha.ind = 0.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Biplot is use because it overlays two different types of information on a single map, the variables (arrows) and the passengers (dots). Because the graph in 2 dimension, it use the PC1 and PC2 as the axis. The PC1 gives the Life Stage axis since the Age arrow point right and the SibSp and Parch arrows point opposite direction, left. The PC2 acts as the Wealth axis the major variables is the Fare. These axis could be use to define the clusters of the passenger, for example the Top-Left of the quadrant represent Wealthy Families.