Preparation

Data Insertion

First, we download the dataset from kaggle https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv

data <- read.csv("Titanic-Dataset.csv")
head(data, n=10)
##    PassengerId Survived Pclass
## 1            1        0      3
## 2            2        1      1
## 3            3        1      3
## 4            4        1      1
## 5            5        0      3
## 6            6        0      3
## 7            7        0      1
## 8            8        0      3
## 9            9        1      3
## 10          10        1      2
##                                                   Name    Sex Age SibSp Parch
## 1                              Braund, Mr. Owen Harris   male  22     1     0
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                               Heikkinen, Miss. Laina female  26     0     0
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                             Allen, Mr. William Henry   male  35     0     0
## 6                                     Moran, Mr. James   male  NA     0     0
## 7                              McCarthy, Mr. Timothy J   male  54     0     0
## 8                       Palsson, Master. Gosta Leonard   male   2     3     1
## 9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0     2
## 10                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1     0
##              Ticket    Fare Cabin Embarked
## 1         A/5 21171  7.2500              S
## 2          PC 17599 71.2833   C85        C
## 3  STON/O2. 3101282  7.9250              S
## 4            113803 53.1000  C123        S
## 5            373450  8.0500              S
## 6            330877  8.4583              Q
## 7             17463 51.8625   E46        S
## 8            349909 21.0750              S
## 9            347742 11.1333              S
## 10           237736 30.0708              C

Data Exploration

Secondly, we will check on the structure of the data

print(dim(data))
## [1] 891  12

Output above meaning that we have 891 data with 12 variables

Now, let’s see the characteristics of each variables

print(summary(data))
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

From the summary above, we can get various insight such as:

  1. There are only about 38% of the total passenger that survived.

  2. There is a substantial amount of missing values, especially in ‘Age’ row.

  3. There is an obvious right skewness in ‘Fare’ due to higher mean than median.

Usually, variables that classified as “character” had to be encode first, but because all those variables will not be used in the following process, I will not bother to encode it

Data Cleaning and Slicing

From the assignment, we only need to use Age, Sibsp, Parch, Fare

data_slice <- data[, c("Age", "SibSp", "Parch", "Fare")]
head(data_slice, n=10)
##    Age SibSp Parch    Fare
## 1   22     1     0  7.2500
## 2   38     1     0 71.2833
## 3   26     0     0  7.9250
## 4   35     1     0 53.1000
## 5   35     0     0  8.0500
## 6   NA     0     0  8.4583
## 7   54     0     0 51.8625
## 8    2     3     1 21.0750
## 9   27     0     2 11.1333
## 10  14     1     0 30.0708

We also had to drop missing values

print("Before")
## [1] "Before"
print(dim(data))
## [1] 891  12
data_clean <- na.omit(data_slice)
print("After")
## [1] "After"
print(dim(data_clean))
## [1] 714   4

Main Process

data <- data_clean

Correlation Matrix

corr_matrix <- cor(data)
print(corr_matrix)
##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000

From results above, we can conclude that:

  1. The highest correlation is a positive one between variables is between Parch and Sibsp with score of 0.38, meaning that as the number of sibling or spouse increase, the number of parents or child also tends to increase. This also means that most passenger tends to travel with their family.

  2. The second highest correlation, which happens to be a negative, between variables is between ‘Age’ and ‘SibSp’ with score of -0.31, meaning that as the age increases, the number of sibling or spouse tends to decrease. This also means that older passenger tends to travel alone while the younger tends to travel with their sibling or spouse.

  3. Other notable correlation is a positive one between ‘Parch’ and ‘Fare’ with score of 0.21, meaning tthat as the more parent or child the passenger bring, the more expensive ticket they bought.

Variance-Covariance Matrix

cov_matrix <- cov(data)
print(cov_matrix)
##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100

From results above, we can conclude that:

  1. Variance:

    1.1. Age & Fare: The variance is huge, showing that the data spread widely. In ‘Fare’, it means that sone passenger pay a little while the other cost a fortune. In ‘Age’, it means that some passenger is maybe a toddler and others maybe a war veteran

    1.2. SibSp & Parch: The variance is small, showing that the data mostly had similar value, and it is reasonable. It is not everyday to see people with 16 siblings or 5 parents.

  2. Covariance: Most of it is already covered in Correlation anyway, just unscaled. We will mainly use it for Eigenvalues

Eigenvectors

Covariance Matrix

eigen_cov <- eigen(cov_matrix)
print("value")
## [1] "value"
print(eigen_cov$values)  
## [1] 2802.5636587  209.0385659    0.9438783    0.4787214
print("vector")
## [1] "vector"
print(eigen_cov$vectors) 
##             [,1]        [,2]         [,3]          [,4]
## [1,] 0.028477552  0.99929943 -0.024018111  0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322  0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826  0.004609234  0.0009266652

From the results above, we can conclude that:

  1. Eigenvalues: The first value 2802.56 is dominating the value with the coverage of 93% of all information in the dataset, while the second value offers a bit help with the remaining 7%. The rest are just additional and didnt matter.
  2. Eigenvectors: The first component, which capture 93% of all the information, is can be seen be dominated by the ‘Fare’ variables with 99% proportion. The secondary component can also be seen dominated heavily by Age with the same 99%.

Well, how does the ‘Age’ and ‘Fare’ can be that dominating? It is because if we use covariance, we will use the unstandardize value of each variables. It is worth noting that most dataset is scaled and measured differently for each variables, so we will be more likely to see one variable have the range of 0-512 and one with 0-8, just like what we have with ‘Fare’ and ‘SibSp’.

Thus, we will check the eigenvectors using the standardized version of covariance matrix: Correlation.

Correlation Matrix

eigen_corr <- eigen(corr_matrix)
print("value")
## [1] "value"
print(eigen_corr$values)  
## [1] 1.6367503 1.1071770 0.6694052 0.5866676
print("vector")
## [1] "vector"
print(eigen_corr$vectors) 
##            [,1]       [,2]        [,3]        [,4]
## [1,]  0.4388714 -0.5962415  0.56095237  0.37043268
## [2,] -0.6250770  0.0732461  0.05500006  0.77517016
## [3,] -0.5908590 -0.1774532  0.60558695 -0.50265342
## [4,] -0.2599159 -0.7795136 -0.56175785 -0.09607493

From the results above, we can conclude that:

  1. Eigenvalues: Unlike when using covariance matrix where the first two component heavily dominating the values, the current one have a more balanced distribution with each component (in the same order) have proportion of: 41%, 28%, 17%, and 14%.

  2. Eigenvectors: While the previous eigenvectors showing dominance of one variables, the current shows an interesting dynamic between each variables.

    2.1. In first component, we will see that ‘Age’ have strong positive score while ‘SibSp’ and ‘Parch’ have really strong negative score, while the ‘Fare’ have weak negative score, in contrast with previous eigen where it dominated. This new dynamics show us the component we’re lacking before: Family structure. Where the older passenger tends to go alone and mostly bought an appropriate ticket just for himself; the younger passenger tends to go with their family, causing them to bought an expensive ticket.

    2.2. In second component, we will see that ‘Age’ and ‘Fare’ both have strong negative score, while the other two have a really weak score. This also capture something: Wealthy factor. This component shows that, assuuming they go alone, older people tends to bought more expensive ticket than the younger.

So, I guess that’s all. We can continue to PCA if I want, but I don’t want to.