Lecture 1 - Titanic Analysis

Hello, my name is Arya Bintang Fauzildan usually called as Zildan from State University of Surabaya, Department of Data Scientist. On this R publication, we’re gonna talk about Multivariat analytics. In this case, i’m so excited to explain the case from titanic dataset from kaggle platform (https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv).

First things first, we’re gonna download the dataset with .csv format. Then, in the dataset, we have so many variables that correspondent with each other. Like the essential one; age, ticket, passangerId, etc. Imagine we’re detective in duty, in order to make investigation and prediction about passennger who might have lost in titanic tragedy by the data given. So that, we can pull red string from this case. For example, what age majority that have been rescued or from what train class that has most people lost from the tragedy.

From here, is the step to make analysis.

Import CSV into R studio In the beginning, insert dataset with format .csv into R studio. Make a variable with name “data” to store any data inside the variable. Use read.csv function then copy path our file in folder and add quotes between the path.

data <- read.csv("E:/ANALISIS_MULTIVARIAT/Titanic-Dataset.csv")

Check data inside the variable.

head(data)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Use columns Age, Sibsp, Parch, Fare. Before manipulate the data. We should install library that make our process faster. In this case, we use tidyr library. This library can make a change in our data with just one row of code, such as drop null function like pandas in python.

Install tidyr library with code below. Make sure to install this library once.

# install.packages("tidyr")

activate library

library("tidyr")

Make subset columns in dataset that we no longer use (drop columns). Remember, only use column Age, Sibsp, Parch, and Fare. syntax select = -c(), stand for removing column.

data <- subset(data, select = -c(PassengerId, Survived, Pclass, Name, Sex, Ticket, Cabin, Embarked))

head(data)

##   Age SibSp Parch    Fare
## 1  22     1     0  7.2500
## 2  38     1     0 71.2833
## 3  26     0     0  7.9250
## 4  35     1     0 53.1000
## 5  35     0     0  8.0500
## 6  NA     0     0  8.4583

Delete row with missing value exist. With tydir library, row with any null value will be removed with drop_na function.

data_clean <- drop_na(data)

head(data_clean)

##   Age SibSp Parch    Fare
## 1  22     1     0  7.2500
## 2  38     1     0 71.2833
## 3  26     0     0  7.9250
## 4  35     1     0 53.1000
## 5  35     0     0  8.0500
## 6  54     0     0 51.8625

Write R code that can generate:

Correlation Matrix
Variance-Covariance Matrix
Eigen value and eigen vector
Interpretation in every output

correlation_matrix <- cor(data_clean, method = "pearson")

correlation_matrix

##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000

Analysis result shows that Sibsp and Parch variables have positive Correlations, indicate passenger that departure with siblings or couples most likely same with parents and children.

Moreover, Age variable have weak correlation to Fare variable. That means, passenger’s age does not really explain ticket price that has been bought.

covariance_matrix <- cov(data_clean)

covariance_matrix

##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100

The largest variance is observed in the Fare variable, indicating that ticket prices have the widest spread compared to the other variables.

Positive covariance between SibSp and Parch suggests that these variables tend to increase together, while the covariance between Age and the remaining variables is relatively small, reflecting weak relationships.

eigen_cov <- eigen(covariance_matrix)

eigen_cov$values

## [1] 2802.5636587  209.0385659    0.9438783    0.4787214

Eigen values explain the amount of variance explained by each component respectfully. The component associated with the largest eigen value explains the greatest proportion of the total variability in the data. In this analysis, the first component is primarily influenced by the Fare variable.

Subsequent components explain smaller proportions of variance and are generally associated with combinations of SibSp and Parch.

eigen_cov$vectors

##             [,1]        [,2]         [,3]          [,4]
## [1,] 0.028477552  0.99929943 -0.024018111  0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322  0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826  0.004609234  0.0009266652

eigen_cov$values / sum(eigen_cov$values)

## [1] 0.930149541 0.069378309 0.000313266 0.000158884

Eigen vectors indicate the direction and contribution of each variable to the principal components. Variables with larger absolute values in an eigen vector have a stronger influence on the corresponding component.

Lecture 1 - Titanic Analysis

Arya Bintang Fauzildan

2026-02-05