Hello, my name is Arya Bintang Fauzildan usually called as Zildan from State University of Surabaya, Department of Data Scientist. On this R publication, we’re gonna talk about Multivariat analytics. In this case, i’m so excited to explain the case from titanic dataset from kaggle platform (https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv).
First things first, we’re gonna download the dataset with .csv format. Then, in the dataset, we have so many variables that correspondent with each other. Like the essential one; age, ticket, passangerId, etc. Imagine we’re detective in duty, in order to make investigation and prediction about passennger who might have lost in titanic tragedy by the data given. So that, we can pull red string from this case. For example, what age majority that have been rescued or from what train class that has most people lost from the tragedy.
From here, is the step to make analysis.
data <- read.csv("E:/ANALISIS_MULTIVARIAT/Titanic-Dataset.csv")
Check data inside the variable.
head(data)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
Install tidyr library with code below. Make sure to install this library once.
# install.packages("tidyr")
activate library
library("tidyr")
Make subset columns in dataset that we no longer use (drop columns). Remember, only use column Age, Sibsp, Parch, and Fare. syntax select = -c(), stand for removing column.
data <- subset(data, select = -c(PassengerId, Survived, Pclass, Name, Sex, Ticket, Cabin, Embarked))
head(data)
## Age SibSp Parch Fare
## 1 22 1 0 7.2500
## 2 38 1 0 71.2833
## 3 26 0 0 7.9250
## 4 35 1 0 53.1000
## 5 35 0 0 8.0500
## 6 NA 0 0 8.4583
data_clean <- drop_na(data)
head(data_clean)
## Age SibSp Parch Fare
## 1 22 1 0 7.2500
## 2 38 1 0 71.2833
## 3 26 0 0 7.9250
## 4 35 1 0 53.1000
## 5 35 0 0 8.0500
## 6 54 0 0 51.8625
correlation_matrix <- cor(data_clean, method = "pearson")
correlation_matrix
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
Analysis result shows that Sibsp and Parch variables have positive Correlations, indicate passenger that departure with siblings or couples most likely same with parents and children.
Moreover, Age variable have weak correlation to Fare variable. That means, passenger’s age does not really explain ticket price that has been bought.
covariance_matrix <- cov(data_clean)
covariance_matrix
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
The largest variance is observed in the Fare variable, indicating that ticket prices have the widest spread compared to the other variables.
Positive covariance between SibSp and Parch suggests that these variables tend to increase together, while the covariance between Age and the remaining variables is relatively small, reflecting weak relationships.
eigen_cov <- eigen(covariance_matrix)
eigen_cov$values
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
Eigen values explain the amount of variance explained by each component respectfully. The component associated with the largest eigen value explains the greatest proportion of the total variability in the data. In this analysis, the first component is primarily influenced by the Fare variable.
Subsequent components explain smaller proportions of variance and are generally associated with combinations of SibSp and Parch.
eigen_cov$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
eigen_cov$values / sum(eigen_cov$values)
## [1] 0.930149541 0.069378309 0.000313266 0.000158884
Eigen vectors indicate the direction and contribution of each variable to the principal components. Variables with larger absolute values in an eigen vector have a stronger influence on the corresponding component.