Analisis Multivariat Dataset Titanic

IMPORT DATASET

titanic <- read.csv("data/Titanic-Dataset.csv")
str(titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

head(titanic)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

dim(titanic)

## [1] 891  12

colnames(titanic)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

SELECT FEATURE

data_mv <- titanic[, c("Age", "SibSp", "Parch", "Fare")]
str(data_mv)

## 'data.frame':    891 obs. of  4 variables:
##  $ Age  : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp: int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch: int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare : num  7.25 71.28 7.92 53.1 8.05 ...

MISSING VALUE

colSums(is.na(data_mv))

##   Age SibSp Parch  Fare 
##   177     0     0     0

HANDLING MISSING VALUE

data_mv_clean <- na.omit(data_mv)
colSums(is.na(data_mv_clean))

##   Age SibSp Parch  Fare 
##     0     0     0     0

dim(data_mv_clean)

## [1] 714   4

CORRELATION MATRIX

cor_matrix <- cor(data_mv_clean)
cor_matrix

##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000

The correlation matrix shows that most of the relationships between variables are weak to moderate. The most prominent relationship is between SibSp and Parch, reflecting the interrelationship between family variables. Age tends to have a negative correlation with family variables, while the relationship between Age and Fare is relatively weak. This indicates that passenger age does not play a significant role in determining ticket fares or the number of family members traveling.

pairs(data_mv_clean)

Scatter plot visualization shows a relatively weak linear relationship pattern in most variable pairs, except for the relationship between SibSp and Parch.

VARIANCE-COVARIANCE MATRIX

cov_matrix <- cov(data_mv_clean)
cov_matrix

##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100

Based on the variance–covariance matrix, the Fare variable has the largest variance compared to the other variables, at 2800.41. This indicates that passenger ticket fares have a very high level of data dispersion, indicating significant differences in ticket prices among passengers. The Age variance of 211.02 indicates a significant variation in passenger ages, while the SibSp and Parch variances are relatively small, indicating that the number of siblings/spouses or parents/children traveling tends to be limited to small values.

The covariance between SibSp and Parch is positive (0.3045), indicating that passengers with more siblings or spouses tend to also bring more parents or children. Conversely, the covariance between Age and SibSp (−4.16) and Age and Parch (−2.34) is negative, indicating that younger passengers tend to travel with more family members. The positive covariance between Fare and the family variable indicates that passengers traveling with more family members tend to pay higher fares, although this relationship is not very strong. Overall, the variance–covariance matrix provides an initial overview of the patterns of variation and interrelationships between variables in their original units.

heatmap(cov_matrix, symm = TRUE, main = "Variance–Covariance Matrix Heatmap")

EIGEN VALUE AND EIGEN VECTOR

eigen_result <- eigen(cov_matrix)
eigen_result$values

## [1] 2802.5636587  209.0385659    0.9438783    0.4787214

eigen_result$vectors

##             [,1]        [,2]         [,3]          [,4]
## [1,] 0.028477552  0.99929943 -0.024018111  0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322  0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826  0.004609234  0.0009266652

The eigendecomposition results of the variance–covariance matrix show that the first eigenvalue has a very large value, at 2802.56, much higher than the other eigenvalues. This indicates that the first principal component explains most of the variation in the Titanic multivariate data. The second eigenvalue, at 209.04, still contributes significantly to the variation, while the third and fourth eigenvalues are very small, at 0.94 and 0.48, respectively, indicating that the variation contributions of these components are relatively negligible.

These findings indicate that the variation structure of the Titanic data is strongly dominated by one principal dimension, and most of the information in the data can be effectively represented by one or two principal components. This provides an important basis for the application of dimensionality reduction techniques such as Principal Component Analysis (PCA).

The eigenvectors show the direction of the linear combination of the variables Age, SibSp, Parch, and Fare that form each principal component. In the eigenvector associated with the largest eigenvalue, the Fare variable has the largest loading, indicating that ticket fare variation is a major contributor to the formation of the first principal component. Meanwhile, the variables Age, SibSp, and Parch contribute relatively little to this component.

The next principal component is formed by a combination of the family variables (SibSp and Parch) and passenger age, reflecting the dimension of variation related to the passenger’s family structure. Overall, the eigenvectors illustrate that ticket fare differences are the dominant factor in the variation in the Titanic data, while age and family structure act as additional dimensions of variation.

plot(
  eigen_result$values,
  type = "b",
  xlab = "Main Component",
  ylab = "Eigen Value",
  main = "Scree Plot"
)

prop_variance <- eigen_result$values / sum(eigen_result$values)
prop_variance

## [1] 0.930149541 0.069378309 0.000313266 0.000158884

The proportion of variance shows that the first principal component explains the largest proportion of the total variation in the data.