I. Multivariate Analysis on Titanic Dataset Using R

In real life, we commonly find that one variable alone cannot fully explain an event. Most of its outcomes are influenced by some factors that happen at the same time. It makes us more likely face difficulties in analyzing data. That is why an approach namely multivariate analysis was developed to handle the difficulties of data analysis when the outcomes are influenced by more than one factor (variable) at the same time.

Multivariate analysis helps us to find relationships and patterns to understand how variables are related to one another and how they contribute to the variation and outcomes in data. We can identify the influence of variables and each of them by measuring some statistics, such as correlation, covariance, eigen value, and more. These approaches are also essential because they form the basis for more advanced approaches, for example, the dimensionality reduction approaches.

This analysis focuses on applying some basic multivariate analysis approaches using R programming language. The goals are to perform the statistical calculations and interpret them so that we can analyze the dataset and its insights.

II. Dataset Description

The dataset used in this analysis is the Titanic Dataset obtained from Kaggle (source: https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv), which contains information about passengers aboard the Royal Mail Ship (RMS) Titanic. The dataset includes some variables that describe passenger backgrounds and conditions during the trip. Each observation represents an individual passenger. The dataset contains missing values and the variables are measured on different scales.

For the purpose of this analysis, only four variables were selected, which are Age, SibSp, Parch, and Fare. - The variable ‘Age’ represents the passenger’s age in years - The variable ‘SibSp’ represents the number of siblings/spouses aboard the Titanic - The variable ‘Parch’ represents the number of parents/children aboard the Titanic - The variable ‘Fare’ represents the ticket price paid by the passenger These variables were chosen because they reflect passenger backgrounds and situations that may show relationships in the data.

III. Core Analysis

A. View Data

Before selecting some variables, we should view the data and its detail first.

A.1. Read the dataset

This command imports the Titanic dataset from a CSV file into R and stores it as a data frame named titanic_dataset.

titanic_dataset <- read.csv(
  "C:/Users/ILLONA LAPTOP/Downloads/Learning R and Rpubs_Analysis on Titanic Dataset/Titanic-Dataset.csv"
)

A.2. Display the data structure

Displays the internal structure of the dataset, such as variable names, data types, and the number of observations, used to check and determine whether the variables are suitable for analysis.

str(titanic_dataset)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

A.3. Display the first and last few rows

Displays the first and last ten rows of the dataset, used to confirm the dataset has the expected suitable values.

head(titanic_dataset, 10)
##    PassengerId Survived Pclass
## 1            1        0      3
## 2            2        1      1
## 3            3        1      3
## 4            4        1      1
## 5            5        0      3
## 6            6        0      3
## 7            7        0      1
## 8            8        0      3
## 9            9        1      3
## 10          10        1      2
##                                                   Name    Sex Age SibSp Parch
## 1                              Braund, Mr. Owen Harris   male  22     1     0
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                               Heikkinen, Miss. Laina female  26     0     0
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                             Allen, Mr. William Henry   male  35     0     0
## 6                                     Moran, Mr. James   male  NA     0     0
## 7                              McCarthy, Mr. Timothy J   male  54     0     0
## 8                       Palsson, Master. Gosta Leonard   male   2     3     1
## 9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0     2
## 10                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1     0
##              Ticket    Fare Cabin Embarked
## 1         A/5 21171  7.2500              S
## 2          PC 17599 71.2833   C85        C
## 3  STON/O2. 3101282  7.9250              S
## 4            113803 53.1000  C123        S
## 5            373450  8.0500              S
## 6            330877  8.4583              Q
## 7             17463 51.8625   E46        S
## 8            349909 21.0750              S
## 9            347742 11.1333              S
## 10           237736 30.0708              C
tail(titanic_dataset, 10)
##     PassengerId Survived Pclass                                     Name    Sex
## 882         882        0      3                       Markun, Mr. Johann   male
## 883         883        0      3             Dahlberg, Miss. Gerda Ulrika female
## 884         884        0      2            Banfield, Mr. Frederick James   male
## 885         885        0      3                   Sutehall, Mr. Henry Jr   male
## 886         886        0      3     Rice, Mrs. William (Margaret Norton) female
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 888         888        1      1             Graham, Miss. Margaret Edith female
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 890         890        1      1                    Behr, Mr. Karl Howell   male
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch           Ticket    Fare Cabin Embarked
## 882  33     0     0           349257  7.8958              S
## 883  22     0     0             7552 10.5167              S
## 884  28     0     0 C.A./SOTON 34068 10.5000              S
## 885  25     0     0  SOTON/OQ 392076  7.0500              S
## 886  39     0     5           382652 29.1250              Q
## 887  27     0     0           211536 13.0000              S
## 888  19     0     0           112053 30.0000   B42        S
## 889  NA     1     2       W./C. 6607 23.4500              S
## 890  26     0     0           111369 30.0000  C148        C
## 891  32     0     0           370376  7.7500              Q

A.4. Check dataset size

  • dim() shows the number of rows and columns
  • nrow() shows the number of rows (observations)
  • ncol() shows the number of columns (variables)
dim(titanic_dataset)
## [1] 891  12
nrow(titanic_dataset)
## [1] 891
ncol(titanic_dataset)
## [1] 12

B. Data Preprocessing

After confirming that the data is suitable, then we choose the variables we want to select and make sure they are clean by handling their missing values.

B.1. Data selection

This command selects only the variables that we want to analyze, then the summary() displays descriptive statistics for each selected variable, including maximum, minimum, median, mean, and quartiles. It helps us identify missing values and see the data range.

selected_data <- titanic_dataset[, c("Age", "SibSp", "Parch", "Fare")]
summary(selected_data)
##       Age            SibSp           Parch             Fare       
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91  
##  Median :28.00   Median :0.000   Median :0.0000   Median : 14.45  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816   Mean   : 32.20  
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00  
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000   Max.   :512.33  
##  NA's   :177

B.2. Data Cleaning

cleaned_data <- na.omit(selected_data) removes all rows containing missing values in any of the selected variables. Then, we can see the difference from the data before and after missing value removal. - nrow(selected_data) displays the total number of rows containing missing values (before data cleaning) - nrow(cleaned_data) displays the total number of rows after the missing values removed (after data cleaning) - cat() also displays the total number of rows before and after data cleaning, but in a more readable and clean format

cleaned_data <- na.omit(selected_data)

nrow(selected_data)
## [1] 891
nrow(cleaned_data)
## [1] 714
cat("Number of rows before removing missing values:", nrow(selected_data))
## Number of rows before removing missing values: 891
cat("Number of rows after removing missing values:", nrow(cleaned_data))
## Number of rows after removing missing values: 714

C. Statistical Analysis

When the data is totally cleaned, the next core part is to measure its statistics and make insights and conclusions from the statistics. We can collect clear explanations and patterns among the selected variables.

C.1. Correlation Matrix

Correlation matrix describes the strength and direction of linear relationships among the selected variables. It helps identify variables that increase or decrease together. Correlation values close to +1 or -1 indicate strong linear relationships, while values close to 0 indicate weak relationships.

correlation_matrix <- cor(cleaned_data)
correlation_matrix
##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000
Output explanation:
  • The diagonal values are all equal to 1, which indicates perfect correlation of each variable with itself.
  • The matrix is symmetric, which means that the correlation between variable X and Y is the same as between Y and X.
  • Age and SibSp show a moderate negative correlation (-0.308), which indicates that younger passengers were more likely to travel with siblings or spouses.
  • Age and Parch also show a negative correlation (-0.189), which indicates that younger passengers were more likely to travel with parents or children.
  • SibSp and Parch show a positive correlation (0.384), which indicates that passengers who traveled with siblings or spouses were also more likely to travel with parents or children.
  • Correlations involving Fare are relatively weak, which indicates that ticket price is loosely related to age and family size.

C.2. Variance-Covariance Matrix

Variance-covariance matrix describes the variability of each variable and how pairs of variables vary together. It helps to see how variables vary individually and jointly. The diagonal elements represent variances of each variable, while the off-diagonal represent covariances between variables.

varcov_matrix <- cov(cleaned_data)
varcov_matrix
##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100
Output explanation:

Variance:

  • Fare shows the largest variance (2800.41), which indicates that there is a wide spread in ticket prices among passengers.
  • Age also shows large variance (211.02), which indicates that there is a wide spread in passengers’ ages, from young to old.
  • SibSp and Parch have small variances, which indicates that there are limited ranges of family size variables (the numbers are more likely to be on near scales).

Covariance:

  • Negative covariance values between Age and SibSp (-4.16) and between Age and Parch (-2.34) indicate that younger passengers were more likely to travel with family members.
  • Positive covariance values between SibSp and Parch (0.30) and between Fare and family-related variables indicate that passengers traveling with family members generally paid higher fares.
  • Because covariance depends on the scale of measurement, variables with larger numeric ranges, such as Fare, dominate the covariance structure.

C.3. Eigenvalues and Eigenvectors

Eigenvalues describe the amount of total variance explained by each principal component (PC), while eigenvectors describe the contribution of each selected variable to the PC. They help to assign meaningful labels to eigenvectors, where each column corresponds to a PC, and each row shows the contribution of a variable.

eigen_valvec <- eigen(varcov_matrix)
eigen_valvec$values
## [1] 2802.5636587  209.0385659    0.9438783    0.4787214
eigen_valvec$vectors
##             [,1]        [,2]         [,3]          [,4]
## [1,] 0.028477552  0.99929943 -0.024018111  0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322  0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826  0.004609234  0.0009266652
rownames(eigen_valvec$vectors) <- colnames(cleaned_data)
colnames(eigen_valvec$vectors) <- paste0("PC", 1:ncol(cleaned_data))
eigen_valvec$vectors
##               PC1         PC2          PC3           PC4
## Age   0.028477552  0.99929943 -0.024018111  0.0035788596
## SibSp 0.002386349 -0.02093144 -0.773693322  0.6332099362
## Parch 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## Fare  0.999586200 -0.02837826  0.004609234  0.0009266652
Output explanation:

Eigenvalues:

  • The first eigenvalue (2802.56) is larger than the other eigenvalues, which indicates that the first PC alone explains most of the total variance in the dataset.
  • The second eigenvalue (209.04) indicates that the second PC explains less variance.
  • The third and fourth eigenvalues are close to zero, which indicate that the third and fourth PC give minimal contribution in explaining the variance.

Eigenvectors:

  • The first principal component (PC1) has a very large loading for Fare (0.9996), while the loadings of Age, SibSp, and Parch are close to zero, which indicates that PC1 essentially represents ticket price variability.
  • The PC2 has a very large loading for Age (0.9993), which indicates that it captures more of the age-related variation, independently of fare.
  • The PC3 and PC4 are mainly influenced by SibSp and Parch, which indicates that they capture more of the family size variables.

D. Findings

This analysis applied basic approaches of multivariate statistics to discover the relationships and variability among Age, SibSp, Parch, and Fare in the Titanic Dataset. The analysis results show that the only clear relationships exist among age and family-related variables. Age is negatively related to SibSp and Parch, while SibSp and Parch are positively related to each other. It indicates that younger passengers were more likely to travel with family members, and family variables tend to increase together. Fare does not show strong linear relationships with the other variables, indicating relative independence from Age and family size, which means that the ticket prices do not have relative influence to the other variables, passengers of any age or any family size could purchase any ticket at different price levels.

While for the variability, Fare is the most variable variable, dominating the variance-covariance matrix and the first principal component (PC1). Age shows moderate variability and mainly contributes to the second principal component (PC2), while SibSp and Parch vary the least and only contribute to minor components. It means that the scales of Fare and Age tend to be far, while the scale for SibSp and Parch tend to be close regarding the amount of family size.

The main findings of this analysis are that family-related variables are related to each other, age is negatively correlated with family-related variables, and fare differences drive most of the overall variation, while other variables contribute relatively little variation.

IV. Conclusion

This analysis shows how multivariate analysis statistical tools, which are the correlation matrix, variance-covariance matrix, eigenvalues, and eigenvectors can be used to understand the structure of multivariate data, which in this case is the Titanic Dataset. The analysis shows the identification of the relationships among variables, and also the variability of the variables. The results confirm that multivariate analysis provides a framework for summarizing datasets and making insights from them. This analysis is the basis step towards more advanced techniques, for example, principal component analysis and dimensionality reduction.