In real life, we commonly find that one variable alone cannot fully explain an event. Most of its outcomes are influenced by some factors that happen at the same time. It makes us more likely face difficulties in analyzing data. That is why an approach namely multivariate analysis was developed to handle the difficulties of data analysis when the outcomes are influenced by more than one factor (variable) at the same time.
Multivariate analysis helps us to find relationships and patterns to understand how variables are related to one another and how they contribute to the variation and outcomes in data. We can identify the influence of variables and each of them by measuring some statistics, such as correlation, covariance, eigen value, and more. These approaches are also essential because they form the basis for more advanced approaches, for example, the dimensionality reduction approaches.
This analysis focuses on applying some basic multivariate analysis approaches using R programming language. The goals are to perform the statistical calculations and interpret them so that we can analyze the dataset and its insights.
The dataset used in this analysis is the Titanic Dataset obtained from Kaggle (source: https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv), which contains information about passengers aboard the Royal Mail Ship (RMS) Titanic. The dataset includes some variables that describe passenger backgrounds and conditions during the trip. Each observation represents an individual passenger. The dataset contains missing values and the variables are measured on different scales.
For the purpose of this analysis, only four variables were selected, which are Age, SibSp, Parch, and Fare. - The variable ‘Age’ represents the passenger’s age in years - The variable ‘SibSp’ represents the number of siblings/spouses aboard the Titanic - The variable ‘Parch’ represents the number of parents/children aboard the Titanic - The variable ‘Fare’ represents the ticket price paid by the passenger These variables were chosen because they reflect passenger backgrounds and situations that may show relationships in the data.
Before selecting some variables, we should view the data and its detail first.
This command imports the Titanic dataset from a CSV file into R and stores it as a data frame named titanic_dataset.
titanic_dataset <- read.csv(
"C:/Users/ILLONA LAPTOP/Downloads/Learning R and Rpubs_Analysis on Titanic Dataset/Titanic-Dataset.csv"
)
Displays the internal structure of the dataset, such as variable names, data types, and the number of observations, used to check and determine whether the variables are suitable for analysis.
str(titanic_dataset)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Displays the first and last ten rows of the dataset, used to confirm the dataset has the expected suitable values.
head(titanic_dataset, 10)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## 7 7 0 1
## 8 8 0 3
## 9 9 1 3
## 10 10 1 2
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## 7 McCarthy, Mr. Timothy J male 54 0 0
## 8 Palsson, Master. Gosta Leonard male 2 3 1
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
## 7 17463 51.8625 E46 S
## 8 349909 21.0750 S
## 9 347742 11.1333 S
## 10 237736 30.0708 C
tail(titanic_dataset, 10)
## PassengerId Survived Pclass Name Sex
## 882 882 0 3 Markun, Mr. Johann male
## 883 883 0 3 Dahlberg, Miss. Gerda Ulrika female
## 884 884 0 2 Banfield, Mr. Frederick James male
## 885 885 0 3 Sutehall, Mr. Henry Jr male
## 886 886 0 3 Rice, Mrs. William (Margaret Norton) female
## 887 887 0 2 Montvila, Rev. Juozas male
## 888 888 1 1 Graham, Miss. Margaret Edith female
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## 890 890 1 1 Behr, Mr. Karl Howell male
## 891 891 0 3 Dooley, Mr. Patrick male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 882 33 0 0 349257 7.8958 S
## 883 22 0 0 7552 10.5167 S
## 884 28 0 0 C.A./SOTON 34068 10.5000 S
## 885 25 0 0 SOTON/OQ 392076 7.0500 S
## 886 39 0 5 382652 29.1250 Q
## 887 27 0 0 211536 13.0000 S
## 888 19 0 0 112053 30.0000 B42 S
## 889 NA 1 2 W./C. 6607 23.4500 S
## 890 26 0 0 111369 30.0000 C148 C
## 891 32 0 0 370376 7.7500 Q
dim(titanic_dataset)
## [1] 891 12
nrow(titanic_dataset)
## [1] 891
ncol(titanic_dataset)
## [1] 12
After confirming that the data is suitable, then we choose the variables we want to select and make sure they are clean by handling their missing values.
This command selects only the variables that we want to analyze, then the summary() displays descriptive statistics for each selected variable, including maximum, minimum, median, mean, and quartiles. It helps us identify missing values and see the data range.
selected_data <- titanic_dataset[, c("Age", "SibSp", "Parch", "Fare")]
summary(selected_data)
## Age SibSp Parch Fare
## Min. : 0.42 Min. :0.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.91
## Median :28.00 Median :0.000 Median :0.0000 Median : 14.45
## Mean :29.70 Mean :0.523 Mean :0.3816 Mean : 32.20
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00
## Max. :80.00 Max. :8.000 Max. :6.0000 Max. :512.33
## NA's :177
cleaned_data <- na.omit(selected_data) removes all rows containing missing values in any of the selected variables. Then, we can see the difference from the data before and after missing value removal. - nrow(selected_data) displays the total number of rows containing missing values (before data cleaning) - nrow(cleaned_data) displays the total number of rows after the missing values removed (after data cleaning) - cat() also displays the total number of rows before and after data cleaning, but in a more readable and clean format
cleaned_data <- na.omit(selected_data)
nrow(selected_data)
## [1] 891
nrow(cleaned_data)
## [1] 714
cat("Number of rows before removing missing values:", nrow(selected_data))
## Number of rows before removing missing values: 891
cat("Number of rows after removing missing values:", nrow(cleaned_data))
## Number of rows after removing missing values: 714
When the data is totally cleaned, the next core part is to measure its statistics and make insights and conclusions from the statistics. We can collect clear explanations and patterns among the selected variables.
Correlation matrix describes the strength and direction of linear relationships among the selected variables. It helps identify variables that increase or decrease together. Correlation values close to +1 or -1 indicate strong linear relationships, while values close to 0 indicate weak relationships.
correlation_matrix <- cor(cleaned_data)
correlation_matrix
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
Variance-covariance matrix describes the variability of each variable and how pairs of variables vary together. It helps to see how variables vary individually and jointly. The diagonal elements represent variances of each variable, while the off-diagonal represent covariances between variables.
varcov_matrix <- cov(cleaned_data)
varcov_matrix
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
Variance:
Covariance:
Eigenvalues describe the amount of total variance explained by each principal component (PC), while eigenvectors describe the contribution of each selected variable to the PC. They help to assign meaningful labels to eigenvectors, where each column corresponds to a PC, and each row shows the contribution of a variable.
eigen_valvec <- eigen(varcov_matrix)
eigen_valvec$values
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
eigen_valvec$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652
rownames(eigen_valvec$vectors) <- colnames(cleaned_data)
colnames(eigen_valvec$vectors) <- paste0("PC", 1:ncol(cleaned_data))
eigen_valvec$vectors
## PC1 PC2 PC3 PC4
## Age 0.028477552 0.99929943 -0.024018111 0.0035788596
## SibSp 0.002386349 -0.02093144 -0.773693322 0.6332099362
## Parch 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## Fare 0.999586200 -0.02837826 0.004609234 0.0009266652
Eigenvalues:
Eigenvectors:
This analysis applied basic approaches of multivariate statistics to discover the relationships and variability among Age, SibSp, Parch, and Fare in the Titanic Dataset. The analysis results show that the only clear relationships exist among age and family-related variables. Age is negatively related to SibSp and Parch, while SibSp and Parch are positively related to each other. It indicates that younger passengers were more likely to travel with family members, and family variables tend to increase together. Fare does not show strong linear relationships with the other variables, indicating relative independence from Age and family size, which means that the ticket prices do not have relative influence to the other variables, passengers of any age or any family size could purchase any ticket at different price levels.
While for the variability, Fare is the most variable variable, dominating the variance-covariance matrix and the first principal component (PC1). Age shows moderate variability and mainly contributes to the second principal component (PC2), while SibSp and Parch vary the least and only contribute to minor components. It means that the scales of Fare and Age tend to be far, while the scale for SibSp and Parch tend to be close regarding the amount of family size.
The main findings of this analysis are that family-related variables are related to each other, age is negatively correlated with family-related variables, and fare differences drive most of the overall variation, while other variables contribute relatively little variation.
This analysis shows how multivariate analysis statistical tools, which are the correlation matrix, variance-covariance matrix, eigenvalues, and eigenvectors can be used to understand the structure of multivariate data, which in this case is the Titanic Dataset. The analysis shows the identification of the relationships among variables, and also the variability of the variables. The results confirm that multivariate analysis provides a framework for summarizing datasets and making insights from them. This analysis is the basis step towards more advanced techniques, for example, principal component analysis and dimensionality reduction.