This document is prepared to fulfill an assignment on learning the R programming language and publishing analysis using RPubs. The dataset used in this analysis is the Titanic Dataset obtained from Kaggle. The analysis focuses on four numerical variables:
These variables are analyzed using correlation analysis, variance–covariance matrix, and eigen decomposition.
The tidyverse library is used because it provides a
collection of functions that simplify data manipulation, including data
selection, cleaning, and transformation using the pipe operator
(%>%).
# Load tidyverse library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The Titanic dataset is imported from a CSV file using the read.csv() function. The argument stringsAsFactors = FALSE is used to prevent character variables from being automatically converted into factors. The head() function is used to display the first six rows of the dataset to ensure that the data has been loaded correctly.
titanic <- read.csv("Titanic-Dataset.csv", stringsAsFactors = FALSE)
# Display the first 6 rows
head(titanic)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
This analysis focuses on four numerical variables: Age, SibSp, Parch, and Fare. Rows containing missing values are removed to ensure that the statistical analysis is valid and reliable.
# Select relevant numerical variables
titanic_selected <- titanic %>%
select(Age, SibSp, Parch, Fare)
# Remove rows with missing values
titanic_clean <- na.omit(titanic_selected)
# Check dimensions of cleaned dataset
dim(titanic_clean)
## [1] 714 4
The correlation matrix is used to examine linear relationships between numerical variables. Correlation values range from -1 to 1, where values close to 1 indicate strong positive relationships, values close to -1 indicate strong negative relationships, and values close to 0 indicate weak linear relationships.
# Compute correlation matrix
correlation_matrix <- cor(titanic_clean)
correlation_matrix
## Age SibSp Parch Fare
## Age 1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676 1.0000000 0.3838199 0.13832879
## Parch -0.18911926 0.3838199 1.0000000 0.20511888
## Fare 0.09606669 0.1383288 0.2051189 1.00000000
The variance–covariance matrix describes the variability of each variable and the relationships between them. Diagonal elements represent variances, while off-diagonal elements represent covariances.
# Compute variance-covariance matrix
covariance_matrix <- cov(titanic_clean)
covariance_matrix
## Age SibSp Parch Fare
## Age 211.019125 -4.1633339 -2.3441911 73.849030
## SibSp -4.163334 0.8644973 0.3045128 6.806212
## Parch -2.344191 0.3045128 0.7281027 9.262176
## Fare 73.849030 6.8062117 9.2621760 2800.413100
Eigenvalues and eigenvectors are computed from the variance–covariance matrix to analyze the structure of data variability. Eigenvalues indicate the amount of variance explained by each component, while eigenvectors show the contribution of each variable.
# Compute eigenvalues and eigenvectors
eigen_result <- eigen(covariance_matrix)
# Eigenvalues
eigen_result$values
## [1] 2802.5636587 209.0385659 0.9438783 0.4787214
# Eigenvectors
eigen_result$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.028477552 0.99929943 -0.024018111 0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322 0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826 0.004609234 0.0009266652