Introduction

This document is prepared to fulfill an assignment on learning the R programming language and publishing analysis using RPubs. The dataset used in this analysis is the Titanic Dataset obtained from Kaggle. The analysis focuses on four numerical variables:

These variables are analyzed using correlation analysis, variance–covariance matrix, and eigen decomposition.


Import Library

The tidyverse library is used because it provides a collection of functions that simplify data manipulation, including data selection, cleaning, and transformation using the pipe operator (%>%).

# Load tidyverse library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import Dataset

The Titanic dataset is imported from a CSV file using the read.csv() function. The argument stringsAsFactors = FALSE is used to prevent character variables from being automatically converted into factors. The head() function is used to display the first six rows of the dataset to ensure that the data has been loaded correctly.

titanic <- read.csv("Titanic-Dataset.csv", stringsAsFactors = FALSE)

# Display the first 6 rows
head(titanic)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Variable Selection and Data Cleaning

This analysis focuses on four numerical variables: Age, SibSp, Parch, and Fare. Rows containing missing values are removed to ensure that the statistical analysis is valid and reliable.

# Select relevant numerical variables
titanic_selected <- titanic %>%
  select(Age, SibSp, Parch, Fare)

# Remove rows with missing values
titanic_clean <- na.omit(titanic_selected)

# Check dimensions of cleaned dataset
dim(titanic_clean)
## [1] 714   4

Correlation Matrix

The correlation matrix is used to examine linear relationships between numerical variables. Correlation values range from -1 to 1, where values close to 1 indicate strong positive relationships, values close to -1 indicate strong negative relationships, and values close to 0 indicate weak linear relationships.

# Compute correlation matrix
correlation_matrix <- cor(titanic_clean)
correlation_matrix
##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.3082468 -0.1891193 0.09606669
## SibSp -0.30824676  1.0000000  0.3838199 0.13832879
## Parch -0.18911926  0.3838199  1.0000000 0.20511888
## Fare   0.09606669  0.1383288  0.2051189 1.00000000

Variance–Covariance Matrix

The variance–covariance matrix describes the variability of each variable and the relationships between them. Diagonal elements represent variances, while off-diagonal elements represent covariances.

# Compute variance-covariance matrix
covariance_matrix <- cov(titanic_clean)
covariance_matrix
##              Age      SibSp      Parch        Fare
## Age   211.019125 -4.1633339 -2.3441911   73.849030
## SibSp  -4.163334  0.8644973  0.3045128    6.806212
## Parch  -2.344191  0.3045128  0.7281027    9.262176
## Fare   73.849030  6.8062117  9.2621760 2800.413100

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors are computed from the variance–covariance matrix to analyze the structure of data variability. Eigenvalues indicate the amount of variance explained by each component, while eigenvectors show the contribution of each variable.

# Compute eigenvalues and eigenvectors
eigen_result <- eigen(covariance_matrix)

# Eigenvalues
eigen_result$values
## [1] 2802.5636587  209.0385659    0.9438783    0.4787214
# Eigenvectors
eigen_result$vectors
##             [,1]        [,2]         [,3]          [,4]
## [1,] 0.028477552  0.99929943 -0.024018111  0.0035788596
## [2,] 0.002386349 -0.02093144 -0.773693322  0.6332099362
## [3,] 0.003280818 -0.01253786 -0.633088089 -0.7739712590
## [4,] 0.999586200 -0.02837826  0.004609234  0.0009266652