There is a complete database of all passengers on the titanic and it contains data as to who did and did not survive. This data is broken into two datasets. The training data contains who did and did not survive and the test dataset is missing that information. I am going to use the titanic dataset from https://www.kaggle.com/competitions/titanic/data. To be exact, I am going to use the train.csv from the dataset.
PassengerId: This is the ID of ever passengers.Survived: This feature have values 0 and 1. 0 is for
not survived and 1 is for survived.Pclass: These are 3 classes of passengers. Class1,
Class2 and Class3.Name: Name of each passengers.Sex: Gender of passengers.Age: Age of passengers.SibSp: Indication that passenger have siblings and
spouse.Parch: Whether a passenger is alone or with
family.Ticket: Ticket no of passenger.Fare: Indicating the fare.Cabin: Cabin of passengers.Embarked: Embarked category.library(caret)
library(dplyr)
library(gtools)
library(GGally)
titanic <- read.csv("titanic/train.csv")
head(titanic)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
tail(titanic)
## PassengerId Survived Pclass Name Sex
## 886 886 0 3 Rice, Mrs. William (Margaret Norton) female
## 887 887 0 2 Montvila, Rev. Juozas male
## 888 888 1 1 Graham, Miss. Margaret Edith female
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## 890 890 1 1 Behr, Mr. Karl Howell male
## 891 891 0 3 Dooley, Mr. Patrick male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 886 39 0 5 382652 29.125 Q
## 887 27 0 0 211536 13.000 S
## 888 19 0 0 112053 30.000 B42 S
## 889 NA 1 2 W./C. 6607 23.450 S
## 890 26 0 0 111369 30.000 C148 C
## 891 32 0 0 370376 7.750 Q
dim(titanic)
## [1] 891 12
colnames(titanic)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
The Titanic dataset is a comprehensive database containing
information about passengers who were aboard the Titanic, including
whether they survived or not. The dataset is divided into two parts: the
training dataset, which includes information about who survived and who
did not, and the test dataset, which lacks survival information. For
this analysis, we will focus on the train.csv file from the
dataset. The dataset consists of 12 columns, each with specific
information about the passengers, such as their ID, survival status,
class, name, gender, age, sibling/spouse indicators, family indicators,
ticket numbers, fare, cabin information, and embarkation details. In
total, there are 891 rows and 12 columns in
the dataset.
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
From the results, there are several variables in the Titanic dataset
that need their data types transformed from integers or characters to
factors. The variables in question include Survived,
Pclass, Sex, SibSp,
Parch, and Embarked. By converting these
variables to factors, we can treat them as categorical features during
exploratory data analysis (EDA). This modification ensures that the
dataset’s categorical attributes are correctly identified and enables us
to gain a deeper understanding of their distributions and
relationships.
titanic <- titanic %>%
mutate(Survived = as.factor(titanic$Survived),
Pclass = as.factor(titanic$Pclass),
Sex = as.factor(titanic$Sex),
SibSp = as.factor(titanic$SibSp),
Parch = as.factor(titanic$Parch),
Embarked = as.factor(titanic$Embarked))
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <fct> male, female, female, female, male, male, male, male, fema…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <fct> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <fct> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…
From the dataset, it appears that the PassengerId
variable contains no missing values. This variable primarily serves as
an identifier and is not considered a significant predictor for our
analysis. Therefore, we may consider removing the
PassengerId variable from the dataset to streamline our
analysis, as it is unlikely to significantly impact our results.
titanic <- titanic %>%
select(-PassengerId)
glimpse(titanic)
## Rows: 891
## Columns: 11
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Flore…
## $ Sex <fct> male, female, female, female, male, male, male, male, female,…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ SibSp <fct> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch <fct> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37345…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C103…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S…
anyNA(titanic)
## [1] TRUE
The dataset contains missing values (NA), which can significantly impact the quality and reliability of data analysis. To ensure accurate and reliable insights from the dataset, it is essential to address these missing values through imputation or data cleaning techniques. Next, we will check which variables contain missing values to determine the scope of data cleaning and imputation needed.
colSums(is.na(titanic))
## Survived Pclass Name Sex Age SibSp Parch Ticket
## 0 0 0 0 177 0 0 0
## Fare Cabin Embarked
## 0 0 0
It seems that the column Age had
177 missing values.
There are two treatments that can be used: - Remove the whole
Age column so that there won’t be less rows. - Remove the
rows with missing values in Age column so that it could
still be used in the analysis.
titanic_no_age <- titanic %>%
select(-Age)
titanic_clean <- na.omit(titanic) # Only rows with missing values removed
colSums(is.na(titanic_no_age))
## Survived Pclass Name Sex SibSp Parch Ticket Fare
## 0 0 0 0 0 0 0 0
## Cabin Embarked
## 0 0
colSums(is.na(titanic_clean))
## Survived Pclass Name Sex Age SibSp Parch Ticket
## 0 0 0 0 0 0 0 0
## Fare Cabin Embarked
## 0 0 0
glimpse(titanic_no_age)
## Rows: 891
## Columns: 10
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Flore…
## $ Sex <fct> male, female, female, female, male, male, male, male, female,…
## $ SibSp <fct> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch <fct> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37345…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C103…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S…
Brief Overview of the data
summary(titanic_no_age)
## Survived Pclass Name Sex SibSp Parch
## 0:549 1:216 Length:891 female:314 0:608 0:678
## 1:342 2:184 Class :character male :577 1:209 1:118
## 3:491 Mode :character 2: 28 2: 80
## 3: 16 3: 5
## 4: 18 4: 4
## 5: 5 5: 5
## 8: 7 6: 1
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 : 2
## Class :character 1st Qu.: 7.91 Class :character C:168
## Mode :character Median : 14.45 Mode :character Q: 77
## Mean : 32.20 S:644
## 3rd Qu.: 31.00
## Max. :512.33
##
summary(titanic_clean)
## Survived Pclass Name Sex Age SibSp
## 0:424 1:186 Length:714 female:261 Min. : 0.42 0:471
## 1:290 2:173 Class :character male :453 1st Qu.:20.12 1:183
## 3:355 Mode :character Median :28.00 2: 25
## Mean :29.70 3: 12
## 3rd Qu.:38.00 4: 18
## Max. :80.00 5: 5
## 8: 0
## Parch Ticket Fare Cabin Embarked
## 0:521 Length:714 Min. : 0.00 Length:714 : 2
## 1:110 Class :character 1st Qu.: 8.05 Class :character C:130
## 2: 68 Mode :character Median : 15.74 Mode :character Q: 28
## 3: 5 Mean : 34.69 S:554
## 4: 4 3rd Qu.: 33.38
## 5: 5 Max. :512.33
## 6: 1
hist(titanic_clean$Fare, breaks=20)
From the Fare histogram, it can be observed that as the
Fare (ticket price) value increases, the frequency of
passengers with such fares tends to decrease. This indicates that the
majority of passengers purchased tickets at lower prices, while tickets
with higher prices were less commonly purchased
boxplot(titanic_clean$Fare)
According to the boxplot shown, there is a noticeable presence of
outliers above the maximum value, particularly in the “Fare” variable.
These outliers may be attributed to various factors, such as scalpers
reselling tickets at exorbitant prices or passengers who acquired their
tickets through auction processes.
hist(titanic_clean$Age, breaks=20)
The Age histogram reveals an interesting distribution of
passenger ages on the Titanic. It’s evident that there is a lower
frequency of very young passengers, particularly those under the age of
20, suggesting that the Titanic had relatively fewer infants and
children. However, as we move beyond the age of 20, the frequency of
passengers gradually declines. This indicates that the majority of
passengers fall into the age range of 20 to 40 years old, as this range
exhibits the highest frequency. The histogram shape suggests a skewed
distribution with a right tail, implying that the older passengers were
less common on the Titanic’s voyage.
boxplot(titanic_clean$Age)
Based on the information derived from the boxplot and the summary statistics: - Outliers are observed in the age distribution, specifically above the age of 60. These outliers suggest the presence of older passengers who may be exceptional cases within the dataset. - The average age of passengers on the Titanic is approximately 29.7 years. This provides an overview of the typical age distribution among the passengers. - The dataset includes infants, as indicated by the lowest age recorded, which is 0.42. This suggests that there was at least one baby on board the Titanic during the voyage, highlighting the diversity in passenger age groups.
These insights into the age distribution provide a preliminary understanding of the passengers’ demographics, including the presence of older individuals and the inclusion of infants on the ship. Further analysis can explore how age might have influenced survival rates or other aspects of the Titanic tragedy.
plot(x = titanic_clean$Age, y = titanic_clean$Fare,
main="Scatter Plot Age vs. Fare",
xlab="Age",
ylab="Fare"
)
In the
Age vs. Fare scatter plot, there is no
clear linear relationship between age (Age) and the fare
(Fare) paid by passengers. This indicates that there is no
significant correlation between the age of passengers and the amount of
money paid for the ship’s tickets. The data is distributed quite evenly
across various age and fare ranges. It can be concluded that age does
not directly influence the ticket fare.
correlation <- cor(titanic_clean$Age, titanic_clean$Fare)
correlation
## [1] 0.09606669
The correlation coefficient between Age and
Fare is approximately 0.0961, which suggests a
very weak positive relationship between the two variables. This
indicates that as a passenger’s age increases, there is a slight
tendency for their fare to also increase, but the relationship is not
strong.
train.csv file. To prepare the data for analysis and
modeling, we transformed several variables into factors. These variables
include Survived, Pclass, Sex,
SibSp, Parch, and Embarked. This
transformation allows us to utilize these attributes effectively when
building models.