RMS Titanic was a British passenger liner, operated by the White Star Line, that sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg during her maiden voyage from Southampton, England, to New York City, United States. RMS Titanic was the largest ship afloat at the time she entered service.
The sinking of the Titanic is one of the most infamous shipwrecks in history resulted in more than 1,500 people died out of estimated 2,224 passengers and crew aboard the ship, making it the deadliest sinking of a single ship up to that time.
The dataset used in this LBB Project of Programming for Data Science (with R) is provided by Kaggle.
Datasets provided by Kaggle contains two similar datasets that include passenger information like name, age, gender, socio-economic class, etc.
One dataset is titled train.csv and the other is titled
test.csv.
train.csv dataset will contain the details of a subset
of the passengers on board (891 to be exact) and importantly, will
reveal whether they survived or not, also known as the “ground
truth”.
test.csv dataset contains similar information as
train.csv but does not disclose the “ground truth” for each
passenger.
For the purpose of this LBB Project, the dataset that will be used for further analysis is the merging of the two datasets above without the information of the “ground truth”.
Before we can perform further analysis on our dataset, the first step is to prepare the dataset itself.
We set the groundwork by importing the necessary packages.
Before we can create the dataset, we will
check the informations contain in each original kaggle datasets
train.csv and test.csv.
# Read the first dataset
train <- read.csv("data_input/train.csv")
# Check the structure of first dataset
str(train)#> 'data.frame': 891 obs. of 12 variables:
#> $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
#> $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
#> $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#> $ Sex : chr "male" "female" "female" "female" ...
#> $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
#> $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
#> $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
#> $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Cabin : chr "" "C85" "" "C123" ...
#> $ Embarked : chr "S" "C" "S" "S" ...
This dataset contains information for total of 891 passengers.
# Read the second dataset
test <- read.csv("data_input/test.csv")
# Check the structure of second dataset
str(test)#> 'data.frame': 418 obs. of 11 variables:
#> $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
#> $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
#> $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
#> $ Sex : chr "male" "female" "male" "male" ...
#> $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
#> $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
#> $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
#> $ Ticket : chr "330911" "363272" "240276" "315154" ...
#> $ Fare : num 7.83 7 9.69 8.66 12.29 ...
#> $ Cabin : chr "" "" "" "" ...
#> $ Embarked : chr "Q" "S" "Q" "S" ...
This dataset contains information for total of 418 passengers.
Based on our exploratory of the two original Kaggle datasets above,
the additional “ground truth” column in the Train Dataset is named
Survived. To create the dataset
in this LBB Project, we will merge the two datasets above without the
column Survived.
Using dplyr function, we will select the Train
Dataset to contain all columns except
Survived, then merge it with the Test Dataset and save the
new dataset as titanic which should contains
information for total of 1,309 passengers.
# Merge the two datasets and save it to a new dataframe object named titanic
titanic <- rbind(train %>% select(-Survived), # Select Train Dataset without column Survived
test)
# Check the structure of the new dataset
str(titanic)#> 'data.frame': 1309 obs. of 11 variables:
#> $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
#> $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#> $ Sex : chr "male" "female" "female" "female" ...
#> $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
#> $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
#> $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
#> $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Cabin : chr "" "C85" "" "C123" ...
#> $ Embarked : chr "S" "C" "S" "S" ...
Data Description:
PassengerId: Row numberPclass: Ticket class and a proxy for socio-economic status (1 = 1st class (Upper), 2 = 2nd class (Middle), 3 = 3rd class (Lower))Name: Name of the passenger
Sex: Gender of the passenger (male / female)Age: Age of the passenger in years and it is fractional if less than 1. If the age is estimated, it is in the form of xx.5SibSp: Number of Siblings / Spouses aboard the Titanic with family relations as follows
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: Number of Parents / Children aboard the Titanic with family relations as follows
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch = 0 for them
Ticket: Ticket NumberFare: Passenger FareCabin: Cabin numberEmbarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Based on the Data Description of our dataset titanic,
there are several columns that we suspected to be categorial types, such
as: Pclass, Sex, SibSp,
Parch, and Embarked.
Before we proceed to change those columns to type
as.factor(), let us confirm the unique values for each of
those columns.
Parch#> [1] 0 1 2 5 3 4 6 9
There are 8 unique values in column
Parch.
Using base R, we will use explicit coercion to
change the data type of columns Pclass, Sex,
SibSp, Parch, and Embarked to
# Explicit Coercion to change data type to as.factor()
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$SibSp <- as.factor(titanic$SibSp)
titanic$Parch <- as.factor(titanic$Parch)
titanic$Embarked <- as.factor(titanic$Embarked)
# Re-Check the structure of the dataset after explicit coercion
str(titanic)#> 'data.frame': 1309 obs. of 11 variables:
#> $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
#> $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#> $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
#> $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
#> $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
#> $ Parch : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
#> $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Cabin : chr "" "C85" "" "C123" ...
#> $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
We will used method of Descriptive Statistic to describe our dataset in general before proceeding to the next process in our Exploratory Data Analysis (EDA)
#> PassengerId Pclass Name Sex Age SibSp
#> Min. : 1 1:323 Length:1309 female:466 Min. : 0.17 0:891
#> 1st Qu.: 328 2:277 Class :character male :843 1st Qu.:21.00 1:319
#> Median : 655 3:709 Mode :character Median :28.00 2: 42
#> Mean : 655 Mean :29.88 3: 20
#> 3rd Qu.: 982 3rd Qu.:39.00 4: 22
#> Max. :1309 Max. :80.00 5: 6
#> NA's :263 8: 9
#> Parch Ticket Fare Cabin
#> 0 :1002 Length:1309 Min. : 0.000 Length:1309
#> 1 : 170 Class :character 1st Qu.: 7.896 Class :character
#> 2 : 113 Mode :character Median : 14.454 Mode :character
#> 3 : 8 Mean : 33.295
#> 4 : 6 3rd Qu.: 31.275
#> 5 : 6 Max. :512.329
#> (Other): 4 NA's :1
#> Embarked
#> : 2
#> C:270
#> Q:123
#> S:914
#>
#>
#>
Several initial insights that we can summarized about our dataset
titanic using Descriptive Statistics method as
follows:
- There are more male (846 passengers) than female (466 passengers) aboard the Titanic.
- Passengers with Lower socio-economic status has the most numbers aboard the Titanic.
- There are several missing values in our dataset on columns
AgeandFare.
Alternatively, we can also use is.na() function to
re-check missing value on each columns and combined it with
colSums() function to sum all the count missing values on
each column
#> PassengerId Pclass Name Sex Age SibSp
#> 0 0 0 0 263 0
#> Parch Ticket Fare Cabin Embarked
#> 0 0 1 0 0
The result above matched with the summary() function
which stated there are 263 and 1 missing
values from columns Age and Fare
respectively.
Even though there are quite significant total missing values in
columns Age, we will keep the dataset as it is for now.
1. What is the total number of Male and Female passengers in Each different socio-economic class?
We would like to know more about our passengers based on their
genders sex within each socio-economic class
Pclass.
#>
#> 1 2 3
#> female 144 106 216
#> male 179 171 493
Insight that we can conclude is that the largest number of passengers with total of 493 passengers aboard Titanic have charateristic as Male with Lower Socio-Economic or 3rd class.
2. What is the total sales of Fare that was sold
for each Socio-Economic class Pclass?
#> Pclass
#> 1 2 3
#> 28265.404 5866.637 9418.445
Insight that we can conclude is that the largest ticket sales with total sales of $28,265.40 sold to the 323 passengers of Upper Socio-Economic or 1st class passengers onboard the Titanic.
Titanic Dataset: (https://www.kaggle.com/competitions/titanic/)Titanic Wikipedia: (https://en.wikipedia.org/wiki/Titanic)