1 Intro

1.1 Titanic

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

This titanic dataset is taken from the kaggle site.

1.2 Brief

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)

2 Get the Data

titanic <- readr::read_csv('train.csv')
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(titanic)
## [1] 891  12

Our titanic dataset contains 891 rows and 12 coloumns.

head(titanic)
## # A tibble: 6 × 12
##   PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
## 2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
## 3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
## 4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
## 5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
## 6           6        0      3 Moran,… male     NA     0     0 330877  8.46 <NA> 
## # … with 1 more variable: Embarked <chr>

Column explanation

  • PassangerId = Passanger identity number
  • Survived = Survival ; 0 = No, 1 = Yes
  • Pclass = Ticket Class ; 1 = 1st, 2 = 2nd, 3 = 3rd
  • Name = Name of passanger
  • Sex = Sex
  • Age = Age in year
  • SibSp = Siblings / spouses aboard the Titanic
  • Parch = Parents / children aboard the Titanic
  • Ticket = Ticket number
  • Fare = Passenger fare
  • Cabin = Cabin number
  • Embarked = Port of Embarkation ; C = Cherbourg, Q = Queenstown, S = Southampton

3 Data Cleansing

3.1 Check the Data Type

str(titanic)
## spec_tbl_df [891 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr [1:891] "male" "female" "female" "female" ...
##  $ Age        : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr [1:891] NA "C85" NA "C123" ...
##  $ Embarked   : chr [1:891] "S" "C" "S" "S" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PassengerId = col_double(),
##   ..   Survived = col_double(),
##   ..   Pclass = col_double(),
##   ..   Name = col_character(),
##   ..   Sex = col_character(),
##   ..   Age = col_double(),
##   ..   SibSp = col_double(),
##   ..   Parch = col_double(),
##   ..   Ticket = col_character(),
##   ..   Fare = col_double(),
##   ..   Cabin = col_character(),
##   ..   Embarked = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

From the result above, we find some of data type not in the corect type. We need to convert it into corect type.

titanic[,c("Survived", "Pclass", "Sex", "SibSp", "Parch", "Cabin", "Embarked")] <- 
  lapply(titanic[,c("Survived", "Pclass", "Sex", "SibSp", "Parch", "Cabin", "Embarked")],
         as.factor)

str(titanic)
## spec_tbl_df [891 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Ticket     : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PassengerId = col_double(),
##   ..   Survived = col_double(),
##   ..   Pclass = col_double(),
##   ..   Name = col_character(),
##   ..   Sex = col_character(),
##   ..   Age = col_double(),
##   ..   SibSp = col_double(),
##   ..   Parch = col_double(),
##   ..   Ticket = col_character(),
##   ..   Fare = col_double(),
##   ..   Cabin = col_character(),
##   ..   Embarked = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Each of column already changed into desired data type.

Now, we check whether there are missing value from the data.

colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

There are missing values in some columns
Age : 177 missing values
Cabin : 687 missing values
Embarker : 2 missing values

3.2 Handle Missing Value

3.2.1 Replace NaN with Mean

For coulmn Age, we replace the missing value with average of Age

titanic$Age[is.na(titanic$Age)] <- mean(titanic$Age, na.rm = TRUE) 

3.2.2 Change NaN into String

For column Cabin, we replace the missing value with “Unknown”

titanic$Cabin <- as.character(titanic$Cabin)
titanic$Cabin[is.na(titanic$Cabin)] <- "Unknown"

titanic$Cabin <- as.factor(titanic$Cabin)

3.2.3 Remove NaN row

For column Embarked, we remove row that contain missing value, because the missing value is only two.

titanic <- na.omit(titanic)

Make sure no more missing values from the data.

colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Now, Titanic dataset is ready to be processed and analyzed.

4 Data Information

4.1 Passanger

Total Passangers from Each Embarked

titanic$Embarked <- recode(titanic$Embarked,
                           "C" = "Cherbourg", "Q" = "Queenstown", "S" = "Southampton")
x <- data.frame(xtabs(~ titanic$Embarked))
x
##   titanic.Embarked Freq
## 1        Cherbourg  168
## 2       Queenstown   77
## 3      Southampton  644

Total Passengers in Each Class

table(titanic$Pclass)
## 
##   1   2   3 
## 214 184 491
pie(table(titanic$Pclass))

Average Age of Passengers in Each Class by Gender

aggregate(Age~Pclass+Sex,titanic, mean)
##   Pclass    Sex      Age
## 1      1 female 33.79665
## 2      2 female 28.74866
## 3      3 female 24.06849
## 4      1   male 39.28772
## 5      2   male 30.65391
## 6      3   male 27.37215

Percentage Women and Men

  • Percentage of women and men who survived
xtabs(~ Sex, titanic[titanic$Survived == 1,])['female']/nrow(titanic)*100
##   female 
## 25.98425
xtabs(~ Sex, titanic[titanic$Survived == 1,])['male']/nrow(titanic)*100
##     male 
## 12.26097
  • Percentage of women and men who are not survived
xtabs(~ Sex, titanic[titanic$Survived == 0,])['female']/nrow(titanic)*100
##   female 
## 9.111361
xtabs(~ Sex, titanic[titanic$Survived == 0,])['male']/nrow(titanic)*100
##     male 
## 52.64342

Correlation beetween Age and Survivded

cor(titanic$Age, as.numeric(titanic$Survived))
## [1] -0.07467292

4.2 Ticket

Average Ticket Prices for Each Class

aggregate(Fare~Pclass, titanic, mean)
##   Pclass     Fare
## 1      1 84.19352
## 2      2 20.66218
## 3      3 13.67555

Standard Deviation of Tiket Prices

sd(titanic$Fare)
## [1] 49.6975

5 Insight from the Data

From the above data processing, several conclusions can be drawn

  • Total number of passengers is 889 people
  • More passengers departing from Southampton than passengers departing from Cherbourg and Queenstown
  • There are more female survivers than male passengers
  • 1st class ticket prices are higher than other classes with a ticket standard deviation of 49.6975
  • There is no correlation between age and survived status