Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Here we try to analyze which factors were more likely to contribute to the death of the passengers and classify who is more likely to survive depending on the features.

Purpose

The purpose of this project was to gain introductory exposure to programmatic data analysis concepts, by analysing the factors that determined whether a passenger survived the Titanic disaster or did not.

The data has been split into two groups:

training set (train.csv) test set (test.csv) The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

below are the library that we need

library(ggplot2)

we could input our data to R and put it into ‘titanic’ object

titanic <- read.csv("train.csv",stringsAsFactors = FALSE)
str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Data Cleansing & Coertions

head(titanic)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

from this result, we find some of data type not in the corect type. we need to convert it into corect type (data coercion)

# convert data types
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Survived <- as.factor(titanic$Survived)
titanic$Name <- as.character(titanic$Name)
titanic$Sex <- as.factor(titanic$Sex)
# titanic$Ticket <- as.character(titanic$Ticket)
titanic$Cabin <- as.character(titanic$Cabin)
titanic$Embarked <- as.factor(titanic$Embarked)
str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

each of column already changed into desired data type and check null data

colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

cool! now check null string data

colSums(titanic=="")
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0          NA 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

finding Age mean for data imputation

# library(dplyr)
# summarise(titanic, Average = mean(Age, na.rm = T))
# mean(titanic$Age, na.rm = TRUE)  
summary(titanic$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

don’t forget to deleting null rows!

# x <- na.omit(titanic)
titanicNew <- titanic[complete.cases(titanic), ]

roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation so the null cleaning more preferable.

about 687 out 891 of cabin data are missing, however, the cabin data not important since it just unique code of room, so i prefer to drop the Cabin colloum

dim(titanicNew)
## [1] 714  12
colSums(is.na(titanicNew))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

From that result, no missing value anymore with 714 data remaining

summary(titanicNew)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:424    1:186   Length:714         female:261  
##  1st Qu.:222.2   1:290    2:173   Class :character   male  :453  
##  Median :445.0            3:355   Mode  :character               
##  Mean   :448.6                                                   
##  3rd Qu.:677.8                                                   
##  Max.   :891.0                                                   
##       Age            SibSp            Parch           Ticket         
##  Min.   : 0.42   Min.   :0.0000   Min.   :0.0000   Length:714        
##  1st Qu.:20.12   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.5126   Mean   :0.4314                     
##  3rd Qu.:38.00   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :80.00   Max.   :5.0000   Max.   :6.0000                     
##       Fare           Cabin           Embarked
##  Min.   :  0.00   Length:714          :  2   
##  1st Qu.:  8.05   Class :character   C:130   
##  Median : 15.74   Mode  :character   Q: 28   
##  Mean   : 34.69                      S:554   
##  3rd Qu.: 33.38                              
##  Max.   :512.33

Data Explanation

Summary : 1. There are 891 passengger in total 2. 177 Age values are missing 3. 714 Data remaining after null data cleansing 4. Southampton,Cherbourg and Queenstown is the most popular port of embarkment in in respective ways 5. There are 184 type of cabins 6. Maximum ticket fare is 512 and minimum of 0 (Free) 7. Passenger have Maximum of 6 siblings / spouses aboard 8. Passenger have Maximum of 5 parents / children aboard the Titanic 9. The oldest passenger is 80 years old and the youngest one is under 1 year old 10.Passenger dominated by men of 453 and women of 261 11.Pclass number 3 is most populated compared to others 12.290 passenger was survived and 424 was dead