Titanic data set - box plots and histograms

Load library

library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Functions defined

Here, I defined some fuctions that will help organizing the data.

  • na_sum- Returns number of nas in a column

This function takes a column as input and gives the number of rows with NA as output.

na_sum<- function(value){
        # is.na() gives a vector of true and false
        x<- is.na(value)
        # as true is 1, sum() returns number of na
        y<- sum(x)
        # return the value
        y
}
  • full_name- puts first name first and last name last

Thus function takes a charecter value of name as last name, first name and returns in first name last name format.

full_name<- function(value){
        # removes the part in the  parenthesis
        x<- strsplit(value, split =  " (", fixed = TRUE)
        # extract only fist and last name
        y<- x[[1]][1]
        # split from "," sothat first and last name is separated
        z<- strsplit(y, split =  ", ")
        # first name first and then last name
        p<- paste(z[[1]][2], z[[1]][1])
        # returns the value
        p
}

Set working directory

The working directory is set to the folder that contains the desired data file.

Load data

data1<- read.csv("titanic_train.csv", header = TRUE)

#——————————————————————————-

head(data1,10)
##    PassengerId Survived Pclass
## 1            1        0      3
## 2            2        1      1
## 3            3        1      3
## 4            4        1      1
## 5            5        0      3
## 6            6        0      3
## 7            7        0      1
## 8            8        0      3
## 9            9        1      3
## 10          10        1      2
##                                                   Name    Sex Age SibSp Parch
## 1                              Braund, Mr. Owen Harris   male  22     1     0
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                               Heikkinen, Miss. Laina female  26     0     0
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                             Allen, Mr. William Henry   male  35     0     0
## 6                                     Moran, Mr. James   male  NA     0     0
## 7                              McCarthy, Mr. Timothy J   male  54     0     0
## 8                       Palsson, Master. Gosta Leonard   male   2     3     1
## 9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0     2
## 10                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1     0
##              Ticket    Fare Cabin Embarked
## 1         A/5 21171  7.2500              S
## 2          PC 17599 71.2833   C85        C
## 3  STON/O2. 3101282  7.9250              S
## 4            113803 53.1000  C123        S
## 5            373450  8.0500              S
## 6            330877  8.4583              Q
## 7             17463 51.8625   E46        S
## 8            349909 21.0750              S
## 9            347742 11.1333              S
## 10           237736 30.0708              C
str(data1)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

To check which columns have NA values

sapply applies the na_sum function to each column and return a matrix.

nas<- sapply(data1, na_sum)
nas
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Creating a new column Full_Name and reformatting name, “first name last name”

data1$Full_Name<- sapply(data1$Name, FUN = full_name)
unique(data1$Survived)
## [1] 0 1

Survived is either yes or no

data1$Survived<- as.factor(data1$Survived)
levels(data1$Survived)
## [1] "0" "1"

Levels of the factor are 0 and 1 which will be substituted with No and Yes

levels(data1$Survived)<- c("No", "Yes")
unique(data1$Pclass)
## [1] 3 1 2

Passenger class is first second and third

data1$Pclass<- as.factor(data1$Pclass)
levels(data1$Pclass)
## [1] "1" "2" "3"

Levels of the factor are 1, 2 and 3 which will be substituted with upper, middle and lower

levels(data1$Pclass)<- c("Upper", "Middle", "Lower")
unique(data1$Sex)
## [1] "male"   "female"

Sex is a factor either female or male

data1$Sex<- as.factor(data1$Sex)
levels(data1$Sex)
## [1] "female" "male"
unique(data1$Embarked)
## [1] "S" "C" "Q" ""

Missing values are removed

data1<- data1[data1$Embarked!="", ]
data1$Embarked<- as.factor(data1$Embarked)
levels(data1$Embarked)
## [1] "C" "Q" "S"

Embarked is a factor variable Levels are C= Cherbourg, Q= Queenstown, S= Southampton

levels(data1$Embarked)<- c("Cherbourg", "Queenstown", "Southampton")

Age is converted to integer and NAs are removed.

class(data1$Age)
## [1] "numeric"
data1$Age<- as.integer(data1$Age)
data1<- data1[is.na(data1$Age)!=TRUE,]

Rearranging columns

data1<- data1[,c(1:4,13,5:12)]

#——————————————————————————-

Box plot of survival on age of passengers

ggplot(data = data1, aes(x= Survived, y= Age, fill= Survived, ))+
        geom_boxplot()+
        theme_classic()+ 
        ggtitle("Box plot of survival on age")

From the box plot, median age of survived passenger is slightly lower. However, box plot doesn’t demonstrate the distribution of data.

A histogram shows the distribution better.

ggplot(data = data1, aes(x= Age, fill= Survived))+
        geom_histogram(bins = 100)+ 
        theme_classic()+ 
        ggtitle("Histogram of age stacked on survival")

For age 0 to 10, Survival chance decreases with age. For the rest, Yes and no maintains an approximate normal distribution.

To check if there was any difference in male and female

ggplot(data = data1, aes(x= Sex, y= Age, fill= Survived))+
        geom_boxplot()+
        theme_classic()+ 
        ggtitle("Box plot of sex on age")

The box plot showed, Survived females had higher higher age which might be due to help from others. The might have considered high priority group during rescue. More younger male below age 20 survived. which might be because they could save themselves or got help on basis of their younger age.

Another hypothesis could be, Most older female had sibling, spouse, parent or child traveling with them who helped them. And older males had sibling, spouse, parent or child traveling with them and they helped them before saving them.

To check if sex really had any effect on survival

ggplot(data = data1, aes(x= Age, fill= Survived))+
        geom_histogram(bins = 100)+ 
        theme_classic()+ 
        facet_grid(~Sex,)+ 
        ggtitle("Histogram of age stacked on survival, facet on sex")

The distribution showed, Most of the females survived and most males died. However, few women didn’t survived and they are scattered in all age Travelling class might have an effect on that.

Lets make a new column combining the value of sibSp and ParCh and name it companion to check if it had any effect on survival. ## To check if number of companion had any effect on survival

data1$Companion<- data1$SibSp+data1$Parch
ggplot(data = data1, aes(x= Sex, y= Companion, fill= Survived))+
        geom_boxplot()+
        theme_classic()+ 
        ggtitle("Box plot of sex on companion")

Few females with higher companion could not survive ## To check the differential effect of companion on survival of male and female

ggplot(data = data1, aes(x= Companion, fill= Survived))+
        geom_histogram(bins = 10)+ 
        theme_classic()+ 
        facet_grid(~Sex,)+ 
        ggtitle("Histogram of companion count stacked on 
                survival, facet on sex")

Most females were travelling alone. They got priority for gender as well as they could save themselves. most women died, who were in a large group. The group might contain child, younger sibling or older parents who were saved first. Therefore those unfortunate females could not survive.

Most males were travelling alone. They died most. They selflessly saved people (both family and strangers) and could not save themselves. ## Effect of passenger class on survival

ggplot(data = data1, aes(x= Companion, fill= Survived))+
        geom_histogram(bins = 10)+ 
        theme_classic()+ 
        facet_grid(Sex~Pclass)+ 
        ggtitle("Histogram of companion stacked on survival, 
                facet on sex and passenger class")

A positive correlation of passenger class is seen with woman’s survival. Upper class women were saved most.
A negative correlation of passenger class is seen with man’s survival. Lower class men died most.

#——————————————————————————————–

  • Most of the males died and most of the females survived. Though most of the males were traveling alone, they selflessly helped others and died. Males in higher classes survived better.

  • Women and child got priority during rescue

  • Higher passenger class had better survival

  • A number of females in lower class died. They might not manage to get any help.

  • Women travelling with more companions survived less.

Thank you for your patience