library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Here, I defined some fuctions that will help organizing the data.
This function takes a column as input and gives the number of rows with NA as output.
na_sum<- function(value){
# is.na() gives a vector of true and false
x<- is.na(value)
# as true is 1, sum() returns number of na
y<- sum(x)
# return the value
y
}
Thus function takes a charecter value of name as last name, first name and returns in first name last name format.
full_name<- function(value){
# removes the part in the parenthesis
x<- strsplit(value, split = " (", fixed = TRUE)
# extract only fist and last name
y<- x[[1]][1]
# split from "," sothat first and last name is separated
z<- strsplit(y, split = ", ")
# first name first and then last name
p<- paste(z[[1]][2], z[[1]][1])
# returns the value
p
}
The working directory is set to the folder that contains the desired data file.
data1<- read.csv("titanic_train.csv", header = TRUE)
#——————————————————————————-
head(data1,10)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## 7 7 0 1
## 8 8 0 3
## 9 9 1 3
## 10 10 1 2
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## 7 McCarthy, Mr. Timothy J male 54 0 0
## 8 Palsson, Master. Gosta Leonard male 2 3 1
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
## 7 17463 51.8625 E46 S
## 8 349909 21.0750 S
## 9 347742 11.1333 S
## 10 237736 30.0708 C
str(data1)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
sapply applies the na_sum function to each column and return a matrix.
nas<- sapply(data1, na_sum)
nas
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Creating a new column Full_Name and reformatting name, “first name last name”
data1$Full_Name<- sapply(data1$Name, FUN = full_name)
unique(data1$Survived)
## [1] 0 1
Survived is either yes or no
data1$Survived<- as.factor(data1$Survived)
levels(data1$Survived)
## [1] "0" "1"
Levels of the factor are 0 and 1 which will be substituted with No and Yes
levels(data1$Survived)<- c("No", "Yes")
unique(data1$Pclass)
## [1] 3 1 2
Passenger class is first second and third
data1$Pclass<- as.factor(data1$Pclass)
levels(data1$Pclass)
## [1] "1" "2" "3"
Levels of the factor are 1, 2 and 3 which will be substituted with upper, middle and lower
levels(data1$Pclass)<- c("Upper", "Middle", "Lower")
unique(data1$Sex)
## [1] "male" "female"
Sex is a factor either female or male
data1$Sex<- as.factor(data1$Sex)
levels(data1$Sex)
## [1] "female" "male"
unique(data1$Embarked)
## [1] "S" "C" "Q" ""
Missing values are removed
data1<- data1[data1$Embarked!="", ]
data1$Embarked<- as.factor(data1$Embarked)
levels(data1$Embarked)
## [1] "C" "Q" "S"
Embarked is a factor variable Levels are C= Cherbourg, Q= Queenstown, S= Southampton
levels(data1$Embarked)<- c("Cherbourg", "Queenstown", "Southampton")
Age is converted to integer and NAs are removed.
class(data1$Age)
## [1] "numeric"
data1$Age<- as.integer(data1$Age)
data1<- data1[is.na(data1$Age)!=TRUE,]
data1<- data1[,c(1:4,13,5:12)]
#——————————————————————————-
ggplot(data = data1, aes(x= Survived, y= Age, fill= Survived, ))+
geom_boxplot()+
theme_classic()+
ggtitle("Box plot of survival on age")
From the box plot, median age of survived passenger is slightly lower. However, box plot doesn’t demonstrate the distribution of data.
ggplot(data = data1, aes(x= Age, fill= Survived))+
geom_histogram(bins = 100)+
theme_classic()+
ggtitle("Histogram of age stacked on survival")
For age 0 to 10, Survival chance decreases with age. For the rest, Yes and no maintains an approximate normal distribution.
ggplot(data = data1, aes(x= Sex, y= Age, fill= Survived))+
geom_boxplot()+
theme_classic()+
ggtitle("Box plot of sex on age")
The box plot showed, Survived females had higher higher age which might be due to help from others. The might have considered high priority group during rescue. More younger male below age 20 survived. which might be because they could save themselves or got help on basis of their younger age.
Another hypothesis could be, Most older female had sibling, spouse, parent or child traveling with them who helped them. And older males had sibling, spouse, parent or child traveling with them and they helped them before saving them.
ggplot(data = data1, aes(x= Age, fill= Survived))+
geom_histogram(bins = 100)+
theme_classic()+
facet_grid(~Sex,)+
ggtitle("Histogram of age stacked on survival, facet on sex")
The distribution showed, Most of the females survived and most males died. However, few women didn’t survived and they are scattered in all age Travelling class might have an effect on that.
Lets make a new column combining the value of sibSp and ParCh and name it companion to check if it had any effect on survival. ## To check if number of companion had any effect on survival
data1$Companion<- data1$SibSp+data1$Parch
ggplot(data = data1, aes(x= Sex, y= Companion, fill= Survived))+
geom_boxplot()+
theme_classic()+
ggtitle("Box plot of sex on companion")
Few females with higher companion could not survive ## To check the differential effect of companion on survival of male and female
ggplot(data = data1, aes(x= Companion, fill= Survived))+
geom_histogram(bins = 10)+
theme_classic()+
facet_grid(~Sex,)+
ggtitle("Histogram of companion count stacked on
survival, facet on sex")
Most females were travelling alone. They got priority for gender as well as they could save themselves. most women died, who were in a large group. The group might contain child, younger sibling or older parents who were saved first. Therefore those unfortunate females could not survive.
Most males were travelling alone. They died most. They selflessly saved people (both family and strangers) and could not save themselves. ## Effect of passenger class on survival
ggplot(data = data1, aes(x= Companion, fill= Survived))+
geom_histogram(bins = 10)+
theme_classic()+
facet_grid(Sex~Pclass)+
ggtitle("Histogram of companion stacked on survival,
facet on sex and passenger class")
A positive correlation of passenger class is seen with woman’s survival. Upper class women were saved most.
A negative correlation of passenger class is seen with man’s survival. Lower class men died most.
#——————————————————————————————–
Most of the males died and most of the females survived. Though most of the males were traveling alone, they selflessly helped others and died. Males in higher classes survived better.
Women and child got priority during rescue
Higher passenger class had better survival
A number of females in lower class died. They might not manage to get any help.
Women travelling with more companions survived less.