WQD7004 - Assignment 1

Introduction

This is my submission for wQD7004 - Assignment 1.

In this notebook, I will be attempting to analyze the Titanic data (obtained from Kaggle) and carry out a simple Exploratory Data Analysis on it to understand the data further and uncover any underlying and interesting information about it.

First and foremost, some backstory on the dataset. During its first voyage in 1912 the British ocean liner, the RMS Titanic struck an iceberg and sunk, despite it was considered unsinkable. Hence, that is how the event got its name “Titanic”.

Packages Info

I will be importing these packages to conduct ease the visualization of this analysis.

ggplot2 - library used to conduct visualization

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Other than the above imported package, I will be just using the existing installed packages available in R studio for this analysis.

Data Preparation

Importing data

df <- read.csv("train.csv")

Understanding the data

Let’s see the first several rows of the dataset

head(df)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Next, observe simple, generic statistics information of the data

summary(df)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

There are 177 missing values within the ‘Age’ attribute.

The structure of the data can be referred as below

str(df)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Data Preprocessing

Based on the look and feel of the data, we need to conduct some preliminary processing of several columns.

Survived

Converting the numeric survive indicator to factors/categorical

df$Survived <- ifelse(df$Survived==1,"Yes","No")
df$Survived <- as.factor(df$Survived)
head(df)
##   PassengerId Survived Pclass
## 1           1       No      3
## 2           2      Yes      1
## 3           3      Yes      3
## 4           4      Yes      1
## 5           5       No      3
## 6           6       No      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Embarked

Provide clearindicator of the embark location, instead of using abbreviations.

df$Embarked <- ifelse(df$Embarked=="S","Southampton", ifelse(df$Embarked=="C","Cherbourg", "Queenstown"))
df$Embarked <- as.factor(df$Embarked)
head(df)
##   PassengerId Survived Pclass
## 1           1       No      3
## 2           2      Yes      1
## 3           3      Yes      3
## 4           4      Yes      1
## 5           5       No      3
## 6           6       No      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin    Embarked
## 1        A/5 21171  7.2500       Southampton
## 2         PC 17599 71.2833   C85   Cherbourg
## 3 STON/O2. 3101282  7.9250       Southampton
## 4           113803 53.1000  C123 Southampton
## 5           373450  8.0500       Southampton
## 6           330877  8.4583        Queenstown

Converting categorical attributes from int to factor

df$Pclass <- as.factor(df$Pclass)
df$SibSp <- as.factor(df$SibSp)
df$Parch <- as.factor(df$Parch)
head(df)
##   PassengerId Survived Pclass
## 1           1       No      3
## 2           2      Yes      1
## 3           3      Yes      3
## 4           4      Yes      1
## 5           5       No      3
## 6           6       No      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin    Embarked
## 1        A/5 21171  7.2500       Southampton
## 2         PC 17599 71.2833   C85   Cherbourg
## 3 STON/O2. 3101282  7.9250       Southampton
## 4           113803 53.1000  C123 Southampton
## 5           373450  8.0500       Southampton
## 6           330877  8.4583        Queenstown

From the above, I can see that ‘Age’ column has 177 missing values. Since I know that, ‘Age’ attribute has outlier(s), imputing the missing values with mean might not be a good idea because the values filled-in would not be as accurate. Hence, I have chosen to impute the missing values with median method instead.

df$Age[is.na(df$Age)] <- round(median(df$Age, na.rm = TRUE))

Checking again the ‘Age’ attribute for any missing value

colSums(is.na(df))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Now, there is no missing value within the ‘Age’ attribute. Time to analyze each of the attributes!

Data Analysis

Univariate data analysis

Let’s try and visualize each of the relevant attributes of the dataset and see what more information the data can tell us.

Pclass

ggplot(data=df, aes(x=Pclass, fill = Pclass)) + 
geom_bar(position = "dodge") + 
geom_text(stat='count', aes(label=..count..), position = position_dodge(0.9),vjust=-0.2) +
  ylab("Number of Passengers")

Most of the passengers are coming from the 3rd class (bought the cheapest ticket to board the Titanic), followed by 1st and 2nd class. Surprisingly, the 1st class passengers are bit more than the 2nd class.

Survived

ggplot(data=df, aes(x=Survived, fill = Survived)) + 
geom_bar(position = "dodge") + 
geom_text(stat='count', aes(label=..count..), position = position_dodge(0.9),vjust=-0.2) +
  ylab("Number of Passengers")

Indeed, there are 207 more passengers found dead as compared to surviving the incident.

Age

ggplot(data=df, aes(x=Age,)) + 
geom_histogram(binwidth = 5) +
  xlab("Age")

There are various age of people that aboard the vessel, ranging from newborn to old individual (0.42 to 80 years old). Most of them are found to be middle age (i.e. 30 years old)

Fare

ggplot(data=df, aes(x=Fare,)) + 
geom_histogram(binwidth = 15) +
  xlab("Fare")

The tickets fare are consistent with the ticket class for sure, as the highest number of ticket purchased is the cheapest one offered to board the Titanic.

Bivariate data analysis

Survived and Age

ggplot(df) + geom_freqpoly(mapping = aes(x = Age, color = Survived), binwidth = 2.5) +
ylab("Frequency")

Survived and Sex

ggplot(df, aes(x=Sex,fill=Survived))+ geom_bar(position = "dodge") + geom_text(stat='count',aes(label=..count..),position = position_dodge(0.9),vjust=-0.2) +
ylab("Number of Passengers") + xlab("Sex")

Women were mostly saved first by using lifeboats as compared to men.

Survived and Pclass

ggplot(df, aes(x=Pclass,fill=Survived))+ geom_bar(position = "dodge") + geom_text(stat='count',aes(label=..count..),position = position_dodge(0.9),vjust=-0.2) +
ylab("Number of Passengers") + xlab("Passenger Class")

It seems that people who paid more @ Pclass 1 had much better chance of survival as compared to the others.

Age and Sex

ggplot(df) + geom_freqpoly(mapping = aes(x = Age, color = Sex), binwidth = 2.5) +
ylab("Frequency")

Age and Pclass

ggplot(df) + geom_freqpoly(mapping = aes(x = Age, color = Pclass), binwidth = 2.5) +
ylab("Frequency")

Most of the passengers, regardless of their ticket paid class are coming from the similar age group i.e. 30 years old

chosen <- c("Survived", "Pclass", "Sex","Age","Fare")
plot(df[,colnames(df) %in% chosen])

Numerical attributes like Age and Fare combine well into a scatterplot, yet other attributes are not plotted exactly as good.

Conclusion

We can explore more relationships between the attributes and do some feature engineering to model and predict the survivability rate of the passengers using machine learning.