This is my submission for wQD7004 - Assignment 1.
In this notebook, I will be attempting to analyze the Titanic data (obtained from Kaggle) and carry out a simple Exploratory Data Analysis on it to understand the data further and uncover any underlying and interesting information about it.
First and foremost, some backstory on the dataset. During its first voyage in 1912 the British ocean liner, the RMS Titanic struck an iceberg and sunk, despite it was considered unsinkable. Hence, that is how the event got its name “Titanic”.
I will be importing these packages to conduct ease the visualization of this analysis.
ggplot2 - library used to conduct visualization
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Other than the above imported package, I will be just using the existing installed packages available in R studio for this analysis.
df <- read.csv("train.csv")
Let’s see the first several rows of the dataset
head(df)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
Next, observe simple, generic statistics information of the data
summary(df)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
There are 177 missing values within the ‘Age’ attribute.
The structure of the data can be referred as below
str(df)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Based on the look and feel of the data, we need to conduct some preliminary processing of several columns.
Converting the numeric survive indicator to factors/categorical
df$Survived <- ifelse(df$Survived==1,"Yes","No")
df$Survived <- as.factor(df$Survived)
head(df)
## PassengerId Survived Pclass
## 1 1 No 3
## 2 2 Yes 1
## 3 3 Yes 3
## 4 4 Yes 1
## 5 5 No 3
## 6 6 No 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
Provide clearindicator of the embark location, instead of using abbreviations.
df$Embarked <- ifelse(df$Embarked=="S","Southampton", ifelse(df$Embarked=="C","Cherbourg", "Queenstown"))
df$Embarked <- as.factor(df$Embarked)
head(df)
## PassengerId Survived Pclass
## 1 1 No 3
## 2 2 Yes 1
## 3 3 Yes 3
## 4 4 Yes 1
## 5 5 No 3
## 6 6 No 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 Southampton
## 2 PC 17599 71.2833 C85 Cherbourg
## 3 STON/O2. 3101282 7.9250 Southampton
## 4 113803 53.1000 C123 Southampton
## 5 373450 8.0500 Southampton
## 6 330877 8.4583 Queenstown
Converting categorical attributes from int to factor
df$Pclass <- as.factor(df$Pclass)
df$SibSp <- as.factor(df$SibSp)
df$Parch <- as.factor(df$Parch)
head(df)
## PassengerId Survived Pclass
## 1 1 No 3
## 2 2 Yes 1
## 3 3 Yes 3
## 4 4 Yes 1
## 5 5 No 3
## 6 6 No 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 Southampton
## 2 PC 17599 71.2833 C85 Cherbourg
## 3 STON/O2. 3101282 7.9250 Southampton
## 4 113803 53.1000 C123 Southampton
## 5 373450 8.0500 Southampton
## 6 330877 8.4583 Queenstown
From the above, I can see that ‘Age’ column has 177 missing values. Since I know that, ‘Age’ attribute has outlier(s), imputing the missing values with mean might not be a good idea because the values filled-in would not be as accurate. Hence, I have chosen to impute the missing values with median method instead.
df$Age[is.na(df$Age)] <- round(median(df$Age, na.rm = TRUE))
Checking again the ‘Age’ attribute for any missing value
colSums(is.na(df))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Now, there is no missing value within the ‘Age’ attribute. Time to analyze each of the attributes!
Let’s try and visualize each of the relevant attributes of the dataset and see what more information the data can tell us.
ggplot(data=df, aes(x=Pclass, fill = Pclass)) +
geom_bar(position = "dodge") +
geom_text(stat='count', aes(label=..count..), position = position_dodge(0.9),vjust=-0.2) +
ylab("Number of Passengers")
Most of the passengers are coming from the 3rd class (bought the cheapest ticket to board the Titanic), followed by 1st and 2nd class. Surprisingly, the 1st class passengers are bit more than the 2nd class.
ggplot(data=df, aes(x=Survived, fill = Survived)) +
geom_bar(position = "dodge") +
geom_text(stat='count', aes(label=..count..), position = position_dodge(0.9),vjust=-0.2) +
ylab("Number of Passengers")
Indeed, there are 207 more passengers found dead as compared to surviving the incident.
ggplot(data=df, aes(x=Age,)) +
geom_histogram(binwidth = 5) +
xlab("Age")
There are various age of people that aboard the vessel, ranging from newborn to old individual (0.42 to 80 years old). Most of them are found to be middle age (i.e. 30 years old)
ggplot(data=df, aes(x=Fare,)) +
geom_histogram(binwidth = 15) +
xlab("Fare")
The tickets fare are consistent with the ticket class for sure, as the highest number of ticket purchased is the cheapest one offered to board the Titanic.
ggplot(df) + geom_freqpoly(mapping = aes(x = Age, color = Survived), binwidth = 2.5) +
ylab("Frequency")
ggplot(df, aes(x=Sex,fill=Survived))+ geom_bar(position = "dodge") + geom_text(stat='count',aes(label=..count..),position = position_dodge(0.9),vjust=-0.2) +
ylab("Number of Passengers") + xlab("Sex")
Women were mostly saved first by using lifeboats as compared to men.
ggplot(df, aes(x=Pclass,fill=Survived))+ geom_bar(position = "dodge") + geom_text(stat='count',aes(label=..count..),position = position_dodge(0.9),vjust=-0.2) +
ylab("Number of Passengers") + xlab("Passenger Class")
It seems that people who paid more @ Pclass 1 had much better chance of survival as compared to the others.
ggplot(df) + geom_freqpoly(mapping = aes(x = Age, color = Sex), binwidth = 2.5) +
ylab("Frequency")
ggplot(df) + geom_freqpoly(mapping = aes(x = Age, color = Pclass), binwidth = 2.5) +
ylab("Frequency")
Most of the passengers, regardless of their ticket paid class are coming from the similar age group i.e. 30 years old
chosen <- c("Survived", "Pclass", "Sex","Age","Fare")
plot(df[,colnames(df) %in% chosen])
Numerical attributes like Age and Fare combine well into a scatterplot, yet other attributes are not plotted exactly as good.
We can explore more relationships between the attributes and do some feature engineering to model and predict the survivability rate of the passengers using machine learning.