Data Visualization - Titanic
1. Intro
The beginning
Hello everyone , welcome to my second Rpubs. I made this rmd to fulfill my LBB assignment. Hope You like it :)
The Data About
This is Titanic Dataset that I get from kaggle. This Dataset is about Survived People in Titanic Ship, we can take a look at the Age, Sex and much more. We’ll check the secret of Survived People, is that depends on the Age, Sex or Pclass. Let’s Explore it.
Set Up
We’ll use several library in R, there are ggplot, scales, ggthemes, tidyr and ggpubr. You can install all of that with install.packages() commnand. Example install.packages(“ggplot”) and so on.
2. Data Pre-Processing
first let’s load The Titanic Dataset
titanic <- read.csv("dataset/titanic.csv")Done, let’s move to the next step
2.1 Dataset Inspection
Get first 5 rows
head(titanic, 5)## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
Get last 5 rows
tail(titanic, 5)## PassengerId Survived Pclass Name Sex
## 887 887 0 2 Montvila, Rev. Juozas male
## 888 888 1 1 Graham, Miss. Margaret Edith female
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## 890 890 1 1 Behr, Mr. Karl Howell male
## 891 891 0 3 Dooley, Mr. Patrick male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 887 27 0 0 211536 13.00 S
## 888 19 0 0 112053 30.00 B42 S
## 889 NA 1 2 W./C. 6607 23.45 S
## 890 26 0 0 111369 30.00 C148 C
## 891 32 0 0 370376 7.75 Q
Get total rows / observation
nrow(titanic)## [1] 891
Get total columns
ncol(titanic)## [1] 12
Get all columns names
names(titanic)## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
Get dimension of dataset
dim(titanic)## [1] 891 12
From our inspection we can take few informations :
Titanic Dataset contains 891 of rows and 12 columns
Each of columns names : PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
2.2 Data Cleaning & Coertions
Check the data type for each column using str(structure)
str(titanic)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
We can see from the result above, there are several data type that incorrect, let’s convert the data type to the correct type (Explicit Coertion)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Age <- as.integer(titanic$Age)
titanic$Survived <- as.factor(titanic$Survived)
str(titanic)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : int 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(titanic)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.00 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.00 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.68 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 Length:891
## 1st Qu.: 7.91 Class :character Class :character
## Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
Done. Each column type now are correct. We can see the summary that we get, a lot of information we can get (especially for the factor data type) because the data type now are correct
Check for the missing value
colSums(is.na(titanic))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
We have missing values in Age Column
anyNA(titanic)## [1] TRUE
Here is the 6 sample of missing value in Age Column by Name
head(titanic[which(is.na(titanic$Age)), c("Name", "Age")])## Name Age
## 6 Moran, Mr. James NA
## 18 Williams, Mr. Charles Eugene NA
## 20 Masselmani, Mrs. Fatima NA
## 27 Emir, Mr. Farred Chehab NA
## 29 O'Dwyer, Miss. Ellen "Nellie" NA
## 30 Todoroff, Mr. Lalio NA
From the result above, we can see the missing value for Age column, there are 177 rows. Let’s clean the missing value using complete_case() function in R
titanic <- titanic[complete.cases(titanic), ]nrow(titanic)## [1] 714
anyNA(titanic)## [1] FALSE
Yeayyy, there is no missing value in our dataset right now. Don’t worry, we just remove 19% of rows from our dataset, that means we can still extract important information from our dataset
Let’s subsetting our dataset, because we are not use all of the column
titanic <- titanic[, !(colnames(titanic) %in% c("Ticket", "Fare", "Cabin", "Embarked"))]head(titanic, 5)## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
Done, we success remove Ticket, Fare, Cabin and Embarked Column. Because we only focus in Survived Peopel by Age, Sex and Pclass
3. Data Processing and Visualization
Let’s summarise our data with summary function in R
summary(titanic)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:424 1:186 Length:714 female:261
## 1st Qu.:222.2 1:290 2:173 Class :character male :453
## Median :445.0 3:355 Mode :character
## Mean :448.6
## 3rd Qu.:677.8
## Max. :891.0
## Age SibSp Parch
## Min. : 0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:20.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :28.00 Median :0.0000 Median :0.0000
## Mean :29.68 Mean :0.5126 Mean :0.4314
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :80.00 Max. :5.0000 Max. :6.0000
Summary :
There are 424 Death and 290 Live People. That means, more people die than survived
In Titanic ship there are several Class. Class eligibility in accordance with its order. Class 1 is a VIP class, class 2 is a normal class, class 3 is an ordinary class and most People are in Class 3
There are more men than women on the titanic. There are 453 Male and 261 Female
The average age of Titanic’s passengers is 29 years. The oldest is 80 years old
Let’s see if the death rate and life on the titanic depends on Age and sex
First Let’s check survived people by Sex and Age
# Calculate the frequency of Death and Live by Age and Sex
sex_freq <- as.data.frame(table(titanic$Sex, titanic$Survived))
age_freq <- as.data.frame(table(titanic$Age, titanic$Survived))
# Take the highest frequency of Death and Live
sex_freq <- sex_freq[order(sex_freq$Freq, decreasing = T),]
age_freq <- age_freq[order(age_freq$Freq, decreasing = T),]
# Let's Plot the sex_freq data.frame
ggplot(sex_freq, aes(x = reorder(Var1, Freq) , y = Freq)) +
geom_col(aes(fill = Var2), position = "fill") +
labs(
title = "Total of Male and Female",
subtitle = " that Survived and Not",
x = "Sex",
y = "Total Survived"
) +
theme_solarized() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.title = element_blank()
) +
scale_fill_brewer(palette = "Set1", labels = c("Death", "Live")) For better visualization and understanding, we’ll break the plot into panels by each female and male using facet plot in ggplot library
ggplot(sex_freq, aes(x = reorder(Var1, Freq) , y = Freq)) +
geom_col(aes(fill = Var2), position = "dodge") +
facet_wrap(~Var1, scales = "free_y") +
labs(
title = "Total Death and Live of Male and Female",
subtitle = "In Titanic Ship",
caption = "Source: Titanic dataset-kaggle",
x = "Sex",
y = "Total Survived"
) +
theme_solarized() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_blank(),
legend.title = element_blank()
) +
scale_fill_brewer(palette = "Set1", labels = c("Death", "Live")) Females survive more than males. It is evident from the number of deaths of women which is much lower than Male. It is possible that the evacuation process of the titanic ship prioritized women over men. It makes sense because women are more vulnerable than men in that situation
Second Let’s check survived people by Age
# take 6 highest age that survived and not
age_death <- head(age_freq[age_freq$Var2 == 0,])
age_death## Var1 Var2 Freq
## 29 28 0 20
## 22 21 0 19
## 19 18 0 17
## 26 25 0 17
## 31 30 0 17
## 20 19 0 16
age_live <- head(age_freq[age_freq$Var2 == 1,])
age_live## Var1 Var2 Freq
## 96 24 1 15
## 94 22 1 11
## 99 27 1 11
## 107 35 1 11
## 108 36 1 11
## 102 30 1 10
# create function for reusable plotting
plot_fn <- function(data, y_title, fill_color) {
plot_live <- ggplot(data, aes(x = Var1 , y = Freq)) +
geom_col(fill = fill_color , position = "dodge") +
labs(x = "Age",
y = y_title) +
theme_solarized() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
) +
scale_fill_brewer(palette = "Set1")
}
# assign the plot function to object / variable plot_death and plot_live
plot_death <- plot_fn(age_death, "Total Death", "#e41a1c")
plot_live <- plot_fn(age_live, "Total Live", "#377eb8")
# unite plot together
figure_one <- ggarrange(plot_death, plot_live)
annotate_figure(
figure_one,
top = text_grob(
"Total Death and Live Male of Female by Age",
color = "black",
face = "bold",
size = 14
),
bottom = text_grob(
"Data source: \n Titanic dataset-kaggle",
color = "black",
hjust = 1.1,
x = 1,
face = "italic",
size = 10
),
)We can conclude from the Bar Plot above, that a lot of people die in age between 18 - 30 years old. For survived people between 22 - 36 years old. That means many young people ranging from 18, 21 and 28 years old died in the titanic ship incident. Because if you look at the frequency of data, most of the survivors only range from 22 to 36 years old
Let’s make sure the majority of the passengers on the Titanic
titanic_age <- as.data.frame(table(titanic$Age))
head(titanic_age[order(titanic_age$Freq, decreasing = T),], 30)## Var1 Freq
## 25 24 31
## 23 22 27
## 29 28 27
## 31 30 27
## 19 18 26
## 20 19 25
## 22 21 24
## 26 25 23
## 37 36 23
## 30 29 20
## 33 32 20
## 27 26 18
## 28 27 18
## 36 35 18
## 17 16 17
## 32 31 17
## 21 20 16
## 24 23 16
## 35 34 16
## 34 33 15
## 41 40 15
## 40 39 14
## 46 45 14
## 18 17 13
## 43 42 13
## 39 38 11
## 3 2 10
## 5 4 10
## 51 50 10
## 45 44 9
The majority of passengers on the Titanic range from 20 to 40 years old. That’s why many survive from the age range of 22 to 36 years old
Let’s check survive people depends on Pclass
# Check how many people in each Pclass
summary(titanic$Pclass)## 1 2 3
## 186 173 355
We can take a look, a lot of people is in Pclass 3 or regular Class
We can sum total Death and Live of people in Titanic Ship by Pclass
pclass_death <- aggregate(Survived == 0~ Pclass, titanic, sum)
names(pclass_death) <- c("Pclass", "Death")
pclass_live <- aggregate(Survived == 1~ Pclass, titanic, sum)
names(pclass_live) <- c("Pclass", "Live")Add column to the pclass_death from pclass_live data.frame
pclass_combine <- cbind(pclass_death, Live = pclass_live$Live)Let’s change format of our table from wide format to long format
pclass_pivot <- pivot_longer(data = pclass_combine,
cols = c("Death", "Live"),
names_to = "Survived",
values_to = "Value")
pclass_pivot## # A tibble: 6 × 3
## Pclass Survived Value
## <fct> <chr> <int>
## 1 1 Death 64
## 2 1 Live 122
## 3 2 Death 90
## 4 2 Live 83
## 5 3 Death 270
## 6 3 Live 85
Yeayy, now our table has been long format, so we can use facet function in ggplot to make the plot separate by each Survived column
Let’s visual the data using Bar Plot and using facet_wrap to seperate by each survived column
ggplot(pclass_pivot, aes(x = Pclass , y = Value)) +
geom_col(aes(fill = Survived), position = "dodge") +
facet_wrap(~Survived, scales = "free_y") +
labs(
title = "Total Live and Death of Male and Female",
subtitle = "In Titanic Ship",
caption = "Source: Titanic dataset-kaggle",
x = "Sex",
y = "Total",
) +
theme_solarized() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.title = element_blank()
) +
scale_fill_brewer(palette = "Set1", labels = c("Death", "Live")) Wowww, according to my guess, that every Pclass must have its influence with life or death. As evidenced by the visualization above, the dominant living people are from Pclass 1 and the dead are mostly from Pclass 3. It makes sense that people who are in Pclass 1 get a higher priority during the evacuation process than people in Pclass 3
Let’s see, whether On the Titanic, there are many siblings and parents with their families or not
# check for sibling / children and parent in titanic
# 1 - 17
sibling <- head(titanic[order(titanic$SibSp, decreasing = T), c("Age", "SibSp")], 15)
sibling## Age SibSp
## 60 11 5
## 72 16 5
## 387 1 5
## 481 9 5
## 684 14 5
## 17 2 4
## 51 7 4
## 69 17 4
## 120 2 4
## 165 1 4
## 172 4 4
## 183 9 4
## 234 5 4
## 262 3 4
## 267 16 4
# 30 - 40
parent <- head(titanic[order(titanic$Parch, decreasing = T), c("Age", "Parch")], 15)
parent## Age Parch
## 679 43 6
## 14 39 5
## 26 38 5
## 611 39 5
## 639 41 5
## 886 39 5
## 168 45 4
## 361 40 4
## 439 64 4
## 568 29 4
## 87 16 3
## 438 24 3
## 737 48 3
## 775 54 3
## 859 24 3
From the result above, range of people have siblings is from 1 - 17 years old and 30 - 40 years old for parent. That’s means pretty much children and family in Titanic Ship
Let’s create reusable function for plotting our data
plot_fn <- function(data, axis_y, y_title) {
ggplot(data, aes(x = Age, y = axis_y, color = axis_y)) +
geom_jitter(aes(size = axis_y), alpha = 0.5, show.legend = F) +
ylim(0, 7) +
labs(y = y_title) +
scale_color_gradient(low="red3", high="green2") +
theme_solarized()
}Then use the plot_fn function for plotting sibling data.frame and parent data.frame
plot.one <- plot_fn(sibling, sibling$SibSp, "Siblings")
plot.oneplot.two <- plot_fn(parent, parent$Parch, "Parents")
plot.twoFor our convenience in comparing the visuals, let’s put together the plot we have made using ggarrange function from ggpubr library
figure <- ggarrange(plot.one, plot.two)
annotate_figure(
figure,
top = text_grob(
"Total Siblings and Parents by Age",
color = "black",
face = "bold",
size = 14
),
bottom = text_grob(
"Data source: \n Titanic dataset-kaggle",
color = "black",
hjust = 1.1,
x = 1,
face = "italic",
size = 10
),
)It turned out that on the titanic ship, quite a lot of children and their brothers. Average age 1, 9, 11, 14 and 16 years old. and for parents or people who are already married, the average age range is 30 to 40 years old.
4. Final Conclusion
We can take two conclusion from all our data visualization :
From the death case of people by age in Titanic Ship, the average frequency is highest at the age of 18-30 years old. That means that most of the children were saved in the evacuation process, because if we look at the data on siblings or children, there are quite a lot but those who die on average are in the range of 18 - 30 years old not in range of 1 - 16 years old.
From the data of parents or family in the plot above, the average is mostly in the age range of 30 - 40 years. If we look at the data that survived, many also came from the age range of 30-36 years. That means parents or family are more prioritized in the evacuation process on the Titanic
people who are in Pclass 1 (VIP Class) will be prioritized in the evacuation process than people in Pclass 3 (Regular / Ordinary Class), seen from the number of lives of people is a lot from Pclass 1 than Pclass 3.
Women will be prioritized in the evacuation process. Seen in the data of survive people, where the dominant female survives more than the male