Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.
Objective and Targeted Audience
The original objective was to ‘determine if with the other features/information about the passengers it is possible to determine those who are likely to survive.’(Loquarts,2016)
The targetted audience wasn’t stated, but based on the entire work, it’s most likely for general public or whoever’s interested in the myth of titanic.
Main Issues
After Reconstruction
Reference
Titanic Ship, Amy Tikkanen (Aug 30,2019), Encyclopaedia Britannica, https://www.britannica.com/topic/Titanic
Visualization — Learning From Disaster: Titanic, Loquarts (December 9,2016), Towards Data Science, Medium, https://towardsdatascience.com/visualization-learning-from-disaster-titanic-42eeb99cdbdc
Data Visualisation Lecture Notes Chapter 1 ~ Chapter 5, James Baglin (Semester 2,2019), Canvas, RMIT University
Titanic: Machine Learning from Disaster, Training Set, Retrieved from on Kaggle, https://www.kaggle.com/c/titanic/data
The following code was used to fix the issues identified in the original.
library(ggplot2)
library(tidyverse)
# import train.csv
rmstitanic <- read.csv("train.csv")
class(rmstitanic)
## [1] "data.frame"
# remove missing value - NA
rmstitanic <- na.omit(rmstitanic)
# Factor variable "Survived" and replace its value with meaningful words
rmstitanic$Survived[rmstitanic$Survived == 1] <- "Survived"
rmstitanic$Survived[rmstitanic$Survived == 0] <- "Dead"
rmstitanic$Survived <- as.factor(rmstitanic$Survived)
rmstitanic$Survived %>% head()
## [1] Dead Survived Survived Survived Dead Dead
## Levels: Dead Survived
# Factor variable "Pclass", replace its value with meaningful words and level the measurement
rmstitanic$Pclass[rmstitanic$Pclass == 1] <- "Class 1" # top class
rmstitanic$Pclass[rmstitanic$Pclass == 2] <- "Class 2"
rmstitanic$Pclass[rmstitanic$Pclass == 3] <- "Class 3" # lowest class
rmstitanic$Pclass <- as.factor(rmstitanic$Pclass)
levels(rmstitanic$Pclass) = c("Class 3", "Class 2", "Class 1")
rmstitanic$Pclass %>% head()
## [1] Class 1 Class 3 Class 1 Class 3 Class 1 Class 3
## Levels: Class 3 Class 2 Class 1
# change variable name "Survived" to "SurvivedOrDead" so that it represents the meaning of the variable clearer
names(rmstitanic)[2] <- "SurvivedOrDead"
# plot four variables into one visualisation to have a more comprehensive undertanding of the relations between passenger ticket fare and their survival situations
p1 <- ggplot(data = rmstitanic, aes(x = Age, y = Fare, colour = SurvivedOrDead)) +
geom_point() +
facet_grid(. ~ Sex)+
labs(title = "Relations Between Ticket Fares and Survival Situations By Gender",
x = "Passenger's Age", y = "Passenger's Ticket Fare")
p1
# plot relations between passenger ticket fare and ticket class
p2 <- ggplot(data = rmstitanic, aes(y = Fare, x = Pclass)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 2/5) +
ggtitle("Relations Between Passenger Ticket Fare and Ticket Class") +
labs(x = "Ticket Class", y = "Passenger's Ticket Fare") +
stat_summary(fun.y = mean, colour = "red", geom = "point", shape = 20) +
theme_minimal()
p2
Data Reference
The following plot fixes the main issues in the original.