Data Visualization - Titanic

1. Intro

The beginning

Hello everyone , welcome to my second Rpubs. I made this rmd to fulfill my LBB assignment. Hope You like it :)

The Data About

This is Titanic Dataset that I get from kaggle. This Dataset is about Survived People in Titanic Ship, we can take a look at the Age, Sex and much more. We’ll check the secret of Survived People, is that depends on the Age, Sex or Pclass. Let’s Explore it.

Set Up

We’ll use several library in R, there are ggplot, scales, ggthemes, tidyr and ggpubr. You can install all of that with install.packages() commnand. Example install.packages(“ggplot”) and so on.

2. Data Pre-Processing

first let’s load The Titanic Dataset

titanic <- read.csv("dataset/titanic.csv")

Done, let’s move to the next step

2.1 Dataset Inspection

Get first 5 rows

head(titanic, 5)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S

Get last 5 rows

tail(titanic, 5)

##     PassengerId Survived Pclass                                     Name    Sex
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 888         888        1      1             Graham, Miss. Margaret Edith female
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 890         890        1      1                    Behr, Mr. Karl Howell   male
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch     Ticket  Fare Cabin Embarked
## 887  27     0     0     211536 13.00              S
## 888  19     0     0     112053 30.00   B42        S
## 889  NA     1     2 W./C. 6607 23.45              S
## 890  26     0     0     111369 30.00  C148        C
## 891  32     0     0     370376  7.75              Q

Get total rows / observation

nrow(titanic)

## [1] 891

Get total columns

ncol(titanic)

## [1] 12

Get all columns names

names(titanic)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

Get dimension of dataset

dim(titanic)

## [1] 891  12

From our inspection we can take few informations :

Titanic Dataset contains 891 of rows and 12 columns
Each of columns names : PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

2.2 Data Cleaning & Coertions

Check the data type for each column using str(structure)

str(titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

We can see from the result above, there are several data type that incorrect, let’s convert the data type to the correct type (Explicit Coertion)

titanic$Sex <- as.factor(titanic$Sex)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Age <-  as.integer(titanic$Age)
titanic$Survived <-  as.factor(titanic$Survived)
str(titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : int  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

summary(titanic)

##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.00   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.00   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.68   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin             Embarked        
##  Min.   :  0.00   Length:891         Length:891        
##  1st Qu.:  7.91   Class :character   Class :character  
##  Median : 14.45   Mode  :character   Mode  :character  
##  Mean   : 32.20                                        
##  3rd Qu.: 31.00                                        
##  Max.   :512.33                                        
##

Done. Each column type now are correct. We can see the summary that we get, a lot of information we can get (especially for the factor data type) because the data type now are correct

Check for the missing value

colSums(is.na(titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

We have missing values in Age Column

anyNA(titanic)

## [1] TRUE

Here is the 6 sample of missing value in Age Column by Name

head(titanic[which(is.na(titanic$Age)),  c("Name", "Age")])

##                             Name Age
## 6               Moran, Mr. James  NA
## 18  Williams, Mr. Charles Eugene  NA
## 20       Masselmani, Mrs. Fatima  NA
## 27       Emir, Mr. Farred Chehab  NA
## 29 O'Dwyer, Miss. Ellen "Nellie"  NA
## 30           Todoroff, Mr. Lalio  NA

From the result above, we can see the missing value for Age column, there are 177 rows. Let’s clean the missing value using complete_case() function in R

titanic <- titanic[complete.cases(titanic), ]

nrow(titanic)

## [1] 714

anyNA(titanic)

## [1] FALSE

Yeayyy, there is no missing value in our dataset right now. Don’t worry, we just remove 19% of rows from our dataset, that means we can still extract important information from our dataset

Let’s subsetting our dataset, because we are not use all of the column

titanic <- titanic[, !(colnames(titanic) %in% c("Ticket", "Fare", "Cabin", "Embarked"))]

head(titanic, 5)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0

Done, we success remove Ticket, Fare, Cabin and Embarked Column. Because we only focus in Survived Peopel by Age, Sex and Pclass

3. Data Processing and Visualization

Let’s summarise our data with summary function in R

summary(titanic)

##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:424    1:186   Length:714         female:261  
##  1st Qu.:222.2   1:290    2:173   Class :character   male  :453  
##  Median :445.0            3:355   Mode  :character               
##  Mean   :448.6                                                   
##  3rd Qu.:677.8                                                   
##  Max.   :891.0                                                   
##       Age            SibSp            Parch       
##  Min.   : 0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:20.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :28.00   Median :0.0000   Median :0.0000  
##  Mean   :29.68   Mean   :0.5126   Mean   :0.4314  
##  3rd Qu.:38.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :80.00   Max.   :5.0000   Max.   :6.0000

Summary :

There are 424 Death and 290 Live People. That means, more people die than survived
In Titanic ship there are several Class. Class eligibility in accordance with its order. Class 1 is a VIP class, class 2 is a normal class, class 3 is an ordinary class and most People are in Class 3
There are more men than women on the titanic. There are 453 Male and 261 Female
The average age of Titanic’s passengers is 29 years. The oldest is 80 years old

Let’s see if the death rate and life on the titanic depends on Age and sex

First Let’s check survived people by Sex and Age

# Calculate the frequency of Death and Live by Age and Sex 
sex_freq <- as.data.frame(table(titanic$Sex, titanic$Survived))
age_freq <- as.data.frame(table(titanic$Age, titanic$Survived))

# Take the highest frequency of Death and Live
sex_freq <- sex_freq[order(sex_freq$Freq, decreasing = T),]
age_freq <- age_freq[order(age_freq$Freq, decreasing = T),]

# Let's Plot the sex_freq data.frame
ggplot(sex_freq, aes(x = reorder(Var1, Freq) , y = Freq)) +
  geom_col(aes(fill = Var2), position = "fill") +
  labs(
    title = "Total of Male and Female",
    subtitle = " that Survived and Not",
    x = "Sex",
    y = "Total Survived"
  ) +
  theme_solarized() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.title = element_blank()
  ) +
  scale_fill_brewer(palette = "Set1", labels = c("Death", "Live"))

For better visualization and understanding, we’ll break the plot into panels by each female and male using facet plot in ggplot library

ggplot(sex_freq, aes(x = reorder(Var1, Freq) , y = Freq)) +
  geom_col(aes(fill = Var2), position = "dodge") +
  facet_wrap(~Var1, scales = "free_y") +
  labs(
    title = "Total Death and Live of Male and Female",
    subtitle = "In Titanic Ship",
    caption = "Source: Titanic dataset-kaggle",
    x = "Sex",
    y = "Total Survived"
  ) +
  theme_solarized() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_blank(),
    legend.title = element_blank()
  ) +
  scale_fill_brewer(palette = "Set1", labels = c("Death", "Live"))

Females survive more than males. It is evident from the number of deaths of women which is much lower than Male. It is possible that the evacuation process of the titanic ship prioritized women over men. It makes sense because women are more vulnerable than men in that situation

Second Let’s check survived people by Age

# take 6 highest age that survived and not
age_death <- head(age_freq[age_freq$Var2 == 0,])
age_death

##    Var1 Var2 Freq
## 29   28    0   20
## 22   21    0   19
## 19   18    0   17
## 26   25    0   17
## 31   30    0   17
## 20   19    0   16

age_live <- head(age_freq[age_freq$Var2 == 1,])
age_live

##     Var1 Var2 Freq
## 96    24    1   15
## 94    22    1   11
## 99    27    1   11
## 107   35    1   11
## 108   36    1   11
## 102   30    1   10

# create function for reusable plotting
plot_fn <- function(data, y_title, fill_color) {
  plot_live <- ggplot(data, aes(x = Var1 , y = Freq)) +
    geom_col(fill = fill_color , position = "dodge") +
    labs(x = "Age",
         y = y_title) +
    theme_solarized() +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold"),
      plot.subtitle = element_text(hjust = 0.5),
    ) +
    scale_fill_brewer(palette = "Set1")
  
}


# assign the plot function to object / variable plot_death and plot_live

plot_death <- plot_fn(age_death, "Total Death", "#e41a1c")
plot_live <- plot_fn(age_live,  "Total Live", "#377eb8")

# unite plot together
figure_one <- ggarrange(plot_death, plot_live)
annotate_figure(
  figure_one,
  top = text_grob(
    "Total Death and Live Male of Female by Age",
    color = "black",
    face = "bold",
    size = 14
  ),
  bottom = text_grob(
    "Data source: \n Titanic dataset-kaggle",
    color = "black",
    hjust = 1.1,
    x = 1,
    face = "italic",
    size = 10
  ),
  
)

We can conclude from the Bar Plot above, that a lot of people die in age between 18 - 30 years old. For survived people between 22 - 36 years old. That means many young people ranging from 18, 21 and 28 years old died in the titanic ship incident. Because if you look at the frequency of data, most of the survivors only range from 22 to 36 years old

Let’s make sure the majority of the passengers on the Titanic

titanic_age <- as.data.frame(table(titanic$Age))
head(titanic_age[order(titanic_age$Freq, decreasing = T),], 30)

##    Var1 Freq
## 25   24   31
## 23   22   27
## 29   28   27
## 31   30   27
## 19   18   26
## 20   19   25
## 22   21   24
## 26   25   23
## 37   36   23
## 30   29   20
## 33   32   20
## 27   26   18
## 28   27   18
## 36   35   18
## 17   16   17
## 32   31   17
## 21   20   16
## 24   23   16
## 35   34   16
## 34   33   15
## 41   40   15
## 40   39   14
## 46   45   14
## 18   17   13
## 43   42   13
## 39   38   11
## 3     2   10
## 5     4   10
## 51   50   10
## 45   44    9

The majority of passengers on the Titanic range from 20 to 40 years old. That’s why many survive from the age range of 22 to 36 years old

Let’s check survive people depends on Pclass

# Check how many people in each Pclass
summary(titanic$Pclass)

##   1   2   3 
## 186 173 355

We can take a look, a lot of people is in Pclass 3 or regular Class

We can sum total Death and Live of people in Titanic Ship by Pclass

pclass_death <- aggregate(Survived == 0~ Pclass, titanic, sum)
names(pclass_death) <- c("Pclass", "Death")

pclass_live <- aggregate(Survived == 1~ Pclass, titanic, sum)
names(pclass_live) <- c("Pclass", "Live")

Add column to the pclass_death from pclass_live data.frame

pclass_combine <- cbind(pclass_death, Live = pclass_live$Live)

Let’s change format of our table from wide format to long format

pclass_pivot <- pivot_longer(data = pclass_combine,
                           cols = c("Death", "Live"),
                           names_to = "Survived",
                           values_to = "Value")

pclass_pivot

## # A tibble: 6 × 3
##   Pclass Survived Value
##   <fct>  <chr>    <int>
## 1 1      Death       64
## 2 1      Live       122
## 3 2      Death       90
## 4 2      Live        83
## 5 3      Death      270
## 6 3      Live        85

Yeayy, now our table has been long format, so we can use facet function in ggplot to make the plot separate by each Survived column

Let’s visual the data using Bar Plot and using facet_wrap to seperate by each survived column

ggplot(pclass_pivot, aes(x = Pclass , y = Value)) +
  geom_col(aes(fill = Survived), position = "dodge") +
  facet_wrap(~Survived, scales = "free_y") +
  labs(
    title = "Total Live and Death of Male and Female",
    subtitle = "In Titanic Ship",
    caption = "Source: Titanic dataset-kaggle",
    x = "Sex",
    y = "Total",
  ) +
  theme_solarized() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.title = element_blank()
  ) +
  scale_fill_brewer(palette = "Set1", labels = c("Death", "Live"))

Wowww, according to my guess, that every Pclass must have its influence with life or death. As evidenced by the visualization above, the dominant living people are from Pclass 1 and the dead are mostly from Pclass 3. It makes sense that people who are in Pclass 1 get a higher priority during the evacuation process than people in Pclass 3

Let’s see, whether On the Titanic, there are many siblings and parents with their families or not

# check for sibling / children and parent in titanic

# 1 - 17
sibling <- head(titanic[order(titanic$SibSp, decreasing = T), c("Age", "SibSp")], 15)
sibling

##     Age SibSp
## 60   11     5
## 72   16     5
## 387   1     5
## 481   9     5
## 684  14     5
## 17    2     4
## 51    7     4
## 69   17     4
## 120   2     4
## 165   1     4
## 172   4     4
## 183   9     4
## 234   5     4
## 262   3     4
## 267  16     4

# 30 - 40
parent <- head(titanic[order(titanic$Parch, decreasing = T), c("Age", "Parch")], 15)
parent

##     Age Parch
## 679  43     6
## 14   39     5
## 26   38     5
## 611  39     5
## 639  41     5
## 886  39     5
## 168  45     4
## 361  40     4
## 439  64     4
## 568  29     4
## 87   16     3
## 438  24     3
## 737  48     3
## 775  54     3
## 859  24     3

From the result above, range of people have siblings is from 1 - 17 years old and 30 - 40 years old for parent. That’s means pretty much children and family in Titanic Ship

Let’s create reusable function for plotting our data

plot_fn <- function(data, axis_y, y_title) {
  ggplot(data, aes(x = Age, y = axis_y, color = axis_y)) +
    geom_jitter(aes(size = axis_y), alpha = 0.5, show.legend = F) +
    ylim(0, 7) +
    labs(y = y_title) +
    scale_color_gradient(low="red3", high="green2") +
    theme_solarized()
}

Then use the plot_fn function for plotting sibling data.frame and parent data.frame

plot.one <- plot_fn(sibling, sibling$SibSp, "Siblings")
plot.one

plot.two <- plot_fn(parent, parent$Parch, "Parents")
plot.two

For our convenience in comparing the visuals, let’s put together the plot we have made using ggarrange function from ggpubr library

figure <- ggarrange(plot.one, plot.two)
annotate_figure(
  figure,
  top = text_grob(
    "Total Siblings and Parents by Age",
    color = "black",
    face = "bold",
    size = 14
  ),
  bottom = text_grob(
    "Data source: \n Titanic dataset-kaggle",
    color = "black",
    hjust = 1.1,
    x = 1,
    face = "italic",
    size = 10
  ),
)

It turned out that on the titanic ship, quite a lot of children and their brothers. Average age 1, 9, 11, 14 and 16 years old. and for parents or people who are already married, the average age range is 30 to 40 years old.

4. Final Conclusion

We can take two conclusion from all our data visualization :

From the death case of people by age in Titanic Ship, the average frequency is highest at the age of 18-30 years old. That means that most of the children were saved in the evacuation process, because if we look at the data on siblings or children, there are quite a lot but those who die on average are in the range of 18 - 30 years old not in range of 1 - 16 years old.
From the data of parents or family in the plot above, the average is mostly in the age range of 30 - 40 years. If we look at the data that survived, many also came from the age range of 30-36 years. That means parents or family are more prioritized in the evacuation process on the Titanic
people who are in Pclass 1 (VIP Class) will be prioritized in the evacuation process than people in Pclass 3 (Regular / Ordinary Class), seen from the number of lives of people is a lot from Pclass 1 than Pclass 3.
Women will be prioritized in the evacuation process. Seen in the data of survive people, where the dominant female survives more than the male

Data Visualization - Titanic