Titanic passenger survival data set
Data comes from : https://www.kaggle.com/c/titanic/data

Prepare data:

Doing data visualization with all the data of titanic, so combining ‘titanic_train’ and ‘titanic_test’ together.

titanic_all <- bind_rows(titanic_train, titanic_test)
glimpse(titanic_all)
## Rows: 1,309
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6",...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...

A quick glimpse of data:

dim(titanic_all) 
## [1] 1309   12
names(titanic_all)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"
head(titanic_all)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q
variable definition key
Survived Survival 0 = No, 1 = Yes
Pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
SibSp # of siblings / spouses aboard the Titanic
Parch # of parents / children aboard the Titanic
Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
Embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Where is Jack & Rose?

To see whether Jack Dawson and Rose DeWitt Bukater are on board.

grep("Jack|Rose", titanic_all$Name, value = TRUE)
## [1] "Brewe, Dr. Arthur Jackson"                    
## [2] "Aks, Mrs. Sam (Leah Rosen)"                   
## [3] "Rosenbaum, Miss. Edith Louise"                
## [4] "Rosenshine, Mr. George (Mr George Thorne\")\""

We find that Jack and Rose in the movie are not on titanic.

Bar Graph:

  1. Bar graph of male and female
ggplot(data = titanic_all, aes(x = Sex)) +geom_bar()

From the picture, the number of male on board is almost twice that of female.

  1. Bar graph of numbers of male and female, showing the ratio of ‘Pclass’
#ggplot(data = titanic_all, aes(x = Sex, fill = factor(Pclass))) + 
#  geom_bar() 
  1. Bar graph of numbers of male and female, showing the ratio of ‘Pclass’ separated.
#ggplot(data = titanic_all, aes(x = Sex, fill = factor(Pclass))) +
#  geom_bar(position = "dodge") +
#  ggtitle('position = "dodge"') #add title
  1. Bar graph of ratio of male and female, showing the ratio of ‘Pclass’
ggplot(data = titanic_all, aes(x = Sex, fill = factor(Pclass))) +
  geom_bar(position = "fill") +
  ggtitle('position = "fill"') #add title

Both of female and male, “class 3”(lower class) accounted for the most.
Note that the ratio of female of “class 1”(Upper class) is far more than that of male of “class 1”.

Box Plot:

  1. Box plot : x-> ‘Embark’ ; y-> ‘Fare’
ggplot(titanic_all, aes(x=Embarked, y=Fare)) +
  geom_boxplot()

The first item is passengers without the boarding port information, so there is no name on it.
The mean of different ports ‘C’,‘Q’,‘S’ are of small difference.
However we can see from the box plot that ‘C’ spread more loosely than others, and both ‘C’ and ‘S’ have many outliners.

  1. Box plot: x–> Embark ; y–> Fare separate by different ‘Pclass’
p1 <- ggplot(titanic_all,aes(x=Embarked, y=Fare, fill=factor(Pclass))) +
  geom_boxplot()
p1

#p1 + theme_gray() # the default
#p1 + theme_bw()
#p1 + theme_linedraw()
#p1 + theme_light()
#p1 + theme_dark()
#p1 + theme_minimal()
#p1 + theme_classic()
#p1 + theme_void()

By classifying ‘Embark’ by ‘Pclass’, we see that class 1 in ‘C’ paid more on tickets than that of in ‘S’.

Density Plot:

  1. Density plot of ‘fare’ with the same scale of y.
ggplot(titanic_all, aes(x = Fare, fill = factor(Pclass))) +
  geom_density() +
  theme_light() +
  facet_grid(Pclass ~ .) #with the same scale of y

It is easily to observe that people in lower class tends to buy cheaper tickets.

  1. Density plot of ‘fare’ with free scale of y.
#ggplot(titanic_all, aes(x = Fare, fill = factor(Pclass))) +
#  geom_density() +
#  theme_light() +
#  facet_grid(Pclass ~ . , scales = "free") #scales = "free"

Line Graph:

Passengers are classified according to ‘Sex’ and ‘Pclass’.
Next, calculating the mean of survival of each classification.

lineData <- titanic_train %>% 
  group_by(Sex, Pclass) %>% #cartesian product of 'Sex' and 'Pclass'
  summarise(SurvivedAvg = mean(Survived)) #calculating
head(lineData)
## # A tibble: 6 x 3
## # Groups:   Sex [2]
##   Sex    Pclass SurvivedAvg
##   <chr>   <int>       <dbl>
## 1 female      1       0.968
## 2 female      2       0.921
## 3 female      3       0.5  
## 4 male        1       0.369
## 5 male        2       0.157
## 6 male        3       0.135
  1. Line graph of averaging survival of passengers, which is classified according to ‘Sex’ and ‘Pclass’.
ggplot(data = lineData, aes(x = Pclass, y = SurvivedAvg, color = Sex)) +
  geom_line() +
  geom_text(aes(label = round(SurvivedAvg,2)), nudge_y = 0.05, show.legend = F) 

Consistent with our intuition, lower class has lower average survival.
We can see that average survival of female is far more higher than male regardless ‘Pclass’.
Moreover, survival of female in class 1 and 2 is close to 1 and it has a rapid drop when ‘Pclass’ change 1 to 2.

Run chart:

  1. Make a scatter plot for each ‘Sex’ and ‘Pclass’. Then use geom_smooth( ) to show the trend of scatter plot.
ggplot(data = titanic_train, aes(x = Age, y = Survived, color = Sex)) +
  geom_point(alpha = 0.2) +
  geom_smooth(se = FALSE) + #C.I. FALSE
  facet_grid(Pclass ~ .) +
  theme_classic() +
  theme(text = element_text(size=10)) +
  scale_color_manual(values = c("#EF5350", "#64B5F6"))

Among all the classes, average of survival of ‘male’ have a small recess about 20 years old.
And children and the elderly have relatively high average og survival. This may indicate that young man give priority to the elderly, women and children to get on the rescue boat first.

  1. Add confidence interval.
#ggplot(data = titanic_train, aes(x = Age, y = Survived, color = Sex)) +
#  geom_point(alpha = 0.2) +
#  geom_smooth(se = TRUE) + #C.I. TRUE
#  facet_grid(Pclass ~ .) +
#  theme_classic() +
#  theme(text = element_text(size=10)) +
#  scale_color_manual(values = c("#EF5350", "#64B5F6"))
options(rpubs.upload.method = "internal")
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))

Reference:

http://biostat.tmu.edu.tw/oldFile/enews/ep_download/16rb.pdf