Calling all necessary libraries

Firstly, I have installed all the necessary, packages. Now, I am calling those libraries

library(devtools)
library(ggplot2)
library(statsr)
library(dplyr)

Reading the titanic data from computer

titanic <- read.csv("titanic.csv", stringsAsFactors = FALSE)
View(titanic)

Converting integers to characters

titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Survived <- as.factor(titanic$Survived)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)
summary(titanic)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked
##  Min.   :  0.00   Length:891          :  2   
##  1st Qu.:  7.91   Class :character   C:168   
##  Median : 14.45   Mode  :character   Q: 77   
##  Mean   : 32.20                      S:644   
##  3rd Qu.: 31.00                              
##  Max.   :512.33                              
## 

Including Plots

ggplot(titanic, aes(x = Survived)) +
  theme_bw() +
  geom_bar() +
  labs(y = "Passenger Count",
      title = "Titanic Survival Rates")

Taking the plot to a bit higher level

Let us find out how many male and female were in the Survived category

ggplot(titanic, aes(x = Sex, fill = Survived)) +
    theme_bw() +
    geom_bar() +
    labs(y = "Passenger Count", 
         title = "Titanic Survival Rates by sex")

The bar chart shows the relative proprtion of those who survived and who died by sex.

In female, more people survived whereas, among male, more people died.

Did class of ticket affect the survival rate?

Apart from gender, may be the classs of tickets also played role

ggplot(titanic, aes(x = Pclass, fill = Survived)) +
    theme_bw() +
    geom_bar() +
    labs(y = "Passenger Count", 
         title = "Titanic Survival Rates by class")

Peple in Class 3 has significantly poor survival ratio. On the other hand, people in Class 1 survived more. Interestingly, 2nd Class passengers’ survival rate is almost 50%.

Visual Drilldown

ggplot(titanic, aes(x = Sex, fill = Survived)) +
    theme_bw() +
    facet_wrap(~ Pclass) +
    geom_bar() +
    labs(y = "Passenger Count", 
         title = "Titanic Survival Rates by class and sex")

There are three panels for each class and each panel has female and male. Based on the chart above, females on first class overwhelmingly survived which is the same case for second class. However, for third class, women had equal chances for survival and death. Sadly, men have the less survivality in al three classes.

Lets examine the age of the passengers,

Like gender and class, age has equal chances of affecting survivabiity I have set binwidth of five years.

ggplot(titanic, aes(x = Age)) +
    theme_bw() +
    geom_histogram(binwidth = 5) +
    labs(y = "Passenger Count",
         x = "Age with binwidth 5 years",
         title = "Titanic passengers age distribution")
## Warning: Removed 177 rows containing non-finite values (stat_bin).

It can be noticed that people in their 20s are the majority among the passengers. There were also some elderly passengers upto some 80 years old as well as some children.

Relative proportion of survived passengers and dead by age distribution

ggplot(titanic, aes(x = Age, fill = Survived)) +
    theme_bw() +
    geom_histogram(binwidth = 5) +
    labs(y = "Passenger Count",
         x = "Age with binwidth 5 years",
         title = "Titanic passengers age distribution")
## Warning: Removed 177 rows containing non-finite values (stat_bin).

Children especially between 0 to 5 years old, had high survivability. However in the higher end on age distribution, the death rate is significantly high especially between 50 and 70 (Except one outlier of 80 years old passengers survived).

Histogram of survival rate by age, class, and sex

ggplot(titanic, aes(x = Age, fill = Survived)) +
  theme_bw() +
  facet_wrap(Sex ~ Pclass) +
  geom_histogram(alpha = 0.5) +
  labs(y = "Age",
       x = "Survived",
       title = "Survival Rates by Age, Pclass, and Sex")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (stat_bin).

Female in first class, they had extremely high survivability rate. Unexpectedly, a female died. That is almost same in second classs. However, female had less survivability who were in third class.

On the other hand, for male, survival rate for first class is highes followed by second class. The survival rate for third classs male was extremely low.










Titanic data is available on Kaggle website.kaggle competitions https://www.kaggle.com/c/titanic/data