CONTENT



INTRODUCTION





The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Some of the images of the disaster are shown below:



 

 



In this I done the analysis on titanic dataset and applied predictive analysis to predict the survival of pessengers.



IMPORTING LIBRARIES AND LOADING DATA


# Importing Packages
library(ggplot2)       # Visualization
library(ggthemes)      # Visualization
library(scales)        # Visualization
library(dplyr)         # Data Manipulation
library(mice)          # Imputation
library(randomForest)  # Classification Algorithm


# Loading Data
train <- read.csv("train.csv", stringsAsFactors = F)
test <- read.csv("test.csv", stringsAsFactors = F)
full <- bind_rows(train, test)     # Bind training and test data



DATA INSPECTION


glimpse(full)
Observations: 1,309
Variables: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1...
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1...
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2...
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence...
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "male", "m...
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2,...
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0, 0...
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0...
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450",...
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21.07...
$ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C103", ...
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S", "S",...


Here we see that we have 1309 observations of 12 variables with their class type and the first few observations.

Variables description:

  • PassengerID : ID of the Passenger.
  • Survived : No (0) Yes (1).
  • Pclass : Passenger’s class.
  • Name : Name of the Passenger.
  • Sex : Gender of the Passenger.
  • SibSp : Number of Siblings / Spouses aboard.
  • Parch : Number of Parents / Children aboard.
  • Ticket : Ticket number.
  • Fare : Fare of the Ticket.
  • Cabin : Cabin number.
  • Embarked : Port of embarkation.



FEATURE ENGINEERING


WHAT’S IN A NAME?


In this data we can break variable Name into meaningful variables. First we extract Passenger Title from the Passenger Name.

# Grab Title from the Passenger Names
full$Title <- gsub("(.*, )|(\\..*)","",full$Name)
# Show Title Counts by Sex
table(full$Sex, full$Title)
        
         Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme  Mr Mrs  Ms Rev
  female    0   0   0    1   1        0    1     0      0  260    2   1   0 197   2   0
  male      1   4   1    0   7        1    0     2     61    0    0   0 757   0   0   8
        
         Sir the Countess
  female   0            1
  male     1            0
# Title with very low cell counts to be combined to "rare" Level
rareTitle <- c("Dona", "Lady","the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer")
# Also reassign "Mlle", "Mme" and "Ms" accordingly
full$Title[full$Title == "Mlle"] <- "Miss"
full$Title[full$Title == "Ms"] <- "Miss"
full$Title[full$Title == "Mme"] <- "Mrs"
full$Title[full$Title %in% rareTitle] <- "Rare Title"
# Show Title Counts by Sex again
table(full$Sex, full$Title)
        
         Master Miss  Mr Mrs Rare Title
  female      0  264   0 198          4
  male       61    0 757   0         25


Finally we will extract Surnames from the Passenger Names.

full$Surnames <- sapply(full$Name, function(x) strsplit(x, split = "[., ]")[[1]][1])
count <-  nlevels(factor(full$Surnames))
count
[1] 866


We have 866 unique surnames.


DO FAMILIES SINK OR SWIM TOGETHER?


Now we will create the Family Size variable based on number of siblings / spouse(s) (may be someone has more than one spouse!!) and number of parents / children.

# Create a Family Size variable including the passenger themselves
full$Fsize <- full$SibSp + full$Parch + 1
# Create a Family variable
full$Family <- paste(full$Surnames, full$Fsize, sep = "_")


Let us look upon the variable Family Size and how it may relates to Survival with the help of visualization.

# Relationship between Family Size and Survival (Training Data)
ggplot(data = full[1:891,], aes(x = Fsize, fill = factor(Survived))) + 
  geom_bar(stat = "count", position = "dodge") + 
  scale_x_continuous(breaks = c(1:11)) + 
  ylim(c(0,400)) +
  labs(x = "Family Size") + 
  theme_bw()


Here we see that sigletons are more in comparision then others and they survived more and died too as compare to others.

Now we create discretized family size variable.

# Discretized family size
full$FsizeD[full$Fsize == 1] <- "Singleton"
full$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- "Small"
full$FsizeD[full$Fsize >= 5] <- "Large"
# Show Family size by Survival using a Mosaic plot
mosaicplot(table(full$FsizeD, full$Survived), main = "Family Size by Survival", shade = TRUE)

This plot shows the relationship between Family size and Survival.


TREATING FEW MORE VARIABLES…


Now we will treate variable Passenger Cabin which has potentially useful information such as their Deck.

Let’s have a look….

# This variable appears to have a lot of missing values
full$Cabin[1:28]
 [1] ""            "C85"         ""            "C123"        ""            ""           
 [7] "E46"         ""            ""            ""            "G6"          "C103"       
[13] ""            ""            ""            ""            ""            ""           
[19] ""            ""            ""            "D56"         ""            "A6"         
[25] ""            ""            ""            "C23 C25 C27"
# The first letter is the "Deck". For Example:
strsplit(full$Cabin[2], NULL)[[1]]
[1] "C" "8" "5"
# Create a Deck variable. Get passenger deck A - F:
full$Deck <- factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))



DATA CLEANING


Now we will treate missing values. As this dataset size is very small so we will neither remove any missing values nor any column. Instead we will replace missing values with the sensible values given the distribution of the data, e.g., mean, median or mode. We will also use prediction for the same.


SENSIBLE VALUE IMPUTATION


# Lets look into the data to see how many missing values are there:
miss <- function(x){
  sum = 0
  for(i in 1:ncol(x))
  {
    cat("In column",colnames(x[i]),"total NA values are:",colSums(is.na(x[i])),"\n")
  }
}
miss(full)
In column PassengerId total NA values are: 0 
In column Survived total NA values are: 418 
In column Pclass total NA values are: 0 
In column Name total NA values are: 0 
In column Sex total NA values are: 0 
In column Age total NA values are: 263 
In column SibSp total NA values are: 0 
In column Parch total NA values are: 0 
In column Ticket total NA values are: 0 
In column Fare total NA values are: 1 
In column Cabin total NA values are: 0 
In column Embarked total NA values are: 0 
In column Title total NA values are: 0 
In column Surnames total NA values are: 0 
In column Fsize total NA values are: 0 
In column Family total NA values are: 0 
In column FsizeD total NA values are: 0 
In column Deck total NA values are: 1014 
blank <- function(x){
  sum = 0
  for(i in 1:ncol(x))
  {
    cat("In column",colnames(x[i]),"total blank values are:",colSums(x[i]==""),"\n")
  }
}
blank(full)
In column PassengerId total blank values are: 0 
In column Survived total blank values are: NA 
In column Pclass total blank values are: 0 
In column Name total blank values are: 0 
In column Sex total blank values are: 0 
In column Age total blank values are: NA 
In column SibSp total blank values are: 0 
In column Parch total blank values are: 0 
In column Ticket total blank values are: 0 
In column Fare total blank values are: NA 
In column Cabin total blank values are: 1014 
In column Embarked total blank values are: 2 
In column Title total blank values are: 0 
In column Surnames total blank values are: 0 
In column Fsize total blank values are: 0 
In column Family total blank values are: 0 
In column FsizeD total blank values are: 0 
In column Deck total blank values are: NA 

Now variable of interest are Embarked, Survived, Age, Fare and Deck.


# Now examine which passenger has missing Embarked 
full$PassengerId[full$Embarked == ""]
[1]  62 830

Passenger 62 and 830 has missing Embarked.

# Now we in which class they belong and how much they paid fare:
full$Pclass[full$PassengerId == 62]
[1] 1
full$Fare[full$PassengerId == 62]
[1] 80
full$Pclass[full$PassengerId == 830]
[1] 1
full$Fare[full$PassengerId == 830]
[1] 80


Here we see that both passengers are in class 1 and paid fare 80. So from where did they embark?

# Get rid of our missing passenger id 
embarkFare <- full %>%
  filter(PassengerId != 62 & PassengerId != 830)
# Use ggplot2 to visualize embarkment, passenger class, & median fare
ggplot(data = embarkFare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) + 
  geom_boxplot() + 
  geom_hline(aes(yintercept = 80), 
      colour = "red", linetype = "dashed", lwd = 2) +
  scale_y_continuous(labels = dollar_format()) + 
  theme_bw()


From this plot we see that median fare for the first class passenger departing from C (Charbourg) Embarked coincides nicely with the $80 paid by the passengers whose Embarked is missing. So we can safely replace the NA with C.

# Replacing NA with C for Embarked - Deficient Passengers
full$Embarked[c(62, 830)] <- "C"


# Now examine which passenger has missing Fare
full$PassengerId[is.na(full$Fare)]
[1] 1044

Passenger 1044 has missing Fare.

# Let's see the detail of this passenger
full[1044, ]


This is the third class passenger who departed from Southampton (S).

# Let's visualize Fares among the "3rd" class and "S" embarked
ggplot(data = full[full$Pclass == 3 & full$Embarked == "S", ], aes(x = Fare)) + 
  geom_density(fill = "#99d6ff", alpha = 0.4) + 
  geom_vline(aes(xintercept = median(Fare, na.rm = T)), 
             colour = "red", linetype = "dashed", lwd = 1) +
  scale_x_continuous(labels = dollar_format()) + 
  theme_bw()


From this plot it seems quite reasonable to replace NA Fare value with median for their class and embarkment which is $8.05.

# Replace missing fare value median fare for class / embarked
full$Fare[1044] <- median(full[full$Pclass == 3 & full$Embarked == "S", ]$Fare, na.rm = T)


PREDICTIVE IMPUTATION


Finally, we will treat the variable Age as it contains 263 missing values. We will create a module which will predict ages based on other variables.

# Make variables factors into factors
factorVars <- c("PassengerId", "Pclass", "Sex", "Embarked", "Title", "Surnames", "Family", "FsizeD")
full[factorVars] <- lapply(full[factorVars], function(x) as.factor(x))
# Set a random seed
set.seed(129)
# Perform mice imputation, excluding certain less - than - useful variables
miceMod <- mice(full[, !names(full) %in% c("PassengerId", "Name", "Ticket", "Cabin", "Family", "Survived", "Surnames")], method = "rf")

 iter imp variable
  1   1  Age  Deck
  1   2  Age  Deck
  1   3  Age  Deck
  1   4  Age  Deck
  1   5  Age  Deck
  2   1  Age  Deck
  2   2  Age  Deck
  2   3  Age  Deck
  2   4  Age  Deck
  2   5  Age  Deck
  3   1  Age  Deck
  3   2  Age  Deck
  3   3  Age  Deck
  3   4  Age  Deck
  3   5  Age  Deck
  4   1  Age  Deck
  4   2  Age  Deck
  4   3  Age  Deck
  4   4  Age  Deck
  4   5  Age  Deck
  5   1  Age  Deck
  5   2  Age  Deck
  5   3  Age  Deck
  5   4  Age  Deck
  5   5  Age  Deck
Number of logged events: 50
# Save the complete output
miceOutput <- complete(miceMod)


Let’s compare the results we get with the original distribution of passenger ages to ensure that nothing is gone completely awry.

# Plot age distribution
par(mfrow = c(1, 2))
hist(full$Age, freq = F, main = "Age: Original Data", col = "darkgreen", ylim = c(0, 0.04))
hist(miceOutput$Age, freq = F, main = "Age: MICE Output", col = "lightgreen", ylim = c(0, 0.04))


Our model is similar to the original Age. So we can replace Age variable with our mice output Age variable.

# Replace Age variable with mice model
full$Age <- miceOutput$Age
# Show new number of missing Age values
sum(is.na(full$Age))
[1] 0


Now we have finished imputation for variables we are interested. Now we will do bit more of feature engineering.


FEATURE ENGINEERING : ROUND 2


Now we will create few age - dependent variables like Child and Mother. A child may be simply be someoneunder 18 years of age and a mother is a passenger who is:

  • Female
  • Is over 18
  • Has more than 0 children
  • Does not have “Miss” title
# First we will look at the relationship between age and survival
ggplot(data = full[1:891, ], aes(x = Age, fill = factor(Survived))) + 
  geom_histogram() + 
  facet_grid(.~Sex) + 
  theme_bw()


# Create a column child, and indicate whether child or adult
full$Child[full$Age < 18] <- "Child"
full$Child[full$Age >= 18] <- "Adult"
# Show counts
table(full$Child, full$Survived)
       
          0   1
  Adult 482 272
  Child  67  70


Here we see that Children has 50% - 50% for survival. Now we will create variable Mother and hope that they more likely to survive on Titanic.

# Adding Mother variable
full$Mother <- "Not Mother"
full$Mother[full$Sex == "female" & full$Parch > 0 & full$Age > 18 & full$Title != "Miss"] <- "Mother"
# Show counts 
table(full$Mother, full$Survived)
            
               0   1
  Mother      16  39
  Not Mother 533 303
# Factorizing the Child and Mother variables
full$Child <- factor(full$Child)
full$Mother <- factor(full$Mother)


We should double check there will be no missing values in the dataset.

md.pattern(full)
    PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
204           1      1    1   1   1     1     1      1    1     1        1     1
687           1      1    1   1   1     1     1      1    1     1        1     1
91            1      1    1   1   1     1     1      1    1     1        1     1
327           1      1    1   1   1     1     1      1    1     1        1     1
              0      0    0   0   0     0     0      0    0     0        0     0
    Surnames Fsize Family FsizeD Child Mother Survived Deck     
204        1     1      1      1     1      1        1    1    0
687        1     1      1      1     1      1        1    0    1
91         1     1      1      1     1      1        0    1    1
327        1     1      1      1     1      1        0    0    2
           0     0      0      0     0      0      418 1014 1432

We finally finished treating missing values in the titanic dataset. We also created some new variables which will help us in building model which reliably predicts survival.



PREDICTION


Now we will predict who survives among passengers of the Titanic from the variables we treated and created. We will use randomForrest classification algorithm for our prediction.


SPLIT INTO TRAIN AND TEST DATASETS


# Split the data back into a train set and a test set
train <- full[1:891,]
test <- full[892:1309,]


BUILDING THE MODEL


Now we will build our model by using randomForrest on the train set.

# Set a random seed
set.seed(754)
# Build the model (note: not all possble variables are used)
model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FsizeD + Child + Mother, data = train)
# Show model error
plot(model, ylim = c(0, 0.36))
legend("topright", colnames(model$err.rate), col = 1:3, fill = 1:3)


The black line shows the overall error rate which falls below 20%. The red and green lines show the error rate for ‘died’ and ‘survived’ respectively.


VARIABLE IMPORTANCE


Let’s look at relative variable importance by plotting the mean decrease in Gini calculated across all trees.

# Get importance
importance    <- importance(model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))
# Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))
# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
    y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
    hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip() + 
  theme_bw()


From this we see that the variable Title is very important.


PREDICTION

# Predict using the test set
prediction <- predict(model, test)
# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)
solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
# Write the solution to file
write.csv(solution, file = "Solution.csv", row.names = F)
# Count
table(solution$Survived)

  0   1 
266 152 


CONCLUSION

This is the analysis on Titanic dataset in which we predicted 266 died out of 418 passengers in the test set. So total passengers who died on Titanic are 815 out of 1309 passengers.

