1.Introduction

I am a new to data science and machine learning, and looking for a simple intro to the Kaggle prediction competitions. This is my first competition on Kaggle.

2.Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we are asked o complete the analysis of what sorts of people were likely to survive. In particular, we are asked to apply the tools of machine learning to predict which passengers survived the tragedy.

3.Overview

The data has been split into two groups:

training set (train.csv) test set (test.csv) The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

4.Data Dictionary

Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton Variable Notes pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

I have downloaded the data set from Kaggle and would like to examine it further

library(ggplot2)
library(dplyr)
#download.file("https://www.kaggle.com/c/titanic/download/test.csv")
#download.file("https://www.kaggle.com/c/titanic/download/train.csv")
train <- read.csv(file="./Data/train.csv", header=TRUE, sep=",",stringsAsFactors = FALSE)
test <- read.csv(file="./Data/test.csv", header=TRUE, sep=",",stringsAsFactors = FALSE)

5.Exploratory Data Analysis

str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
str(test)
## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...
summary(train)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 
summary(test)
##   PassengerId         Pclass          Name               Sex           
##  Min.   : 892.0   Min.   :1.000   Length:418         Length:418        
##  1st Qu.: 996.2   1st Qu.:1.000   Class :character   Class :character  
##  Median :1100.5   Median :3.000   Mode  :character   Mode  :character  
##  Mean   :1100.5   Mean   :2.266                                        
##  3rd Qu.:1204.8   3rd Qu.:3.000                                        
##  Max.   :1309.0   Max.   :3.000                                        
##                                                                        
##       Age            SibSp            Parch           Ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   Length:418        
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :27.00   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :30.27   Mean   :0.4474   Mean   :0.3923                     
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.0000                     
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000                     
##  NA's   :86                                                          
##       Fare            Cabin             Embarked        
##  Min.   :  0.000   Length:418         Length:418        
##  1st Qu.:  7.896   Class :character   Class :character  
##  Median : 14.454   Mode  :character   Mode  :character  
##  Mean   : 35.627                                        
##  3rd Qu.: 31.500                                        
##  Max.   :512.329                                        
##  NA's   :1
set.seed(123)

We will combine both test and train data for exploring and processing the datat

#Assume no one survived..
test$Survived <- 0

finalData <- rbind(train,test)
str(finalData)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
summary(finalData)
##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.2613   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                                                                    
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin          
##  Length:1309        Min.   :  0.000   Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character  
##                     Mean   : 33.295                     
##                     3rd Qu.: 31.275                     
##                     Max.   :512.329                     
##                     NA's   :1                           
##    Embarked        
##  Length:1309       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

So Fare and Age has few missing data in numeric columns. We have to examine non numeric columns also.

6.Visualize the missing data

library(Amelia)    # a package to visualize missing data 
missmap(finalData,main="Titanic Training Data - Missings Map", col=c("yellow", "black"), legend=FALSE)

#no. of survived and no. of expired people
as.data.frame(table(finalData$Survived))
##   Var1 Freq
## 1    0  967
## 2    1  342

So 342 people survived and 967 Expired.

How many females and male survivors within a age category

survived <- finalData[finalData$Survived == 1, ]


ggplot(survived, aes(x = Age, fill = factor(Sex))) +
  geom_histogram()

#survival rate for lone travellers

travelled_alone <- survived[survived$SibSp == 0 & survived$Parch == 0,]
nrow(travelled_alone) #no. of travelled alone and survivied
## [1] 163
as.data.frame(table(travelled_alone$Sex))
##     Var1 Freq
## 1 female   99
## 2   male   64
tot_travelled_alone <- train[train$SibSp == 0 & train$Parch == 0,]
as.data.frame(table(tot_travelled_alone$Sex))
##     Var1 Freq
## 1 female  126
## 2   male  411

Conclusion: Therefore most of female lone travellers survived than male lone travellers

#Give dummy variables for sex column
finalData$Sex <- ifelse(finalData$Sex == "male",0,1)
#Find how many travelled together, atleast the immediate family members count.
finalData$Familycount <- finalData$SibSp + finalData$Parch + 1 
head(finalData)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   0  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)   1  38     1     0
## 3                              Heikkinen, Miss. Laina   1  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel)   1  35     1     0
## 5                            Allen, Mr. William Henry   0  35     0     0
## 6                                    Moran, Mr. James   0  NA     0     0
##             Ticket    Fare Cabin Embarked Familycount
## 1        A/5 21171  7.2500              S           2
## 2         PC 17599 71.2833   C85        C           2
## 3 STON/O2. 3101282  7.9250              S           1
## 4           113803 53.1000  C123        S           2
## 5           373450  8.0500              S           1
## 6           330877  8.4583              Q           1

Survival probability in different classes

finalData %>% ggplot(aes(x=Pclass,fill=factor(Survived)))+geom_bar(stat="count",position="fill")

Survival probability in different sex

finalData %>% ggplot(aes(x=Sex,fill=factor(Survived)))+geom_bar(stat="count",position="fill")

Survival probability depending on Family Size

finalData$Familycount <- finalData$SibSp + finalData$Parch + 1 
finalData%>%ggplot()+geom_boxplot(aes(x=Pclass,y=Familycount,fill=as.factor(Survived))) + theme_classic()

We will see the correlation between variables.

7.Process Data

finalData$Familycount <- finalData$SibSp + finalData$Parch + 1 

Unite Family members with their family group

#To do that split names into title, Surname, FirstName and Maiden name only for Mrs.
library(dplyr)
finalData <- finalData %>% 
        mutate(tempcol = strsplit(Name, "[,.]")) %>% 
        rowwise() %>% 
        mutate(Surname = unlist(tempcol)[1], Title = unlist(tempcol[2]), Firstname = unlist(tempcol[3]), 
        Maidenname = unlist(ifelse(Title == " Mrs",tail(strsplit(strsplit(tempcol[3],"[\\(\\)]")[[1]][2]," ")[[1]],1) ,"U")))  %>%
        select(-c(Name,tempcol))

Get family ids

#second with surname and family count create familyid..no family for single travellers
finalData$Familyid <- ifelse( finalData$Familycount == 1,"single",paste(as.character(finalData$Familycount), finalData$Surname, sep=""))

unique(finalData$Title)
##  [1] " Mr"           " Mrs"          " Miss"         " Master"      
##  [5] " Don"          " Rev"          " Dr"           " Mme"         
##  [9] " Ms"           " Major"        " Lady"         " Sir"         
## [13] " Mlle"         " Col"          " Capt"         " the Countess"
## [17] " Jonkheer"     " Dona"
finalData$Title[finalData$Title %in% c( ' Mlle', ' Ms')] <- ' Miss'
finalData$Title[finalData$Title %in% c(' Capt', ' Don', ' Major', ' Sir',' Col')] <- ' Sir'
finalData$Title[finalData$Title %in% c(' Dona', ' Lady', ' the Countess', ' Jonkheer', ' Mme')] <- ' Lady'

8.Impute Missing Values

# Number of Rows with missing values 
nrow(finalData[!complete.cases(finalData),])
## [1] 266
library(Hmisc)

#argImpute() automatically identifies the variable type and treats them accordingly.
finalData$Title <- as.factor(finalData$Title)
#finalData$Embarked <- as.factor(finalData$Embarked)
impute_arg <- aregImpute(~  Age + Sex + Pclass + SibSp + Parch + Fare + Title   , data = finalData, n.impute = 10, nk = 0 )
## Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
Iteration 9 
Iteration 10 
Iteration 11 
Iteration 12 
Iteration 13 
# Get the imputed values
imputed <-impute.transcan(impute_arg, data=finalData, imputation = 5, list.out=TRUE, pr=FALSE, check=FALSE)

# convert the list to the database
imputed.data <- as.data.frame(do.call(cbind,imputed))

# arrange the columns accordingly
#as of now consider age alone...and fare can be imputed manually as it is missing only a value for 674 passenger_id
finalData <- cbind(finalData,imputed.data$Age)
finalData$Age <- NULL
names(finalData)[names(finalData) == 'imputed.data$Age'] <- 'Age'

finalData$Fare[1044] <- imputed.data$Fare[1044]

# two embarked value is empty..they both r ladies with same tickt id ..their pclass = 1 and age = 38 & 62 and not related..might be friends
# same cabin..and both survivied..cabin B cabin...with these find Embarked place..
finalData[finalData$Embarked == "", ]
##     PassengerId Survived Pclass Sex SibSp Parch Ticket Fare Cabin Embarked
## 62           62        1      1   1     0     0 113572   80   B28         
## 830         830        1      1   1     0     0 113572   80   B28         
##     Familycount Surname Title                      Firstname Maidenname
## 62            1   Icard  Miss                         Amelie          U
## 830           1   Stone   Mrs  George Nelson (Martha Evelyn)     Evelyn
##     Familyid Age
## 62    single  38
## 830   single  62
prop.table(table(finalData$Embarked,finalData$Survived),1)
##    
##             0         1
##     0.0000000 1.0000000
##   C 0.6555556 0.3444444
##   Q 0.7560976 0.2439024
##   S 0.7625821 0.2374179
#Wiith the above info check whole data base#
table(finalData$Embarked[finalData$Pclass == 1 & finalData$Sex == 1 & finalData$Age >= 38 & finalData$Age <= 62 & finalData$Survived == 1])
## 
##     C  S 
##  2 22 14
# mostly from Embarked = "C satisfies this ...therefore fill this missing wiht "C
finalData$Embarked[62] <- "C"
finalData$Embarked[830] <- "C"

#rows of title = Mrs and no Maiden name has to be filled. Not of much importance, so consider it later

head(finalData)
##   PassengerId Survived Pclass Sex SibSp Parch           Ticket    Fare
## 1           1        0      3   0     1     0        A/5 21171  7.2500
## 2           2        1      1   1     1     0         PC 17599 71.2833
## 3           3        1      3   1     0     0 STON/O2. 3101282  7.9250
## 4           4        1      1   1     1     0           113803 53.1000
## 5           5        0      3   0     0     0           373450  8.0500
## 6           6        0      3   0     0     0           330877  8.4583
##   Cabin Embarked Familycount   Surname Title
## 1              S           2    Braund    Mr
## 2   C85        C           2   Cumings   Mrs
## 3              S           1 Heikkinen  Miss
## 4  C123        S           2  Futrelle   Mrs
## 5              S           1     Allen    Mr
## 6              Q           1     Moran    Mr
##                                Firstname Maidenname  Familyid Age
## 1                            Owen Harris          U   2Braund  22
## 2  John Bradley (Florence Briggs Thayer)     Thayer  2Cumings  38
## 3                                  Laina          U    single  26
## 4          Jacques Heath (Lily May Peel)       Peel 2Futrelle  35
## 5                          William Henry          U    single  35
## 6                                  James          U    single  63
#Extract cabin level info..if unknow mark 'U'
finalData$Cabin <- sub("^$", "U" , finalData$Cabin)
finalData <- finalData %>% rowwise() %>% mutate(CabinLevel = substring(Cabin,1,1))


#create ticket category

finalData <- finalData %>% rowwise() %>% mutate(Ticketcategory = strsplit(Ticket, " ")[[1]][1])
finalData$Ticketcategory <- gsub("^\\d*$","XX",finalData$Ticketcategory)

# create family category
# if family size = 2 & 3  then small family  else big family
finalData <- finalData %>% rowwise() %>% mutate(Familycategory = ifelse(Familycount %in% c(2,3), "small" , ifelse(Familycount %in% c(1), "single", "big")))
#check people with similar familyid and family count matches and their age,sex, cabin, 
#Groupby familyid can be used to fill in missing values, coz people in same family tend to be in same cabin..embark at same place..etc..

df <- finalData %>% group_by(Surname) %>% summarise(n())

#Categoriese age
finalData <- finalData %>% mutate(Agecategory = ifelse(Age < 18, "kid",ifelse(Age>60, "old", "adult")))

#SAVE FAMIY By finding family members latter
#There are 9 Andersson but only 7 family members..so two are other Andersson..change family id . Later we can finalDatane them with their family
#How come they also have come with 7 family members each...find it
finalData$Familyid[finalData$Surname == "Andersson" & finalData$Familycount == 7 & finalData$Ticket == "3101281"] <- "Andersson1"
finalData$Familyid[finalData$Surname == "Andersson" & finalData$Familycount == 7 & finalData$Ticket == "347091"] <- "Andersson2"

#Put all Richards together..they have diff familyid, same ticketid so same family..find other two siblings of Mrs. Richards ( 438 )
finalData$Familyid[finalData$Ticket == "29106" ]  <- "Richards" # might be hocking is sibiling of Richards..lil bit confusing..leaving it for now

#Find two siblings of Mr. Kink-Heilmann and give them the same family id
finalData$Familyid[finalData$Ticket == "315153" ] <- "Kink-Heilmann"

#Hocking family all confusion..two children missing for Mrs. Hocking..n find her sibling also
finalData$Familyid[finalData$Ticket == "29105" ] <- "Hocking"

#Find sibling of MR Vander Planke , united Mr and MRs...find his sibling
finalData$Familyid[finalData$Ticket == "345763" ] <- "Vander Planke"
finalData$Familyid[finalData$Ticket == "345764" ] <- "Vander Planke" # Same surname and Fare 

finalData$Familyid[finalData$Ticket == "31027" ] <- "Renouf"

#Unite Mr and Mrs
finalData$Familyid[finalData$Ticket == "243847" ] <- "Jacobsohn"


finalData$Familyid[finalData$Ticket =="F.C. 12750" ] <- "Davidson" # Find her 2 children ?? Mr. Davidson..no children..so correct the value

finalData$Familyid[finalData$Ticket =="3101278" ] <- "Backstrom" # Find 2 siblings of Mrs Bakstrom

finalData$Familyid[finalData$Ticket =="345763" ] <-  "Vander Planke" # Some error with other Vander Planke..find out correct count of Vander Planke family members

finalData$Familyid[finalData$Ticket =="2625" ] <- "Thomas" # Find her sibling of Mrs. Thomas..she didnt come with spouse

finalData$Familyid[finalData$Ticket =="347054" ] <- "Strom" #Find sibling 

finalData$Familyid[finalData$Ticket =="C.A. 33112" & finalData$Surname == "Davies" ] <- "Davies" # might be find parent of Mrs.. or correct Parch count
finalData$Familyid[finalData$Ticket =="A/4 48871" & finalData$Surname == "Davies" & finalData$Ticket =="A/4 48873" ] <- "Davies1"

finalData$Familyid[finalData$Surname =="Brown" & finalData$Ticket == "29750" ] <- "3Brown1" # Find sibling of 1248 another brown not this family

finalData$Familyid[finalData$Ticket =="11769"] <- "3Appleton" # find one sibling

#Wilkes sibling missing...893 id
# p id 19 Mrs. Vander Planke  sibling missing check with ticket id 345763

# Mrs stephenson sibling missing ticket 36947 pid 592

# Uniting mother n child coz mothers surname and daughters maiden name same & same ticket id
finalData$Familyid[finalData$Ticket =="230433"] <- "Parrish"

# Might be father and daughter
finalData$Familyid[finalData$Ticket =="13236"] <- "Mock"

finalData$Familyid[finalData$Ticket == "31027"] <- "2Renouf"

#Sisters unite now
finalData$Familyid[finalData$Ticket == "13502" & finalData$PassengerId == "276"] <- "Andrews"
finalData$Familyid[finalData$Ticket == "13502" & finalData$PassengerId == "766"] <- "Andrews"

finalData$Familyid[finalData$Ticket == "3101298"] <- "Hirvonen"

finalData$Familyid[finalData$Ticket == "PC 17572" & finalData$Surname == "Harper"] <- "Harper1"

#Find sibling of Miss. Eustis pid 497

finalData$Familyid[finalData$PassengerId == "311"] <- "Potters"
finalData$Familyid[finalData$PassengerId == "1042"] <- "Potters" # mother n daughter

finalData$Familyid[finalData$Ticket == "113505"] <- "Bowermans"


#Fill in Maidenname
# if no maidenname for Mrs..fill with U
finalData$Maidenname[finalData$PassengerId %in% c(20,257,707,798,911,1148)] <- "U"

finalData <- finalData[finalData$CabinLevel != "T", ]

# 3. Splitting Data Manually as it was before

#problem dont use factor for all num and int values..coz it gives more than 53 factore..cannot be used in rfe feature selection
#finalData[] <- lapply(finalData, factor)

finalData$Pclass <- ifelse(finalData$Pclass == 1, "One", ifelse(finalData$Pclass == 2,"Two","Three"))
finalData$Survived <- ifelse(finalData$Survived == 1, "Y","N")


finalData$Survived <- as.factor(finalData$Survived)
finalData$Pclass <- as.factor(finalData$Pclass)
finalData$Sex <- as.factor(finalData$Sex)
finalData$Embarked <- as.factor(finalData$Embarked)
finalData$Embarked <- as.factor(finalData$Embarked)
finalData$Title <- as.factor(finalData$Title)
finalData$CabinLevel <- as.factor(finalData$CabinLevel)
finalData$Ticketcategory <- as.factor(finalData$Ticketcategory)
finalData$Familycategory <- as.factor(finalData$Familycategory)
finalData$Agecategory <- as.factor(finalData$Agecategory)
finalData$Familyid <- as.factor(finalData$Familyid)
finalData$Familyid_num <- as.numeric(finalData$Familyid)

9. Variable Importance - Method 1

control <- rfeControl(functions = rfFuncs,method = "repeatedcv",number = 10)
predictors <- c("Pclass","Sex","SibSp","Parch","Fare","Embarked","Familycount","Title","Age","CabinLevel","Ticketcategory","Familycategory","Agecategory")
Survival_Pred_Profile <- rfe(finalData[ ,predictors], finalData$Survived, sizes = c(1:13),
                         rfeControl = control,metric = "Accuracy")
print(Survival_Pred_Profile) # Top 5 : Title, CabinLevel, Sex, Fare, Pclass
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 1 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy   Kappa AccuracySD KappaSD Selected
##          1   0.7240 0.13103   0.030757 0.09805         
##          2   0.7439 0.15768   0.030370 0.10985         
##          3   0.7523 0.20380   0.030086 0.09924         
##          4   0.7654 0.28003   0.044146 0.18248         
##          5   0.7860 0.36214   0.033897 0.15734        *
##          6   0.7815 0.33242   0.041048 0.20652         
##          7   0.7769 0.26684   0.035222 0.22929         
##          8   0.7547 0.12868   0.027758 0.21292         
##          9   0.7509 0.09319   0.028572 0.19672         
##         10   0.7462 0.05937   0.024846 0.16952         
##         11   0.7462 0.06132   0.022415 0.16140         
##         12   0.7462 0.05868   0.022337 0.16182         
##         13   0.7393 0.01259   0.009989 0.03392         
## 
## The top 5 variables (out of 5):
##    Title, CabinLevel, Sex, Fare, Pclass
# list the chosen features
predictors(Survival_Pred_Profile)
## [1] "Title"      "CabinLevel" "Sex"        "Fare"       "Pclass"
# plot the results
plot(Survival_Pred_Profile, type=c("g", "o"))

10. Variable Importance - Method 2

control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(Survived ~ Pclass+Sex+SibSp+Parch+Fare+Embarked+Familycount+Title+Age+CabinLevel+Ticketcategory+Familycategory,
               data = finalData, method="rf", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

11. Building model and Prediction Using Ensemble ( using caret package )

The four steps to improve model performance.

Improve Performance With Data, Improve Performance With Algorithms, Improve Performance With Algorithm Tuning, Improve Performance With Ensembles. We have done some basic data improvements. After trying many algorithems like random forest, logsitic regreession, ranger etc..I have finally built a model using Ensemble from caret package.

#Use the top 10 features in the predictions and splitting data to train and test set.
predictors <- c("Pclass","Sex","Parch","Embarked","Familycount","Title","Age","CabinLevel","Fare","Ticketcategory","Familyid_num")

selectedData <- finalData[,predictors]
selectedData <- cbind(selectedData,finalData$Survived)
names(selectedData)[12] <- "Survived"
names(finalData)[names(finalData) == 'finalData$Survived'] <- 'Survived'

#Splitting data
anyNA(selectedData)
## [1] FALSE
trainSet <- selectedData[1:891,]
testSet <- selectedData[892:1309,]
str(trainSet)
## 'data.frame':    891 obs. of  12 variables:
##  $ Pclass        : Factor w/ 3 levels "One","Three",..: 2 1 2 1 2 2 1 2 2 3 ...
##  $ Sex           : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Parch         : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Embarked      : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##  $ Familycount   : num  2 2 1 2 1 1 1 5 3 2 ...
##  $ Title         : Factor w/ 8 levels " Dr"," Lady",..: 5 6 4 6 5 5 5 3 6 6 ...
##  $ Age           : num  22 38 26 35 35 63 54 2 27 14 ...
##  $ CabinLevel    : Factor w/ 8 levels "A","B","C","D",..: 8 3 8 3 8 8 5 8 8 8 ...
##  $ Fare          : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Ticketcategory: Factor w/ 51 levels "A.","A./5.","A.5.",..: 6 23 44 51 51 51 51 51 51 51 ...
##  $ Familyid_num  : num  13 26 215 39 215 215 215 190 149 81 ...
##  $ Survived      : Factor w/ 2 levels "N","Y": 1 2 2 2 1 1 1 1 2 2 ...
str(testSet)
## 'data.frame':    418 obs. of  12 variables:
##  $ Pclass        : Factor w/ 3 levels "One","Three",..: 2 3 2 2 2 2 3 2 2 2 ...
##  $ Sex           : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 1 2 1 1 ...
##  $ Parch         : int  0 0 0 1 0 0 1 0 0 0 ...
##  $ Embarked      : Factor w/ 3 levels "C","Q","S": 3 2 3 3 3 2 3 1 3 3 ...
##  $ Familycount   : num  2 1 1 3 1 1 3 1 3 1 ...
##  $ Title         : Factor w/ 8 levels " Dr"," Lady",..: 6 5 5 6 5 4 5 6 5 5 ...
##  $ Age           : num  47 62 27 22 14 30 26 18 21 19 ...
##  $ CabinLevel    : Factor w/ 8 levels "A","B","C","D",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ Fare          : num  7 9.69 8.66 12.29 9.22 ...
##  $ Ticketcategory: Factor w/ 51 levels "A.","A./5.","A.5.",..: 51 51 51 51 51 51 51 51 4 51 ...
##  $ Familyid_num  : num  114 215 215 207 215 215 124 215 132 215 ...
##  $ Survived      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

Stacking: In stacking multiple layers of machine learning models are placed one over another where each of the models passes their predictions to the model in the layer above it and the top layer model takes decisions based on the outputs of the models in layers below it. two important criteria that we previously discussed on individual model accuracy and inter-model prediction correlation which must be fulfilled. If the predictions are highly correlated, then using these three might not give better results than individual models.

STEPS for STACKING 1.Train the individual base layer models on training data. 2.Predict using each base layer model for training data and test data. 3.Now train the top layer model again on the predictions of the bottom layer models that has been made on the training data. 4.Finally, predict using the top layer model with the predictions of bottom layer models that has been made for testing data.

library(caret)
library(caretEnsemble)
library(rpart)
fitControl <- trainControl(method="repeatedcv", number=10, repeats=10, savePredictions=TRUE,classProbs=TRUE)

set.seed(1234)

algorithmList <- c( 'rpart','gbm','treebag')
models <- caretList(trainSet[,1:11],trainSet$Survived , trControl=fitControl, methodList=algorithmList)
results <- resamples(models)
summary(results)
dotplot(results)

# correlation between results
modelCor(results)
splom(results)

# stack using gbm
stackControl <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
set.seed(1234)
stack.gbm <- caretStack(models, method="gbm", metric="Accuracy", trControl=stackControl)
print(stack.gbm)


stack1Pred <- predict(stack.gbm,testSet)

Pred <- ifelse(stack1Pred == "Y", 1,0)

out <- data.frame(PassengerId=test$PassengerId,Survived=Pred,row.names=NULL)
write.csv(x=out,file="submission_ensemble.csv",row.names=FALSE,quote=FALSE)

11.My Score in Kaggle

SCORE

SCORE

12. Conclusion

Thank you for taking time and reading through this.