Introduction

Titanic is definitely good kick-up for any Machine Learning enthusiast as it containes few feauter and More than that who don’t know Titanic Tragedy, Jack and Rose Love story. Titanic is the tragedy story of a ship Sunk in the first Journey itself. This is the best example of what is going to happen if the nature gives just an hour to human being before his death and survival,

Whether he try to save himself first or the family or children?
Priority goes to rich folks or poor, who among poor folks survive first?
Smaller Family having better Survival rate than bigger?

With Above explanation Lets start the script. The data is taken from these sources:

Data loading and consolidation

## Load all the library required one by one

library('ggplot2') 
library('caret')

## Warning: package 'caret' was built under R version 3.5.2

library('dplyr') 
library('randomForest') 
library('rpart')
library('rpart.plot')
library('car')
library('e1071')


##Lets Load raw data in the orginal form by setting stringsAsFactors = F

train.tit <- read.csv('train.csv', stringsAsFactors = F)
test.tit  <- read.csv('test.csv', stringsAsFactors = F)
test.tit$Survived <- NA

##Combine both test and train
full_titanic <- rbind(train.tit, test.tit)

##Check the structure
str(full_titanic)

## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Missing value imputation

###is there any Missing obesrvation
colSums(is.na(full_titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1           0           0

####Empty data
colSums(full_titanic=='')

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0          NA           0           0           0          NA 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0          NA        1014           2

##Summary shows, Age missing 263 value, Cabin too having lot of missing value and embarked just 2

###Lets replace Embarked by most frequest observation 

table(full_titanic$Embarked)

## 
##       C   Q   S 
##   2 270 123 914

full_titanic$Embarked[full_titanic$Embarked==""]="S"
table(full_titanic$Embarked)

## 
##   C   Q   S 
## 270 123 916

##As Age and Cabin has too many missing value, will check it during analysis Phase

###Check the length and see how many varibles of them we can move to factor for our analysis

apply(full_titanic,2, function(x) length(unique(x)))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##        1309           3           3        1307           2          99 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           8         929         282         187           3

###will convert the below varible into factor for ananlysis

cols=c("Survived","Pclass","Sex","Embarked")
for (i in cols){
  full_titanic[,i]=as.factor(full_titanic[,i])
}

Exploratory analysis and Feature engineering

Exploratory Analysis on Pclass

## Hypothesis is that,  Rich folks survival rate is much better than poor folks, Does any diffrence in the Titanic?  

###Visualize P class which is the best proxy for Rich and Poor  

ggplot(full_titanic[1:891,],aes(x = Pclass,fill=factor(Survived)))

##No diffrences in the Titanic too, First class Survival rate is far more better than the 3rd class  
##No doubt Rich peope having better Survival rate than the poor

# Visualize the 3-way relationship of sex, pclass, and survival
ggplot(full_titanic[1:891,], aes(x = Sex, fill = Survived))

##In the all the class female Survival rate is better than Men

Exploratory Analysis on Title

The first variable, which is related to a stock’s which catches my attention is passenger name because we can break it down into additional meaningful variables which can feed predictions or be used in the creation of additional new variables. For instance, passenger title is contained within the passenger name variable and we can use surname to represent families. Let’s do some feature engineering!

head(full_titanic$Name)

## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"

##Lets extract the title and check if we have predictive power in that
names <- full_titanic$Name
title <-  gsub("^.*, (.*?)\\..*$", "\\1", names)

full_titanic$title <- title

table(title)

## title
##         Capt          Col          Don         Dona           Dr 
##            1            4            1            1            8 
##     Jonkheer         Lady        Major       Master         Miss 
##            1            1            2           61          260 
##         Mlle          Mme           Mr          Mrs           Ms 
##            2            1          757          197            2 
##          Rev          Sir the Countess 
##            8            1            1

###MISS, Mrs, Master and Mr are taking more numbers

###Better to group Other titles into bigger basket by checking gender and survival rate to aviod any overfitting


full_titanic$title[full_titanic$title == 'Mlle']        <- 'Miss' 
full_titanic$title[full_titanic$title == 'Ms']          <- 'Miss'
full_titanic$title[full_titanic$title == 'Mme']         <- 'Mrs' 
full_titanic$title[full_titanic$title == 'Lady']          <- 'Miss'
full_titanic$title[full_titanic$title == 'Dona']          <- 'Miss'

## I am afraid creating a new varible with small data can causes a overfit
## However, My thinking is that combining below feauter into original variable may loss some predictive power as they are all army folks, doctor and nobel peoples 

full_titanic$title[full_titanic$title == 'Capt']        <- 'Officer' 
full_titanic$title[full_titanic$title == 'Col']        <- 'Officer' 
full_titanic$title[full_titanic$title == 'Major']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Dr']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Rev']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Don']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Sir']   <- 'Officer'
full_titanic$title[full_titanic$title == 'the Countess']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Jonkheer']   <- 'Officer'


# Lets check who among Mr, Master, Miss having a better survival rate
 ggplot(full_titanic[1:891,],aes(x = title,fill=factor(Survived)))

##In the titanic you are Mr then there is less chance of survival, Miss and Mrs having better survival rate then Master and Officer 


### Visualize the 3-way of relationship of Title, Pclass, and Survival

ggplot(full_titanic[1:891,], aes(x = title, fill = Survived))

##Master in 1st and 2nd class has 100% Survival where has Mrs and Miss having 90% chance of Survival in 1st and 2nd class 
##Since Title mostly depending on Age (except few cases), I will use title in place of age which has 263 missing observation

Exploratory Analysis on Family

# Lets create a Family size using Sibsp and Parch

full_titanic$FamilySize <-full_titanic$SibSp + full_titanic$Parch + 1

full_titanic$FamilySized[full_titanic$FamilySize == 1]   <- 'Single'
full_titanic$FamilySized[full_titanic$FamilySize < 5 & full_titanic$FamilySize >= 2]   <- 'Small'
full_titanic$FamilySized[full_titanic$FamilySize >= 5]   <- 'Big'

full_titanic$FamilySized=as.factor(full_titanic$FamilySized)


###Lets Visualize the Survival rate by Family size 
ggplot(full_titanic[1:891,],aes(x = FamilySized,fill=factor(Survived)))

###Big Family in Titanic having worst survival rate then Smaller and Alone

####Why Big Family has a probelm?, Check in the below visualization

ggplot(full_titanic[1:891,], aes(x = FamilySized, fill = Survived))

##You are a Master in the Big Family your Survival rate is absolutely nill even though overall survival rate of master is very good

###I am very surprised to see Single coming out to be bulk, however there is chance that, they could come with friends or servants
##I though to extract those unique number using same ticket number distributed.


##Engineer features based on all the passengers with the same ticket
ticket.unique <- rep(0, nrow(full_titanic))
tickets <- unique(full_titanic$Ticket)

for (i in 1:length(tickets)) {
  current.ticket <- tickets[i]
  party.indexes <- which(full_titanic$Ticket == current.ticket)
  
  
  for (k in 1:length(party.indexes)) {
    ticket.unique[party.indexes[k]] <- length(party.indexes)
  }
}

full_titanic$ticket.unique <- ticket.unique


full_titanic$ticket.size[full_titanic$ticket.unique == 1]   <- 'Single'
full_titanic$ticket.size[full_titanic$ticket.unique < 5 & full_titanic$ticket.unique>= 2]   <- 'Small'
full_titanic$ticket.size[full_titanic$ticket.unique >= 5]   <- 'Big'

##Lets check the Ticket size through grpah
ggplot(full_titanic[1:891,],aes(x = ticket.size,fill=factor(Survived)))

##Lets check the Ticket and title size through grpah
ggplot(full_titanic[1:891,], aes(x = ticket.size, fill = Survived))

##We can't see huge diffrence b/w ticket size and Family Size, May be we will use any one of them which is contributing more

Now that we’ve taken care of splitting passenger name into some new variables, we can take it a step further and make some new family variables. First we’re going to make a family size variable based on number of siblings/spouse(s) (maybe someone has more than one spouse?) and number of children/parents.

Exploratory Analysis on Embarked

###is there any association between Survial rate and where he get into the Ship.   
 ggplot(full_titanic[1:891,],aes(x = Embarked,fill=factor(Survived)))

##Lets further divide the grpah by Pclass
ggplot(full_titanic[1:891,], aes(x = Embarked, fill = Survived))

##Haha..I don't think there is a correlation between Survival rate and Embarked 

##There is a lot of Missing value in Cabin, i dont think its good idea to use that
##As mentioned earlier will use Title inplace of Age 
##Fare is definitelly correlate with Pclass..so i am not going to use that too

full_titanic$ticket.size <- as.factor(full_titanic$ticket.size)
full_titanic$title <- as.factor(full_titanic$title)

##From the Explortory anlysis part we have decided to use below variables for our model building 

##"Pclass", "title","Sex","Embarked","FamilySized","ticket.size"

##Any redaundant varible among above will drop in the course of analysis

What does our family size variable look like? To help us understand how it may relate to survival, let’s plot it among the training data.

Divide data into train and set for internal validation

###lets prepare and keep data in the proper format

feauter1<-full_titanic[1:891, c("Pclass", "title","Sex","Embarked","FamilySized","ticket.size")]
response <- as.factor(train.tit$Survived)
feauter1$Survived=as.factor(train.tit$Survived)


###For Cross validation purpose will keep 20% of data aside from my orginal train set
##This is just to check how well my data works for unseen data
set.seed(500)
ind=createDataPartition(feauter1$Survived,times=1,p=0.8,list=FALSE)
train_val=feauter1[ind,]
test_val=feauter1[-ind,]

####check the proprtion of Survival rate in orginal training data, current traing and testing data
round(prop.table(table(train.tit$Survived)*100),digits = 1)

## 
##   0   1 
## 0.6 0.4

round(prop.table(table(train_val$Survived)*100),digits = 1)

## 
##   0   1 
## 0.6 0.4

round(prop.table(table(test_val$Survived)*100),digits = 1)

## 
##   0   1 
## 0.6 0.4

Predictive Analysis and Cross Validation

Whoa, glad we made our title variable! It has the highest relative importance out of all of our predictor variables. I think I’m most surprised to see that passenger but maybe that’s just bias coming from watching the movie Titanic too many times as a kid.

Decison tree

##Random forest is for more better than Single tree however single tree is very easy to use and illustrate
set.seed(1234)
Model_DT=rpart(Survived~.,data=train_val,method="class")


rpart.plot(Model_DT,extra =  3,fallen.leaves = T)

###Surprise, Check out the plot,  our Single tree model is using only Title, Pclass and Ticket.size and vomited rest
###Lets Predict train data and check the accuracy of single tree

PRE_TDT=predict(Model_DT,data=train_val,type="class")
confusionMatrix(PRE_TDT,train_val$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 395  71
##          1  45 203
##                                           
##                Accuracy : 0.8375          
##                  95% CI : (0.8084, 0.8639)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6502          
##                                           
##  Mcnemar's Test P-Value : 0.02028         
##                                           
##             Sensitivity : 0.8977          
##             Specificity : 0.7409          
##          Pos Pred Value : 0.8476          
##          Neg Pred Value : 0.8185          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5532          
##    Detection Prevalence : 0.6527          
##       Balanced Accuracy : 0.8193          
##                                           
##        'Positive' Class : 0               
##

#####Accuracy is 0.8375
####Not at all bad using Single tree and just 3 feauters

##There is chance of overfitting in Single tree, So I will go for cross validation using '10 fold techinque'
set.seed(1234)
cv.10 <- createMultiFolds(train_val$Survived, k = 10, times = 10)

# Control
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
                       index = cv.10)

##Train the data
Model_CDT <- train(x = train_val[,-7], y = train_val[,7], method = "rpart", tuneLength = 30,
                   trControl = ctrl)


##Check the accurcay
##Accurcay using 10 fold cross validation of Single tree is 0.8139 
##Seems Overfitted earlier using Single tree, there our accurcay rate is 0.83

# check the variable imporatnce, is it the same as in Single tree?
rpart.plot(Model_CDT$finalModel,extra =  3,fallen.leaves = T)

##Yes, there is no change in the imporatnce of variable


###Lets cross validate the accurcay using data that kept aside for testing purpose
PRE_VDTS=predict(Model_CDT$finalModel,newdata=test_val,type="class")
confusionMatrix(PRE_VDTS,test_val$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 97 20
##          1 12 48
##                                           
##                Accuracy : 0.8192          
##                  95% CI : (0.7545, 0.8729)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 3.784e-09       
##                                           
##                   Kappa : 0.6093          
##                                           
##  Mcnemar's Test P-Value : 0.2159          
##                                           
##             Sensitivity : 0.8899          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8291          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5480          
##    Detection Prevalence : 0.6610          
##       Balanced Accuracy : 0.7979          
##                                           
##        'Positive' Class : 0               
##

###There it is, How exactly our train data and test data matches in accuracy (0.8192)

Random Forest

set.seed(1234)
rf.1 <- randomForest(x = train_val[,-7],y=train_val[,7], importance = TRUE, ntree = 1000)
rf.1

## 
## Call:
##  randomForest(x = train_val[, -7], y = train_val[, 7], ntree = 1000,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 17.51%
## Confusion matrix:
##     0   1 class.error
## 0 395  45   0.1022727
## 1  80 194   0.2919708

varImpPlot(rf.1)

####Random Forest accurcay rate is 82.91 which is 1% better than the decison  tree
####Lets remove 2 redaundant varibles and do the modeling again
train_val1=train_val[,-4:-5]
test_val1=test_val[,-4:-5]

set.seed(1234)
rf.2 <- randomForest(x = train_val1[,-5],y=train_val1[,5], importance = TRUE, ntree = 1000)
rf.2

## 
## Call:
##  randomForest(x = train_val1[, -5], y = train_val1[, 5], ntree = 1000,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 15.97%
## Confusion matrix:
##     0   1 class.error
## 0 395  45   0.1022727
## 1  69 205   0.2518248

varImpPlot(rf.2)

###Can see the Magic now, increase in accuracy by just removing 2 varibles, accuracy now is 84.03 

##Even though random forest is so power full we accept the model only after cross validation


set.seed(2348)
cv10_1 <- createMultiFolds(train_val1[,5], k = 10, times = 10)

# Set up caret's trainControl object per above.
ctrl_1 <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
                      index = cv10_1)



set.seed(1234)
rf.5<- train(x = train_val1[,-5], y = train_val1[,5], method = "rf", tuneLength = 3,
              ntree = 1000, trControl =ctrl_1)

rf.5

## Random Forest 
## 
## 714 samples
##   4 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 642, 643, 642, 643, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8392390  0.6538299
##   3     0.8381162  0.6515589
##   4     0.8360133  0.6469927
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

##Cross validation give us the accurcay rate of .8393

###Lets Predict the test data 

pr.rf=predict(rf.5,newdata = test_val1)

confusionMatrix(pr.rf,test_val1$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 97 20
##          1 12 48
##                                           
##                Accuracy : 0.8192          
##                  95% CI : (0.7545, 0.8729)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 3.784e-09       
##                                           
##                   Kappa : 0.6093          
##                                           
##  Mcnemar's Test P-Value : 0.2159          
##                                           
##             Sensitivity : 0.8899          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8291          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5480          
##    Detection Prevalence : 0.6610          
##       Balanced Accuracy : 0.7979          
##                                           
##        'Positive' Class : 0               
##

####accuracy rate is 0.8192, lower than what we have expected

All of the variables we care about should be taken care of and there should be no missing data. I’m going to double check just to be sure. ##Support Vector Machine ###Linear Support vector Machine

###Before going to model lets tune the cost Parameter

set.seed(1274)
liner.tune=tune.svm(Survived~.,data=train_val1,kernel="linear",cost=c(0.01,0.1,0.2,0.5,0.7,1,2,3,5,10,15,20,50,100))

liner.tune

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     3
## 
## - best performance: 0.1736502

###best perforamnce when cost=3 and accuracy rate is 82.7


###Lets get a best.liner model  
best.linear=liner.tune$best.model

##Predict Survival rate using test data

best.test=predict(best.linear,newdata=test_val1,type="class")
confusionMatrix(best.test,test_val1$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 97 21
##          1 12 47
##                                           
##                Accuracy : 0.8136          
##                  95% CI : (0.7483, 0.8681)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.058e-08       
##                                           
##                   Kappa : 0.5959          
##                                           
##  Mcnemar's Test P-Value : 0.1637          
##                                           
##             Sensitivity : 0.8899          
##             Specificity : 0.6912          
##          Pos Pred Value : 0.8220          
##          Neg Pred Value : 0.7966          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5480          
##    Detection Prevalence : 0.6667          
##       Balanced Accuracy : 0.7905          
##                                           
##        'Positive' Class : 0               
##

###Linear model accuracy is 0.8136

Radial Support vector Machine

######Lets go to non liner SVM, Radial Kerenl
set.seed(1274)

rd.poly=tune.svm(Survived~.,data=train_val1,kernel="radial",gamma=seq(0.1,5))

summary(rd.poly)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma
##    2.1
## 
## - best performance: 0.166608 
## 
## - Detailed performance results:
##   gamma     error dispersion
## 1   0.1 0.1680164 0.04245604
## 2   1.1 0.1680164 0.03983673
## 3   2.1 0.1666080 0.04166448
## 4   3.1 0.1666080 0.04166448
## 5   4.1 0.1666080 0.04166448

best.rd=rd.poly$best.model

###Non Linear Kerenel giving us a better accuray 

##Lets Predict test data
pre.rd=predict(best.rd,newdata = test_val1)

confusionMatrix(pre.rd,test_val1$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 97 20
##          1 12 48
##                                           
##                Accuracy : 0.8192          
##                  95% CI : (0.7545, 0.8729)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 3.784e-09       
##                                           
##                   Kappa : 0.6093          
##                                           
##  Mcnemar's Test P-Value : 0.2159          
##                                           
##             Sensitivity : 0.8899          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8291          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5480          
##    Detection Prevalence : 0.6610          
##       Balanced Accuracy : 0.7979          
##                                           
##        'Positive' Class : 0               
##

####Accurcay of test data using Non Liner model is 0.81
####it could be due to we are using smaller set of sample for testing data

We could definitely use rpart (recursive partitioning for regression) to predict missing ages, but I’m going to use the mice package for this task just for something different. You can read more about multiple imputation using chained equations. Since we haven’t done it yet, I’ll first factorize the factor variables and then perform mice imputation.

Logistic Regression

contrasts(train_val1$Sex)

##        male
## female    0
## male      1

contrasts(train_val1$Pclass)

##   2 3
## 1 0 0
## 2 1 0
## 3 0 1

##The above shows how the varible coded among themself

##Lets run Logistic regression model
log.mod <- glm(Survived ~ ., family = binomial(link=logit), 
               data = train_val1)
###Check the summary
summary(log.mod)

## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = logit), 
##     data = train_val1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4187  -0.5944  -0.3937   0.5805   3.0414  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        16.8752   624.0921   0.027 0.978428    
## Pclass2            -1.1968     0.3129  -3.824 0.000131 ***
## Pclass3            -2.1324     0.2721  -7.838 4.58e-15 ***
## titleMiss         -16.1021   624.0921  -0.026 0.979416    
## titleMr            -3.7422     0.5216  -7.175 7.24e-13 ***
## titleMrs          -16.0186   624.0921  -0.026 0.979523    
## titleOfficer       -4.3752     0.8595  -5.090 3.58e-07 ***
## Sexmale           -15.6157   624.0919  -0.025 0.980038    
## ticket.sizeSingle   2.0968     0.4082   5.137 2.79e-07 ***
## ticket.sizeSmall    2.0356     0.3870   5.260 1.44e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 589.82  on 704  degrees of freedom
## AIC: 609.82
## 
## Number of Fisher Scoring iterations: 13

confint(log.mod)

##                        2.5 %     97.5 %
## (Intercept)       -80.363813         NA
## Pclass2            -1.821261 -0.5924824
## Pclass3            -2.676712 -1.6082641
## titleMiss                 NA 81.1580568
## titleMr            -4.806899 -2.7544131
## titleMrs                  NA 81.2072009
## titleOfficer       -6.200669 -2.7777761
## Sexmale                   NA 81.9299127
## ticket.sizeSingle   1.318753  2.9224495
## ticket.sizeSmall    1.294852  2.8160324

###Predict train data
train.probs <- predict(log.mod, data=train_val1,type =  "response")
table(train_val1$Survived,train.probs>0.5)

##    
##     FALSE TRUE
##   0   395   45
##   1    70  204

(395+204)/(395+204+70+45)

## [1] 0.8389356

###Logistic regression predicted train data with accuracy rate of 0.83 

test.probs <- predict(log.mod, newdata=test_val1,type =  "response")
table(test_val1$Survived,test.probs>0.5)

##    
##     FALSE TRUE
##   0    97   12
##   1    21   47

(97+47)/(97+12+21+47)

## [1] 0.8135593

###Accuracy rate of teat data is 0.8135..

Conclusion

When I submit the predicted survival data from various models that built in the course to Kaggle competion, i have got approximately the same score. Now I realize that why data scientist used to spend most of their time into feature engineering and exploratory analysis compare to actual model building. Model that we are using is definitely important, however more than that understanding our data and feature engineering is crucial.

Thanks

I would use this opportunity to thank all the people who inspired me to learn data scientist especially people in Kaggle community sharing their script openly which helping many of new data scientist like me. I am ending my first Kernel with this, I know, I would have done more feature engineering especially to estimate the Age of missing observation and EDA on Fare. I though not to use Age directly as I am not much comfortable in missing value estimation. As mentioned in the start, I would much appreciate for any questions, correction and suggestion

Beginners Titanic

Swamy S M

14th Oct 2022