Titanic is definitely good kick-up for any Machine Learning enthusiast as it containes few feauter and More than that who don’t know Titanic Tragedy, Jack and Rose Love story. Titanic is the tragedy story of a ship Sunk in the first Journey itself. This is the best example of what is going to happen if the nature gives just an hour to human being before his death and survival,
With Above explanation Lets start the script. The data is taken from these sources:
## Load all the library required one by one
library('ggplot2')
library('caret')
## Warning: package 'caret' was built under R version 3.5.2
library('dplyr')
library('randomForest')
library('rpart')
library('rpart.plot')
library('car')
library('e1071')
##Lets Load raw data in the orginal form by setting stringsAsFactors = F
train.tit <- read.csv('train.csv', stringsAsFactors = F)
test.tit <- read.csv('test.csv', stringsAsFactors = F)
test.tit$Survived <- NA
##Combine both test and train
full_titanic <- rbind(train.tit, test.tit)
##Check the structure
str(full_titanic)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
###is there any Missing obesrvation
colSums(is.na(full_titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 418 0 0 0 263
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 0 0
####Empty data
colSums(full_titanic=='')
## PassengerId Survived Pclass Name Sex Age
## 0 NA 0 0 0 NA
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 NA 1014 2
##Summary shows, Age missing 263 value, Cabin too having lot of missing value and embarked just 2
###Lets replace Embarked by most frequest observation
table(full_titanic$Embarked)
##
## C Q S
## 2 270 123 914
full_titanic$Embarked[full_titanic$Embarked==""]="S"
table(full_titanic$Embarked)
##
## C Q S
## 270 123 916
##As Age and Cabin has too many missing value, will check it during analysis Phase
###Check the length and see how many varibles of them we can move to factor for our analysis
apply(full_titanic,2, function(x) length(unique(x)))
## PassengerId Survived Pclass Name Sex Age
## 1309 3 3 1307 2 99
## SibSp Parch Ticket Fare Cabin Embarked
## 7 8 929 282 187 3
###will convert the below varible into factor for ananlysis
cols=c("Survived","Pclass","Sex","Embarked")
for (i in cols){
full_titanic[,i]=as.factor(full_titanic[,i])
}
## Hypothesis is that, Rich folks survival rate is much better than poor folks, Does any diffrence in the Titanic?
###Visualize P class which is the best proxy for Rich and Poor
ggplot(full_titanic[1:891,],aes(x = Pclass,fill=factor(Survived)))
##No diffrences in the Titanic too, First class Survival rate is far more better than the 3rd class
##No doubt Rich peope having better Survival rate than the poor
# Visualize the 3-way relationship of sex, pclass, and survival
ggplot(full_titanic[1:891,], aes(x = Sex, fill = Survived))
##In the all the class female Survival rate is better than Men
The first variable, which is related to a stock’s which catches my attention is passenger name because we can break it down into additional meaningful variables which can feed predictions or be used in the creation of additional new variables. For instance, passenger title is contained within the passenger name variable and we can use surname to represent families. Let’s do some feature engineering!
head(full_titanic$Name)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
##Lets extract the title and check if we have predictive power in that
names <- full_titanic$Name
title <- gsub("^.*, (.*?)\\..*$", "\\1", names)
full_titanic$title <- title
table(title)
## title
## Capt Col Don Dona Dr
## 1 4 1 1 8
## Jonkheer Lady Major Master Miss
## 1 1 2 61 260
## Mlle Mme Mr Mrs Ms
## 2 1 757 197 2
## Rev Sir the Countess
## 8 1 1
###MISS, Mrs, Master and Mr are taking more numbers
###Better to group Other titles into bigger basket by checking gender and survival rate to aviod any overfitting
full_titanic$title[full_titanic$title == 'Mlle'] <- 'Miss'
full_titanic$title[full_titanic$title == 'Ms'] <- 'Miss'
full_titanic$title[full_titanic$title == 'Mme'] <- 'Mrs'
full_titanic$title[full_titanic$title == 'Lady'] <- 'Miss'
full_titanic$title[full_titanic$title == 'Dona'] <- 'Miss'
## I am afraid creating a new varible with small data can causes a overfit
## However, My thinking is that combining below feauter into original variable may loss some predictive power as they are all army folks, doctor and nobel peoples
full_titanic$title[full_titanic$title == 'Capt'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Col'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Major'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Dr'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Rev'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Don'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Sir'] <- 'Officer'
full_titanic$title[full_titanic$title == 'the Countess'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Jonkheer'] <- 'Officer'
# Lets check who among Mr, Master, Miss having a better survival rate
ggplot(full_titanic[1:891,],aes(x = title,fill=factor(Survived)))
##In the titanic you are Mr then there is less chance of survival, Miss and Mrs having better survival rate then Master and Officer
### Visualize the 3-way of relationship of Title, Pclass, and Survival
ggplot(full_titanic[1:891,], aes(x = title, fill = Survived))
##Master in 1st and 2nd class has 100% Survival where has Mrs and Miss having 90% chance of Survival in 1st and 2nd class
##Since Title mostly depending on Age (except few cases), I will use title in place of age which has 263 missing observation
# Lets create a Family size using Sibsp and Parch
full_titanic$FamilySize <-full_titanic$SibSp + full_titanic$Parch + 1
full_titanic$FamilySized[full_titanic$FamilySize == 1] <- 'Single'
full_titanic$FamilySized[full_titanic$FamilySize < 5 & full_titanic$FamilySize >= 2] <- 'Small'
full_titanic$FamilySized[full_titanic$FamilySize >= 5] <- 'Big'
full_titanic$FamilySized=as.factor(full_titanic$FamilySized)
###Lets Visualize the Survival rate by Family size
ggplot(full_titanic[1:891,],aes(x = FamilySized,fill=factor(Survived)))
###Big Family in Titanic having worst survival rate then Smaller and Alone
####Why Big Family has a probelm?, Check in the below visualization
ggplot(full_titanic[1:891,], aes(x = FamilySized, fill = Survived))
##You are a Master in the Big Family your Survival rate is absolutely nill even though overall survival rate of master is very good
###I am very surprised to see Single coming out to be bulk, however there is chance that, they could come with friends or servants
##I though to extract those unique number using same ticket number distributed.
##Engineer features based on all the passengers with the same ticket
ticket.unique <- rep(0, nrow(full_titanic))
tickets <- unique(full_titanic$Ticket)
for (i in 1:length(tickets)) {
current.ticket <- tickets[i]
party.indexes <- which(full_titanic$Ticket == current.ticket)
for (k in 1:length(party.indexes)) {
ticket.unique[party.indexes[k]] <- length(party.indexes)
}
}
full_titanic$ticket.unique <- ticket.unique
full_titanic$ticket.size[full_titanic$ticket.unique == 1] <- 'Single'
full_titanic$ticket.size[full_titanic$ticket.unique < 5 & full_titanic$ticket.unique>= 2] <- 'Small'
full_titanic$ticket.size[full_titanic$ticket.unique >= 5] <- 'Big'
##Lets check the Ticket size through grpah
ggplot(full_titanic[1:891,],aes(x = ticket.size,fill=factor(Survived)))
##Lets check the Ticket and title size through grpah
ggplot(full_titanic[1:891,], aes(x = ticket.size, fill = Survived))
##We can't see huge diffrence b/w ticket size and Family Size, May be we will use any one of them which is contributing more
Now that we’ve taken care of splitting passenger name into some new variables, we can take it a step further and make some new family variables. First we’re going to make a family size variable based on number of siblings/spouse(s) (maybe someone has more than one spouse?) and number of children/parents.
###is there any association between Survial rate and where he get into the Ship.
ggplot(full_titanic[1:891,],aes(x = Embarked,fill=factor(Survived)))
##Lets further divide the grpah by Pclass
ggplot(full_titanic[1:891,], aes(x = Embarked, fill = Survived))
##Haha..I don't think there is a correlation between Survival rate and Embarked
##There is a lot of Missing value in Cabin, i dont think its good idea to use that
##As mentioned earlier will use Title inplace of Age
##Fare is definitelly correlate with Pclass..so i am not going to use that too
full_titanic$ticket.size <- as.factor(full_titanic$ticket.size)
full_titanic$title <- as.factor(full_titanic$title)
##From the Explortory anlysis part we have decided to use below variables for our model building
##"Pclass", "title","Sex","Embarked","FamilySized","ticket.size"
##Any redaundant varible among above will drop in the course of analysis
What does our family size variable look like? To help us understand how it may relate to survival, let’s plot it among the training data.
###lets prepare and keep data in the proper format
feauter1<-full_titanic[1:891, c("Pclass", "title","Sex","Embarked","FamilySized","ticket.size")]
response <- as.factor(train.tit$Survived)
feauter1$Survived=as.factor(train.tit$Survived)
###For Cross validation purpose will keep 20% of data aside from my orginal train set
##This is just to check how well my data works for unseen data
set.seed(500)
ind=createDataPartition(feauter1$Survived,times=1,p=0.8,list=FALSE)
train_val=feauter1[ind,]
test_val=feauter1[-ind,]
####check the proprtion of Survival rate in orginal training data, current traing and testing data
round(prop.table(table(train.tit$Survived)*100),digits = 1)
##
## 0 1
## 0.6 0.4
round(prop.table(table(train_val$Survived)*100),digits = 1)
##
## 0 1
## 0.6 0.4
round(prop.table(table(test_val$Survived)*100),digits = 1)
##
## 0 1
## 0.6 0.4
Whoa, glad we made our title variable! It has the highest relative importance out of all of our predictor variables. I think I’m most surprised to see that passenger but maybe that’s just bias coming from watching the movie Titanic too many times as a kid.
##Random forest is for more better than Single tree however single tree is very easy to use and illustrate
set.seed(1234)
Model_DT=rpart(Survived~.,data=train_val,method="class")
rpart.plot(Model_DT,extra = 3,fallen.leaves = T)
###Surprise, Check out the plot, our Single tree model is using only Title, Pclass and Ticket.size and vomited rest
###Lets Predict train data and check the accuracy of single tree
PRE_TDT=predict(Model_DT,data=train_val,type="class")
confusionMatrix(PRE_TDT,train_val$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 395 71
## 1 45 203
##
## Accuracy : 0.8375
## 95% CI : (0.8084, 0.8639)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6502
##
## Mcnemar's Test P-Value : 0.02028
##
## Sensitivity : 0.8977
## Specificity : 0.7409
## Pos Pred Value : 0.8476
## Neg Pred Value : 0.8185
## Prevalence : 0.6162
## Detection Rate : 0.5532
## Detection Prevalence : 0.6527
## Balanced Accuracy : 0.8193
##
## 'Positive' Class : 0
##
#####Accuracy is 0.8375
####Not at all bad using Single tree and just 3 feauters
##There is chance of overfitting in Single tree, So I will go for cross validation using '10 fold techinque'
set.seed(1234)
cv.10 <- createMultiFolds(train_val$Survived, k = 10, times = 10)
# Control
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
index = cv.10)
##Train the data
Model_CDT <- train(x = train_val[,-7], y = train_val[,7], method = "rpart", tuneLength = 30,
trControl = ctrl)
##Check the accurcay
##Accurcay using 10 fold cross validation of Single tree is 0.8139
##Seems Overfitted earlier using Single tree, there our accurcay rate is 0.83
# check the variable imporatnce, is it the same as in Single tree?
rpart.plot(Model_CDT$finalModel,extra = 3,fallen.leaves = T)
##Yes, there is no change in the imporatnce of variable
###Lets cross validate the accurcay using data that kept aside for testing purpose
PRE_VDTS=predict(Model_CDT$finalModel,newdata=test_val,type="class")
confusionMatrix(PRE_VDTS,test_val$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 20
## 1 12 48
##
## Accuracy : 0.8192
## 95% CI : (0.7545, 0.8729)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 3.784e-09
##
## Kappa : 0.6093
##
## Mcnemar's Test P-Value : 0.2159
##
## Sensitivity : 0.8899
## Specificity : 0.7059
## Pos Pred Value : 0.8291
## Neg Pred Value : 0.8000
## Prevalence : 0.6158
## Detection Rate : 0.5480
## Detection Prevalence : 0.6610
## Balanced Accuracy : 0.7979
##
## 'Positive' Class : 0
##
###There it is, How exactly our train data and test data matches in accuracy (0.8192)
set.seed(1234)
rf.1 <- randomForest(x = train_val[,-7],y=train_val[,7], importance = TRUE, ntree = 1000)
rf.1
##
## Call:
## randomForest(x = train_val[, -7], y = train_val[, 7], ntree = 1000, importance = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 17.51%
## Confusion matrix:
## 0 1 class.error
## 0 395 45 0.1022727
## 1 80 194 0.2919708
varImpPlot(rf.1)
####Random Forest accurcay rate is 82.91 which is 1% better than the decison tree
####Lets remove 2 redaundant varibles and do the modeling again
train_val1=train_val[,-4:-5]
test_val1=test_val[,-4:-5]
set.seed(1234)
rf.2 <- randomForest(x = train_val1[,-5],y=train_val1[,5], importance = TRUE, ntree = 1000)
rf.2
##
## Call:
## randomForest(x = train_val1[, -5], y = train_val1[, 5], ntree = 1000, importance = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 15.97%
## Confusion matrix:
## 0 1 class.error
## 0 395 45 0.1022727
## 1 69 205 0.2518248
varImpPlot(rf.2)
###Can see the Magic now, increase in accuracy by just removing 2 varibles, accuracy now is 84.03
##Even though random forest is so power full we accept the model only after cross validation
set.seed(2348)
cv10_1 <- createMultiFolds(train_val1[,5], k = 10, times = 10)
# Set up caret's trainControl object per above.
ctrl_1 <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
index = cv10_1)
set.seed(1234)
rf.5<- train(x = train_val1[,-5], y = train_val1[,5], method = "rf", tuneLength = 3,
ntree = 1000, trControl =ctrl_1)
rf.5
## Random Forest
##
## 714 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 642, 643, 642, 643, 643, 643, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8392390 0.6538299
## 3 0.8381162 0.6515589
## 4 0.8360133 0.6469927
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
##Cross validation give us the accurcay rate of .8393
###Lets Predict the test data
pr.rf=predict(rf.5,newdata = test_val1)
confusionMatrix(pr.rf,test_val1$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 20
## 1 12 48
##
## Accuracy : 0.8192
## 95% CI : (0.7545, 0.8729)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 3.784e-09
##
## Kappa : 0.6093
##
## Mcnemar's Test P-Value : 0.2159
##
## Sensitivity : 0.8899
## Specificity : 0.7059
## Pos Pred Value : 0.8291
## Neg Pred Value : 0.8000
## Prevalence : 0.6158
## Detection Rate : 0.5480
## Detection Prevalence : 0.6610
## Balanced Accuracy : 0.7979
##
## 'Positive' Class : 0
##
####accuracy rate is 0.8192, lower than what we have expected
All of the variables we care about should be taken care of and there should be no missing data. I’m going to double check just to be sure. ##Support Vector Machine ###Linear Support vector Machine
###Before going to model lets tune the cost Parameter
set.seed(1274)
liner.tune=tune.svm(Survived~.,data=train_val1,kernel="linear",cost=c(0.01,0.1,0.2,0.5,0.7,1,2,3,5,10,15,20,50,100))
liner.tune
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 3
##
## - best performance: 0.1736502
###best perforamnce when cost=3 and accuracy rate is 82.7
###Lets get a best.liner model
best.linear=liner.tune$best.model
##Predict Survival rate using test data
best.test=predict(best.linear,newdata=test_val1,type="class")
confusionMatrix(best.test,test_val1$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 21
## 1 12 47
##
## Accuracy : 0.8136
## 95% CI : (0.7483, 0.8681)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.058e-08
##
## Kappa : 0.5959
##
## Mcnemar's Test P-Value : 0.1637
##
## Sensitivity : 0.8899
## Specificity : 0.6912
## Pos Pred Value : 0.8220
## Neg Pred Value : 0.7966
## Prevalence : 0.6158
## Detection Rate : 0.5480
## Detection Prevalence : 0.6667
## Balanced Accuracy : 0.7905
##
## 'Positive' Class : 0
##
###Linear model accuracy is 0.8136
######Lets go to non liner SVM, Radial Kerenl
set.seed(1274)
rd.poly=tune.svm(Survived~.,data=train_val1,kernel="radial",gamma=seq(0.1,5))
summary(rd.poly)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma
## 2.1
##
## - best performance: 0.166608
##
## - Detailed performance results:
## gamma error dispersion
## 1 0.1 0.1680164 0.04245604
## 2 1.1 0.1680164 0.03983673
## 3 2.1 0.1666080 0.04166448
## 4 3.1 0.1666080 0.04166448
## 5 4.1 0.1666080 0.04166448
best.rd=rd.poly$best.model
###Non Linear Kerenel giving us a better accuray
##Lets Predict test data
pre.rd=predict(best.rd,newdata = test_val1)
confusionMatrix(pre.rd,test_val1$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 20
## 1 12 48
##
## Accuracy : 0.8192
## 95% CI : (0.7545, 0.8729)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 3.784e-09
##
## Kappa : 0.6093
##
## Mcnemar's Test P-Value : 0.2159
##
## Sensitivity : 0.8899
## Specificity : 0.7059
## Pos Pred Value : 0.8291
## Neg Pred Value : 0.8000
## Prevalence : 0.6158
## Detection Rate : 0.5480
## Detection Prevalence : 0.6610
## Balanced Accuracy : 0.7979
##
## 'Positive' Class : 0
##
####Accurcay of test data using Non Liner model is 0.81
####it could be due to we are using smaller set of sample for testing data
We could definitely use rpart (recursive partitioning for regression) to predict missing ages, but I’m going to use the mice package for this task just for something different. You can read more about multiple imputation using chained equations. Since we haven’t done it yet, I’ll first factorize the factor variables and then perform mice imputation.
contrasts(train_val1$Sex)
## male
## female 0
## male 1
contrasts(train_val1$Pclass)
## 2 3
## 1 0 0
## 2 1 0
## 3 0 1
##The above shows how the varible coded among themself
##Lets run Logistic regression model
log.mod <- glm(Survived ~ ., family = binomial(link=logit),
data = train_val1)
###Check the summary
summary(log.mod)
##
## Call:
## glm(formula = Survived ~ ., family = binomial(link = logit),
## data = train_val1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4187 -0.5944 -0.3937 0.5805 3.0414
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.8752 624.0921 0.027 0.978428
## Pclass2 -1.1968 0.3129 -3.824 0.000131 ***
## Pclass3 -2.1324 0.2721 -7.838 4.58e-15 ***
## titleMiss -16.1021 624.0921 -0.026 0.979416
## titleMr -3.7422 0.5216 -7.175 7.24e-13 ***
## titleMrs -16.0186 624.0921 -0.026 0.979523
## titleOfficer -4.3752 0.8595 -5.090 3.58e-07 ***
## Sexmale -15.6157 624.0919 -0.025 0.980038
## ticket.sizeSingle 2.0968 0.4082 5.137 2.79e-07 ***
## ticket.sizeSmall 2.0356 0.3870 5.260 1.44e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 950.86 on 713 degrees of freedom
## Residual deviance: 589.82 on 704 degrees of freedom
## AIC: 609.82
##
## Number of Fisher Scoring iterations: 13
confint(log.mod)
## 2.5 % 97.5 %
## (Intercept) -80.363813 NA
## Pclass2 -1.821261 -0.5924824
## Pclass3 -2.676712 -1.6082641
## titleMiss NA 81.1580568
## titleMr -4.806899 -2.7544131
## titleMrs NA 81.2072009
## titleOfficer -6.200669 -2.7777761
## Sexmale NA 81.9299127
## ticket.sizeSingle 1.318753 2.9224495
## ticket.sizeSmall 1.294852 2.8160324
###Predict train data
train.probs <- predict(log.mod, data=train_val1,type = "response")
table(train_val1$Survived,train.probs>0.5)
##
## FALSE TRUE
## 0 395 45
## 1 70 204
(395+204)/(395+204+70+45)
## [1] 0.8389356
###Logistic regression predicted train data with accuracy rate of 0.83
test.probs <- predict(log.mod, newdata=test_val1,type = "response")
table(test_val1$Survived,test.probs>0.5)
##
## FALSE TRUE
## 0 97 12
## 1 21 47
(97+47)/(97+12+21+47)
## [1] 0.8135593
###Accuracy rate of teat data is 0.8135..
When I submit the predicted survival data from various models that built in the course to Kaggle competion, i have got approximately the same score. Now I realize that why data scientist used to spend most of their time into feature engineering and exploratory analysis compare to actual model building. Model that we are using is definitely important, however more than that understanding our data and feature engineering is crucial.
I would use this opportunity to thank all the people who inspired me to learn data scientist especially people in Kaggle community sharing their script openly which helping many of new data scientist like me. I am ending my first Kernel with this, I know, I would have done more feature engineering especially to estimate the Age of missing observation and EDA on Fare. I though not to use Age directly as I am not much comfortable in missing value estimation. As mentioned in the start, I would much appreciate for any questions, correction and suggestion