Given some incompleted passenger information:
1. Find out what sorts of people were likely to survive.
2. Predict which passengers survived the tragedy.
Survival: Survival (0 = No; 1 = Yes)
Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name: Name
Sex: Sex
Age: Age
Sibsp: Number of Siblings/Spouses Aboard
Parch: Number of Parents/Children Aboard
Ticket: Ticket Number
Fare: Passenger Fare
Cabin: Cabin
Embarked: Port of Embarkation
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
Passenger Age distribution is close to normal distribution.
Children (<12 years old) have bigger change to survive.
ggplot(subset(train,Age!=is.na(Age)),aes(x=Age,fill=factor(Survived)))+geom_density(alpha=0.6)+labs(title='Passenger Age Distribution')
Fare has a skewed distribution, majority has the Fare < $50.
There is higher change of perishing, for those with Fare < $20.
ggplot(train,aes(x=Fare,fill=factor(Survived)))+geom_density(alpha=0.6,na.rm=TRUE)+labs(title='Ticket Fare Distribution')
Higher Passenger class implies higher surival rate.
Female has higher survival rate.
byPclass<-aggregate(Survived~Pclass,train,mean)
bySex<-aggregate(Survived~Sex,train,mean)
par(mfrow=c(1,2))
barplot(byPclass$Survived,names.arg=byPclass$Pclass,col='red',main='Survival rate by Pclass')
barplot(bySex$Survived,names.arg=bySex$Sex,col='blue',main='Survival rate by Sex')
train$Cat<-'train'
test$Cat<-'test'
test$Survived<-NA
full<-rbind(train,test)
Passenger Names are in format of Surname, Title. Firstname
full$Name[1]
## [1] "Braund, Mr. Owen Harris"
Split the text to get Title
full$Title<-sapply(full$Name,function(x) strsplit(x,'[.,]')[[1]][2])
full$Title<-gsub(' ','',full$Title)
table(full$Title)
##
## Capt Col Don Dona Dr Jonkheer
## 1 4 1 1 8 1
## Lady Major Master Miss Mlle Mme
## 1 2 61 260 2 1
## Mr Mrs Ms Rev Sir theCountess
## 757 197 2 8 1 1
Aggregate some less frequent Titles
full$Title[full$Title %in% c('Capt', 'Don', 'Major','Jonkheer')] <- 'Sir'
full$Title[full$Title %in% c('Dona','the Countess' )] <- 'Lady'
full$Title[full$Title %in% c('Mlle','Mme','Ms')] <- 'Miss'
FamilySie = Parch (No.of Parents/Children) + SibSp (No. of Sibling/Spouse)+1
full$FamilySize<-full$Parch+full$SibSp+1
characterize mother as female who are older than 18 years old and travelled with child
full$Mother<-0
full$Mother[full$Sex=='female' & full$Parch>0 & full$Age>18 & full$Title!='Miss']<-1
characterize child as who are not older than 12 years old
full$Child<-0
full$Child[full$Parch>0 & full$Age<=12]<-1
library(Amelia)
missmap(train,col=c('yellow','black'),main='Titanic Train Data',legend=F,y.labels=NULL,y.at=NULL)
Cabin, Age, and Embarked are identified with missing values.
Only two Embarked are missing, simply assign them to the majority.
table(full$Embarked)
##
## C Q S
## 270 123 914
full$Embarked[is.na(full$Embarked)]<-'S'
~30% is missing, using decision tree (regression) method to predict missing Age data
library(rpart)
fit.Age<-rpart(Age~Pclass+Title+Sex+SibSp+Parch+Fare,data=subset(full,Age!=is.na(Age)),method='anova')
full$Age[is.na(full$Age)]<-predict(fit.Age,full[is.na(full$Age),])
Over 70% is missing, predicting cabin here may be inappropiate.
Hence cabin information is ignored at this stage of analysis.
train<-full[full$Cat=='train',]
test<-full[full$Cat=='test',]
train$Survived<-factor(train$Survived)
#levels(train$Survived)<-c('Perished','Survived')
To evaluate different models, 10 folds repeated cv is used.
Selection metrics is accuracy.
library(caret)
control=trainControl(method='repeatedcv',number=10,repeats=10)
set.seed(99)
fit.bl<-train(Survived~FamilySize+Pclass+Sex+Age+SibSp+Parch+Fare+Embarked+Title+Mother+Child,data=train,method='bayesglm',trControl=control,family=binomial(link='logit'))
set.seed(99)
fit.rp<-train(Survived~FamilySize+Pclass+Sex+Age+SibSp+Parch+Fare+Embarked+Title+Mother+Child,data=train,method='rpart',trControl=control)
set.seed(99)
fit.rf<-train(Survived~FamilySize+Pclass+Sex+Age+SibSp+Parch+Fare+Embarked+Title+Mother+Child,data=train,method='rf',trControl=control)
set.seed(99)
fit.nnet<-train(Survived~FamilySize+Pclass+Sex+Age+SibSp+Parch+Fare+Embarked+Title+Mother+Child,data=train,method='nnet',trControl=control,maxit=1000,trace=F,tuneGrid=grid.nnet)
Nerual Network is the best in terms of accuracy with 10 folds cross validation.
acc.bl<-paste0(round(max(fit.bl$results$Accuracy),4)*100,'%')
acc.rp<-paste0(round(max(fit.rp$results$Accuracy),4)*100,"%")
acc.rf<-paste0(round(max(fit.rf$results$Accuracy),4)*100,"%")
acc.nnet<-paste0(round(max(fit.nnet$results$Accuracy),4)*100,'%')
List the accuracy for different models:
Bayesian Logistic Regression: 82.68%
Decision Tree: 81.22%
Random Frost: 82.76%
Neural Network: 82.85%
test$Survived<-predict(fit.nnet,test)
submission<-test[,1:2]
write.csv(submission,'submission.csv',row.names=F)
The most likely to survive on the Titanic are:
Pclass 1 & 2 female and children. (95% survived)
The most unlikely to survive on the Titanic are:
Pclass 3 female/children with family size bigger than 4. (only 9% survived)
library(rpart)
fit.rpart<-rpart(Survived~FamilySize+Pclass+Sex+Age+SibSp+Parch+Fare+Embarked+Title+Mother+Child,data=train,control=rpart.control(cp=0.01))
library(rattle)
fancyRpartPlot(fit.rpart,main='Decision Tree of Survival on the Titanic',cex=0.8)