This is my titanic analysis
## Class Sex Age Survived
## 1 1st Female Adult Yes
## 2 1st Female Adult Yes
## 3 Crew Male Adult No
## 4 1st Female Adult Yes
## 5 Crew Male Adult No
## 6 3rd Male Adult No
Here is the snapshot
##
## Female Male
## No 126 1364
## Yes 344 367
##
## Adult Child
## No 1438 52
## Yes 654 57
Data Exploration
Data Preparation
print(class(dataset$Class))
## [1] "character"
dataset$Class = as.factor(dataset$Class)
dataset$Sex = as.factor(dataset$Sex)
dataset$Age = as.factor(dataset$Age)
dataset$Survived = as.factor(dataset$Survived)
Data divided into training and validation data
set.seed(1)
trainrows = sample(row.names(dataset),nrow(dataset)*0.7)
traindataset = dataset[trainrows,]
validrows = setdiff(row.names(dataset),trainrows)
validdataset = dataset[validrows,]
print(head(traindataset))
## Class Sex Age Survived
## 1017 3rd Male Adult No
## 679 Crew Male Adult Yes
## 2177 Crew Male Adult Yes
## 930 3rd Female Adult No
## 1533 1st Female Adult Yes
## 471 2nd Male Adult No
print(head(validdataset))
## Class Sex Age Survived
## 2 1st Female Adult Yes
## 3 Crew Male Adult No
## 4 1st Female Adult Yes
## 6 3rd Male Adult No
## 10 1st Male Adult Yes
## 12 Crew Male Adult No
Fitting the model to training dataset
model = naiveBayes(Survived ~ ., data = traindataset)
print(model)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.6753247 0.3246753
##
## Conditional probabilities:
## Class
## Y 1st 2nd 3rd Crew
## No 0.08557692 0.11250000 0.35961538 0.44230769
## Yes 0.27000000 0.16800000 0.25000000 0.31200000
##
## Sex
## Y Female Male
## No 0.09230769 0.90769231
## Yes 0.46800000 0.53200000
##
## Age
## Y Adult Child
## No 0.96634615 0.03365385
## Yes 0.92400000 0.07600000
print(summary(model))
## Length Class Mode
## apriori 2 table numeric
## tables 3 -none- list
## levels 2 -none- character
## isnumeric 3 -none- logical
## call 4 -none- call
model2 = C5.0(Survived~.,data = traindataset)
plot(model2)
model3 = rpart(Survived~.,data = traindataset)
rpart.plot(model3,type = 0)
model4 = ctree(Survived~., data = traindataset)
plot(model4)
#model5 = glm(Survived~.,data = traindataset,family = "binomial")
Applying the model to validation dataset to get the predictions
pred = predict(model, validdataset)
pred2 = predict(model2,validdataset)
pred3 = predict(model3,validdataset,type = "class")
pred4 = predict(model4,validdataset)
#pred5 = ifelse(predict.glm(model5,validdataset,type = "response")>0.5,"Yes","No")
print(head(pred))
## [1] Yes No Yes No No No
## Levels: No Yes
validdataset$SurvivedpredictionNaive = pred
validdataset$Survivedpredictionc5 = pred2
validdataset$Survivedpredictionrpart = pred3
validdataset$Survivedpredictionctree = pred4
#validdataset$Survivedpredictionlr= as.factor(pred5)
print(head(validdataset))
## Class Sex Age Survived SurvivedpredictionNaive Survivedpredictionc5
## 2 1st Female Adult Yes Yes Yes
## 3 Crew Male Adult No No No
## 4 1st Female Adult Yes Yes Yes
## 6 3rd Male Adult No No No
## 10 1st Male Adult Yes No No
## 12 Crew Male Adult No No No
## Survivedpredictionrpart Survivedpredictionctree
## 2 Yes Yes
## 3 No No
## 4 Yes Yes
## 6 No No
## 10 No No
## 12 No No
Evaluation of the model
confusionMatrix(validdataset$SurvivedpredictionNaive, validdataset$Survived,positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 420 99
## Yes 30 112
##
## Accuracy : 0.8048
## 95% CI : (0.7725, 0.8344)
## No Information Rate : 0.6808
## P-Value [Acc > NIR] : 6.218e-13
##
## Kappa : 0.5083
##
## Mcnemar's Test P-Value : 2.137e-09
##
## Sensitivity : 0.5308
## Specificity : 0.9333
## Pos Pred Value : 0.7887
## Neg Pred Value : 0.8092
## Prevalence : 0.3192
## Detection Rate : 0.1694
## Detection Prevalence : 0.2148
## Balanced Accuracy : 0.7321
##
## 'Positive' Class : Yes
##
confusionMatrix(validdataset$Survivedpredictionc5, validdataset$Survived,positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 447 131
## Yes 3 80
##
## Accuracy : 0.7973
## 95% CI : (0.7646, 0.8273)
## No Information Rate : 0.6808
## P-Value [Acc > NIR] : 1.549e-11
##
## Kappa : 0.444
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3791
## Specificity : 0.9933
## Pos Pred Value : 0.9639
## Neg Pred Value : 0.7734
## Prevalence : 0.3192
## Detection Rate : 0.1210
## Detection Prevalence : 0.1256
## Balanced Accuracy : 0.6862
##
## 'Positive' Class : Yes
##
confusionMatrix(validdataset$Survivedpredictionrpart, validdataset$Survived,positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 447 126
## Yes 3 85
##
## Accuracy : 0.8048
## 95% CI : (0.7725, 0.8344)
## No Information Rate : 0.6808
## P-Value [Acc > NIR] : 6.218e-13
##
## Kappa : 0.4687
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4028
## Specificity : 0.9933
## Pos Pred Value : 0.9659
## Neg Pred Value : 0.7801
## Prevalence : 0.3192
## Detection Rate : 0.1286
## Detection Prevalence : 0.1331
## Balanced Accuracy : 0.6981
##
## 'Positive' Class : Yes
##
confusionMatrix(validdataset$Survivedpredictionctree, validdataset$Survived,positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 447 126
## Yes 3 85
##
## Accuracy : 0.8048
## 95% CI : (0.7725, 0.8344)
## No Information Rate : 0.6808
## P-Value [Acc > NIR] : 6.218e-13
##
## Kappa : 0.4687
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4028
## Specificity : 0.9933
## Pos Pred Value : 0.9659
## Neg Pred Value : 0.7801
## Prevalence : 0.3192
## Detection Rate : 0.1286
## Detection Prevalence : 0.1331
## Balanced Accuracy : 0.6981
##
## 'Positive' Class : Yes
##
#confusionMatrix(validdataset$Survivedpredictionlr, validdataset$Survived,positive = "Yes")
Deployment / Use the model for prediction for new dataset
Class = "1st"
Age = "Adult"
Sex = "Male"
newdataset = data.frame(Class,Sex,Age)
predictionnew = predict(model,newdataset)
predictprob = data.frame(predict(model,newdataset,type = "raw"))
print(predictprob)
## No Yes
## 1 0.5405193 0.4594807
newdataset$Survivedprediction = predictionnew
newdataset$ProbNo = predictprob$No
newdataset$ProbYes = predictprob$Yes
print(newdataset)
## Class Sex Age Survivedprediction ProbNo ProbYes
## 1 1st Male Adult No 0.5405193 0.4594807