This page contains dataset of titanic passenger who were onboard during the disaster. I obtained the data from kaggle.com/datasets.
I am going to predict the passengers of the ship whether they survived or not. I will use machine learning algorithm with R programming language.
Happy Reading!
First of all, I have to read the data and store in to train and test
Combine data
Now, I will fix the dataset
The following fixes SibSp/Parch values for two passengers (Id=280 and Id=1284) according to this kernel because a 16 year old can’t have a 13 year old son! So I may fix the problem.
comb$SibSp[comb$PassengerId==280] = 0
comb$Parch[comb$PassengerId==280] = 2
comb$SibSp[comb$PassengerId==1284] = 1
comb$Parch[comb$PassengerId==1284] = 1I will fix the class of the column into the correct ones.
comb <- comb %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Embarked = as.factor(Embarked),
Sex = as.factor(Sex))Check if there any missing value
## PassengerId Survived Pclass Name Sex Age
## 0 418 0 0 0 263
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 0 0
There are missing values which have to be imputed and after that we can split our data gain into training and testing. We may use recipes library.
rec<- recipe(Survived~., training(split)) %>%
step_meanimpute(Age, Fare) %>%
step_modeimpute(Survived) %>%
prep()Split the data into learn and exam
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 0 0
## Parch Ticket Fare Cabin Embarked Survived
## 0 0 0 0 0 0
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 0 0
## Parch Ticket Fare Cabin Embarked Survived
## 0 0 0 0 0 0
There are no NA or missing values in the datasets.
Let’s check data proportion of learn first.
##
## 0 1
## 0.7429854 0.2570146
We found that the ratio of data is imbalance, hence it has to be balanced by using upsample function
Check again the proportion
##
## 0 1
## 0.5 0.5
Yup, the data is balance now.
In this part, I am going to observe my data. I will see the correlations between the predictors and target, hence I can make the model.
## PassengerId Pclass Name Sex Age SibSp
## 1 5 3 Allen, Mr. William Henry male 35.00000 0
## 2 6 3 Moran, Mr. James male 30.15165 0
## 3 8 3 Palsson, Master. Gosta Leonard male 2.00000 3
## 4 13 3 Saundercock, Mr. William Henry male 20.00000 0
## 5 14 3 Andersson, Mr. Anders Johan male 39.00000 1
## 6 15 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.00000 0
## Parch Ticket Fare Cabin Embarked Survived
## 1 0 373450 8.0500 S 0
## 2 0 330877 8.4583 Q 0
## 3 1 349909 21.0750 S 0
## 4 0 A/5. 2151 8.0500 S 0
## 5 5 347082 31.2750 S 0
## 6 0 350406 7.8542 S 0
## Warning in ggcorr(learn_up, label = T): data in column(s) 'Pclass', 'Name',
## 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Survived' are not numeric and were
## ignored
It seems that PassengerId has no correlation with other variable.
model1<- glm (formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare +
Embarked, family = "binomial",
data = learn_up)
summary (model1)##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Embarked, family = "binomial", data = learn_up)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2261 -0.7977 -0.1501 0.8538 2.0261
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.290578 353.286877 0.046 0.96322
## Pclass2 -0.683230 0.208522 -3.277 0.00105 **
## Pclass3 -1.407711 0.201693 -6.979 2.96e-12 ***
## Sexmale -1.883793 0.136971 -13.753 < 2e-16 ***
## Age -0.031988 0.005621 -5.690 1.27e-08 ***
## SibSp -0.194806 0.076142 -2.558 0.01051 *
## Parch -0.020145 0.083601 -0.241 0.80958
## Fare -0.001161 0.001397 -0.831 0.40573
## EmbarkedC -13.107603 353.286786 -0.037 0.97040
## EmbarkedQ -13.532995 353.286830 -0.038 0.96944
## EmbarkedS -13.427360 353.286774 -0.038 0.96968
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1835.5 on 1323 degrees of freedom
## Residual deviance: 1469.4 on 1313 degrees of freedom
## AIC: 1491.4
##
## Number of Fisher Scoring iterations: 13
pred1<- predict(object = model1, newdata = exam, type ="response")
pred_round1 <- as.factor(ifelse(pred1 >= 0.5, "1", "0"))Save the data frame to data object
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 236 27
## 1 69 86
##
## Accuracy : 0.7703
## 95% CI : (0.727, 0.8098)
## No Information Rate : 0.7297
## P-Value [Acc > NIR] : 0.033
##
## Kappa : 0.4788
##
## Mcnemar's Test P-Value : 2.857e-05
##
## Sensitivity : 0.7611
## Specificity : 0.7738
## Pos Pred Value : 0.5548
## Neg Pred Value : 0.8973
## Prevalence : 0.2703
## Detection Rate : 0.2057
## Detection Prevalence : 0.3708
## Balanced Accuracy : 0.7674
##
## 'Positive' Class : 1
##
From the matrix above, it can be concluded that model1 has already had a good accuracy pf prediction, 0,77 ~ 77%.
If 77% of accuracy is not enough, we can still try to improve it. One way to do it is by resize the treshold of the prediction. Since the sensitivity of the data is necessary, I will try to increase the treshold to 0.7.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 281 58
## 1 24 55
##
## Accuracy : 0.8038
## 95% CI : (0.7625, 0.8408)
## No Information Rate : 0.7297
## P-Value [Acc > NIR] : 0.0002688
##
## Kappa : 0.4507
##
## Mcnemar's Test P-Value : 0.0002682
##
## Sensitivity : 0.4867
## Specificity : 0.9213
## Pos Pred Value : 0.6962
## Neg Pred Value : 0.8289
## Prevalence : 0.2703
## Detection Rate : 0.1316
## Detection Prevalence : 0.1890
## Balanced Accuracy : 0.7040
##
## 'Positive' Class : 1
##
As we see, the accuracy of the model is increased to 80%. Great!
Store it in data2
Besides logistic regression, there is a method called knn method which can be used as classification function.
I will split my learn and exam data to learn_x (for predictors) and learn_ y (for target variable survived) and exam_x (for predictors) exam_y (target variable survived).
First of all, I will scale my learn and exam data. Since knn method is effective only in numeric variables, I eliminate all factor and character variable from the data.
learn_z <- learn_up %>%
select(-c(PassengerId, Pclass, Name , Sex, Ticket, Cabin, Embarked)) %>%
mutate_if(is.numeric, scale)
exam_z <- exam %>%
select(-c(PassengerId, Pclass, Name , Sex, Ticket, Cabin, Embarked)) %>%
mutate_if(is.numeric, scale)Split into train and test.
To get the k value off knn, I will find the squared root of my total rows. The K value then determined by the total levels of my Survived levels. If it is even so I will choose odd K, while it is odd, I will choose even K number.
## [1] 36.38681
Since my Survived levels are consisted of 2 levels, “0” and “1” , I will pick odd number of K which is 37.
Now build the knn model using library class
Using the confusion matrix once again
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 198 44
## 1 107 69
##
## Accuracy : 0.6388
## 95% CI : (0.5907, 0.6849)
## No Information Rate : 0.7297
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.221
##
## Mcnemar's Test P-Value : 4.524e-07
##
## Sensitivity : 0.6106
## Specificity : 0.6492
## Pos Pred Value : 0.3920
## Neg Pred Value : 0.8182
## Prevalence : 0.2703
## Detection Rate : 0.1651
## Detection Prevalence : 0.4211
## Balanced Accuracy : 0.6299
##
## 'Positive' Class : 1
##
As we see on the confusion matrix, accuracy of this model is only 63%. This can be concluded that the knn method doesn’t work well on titanic data.
To make the predictions, I used two different methods of machine learning classification, logistic regression and knn method. Based on those methods, it can be said that logistic regression is more likely to be used rather than knn method due to their accuracy of prediction. This thing happened because knn method uses only the numeric variables, hence it denies all factor and character. In result, bad prediction produced.
Here is the final result of the model dat2 which has the best accuracy.
## PassengerId Survived
## 1 1 0
## 2 7 0
## 3 9 0
## 4 11 1
## 5 17 0
## 6 20 0
I will store it in csv