The dataset used here is TitanicSurvival dataset in which we have to predict the survivors of the titanic disaster. Below code gives us the how to Import dataset and variable types. View function is used to view tour imported dataset.
titanicsurvival
Error: object 'titanicsurvival' not found
Assigning dataset to a new variable for our analysis
Structure(str) gives us the about the variables whether they are numeric or character
str(my_data_titanic)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 261 obs. of 8 variables:
$ obs : num 1 2 3 4 5 6 7 8 9 10 ...
$ PassengerId: num 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : num 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : num 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : num 0 0 0 0 0 0 0 1 2 0 ...
Summary of the dataset gives us the minimum value,maximum value, quartile values,mean,median. This gives us the basic understanding of our dataset.
summary(my_data_titanic)
obs PassengerId Survived Pclass
Min. : 1 Min. : 1 Min. :0.0000 Min. :1.000
1st Qu.: 66 1st Qu.: 66 1st Qu.:0.0000 1st Qu.:2.000
Median :131 Median :131 Median :0.0000 Median :3.000
Mean :131 Mean :131 Mean :0.3487 Mean :2.406
3rd Qu.:196 3rd Qu.:196 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :261 Max. :261 Max. :1.0000 Max. :3.000
Sex Age SibSp Parch
Length:261 Min. : 0.83 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:19.00 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :28.00 Median :0.000 Median :0.0000
Mean :28.61 Mean :0.636 Mean :0.3985
3rd Qu.:37.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :71.00 Max. :8.000 Max. :5.0000
NA's :51
View function is used to view our dataset.
Head function gives the top 6 observations of our variables
Missing values is filled with median of that variable as we cannnot carry out analysis with those missing values.
Considering our dependent variable as a factor will help us to plot graph based on our dependent variable.
ggplot2 library is used to plot graphs in a colourful way.Dotplot is used here to check the relationship between survived and age variables.
Here we split our data into 80% for train data, 20% for test data for prediction.Library catools is used to split data randomly into 80% for training and 20% for testing.
We train our model for prediction including all variables. We cannot remove variable based on p value and we should consider AIC value for removing variables.
summary(model_titanic)
Call:
glm(formula = Survived ~ ., family = "binomial", data = train_titanic)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9882 -0.6168 -0.3831 0.6152 2.3366
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.166809 1.122195 4.604 4.14e-06 ***
obs -0.001629 0.002577 -0.632 0.527388
PassengerId NA NA NA NA
Pclass -1.043573 0.270168 -3.863 0.000112 ***
Sexmale -2.970411 0.433374 -6.854 7.17e-12 ***
Age -0.043542 0.017889 -2.434 0.014935 *
SibSp -0.615172 0.236748 -2.598 0.009365 **
Parch 0.213787 0.230008 0.929 0.352641
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 250.92 on 194 degrees of freedom
Residual deviance: 165.64 on 188 degrees of freedom
AIC: 179.64
Number of Fisher Scoring iterations: 5
We need to remove variable seperately and check model overall AIC. 3 variables are removed one by one and AIC is checked. AIC should not decrease.
summary(model_calculation_1)
Call:
glm(formula = Survived ~ . - Parch, family = "binomial", data = train_titanic)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9692 -0.6057 -0.3642 0.6151 2.3735
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.198737 1.116334 4.657 3.21e-06 ***
obs -0.001535 0.002565 -0.598 0.549611
PassengerId NA NA NA NA
Pclass -1.028297 0.268861 -3.825 0.000131 ***
Sexmale -2.991456 0.432742 -6.913 4.75e-12 ***
Age -0.043816 0.017796 -2.462 0.013813 *
SibSp -0.549252 0.222571 -2.468 0.013596 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 250.92 on 194 degrees of freedom
Residual deviance: 166.50 on 189 degrees of freedom
AIC: 178.5
Number of Fisher Scoring iterations: 5
summary(model_calculation_2)
Call:
glm(formula = Survived ~ . - obs, family = "binomial", data = train_titanic)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9882 -0.6168 -0.3831 0.6152 2.3366
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.166809 1.122195 4.604 4.14e-06 ***
PassengerId -0.001629 0.002577 -0.632 0.527388
Pclass -1.043573 0.270168 -3.863 0.000112 ***
Sexmale -2.970411 0.433374 -6.854 7.17e-12 ***
Age -0.043542 0.017889 -2.434 0.014935 *
SibSp -0.615172 0.236748 -2.598 0.009365 **
Parch 0.213787 0.230008 0.929 0.352641
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 250.92 on 194 degrees of freedom
Residual deviance: 165.64 on 188 degrees of freedom
AIC: 179.64
Number of Fisher Scoring iterations: 5
summary(model_calculation_3)
Call:
glm(formula = Survived ~ . - PassengerId, family = "binomial",
data = train_titanic)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9882 -0.6168 -0.3831 0.6152 2.3366
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.166809 1.122195 4.604 4.14e-06 ***
obs -0.001629 0.002577 -0.632 0.527388
Pclass -1.043573 0.270168 -3.863 0.000112 ***
Sexmale -2.970411 0.433374 -6.854 7.17e-12 ***
Age -0.043542 0.017889 -2.434 0.014935 *
SibSp -0.615172 0.236748 -2.598 0.009365 **
Parch 0.213787 0.230008 0.929 0.352641
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 250.92 on 194 degrees of freedom
Residual deviance: 165.64 on 188 degrees of freedom
AIC: 179.64
Number of Fisher Scoring iterations: 5
After seeing AIC we remove obs variable from our model.
summary(model_final_titanic)
Call:
glm(formula = Survived ~ . - obs, family = "binomial", data = train_titanic)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9882 -0.6168 -0.3831 0.6152 2.3366
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.166809 1.122195 4.604 4.14e-06 ***
PassengerId -0.001629 0.002577 -0.632 0.527388
Pclass -1.043573 0.270168 -3.863 0.000112 ***
Sexmale -2.970411 0.433374 -6.854 7.17e-12 ***
Age -0.043542 0.017889 -2.434 0.014935 *
SibSp -0.615172 0.236748 -2.598 0.009365 **
Parch 0.213787 0.230008 0.929 0.352641
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 250.92 on 194 degrees of freedom
Residual deviance: 165.64 on 188 degrees of freedom
AIC: 179.64
Number of Fisher Scoring iterations: 5
Now we predict our model using test data based on our trained model.
predict_result
1 2 3 4 5 6
0.71076730 0.07824510 0.80886019 0.13867549 0.51004099 0.19016744
7 8 9 10 11 12
0.09997748 0.68345138 0.32332508 0.09852130 0.09765669 0.75685141
13 14 15 16 17 18
0.02740911 0.78383854 0.92944878 0.12008716 0.84655226 0.29951068
19 20 21 22 23 24
0.07945105 0.09288400 0.66412226 0.90033888 0.08744285 0.16565174
25 26 27 28 29 30
0.83912949 0.65755090 0.72059595 0.05916360 0.75185989 0.01485273
31 32 33 34 35 36
0.10703554 0.23356966 0.07013690 0.30107277 0.13500569 0.73396214
37 38 39 40 41 42
0.08711358 0.21498113 0.08271845 0.74719077 0.08851373 0.02949691
43 44 45 46 47 48
0.14408431 0.78740737 0.18413554 0.01849268 0.47430841 0.03275988
49 50 51 52 53 54
0.86872796 0.07767353 0.06034641 0.11385083 0.08923421 0.09628556
55 56 57 58 59 60
0.91475826 0.12015382 0.25203262 0.25970226 0.21106102 0.05690354
61 62 63 64 65 66
0.17518991 0.06662949 0.07160154 0.12358793 0.89820044 0.07052627
After our prediction we need check our accuracy of our prediction. library ROCR is used to plot out actual value and predicted value to test our perfomance. From our ROCR graph we can find out our threshold value for our accuracy prediction. Here for 0.66 threshold value we get 84% accuracy for prediction so our predicted model is good to use.