Importing our dataset from excel.
We assign a new variable for our dataset.
Structure(str) explains us the about variables whether they are numeric or character.
str(etit)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 891 obs. of 12 variables:
$ PassengerId: num 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : num 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : num 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : num 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr NA "C85" NA "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
Summary of the dataset gives us the minimum value,maximum value, quartile values,mean,median. This gives us the basic understanding of our dataset.
summary(etit)
PassengerId Survived Pclass Name Sex
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891 Length:891
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character Class :character
Median :446.0 Median :0.0000 Median :3.000 Mode :character Mode :character
Mean :446.0 Mean :0.3838 Mean :2.309
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :891.0 Max. :1.0000 Max. :3.000
Age SibSp Parch Ticket Fare
Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891 Min. : 0.00
1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character 1st Qu.: 7.91
Median :28.00 Median :0.000 Median :0.0000 Mode :character Median : 14.45
Mean :29.70 Mean :0.523 Mean :0.3816 Mean : 32.20
3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00
Max. :80.00 Max. :8.000 Max. :6.0000 Max. :512.33
NA's :177
Cabin Embarked
Length:891 Length:891
Class :character Class :character
Mode :character Mode :character
Head function gives the top 6 observations of our variables.
Now we check the na values columnwise.
colSums(is.na(etit))
PassengerId Survived Pclass Name Sex Age SibSp
0 0 0 0 0 177 0
Parch Ticket Fare Cabin Embarked
0 0 0 687 2
We remove the variables PassengerId ,Name, Ticket,cabin from our dataset
Now we view our dataset.
we check the na values and omit it from our dataset
colSums(is.na(etit))
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 0 0 0 0 0 0
we have to factorise our character variable for further analysis.
Now we check the structure and we see that all variables are num or factor.
str(etit)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 712 obs. of 8 variables:
$ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
$ Pclass : num 3 1 3 1 3 1 3 3 2 3 ...
$ Sex : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 1 1 1 ...
$ Age : num 22 38 26 35 35 54 2 27 14 4 ...
$ SibSp : num 1 1 0 1 0 0 3 0 1 1 ...
$ Parch : num 0 0 0 0 0 0 1 2 0 1 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Embarked: Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 2 ...
- attr(*, "na.action")= 'omit' Named int 6 18 20 27 29 30 32 33 37 43 ...
..- attr(*, "names")= chr "6" "18" "20" "27" ...
This dotplot shows the survived rate in different age.
We plot how many count in passenger class.
This plot shows how many survived in male and female.
We see the distribution of ages of passengers
We see the survival rate with age and sex.
We split the data into 80% train data and 20% test data.
split_etit
[1] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Spliting our Training data.
Splitting our test data.
Forming our model.
summary(model_etit)
Call:
glm(formula = Survived ~ ., family = "binomial", data = train_etit)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6684 -0.7069 -0.3881 0.6738 2.4250
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.429803 0.724821 7.491 6.82e-14 ***
Pclass -1.238310 0.189149 -6.547 5.88e-11 ***
Sex1 -2.431730 0.246964 -9.846 < 2e-16 ***
Age -0.042573 0.009435 -4.512 6.42e-06 ***
SibSp -0.451719 0.149039 -3.031 0.00244 **
Parch -0.019863 0.135369 -0.147 0.88334
Fare 0.001182 0.003054 0.387 0.69867
Embarked1 -0.262844 0.276103 -0.952 0.34111
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 716.62 on 533 degrees of freedom
Residual deviance: 488.65 on 526 degrees of freedom
AIC: 504.65
Number of Fisher Scoring iterations: 5
We remove our variables whose p values are greater.
summary(final_model)
Call:
glm(formula = Survived ~ . - Parch - Fare - Embarked, family = "binomial",
data = train_etit)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6866 -0.6889 -0.3861 0.6593 2.4198
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.423446 0.629048 8.622 < 2e-16 ***
Pclass -1.301043 0.163682 -7.949 1.89e-15 ***
Sex1 -2.445973 0.241339 -10.135 < 2e-16 ***
Age -0.043269 0.009416 -4.595 4.32e-06 ***
SibSp -0.454408 0.141757 -3.206 0.00135 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 716.62 on 533 degrees of freedom
Residual deviance: 489.91 on 529 degrees of freedom
AIC: 499.91
Number of Fisher Scoring iterations: 5
We predict our model.
predict_result
1 2 3 4 5 6 7 8
0.88326901 0.59756114 0.70947426 0.83379670 0.24249631 0.25053284 0.07724415 0.50261957
9 10 11 12 13 14 15 16
0.90353584 0.66778840 0.87132253 0.10350865 0.14832816 0.26252322 0.11842914 0.58406267
17 18 19 20 21 22 23 24
0.11181104 0.85370771 0.19849970 0.66401399 0.10553352 0.64831948 0.13772528 0.01841237
25 26 27 28 29 30 31 32
0.34071313 0.13018811 0.33040634 0.34982489 0.14832816 0.40048084 0.93796816 0.03465286
33 34 35 36 37 38 39 40
0.11398571 0.15959624 0.73548657 0.13772528 0.35969258 0.79367809 0.44798160 0.90190295
41 42 43 44 45 46 47 48
0.15387784 0.80748947 0.13266682 0.28440911 0.31154919 0.04179059 0.10151821 0.10178857
49 50 51 52 53 54 55 56
0.29329685 0.13266682 0.45291416 0.26770984 0.05349962 0.26347448 0.11842914 0.74025898
57 58 59 60 61 62 63 64
0.10553352 0.16548559 0.95245648 0.94510514 0.14832816 0.76534564 0.10553352 0.12569121
65 66 67 68 69 70 71 72
0.10968825 0.82102075 0.42743405 0.50804746 0.84822063 0.30234498 0.12776661 0.11618081
73 74 75 76 77 78 79 80
0.39488945 0.08016667 0.67359786 0.50682080 0.09027544 0.89600153 0.04538981 0.93540236
81 82 83 84 85 86 87 88
0.06829611 0.11398571 0.13772528 0.90379357 0.10151821 0.46365579 0.15959624 0.28060784
89 90 91 92 93 94 95 96
0.14294483 0.07857999 0.16291192 0.59347457 0.41162978 0.15426369 0.63839183 0.17506016
97 98 99 100 101 102 103 104
0.53501425 0.89600153 0.13772528 0.79107992 0.60901470 0.80114568 0.41162978 0.79416247
105 106 107 108 109 110 111 112
0.65878634 0.06829611 0.94397604 0.95970801 0.82416631 0.05689586 0.83076570 0.87873321
113 114 115 116 117 118 119 120
0.09027544 0.09051887 0.09027544 0.78333656 0.13266682 0.08016667 0.70559362 0.05575022
121 122 123 124 125 126 127 128
0.34941829 0.08016667 0.35542125 0.09574250 0.14367177 0.04179059 0.14294483 0.95622531
129 130 131 132 133 134 135 136
0.88518875 0.13772528 0.06559370 0.41162978 0.47020289 0.06577539 0.62437412 0.14294483
137 138 139 140 141 142 143 144
0.96992027 0.47020289 0.06048488 0.94730726 0.46483689 0.95970801 0.83930780 0.08678434
145 146 147 148 149 150 151 152
0.53992822 0.23852992 0.26677993 0.54041187 0.82102075 0.59866436 0.47302335 0.10553352
153 154 155 156 157 158 159 160
0.16548559 0.81172462 0.15387784 0.78650253 0.67731693 0.04354479 0.54462256 0.09763900
161 162 163 164 165 166 167 168
0.09389254 0.49723083 0.36425772 0.09574939 0.50804746 0.58711316 0.87873321 0.13772528
169 170 171 172 173 174 175 176
0.06048488 0.08016667 0.67731693 0.84824111 0.73185326 0.76828961 0.76049752 0.70501132
177 178
0.30234498 0.11842914
we form a table to see our accuracy.
table(actual=test_etit$Survived,predicted=predict_result>0.64)
predicted
actual FALSE TRUE
0 99 2
1 26 51
we check our accuaracy for our model.
my_accu
[1] 0.8426966
After our prediction we need check our accuracy of our prediction. library ROCR is used to plot out actual value and predicted value to test our perfomance. From our ROCR graph we can find out our threshold value for our accuracy prediction. Here for 0.64 threshold value we get 84% accuracy for prediction so our predicted model is good to use.