library(readr)
library(dplyr)
library(glmnet)
##讀檔
tbl_titanic <- read_csv("titanic3.csv", na = c("", "NA"))
先對資料做前置處理
##轉換資料形態
tbl_titanic <- tbl_titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_at(.vars = vars(pclass, survived, sex, embarked),
.funs = funs(as.factor))
summary(tbl_titanic)
survived pclass sex age sibsp
0:809 1:323 female:466 Min. : 0.17 Min. :0.0000
1:500 2:277 male :843 1st Qu.:21.00 1st Qu.:0.0000
3:709 Median :28.00 Median :0.0000
Mean :29.88 Mean :0.4989
3rd Qu.:39.00 3rd Qu.:1.0000
Max. :80.00 Max. :8.0000
NA's :263
parch fare embarked
Min. :0.000 Min. : 0.000 C :270
1st Qu.:0.000 1st Qu.: 7.896 Q :123
Median :0.000 Median : 14.454 S :914
Mean :0.385 Mean : 33.295 NA's: 2
3rd Qu.:0.000 3rd Qu.: 31.275
Max. :9.000 Max. :512.329
NA's :1
##處理missing data
tbl_titanic <- tbl_titanic %>%
mutate(age_na = is.na(age), age = replace(age, is.na(age), 0),
fare = replace(fare, is.na(fare), median(fare, na.rm = T)),
embarked = replace(embarked, is.na(embarked), "S"))
summary(tbl_titanic)
survived pclass sex age sibsp
0:809 1:323 female:466 Min. : 0.00 Min. :0.0000
1:500 2:277 male :843 1st Qu.: 7.00 1st Qu.:0.0000
3:709 Median :24.00 Median :0.0000
Mean :23.88 Mean :0.4989
3rd Qu.:35.00 3rd Qu.:1.0000
Max. :80.00 Max. :8.0000
parch fare embarked age_na
Min. :0.000 Min. : 0.000 C:270 Mode :logical
1st Qu.:0.000 1st Qu.: 7.896 Q:123 FALSE:1046
Median :0.000 Median : 14.454 S:916 TRUE :263
Mean :0.385 Mean : 33.281
3rd Qu.:0.000 3rd Qu.: 31.275
Max. :9.000 Max. :512.329
建構 “survived ~ sex + pclass” logistic regression model
set.seed(56500)
idc_train <- sample(c(T, F), nrow(tbl_titanic), replace = T, prob = c(.7, .3))
glm_fit_1 <- tbl_titanic %>% glm(survived ~ sex+pclass, data = .,
family = binomial(link = "logit"),
subset = idc_train)
summary(glm_fit_1)
Call:
glm(formula = survived ~ sex + pclass, family = binomial(link = "logit"),
data = ., subset = idc_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0974 -0.6883 -0.4580 0.7398 2.1479
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.0822 0.2043 10.192 < 2e-16 ***
sexmale -2.4756 0.1743 -14.202 < 2e-16 ***
pclass2 -0.9260 0.2343 -3.953 7.73e-05 ***
pclass3 -1.8085 0.2055 -8.798 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1227.86 on 927 degrees of freedom
Residual deviance: 886.13 on 924 degrees of freedom
AIC: 894.13
Number of Fisher Scoring iterations: 4
建構 “survived ~ sex * pclass” logistic regression model
glm_fit_2 <- tbl_titanic %>% glm(survived ~ sex*pclass, data = .,
family = binomial(link = "logit"),
subset = idc_train)
summary(glm_fit_2)
Call:
glm(formula = survived ~ sex * pclass, family = binomial(link = "logit"),
data = ., subset = idc_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5601 -0.5669 -0.5605 0.5207 1.9640
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.2387 0.5095 6.356 2.07e-10 ***
sexmale -3.9071 0.5445 -7.175 7.21e-13 ***
pclass2 -1.3088 0.6220 -2.104 0.035358 *
pclass3 -3.3654 0.5339 -6.304 2.90e-10 ***
sexmale:pclass2 0.2303 0.6968 0.331 0.741002
sexmale:pclass3 2.2623 0.5876 3.850 0.000118 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1227.86 on 927 degrees of freedom
Residual deviance: 854.13 on 922 degrees of freedom
AIC: 866.13
Number of Fisher Scoring iterations: 5
已知 \(P(y=1|x)=\frac{exp(\beta_0+\sum_{p=1}^{P}\beta_p x_p)}{1+exp(\beta_0+\sum_{p=1}^{P}\beta_p x_p)}\)
model 1 沒有 sex 及 pclass 的交互作用,而 Coefficients 分別為
| (Intercept) | sexmale | pclass2 | pclass3 |
|---|---|---|---|
| 2.0822 | -2.4756 | -0.9260 | -1.8085 |
其中 sexfemale 以及 pclass1 為 0
\(P(y=1|x=female,x=pclass1)=\frac{exp(\beta_0)}{1+exp(\beta_0)}=\frac{exp(2.0822)}{1+exp(2.0822)}=0.8896\)
\(P(y=1|x=female,x=pclass2)=\frac{exp(\beta_0+\beta_2x_2)}{1+exp(\beta_0+\beta_2 x_2)}=\frac{exp(2.0822-0.9260)}{1+exp(2.0822-0.9260)}=0.7606\)
\(P(y=1|x=female,x=pclass3)=\frac{exp(\beta_0+\beta_2x_2)}{1+exp(\beta_0+\beta_2 x_2)}=\frac{exp(2.0822-1.8085)}{1+exp(2.0822-1.8085)}=0.5680\)
\(P(y=1|x=male,x=pclass1)=\frac{exp(\beta_0+\beta_1x_1)}{1+exp(\beta_0+\beta_1x_1)}=\frac{exp(2.0822-2.4756)}{1+exp(2.0822-2.4756)}=0.4029\)
\(P(y=1|x=male,x=pclass2)=\frac{exp(\beta_0+\beta_1x_1+\beta_2x_2)}{1+exp(\beta_0+\beta_1x_1+\beta_2x_2)}=\frac{exp(2.0822-2.4756-0.9260)}{1+exp(2.0822-2.4756-0.9260)}=0.2109\)
\(P(y=1|x=male,x=pclass3)=\frac{exp(\beta_0+\beta_1x_1+\beta_2x_2)}{1+exp(\beta_0+\beta_1x_1+\beta_2x_2)}=\frac{exp(2.0822-2.4756-1.8085)}{1+exp(2.0822-2.4756-1.8085)}=0.0996\)
model 2 有 sex 及 pclass 的交互作用,而 Coefficients 分別為
| (Intercept) | sexmale | pclass2 | pclass3 | sexmale:pclass2 | sexmale:pclass3 |
|---|---|---|---|---|---|
| 3.2387 | -3.9071 | -1.3088 | -3.3654 | 0.2303 | 2.2623 |
其中 sexfemale 以及 pclass1 為 0
\(P(y=1|x=female,x=pclass1)=\frac{exp(\beta_0)}{1+exp(\beta_0)}=\frac{exp(3.2387)}{1+exp(3.2387)}=0.9623\)
\(P(y=1|x=female,x=pclass2)=\frac{exp(\beta_0+\beta_2x_2)}{1+exp(\beta_0+\beta_2 x_2)}=\frac{exp(3.2387-1.3088)}{1+exp(3.2387-1.3088)}=0.8732\)
\(P(y=1|x=female,x=pclass3)=\frac{exp(\beta_0+\beta_2x_2)}{1+exp(\beta_0+\beta_2 x_2)}=\frac{exp(3.2387-3.3654)}{1+exp(3.2387-3.3654)}=0.4684\)
\(P(y=1|x=male,x=pclass1)=\frac{exp(\beta_0+\beta_1x_1)}{1+exp(\beta_0+\beta_1x_1)}=\frac{exp(3.2387-3.9071)}{1+exp(3.2387-3.9071)}=0.3389\)
\(P(y=1|x=male,x=pclass2)=\frac{exp(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3)}{1+exp(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3)}=\frac{exp(3.2387-3.9071-1.3088+0.2303)}{1+exp(3.2387-3.9071-1.3088+0.2303)}=0.1484\)
\(P(y=1|x=male,x=pclass3)=\frac{exp(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3)}{1+exp(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3)}=\frac{exp(3.2387-3.9071-3.3654+2.2623)}{1+exp(3.2387-3.9071-3.3654+2.2623)}=0.1454\)
tbl_titanic[idc_train, ] %>% select(survived, sex, pclass) %>% ftable()
pclass 1 2 3
survived sex
0 female 4 9 84
male 80 109 294
1 female 102 62 74
male 41 19 50
由上表得知
\(P(y=1|x=female,x=pclass1)=102/106=0.9623\)
\(P(y=1|x=female,x=pclass2)=62/71=0.8732\)
\(P(y=1|x=female,x=pclass3)=74/158=0.4684\)
\(P(y=1|x=male,x=pclass1)=41/121=0.3388\)
\(P(y=1|x=male,x=pclass2)=19/128=0.1484\)
\(P(y=1|x=male,x=pclass3)=50/344=0.1453\)
由 2 與 3 的結果可以得知,在第一個 model 中雖然沒有辦法得到 training set 正確的機率,但也算非常接近,而 model 2 的結果可以很好的代表整個 training set,不過哪個比較好還是要使用 testing set 做驗證,但在 training set 中 model 2 確實做的比較好。