\[ \begin{align} \mbox{log(odd)}=\mbox{logit}(p)=\mbox{log}(\frac{\mbox{Pr(y=1)}}{1-\mbox{Pr(y=1)}})=\beta_0+\sum_{i=1}^n\beta_ix_i \end{align} \]
library(data.table)
library(dplyr)
x <- fread("titanic.csv",nrows = 1317)
attach(x)
x[fare == 9999,"fare"] <- NA
x[age == 9999,"age"] <- NA
knitr::kable(head(x))
| name | gender | age | class | fare | group | joined | job | boat | survival |
|---|---|---|---|---|---|---|---|---|---|
| ALLEN, Miss Elisabeth Walton | 1 | 29 | 1 | 211 | Southampton | 2 | 1 | ||
| ALLISON, Mr Hudson Joshua Creighton | 0 | 30 | 1 | 151 | Southampton | Businessman | 0 | ||
| ALLISON, Mrs Bessie Waldo | 1 | 25 | 1 | 151 | Southampton | 0 | |||
| ALLISON, Miss Helen Loraine | 1 | 2 | 1 | 151 | Southampton | 0 | |||
| ALLISON, Master Hudson Trevor | 0 | 1 | 1 | 151 | Southampton | 11 | 1 | ||
| ANDERSON, Mr Harry | 0 | 47 | 1 | 26 | Southampton | Stockbroker | 3 | 1 |
fare的多寡如何影響survival,他是以概率模型的方式來解釋,其中fare的係數為0.013,P值小於0.05,表示每增加1英鎊,存活率增加0.013,呈正相關。(註:相關性並不代表因果關係)。model <- glm(survival ~ fare,data = x, family = binomial(link = "logit"),na.action = na.exclude)
summary(model)
##
## Call:
## glm(formula = survival ~ fare, family = binomial(link = "logit"),
## data = x, na.action = na.exclude)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2790 -0.8817 -0.8486 1.3470 1.5703
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.927778 0.076671 -12.101 < 2e-16 ***
## fare 0.013108 0.001646 7.961 1.7e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1710.1 on 1290 degrees of freedom
## Residual deviance: 1617.2 on 1289 degrees of freedom
## (26 observations deleted due to missingness)
## AIC: 1621.2
##
## Number of Fisher Scoring iterations: 4
fare票價最小為3英鎊,那麼付3英鎊的乘客生還的機率為0.2914(暫不靠考慮前面相關性):\[ \begin{align} \pi(x)&=\frac{\mbox{exp}(-0.927778+0.013108x)}{1+\mbox{exp}(-0.927778+0.013108x)}\\ \pi(3)&=\frac{\mbox{exp}(-0.927778+0.013108\times3)}{1+\mbox{exp}(-0.927778+0.013108\times3)}=0.2914 \end{align} \]
predict參數中選擇type="response",R將以\(P(y=1|x)\)的方式輸出概率值,白話說就是在某個價格的情況下\((x)\),生還的機率\((y=1)\),其中,Logistic迴歸的原理不再是線性迴歸的「最小平方法」,而是「最大概似法」(maximum likelihood),表示我們預測的這一組參數值-0.927,0.013,會使預測的\(\pi(x)\)個別符合原始資料中的值的整體可能性達到最大。library(popbio)
predict(model,type = "response",newdata = data.frame(fare = 3))
## 1
## 0.2914288
plot(x[,c("fare","survival")])
curve(predict(model,type = "response",newdata = data.frame(fare = x)),add = TRUE)
\[ \begin{align} \mbox{log(odd)}&=\mbox{logit}(p)=\beta_0+\beta_1 x\\ &=-0.927778+0.013108x\\ \mbox{odd}&=\mbox{exp}(\beta_0+\beta_1x)\\ \mbox{odds ratio}&=\frac{\mbox{odd}_1}{\mbox{odd}_2}=\frac{\mbox{exp}(\beta_0+\beta_1x_1)}{\mbox{exp}(\beta_0+\beta_1x_2)} \end{align}\\ \mbox{exp}(\beta_1\times1)=1.013194=\mbox{每增加1英鎊,對存活的勝算增加}1.013194\\ \mbox{exp}(\beta_0+\beta_1\times2)=1.334731=\mbox{在2英鎊時,存活對不存活的勝算為}1.334731\\ \frac{\mbox{exp}(\beta_0+\beta_1\times2)}{\mbox{exp}(\beta_0+\beta_1\times1)}=\frac{1.334731}{1.331152}=1.002689\\ \mbox{勝算比:存活時,付2英鎊的勝算為1英鎊的}1.002689\mbox{倍} \]
predict(model,type = "response",newdata = data.frame(fare = c(1,2))) %>% exp()
## 1 2
## 1.331152 1.334731