Logistic 迴歸模型

邏輯斯迴歸模型是在機器學習中常見的判別模型,專門用來處理類別型資料,而其中又分為Two-class Classification以及Multi-class Classification,差別只在於預測項特徵多寡還有模型解釋程度,其餘的Two-class分類器例如SVM,Bayes,Decision tree以及Neural network等等也都是常見的演算法。

Logistic迴歸是以二項分布(Binomial)為基礎,其中\(\beta_0\)是截距,而\(\beta_i\)是自變數在該模型的斜率:

\[ \begin{align} \mbox{log(odd)}=\mbox{logit}(p)=\mbox{log}(\frac{\mbox{Pr(y=1)}}{1-\mbox{Pr(y=1)}})=\beta_0+\sum_{i=1}^n\beta_ix_i \end{align} \]

Two-class Logist

這裡以鐵達尼號資料,分析票價與生還者之間的關係,例如說是否是票價愈高,生還的人愈多,其中判斷是多少class最直覺就是應變數的label個數,像生還只有兩種結果,就是Two-class,這裡只取前1317筆,並將fare遺漏值設定為NA,age遺漏值設定為NA。
library(data.table)
library(dplyr)
x <- fread("titanic.csv",nrows = 1317)
attach(x)
x[fare == 9999,"fare"] <- NA
x[age == 9999,"age"] <- NA
knitr::kable(head(x))
name gender age class fare group joined job boat survival
ALLEN, Miss Elisabeth Walton 1 29 1 211 Southampton 2 1
ALLISON, Mr Hudson Joshua Creighton 0 30 1 151 Southampton Businessman 0
ALLISON, Mrs Bessie Waldo 1 25 1 151 Southampton 0
ALLISON, Miss Helen Loraine 1 2 1 151 Southampton 0
ALLISON, Master Hudson Trevor 0 1 1 151 Southampton 11 1
ANDERSON, Mr Harry 0 47 1 26 Southampton Stockbroker 3 1
formula分為前後,以邏輯思迴歸的核心理念來說明,就是解釋fare的多寡如何影響survival,他是以概率模型的方式來解釋,其中fare的係數為0.013,P值小於0.05,表示每增加1英鎊,存活率增加0.013,呈正相關。(註:相關性並不代表因果關係)。
圖片出處:https://en.wikipedia.org/wiki/Correlation_and_dependence
model <- glm(survival ~ fare,data = x, family = binomial(link = "logit"),na.action = na.exclude)
summary(model)
## 
## Call:
## glm(formula = survival ~ fare, family = binomial(link = "logit"), 
##     data = x, na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2790  -0.8817  -0.8486   1.3470   1.5703  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.927778   0.076671 -12.101  < 2e-16 ***
## fare         0.013108   0.001646   7.961  1.7e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1710.1  on 1290  degrees of freedom
## Residual deviance: 1617.2  on 1289  degrees of freedom
##   (26 observations deleted due to missingness)
## AIC: 1621.2
## 
## Number of Fisher Scoring iterations: 4
將結果帶入公式,我們將\(x=3\)帶入來解釋此模型,在fare票價最小為3英鎊,那麼付3英鎊的乘客生還的機率為0.2914(暫不靠考慮前面相關性):

\[ \begin{align} \pi(x)&=\frac{\mbox{exp}(-0.927778+0.013108x)}{1+\mbox{exp}(-0.927778+0.013108x)}\\ \pi(3)&=\frac{\mbox{exp}(-0.927778+0.013108\times3)}{1+\mbox{exp}(-0.927778+0.013108\times3)}=0.2914 \end{align} \]

predict參數中選擇type="response",R將以\(P(y=1|x)\)的方式輸出概率值,白話說就是在某個價格的情況下\((x)\),生還的機率\((y=1)\),其中,Logistic迴歸的原理不再是線性迴歸的「最小平方法」,而是「最大概似法」(maximum likelihood),表示我們預測的這一組參數值-0.927,0.013,會使預測的\(\pi(x)\)個別符合原始資料中的值的整體可能性達到最大。
library(popbio)
predict(model,type = "response",newdata = data.frame(fare = 3))
##         1 
## 0.2914288
plot(x[,c("fare","survival")])
curve(predict(model,type = "response",newdata = data.frame(fare = x)),add = TRUE)

Logistic迴歸還有一重要的性質,也就是勝算(odds)與勝算比(odds ratio),呈上所述,我們的\(p\)為應變數\(y=1\)之機率(存活的機率),\(x\)為連續型變數(票價),此時\(\beta_0\)\(x=0\)時log(y=1對y=0的勝算),\(\beta_i\)\(x\)對勝算的貢獻量(權重),直接從例題說明,解讀成每增加1英鎊對存活之勝算(odds)增加exp(\(\beta_1\))=1.013194。

\[ \begin{align} \mbox{log(odd)}&=\mbox{logit}(p)=\beta_0+\beta_1 x\\ &=-0.927778+0.013108x\\ \mbox{odd}&=\mbox{exp}(\beta_0+\beta_1x)\\ \mbox{odds ratio}&=\frac{\mbox{odd}_1}{\mbox{odd}_2}=\frac{\mbox{exp}(\beta_0+\beta_1x_1)}{\mbox{exp}(\beta_0+\beta_1x_2)} \end{align}\\ \mbox{exp}(\beta_1\times1)=1.013194=\mbox{每增加1英鎊,對存活的勝算增加}1.013194\\ \mbox{exp}(\beta_0+\beta_1\times2)=1.334731=\mbox{在2英鎊時,存活對不存活的勝算為}1.334731\\ \frac{\mbox{exp}(\beta_0+\beta_1\times2)}{\mbox{exp}(\beta_0+\beta_1\times1)}=\frac{1.334731}{1.331152}=1.002689\\ \mbox{勝算比:存活時,付2英鎊的勝算為1英鎊的}1.002689\mbox{倍} \]

predict(model,type = "response",newdata = data.frame(fare = c(1,2))) %>% exp()
##        1        2 
## 1.331152 1.334731