Introduction

邏輯斯迴歸模型是在機器學習中常見的分類模型，專門用來處理類別型資料，而其中又分為Binary Classification以及Multi-class Classification

現在我們要使用ISLR套件中的信用卡違約資料，目的是要預測客戶是否會拖欠卡債，反應變數由「是」、「否」構成了二元類別型態的資料

library(ISLR)
library(tibble)
library(ggplot2)
library(gridExtra)
data = as.data.frame(Default)
str(data)

## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...

balance 代表信用卡結算餘額，通常指的就是當月該付的錢

summary(data)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

檢查資料是否有遺漏值

# 檢查遺漏值數量
table(is.na(data))

## 
## FALSE 
## 40000

為了後續的預測以及避免模型under/over-fitting，我們將資料切分訓練集以及測試集

set.seed(42)
# 先把資料區分成 train=0.8, test=0.2 
train.index = sample(x=1:nrow(data), size=ceiling(0.8*nrow(data) ))
train = data[train.index, ]
test = data[-train.index, ]

將反應變數轉成0和1

train$default = as.numeric(train$default) - 1 
test$default = as.numeric(test$default) - 1

在建立模型前，資料視覺化是一項很重要的工作，可以觀察該筆資料的特性，由於我們的反應變數是兩個類別，我們便可以畫連續型態變數的分布在反應變數兩個類別上的差異

p1 = ggplot(data = data, aes(x = balance)) +   
      # 散布圖對應的函式是geom_density()
      geom_density(aes(fill=default,  # 用aes()，描繪散布圖內的各種屬性
                       alpha=0.1)) +
      labs(title="Density plot", 
          subtitle="# Balance Distribution")

p2 = ggplot(data = data, aes(x = income)) +   
      geom_density(aes(fill=default,  
                       alpha=0.1)) +
      labs(title="Density plot", 
          subtitle="# Income Distribution")


grid.arrange(p1, p2, nrow = 1)

可以看到balance是一個良好的變數，其反應變數呈現兩種差異很大的分配，接下來我們便以balance當作例子

The Logistic Function

以反應變數default和自變數balance作圖，由圖中可以看到資料點只集中在圖中的兩條水平線上，因為這是類別型態的資料，所以若我們用線性迴歸去配適我們的資料時，效果會很差，這時候邏輯斯迴歸就能解決這樣的問題，他能配適一條像是“S”的曲線，明顯比線性迴歸效果好

這條藍色的logit function代表：

當一單位的balance增加，default接近1的機率就靠近一些(0代表NO，1代表Yes)
這條機率曲線界在0到1之間

簡單以數學解釋 \[ p(x) = Pr(being　1)= P(Y = 1 \mid {X = x}) \]

羅吉斯迴歸用到的logit函數，也被稱作Sigmoid函數，在深度學習領域常被拿來使用或是比較，可以將任意輸入值輸出成0到1之間的函數

\[ \sigma(x) = \frac{e^x}{1 + e^x} = \frac{1}{1 + e^{-x}} \]

寫過來我們的機率函數變成 \[ p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p)}} = \sigma(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p) \]

轉換成邏輯斯迴歸的形式

\[ logit(p) = log(odds) = \log\left(\frac{p(x)}{1 - p(x)}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p. \] 羅吉斯回歸方程式將類別型態的反應變數轉換為事件的log odds值，也就是\(log⟮\frac{Pi}{1−Pi})\)⟯，來預測與自變數(X1~Xn) 的線性關係，接下來邏輯斯迴歸模型將會用最大概似估計法求解，並且找出最佳的Beta估計值，當然這部分交給R內建的GLM套件處理就好

Logistic Regression with `glm()`

使用R裡面的 glm()套件(Generalized Linear Models)，裡面的參數設定family = "binomial"，即我們要的二元邏輯斯迴歸
若你也想要用 glm()套件作線性迴歸，只要將參數改成family = "gaussian"就會等於之前的 lm()函數

model_glm = glm(default ~ balance, data = train, family = "binomial")
summary(model_glm)

## 
## Call:
## glm(formula = default ~ balance, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3063  -0.1438  -0.0568  -0.0212   3.7822  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.081e+01  4.099e-01  -26.36   <2e-16 ***
## balance      5.601e-03  2.493e-04   22.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2374.1  on 7999  degrees of freedom
## Residual deviance: 1274.4  on 7998  degrees of freedom
## AIC: 1278.4
## 
## Number of Fisher Scoring iterations: 8

Interpretation of Model Summary

Call：

模型的長相，範例中我們的自變數只放了一個，因此會邏輯斯迴歸會估計一個\(\beta_1\)項，預設還會估計一個截距項\(\beta_0\)
Deviance Residuals：

類似線性迴歸中殘差的概念，即配適的迴歸線和觀測值的差異，以數學式子理解，\(\epsilon_i\)就是我們估計的殘差 \[ \log\left(\frac{p(x_i)}{1 - p(x_i)}\right) = \beta_0 + \beta_1 x_i + \epsilon_i \] 表中提供了偏差殘差的簡單敘述統計如中位數、第一以及第三四分位數
Coefficients：

估計的係數可以列成以下式子： \[ \begin{align} logit(p) = \log\left(\frac{p(x_i)}{1 - p(x_i)}\right) &= \beta_0 + \beta_1 x \\&= -10.81 + 0.005601x \end{align} \] 當balance每一單位的增加，拖欠貸款(balance = 1)的log odds就會增加0.005601
Deviance：

在廣義線性模型中用來衡量 goodness of fit，越高代表擬合的越差
- Null deviance：
  僅有截距項的模型和飽和模型(saturated model)比較得到的偏差統計量的值，自由度就是N-1，N為觀測值的數量
- Residual deviance：
  包括截距項以及balance變數的模型和飽和模型(saturated model)比較得到的偏差統計量的值，值越小代表模型含有balance變數的配式較好
AIC (Akaike Information Criteria):
- AIC是一個指標用來評估統計模型的複雜度和衡量統計模型goodness of fit，類似於線性迴歸中的\(adjusted R^2\)，當模型太過複雜可能導致過度擬合給予懲罰
- 當比較多個模型時，選擇AIC越小的模型較佳
Number of Fisher Scoring iterations:

Fisher’s scoring algorithm類似於牛頓法用來求解最大概似估計，表中可以看到需要經過八次的迭代演算法才能收斂

Prediction

type = “link”

預設的情況，predict glm()會使用type = "link"

head(predict(model_glm))

##      2369      5273      9290      1252      8826       356 
## -5.523309 -5.005989 -5.153738 -4.109757 -6.722855 -6.788078

head(predict(model_glm, type = "link"))

##      2369      5273      9290      1252      8826       356 
## -5.523309 -5.005989 -5.153738 -4.109757 -6.722855 -6.788078

其實就是每個樣本分別代入 \[ \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \] 的結果

type = “response”

預測機率需使用type = "response"

head(predict(model_glm, type='response'))

##        2369        5273        9290        1252        8826         356 
## 0.003976737 0.006653152 0.005744575 0.016146766 0.001201653 0.001125864

\[ p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 )}} = \frac{1}{1 + e^{-(-10.81 + 0.005601x )}} \] 若今天某個客戶這個月的balance為2300元，他會欠卡債的機率是多少呢？我們將模型代入\(x=2300\) \[ p(x = 2300) = \frac{1}{1 + e^{-(-10.81 + 0.005601 \times 2300 )}} = 0.8886683 \]

predict(model_glm,type = "response",newdata = data.frame(balance = 2300))

##         1 
## 0.8886683

得到機率後並還沒完成我們的目標：分類，因此需給定一個閥值，將機率0.5以上的視作分類成1的結果

result = predict(model_glm, type='response')
result = ifelse(result > 0.5,1,0)
head(result)

## 2369 5273 9290 1252 8826  356 
##    0    0    0    0    0    0

將train的預測結果繪製混淆矩陣confusion matrix

# Making predictions on the train set.
trn_tab = table(predicted = result, actual = train$default)
trn_tab

##          actual
## predicted    0    1
##         0 7691  186
##         1   37   86

使用事前切割好的test資料來預測，這次使用caret套件來觀察混淆矩陣

## Predicting Test Data
result = predict(model_glm,newdata=test,type='response')
result = ifelse(result > 0.5,1,0)

## Confusion matrix and statistics
library(caret)
confusionMatrix(data=as.factor(result), reference=as.factor(test$default))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1930   46
##          1    9   15
##                                           
##                Accuracy : 0.9725          
##                  95% CI : (0.9644, 0.9792)
##     No Information Rate : 0.9695          
##     P-Value [Acc > NIR] : 0.2403          
##                                           
##                   Kappa : 0.3416          
##                                           
##  Mcnemar's Test P-Value : 1.208e-06       
##                                           
##             Sensitivity : 0.9954          
##             Specificity : 0.2459          
##          Pos Pred Value : 0.9767          
##          Neg Pred Value : 0.6250          
##              Prevalence : 0.9695          
##          Detection Rate : 0.9650          
##    Detection Prevalence : 0.9880          
##       Balanced Accuracy : 0.6206          
##                                           
##        'Positive' Class : 0               
##

現在觀察混淆矩陣的每個欄位在醫學或統計學上常常用的指標：

	實際False(0)	實際True(1)
預測False(0)	True Negative(TN) 真陰性	False Negative(FN) 偽陰性
預測True(1)	False Positive (FP) 偽陽性	True Positive (TP) 真陽性

\[ Sensitivity 靈敏度=True Positive Rate (TPR) = \frac{TP}{TP + FN} \] \[ Specificity 特異度=True negative rate (TNR) = \frac{TN}{TN + FP} \]

ROC曲線

全名為Receiver Operating Characteristics，在醫學上，會通過一些閥值來斷定病人是否有得此病，而這個閥值就會影響Sensitivity和Specificity，如前面範例中我們的閥值設定為0.5，當我們考慮所有閥值的可能性並記錄下Sensitivity和Specificity的分布情況，就可以畫成ROC Curve

library("pROC")
test_prob = predict(model_glm, newdata = test, type = "response")
par(pty = "s") #指定畫布繪製成正方形的大小
test_roc = roc(test$default ~ test_prob, plot = TRUE, print.auc = TRUE, legacy.axes=TRUE) #legacy.axes=TRUE 將 x軸改成 1 - Specificity

在觀察ROC曲線得好壞時，除了觀察曲線的上的變化外，還可以計算曲線下的面積，我們稱為AUC(Area Under Curve)，AUC面積越大越好
Caption
圖來源：wikipedia

Reference

A work by Ginger Zhan

bmw2142@gmail.com

Logistic Regression

Ginger Zhan

2019-12-04

Introduction

The Logistic Function

Logistic Regression with glm()

Interpretation of Model Summary

Call：

Deviance Residuals：

Coefficients：

Deviance：

AIC (Akaike Information Criteria):

Number of Fisher Scoring iterations: