Logistic Regression and Generalized Linear Modeling

今回は 非線形モデリング の入り口として，ロジスティック回帰分析 に触れてみましょう。

回帰分析はすでにやってみましたね。回帰分析ではデータを説明するために関数を与えるのでした。特に一番簡単な線形関数，\(y=ax+b\)を使うのが(線形)(単)回帰分析ですが，この関数の形がロジスティック関数になっているのが，ロジスティック回帰分析です。

ロジスティック関数の形を見てみましょう。

x <- seq(-4,4,0.01)
plot(x,exp(x)/(1+exp(x)),type="l")

縦軸が0から1になっていることに注目してください。この関数は0から1までの滑らかなカーブを描いているわけです。

ロジスティック回帰分析は，従属変数が0/1のような二値変数になるデータに使われる回帰分析です。二値変数は「男性/女性」や「成功/失敗」，「正答/誤答」など二分する市ていることを表す名義尺度水準の変数です。

ロジスティック回帰分析はこのように，従属変数が名義尺度水準であるときに使われる回帰分析であると言えます。実際の例を見てみましょう。MASSパッケージに入っているbirthwtデータを使います。

##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

このデータは、1986年マサチューセッツ州で得られた，低体重児が出産される原因を調査するデータです。変数lowが1になっているのがTRUE=低体重児，0が健常児です。説明に使われる他の変数は次の通り。

母親の年齢(age)
母親の体重(lwt)
人種(race,1=white,2=black,3=other)
喫煙するかどうか(smoke)
早産の回数(plt)
高血圧かどうか(ht)
子宮過敏症の有無(ui)
妊娠初期に検査した回数(ftv)
生まれた時の体重(bwt)

です。

ロジスティック回帰の実際

低体重かどうかが0/1のデータですので，これを従属変数にし，母親の体重を独立変数にした回帰を考えます。通常の回帰は次のようになります。

result.lm <- lm(low~lwt,data=birthwt)
summary(result.lm)

## 
## Call:
## lm(formula = low ~ lwt, data = birthwt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4277 -0.3375 -0.2859  0.6136  0.8687 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.646733   0.146012   4.429 1.61e-05 ***
## lwt         -0.002577   0.001095  -2.354   0.0196 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4591 on 187 degrees of freedom
## Multiple R-squared:  0.02877,    Adjusted R-squared:  0.02358 
## F-statistic:  5.54 on 1 and 187 DF,  p-value: 0.01962

plot(birthwt$lwt,birthwt$low)
abline(result.lm)

従属変数が二値ですから，線形回帰が当てはまるはずもありません。では次にロジスティック回帰分析をしてみます。

result.glm <- glm(low~lwt,data=birthwt,family=binomial)
summary(result.glm)

## 
## Call:
## glm(formula = low ~ lwt, family = binomial, data = birthwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0951  -0.9022  -0.8018   1.3609   1.9821  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.99831    0.78529   1.271   0.2036  
## lwt         -0.01406    0.00617  -2.279   0.0227 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 228.69  on 187  degrees of freedom
## AIC: 232.69
## 
## Number of Fisher Scoring iterations: 4

plot(birthwt$lwt,birthwt$low)
par(new=T)
plot(birthwt$lwt,result.glm$fitted.value,col="red")

幾分，滑らかな当てはまりになりました。

出力結果が0/1の間にありますので，予測値は「確率」のように解釈することができます。予測値の一部を見てみますと，それぞれが低体重児を出産する確率であると考えられます。

head(result.glm$fitted.value)

##        85        86        87        88        89        91 
## 0.1736052 0.2349235 0.3827710 0.3728574 0.3761505 0.3219314

また，回帰係数はそのまま解釈するのではなく，ロジスティック関数に入れて考える必要があります。次のように入力しましょう。

exp(result.glm$coefficients)

## (Intercept)         lwt 
##   2.7137035   0.9860401

ここから，体重が一単位(1ポンド=454g=0.45kg)増えると，低体重児が生まれる確率が0.98倍になる，と解釈できます(こうした確率の比率のことを特に オッズ比 と言います)。

使い方は回帰分析と同じ

説明変数が複数になっても，基本的な考え方は回帰分析と同じです。

result.glm2 <- glm(low~age+lwt+smoke+ptl+ht+ui,data=birthwt,family=binomial)
summary(result.glm2)

## 
## Call:
## glm(formula = low ~ age + lwt + smoke + ptl + ht + ui, family = binomial, 
##     data = birthwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0765  -0.8086  -0.6246   1.0280   2.0222  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.381863   1.088914   1.269  0.20443   
## age         -0.042226   0.034585  -1.221  0.22212   
## lwt         -0.014318   0.006654  -2.152  0.03141 * 
## smoke        0.550765   0.343648   1.603  0.10900   
## ptl          0.593158   0.348433   1.702  0.08869 . 
## ht           1.863640   0.686376   2.715  0.00662 **
## ui           0.736751   0.456508   1.614  0.10655   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 208.77  on 182  degrees of freedom
## AIC: 222.77
## 
## Number of Fisher Scoring iterations: 4

exp(result.glm2$coefficients)

## (Intercept)         age         lwt       smoke         ptl          ht 
##   3.9823150   0.9586532   0.9857836   1.7345794   1.8096941   6.4471597 
##          ui 
##   2.0891364

result.glm3 <- step(result.glm2)

## Start:  AIC=222.77
## low ~ age + lwt + smoke + ptl + ht + ui
## 
##         Df Deviance    AIC
## - age    1   210.31 222.31
## <none>       208.77 222.77
## - ui     1   211.33 223.33
## - smoke  1   211.33 223.33
## - ptl    1   211.78 223.78
## - lwt    1   213.97 225.97
## - ht     1   216.54 228.54
## 
## Step:  AIC=222.31
## low ~ lwt + smoke + ptl + ht + ui
## 
##         Df Deviance    AIC
## <none>       210.31 222.31
## - ptl    1   212.83 222.83
## - smoke  1   213.01 223.01
## - ui     1   213.15 223.15
## - lwt    1   216.63 226.63
## - ht     1   218.45 228.45

summary(result.glm3)

## 
## Call:
## glm(formula = low ~ lwt + smoke + ptl + ht + ui, family = binomial, 
##     data = birthwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0738  -0.7877  -0.6416   1.0657   1.9836  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.562843   0.859815   0.655  0.51272   
## lwt         -0.015493   0.006597  -2.348  0.01886 * 
## smoke        0.563972   0.342368   1.647  0.09950 . 
## ptl          0.533933   0.341417   1.564  0.11785   
## ht           1.905592   0.685990   2.778  0.00547 **
## ui           0.769617   0.452910   1.699  0.08927 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 210.31  on 183  degrees of freedom
## AIC: 222.31
## 
## Number of Fisher Scoring iterations: 4

統計的に有意な変数は何か，それぞれの係数がどの程度オッズを変えるのか，モデルの適合度はどうか(AIC)などをみて考えることができます。変数選択をすることもできます。

一般化線形モデル(GLM)という考え方

回帰分析(lm)とロジスティック回帰分析(glm)は，Rの関数の形がほとんど違わなかったことに注目してください。違うところは，familyというオプションの指定だけです。

ここでいうfamilyは族と訳され，分布がどういうグループに入るか，という意味です。普通の回帰分析は正規分布(ガウス)グループにはいるのですが，二値変数は二項分布(binomial)グループ，という別の族にはいるわけです。このグループは他にも，ガンマ分布，，ポアソン分布などがあり，必要に応じて使い分けるだけで，基本的な考え方は線形モデル，これを一般化するだけです。だからGLMというのですね。

ガンマ分布の例

airqualityデータを使って，ガンマ分布を族としたGLMの例を示します。ガンマ分布は正の実数をとる値の分布です。言い換えると，負の数は取らないので，そういうデータに対して当てはめます。

data(airquality)
result.glm.gamma <- glm(Ozone~Solar.R+Wind+Temp,data=airquality,family=Gamma)
summary(result.glm.gamma)

## 
## Call:
## glm(formula = Ozone ~ Solar.R + Wind + Temp, family = Gamma, 
##     data = airquality)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.91385  -0.44403  -0.09604   0.29322   1.25132  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.061e-01  1.530e-02   6.934 3.23e-10 ***
## Solar.R     -6.825e-05  1.779e-05  -3.836 0.000212 ***
## Wind         1.442e-03  3.471e-04   4.156 6.55e-05 ***
## Temp        -9.627e-04  1.569e-04  -6.137 1.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2610159)
## 
##     Null deviance: 71.950  on 110  degrees of freedom
## Residual deviance: 29.177  on 107  degrees of freedom
##   (42 observations deleted due to missingness)
## AIC: 939.88
## 
## Number of Fisher Scoring iterations: 6

ちなみに同じデータにガウス分布を当てはめた場合と比較します。

result.glm.gauss <- glm(Ozone~Solar.R+Wind+Temp,data=airquality,family=gaussian)
summary(result.glm.gauss)

## 
## Call:
## glm(formula = Ozone ~ Solar.R + Wind + Temp, family = gaussian, 
##     data = airquality)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -40.485  -14.219   -3.551   10.097   95.619  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 448.6242)
## 
##     Null deviance: 121802  on 110  degrees of freedom
## Residual deviance:  48003  on 107  degrees of freedom
##   (42 observations deleted due to missingness)
## AIC: 998.72
## 
## Number of Fisher Scoring iterations: 2

ガウス分布は普通の回帰分析と同じことなのでした。

result.glm.lm    <- lm(Ozone~Solar.R+Wind+Temp,data=airquality)
summary(result.glm.lm)

## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

これらのモデルは，AICをつかって当てはめを評価することができます。

AIC(result.glm.gamma)

## [1] 939.8778

AIC(result.glm.gauss)

## [1] 998.7171

AIC(result.glm.lm)

## [1] 998.7171

AIC(赤池情報量基準)は小さいほど当てはまりが良いため，ガンマ分布を使ったモデルの方が良いモデルだと言えます。

ポアソン分布の例

ポアソン分布は0以上の整数をとる，いわゆるカウントデータに使われます。ホームランの本数と運動能力からなるサンプルデータで試してみましょう。サンプルデータはMoodleサイトからダウンロードしてください。

sport2 <- read.csv("sport2.csv",head=T)
summary(sport2)

##        HR              M1              M2              M3       
##  Min.   : 0.00   Min.   :26.27   Min.   :16.90   Min.   :20.62  
##  1st Qu.: 2.00   1st Qu.:43.54   1st Qu.:43.81   1st Qu.:44.88  
##  Median : 4.50   Median :49.13   Median :49.25   Median :53.88  
##  Mean   : 5.84   Mean   :48.94   Mean   :49.55   Mean   :51.43  
##  3rd Qu.: 7.25   3rd Qu.:54.50   3rd Qu.:56.34   3rd Qu.:59.29  
##  Max.   :36.00   Max.   :68.97   Max.   :73.02   Max.   :75.93

hist(sport2$HR)

これは100人の野球選手のダミーデータで，HRはホームランの本数，M1は腕力，M2は脚力，M3は素早さに関するスポーツテストの偏差値です。ホームランを腕力で予測してみましょう。

result.glm.pois <- glm(HR~M1,data=sport2,family="poisson")
summary(result.glm.pois)

## 
## Call:
## glm(formula = HR ~ M1, family = "poisson", data = sport2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5424  -0.9707   0.1806   0.7589   2.4599  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.719752   0.282457  -9.629   <2e-16 ***
## M1           0.085796   0.005044  17.008   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 450.01  on 99  degrees of freedom
## Residual deviance: 137.40  on 98  degrees of freedom
## AIC: 458.21
## 
## Number of Fisher Scoring iterations: 5

当てはまりを考えると，

result.glm.lm2 <- lm(HR~M1,data=sport2)
summary(result.glm.lm2)

## 
## Call:
## lm(formula = HR ~ M1, data = sport2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8980 -2.4354  0.1976  1.4405 20.9591 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.63715    2.09838  -7.929  3.6e-12 ***
## M1            0.45930    0.04216  10.894  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.819 on 98 degrees of freedom
## Multiple R-squared:  0.5477, Adjusted R-squared:  0.5431 
## F-statistic: 118.7 on 1 and 98 DF,  p-value: < 2.2e-16

AIC(result.glm.pois)

## [1] 458.2146

AIC(result.glm.lm2)

## [1] 555.7452

ポアソン分布を使った回帰分析の方が，適合度が高いことがわかります。

まとめ

データの分布に適した族を使ったGLMをやろう

課題

sport2のデータを使って，HRを予測する他の変数を使ったGLM式を作りなさい
同じくM2を予測するのに適した回帰式を作りなさい