載入pima
library(faraway)
## Warning: 套件 'faraway' 是用 R 版本 4.3.1 來建造的
data(pima)
summary(pima)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin bmi diabetes age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## test
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
glucose, diastolic, triceps, insulin, bmi 等變數等於 0 不合理, 將這些變數為 0 的資料設為 missing
pima$diastolic[pima$diastolic==0] <- NA
pima$glucose[pima$glucose==0] <- NA
pima$triceps[pima$triceps==0] <- NA
pima$insulin[pima$insulin==0] <- NA
pima$bmi[pima$bmi==0] <- NA
r<-glm(test~.,family="binomial",data=pima)
summary(r)
##
## Call:
## glm(formula = test ~ ., family = "binomial", data = pima)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.004e+01 1.218e+00 -8.246 < 2e-16 ***
## pregnant 8.216e-02 5.543e-02 1.482 0.13825
## glucose 3.827e-02 5.768e-03 6.635 3.24e-11 ***
## diastolic -1.420e-03 1.183e-02 -0.120 0.90446
## triceps 1.122e-02 1.708e-02 0.657 0.51128
## insulin -8.253e-04 1.306e-03 -0.632 0.52757
## bmi 7.054e-02 2.734e-02 2.580 0.00989 **
## diabetes 1.141e+00 4.274e-01 2.669 0.00760 **
## age 3.395e-02 1.838e-02 1.847 0.06474 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 498.10 on 391 degrees of freedom
## Residual deviance: 344.02 on 383 degrees of freedom
## (因為不存在,376 個觀察量被刪除了)
## AIC: 362.02
##
## Number of Fisher Scoring iterations: 5
此模型的 deviance 為 344.02 (383 degrees of freedom)
Goodness of fit test: \(H_0\): 所有變數係數=0 vs \(H_1\): 至少有一個變數係數不等於 0
\[T = \text{null deviance} - \text{deviance} = 498.10 - 344.02 \sim \chi^2_{df=391-383}\]
T<-r$null.deviance-r$deviance #test statistics
T.df<-r$df.null-r$df.residual
1-pchisq(T,T.df) # p-value
## [1] 0
The p-value\(\approx\) 0, reject \(H_0\). This model fits the data.
summary(pima$bmi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.20 27.50 32.30 32.46 36.60 67.10 11
r$coefficient[7]
## bmi
## 0.07053758
Woman A: bmi = 1st quantile \(\approx\) 27.5
Woman B: bmi = 3rd quantile \(\approx\) 36.6
The coefficient of bmi \(\approx\) 0.07
\[ \log \frac{p_B}{1-p_B} - \log \frac{p_A}{1-p_A} = (36.6-27.5) \times \hat{\beta} =9.1\cdot \hat{\beta} \approx 0.642\] \[ \frac{\frac{p_B}{1-p_B}}{\frac{p_A}{1-p_A}} \approx e^{0.642} \approx 1.9 \]
\[ se(9.1\hat{\beta}) =9.1\cdot se(\hat{\beta}) \]
The 95% CI for \(9.1 \cdot \hat{\beta}\) would be
se.beta<-sqrt(summary(r)$cov.unscaled[7,7])
ci.diff<-9.1*r$coefficient[7]+c(-1,1)*qnorm(0.975)*9.1*se.beta
round(ci.diff, digits=3)
## [1] 0.154 1.130
round(exp(ci.diff),digits=3)
## [1] 1.167 3.094
假設其他變數值固定的情況下,bmi 從 1st quantile (27.5) 增加到 3rd quantile (36.6), 則 the log odds of testing positive for diabetes 將增加 0.642, 95% 信賴區間為 (0.154,1.130)
或可解釋為: 假設其他變數值固定的情況下,bmi 從 1st quantile (27.5) 增加到 3rd quantile (36.6), 則 the odds of testing positive for diabetes 將為原來的 1.9 倍, 95% 信賴區間為 (1.167, 3.094)。
diastolic 係數的估計值為 -0.00142, 因此根據模型估計結果,test positive 的人會有較低的 diastolic。
mean(pima$diastolic[pima$test==1],na.rm=TRUE)
## [1] 75.32143
mean(pima$diastolic[pima$test==0],na.rm=TRUE)
## [1] 70.87734
由敘述統計可知,test positive 的人,diastolic 平均值為 75.32,高於 test negative 的人 (70.88)
在模型中diastolic 係數為負 (-0.00142),代表 diastolic 增加會降低 test positive 的機率。這個估計值的結果看似和資料的敘述統計結果相矛盾,但此估計值的 p-value 為 0.9,不顯著,代表沒有足夠的證據顯示 diastolic 會影響到 test positive 的機率,因此模型估計結果與敘述統計結果並不矛盾。
x.new<-c( 1, 99, 64, 22, 76, 27, 0.25, 25)
data.new<-data.frame(t(x.new))
colnames(data.new)<-colnames(pima)[1:8]
xb.p<-predict(r,data.new,se.fit =T)
ilogit(xb.p$fit)
## 1
## 0.04573331
round(ilogit(xb.p$fit+c(-1,1)*qnorm(0.975)*xb.p$se.fit),digits=3)
## [1] 0.025 0.082
test positive 的機率預測值為 0.046,其 95% CI 為 (0.025,0.082)
在執行 step() 之前,先刪除變數有 missing 的樣本。
pima.rm.na<-pima[rowSums(is.na(pima))==0,]
dim(pima.rm.na)
## [1] 392 9
資料完整的樣本數為 392。
r.rm.na<-glm(test~.,family="binomial",data=pima.rm.na)
r.rm.na.f<-step(r.rm.na)
## Start: AIC=362.02
## test ~ pregnant + glucose + diastolic + triceps + insulin + bmi +
## diabetes + age
##
## Df Deviance AIC
## - diastolic 1 344.04 360.04
## - insulin 1 344.42 360.42
## - triceps 1 344.45 360.45
## <none> 344.02 362.02
## - pregnant 1 346.24 362.24
## - age 1 347.55 363.55
## - bmi 1 350.89 366.89
## - diabetes 1 351.58 367.58
## - glucose 1 396.95 412.95
##
## Step: AIC=360.04
## test ~ pregnant + glucose + triceps + insulin + bmi + diabetes +
## age
##
## Df Deviance AIC
## - insulin 1 344.42 358.42
## - triceps 1 344.46 358.46
## <none> 344.04 360.04
## - pregnant 1 346.24 360.24
## - age 1 347.60 361.60
## - bmi 1 351.28 365.28
## - diabetes 1 351.67 365.67
## - glucose 1 397.31 411.31
##
## Step: AIC=358.42
## test ~ pregnant + glucose + triceps + bmi + diabetes + age
##
## Df Deviance AIC
## - triceps 1 344.89 356.89
## <none> 344.42 358.42
## - pregnant 1 346.74 358.74
## - age 1 347.87 359.87
## - bmi 1 351.32 363.32
## - diabetes 1 351.90 363.90
## - glucose 1 411.11 423.11
##
## Step: AIC=356.89
## test ~ pregnant + glucose + bmi + diabetes + age
##
## Df Deviance AIC
## <none> 344.89 356.89
## - pregnant 1 347.23 357.23
## - age 1 348.72 358.72
## - diabetes 1 352.72 362.72
## - bmi 1 360.44 370.44
## - glucose 1 411.85 421.85
summary(r.rm.na.f)
##
## Call:
## glm(formula = test ~ pregnant + glucose + bmi + diabetes + age,
## family = "binomial", data = pima.rm.na)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.992080 1.086866 -9.193 < 2e-16 ***
## pregnant 0.083953 0.055031 1.526 0.127117
## glucose 0.036458 0.004978 7.324 2.41e-13 ***
## bmi 0.078139 0.020605 3.792 0.000149 ***
## diabetes 1.150913 0.424242 2.713 0.006670 **
## age 0.034360 0.017810 1.929 0.053692 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 498.10 on 391 degrees of freedom
## Residual deviance: 344.89 on 386 degrees of freedom
## AIC: 356.89
##
## Number of Fisher Scoring iterations: 5
backward variable selection 依序刪除 diastolic, insulin, triceps 等三個變數,最後得到的模型為
\[ \log \frac{\hat{p}_i}{1-\hat{p}_i} = -9.992 + 0.084\cdot (pregnamt) +0.036 \cdot (glucose) +0.078 \cdot (bmi) +1.151 \cdot (diabetes) + 0.034 \cdot (age) \]
test.obs<-pima.rm.na$test
test.pred<- ifelse(r.rm.na.f$fitted.values>=0.5,1,0)
table(test.pred,test.obs)
## test.obs
## test.pred 0 1
## 0 232 55
## 1 30 75
(sens<-75/130)
## [1] 0.5769231
(spes<-232/262)
## [1] 0.8854962
在 test positive的機率值 p 的 cutoff 為 0.5 的情況下,sensitivity = 0.577, specificity = 0.885。
library(pROC)
## Warning: 套件 'pROC' 是用 R 版本 4.3.2 來建造的
## Type 'citation("pROC")' for a citation.
##
## 載入套件:'pROC'
## 下列物件被遮斷自 'package:stats':
##
## cov, smooth, var
r.roc<-roc(test.obs,r.rm.na.f$fitted.values)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
r.roc$auc
## Area under the curve: 0.8631
Area under the ROC curve 為 0.8631
ROC curve 的圖形為:
plot(r.roc)