載入pima

library(faraway)
## Warning: 套件 'faraway' 是用 R 版本 4.3.1 來建造的
data(pima)

(a)

summary(pima)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           bmi           diabetes           age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##       test      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

glucose, diastolic, triceps, insulin, bmi 等變數等於 0 不合理, 將這些變數為 0 的資料設為 missing

pima$diastolic[pima$diastolic==0] <- NA
pima$glucose[pima$glucose==0] <- NA
pima$triceps[pima$triceps==0] <- NA
pima$insulin[pima$insulin==0] <- NA
pima$bmi[pima$bmi==0] <- NA

(b)

r<-glm(test~.,family="binomial",data=pima)
summary(r)
## 
## Call:
## glm(formula = test ~ ., family = "binomial", data = pima)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.004e+01  1.218e+00  -8.246  < 2e-16 ***
## pregnant     8.216e-02  5.543e-02   1.482  0.13825    
## glucose      3.827e-02  5.768e-03   6.635 3.24e-11 ***
## diastolic   -1.420e-03  1.183e-02  -0.120  0.90446    
## triceps      1.122e-02  1.708e-02   0.657  0.51128    
## insulin     -8.253e-04  1.306e-03  -0.632  0.52757    
## bmi          7.054e-02  2.734e-02   2.580  0.00989 ** 
## diabetes     1.141e+00  4.274e-01   2.669  0.00760 ** 
## age          3.395e-02  1.838e-02   1.847  0.06474 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 498.10  on 391  degrees of freedom
## Residual deviance: 344.02  on 383  degrees of freedom
##   (因為不存在,376 個觀察量被刪除了)
## AIC: 362.02
## 
## Number of Fisher Scoring iterations: 5

此模型的 deviance 為 344.02 (383 degrees of freedom)

Goodness of fit test: \(H_0\): 所有變數係數=0 vs \(H_1\): 至少有一個變數係數不等於 0

\[T = \text{null deviance} - \text{deviance} = 498.10 - 344.02 \sim \chi^2_{df=391-383}\]

T<-r$null.deviance-r$deviance #test statistics
T.df<-r$df.null-r$df.residual
1-pchisq(T,T.df) # p-value
## [1] 0

The p-value\(\approx\) 0, reject \(H_0\). This model fits the data.

(c)

summary(pima$bmi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.20   27.50   32.30   32.46   36.60   67.10      11
r$coefficient[7]
##        bmi 
## 0.07053758

Woman A: bmi = 1st quantile \(\approx\) 27.5

Woman B: bmi = 3rd quantile \(\approx\) 36.6

The coefficient of bmi \(\approx\) 0.07

\[ \log \frac{p_B}{1-p_B} - \log \frac{p_A}{1-p_A} = (36.6-27.5) \times \hat{\beta} =9.1\cdot \hat{\beta} \approx 0.642\] \[ \frac{\frac{p_B}{1-p_B}}{\frac{p_A}{1-p_A}} \approx e^{0.642} \approx 1.9 \]

\[ se(9.1\hat{\beta}) =9.1\cdot se(\hat{\beta}) \]

The 95% CI for \(9.1 \cdot \hat{\beta}\) would be

se.beta<-sqrt(summary(r)$cov.unscaled[7,7])
ci.diff<-9.1*r$coefficient[7]+c(-1,1)*qnorm(0.975)*9.1*se.beta
round(ci.diff, digits=3)
## [1] 0.154 1.130
round(exp(ci.diff),digits=3)
## [1] 1.167 3.094

假設其他變數值固定的情況下,bmi 從 1st quantile (27.5) 增加到 3rd quantile (36.6), 則 the log odds of testing positive for diabetes 將增加 0.642, 95% 信賴區間為 (0.154,1.130)

或可解釋為: 假設其他變數值固定的情況下,bmi 從 1st quantile (27.5) 增加到 3rd quantile (36.6), 則 the odds of testing positive for diabetes 將為原來的 1.9 倍, 95% 信賴區間為 (1.167, 3.094)。

(d)

diastolic 係數的估計值為 -0.00142, 因此根據模型估計結果,test positive 的人會有較低的 diastolic。

mean(pima$diastolic[pima$test==1],na.rm=TRUE)
## [1] 75.32143
mean(pima$diastolic[pima$test==0],na.rm=TRUE)
## [1] 70.87734

由敘述統計可知,test positive 的人,diastolic 平均值為 75.32,高於 test negative 的人 (70.88)

在模型中diastolic 係數為負 (-0.00142),代表 diastolic 增加會降低 test positive 的機率。這個估計值的結果看似和資料的敘述統計結果相矛盾,但此估計值的 p-value 為 0.9,不顯著,代表沒有足夠的證據顯示 diastolic 會影響到 test positive 的機率,因此模型估計結果與敘述統計結果並不矛盾。

(e)

x.new<-c( 1, 99, 64, 22, 76, 27, 0.25, 25)
data.new<-data.frame(t(x.new))
colnames(data.new)<-colnames(pima)[1:8]
xb.p<-predict(r,data.new,se.fit =T)
ilogit(xb.p$fit)
##          1 
## 0.04573331
round(ilogit(xb.p$fit+c(-1,1)*qnorm(0.975)*xb.p$se.fit),digits=3)
## [1] 0.025 0.082

test positive 的機率預測值為 0.046,其 95% CI 為 (0.025,0.082)

(f)

在執行 step() 之前,先刪除變數有 missing 的樣本。

pima.rm.na<-pima[rowSums(is.na(pima))==0,]
dim(pima.rm.na)
## [1] 392   9

資料完整的樣本數為 392。

r.rm.na<-glm(test~.,family="binomial",data=pima.rm.na)
r.rm.na.f<-step(r.rm.na)
## Start:  AIC=362.02
## test ~ pregnant + glucose + diastolic + triceps + insulin + bmi + 
##     diabetes + age
## 
##             Df Deviance    AIC
## - diastolic  1   344.04 360.04
## - insulin    1   344.42 360.42
## - triceps    1   344.45 360.45
## <none>           344.02 362.02
## - pregnant   1   346.24 362.24
## - age        1   347.55 363.55
## - bmi        1   350.89 366.89
## - diabetes   1   351.58 367.58
## - glucose    1   396.95 412.95
## 
## Step:  AIC=360.04
## test ~ pregnant + glucose + triceps + insulin + bmi + diabetes + 
##     age
## 
##            Df Deviance    AIC
## - insulin   1   344.42 358.42
## - triceps   1   344.46 358.46
## <none>          344.04 360.04
## - pregnant  1   346.24 360.24
## - age       1   347.60 361.60
## - bmi       1   351.28 365.28
## - diabetes  1   351.67 365.67
## - glucose   1   397.31 411.31
## 
## Step:  AIC=358.42
## test ~ pregnant + glucose + triceps + bmi + diabetes + age
## 
##            Df Deviance    AIC
## - triceps   1   344.89 356.89
## <none>          344.42 358.42
## - pregnant  1   346.74 358.74
## - age       1   347.87 359.87
## - bmi       1   351.32 363.32
## - diabetes  1   351.90 363.90
## - glucose   1   411.11 423.11
## 
## Step:  AIC=356.89
## test ~ pregnant + glucose + bmi + diabetes + age
## 
##            Df Deviance    AIC
## <none>          344.89 356.89
## - pregnant  1   347.23 357.23
## - age       1   348.72 358.72
## - diabetes  1   352.72 362.72
## - bmi       1   360.44 370.44
## - glucose   1   411.85 421.85
summary(r.rm.na.f)
## 
## Call:
## glm(formula = test ~ pregnant + glucose + bmi + diabetes + age, 
##     family = "binomial", data = pima.rm.na)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.992080   1.086866  -9.193  < 2e-16 ***
## pregnant     0.083953   0.055031   1.526 0.127117    
## glucose      0.036458   0.004978   7.324 2.41e-13 ***
## bmi          0.078139   0.020605   3.792 0.000149 ***
## diabetes     1.150913   0.424242   2.713 0.006670 ** 
## age          0.034360   0.017810   1.929 0.053692 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 498.10  on 391  degrees of freedom
## Residual deviance: 344.89  on 386  degrees of freedom
## AIC: 356.89
## 
## Number of Fisher Scoring iterations: 5

backward variable selection 依序刪除 diastolic, insulin, triceps 等三個變數,最後得到的模型為

\[ \log \frac{\hat{p}_i}{1-\hat{p}_i} = -9.992 + 0.084\cdot (pregnamt) +0.036 \cdot (glucose) +0.078 \cdot (bmi) +1.151 \cdot (diabetes) + 0.034 \cdot (age) \]

(g)

test.obs<-pima.rm.na$test
test.pred<- ifelse(r.rm.na.f$fitted.values>=0.5,1,0)
table(test.pred,test.obs)
##          test.obs
## test.pred   0   1
##         0 232  55
##         1  30  75
(sens<-75/130)
## [1] 0.5769231
(spes<-232/262)
## [1] 0.8854962

在 test positive的機率值 p 的 cutoff 為 0.5 的情況下,sensitivity = 0.577, specificity = 0.885。

(h)

library(pROC)
## Warning: 套件 'pROC' 是用 R 版本 4.3.2 來建造的
## Type 'citation("pROC")' for a citation.
## 
## 載入套件:'pROC'
## 下列物件被遮斷自 'package:stats':
## 
##     cov, smooth, var
r.roc<-roc(test.obs,r.rm.na.f$fitted.values)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
r.roc$auc
## Area under the curve: 0.8631

Area under the ROC curve 為 0.8631

ROC curve 的圖形為:

plot(r.roc)