The dataset esdcomp was recorded on 44 doctors working in an emergency service at a hospital to study the factors affecting the number of complaints recieved.
library(faraway)
## Warning: 套件 'faraway' 是用 R 版本 4.3.1 來建造的
data(esdcomp)
rate<-esdcomp$complaints/esdcomp$visits
par(mfrow=c(2,2))
plot(rate~residency,data=esdcomp)
plot(rate~gender,data=esdcomp)
plot(rate~revenue,data=esdcomp)
plot(rate~hours,data=esdcomp)
由圖形可以看出,residency, gender, revenue 似是和 rate of complaints 大小無關,hours 看起來有正相關,但也不明顯。
Perform this check numerically and graphically.
bmod<-glm(cbind(complaints,visits-complaints)~.,family=binomial, data=esdcomp)
summary(bmod)
##
## Call:
## glm(formula = cbind(complaints, visits - complaints) ~ ., family = binomial,
## data = esdcomp)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.1208523 0.8508618 -9.544 <2e-16 ***
## residencyY -0.2092939 0.2012992 -1.040 0.2985
## genderM 0.1956918 0.2182965 0.896 0.3700
## revenue 0.0015785 0.0028317 0.557 0.5772
## hours 0.0007028 0.0003508 2.004 0.0451 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63.523 on 43 degrees of freedom
## Residual deviance: 54.594 on 39 degrees of freedom
## AIC: 187.31
##
## Number of Fisher Scoring iterations: 5
利用 goodness-of-fit test
\[H_0: \text{Null model (所有係數均為零) vs }H_1: \text{Current model(至少有一個係數不為零) } \]
test statistic: T = Deviance(Null model) - Deviance(current model)
\[ T \sim \chi_{df_T} \text{, where } df_T=df_{null}-df_{current} \]
t<-bmod$null.deviance-bmod$deviance
t.df<-bmod$df.null-bmod$df.residual
pchisq(t, t.df,lower=FALSE)
## [1] 0.06288113
p-value = 0.062>0.05, not reject \(H_0 \Rightarrow\) 此模型配適不好。
phat<-bmod$fitted.values
plot(rate, phat, ylab="Predicted rate", xlab="observed rate")
abline(a=0,b=1)
上圖為每個醫生的 rate of complaints。由上圖可看出,預測值和觀測值無明顯線性相關,因此也可以得到此模型的配適結果不好。
從模型配適結果,我們可以知道 residency, gender, revenue 這三個變數的影響均不顯著,若採取 backward selection, 只保留顯著的變數在模型當中,則我們可以得到的模型如下:
sumary(bmod)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.12085231 0.85086178 -9.5443 < 2e-16
## residencyY -0.20929392 0.20129919 -1.0397 0.29847
## genderM 0.19569176 0.21829652 0.8964 0.37001
## revenue 0.00157849 0.00283167 0.5574 0.57723
## hours 0.00070279 0.00035075 2.0037 0.04511
##
## n = 44 p = 5
## Deviance = 54.59353 Null Deviance = 63.52322 (Difference = 8.92969)
令 p 代表 complaint 的比例
\[\log \frac{\hat{p}}{1-\hat{p}} = -8.1209 - 0.2093 \cdot I(residency=Y) + 0.1957 \cdot I(gender=M ) +0.0016 \cdot (revenue) +0.0007 \cdot hours\]
在考慮所有變數的模型中,只有 hours 的 p-value < 0.05。
hours 的係數估計值為 0.0007, 代表在其他變數固定不變的情況下,醫生的工作時數 (hours) 每增加一單位, rate of complaints 的 log odds 增加 0.0007。
或可解釋為,在其他變數固定不變的情況下,醫生的工作時數 (hours) 每增加一單位, rate of complaints 的 log odds 增加 \(\exp(0.0007) \approx 1\)。
pmod<-glm(complaints~.,family=poisson, data=esdcomp)
summary(pmod)
##
## Call:
## glm(formula = complaints ~ ., family = poisson, data = esdcomp)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0803448 1.1542122 -0.070 0.94450
## visits 0.0009499 0.0003386 2.806 0.00502 **
## residencyY -0.2319740 0.2029388 -1.143 0.25301
## genderM 0.1122391 0.2235043 0.502 0.61554
## revenue -0.0033827 0.0041553 -0.814 0.41560
## hours -0.0001569 0.0006634 -0.237 0.81298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 49.995 on 38 degrees of freedom
## AIC: 184.77
##
## Number of Fisher Scoring iterations: 5
t.p<-pmod$null.deviance-pmod$deviance
t.p.df<-pmod$df.null-pmod$df.residual
pchisq(t.p, t.p.df,lower=FALSE)
## [1] 1.925686e-07
goodness of fit test 下
\[H_0: \text{Null model (所有係數均為零) vs }H_1: \text{Current model(至少有一個係數不為零) } \] p-value 為 1.93 e-07 < 0.05, reject \(H_0 \Rightarrow\) 此模型至少有個係數不為零。 故 this model fit the data。
plot(esdcomp$complaints,pmod$fitted.values,xlab="observed numbers of complains",ylab="predicted numbers of complaints")
abline(a=0,b=1)
由圖形可看出,預測次數和真實次數接近 45度線,因此模型有預測效果,this model fit the data。
在考慮所有變數的模型中,只有 visits 的 p-value = 0.005 < 0.05。
visits 的係數估計值約為 0.001, 代表在其他變數固定不變的情況下,病人看病人數 (visits) 每增加 1000人, rate of complaints 的次數預期增加 \(e^1 \approx 2.7\) 次。
或在其他變數固定不變的情況下,醫生的工作時數 (hours) 每增加 1000人, complaints 次取取 log 預期增加 1。
pmod.step<-step(pmod)
## Start: AIC=184.77
## complaints ~ visits + residency + gender + revenue + hours
##
## Df Deviance AIC
## - hours 1 50.051 182.83
## - gender 1 50.251 183.03
## - revenue 1 50.665 183.44
## - residency 1 51.319 184.10
## <none> 49.995 184.78
## - visits 1 57.568 190.35
##
## Step: AIC=182.83
## complaints ~ visits + residency + gender + revenue
##
## Df Deviance AIC
## - gender 1 50.404 181.18
## - revenue 1 50.739 181.52
## - residency 1 51.321 182.10
## <none> 50.051 182.83
## - visits 1 75.947 206.73
##
## Step: AIC=181.18
## complaints ~ visits + residency + revenue
##
## Df Deviance AIC
## - revenue 1 50.882 179.66
## - residency 1 52.109 180.89
## <none> 50.404 181.18
## - visits 1 76.427 205.21
##
## Step: AIC=179.66
## complaints ~ visits + residency
##
## Df Deviance AIC
## <none> 50.882 179.66
## - residency 1 54.246 181.03
## - visits 1 83.303 210.08
summary(pmod.step)
##
## Call:
## glm(formula = complaints ~ visits + residency, family = poisson,
## data = esdcomp)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7274574 0.3972932 -1.831 0.0671 .
## visits 0.0008101 0.0001429 5.668 1.45e-08 ***
## residencyY -0.3121729 0.1725024 -1.810 0.0703 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 50.882 on 41 degrees of freedom
## AIC: 179.66
##
## Number of Fisher Scoring iterations: 5
經過 variable selection,模型中變數剩下 visits, residency, 預期 complaints 次數為 \(\mu\)
\[ \log (\hat{\mu}) = -0.7274574 + 0.0008101 \cdot (visits) -0.3121729 \cdot I(residency=Y) \]