HW12-30

The dataset esdcomp was recorded on 44 doctors working in an emergency service at a hospital to study the factors affecting the number of complaints recieved.

library(faraway)

## Warning: 套件 'faraway' 是用 R 版本 4.3.1 來建造的

data(esdcomp)

(a) Consider the rate of complaints in terms of the number received in relation to the number of patient visits made. Plot this against each of the four potential predictors and comment on the relationships.

rate<-esdcomp$complaints/esdcomp$visits
par(mfrow=c(2,2))
plot(rate~residency,data=esdcomp)
plot(rate~gender,data=esdcomp)
plot(rate~revenue,data=esdcomp)
plot(rate~hours,data=esdcomp)

由圖形可以看出，residency, gender, revenue 似是和 rate of complaints 大小無關，hours 看起來有正相關，但也不明顯。

(b) Fit a binomial GLM for the number of complaints out of the number of visits with the other four variables as predictors. Does this model fit the data?

Perform this check numerically and graphically.

bmod<-glm(cbind(complaints,visits-complaints)~.,family=binomial, data=esdcomp) 

summary(bmod)

## 
## Call:
## glm(formula = cbind(complaints, visits - complaints) ~ ., family = binomial, 
##     data = esdcomp)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.1208523  0.8508618  -9.544   <2e-16 ***
## residencyY  -0.2092939  0.2012992  -1.040   0.2985    
## genderM      0.1956918  0.2182965   0.896   0.3700    
## revenue      0.0015785  0.0028317   0.557   0.5772    
## hours        0.0007028  0.0003508   2.004   0.0451 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63.523  on 43  degrees of freedom
## Residual deviance: 54.594  on 39  degrees of freedom
## AIC: 187.31
## 
## Number of Fisher Scoring iterations: 5

利用 goodness-of-fit test

\[H_0: \text{Null model (所有係數均為零) vs }H_1: \text{Current model(至少有一個係數不為零) } \]

test statistic: T = Deviance(Null model) - Deviance(current model)

\[ T \sim \chi_{df_T} \text{, where } df_T=df_{null}-df_{current} \]

t<-bmod$null.deviance-bmod$deviance
t.df<-bmod$df.null-bmod$df.residual
pchisq(t, t.df,lower=FALSE)

## [1] 0.06288113

p-value = 0.062>0.05, not reject \(H_0 \Rightarrow\) 此模型配適不好。

phat<-bmod$fitted.values
plot(rate, phat, ylab="Predicted rate", xlab="observed rate")
abline(a=0,b=1)

上圖為每個醫生的 rate of complaints。由上圖可看出，預測值和觀測值無明顯線性相關，因此也可以得到此模型的配適結果不好。

從模型配適結果，我們可以知道 residency, gender, revenue 這三個變數的影響均不顯著，若採取 backward selection, 只保留顯著的變數在模型當中，則我們可以得到的模型如下:

(c) Check the significance of the four predictors in the binomial model. Give a numerical interpretation to the effect of any significant predictors.

sumary(bmod)

##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept) -8.12085231  0.85086178 -9.5443  < 2e-16
## residencyY  -0.20929392  0.20129919 -1.0397  0.29847
## genderM      0.19569176  0.21829652  0.8964  0.37001
## revenue      0.00157849  0.00283167  0.5574  0.57723
## hours        0.00070279  0.00035075  2.0037  0.04511
## 
## n = 44 p = 5
## Deviance = 54.59353 Null Deviance = 63.52322 (Difference = 8.92969)

令 p 代表 complaint 的比例

\[\log \frac{\hat{p}}{1-\hat{p}} = -8.1209 - 0.2093 \cdot I(residency=Y) + 0.1957 \cdot I(gender=M ) +0.0016 \cdot (revenue) +0.0007 \cdot hours\]

在考慮所有變數的模型中，只有 hours 的 p-value < 0.05。

hours 的係數估計值為 0.0007, 代表在其他變數固定不變的情況下，醫生的工作時數 (hours) 每增加一單位， rate of complaints 的 log odds 增加 0.0007。

或可解釋為，在其他變數固定不變的情況下，醫生的工作時數 (hours) 每增加一單位， rate of complaints 的 log odds 增加 \(\exp(0.0007) \approx 1\)。

(d) Fit an appropriate Poisson rate model for number of complaints that takes a comparable form to the binomial GLM fitted earlier. Does this model fit the data? Again check this numerically and graphically.

pmod<-glm(complaints~.,family=poisson, data=esdcomp) 
summary(pmod)

## 
## Call:
## glm(formula = complaints ~ ., family = poisson, data = esdcomp)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.0803448  1.1542122  -0.070  0.94450   
## visits       0.0009499  0.0003386   2.806  0.00502 **
## residencyY  -0.2319740  0.2029388  -1.143  0.25301   
## genderM      0.1122391  0.2235043   0.502  0.61554   
## revenue     -0.0033827  0.0041553  -0.814  0.41560   
## hours       -0.0001569  0.0006634  -0.237  0.81298   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 49.995  on 38  degrees of freedom
## AIC: 184.77
## 
## Number of Fisher Scoring iterations: 5

t.p<-pmod$null.deviance-pmod$deviance
t.p.df<-pmod$df.null-pmod$df.residual
pchisq(t.p, t.p.df,lower=FALSE)

## [1] 1.925686e-07

goodness of fit test 下

\[H_0: \text{Null model (所有係數均為零) vs }H_1: \text{Current model(至少有一個係數不為零) } \] p-value 為 1.93 e-07 < 0.05, reject \(H_0 \Rightarrow\) 此模型至少有個係數不為零。故 this model fit the data。

plot(esdcomp$complaints,pmod$fitted.values,xlab="observed numbers of complains",ylab="predicted numbers of complaints")
abline(a=0,b=1)

由圖形可看出，預測次數和真實次數接近 45度線，因此模型有預測效果，this model fit the data。

(e) Again check the significance of the predictors and provide a numerical interpretation. Compare the conclusions of the two models.

在考慮所有變數的模型中，只有 visits 的 p-value = 0.005 < 0.05。

visits 的係數估計值約為 0.001, 代表在其他變數固定不變的情況下，病人看病人數 (visits) 每增加 1000人， rate of complaints 的次數預期增加 \(e^1 \approx 2.7\) 次。

或在其他變數固定不變的情況下，醫生的工作時數 (hours) 每增加 1000人， complaints 次取取 log 預期增加 1。

(f) Do the variable selection for the Poisson regression model. Write down the model.

pmod.step<-step(pmod)

## Start:  AIC=184.77
## complaints ~ visits + residency + gender + revenue + hours
## 
##             Df Deviance    AIC
## - hours      1   50.051 182.83
## - gender     1   50.251 183.03
## - revenue    1   50.665 183.44
## - residency  1   51.319 184.10
## <none>           49.995 184.78
## - visits     1   57.568 190.35
## 
## Step:  AIC=182.83
## complaints ~ visits + residency + gender + revenue
## 
##             Df Deviance    AIC
## - gender     1   50.404 181.18
## - revenue    1   50.739 181.52
## - residency  1   51.321 182.10
## <none>           50.051 182.83
## - visits     1   75.947 206.73
## 
## Step:  AIC=181.18
## complaints ~ visits + residency + revenue
## 
##             Df Deviance    AIC
## - revenue    1   50.882 179.66
## - residency  1   52.109 180.89
## <none>           50.404 181.18
## - visits     1   76.427 205.21
## 
## Step:  AIC=179.66
## complaints ~ visits + residency
## 
##             Df Deviance    AIC
## <none>           50.882 179.66
## - residency  1   54.246 181.03
## - visits     1   83.303 210.08

summary(pmod.step)

## 
## Call:
## glm(formula = complaints ~ visits + residency, family = poisson, 
##     data = esdcomp)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.7274574  0.3972932  -1.831   0.0671 .  
## visits       0.0008101  0.0001429   5.668 1.45e-08 ***
## residencyY  -0.3121729  0.1725024  -1.810   0.0703 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 50.882  on 41  degrees of freedom
## AIC: 179.66
## 
## Number of Fisher Scoring iterations: 5

經過 variable selection，模型中變數剩下 visits, residency, 預期 complaints 次數為 \(\mu\)

\[ \log (\hat{\mu}) = -0.7274574 + 0.0008101 \cdot (visits) -0.3121729 \cdot I(residency=Y) \]