The data set provides information from a study based on 196 persons selected in a probability sample within two sectors in a city. Each line of the data set has an identification number and provides information on 5 other variables for a single person. The 6 variables are:
| Variable Number | Variable Name | Description |
|---|---|---|
| 1 | Identification | number 1-196 |
| 2 | Age | Age of person (in years) |
| 3 | Socioeconomic status | 1 = upper, 2 = middle, 3 = lower |
| 4 | Sector | Sector within city, where: 1 = sector 1, 2 = sector 2 |
| 5 | Disease status | 1 = with disease, 0 = without disease |
| 6 | Savings account status | 1 = has savings account, 0 = does not have savings account |
Data is from H. G. Dantes, J. S. Koopman, C. L. Addy, et al., “Dengue Epidemics on the Pacific Coast of Mexico”, International Journal of Epidemiology 17 (1988), pp. 178-186
Savings account status is the response variable, while age, socioeconomic status, city sector and disease status are the predictor variables. Cases 1-98 are to be utilized for developing the logistic regression model.
Fit logistic regression model. State the fitted response function. Decide which predictor variables can be dropped from the regression model and justify your choice. Use the remaining data to predict savings account status. Obtain the ROC curve to assess the model’s predictive power and identify the optimal cutoff.
# 读取数据,并将解释变量预先因子化。account变量在回归时会自动因子化,方便起见可不预先因子化。
data = read.table("Disease.txt", col.names = c("id", "age", "socioecon", "sector",
"disease", "account"), header = F)
data$socioecon = factor(data$socioecon)
data$sector = factor(data$sector)
data$disease = factor(data$disease)
# 将1-98行数据作为训练集,余下的作为测试集
nrow(data)
## [1] 196
train = data[1:98, ]
test = data[-c(1:98), ]
glm.full = glm(account ~ age + socioecon + sector + disease, family = binomial(link = "logit"),
data = train)
summary(glm.full)
##
## Call:
## glm(formula = account ~ age + socioecon + sector + disease, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.976 -0.830 0.410 0.818 1.905
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0588 0.5659 0.10 0.9173
## age 0.0325 0.0146 2.22 0.0263 *
## socioecon2 -1.3479 0.6097 -2.21 0.0270 *
## socioecon3 -1.7607 0.5652 -3.12 0.0018 **
## sector2 0.4709 0.5269 0.89 0.3716
## disease1 0.6938 0.5662 1.23 0.2205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 134.83 on 97 degrees of freedom
## Residual deviance: 106.88 on 92 degrees of freedom
## AIC: 118.9
##
## Number of Fisher Scoring iterations: 4
# 暂且取0.5为预测值的分界点,计算预测准确度
# 注意:这个准确度既包括对savings account
# status=1的点的预测准确度,也包括对status=0的点的预测准确度,因此不同于下面ROC图中的TPR和FPR。下同。
pred.full = predict(glm.full, test)
pred.assign = as.numeric(pred.full >= 0.5)
(precise.full = sum(pred.assign == test$account)/nrow(test))
## [1] 0.7143
glm.step = step(glm.full)
## Start: AIC=118.9
## account ~ age + socioecon + sector + disease
##
## Df Deviance AIC
## - sector 1 108 118
## - disease 1 108 118
## <none> 107 119
## - age 1 112 122
## - socioecon 2 118 126
##
## Step: AIC=117.7
## account ~ age + socioecon + disease
##
## Df Deviance AIC
## <none> 108 118
## - disease 1 110 118
## - age 1 113 121
## - socioecon 2 120 126
summary(glm.step)
##
## Call:
## glm(formula = account ~ age + socioecon + disease, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.037 -0.850 0.454 0.804 1.880
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2159 0.5372 0.40 0.6878
## age 0.0322 0.0145 2.22 0.0262 *
## socioecon2 -1.3382 0.6064 -2.21 0.0273 *
## socioecon3 -1.8286 0.5592 -3.27 0.0011 **
## disease1 0.8560 0.5361 1.60 0.1104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 134.83 on 97 degrees of freedom
## Residual deviance: 107.68 on 93 degrees of freedom
## AIC: 117.7
##
## Number of Fisher Scoring iterations: 4
# 用逐步回归模型预测savings account status,计算准确率
pred.step = predict(glm.step, test)
pred.assign = as.numeric(pred.step >= 0.5)
(precise.step = sum(pred.assign == test$account)/nrow(test))
## [1] 0.7041
glm.opt = update(glm.step, . ~ . - disease)
summary(glm.opt)
##
## Call:
## glm(formula = account ~ age + socioecon, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.979 -0.927 0.412 0.896 1.843
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3711 0.5174 0.72 0.4732
## age 0.0368 0.0139 2.64 0.0083 **
## socioecon2 -1.2555 0.5892 -2.13 0.0331 *
## socioecon3 -1.9040 0.5552 -3.43 0.0006 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 134.83 on 97 degrees of freedom
## Residual deviance: 110.31 on 94 degrees of freedom
## AIC: 118.3
##
## Number of Fisher Scoring iterations: 4
# 用精简模型预测,计算准确率,准确率有所提高
pred.opt = predict(glm.opt, test)
pred.assign = as.numeric(pred.opt >= 0.5)
(precise.opt = sum(pred.assign == test$account)/nrow(test))
## [1] 0.7551
# 如果不存在ROCR包就安装
require(ROCR) || install.packages("ROCR", quietly = T)
ROCR.pred.full <- prediction(pred.full, test$account)
ROCR.perf.full <- performance(ROCR.pred.full, "tpr", "fpr")
ROCR.pred.step <- prediction(pred.step, test$account)
ROCR.perf.step <- performance(ROCR.pred.step, "tpr", "fpr")
ROCR.pred.opt <- prediction(pred.opt, test$account)
ROCR.perf.opt <- performance(ROCR.pred.opt, "tpr", "fpr")
plot(ROCR.perf.full, col = "green", lty = 1, lwd = 3, main = "ROC curves")
plot(ROCR.perf.step, add = TRUE, col = "blue", lty = 1, lwd = 3)
plot(ROCR.perf.opt, add = TRUE, col = "red", lty = 1, lwd = 3)
legend(0.7, 0.3, c("Full", "Stepwise", "Optimum"), col = c("green", "blue",
"red"), lwd = 3)
方法参考:徐林发等,2011. 应用ROC曲线求解最佳切点的方法介绍. 中国卫生统计,2011年12月第28卷第6期
代码参考:http://stackoverflow.com/questions/16347507/obtaining-threshold-values-from-a-roc-curve-r-rocr
# 计算线性门槛值cut(没有经过Logit逆函数转换为概率值),以及相应的FPR和TPR
cutoffs <- data.frame(cut = ROCR.perf.opt@alpha.values[[1]], fpr = ROCR.perf.opt@x.values[[1]],
tpr = ROCR.perf.opt@y.values[[1]])
# 对TPR+(1-FPR)的值排序,取第一位的门槛值
(optim = head(cutoffs[order(cutoffs$tpr + 1 - cutoffs$fpr, decreasing = TRUE),
], 1))
## cut fpr tpr
## 34 0.5918 0.1333 0.6604
# 由此,我们选取0.5918072作为Logistic回归预测的门槛值
pred.assign = as.numeric(pred.opt >= optim[1, 1])
# 准确度:
(precise.opt2 = sum(pred.assign == test$account)/nrow(test))
## [1] 0.7551
# 经验证,在这个门槛值下,FPR和TPR值与ROCR包给出的值完全一致:
(fpr.opt = sum(pred.assign * (1 - test$account))/sum(1 - test$account))
## [1] 0.1333
(tpr.opt = sum(pred.assign * test$account)/sum(test$account))
## [1] 0.6604
# 重新画一幅ROC图
ROCR.pred.full <- prediction(pred.full, test$account)
ROCR.perf.full <- performance(ROCR.pred.full, "tpr", "fpr")
ROCR.pred.step <- prediction(pred.step, test$account)
ROCR.perf.step <- performance(ROCR.pred.step, "tpr", "fpr")
ROCR.pred.opt <- prediction(pred.opt, test$account)
ROCR.perf.opt <- performance(ROCR.pred.opt, "tpr", "fpr")
plot(ROCR.perf.full, col = "green", lty = 1, lwd = 3, main = "ROC curves")
plot(ROCR.perf.step, add = TRUE, col = "blue", lty = 1, lwd = 3)
plot(ROCR.perf.opt, add = TRUE, col = "red", lty = 1, lwd = 3)
legend(0.7, 0.3, c("Full", "Stepwise", "Optimum"), col = c("green", "blue",
"red"), lwd = 3)
# 标出最佳分界点opt. cutoff在图中的位置
lines(x = c(optim[1, 2], optim[1, 2]), y = c(-0.05, optim[1, 3]), col = "brown",
lty = 3)
lines(x = c(-0.05, optim[1, 2]), y = c(optim[1, 3], optim[1, 3]), col = "brown",
lty = 3)
points(x = optim[1, 2], y = optim[1, 3], pch = 16, cex = 1, bg = "black")
text(x = optim[1, 2], y = optim[1, 3], labels = "opt. cutoff", adj = c(0.5,
-1))
glm.full.all = glm(account ~ age + socioecon + sector + disease, family = binomial(link = "logit"),
data = data)
glm.step.all = step(glm.full.all)
## Start: AIC=227.4
## account ~ age + socioecon + sector + disease
##
## Df Deviance AIC
## - disease 1 215 225
## <none> 215 227
## - sector 1 220 230
## - age 1 230 240
## - socioecon 2 242 250
##
## Step: AIC=225.4
## account ~ age + socioecon + sector
##
## Df Deviance AIC
## <none> 215 225
## - sector 1 221 229
## - age 1 230 238
## - socioecon 2 242 248
summary(glm.step.all)
##
## Call:
## glm(formula = account ~ age + socioecon + sector, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.290 -0.865 0.389 0.815 1.989
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0541 0.3814 0.14 0.88730
## age 0.0355 0.0098 3.62 0.00029 ***
## socioecon2 -1.1743 0.4178 -2.81 0.00494 **
## socioecon3 -1.9536 0.4026 -4.85 1.2e-06 ***
## sector2 0.7895 0.3486 2.27 0.02351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 270.06 on 195 degrees of freedom
## Residual deviance: 215.36 on 191 degrees of freedom
## AIC: 225.4
##
## Number of Fisher Scoring iterations: 4