HW3 Logistic Regression

Disease Outbreak

The data set provides information from a study based on 196 persons selected in a probability sample within two sectors in a city. Each line of the data set has an identification number and provides information on 5 other variables for a single person. The 6 variables are:

Variable Number	Variable Name	Description
1	Identification	number 1-196
2	Age	Age of person (in years)
3	Socioeconomic status	1 = upper, 2 = middle, 3 = lower
4	Sector	Sector within city, where: 1 = sector 1, 2 = sector 2
5	Disease status	1 = with disease, 0 = without disease
6	Savings account status	1 = has savings account, 0 = does not have savings account

Data is from H. G. Dantes, J. S. Koopman, C. L. Addy, et al., “Dengue Epidemics on the Pacific Coast of Mexico”, International Journal of Epidemiology 17 (1988), pp. 178-186

Savings account status is the response variable, while age, socioeconomic status, city sector and disease status are the predictor variables. Cases 1-98 are to be utilized for developing the logistic regression model.

Fit logistic regression model. State the fitted response function. Decide which predictor variables can be dropped from the regression model and justify your choice. Use the remaining data to predict savings account status. Obtain the ROC curve to assess the model’s predictive power and identify the optimal cutoff.

一、数据预处理

# 读取数据，并将解释变量预先因子化。account变量在回归时会自动因子化，方便起见可不预先因子化。

data = read.table("Disease.txt", col.names = c("id", "age", "socioecon", "sector", 
    "disease", "account"), header = F)

data$socioecon = factor(data$socioecon)
data$sector = factor(data$sector)
data$disease = factor(data$disease)

# 将1-98行数据作为训练集，余下的作为测试集
nrow(data)

## [1] 196

train = data[1:98, ]
test = data[-c(1:98), ]

二、Logistic回归

1、全变量Logistic回归模型

glm.full = glm(account ~ age + socioecon + sector + disease, family = binomial(link = "logit"), 
    data = train)
summary(glm.full)

## 
## Call:
## glm(formula = account ~ age + socioecon + sector + disease, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.976  -0.830   0.410   0.818   1.905  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.0588     0.5659    0.10   0.9173   
## age           0.0325     0.0146    2.22   0.0263 * 
## socioecon2   -1.3479     0.6097   -2.21   0.0270 * 
## socioecon3   -1.7607     0.5652   -3.12   0.0018 **
## sector2       0.4709     0.5269    0.89   0.3716   
## disease1      0.6938     0.5662    1.23   0.2205   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 134.83  on 97  degrees of freedom
## Residual deviance: 106.88  on 92  degrees of freedom
## AIC: 118.9
## 
## Number of Fisher Scoring iterations: 4


# 暂且取0.5为预测值的分界点，计算预测准确度

# 注意：这个准确度既包括对savings account
# status=1的点的预测准确度，也包括对status=0的点的预测准确度，因此不同于下面ROC图中的TPR和FPR。下同。

pred.full = predict(glm.full, test)
pred.assign = as.numeric(pred.full >= 0.5)
(precise.full = sum(pred.assign == test$account)/nrow(test))

## [1] 0.7143

2、逐步回归模型：筛选变量，sector变量被排除

glm.step = step(glm.full)

## Start:  AIC=118.9
## account ~ age + socioecon + sector + disease
## 
##             Df Deviance AIC
## - sector     1      108 118
## - disease    1      108 118
## <none>              107 119
## - age        1      112 122
## - socioecon  2      118 126
## 
## Step:  AIC=117.7
## account ~ age + socioecon + disease
## 
##             Df Deviance AIC
## <none>              108 118
## - disease    1      110 118
## - age        1      113 121
## - socioecon  2      120 126

summary(glm.step)

## 
## Call:
## glm(formula = account ~ age + socioecon + disease, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.037  -0.850   0.454   0.804   1.880  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.2159     0.5372    0.40   0.6878   
## age           0.0322     0.0145    2.22   0.0262 * 
## socioecon2   -1.3382     0.6064   -2.21   0.0273 * 
## socioecon3   -1.8286     0.5592   -3.27   0.0011 **
## disease1      0.8560     0.5361    1.60   0.1104   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 134.83  on 97  degrees of freedom
## Residual deviance: 107.68  on 93  degrees of freedom
## AIC: 117.7
## 
## Number of Fisher Scoring iterations: 4


# 用逐步回归模型预测savings account status，计算准确率
pred.step = predict(glm.step, test)
pred.assign = as.numeric(pred.step >= 0.5)
(precise.step = sum(pred.assign == test$account)/nrow(test))

## [1] 0.7041

3、精简模型：鉴于变量disease不显著，去除该变量，AIC值上升很少

glm.opt = update(glm.step, . ~ . - disease)
summary(glm.opt)

## 
## Call:
## glm(formula = account ~ age + socioecon, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.979  -0.927   0.412   0.896   1.843  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.3711     0.5174    0.72   0.4732    
## age           0.0368     0.0139    2.64   0.0083 ** 
## socioecon2   -1.2555     0.5892   -2.13   0.0331 *  
## socioecon3   -1.9040     0.5552   -3.43   0.0006 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 134.83  on 97  degrees of freedom
## Residual deviance: 110.31  on 94  degrees of freedom
## AIC: 118.3
## 
## Number of Fisher Scoring iterations: 4


# 用精简模型预测，计算准确率，准确率有所提高
pred.opt = predict(glm.opt, test)
pred.assign = as.numeric(pred.opt >= 0.5)
(precise.opt = sum(pred.assign == test$account)/nrow(test))

## [1] 0.7551

精简模型的所有变量都是显著的，预测准确度最高，而AIC并不高，所以我们认为这个模型是最好的。

三、用ROCR包画出ROC曲线

# 如果不存在ROCR包就安装
require(ROCR) || install.packages("ROCR", quietly = T)

ROCR.pred.full <- prediction(pred.full, test$account)
ROCR.perf.full <- performance(ROCR.pred.full, "tpr", "fpr")

ROCR.pred.step <- prediction(pred.step, test$account)
ROCR.perf.step <- performance(ROCR.pred.step, "tpr", "fpr")

ROCR.pred.opt <- prediction(pred.opt, test$account)
ROCR.perf.opt <- performance(ROCR.pred.opt, "tpr", "fpr")

plot(ROCR.perf.full, col = "green", lty = 1, lwd = 3, main = "ROC curves")
plot(ROCR.perf.step, add = TRUE, col = "blue", lty = 1, lwd = 3)
plot(ROCR.perf.opt, add = TRUE, col = "red", lty = 1, lwd = 3)

legend(0.7, 0.3, c("Full", "Stepwise", "Optimum"), col = c("green", "blue", 
    "red"), lwd = 3)

plot of chunk unnamed-chunk-6

四、用Youden index法确定最佳的cutoff点，使得TPR+(1-FPR)的值最大

方法参考：徐林发等，2011. 应用ROC曲线求解最佳切点的方法介绍. 中国卫生统计，2011年12月第28卷第6期

代码参考：http://stackoverflow.com/questions/16347507/obtaining-threshold-values-from-a-roc-curve-r-rocr

# 计算线性门槛值cut（没有经过Logit逆函数转换为概率值），以及相应的FPR和TPR
cutoffs <- data.frame(cut = ROCR.perf.opt@alpha.values[[1]], fpr = ROCR.perf.opt@x.values[[1]], 
    tpr = ROCR.perf.opt@y.values[[1]])

# 对TPR+(1-FPR)的值排序，取第一位的门槛值
(optim = head(cutoffs[order(cutoffs$tpr + 1 - cutoffs$fpr, decreasing = TRUE), 
    ], 1))

##       cut    fpr    tpr
## 34 0.5918 0.1333 0.6604


# 由此，我们选取0.5918072作为Logistic回归预测的门槛值
pred.assign = as.numeric(pred.opt >= optim[1, 1])

# 准确度：
(precise.opt2 = sum(pred.assign == test$account)/nrow(test))

## [1] 0.7551


# 经验证，在这个门槛值下，FPR和TPR值与ROCR包给出的值完全一致：
(fpr.opt = sum(pred.assign * (1 - test$account))/sum(1 - test$account))

## [1] 0.1333

(tpr.opt = sum(pred.assign * test$account)/sum(test$account))

## [1] 0.6604

# 重新画一幅ROC图
ROCR.pred.full <- prediction(pred.full, test$account)
ROCR.perf.full <- performance(ROCR.pred.full, "tpr", "fpr")

ROCR.pred.step <- prediction(pred.step, test$account)
ROCR.perf.step <- performance(ROCR.pred.step, "tpr", "fpr")

ROCR.pred.opt <- prediction(pred.opt, test$account)
ROCR.perf.opt <- performance(ROCR.pred.opt, "tpr", "fpr")

plot(ROCR.perf.full, col = "green", lty = 1, lwd = 3, main = "ROC curves")
plot(ROCR.perf.step, add = TRUE, col = "blue", lty = 1, lwd = 3)
plot(ROCR.perf.opt, add = TRUE, col = "red", lty = 1, lwd = 3)

legend(0.7, 0.3, c("Full", "Stepwise", "Optimum"), col = c("green", "blue", 
    "red"), lwd = 3)

# 标出最佳分界点opt. cutoff在图中的位置
lines(x = c(optim[1, 2], optim[1, 2]), y = c(-0.05, optim[1, 3]), col = "brown", 
    lty = 3)
lines(x = c(-0.05, optim[1, 2]), y = c(optim[1, 3], optim[1, 3]), col = "brown", 
    lty = 3)
points(x = optim[1, 2], y = optim[1, 3], pch = 16, cex = 1, bg = "black")
text(x = optim[1, 2], y = optim[1, 3], labels = "opt. cutoff", adj = c(0.5, 
    -1))

plot of chunk unnamed-chunk-8

五、余论

在上面的模型中，sector变量不显著。然而在整个样本中，sector变量是显著的：

glm.full.all = glm(account ~ age + socioecon + sector + disease, family = binomial(link = "logit"), 
    data = data)
glm.step.all = step(glm.full.all)

## Start:  AIC=227.4
## account ~ age + socioecon + sector + disease
## 
##             Df Deviance AIC
## - disease    1      215 225
## <none>              215 227
## - sector     1      220 230
## - age        1      230 240
## - socioecon  2      242 250
## 
## Step:  AIC=225.4
## account ~ age + socioecon + sector
## 
##             Df Deviance AIC
## <none>              215 225
## - sector     1      221 229
## - age        1      230 238
## - socioecon  2      242 248

summary(glm.step.all)

## 
## Call:
## glm(formula = account ~ age + socioecon + sector, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.290  -0.865   0.389   0.815   1.989  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.0541     0.3814    0.14  0.88730    
## age           0.0355     0.0098    3.62  0.00029 ***
## socioecon2   -1.1743     0.4178   -2.81  0.00494 ** 
## socioecon3   -1.9536     0.4026   -4.85  1.2e-06 ***
## sector2       0.7895     0.3486    2.27  0.02351 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 270.06  on 195  degrees of freedom
## Residual deviance: 215.36  on 191  degrees of freedom
## AIC: 225.4
## 
## Number of Fisher Scoring iterations: 4