What are possible risk factors related to heart diseases? What determines an employee being a desirable one with the firm? How to tell a review in Amazon is real or fake? How to model a categorical variables conveniently and efficiently? Logistic regression models are most commonly used methods to model the probability of an event. We then automatically get a linear classification rules. Various criteria are introduced. Analogous to least squared solutions for the usual regression models, we use maximum likelihood estimations.
Read
Heart disease is the leading cause of the death in United States. One out of four death is due to heart disease. It is important to identify Coronary Heart Disease risk factors. Many studies have indicate, high blood pressure, high cholesterol, age, gender, race are among some major risk factors.
Starting from late 40th, National Heart, Lung and Blood Institute (NHLBI) launched it’s famous Framingham Heart Study. By now subjects of three generations together with other people have been monitored and followed in the study. Over thousands research papers have been published using these longitudinal data sets. More details.
Using a piece of the data gathered at the beginning of the study, we illustrate how to identify risk factors of heart disease and how to classify one with a heart disease.
i) Data: 1,406 participants. Conditions gathered at the beginning of the study (early 50s).
| Variable | Description |
|---|---|
Heart Disease |
Indicator of having heart disease or not |
AGE |
Age |
SEX |
Gender |
SBP |
Systolic blood pressure |
DBP |
Diastolic blood pressure |
CHOL |
Cholesterol level |
FRW |
age and gender adjusted weight |
CIG |
Self-reported number of cigarettes smoked each week |
ii) Goal:
Heart DiseasePredict if one has Heart Desease given the information in hand
In particular,
Predict \(Prob(HD = 1)\) for Alice, who is:
| AGE | SEX | SBP | DBP | CHOL | FRW | CIG |
|---|---|---|---|---|---|---|
| 45 | FEMALE | 100 | 80 | 180 | 110 | 5 |
where \[HD =\left\{\begin{array}{ll} 1 & \mbox{ if heart disease} \\ 0 & \mbox{ if normal} \end{array}\right.\]
Read Framingham.dat as fram_data
fram_data <- read.csv("Framingham.dat", sep=",", header=T, as.is=T)
str(fram_data)
names(fram_data)
summary(fram_data)Rename and set variables to their correct type
# names(fram_data)[1] <- "HD"
# fram_data$HD <- as.factor(fram_data$HD)
# fram_data$SEX <- as.factor(fram_data$SEX)# dplyr way
fram_data %<>%
rename(HD = Heart.Disease.) %>%
mutate(HD = as.factor(HD),
SEX = as.factor(SEX)) Structure of the data
## 'data.frame': 1407 obs. of 8 variables:
## $ HD : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ AGE : int 45 49 47 48 49 47 48 48 46 45 ...
## $ SEX : Factor w/ 2 levels "FEMALE","MALE": 2 2 2 2 2 2 2 2 2 2 ...
## $ SBP : int 90 100 100 108 108 108 108 110 110 110 ...
## $ DBP : int 50 64 70 70 75 68 70 70 80 72 ...
## $ CHOL: int 216 237 215 340 149 165 196 229 204 183 ...
## $ FRW : int 76 97 86 93 95 88 79 85 112 93 ...
## $ CIG : int 5 0 50 0 0 0 20 25 0 20 ...
We have created the last row for Alice, which will be used for prediction. Notice that the response HD is missing here.
## HD AGE SEX SBP DBP CHOL FRW CIG
## 1407 <NA> 45 FEMALE 100 80 180 110 5
We take out the last row and save it to fram_data.new to avoid any potential problems in our analyses.
We have 311 observations with HD = 1 and 1,095 with HD = 0
## HD AGE SEX SBP DBP
## 0:1095 Min. :45.00 FEMALE:737 Min. : 90.0 Min. : 50.00
## 1: 311 1st Qu.:48.00 MALE :669 1st Qu.:130.0 1st Qu.: 80.00
## Median :52.00 Median :142.0 Median : 90.00
## Mean :52.45 Mean :148.1 Mean : 90.14
## 3rd Qu.:56.00 3rd Qu.:160.0 3rd Qu.: 98.00
## Max. :62.00 Max. :300.0 Max. :160.00
##
## CHOL FRW CIG
## Min. : 96.0 Min. : 52.0 Min. : 0.000
## 1st Qu.:200.0 1st Qu.: 94.0 1st Qu.: 0.000
## Median :230.5 Median :103.0 Median : 0.000
## Mean :234.7 Mean :105.4 Mean : 8.048
## 3rd Qu.:264.0 3rd Qu.:114.0 3rd Qu.:20.000
## Max. :430.0 Max. :222.0 Max. :60.000
## NA's :11 NA's :2
Notice missing values in FRW and CIG. We won’t take those observations off the data set.
For simplicity we start with a simple question: How does HD relate to SBP?
## # A tibble: 2 x 2
## HD `mean(SBP)`
## <fct> <dbl>
## 1 0 145.
## 2 1 159.
On average SBP seems to be higher among HD = 1
We can see the distribution of SBP through back to back box plots. Again, SBP seems to be higher when HD = 1
Next we explore how proportions of HD = 1 relate to SB. Note that HD of type factor is coerced into numeric 1 and 2. In fact we can code a Heart Disease many ways. It can be a 1 or Yes or Bad etc…
Standard plots:
plot(jitter(as.numeric(fram_data$SBP), factor=.5),
jitter(as.numeric(fram_data$HD), factor=.5),
col=fram_data$HD,
xlab = "SBP",
ylab = "HD")
legend("right",
legend=c("0", "1"),
lty=c(1,1),
lwd=c(2,2),
col=unique(fram_data$HD))Problems: many observations are overlapped so they are not visible.
Solutions: use jitter() to spread out the observations with similar SBP values.
plot(jitter(as.numeric(fram_data$HD), factor=.5) ~ fram_data$SBP,
pch=4,
col=fram_data$HD,
ylab="HD",
xlab="SBP")
legend("right", legend=c("0", "1"),
lty=c(1,1), lwd=c(2,2), col=unique(fram_data$HD))Alternatively we could use stripchart() as follows:
stripchart(fram_data$SBP~fram_data$HD, method="jitter",
col=c("black", "red"), pch=4,
ylab="HD", xlab="HB")
legend("topright", legend=c("0", "1"),
lty=c(1,1), lwd=c(2,2), col=unique(fram_data$HD))geom_jitter() also provides similar function.
fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
ggplot(aes(x=SBP, y=HD)) +
geom_jitter(height = .05, aes(color = factor(HD)))All the plots above do not show the proportion of “1”’s vs. “0”’s as a function of SBP.
Can we use a linear model?
For a binary response with a \(0/1\) coding as HD, it can be shown that \(X \hat\beta\) is in fact an estimate of \(P(HD=1 | X)\) and we can predict heart disease if \(\hat P(HD=1 | X) > 0.5\) and normal otherwise. However, some of our estimate will be outside of \([0,1]\), which makes no sense to interpret as probability.
fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
ggplot(aes(x=SBP, y=HD)) +
geom_jitter(height = .05, aes(color = factor(HD))) +
geom_smooth(method = "lm", se = FALSE) +
ylab("Prob(HD=1)")HD~SBPWe clearly would like to model probability of one with HD given SBP. One of the most popular model is logistic regression model.
In logistic regression model we will model the probability of one being sick as follows:
\[P(HD=1\vert SBP) = \frac{e^{\beta_0 + \beta_1 SBP}}{1+e^{\beta_0+\beta_1 SBP}}\]
where \(\beta_0\) and \(\beta_1\) are unknown parameters. We see the following properties immediately:
\[0 < P(HD=1\vert SBP) = \frac{e^{\beta_0 + \beta_1 SBP}}{1+e^{\beta_0+\beta_1 SBP}} < 1\]
We get the $P(HD=0\vert SBP)$:
\[P(HD=0\vert SBP) = 1-P(HD=1\vert SBP) = \frac{1}{1+e^{\beta_0+\beta_1 SBP}}\]
log odds of being HD.\[logit(P(HD=1|SBP)) =\log\left(\frac{P(HD=1\vert SBP)}{P(HD=0\vert SBP)}\right)=\beta_0+\beta_1 \times SBP\]
The interpretation of \(\beta_1\) is the change in log odds for a unit change in SBP.
\(P(HD=1|SBP)\) is a monotone function of SBP, depending on the sign of \(\beta_1\). \(P(HD=1|SBP)\) is an increasing function of SBP if \(\beta_1 >0\).
For our setting, the response is a categorical variable. The notion of least squared errors does not apply here. We will need to find some sensible function to minimize or maximize to estimate the unknown parameters \(\beta\)’s. We introduce Likelihood Function of \(\beta_0\) and \(\beta_1\) given the data, namely the
Probability of seeing the actual outcome in the data.
Take a look at a piece of the data, randomly chosen from our dataset. We can then see part of the likelihood function.
HD SBP
975 0 135
710 0 124
774 0 136
416 1 130
392 1 120
273 0 134
1373 0 154
1228 0 150
1321 1 210
690 0 118
We use this to illustrate the notion of likelihood function.
\[\begin{split} \mathcal{Lik}(\beta_0, \beta_1 \vert {\text Data}) &= {Prob\text {(the outcome of the data)}}\\ &=Prob((HD=0|SBP=135), (HD=0|SBP=124), \ldots, (HD=1|SBP=130), \ldots ) \\ &=Prob(HD=0|SBP=135) \times Prob(HD=0|SBP=124) \times \ldots \times Prob(HD=1|SBP=130) \times \ldots ) \\ &= \frac{1}{1+e^{\beta_0 + 130 \beta_1}}\cdot\frac{1}{1+e^{\beta_0 + 140\beta_1}}\cdots\frac{e^{\beta_0 + 130 \beta_1}}{1 + e^{\beta_0 + 130 \beta_1}} \dots \end{split}\]
MLE: The estimates \((\hat \beta_1, \hat \beta_0)\) that maximizes the likelihood function is termed as Maximum Likelihood Estimators:
\[(\hat \beta_1, \hat \beta_0) = \arg\max_{\beta_0, \beta_1} \mathcal{Lik}(\beta_0, \beta_1 \vert Data)\] Remark:
\(\hat \beta_0\) and \(\hat \beta_1\) are obtained through \(\max\log(\mathcal{Lik}(\beta_0, \beta_1 \vert D))\) or equivalently through \(\min -\{\log(\mathcal{Lik}(\beta_0, \beta_1 \vert D))\}\)
MLE can only be obtained through numerical calculations.
glm() will be used to do logistic regression. It is very similar to lm() but some output might be different.
The default is logit link in glm
##
## Call:
## glm(formula = HD ~ SBP, family = binomial(logit), data = fram_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6687 -0.7109 -0.6253 -0.5244 2.1072
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.66342 0.34602 -10.587 < 2e-16 ***
## SBP 0.01590 0.00221 7.196 6.22e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1485.9 on 1405 degrees of freedom
## Residual deviance: 1432.8 on 1404 degrees of freedom
## AIC: 1436.8
##
## Number of Fisher Scoring iterations: 4
To see the prob function estimated by glm:
logit = -3.66 + 0.0159 SBP
\[\begin{split} \hat P(HD = 1 \vert SBP) &= \frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}} \\ \hat P(HD = 0 \vert SBP) &= \frac{1}{1+e^{-3.66+0.0159 \times SBP}} \end{split}\]
Now to estimate \(P(HD=1)\) for Alice, we can plug in her SBP=100 into the logit function:
Based on fit1 we plug in SBP value into the prob equation. \[\hat P(HD = 1 \vert SBP=100) = \frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}} = \frac{e^{-3.66+0.0159 \times 100}}{1+e^{-3.66+0.0159 \times 100}} \approx 0.112\]
We can also use the predict() function.
## 1407
## 0.1117059
Let us see what we can say about the risk of HD when SBP increases.
logit = -3.66 + 0.0159 SBP. That means logodds increases .0159 when SBP increase by 1.
Notice the \(Prob(HD = 1)\) is an increasing function of SBP since \(\hat \beta_1 = .0159 > 0\). That means when SBP increases, the chance of being HD is higher.
Unfortunately we do not have a nice linear interpretation of \(\beta_1\) over \(Prob(HD = 1)\) anymore.
fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
ggplot(aes(x=SBP, y=HD)) +
geom_jitter(height = .05, aes(color = factor(HD))) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE) +
# geom_smooth(method = "lm",
# color = "red",
# se = FALSE) +
ylab("Prob(HD=1)")Alternatively, we can plot the prob through \(\frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}}\)
x <- seq(100, 300, by=1)
y <- exp(-3.66+0.0159*x)/(1+exp(-3.66+0.0159*x))
plot(x, y, pch=16, type = "l",
xlab = "SBP",
ylab = "Prob of P(Y=1|SBP)" )HD when SBP increases.How can we tell if the true \(\beta_1\) is not 0? We need to provide either confidence intervals of hypotheses tests for the unknown parameters.
Facts about MLE:
##
## Call:
## glm(formula = HD ~ SBP, family = binomial(logit), data = fram_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6687 -0.7109 -0.6253 -0.5244 2.1072
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.66342 0.34602 -10.587 < 2e-16 ***
## SBP 0.01590 0.00221 7.196 6.22e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1485.9 on 1405 degrees of freedom
## Residual deviance: 1432.8 on 1404 degrees of freedom
## AIC: 1436.8
##
## Number of Fisher Scoring iterations: 4
Usual z-intervals for the coefficient’s (output from the summary)
2.5 % 97.5 %
(Intercept) -4.34161096 -2.98522712
SBP 0.01156901 0.02023071
Similar to F tests in OLS (Ordinary Least Squares), we have likelihood ratio test to test if a collective set of variables are not needed.
Here:
\[H_0: \beta_{SBP}=0 \mbox{ v.s. } H_1: \beta_{SBP} \not= 0\]
Facts about Likelihood Ratio Tests
Under the \(H_0\), the likelihood ratio (modified with log) \[\begin{split} \text {Testing stat} = \chi^2 & = -2\times \log \frac{\max _{H_1} \mathcal{Lik}(\beta_0, \beta_1 \vert D)}{\max_{H_0} \mathcal{Lik}(\beta_0, \beta_1 \vert D)}\\ &=-2\log(\mathcal{Lik}_{H_0}) - (-2\log(\mathcal{Lik}_{H_1}))\\ &\sim \chi^2_{df=1} \end{split}\]
Testing stat = Null Deviance - Residual Deviance. Here Deviance = \(-2\log(\mathcal{L})\)
The p-value is done through \(\chi^2\) distribution \(p_{value}=P(\chi^2_{df} > \chi^2)\)
glm() outputs the two terms
See more about \(\chi^2\) distribution in the appendix
Call:
glm(formula = HD ~ SBP, family = binomial(logit), data = fram_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6687 -0.7109 -0.6253 -0.5244 2.1072
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.66342 0.34602 -10.587 < 2e-16 ***
SBP 0.01590 0.00221 7.196 6.22e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1485.9 on 1405 degrees of freedom
Residual deviance: 1432.8 on 1404 degrees of freedom
AIC: 1436.8
Number of Fisher Scoring iterations: 4
Notice:
Null deviance = 1485.9
Residual deviance: 1432.8
\(\chi^2 = 1485.9-1432.8= 53.1\)
We can then get the p-value of the likelihood Ratio test .
We can also use anova() or car::Anova() to get the \(\chi^2\) tests. Here the null hypothesis \(H_0:~ \beta_{SBP}=0\).
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: HD
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1405 1485.9
## SBP 1 53.08 1404 1432.8 3.203e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Similar to the F-test in lm() set up.
## Analysis of Deviance Table (Type II tests)
##
## Response: HD
## LR Chisq Df Pr(>Chisq)
## SBP 53.08 1 3.203e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Inverting the likelihood ratio tests we can get confidence intervals for the coefficients. This should be similar to that obtained through Wald or z intervals but they are not the same.
## 2.5 % 97.5 %
## (Intercept) -4.34161096 -2.98522712
## SBP 0.01156901 0.02023071
Given the prediction for the probability of Alice having heart disease, i.e. \(\hat P(HD=1 | SBP=100)=.11\), how do we decide whether Alice will have heart disease or not? In general, how do we classify \(\hat Y = 1\) given \(\hat P(Y=1 | X)\)?
A sensible way to predict \(Y|x\) is to set a threshold over Prob(Y|x).
Once we have an estimation equation for Prob(HD=1|SBP) we can immediately obtain prediction rules by setting threshold.
Rule 1: Thresholding probability by 1/2
By definition, the larger \(\hat P(HD=1 | SBP)\) is, the more likely Alice will have heart disease. We start with a classifier that classifies \(\widehat{HD} = 1\) if \(\hat P(HD=1 \vert SBP) > 1/2\). To be more specific,
\[\widehat{HD}=1 ~~~~ \text{if} ~~~~ \hat P(HD=1 \vert SBP) = \frac{e^{-3.66+.0159\cdot SBP}}{1+e^{-3.66+.0159\cdot SBP}} > \frac{1}{2}.\]
## 1407
## "0"
Thus we classify Alice as not having heart disease.
Linear boundary:
The classification rule above is equivalent to
\[ \begin{split} \widehat{HD}=1 ~~~~ \text{if} ~~~~ -3.66+.0159 \cdot SBP &= \log (\text{odds ratio}) \\ & = \log \Bigg(\frac{P(HD=1 \vert SBP)}{P(HD=0 \vert SBP)} \Bigg) \\ &> \log \Bigg(\frac{1/2}{1/2}\Bigg) = \log 1 = 0 \end{split} \]
Simple algebra yields \[\hat Y=1 ~~~~ \text{if} ~~~~ SBP > \frac{3.66}{.0159}=230.18.\] This is called the linear classification boundary with simple interpretation. Since Alice has \(SBP = 100 < 230.18\), we classify Alice as not having heart disease.
fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
ggplot(aes(x=SBP, y=HD)) +
geom_jitter(height = .05, aes(color = factor(HD))) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE) +
geom_vline(xintercept = 230.18, col="red") +
ggtitle("Classifier: HD = 1 if prob > 1/2") +
ylab("Prob(HD=1)")Based on the about linear boundary, we classify everyone as \(\hat Y=1\) if their SBP is on the right side of the vertical line!
Rule 2: Thresholding probability by 1/3
Let’s redo the exercise by thresholding \(\hat P(HD=1 | SBP) > 1/3\).
\[\widehat{HD}=1 ~~~~ \text{if} ~~~~ \hat P(HD=1 \vert SBP) = \frac{e^{-3.66+.0159\cdot SBP}}{1+e^{-3.66+.0159\cdot SBP}} > \frac{1}{3}.\]
## 1407
## "0"
We will still classify Alice as not having heart disease.
Alternatively, the linear classification rule is
\[ \begin{split} \widehat{HD}=1 ~~~~ \text{if} ~~~~ -3.66+.0159 \cdot SBP &= \log (\text{odds ratio}) \\ & = \log \Bigg(\frac{P(HD=1 \vert SBP)}{P(HD=0 \vert SBP)} \Bigg) \\ &> \log \Bigg(\frac{1/3}{2/3}\Bigg) = \log (1/2) = -0.693 \end{split} \]
Simple algebra yields \[\hat Y=1 ~~~~ \text{if} ~~~~ SBP > \frac{3.66 + \log(1/2)}{.0159} = 187\]
We now compare the two classifier
fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
ggplot(aes(x=SBP, y=HD)) +
geom_jitter(height = .05, aes(color = factor(HD))) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE) +
geom_line(aes(x = 230.18), col="red") +
geom_line(aes(x = 187), col="green") +
ggtitle("Green: HD = 1 if prob > 1/3; Red: HD = 1 if prob > 1/2") +
ylab("Prob(HD=1)")Given two classifier, how can we tell which one is better? There are several criteria to evaluate the performance of a classifier.
Given the actual status and the predicted status by a classifier, we can summarize how well this rule works by a two-way table which we called confusion matrix.
Let’s use our first classifier, \[\widehat{HD} = 1 \mbox{ if } \hat P(HD=1 \vert SBP) > 1/2.\]
Take a few participants in the data, we compare the truth and the predictions:
set.seed(10)
output1 <- data.frame(fram_data$HD, fit1.pred.5, fit1$fitted)[sample(1406, 10),]
names(output1) <- c( "HD", "Predicted HD", "Prob")
output1## HD Predicted HD Prob
## 491 0 0 0.1919423
## 1354 0 0 0.1684801
## 368 1 0 0.1684801
## 439 0 0 0.1473629
## 344 0 0 0.2611611
## 1295 0 0 0.4351549
## 143 0 0 0.2461149
## 938 0 0 0.1576343
## 1405 1 0 0.2232822
## 930 0 0 0.1514034
We see there are 4 participants mislabeled.
Confusion matrix: a 2 by 2 table which summarize the number of mis/agreed labels
##
## fit1.pred.5 0 1
## 0 1084 302
## 1 11 9
Note that the rows are \(\hat y\) and the columns are \(y\). These four numbers reflects different criteria.
| \(Y=0\) | \(Y=1\) | |
|---|---|---|
| \(\hat{Y}=0\) | Specificity | False negative |
| \(\hat{Y}=1\) | False positive | Sensitivity (true positive) |
Sensitivity:
The sensitivity is defined as \[P(\hat Y = 1 \vert Y = 1)\]
This is also called True Positive Rate: the proportion of correct positive classification.
## [1] 0.02893891
Specificity:
The specificity is defined as \[P(\hat Y = 0| Y = 0)\]
Specificity measures the proportion of correct negative classification.
## [1] 0.9899543
False Positive:
A related measure is false positive rate.
\[1 - \text{Specificity} = P(\hat Y=1 \vert Y=0)\]
False Positive measures the proportion of incorrect positive classification (given the actual status being negative).
## [1] 0.01004566
Given a logistic model, we can obtain different classifiers by changing the classification rule, i.e. by changing the threshold. Each classifier has its sensitivity and specificity. We can set sensitivity as x-axis and specificity as y-axis and plot all the pairs of sensitivity and specificity. This curve is termed ROC curve.
ROC curve is helpful when choosing classifier. We want to have both high specificity and high sensitivity at the same time, which mean we want to classify both \(Y=0\) and \(Y=1\) correctly. However, in general, we will NOT have a perfect classifier and need to strive a balance between the two.
We use the roc() function from package pROC to obtain detailed information for each classifier. Notice the ROC curve here is Sensitivity vs. Specificity. Some prefer using false positive as x-axis so as to have an ascending x-axis.
**Plotting ROC curve:*
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Note that the higher the specificity, the lower the sensitivity. A perfect classifier will have both \(\text{Specificity}=1\) and \(\text{Sensitivity}=1\).
Compare the following ROC curve using false positive as x-axis.
plot(1-fit1.roc$specificities, fit1.roc$sensitivities, col="red", pch=16,
xlab="False Positive",
ylab="Sensitivity")We can extract specificity and sensitivity of different thresholds from fit1.roc.
## [1] "percent" "sensitivities" "specificities"
## [4] "thresholds" "direction" "cases"
## [7] "controls" "fun.sesp" "auc"
## [10] "call" "original.predictor" "original.response"
## [13] "predictor" "response" "levels"
We can also plot a curve that shows the probability thresholds used and the corresponding False Positive rate.
plot(fit1.roc$thresholds, 1-fit1.roc$specificities, col="green", pch=16,
xlab="Threshold on prob",
ylab="False Positive",
main = "Thresholds vs. False Postive")AUC measures the area under the ROC curve. It is used to measure the performance of the logistic model as a whole: the larger the better. Why?
Given a specificity, the model with a higher sensitivity is more desirable. If a model have higher sensitivity for each specificity, this is a better model. An extreme case is when a model has sensitivity being constantly 1 and thus \(AUC=1\).
fit1.roc$auc
pROC::auc(fit1.roc)
### if you get "Error in rank(prob) : argument "prob" is missing, with no default"
### then it is possible that you are using the auc() in glmnet
### use pROC::auc(fit1.roc) to specify we want to use auc() in pROC## Area under the curve: 0.637
## Area under the curve: 0.637
Mean values of missclassifications
\[MCE= \frac{1}{n} \sum_{i=1}^n \{\hat y_i \neq y_i\}\]
## [1] 0.2226174
## [1] 0.7773826
False discovery rate measure the expected propotion of false discovery (incorrectly reject the null hypotheses).
FDR = \(P(Y=0 | \hat Y = 1)\)
Two related concepts are true positive and true negative.
Positive Prediction (true positive):
Positive Prediction is a measure of the accuracy given the predictions.
Positive Prediction = \(P(Y = 1 | \hat Y = 1)\)
## [1] 0.45
Negative Prediction:
Negative Prediction = \(P(Y = 0 | \hat Y = 0)\)
## [1] 0.7821068
We can use confusionMatrix() from the caret package to get the confusion table, accuracy (MCE), sensitivity, specificity, and etc.
confusionMatrix(data = as.factor(fit1.pred.5), # predicted value
reference = fram_data$HD, # true results as reference
positive = levels(fram_data$HD)[2]) # the positive resultconfusionMatrix(data = cm.5, # the confusion table
positive = levels(fram_data$HD)[2]) # the positive result## Confusion Matrix and Statistics
##
##
## fit1.pred.5 0 1
## 0 1084 302
## 1 11 9
##
## Accuracy : 0.7774
## 95% CI : (0.7547, 0.7989)
## No Information Rate : 0.7788
## P-Value [Acc > NIR] : 0.5661
##
## Kappa : 0.0284
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.028939
## Specificity : 0.989954
## Pos Pred Value : 0.450000
## Neg Pred Value : 0.782107
## Prevalence : 0.221195
## Detection Rate : 0.006401
## Detection Prevalence : 0.014225
## Balanced Accuracy : 0.509447
##
## 'Positive' Class : 1
##
Note: we can compare rule 1 with Rule 2:
\[\widehat{HD}= 1 ~~~~ \text{if} ~~~~ \hat P(HD=1 \vert SBP) > 1/3\]
fit1.pred.33 <- ifelse(fram_data$SBP > 187, "1", "0")
fit1.pred.33 <- as.factor(fit1.pred.33)
confusionMatrix(data = fit1.pred.33,
reference = fram_data$HD,
positive = levels(fram_data$HD)[2])## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1029 262
## 1 66 49
##
## Accuracy : 0.7667
## 95% CI : (0.7437, 0.7886)
## No Information Rate : 0.7788
## P-Value [Acc > NIR] : 0.8693
##
## Kappa : 0.1256
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.15756
## Specificity : 0.93973
## Pos Pred Value : 0.42609
## Neg Pred Value : 0.79706
## Prevalence : 0.22119
## Detection Rate : 0.03485
## Detection Prevalence : 0.08179
## Balanced Accuracy : 0.54864
##
## 'Positive' Class : 1
##
Notice the trade offs between Sensitivity and False Positive Rate when we change the thresholding numbers.
We have introduce elements of logistic regression model and classifications using only SBP. We can immediately extend all the concepts to include more possible risk factors.
For simplicity, we delete all cases with missing values. In general, this is not recommended!
Denote \(x = (x_1, x_2, \ldots, x_p)\). The logit link for the full model is
\[logit(P(HD = 1 | x)) = \log \Bigg(\frac{P(HD=1 | x)}{P(HD=0 | x)} \Bigg) = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p\] where \[P(HD = 1 | x) = \frac{e^{\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p}}{1 + e^{\beta_0 + \beta_1 x_1 + \dots \beta_p x_p}}\]
Similarly,
MLE’s will be obtained through maximize the log likelihood function
Wald tests/intervals hold for each coefficient
Likelihood Ratio tests hold for a set of predictors
We show next the logistic regression model with all possible features.
We now test whether all the risk factors are not useful. To be specific, we test against the null hypothesis \[H_0: \beta_1 = \ldots = \beta_p = 0.\]
We first obtain the chi-square statistics and then the p-value. Notice that under the null hypothesis, the test statistics is approximately \(\chi^2_7\), i.e. a chi-square distribution with degree of freedom of 7 because \(p=7\).
## [1] 4.300522e-24
We reject the null hypothesis with any reasonable small \(\alpha\).
Alternatively we may use anova to get the \(\chi^2\) test by fitting the reduced model vs. the full model:
Analysis of Deviance Table
Model 1: HD ~ 1
Model 2: HD ~ AGE + SEX + SBP + DBP + CHOL + FRW + CIG
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1392 1469.3
2 1385 1343.1 7 126.19 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We conclude from the Wald test that DBP or FRW by itself is not significant. Can we also conclude \(\beta_{DBP} = \beta_{FRW} = 0\)? We use likelihood ratio test to test this null hypothesis.
fit3 <- update(fit2, .~. -DBP -FRW)
# fit3 <- glm(HD~AGE+SEX+SBP+DBP+CHOL, fram_data.f, family=binomial(logit))
summary(fit3) ##
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + CIG, family = binomial,
## data = fram_data.f)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7541 -0.7288 -0.5538 -0.3438 2.4470
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.702282 0.926831 -9.389 < 2e-16 ***
## AGE 0.061358 0.014753 4.159 3.20e-05 ***
## SEXMALE 0.885748 0.155792 5.685 1.30e-08 ***
## SBP 0.017085 0.002372 7.203 5.88e-13 ***
## CHOL 0.004398 0.001499 2.935 0.00334 **
## CIG 0.011358 0.006058 1.875 0.06083 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1469.3 on 1392 degrees of freedom
## Residual deviance: 1345.5 on 1387 degrees of freedom
## AIC: 1357.5
##
## Number of Fisher Scoring iterations: 4
Similarly we obtain the chi-square statistics and then the p-value. Under this null hypothesis, the test statistics is approximately \(\chi^2_2\).
chi.sq.2 <- fit3$deviance-fit2$deviance
hist(rchisq(10000, 2), freq=FALSE, breaks=100) # a chi-square stat with df=2
abline(v = chi.sq.2, col="blue")
abline(v = qchisq(0.95, 2), col="red")
legend("right", c("test stat", "95-th quantile"),
col=c("blue", "red"), lty=1)## [1] 0.3022805
We fail to reject the \(H_0\) at \(0.05\) level.
We can also use anova().
## Analysis of Deviance Table
##
## Model 1: HD ~ AGE + SEX + SBP + CHOL + CIG
## Model 2: HD ~ AGE + SEX + SBP + DBP + CHOL + FRW + CIG
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1387 1345.5
## 2 1385 1343.1 2 2.3928 0.3023
Once again we don’t have evidence to reject the null at .05 level.
We will introduce some quick way of model selection without details.
Backward selection is the easiest (without any packages) if p is not too large. As in lm(), we eliminate the variable with the highest p-value.
Backward selection by kicking DBP (the one with largest p-value) out
##
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + FRW + CIG, family = binomial,
## data = fram_data.f)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7066 -0.7279 -0.5517 -0.3343 2.4501
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.227856 0.996153 -9.263 < 2e-16 ***
## AGE 0.061529 0.014775 4.164 3.12e-05 ***
## SEXMALE 0.911274 0.157117 5.800 6.63e-09 ***
## SBP 0.015966 0.002487 6.420 1.37e-10 ***
## CHOL 0.004493 0.001503 2.990 0.00279 **
## FRW 0.006039 0.004004 1.508 0.13151
## CIG 0.012279 0.006088 2.017 0.04369 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1469.3 on 1392 degrees of freedom
## Residual deviance: 1343.3 on 1386 degrees of freedom
## AIC: 1357.3
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + CIG, family = binomial,
## data = fram_data.f)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7541 -0.7288 -0.5538 -0.3438 2.4470
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.702282 0.926831 -9.389 < 2e-16 ***
## AGE 0.061358 0.014753 4.159 3.20e-05 ***
## SEXMALE 0.885748 0.155792 5.685 1.30e-08 ***
## SBP 0.017085 0.002372 7.203 5.88e-13 ***
## CHOL 0.004398 0.001499 2.935 0.00334 **
## CIG 0.011358 0.006058 1.875 0.06083 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1469.3 on 1392 degrees of freedom
## Residual deviance: 1345.5 on 1387 degrees of freedom
## AIC: 1357.5
##
## Number of Fisher Scoring iterations: 4
CIG didn’t really add much to decrease the residual deviance
##
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL, family = binomial,
## data = fram_data.f)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6069 -0.7348 -0.5523 -0.3476 2.4344
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.408724 0.908600 -9.255 < 2e-16 ***
## AGE 0.056639 0.014500 3.906 9.38e-05 ***
## SEXMALE 0.989870 0.145053 6.824 8.84e-12 ***
## SBP 0.016956 0.002362 7.179 7.02e-13 ***
## CHOL 0.004480 0.001495 2.996 0.00274 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1469.3 on 1392 degrees of freedom
## Residual deviance: 1349.0 on 1388 degrees of freedom
## AIC: 1359
##
## Number of Fisher Scoring iterations: 4
Prediction for Alice using fit2.3
Notice the difference between the predictions among fit2.3 and fit1!!!
bestglm()Recall that \[AIC = - 2\log(\mathcal{L}) + 2d\] Where \(d\) is the total number of parameters. We are looking for a model with small AIC.
All predictors
In fit2, \[\text{Residual Deviance} = -2\log(\mathcal{L}).\]
Therefore, \[\begin{split} AIC &= -2\log(\mathcal{L}) + 2d \\ &= \text{Residual Deviance} + 2*8 \\ &= 1343.1 + 16 = 1359.1 \end{split}\]
We use bestglm() to find model(s) with the smallest AIC?
bestglm() comparing regsubsets()# Get the design matrix without 1's and HD
Xy <- model.matrix(HD ~.+0, fram_data.f)
# Attach y as the last column.
Xy <- data.frame(Xy, fram_data.f$HD)
fit.all <- bestglm(Xy, family = binomial, method = "exhaustive", IC="AIC", nvmax = 10) # method = "exhaustive", "forward" or "backward"## Morgan-Tatar search since family is non-gaussian.
## [1] "BestModel" "BestModels" "Bestq" "qTable" "Subsets"
## [6] "Title" "ModelReport"
List the top 5 models. In the way any one of the following model could be used.
## AGE SEXFEMALE SEXMALE SBP DBP CHOL FRW CIG Criterion
## 1 TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE 1355.285
## 2 TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE 1355.285
## 3 TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE 1355.536
## 4 TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE 1355.536
## 5 TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE 1357.011
Pick one among BestModels.
##
## Call: glm(formula = y ~ ., family = family, data = Xi, weights = weights)
##
## Coefficients:
## (Intercept) AGE SEXMALE SBP CHOL
## -9.227856 0.061529 0.911274 0.015966 0.004493
## FRW CIG
## 0.006039 0.012279
##
## Degrees of Freedom: 1392 Total (i.e. Null); 1386 Residual
## Null Deviance: 1469
## Residual Deviance: 1343 AIC: 1357
Are all variables in model with small AIC significant?
##
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + FRW + CIG, family = binomial,
## data = fram_data.f)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7066 -0.7279 -0.5517 -0.3343 2.4501
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.227856 0.996153 -9.263 < 2e-16 ***
## AGE 0.061529 0.014775 4.164 3.12e-05 ***
## SEXMALE 0.911274 0.157117 5.800 6.63e-09 ***
## SBP 0.015966 0.002487 6.420 1.37e-10 ***
## CHOL 0.004493 0.001503 2.990 0.00279 **
## FRW 0.006039 0.004004 1.508 0.13151
## CIG 0.012279 0.006088 2.017 0.04369 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1469.3 on 1392 degrees of freedom
## Residual deviance: 1343.3 on 1386 degrees of freedom
## AIC: 1357.3
##
## Number of Fisher Scoring iterations: 4
As seen above, FRW is not significant so we can eliminate it.
fit.final.0 <- glm(HD~AGE+SEX+SBP+CHOL+FRW+CIG, family=binomial, data=fram_data.f)
summary(fit.final.0)
summary(update(fit.final.0, .~. -FRW))Note that FRW is not significant in fit.fial.0. Once we drop FRW, all the remaining factors seem to be fine. (CIG shows weak evidence to be useful in this final model.) Nevertheless we decided to retain all the variables in the final model.
Call:
glm(formula = HD ~ AGE + SEX + SBP + CHOL + CIG, family = binomial,
data = fram_data.f)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7541 -0.7288 -0.5538 -0.3438 2.4470
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.702282 0.926831 -9.389 < 2e-16 ***
AGE 0.061358 0.014753 4.159 3.20e-05 ***
SEXMALE 0.885748 0.155792 5.685 1.30e-08 ***
SBP 0.017085 0.002372 7.203 5.88e-13 ***
CHOL 0.004398 0.001499 2.935 0.00334 **
CIG 0.011358 0.006058 1.875 0.06083 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1469.3 on 1392 degrees of freedom
Residual deviance: 1345.5 on 1387 degrees of freedom
AIC: 1357.5
Number of Fisher Scoring iterations: 4
So based on the study above, we may say collectively AGE, SBP, CHOL, CIG are all positively related to the chance of a HD while Male’s have higher chance of HD comparing with Females controlling for all other factors in the model.
## Analysis of Deviance Table (Type II tests)
##
## Response: HD
## LR Chisq Df Pr(>Chisq)
## AGE 16.540 1 4.763e-05 ***
## SBP 56.950 1 4.470e-14 ***
## SEX 44.787 1 2.197e-11 ***
## SBP:SEX 0.289 1 0.5910
## AGE:SEX 0.832 1 0.3617
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
CIG as a categorical variable?regsubsets()If you use regsubsets() here, then it will treat HD as a continuous variable and fit lm(HD ~....) using two response values 0 and 1. It will not return the results coming from glm().
## Subset selection object
## Call: regsubsets.formula(HD ~ ., fram_data, nvmax = 10, method = "exhaustive")
## 7 Variables (and intercept)
## Forced in Forced out
## AGE FALSE FALSE
## SEXMALE FALSE FALSE
## SBP FALSE FALSE
## DBP FALSE FALSE
## CHOL FALSE FALSE
## FRW FALSE FALSE
## CIG FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## AGE SEXMALE SBP DBP CHOL FRW CIG
## 1 ( 1 ) " " " " "*" " " " " " " " "
## 2 ( 1 ) " " "*" "*" " " " " " " " "
## 3 ( 1 ) "*" "*" "*" " " " " " " " "
## 4 ( 1 ) "*" "*" "*" " " "*" " " " "
## 5 ( 1 ) "*" "*" "*" " " "*" " " "*"
## 6 ( 1 ) "*" "*" "*" " " "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
The recipe for choosing a classifier with multiple logistic regression is similar to that using SBP alone.
Before using the final model, let’s use only SBP and AGE to take a look at the linear boundary.
Get the logit function and examine some specific rules
\[logit=-6.554+0.0144SBP+.0589AGE\]
Rule 1: threshold the probability at 2/3
\[\begin{split} \widehat{HD}=1 \mbox{ if } SBP &>\frac{-0.0589}{0.0144} Age + \frac{\log(2)+6.554}{0.0144} \\ &=-4.09 Age + 503 \end{split}\]
Rule 2: threshold the probability at 1/2
\[\begin{split} \widehat{HD}=1 \mbox{ if } SBP &>\frac{-0.0589}{0.0144} Age + \frac{\log(1)+6.554}{0.0144} \\ &=-4.09 Age + 455 \end{split}\]
Let’s put two linear boundaries together.
fram_data.f %>%
ggplot(aes(x=AGE, y=SBP)) +
geom_jitter(width = 0.2, aes(color = factor(HD))) +
geom_abline(intercept = 503, slope = -4.09, col = "blue") +
geom_abline(intercept = 455, slope = -4.09, col = "red") +
ggtitle("Blue: HD = 1 if prob > 2/3; Red: HD = 1 if prob > 1/2")Now we compare three models fit1, fit2, fit.final using ROC curve and AUC.
fit1 <- glm(HD~SBP, fram_data.f, family=binomial(logit))
fit1.roc <- roc(fram_data.f$HD, fit1$fitted)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(1-fit1.roc$specificities,
fit1.roc$sensitivities, col="red", lwd=3, type="l",
xlab="False Positive",
ylab="Sensitivity")
lines(1-fit2.roc$specificities, fit2.roc$sensitivities, col="blue", lwd=3)
lines(1-fit.final.roc$specificities, fit.final.roc$sensitivities, col="green", lwd=3)
legend("bottomright",
c(paste0("fit1 AUC=", round(fit1.roc$auc,2)),
paste0("fit2 AUC=", round(fit2.roc$auc, 2)),
paste0("fit.final AUC=", round(fit.final.roc$auc, 2))),
col=c("red", "blue", "green"),
lty=1) From both ROC curve and AUC, we can say that our final model is indeed better in terms of overall performance. After deciding on the model, we need to decide the classification rule depending on the goal.
Given model, what threshold should we use so that the MCE will be minimized? The answer would be using \(0.5\) as the threshold number, assuming we treat each mistake the same. On the other hand if the two type of the mistakes cost differently we ought to weigh the type of the mistake.
For example, lower false negative is more important then we can weigh more on false negative. A more concrete example is disease diagnosis. Falsely diagnoses as disease-free is worse than falsely diagnosed as positive. We can achieve this by the Bayes rule.
We first define the loss function.
\(a_{1,0}=L(Y=1, \hat Y=0)\), the loss (cost) of making an “1” to a “0” (false negative)
\(a_{0,1}=L(Y=0, \hat Y=1)\), the loss of making a “0” to an “1” (false positive)
\(a_{0, 0} = a_{1, 1}=0\) (correct classification)
Then \[\begin{split} E(L(Y, \hat Y=1)) &= P(Y=1) \cdot L(Y=1, \hat Y=1) + P(Y=0) \cdot L(Y=0, \hat Y=1) \\ &= a_{0,1} \cdot P(Y=0) \end{split} \]
Similarly, \[E(L(Y, \hat Y=0))=a_{1,0} \cdot P(Y=1)\]
To minimize the two mean losses, we choose \[\begin{split} \hat Y=1 ~~~~ \text{if} ~~~~ &\quad E(L(Y, \hat Y=1)) < E(L(Y, \hat Y=0)) \\ & \Leftrightarrow a_{0,1} \cdot P(Y=0) < a_{1,0} \cdot P(Y=1) \end{split} \]
We have the following optimal rule:
\[\begin{split} \hat Y=1 ~~~~ \text{if} ~~~~ &\quad \frac{P(Y=1 \vert X)}{P(Y=0\vert X)} > \frac{a_{0,1}}{a_{1,0}} \\ & \Leftrightarrow P(Y=1 \vert X) > \frac{\frac{a_{0,1}}{a_{1,0}}}{1 + \frac{a_{0,1}}{a_{1,0}}} \end{split} \]
The above rule is called Bayes’ rule.
Weighted classification error:
The weighted misclassification error (may not be a number between 0 and 1) is then
\[MCE=\frac{a_{1,0} \sum 1 \{\hat Y = 0 | Y=1\} + a_{0,1} \sum 1 \{\hat Y = 1 | Y=0\}}{n}\]
Example: \(5a_{0,1}=a_{1,0}\)
If the cost of misclassifying “1” to “0” is 5 times of that of misclassifying “0” to “1”, then the Bayes rule is thresholding over the \[\hat P(Y=1 \vert x) > \frac{0.2}{(1+0.2)}=0.17\] or \[logit > \log(\frac{0.17}{0.83})=-1.59\]. The linear boundary is
\[SBP \geq -4.09Age + 344.7\]
Let’s draw the linear boundary of the Bayes rule when \(\frac{a_{1,0}}{a_{0,1}}=5\) and compare with thresholding probability by 1/2 and 2/3.
fram_data.f %>%
ggplot(aes(x=AGE, y=SBP)) +
geom_jitter(width = 0.2, aes(color = factor(HD))) +
geom_abline(intercept = 503, slope = -4.09, col = "blue") +
geom_abline(intercept = 455, slope = -4.09, col = "red") +
geom_abline(intercept = 344.7, slope = -4.09, col = "green") +
ggtitle("Green: Bayes rule a10/a01 = 5; Blue: HD = 1 if prob > 2/3; Red: HD = 1 if prob > 1/2")Finally we get the weighted misclassification error (may not be a number between 0 and 1)
\[MCE=\frac{a_{1,0} \sum 1_{\hat y = 0 | y=1} + a_{0,1} \sum 1_{\hat y = 1 | y=0}}{n}\]
Let’s compare the weighted misclassification error using the Bayes rule and the naive 1/2 threshold using fit2. The weighted MCE of classifier using the Bayes rule should the smaller as expected.
fit2.pred.bayes <- as.factor(ifelse(fit2$fitted > .17, "1", "0"))
MCE.bayes <- (5*sum(fit2.pred.bayes[fram_data.f$HD == "1"] != "1")
+ sum(fit2.pred.bayes[fram_data.f$HD == "0"] != "0"))/length(fram_data.f$HD)
MCE.bayes## [1] 0.7020818
What happened for .5 as the threshold?
fit2.pred.5 <- as.factor(ifelse(fit2$fitted > .5, "1", "0"))
MCE.bayes.5 <- (5*sum(fit2.pred.5[fram_data.f$HD == "1"] != "1")
+ sum(fit2.pred.5[fram_data.f$HD == "0"] != "0"))/length(fram_data.f$HD)
MCE.bayes.5## [1] 1.064609
Compare the weighted MCE of the classifier using Bayes rule with fit.final. The weighted MCE is smaller than that of fit2.
fit.final.bayes <- as.factor(ifelse(fit.final$fitted > .17, "1", "0"))
MCE.final.bayes <- (5*sum(fit.final.bayes[fram_data.f$HD == "1"] != "1")
+ sum(fit.final.bayes[fram_data.f$HD == "0"] != "0"))/length(fram_data.f$HD)
MCE.final.bayes## [1] 0.6116296
Depending on the goal, we need to choose a criterion. Then to evaluate the performance, we use testing data.
We may split the data into two sub-samples.
Choose the model by some criterion. For example, the one with the largest AUC.
We first split the data.
# Split the data:
N <- length(fram_data.f$HD)
set.seed(10)
# Take a random sample of n=1000 from 1 to N=1393
index.train <- sample(N, 1000)
# Set the 1000 randomly chosen subjects as a training data
data.train <- fram_data.f[index.train,]
# The remaining subjects will be reserved for testing purposes
data.test <- fram_data.f[-index.train,] Compare two models with testing data
Get the fitted probabilities using the testing data
fit1.fitted.test <- predict(fit1.train, data.test, type="response") # get the prob's
fit5.fitted.test <- predict(fit5.train, data.test, type="response")
data.frame(fit1.fitted.test, fit5.fitted.test)[1:10, ]## fit1.fitted.test fit5.fitted.test
## 6 0.1310456 0.08427391
## 14 0.1383692 0.10074764
## 21 0.1421581 0.09981915
## 24 0.1460333 0.17082459
## 25 0.1460333 0.11123836
## 27 0.1499956 0.13101216
## 30 0.1499956 0.12530310
## 37 0.1540460 0.14660777
## 38 0.1540460 0.27803701
## 42 0.1540460 0.14380806
Look at the first 10 rows. Notice that row names are the subject number chosen in the testing data. The estimated probability for each row is different using fit1 and fit5.
Compare the performances using the testing data.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(1-fit5.test.roc$specificities, fit5.test.roc$sensitivities,
col="red", type="l", lwd=3,
xlab=paste("AUC(fit5.test) =",
round(pROC::auc(fit5.test.roc),2),
" AUC(fit1.test) =",
round(pROC::auc(fit1.test.roc),2)),
ylab="Sensitivities")
lines(1-fit1.test.roc$specificities, fit1.test.roc$sensitivities, col="blue", lwd=3)
legend("bottomright", legend=c("fit5.test w five variables", "fit1.test w one variable"),
lty=c(1,1), lwd=c(2,2), col=c("red", "blue"))
title("Comparison of two models using testing data")Conclusion: we will use fit5 for its larger testing AUC curve.
Note: If you comment out set.seed and repeat running the above training/testing session, what will you observe from the two ROC curves and the two AUC values?
Are you surprised?
for(i in 1:10) {
# set seed
set.seed(10*i)
# split data
index.train <- sample(N, 1000)
data.train <- fram_data.f[index.train,]
data.test <- fram_data.f[-index.train,]
# fit
fit1.train <- glm(HD~SBP, data=data.train, family=binomial)
fit5.train <- glm(HD~SBP+SEX+AGE+CHOL+CIG, data=data.train, family=binomial)
# predict
fit1.fitted.test <- predict(fit1.train, data.test, type="response")
fit5.fitted.test <- predict(fit5.train, data.test, type="response")
# roc
fit1.test.roc <- roc(data.test$HD, fit1.fitted.test)
fit5.test.roc <- roc(data.test$HD, fit5.fitted.test)
# plot
plot(1-fit5.test.roc$specificities, fit5.test.roc$sensitivities,
col="red", type="l", lwd=3,
xlab=paste("AUC(fit5.test) =",
round(pROC::auc(fit5.test.roc),2),
" AUC(fit1.test) =",
round(pROC::auc(fit1.test.roc),2)),
ylab="Sensitivities")
lines(1-fit1.test.roc$specificities, fit1.test.roc$sensitivities, col="blue", lwd=3)
legend("bottomright", legend=c("fit5.test w five variables", "fit1.test w one variable"),
lty=c(1,1), lwd=c(2,2), col=c("red", "blue"))
title("Comparison of two models using testing data")
# pause for one second
Sys.sleep(1)
}What does a \(\chi^2\) distribution look like?
\(\chi^2\) distributions with different degrees of freedom.
par(mfrow=c(1,3))
hist(rchisq(10000, 1), freq=FALSE, breaks=20) # the second arg = df
hist(rchisq(10000, 10), freq=FALSE, breaks=20) # the second arg = df
hist(rchisq(10000, 20), freq=FALSE, breaks=20) # when df is large, chi-squared dis is approaching to a normal distributionWhen df is getting larger, \(\chi^2\) distribution is approximately normal (why?)