Objectives

What are possible risk factors related to heart diseases? What determines an employee being a desirable one with the firm? How to tell a review in Amazon is real or fake? How to model a categorical variables conveniently and efficiently? Logistic regression models are most commonly used methods to model the probability of an event. We then automatically get a linear classification rules. Various criteria are introduced. Analogous to least squared solutions for the usual regression models, we use maximum likelihood estimations.

Introduction
A quick EDA
Logistic Regressions (Illustrated with one predictor)
- Link Function
- Maximum Likelihood Estimator (MLE)
- Inference for the Coefficients
  - Wald Intervals / Tests (through the MLE’s)
  - Likelihood Ratio Intervals / Tests (Chi-Squared Tests)
Classification
- Classification rules: thresholding $p(y=1 \vert x)$
- Criteria
  1. Sensitivity
  2. Specificity (False Positive)
  3. Misclassification errors
  4. ROC (Receiver Operating Characteristic) curves and AUC (Area Under the Curve)
  5. FDR
Multiple Logistic Regressions
- Natural Extension
- Model selection through backward selection or AIC (Aikake’s Information Criterion)
- Final model
- Classification
Bayes rules
- Unequal costs
- Weighted misclassification error
Training/Testing data

Read

Chapter 4.1-4.3.4 (too skimpy about Logistic reg)
Chapter 20.1-20.5 (Statistical Sleuth, OK, not enough either)

1 Case Study: Framingham Heart Study

Heart disease is the leading cause of the death in United States. One out of four death is due to heart disease. It is important to identify Coronary Heart Disease risk factors. Many studies have indicate, high blood pressure, high cholesterol, age, gender, race are among some major risk factors.

Starting from late 40th, National Heart, Lung and Blood Institute (NHLBI) launched it’s famous Framingham Heart Study. By now subjects of three generations together with other people have been monitored and followed in the study. Over thousands research papers have been published using these longitudinal data sets. More details.

Using a piece of the data gathered at the beginning of the study, we illustrate how to identify risk factors of heart disease and how to classify one with a heart disease.

i) Data: 1,406 participants. Conditions gathered at the beginning of the study (early 50s).

Variable	Description
`Heart Disease`	Indicator of having heart disease or not
`AGE`	Age
`SEX`	Gender
`SBP`	Systolic blood pressure
`DBP`	Diastolic blood pressure
`CHOL`	Cholesterol level
`FRW`	age and gender adjusted weight
`CIG`	Self-reported number of cigarettes smoked each week

ii) Goal:

Identify risk factors
Predict the probability of one with Heart Disease
Predict if one has Heart Desease given the information in hand
In particular,

Predict $Prob(HD = 1)$ for Alice, who is:

AGE	SEX	SBP	DBP	CHOL	FRW	CIG
45	FEMALE	100	80	180	110	5

where \[HD =\left\{\begin{array}{ll} 1 & \mbox{ if heart disease} \\ 0 & \mbox{ if normal} \end{array}\right.\]

1.1 EDA

Read Framingham.dat as fram_data

fram_data <- read.csv("Framingham.dat", sep=",", header=T, as.is=T)
str(fram_data) 
names(fram_data)
summary(fram_data)

Rename and set variables to their correct type

# names(fram_data)[1] <- "HD"
# fram_data$HD <- as.factor(fram_data$HD)
# fram_data$SEX <- as.factor(fram_data$SEX)

# dplyr way
fram_data %<>%
   rename(HD = Heart.Disease.) %>%
   mutate(HD = as.factor(HD), 
          SEX = as.factor(SEX))

Structure of the data

str(fram_data)

## 'data.frame':    1407 obs. of  8 variables:
##  $ HD  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AGE : int  45 49 47 48 49 47 48 48 46 45 ...
##  $ SEX : Factor w/ 2 levels "FEMALE","MALE": 2 2 2 2 2 2 2 2 2 2 ...
##  $ SBP : int  90 100 100 108 108 108 108 110 110 110 ...
##  $ DBP : int  50 64 70 70 75 68 70 70 80 72 ...
##  $ CHOL: int  216 237 215 340 149 165 196 229 204 183 ...
##  $ FRW : int  76 97 86 93 95 88 79 85 112 93 ...
##  $ CIG : int  5 0 50 0 0 0 20 25 0 20 ...

We have created the last row for Alice, which will be used for prediction. Notice that the response HD is missing here.

tail(fram_data, 1)

##        HD AGE    SEX SBP DBP CHOL FRW CIG
## 1407 <NA>  45 FEMALE 100  80  180 110   5

We take out the last row and save it to fram_data.new to avoid any potential problems in our analyses.

fram_data.new <- fram_data[1407,]
fram_data <- fram_data[-1407,]

We have 311 observations with HD = 1 and 1,095 with HD = 0

summary(fram_data)

##  HD            AGE            SEX           SBP             DBP        
##  0:1095   Min.   :45.00   FEMALE:737   Min.   : 90.0   Min.   : 50.00  
##  1: 311   1st Qu.:48.00   MALE  :669   1st Qu.:130.0   1st Qu.: 80.00  
##           Median :52.00                Median :142.0   Median : 90.00  
##           Mean   :52.45                Mean   :148.1   Mean   : 90.14  
##           3rd Qu.:56.00                3rd Qu.:160.0   3rd Qu.: 98.00  
##           Max.   :62.00                Max.   :300.0   Max.   :160.00  
##                                                                        
##       CHOL            FRW             CIG        
##  Min.   : 96.0   Min.   : 52.0   Min.   : 0.000  
##  1st Qu.:200.0   1st Qu.: 94.0   1st Qu.: 0.000  
##  Median :230.5   Median :103.0   Median : 0.000  
##  Mean   :234.7   Mean   :105.4   Mean   : 8.048  
##  3rd Qu.:264.0   3rd Qu.:114.0   3rd Qu.:20.000  
##  Max.   :430.0   Max.   :222.0   Max.   :60.000  
##                  NA's   :11      NA's   :2

Notice missing values in FRW and CIG. We won’t take those observations off the data set.

sum(is.na(fram_data))

1.2 HD vs SBP

For simplicity we start with a simple question: How does HD relate to SBP?

fram_data %>% group_by(HD)  %>% summarise(mean(SBP))

## # A tibble: 2 x 2
##   HD    `mean(SBP)`
##   <fct>       <dbl>
## 1 0            145.
## 2 1            159.

On average SBP seems to be higher among HD = 1

We can see the distribution of SBP through back to back box plots. Again, SBP seems to be higher when HD = 1

boxplot(SBP~HD, fram_data)
plot(fram_data$HD, fram_data$SBP, ylab ="SBP", xlab = "HD")

Next we explore how proportions of HD = 1 relate to SB. Note that HD of type factor is coerced into numeric 1 and 2. In fact we can code a Heart Disease many ways. It can be a 1 or Yes or Bad etc…

Standard plots:

plot(jitter(as.numeric(fram_data$SBP), factor=.5),
     jitter(as.numeric(fram_data$HD), factor=.5),
     col=fram_data$HD, 
     xlab = "SBP", 
     ylab = "HD")
legend("right",
       legend=c("0", "1"),
       lty=c(1,1),
       lwd=c(2,2),
       col=unique(fram_data$HD))

Problems: many observations are overlapped so they are not visible.

Solutions: use jitter() to spread out the observations with similar SBP values.

plot(jitter(as.numeric(fram_data$HD), factor=.5) ~ fram_data$SBP,
     pch=4,
     col=fram_data$HD,
     ylab="HD",
     xlab="SBP")
legend("right", legend=c("0", "1"),
       lty=c(1,1), lwd=c(2,2), col=unique(fram_data$HD))

Alternatively we could use stripchart() as follows:

stripchart(fram_data$SBP~fram_data$HD, method="jitter", 
           col=c("black", "red"), pch=4,
           ylab="HD", xlab="HB") 
legend("topright", legend=c("0", "1"),
       lty=c(1,1), lwd=c(2,2), col=unique(fram_data$HD))

geom_jitter() also provides similar function.

fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
  ggplot(aes(x=SBP, y=HD)) + 
  geom_jitter(height = .05, aes(color = factor(HD)))

All the plots above do not show the proportion of “1”’s vs. “0”’s as a function of SBP.

Can we use a linear model?

For a binary response with a $0/1$ coding as HD, it can be shown that $X \hat\beta$ is in fact an estimate of $P(HD=1 | X)$ and we can predict heart disease if $\hat P(HD=1 | X) > 0.5$ and normal otherwise. However, some of our estimate will be outside of $[0,1]$, which makes no sense to interpret as probability.

fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
  ggplot(aes(x=SBP, y=HD)) + 
  geom_jitter(height = .05, aes(color = factor(HD))) + 
  geom_smooth(method = "lm", se = FALSE) +
  ylab("Prob(HD=1)")

2 Logistic Regression: `HD~SBP`

We clearly would like to model probability of one with HD given SBP. One of the most popular model is logistic regression model.

2.1 Logistic model

In logistic regression model we will model the probability of one being sick as follows:

\[P(HD=1\vert SBP) = \frac{e^{\beta_0 + \beta_1 SBP}}{1+e^{\beta_0+\beta_1 SBP}}\]

where $\beta_0$ and $\beta_1$ are unknown parameters. We see the following properties immediately:

\[0 < P(HD=1\vert SBP) = \frac{e^{\beta_0 + \beta_1 SBP}}{1+e^{\beta_0+\beta_1 SBP}} < 1\]
We get the $P(HD=0\vert SBP)$ :

\[P(HD=0\vert SBP) = 1-P(HD=1\vert SBP) = \frac{1}{1+e^{\beta_0+\beta_1 SBP}}\]

What does the linear function describe? It is log odds of being HD.

\[logit(P(HD=1|SBP)) =\log\left(\frac{P(HD=1\vert SBP)}{P(HD=0\vert SBP)}\right)=\beta_0+\beta_1 \times SBP\]

The interpretation of $\beta_1$ is the change in log odds for a unit change in SBP.
$P(HD=1|SBP)$ is a monotone function of SBP, depending on the sign of $\beta_1$. $P(HD=1|SBP)$ is an increasing function of SBP if $\beta_1 >0$.

2.2 Maximum Likelihood Estimators (MLE)

For our setting, the response is a categorical variable. The notion of least squared errors does not apply here. We will need to find some sensible function to minimize or maximize to estimate the unknown parameters $\beta$’s. We introduce Likelihood Function of $\beta_0$ and $\beta_1$ given the data, namely the

Probability of seeing the actual outcome in the data.

Take a look at a piece of the data, randomly chosen from our dataset. We can then see part of the likelihood function.

set.seed(2)
fram_data[sample(1:1406, 10), c("HD", "SBP")]

     HD SBP
975   0 135
710   0 124
774   0 136
416   1 130
392   1 120
273   0 134
1373  0 154
1228  0 150
1321  1 210
690   0 118

We use this to illustrate the notion of likelihood function.

\[\begin{split} \mathcal{Lik}(\beta_0, \beta_1 \vert {\text Data}) &= {Prob\text {(the outcome of the data)}}\\ &=Prob((HD=0|SBP=135), (HD=0|SBP=124), \ldots, (HD=1|SBP=130), \ldots ) \\ &=Prob(HD=0|SBP=135) \times Prob(HD=0|SBP=124) \times \ldots \times Prob(HD=1|SBP=130) \times \ldots ) \\ &= \frac{1}{1+e^{\beta_0 + 130 \beta_1}}\cdot\frac{1}{1+e^{\beta_0 + 140\beta_1}}\cdots\frac{e^{\beta_0 + 130 \beta_1}}{1 + e^{\beta_0 + 130 \beta_1}} \dots \end{split}\]

MLE: The estimates $(\hat \beta_1, \hat \beta_0)$ that maximizes the likelihood function is termed as Maximum Likelihood Estimators:

\[(\hat \beta_1, \hat \beta_0) = \arg\max_{\beta_0, \beta_1} \mathcal{Lik}(\beta_0, \beta_1 \vert Data)\] Remark:

$\hat \beta_0$ and $\hat \beta_1$ are obtained through $\max\log(\mathcal{Lik}(\beta_0, \beta_1 \vert D))$ or equivalently through $\min -\{\log(\mathcal{Lik}(\beta_0, \beta_1 \vert D))\}$
MLE can only be obtained through numerical calculations.

glm() will be used to do logistic regression. It is very similar to lm() but some output might be different.

The default is logit link in glm

fit1 <- glm(HD~SBP, fram_data, family=binomial(logit)) 
summary(fit1, results=TRUE)

## 
## Call:
## glm(formula = HD ~ SBP, family = binomial(logit), data = fram_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6687  -0.7109  -0.6253  -0.5244   2.1072  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.66342    0.34602 -10.587  < 2e-16 ***
## SBP          0.01590    0.00221   7.196 6.22e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1485.9  on 1405  degrees of freedom
## Residual deviance: 1432.8  on 1404  degrees of freedom
## AIC: 1436.8
## 
## Number of Fisher Scoring iterations: 4

2.2.1 Prediction

To see the prob function estimated by glm:

logit = -3.66 + 0.0159 SBP
\[\begin{split} \hat P(HD = 1 \vert SBP) &= \frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}} \\ \hat P(HD = 0 \vert SBP) &= \frac{1}{1+e^{-3.66+0.0159 \times SBP}} \end{split}\]

Now to estimate $P(HD=1)$ for Alice, we can plug in her SBP=100 into the logit function:

fram_data.new

Based on fit1 we plug in SBP value into the prob equation. \[\hat P(HD = 1 \vert SBP=100) = \frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}} = \frac{e^{-3.66+0.0159 \times 100}}{1+e^{-3.66+0.0159 \times 100}} \approx 0.112\]

We can also use the predict() function.

fit1.predict <- predict(fit1, fram_data.new, type="response") 
fit1.predict

##      1407 
## 0.1117059

2.2.2 Interpretations

Let us see what we can say about the risk of HD when SBP increases.

logit = -3.66 + 0.0159 SBP. That means logodds increases .0159 when SBP increase by 1.
Notice the $Prob(HD = 1)$ is an increasing function of SBP since $\hat \beta_1 = .0159 > 0$. That means when SBP increases, the chance of being HD is higher.
Unfortunately we do not have a nice linear interpretation of $\beta_1$ over $Prob(HD = 1)$ anymore.

fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
  ggplot(aes(x=SBP, y=HD)) + 
  geom_jitter(height = .05, aes(color = factor(HD))) + 
  geom_smooth(method = "glm", 
              method.args = list(family = "binomial"),
              se = FALSE) +
  # geom_smooth(method = "lm", 
  #             color = "red",
  #             se = FALSE) +
  ylab("Prob(HD=1)")

Alternatively, we can plot the prob through $\frac{e^{-3.66+0.0159 \times SBP}}{1+e^{-3.66+0.0159 \times SBP}}$

x <- seq(100, 300, by=1)
y <- exp(-3.66+0.0159*x)/(1+exp(-3.66+0.0159*x))
plot(x, y, pch=16, type = "l",
     xlab = "SBP",
     ylab = "Prob of P(Y=1|SBP)" )

Once again the plots show the increase risk of HD when SBP increases.

2.3 Inference for the Coefficients

How can we tell if the true $\beta_1$ is not 0? We need to provide either confidence intervals of hypotheses tests for the unknown parameters.

2.3.1 Wald intervals/tests (through the MLE’s)

Facts about MLE:

The MLE’s are approximately normal and they are unbiased estimators of the $\beta$’s.
The standard errors are obtained through information matrix.
The $z$ intervals and $z$ tests are valid for each $\beta_i$.

summary(fit1)

## 
## Call:
## glm(formula = HD ~ SBP, family = binomial(logit), data = fram_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6687  -0.7109  -0.6253  -0.5244   2.1072  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.66342    0.34602 -10.587  < 2e-16 ***
## SBP          0.01590    0.00221   7.196 6.22e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1485.9  on 1405  degrees of freedom
## Residual deviance: 1432.8  on 1404  degrees of freedom
## AIC: 1436.8
## 
## Number of Fisher Scoring iterations: 4

Usual z-intervals for the coefficient’s (output from the summary)

confint.default(fit1)   #confint(fit1, level = .99) with diff level of significance

                  2.5 %      97.5 %
(Intercept) -4.34161096 -2.98522712
SBP          0.01156901  0.02023071

2.3.2 Likelihood ratio test

Similar to F tests in OLS (Ordinary Least Squares), we have likelihood ratio test to test if a collective set of variables are not needed.

Here:

\[H_0: \beta_{SBP}=0 \mbox{ v.s. } H_1: \beta_{SBP} \not= 0\]

Facts about Likelihood Ratio Tests

Under the $H_0$, the likelihood ratio (modified with log) \[\begin{split} \text {Testing stat} = \chi^2 & = -2\times \log \frac{\max _{H_1} \mathcal{Lik}(\beta_0, \beta_1 \vert D)}{\max_{H_0} \mathcal{Lik}(\beta_0, \beta_1 \vert D)}\\ &=-2\log(\mathcal{Lik}_{H_0}) - (-2\log(\mathcal{Lik}_{H_1}))\\ &\sim \chi^2_{df=1} \end{split}\]
Testing stat = Null Deviance - Residual Deviance. Here Deviance = $-2\log(\mathcal{L})$
The p-value is done through $\chi^2$ distribution $p_{value}=P(\chi^2_{df} > \chi^2)$
glm() outputs the two terms
See more about $\chi^2$ distribution in the appendix

summary(fit1)

  
  Call:
  glm(formula = HD ~ SBP, family = binomial(logit), data = fram_data)
  
  Deviance Residuals: 
      Min       1Q   Median       3Q      Max  
  -1.6687  -0.7109  -0.6253  -0.5244   2.1072  
  
  Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
  (Intercept) -3.66342    0.34602 -10.587  < 2e-16 ***
  SBP          0.01590    0.00221   7.196 6.22e-13 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 1485.9  on 1405  degrees of freedom
  Residual deviance: 1432.8  on 1404  degrees of freedom
  AIC: 1436.8
  
  Number of Fisher Scoring iterations: 4

Notice:

Null deviance = 1485.9
Residual deviance: 1432.8
$\chi^2 = 1485.9-1432.8= 53.1$

We can then get the p-value of the likelihood Ratio test .

chi.sq <- 1485.9-1432.8
pchisq(chi.sq, 1, lower.tail=FALSE)

We can also use anova() or car::Anova() to get the $\chi^2$ tests. Here the null hypothesis $H_0:~ \beta_{SBP}=0$.

anova(fit1, test="Chisq") #

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: HD
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                  1405     1485.9              
## SBP   1    53.08      1404     1432.8 3.203e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similar to the F-test in lm() set up.

Anova(fit1)

## Analysis of Deviance Table (Type II tests)
## 
## Response: HD
##     LR Chisq Df Pr(>Chisq)    
## SBP    53.08  1  3.203e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inverting the likelihood ratio tests we can get confidence intervals for the coefficients. This should be similar to that obtained through Wald or z intervals but they are not the same.

confint.default(fit1)   #confint.default(fit1) #z-interval

##                   2.5 %      97.5 %
## (Intercept) -4.34161096 -2.98522712
## SBP          0.01156901  0.02023071

3 Classification

Given the prediction for the probability of Alice having heart disease, i.e. $\hat P(HD=1 | SBP=100)=.11$, how do we decide whether Alice will have heart disease or not? In general, how do we classify $\hat Y = 1$ given $\hat P(Y=1 | X)$?

A sensible way to predict $Y|x$ is to set a threshold over Prob(Y|x).

3.1 Classification rules

Once we have an estimation equation for Prob(HD=1|SBP) we can immediately obtain prediction rules by setting threshold.

Rule 1: Thresholding probability by 1/2

By definition, the larger $\hat P(HD=1 | SBP)$ is, the more likely Alice will have heart disease. We start with a classifier that classifies $\widehat{HD} = 1$ if $\hat P(HD=1 \vert SBP) > 1/2$. To be more specific,

\[\widehat{HD}=1 ~~~~ \text{if} ~~~~ \hat P(HD=1 \vert SBP) = \frac{e^{-3.66+.0159\cdot SBP}}{1+e^{-3.66+.0159\cdot SBP}} > \frac{1}{2}.\]

ifelse(fit1.predict > 1/2, "1", "0")

## 1407 
##  "0"

Thus we classify Alice as not having heart disease.

Linear boundary:

The classification rule above is equivalent to

\[ \begin{split} \widehat{HD}=1 ~~~~ \text{if} ~~~~ -3.66+.0159 \cdot SBP &= \log (\text{odds ratio}) \\ & = \log \Bigg(\frac{P(HD=1 \vert SBP)}{P(HD=0 \vert SBP)} \Bigg) \\ &> \log \Bigg(\frac{1/2}{1/2}\Bigg) = \log 1 = 0 \end{split} \]

Simple algebra yields \[\hat Y=1 ~~~~ \text{if} ~~~~ SBP > \frac{3.66}{.0159}=230.18.\] This is called the linear classification boundary with simple interpretation. Since Alice has $SBP = 100 < 230.18$, we classify Alice as not having heart disease.

fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
  ggplot(aes(x=SBP, y=HD)) + 
  geom_jitter(height = .05, aes(color = factor(HD))) + 
  geom_smooth(method = "glm", 
              method.args = list(family = "binomial"),
              se = FALSE) +
  geom_vline(xintercept = 230.18, col="red") +
  ggtitle("Classifier: HD = 1 if prob > 1/2") + 
  ylab("Prob(HD=1)")

Based on the about linear boundary, we classify everyone as $\hat Y=1$ if their SBP is on the right side of the vertical line!

Rule 2: Thresholding probability by 1/3

Let’s redo the exercise by thresholding $\hat P(HD=1 | SBP) > 1/3$.

\[\widehat{HD}=1 ~~~~ \text{if} ~~~~ \hat P(HD=1 \vert SBP) = \frac{e^{-3.66+.0159\cdot SBP}}{1+e^{-3.66+.0159\cdot SBP}} > \frac{1}{3}.\]

ifelse(fit1.predict > 1/3, "1", "0")

## 1407 
##  "0"

We will still classify Alice as not having heart disease.

Alternatively, the linear classification rule is

Simple algebra yields \[\hat Y=1 ~~~~ \text{if} ~~~~ SBP > \frac{3.66 + \log(1/2)}{.0159} = 187\]

We now compare the two classifier

fram_data %>% mutate(HD = as.numeric(HD)-1) %>%
  ggplot(aes(x=SBP, y=HD)) + 
  geom_jitter(height = .05, aes(color = factor(HD))) + 
  geom_smooth(method = "glm", 
              method.args = list(family = "binomial"),
              se = FALSE) +
  geom_line(aes(x = 230.18), col="red") +
  geom_line(aes(x = 187), col="green") +
  ggtitle("Green: HD = 1 if prob > 1/3; Red: HD = 1 if prob > 1/2") + 
  ylab("Prob(HD=1)")

3.2 Criteria for classifier

Given two classifier, how can we tell which one is better? There are several criteria to evaluate the performance of a classifier.

Criteria set I: Sensitivity (True positive) and Specificity (True negative) vs. False positive. This set of criteria will lead ROC curves and notion of AUC
Criterion II: Misclassification errors
Criterion set III: Positive Prediction and Negative Prediction or FDR

3.2.1 Confusion matrix

Given the actual status and the predicted status by a classifier, we can summarize how well this rule works by a two-way table which we called confusion matrix.

Let’s use our first classifier, \[\widehat{HD} = 1 \mbox{ if } \hat P(HD=1 \vert SBP) > 1/2.\]

fit1.pred.5 <- ifelse(fit1$fitted > 1/2, "1", "0")

Take a few participants in the data, we compare the truth and the predictions:

set.seed(10)
output1 <- data.frame(fram_data$HD, fit1.pred.5, fit1$fitted)[sample(1406, 10),] 
names(output1) <- c( "HD", "Predicted HD", "Prob")
output1

##      HD Predicted HD      Prob
## 491   0            0 0.1919423
## 1354  0            0 0.1684801
## 368   1            0 0.1684801
## 439   0            0 0.1473629
## 344   0            0 0.2611611
## 1295  0            0 0.4351549
## 143   0            0 0.2461149
## 938   0            0 0.1576343
## 1405  1            0 0.2232822
## 930   0            0 0.1514034

We see there are 4 participants mislabeled.

Confusion matrix: a 2 by 2 table which summarize the number of mis/agreed labels

cm.5 <- table(fit1.pred.5, fram_data$HD)
cm.5

##            
## fit1.pred.5    0    1
##           0 1084  302
##           1   11    9

Note that the rows are $\hat y$ and the columns are $y$. These four numbers reflects different criteria.

	$Y=0$	$Y=1$
$\hat{Y}=0$	Specificity	False negative
$\hat{Y}=1$	False positive	Sensitivity (true positive)

3.2.2 Sensitivity and Specificity

Sensitivity:

The sensitivity is defined as \[P(\hat Y = 1 \vert Y = 1)\]

This is also called True Positive Rate: the proportion of correct positive classification.

sensitivity <- cm.5[2,2]/sum(cm.5[,2])
sensitivity

## [1] 0.02893891

Specificity:

The specificity is defined as \[P(\hat Y = 0| Y = 0)\]

Specificity measures the proportion of correct negative classification.

specificity <- cm.5[1,1]/sum(cm.5[,1])
specificity

## [1] 0.9899543

False Positive:

A related measure is false positive rate.

\[1 - \text{Specificity} = P(\hat Y=1 \vert Y=0)\]

False Positive measures the proportion of incorrect positive classification (given the actual status being negative).

false.positive <- cm.5[2,1]/sum(cm.5[,1])
false.positive

## [1] 0.01004566

3.2.3 ROC curve and AUC

3.2.3.1 ROC curve

Given a logistic model, we can obtain different classifiers by changing the classification rule, i.e. by changing the threshold. Each classifier has its sensitivity and specificity. We can set sensitivity as x-axis and specificity as y-axis and plot all the pairs of sensitivity and specificity. This curve is termed ROC curve.

ROC curve is helpful when choosing classifier. We want to have both high specificity and high sensitivity at the same time, which mean we want to classify both $Y=0$ and $Y=1$ correctly. However, in general, we will NOT have a perfect classifier and need to strive a balance between the two.

We use the roc() function from package pROC to obtain detailed information for each classifier. Notice the ROC curve here is Sensitivity vs. Specificity. Some prefer using false positive as x-axis so as to have an ascending x-axis.

**Plotting ROC curve:*

fit1.roc<- roc(fram_data$HD, fit1$fitted, plot=T, col="blue")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

Note that the higher the specificity, the lower the sensitivity. A perfect classifier will have both $\text{Specificity}=1$ and $\text{Sensitivity}=1$.

Compare the following ROC curve using false positive as x-axis.

plot(1-fit1.roc$specificities, fit1.roc$sensitivities, col="red", pch=16,
     xlab="False Positive", 
     ylab="Sensitivity")

We can extract specificity and sensitivity of different thresholds from fit1.roc.

names(fit1.roc)  # output from roc

##  [1] "percent"            "sensitivities"      "specificities"     
##  [4] "thresholds"         "direction"          "cases"             
##  [7] "controls"           "fun.sesp"           "auc"               
## [10] "call"               "original.predictor" "original.response" 
## [13] "predictor"          "response"           "levels"

We can also plot a curve that shows the probability thresholds used and the corresponding False Positive rate.

plot(fit1.roc$thresholds, 1-fit1.roc$specificities,  col="green", pch=16,  
     xlab="Threshold on prob",
     ylab="False Positive",
     main = "Thresholds vs. False Postive")

3.2.3.2 AUC (Area under the curve)

AUC measures the area under the ROC curve. It is used to measure the performance of the logistic model as a whole: the larger the better. Why?

Given a specificity, the model with a higher sensitivity is more desirable. If a model have higher sensitivity for each specificity, this is a better model. An extreme case is when a model has sensitivity being constantly 1 and thus $AUC=1$.

fit1.roc$auc 
pROC::auc(fit1.roc)
### if you get "Error in rank(prob) : argument "prob" is missing, with no default"
### then it is possible that you are using the auc() in glmnet
### use pROC::auc(fit1.roc) to specify we want to use auc() in pROC

## Area under the curve: 0.637
## Area under the curve: 0.637

3.2.4 Misclassification error

Mean values of missclassifications

\[MCE= \frac{1}{n} \sum_{i=1}^n \{\hat y_i \neq y_i\}\]

error.training <- mean(fit1.pred.5 != fram_data$HD)
error.training

## [1] 0.2226174

accuracy <- 1 - error.training
accuracy

## [1] 0.7773826

3.2.5 False Discovery Rate (FDR)

False discovery rate measure the expected propotion of false discovery (incorrectly reject the null hypotheses).

FDR = $P(Y=0 | \hat Y = 1)$

fdr <- cm.5[2,1] / sum(cm.5[2,])

Two related concepts are true positive and true negative.

Positive Prediction (true positive):

Positive Prediction is a measure of the accuracy given the predictions.

Positive Prediction = $P(Y = 1 | \hat Y = 1)$

positive.pred <- cm.5[2,2] / sum(cm.5[2,])
positive.pred

## [1] 0.45

Negative Prediction:

Negative Prediction = $P(Y = 0 | \hat Y = 0)$

negative.pred <- cm.5[1,1] / sum(cm.5[1,])
negative.pred

## [1] 0.7821068

3.2.6 confusionMatrix()

We can use confusionMatrix() from the caret package to get the confusion table, accuracy (MCE), sensitivity, specificity, and etc.

confusionMatrix(data = as.factor(fit1.pred.5),         # predicted value
                       reference = fram_data$HD,              # true results as reference
                       positive = levels(fram_data$HD)[2])    # the positive result

confusionMatrix(data = cm.5,                          # the confusion table
                positive = levels(fram_data$HD)[2])   # the positive result

## Confusion Matrix and Statistics
## 
##            
## fit1.pred.5    0    1
##           0 1084  302
##           1   11    9
##                                           
##                Accuracy : 0.7774          
##                  95% CI : (0.7547, 0.7989)
##     No Information Rate : 0.7788          
##     P-Value [Acc > NIR] : 0.5661          
##                                           
##                   Kappa : 0.0284          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.028939        
##             Specificity : 0.989954        
##          Pos Pred Value : 0.450000        
##          Neg Pred Value : 0.782107        
##              Prevalence : 0.221195        
##          Detection Rate : 0.006401        
##    Detection Prevalence : 0.014225        
##       Balanced Accuracy : 0.509447        
##                                           
##        'Positive' Class : 1               
##

Note: we can compare rule 1 with Rule 2:

\[\widehat{HD}= 1 ~~~~ \text{if} ~~~~ \hat P(HD=1 \vert SBP) > 1/3\]

fit1.pred.33 <- ifelse(fram_data$SBP > 187, "1", "0")
fit1.pred.33 <- as.factor(fit1.pred.33)

confusionMatrix(data = fit1.pred.33, 
                reference = fram_data$HD, 
                positive = levels(fram_data$HD)[2])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1029  262
##          1   66   49
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.7437, 0.7886)
##     No Information Rate : 0.7788          
##     P-Value [Acc > NIR] : 0.8693          
##                                           
##                   Kappa : 0.1256          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.15756         
##             Specificity : 0.93973         
##          Pos Pred Value : 0.42609         
##          Neg Pred Value : 0.79706         
##              Prevalence : 0.22119         
##          Detection Rate : 0.03485         
##    Detection Prevalence : 0.08179         
##       Balanced Accuracy : 0.54864         
##                                           
##        'Positive' Class : 1               
##

Notice the trade offs between Sensitivity and False Positive Rate when we change the thresholding numbers.

4 Multiple Logistic Regression and Classification

We have introduce elements of logistic regression model and classifications using only SBP. We can immediately extend all the concepts to include more possible risk factors.

For simplicity, we delete all cases with missing values. In general, this is not recommended!

summary(fram_data)
fram_data.f <- na.omit(fram_data)

4.1 Multiple logistic regression

Denote $x = (x_1, x_2, \ldots, x_p)$. The logit link for the full model is

\[logit(P(HD = 1 | x)) = \log \Bigg(\frac{P(HD=1 | x)}{P(HD=0 | x)} \Bigg) = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p\] where \[P(HD = 1 | x) = \frac{e^{\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p}}{1 + e^{\beta_0 + \beta_1 x_1 + \dots \beta_p x_p}}\]

Similarly,

MLE’s will be obtained through maximize the log likelihood function
Wald tests/intervals hold for each coefficient
Likelihood Ratio tests hold for a set of predictors

We show next the logistic regression model with all possible features.

fit2 <- glm(HD~., fram_data.f, family=binomial) 
fit2

The probability of $HD=1$ given all the factors is estimated as \[\hat P(HD = 1|X) = \frac{e^{-9.33480 + 0.06249 \times AGE + 0.90610 I_{Male} + \dots + 0.01231 \times CIG}}{1 + {e^{-9.33480 + 0.06249 \times AGE + 0.90610 I_{Male} + \dots + 0.01231 \times CIG}}}\]

4.1.1 Wald intervals

summary(fit2)
confint.default(fit2)

4.1.2 Likelihood ratio test

We now test whether all the risk factors are not useful. To be specific, we test against the null hypothesis \[H_0: \beta_1 = \ldots = \beta_p = 0.\]

We first obtain the chi-square statistics and then the p-value. Notice that under the null hypothesis, the test statistics is approximately $\chi^2_7$, i.e. a chi-square distribution with degree of freedom of 7 because $p=7$.

chi.sq <- 1469-1343
pvalue <- pchisq(chi.sq, 7, lower.tail=FALSE)
pvalue

## [1] 4.300522e-24

We reject the null hypothesis with any reasonable small $\alpha$.

Alternatively we may use anova to get the $\chi^2$ test by fitting the reduced model vs. the full model:

# Null model 
fit0 <- glm(HD~1, fram_data.f, family=binomial)
anova(fit0, fit2, test="Chisq") #

Analysis of Deviance Table

Model 1: HD ~ 1
Model 2: HD ~ AGE + SEX + SBP + DBP + CHOL + FRW + CIG
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      1392     1469.3                          
2      1385     1343.1  7   126.19 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.1.3 Are DBP and FRW not useful?

We conclude from the Wald test that DBP or FRW by itself is not significant. Can we also conclude $\beta_{DBP} = \beta_{FRW} = 0$? We use likelihood ratio test to test this null hypothesis.

fit3 <- update(fit2, .~. -DBP -FRW) 
# fit3 <- glm(HD~AGE+SEX+SBP+DBP+CHOL, fram_data.f, family=binomial(logit))
summary(fit3)

## 
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + CIG, family = binomial, 
##     data = fram_data.f)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7541  -0.7288  -0.5538  -0.3438   2.4470  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.702282   0.926831  -9.389  < 2e-16 ***
## AGE          0.061358   0.014753   4.159 3.20e-05 ***
## SEXMALE      0.885748   0.155792   5.685 1.30e-08 ***
## SBP          0.017085   0.002372   7.203 5.88e-13 ***
## CHOL         0.004398   0.001499   2.935  0.00334 ** 
## CIG          0.011358   0.006058   1.875  0.06083 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1469.3  on 1392  degrees of freedom
## Residual deviance: 1345.5  on 1387  degrees of freedom
## AIC: 1357.5
## 
## Number of Fisher Scoring iterations: 4

Similarly we obtain the chi-square statistics and then the p-value. Under this null hypothesis, the test statistics is approximately $\chi^2_2$.

chi.sq.2 <- fit3$deviance-fit2$deviance
hist(rchisq(10000, 2), freq=FALSE, breaks=100) # a chi-square stat with df=2
abline(v = chi.sq.2, col="blue")
abline(v = qchisq(0.95, 2), col="red")
legend("right", c("test stat", "95-th quantile"), 
       col=c("blue", "red"), lty=1)

pchisq(chi.sq.2, 2, lower.tail=FALSE)

## [1] 0.3022805

We fail to reject the $H_0$ at $0.05$ level.

We can also use anova().

anova(fit3, fit2, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: HD ~ AGE + SEX + SBP + CHOL + CIG
## Model 2: HD ~ AGE + SEX + SBP + DBP + CHOL + FRW + CIG
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      1387     1345.5                     
## 2      1385     1343.1  2   2.3928   0.3023

Once again we don’t have evidence to reject the null at .05 level.

4.2 Model selection

We will introduce some quick way of model selection without details.

4.2.1 Backward elimination

Backward selection is the easiest (without any packages) if p is not too large. As in lm(), we eliminate the variable with the highest p-value.

summary(fit2)
fit2.1 <- update(fit2, .~. -DBP)

Backward selection by kicking DBP (the one with largest p-value) out

summary(fit2.1)

## 
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + FRW + CIG, family = binomial, 
##     data = fram_data.f)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7066  -0.7279  -0.5517  -0.3343   2.4501  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.227856   0.996153  -9.263  < 2e-16 ***
## AGE          0.061529   0.014775   4.164 3.12e-05 ***
## SEXMALE      0.911274   0.157117   5.800 6.63e-09 ***
## SBP          0.015966   0.002487   6.420 1.37e-10 ***
## CHOL         0.004493   0.001503   2.990  0.00279 ** 
## FRW          0.006039   0.004004   1.508  0.13151    
## CIG          0.012279   0.006088   2.017  0.04369 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1469.3  on 1392  degrees of freedom
## Residual deviance: 1343.3  on 1386  degrees of freedom
## AIC: 1357.3
## 
## Number of Fisher Scoring iterations: 4

fit2.2 <- update(fit2.1, .~. -FRW)
summary(fit2.2)

## 
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + CIG, family = binomial, 
##     data = fram_data.f)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7541  -0.7288  -0.5538  -0.3438   2.4470  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.702282   0.926831  -9.389  < 2e-16 ***
## AGE          0.061358   0.014753   4.159 3.20e-05 ***
## SEXMALE      0.885748   0.155792   5.685 1.30e-08 ***
## SBP          0.017085   0.002372   7.203 5.88e-13 ***
## CHOL         0.004398   0.001499   2.935  0.00334 ** 
## CIG          0.011358   0.006058   1.875  0.06083 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1469.3  on 1392  degrees of freedom
## Residual deviance: 1345.5  on 1387  degrees of freedom
## AIC: 1357.5
## 
## Number of Fisher Scoring iterations: 4

CIG didn’t really add much to decrease the residual deviance

fit2.3 <- update(fit2.2, .~. -CIG)
summary(fit2.3)

## 
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL, family = binomial, 
##     data = fram_data.f)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6069  -0.7348  -0.5523  -0.3476   2.4344  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.408724   0.908600  -9.255  < 2e-16 ***
## AGE          0.056639   0.014500   3.906 9.38e-05 ***
## SEXMALE      0.989870   0.145053   6.824 8.84e-12 ***
## SBP          0.016956   0.002362   7.179 7.02e-13 ***
## CHOL         0.004480   0.001495   2.996  0.00274 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1469.3  on 1392  degrees of freedom
## Residual deviance: 1349.0  on 1388  degrees of freedom
## AIC: 1359
## 
## Number of Fisher Scoring iterations: 4

Prediction for Alice using fit2.3

predict(fit2.3, fram_data.new, type="response")
predict(fit1, fram_data.new, type="response")

Notice the difference between the predictions among fit2.3 and fit1!!!

4.2.2 AIC through `bestglm()`

Recall that \[AIC = - 2\log(\mathcal{L}) + 2d\] Where $d$ is the total number of parameters. We are looking for a model with small AIC.

All predictors

fit2 <- glm(HD~., fram_data.f, family=binomial)
AIC <- fit2$deviance + 2*8

In fit2, \[\text{Residual Deviance} = -2\log(\mathcal{L}).\]

Therefore, \[\begin{split} AIC &= -2\log(\mathcal{L}) + 2d \\ &= \text{Residual Deviance} + 2*8 \\ &= 1343.1 + 16 = 1359.1 \end{split}\]

We use bestglm() to find model(s) with the smallest AIC?

the syntax is similar to that from leaps.
we know less well about bestglm() comparing regsubsets()

# Get the design matrix without 1's and HD
Xy <- model.matrix(HD ~.+0, fram_data.f) 
# Attach y as the last column.
Xy <- data.frame(Xy, fram_data.f$HD)   

fit.all <- bestglm(Xy, family = binomial, method = "exhaustive", IC="AIC", nvmax = 10) # method = "exhaustive", "forward" or "backward"

## Morgan-Tatar search since family is non-gaussian.

names(fit.all) # fit.all$Subsets to list best submodels

## [1] "BestModel"   "BestModels"  "Bestq"       "qTable"      "Subsets"    
## [6] "Title"       "ModelReport"

List the top 5 models. In the way any one of the following model could be used.

fit.all$BestModels

##    AGE SEXFEMALE SEXMALE  SBP   DBP CHOL   FRW   CIG Criterion
## 1 TRUE     FALSE    TRUE TRUE FALSE TRUE  TRUE  TRUE  1355.285
## 2 TRUE      TRUE   FALSE TRUE FALSE TRUE  TRUE  TRUE  1355.285
## 3 TRUE     FALSE    TRUE TRUE FALSE TRUE FALSE  TRUE  1355.536
## 4 TRUE      TRUE   FALSE TRUE FALSE TRUE FALSE  TRUE  1355.536
## 5 TRUE     FALSE    TRUE TRUE FALSE TRUE FALSE FALSE  1357.011

Pick one among BestModels.

fit.all$BestModel

## 
## Call:  glm(formula = y ~ ., family = family, data = Xi, weights = weights)
## 
## Coefficients:
## (Intercept)          AGE      SEXMALE          SBP         CHOL  
##   -9.227856     0.061529     0.911274     0.015966     0.004493  
##         FRW          CIG  
##    0.006039     0.012279  
## 
## Degrees of Freedom: 1392 Total (i.e. Null);  1386 Residual
## Null Deviance:       1469 
## Residual Deviance: 1343  AIC: 1357

Are all variables in model with small AIC significant?

summary(glm(HD~AGE+SEX+SBP+CHOL+FRW+CIG, family=binomial, data=fram_data.f))

## 
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + FRW + CIG, family = binomial, 
##     data = fram_data.f)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7066  -0.7279  -0.5517  -0.3343   2.4501  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.227856   0.996153  -9.263  < 2e-16 ***
## AGE          0.061529   0.014775   4.164 3.12e-05 ***
## SEXMALE      0.911274   0.157117   5.800 6.63e-09 ***
## SBP          0.015966   0.002487   6.420 1.37e-10 ***
## CHOL         0.004493   0.001503   2.990  0.00279 ** 
## FRW          0.006039   0.004004   1.508  0.13151    
## CIG          0.012279   0.006088   2.017  0.04369 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1469.3  on 1392  degrees of freedom
## Residual deviance: 1343.3  on 1386  degrees of freedom
## AIC: 1357.3
## 
## Number of Fisher Scoring iterations: 4

4.3 Final model

As seen above, FRW is not significant so we can eliminate it.

fit.final.0 <- glm(HD~AGE+SEX+SBP+CHOL+FRW+CIG, family=binomial, data=fram_data.f)
summary(fit.final.0)
summary(update(fit.final.0, .~. -FRW))

Note that FRW is not significant in fit.fial.0. Once we drop FRW, all the remaining factors seem to be fine. (CIG shows weak evidence to be useful in this final model.) Nevertheless we decided to retain all the variables in the final model.

fit.final <- glm(HD~AGE+SEX+SBP+CHOL+CIG, family=binomial, data=fram_data.f)
summary(fit.final)

  
  Call:
  glm(formula = HD ~ AGE + SEX + SBP + CHOL + CIG, family = binomial, 
      data = fram_data.f)
  
  Deviance Residuals: 
      Min       1Q   Median       3Q      Max  
  -1.7541  -0.7288  -0.5538  -0.3438   2.4470  
  
  Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
  (Intercept) -8.702282   0.926831  -9.389  < 2e-16 ***
  AGE          0.061358   0.014753   4.159 3.20e-05 ***
  SEXMALE      0.885748   0.155792   5.685 1.30e-08 ***
  SBP          0.017085   0.002372   7.203 5.88e-13 ***
  CHOL         0.004398   0.001499   2.935  0.00334 ** 
  CIG          0.011358   0.006058   1.875  0.06083 .  
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 1469.3  on 1392  degrees of freedom
  Residual deviance: 1345.5  on 1387  degrees of freedom
  AIC: 1357.5
  
  Number of Fisher Scoring iterations: 4

So based on the study above, we may say collectively AGE, SBP, CHOL, CIG are all positively related to the chance of a HD while Male’s have higher chance of HD comparing with Females controlling for all other factors in the model.

4.3.1 Remark

As we have known, we can always extend models to include more risk factors or iteration terms.

We may ask are there interaction effects of SBP by SEX? or AGE by SEX?

Anova(glm(HD ~ AGE + SBP + SEX + SBP*SEX + AGE * SEX, fram_data, family = binomial))

## Analysis of Deviance Table (Type II tests)
## 
## Response: HD
##         LR Chisq Df Pr(>Chisq)    
## AGE       16.540  1  4.763e-05 ***
## SBP       56.950  1  4.470e-14 ***
## SEX       44.787  1  2.197e-11 ***
## SBP:SEX    0.289  1     0.5910    
## AGE:SEX    0.832  1     0.3617    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We may also ask questions such as, what would have happened if we
treat CIG as a categorical variable?

regsubsets()

If you use regsubsets() here, then it will treat HD as a continuous variable and fit lm(HD ~....) using two response values 0 and 1. It will not return the results coming from glm().

library(leaps)
summary(regsubsets(HD ~., fram_data, nvmax=10, method="exhaustive"))

## Subset selection object
## Call: regsubsets.formula(HD ~ ., fram_data, nvmax = 10, method = "exhaustive")
## 7 Variables  (and intercept)
##         Forced in Forced out
## AGE         FALSE      FALSE
## SEXMALE     FALSE      FALSE
## SBP         FALSE      FALSE
## DBP         FALSE      FALSE
## CHOL        FALSE      FALSE
## FRW         FALSE      FALSE
## CIG         FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          AGE SEXMALE SBP DBP CHOL FRW CIG
## 1  ( 1 ) " " " "     "*" " " " "  " " " "
## 2  ( 1 ) " " "*"     "*" " " " "  " " " "
## 3  ( 1 ) "*" "*"     "*" " " " "  " " " "
## 4  ( 1 ) "*" "*"     "*" " " "*"  " " " "
## 5  ( 1 ) "*" "*"     "*" " " "*"  " " "*"
## 6  ( 1 ) "*" "*"     "*" " " "*"  "*" "*"
## 7  ( 1 ) "*" "*"     "*" "*" "*"  "*" "*"

5 Classification revisit

The recipe for choosing a classifier with multiple logistic regression is similar to that using SBP alone.

Fit the logistic regression
Get a set of rules by thresholding the estimated probability as a function of the predictors
Evaluate the performance of the set of rules using one criterion of your choice

5.1 Two features

Before using the final model, let’s use only SBP and AGE to take a look at the linear boundary.

Get the logit function and examine some specific rules

fit2 <- glm(HD~SBP+AGE, family= binomial, fram_data.f)
summary(fit2)

\[logit=-6.554+0.0144SBP+.0589AGE\]

Rule 1: threshold the probability at 2/3

\[\begin{split} \widehat{HD}=1 \mbox{ if } SBP &>\frac{-0.0589}{0.0144} Age + \frac{\log(2)+6.554}{0.0144} \\ &=-4.09 Age + 503 \end{split}\]

Rule 2: threshold the probability at 1/2

\[\begin{split} \widehat{HD}=1 \mbox{ if } SBP &>\frac{-0.0589}{0.0144} Age + \frac{\log(1)+6.554}{0.0144} \\ &=-4.09 Age + 455 \end{split}\]

Let’s put two linear boundaries together.

fram_data.f  %>%
  ggplot(aes(x=AGE, y=SBP)) + 
  geom_jitter(width = 0.2, aes(color = factor(HD))) + 
  geom_abline(intercept = 503, slope = -4.09, col = "blue") +
  geom_abline(intercept = 455, slope = -4.09, col = "red") +
  ggtitle("Blue: HD = 1 if prob > 2/3; Red: HD = 1 if prob > 1/2")

5.2 Final model

Now we compare three models fit1, fit2, fit.final using ROC curve and AUC.

fit1 <- glm(HD~SBP, fram_data.f, family=binomial(logit)) 
fit1.roc <- roc(fram_data.f$HD, fit1$fitted)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

fit2.roc <- roc(fram_data.f$HD, fit2$fitted)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

fit.final.roc <- roc(fram_data.f$HD, fit.final$fitted)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(1-fit1.roc$specificities, 
     fit1.roc$sensitivities, col="red", lwd=3, type="l",
     xlab="False Positive", 
     ylab="Sensitivity")
lines(1-fit2.roc$specificities, fit2.roc$sensitivities, col="blue", lwd=3)
lines(1-fit.final.roc$specificities, fit.final.roc$sensitivities, col="green", lwd=3)
legend("bottomright",
       c(paste0("fit1 AUC=", round(fit1.roc$auc,2)), 
         paste0("fit2 AUC=", round(fit2.roc$auc, 2)), 
         paste0("fit.final AUC=", round(fit.final.roc$auc, 2))),
       col=c("red", "blue", "green"),
       lty=1)

From both ROC curve and AUC, we can say that our final model is indeed better in terms of overall performance. After deciding on the model, we need to decide the classification rule depending on the goal.

6 Bayes rule with Unequal Losses

Given model, what threshold should we use so that the MCE will be minimized? The answer would be using $0.5$ as the threshold number, assuming we treat each mistake the same. On the other hand if the two type of the mistakes cost differently we ought to weigh the type of the mistake.

For example, lower false negative is more important then we can weigh more on false negative. A more concrete example is disease diagnosis. Falsely diagnoses as disease-free is worse than falsely diagnosed as positive. We can achieve this by the Bayes rule.

6.1 Bayes’ rule

We first define the loss function.

$a_{1,0}=L(Y=1, \hat Y=0)$, the loss (cost) of making an “1” to a “0” (false negative)
$a_{0,1}=L(Y=0, \hat Y=1)$, the loss of making a “0” to an “1” (false positive)
$a_{0, 0} = a_{1, 1}=0$ (correct classification)

Then \[\begin{split} E(L(Y, \hat Y=1)) &= P(Y=1) \cdot L(Y=1, \hat Y=1) + P(Y=0) \cdot L(Y=0, \hat Y=1) \\ &= a_{0,1} \cdot P(Y=0) \end{split} \]

Similarly, \[E(L(Y, \hat Y=0))=a_{1,0} \cdot P(Y=1)\]

To minimize the two mean losses, we choose \[\begin{split} \hat Y=1 ~~~~ \text{if} ~~~~ &\quad E(L(Y, \hat Y=1)) < E(L(Y, \hat Y=0)) \\ & \Leftrightarrow a_{0,1} \cdot P(Y=0) < a_{1,0} \cdot P(Y=1) \end{split} \]

We have the following optimal rule:

\[\begin{split} \hat Y=1 ~~~~ \text{if} ~~~~ &\quad \frac{P(Y=1 \vert X)}{P(Y=0\vert X)} > \frac{a_{0,1}}{a_{1,0}} \\ & \Leftrightarrow P(Y=1 \vert X) > \frac{\frac{a_{0,1}}{a_{1,0}}}{1 + \frac{a_{0,1}}{a_{1,0}}} \end{split} \]

The above rule is called Bayes’ rule.

Weighted classification error:

The weighted misclassification error (may not be a number between 0 and 1) is then

\[MCE=\frac{a_{1,0} \sum 1 \{\hat Y = 0 | Y=1\} + a_{0,1} \sum 1 \{\hat Y = 1 | Y=0\}}{n}\]

Example: $5a_{0,1}=a_{1,0}$

If the cost of misclassifying “1” to “0” is 5 times of that of misclassifying “0” to “1”, then the Bayes rule is thresholding over the \[\hat P(Y=1 \vert x) > \frac{0.2}{(1+0.2)}=0.17\] or \[logit > \log(\frac{0.17}{0.83})=-1.59\]. The linear boundary is

\[SBP \geq -4.09Age + 344.7\]

Let’s draw the linear boundary of the Bayes rule when $\frac{a_{1,0}}{a_{0,1}}=5$ and compare with thresholding probability by 1/2 and 2/3.

fram_data.f  %>%
  ggplot(aes(x=AGE, y=SBP)) + 
  geom_jitter(width = 0.2, aes(color = factor(HD))) + 
  geom_abline(intercept = 503, slope = -4.09, col = "blue") +
  geom_abline(intercept = 455, slope = -4.09, col = "red") +
  geom_abline(intercept = 344.7, slope = -4.09, col = "green") +
  ggtitle("Green: Bayes rule a10/a01 = 5; Blue: HD = 1 if prob > 2/3; Red: HD = 1 if prob > 1/2")

Finally we get the weighted misclassification error (may not be a number between 0 and 1)

\[MCE=\frac{a_{1,0} \sum 1_{\hat y = 0 | y=1} + a_{0,1} \sum 1_{\hat y = 1 | y=0}}{n}\]

Let’s compare the weighted misclassification error using the Bayes rule and the naive 1/2 threshold using fit2. The weighted MCE of classifier using the Bayes rule should the smaller as expected.

fit2.pred.bayes <- as.factor(ifelse(fit2$fitted > .17, "1", "0"))
MCE.bayes <- (5*sum(fit2.pred.bayes[fram_data.f$HD == "1"] != "1")
              + sum(fit2.pred.bayes[fram_data.f$HD == "0"] != "0"))/length(fram_data.f$HD)
MCE.bayes

## [1] 0.7020818

What happened for .5 as the threshold?

fit2.pred.5 <- as.factor(ifelse(fit2$fitted > .5, "1", "0"))
MCE.bayes.5 <- (5*sum(fit2.pred.5[fram_data.f$HD == "1"] != "1")
              + sum(fit2.pred.5[fram_data.f$HD == "0"] != "0"))/length(fram_data.f$HD)
MCE.bayes.5

## [1] 1.064609

Compare the weighted MCE of the classifier using Bayes rule with fit.final. The weighted MCE is smaller than that of fit2.

fit.final.bayes <- as.factor(ifelse(fit.final$fitted > .17, "1", "0"))
MCE.final.bayes <- (5*sum(fit.final.bayes[fram_data.f$HD == "1"] != "1")
              + sum(fit.final.bayes[fram_data.f$HD == "0"] != "0"))/length(fram_data.f$HD)
MCE.final.bayes

## [1] 0.6116296

7 Training/Testing Data

Depending on the goal, we need to choose a criterion. Then to evaluate the performance, we use testing data.

We may split the data into two sub-samples.

Training Data: fit a model
Testing Data: estimate the performance

Choose the model by some criterion. For example, the one with the largest AUC.

We first split the data.

# Split the data:
N <- length(fram_data.f$HD)

set.seed(10)
# Take a random sample of n=1000 from 1 to N=1393
index.train <- sample(N, 1000)
# Set the 1000 randomly chosen subjects as a training data
data.train <- fram_data.f[index.train,]
# The remaining subjects will be reserved for testing purposes
data.test <- fram_data.f[-index.train,]

Compare two models with testing data

fit1.train <- glm(HD~SBP, data=data.train, family=binomial)
summary(fit1.train)

fit5.train <- glm(HD~SBP+SEX+AGE+CHOL+CIG, data=data.train, family=binomial)
summary(fit5.train)

Get the fitted probabilities using the testing data

fit1.fitted.test <- predict(fit1.train, data.test, type="response") # get the prob's
fit5.fitted.test <- predict(fit5.train, data.test, type="response")

data.frame(fit1.fitted.test, fit5.fitted.test)[1:10, ]

##    fit1.fitted.test fit5.fitted.test
## 6         0.1310456       0.08427391
## 14        0.1383692       0.10074764
## 21        0.1421581       0.09981915
## 24        0.1460333       0.17082459
## 25        0.1460333       0.11123836
## 27        0.1499956       0.13101216
## 30        0.1499956       0.12530310
## 37        0.1540460       0.14660777
## 38        0.1540460       0.27803701
## 42        0.1540460       0.14380806

Look at the first 10 rows. Notice that row names are the subject number chosen in the testing data. The estimated probability for each row is different using fit1 and fit5.

Compare the performances using the testing data.

fit1.test.roc <- roc(data.test$HD, fit1.fitted.test)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

fit5.test.roc <- roc(data.test$HD, fit5.fitted.test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(1-fit5.test.roc$specificities, fit5.test.roc$sensitivities, 
     col="red", type="l", lwd=3, 
     xlab=paste("AUC(fit5.test) =",
                round(pROC::auc(fit5.test.roc),2),
                "  AUC(fit1.test) =",
                round(pROC::auc(fit1.test.roc),2)), 
     ylab="Sensitivities")
lines(1-fit1.test.roc$specificities, fit1.test.roc$sensitivities, col="blue", lwd=3)
legend("bottomright", legend=c("fit5.test w five variables", "fit1.test w one variable"),
       lty=c(1,1), lwd=c(2,2), col=c("red", "blue"))
title("Comparison of two models using testing data")

Conclusion: we will use fit5 for its larger testing AUC curve.

Note: If you comment out set.seed and repeat running the above training/testing session, what will you observe from the two ROC curves and the two AUC values?

Are you surprised?

for(i in 1:10) {
  # set seed
  set.seed(10*i)
  
  # split data
  index.train <- sample(N, 1000)
  data.train <- fram_data.f[index.train,]
  data.test <- fram_data.f[-index.train,] 
  
  # fit
  fit1.train <- glm(HD~SBP, data=data.train, family=binomial)
  fit5.train <- glm(HD~SBP+SEX+AGE+CHOL+CIG, data=data.train, family=binomial)
  
  # predict
  fit1.fitted.test <- predict(fit1.train, data.test, type="response")
  fit5.fitted.test <- predict(fit5.train, data.test, type="response")
  
  # roc
  fit1.test.roc <- roc(data.test$HD, fit1.fitted.test)
  fit5.test.roc <- roc(data.test$HD, fit5.fitted.test)

  # plot
 plot(1-fit5.test.roc$specificities, fit5.test.roc$sensitivities, 
     col="red", type="l", lwd=3, 
     xlab=paste("AUC(fit5.test) =",
                round(pROC::auc(fit5.test.roc),2),
                "  AUC(fit1.test) =",
                round(pROC::auc(fit1.test.roc),2)), 
     ylab="Sensitivities")
lines(1-fit1.test.roc$specificities, fit1.test.roc$sensitivities, col="blue", lwd=3)
legend("bottomright", legend=c("fit5.test w five variables", "fit1.test w one variable"),
       lty=c(1,1), lwd=c(2,2), col=c("red", "blue"))
title("Comparison of two models using testing data")
  
  # pause for one second
  Sys.sleep(1)
}

8 Appendices

8.1 Chi-Squared Distribution

What does a $\chi^2$ distribution look like?

$\chi_1^{2} = z^2$
$\chi_2^{2} = z_1^2 + z_2^2$, $z_1$ and $z_2$ are independently chosen z-scores

$\chi^2$ distributions with different degrees of freedom.

par(mfrow=c(1,3))
hist(rchisq(10000, 1), freq=FALSE, breaks=20) # the second arg = df
hist(rchisq(10000, 10), freq=FALSE, breaks=20) # the second arg = df
hist(rchisq(10000, 20), freq=FALSE, breaks=20) # when df is large, chi-squared dis is approaching to a normal distribution

When df is getting larger, $\chi^2$ distribution is approximately normal (why?)

	\(Y=0\)	\(Y=1\)
\(\hat{Y}=0\)	Specificity	False negative
\(\hat{Y}=1\)	False positive	Sensitivity (true positive)

Logistic Regression/Classification

Modern Data Mining