Applied Logistic Regression

Exercise One

For this exercise, we need the Myopia Study dataset. The data is available by Professor Lemeshow’s course on Canvas Network.

filename = "MYOPIA.csv"
if (!file.exists(filename)) {
        fileURL = "https://learn.canvas.net/courses/1179/files/461762/download?download_frd=1"
        download.file(fileURL, filename)
        }
data = read.csv("MYOPIA.csv", header = T, sep = ",")

The six first observations:

ztable(head(data))


	ID	STUDYYEAR	MYOPIC	AGE	GENDER	SPHEQ	AL	ACD	LT	VCD	SPORTHR	READHR	COMPHR	STUDYHR	TVHR	DIOPTERHR	MOMMY	DADMY
1	1	1992	1	6	1	-0.05	21.89	3.69	3.50	14.70	45	8	0	0	10	34	1	1
2	2	1995	0	6	1	0.61	22.38	3.70	3.39	15.29	4	0	1	1	7	12	1	1
3	3	1991	0	6	1	1.18	22.49	3.46	3.51	15.52	14	0	2	0	10	14	0	0
4	4	1990	1	6	1	0.52	22.20	3.86	3.61	14.73	18	11	0	0	4	37	0	1
5	5	1995	0	5	0	0.70	23.29	3.68	3.45	16.16	14	0	0	0	4	4	1	0
6	6	1995	0	6	0	1.74	22.14	3.22	3.56	15.36	10	6	2	1	19	44	0	1

Last six observations of data:

ztable(tail(data))


	ID	STUDYYEAR	MYOPIC	AGE	GENDER	SPHEQ	AL	ACD	LT	VCD	SPORTHR	READHR	COMPHR	STUDYHR	TVHR	DIOPTERHR	MOMMY	DADMY
613	613	1990	0	9	0	-0.03	22.96	3.76	3.34	15.86	9	9	5	0	16	53	1	0
614	614	1995	1	6	0	0.68	22.40	3.66	3.80	14.93	2	0	7	3	14	37	1	0
615	615	1993	0	6	1	0.67	22.50	3.57	3.38	15.56	6	0	1	0	8	10	1	1
616	616	1995	0	6	0	1.83	22.94	3.62	3.42	15.89	8	0	0	0	4	4	1	1
617	617	1991	0	6	1	0.67	21.92	3.69	3.60	14.64	12	2	1	0	15	23	0	0
618	618	1994	0	6	0	0.80	22.26	3.53	3.48	15.25	25	0	2	0	10	14	1	1

One variable that is clearly important is the initial value of spherical equivalent refraction (SPHEQ).

Complete the following:

A. Write down the equation for the logistic regression model of MYOPIA on SPHEQ. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, MYOPIA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between MYOPIA and SPHEQ?

B. Form a scatter plot of MYOPIA vs. SPHEQ.

C. Write down an expression for the likelihood and log likelihood for the logistic regression model in part (A) using the ungrouped, n=618, data. Obtain expressions for the two likelihood equations.

D. Obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=618, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatter plots in parts (B) and (C).

Solving

A. Equation for the logistic regression model of MYOPIA on SPHEQ:

\[\pi(SPHEQ) = \dfrac{e^{\beta_0 + \beta_{1}\times SPHEQ}}{1+e^{\beta_0 + \beta_{1}\times SPHEQ}},\] where \(\pi(SPHEQ)\) is the conditional mean of MYOPIC given SPHEQ (\(E[MYOPIC ~|~ SPHEQ]\)) and can be estimate as below¹:

library(dplyr)
estimate = data %>%
        select(MYOPIC, SPHEQ) %>%
        group_by(SPHEQ, MYOPIC) %>%
        summarise( m = mean(MYOPIC))

Head and tail of estimate for MYOPIC given SPHEQ:

pander::pander(cbind(head(estimate), tail(estimate)))

SPHEQ	MYOPIC	m	SPHEQ
-0.699	1	1	3.422
-0.600	1	1	3.491
-0.571	1	1	3.615
-0.524	1	1	3.731
-0.502	1	1	4.228
-0.491	1	1	4.372

Equation for the logit transformation of this logistic regression model: \[\log \left(\dfrac{\pi(SPHEQ)}{1-\pi(SPHEQ)}\right) = \beta_0 + \beta_{1}\times SPHEQ,\] where log is the logarithm neperian function.

Myopic is a binary outcome variable, that is, the conditional mean of MYOPIC in regression equation is bounded between zero and one. So it’s appropriate for logistic regression.

B. We can form a scatter plot for MYOPIA vs. SPHEQ with the values computed in (A) as estimate for \(E[MYOPIC ~|~ SPHEQ]\):

library(ggplot2)
ggplot(estimate, aes(SPHEQ, m)) +
        geom_point() +
        xlab("SPHEQ") + 
        ggtitle("SPHEQ (X) x MYOPIC (Y)") +
        ylab("Estimate for E[MYOPIC  | SPHEQ]")

C. Likelihood function for pair \((SPHEQ_{i}, MYOPIC_{i}), \quad 1 \leq i \leq 618\): \[ {\cal l}(\beta) = \prod_{i=1}^{618} \left\{ \pi(SPHEQ_{i})^{MYOPIC_{i}}\times [1-\pi(SPHEQ_{i})]^{1- MYOPIC_{i}} \right\}.\]

Log-likelihood function for pair \((SPHEQ_{i}, MYOPIC_{i}), \quad 1 \leq i \leq 618\):

\[ {\cal L}(\beta) = log({\cal l}(\beta)) = \sum_{i=1}^{618} \left\{ MYOPIC_{i}\times log(\pi(SPHEQ_{i})) + [1- MYOPIC_{i}] \times log(1-\pi(SPHEQ_{i})) \right\}.\]

Likelihood equations for pair \((SPHEQ_{i}, MYOPIC_{i}), \quad 1 \leq i \leq 618\):

\[\sum_{i=1}^{618} \left\{ MYOPIC_{i} - \pi(SPHEQ_{i}) \right\} = 0\] and \[\sum_{i=1}^{618} \left\{SPHEQ_{i} \times [ MYOPIC_{i} - \pi(SPHEQ_{i})] \right\} = 0.\]

D. In R² we can obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A) with code below:

mod = glm(MYOPIC ~ SPHEQ,
                  data = data,
                  family = binomial(link = "logit"))
ztable(mod)


	Estimate	Std. Error	z value	Pr(>\|z\|)	OR	lcl	ucl
(Intercept)	0.0540	0.2067	0.26	0.7940	1.06	0.71	1.59
SPHEQ	-3.8331	0.4184	-9.16	0.0000	0.02	0.01	0.05
Call: glm(formula = MYOPIC ~ SPHEQ, family = binomial(link = “logit”), data = data)

So \(\hat{\beta_{0}}= 0.05\) and \(\hat{\beta_{1}}= -3.83\) are maximum likelihood estimates of the parameters of the logistic regression model in part (A).

Another way to obtain these predictors in R²:

library(bbmle)
(bino = mle2(MYOPIC~dbinom(prob=(exp(beta0+beta1*SPHEQ)/(1+exp(beta0+beta1*SPHEQ))), size = 1), 
            start = list(beta0=coef(mod)[1], beta1=coef(mod)[2]),
            data = data))

## 
## Call:
## mle2(minuslogl = MYOPIC ~ dbinom(prob = (exp(beta0 + beta1 * 
##     SPHEQ)/(1 + exp(beta0 + beta1 * SPHEQ))), size = 1), start = list(beta0 = coef(mod)[1], 
##     beta1 = coef(mod)[2]), data = data)
## 
## Coefficients:
##       beta0       beta1 
##  0.05397313 -3.83309759 
## 
## Log-likelihood: -168.67

Likelihood profiles:

library(sads)
plot(profile(bino))

Equation for the estimated logistic probabilities:

\[\hat{\pi}(SPHEQ) = \dfrac{e^{0.05 + (-3.83)\times SPHEQ}}{1+e^{0.05 + (-3.83)\times SPHEQ}}.\]

Finally, the plot of equation for the fitted values on the axes used in the scatter plots in parts (B) and (C):

ggplot(data, aes(SPHEQ, MYOPIC)) +
        geom_point() + 
        geom_smooth(method = "glm", 
                    method.args = list(family = "binomial"), se = F) +
        ggtitle("SPHEQ (X) x MYOPIC (Y)")

Exercise Two

For this exercise, we need the ICU dataset. The data is available by Professor Lemeshow’s course on Canvas Network.

filename = "icu.csv"
if (!file.exists(filename)) {
        fileURL = "https://learn.canvas.net/courses/1179/files/461760/download?download_frd=1"
        download.file(fileURL, filename)
        }
data = read.csv("icu.csv", header = T, sep = ",")

The six first observations:

ztable(head(data))


	ID	STA	AGE	SEX	RACE	SER	CAN	CRN	INF	CPR	SYS	HRA	PRE	TYP	FRA	PO2	PH	PCO	BIC	CRE	LOC
1	552	0	16	0	1	1	0	0	0	0	100	140	0	1	1	0	0	0	0	0	0
2	102	0	16	1	1	0	0	0	0	0	104	111	0	1	0	0	0	0	0	0	0
3	837	0	17	1	3	0	0	0	0	0	130	140	0	1	0	0	0	0	0	0	0
4	863	0	17	0	3	1	0	0	0	0	136	78	0	1	0	0	0	0	0	0	0
5	829	0	17	0	1	1	0	0	0	0	140	78	0	1	1	0	0	0	0	0	0
6	379	0	18	0	1	1	0	0	0	0	112	76	0	1	1	0	0	0	0	0	0

Last six observations of data:

ztable(tail(data))


	ID	STA	AGE	SEX	RACE	SER	CAN	CRN	INF	CPR	SYS	HRA	PRE	TYP	FRA	PO2	PH	PCO	BIC	CRE	LOC
195	222	1	88	0	1	0	0	0	1	1	141	140	0	1	0	0	0	0	0	0	0
196	877	0	88	1	1	0	0	1	0	0	190	88	0	1	0	0	0	0	0	0	0
197	881	0	89	1	1	1	0	0	0	0	190	114	0	1	0	0	0	1	0	0	2
198	204	1	91	0	1	0	0	0	1	0	64	125	0	1	0	0	0	1	0	0	0
199	674	0	91	0	1	0	0	1	1	0	152	125	0	1	0	0	0	0	0	0	0
200	165	1	92	0	1	0	0	0	1	0	124	80	0	1	0	0	0	0	1	0	0

The ICU dataset consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. A number of publications have appeared which have focused on various facets of this problem.

Complete the following:

A. Write down the equation for the logistic regression model of STA on AGE. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, STA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between STA and AGE?

B. Form a scatter plot of STA versus AGE.

C. Write down an expression for the likelihood and log likelihood for the logistic regression model in part (A) using the ungrouped, n=200, data. Obtain expressions for the two likelihood equations.

D. Obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A). These estimates should be based on the ungrouped, n=200, data. Using these estimates, write down the equation for the fitted values, that is, the estimated logistic probabilities. Plot the equation for the fitted values on the axes used in the scatter plots in part (B).

E. Summarize (describe in words) the results presented in the plot obtained from parts (B) and (D).

Solving

A. Equation for the logistic regression model of STA on AGE:

\[\pi(AGE) = \dfrac{e^{\beta_0 + \beta_{1}\times AGE}}{1+e^{\beta_0 + \beta_{1}\times AGE}},\] where \(\pi(STA)\) is the conditional mean of STA given AGE (\(E[STA ~|~ AGE]\)) and can be estimate as below¹:

est = data %>%
        select(AGE, STA) %>%
        group_by(AGE, STA) %>%
        summarise( m = mean(STA))

Head and tail of estimate for STA given AGE:

pander::pander(cbind(head(est), tail(est)))

AGE	STA	m	AGE	STA	m
16	0	0	88	0	0
17	0	0	88	1	1
18	0	0	89	0	0
19	0	0	91	0	0
19	1	1	91	1	1
20	0	0	92	1	1

Equation for the logit transformation of this logistic regression model: \[\log \left(\dfrac{\pi(AGE)}{1-\pi(AGE)}\right) = \beta_0 + \beta_{1}\times AGE,\] where log is the logarithm neperian function.

AGE is a binary outcome variable, that is, the conditional mean of AGE in regression equation is bounded between zero and one. So it’s appropriate for logistic regression.

B. We can form a scatter plot for AGE vs. STA with the values computed in (A) as estimate for \(E[STA ~|~ AGE]\):

ggplot(est, aes(AGE, m)) +
        geom_point() +
        xlab("AGE") + 
        ggtitle("AGE (X) x STA (Y)") +
        ylab("Estimate for E[STA  | AGE]")

C. Likelihood function for pair \((AGE_{i}, STA_{i}), \quad 1 \leq i \leq 200\): \[ {\cal l}(\beta) = \prod_{i=1}^{200} \left\{ \pi(AGE_{i})^{STA_{i}}\times [1-\pi(AGE_{i})]^{1- STA_{i}} \right\}.\]

Log-likelihood function for pair \((AGE_{i}, STA_{i}), \quad 1 \leq i \leq 200\):

\[ {\cal L}(\beta) = log({\cal l}(\beta)) = \sum_{i=1}^{200} \left\{ STA_{i}\times log(\pi(AGE_{i})) + [1- STA_{i}] \times log(1-\pi(AGE_{i})) \right\}.\]

Likelihood equations for pair \((AGE_{i},STA_{i}), \quad 1 \leq i \leq 200\):

\[\sum_{i=1}^{200} \left\{ STA_{i} - \pi(AGE_{i}) \right\} = 0\] and \[\sum_{i=1}^{200} \left\{STA_{i} \times [ STA_{i} - \pi(AGE_{i})] \right\} = 0.\]

D. In R² we can obtain the maximum likelihood estimates of the parameters of the logistic regression model in part (A) with code below:

(mod = glm(STA ~ AGE, data = data, family = binomial(link = "logit")))

## 
## Call:  glm(formula = STA ~ AGE, family = binomial(link = "logit"), data = data)
## 
## Coefficients:
## (Intercept)          AGE  
##    -3.05851      0.02754  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  198 Residual
## Null Deviance:       200.2 
## Residual Deviance: 192.3     AIC: 196.3

Below, plot of equation for the fitted values on the axes used in the scatter plots in parts (B) and (C):

ggplot(data, aes(AGE, STA)) +
        geom_point() + 
        geom_smooth(method = "glm", 
                    method.args = list(family = "binomial"), se = F) +
        ggtitle("AGE (X) x STAR (Y)")

E. In plots we can see that old patients are more susceptible to die if admitted to an adult intensive care unit (ICU).

References

1. Hosmer DW, Lemeshow S, Sturdivant R. Applied Logistic regression. 2013.(http://onlinelibrary.wiley.com/book/10.1002/9781118548387)

2. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2016.(https://www.R-project.org/)

Applied Logistic Regression

Homework - Week 1

August/2016

Exercise One

Complete the following:

Solving

Exercise Two

Complete the following:

Solving

References

	ID	STA	AGE	SEX	RACE	SER	CAN	CRN	INF	CPR	SYS	HRA	PRE	TYP	FRA	PO2	PH	PCO	BIC	CRE	LOC
1	552	0	16	0	1	1	0	0	0	0	100	140	0	1	1	0	0	0	0	0	0
2	102	0	16	1	1	0	0	0	0	0	104	111	0	1	0	0	0	0	0	0	0
3	837	0	17	1	3	0	0	0	0	0	130	140	0	1	0	0	0	0	0	0	0
4	863	0	17	0	3	1	0	0	0	0	136	78	0	1	0	0	0	0	0	0	0
5	829	0	17	0	1	1	0	0	0	0	140	78	0	1	1	0	0	0	0	0	0
6	379	0	18	0	1	1	0	0	0	0	112	76	0	1	1	0	0	0	0	0	0

	ID	STA	AGE	SEX	RACE	SER	CAN	CRN	INF	CPR	SYS	HRA	PRE	TYP	FRA	PO2	PH	PCO	BIC	CRE	LOC
195	222	1	88	0	1	0	0	0	1	1	141	140	0	1	0	0	0	0	0	0	0
196	877	0	88	1	1	0	0	1	0	0	190	88	0	1	0	0	0	0	0	0	0
197	881	0	89	1	1	1	0	0	0	0	190	114	0	1	0	0	0	1	0	0	2
198	204	1	91	0	1	0	0	0	1	0	64	125	0	1	0	0	0	1	0	0	0
199	674	0	91	0	1	0	0	1	1	0	152	125	0	1	0	0	0	0	0	0	0
200	165	1	92	0	1	0	0	0	1	0	124	80	0	1	0	0	0	0	1	0	0

	ID	STA	AGE	SEX	RACE	SER	CAN	CRN	INF	CPR	SYS	HRA	PRE	TYP	FRA	PO2	PH	PCO	BIC	CRE	LOC
1	552	0	16	0	1	1	0	0	0	0	100	140	0	1	1	0	0	0	0	0	0
2	102	0	16	1	1	0	0	0	0	0	104	111	0	1	0	0	0	0	0	0	0
3	837	0	17	1	3	0	0	0	0	0	130	140	0	1	0	0	0	0	0	0	0
4	863	0	17	0	3	1	0	0	0	0	136	78	0	1	0	0	0	0	0	0	0
5	829	0	17	0	1	1	0	0	0	0	140	78	0	1	1	0	0	0	0	0	0
6	379	0	18	0	1	1	0	0	0	0	112	76	0	1	1	0	0	0	0	0	0

	ID	STA	AGE	SEX	RACE	SER	CAN	CRN	INF	CPR	SYS	HRA	PRE	TYP	FRA	PO2	PH	PCO	BIC	CRE	LOC
195	222	1	88	0	1	0	0	0	1	1	141	140	0	1	0	0	0	0	0	0	0
196	877	0	88	1	1	0	0	1	0	0	190	88	0	1	0	0	0	0	0	0	0
197	881	0	89	1	1	1	0	0	0	0	190	114	0	1	0	0	0	1	0	0	2
198	204	1	91	0	1	0	0	0	1	0	64	125	0	1	0	0	0	1	0	0	0
199	674	0	91	0	1	0	0	1	1	0	152	125	0	1	0	0	0	0	0	0	0
200	165	1	92	0	1	0	0	0	1	0	124	80	0	1	0	0	0	0	1	0	0

	ID	STA	AGE	SEX	RACE	SER	CAN	CRN	INF	CPR	SYS	HRA	PRE	TYP	FRA	PO2	PH	PCO	BIC	CRE	LOC
1	552	0	16	0	1	1	0	0	0	0	100	140	0	1	1	0	0	0	0	0	0
2	102	0	16	1	1	0	0	0	0	0	104	111	0	1	0	0	0	0	0	0	0
3	837	0	17	1	3	0	0	0	0	0	130	140	0	1	0	0	0	0	0	0	0
4	863	0	17	0	3	1	0	0	0	0	136	78	0	1	0	0	0	0	0	0	0
5	829	0	17	0	1	1	0	0	0	0	140	78	0	1	1	0	0	0	0	0	0
6	379	0	18	0	1	1	0	0	0	0	112	76	0	1	1	0	0	0	0	0	0

	ID	STA	AGE	SEX	RACE	SER	CAN	CRN	INF	CPR	SYS	HRA	PRE	TYP	FRA	PO2	PH	PCO	BIC	CRE	LOC
195	222	1	88	0	1	0	0	0	1	1	141	140	0	1	0	0	0	0	0	0	0
196	877	0	88	1	1	0	0	1	0	0	190	88	0	1	0	0	0	0	0	0	0
197	881	0	89	1	1	1	0	0	0	0	190	114	0	1	0	0	0	1	0	0	2
198	204	1	91	0	1	0	0	0	1	0	64	125	0	1	0	0	0	1	0	0	0
199	674	0	91	0	1	0	0	1	1	0	152	125	0	1	0	0	0	0	0	0	0
200	165	1	92	0	1	0	0	0	1	0	124	80	0	1	0	0	0	0	1	0	0