Logit Regression and COX Proportional Hazard Regression of Intensive Care Unit Dataset

1. Introduction of Intensive Care Unit data.

The ICU data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to predict the probability of survival to hospital discharge of these patients and to study the risk factors associated with ICU mortality. (Hosmer Jr and Sturdivant. 2013)

2. The variables of Intensive Care Unit data.

There are 200 observations of 21 variables.
ID : Identification Number
STA : Vital Status(0=Lived, 1=Died)
AGE : Years
GENDER: 0=Male, 1=Female
RACE: 1=White, 2=Black, 3=Other
SER : Service at ICU Admission(0=Medical, 1=Surgical)
CAN : Cancer Part of Present Problem (0=No, 1=Yes)
CRN : History of Chronic Renal Failure(0=No, 1=Yes)
INF : Infection Probable at ICU Admission(0=No, 1=Yes)
CPR : CPR Prior to ICU Admission (0=No, 1=Yes)
SYS : Systolic Blood Pressure(mm Hg)
HRA : Heart Rate at ICU Admission(Beats/min)
PRE : Previous Admission to an ICU within 6 Months(0=No, 1=Yes)
TYP : Type of Admission(0=Elective, 1=Emergency)
FRA : Long Bone, Multiple, Neck, Single Area, or Hip Fracture(0=No, 1=Yes)
PO2 : PO2 from Initial Blood Gases( 0 = >60, 1 = <60)
PH : PH from Initial Blood Gases(0 =>7.25, 1 =< 7.25)
PCO2: PCO2 from initial Blood(0 =<45, 1 = > 45)
BIC : Bicarbonate from Initial Blood Gases(0=>18, 1 =< 18)
CRE : Creatinine from Initial Blood(0 =<2.0, 1 => 2.0)
LOC : Level of Consciousness at ICU Admission (0=No Coma,or Stupor, 1=Deep Stupor, 2=Coma)

3. Estimation of Logistic Regression Models using “Zelig” Package

library(Zelig)
icu <- read.table("icu.txt", header=TRUE)

icu.z1 <- zelig(STA ~ AGE+GENDER+RACE+SER+CAN+CRN+INF+CPR+SYS+HRA+PRE+TYP+FRA+PO2+PH+PCO+BIC+CRE+LOC, data=icu, model="logit", x=TRUE)  
icu.z2 <- zelig(STA ~ AGE+TYP, data=icu, model="logit",x=TRUE)
icu.z3 <- zelig(STA ~ AGE+TYP + AGE:TYP, data=icu, model="logit",x=TRUE)

library(stargazer)
stargazer(icu.z1, icu.z2, icu.z3, type="html")


	Dependent variable:

	STA
	(1)	(2)	(3)

AGE	0.052^***	0.034^***	0.074
	(0.017)	(0.011)	(0.088)

GENDER	-0.555
	(0.501)

RACE	-0.005
	(0.528)

SER	-0.545
	(0.575)

CAN	2.758^***
	(0.980)

CRN	-0.102
	(0.762)

INF	-0.056
	(0.535)

CPR	0.978
	(0.983)

SYS	-0.012
	(0.008)

HRA	-0.004
	(0.009)

PRE	0.929
	(0.629)

TYP	2.744^***	2.454^***	5.308
	(0.995)	(0.753)	(6.413)

FRA	1.152
	(0.999)

PO2	0.387
	(0.851)

PH	2.415^**
	(1.231)

PCO	-3.173^**
	(1.386)

BIC	-0.794
	(0.916)

CRE	0.233
	(1.075)

LOC	2.706^***
	(0.752)

AGE:TYP			-0.040
			(0.088)

Constant	-5.414^**	-5.509^***	-8.315
	(2.122)	(1.034)	(6.375)


Observations	200	200	200
Log Likelihood	-64.800	-86.538	-86.416
Akaike Inf. Crit.	169.600	179.076	180.832

Note:	p<0.1; p<0.05; p<0.01

“Zelig” package is used to estimate logistic regresssion models of patient’s Vital Status(0=Lived, 1=Died) at ICU(Intensive Care Unit) as a function of explanatory variables.(Kosuke Imai and Lau 2015)
The first model uses all the variables in the dataset and shows only 6 variables are significant among 19 explanatory variables.
For the second model, only 2 explanatory variables: AGE and TYP(Type of Admission, 0=Elective, 1=Emergency) are included to explain the relationships of Patient’s age and type of admission on the response variable vital status. The coefficients for AGE is estimated as 0.034 which means that the ICU patient’s log odds of dying(Vital Status =1) increases 0.034 times higher for additonal age increment with type of admission to the ICU being equal. Exponentiating the logistic regression model removes the log for the odds and changes it from a model that is additive in the log-odds scale to one that is multiplicative in the odds scale. Therefore, we also can interpret that when type of admission to the ICU is elective(TYP=0) the ICU patient’s odds of dying(Vital Status =1) increases 1.0346(=e^0.034) times higher for additonal age increment. The third model includes AGE, TYP and their interaction term AGE:TYP to measure the interaction effect of AGE and TYP.
Then 3 logit models are compared using “stargazer” package.(Hlavac 2014)
The result shows that the interaction term of AGE and TYP is not statistically significant.

4. Simulating the logistic models using the Zelig package

The next Simulation-based approach produces more accurate results than conventional mathematical technique for many reasons.(King, Tomz, and Wittenberg 2000) (Zelner 2009) The first simulation table shows that probability of dying when the type of admission to the ICU is elective(TYP=0) increases by 0.03487779 for additional age increment.
Similary, the second simulation table shows that when type of admission to the ICU is emergency(TYP=1) the probability of ICU patient’s dying increases 0.2538559 times higher for additonal age increment.
This results are matched with the logit regression result of model2 in the above table in (3). Then the graphs visualizes the relationship of expected value(Vital Status) and age. The second graph shows that probabilty of dying increases much steeper than the first graph since it is the case in which type of admission is emergency.

library(stargazer)
set.seed(11111)

x1 <- setx(icu.z2, TYP = 0)
s1 <- sim(icu.z2, x=x1)
summary(s1)

## 
##   Model: logit 
##   Number of simulations: 1000 
## 
## Values of X 
##   (Intercept)    AGE TYP
## 1           1 57.545   0
## 
## Expected Values: E(Y|X)
##         mean         sd        2.5%     97.5%
## 1 0.03487779 0.02781144 0.007549068 0.1103397
## 
## Predicted Values: Y|X
##       0     1
## 1 0.968 0.032

x2 <- setx(icu.z2, TYP = 1)
s2 <- sim(icu.z2, x=x2)
summary(s2)

## 
##   Model: logit 
##   Number of simulations: 1000 
## 
## Values of X 
##   (Intercept)    AGE TYP
## 1           1 57.545   1
## 
## Expected Values: E(Y|X)
##        mean         sd     2.5%     97.5%
## 1 0.2538559 0.03770622 0.187763 0.3308108
## 
## Predicted Values: Y|X
##       0     1
## 1 0.754 0.246

AGE.range <- min(icu$AGE):max(icu$AGE)
x1 <- setx(icu.z2, TYP = 0, AGE=AGE.range)
s1 <- sim(icu.z2, x=x1)
plot.ci(s1)

AGE.range <- min(icu$AGE):max(icu$AGE)
x2 <- setx(icu.z2, TYP = 1, AGE=AGE.range)
s2 <- sim(icu.z2, x=x2)
plot.ci(s2)

5. Statistical significance test of the interaction effect

The third logistic model icu.z3 includes the interaction term of AGE and STA. To test the significance of the interaction effect, the following “difference in differences” method is conducted. The Zelig packages versions 4 and 5 had the systematic error with the simulated expected joint probabilities or expected values. Therefore, Zelig 3.5-2 package is used instead.

library(mvtnorm)
set.seed(12345)

xh1 <- setx(icu.z3, AGE = mean(icu$AGE)+sd(icu$AGE), TYP=1)
xl1 <- setx(icu.z3, AGE = mean(icu$AGE), TYP=1)
xh0 <- setx(icu.z3, AGE = mean(icu$AGE)+sd(icu$AGE), TYP =0)
xl0 <- setx(icu.z3, AGE = mean(icu$AGE), TYP=0)

zh1 <- sim(icu.z3, x=xh1)
zl1 <- sim(icu.z3, x=xl1)
zh0 <- sim(icu.z3, x=xh0)
zl0 <- sim(icu.z3, x=xl0)

eff <- (zh1$qi$ev - zl1$qi$ev) -(zh0$qi$ev - zl0$qi$ev)

quantile(eff, c(.025, .975))

##       2.5%      97.5% 
## -0.1395350  0.3470681

hist(eff)

As a result, 95% Confidence interval(-0.1395, 0.3471) of the distribution of the random values for the interaction effect includes 0. Which implies the interaction term AGE:TYP has no significant effect on the vital status of patient. This result is the same as the logistic regression using “Zelig” package in (3) above.
Histogram of the distribution of the vlaues for “the difference in differences” gives a better picture of no interaction effect including 0.

6. Cox Proportional Hazard model

Next, the cox proprotional hazards regression model is used to analyze the relationship between the survival time of patients(AGE) and risk factors assessing their conditons at the ICU. Cox PH model is the most common statistical methods for censored data and it does not require any particular probability distribution of survival times. This fact relieves a concern about the unknown distribution of survival times for this ICU dataset. In this Cox Proportional Hazard model, the variable STA is used for censored observation. That is, if STA=1, then the patient is dead at the ICU and the survival duration of the patient is not censored and if STA=0, then the patient is alive and the survival duration of the patient is censored.

z.out0 <- zelig(Surv(AGE, STA) ~ GENDER+RACE+SER+CAN+CRN+INF+CPR+SYS+HRA+PRE+TYP+FRA+PO2+PH+PCO+BIC+CRE+LOC, data = icu, model = "coxph")

stargazer(z.out0, type = "html")


	Dependent variable:

	AGE

GENDER	-0.942^**
	(0.466)

RACE	0.609
	(0.405)

SER	-0.289
	(0.497)

CAN	2.680^***
	(0.735)

CRN	-0.372
	(0.513)

INF	-0.083
	(0.414)

CPR	0.428
	(0.620)

SYS	-0.003
	(0.006)

HRA	-0.004
	(0.008)

PRE	1.288^**
	(0.570)

TYP	2.238^**
	(0.871)

FRA	1.180
	(0.897)

PO2	0.069
	(0.651)

PH	1.909^**
	(0.845)

PCO	-1.621^*
	(0.863)

BIC	-0.540
	(0.686)

CRE	-0.318
	(0.637)

LOC	1.255^***
	(0.331)


Observations	200
R²	0.251
Max. Possible R²	0.798
Log Likelihood	-131.029
Wald Test	49.750^*** (df = 18)
LR Test	57.815^*** (df = 18)
Score (Logrank) Test	72.416^*** (df = 18)

Note:	p<0.1; p<0.05; p<0.01

The result of Cox PH regression analysis shows that explanatory variables, GENDER,CAN,PRE,TYP,PH,PCO,LOC are statistically significant for the explanation of the response variable AGE. Next, only set of significant explanatory variables are used to set up a new Cox PH model. Then each explanatory variable is used to explain the relationship between the response variable AGE. From the comparison table, only TYP and LOC become significant.

z.out1 <- zelig(Surv(AGE, STA) ~ GENDER+CAN+PRE+TYP+PH+PCO+LOC, data = icu, model = "coxph")
z.out2 <- zelig(Surv(AGE, STA) ~ GENDER, data = icu, model = "coxph")
z.out3 <- zelig(Surv(AGE, STA) ~ CAN, data = icu, model = "coxph")
z.out4 <- zelig(Surv(AGE, STA) ~ PRE, data = icu, model = "coxph")
z.out5 <- zelig(Surv(AGE, STA) ~ TYP, data = icu, model = "coxph")
z.out6 <- zelig(Surv(AGE, STA) ~ PH, data = icu, model = "coxph")
z.out7 <- zelig(Surv(AGE, STA) ~ LOC, data = icu, model = "coxph")

stargazer(z.out1, z.out2, z.out3, z.out4, z.out5, z.out6, z.out7, type = "html")


	Dependent variable:

	AGE
	(1)	(2)	(3)	(4)	(5)	(6)	(7)

GENDER	-0.781^**	-0.162
	(0.380)	(0.331)

CAN	2.491^***		0.439
	(0.635)		(0.536)

PRE	1.183^**			0.218
	(0.498)			(0.422)

TYP	2.266^***				2.056^***
	(0.773)				(0.728)

PH	1.624^***					0.496
	(0.605)					(0.530)

PCO	-1.292^**
	(0.601)

LOC	1.308^***						0.885^***
	(0.238)						(0.187)


Observations	200	200	200	200	200	200	200
R²	0.222	0.001	0.003	0.001	0.071	0.004	0.080
Max. Possible R²	0.798	0.798	0.798	0.798	0.798	0.798	0.798
Log Likelihood	-134.803	-159.816	-159.636	-159.809	-152.541	-159.552	-151.624
Wald Test	44.700^*** (df = 7)	0.240 (df = 1)	0.670 (df = 1)	0.270 (df = 1)	7.960^*** (df = 1)	0.880 (df = 1)	22.360^*** (df = 1)
LR Test	50.269^*** (df = 7)	0.242 (df = 1)	0.602 (df = 1)	0.256 (df = 1)	14.793^*** (df = 1)	0.770 (df = 1)	16.627^*** (df = 1)
Score (Logrank) Test	60.157^*** (df = 7)	0.241 (df = 1)	0.682 (df = 1)	0.269 (df = 1)	11.149^*** (df = 1)	0.894 (df = 1)	28.120^*** (df = 1)

Note:	p<0.1; p<0.05; p<0.01

Next, the interaction effect was checked using the Zelig coxph model again. The below table shows that the interaction term of TYP and LOC is also significant. Final model is determined to include the interaction term of TYP:LOC

z.out11 <- zelig(Surv(AGE, STA) ~ GENDER+CAN+PRE+TYP+PH+PCO+LOC, data = icu, model = "coxph")
z.out21 <- zelig(Surv(AGE, STA) ~ TYP+LOC, data = icu, model = "coxph")
z.out31 <- zelig(Surv(AGE, STA) ~ TYP+LOC+TYP:LOC, data = icu, model = "coxph")
z.out41 <- zelig(Surv(AGE, STA) ~ TYP:LOC, data = icu, model="coxph")

stargazer(z.out11, z.out21, z.out31, z.out41, type = "html")


	Dependent variable:

	AGE
	(1)	(2)	(3)	(4)

GENDER	-0.781^**
	(0.380)

CAN	2.491^***
	(0.635)

PRE	1.183^**
	(0.498)

TYP	2.266^***	1.814^**	2.485^**
	(0.773)	(0.736)	(1.021)

PH	1.624^***
	(0.605)

PCO	-1.292^**
	(0.601)

LOC	1.308^***	0.732^***	3.975^***
	(0.238)	(0.188)	(1.421)

TYP:LOC			-3.275^**	0.859^***
			(1.432)	(0.189)


Observations	200	200	200	200
R²	0.222	0.125	0.142	0.074
Max. Possible R²	0.798	0.798	0.798	0.798
Log Likelihood	-134.803	-146.564	-144.623	-152.273
Wald Test	44.700^*** (df = 7)	24.320^*** (df = 2)	22.690^*** (df = 3)	20.690^*** (df = 1)
LR Test	50.269^*** (df = 7)	26.746^*** (df = 2)	30.629^*** (df = 3)	15.328^*** (df = 1)
Score (Logrank) Test	60.157^*** (df = 7)	34.533^*** (df = 2)	36.476^*** (df = 3)	25.817^*** (df = 1)

Note:	p<0.1; p<0.05; p<0.01

7. Simulation of Cox PH model

The depedent variable of Cox PH model is the set of Surv(AGE, STA) where AGE is the duration of ICU patients andSTA is a binary status such that STA =0 if the patient is alive(censored), and STA=1 if the patient is dead(not censored). 2 explatory variables TYP(Type of admission) and LOC(Level of Consciousness at ICU Admission, 0=No Coma,or Stupor, 1=Deep Stupor, 2=Coma) and their interaction effect term are used for the survival function.

The PH regression result also shows that each coefficient of the predictors is significant at 0.01 levels. The coefficient estimate of TYP:LOC has negative value, -3.27508. This means that increase of TYP:LOC gives negative effect for death of a patient unlike the other predictors. As a result, when one more unit of TYP:LOC is increased, a patient is 26.45(=1/0.03781) times more likely to survive.
However, interpretation of the interaction effect has problems because of the collinearity and characteristics of 2 categorical variables. In addition, the interaction term alone has postive coefficient value for the death of a patient in the model 4 in the above table although it has negative coefficient value in the model 3 in the above table.

z.surv <- zelig(Surv(AGE, STA) ~ TYP+LOC+TYP:LOC, data = icu, model = "coxph")

summary(z.surv)

## Call:
## zelig(formula = Surv(AGE, STA) ~ TYP + LOC + TYP:LOC, model = "coxph", 
##     data = icu)
## 
##   n= 200, number of events= 40 
## 
##             coef exp(coef) se(coef)      z Pr(>|z|)   
## TYP      2.48499  12.00094  1.02061  2.435  0.01490 * 
## LOC      3.97549  53.27601  1.42054  2.799  0.00513 **
## TYP:LOC -3.27508   0.03781  1.43220 -2.287  0.02221 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## TYP      12.00094    0.08333  1.623582   88.7067
## LOC      53.27601    0.01877  3.291290  862.3773
## TYP:LOC   0.03781   26.44531  0.002283    0.6262
## 
## Concordance= 0.737  (se = 0.048 )
## Rsquare= 0.142   (max possible= 0.798 )
## Likelihood ratio test= 30.63  on 3 df,   p=1.018e-06
## Wald test            = 22.69  on 3 df,   p=4.683e-05
## Score (logrank) test = 36.48  on 3 df,   p=5.939e-08

However, for the other variables the increased amount implies to have more risk of death. So the TYP=1(Emergency admission) increases the risk of death 12.00094 times higher than TYP=0(elctive admission), and LOC has much higher hazard rates than TYP. The increase of one level of LOC increases the risk of death 53.27601 higher(Patients with No Coma VS Patients with Stupor or Patients with Stupor VS Patients with Coma).

The next simulation of the survival function starts with defining the range of TYP variable(TYP=0, TYP=1). Then simulation shows that simuated hazard ratio distribution “s.surv\(qi\)hr” returns the estimation of the hazard ratio for the TYP variable. So the result shows that mean hazard rate is 11.5919 which implies that the patients with emergency admission has 11.5919 times higher risk of death than the patiens with elective admission. This results is very close to the hazard rate estimated in the PH model using Zelig.

The plots also visualizes the simulated survival function and estimated cumulated hazard and hazard rate over time. The plots show the typical behavior of survival function. As time goes by, the survival rate falls to 0, the cumulative hazard increases while the hazard rate being constant over time, which is required by the Cox PH model assumption.

x.survlow1 <- setx(z.surv, TYP=0)
x.survhigh1 <- setx(z.surv, TYP=1)
s.surv <- sim(z.surv, x=x.survlow1, x1=x.survhigh1)
summary(s.surv$qi$hr)

##        1           
##  Min.   :  0.6441  
##  1st Qu.:  4.6084  
##  Median :  8.0260  
##  Mean   : 11.5919  
##  3rd Qu.: 14.0998  
##  Max.   :209.0712

plot(s.surv)

References

Hlavac, Marek(2014). 2014. Stargazer:LaTex Code and ASCII Text for Well-Formatted Regression and Summary Staistics Tables. http://CRAN.R-project.org/package=stargazer.

Hosmer Jr, Stanley Lemeshow, David W., and Rodney X. Sturdivant. 2013. Applied Logistic Regression. John Wiley & Sons.

King, G., M. Tomz, and J. Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44 (2): 341–55.

Kosuke Imai, Gary King, and Olivia Lau. 2015. Logit: Logistic Regression for Dichotomous Dependent Variables. http://gking.harvard.edu/zelig.

Zelner, Bennet A. 2009. “Using Simulation to Interpret Results from Logit, Probit, and Other Nonlinear Models.” Strategic Management Journal.