Math 536 Exam 2

Problem/Task Description

We are interested in the risks of anastomotic leaking following a colectomy associated with unhealthy weight levels (measured by BMI). Meredith Grey has contacted us to
1.) articulate the risks associated with BMI
2.) identify any other potentially importatnt predictors of anastomotic leaking following a colectomy
3.) analyze two case studies (two patients) with two different sets of health data predictors and graphically represent the dangers associated with obesity.

We first examine the distribution of BMI in our patients. Note that per CDC guidelines, BMI greater than or equal to 30 is obese.

Before building a model that will include BMI and the other predictor variables, we will first partition our data into patients who are not obese (Underweight, Normal, and Overweight per CDC guidelines) and those who are obese.

Figure 2

As we can see above in Figure 2, patients with BMI greater than or 30 (CDC guideline for obesity) experienced a higher rate of anastomotic leak than those who were not considered obese. This suggests that BMI (in particular BMI greater than or equal to 30) increases the risk of anastomotic leak following a colectomy.

We now will attempt to create a logistic regression model based on the predictors. We will first start with full model that incorporates all the predictors and then run R’s step() function to determine the predictors that are most important and that give the lowest AIC score.

model.all = glm(Anastamotic.Leak~Gender + BMI + Age + Race + Tobacco + DM + CAD.PAD + Cancer + Albumin..g.dL. + Operative.Length, family='binomial')
step(model.all)

After execution, the step() function decides on the following model.

model.AIC = glm(Anastamotic.Leak~Gender + BMI + Age + Tobacco + DM + Albumin..g.dL. + Operative.Length, family='binomial')
summary(model.AIC)

## 
## Call:
## glm(formula = Anastamotic.Leak ~ Gender + BMI + Age + Tobacco + 
##     DM + Albumin..g.dL. + Operative.Length, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7600  -0.5141  -0.3220  -0.1457   2.4042  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -5.53272    1.89872  -2.914 0.003569 ** 
## Gender           -0.77174    0.49998  -1.544 0.122699    
## BMI               0.08909    0.03041   2.930 0.003389 ** 
## Age               0.08276    0.02350   3.522 0.000428 ***
## Tobacco           0.78186    0.53601   1.459 0.144655    
## DM               -0.86507    0.63472  -1.363 0.172911    
## Albumin..g.dL.   -1.38995    0.35815  -3.881 0.000104 ***
## Operative.Length  7.71134    3.46733   2.224 0.026148 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 148.35  on 178  degrees of freedom
## Residual deviance: 113.93  on 171  degrees of freedom
## AIC: 129.93
## 
## Number of Fisher Scoring iterations: 6

In other words, via interpretation of the log odds of a logistic regression model, we have \[ \hat{p}^{*} = \frac{\exp(X^{*}\beta)}{1+\exp(X^{*}\beta)} \] where \(X^{*} = [1, Gender, BMI, Age, Tobacco, DM, Albumin, Operative Length]\) and \(\beta = [-5.53273, -0.77174, 0.08909, 0.08276, 0.78186, -0.86507, -1.38995, 7.71134]\).

For more details on the variable selection see Appendix: Variable Selection. For more details on the validity of this model and residual diagnostics see Appendix: Residual Diagnostics

In particular, the coefficient for BMI is 0.08909. We interpret this coefficient as follows.

Since \(\exp(0.08909) = 1.093179\), on average, a one unit increase in BMI results in a 9.3179% increase in odds of an anastomotic leak following a colectomy. This agrees with our earlier statement that obese patients had a higher rate of anastomitic leak than non-obese patients.

Furthermore, from this logistic regression model we see that the following variables were deemed important.
-Gender
-BMI
-Tobacco
-DM
-Albumin
-Operative Length

This means that the following variables were deemed unimportant
-Race
-CAD/PAD
-Cancer

These variables may have been highly correlated with the others and so the information they were providing the model was already embedded in the important variables. For example, CAD/PAD may have been embedded in Tobacco use since Tobacco use has association with higher rates of Coronary or Pulminary Artery Disease. The same could be said for Cancer.

This concludes task (1.) and task (2.). We have identified important predictors. We have identified unimportant predictors. We have identified that a higher BMI increases the odds of an anastomotic leak via both a logistic regression model and through a simple analysis of the sample proportions of obese vs non-obese patients.

Task (3.) Case Studies 1 and 2

Case Study 1 Ms. Arizona Robbins

Figure 3

As we can see in Figure 3 above, Ms. Arizona Robbins has a low probability of having an anastomotic leak post colectomy. Given Arizona’s attributes, it is unlikely that she had a BMI that exceeded the obesity threshold of 30. Thus, our 95% prediction intervals for Arizona are in the range of (0 - 0.03) for BMI less than 30.

That being said, it is clear that as BMI exceeds the obesity threshold the probability (especially the upper bounds on our 95% prediction intervals) enter into a territory that is no longer just highly unlikely. For example, a BMI of 50 has a range of probabilities between 0 and 0.08.

Case Study 2 Mr. Richard Webber

Figure 4

As we can see in Figure 4 above, Mr. Richard Webber exhibits between a 0.05 and 0.62 probability of anastomotic leak. This cann be explained by Richard’s set of attributes which include Diabetes, Tobacco use, older Age, etc.

Again, just like in Arizona’s graph, it is clear that exceeding the obesity threshold presents a much higher probability than if a patient were to stay below.

Appendix Variable Selection

As stated earlier in the report we begin with all predictor variables in the model and use the step() function in R to determine the important predictors based on AIC.

model.all = glm(Anastamotic.Leak~Gender + BMI + Age + Race + Tobacco + DM + CAD.PAD + Cancer + Albumin..g.dL. + Operative.Length, family='binomial')
step(model.all)

## Start:  AIC=133.56
## Anastamotic.Leak ~ Gender + BMI + Age + Race + Tobacco + DM + 
##     CAD.PAD + Cancer + Albumin..g.dL. + Operative.Length
## 
##                    Df Deviance    AIC
## - Race              1   111.77 131.77
## - CAD.PAD           1   112.47 132.47
## - Cancer            1   112.61 132.60
## - DM                1   113.43 133.43
## <none>                  111.56 133.56
## - Gender            1   114.16 134.16
## - Tobacco           1   114.99 134.99
## - Operative.Length  1   115.32 135.32
## - BMI               1   120.51 140.51
## - Age               1   123.62 143.62
## - Albumin..g.dL.    1   125.75 145.75
## 
## Step:  AIC=131.77
## Anastamotic.Leak ~ Gender + BMI + Age + Tobacco + DM + CAD.PAD + 
##     Cancer + Albumin..g.dL. + Operative.Length
## 
##                    Df Deviance    AIC
## - CAD.PAD           1   112.59 130.59
## - Cancer            1   112.89 130.89
## - DM                1   113.51 131.51
## <none>                  111.77 131.77
## - Gender            1   114.36 132.36
## - Tobacco           1   115.07 133.07
## - Operative.Length  1   116.01 134.01
## - BMI               1   120.51 138.51
## - Age               1   123.62 141.62
## - Albumin..g.dL.    1   125.88 143.88
## 
## Step:  AIC=130.59
## Anastamotic.Leak ~ Gender + BMI + Age + Tobacco + DM + Cancer + 
##     Albumin..g.dL. + Operative.Length
## 
##                    Df Deviance    AIC
## - Cancer            1   113.93 129.93
## - DM                1   114.46 130.46
## <none>                  112.59 130.59
## - Gender            1   114.67 130.67
## - Tobacco           1   115.57 131.57
## - Operative.Length  1   116.75 132.75
## - BMI               1   120.68 136.68
## - Age               1   123.63 139.63
## - Albumin..g.dL.    1   126.10 142.10
## 
## Step:  AIC=129.93
## Anastamotic.Leak ~ Gender + BMI + Age + Tobacco + DM + Albumin..g.dL. + 
##     Operative.Length
## 
##                    Df Deviance    AIC
## <none>                  113.93 129.93
## - DM                1   115.95 129.95
## - Tobacco           1   116.15 130.15
## - Gender            1   116.39 130.39
## - Operative.Length  1   118.71 132.71
## - BMI               1   122.49 136.49
## - Age               1   128.93 142.93
## - Albumin..g.dL.    1   130.12 144.12

## 
## Call:  glm(formula = Anastamotic.Leak ~ Gender + BMI + Age + Tobacco + 
##     DM + Albumin..g.dL. + Operative.Length, family = "binomial")
## 
## Coefficients:
##      (Intercept)            Gender               BMI               Age  
##         -5.53272          -0.77174           0.08909           0.08276  
##          Tobacco                DM    Albumin..g.dL.  Operative.Length  
##          0.78186          -0.86507          -1.38995           7.71134  
## 
## Degrees of Freedom: 178 Total (i.e. Null);  171 Residual
## Null Deviance:       148.3 
## Residual Deviance: 113.9     AIC: 129.9

We will run a cross-validation simulation to further prove that this reduction in variables results in a better model. Our objective measure of fit is the log-likelihood.

CV.LL.full = rep(0, 1000)
for (j in 1:1000) {
  CV.LL = 0
  folds = createFolds(1:length(sub.data$Anastamotic.Leak), k=10)
  for(i in 1:10){
    train.data = sub.data[-folds[[i]],]
    test.data = sub.data[folds[[i]],]
    model.lreg = glm(Anastamotic.Leak~Gender + BMI + Age + Race + Tobacco + DM + CAD.PAD + Cancer + Albumin..g.dL. + Operative.Length, family='binomial', data = train.data)
    log.odd = predict(model.lreg,newdata=test.data)
    p.hat = exp(log.odd)/(1+exp(log.odd))
    y = test.data$Anastamotic.Leak
    CV.LL = CV.LL + sum(y*log(p.hat) +  (1-y)*log(1-p.hat))
  }
  CV.LL.full[j] = CV.LL
}

CV.LL.AIC = rep(0, 1000)
for (j in 1:1000) {
  CV.LL = 0
  folds = createFolds(1:length(sub.data$Anastamotic.Leak), k=10)
  for(i in 1:10){
    train.data = sub.data[-folds[[i]],]
    test.data = sub.data[folds[[i]],]
    model.lreg = glm(Anastamotic.Leak~Gender + BMI + Age + Tobacco + DM + Albumin..g.dL. + Operative.Length, family='binomial', data = train.data)
    log.odd = predict(model.lreg,newdata=test.data)
    p.hat = exp(log.odd)/(1+exp(log.odd))
    y = test.data$Anastamotic.Leak
    CV.LL = CV.LL + sum(y*log(p.hat) +  (1-y)*log(1-p.hat))
  }
  CV.LL.AIC[j] = CV.LL
}

mean(CV.LL.full)

## [1] -71.77379

mean(CV.LL.AIC)

## [1] -67.65915

As we can see, the log-likelihood of the full model is less than the log-likelihood of the reduced model.

Appendix Residual Diagnostics

We will look at the residual plots of our model and assess the validity of model assumptions.

model.AIC = glm(Anastamotic.Leak~Gender + BMI + Age + Tobacco + DM + Albumin..g.dL. + Operative.Length, family='binomial')
binnedplot(fitted(model.AIC), residuals(model.AIC, type='response'), nclass=NULL, xlab='Expected Probabilities', ylab='Average residual', main = 'Binned residual plot')

#binnedplot(sub.data$BMI, residuals(model.AIC, type='response'), xlab='BMI', ylab='Average Residual', main='Binned residual plot BMI')

Looking at the Figure above (Binned residual plot), we see that the model struggles below 0.1, all of which have negative average residual. This means that in these bins the model is predicting a higher average rate of anastomotic leak than is actually the case. While this seems troubling, we are mostly focused on those patients who are high at risk. Thus, the fact that our model is conservative and over-predicting patients in the very low probability (less than 0.10) is not an issue.

Appendix Bootstrapping a Prediction Interval

See below code for bootstrapping a prediction interval for each \(\hat{p}^{*}\) that we predict for the range of BMI values for both case studies.

orig.model = model.AIC
#bootstrap Arizona
numsim = 1000
phat.mat = matrix(-10, nrow=30, ncol = numsim)

arizona.df = data.frame(Gender = rep(1,30), BMI=BMI.seq, Age=rep(35,30), Tobacco=rep(0,30), DM=rep(0,30), Albumin..g.dL.=rep(4.2,30), Operative.Length=rep(0.06025,30))
for(j in 1:numsim) {
  sample.index = sample(1:nrow(sub.data), replace=TRUE)
  x.bs = sub.data[sample.index, 1:10]
  
  log.odds.bs = predict(orig.model, newdata = x.bs)
  prob.bs = exp(log.odds.bs)/(1 + exp(log.odds.bs))
  y.bs = rep(0,nrow(sub.data))
  for(i in 1:nrow(sub.data)){y.bs[i] = sample(c(1,0),1,prob=c(prob.bs[i],1-prob.bs[i]))}
  model.bs = glm(y.bs~Gender + BMI + Age + Tobacco + DM + Albumin..g.dL. + Operative.Length,family='binomial', data=x.bs)
  log.odds.arizona = predict.glm(model.bs, newdata=arizona.df)
  phat.mat[,j] = exp(log.odds.arizona)/(1+exp(log.odds.arizona))
}

arizona.probs = data.frame(point.est = rep(-10,30), lower.est=rep(-10,30), upper.est=rep(-10,30))
arizona.logs = predict.glm(orig.model, newdata = arizona.df)
arizona.probs$point.est = exp(arizona.logs)/(1+exp(arizona.logs))
arizona.probs$lower.est = apply(phat.mat, 1, quantile, probs=c(0.025))
arizona.probs$upper.est = apply(phat.mat, 1, quantile, probs=c(0.975))
plot(1, main = 'Case Study 1 Ms. Arizona Robbins', type='n', xlab='BMI', xlim=c(min.BMI, max.BMI), ylim=c(0,0.3),  ylab='Probability of Anast Leak')
lines(BMI.seq, arizona.probs$point.est, col='green')
lines(BMI.seq, arizona.probs$lower.est, col='purple')
lines(BMI.seq, arizona.probs$upper.est, col='blue')
legend('topleft', legend=c("Upper", "Point Estimate", "Lower"), text.col=c('blue', 'green', 'purple'))


#----------------------------------------------------------------
orig.model = model.AIC
#bootstrap Richard
numsim = 1000
phat.mat = matrix(-10, nrow=30, ncol = numsim)

richard.df = data.frame(Gender = rep(0,30), BMI=BMI.seq, Age=rep(62,30), Tobacco=rep(1,30), DM=rep(1,30), Albumin..g.dL.=rep(2.8,30), Operative.Length=rep(210/1440,30))
for(j in 1:numsim) {
  sample.index = sample(1:nrow(sub.data), replace=TRUE)
  x.bs = sub.data[sample.index, 1:10]
  
  log.odds.bs = predict(orig.model, newdata = x.bs)
  prob.bs = exp(log.odds.bs)/(1 + exp(log.odds.bs))
  y.bs = rep(0,nrow(sub.data))
  for(i in 1:nrow(sub.data)){y.bs[i] = sample(c(1,0),1,prob=c(prob.bs[i],1-prob.bs[i]))}
  model.bs = glm(y.bs~Gender + BMI + Age + Tobacco + DM + Albumin..g.dL. + Operative.Length,family='binomial', data=x.bs)
  log.odds.richard = predict.glm(model.bs, newdata=richard.df)
  phat.mat[,j] = exp(log.odds.richard)/(1+exp(log.odds.richard))
}

richard.probs = data.frame(point.est = rep(-10,30), lower.est=rep(-10,30), upper.est=rep(-10,30))
richard.logs = predict.glm(orig.model, newdata = richard.df)
richard.probs$point.est = exp(richard.logs)/(1+exp(richard.logs))
richard.probs$lower.est = apply(phat.mat, 1, quantile, probs=c(0.025))
richard.probs$upper.est = apply(phat.mat, 1, quantile, probs=c(0.975))
plot(1, main = 'Case Study 2 Mr. Richard Webber', type='n', xlab='BMI', xlim=c(min.BMI, max.BMI), ylim=c(0,1),  ylab='Probability of Anast Leak')
lines(BMI.seq, richard.probs$point.est, col='green')
lines(BMI.seq, richard.probs$lower.est, col='purple')
lines(BMI.seq, richard.probs$upper.est, col='blue')
legend('topleft', legend=c("Upper", "Point Estimate", "Lower"), text.col=c('blue', 'green', 'purple'))
abline(v=30, col='red')