Model fitting

First, I will fit my unadjusted model. I use ordinal logistic regression (provided by MASS:polr package) because I am regressing on to an ordinal outcome variable (dindo) with levels 0 through 5.

unadjfit <- MASS::polr(dindo ~ hxcopd, data = df_m)
unadjfit.or <- exp(cbind(coef(unadjfit), t(confint(unadjfit))))
#> Waiting for profiling to be done...
#> 
#> Re-fitting to get Hessian

The model is summarized:

summary(unadjfit)
#> 
#> Re-fitting to get Hessian
#> Call:
#> MASS::polr(formula = dindo ~ hxcopd, data = df_m)
#> 
#> Coefficients:
#>             Value Std. Error t value
#> hxcopdTRUE 0.4003     0.1323   3.027
#> 
#> Intercepts:
#>     Value   Std. Error t value
#> 0|1  3.6128  0.1018    35.4742
#> 1|2  3.7429  0.1045    35.8053
#> 2|3  4.2549  0.1181    36.0240
#> 3|4  4.9046  0.1443    33.9888
#> 4|5  6.0542  0.2276    26.6017
#> 
#> Residual Deviance: 2887.225 
#> AIC: 2899.225

polr outputs the ordinal logistic regression parameteized as:

\[logit (P(Y \le j)) = \beta_{j0} – \eta_{1}x_1 – \cdots – \eta_{p} x_p\]

Due to the parallel lines assumption, even though we have six categories, the coefficient of COPD (hxcopd) stays the same across the five categories. The two equations for hxcopd = 1 and hxcopd = 0 are

\[ \begin{eqnarray} logit (P(Y \le j | x_1=1) & = & \beta_{j0} – \eta_{1} \\ logit (P(Y \le j | x_1=0) & = & \beta_{j0} \end{eqnarray} \]

Therefore:

\[logit (P(Y \le j)|x_1=1) -logit (P(Y \le j)|x_1=0) = – \eta_{1}.\]

Normally, I could get the odds ratio by exponentiating both sides of this equation and using \(log(b)-log(a) = log(b/a)\):

\[\frac{P(Y \le j |x_1=1)}{P(Y>j|x_1=1)} / \frac{P(Y \le j |x_1=0)}{P(Y>j|x_1=0)} = exp( -\eta_{1}).\]

Which by the proportional odds assumption can be simplified:

\[\frac{P(Y \le j |x_1=1)}{P(Y>j|x_1=1)} = p_1 / (1-p_1) \]

\[\frac{P(Y \le j |x_1=0)}{P(Y>j|x_1=0)} = p_0 / (1-p_0)\]

The odds ratio is defined as:

\[\frac{p_1 / (1-p_1) }{p_0 / (1-p_0)} = exp( -\eta_{1}).\]

However, the coefficient reported by summary(unadjfit) is actually \(\eta\), not \(-\eta\). So I will have to flip the odds ratio:

Since \(exp(-\eta_{1}) = \frac{1}{exp(\eta_{1})}\),

\[exp(\eta_{1}) = \frac{p_0 / (1-p_0) }{p_1 / (1-p_1)}.\]

From the output, \(\hat{\eta}_1=0.4003126\), the odds ratio \(exp(\hat{\eta}_1)=1.4922911\) is actually \(\frac{p_0 / (1-p_0) }{p_1 / (1-p_1)}\).

This can be interpreted as “people without COPD have 1.4922911 higher odds of being in a category \(\leq J\) vs. \(>J\) when compared to patients with COPD” where \(J = \{0, 1,2,3,4,5\}\).

To make this more interpretable, you can switch around some of the signs:

\[ \begin{eqnarray} exp(-\eta_{1}) & = & \frac{p_1 / (1-p_1)}{p_0/(1-p_0)} \\ & = & \frac{p_1 (1-p_0)}{p_0(1-p_1)} \\ & = & \frac{(1-p_0)/p_0}{(1-p_1)/p_1} \\ & = & \frac{P (Y >j | x=0)/P(Y \le j|x=0)}{P(Y > j | x=1)/P(Y \le j | x=1)}. \end{eqnarray} \]

Since \(exp(-\eta_{1}) = \frac{1}{exp(\eta_{1})}\),

\[\frac{P (Y >j | x=1)/P(Y \le j|x=1)}{P(Y > j | x=0)/P(Y \le j | x=0)} = exp(\eta).\]

Instead of interpreting the odds of being in category \(\leq J\), we can interpret the odds of being in category \(>J\): “people with COPD have 1.4922911 times the odds of being in category \(>J\) compared to people without COPD.”

The 95% confidence interval can be easily calculated:

exp(confint(unadjfit))
#> Waiting for profiling to be done...
#> 
#> Re-fitting to get Hessian
#>    2.5 %   97.5 % 
#> 1.153489 1.938509

In order to fit the adjusted model, I start by suppying all the variables include in the match:

adjfit <- MASS::polr(dindo ~ hxcopd + sex + race + age + diabetes + smoke + dyspnea + fnstatus2 + ascites + hxchf + hypermed + renafail + dialysis + steroid + bleeddis + wtloss + lap, data = df_m)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(adjfit)
#> 
#> Re-fitting to get Hessian
#> Call:
#> MASS::polr(formula = dindo ~ hxcopd + sex + race + age + diabetes + 
#>     smoke + dyspnea + fnstatus2 + ascites + hxchf + hypermed + 
#>     renafail + dialysis + steroid + bleeddis + wtloss + lap, 
#>     data = df_m)
#> 
#> Coefficients:
#>                                             Value Std. Error    t value
#> hxcopdTRUE                                0.42863  1.342e-01  3.193e+00
#> sexTRUE                                  -0.06410  1.940e-01 -3.303e-01
#> raceAsian                                -1.00369  1.245e+00 -8.063e-01
#> raceBlack                                 0.22738  1.036e+00  2.195e-01
#> raceNative Hawaiian or Pacific islander -10.96370  6.278e-05 -1.746e+05
#> raceWhite                                -0.10687  1.021e+00 -1.046e-01
#> age                                       0.02040  6.960e-03  2.931e+00
#> diabetesTRUE                              0.57178  1.670e-01  3.424e+00
#> smokeTRUE                                 0.16700  1.515e-01  1.103e+00
#> dyspneaTRUE                               0.17248  1.399e-01  1.233e+00
#> fnstatus2Partially dependent              0.78438  3.369e-01  2.328e+00
#> fnstatus2Totally dependent              -14.00459  8.157e-09 -1.717e+09
#> ascitesTRUE                               1.82496  5.805e-01  3.144e+00
#> hxchfTRUE                                 0.99521  2.560e-01  3.888e+00
#> hypermedTRUE                              0.04481  1.508e-01  2.972e-01
#> renafailTRUE                              0.35707  1.070e+00  3.338e-01
#> dialysisTRUE                              0.98040  3.492e-01  2.807e+00
#> steroidTRUE                               0.07941  2.370e-01  3.350e-01
#> bleeddisTRUE                              0.80356  2.006e-01  4.007e+00
#> wtlossTRUE                                0.97155  4.148e-01  2.342e+00
#> lapTRUE                                   0.24056  1.564e-01  1.538e+00
#> 
#> Intercepts:
#>     Value         Std. Error    t value      
#> 0|1  5.421500e+00  1.156200e+00  4.689200e+00
#> 1|2  5.554000e+00  1.156400e+00  4.802700e+00
#> 2|3  6.074700e+00  1.157800e+00  5.246600e+00
#> 3|4  6.732100e+00  1.160900e+00  5.799100e+00
#> 4|5  7.888900e+00  1.174300e+00  6.718100e+00
#> 
#> Residual Deviance: 2786.892 
#> AIC: 2838.892

However, note the warning. This is because the level of fnstatus2 called Totally dependent has only 5 records. Each of these records ends up with the same Dindo-Clavien classification of 0. This leads to complete separation of the variable fnstatus2Totally dependent with regards to the outcome variable dindo. The maximum likelihood estimate for fnstatus2Totally dependent does not exist. Another hint is the large coefficient of -14.0045892 and the even larger standard error.

We can resolve this complete separation by collapsing fnstatus2 into two levels (“Independent” and “Dependent”):

df_m$fnstatus2 <- factor(df_m$fnstatus2)
levels(df_m$fnstatus2) <- list(Independent = "Independent", "Dependent" = c("Partially dependent", "Totally dependent"))

adjfit <- MASS::polr(dindo ~ hxcopd + sex + race + age + diabetes + smoke + dyspnea + fnstatus2 + ascites + hxchf + hypermed + renafail + dialysis + steroid + bleeddis + wtloss + lap, data = df_m)

summary(adjfit)
#> 
#> Re-fitting to get Hessian
#> Call:
#> MASS::polr(formula = dindo ~ hxcopd + sex + race + age + diabetes + 
#>     smoke + dyspnea + fnstatus2 + ascites + hxchf + hypermed + 
#>     renafail + dialysis + steroid + bleeddis + wtloss + lap, 
#>     data = df_m)
#> 
#> Coefficients:
#>                                             Value Std. Error    t value
#> hxcopdTRUE                                0.42908  1.342e-01  3.197e+00
#> sexTRUE                                  -0.06567  1.940e-01 -3.385e-01
#> raceAsian                                -1.01655  1.245e+00 -8.166e-01
#> raceBlack                                 0.22811  1.036e+00  2.202e-01
#> raceNative Hawaiian or Pacific islander -10.71234  8.076e-05 -1.326e+05
#> raceWhite                                -0.10699  1.021e+00 -1.048e-01
#> age                                       0.02040  6.956e-03  2.932e+00
#> diabetesTRUE                              0.57169  1.670e-01  3.424e+00
#> smokeTRUE                                 0.16755  1.515e-01  1.106e+00
#> dyspneaTRUE                               0.17326  1.398e-01  1.239e+00
#> fnstatus2Dependent                        0.75604  3.359e-01  2.251e+00
#> ascitesTRUE                               1.82831  5.803e-01  3.150e+00
#> hxchfTRUE                                 0.99739  2.559e-01  3.897e+00
#> hypermedTRUE                              0.04599  1.508e-01  3.050e-01
#> renafailTRUE                              0.35625  1.070e+00  3.330e-01
#> dialysisTRUE                              0.98028  3.492e-01  2.807e+00
#> steroidTRUE                               0.08093  2.370e-01  3.415e-01
#> bleeddisTRUE                              0.80512  2.005e-01  4.015e+00
#> wtlossTRUE                                0.97516  4.147e-01  2.352e+00
#> lapTRUE                                   0.24020  1.564e-01  1.536e+00
#> 
#> Intercepts:
#>     Value        Std. Error   t value     
#> 0|1       5.4216       1.1561       4.6896
#> 1|2       5.5541       1.1564       4.8031
#> 2|3       6.0748       1.1578       5.2470
#> 3|4       6.7322       1.1608       5.7995
#> 4|5       7.8889       1.1742       6.7186
#> 
#> Residual Deviance: 2787.425 
#> AIC: 2837.425

In order to produce the best fit, I utilize both backwards and forwards step-wise model selection by AIC:

adjfit.step <- MASS::stepAIC(adjfit, direction = "both", trace = FALSE)

summary(adjfit.step)
#> 
#> Re-fitting to get Hessian
#> Call:
#> MASS::polr(formula = dindo ~ hxcopd + age + diabetes + fnstatus2 + 
#>     ascites + hxchf + dialysis + bleeddis + wtloss, data = df_m)
#> 
#> Coefficients:
#>                      Value Std. Error t value
#> hxcopdTRUE         0.42097   0.133888   3.144
#> age                0.01624   0.006175   2.631
#> diabetesTRUE       0.57029   0.164486   3.467
#> fnstatus2Dependent 0.77247   0.336009   2.299
#> ascitesTRUE        1.73700   0.580973   2.990
#> hxchfTRUE          1.04844   0.250773   4.181
#> dialysisTRUE       1.00015   0.342407   2.921
#> bleeddisTRUE       0.81298   0.198346   4.099
#> wtlossTRUE         0.97523   0.415840   2.345
#> 
#> Intercepts:
#>     Value   Std. Error t value
#> 0|1  5.0570  0.4572    11.0611
#> 1|2  5.1893  0.4578    11.3346
#> 2|3  5.7091  0.4612    12.3777
#> 3|4  6.3653  0.4687    13.5808
#> 4|5  7.5204  0.5007    15.0188
#> 
#> Residual Deviance: 2799.438 
#> AIC: 2827.438

Exponentiating the coefficients to obtain odds ratios:

broom::tidy(adjfit.step, conf.int = TRUE, p.values = TRUE, exponentiate = TRUE)
#> 
#> Re-fitting to get Hessian
#> # A tibble: 14 x 8
#>    term       estimate std.error statistic conf.low conf.high  p.value coef.type
#>    <chr>         <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl> <chr>    
#>  1 hxcopdTRUE     1.52   0.134        3.14     1.17      1.99  1.51e-3 coeffici~
#>  2 age            1.02   0.00618      2.63     1.00      1.03  7.67e-3 coeffici~
#>  3 diabetesT~     1.77   0.164        3.47     1.27      2.42  9.76e-4 coeffici~
#>  4 fnstatus2~     2.17   0.336        2.30     1.06      4.01  3.49e-2 coeffici~
#>  5 ascitesTR~     5.68   0.581        2.99     1.57     16.1   1.14e-2 coeffici~
#>  6 hxchfTRUE      2.85   0.251        4.18     1.70      4.56  1.72e-4 coeffici~
#>  7 dialysisT~     2.72   0.342        2.92     1.32      5.09  8.66e-3 coeffici~
#>  8 bleeddisT~     2.25   0.198        4.10     1.50      3.28  1.60e-4 coeffici~
#>  9 wtlossTRUE     2.65   0.416        2.35     1.09      5.66  3.29e-2 coeffici~
#> 10 0|1          157.     0.457       11.1     NA        NA    NA       scale    
#> 11 1|2          179.     0.458       11.3     NA        NA    NA       scale    
#> 12 2|3          302.     0.461       12.4     NA        NA    NA       scale    
#> 13 3|4          581.     0.469       13.6     NA        NA    NA       scale    
#> 14 4|5         1845.     0.501       15.0     NA        NA    NA       scale

The adjusted odds ratio of 1.5234386 is very similar to the unadjusted odds ratio of 1.4922911. This is because we had very good matching which essentially negates the effect of our covariates. Nonetheless, the logistic regression is more robust and has greater power than a chi-square test. Not to mention a chi-square test does not tell us the size of the effect.

I can test the parallel regression assumption with a Brant test:

brant::brant(adjfit.step)
#> ---------------------------------------------------- 
#> Test for     X2  df  probability 
#> ---------------------------------------------------- 
#> Omnibus          -756.67 36  1
#> hxcopdTRUE       0.44    4   0.98
#> age          6.25    4   0.18
#> diabetesTRUE     1.7 4   0.79
#> fnstatus2Dependent   15.68   4   0
#> ascitesTRUE      127.76  4   0
#> hxchfTRUE        1.52    4   0.82
#> dialysisTRUE     19.19   4   0
#> bleeddisTRUE     2.21    4   0.7
#> wtlossTRUE       40.17   4   0
#> ---------------------------------------------------- 
#> 
#> H0: Parallel Regression Assumption holds
#> Warning in brant::brant(adjfit.step): 1402 combinations in table(dv,ivs) do not
#> occur. Because of that, the test results might be invalid.

The omnibus test is non-significant but there are some significant covariates. I’m still trying to figure out how to interpret whether or not this means the assumption holds.

In addition, the Hosmer and Lemeshow test is non-significant, which suggests the assumption holds:

generalhoslem::logitgof(df_m$dindo, fitted(adjfit.step), ord = TRUE)
#> Warning in generalhoslem::logitgof(df_m$dindo, fitted(adjfit.step), ord =
#> TRUE): At least one cell in the expected frequencies table is < 1. Chi-square
#> approximation may be incorrect.
#> 
#>  Hosmer and Lemeshow test (ordinal model)
#> 
#> data:  df_m$dindo, fitted(adjfit.step)
#> X-squared = 32.485, df = 44, p-value = 0.9001

Alternatively, many will recommend against the use of step-wise model selection due to its potential to produce significant bias. Instead, they advocate for covariate selection based on clinical expertise and insight. For that reason, I would build a model with covariates that I really thought were strongly related to outcome. I would argue sex, race, dyspnea, weight loss, and steroid use are probably not strongly related with the outcomes we are examining. I realize you could make the argument that any of these are in some way, but I’m trying to pick out the ones I think are strongly related and also trying to minimize the number of covariates in our model.

adjfit <- MASS::polr(dindo ~ hxcopd + age + diabetes + smoke + fnstatus2 + ascites + hxchf + renafail + dialysis + steroid + bleeddis + lap, data = df_m)
broom::tidy(adjfit, conf.int = TRUE, p.values = TRUE, exponentiate = TRUE)
#> 
#> Re-fitting to get Hessian
#> # A tibble: 17 x 8
#>    term      estimate std.error statistic conf.low conf.high   p.value coef.type
#>    <chr>        <dbl>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl> <chr>    
#>  1 hxcopdTR~     1.51   0.134       3.10    1.17        1.97   1.79e-3 coeffici~
#>  2 age           1.02   0.00683     3.06    1.01        1.04   1.89e-3 coeffici~
#>  3 diabetes~     1.77   0.165       3.48    1.27        2.43   9.40e-4 coeffici~
#>  4 smokeTRUE     1.21   0.151       1.25    0.896       1.62   2.14e-1 coeffici~
#>  5 fnstatus~     2.29   0.335       2.48    1.13        4.23   2.39e-2 coeffici~
#>  6 ascitesT~     6.65   0.571       3.32    1.87       18.5    6.05e-3 coeffici~
#>  7 hxchfTRUE     3.03   0.248       4.47    1.82        4.82   6.62e-5 coeffici~
#>  8 renafail~     1.52   1.06        0.393   0.0814      8.23   7.10e-1 coeffici~
#>  9 dialysis~     3.05   0.340       3.28    1.49        5.70   3.57e-3 coeffici~
#> 10 steroidT~     1.11   0.236       0.428   0.677       1.72   6.73e-1 coeffici~
#> 11 bleeddis~     2.30   0.198       4.19    1.53        3.34   1.16e-4 coeffici~
#> 12 lapTRUE       1.23   0.156       1.33    0.900       1.66   1.91e-1 coeffici~
#> 13 0|1         245.     0.534      10.3    NA          NA     NA       scale    
#> 14 1|2         280.     0.534      10.5    NA          NA     NA       scale    
#> 15 2|3         470.     0.537      11.4    NA          NA     NA       scale    
#> 16 3|4         907.     0.544      12.5    NA          NA     NA       scale    
#> 17 4|5        2883.     0.572      13.9    NA          NA     NA       scale

lap, which is a variable that indicates if a patient underwent laparoscopic or open repair, is not significantly associated with any difference in the primary endpoint.

What else do we need here?

modeling

Creating Data Sets

Matching

Model fitting