Single variable plots

Index plots

ggplot(df) +
  geom_bar(aes(x = col_anastomotic))


ggplot(df) +
  geom_bar(aes(x = orgspcssi))


ggplot(df) +
  geom_bar(aes(x = wndinfd))


df %>% mutate(comp = df$orgspcssi | df$wndinfd | df$col_anastomotic) %>%
  ggplot() +
  geom_bar(aes(x = comp))

These plots demonstrate no weird discrepancies in the outcome variables. By tabling them:

table(df$col_anastomotic)
#> 
#>  FALSE   TRUE 
#> 161850   5610
table(df$orgspcssi)
#> 
#>  FALSE   TRUE 
#> 159615   7845
table(df$wndinfd)
#> 
#>  FALSE   TRUE 
#> 166195   1265
table(df$orgspcssi | df$wndinfd | df$col_anastomotic)
#> 
#>  FALSE   TRUE 
#> 157191  10269

To get a clearer idea of how to the composite outcome breaks down:

df %>%
  select(orgspcssi, wndinfd, col_anastomotic) %>%
  group_by(orgspcssi, wndinfd, col_anastomotic) %>%
  summarise(count = n())
#> `summarise()` regrouping output by 'orgspcssi', 'wndinfd' (override with `.groups` argument)
#> # A tibble: 8 x 4
#> # Groups:   orgspcssi, wndinfd [4]
#>   orgspcssi wndinfd col_anastomotic  count
#>   <lgl>     <lgl>   <lgl>            <int>
#> 1 FALSE     FALSE   FALSE           157191
#> 2 FALSE     FALSE   TRUE              1268
#> 3 FALSE     TRUE    FALSE              955
#> 4 FALSE     TRUE    TRUE               201
#> 5 TRUE      FALSE   FALSE             3662
#> 6 TRUE      FALSE   TRUE              4074
#> 7 TRUE      TRUE    FALSE               42
#> 8 TRUE      TRUE    TRUE                67

It seems like the vast majorty of adverse outcomes are due to organ-space SSI. Most of those patients with an organ-space SSI seem to have an associated leak. Also interesting that of people with a leak, most get an organ-space SSI. We can looks at this more closely:

df %>%
  select(orgspcssi, col_anastomotic) %>%
  group_by(col_anastomotic, orgspcssi) %>%
  summarise(count = n()) %>%
  mutate(rate = count/sum(count))
#> `summarise()` regrouping output by 'col_anastomotic' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   col_anastomotic [2]
#>   col_anastomotic orgspcssi  count   rate
#>   <lgl>           <lgl>      <int>  <dbl>
#> 1 FALSE           FALSE     158146 0.977 
#> 2 FALSE           TRUE        3704 0.0229
#> 3 TRUE            FALSE       1469 0.262 
#> 4 TRUE            TRUE        4141 0.738

It seems approximately 74% of patients that have a leak will develop an organ-space SSI. This is good background for our hypothesis. If we can reduce rates of leak with ICG, we should theoretically reduce rates of organ-space SSI. We will see if this pans out in the data.

Given this is a contigency table, we can run a Chi-square test for independence:

chisq.test(table(df$col_anastomotic, df$orgspcssi))
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  table(df$col_anastomotic, df$orgspcssi)
#> X-squared = 62106, df = 1, p-value < 2.2e-16

This significant p-value suggests we can reject the \(H_0\) hypothesis that anastomotic leaks are independent of organ-space SSIs. This makes clinical sense.

Explanatory variables

This section is to get an overview of the potential explanatory variables in our data and look for any discrepancies or outliers in our data. It should also generate additional questions.

ICG

df %>%
  group_by(icg, col_anastomotic) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n))
#> `summarise()` regrouping output by 'icg' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   icg [2]
#>   icg   col_anastomotic      n   rate
#>   <lgl> <lgl>            <int>  <dbl>
#> 1 FALSE FALSE           160781 0.966 
#> 2 FALSE TRUE              5580 0.0335
#> 3 TRUE  FALSE             1069 0.973 
#> 4 TRUE  TRUE                30 0.0273

This table suggests a lower rate of patients have an anastomotic leak when using ICG.

A quick Chi-square test of independence:

chisq.test(table(df$icg, df$col_anastomotic))
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  table(df$icg, df$col_anastomotic)
#> X-squared = 1.1289, df = 1, p-value = 0.288

There is not yet enough to reject the \(H_0\). We can check for organ-space surgical site infection too.

df %>% group_by(icg, orgspcssi) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) 
#> `summarise()` regrouping output by 'icg' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   icg [2]
#>   icg   orgspcssi      n   rate
#>   <lgl> <lgl>      <int>  <dbl>
#> 1 FALSE FALSE     158564 0.953 
#> 2 FALSE TRUE        7797 0.0469
#> 3 TRUE  FALSE       1051 0.956 
#> 4 TRUE  TRUE          48 0.0437

These rates look pretty close too. We can check a quick Chi-square:

chisq.test(table(df$icg, df$orgspcssi))
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  table(df$icg, df$orgspcssi)
#> X-squared = 0.18275, df = 1, p-value = 0.669

There is not yet enough to reject the \(H_0\).

Age

Let’s see if age is related to rates of anastomotic leak and organ-space SSI.

df %>%
  ggplot(aes(x = age, y = col_anastomotic, fill = ..x..)) +
  geom_density_ridges_gradient() +
  scale_fill_viridis(option = "B") +
  theme(legend.position="none",
        panel.spacing = unit(0.1, "lines"),
        strip.text.x = element_text(size = 8))
#> Picking joint bandwidth of 1.82


df %>%
  ggplot(aes(x = age, y = orgspcssi, fill = ..x..)) +
  geom_density_ridges_gradient() +
  scale_fill_viridis(option = "B") +
  theme(legend.position="none",
        panel.spacing = unit(0.1, "lines"),
        strip.text.x = element_text(size = 8))
#> Picking joint bandwidth of 1.78

The distributions do not seem markedly different. Perhaps boxplots will convice you:

df %>%
  ggplot(aes(x = age, y = col_anastomotic, group = col_anastomotic)) +
  geom_boxplot()
#> Warning: Removed 1 rows containing non-finite values (stat_boxplot).


df %>%
  ggplot(aes(x = age, y = orgspcssi, group = orgspcssi)) +
  geom_boxplot()
#> Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Is age related to the choice to use ICG?

df %>%
  ggplot(aes(x = age, y = icg, group = icg)) +
  geom_boxplot()
#> Warning: Removed 1 rows containing non-finite values (stat_boxplot).

It would appear not.

x <- glm(col_anastomotic ~ age, data = df, family = binomial(link = "logit"))
summary(x)
#> 
#> Call:
#> glm(formula = col_anastomotic ~ age, family = binomial(link = "logit"), 
#>     data = df)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.2828  -0.2651  -0.2598  -0.2550   2.6459  
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -3.1311534  0.0565792 -55.341  < 2e-16 ***
#> age         -0.0037610  0.0009002  -4.178 2.94e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 49135  on 167458  degrees of freedom
#> Residual deviance: 49118  on 167457  degrees of freedom
#>   (1 observation deleted due to missingness)
#> AIC: 49122
#> 
#> Number of Fisher Scoring iterations: 6
exp(coef(x))
#> (Intercept)         age 
#>   0.0436674   0.9962460

x <- glm(orgspcssi ~ age, data = df, family = binomial(link = "logit"))
summary(x)
#> 
#> Call:
#> glm(formula = orgspcssi ~ age, family = binomial(link = "logit"), 
#>     data = df)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.3795  -0.3215  -0.3051  -0.2908   2.5824  
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -2.4187676  0.0466476  -51.85   <2e-16 ***
#> age         -0.0097715  0.0007559  -12.93   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 63342  on 167458  degrees of freedom
#> Residual deviance: 63177  on 167457  degrees of freedom
#>   (1 observation deleted due to missingness)
#> AIC: 63181
#> 
#> Number of Fisher Scoring iterations: 6
exp(coef(x))
#> (Intercept)         age 
#>  0.08903127  0.99027604

Nonetheless, a simple unadjusted regression of age on the outcomes suggest increasing age has a very small but significant protective effect.

Pre-operative albumin

Pre-operative albumin seems likely to have an effect on anastomotic leaks.

First, we can look at the distrubtion of pre-operative albumin:

df %>%
  ggplot(aes(x = pralbum)) + 
  geom_histogram(bins = 40)

This seems fairly negatively skewed. We can check skewness:

moments::skewness(df$pralbum, na.rm = TRUE)
#> [1] -0.7439432

Indeed it is. A Q-Q plot demonstrates this graphically:

qqnorm(df$pralbum)
qqline(df$pralbum)

We may be able to reduce the skew by transforming the variable.

moments::skewness(df$pralbum^(1/3), na.rm = TRUE)
#> [1] -1.305736
moments::skewness(df$pralbum^2, na.rm = TRUE)
#> [1] 0.200398

A square transformation seems to reduce the skewness quite markedly, with the exception of the extreme values:

df %>%
  ggplot(aes(x = pralbum^2)) + 
  geom_histogram(bins = 20) +
  xlim(0,30)

qqnorm(df$pralbum^2)
qqline(df$pralbum^2)

df %>%
  ggplot(aes(x = pralbum^2, y = col_anastomotic, group = col_anastomotic)) +
  geom_boxplot()
#> Warning: Removed 49801 rows containing non-finite values (stat_boxplot).


df %>%
  ggplot(aes(x = pralbum^2, y = orgspcssi, group = orgspcssi)) +
  geom_boxplot()
#> Warning: Removed 49801 rows containing non-finite values (stat_boxplot).

We can regress the albumin squared in an unadjusted model.

x <- glm(col_anastomotic ~ pralbum^2, family = binomial(link = "logit"), data = df)
summary(x)
#> 
#> Call:
#> glm(formula = col_anastomotic ~ pralbum^2, family = binomial(link = "logit"), 
#>     data = df)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.4212  -0.2750  -0.2566  -0.2395   2.9171  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -2.02613    0.08566  -23.65   <2e-16 ***
#> pralbum     -0.35146    0.02307  -15.24   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 35248  on 117658  degrees of freedom
#> Residual deviance: 35028  on 117657  degrees of freedom
#>   (49801 observations deleted due to missingness)
#> AIC: 35032
#> 
#> Number of Fisher Scoring iterations: 6
exp(coef(x))
#> (Intercept)     pralbum 
#>   0.1318453   0.7036626

x <- glm(orgspcssi ~ pralbum^2, family = binomial(link = "logit"), data = df)
summary(x)
#> 
#> Call:
#> glm(formula = orgspcssi ~ pralbum^2, family = binomial(link = "logit"), 
#>     data = df)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.6485  -0.3396  -0.3051  -0.2740   3.2386  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.90366    0.06753  -13.38   <2e-16 ***
#> pralbum     -0.54878    0.01854  -29.60   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 47343  on 117658  degrees of freedom
#> Residual deviance: 46524  on 117657  degrees of freedom
#>   (49801 observations deleted due to missingness)
#> AIC: 46528
#> 
#> Number of Fisher Scoring iterations: 6
exp(coef(x))
#> (Intercept)     pralbum 
#>   0.4050849   0.5776557

This makes sense. A 1 unit increase in albumin squared results in 30% decrease in the odds of having an anastomotic leak and nearly 50% decrease in the odds of an organ-space surgical site infection. This will definitely be in our match and model.

Diabetes

Diabetes seems like another likely risk factor for leak and infection.

df %>%
  group_by(diabetes, col_anastomotic) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'diabetes' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   diabetes [2]
#>   diabetes col_anastomotic      n   rate
#>   <lgl>    <lgl>            <int>  <dbl>
#> 1 FALSE    FALSE           137318 0.967 
#> 2 FALSE    TRUE              4675 0.0329
#> 3 TRUE     FALSE            24532 0.963 
#> 4 TRUE     TRUE               935 0.0367
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 9.4633, df = 1, p-value = 0.002096

Despite these rates seeming close, there is enough of a sample size to result in a significant Chi-square test. Diabetes will definitely go in our match and our model.

df %>%
  group_by(diabetes, orgspcssi) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'diabetes' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   diabetes [2]
#>   diabetes orgspcssi      n   rate
#>   <lgl>    <lgl>      <int>  <dbl>
#> 1 FALSE    FALSE     135404 0.954 
#> 2 FALSE    TRUE        6589 0.0464
#> 3 TRUE     FALSE      24211 0.951 
#> 4 TRUE     TRUE        1256 0.0493
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 4.0443, df = 1, p-value = 0.04432

The same is true as diabetes relates to organ-space SSI.

Insulin

df %>%
  group_by(insulin, col_anastomotic) %>%
  mutate(insulin = tidyr::replace_na(insulin, FALSE)) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'insulin' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   insulin [2]
#>   insulin col_anastomotic      n   rate
#>   <lgl>   <lgl>            <int>  <dbl>
#> 1 FALSE   FALSE           153832 0.967 
#> 2 FALSE   TRUE              5296 0.0333
#> 3 TRUE    FALSE             8018 0.962 
#> 4 TRUE    TRUE               314 0.0377
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 4.609, df = 1, p-value = 0.0318

df %>%
  group_by(insulin, orgspcssi) %>%
  mutate(insulin = tidyr::replace_na(insulin, FALSE)) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'insulin' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   insulin [2]
#>   insulin orgspcssi      n   rate
#>   <lgl>   <lgl>      <int>  <dbl>
#> 1 FALSE   FALSE     151749 0.954 
#> 2 FALSE   TRUE        7379 0.0464
#> 3 TRUE    FALSE       7866 0.944 
#> 4 TRUE    TRUE         466 0.0559
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 15.983, df = 1, p-value = 6.39e-05

Insulin is obviously highly correlated to the diagnosis of diabetes and it is likely there is an interaction between diabetes and insulin. We should check for an interaction effect:

tapply(df$col_anastomotic, list(df$diabetes, tidyr::replace_na(df$insulin, FALSE)), mean)
#>            FALSE       TRUE
#> FALSE 0.03292416         NA
#> TRUE  0.03624161 0.03768603
tapply(df$orgspcssi, list(df$diabetes, tidyr::replace_na(df$insulin, FALSE)), mean)
#>            FALSE       TRUE
#> FALSE 0.04640370         NA
#> TRUE  0.04610446 0.05592895

The rows refer to diabetes and the columns to insulin. The values are the rates of anastomotic leak/organ-space SSI. This makes it clear that insulin contributes predominantly to the rise in organ-space SSI seen in patients with diabetes. The rate of organ-space SSI in patients with diabetes but not taking insulin is similar to the rate of organ-space SSI in patients without diabetes.

We will need to either just use insulin or include an interaction term in our model.

Height/Weight

It is probably easiest to deal with height and weight as BMI. First we can check the distribution:

df %>%
  mutate(bmi = nsqipBileSpill::bmi(height, weight)) %>%
  ggplot(aes(x = bmi)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 2679 rows containing non-finite values (stat_bin).

Seems to be right skewed:

moments::skewness(nsqipBileSpill::bmi(df$height, df$weight), na.rm = TRUE)
#> [1] 1.181602

qqnorm(nsqipBileSpill::bmi(df$height, df$weight))
qqline(nsqipBileSpill::bmi(df$height, df$weight))

We can correct positive skewness with a log transformation or square root transformation:

moments::skewness(log(nsqipBileSpill::bmi(df$height, df$weight)), na.rm = TRUE)
#> [1] 0.2802827
moments::skewness(sqrt(nsqipBileSpill::bmi(df$height, df$weight)), na.rm = TRUE)
#> [1] 0.6944194

qqnorm(log(nsqipBileSpill::bmi(df$height, df$weight)))
qqline(log(nsqipBileSpill::bmi(df$height, df$weight)))

df %>%
  mutate(bmi = log(nsqipBileSpill::bmi(height, weight))) %>%
  ggplot(aes(x = bmi, y = col_anastomotic, group = col_anastomotic)) +
  geom_boxplot()
#> Warning: Removed 2679 rows containing non-finite values (stat_boxplot).


df %>%
  mutate(bmi = log(nsqipBileSpill::bmi(height, weight))) %>%
  ggplot(aes(x = bmi, y = orgspcssi, group = orgspcssi)) +
  geom_boxplot()
#> Warning: Removed 2679 rows containing non-finite values (stat_boxplot).

We can regress the log-transformed BMI in an unadjusted model.

x <- glm(col_anastomotic ~ log(nsqipBileSpill::bmi(height, weight)), family = binomial(link = "logit"), data = df)
summary(x)
#> 
#> Call:
#> glm(formula = col_anastomotic ~ log(nsqipBileSpill::bmi(height, 
#>     weight)), family = binomial(link = "logit"), data = df)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.2691  -0.2606  -0.2596  -0.2586   2.6275  
#> 
#> Coefficients:
#>                                          Estimate Std. Error z value Pr(>|z|)
#> (Intercept)                              -3.55000    0.20508 -17.311   <2e-16
#> log(nsqipBileSpill::bmi(height, weight))  0.05332    0.06145   0.868    0.386
#>                                             
#> (Intercept)                              ***
#> log(nsqipBileSpill::bmi(height, weight))    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 47977  on 164780  degrees of freedom
#> Residual deviance: 47976  on 164779  degrees of freedom
#>   (2679 observations deleted due to missingness)
#> AIC: 47980
#> 
#> Number of Fisher Scoring iterations: 6
exp(coef(x))
#>                              (Intercept) 
#>                               0.02872473 
#> log(nsqipBileSpill::bmi(height, weight)) 
#>                               1.05476223

x <- glm(orgspcssi ~ log(nsqipBileSpill::bmi(height, weight)), family = binomial(link = "logit"), data = df)
summary(x)
#> 
#> Call:
#> glm(formula = orgspcssi ~ log(nsqipBileSpill::bmi(height, weight)), 
#>     family = binomial(link = "logit"), data = df)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.3394  -0.3116  -0.3077  -0.3034   2.5585  
#> 
#> Coefficients:
#>                                          Estimate Std. Error z value Pr(>|z|)
#> (Intercept)                               -2.4307     0.1753 -13.863  < 2e-16
#> log(nsqipBileSpill::bmi(height, weight))  -0.1789     0.0527  -3.394  0.00069
#>                                             
#> (Intercept)                              ***
#> log(nsqipBileSpill::bmi(height, weight)) ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 61788  on 164780  degrees of freedom
#> Residual deviance: 61776  on 164779  degrees of freedom
#>   (2679 observations deleted due to missingness)
#> AIC: 61780
#> 
#> Number of Fisher Scoring iterations: 5
exp(coef(x))
#>                              (Intercept) 
#>                               0.08797511 
#> log(nsqipBileSpill::bmi(height, weight)) 
#>                               0.83622534

Smoking

Smoking seems a likely candidate as a good explanatory variable.

df %>%
  group_by(smoke, col_anastomotic) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'smoke' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   smoke [2]
#>   smoke col_anastomotic      n   rate
#>   <lgl> <lgl>            <int>  <dbl>
#> 1 FALSE FALSE           135637 0.969 
#> 2 FALSE TRUE              4323 0.0309
#> 3 TRUE  FALSE            26213 0.953 
#> 4 TRUE  TRUE              1287 0.0468
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 179.25, df = 1, p-value < 2.2e-16

df %>%
  group_by(smoke, orgspcssi) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'smoke' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   smoke [2]
#>   smoke orgspcssi      n   rate
#>   <lgl> <lgl>      <int>  <dbl>
#> 1 FALSE FALSE     133859 0.956 
#> 2 FALSE TRUE        6101 0.0436
#> 3 TRUE  FALSE      25756 0.937 
#> 4 TRUE  TRUE        1744 0.0634
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 201.91, df = 1, p-value < 2.2e-16

I’d like to control for packs, but the data is missing for most of our records:

summary(df$packs)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    0.00    0.00    0.00    0.79    0.00   62.00  167352

Alcohol

Alcohol would be a good predictor I think, but unfortunately the data is missing for most of our records:

summary(df$etoh)
#>    Mode   FALSE    NA's 
#> logical     147  167313

Pre-operative functional status

fnstatus2 is almost always related to poor outcomes.

table(df$fnstatus2)
#> 
#>         Independent Partially dependent   Totally dependent 
#>              162863                3289                 626

Given the relatively low numbers of totally dependent (which will get even smaller after inclusion/exclusion criteria and the match), I almost always have to reduce this variable to a boolean (Independent vs. not independent):

df %>%
  mutate(dependent = fnstatus2 != "Independent") %>%
  drop_na(dependent) %>%
  group_by(dependent, col_anastomotic) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'dependent' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   dependent [2]
#>   dependent col_anastomotic      n   rate
#>   <lgl>     <lgl>            <int>  <dbl>
#> 1 FALSE     FALSE           157470 0.967 
#> 2 FALSE     TRUE              5393 0.0331
#> 3 TRUE      FALSE             3722 0.951 
#> 4 TRUE      TRUE               193 0.0493
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 30.434, df = 1, p-value = 3.454e-08

df %>%
  mutate(dependent = fnstatus2 != "Independent") %>%
  drop_na(dependent) %>%
  group_by(dependent, orgspcssi) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'dependent' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   dependent [2]
#>   dependent orgspcssi      n   rate
#>   <lgl>     <lgl>      <int>  <dbl>
#> 1 FALSE     FALSE     155348 0.954 
#> 2 FALSE     TRUE        7515 0.0461
#> 3 TRUE      FALSE       3633 0.928 
#> 4 TRUE      TRUE         282 0.0720
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 56.912, df = 1, p-value = 4.558e-14

We will include dependent in our model. Alternatively, these patients could excluded but we should be able to match and then adjust for the effect.

Pre-operative ventilator use

table(df$ventilat)
#> 
#>  FALSE   TRUE 
#> 166539    921

There are not many patients on ventilators. But I would guess its a significant risk factor.

df %>%
  group_by(ventilat, col_anastomotic) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'ventilat' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   ventilat [2]
#>   ventilat col_anastomotic      n   rate
#>   <lgl>    <lgl>            <int>  <dbl>
#> 1 FALSE    FALSE           160995 0.967 
#> 2 FALSE    TRUE              5544 0.0333
#> 3 TRUE     FALSE              855 0.928 
#> 4 TRUE     TRUE                66 0.0717
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 40.475, df = 1, p-value = 1.991e-10

df %>%
  group_by(ventilat, orgspcssi) %>%
  summarise(n = n()) %>%
  mutate(rate = n/sum(n)) %>%
  chi()
#> `summarise()` regrouping output by 'ventilat' (override with `.groups` argument)
#> # A tibble: 4 x 4
#> # Groups:   ventilat [2]
#>   ventilat orgspcssi      n   rate
#>   <lgl>    <lgl>      <int>  <dbl>
#> 1 FALSE    FALSE     158810 0.954 
#> 2 FALSE    TRUE        7729 0.0464
#> 3 TRUE     FALSE        805 0.874 
#> 4 TRUE     TRUE         116 0.126
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  .
#> X-squared = 128, df = 1, p-value < 2.2e-16

Turns out this seems to be the biggest risk factor we’ve explored so far. Let’s explore any relationship between ICG and pre-operative ventilator use. Are surgeons using ICG more in patients on a ventilator, maybe because they have a high a priori suspicision?

table(df$icg, df$ventilat)
#>        
#>          FALSE   TRUE
#>   FALSE 165441    920
#>   TRUE    1098      1

Actually, ICG was only used on one ventilator-dependent patient. Unfortunately we won’t be able to make any inferences about the utility of ICG in this population.

ventilat will definitely go in our model and be matched for (unless you exclude these patients).

Other variables

To speed up the rest, I’ll make a function and test all exposures simultaneously. This is only for categorical exposures because I am running Chi-square tests (so not age, height, weight, etc. We can test those separately):

testexposure <- function(df, exposure, outcome) {
  exposure <- rlang::ensym(exposure)
  outcome <- rlang::ensym(outcome)
  
  x <- df %>%
    drop_na(!!exposure) %>%  
    group_by(!!exposure, !!outcome) %>%
    summarise(n = n()) %>%
    pull(n) %>%
    matrix(2, 2) %>%
    chisq.test()
  
  sig <- if(x$p.value < 0.001) {"***"} else if(x$p.value < 0.01) {"**"} else if(x$p.value < 0.05) {"*"} else {""}
  
  tibble(exposure = deparse(substitute(!!exposure)),
         n = sum(x$observed),
         nExposureFALSE = sum(x$observed[1]),
         outcomeExposureFALSE = x$observed[2,1]/sum(x$observed[,1]), 
         nExposureTRUE = sum(x$observed[,2]),
         outcomeExposureTRUE = x$observed[2,2]/sum(x$observed[,2]), 
         p = x$p.value,
         sig = sig)
}

testexposures <- function(df, ..., outcome) {
  exposures <- rlang::ensyms(...)
  outcome <- rlang::ensym(outcome)
  
  purrr::map_dfr(exposures, function(x) {
    df %>%
      testexposure(!!x, !!outcome)
  })
}

df %>%
  mutate(dependent = fnstatus2 != "Independent",
         insulin = tidyr::replace_na(insulin, FALSE),
         lap = col_approach != "Open",
         cancer = grepl("Colon cancer", col_indication)) %>%
  testexposures(icg, insulin, smoke, dependent, ventilat, hxcopd, hxchf, hxmi, hxangina, hypermed, renafail, dialysis, discancr, wndinf, steroid, wtloss, bleeddis, transfus, chemo, radio, prsepis, emergncy, lap, col_steroid, col_mech_bowel_prep, col_oral_antibiotic, col_chemo, col_emergent, cancer, outcome = col_anastomotic) %>%
  knitr::kable()

exposure	n	nExposureFALSE	outcomeExposureFALSE	nExposureTRUE	outcomeExposureTRUE	p	sig
icg	167460	160781	0.0335415	1099	0.0272975	0.2880172
insulin	167460	153832	0.0332814	8332	0.0376860	0.0318039	*
smoke	167460	135637	0.0308874	27500	0.0468000	0.0000000	***
dependent	166778	157470	0.0331137	3915	0.0492976	0.0000000	***
ventilat	167460	160995	0.0332895	921	0.0716612	0.0000000	***
hxcopd	167460	153845	0.0329688	8370	0.0436081	0.0000002	***
hxchf	167460	160182	0.0333420	1753	0.0484883	0.0005837	***
hxmi	294	139	0.0544218	147	0.0544218	1.0000000
hxangina	294	139	0.0544218	147	0.0544218	1.0000000
hypermed	167460	84206	0.0327153	80406	0.0343507	0.0651112
renafail	167460	161232	0.0333929	658	0.0607903	0.0001510	***
dialysis	167460	160438	0.0332614	1502	0.0599201	0.0000000	***
discancr	167460	152543	0.0325050	9792	0.0495302	0.0000000	***
wndinf	167460	159536	0.0330331	2474	0.0646726	0.0000000	***
steroid	167460	149773	0.0324679	12661	0.0461259	0.0000000	***
wtloss	167460	155186	0.0326148	7042	0.0536779	0.0000000	***
bleeddis	167460	155976	0.0327190	6208	0.0538015	0.0000000	***
transfus	167460	158064	0.0331117	3983	0.0494602	0.0000000	***
chemo	294	139	0.0544218	147	0.0544218	1.0000000
radio	232	112	0.0344828	116	0.0344828	1.0000000
prsepis	167460	151319	0.0316018	11203	0.0599839	0.0000000	***
emergncy	167460	147145	0.0315713	15518	0.0523908	0.0000000	***
lap	167434	46539	0.0478538	118556	0.0275819	0.0000000	***
col_steroid	165708	151466	0.0327039	9121	0.0462669	0.0000000	***
col_mech_bowel_prep	144430	49963	0.0452322	92100	0.0265689	0.0000000	***
col_oral_antibiotic	146697	77603	0.0417608	65712	0.0234813	0.0000000	***
col_chemo	165764	152056	0.0324026	8616	0.0522284	0.0000000	***
col_emergent	6608	897	0.0567823	5657	0.0549761	0.8816393
cancer	167460	91816	0.0345321	72360	0.0321448	0.0074578	**


df %>%
  mutate(dependent = fnstatus2 != "Independent",
         insulin = tidyr::replace_na(insulin, FALSE),
         lap = col_approach != "Open",
         cancer = grepl("Colon cancer", col_indication)) %>%
  testexposures(icg, insulin, smoke, dependent, ventilat, hxcopd, hxchf, hxmi, hxangina, hypermed, renafail, dialysis, discancr, wndinf, steroid, wtloss, bleeddis, transfus, chemo, radio, prsepis, emergncy, lap, col_steroid, col_mech_bowel_prep, col_oral_antibiotic, col_chemo, cancer, outcome = orgspcssi) %>%
  knitr::kable()

exposure	n	nExposureFALSE	outcomeExposureFALSE	nExposureTRUE	outcomeExposureTRUE	p	sig
icg	167460	158564	0.0468680	1099	0.0436761	0.6690174
insulin	167460	151749	0.0463715	8332	0.0559289	0.0000639	***
smoke	167460	133859	0.0435910	27500	0.0634182	0.0000000	***
dependent	166778	155348	0.0461431	3915	0.0720307	0.0000000	***
ventilat	167460	158810	0.0464095	921	0.1259501	0.0000000	***
hxcopd	167460	151749	0.0461437	8370	0.0602151	0.0000000	***
hxchf	167460	157972	0.0466788	1753	0.0627496	0.0018663	**
hxmi	294	140	0.0476190	147	0.0476190	1.0000000
hxangina	294	140	0.0476190	147	0.0476190	1.0000000
hypermed	167460	82852	0.0482689	80406	0.0453076	0.0043230	**
renafail	167460	159031	0.0465882	658	0.1124620	0.0000000	***
dialysis	167460	158252	0.0464334	1502	0.0925433	0.0000000	***
discancr	167460	150600	0.0448284	9792	0.0793505	0.0000000	***
wndinf	167460	157425	0.0458281	2474	0.1147939	0.0000000	***
steroid	167460	147940	0.0443091	12661	0.0778769	0.0000000	***
wtloss	167460	153121	0.0454874	7042	0.0778188	0.0000000	***
bleeddis	167460	153895	0.0456242	6208	0.0786082	0.0000000	***
transfus	167460	155918	0.0462389	3983	0.0718052	0.0000000	***
chemo	294	140	0.0476190	147	0.0476190	1.0000000
radio	232	113	0.0258621	116	0.0258621	1.0000000
prsepis	167460	149815	0.0412270	11203	0.1252343	0.0000000	***
emergncy	167460	145561	0.0419963	15518	0.0943421	0.0000000	***
lap	167434	45233	0.0745734	118556	0.0354179	0.0000000	***
col_steroid	165708	149524	0.0451059	9121	0.0792676	0.0000000	***
col_mech_bowel_prep	144430	48689	0.0695777	92100	0.0353529	0.0000000	***
col_oral_antibiotic	146697	76117	0.0601099	65712	0.0326729	0.0000000	***
col_chemo	165764	150037	0.0452503	8616	0.0763695	0.0000000	***
cancer	167460	90103	0.0525447	72360	0.0393588	0.0000000	***

It is clear that many of these exposures are related to both of these outcomes. Given that organ-space SSI and anastomotic leak are inextricably linked (leak is almost surely a significant risk factor for organ-space SSI), it doesn’t really make sense to look at the impact of ICG on both outcomes. This doesn’t make sense clinically. No one would use ICG to reduce rates of organ-space SSI, they would use it to reduce rates of anastomotic leak. To avoid this weird interaction between your two outcomes, I would narrow the question to a single outcome - does ICG reduce rates of anastomotic leak? I think this is clearer study and will be much simpler for readers to understand. It also aligns with the ICG use case.

From here on out, I will be considering anastomotic leak as the only outcome.

There are a lot of variables in the above analysis that are significant but there are probably some interactions, so we should dig into it. This is definitely non-exhaustive and the decision to investigate various variables or interaction effects requires some clinical knowledge, so if you think I have missed something, please let me know.

Colon cancer vs. oral antibiotics/mechanical preparation

I figure that patients with cancer are getting elective surgery, so they have time for oral antibiotics and mechanical preparation. This should explain why cancer seems to be associated with lower rates of anastomotic leak.

interactions <- function(outcome, exposure1, exposure2) {
  
  x <- tapply(outcome, list(exposure1, exposure2), mean)
  names(dimnames(x)) <- c(deparse(substitute(exposure1)), deparse(substitute(exposure2)))
  x
}

interactions(df$col_anastomotic, grepl("Colon cancer", df$col_indication), df$col_oral_antibiotic)
#>                                         df$col_oral_antibiotic
#> grepl("Colon cancer", df$col_indication)      FALSE       TRUE
#>                                    FALSE 0.04305001 0.02352713
#>                                    TRUE  0.03996812 0.02342507

Interestingly, it seems like cancer has a benefit in the absence of oral antibiotic prep and that the oral antibiotic preparation acutally mitigates the negative effect of not having cancer.

interactions(df$col_anastomotic, grepl("Colon cancer", df$col_indication), df$col_mech_bowel_prep)
#>                                         df$col_mech_bowel_prep
#> grepl("Colon cancer", df$col_indication)      FALSE       TRUE
#>                                    FALSE 0.04670531 0.02605949
#>                                    TRUE  0.04276215 0.02715016

The same seems to be true for mechanical bowel preparation.

Cancer vs. emergent surgery

Maybe the “benefit” of cancer is present because actually most patients with an indication of cancer are having surgery electively rather than emergently.

interactions(df$col_anastomotic, grepl("Colon cancer", df$col_indication), df$emergncy)
#>                                         df$emergncy
#> grepl("Colon cancer", df$col_indication)      FALSE       TRUE
#>                                    FALSE 0.03182400 0.05286812
#>                                    TRUE  0.03126809 0.05060976
interactions(df$col_anastomotic, grepl("Colon cancer", df$col_indication), df$electsurg)
#>                                         df$electsurg
#> grepl("Colon cancer", df$col_indication)      FALSE       TRUE
#>                                    FALSE 0.05246723 0.02838033
#>                                    TRUE  0.04220976 0.02991308

Unfortunately, emergency vs. elective surgery does not seem to explain the interaction.

Cancer vs. steroids

Maybe steroid use is higher in non-cancer cases (such as for IBD) and relatively low in cancer cases. This might lead to higher rates of breakdown in the non-cancer cases and a relative protective effect of cancer.

interactions(df$col_anastomotic, grepl("Colon cancer", df$col_indication), df$col_steroid)
#>                                         df$col_steroid
#> grepl("Colon cancer", df$col_indication)      FALSE       TRUE
#>                                    FALSE 0.03333917 0.04684535
#>                                    TRUE  0.03193649 0.03918723

Still no good explanation.

Cancer vs. chemotherapy

interactions(df$col_anastomotic, grepl("Colon cancer", df$col_indication), df$col_chemo)
#>                                         df$col_chemo
#> grepl("Colon cancer", df$col_indication)      FALSE       TRUE
#>                                    FALSE 0.03409054 0.05565118
#>                                    TRUE  0.02997689 0.05136040

Cancer vs. disseminated cancer

interactions(df$col_anastomotic, grepl("Colon cancer", df$col_indication), df$discancr)
#>                                         df$discancr
#> grepl("Colon cancer", df$col_indication)      FALSE       TRUE
#>                                    FALSE 0.03403294 0.05570776
#>                                    TRUE  0.03031286 0.04775059

Well I’m not sure why cancer seems to have a protective effect. Maybe you have other ideas. I think it might be simply due to what the other indications are:

unique(df$col_indication)
#>  [1] "Colon cancer"                 "Chrohn's disease"            
#>  [3] "Other"                        "Acute diverticulitis"        
#>  [5] "Chronic diverticular disease" "Ulcerative colitis"          
#>  [7] "Non-malignant polyp"          "Colon cancer w/ obstruction" 
#>  [9] "Volvulus"                     "Bleeding"                    
#> [11] NA                             "Enterocolitis"

The other conditions are pretty nasty conditions that probably lead to inflammation or devascularation at the anastomotic site. Cancer is a relatively well localized lesion with other healthy tissue at the anastomosis. Maybe this explains why cancer has a relative protective effect.

Mechanical bowel preparation and oral antibiotic

interactions(df$col_anastomotic, df$col_oral_antibiotic, df$col_mech_bowel_prep)
#>                       df$col_mech_bowel_prep
#> df$col_oral_antibiotic      FALSE       TRUE
#>                  FALSE 0.04695034 0.03399189
#>                  TRUE  0.03489771 0.02215336

There is certainly an interaction to account for here. Patients that have both antibiotic and mechanical prepration had the lowest rates of anastomotic leak and patients that had neither had the highest. This interaction will definitely need to be controlled for.

Renal failure and dialysis

interactions(df$col_anastomotic, df$renafail, df$dialysis)
#>            df$dialysis
#> df$renafail      FALSE       TRUE
#>       FALSE 0.03320101 0.05791506
#>       TRUE  0.05543237 0.07246377

There is an interaction to account for here. Patients that have both acute renal failure and dialysis have the highest rates. Patients with neither have the lowest. Patients having one or the other are in-between at relatively equal rates. This makes clinical sense, which is reassuring. This interaction will definitely need to be controlled for.

Exploratory Data Analysis