Finding great data to explore methods in Health Data Science can be difficult in a climate where personal data is intensely private by definition. Recent advances are being made in differential privacy techniques and federated data analysis, while others make data available by generating synthetic data based on the features of underlying real health data.

SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government. (https://github.com/synthetichealth/synthea/wiki)

The data provided by Synthea is detailed, workable, and can be downloaded from https://synthea.mitre.org/downloads. Additional information about the data can be found at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.

In this series of posts, I will be using a subset of the Synthea data, focussing on the recent COVID pandemic. In this post, I pose the question -

How do pre-existing coditions affect the risk of COVID death?

Since all data in this set are suspected COVID patients, this might be re-phrased, how do pre-existing conditions affect the risk of COVID death, given one is infected and symptomatic enough to seek medical help. The active period covers 27-05-2019 to 27-05-2020.

# Import some packages
library(tidyverse)
library(data.table)
library(survival)
library(caret)
library(glmnet)

# And some handy functions
putZeros<- function(df) {
  df<- df %>% mutate_at(vars(-group_cols()),~replace(.,is.na(.),0))
  return(df)
}

not_including <- function(x,elements){
  y<- x[!x %in% elements]
  return(y)
}

# And read in the data that I cleaned and wrangled in a previous session

IP<- readRDS("Dev Data 0/IP_0.rds") # 1766943 row

Let’s begin

In this post I will be making use of the data in two main parts.

  1. Patient characterists (age, sex, etc. - one row per patient)
  2. Patient histories (observations, procedures etc. - many rows per patient)
# Create a patient summary set

PID<- IP %>% select(PID, age, sex, Status, futime, ethnicity_fac, allergy) %>% distinct()
#9040 patients

# Add some labels
PID$over_60<-      ifelse(PID$age >= 60, "Over 60", "Under 60")
PID$death_status<- ifelse(PID$Status== 0, "Alive", "Died")

# Mosaic plot - what we can see already about COVID death
ptable<-           table(PID$death_status, PID$over_60)
mosaicplot(ptable, main= "COVID Death Counts by Age", color = c("#ef5675", "#374c80"),
           cex.axis= 1)

We can see already that patients that are over the age of 60 at the start of the study are much more likely to die from COVID infection than patients that are under 60. But what other factors might be at play?

# Extract patient histories from the main set

history<- IP %>% filter(type %like% "history", 
                        type %like% "condition|observation|procedure")

# The patient histories are a collection of codes and descriptions for past conditions,
# observations, disorders and procedures.

# I used this mapping table to put these tags into groups
groupRef<- read_csv("group.csv")
groupRef %>% head(20)

After joining the grouping reference with the patient histories, we can see that there are a number of conditions present prior to the active COVID period.

# Join the reference list with patient histories
history<- left_join(history,groupRef, by="DESCRIPTION")

history %>% count(Group) %>% filter(!is.na(Group)) %>%
  mutate(Condition= fct_reorder(Group, n)) %>%
  ggplot(aes(x= n, y= Condition, fill= n)) +
  geom_col() +
  labs(title= "Condition mentions in patient histories", x= "Count of mentions") +
  theme(legend.position = "none") +
  scale_fill_gradient(low= "#7a5195", high="#ff764a")

This is great but we need to create a matrix that is good for modelling. In other words, we need one row per patient, and an indicator matrix with a binary flag as to whether the patient has a record of each condition or not.

# Create an indicator matrix
tmat<- history %>% count(PID,Group) %>% 
          filter(!is.na(Group)) %>% spread(Group, n) %>% putZeros()

conditions<- names(tmat) %>% not_including(c( "PID")) %>% print()
 [1] "Allergic_rhinitis"    "Alzheimers_disease"   "Anemia"              
 [4] "Appendicitis"         "Arthritis"            "Asthma_bronchitis"   
 [7] "Cancer"               "Cardiac"              "Chronic_pain"        
[10] "Chronic_respiratory"  "Dermatitis"           "Diabetes"            
[13] "Drug_abuse"           "ENT"                  "High_cholesterol"    
[16] "Hypertension"         "Kidney_urinary_etc"   "Mental"              
[19] "Metabolic"            "Neuro"                "Obesity"             
[22] "Orthopaedic_disorder" "Sinusitis"            "Smoking"             
# Then attach the indicator matrix to the patient characteristics set, and fill in the zeros
PID<-              left_join(PID, tmat, by="PID")
PID[,conditions]<- putZeros(PID[, conditions])
PID[,conditions]<- apply(PID[,conditions],2, function(x) ifelse(x > 1, 1, x))

Survival analysis

Fantastic! Now that we have our features ready, we can set up our survival outcome variable, which relates the survival time futime to death status Status. Also to simplify the ethnicity factor, I will recode this a binary predictor, since the patients in the “Black” category had a somewhat higher risk that the other groups in previous EDA.

My first step will be to run each pre-existing condition in turn, in a basic model including age, sex and ethnicity as covariates. I also make use of the purrr package to streamline modelling.

# Create survival object
PID$Surv<-      with(PID, Surv(futime, Status))
PID$ethnicity<- ifelse(PID$ethnicity_fac %like% "Black", 1, 0)

# Create formulas
forms<-         paste("Surv ~ age + sex + ethnicity +",conditions)
formulas<-      map(forms, as.formula)

# Run the models, and tidy them up
models<-        map(formulas, ~coxph(.x,data= PID))
tidies<-        map(models, broom::tidy)
out<-           bind_rows(tidies, .id= "ID")
out %>%         head(20)

Where do we see a signal? It might be easier to visualise the adjusted univariate effects in a plot…

out_plot<- out %>% filter(!term %in% c("sex","age","ethnicity")) %>%
    mutate(ExpE= exp(estimate), ExpL= exp(conf.low), ExpU= exp(conf.high),
           Significant= ifelse(p.value < 0.05, "Significant", "Not significant"),
           alpha= ifelse(p.value < 0.05, 1, 0.3),
           Predictor= fct_reorder(term, ExpE))
out_plot$alpha<- I(out_plot$alpha)

ggplot(out_plot, aes(x= ExpE, y= Predictor, fill= estimate, alpha= alpha)) +
  geom_col() +
  scale_x_continuous( expand = c(0, 0)) +
  theme_bw() +
  theme(legend.position= "none",
        axis.ticks.y= element_blank(),
        axis.text.y = element_text(size=10),
        panel.grid.major.y = element_blank()) +
  scale_fill_gradient(low= "#7a5195", high="#ff764a")+
  labs(x= "\nOdds ratio - Increase in risk if condition is present", y= "Pre-existing condition",
       title= "Univariate effects of Pre-existing Conditions on COVID Death",
       subtitle = "Effects in light shading indicate non-significant coefficients.")

The Smoking effect really stands out! This is much higher than what we have found in our real COVID research. This could be because the Smoking observation in the patient histories is recorded alongside current, daily smoking or tobacco use.

More focused survival modelling

Now we will pull through the conditions that have shown some signal, and begin survival modelling in earnest. This includes splitting the data into a training and test set.

# Identify variables with significant univariate association
sigs<- out %>% filter(p.value < 0.05) %>% pull(term) %>% unique() 

# Modelling data set
XY<-   PID %>% select(Status, Surv, all_of(sigs)) %>% na.omit()

# Create test-train-split
set.seed(123)
trainIndex<- createDataPartition(XY$Status, p=0.7, list=FALSE, times=1)

X<-           XY %>% select(-c(Surv, Status))  %>% as.matrix() %>% scale()
y<-           XY$Surv
X_train<-     X[trainIndex,]
X_test<-      X[-trainIndex,]
y_train<-     y[trainIndex]
y_test<-      XY$Status[-trainIndex]
PID_train<-   PID[trainIndex,]
PID_test<-    PID[-trainIndex, ]

# Use glmnet to apply regularised cox proportional hazards
fit <-        glmnet(X_train, y_train, family="cox")
plot(fit, label=TRUE, cex= 1.5)

cv.fit <-     cv.glmnet(X_train, y_train, family="cox")
plot(cv.fit)

Cross-validated on the cox ph suggests model with between 4 and 8 predictors. Lets see what coefficients enter the model in those first 4 - 8 steps. I like to examine how the coefficient matrix evolves.

# Extract the coefficient matrix and sort by absolute value to get the predictors into a sensible order

mmat<-          as.matrix(coef(fit)) %>% data.frame() 
mmat$row_sums<- rowSums(abs(mmat))
mmat<-          mmat %>% arrange(desc(row_sums))

print(rownames(mmat))
 [1] "age"                  "Smoking"              "Hypertension"        
 [4] "Chronic_respiratory"  "Anemia"               "Appendicitis"        
 [7] "Kidney_urinary_etc"   "Sinusitis"            "sex"                 
[10] "Orthopaedic_disorder" "Asthma_bronchitis"    "Diabetes"            
[13] "ENT"                 

So this gives us the approximate order in which the various conditions enter the multivariate model. I tend to use the columns of the coefficient matrix to explore different models in the optimal region until I find one that looks sensible, with all predictors having a significant p-value.

# Create in indicator matrix to show model sizes at each step
Zmat<-        mmat!=0
model_sizes<- Zmat %>% colSums() %>% print()
      s0       s1       s2       s3       s4       s5       s6       s7       s8       s9 
       0        1        1        1        1        1        1        2        2        2 
     s10      s11      s12      s13      s14      s15      s16      s17      s18      s19 
       2        2        2        2        2        2        2        2        2        3 
     s20      s21      s22      s23      s24      s25      s26      s27      s28      s29 
       4        4        4        4        4        5        5        5        6        6 
     s30      s31      s32      s33      s34      s35      s36      s37      s38      s39 
       7        8        8        8        8        8        8        8        8        8 
     s40      s41      s42      s43      s44      s45      s46      s47      s48      s49 
       8        9        9        9        9       10       11       11       11       11 
     s50      s51      s52 row_sums 
      11       11       11       11 
# I use this to iteractively explore the model selections at each step
trialSet<-   row.names(Zmat)[Zmat[,28]] %>% print()
[1] "age"                 "Smoking"             "Hypertension"        "Chronic_respiratory"
[5] "Anemia"             
formula<-    as.formula(paste("Surv ~",str_c(trialSet, collapse= " + "))) %>% print()
Surv ~ age + Smoking + Hypertension + Chronic_respiratory + Anemia
m2<- coxph(formula,  data= PID_train)
summary(m2)
Call:
coxph(formula = formula, data = PID_train)

  n= 6328, number of events= 247 

                         coef exp(coef)  se(coef)      z Pr(>|z|)    
age                  0.044822  1.045842  0.003161 14.179  < 2e-16 ***
Smoking              2.359141 10.581855  0.157834 14.947  < 2e-16 ***
Hypertension         0.567629  1.764080  0.130118  4.362 1.29e-05 ***
Chronic_respiratory  0.786792  2.196339  0.225826  3.484 0.000494 ***
Anemia               0.271028  1.311312  0.132843  2.040 0.041329 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                    exp(coef) exp(-coef) lower .95 upper .95
age                     1.046     0.9562     1.039     1.052
Smoking                10.582     0.0945     7.766    14.418
Hypertension            1.764     0.5669     1.367     2.277
Chronic_respiratory     2.196     0.4553     1.411     3.419
Anemia                  1.311     0.7626     1.011     1.701

Concordance= 0.881  (se = 0.01 )
Likelihood ratio test= 622.1  on 5 df,   p=<2e-16
Wald test            = 809.6  on 5 df,   p=<2e-16
Score (logrank) test = 1949  on 5 df,   p=<2e-16

Assess predictions from the suggested model

The above analysis indicates a multivariate model with five features. How does this model do predicting against the test set?

m3<-        coxph(Surv ~ age + Smoking + Hypertension + Anemia + Chronic_respiratory,  
                  data= PID_train)
preds<-     predict(m3, newdata = PID_test, type= "risk")

pROC_obj <- pROC::roc(y_test, preds,
                smoothed = TRUE,
                ci=TRUE, ci.alpha=0.9, stratified=FALSE, auc.polygon.col="#e6cbdb",
                plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
                print.auc=TRUE, show.thres=TRUE)

Great! The model suggested by regularised cox ph has good predictive ability on the test set, with an AUC of 0.899.

Final model and discussion

Now we put all the data back together to estimate the effects with the final model.

# Put together final model to get best estimates of the coefficients

# Lets predict the effect of a 10 year increment in age, to get a clear odds ratio interval
PID$Age_10_yrs<- PID$age/10

final<- coxph(Surv ~ Age_10_yrs + Smoking + Hypertension + Anemia + Chronic_respiratory,  
              data= PID)
final_sum<- broom::tidy(final) %>% mutate( ExpE= exp(estimate),
                                           ExpL= exp(conf.low),
                                           ExpU= exp(conf.high),
                                           Order= 5:1)
final_sum<- final_sum %>% mutate(Comment= paste0(sprintf("%2.1f",ExpE),"  [",
                                                sprintf("%2.1f",ExpL),", ",
                                                sprintf("%2.1f",ExpU),"]"))

cols5<- c( "#374c80", "#7a5195", "#bc5090", "#ef5675", "#ff764a")

ggplot(final_sum,aes(x= ExpE,y=Order, color= factor(Order))) +
  geom_point(size=2) +
  geom_errorbarh(aes(xmin=ExpL,xmax= ExpU),height=0.2)+
  geom_vline(xintercept= 1, col="grey30",size=.1, lty="dashed") +
  geom_text(aes(x=ExpE,y=Order+0.25,label= term), fontface= "bold") +
  geom_text(aes(x=ExpE,y=Order-0.25,label= Comment)) +
  theme_bw() +
  expand_limits(x=c(0,2))+
  theme(title= element_text(size=12),
        axis.title.x = element_text(margin = margin(t = 5, r = 20, b = 0, l = 0)),
        axis.text.y= element_blank(),axis.ticks.y=element_blank(),
        axis.text.x= element_text(size=10,lineheight = 0.3),
        legend.position = "none",
        panel.grid= element_blank(),
        strip.background= element_rect(fill="grey20"),
        strip.text= element_text(color="white",face="bold",size=12),
        panel.spacing = unit(1, "lines"),
        panel.border= element_rect(colour= "grey30")) +
  labs(title= "Inreased Risk of COVID Death (Odds Ratio) by Predictor", 
       x= "Change in Odds associated with each predictor",
       y= element_blank()) +
  scale_color_manual(values= cols5)

Many thanks to Synthea for producing such a detailed synthetic data set for demonstrating various modelling methods.

Again, Smoking looks to be the real villain here, with current, daily smoking increasing the odds of death (once infected with COVID) by about ten-fold.

Conditions under the Chronic_Respiratory heading include pulmonary emphysema, chronic obstructive bronchitis, suspected lung cancer, pulmonary rehabilitation (regime/therapy) and cystic fibrosis. Patients with these conditions might be expected to have an increased risk of death once infected with COVID of between 1.4x and 2.9x compared with those who do not.

Every additional 10 years of age of the patient at admission is seen to increase the risk of death by between 50% to 70%, while those with hypertension also have about a 70% increased risk.

References

Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079

---
title: 'Synthea: COVID Death and Pre-Existing Conditions'
author: "Cel McCracken"
date: "04/07/2020"
output: html_notebook
---

Finding great data to explore methods in Health Data Science can be difficult in a climate where personal data is intensely private by definition. 
Recent advances are being made in differential privacy techniques and federated data analysis, while others make data available by generating synthetic data based on the features of underlying real health data.

> SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government. (https://github.com/synthetichealth/synthea/wiki)

The data provided by Synthea is detailed, workable, and can be downloaded from [https://synthea.mitre.org/downloads](https://synthea.mitre.org/downloads).  Additional information about the data can be found at [https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary](https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary).

In this series of posts, I will be using a subset of the Synthea data, focussing on the recent COVID pandemic. In this post, I pose the question -

> How do pre-existing coditions affect the risk of COVID death?

Since all data in this set are suspected COVID patients, this might be re-phrased, how do pre-existing conditions affect the risk of COVID death, given one is infected and symptomatic enough to seek medical help. The active period covers 27-05-2019 to 27-05-2020.

```{r cars}
# Import some packages
library(tidyverse)
library(data.table)
library(survival)
library(caret)
library(glmnet)

# And some handy functions
putZeros<- function(df) {
  df<- df %>% mutate_at(vars(-group_cols()),~replace(.,is.na(.),0))
  return(df)
}

not_including <- function(x,elements){
  y<- x[!x %in% elements]
  return(y)
}

# And read in the data that I cleaned and wrangled in a previous session

IP<- readRDS("Dev Data 0/IP_0.rds") # 1766943 rows
```

### Let's begin

In this post I will be making use of the data in two main parts. 

1. Patient characterists (age, sex, etc. - one row per patient)
2. Patient histories (observations, procedures etc. - many rows per patient)

```{r create_patient_set, message=FALSE}
# Create a patient summary set

PID<- IP %>% select(PID, age, sex, Status, futime, ethnicity_fac, allergy) %>% distinct()
#9040 patients

# Add some labels
PID$over_60<-      ifelse(PID$age >= 60, "Over 60", "Under 60")
PID$death_status<- ifelse(PID$Status== 0, "Alive", "Died")

# Mosaic plot - what we can see already about COVID death
ptable<-           table(PID$death_status, PID$over_60)
mosaicplot(ptable, main= "COVID Death Counts by Age", color = c("#ef5675", "#374c80"),
           cex.axis= 1)
```
We can see already that patients that are over the age of 60 at the start of the study are much more likely to die from COVID infection than patients that are under 60.  But what other factors might be at play?

```{r message=FALSE, warning= FALSE}
# Extract patient histories from the main set

history<- IP %>% filter(type %like% "history", 
                        type %like% "condition|observation|procedure")

# The patient histories are a collection of codes and descriptions for past conditions,
# observations, disorders and procedures.

# I used this mapping table to put these tags into groups
groupRef<- read_csv("group.csv")
groupRef %>% head(20)
```
After joining the grouping reference with the patient histories, we can see that there are a number of conditions present prior to the active COVID period.
```{r  message= FALSE}
# Join the reference list with patient histories
history<- left_join(history,groupRef, by="DESCRIPTION")

history %>% count(Group) %>% filter(!is.na(Group)) %>%
  mutate(Condition= fct_reorder(Group, n)) %>%
  ggplot(aes(x= n, y= Condition, fill= n)) +
  geom_col() +
  labs(title= "Condition mentions in patient histories", x= "Count of mentions") +
  theme(legend.position = "none") +
  scale_fill_gradient(low= "#7a5195", high="#ff764a")
```
This is great but we need to create a matrix that is good for modelling. In other words, we need one row per patient, and an indicator matrix with a binary flag as to whether the patient has a record of each condition or not.

```{r warning=FALSE}
# Create an indicator matrix
tmat<- history %>% count(PID,Group) %>% 
          filter(!is.na(Group)) %>% spread(Group, n) %>% putZeros()

conditions<- names(tmat) %>% not_including(c( "PID")) %>% print()

# Then attach the indicator matrix to the patient characteristics set, and fill in the zeros
PID<-              left_join(PID, tmat, by="PID")
PID[,conditions]<- putZeros(PID[, conditions])
PID[,conditions]<- apply(PID[,conditions],2, function(x) ifelse(x > 1, 1, x))
```

### Survival analysis
Fantastic! Now that we have our features ready, we can set up our survival outcome variable, which relates the survival time `futime` to death status `Status`. Also to simplify the ethnicity factor, I will recode this a binary predictor, since the patients in the "Black" category had a somewhat higher risk that the other groups in previous EDA.

My first step will be to run each pre-existing condition in turn, in a basic model including age, sex and ethnicity as covariates. I also make use of the `purrr` package to streamline modelling.

```{r}
# Create survival object
PID$Surv<-      with(PID, Surv(futime, Status))
PID$ethnicity<- ifelse(PID$ethnicity_fac %like% "Black", 1, 0)

# Create formulas
forms<-         paste("Surv ~ age + sex + ethnicity +",conditions)
formulas<-      map(forms, as.formula)

# Run the models, and tidy them up
models<-        map(formulas, ~coxph(.x,data= PID))
tidies<-        map(models, broom::tidy)
out<-           bind_rows(tidies, .id= "ID")
out %>%         head(20)
```
Where do we see a signal? It might be easier to visualise the adjusted univariate effects in a plot...
```{r message= FALSE}
out_plot<- out %>% filter(!term %in% c("sex","age","ethnicity")) %>%
    mutate(ExpE= exp(estimate), ExpL= exp(conf.low), ExpU= exp(conf.high),
           Significant= ifelse(p.value < 0.05, "Significant", "Not significant"),
           alpha= ifelse(p.value < 0.05, 1, 0.3),
           Predictor= fct_reorder(term, ExpE))
out_plot$alpha<- I(out_plot$alpha)

ggplot(out_plot, aes(x= ExpE, y= Predictor, fill= estimate, alpha= alpha)) +
  geom_col() +
  scale_x_continuous( expand = c(0, 0)) +
  theme_bw() +
  theme(legend.position= "none",
        axis.ticks.y= element_blank(),
        axis.text.y = element_text(size=10),
        panel.grid.major.y = element_blank()) +
  scale_fill_gradient(low= "#7a5195", high="#ff764a")+
  labs(x= "\nOdds ratio - Increase in risk if condition is present", y= "Pre-existing condition",
       title= "Univariate effects of Pre-existing Conditions on COVID Death",
       subtitle = "Effects in light shading indicate non-significant coefficients.")
```


The `Smoking` effect really stands out! This is much higher than what we have found in our real COVID research. This could be because the `Smoking` observation in the patient histories is recorded alongside current, daily smoking or tobacco use.

### More focused survival modelling

Now we will pull through the conditions that have shown some signal, and begin survival modelling in earnest.  This includes splitting the data into a training and test set.

```{r message=FALSE, warning=FALSE}
# Identify variables with significant univariate association
sigs<- out %>% filter(p.value < 0.05) %>% pull(term) %>% unique() 

# Modelling data set
XY<-   PID %>% select(Status, Surv, all_of(sigs)) %>% na.omit()

# Create test-train-split
set.seed(123)
trainIndex<- createDataPartition(XY$Status, p=0.7, list=FALSE, times=1)

X<-           XY %>% select(-c(Surv, Status))  %>% as.matrix() %>% scale()
y<-           XY$Surv
X_train<-     X[trainIndex,]
X_test<-      X[-trainIndex,]
y_train<-     y[trainIndex]
y_test<-      XY$Status[-trainIndex]
PID_train<-   PID[trainIndex,]
PID_test<-    PID[-trainIndex, ]

# Use glmnet to apply regularised cox proportional hazards
fit <-        glmnet(X_train, y_train, family="cox")
plot(fit, label=TRUE, cex= 1.5)
cv.fit <-     cv.glmnet(X_train, y_train, family="cox")
plot(cv.fit)
```
Cross-validated on the cox ph suggests model with between 4 and 8 predictors. Lets see what coefficients enter the model in those first 4 - 8 steps. I like to examine how the coefficient matrix evolves.

```{r}
# Extract the coefficient matrix and sort by absolute value to get the predictors into a sensible order

mmat<-          as.matrix(coef(fit)) %>% data.frame() 
mmat$row_sums<- rowSums(abs(mmat))
mmat<-          mmat %>% arrange(desc(row_sums))

print(rownames(mmat))
```
So this gives us the approximate order in which the various conditions enter the multivariate model. I tend to use the columns of the coefficient matrix to explore different models in the optimal region until I find one that looks sensible, with all predictors having a significant p-value.

```{r}
# Create in indicator matrix to show model sizes at each step
Zmat<-        mmat!=0
model_sizes<- Zmat %>% colSums() %>% print()

# I use this to iteractively explore the model selections at each step
trialSet<-   row.names(Zmat)[Zmat[,28]] 
formula<-    as.formula(paste("Surv ~",str_c(trialSet, collapse= " + "))) 

m2<- coxph(formula,  data= PID_train)
summary(m2)
```

### Assess predictions from the suggested model

The above analysis indicates a multivariate model with five features. How does this model do predicting against the test set?

```{r message= FALSE}
m3<-        coxph(Surv ~ age + Smoking + Hypertension + Anemia + Chronic_respiratory,  
                  data= PID_train)
preds<-     predict(m3, newdata = PID_test, type= "risk")

pROC_obj <- pROC::roc(y_test, preds,
                smoothed = TRUE,
                ci=TRUE, ci.alpha=0.9, stratified=FALSE, auc.polygon.col="#e6cbdb",
                plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
                print.auc=TRUE, show.thres=TRUE)
```
Great! The model suggested by regularised cox ph has good predictive ability on the test set, with an AUC of 0.899.  

### Final model and discussion

Now we put all the data back together to estimate the effects with the final model.
```{r }
# Put together final model to get best estimates of the coefficients

# Lets predict the effect of a 10 year increment in age, to get a clear odds ratio interval
PID$Age_10_yrs<- PID$age/10

final<- coxph(Surv ~ Age_10_yrs + Smoking + Hypertension + Anemia + Chronic_respiratory,  
              data= PID)
final_sum<- broom::tidy(final) %>% mutate( ExpE= exp(estimate),
                                           ExpL= exp(conf.low),
                                           ExpU= exp(conf.high),
                                           Order= 5:1)
final_sum<- final_sum %>% mutate(Comment= paste0(sprintf("%2.1f",ExpE),"  [",
                                                sprintf("%2.1f",ExpL),", ",
                                                sprintf("%2.1f",ExpU),"]"))

cols5<- c( "#374c80", "#7a5195", "#bc5090", "#ef5675", "#ff764a")

ggplot(final_sum,aes(x= ExpE,y=Order, color= factor(Order))) +
  geom_point(size=2) +
  geom_errorbarh(aes(xmin=ExpL,xmax= ExpU),height=0.2)+
  geom_vline(xintercept= 1, col="grey30",size=.1, lty="dashed") +
  geom_text(aes(x=ExpE,y=Order+0.25,label= term), fontface= "bold") +
  geom_text(aes(x=ExpE,y=Order-0.25,label= Comment)) +
  theme_bw() +
  expand_limits(x=c(0,2))+
  theme(title= element_text(size=12),
        axis.title.x = element_text(margin = margin(t = 5, r = 20, b = 0, l = 0)),
        axis.text.y= element_blank(),axis.ticks.y=element_blank(),
        axis.text.x= element_text(size=10,lineheight = 0.3),
        legend.position = "none",
        panel.grid= element_blank(),
        strip.background= element_rect(fill="grey20"),
        strip.text= element_text(color="white",face="bold",size=12),
        panel.spacing = unit(1, "lines"),
        panel.border= element_rect(colour= "grey30")) +
  labs(title= "Inreased Risk of COVID Death (Odds Ratio) by Predictor", 
       x= "Change in Odds associated with each predictor",
       y= element_blank()) +
  scale_color_manual(values= cols5)
```
Many thanks to Synthea for producing such a detailed synthetic data set for demonstrating various modelling methods.

Again, `Smoking` looks to be the real villain here, with current, daily smoking increasing the odds of death (once infected with COVID) by about ten-fold.  

Conditions under the `Chronic_Respiratory` heading include pulmonary emphysema, 
chronic obstructive bronchitis, suspected lung cancer, pulmonary rehabilitation (regime/therapy) and cystic fibrosis. Patients with these conditions might be expected to have an increased risk of death once infected with COVID of between 1.4x and 2.9x compared with those who do not.

Every additional 10 years of age of the patient at admission is seen to increase the risk of death by between 50% to 70%, while those with hypertension also have about a 70% increased risk.


### References

Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079
