Variable analysis COVID-19 Imaging database

In this document I wanted to have a quick look at variables that we decided to collect in the remaining patients, focusing on variable counts and missing values. This way we can reevaluate our list of variables to recollect and also see the weaknesses in our collected data and prevent the mistakes in the future data collection.

I think that an importmant thing to consider is that even if we collect additional 300 patients and end up with ~400, when modeling, after inclusion of imaging and demographic variables and important cardiovascular variables such as HTN and smoking there would not be much more room to add other variables before we start overfitting our models. Also for every missing value from any variable in the model, the whole observation will be dropped from the model. Therefore I believe that in general we would not be using a lot of other variables outside our imaging, demographic and typical cardiovascular. Tracy, please correct me, if I am getting something wrong.

Link to the Variables to choose spreadsheet: https://docs.google.com/spreadsheets/d/1vj9DYLBjopVMnk5B_sInAX56oxSK1bmCzrAMzrzkAsQ/edit#gid=587039776


Level of care at time of admission and Highest level of care during hospital stay

Here I combined level_of_care_at_time_of_admission with highest_level_of_care_during_hospital_stay for each patient with " - " between them:

##   [1] "Med/surg - Med/surg"           "Step-down/PCU - Step-down/PCU"
##   [3] "Step-down/PCU - Step-down/PCU" "Med/surg - Med/surg"          
##   [5] "Med/surg - Med/surg"           "Step-down/PCU - Step-down/PCU"
##   [7] "Med/surg - ICU"                "NA - NA"                      
##   [9] "ICU - ICU"                     "Med/surg - Med/surg"          
##  [11] "ICU - ICU"                     "Med/surg - Step-down/PCU"     
##  [13] "Med/surg - Med/surg"           "Step-down/PCU - ICU"          
##  [15] "Step-down/PCU - ICU"           "Step-down/PCU - Med/surg"     
##  [17] "Step-down/PCU - Step-down/PCU" "Med/surg - ICU"               
##  [19] "Med/surg - Med/surg"           "ICU - ICU"                    
##  [21] "Med/surg - Med/surg"           "ICU - ICU"                    
##  [23] "Med/surg - Step-down/PCU"      "Step-down/PCU - Step-down/PCU"
##  [25] "Med/surg - Med/surg"           "ICU - ICU"                    
##  [27] "Step-down/PCU - Step-down/PCU" "Med/surg - Step-down/PCU"     
##  [29] "NA - NA"                       "NA - NA"                      
##  [31] "Step-down/PCU - ICU"           "Med/surg - Med/surg"          
##  [33] "NA - NA"                       "Med/surg - Step-down/PCU"     
##  [35] "NA - NA"                       "Med/surg - Med/surg"          
##  [37] "Med/surg - Med/surg"           "NA - NA"                      
##  [39] "NA - NA"                       "NA - NA"                      
##  [41] "NA - NA"                       "NA - NA"                      
##  [43] "NA - NA"                       "NA - NA"                      
##  [45] "NA - NA"                       "NA - NA"                      
##  [47] "NA - NA"                       "NA - NA"                      
##  [49] "NA - NA"                       "NA - NA"                      
##  [51] "NA - NA"                       "NA - NA"                      
##  [53] "NA - NA"                       "NA - NA"                      
##  [55] "NA - NA"                       "NA - NA"                      
##  [57] "NA - NA"                       "NA - NA"                      
##  [59] "NA - NA"                       "NA - NA"                      
##  [61] "NA - NA"                       "NA - NA"                      
##  [63] "NA - NA"                       "NA - NA"                      
##  [65] "NA - NA"                       "NA - NA"                      
##  [67] "NA - NA"                       "NA - NA"                      
##  [69] "Step-down/PCU - NA"            "NA - NA"                      
##  [71] "NA - NA"                       "NA - NA"                      
##  [73] "NA - NA"                       "NA - NA"                      
##  [75] "NA - NA"                       "NA - NA"                      
##  [77] "NA - NA"                       "ICU - ICU"                    
##  [79] "NA - NA"                       "Step-down/PCU - Step-down/PCU"
##  [81] "Med/surg - Med/surg"           "NA - NA"                      
##  [83] "NA - NA"                       "ICU - ICU"                    
##  [85] "Med/surg - Med/surg"           "Med/surg - Med/surg"          
##  [87] "Med/surg - Med/surg"           "Med/surg - Med/surg"          
##  [89] "NA - NA"                       "NA - NA"                      
##  [91] "NA - NA"                       "Med/surg - Med/surg"          
##  [93] "Med/surg - Med/surg"           "Med/surg - Med/surg"          
##  [95] "Step-down/PCU - Step-down/PCU" "NA - NA"                      
##  [97] "NA - NA"                       "Med/surg - Med/surg"          
##  [99] "Step-down/PCU - Step-down/PCU" "Med/surg - Med/surg"          
## [101] "Med/surg - Med/surg"           "NA - NA"                      
## [103] "NA - NA"                       "NA - NA"                      
## [105] "Med/surg - Med/surg"           "NA - NA"                      
## [107] "NA - NA"                       "NA - NA"                      
## [109] "NA - NA"                       "NA - NA"                      
## [111] "NA - NA"                       "NA - NA"                      
## [113] "NA - NA"                       "NA - NA"                      
## [115] "NA - NA"                       "NA - NA"                      
## [117] "NA - NA"                       "NA - NA"                      
## [119] "NA - NA"                       "NA - NA"                      
## [121] "NA - NA"                       "NA - NA"                      
## [123] "NA - NA"                       "NA - NA"                      
## [125] "NA - NA"                       "NA - NA"                      
## [127] "NA - NA"                       "Med/surg - ICU"

I would suggest keeping just one of them, as they are mostly the same with few exceptions. With our data granularity I don’t think these little changes will make too much difference in influencing outcomes. Given so many missing values we could also consider removing both.


Admission diagnosis

primary_admission_diagnosis n
COVID-19 - confirmed 17
COVID-19 - rule-out 33
Other 11
NA 67

I don’t think that distinguishing rule - out VS confirmed COVID would make too much difference in our outcomes, I would recommend dropping this variable.


Other chronic lung disease

other_chronic_lung_disease n
No 55
Yes 1
NA 72

I would suggest dropping this variable with only 1 patient with it also given it being very non-specific.


OSA and Asthma

obstructive_sleep_apnea n
No 53
Yes 5
NA 70
asthma n
No 46
Yes 12
NA 70

Both are very important predictors of COVID severity. I would suggest keeping asthma as there is a solid number of patients with it (12), but I would suggest removing OSA as only 5 patients have it.


Baseline EF

baseline_ejection_fraction_prior_to_presentation n
28 1
30 1
51 1
53 1
61 1
65 1
70 1
75 1
77 1
82 1
NA 118

I would drop this importmant variable because of so many missing values.


Importmant baseline characteristics with a lot of missing values

We can see in the Variables to choose spreadsheet that multiple of our very importmant cardiac baseline variables have many missing values (perc_missing).

For instance, CAD variable for each patient looks like this:

##   [1] "No"  NA    NA    NA    "No"  "No"  NA    NA    NA    "No"  NA    NA   
##  [13] "No"  NA    NA    NA    "No"  "No"  NA    NA    NA    NA    "Yes" "Yes"
##  [25] "No"  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    "No" 
##  [37] "Yes" NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
##  [49] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
##  [61] NA    NA    NA    NA    NA    NA    NA    NA    NA    "No"  NA    NA   
##  [73] NA    NA    NA    NA    NA    "No"  "Yes" "No"  "No"  "No"  "No"  "No" 
##  [85] "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "Yes" "No"  "No"  "No" 
##  [97] "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No"  NA    NA    NA   
## [109] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
## [121] NA    NA    NA    NA    NA    NA    NA    "Yes"

The pattern is similar across other important variables such as hypertension and smoking. We desperately need these variables and they need to be complete because when used in modeling each missing value will drop the whole observation (as I wrote above). Also I believe that if we miss so many of them, reviewers will scrutinize the quality of our data. We should discuss with the data collection team problems in collecting these variables that they encountered and how to ensure that the percentage of missing in these crucial variables is < 5%. I could assume that some of the data collectors wrote “NA” when they did not find history of it, so potentially it could be “0” instead?

From the variables from above image that I would consider dropping is cancer, given so few patients having it:

malignancy_active_treatment_or_within_last_year n
No 58
Yes 3
NA 67


Other missing variables

There is a lot of missing variables in other importmant variables that seem very easy to extract from Epic, such as SBP:

SBP_n_missing n_observatiions perc_miss
66 128 51.5625

We should discuss with the data collecting teams any issues with obtaining these variables to make sure they are collected for future patients.


Some of the outcomes

tracheostomy n
Yes 1
NA 127

Might consider dropping given so few patients having it.


suspected_confirmed_myocarditis n
No 55
Yes 2
NA 71

Might consider dropping given so few patients having it.


suspected_confirmed_myocarditis n
No 55
Yes 2
NA 71

Might consider dropping given so few patients having it.


Readmissions

was_the_patient_re_admitted_after_the_index_hospitalization_including_to_outside_hospital_if_records_available n
No 33
Yes 14
NA 81

We could consider updating this variable for previous 125 patients after we pick a specific “end of follow up” date. It could be useful as a potential outcome variable / part of composite outcome variable.

Please let me know your thoughts.