In this document I wanted to have a quick look at variables that we decided to collect in the remaining patients, focusing on variable counts and missing values. This way we can reevaluate our list of variables to recollect and also see the weaknesses in our collected data and prevent the mistakes in the future data collection.
I think that an importmant thing to consider is that even if we collect additional 300 patients and end up with ~400, when modeling, after inclusion of imaging and demographic variables and important cardiovascular variables such as HTN and smoking there would not be much more room to add other variables before we start overfitting our models. Also for every missing value from any variable in the model, the whole observation will be dropped from the model. Therefore I believe that in general we would not be using a lot of other variables outside our imaging, demographic and typical cardiovascular. Tracy, please correct me, if I am getting something wrong.
Link to the Variables to choose spreadsheet: https://docs.google.com/spreadsheets/d/1vj9DYLBjopVMnk5B_sInAX56oxSK1bmCzrAMzrzkAsQ/edit#gid=587039776
Here I combined level_of_care_at_time_of_admission with highest_level_of_care_during_hospital_stay for each patient with " - " between them:
## [1] "Med/surg - Med/surg" "Step-down/PCU - Step-down/PCU"
## [3] "Step-down/PCU - Step-down/PCU" "Med/surg - Med/surg"
## [5] "Med/surg - Med/surg" "Step-down/PCU - Step-down/PCU"
## [7] "Med/surg - ICU" "NA - NA"
## [9] "ICU - ICU" "Med/surg - Med/surg"
## [11] "ICU - ICU" "Med/surg - Step-down/PCU"
## [13] "Med/surg - Med/surg" "Step-down/PCU - ICU"
## [15] "Step-down/PCU - ICU" "Step-down/PCU - Med/surg"
## [17] "Step-down/PCU - Step-down/PCU" "Med/surg - ICU"
## [19] "Med/surg - Med/surg" "ICU - ICU"
## [21] "Med/surg - Med/surg" "ICU - ICU"
## [23] "Med/surg - Step-down/PCU" "Step-down/PCU - Step-down/PCU"
## [25] "Med/surg - Med/surg" "ICU - ICU"
## [27] "Step-down/PCU - Step-down/PCU" "Med/surg - Step-down/PCU"
## [29] "NA - NA" "NA - NA"
## [31] "Step-down/PCU - ICU" "Med/surg - Med/surg"
## [33] "NA - NA" "Med/surg - Step-down/PCU"
## [35] "NA - NA" "Med/surg - Med/surg"
## [37] "Med/surg - Med/surg" "NA - NA"
## [39] "NA - NA" "NA - NA"
## [41] "NA - NA" "NA - NA"
## [43] "NA - NA" "NA - NA"
## [45] "NA - NA" "NA - NA"
## [47] "NA - NA" "NA - NA"
## [49] "NA - NA" "NA - NA"
## [51] "NA - NA" "NA - NA"
## [53] "NA - NA" "NA - NA"
## [55] "NA - NA" "NA - NA"
## [57] "NA - NA" "NA - NA"
## [59] "NA - NA" "NA - NA"
## [61] "NA - NA" "NA - NA"
## [63] "NA - NA" "NA - NA"
## [65] "NA - NA" "NA - NA"
## [67] "NA - NA" "NA - NA"
## [69] "Step-down/PCU - NA" "NA - NA"
## [71] "NA - NA" "NA - NA"
## [73] "NA - NA" "NA - NA"
## [75] "NA - NA" "NA - NA"
## [77] "NA - NA" "ICU - ICU"
## [79] "NA - NA" "Step-down/PCU - Step-down/PCU"
## [81] "Med/surg - Med/surg" "NA - NA"
## [83] "NA - NA" "ICU - ICU"
## [85] "Med/surg - Med/surg" "Med/surg - Med/surg"
## [87] "Med/surg - Med/surg" "Med/surg - Med/surg"
## [89] "NA - NA" "NA - NA"
## [91] "NA - NA" "Med/surg - Med/surg"
## [93] "Med/surg - Med/surg" "Med/surg - Med/surg"
## [95] "Step-down/PCU - Step-down/PCU" "NA - NA"
## [97] "NA - NA" "Med/surg - Med/surg"
## [99] "Step-down/PCU - Step-down/PCU" "Med/surg - Med/surg"
## [101] "Med/surg - Med/surg" "NA - NA"
## [103] "NA - NA" "NA - NA"
## [105] "Med/surg - Med/surg" "NA - NA"
## [107] "NA - NA" "NA - NA"
## [109] "NA - NA" "NA - NA"
## [111] "NA - NA" "NA - NA"
## [113] "NA - NA" "NA - NA"
## [115] "NA - NA" "NA - NA"
## [117] "NA - NA" "NA - NA"
## [119] "NA - NA" "NA - NA"
## [121] "NA - NA" "NA - NA"
## [123] "NA - NA" "NA - NA"
## [125] "NA - NA" "NA - NA"
## [127] "NA - NA" "Med/surg - ICU"
I would suggest keeping just one of them, as they are mostly the same with few exceptions. With our data granularity I don’t think these little changes will make too much difference in influencing outcomes. Given so many missing values we could also consider removing both.
primary_admission_diagnosis | n |
---|---|
COVID-19 - confirmed | 17 |
COVID-19 - rule-out | 33 |
Other | 11 |
NA | 67 |
I don’t think that distinguishing rule - out VS confirmed COVID would make too much difference in our outcomes, I would recommend dropping this variable.
other_chronic_lung_disease | n |
---|---|
No | 55 |
Yes | 1 |
NA | 72 |
I would suggest dropping this variable with only 1 patient with it also given it being very non-specific.
obstructive_sleep_apnea | n |
---|---|
No | 53 |
Yes | 5 |
NA | 70 |
asthma | n |
---|---|
No | 46 |
Yes | 12 |
NA | 70 |
Both are very important predictors of COVID severity. I would suggest keeping asthma as there is a solid number of patients with it (12), but I would suggest removing OSA as only 5 patients have it.
baseline_ejection_fraction_prior_to_presentation | n |
---|---|
28 | 1 |
30 | 1 |
51 | 1 |
53 | 1 |
61 | 1 |
65 | 1 |
70 | 1 |
75 | 1 |
77 | 1 |
82 | 1 |
NA | 118 |
I would drop this importmant variable because of so many missing values.
We can see in the Variables to choose spreadsheet that multiple of our very importmant cardiac baseline variables have many missing values (perc_missing).
For instance, CAD variable for each patient looks like this:
## [1] "No" NA NA NA "No" "No" NA NA NA "No" NA NA
## [13] "No" NA NA NA "No" "No" NA NA NA NA "Yes" "Yes"
## [25] "No" NA NA NA NA NA NA NA NA NA NA "No"
## [37] "Yes" NA NA NA NA NA NA NA NA NA NA NA
## [49] NA NA NA NA NA NA NA NA NA NA NA NA
## [61] NA NA NA NA NA NA NA NA NA "No" NA NA
## [73] NA NA NA NA NA "No" "Yes" "No" "No" "No" "No" "No"
## [85] "No" "No" "Yes" "No" "No" "No" "No" "No" "Yes" "No" "No" "No"
## [97] "No" "No" "No" "Yes" "No" "No" "No" "No" "No" NA NA NA
## [109] NA NA NA NA NA NA NA NA NA NA NA NA
## [121] NA NA NA NA NA NA NA "Yes"
The pattern is similar across other important variables such as hypertension and smoking. We desperately need these variables and they need to be complete because when used in modeling each missing value will drop the whole observation (as I wrote above). Also I believe that if we miss so many of them, reviewers will scrutinize the quality of our data. We should discuss with the data collection team problems in collecting these variables that they encountered and how to ensure that the percentage of missing in these crucial variables is < 5%. I could assume that some of the data collectors wrote “NA” when they did not find history of it, so potentially it could be “0” instead?
From the variables from above image that I would consider dropping is cancer, given so few patients having it:
malignancy_active_treatment_or_within_last_year | n |
---|---|
No | 58 |
Yes | 3 |
NA | 67 |
There is a lot of missing variables in other importmant variables that seem very easy to extract from Epic, such as SBP:
SBP_n_missing | n_observatiions | perc_miss |
---|---|---|
66 | 128 | 51.5625 |
We should discuss with the data collecting teams any issues with obtaining these variables to make sure they are collected for future patients.
tracheostomy | n |
---|---|
Yes | 1 |
NA | 127 |
Might consider dropping given so few patients having it.
suspected_confirmed_myocarditis | n |
---|---|
No | 55 |
Yes | 2 |
NA | 71 |
Might consider dropping given so few patients having it.
suspected_confirmed_myocarditis | n |
---|---|
No | 55 |
Yes | 2 |
NA | 71 |
Might consider dropping given so few patients having it.
was_the_patient_re_admitted_after_the_index_hospitalization_including_to_outside_hospital_if_records_available | n |
---|---|
No | 33 |
Yes | 14 |
NA | 81 |
We could consider updating this variable for previous 125 patients after we pick a specific “end of follow up” date. It could be useful as a potential outcome variable / part of composite outcome variable.
Please let me know your thoughts.