Janssen Programming Assignment: Charlson Index Using the CDM

Assignment

Write a program (in SAS, SQL or R) that could be applied to the OMOP common data model to calculate the Charlson index for all people in an exposed population, indexed on the date of first exposure. The program would take as input a list of drug identifiers (e.g. NDCs or RxNorm CUIs) that define the exposure, and output a table that contains two fields: PERSON_ID, and the Charlson index score.

Step 1: Read, Research, and Plan Approach

In an effort to not have to re-invent the wheel, I identified the comorbidity R package (documentation found here: https://cran.r-project.org/web/packages/comorbidity/index.html). The author cites and uses some of the exact papers/sources referenced in the Introduction. The author has created a function that will automatically calculate the Charlson comorbidity index (CCI) when passing in a series of patient IDs and diagnostic codes. There are options for calculating the Charlson or Elixhauser score, but for the purpose of this assignment, I will only calculate the CCI.

However, using this function to calculate the CCI will be one of the last steps I do to complete this assignment. First, I need to use the data and columns contained in the tables within the OMOP CDM to arrive at a usable data frame. The CCI is calculated from presence of certain diagnoses, but the assignment asks me to calculate the CCI using only a list of drug codes. I will need to transition from a list of patients with a particular set of drug codes to patients who have comorbidities of interest.

Approach:

Clearly define exposure variable within DRUG_EXPOSURE table. For the purpose of this assignment, I will assume that the input, which is a list of RxNorm/NDC drug codes, have already been mapped to the CDM and are now defined as drug_concept_id values in the DRUG_EXPOSURE table.
Identify first (earliest) exposure for each drug of interest, for each patient. It is unclear in the assignment instructions whether “indexed on the date of first exposure” should be applied at the patient-drug level, or just the patient level.
For only these patients exposed to any of these drugs of interest, join the appropriate diagnosis data from the CONDITION_OCCURRENCE table with the data from (1) and (2), ensuring that the diagnosis/diagnoses occur on or after the date of earliest exposure. This is to make sure that exposure comes before diagnosis date. It is unclear in the assignment instructions whether or not an additional check needs to be made relating to the OBSERVATION_PERIOD table. I can envision a check where the drug date and/or the diagnosis date needs to occur within the time period where the “observation_start_date” and the “observation_end_date” are the lower and upper time limits for inclusion, respectfully.
Now that we have diagnosis codes for all patients that met exposure criteria, pass the “person_id” and the “condition_concept_id” values into the comorbidity() function to return a data frame containing the scores of each of the individual comorbidities. One important note from the function documentation to make sure of is: “ICD-10 and ICD-9 codes must be in upper case and with alphanumeric characters only in order to be properly recognised”.
Lastly, we can use the package’s score() function to calculate the weighted CCI value for each individual patient.

Each of the above steps will be executed on sample data, to illustrate that my solutions work. Then, the function that is asked for in the assignment will be written.

Step 2: Code and Achieve Desired Result with Sample/Example Data

Before touching the data, I will load in relevant R packages that I will likely need to use. If I need more later, I will load them later. I will also create a few user-defined functions that I often use, in case I need them. These steps are typical of my general coding process, but are not always necessary.

#I have already installed these pkgs, so I will just load them

library(tidyverse) #an essential for just about everything!
library(janitor) #good at quickly cleaning up messy data tables (column names, etc.)
library(gtsummary) #creates nice summary tables
library(comorbidity) #will eventually calculate CCI
library(randomNames) #generates random names etc. for simulated patient data
library(stringi) #generates random strings
library(zoo) #can help with dates

#define 'not in' function
'%!in%' <- function(x,y)!('%in%'(x,y))

To illustrate this assignment with a working example, I will generate sample data that can be used throughout each step of the assignment. This step would not be executed in a real situation where I have access to the data, where I would just load in the data instead. I will first use the sample data to make sure the approach and coding solutions work as intended, then I will write the function that is asked for in the assignment which will mimic the steps used in the example to return the desired output.

#create sample data for each table
#first I will set the seed so that everything is reproducible
set.seed(999)

#1.------PERSON--------

#use randomNames package to generate sample patient data (I will create 40 sample patients)
dat_pers <- randomNames(n=40,
                       return.complete.data=T)

#need to tack on a unique person_id, ethnicity_concept_id, and year_of_birth to truly mimic the PERSON data
#need to drop the name column as well, likely PHI
dat_pers2 <- dat_pers %>%
  mutate(person_id = sample(x=seq(100000,999999,1), size=40, replace=F), #random 6 digit unique patient id 
         year_of_birth = sample(x=seq(1916, 2016, 1), size=40, replace=T), #restrict to <= 100 years old between 1916-2016
         ethnicity_concept_id = sample(x=c(1,2,3), size=40, replace=T) #the values of 1, 2, & 3 would have a lookup table in the CDM
         ) %>%
  rename(gender_concept_id = gender,
         race_concept_id = ethnicity) %>%
  select(person_id, gender_concept_id, year_of_birth,
         race_concept_id, ethnicity_concept_id)

#quick view of first 5 persons:
head(dat_pers2, 5)

##    person_id gender_concept_id year_of_birth race_concept_id
## 1:    713073                 0          1982               5
## 2:    145211                 1          1958               3
## 3:    459729                 0          1985               2
## 4:    776641                 0          1941               1
## 5:    969577                 0          1918               1
##    ethnicity_concept_id
## 1:                    2
## 2:                    3
## 3:                    3
## 4:                    3
## 5:                    3

#2.------OBSERVATION_PERIOD--------

#need to use same person_ids from above
#need to tack on any missing cols
dat_obs <- data.frame(
  observation_period_id = sample(x=seq(1000000,9999999,1), size=40, replace=F), #random 7 digit unique id, assume only 1 per person
  person_id = sample(x=dat_pers2$person_id, size=40, replace=F) #random order for person_ids
)

#before adding observation_period_start_date and end date, need to make sure these dates come on or after each patient's DOB

dat_obs_pers2_fordate <- inner_join(dat_pers2, dat_obs) %>%
  select(person_id, year_of_birth, observation_period_id) 

#now can add the obs dates on
dat_obs2 <- dat_obs_pers2_fordate %>%
  group_by(person_id) %>%
  mutate(year_of_birth_lastday = as.Date(as.yearmon(year_of_birth) + 11/12, frac = 1),
    observation_period_start_date = 
      sample(seq(year_of_birth_lastday, (year_of_birth_lastday + (365 * 5)), by="day"), 1), #start date is either on or within 5 years of year of birth for each person
    observation_period_end_date = 
      sample(seq((observation_period_start_date+365), Sys.Date(), by="day"), 1)            #end date is at least 1 year after start date and spans to current day today
    ) %>%
  select(-c(year_of_birth_lastday, year_of_birth)) #remove year of birth because it is not in this table

#quick view of first 5 observations:
head(dat_obs2, 5)

## # A tibble: 5 x 4
## # Groups:   person_id [5]
##   person_id observation_period_~ observation_period_star~ observation_period_en~
##       <dbl>                <dbl> <date>                   <date>                
## 1    713073              7223144 1987-02-18               1989-12-12            
## 2    145211              1236938 1963-08-20               1967-12-27            
## 3    459729              5724923 1987-02-14               1993-04-24            
## 4    776641              4748263 1945-05-31               1986-04-22            
## 5    969577              5106453 1919-02-08               1953-10-05

#3.------DRUG_EXPOSURE--------

#need to use same person_ids from above
#need to tack on any missing cols
#this table will have > 1 row per patient, assuming that some patients may be on > 1 drug, let's say length 500 total
dat_drug <- data.frame(
  drug_exposure_id = sample(x=seq(1000000,9999999,1), size=500, replace=F), #random 7 digit unique id 
  person_id = sample(x=dat_pers2$person_id, size=500, replace=T), #random order for person_ids
  drug_concept_id = sample(x=seq(1,20,1), size=500, replace=T) #17 conditions that make up the CCI, so we will randomly sample 1 through 20, simulating (for this example) 20 different drug options for 17 different conditions, but only 17/20 drug ids will match with the conditions
)

#before adding drug_exposure_start_date and end date, need to make sure these dates come on or after each patient's DOB
#but not necessarily within the observation period, in case we want to do an exercise later where we perform this additional check

dat_drug_pers2_fordate <- inner_join(dat_pers2, dat_drug) %>%
  select(person_id, year_of_birth, drug_exposure_id, drug_concept_id) 

dat_drug2 <- dat_drug_pers2_fordate %>%
  group_by(person_id, drug_concept_id, drug_exposure_id) %>% #making sure that each drug and drug ID per patient has its own dates
  mutate(year_of_birth_lastday = as.Date(as.yearmon(year_of_birth) + 11/12, frac = 1), # end of year in year of birth
    drug_exposure_start_date = 
      sample(seq(year_of_birth_lastday, (year_of_birth_lastday + (360 * 5)), by="day"), 1), #start date is either on or within5 years of year of birth for each person
    drug_exposure_end_date = 
      sample(seq((drug_exposure_start_date +365), Sys.Date(), by="day"), 1) #end date is at least 1 year after start date and spans to current day today
    ) %>%
  select(-c(year_of_birth_lastday, year_of_birth)) %>% #remove year of birth because it is not in this table
  select(drug_exposure_id, person_id, drug_concept_id, drug_exposure_start_date, drug_exposure_end_date)

#quick view of first 5 drug records:
head(dat_drug2, 5)

## # A tibble: 5 x 5
## # Groups:   person_id, drug_concept_id, drug_exposure_id [5]
##   drug_exposure_id person_id drug_concept_id drug_exposure_sta~ drug_exposure_e~
##              <dbl>     <dbl>           <dbl> <date>             <date>          
## 1          4399072    713073              10 1987-01-09         2021-11-15      
## 2          1491789    713073              20 1985-06-10         1988-10-10      
## 3          2342278    713073               1 1985-11-29         2002-08-27      
## 4          1387146    713073               2 1986-11-25         1996-05-15      
## 5          5871878    713073               6 1984-05-09         1994-11-17

#4.------CONDITION_OCCURRENCE--------
#need to use same person_ids from above
#need to tack on any missing cols


#this table will have > 1 row per patient, assuming that some patients may have > 1 condition, let's say length 500 total
dat_cond <- data.frame(
  condition_occurrence_id = sample(x=seq(1000000,9999999,1), size=500, replace=F), #random 7 digit unique id 
  person_id = sample(x=dat_pers2$person_id, size=500, replace=T), #random order for person_ids
  condition_concept_id = sample(x=seq(1,20,1), size=500, replace=T) #17 conditions that make up the CCI, so we will randomly sample 1 through 20, simulating (for this example) 20 different condition options for 17 different conditions of interest, but only 17/20 condition ids will match with the conditions we need
)

#before adding condition_start_date and end date, need to make sure these dates come on or after each patient's DOB
#but not necessarily within the observation period, in case we want to do an exercise later where we perform this additional check

dat_cond_pers2_fordate <- inner_join(dat_pers2, dat_cond) %>%
  select(person_id, year_of_birth, condition_occurrence_id, condition_concept_id) 

dat_cond2 <- dat_cond_pers2_fordate %>%
  group_by(person_id, condition_concept_id, condition_occurrence_id) %>% #making sure that each condition and condition ID per patient has its own dates
  mutate(year_of_birth_lastday = as.Date(as.yearmon(year_of_birth) + 11/12, frac = 1), # end of year in year of birth
    condition_start_date = 
      sample(seq(year_of_birth_lastday, (year_of_birth_lastday + (360 * 5)), by="day"), 1), #start date is either on or within5 years of year of birth for each person
    condition_end_date = 
      sample(seq((condition_start_date +365), Sys.Date(), by="day"), 1) #end date is at least 1 year after start date and spans to current day today
    ) %>%
  select(-c(year_of_birth_lastday, year_of_birth)) %>% #remove year of birth because it is not in this table
  select(condition_occurrence_id, person_id, condition_concept_id, condition_start_date, condition_end_date)

#quick view of first 5 condition records:
head(dat_cond2, 5)

## # A tibble: 5 x 5
## # Groups:   person_id, condition_concept_id, condition_occurrence_id [5]
##   condition_occurr~ person_id condition_conce~ condition_start~ condition_end_d~
##               <dbl>     <dbl>            <dbl> <date>           <date>          
## 1           5520661    713073               19 1987-08-19       1996-02-13      
## 2           8833374    713073                4 1983-08-10       1995-10-07      
## 3           7687341    713073               19 1984-11-09       2004-07-26      
## 4           2361338    713073                5 1986-01-16       2002-11-21      
## 5           6519796    713073               18 1986-07-10       2002-03-31

#it is possible that a patient's drug_exposure_start_date is before their condition_start_date, which is likely not possible in practice and would likely need to be removed/filtered. But for the purpose of this assignment, we only need to look at drug date, but this additional check could be performed if needed

Next, now that the sample data are generated, I can define the exposure at the drug-level within the drug exposure (dat_drug2) table as drug_concept_id (step 1 of Approach). This variable is an integer, and is defined as “A foreign key that refers to a Standard Concept identifier in the Standardized Vocabularies belonging to the ‘Drug’ domain”. A person may have > 1 record of the same type of drug, for each of which a unique drug_exposure_id would be present. In the simulated drug data, each patient was assigned a random number of drug_concept_id values, ranging from 1-20. For the purpose of this assignment, given that there are 17 different conditions/diagnoses that contribute to the CCI score, I will define drugs matching conditions of interest as those that have the same valued integer for both drug_concept_id and condition_concept_id from the condition occurrence (dat_cond2) table. I purposely added 3 extra drugs that should not be selected for (drug_concept_id values 18-20), and similarly 3 extra conditions that should not be selected for (condition_concept_id values 18-20). To enable myself to utilize the comborbidity package later to calculate the CCI, I will also need to create an accurate lookup table, which assigns CCI-specific ICD codes to the condition_concept_id values 1-17, and non-CCI-specific ICD codes to the the condition_concept_id values 18-20. I will create this lookup table next, so that I can use it later.

The following resource was used to create the lookup table:

http://mchp-appserv.cpe.umanitoba.ca/concept/Charlson%20Comorbidities%20-%20Coding%20Algorithms%20for%20ICD-9-CM%20and%20ICD-10.pdf.

I will just select the first-occurring ICD-10 code for each of the 17 CCI-specific comorbidities for simplicity’s sake, and use 3 made up non-CCI-specific codes for the remaining 3:

#make the lookup table
dat_condition_lookup <- data.frame(
  condition_concept_id = seq(1,20,1),
  condition_descr = c("I21", "I428", "I70", "G45", "F00", "I278", "M05", "K25", "B18", #all CCI-speific codes
                      "E100", "E102", "G041", "I120", "C00", "I850", "C77", "B20",
                      
                      "N0P3", "N00", "N3G" #all non-CCI-specific_codes
                      )
)

#check to make sure that it is identifying each corerct code, and not identifying each incorrect code. There should be a "1" present for each row 1:17 in an identity-matrix pattern (i.e. row 1 has a 1 in first column, row 2 has a 1 in second column, etc.), with the pattern breaking by containing all zeroes starting at row 18
comorbidity(x = dat_condition_lookup, id = "condition_concept_id",
            code = "condition_descr", map = "charlson_icd10_quan", assign0 = FALSE)

##    condition_concept_id ami chf pvd cevd dementia copd rheumd pud mld diab
## 1                     1   1   0   0    0        0    0      0   0   0    0
## 2                     2   0   1   0    0        0    0      0   0   0    0
## 3                     3   0   0   1    0        0    0      0   0   0    0
## 4                     4   0   0   0    1        0    0      0   0   0    0
## 5                     5   0   0   0    0        1    0      0   0   0    0
## 6                     6   0   0   0    0        0    1      0   0   0    0
## 7                     7   0   0   0    0        0    0      1   0   0    0
## 8                     8   0   0   0    0        0    0      0   1   0    0
## 9                     9   0   0   0    0        0    0      0   0   1    0
## 10                   10   0   0   0    0        0    0      0   0   0    1
## 11                   11   0   0   0    0        0    0      0   0   0    0
## 12                   12   0   0   0    0        0    0      0   0   0    0
## 13                   13   0   0   0    0        0    0      0   0   0    0
## 14                   14   0   0   0    0        0    0      0   0   0    0
## 15                   15   0   0   0    0        0    0      0   0   0    0
## 16                   16   0   0   0    0        0    0      0   0   0    0
## 17                   17   0   0   0    0        0    0      0   0   0    0
## 18                   18   0   0   0    0        0    0      0   0   0    0
## 19                   19   0   0   0    0        0    0      0   0   0    0
## 20                   20   0   0   0    0        0    0      0   0   0    0
##    diabwc hp rend canc msld metacanc aids
## 1       0  0    0    0    0        0    0
## 2       0  0    0    0    0        0    0
## 3       0  0    0    0    0        0    0
## 4       0  0    0    0    0        0    0
## 5       0  0    0    0    0        0    0
## 6       0  0    0    0    0        0    0
## 7       0  0    0    0    0        0    0
## 8       0  0    0    0    0        0    0
## 9       0  0    0    0    0        0    0
## 10      0  0    0    0    0        0    0
## 11      1  0    0    0    0        0    0
## 12      0  1    0    0    0        0    0
## 13      0  0    1    0    0        0    0
## 14      0  0    0    1    0        0    0
## 15      0  0    0    0    1        0    0
## 16      0  0    0    0    0        1    0
## 17      0  0    0    0    0        0    1
## 18      0  0    0    0    0        0    0
## 19      0  0    0    0    0        0    0
## 20      0  0    0    0    0        0    0

The lookup table looks good and the comorbidity() function appears to work well, so we will move on to step 2 of the Approach.

It is unclear in the assignment instructions whether “indexed on the date of first exposure” should be applied at the patient-drug level, or just the patient level. To capture more data per patient (and as a result potentially more comorbidities), I will apply it at the patient_drug level by taking the earliest (first) record of each drug_exposure_id per person_id, per drug_concept_id.

#create data frame that identifies first-occuring drug_exposure_id per person, per drug
dat_drug_first <- dat_drug2 %>%
  group_by(person_id, drug_concept_id) %>% #group appropriately
  mutate(first_exposure = case_when(                                #create binary variable that tells us if indeed this is the first exposure to this drug per patient, per drug
    drug_exposure_start_date==min(drug_exposure_start_date) ~ 1,
    T ~ 0
  ),
    first_exposure_id = drug_exposure_id[which.min(drug_exposure_start_date)]  #index the drug_exposure_id on the minimum (earliest) drug start date, per patient, per drug
)
  
dat_drug_first2 <- dat_drug_first %>%
  ungroup() %>%  #ungroup the data
  filter(first_exposure==1)   #take only rows where it was a first exposure to a drug

As an aside, 123 total observations contributed by 39 distinct persons were removed from the drug data when applying the above filtering criteria. Each of these patients had multiple records (as defined by drug_exposure_id) of the same type of drug (as defined by drug_concept_id) on file, so we took only the earliest (first-occurring) one in these cases. Below is a quick look at the first 6 rows of these data, which displays the logic used for one person in particular as an example:

dat_drug_first_filtered_persons <- dat_drug_first %>%
  filter(0 %in% first_exposure) %>%
  arrange(desc(person_id))

head(dat_drug_first_filtered_persons, 6)

## # A tibble: 6 x 7
## # Groups:   person_id, drug_concept_id [3]
##   drug_exposure_id person_id drug_concept_id drug_exposure_sta~ drug_exposure_e~
##              <dbl>     <dbl>           <dbl> <date>             <date>          
## 1          9073493    969577               2 1920-01-13         2000-07-13      
## 2          9356546    969577              10 1923-03-29         2014-07-12      
## 3          8730686    969577              10 1922-11-08         1978-12-28      
## 4          9965029    969577               3 1920-11-29         1975-08-29      
## 5          3035880    969577               2 1920-07-20         1967-10-26      
## 6          9076775    969577               3 1921-04-29         2001-10-14      
## # ... with 2 more variables: first_exposure <dbl>, first_exposure_id <dbl>

Next, step 3 of the Approach involves joining the appropriate diagnosis data from the dat_cond2 (conditions) table with the data from the above step, ensuring that the diagnosis/diagnoses occur on or after the date of earliest exposure. This is to make sure that exposure comes before diagnosis date. The only variables (for the purpose of this assignment) needed to join to the condition table are person_id, drug_concept_id, and drug_exposure_start_date.

#subset the new filtered drug data to only relevant columns
dat_drug_first3 <- dat_drug_first2 %>%
  select(person_id, drug_concept_id, drug_exposure_start_date)

#then, join each data frame together, by person_id
dat_drug_and_cond <- inner_join(dat_drug_first3, dat_cond2)

#as stated previously, given that there are 17 different conditions/diagnoses that contribute to the CCI score, I will define drugs matching conditions of interest as those that have the same valued integer for both `drug_concept_id` and `condition_concept_id`. as such, we can first filter out those persons and observations where the drug ids and condition ids do not match

dat_drug_and_cond_match <- dat_drug_and_cond %>%
  group_by(person_id, drug_concept_id) %>%
  mutate(condition_matches_drug_id = case_when(
    condition_concept_id %in% drug_concept_id ~ 1,
    T ~ 0
  )
)

#to have the most accurate and conservative CCI estimates possible at the end of this analysis, we should only include data from patients whose drug id codes exactly match the corresponding condition id codes (in this simulated example, this is interpreted as only pulling data for those patients whose drug type actually is a drug used to treat the condition it is intended for). So, next we filter out all rows that did not have a match:

dat_drug_and_cond_match2 <- dat_drug_and_cond_match %>%
  ungroup() %>%
  filter(condition_matches_drug_id==1)

#the next check we need to perform is that the diagnosis/diagnoses occur on or after the date of earliest drug exposure, so we do that next

dat_drug_and_cond_match3 <- dat_drug_and_cond_match2 %>%
  group_by(person_id, drug_concept_id) %>%
  mutate(drug_on_or_after_condition = case_when(
    condition_start_date <= drug_exposure_start_date ~ 1,
    T ~ 0
  )
)

#and similarly we filter out those observations that failed the above criteria


dat_drug_and_cond_match4 <- dat_drug_and_cond_match3 %>%
  ungroup() %>%
  filter(drug_on_or_after_condition==1)

#now, some patients may actually have > 1 row in the remaining data frame if they have multiple different diagnosis records of the same condition, recorded on different dates, for which each diagnosis date was prior to the drug exposure date. For the purpose of this assignment, we will treat a patient EVER having a condition (even if they have multiple onset records) as a single observation for that patient. The resulting data frame should have only 1 row per patient per unique condition, via the below coding logic:

dat_conds_by_person <- distinct(dat_drug_and_cond_match4 %>%
  select(person_id, condition_concept_id)
)

#this gives us a nice, 2-column summary of the distinct conditions per person. We can look at the resulting data for the first few people using the code below:

head(dat_conds_by_person, 13)

## # A tibble: 13 x 2
##    person_id condition_concept_id
##        <dbl>                <dbl>
##  1    713073                   10
##  2    713073                   15
##  3    145211                    6
##  4    145211                   20
##  5    145211                    1
##  6    776641                    8
##  7    776641                   13
##  8    969577                   10
##  9    969577                   19
## 10    969577                   12
## 11    969577                   16
## 12    439790                   10
## 13    439790                    4

Before we can move on to step 4 of the Approach and use the comorbidity() function to check for the presence or absence of CCI-relevant conditions for each of these patients, we need to map the condition_concept_id values that we have in the CDM to the actual ICD codes that are used to identify Charlson comorbidities. We can use the lookup table we made earlier to make this easy:

#join the lookup table/values with our final list of persons + conditions, then remove the condition ids because we can't use them to calculate the CCIs anymore
dat_conds_by_person_decoded <- inner_join(dat_conds_by_person, dat_condition_lookup) %>%
  select(-c(condition_concept_id))

Now we can use the comorbidity() function to check for the presence or absence of CCI-relevant conditions for each of these patients. There are 100 total comorbidities contributed by 37 distinct patients in our final list.

#utilizing the documentation and examples of this package:

# Charlson comorbidity presence based on ICD-10 diagnostic codes:
dat_comorbid <- comorbidity(x = dat_conds_by_person_decoded,   #specify the data frame from which to look at codes
            id = "person_id",                                  # the identification variable
            code = "condition_descr",                          # the column containing the condition codes (using ICD-10 codes as example)
            map = "charlson_icd10_quan",                       # mapping algorithm to use, we will use based on Quan paper
            assign0 = F)                                       # should an heirarchy of comorbidities be used (leave out for this example)

The resulting table is a patient-wise summary of the presence or absence of each of the 17 Charlson comorbidities. It’s worthwhile to tabulate display the frequency across all 37 persons nicely below:

#remove person_id from summary stats
dat_comorbid_fortab <- dat_comorbid %>%
  select(-c(person_id))

#make the table using the nice tbl_summary() function from gtsummary pkg
tab_comorbid_blowout <- dat_comorbid_fortab %>%
  tbl_summary(statistic = list(all_categorical() ~ "{n} / {N}; ({p}%); ({p_miss}%)"),
              digits = all_continuous() ~ 2,
              label = list(ami ~ "Acute Myocardial Infarction",
                           chf ~ "Congestive Heart Failure",
                           pvd ~ "Peripheral Vascular Disease",
                           cevd ~ "Cerebrovascular Disease",
                           dementia ~ "Dementia",
                           copd ~ "Chronic Obstructive Pulmonary Disease",
                           rheumd ~ "Rheumatoid Disease",
                           pud ~ "Peptic Ulcer Disease",
                           mld ~ "Mild Liver Disease",
                           diab ~ "Diabetes Without Complications",
                           diabwc ~ "Diabetes With Complications",
                           hp ~ "Hemiplegia or Paraplegia",
                           rend ~ "Renal Disease",
                           canc ~ "Cancer (any malignancy)",
                           msld ~ "Moderate or Sever Liver Disease",
                           metacanc ~ "Metastatic Solid Tumour",
                           aids ~ "AIDS/HIV"
                           ),
              
              missing_text = "(Missing)") %>%
  modify_header(label ~ "**Charlson Comorbidity**") %>%
  modify_footnote(
    all_stat_cols() ~ "Frequency (%); (% missing)"
  ) %>%
  bold_labels()

#print the table
tab_comorbid_blowout

Charlson Comorbidity	N = 37¹
Acute Myocardial Infarction	4 / 37; (11%); (0%)
Congestive Heart Failure	2 / 37; (5.4%); (0%)
Peripheral Vascular Disease	4 / 37; (11%); (0%)
Cerebrovascular Disease	4 / 37; (11%); (0%)
Dementia	9 / 37; (24%); (0%)
Chronic Obstructive Pulmonary Disease	6 / 37; (16%); (0%)
Rheumatoid Disease	7 / 37; (19%); (0%)
Peptic Ulcer Disease	6 / 37; (16%); (0%)
Mild Liver Disease	4 / 37; (11%); (0%)
Diabetes Without Complications	5 / 37; (14%); (0%)
Diabetes With Complications	2 / 37; (5.4%); (0%)
Hemiplegia or Paraplegia	6 / 37; (16%); (0%)
Renal Disease	3 / 37; (8.1%); (0%)
Cancer (any malignancy)	4 / 37; (11%); (0%)
Moderate or Sever Liver Disease	5 / 37; (14%); (0%)
Metastatic Solid Tumour	7 / 37; (19%); (0%)
AIDS/HIV	6 / 37; (16%); (0%)
¹ Frequency (%); (% missing)

Lastly, per step 5 of the Approach, we can use the comorbidity package’s score() function to calculate the weighted CCI value for each individual patient.

#add on the score for each patient to the comoridity table to get the weighted cci per person, and drop all individual comorbidity identifier columns because these are not asked for as a final product

dat_person_w_score <- dat_comorbid %>%
  mutate(cci_score = score(x = dat_comorbid,
                           weights = "quan",
                           assign0 = F)
         ) %>%
  select(person_id, cci_score)


#let's have a look at the final product:
print.data.frame(dat_person_w_score)

##    person_id cci_score
## 1     137849         6
## 2     145211         1
## 3     154190         1
## 4     172100         2
## 5     263052         5
## 6     292854         2
## 7     319861         8
## 8     336900         6
## 9     337686         6
## 10    368820         6
## 11    387533         4
## 12    420245        13
## 13    439790         0
## 14    488828         1
## 15    494779         8
## 16    499843         4
## 17    521908         3
## 18    542546         1
## 19    548564        10
## 20    559703         5
## 21    572346         7
## 22    614986         4
## 23    633091         2
## 24    637565         0
## 25    701780         6
## 26    713073         4
## 27    715271         0
## 28    720913        15
## 29    746542         1
## 30    776641         1
## 31    813350         4
## 32    815826         1
## 33    853946         0
## 34    880844         2
## 35    888559         3
## 36    960193         4
## 37    969577         8

Step 3: Write a Program That Achieves Desired Result

Now that I have achieved the desired result via a working example using simulated data, I will write a function (call it drugs_to_person_and_cci) that explicitly addresses what is being asked in the assignment, using the assumed OMOP CDM tables and column names.

First, I will load in all available tables. Depending on the computing setup and database connections available, some work may need to be done in SQL first. For the purpose of this assignment, I will assume that the data have already been queried in SQL and I can load all data tables in via CSV files. I will also assume the CSV versions of the data have the same names as the table names provided in the Instructions.

#load in the data
dat_pers_cdm <- read.csv("PERSON.csv")
dat_obs_cdm <- read.csv("OBSERVATION_PERIOD.csv")
dat_drug_cdm <- read.csv("DRUG_EXPOSURE.csv")
dat_cond_cdm <- read.csv("CONDITION_OCCURRENCE.csv")

Now, to write the function:

drugs_to_person_and_cci <- function(drug_list){
  
  #first, filter the cdm drug data to only those drugs that are in our input list
  dat_drug2 <- drug_cdm %>%
    filter(drug_concept_id %in% drug_list)
  
  #next, create data frame that identifies first-occuring drug_exposure_id per person, per drug
dat_drug_first <- dat_drug2 %>%
  group_by(person_id, drug_concept_id) %>% #group appropriately
  mutate(first_exposure = case_when(                                #create binary variable that tells us if indeed this is the first exposure to this drug per patient, per drug
    drug_exposure_start_date==min(drug_exposure_start_date) ~ 1,
    T ~ 0
  ),
    first_exposure_id = drug_exposure_id[which.min(drug_exposure_start_date)]  #index the drug_exposure_id on the minimum (earliest) drug start date, per patient, per drug
)
  
dat_drug_first2 <- dat_drug_first %>%
  ungroup() %>%  #ungroup the data
  filter(first_exposure==1)   #take only rows where it was a first exposure to a drug
  
  

#subset the new filtered drug data to only relevant columns
dat_drug_first3 <- dat_drug_first2 %>%
  select(person_id, drug_concept_id, drug_exposure_start_date)

#then, join each data frame together, by person_id
dat_drug_and_cond <- inner_join(dat_drug_first3, dat_cond_cdm)


#to have the most accurate and conservative CCI estimates possible at the end of this analysis, we should only include data from patients whose drug id codes exactly match the corresponding condition id codes (in this simulated example, this is interpreted as only pulling data for those patients whose drug type actually is a drug used to treat the condition it is intended for). So, next we filter out all rows that did not have a match:

dat_drug_and_cond2 <-  dat_drug_and_cond %>%
  group_by(person_id, drug_concept_id) %>%
  mutate(drug_on_or_after_condition = case_when(
    condition_start_date <= drug_exposure_start_date ~ 1,
    T ~ 0
  )
)

#and similarly we filter out those observations that failed the above criteria


dat_drug_and_cond3 <- dat_drug_and_cond2 %>%
  ungroup() %>%
  filter(drug_on_or_after_condition==1)

#now, some patients may actually have > 1 row in the remaining data frame if they have multiple different diagnosis records of the same condition, recorded on different dates, for which each diagnosis date was prior to the drug exposure date. For the purpose of this assignment, we will treat a patient EVER having a condition (even if they have multiple onset records) as a single observation for that patient. The resulting data frame should have only 1 row per patient per unique condition, via the below coding logic:

dat_conds_by_person <- distinct(dat_drug_and_cond3 %>%
  select(person_id, condition_concept_id)
)


#In the OMOP CDM, I am sure there exists some form of lookup table for the condition_concept_id  integer values, where researchers are able to eventually know which condition_concept_id codes specifically relate to the corresponding ICD-9/10 codes, which can be used for the eventual CCI calculation. For the rest of the function, it's probable to assume that the Janssen team has done exactly this already, and we can take the condition_concept_id codes and get the corresponding ICD-9/10 codes to then directly get a binary value for each of the 17 comorbidities, indicating yes/no whether or not the condition_concept_id for that person taking that drug matches up with a Charlson-specific comorbidity. The rest of the code in this function will operate under this assumption. Let's call this hypothetical CDM lookup table "dat_cond_cdm_icd", and the icd code-containing column "icd_code"

dat_conds_icds_by_person <- inner_join(dat_conds_by_person, dat_cond_cdm_icd) %>%
  select(person_id, icd_code)

#now, this data frame can be passed to the comorbidity function. Since we want a more robust function that can return cci's for both icd-9 and 10 classification systems, we will run it separately based on the coding system, then bind the results together.

#utilizing the documentation and examples of this package:

# Charlson comorbidity presence based on ICD-10 diagnostic codes:
dat_comorbid_icd10 <- comorbidity(x = dat_conds_icds_by_person,   #specify the data frame from which to look at codes
            id = "person_id",                                  # the identification variable
            code = "icd_code",                                 # the column containing the condition codes 
            map = "charlson_icd10_quan",                       # mapping algorithm to use, we will use based on Quan paper
            assign0 = F)                                       # should an hierarchy of comorbidities be used (leave out for this example)

# Charlson comorbidity presence based on ICD-9 diagnostic codes:
dat_comorbid_icd9 <- comorbidity(x = dat_conds_icds_by_person,   #specify the data frame from which to look at codes
            id = "person_id",                                  # the identification variable
            code = "icd_code",                                 # the column containing the condition codes
            map = "charlson_icd9_quan",                       # mapping algorithm to use, we will use based on Quan paper
            assign0 = F)       


#add on the score for each patient to the comoridity table to get the weighted cci per person, and drop all individual comorbidity identifier columns because these are not asked for as a final product

dat_person_w_score_icd10 <- dat_comorbid_icd10 %>%
  mutate(cci_score = score(x = dat_comorbid_icd10,
                           weights = "quan",
                           assign0 = F)
         ) %>%
  select(person_id, cci_score)


dat_person_w_score_icd9 <- dat_comorbid_icd9 %>%
  mutate(cci_score = score(x = dat_comorbid_icd9,
                           weights = "quan",
                           assign0 = F)
         ) %>%
  select(person_id, cci_score)

#lastly, we can bind these two resulting tables together to get what we ultimately want the function to return to us:

dat_person_w_score_icd9_and_10 <- rbind(dat_person_w_score_icd10, dat_person_w_score_icd9)

#one important note: if a person has both icd 9 and 10 records on file, they will have a different cci score calculated for them based on each coding system. It is unclear if this is what would be desired in practice, but for the purpose of this assignment, I will assume this is the preferred option

return(dat_person_w_score_icd9_and_10)

  
}

The above function works as intended when tested back on the example data. In a test of the function, we can apply it to a new list of “drug codes”, which are currently just integers ranging from 1 to 20. Let’s calculate CCI values for anyone taking simulated RxNorm/NDC drug codes valued at drug_concept_id = 1, 5, 7, 12, and 16:

#re-assign data that would in practice come from the OMOP CDM to the simulated data I already have, since i do not have access to Janssen's data
drug_cdm <- dat_drug2
dat_cond_cdm_icd <- dat_condition_lookup
dat_cond_cdm_icd <- dat_cond_cdm_icd %>%
    rename(icd_code = condition_descr)
    dat_cond_cdm <- dat_cond2

#define the drug codes of interest (the input):
drugs_of_interest = c(1, 5, 7, 12, 16)

#execute the function to get the desired output:
drugs_to_person_and_cci(drugs_of_interest)

##    person_id cci_score
## 1     137849         4
## 2     145211         1
## 3     154190        11
## 4     172100         6
## 5     263052         5
## 6     292854         7
## 7     319861        23
## 8     336900        12
## 9     337686        11
## 10    368820        11
## 11    387533        20
## 12    420245        22
## 13    439790         6
## 14    459729        12
## 15    494779        12
## 16    499843         7
## 17    521908         8
## 18    542546         6
## 19    548564        10
## 20    559703         5
## 21    572346        12
## 22    614986        11
## 23    633091         7
## 24    701780        16
## 25    715271         4
## 26    720913        21
## 27    746542        11
## 28    802165         1
## 29    813350         6
## 30    815826         7
## 31    853946         5
## 32    865077        11
## 33    880844         9
## 34    888559        11
## 35    960193        15
## 36    969577        16
## 37    137849         0
## 38    145211         0
## 39    154190         0
## 40    172100         0
## 41    263052         0
## 42    292854         0
## 43    319861         0
## 44    336900         0
## 45    337686         0
## 46    368820         0
## 47    387533         0
## 48    420245         0
## 49    439790         0
## 50    459729         0
## 51    494779         0
## 52    499843         0
## 53    521908         0
## 54    542546         0
## 55    548564         0
## 56    559703         0
## 57    572346         0
## 58    614986         0
## 59    633091         0
## 60    701780         0
## 61    715271         0
## 62    720913         0
## 63    746542         0
## 64    802165         0
## 65    813350         0
## 66    815826         0
## 67    853946         0
## 68    865077         0
## 69    880844         0
## 70    888559         0
## 71    960193         0
## 72    969577         0

And since only ICD-10 codes were used in creating the simulated data, that is why all persons have a score of 0 for the ICD-9-specific CCI calculations, which is expected.

Supplemental Results

1. OMOP CDM Usage with My Function/Program: Extensions

Per the OMOP CDM (https://ohdsi.github.io/CommonDataModel/index.html), it appears as though the table names where actual ICD identifiers and codes can be found are within the “Vocabulary Tables” set of tables. Here, in the “CONCEPT” table, the column “concept_name” seems a likely candidate to contain the diagnosis codes themselves (e.g. I21% for Myocardial Infarction in ICD-10 coding system, etc.). In the “VOCABULARY” table, the column “vocabulary_id” seems a likely candidate to contain the coding systems from which the diagnosis codes are from (e.g. ICD-10 for International Classification of Diseases, 10th Edition). These tables would likely need to be used and referenced in order to pull the actual diagnosis codes for the 17 comorbidities of interest, unless the team at Janssen has already done this.

An even more robust/useful way to calculate the CCI score for each person exposed to a list of drugs of interest would be to consider diagnoses of the 17 comorbidities that make up the CCI across all coding systems present in the OMOP CDM. That is, across ICD-9, ICD-10, SNOMED, and all other diagnosis coding systems that Janssen has in their data. Ideally, this would mean that a singular condition_concept_id code would map to multiple vocabulary_id codes, each of which points to the same underlying condition (e.g. Myocardial Infarction). With this approach, the comorbidity() R package for quickly calculating CCI scores may not be of use to us anymore, since per the package’s documentation, it seems to only be able to handle ICD codes. This is fine though, because we can write a function to manually calculate the weighted CCI scores since the equation is simple. This approach is actually what my worked example with the simulated data was aiming at. The 17 distinct CCI-specific integers for condition_concept_id were meant to represent exactly this. Each of these singular, standardized codes would represent one of the 17 Charlson comorbidities, and any records of each of these 17 comorbidities in the raw data, regardless of which coding system the diagnosis uses/comes from, the OMOP CDM processes the raw code and turns it into one of the standardized ones.

The following illustration I made can provide a visual explanation of what I mean:

We can quickly write a user-defined function to manually calculate CCI scores (based on original weights), rather than use the comorbidity package. It will take a data frame and the columns of patients and their diagnosis codes as inputs, and output a new data frame containing CCI scores for each of the patients. The same 17 distinct CCI-specific integers (numbers 1 through 17) will be the placeholder values for each CCI-specific condition_concept_id value. Below, we can write the function:

get_CCI <- function(df, patient_col, diagnosis_col) {
  
  patientCol <- enquo(patient_col)
  diagnosisCol <- enquo(diagnosis_col)
  
  df2 <- df %>%
    group_by(!!patientCol) %>%
    mutate(ami_score = case_when(   #acute myocardial infarction
     1 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     chf_score = case_when(   #congestive heart failure
     2 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     pvd_score = case_when(   #peripheral vascular disease
     3 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     cevd_score = case_when(   #cerebrovascular disease
     4 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     dementia_score = case_when(   #dementia
     5 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     copd_score = case_when(   #chronic obstructive pulmonary disease
     6 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     rheumd_score = case_when(   #rheumatoid disease
     7 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     pud_score = case_when(   #peptic ulcer disease
     8 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     mld_score = case_when(   #mild liver disease
     9 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     diab_score = case_when(   #diabetes without complications
     10 %in% !!diagnosisCol ~ 1,
     T ~ 0,
       ),
     diabwc_score = case_when(   #diabetes with complications
     11 %in% !!diagnosisCol ~ 2,
     T ~ 0,
       ),
     hp_score = case_when(   #hemiplegia or paraplegia
     12 %in% !!diagnosisCol ~ 2,
     T ~ 0,
       ),
     rend_score = case_when(   #renal disease
     13 %in% !!diagnosisCol ~ 2,
     T ~ 0,
       ),
     canc_score = case_when(   #cancer (any malignancy)
     14 %in% !!diagnosisCol ~ 2,
     T ~ 0,
       ),
     msld_score = case_when(   #moderate or severe liver disease
     15 %in% !!diagnosisCol ~ 3,
     T ~ 0,
       ),
     metacanc_score = case_when(   #metastatic solid tumour
     16 %in% !!diagnosisCol ~ 6,
     T ~ 0,
       ),
     aids_score = case_when(   #AIDS/HIV
     17 %in% !!diagnosisCol ~ 6,
     T ~ 0,
       ),
     
     #now calcualte the score by summing over the weighted presence values, divided by the number of patient comorbidities
     tot_cci_score = sum(ami_score, chf_score, pvd_score, cevd_score, dementia_score,
                     copd_score, rheumd_score, pud_score, mld_score, diabwc_score,
                     diabwc_score, hp_score, rend_score, canc_score, msld_score, metacanc_score,
                     aids_score, na.rm = T) ,  #remove NAs to make function impervious to missing data, which may be the case
     n=n(),    #count number of patient comorbidities
     cci_score = tot_cci_score/n   #calculate final cci score
    ) %>% ungroup()
  
  df3 <- distinct(df2 %>%    #select distinct persons and scores 
    select(!!patientCol, cci_score)
  )
  
  
  
  #return the desired result
  return(df3)

}

#test the function against simulated data from 5 persons
set.seed(978)
testdf <- data.frame(id = sample(x=c("a", "b", "c", "d", "e"), size=12, replace=T),
                     diag = sample(x=seq(1,17,1), size=12, replace=T)) %>%
  arrange(desc(id))

#view the simulated data to see what the expected scores SHOULD be if the function works properly:
print.data.frame(testdf)

##    id diag
## 1   e    3
## 2   e    3
## 3   e    9
## 4   d    5
## 5   d    8
## 6   c   12
## 7   b    1
## 8   b   17
## 9   b    9
## 10  b    3
## 11  a    2
## 12  a   15

#person a should have a score of 4 (chf (1) + msld (3) = 4)
#person b should have a score of 9 (ami (1) + pvd (1) + mld(1) + aids(6) = 9)
#person c should have a score of 2 (hp(2) = 2)
#person d should have a score of 2 (dementia (1) + pud (1) = 2)
#person e should have a score of 2 (pvd (1) + mld (1) = 2)

get_CCI(testdf, id, diag)

## # A tibble: 5 x 2
##   id    cci_score
##   <chr>     <dbl>
## 1 e             2
## 2 d             2
## 3 c             2
## 4 b             9
## 5 a             4

The function works as intended - the values match!

2. Generating Potentially Relevant Tables Using Additional OMOP CDM Data

It’s plausible that it is of particular interest to investigate patients’ CCI scores when stratifying by demographics and/or other covariates. To explore this a bit, I will use the simulated data and CCI results from earlier in this assignment.

#left join the person data onto the final result table from the worked example. We will use a left join, because not all person_ids in the person data had cci scores estimated for them. This is because 3 persons failed one of the criteria, which is either: 
#(1) drugs matching conditions of interest as those that have the same valued integer for both `drug_concept_id` and `condition_concept_id` or 
#(2) the diagnosis/diagnoses occur on or after the date of earliest drug exposure

dat_cci_dems <- left_join(dat_pers2, dat_person_w_score)


#can calculate age, as well as give simulated actual categories for all other demographics
set.seed(978)

dat_cci_dems2 <- dat_cci_dems %>%
  mutate(age = lubridate::year(Sys.Date()) - year_of_birth,
         gender = case_when(
           gender_concept_id==1 ~ "Female",
           T ~ "Male"
          ),
         race = case_when(
           race_concept_id==1 ~ "Asian",
           race_concept_id==2 ~ "Black",
           race_concept_id==3 ~ "Native Hawaiian or Other Pacific Islander",
           race_concept_id==4 ~ "White",
           race_concept_id==5 ~ "American Indian or Alaska Native",
           race_concept_id==6 ~ "Other"
          ),
         ethnicity = case_when(
           ethnicity_concept_id==1 ~ "Hispanic/Latino",
           ethnicity_concept_id==2 ~ "Non-Hispanic/Latino",
           ethnicity_concept_id==3 ~ "Other"
         )
  ) %>%
  select(age, gender, race, ethnicity, cci_score) #removing id cols, and now also person_id to de-identify results

#let's present the descriptive statistics in a nice table:
dat_cci_dems_tab <- dat_cci_dems2 %>%
  tbl_summary(statistic = list(all_continuous() ~ "{mean} ({sd}); ({min}-{max}); {median}; ({p_miss}%)",
                               all_categorical() ~ "{n} / {N}; ({p}%); ({p_miss}%)"),
              label = list(age ~ "Age (years)",
                           gender ~ "Gender",
                           race ~ "Race",
                           ethnicity ~ "Ethnicity",
                           cci_score ~ "Charlson Comorbidity Index (CCI) Score"
                           ),

              missing_text = "(Missing)") %>%
  modify_header(label ~ "**Variable**") %>%  
  modify_footnote(
    all_stat_cols() ~ "Mean (SD); (Min-Max); Median for numerical variables; (% missing),
    or Frequency (%); (% missing) for categorical variables."
  ) %>%
  bold_labels()

#print the table
dat_cci_dems_tab

Variable	N = 40¹
Age (years)	68 (24); (17-104); 70; (0%)
Gender
Female	15 / 40; (38%); (0%)
Male	25 / 40; (62%); (0%)
Race
American Indian or Alaska Native	7 / 40; (18%); (0%)
Asian	9 / 40; (22%); (0%)
Black	8 / 40; (20%); (0%)
Native Hawaiian or Other Pacific Islander	5 / 40; (12%); (0%)
Other	2 / 40; (5.0%); (0%)
White	9 / 40; (22%); (0%)
Ethnicity
Hispanic/Latino	9 / 40; (22%); (0%)
Non-Hispanic/Latino	14 / 40; (35%); (0%)
Other	17 / 40; (42%); (0%)
Charlson Comorbidity Index (CCI) Score	4.2 (3.6); (0.0-15.0); 4.0; (7.5%)
(Missing)	3
¹ Mean (SD); (Min-Max); Median for numerical variables; (% missing), or Frequency (%); (% missing) for categorical variables.

This gives a nice breakdown of the demographic characteristics of the 40 patients in the sample cohort.

Perhaps prior studies have shown that there exists differences in CCI scores between males and females. Then, we may be interested in the same breakdown but stratified by gender. Let’s have a look:

#let's present the descriptive statistics in a nice table, now stratified by gender:
dat_cci_dems_tab_by_gender <- dat_cci_dems2 %>%
  tbl_summary(by = gender,  #the "by" argument let's us stratify the table easily
    statistic = list(all_continuous() ~ "{mean} ({sd}); ({min}-{max}); {median}; ({p_miss}%)",
                               all_categorical() ~ "{n} / {N}; ({p}%); ({p_miss}%)"),
              label = list(age ~ "Age (years)",
                           race ~ "Race",
                           ethnicity ~ "Ethnicity",
                           cci_score ~ "Charlson Comorbidity Index (CCI) Score"
                           ),

              missing_text = "(Missing)") %>%
  add_p(test = list(all_continuous() ~ "t.test", 
                     all_categorical() ~ "chisq.test"),
         pvalue_fun = ~style_pvalue(.x, digits = 2)) %>%
   add_overall() %>%
  modify_header(label ~ "**Variable**") %>%  
  modify_footnote(
    all_stat_cols() ~ "Mean (SD); (Min-Max); Median for numerical variables; (% missing),
    or Frequency (%); (% missing) for categorical variables."
  ) %>%
  bold_labels()

#print the table
dat_cci_dems_tab_by_gender

Variable	Overall, N = 40¹	Female, N = 15¹	Male, N = 25¹	p-value²
Age (years)	68 (24); (17-104); 70; (0%)	79 (16); (59-104); 76; (0%)	61 (26); (17-104); 58; (0%)	0.010
Race				0.77
American Indian or Alaska Native	7 / 40; (18%); (0%)	2 / 15; (13%); (0%)	5 / 25; (20%); (0%)
Asian	9 / 40; (22%); (0%)	2 / 15; (13%); (0%)	7 / 25; (28%); (0%)
Black	8 / 40; (20%); (0%)	3 / 15; (20%); (0%)	5 / 25; (20%); (0%)
Native Hawaiian or Other Pacific Islander	5 / 40; (12%); (0%)	2 / 15; (13%); (0%)	3 / 25; (12%); (0%)
Other	2 / 40; (5.0%); (0%)	1 / 15; (6.7%); (0%)	1 / 25; (4.0%); (0%)
White	9 / 40; (22%); (0%)	5 / 15; (33%); (0%)	4 / 25; (16%); (0%)
Ethnicity				0.91
Hispanic/Latino	9 / 40; (22%); (0%)	3 / 15; (20%); (0%)	6 / 25; (24%); (0%)
Non-Hispanic/Latino	14 / 40; (35%); (0%)	5 / 15; (33%); (0%)	9 / 25; (36%); (0%)
Other	17 / 40; (42%); (0%)	7 / 15; (47%); (0%)	10 / 25; (40%); (0%)
Charlson Comorbidity Index (CCI) Score	4.2 (3.6); (0.0-15.0); 4.0; (7.5%)	1.9 (1.8); (0.0-6.0); 1.0; (13%)	5.4 (3.7); (0.0-15.0); 5.0; (4.0%)	<0.001
(Missing)	3	2	1
¹ Mean (SD); (Min-Max); Median for numerical variables; (% missing), or Frequency (%); (% missing) for categorical variables. ² Welch Two Sample t-test; Pearson's Chi-squared test

We notice that while there doesn’t appear to be a significant in CCI score between males and females, we do notice that there is a significant difference in mean age between males and females (p = 0.010), indicating that females in our cohort were, on average, significantly older (mean age 79) than males (mean age 61). We can also notice that the range of CCI values in our patient cohort is 0-15. Lastly, it may be of interest to categorize the score ranges in an ordinal fashion, and stratify the results by those groups:

#create the new variable
dat_cci_dems3 <- dat_cci_dems2 %>%
  mutate(cci_cat = factor(case_when(
    cci_score <= 3 ~ "0-3",
    cci_score > 3 & cci_score <= 7 ~ "4-7",
    cci_score > 7 & cci_score <= 11 ~ "8-11",
    cci_score > 11 ~ "12-15",
    ), levels = c("0-3", "4-7", "8-11", "12-15"))
  )

#create the new stratified table on cci score categorization group

dat_cci_dems_tab_by_cci_cat <- dat_cci_dems3 %>%
  tbl_summary(by = cci_cat,  #the "by" argument let's us stratify the table easily
    statistic = list(all_continuous() ~ "{mean} ({sd}); ({min}-{max}); {median}; ({p_miss}%)",
                               all_categorical() ~ "{n} / {N}; ({p}%); ({p_miss}%)"),
              label = list(age ~ "Age (years)",
                           gender ~ "Gender",
                           race ~ "Race",
                           ethnicity ~ "Ethnicity",
                           cci_score ~ "Charlson Comorbidity Index (CCI) Score"
                           ),

              missing_text = "(Missing)") %>%
  add_p(test = list(all_continuous() ~ "aov", 
                     all_categorical() ~ "chisq.test"),
         pvalue_fun = ~style_pvalue(.x, digits = 2)) %>%
   add_overall() %>%
  modify_header(label ~ "**Variable**") %>%  
  modify_footnote(
    all_stat_cols() ~ "Mean (SD); (Min-Max); Median for numerical variables; (% missing),
    or Frequency (%); (% missing) for categorical variables."
  ) %>%
  bold_labels()

#print the table
dat_cci_dems_tab_by_cci_cat

Variable	Overall, N = 37¹	0-3, N = 17¹	4-7, N = 14¹	8-11, N = 4¹	12-15, N = 2¹	p-value²
Age (years)	68 (25); (17-104); 69; (0%)	71 (24); (17-104); 73; (0%)	60 (25); (18-98); 56; (0%)	78 (31); (38-104); 84; (0%)	84 (0); (84-84); 84; (0%)	0.37
Gender						0.035
Female	13 / 37; (35%); (0%)	10 / 17; (59%); (0%)	3 / 14; (21%); (0%)	0 / 4; (0%); (0%)	0 / 2; (0%); (0%)
Male	24 / 37; (65%); (0%)	7 / 17; (41%); (0%)	11 / 14; (79%); (0%)	4 / 4; (100%); (0%)	2 / 2; (100%); (0%)
Race						0.87
American Indian or Alaska Native	6 / 37; (16%); (0%)	2 / 17; (12%); (0%)	3 / 14; (21%); (0%)	1 / 4; (25%); (0%)	0 / 2; (0%); (0%)
Asian	9 / 37; (24%); (0%)	4 / 17; (24%); (0%)	4 / 14; (29%); (0%)	1 / 4; (25%); (0%)	0 / 2; (0%); (0%)
Black	7 / 37; (19%); (0%)	4 / 17; (24%); (0%)	2 / 14; (14%); (0%)	0 / 4; (0%); (0%)	1 / 2; (50%); (0%)
Native Hawaiian or Other Pacific Islander	5 / 37; (14%); (0%)	3 / 17; (18%); (0%)	1 / 14; (7.1%); (0%)	1 / 4; (25%); (0%)	0 / 2; (0%); (0%)
Other	2 / 37; (5.4%); (0%)	2 / 17; (12%); (0%)	0 / 14; (0%); (0%)	0 / 4; (0%); (0%)	0 / 2; (0%); (0%)
White	8 / 37; (22%); (0%)	2 / 17; (12%); (0%)	4 / 14; (29%); (0%)	1 / 4; (25%); (0%)	1 / 2; (50%); (0%)
Ethnicity						0.87
Hispanic/Latino	9 / 37; (24%); (0%)	4 / 17; (24%); (0%)	3 / 14; (21%); (0%)	1 / 4; (25%); (0%)	1 / 2; (50%); (0%)
Non-Hispanic/Latino	13 / 37; (35%); (0%)	5 / 17; (29%); (0%)	6 / 14; (43%); (0%)	2 / 4; (50%); (0%)	0 / 2; (0%); (0%)
Other	15 / 37; (41%); (0%)	8 / 17; (47%); (0%)	5 / 14; (36%); (0%)	1 / 4; (25%); (0%)	1 / 2; (50%); (0%)
Charlson Comorbidity Index (CCI) Score	4.2 (3.6); (0.0-15.0); 4.0; (0%)	1.2 (1.0); (0.0-3.0); 1.0; (0%)	5.1 (1.1); (4.0-7.0); 5.0; (0%)	8.5 (1.0); (8.0-10.0); 8.0; (0%)	14.0 (1.4); (13.0-15.0); 14.0; (0%)	<0.001
¹ Mean (SD); (Min-Max); Median for numerical variables; (% missing), or Frequency (%); (% missing) for categorical variables. ² One-way ANOVA; Pearson's Chi-squared test

When stratifying by the categorized version of CCI score, we do not notice any significant differences between score category and any of the other demographics.