The Charlson comorbidity index was developed to predict mortality rates based on prior diagnoses(1). Several researchers have applied the Charlson index to administrative claims databases and other observational sources(2-8). In short, the index is a weighted sum of the occurrence of 17 medical conditions of interest.
The Observational Medical Outcomes Partnership (OMOP) Common Data Model has been designed to support disparate observational databases (including administrative claims and electronic heath records). Details about OMOP common data model version 5.4 are available at https://ohdsi.github.io/CommonDataModel/cdm54.html Briefly, for this exercise, four tables of interest (with potentially relevant fields) are PERSON, OBSERVATION_PERIOD, DRUG_EXPOSURE, and CONDITION_OCCURRENCE.
Write a program (in SAS, SQL or R) that could be applied to the OMOP common data model to calculate the Charlson index for all people in an exposed population, indexed on the date of first exposure. The program would take as input a list of drug identifiers (e.g. NDCs or RxNorm CUIs) that define the exposure, and output a table that contains two fields: PERSON_ID, and the Charlson index score.
In an effort to not have to re-invent the wheel, I identified the comorbidity R package (documentation found here: https://cran.r-project.org/web/packages/comorbidity/index.html). The author cites and uses some of the exact papers/sources referenced in the Introduction. The author has created a function that will automatically calculate the Charlson comorbidity index (CCI) when passing in a series of patient IDs and diagnostic codes. There are options for calculating the Charlson or Elixhauser score, but for the purpose of this assignment, I will only calculate the CCI.
However, using this function to calculate the CCI will be one of the last steps I do to complete this assignment. First, I need to use the data and columns contained in the tables within the OMOP CDM to arrive at a usable data frame. The CCI is calculated from presence of certain diagnoses, but the assignment asks me to calculate the CCI using only a list of drug codes. I will need to transition from a list of patients with a particular set of drug codes to patients who have comorbidities of interest.
drug_concept_id values in the DRUG_EXPOSURE table.comorbidity() function to return a data frame containing the scores of each of the individual comorbidities. One important note from the function documentation to make sure of is: “ICD-10 and ICD-9 codes must be in upper case and with alphanumeric characters only in order to be properly recognised”.score() function to calculate the weighted CCI value for each individual patient.Each of the above steps will be executed on sample data, to illustrate that my solutions work. Then, the function that is asked for in the assignment will be written.
Before touching the data, I will load in relevant R packages that I will likely need to use. If I need more later, I will load them later. I will also create a few user-defined functions that I often use, in case I need them. These steps are typical of my general coding process, but are not always necessary.
#I have already installed these pkgs, so I will just load them
library(tidyverse) #an essential for just about everything!
library(janitor) #good at quickly cleaning up messy data tables (column names, etc.)
library(gtsummary) #creates nice summary tables
library(comorbidity) #will eventually calculate CCI
library(randomNames) #generates random names etc. for simulated patient data
library(stringi) #generates random strings
library(zoo) #can help with dates
#define 'not in' function
'%!in%' <- function(x,y)!('%in%'(x,y))
To illustrate this assignment with a working example, I will generate sample data that can be used throughout each step of the assignment. This step would not be executed in a real situation where I have access to the data, where I would just load in the data instead. I will first use the sample data to make sure the approach and coding solutions work as intended, then I will write the function that is asked for in the assignment which will mimic the steps used in the example to return the desired output.
#create sample data for each table
#first I will set the seed so that everything is reproducible
set.seed(999)
#1.------PERSON--------
#use randomNames package to generate sample patient data (I will create 40 sample patients)
dat_pers <- randomNames(n=40,
return.complete.data=T)
#need to tack on a unique person_id, ethnicity_concept_id, and year_of_birth to truly mimic the PERSON data
#need to drop the name column as well, likely PHI
dat_pers2 <- dat_pers %>%
mutate(person_id = sample(x=seq(100000,999999,1), size=40, replace=F), #random 6 digit unique patient id
year_of_birth = sample(x=seq(1916, 2016, 1), size=40, replace=T), #restrict to <= 100 years old between 1916-2016
ethnicity_concept_id = sample(x=c(1,2,3), size=40, replace=T) #the values of 1, 2, & 3 would have a lookup table in the CDM
) %>%
rename(gender_concept_id = gender,
race_concept_id = ethnicity) %>%
select(person_id, gender_concept_id, year_of_birth,
race_concept_id, ethnicity_concept_id)
#quick view of first 5 persons:
head(dat_pers2, 5)
## person_id gender_concept_id year_of_birth race_concept_id
## 1: 713073 0 1982 5
## 2: 145211 1 1958 3
## 3: 459729 0 1985 2
## 4: 776641 0 1941 1
## 5: 969577 0 1918 1
## ethnicity_concept_id
## 1: 2
## 2: 3
## 3: 3
## 4: 3
## 5: 3
#2.------OBSERVATION_PERIOD--------
#need to use same person_ids from above
#need to tack on any missing cols
dat_obs <- data.frame(
observation_period_id = sample(x=seq(1000000,9999999,1), size=40, replace=F), #random 7 digit unique id, assume only 1 per person
person_id = sample(x=dat_pers2$person_id, size=40, replace=F) #random order for person_ids
)
#before adding observation_period_start_date and end date, need to make sure these dates come on or after each patient's DOB
dat_obs_pers2_fordate <- inner_join(dat_pers2, dat_obs) %>%
select(person_id, year_of_birth, observation_period_id)
#now can add the obs dates on
dat_obs2 <- dat_obs_pers2_fordate %>%
group_by(person_id) %>%
mutate(year_of_birth_lastday = as.Date(as.yearmon(year_of_birth) + 11/12, frac = 1),
observation_period_start_date =
sample(seq(year_of_birth_lastday, (year_of_birth_lastday + (365 * 5)), by="day"), 1), #start date is either on or within 5 years of year of birth for each person
observation_period_end_date =
sample(seq((observation_period_start_date+365), Sys.Date(), by="day"), 1) #end date is at least 1 year after start date and spans to current day today
) %>%
select(-c(year_of_birth_lastday, year_of_birth)) #remove year of birth because it is not in this table
#quick view of first 5 observations:
head(dat_obs2, 5)
## # A tibble: 5 x 4
## # Groups: person_id [5]
## person_id observation_period_~ observation_period_star~ observation_period_en~
## <dbl> <dbl> <date> <date>
## 1 713073 7223144 1987-02-18 1989-12-12
## 2 145211 1236938 1963-08-20 1967-12-27
## 3 459729 5724923 1987-02-14 1993-04-24
## 4 776641 4748263 1945-05-31 1986-04-22
## 5 969577 5106453 1919-02-08 1953-10-05
#3.------DRUG_EXPOSURE--------
#need to use same person_ids from above
#need to tack on any missing cols
#this table will have > 1 row per patient, assuming that some patients may be on > 1 drug, let's say length 500 total
dat_drug <- data.frame(
drug_exposure_id = sample(x=seq(1000000,9999999,1), size=500, replace=F), #random 7 digit unique id
person_id = sample(x=dat_pers2$person_id, size=500, replace=T), #random order for person_ids
drug_concept_id = sample(x=seq(1,20,1), size=500, replace=T) #17 conditions that make up the CCI, so we will randomly sample 1 through 20, simulating (for this example) 20 different drug options for 17 different conditions, but only 17/20 drug ids will match with the conditions
)
#before adding drug_exposure_start_date and end date, need to make sure these dates come on or after each patient's DOB
#but not necessarily within the observation period, in case we want to do an exercise later where we perform this additional check
dat_drug_pers2_fordate <- inner_join(dat_pers2, dat_drug) %>%
select(person_id, year_of_birth, drug_exposure_id, drug_concept_id)
dat_drug2 <- dat_drug_pers2_fordate %>%
group_by(person_id, drug_concept_id, drug_exposure_id) %>% #making sure that each drug and drug ID per patient has its own dates
mutate(year_of_birth_lastday = as.Date(as.yearmon(year_of_birth) + 11/12, frac = 1), # end of year in year of birth
drug_exposure_start_date =
sample(seq(year_of_birth_lastday, (year_of_birth_lastday + (360 * 5)), by="day"), 1), #start date is either on or within5 years of year of birth for each person
drug_exposure_end_date =
sample(seq((drug_exposure_start_date +365), Sys.Date(), by="day"), 1) #end date is at least 1 year after start date and spans to current day today
) %>%
select(-c(year_of_birth_lastday, year_of_birth)) %>% #remove year of birth because it is not in this table
select(drug_exposure_id, person_id, drug_concept_id, drug_exposure_start_date, drug_exposure_end_date)
#quick view of first 5 drug records:
head(dat_drug2, 5)
## # A tibble: 5 x 5
## # Groups: person_id, drug_concept_id, drug_exposure_id [5]
## drug_exposure_id person_id drug_concept_id drug_exposure_sta~ drug_exposure_e~
## <dbl> <dbl> <dbl> <date> <date>
## 1 4399072 713073 10 1987-01-09 2021-11-15
## 2 1491789 713073 20 1985-06-10 1988-10-10
## 3 2342278 713073 1 1985-11-29 2002-08-27
## 4 1387146 713073 2 1986-11-25 1996-05-15
## 5 5871878 713073 6 1984-05-09 1994-11-17
#4.------CONDITION_OCCURRENCE--------
#need to use same person_ids from above
#need to tack on any missing cols
#this table will have > 1 row per patient, assuming that some patients may have > 1 condition, let's say length 500 total
dat_cond <- data.frame(
condition_occurrence_id = sample(x=seq(1000000,9999999,1), size=500, replace=F), #random 7 digit unique id
person_id = sample(x=dat_pers2$person_id, size=500, replace=T), #random order for person_ids
condition_concept_id = sample(x=seq(1,20,1), size=500, replace=T) #17 conditions that make up the CCI, so we will randomly sample 1 through 20, simulating (for this example) 20 different condition options for 17 different conditions of interest, but only 17/20 condition ids will match with the conditions we need
)
#before adding condition_start_date and end date, need to make sure these dates come on or after each patient's DOB
#but not necessarily within the observation period, in case we want to do an exercise later where we perform this additional check
dat_cond_pers2_fordate <- inner_join(dat_pers2, dat_cond) %>%
select(person_id, year_of_birth, condition_occurrence_id, condition_concept_id)
dat_cond2 <- dat_cond_pers2_fordate %>%
group_by(person_id, condition_concept_id, condition_occurrence_id) %>% #making sure that each condition and condition ID per patient has its own dates
mutate(year_of_birth_lastday = as.Date(as.yearmon(year_of_birth) + 11/12, frac = 1), # end of year in year of birth
condition_start_date =
sample(seq(year_of_birth_lastday, (year_of_birth_lastday + (360 * 5)), by="day"), 1), #start date is either on or within5 years of year of birth for each person
condition_end_date =
sample(seq((condition_start_date +365), Sys.Date(), by="day"), 1) #end date is at least 1 year after start date and spans to current day today
) %>%
select(-c(year_of_birth_lastday, year_of_birth)) %>% #remove year of birth because it is not in this table
select(condition_occurrence_id, person_id, condition_concept_id, condition_start_date, condition_end_date)
#quick view of first 5 condition records:
head(dat_cond2, 5)
## # A tibble: 5 x 5
## # Groups: person_id, condition_concept_id, condition_occurrence_id [5]
## condition_occurr~ person_id condition_conce~ condition_start~ condition_end_d~
## <dbl> <dbl> <dbl> <date> <date>
## 1 5520661 713073 19 1987-08-19 1996-02-13
## 2 8833374 713073 4 1983-08-10 1995-10-07
## 3 7687341 713073 19 1984-11-09 2004-07-26
## 4 2361338 713073 5 1986-01-16 2002-11-21
## 5 6519796 713073 18 1986-07-10 2002-03-31
#it is possible that a patient's drug_exposure_start_date is before their condition_start_date, which is likely not possible in practice and would likely need to be removed/filtered. But for the purpose of this assignment, we only need to look at drug date, but this additional check could be performed if needed
Next, now that the sample data are generated, I can define the exposure at the drug-level within the drug exposure (dat_drug2) table as drug_concept_id (step 1 of Approach). This variable is an integer, and is defined as “A foreign key that refers to a Standard Concept identifier in the Standardized Vocabularies belonging to the ‘Drug’ domain”. A person may have > 1 record of the same type of drug, for each of which a unique drug_exposure_id would be present. In the simulated drug data, each patient was assigned a random number of drug_concept_id values, ranging from 1-20. For the purpose of this assignment, given that there are 17 different conditions/diagnoses that contribute to the CCI score, I will define drugs matching conditions of interest as those that have the same valued integer for both drug_concept_id and condition_concept_id from the condition occurrence (dat_cond2) table. I purposely added 3 extra drugs that should not be selected for (drug_concept_id values 18-20), and similarly 3 extra conditions that should not be selected for (condition_concept_id values 18-20). To enable myself to utilize the comborbidity package later to calculate the CCI, I will also need to create an accurate lookup table, which assigns CCI-specific ICD codes to the condition_concept_id values 1-17, and non-CCI-specific ICD codes to the the condition_concept_id values 18-20. I will create this lookup table next, so that I can use it later.
The following resource was used to create the lookup table:
I will just select the first-occurring ICD-10 code for each of the 17 CCI-specific comorbidities for simplicity’s sake, and use 3 made up non-CCI-specific codes for the remaining 3:
#make the lookup table
dat_condition_lookup <- data.frame(
condition_concept_id = seq(1,20,1),
condition_descr = c("I21", "I428", "I70", "G45", "F00", "I278", "M05", "K25", "B18", #all CCI-speific codes
"E100", "E102", "G041", "I120", "C00", "I850", "C77", "B20",
"N0P3", "N00", "N3G" #all non-CCI-specific_codes
)
)
#check to make sure that it is identifying each corerct code, and not identifying each incorrect code. There should be a "1" present for each row 1:17 in an identity-matrix pattern (i.e. row 1 has a 1 in first column, row 2 has a 1 in second column, etc.), with the pattern breaking by containing all zeroes starting at row 18
comorbidity(x = dat_condition_lookup, id = "condition_concept_id",
code = "condition_descr", map = "charlson_icd10_quan", assign0 = FALSE)
## condition_concept_id ami chf pvd cevd dementia copd rheumd pud mld diab
## 1 1 1 0 0 0 0 0 0 0 0 0
## 2 2 0 1 0 0 0 0 0 0 0 0
## 3 3 0 0 1 0 0 0 0 0 0 0
## 4 4 0 0 0 1 0 0 0 0 0 0
## 5 5 0 0 0 0 1 0 0 0 0 0
## 6 6 0 0 0 0 0 1 0 0 0 0
## 7 7 0 0 0 0 0 0 1 0 0 0
## 8 8 0 0 0 0 0 0 0 1 0 0
## 9 9 0 0 0 0 0 0 0 0 1 0
## 10 10 0 0 0 0 0 0 0 0 0 1
## 11 11 0 0 0 0 0 0 0 0 0 0
## 12 12 0 0 0 0 0 0 0 0 0 0
## 13 13 0 0 0 0 0 0 0 0 0 0
## 14 14 0 0 0 0 0 0 0 0 0 0
## 15 15 0 0 0 0 0 0 0 0 0 0
## 16 16 0 0 0 0 0 0 0 0 0 0
## 17 17 0 0 0 0 0 0 0 0 0 0
## 18 18 0 0 0 0 0 0 0 0 0 0
## 19 19 0 0 0 0 0 0 0 0 0 0
## 20 20 0 0 0 0 0 0 0 0 0 0
## diabwc hp rend canc msld metacanc aids
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0
## 11 1 0 0 0 0 0 0
## 12 0 1 0 0 0 0 0
## 13 0 0 1 0 0 0 0
## 14 0 0 0 1 0 0 0
## 15 0 0 0 0 1 0 0
## 16 0 0 0 0 0 1 0
## 17 0 0 0 0 0 0 1
## 18 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0
The lookup table looks good and the comorbidity() function appears to work well, so we will move on to step 2 of the Approach.
It is unclear in the assignment instructions whether “indexed on the date of first exposure” should be applied at the patient-drug level, or just the patient level. To capture more data per patient (and as a result potentially more comorbidities), I will apply it at the patient_drug level by taking the earliest (first) record of each drug_exposure_id per person_id, per drug_concept_id.
#create data frame that identifies first-occuring drug_exposure_id per person, per drug
dat_drug_first <- dat_drug2 %>%
group_by(person_id, drug_concept_id) %>% #group appropriately
mutate(first_exposure = case_when( #create binary variable that tells us if indeed this is the first exposure to this drug per patient, per drug
drug_exposure_start_date==min(drug_exposure_start_date) ~ 1,
T ~ 0
),
first_exposure_id = drug_exposure_id[which.min(drug_exposure_start_date)] #index the drug_exposure_id on the minimum (earliest) drug start date, per patient, per drug
)
dat_drug_first2 <- dat_drug_first %>%
ungroup() %>% #ungroup the data
filter(first_exposure==1) #take only rows where it was a first exposure to a drug
As an aside, 123 total observations contributed by 39 distinct persons were removed from the drug data when applying the above filtering criteria. Each of these patients had multiple records (as defined by drug_exposure_id) of the same type of drug (as defined by drug_concept_id) on file, so we took only the earliest (first-occurring) one in these cases. Below is a quick look at the first 6 rows of these data, which displays the logic used for one person in particular as an example:
dat_drug_first_filtered_persons <- dat_drug_first %>%
filter(0 %in% first_exposure) %>%
arrange(desc(person_id))
head(dat_drug_first_filtered_persons, 6)
## # A tibble: 6 x 7
## # Groups: person_id, drug_concept_id [3]
## drug_exposure_id person_id drug_concept_id drug_exposure_sta~ drug_exposure_e~
## <dbl> <dbl> <dbl> <date> <date>
## 1 9073493 969577 2 1920-01-13 2000-07-13
## 2 9356546 969577 10 1923-03-29 2014-07-12
## 3 8730686 969577 10 1922-11-08 1978-12-28
## 4 9965029 969577 3 1920-11-29 1975-08-29
## 5 3035880 969577 2 1920-07-20 1967-10-26
## 6 9076775 969577 3 1921-04-29 2001-10-14
## # ... with 2 more variables: first_exposure <dbl>, first_exposure_id <dbl>
Next, step 3 of the Approach involves joining the appropriate diagnosis data from the dat_cond2 (conditions) table with the data from the above step, ensuring that the diagnosis/diagnoses occur on or after the date of earliest exposure. This is to make sure that exposure comes before diagnosis date. The only variables (for the purpose of this assignment) needed to join to the condition table are person_id, drug_concept_id, and drug_exposure_start_date.
#subset the new filtered drug data to only relevant columns
dat_drug_first3 <- dat_drug_first2 %>%
select(person_id, drug_concept_id, drug_exposure_start_date)
#then, join each data frame together, by person_id
dat_drug_and_cond <- inner_join(dat_drug_first3, dat_cond2)
#as stated previously, given that there are 17 different conditions/diagnoses that contribute to the CCI score, I will define drugs matching conditions of interest as those that have the same valued integer for both `drug_concept_id` and `condition_concept_id`. as such, we can first filter out those persons and observations where the drug ids and condition ids do not match
dat_drug_and_cond_match <- dat_drug_and_cond %>%
group_by(person_id, drug_concept_id) %>%
mutate(condition_matches_drug_id = case_when(
condition_concept_id %in% drug_concept_id ~ 1,
T ~ 0
)
)
#to have the most accurate and conservative CCI estimates possible at the end of this analysis, we should only include data from patients whose drug id codes exactly match the corresponding condition id codes (in this simulated example, this is interpreted as only pulling data for those patients whose drug type actually is a drug used to treat the condition it is intended for). So, next we filter out all rows that did not have a match:
dat_drug_and_cond_match2 <- dat_drug_and_cond_match %>%
ungroup() %>%
filter(condition_matches_drug_id==1)
#the next check we need to perform is that the diagnosis/diagnoses occur on or after the date of earliest drug exposure, so we do that next
dat_drug_and_cond_match3 <- dat_drug_and_cond_match2 %>%
group_by(person_id, drug_concept_id) %>%
mutate(drug_on_or_after_condition = case_when(
condition_start_date <= drug_exposure_start_date ~ 1,
T ~ 0
)
)
#and similarly we filter out those observations that failed the above criteria
dat_drug_and_cond_match4 <- dat_drug_and_cond_match3 %>%
ungroup() %>%
filter(drug_on_or_after_condition==1)
#now, some patients may actually have > 1 row in the remaining data frame if they have multiple different diagnosis records of the same condition, recorded on different dates, for which each diagnosis date was prior to the drug exposure date. For the purpose of this assignment, we will treat a patient EVER having a condition (even if they have multiple onset records) as a single observation for that patient. The resulting data frame should have only 1 row per patient per unique condition, via the below coding logic:
dat_conds_by_person <- distinct(dat_drug_and_cond_match4 %>%
select(person_id, condition_concept_id)
)
#this gives us a nice, 2-column summary of the distinct conditions per person. We can look at the resulting data for the first few people using the code below:
head(dat_conds_by_person, 13)
## # A tibble: 13 x 2
## person_id condition_concept_id
## <dbl> <dbl>
## 1 713073 10
## 2 713073 15
## 3 145211 6
## 4 145211 20
## 5 145211 1
## 6 776641 8
## 7 776641 13
## 8 969577 10
## 9 969577 19
## 10 969577 12
## 11 969577 16
## 12 439790 10
## 13 439790 4
Before we can move on to step 4 of the Approach and use the comorbidity() function to check for the presence or absence of CCI-relevant conditions for each of these patients, we need to map the condition_concept_id values that we have in the CDM to the actual ICD codes that are used to identify Charlson comorbidities. We can use the lookup table we made earlier to make this easy:
#join the lookup table/values with our final list of persons + conditions, then remove the condition ids because we can't use them to calculate the CCIs anymore
dat_conds_by_person_decoded <- inner_join(dat_conds_by_person, dat_condition_lookup) %>%
select(-c(condition_concept_id))
Now we can use the comorbidity() function to check for the presence or absence of CCI-relevant conditions for each of these patients. There are 100 total comorbidities contributed by 37 distinct patients in our final list.
#utilizing the documentation and examples of this package:
# Charlson comorbidity presence based on ICD-10 diagnostic codes:
dat_comorbid <- comorbidity(x = dat_conds_by_person_decoded, #specify the data frame from which to look at codes
id = "person_id", # the identification variable
code = "condition_descr", # the column containing the condition codes (using ICD-10 codes as example)
map = "charlson_icd10_quan", # mapping algorithm to use, we will use based on Quan paper
assign0 = F) # should an heirarchy of comorbidities be used (leave out for this example)
The resulting table is a patient-wise summary of the presence or absence of each of the 17 Charlson comorbidities. It’s worthwhile to tabulate display the frequency across all 37 persons nicely below:
#remove person_id from summary stats
dat_comorbid_fortab <- dat_comorbid %>%
select(-c(person_id))
#make the table using the nice tbl_summary() function from gtsummary pkg
tab_comorbid_blowout <- dat_comorbid_fortab %>%
tbl_summary(statistic = list(all_categorical() ~ "{n} / {N}; ({p}%); ({p_miss}%)"),
digits = all_continuous() ~ 2,
label = list(ami ~ "Acute Myocardial Infarction",
chf ~ "Congestive Heart Failure",
pvd ~ "Peripheral Vascular Disease",
cevd ~ "Cerebrovascular Disease",
dementia ~ "Dementia",
copd ~ "Chronic Obstructive Pulmonary Disease",
rheumd ~ "Rheumatoid Disease",
pud ~ "Peptic Ulcer Disease",
mld ~ "Mild Liver Disease",
diab ~ "Diabetes Without Complications",
diabwc ~ "Diabetes With Complications",
hp ~ "Hemiplegia or Paraplegia",
rend ~ "Renal Disease",
canc ~ "Cancer (any malignancy)",
msld ~ "Moderate or Sever Liver Disease",
metacanc ~ "Metastatic Solid Tumour",
aids ~ "AIDS/HIV"
),
missing_text = "(Missing)") %>%
modify_header(label ~ "**Charlson Comorbidity**") %>%
modify_footnote(
all_stat_cols() ~ "Frequency (%); (% missing)"
) %>%
bold_labels()
#print the table
tab_comorbid_blowout
| Charlson Comorbidity | N = 371 |
|---|---|
| Acute Myocardial Infarction | 4 / 37; (11%); (0%) |
| Congestive Heart Failure | 2 / 37; (5.4%); (0%) |
| Peripheral Vascular Disease | 4 / 37; (11%); (0%) |
| Cerebrovascular Disease | 4 / 37; (11%); (0%) |
| Dementia | 9 / 37; (24%); (0%) |
| Chronic Obstructive Pulmonary Disease | 6 / 37; (16%); (0%) |
| Rheumatoid Disease | 7 / 37; (19%); (0%) |
| Peptic Ulcer Disease | 6 / 37; (16%); (0%) |
| Mild Liver Disease | 4 / 37; (11%); (0%) |
| Diabetes Without Complications | 5 / 37; (14%); (0%) |
| Diabetes With Complications | 2 / 37; (5.4%); (0%) |
| Hemiplegia or Paraplegia | 6 / 37; (16%); (0%) |
| Renal Disease | 3 / 37; (8.1%); (0%) |
| Cancer (any malignancy) | 4 / 37; (11%); (0%) |
| Moderate or Sever Liver Disease | 5 / 37; (14%); (0%) |
| Metastatic Solid Tumour | 7 / 37; (19%); (0%) |
| AIDS/HIV | 6 / 37; (16%); (0%) |
|
1
Frequency (%); (% missing)
|
|
Lastly, per step 5 of the Approach, we can use the comorbidity package’s score() function to calculate the weighted CCI value for each individual patient.
#add on the score for each patient to the comoridity table to get the weighted cci per person, and drop all individual comorbidity identifier columns because these are not asked for as a final product
dat_person_w_score <- dat_comorbid %>%
mutate(cci_score = score(x = dat_comorbid,
weights = "quan",
assign0 = F)
) %>%
select(person_id, cci_score)
#let's have a look at the final product:
print.data.frame(dat_person_w_score)
## person_id cci_score
## 1 137849 6
## 2 145211 1
## 3 154190 1
## 4 172100 2
## 5 263052 5
## 6 292854 2
## 7 319861 8
## 8 336900 6
## 9 337686 6
## 10 368820 6
## 11 387533 4
## 12 420245 13
## 13 439790 0
## 14 488828 1
## 15 494779 8
## 16 499843 4
## 17 521908 3
## 18 542546 1
## 19 548564 10
## 20 559703 5
## 21 572346 7
## 22 614986 4
## 23 633091 2
## 24 637565 0
## 25 701780 6
## 26 713073 4
## 27 715271 0
## 28 720913 15
## 29 746542 1
## 30 776641 1
## 31 813350 4
## 32 815826 1
## 33 853946 0
## 34 880844 2
## 35 888559 3
## 36 960193 4
## 37 969577 8
Now that I have achieved the desired result via a working example using simulated data, I will write a function (call it drugs_to_person_and_cci) that explicitly addresses what is being asked in the assignment, using the assumed OMOP CDM tables and column names.
First, I will load in all available tables. Depending on the computing setup and database connections available, some work may need to be done in SQL first. For the purpose of this assignment, I will assume that the data have already been queried in SQL and I can load all data tables in via CSV files. I will also assume the CSV versions of the data have the same names as the table names provided in the Instructions.
#load in the data
dat_pers_cdm <- read.csv("PERSON.csv")
dat_obs_cdm <- read.csv("OBSERVATION_PERIOD.csv")
dat_drug_cdm <- read.csv("DRUG_EXPOSURE.csv")
dat_cond_cdm <- read.csv("CONDITION_OCCURRENCE.csv")
Now, to write the function:
drugs_to_person_and_cci <- function(drug_list){
#first, filter the cdm drug data to only those drugs that are in our input list
dat_drug2 <- drug_cdm %>%
filter(drug_concept_id %in% drug_list)
#next, create data frame that identifies first-occuring drug_exposure_id per person, per drug
dat_drug_first <- dat_drug2 %>%
group_by(person_id, drug_concept_id) %>% #group appropriately
mutate(first_exposure = case_when( #create binary variable that tells us if indeed this is the first exposure to this drug per patient, per drug
drug_exposure_start_date==min(drug_exposure_start_date) ~ 1,
T ~ 0
),
first_exposure_id = drug_exposure_id[which.min(drug_exposure_start_date)] #index the drug_exposure_id on the minimum (earliest) drug start date, per patient, per drug
)
dat_drug_first2 <- dat_drug_first %>%
ungroup() %>% #ungroup the data
filter(first_exposure==1) #take only rows where it was a first exposure to a drug
#subset the new filtered drug data to only relevant columns
dat_drug_first3 <- dat_drug_first2 %>%
select(person_id, drug_concept_id, drug_exposure_start_date)
#then, join each data frame together, by person_id
dat_drug_and_cond <- inner_join(dat_drug_first3, dat_cond_cdm)
#to have the most accurate and conservative CCI estimates possible at the end of this analysis, we should only include data from patients whose drug id codes exactly match the corresponding condition id codes (in this simulated example, this is interpreted as only pulling data for those patients whose drug type actually is a drug used to treat the condition it is intended for). So, next we filter out all rows that did not have a match:
dat_drug_and_cond2 <- dat_drug_and_cond %>%
group_by(person_id, drug_concept_id) %>%
mutate(drug_on_or_after_condition = case_when(
condition_start_date <= drug_exposure_start_date ~ 1,
T ~ 0
)
)
#and similarly we filter out those observations that failed the above criteria
dat_drug_and_cond3 <- dat_drug_and_cond2 %>%
ungroup() %>%
filter(drug_on_or_after_condition==1)
#now, some patients may actually have > 1 row in the remaining data frame if they have multiple different diagnosis records of the same condition, recorded on different dates, for which each diagnosis date was prior to the drug exposure date. For the purpose of this assignment, we will treat a patient EVER having a condition (even if they have multiple onset records) as a single observation for that patient. The resulting data frame should have only 1 row per patient per unique condition, via the below coding logic:
dat_conds_by_person <- distinct(dat_drug_and_cond3 %>%
select(person_id, condition_concept_id)
)
#In the OMOP CDM, I am sure there exists some form of lookup table for the condition_concept_id integer values, where researchers are able to eventually know which condition_concept_id codes specifically relate to the corresponding ICD-9/10 codes, which can be used for the eventual CCI calculation. For the rest of the function, it's probable to assume that the Janssen team has done exactly this already, and we can take the condition_concept_id codes and get the corresponding ICD-9/10 codes to then directly get a binary value for each of the 17 comorbidities, indicating yes/no whether or not the condition_concept_id for that person taking that drug matches up with a Charlson-specific comorbidity. The rest of the code in this function will operate under this assumption. Let's call this hypothetical CDM lookup table "dat_cond_cdm_icd", and the icd code-containing column "icd_code"
dat_conds_icds_by_person <- inner_join(dat_conds_by_person, dat_cond_cdm_icd) %>%
select(person_id, icd_code)
#now, this data frame can be passed to the comorbidity function. Since we want a more robust function that can return cci's for both icd-9 and 10 classification systems, we will run it separately based on the coding system, then bind the results together.
#utilizing the documentation and examples of this package:
# Charlson comorbidity presence based on ICD-10 diagnostic codes:
dat_comorbid_icd10 <- comorbidity(x = dat_conds_icds_by_person, #specify the data frame from which to look at codes
id = "person_id", # the identification variable
code = "icd_code", # the column containing the condition codes
map = "charlson_icd10_quan", # mapping algorithm to use, we will use based on Quan paper
assign0 = F) # should an hierarchy of comorbidities be used (leave out for this example)
# Charlson comorbidity presence based on ICD-9 diagnostic codes:
dat_comorbid_icd9 <- comorbidity(x = dat_conds_icds_by_person, #specify the data frame from which to look at codes
id = "person_id", # the identification variable
code = "icd_code", # the column containing the condition codes
map = "charlson_icd9_quan", # mapping algorithm to use, we will use based on Quan paper
assign0 = F)
#add on the score for each patient to the comoridity table to get the weighted cci per person, and drop all individual comorbidity identifier columns because these are not asked for as a final product
dat_person_w_score_icd10 <- dat_comorbid_icd10 %>%
mutate(cci_score = score(x = dat_comorbid_icd10,
weights = "quan",
assign0 = F)
) %>%
select(person_id, cci_score)
dat_person_w_score_icd9 <- dat_comorbid_icd9 %>%
mutate(cci_score = score(x = dat_comorbid_icd9,
weights = "quan",
assign0 = F)
) %>%
select(person_id, cci_score)
#lastly, we can bind these two resulting tables together to get what we ultimately want the function to return to us:
dat_person_w_score_icd9_and_10 <- rbind(dat_person_w_score_icd10, dat_person_w_score_icd9)
#one important note: if a person has both icd 9 and 10 records on file, they will have a different cci score calculated for them based on each coding system. It is unclear if this is what would be desired in practice, but for the purpose of this assignment, I will assume this is the preferred option
return(dat_person_w_score_icd9_and_10)
}
The above function works as intended when tested back on the example data. In a test of the function, we can apply it to a new list of “drug codes”, which are currently just integers ranging from 1 to 20. Let’s calculate CCI values for anyone taking simulated RxNorm/NDC drug codes valued at drug_concept_id = 1, 5, 7, 12, and 16:
#re-assign data that would in practice come from the OMOP CDM to the simulated data I already have, since i do not have access to Janssen's data
drug_cdm <- dat_drug2
dat_cond_cdm_icd <- dat_condition_lookup
dat_cond_cdm_icd <- dat_cond_cdm_icd %>%
rename(icd_code = condition_descr)
dat_cond_cdm <- dat_cond2
#define the drug codes of interest (the input):
drugs_of_interest = c(1, 5, 7, 12, 16)
#execute the function to get the desired output:
drugs_to_person_and_cci(drugs_of_interest)
## person_id cci_score
## 1 137849 4
## 2 145211 1
## 3 154190 11
## 4 172100 6
## 5 263052 5
## 6 292854 7
## 7 319861 23
## 8 336900 12
## 9 337686 11
## 10 368820 11
## 11 387533 20
## 12 420245 22
## 13 439790 6
## 14 459729 12
## 15 494779 12
## 16 499843 7
## 17 521908 8
## 18 542546 6
## 19 548564 10
## 20 559703 5
## 21 572346 12
## 22 614986 11
## 23 633091 7
## 24 701780 16
## 25 715271 4
## 26 720913 21
## 27 746542 11
## 28 802165 1
## 29 813350 6
## 30 815826 7
## 31 853946 5
## 32 865077 11
## 33 880844 9
## 34 888559 11
## 35 960193 15
## 36 969577 16
## 37 137849 0
## 38 145211 0
## 39 154190 0
## 40 172100 0
## 41 263052 0
## 42 292854 0
## 43 319861 0
## 44 336900 0
## 45 337686 0
## 46 368820 0
## 47 387533 0
## 48 420245 0
## 49 439790 0
## 50 459729 0
## 51 494779 0
## 52 499843 0
## 53 521908 0
## 54 542546 0
## 55 548564 0
## 56 559703 0
## 57 572346 0
## 58 614986 0
## 59 633091 0
## 60 701780 0
## 61 715271 0
## 62 720913 0
## 63 746542 0
## 64 802165 0
## 65 813350 0
## 66 815826 0
## 67 853946 0
## 68 865077 0
## 69 880844 0
## 70 888559 0
## 71 960193 0
## 72 969577 0
And since only ICD-10 codes were used in creating the simulated data, that is why all persons have a score of 0 for the ICD-9-specific CCI calculations, which is expected.
Per the OMOP CDM (https://ohdsi.github.io/CommonDataModel/index.html), it appears as though the table names where actual ICD identifiers and codes can be found are within the “Vocabulary Tables” set of tables. Here, in the “CONCEPT” table, the column “concept_name” seems a likely candidate to contain the diagnosis codes themselves (e.g. I21% for Myocardial Infarction in ICD-10 coding system, etc.). In the “VOCABULARY” table, the column “vocabulary_id” seems a likely candidate to contain the coding systems from which the diagnosis codes are from (e.g. ICD-10 for International Classification of Diseases, 10th Edition). These tables would likely need to be used and referenced in order to pull the actual diagnosis codes for the 17 comorbidities of interest, unless the team at Janssen has already done this.
An even more robust/useful way to calculate the CCI score for each person exposed to a list of drugs of interest would be to consider diagnoses of the 17 comorbidities that make up the CCI across all coding systems present in the OMOP CDM. That is, across ICD-9, ICD-10, SNOMED, and all other diagnosis coding systems that Janssen has in their data. Ideally, this would mean that a singular condition_concept_id code would map to multiple vocabulary_id codes, each of which points to the same underlying condition (e.g. Myocardial Infarction). With this approach, the comorbidity() R package for quickly calculating CCI scores may not be of use to us anymore, since per the package’s documentation, it seems to only be able to handle ICD codes. This is fine though, because we can write a function to manually calculate the weighted CCI scores since the equation is simple. This approach is actually what my worked example with the simulated data was aiming at. The 17 distinct CCI-specific integers for condition_concept_id were meant to represent exactly this. Each of these singular, standardized codes would represent one of the 17 Charlson comorbidities, and any records of each of these 17 comorbidities in the raw data, regardless of which coding system the diagnosis uses/comes from, the OMOP CDM processes the raw code and turns it into one of the standardized ones.
The following illustration I made can provide a visual explanation of what I mean:
We can quickly write a user-defined function to manually calculate CCI scores (based on original weights), rather than use the comorbidity package. It will take a data frame and the columns of patients and their diagnosis codes as inputs, and output a new data frame containing CCI scores for each of the patients. The same 17 distinct CCI-specific integers (numbers 1 through 17) will be the placeholder values for each CCI-specific condition_concept_id value. Below, we can write the function:
get_CCI <- function(df, patient_col, diagnosis_col) {
patientCol <- enquo(patient_col)
diagnosisCol <- enquo(diagnosis_col)
df2 <- df %>%
group_by(!!patientCol) %>%
mutate(ami_score = case_when( #acute myocardial infarction
1 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
chf_score = case_when( #congestive heart failure
2 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
pvd_score = case_when( #peripheral vascular disease
3 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
cevd_score = case_when( #cerebrovascular disease
4 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
dementia_score = case_when( #dementia
5 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
copd_score = case_when( #chronic obstructive pulmonary disease
6 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
rheumd_score = case_when( #rheumatoid disease
7 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
pud_score = case_when( #peptic ulcer disease
8 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
mld_score = case_when( #mild liver disease
9 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
diab_score = case_when( #diabetes without complications
10 %in% !!diagnosisCol ~ 1,
T ~ 0,
),
diabwc_score = case_when( #diabetes with complications
11 %in% !!diagnosisCol ~ 2,
T ~ 0,
),
hp_score = case_when( #hemiplegia or paraplegia
12 %in% !!diagnosisCol ~ 2,
T ~ 0,
),
rend_score = case_when( #renal disease
13 %in% !!diagnosisCol ~ 2,
T ~ 0,
),
canc_score = case_when( #cancer (any malignancy)
14 %in% !!diagnosisCol ~ 2,
T ~ 0,
),
msld_score = case_when( #moderate or severe liver disease
15 %in% !!diagnosisCol ~ 3,
T ~ 0,
),
metacanc_score = case_when( #metastatic solid tumour
16 %in% !!diagnosisCol ~ 6,
T ~ 0,
),
aids_score = case_when( #AIDS/HIV
17 %in% !!diagnosisCol ~ 6,
T ~ 0,
),
#now calcualte the score by summing over the weighted presence values, divided by the number of patient comorbidities
tot_cci_score = sum(ami_score, chf_score, pvd_score, cevd_score, dementia_score,
copd_score, rheumd_score, pud_score, mld_score, diabwc_score,
diabwc_score, hp_score, rend_score, canc_score, msld_score, metacanc_score,
aids_score, na.rm = T) , #remove NAs to make function impervious to missing data, which may be the case
n=n(), #count number of patient comorbidities
cci_score = tot_cci_score/n #calculate final cci score
) %>% ungroup()
df3 <- distinct(df2 %>% #select distinct persons and scores
select(!!patientCol, cci_score)
)
#return the desired result
return(df3)
}
#test the function against simulated data from 5 persons
set.seed(978)
testdf <- data.frame(id = sample(x=c("a", "b", "c", "d", "e"), size=12, replace=T),
diag = sample(x=seq(1,17,1), size=12, replace=T)) %>%
arrange(desc(id))
#view the simulated data to see what the expected scores SHOULD be if the function works properly:
print.data.frame(testdf)
## id diag
## 1 e 3
## 2 e 3
## 3 e 9
## 4 d 5
## 5 d 8
## 6 c 12
## 7 b 1
## 8 b 17
## 9 b 9
## 10 b 3
## 11 a 2
## 12 a 15
#person a should have a score of 4 (chf (1) + msld (3) = 4)
#person b should have a score of 9 (ami (1) + pvd (1) + mld(1) + aids(6) = 9)
#person c should have a score of 2 (hp(2) = 2)
#person d should have a score of 2 (dementia (1) + pud (1) = 2)
#person e should have a score of 2 (pvd (1) + mld (1) = 2)
get_CCI(testdf, id, diag)
## # A tibble: 5 x 2
## id cci_score
## <chr> <dbl>
## 1 e 2
## 2 d 2
## 3 c 2
## 4 b 9
## 5 a 4
The function works as intended - the values match!
It’s plausible that it is of particular interest to investigate patients’ CCI scores when stratifying by demographics and/or other covariates. To explore this a bit, I will use the simulated data and CCI results from earlier in this assignment.
#left join the person data onto the final result table from the worked example. We will use a left join, because not all person_ids in the person data had cci scores estimated for them. This is because 3 persons failed one of the criteria, which is either:
#(1) drugs matching conditions of interest as those that have the same valued integer for both `drug_concept_id` and `condition_concept_id` or
#(2) the diagnosis/diagnoses occur on or after the date of earliest drug exposure
dat_cci_dems <- left_join(dat_pers2, dat_person_w_score)
#can calculate age, as well as give simulated actual categories for all other demographics
set.seed(978)
dat_cci_dems2 <- dat_cci_dems %>%
mutate(age = lubridate::year(Sys.Date()) - year_of_birth,
gender = case_when(
gender_concept_id==1 ~ "Female",
T ~ "Male"
),
race = case_when(
race_concept_id==1 ~ "Asian",
race_concept_id==2 ~ "Black",
race_concept_id==3 ~ "Native Hawaiian or Other Pacific Islander",
race_concept_id==4 ~ "White",
race_concept_id==5 ~ "American Indian or Alaska Native",
race_concept_id==6 ~ "Other"
),
ethnicity = case_when(
ethnicity_concept_id==1 ~ "Hispanic/Latino",
ethnicity_concept_id==2 ~ "Non-Hispanic/Latino",
ethnicity_concept_id==3 ~ "Other"
)
) %>%
select(age, gender, race, ethnicity, cci_score) #removing id cols, and now also person_id to de-identify results
#let's present the descriptive statistics in a nice table:
dat_cci_dems_tab <- dat_cci_dems2 %>%
tbl_summary(statistic = list(all_continuous() ~ "{mean} ({sd}); ({min}-{max}); {median}; ({p_miss}%)",
all_categorical() ~ "{n} / {N}; ({p}%); ({p_miss}%)"),
label = list(age ~ "Age (years)",
gender ~ "Gender",
race ~ "Race",
ethnicity ~ "Ethnicity",
cci_score ~ "Charlson Comorbidity Index (CCI) Score"
),
missing_text = "(Missing)") %>%
modify_header(label ~ "**Variable**") %>%
modify_footnote(
all_stat_cols() ~ "Mean (SD); (Min-Max); Median for numerical variables; (% missing),
or Frequency (%); (% missing) for categorical variables."
) %>%
bold_labels()
#print the table
dat_cci_dems_tab
| Variable | N = 401 |
|---|---|
| Age (years) | 68 (24); (17-104); 70; (0%) |
| Gender | |
| Female | 15 / 40; (38%); (0%) |
| Male | 25 / 40; (62%); (0%) |
| Race | |
| American Indian or Alaska Native | 7 / 40; (18%); (0%) |
| Asian | 9 / 40; (22%); (0%) |
| Black | 8 / 40; (20%); (0%) |
| Native Hawaiian or Other Pacific Islander | 5 / 40; (12%); (0%) |
| Other | 2 / 40; (5.0%); (0%) |
| White | 9 / 40; (22%); (0%) |
| Ethnicity | |
| Hispanic/Latino | 9 / 40; (22%); (0%) |
| Non-Hispanic/Latino | 14 / 40; (35%); (0%) |
| Other | 17 / 40; (42%); (0%) |
| Charlson Comorbidity Index (CCI) Score | 4.2 (3.6); (0.0-15.0); 4.0; (7.5%) |
| (Missing) | 3 |
|
1
Mean (SD); (Min-Max); Median for numerical variables; (% missing),
or Frequency (%); (% missing) for categorical variables.
|
|
This gives a nice breakdown of the demographic characteristics of the 40 patients in the sample cohort.
Perhaps prior studies have shown that there exists differences in CCI scores between males and females. Then, we may be interested in the same breakdown but stratified by gender. Let’s have a look:
#let's present the descriptive statistics in a nice table, now stratified by gender:
dat_cci_dems_tab_by_gender <- dat_cci_dems2 %>%
tbl_summary(by = gender, #the "by" argument let's us stratify the table easily
statistic = list(all_continuous() ~ "{mean} ({sd}); ({min}-{max}); {median}; ({p_miss}%)",
all_categorical() ~ "{n} / {N}; ({p}%); ({p_miss}%)"),
label = list(age ~ "Age (years)",
race ~ "Race",
ethnicity ~ "Ethnicity",
cci_score ~ "Charlson Comorbidity Index (CCI) Score"
),
missing_text = "(Missing)") %>%
add_p(test = list(all_continuous() ~ "t.test",
all_categorical() ~ "chisq.test"),
pvalue_fun = ~style_pvalue(.x, digits = 2)) %>%
add_overall() %>%
modify_header(label ~ "**Variable**") %>%
modify_footnote(
all_stat_cols() ~ "Mean (SD); (Min-Max); Median for numerical variables; (% missing),
or Frequency (%); (% missing) for categorical variables."
) %>%
bold_labels()
#print the table
dat_cci_dems_tab_by_gender
| Variable | Overall, N = 401 | Female, N = 151 | Male, N = 251 | p-value2 |
|---|---|---|---|---|
| Age (years) | 68 (24); (17-104); 70; (0%) | 79 (16); (59-104); 76; (0%) | 61 (26); (17-104); 58; (0%) | 0.010 |
| Race | 0.77 | |||
| American Indian or Alaska Native | 7 / 40; (18%); (0%) | 2 / 15; (13%); (0%) | 5 / 25; (20%); (0%) | |
| Asian | 9 / 40; (22%); (0%) | 2 / 15; (13%); (0%) | 7 / 25; (28%); (0%) | |
| Black | 8 / 40; (20%); (0%) | 3 / 15; (20%); (0%) | 5 / 25; (20%); (0%) | |
| Native Hawaiian or Other Pacific Islander | 5 / 40; (12%); (0%) | 2 / 15; (13%); (0%) | 3 / 25; (12%); (0%) | |
| Other | 2 / 40; (5.0%); (0%) | 1 / 15; (6.7%); (0%) | 1 / 25; (4.0%); (0%) | |
| White | 9 / 40; (22%); (0%) | 5 / 15; (33%); (0%) | 4 / 25; (16%); (0%) | |
| Ethnicity | 0.91 | |||
| Hispanic/Latino | 9 / 40; (22%); (0%) | 3 / 15; (20%); (0%) | 6 / 25; (24%); (0%) | |
| Non-Hispanic/Latino | 14 / 40; (35%); (0%) | 5 / 15; (33%); (0%) | 9 / 25; (36%); (0%) | |
| Other | 17 / 40; (42%); (0%) | 7 / 15; (47%); (0%) | 10 / 25; (40%); (0%) | |
| Charlson Comorbidity Index (CCI) Score | 4.2 (3.6); (0.0-15.0); 4.0; (7.5%) | 1.9 (1.8); (0.0-6.0); 1.0; (13%) | 5.4 (3.7); (0.0-15.0); 5.0; (4.0%) | <0.001 |
| (Missing) | 3 | 2 | 1 | |
|
1
Mean (SD); (Min-Max); Median for numerical variables; (% missing),
or Frequency (%); (% missing) for categorical variables.
2
Welch Two Sample t-test; Pearson's Chi-squared test
|
||||
We notice that while there doesn’t appear to be a significant in CCI score between males and females, we do notice that there is a significant difference in mean age between males and females (p = 0.010), indicating that females in our cohort were, on average, significantly older (mean age 79) than males (mean age 61). We can also notice that the range of CCI values in our patient cohort is 0-15. Lastly, it may be of interest to categorize the score ranges in an ordinal fashion, and stratify the results by those groups:
#create the new variable
dat_cci_dems3 <- dat_cci_dems2 %>%
mutate(cci_cat = factor(case_when(
cci_score <= 3 ~ "0-3",
cci_score > 3 & cci_score <= 7 ~ "4-7",
cci_score > 7 & cci_score <= 11 ~ "8-11",
cci_score > 11 ~ "12-15",
), levels = c("0-3", "4-7", "8-11", "12-15"))
)
#create the new stratified table on cci score categorization group
dat_cci_dems_tab_by_cci_cat <- dat_cci_dems3 %>%
tbl_summary(by = cci_cat, #the "by" argument let's us stratify the table easily
statistic = list(all_continuous() ~ "{mean} ({sd}); ({min}-{max}); {median}; ({p_miss}%)",
all_categorical() ~ "{n} / {N}; ({p}%); ({p_miss}%)"),
label = list(age ~ "Age (years)",
gender ~ "Gender",
race ~ "Race",
ethnicity ~ "Ethnicity",
cci_score ~ "Charlson Comorbidity Index (CCI) Score"
),
missing_text = "(Missing)") %>%
add_p(test = list(all_continuous() ~ "aov",
all_categorical() ~ "chisq.test"),
pvalue_fun = ~style_pvalue(.x, digits = 2)) %>%
add_overall() %>%
modify_header(label ~ "**Variable**") %>%
modify_footnote(
all_stat_cols() ~ "Mean (SD); (Min-Max); Median for numerical variables; (% missing),
or Frequency (%); (% missing) for categorical variables."
) %>%
bold_labels()
#print the table
dat_cci_dems_tab_by_cci_cat
| Variable | Overall, N = 371 | 0-3, N = 171 | 4-7, N = 141 | 8-11, N = 41 | 12-15, N = 21 | p-value2 |
|---|---|---|---|---|---|---|
| Age (years) | 68 (25); (17-104); 69; (0%) | 71 (24); (17-104); 73; (0%) | 60 (25); (18-98); 56; (0%) | 78 (31); (38-104); 84; (0%) | 84 (0); (84-84); 84; (0%) | 0.37 |
| Gender | 0.035 | |||||
| Female | 13 / 37; (35%); (0%) | 10 / 17; (59%); (0%) | 3 / 14; (21%); (0%) | 0 / 4; (0%); (0%) | 0 / 2; (0%); (0%) | |
| Male | 24 / 37; (65%); (0%) | 7 / 17; (41%); (0%) | 11 / 14; (79%); (0%) | 4 / 4; (100%); (0%) | 2 / 2; (100%); (0%) | |
| Race | 0.87 | |||||
| American Indian or Alaska Native | 6 / 37; (16%); (0%) | 2 / 17; (12%); (0%) | 3 / 14; (21%); (0%) | 1 / 4; (25%); (0%) | 0 / 2; (0%); (0%) | |
| Asian | 9 / 37; (24%); (0%) | 4 / 17; (24%); (0%) | 4 / 14; (29%); (0%) | 1 / 4; (25%); (0%) | 0 / 2; (0%); (0%) | |
| Black | 7 / 37; (19%); (0%) | 4 / 17; (24%); (0%) | 2 / 14; (14%); (0%) | 0 / 4; (0%); (0%) | 1 / 2; (50%); (0%) | |
| Native Hawaiian or Other Pacific Islander | 5 / 37; (14%); (0%) | 3 / 17; (18%); (0%) | 1 / 14; (7.1%); (0%) | 1 / 4; (25%); (0%) | 0 / 2; (0%); (0%) | |
| Other | 2 / 37; (5.4%); (0%) | 2 / 17; (12%); (0%) | 0 / 14; (0%); (0%) | 0 / 4; (0%); (0%) | 0 / 2; (0%); (0%) | |
| White | 8 / 37; (22%); (0%) | 2 / 17; (12%); (0%) | 4 / 14; (29%); (0%) | 1 / 4; (25%); (0%) | 1 / 2; (50%); (0%) | |
| Ethnicity | 0.87 | |||||
| Hispanic/Latino | 9 / 37; (24%); (0%) | 4 / 17; (24%); (0%) | 3 / 14; (21%); (0%) | 1 / 4; (25%); (0%) | 1 / 2; (50%); (0%) | |
| Non-Hispanic/Latino | 13 / 37; (35%); (0%) | 5 / 17; (29%); (0%) | 6 / 14; (43%); (0%) | 2 / 4; (50%); (0%) | 0 / 2; (0%); (0%) | |
| Other | 15 / 37; (41%); (0%) | 8 / 17; (47%); (0%) | 5 / 14; (36%); (0%) | 1 / 4; (25%); (0%) | 1 / 2; (50%); (0%) | |
| Charlson Comorbidity Index (CCI) Score | 4.2 (3.6); (0.0-15.0); 4.0; (0%) | 1.2 (1.0); (0.0-3.0); 1.0; (0%) | 5.1 (1.1); (4.0-7.0); 5.0; (0%) | 8.5 (1.0); (8.0-10.0); 8.0; (0%) | 14.0 (1.4); (13.0-15.0); 14.0; (0%) | <0.001 |
|
1
Mean (SD); (Min-Max); Median for numerical variables; (% missing),
or Frequency (%); (% missing) for categorical variables.
2
One-way ANOVA; Pearson's Chi-squared test
|
||||||
When stratifying by the categorized version of CCI score, we do not notice any significant differences between score category and any of the other demographics.