Undestand how antineoplastic drugs are being given to patients with Cancer

Overview

In this exercise, I will use synthetic data from a cancer clinic to understand how anticancer drugs are administered to patients. To do this, I will use the following three tables:

Dataset containing each patient’s diagnostic information (Patients_Diagnosis.csv)
Dataset containing demographic information for each patient (Patient_Demographics.csv)
Dataset containing information about the drugs recieved by the patient (Patient_Treatment.csv)

Questions

Number and proportion of patients diagnosed with each type of cancer.
How long after their earliest diagnosis do patients start treatment? Provide the mean, medium, minimum and maximum for each of the cancer groups
Which treatment regimens would be indicated as first-line treatment for patients with:
- breast cancer only?
- colon cancer only?
- both breast and colon cancer?
Generate a table showing the age, sex, geographic region and stage at diagnosis of the clinic’s patients stratiefied by the patient’s first diagnosis (breast, colon)

Question 1: Calcul du number and proportion of patients diagnosed with each type of cancer

Assumptions

I will assume that patients with multiple diagnoses of the same cancer type are counted once and classified based on the distinct cancer types present in their diagnosis history.
I will assume that if patient has multiple diagnoses, they will be counted in one single category, as Both.
I will assume that the cancer types are mutually exclusive and that each patient can only be classified into one category based on their diagnosis history.
I will assume that patients diagnosed with the same cancer type at different stages are treated as having a single cancer type diagnosis. Stage information is not considered in the cancer group classification, only the cancer type (breast vs colon) is used.

Before answering this question, I will go through these intermediate steps: installing and loading the packages, setting up the working directory, and then exploring the data to determine if there are any missing values, the number of columns and rows, and the variable types for each dataset.

Install and load librairies

Show the code

# Install pacman only if it's not already installed
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# install or load relevant packages
pacman::p_load(
  tidyverse, # core tidy tools (includes dplyr, tidyr, readr, etc.)
  #gtsummary, # for creating summary tables
  here,       # for constructing file paths
  rmarkdown,  # for rendering R Markdown documents
  lubridate   # for working with dates and times

)

Setup the directory, Load and explore the data

Show the code

# Setup the directory containing the data files

# directory <- here::here(directory)
  data_dir <- "~/Desktop/Flatiron_assessment"

# Load the data from the specified directory
patients_diagnosis <- read_csv(file.path(data_dir, "Patient_Diagnosis.csv"))

patients_demographics <- read_csv(file.path(data_dir, "Patient_Demographics.csv"))

patients_treatment <- read_csv(file.path(data_dir, "Patient_Treatment.csv"))

#Explore the data using glimpse() to understand the structure of each dataset and the types of variables they contain

#str(patients_diagnosis)
#str(patients_demographics)
#str(patients_treatment)
#glimpse(patients_diagnosis)
#glimpse(patients_demographics)
#glimpse(patients_treatment)

Calculate the number and proportion of patients diagnosed with each type of cancer

Show the code

# Create a new variable to classify patients based on their cancer types and keep only the earliest diagnosis date for each patient and cancer type
patients_diagnosis_distinct <- patients_diagnosis |>
  group_by(patient_id) |>
  mutate(class_diagnosis = case_when(
    n_distinct(diagnosis) == 1 ~ first(diagnosis),
    n_distinct(diagnosis) > 1 ~ "Both"
  )) |>
  arrange(diagnosis_date) |>
  slice(1)|>
  ungroup()

# Calculate the number and proportion of patients in each cancer group
result <- patients_diagnosis_distinct |>
  group_by(class_diagnosis) |>
  summarise(
    n_patients = n(),
    proportion = round(n() / nrow(patients_diagnosis_distinct),2)
  )
 
print(result)

# A tibble: 3 × 3
  class_diagnosis n_patients proportion
  <chr>                <int>      <dbl>
1 Both                     5       0.11
2 Breast Cancer           31       0.66
3 Colon Cancer            11       0.23

Table 1. Number and proportion of patients diagnosed with each type of cancer

Type of cancer	Number of patients (N)	Proportion of patients (%)
Breast cancer	31	66
Colon cancer	11	23
Both	5	11

Question 2: How long after their earliest diagnosis do patients start treatment? Provide the mean, medium, minimum and maximum for each of the cancer groups

Assumptions

I will assume that for patients with both cancer types, the reference diagnosis date used to calculate the time to treatment is the earliest diagnosis date across all cancer types, regardless of which cancer was diagnosed first.
I will assume that if a patient has multiple treatments, the earliest treatment date is used for the calculation of time to treatment.

Show the code

# Format the date variable
patients_early_diagnosis <- patients_diagnosis_distinct |>
  dplyr::mutate(diagnosis_date = as.Date(diagnosis_date, 
                                         format = "%m/%d/%y"))

# Format the date variable and keep only the earliest treatment date for each patient
patients_early_treatment <- patients_treatment |>
  mutate(treatment_date = as.Date(treatment_date, 
                                  format = "%m/%d/%y"))|>
  group_by(patient_id) |>
  arrange(treatment_date) |>
  slice(1) |>
  ungroup()

# Merge the datasets to have the diagnosis date and treatment date in the same dataset
all_data <- patients_early_diagnosis|>
  left_join(patients_early_treatment, by = "patient_id")
  
# Calculate the time from diagnosis to treatment for each patient
all_data <- all_data |>
  mutate(time_to_treatment = as.numeric(difftime(treatment_date, diagnosis_date, units = "days")))

# Calculate the mean, median, minimum and maximum time to treatment for each cancer group
time_to_treatment_summary <- all_data |>
  group_by(class_diagnosis) |>
  summarise(
    mean_time = floor(mean(time_to_treatment, na.rm = TRUE)),
    median_time = floor(median(time_to_treatment, na.rm = TRUE)),
    min_time = floor(min(time_to_treatment, na.rm = TRUE)),
    max_time = floor(max(time_to_treatment, na.rm = TRUE))
  )

time_to_treatment_summary

# A tibble: 3 × 5
  class_diagnosis mean_time median_time min_time max_time
  <chr>               <dbl>       <dbl>    <dbl>    <dbl>
1 Both                    8           7        7       11
2 Breast Cancer           5           5       -3       20
3 Colon Cancer           30           4        0      304

Table 2. Mean, Median, Minimum and Maximum time to treatment for each cancer group

Type of cancer	Mean (days)	Median (days)	Minimiun (days)	Maximum(days)
Breast cancer	5	5	-3	20
Colon cancer	30	4	0	304
Both	8	7	7	11

Keep in mind

The patient with ID 4256 was diagnosed with stage III breast cancer but was not treated because no treatment information was included in the dataset.

Question 3: Which treatment regimens would be indicated as first-line treatment for patients with:

breast cancer only?
colon cancer only?
both breast and colon cancer?

Assumptions

I will assume that the first-line treatment regimen for each patient is determined based on the earliest treatment date recorded in the dataset, regardless of the specific treatment type or regimen.

Show the code

# Identify the first-line treatment regimens for each cancer group 

first_line_treatments <- all_data |>
  filter(!is.na(treatment_date)) |>
  group_by(class_diagnosis) |>
  summarise(first_line_regimens = list(unique(drug_code)))

first_line_treatments

# A tibble: 3 × 2
  class_diagnosis first_line_regimens
  <chr>           <list>             
1 Both            <chr [2]>          
2 Breast Cancer   <chr [3]>          
3 Colon Cancer    <chr [3]>

Table 3. First-line treatment regimens for each cancer group

Types of cancer	First-line regimens
Breast Cancer	(A, B, C)
Colon Cancer	(A, D, B)
Both	(C, A)

Question 4: Generate a table showing the age, sex, geographic region and stage at diagnosis of the clinic’s patients stratiefied by the patient’s first diagnosis.

Show the code

# Keep only the patient's first diagnosis

patients_first_diagnosis <- patients_diagnosis |>
  group_by(patient_id) |>
  arrange(diagnosis_date) |>
  slice(1) |>
  ungroup()


# Merge the demographic data with the diagnosis data to have all relevant information in one dataset and caluclated age at diagnosis

demographics_diagnosis <- patients_demographics |>
  left_join(patients_first_diagnosis, by = "patient_id") |>
  mutate(diagnosis_date = ymd(diagnosis_date), 
         birth_date = ymd(birth_date), 
         age = floor(as.numeric(difftime(diagnosis_date, birth_date, units = "days")) / 365.25))


# Generate the table using gtsummarry package

tble <- demographics_diagnosis |>
  select(Age = age, Sex = birth_sex, Region=region, Grade = stage_dx, diagnosis) |>
  gtsummary::tbl_summary(
    by = diagnosis,
    statistic = gtsummary::all_categorical() ~ "{n} ({p}%)",
    percent = "column",
    missing = "no"
  ) |>
  gtsummary::add_overall(last = TRUE) |>
  gtsummary::modify_header(label ~ "**Types of cancer**") |>
  gtsummary::bold_labels()

tble

Types of cancer	Breast Cancer N = 32¹	Colon Cancer N = 15¹	Overall N = 47¹
Age	47 (33, 62)	35 (33, 73)	45 (33, 63)
Sex
F	27 (84%)	10 (67%)	37 (79%)
M	5 (16%)	5 (33%)	10 (21%)
Region
Mid-west	12 (38%)	1 (6.7%)	13 (28%)
Northeast	8 (25%)	5 (33%)	13 (28%)
South	7 (22%)	6 (40%)	13 (28%)
West	5 (16%)	3 (20%)	8 (17%)
Grade
I	7 (22%)	2 (13%)	9 (19%)
II	12 (38%)	4 (27%)	16 (34%)
III	9 (28%)	2 (13%)	11 (23%)
IV	4 (13%)	7 (47%)	11 (23%)
¹ Median (Q1, Q3); n (%)