Undestand how antineoplastic drugs are being given to patients with Cancer
Overview
In this exercise, I will use synthetic data from a cancer clinic to understand how anticancer drugs are administered to patients. To do this, I will use the following three tables:
Dataset containing each patient’s diagnostic information (Patients_Diagnosis.csv)
Dataset containing demographic information for each patient (Patient_Demographics.csv)
Dataset containing information about the drugs recieved by the patient (Patient_Treatment.csv)
Questions
Number and proportion of patients diagnosed with each type of cancer.
How long after their earliest diagnosis do patients start treatment? Provide the mean, medium, minimum and maximum for each of the cancer groups
Which treatment regimens would be indicated as first-line treatment for patients with:
breast cancer only?
colon cancer only?
both breast and colon cancer?
Generate a table showing the age, sex, geographic region and stage at diagnosis of the clinic’s patients stratiefied by the patient’s first diagnosis (breast, colon)
Question 1: Calcul du number and proportion of patients diagnosed with each type of cancer
Assumptions
I will assume that patients with multiple diagnoses of the same cancer type are counted once and classified based on the distinct cancer types present in their diagnosis history.
I will assume that if patient has multiple diagnoses, they will be counted in one single category, as Both.
I will assume that the cancer types are mutually exclusive and that each patient can only be classified into one category based on their diagnosis history.
I will assume that patients diagnosed with the same cancer type at different stages are treated as having a single cancer type diagnosis. Stage information is not considered in the cancer group classification, only the cancer type (breast vs colon) is used.
Before answering this question, I will go through these intermediate steps: installing and loading the packages, setting up the working directory, and then exploring the data to determine if there are any missing values, the number of columns and rows, and the variable types for each dataset.
# Install pacman only if it's not already installedif (!requireNamespace("pacman", quietly =TRUE)) {install.packages("pacman")}# install or load relevant packagespacman::p_load( tidyverse, # core tidy tools (includes dplyr, tidyr, readr, etc.)#gtsummary, # for creating summary tables here, # for constructing file paths rmarkdown, # for rendering R Markdown documents lubridate # for working with dates and times)
# Setup the directory containing the data files# directory <- here::here(directory) data_dir <-"~/Desktop/Flatiron_assessment"# Load the data from the specified directorypatients_diagnosis <-read_csv(file.path(data_dir, "Patient_Diagnosis.csv"))patients_demographics <-read_csv(file.path(data_dir, "Patient_Demographics.csv"))patients_treatment <-read_csv(file.path(data_dir, "Patient_Treatment.csv"))#Explore the data using glimpse() to understand the structure of each dataset and the types of variables they contain#str(patients_diagnosis)#str(patients_demographics)#str(patients_treatment)#glimpse(patients_diagnosis)#glimpse(patients_demographics)#glimpse(patients_treatment)
Calculate the number and proportion of patients diagnosed with each type of cancer
# Create a new variable to classify patients based on their cancer types and keep only the earliest diagnosis date for each patient and cancer typepatients_diagnosis_distinct <- patients_diagnosis |>group_by(patient_id) |>mutate(class_diagnosis =case_when(n_distinct(diagnosis) ==1~first(diagnosis),n_distinct(diagnosis) >1~"Both" )) |>arrange(diagnosis_date) |>slice(1)|>ungroup()# Calculate the number and proportion of patients in each cancer groupresult <- patients_diagnosis_distinct |>group_by(class_diagnosis) |>summarise(n_patients =n(),proportion =round(n() /nrow(patients_diagnosis_distinct),2) )print(result)
# A tibble: 3 × 3
class_diagnosis n_patients proportion
<chr> <int> <dbl>
1 Both 5 0.11
2 Breast Cancer 31 0.66
3 Colon Cancer 11 0.23
Table 1. Number and proportion of patients diagnosed with each type of cancer
Type of cancer
Number of patients (N)
Proportion of patients (%)
Breast cancer
31
66
Colon cancer
11
23
Both
5
11
Question 2: How long after their earliest diagnosis do patients start treatment? Provide the mean, medium, minimum and maximum for each of the cancer groups
Assumptions
I will assume that for patients with both cancer types, the reference diagnosis date used to calculate the time to treatment is the earliest diagnosis date across all cancer types, regardless of which cancer was diagnosed first.
I will assume that if a patient has multiple treatments, the earliest treatment date is used for the calculation of time to treatment.
# Format the date variablepatients_early_diagnosis <- patients_diagnosis_distinct |> dplyr::mutate(diagnosis_date =as.Date(diagnosis_date, format ="%m/%d/%y"))# Format the date variable and keep only the earliest treatment date for each patientpatients_early_treatment <- patients_treatment |>mutate(treatment_date =as.Date(treatment_date, format ="%m/%d/%y"))|>group_by(patient_id) |>arrange(treatment_date) |>slice(1) |>ungroup()# Merge the datasets to have the diagnosis date and treatment date in the same datasetall_data <- patients_early_diagnosis|>left_join(patients_early_treatment, by ="patient_id")# Calculate the time from diagnosis to treatment for each patientall_data <- all_data |>mutate(time_to_treatment =as.numeric(difftime(treatment_date, diagnosis_date, units ="days")))# Calculate the mean, median, minimum and maximum time to treatment for each cancer grouptime_to_treatment_summary <- all_data |>group_by(class_diagnosis) |>summarise(mean_time =floor(mean(time_to_treatment, na.rm =TRUE)),median_time =floor(median(time_to_treatment, na.rm =TRUE)),min_time =floor(min(time_to_treatment, na.rm =TRUE)),max_time =floor(max(time_to_treatment, na.rm =TRUE)) )time_to_treatment_summary
# A tibble: 3 × 5
class_diagnosis mean_time median_time min_time max_time
<chr> <dbl> <dbl> <dbl> <dbl>
1 Both 8 7 7 11
2 Breast Cancer 5 5 -3 20
3 Colon Cancer 30 4 0 304
Table 2. Mean, Median, Minimum and Maximum time to treatment for each cancer group
Type of cancer
Mean (days)
Median (days)
Minimiun (days)
Maximum(days)
Breast cancer
5
5
-3
20
Colon cancer
30
4
0
304
Both
8
7
7
11
Keep in mind
The patient with ID 4256 was diagnosed with stage III breast cancer but was not treated because no treatment information was included in the dataset.
Question 3: Which treatment regimens would be indicated as first-line treatment for patients with:
breast cancer only?
colon cancer only?
both breast and colon cancer?
Assumptions
I will assume that the first-line treatment regimen for each patient is determined based on the earliest treatment date recorded in the dataset, regardless of the specific treatment type or regimen.
# Identify the first-line treatment regimens for each cancer group first_line_treatments <- all_data |>filter(!is.na(treatment_date)) |>group_by(class_diagnosis) |>summarise(first_line_regimens =list(unique(drug_code)))first_line_treatments
# A tibble: 3 × 2
class_diagnosis first_line_regimens
<chr> <list>
1 Both <chr [2]>
2 Breast Cancer <chr [3]>
3 Colon Cancer <chr [3]>
Table 3. First-line treatment regimens for each cancer group
Types of cancer
First-line regimens
Breast Cancer
(A, B, C)
Colon Cancer
(A, D, B)
Both
(C, A)
Question 4: Generate a table showing the age, sex, geographic region and stage at diagnosis of the clinic’s patients stratiefied by the patient’s first diagnosis.
# Keep only the patient's first diagnosispatients_first_diagnosis <- patients_diagnosis |>group_by(patient_id) |>arrange(diagnosis_date) |>slice(1) |>ungroup()# Merge the demographic data with the diagnosis data to have all relevant information in one dataset and caluclated age at diagnosisdemographics_diagnosis <- patients_demographics |>left_join(patients_first_diagnosis, by ="patient_id") |>mutate(diagnosis_date =ymd(diagnosis_date), birth_date =ymd(birth_date), age =floor(as.numeric(difftime(diagnosis_date, birth_date, units ="days")) /365.25))# Generate the table using gtsummarry packagetble <- demographics_diagnosis |>select(Age = age, Sex = birth_sex, Region=region, Grade = stage_dx, diagnosis) |> gtsummary::tbl_summary(by = diagnosis,statistic = gtsummary::all_categorical() ~"{n} ({p}%)",percent ="column",missing ="no" ) |> gtsummary::add_overall(last =TRUE) |> gtsummary::modify_header(label ~"**Types of cancer**") |> gtsummary::bold_labels()tble