Clementine Djouka - Flatiron Technical Exercise

Data Analysis Questions

Question 1: Cancer Distribution

In order to provide Cancer Distribution information, we use the Patient Diagnosis dataset. The count function gives us a breakdown of the diagnosis. We see that, out of our 57 cases, we have 18 cases of colon cancer and 39 cases of breast cancer. A great way to visualize this is a bar plot, which I created using ggplot.

library(readxl)
Patient_Diagnosis <- read_excel("Downloads/Patient_Diagnosis.xlsx")
library(plyr)
count(Patient_Diagnosis, "diagnosis")

##       diagnosis freq
## 1 Breast Cancer   39
## 2  Colon Cancer   18

library(ggplot2)
Q1 <- ggplot(Patient_Diagnosis, aes(factor(Patient_Diagnosis$diagnosis))) + geom_bar(stat = "count", fill = "purple") + theme_classic() + xlab("Cancer Type") + ylab("Number of Cases") + ggtitle("Cancer Distribution")
Q1

I want to point out that I noticed that we have cases where the same patient has two types of cancer, or same cancer with 2 different diagnosis codes. I chose to leave these observations as is and not consider them as duplicates to be removed because it makes sense that someone would have two cancers, or different types of the same cancer and these would constitute separate cases of cancers to be accounted for in a cancer distribution analysis.

Question 2: Time Elapsed Between Diagnosis and Beginning of Treatment

To be able to find the time elapsed between diagnosis and first treatment, we first need to work on our Patient Treatment dataset. This dataset contains a record of all the treatments administered to patients and the dates associated. For this particular request, we only need the first date of treatment for each patient. From there, we can substract that date from the diagnosis date that we have in the Patient Diagnosis file.

So let’s start by creating our adjusted Patient Treatment file with only the first date of treatment per patient.

library(readxl)
Patient_Treatment <- read_excel("Downloads/Patient_Treatment.xlsx")
library (data.table)
PT2 <-setDT(Patient_Treatment)[order(Patient_Treatment$treatment_date), head(.SD, 1L), by = Patient_Treatment$patient_id]

Then we need to combine both datasets in order to do the time interval. We notice that we have 46 unique patients in this file. We also have some patients that have different diagnosis dates, due to the different types of cancers they may have. I chose to leave them in the final dataset as is, even though it may appear as duplicate data.

names(PT2) <- c("patient_id", "first_treatment_date", "drug_code")
Question2_data <- merge(Patient_Diagnosis, PT2, by = "patient_id")

Now we need to find the elapsed time:

Question2_data$elapsed_time <- difftime(Question2_data$first_treatment_date, Question2_data$diagnosis_date, units = "days")

It would be good to plot it and see what insights we can get from the elapsed time. I chose a boxplot to get a nice overview.

Question2_data$elapsed_time <- as.numeric(as.character(Question2_data$elapsed_time))
boxplot(Question2_data$elapsed_time, data=Question2_data, main="Time Difference Between Diagnosis and Treatment in Days")

From looking at the first boxplot, we notice that we have a clear outlier, so I decided to remove it and build a second boxplot so we can have a better image of the distribution of the elapsed times.

Q2_2 <- Question2_data[c(-10),]
boxplot(Q2_2$elapsed_time, data=Q2_2, main ="Time Difference Between Diagnosis and Treatment in Days")

summary(Question2_data$elapsed_time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.00    3.50    5.00   10.11    6.00  304.00

The second boxplot hints that the time elapsed between the diagnosis and the treatment varies between 0 and 7-8 days with an average of about 5 days. The summary statistics provides us with additional information. We see that we have some values below zero which is impossible im time intervals therefore these observations need to be double checked for accuracy. The median time between diagnosis and first treatment is 5 days. This is more accurate than the mean of 10 days due to the extreme outlier of 304 days present in our data.

Question 3: First Line of Treatment for Breast Cancer vs. Colon Cancer

In order to see what insights we can get from our data, I chose to first take a look at which drugs are administered dependig on the cancer type. Then to get a deeper sense of things, I built a time series showing the timeline of drug administrations broken down by the cancer type will help us see which types of drugs tend to be administered first.Both graphs were created using ggplot.

library(ggplot2)
Drug <- Question2_data$drug_code
Q3 <- ggplot(Question2_data, aes(factor(Question2_data$diagnosis))) + geom_bar(aes(fill=Drug)) + theme_classic() + xlab("Cancer Type") + ylab("Number of Treatments") + ggtitle("Distribution of each Treatment per Cancer Type")
Q3

Cancer_Type <- Question2_data$diagnosis
Q3_2 <- ggplot(Question2_data, aes(Question2_data$first_treatment_date, Question2_data$drug_code, color = Cancer_Type)) + geom_point(aes(fill=Cancer_Type)) + xlab("Date") + ylab("Drug")
Q3_2

The bar graphs tells us that chemotherapy drugs are more often used in general, and we also see that the immunotherapy drug D is only used for colon cancer cases. Based on the time series of the drug administration broken down by cancer type, it looks like for breast cancer, chemotherapy drugs are used as first-line of treatment. For colon cancer, it appears that both chemo and immunotherapy drugs tend to be used more simultaneously as first-line of treatment (we can see that the earliest instances of A B C and D treatment instances for colon cancer happen around the same period).

Question 4: Duration of Therapy for Breast Cancer Patients Regimen A vs. Regimen B

We first need to find the right dataset for the analysis. This question only concerns breast cancer treatment data so let’s remove any colon cancer cases from the Patient Treatment dataset. The Patient Treatment dataset has no information on diagnosis so I need to first merge both dataset before removing the colon cancer cases. After selecting only breast cancer cases, I retrieved the first and last treatment dates in order to come up with the treatment duration. From there, I merged this dataset, with the Patient Treatment adjusted dataset (PT) in order to get the drug administered as first treatment (as we are looking at first-line therapy). Once I have this final dataset, I then create boxplots broken down by the different regimen in order to see if there is a difference.

Q4_dataprep <- merge(Patient_Diagnosis, Patient_Treatment, by = "patient_id")
Q4_data <- Q4_dataprep[which(Q4_dataprep$diagnosis== "Breast Cancer"), ]
last_treatment_date <- aggregate(Q4_data$treatment_date, list(Q4_data$patient_id), max)
first_treatment_date <- aggregate(Q4_data$treatment_date, list(Q4_data$patient_id), min)
names(last_treatment_date) <- c("patient_id", "last_treatment_date")
names(first_treatment_date) <- c("patient_id", "first_treatment_date")
duration_data <- merge(first_treatment_date, last_treatment_date, by = "patient_id")
duration_data$duration <- difftime(duration_data$last_treatment_date, duration_data$first_treatment_date, units = "days")
finalQ4 <- merge(duration_data, PT2, by = "patient_id")
finalQ4$duration <- as.numeric(as.character(finalQ4$duration))
Viz4 <- boxplot(finalQ4$duration~finalQ4$drug_code , data = finalQ4, main="Treatment Duration Broken Down by Regimen", xlab = "Regimen", ylab="Treatment Duration")

Viz4

## $stats
##       [,1] [,2] [,3]
## [1,]  41.0   74   73
## [2,]  56.5   74   73
## [3,]  75.5   76   78
## [4,] 161.5   80   84
## [5,] 179.0   80   87
## 
## $n
## [1] 20  9  6
## 
## $conf
##           [,1]  [,2]     [,3]
## [1,]  38.40363 72.84 70.90464
## [2,] 112.59637 79.16 85.09536
## 
## $out
## [1] 2584   38 1001    0  194   47
## 
## $group
## [1] 1 2 2 2 2 3
## 
## $names
## [1] "A" "B" "C"

Viz4_2data <- subset(finalQ4, finalQ4$duration<200)
Viz4_2 <- boxplot(Viz4_2data$duration~Viz4_2data$drug_code , data = Viz4_2data, main="Treatment Duration Broken Down by Regimen", xlab = "Regimen", ylab="Treatment Duration")

Viz4_2

## $stats
##       [,1] [,2] [,3]
## [1,]  41.0 38.0   73
## [2,]  56.5 56.0   73
## [3,]  74.0 76.0   78
## [4,] 150.5 78.5   84
## [5,] 179.0 80.0   87
## 
## $n
## [1] 19  8  6
## 
## $conf
##           [,1]     [,2]     [,3]
## [1,]  39.92718 63.43118 70.90464
## [2,] 108.07282 88.56882 85.09536
## 
## $out
## [1]   0 194  47
## 
## $group
## [1] 2 2 3
## 
## $names
## [1] "A" "B" "C"

We detect outliers in the first boxplot which makes the observation difficult. Therefore I recreate the boxplots after removing the outliers. The second set of boxplots shows that the range of the treatment duration is much wider for Regimen A than for regimen B as a first-line therapy. The average duration seems to be close in both cases. The next step is to see whether this difference is statistically significant.

A = finalQ4$drug_code== "A"
duration.A = finalQ4[A, ]$duration
duration.A

##  [1] 2584   41   45   51   52   53   60  157   70  144  166  167  173  179
## [15]   73   69   74   77   77   93

B = finalQ4$drug_code== "B"
duration.B = finalQ4[B, ]$duration
duration.B

## [1]   38 1001    0   74   76   76   77   80  194

t.test(duration.A, duration.B)

## 
##  Welch Two Sample t-test
## 
## data:  duration.A and duration.B
## t = 0.25031, df = 25.424, p-value = 0.8044
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -293.8613  375.2502
## sample estimates:
## mean of x mean of y 
##  220.2500  179.5556

In this particular case, for simplicity we assume that our population follows a normak distribution. The result of the t-test tells us that the difference is statistically not equal to zero. Therefore there is a statistically significant difference between the treatment duration for patients using Regimen A vs. Regimen B as first-line therapy. One thing that is important to point out is that in some cases we observe that both drugs are administered simultaneously or at very short intervals, this could modify the final conclusion of our analysis. In order for this analysis to be more relevant, a situation where patients are adminsitered either drug and not both would provide more accurate results.

Clementine Djouka - Flatiron Technical Exercise

General Questions:

Data Analysis Questions