1. When presented with a new dataset or database, what steps do you generally take to evaluate it prior to working with it?
It is very important to ensure of data integrity when presented with new data in order to make sure the analysis you run is sound. I typically do the following: - Check for missing/incoherent data (such as a typo or a date that doesn’t make sense). A manual check is sufficient for small datasets. But for larger ones, I would use a function from statistical programming tools such as is.na in R. - Check for outliers. A rough check for outliers could be done by creating a scatterplot of the data, or any other data visualization in order to see if anything jumps out of the ordinary. - Check for duplicates and make sure they are removed from the dataset. The unique observation identifier (such as patient_id) is very helpful in this case While we run these different searches, we also need to think about how to address them. Duplicates should be removed, but for missing data, outliers and incoherent data – we have options. We can replace the observation by a dataset average, an industry average, or a specific value (such as 0) depending on what makes sense given the context. We also need to decide if we even want to keep this specific observation as part of the analysis.
2. Based on the information provided above and the attached dataset, what three questions would you like to understand prior to conducting any analysis of the data?
I would like to understand the following: - What is the goal of the analysis? Understanding how the drugs are given is a bit vague, for what purpose? - I noticed in the Patient Diagnosis dataset several diagnosis codes for breast cancer and several diagnosis codes for colon cancer (174.3, 174.1 etc.) What do they mean? - Is it typical to have such a wide date range for cancer treatment (2010-2017)? And Is there a reason why there is no data for any of the patients for the years 2015 and 2016 in the patient treatment dataset? If so, what is it?
Question 1: Cancer Distribution
In order to provide Cancer Distribution information, we use the Patient Diagnosis dataset. The count function gives us a breakdown of the diagnosis. We see that, out of our 57 cases, we have 18 cases of colon cancer and 39 cases of breast cancer. A great way to visualize this is a bar plot, which I created using ggplot.
library(readxl)
Patient_Diagnosis <- read_excel("Downloads/Patient_Diagnosis.xlsx")
library(plyr)
count(Patient_Diagnosis, "diagnosis")
## diagnosis freq
## 1 Breast Cancer 39
## 2 Colon Cancer 18
library(ggplot2)
Q1 <- ggplot(Patient_Diagnosis, aes(factor(Patient_Diagnosis$diagnosis))) + geom_bar(stat = "count", fill = "purple") + theme_classic() + xlab("Cancer Type") + ylab("Number of Cases") + ggtitle("Cancer Distribution")
Q1
I want to point out that I noticed that we have cases where the same patient has two types of cancer, or same cancer with 2 different diagnosis codes. I chose to leave these observations as is and not consider them as duplicates to be removed because it makes sense that someone would have two cancers, or different types of the same cancer and these would constitute separate cases of cancers to be accounted for in a cancer distribution analysis.
Question 2: Time Elapsed Between Diagnosis and Beginning of Treatment
To be able to find the time elapsed between diagnosis and first treatment, we first need to work on our Patient Treatment dataset. This dataset contains a record of all the treatments administered to patients and the dates associated. For this particular request, we only need the first date of treatment for each patient. From there, we can substract that date from the diagnosis date that we have in the Patient Diagnosis file.
So let’s start by creating our adjusted Patient Treatment file with only the first date of treatment per patient.
library(readxl)
Patient_Treatment <- read_excel("Downloads/Patient_Treatment.xlsx")
library (data.table)
PT2 <-setDT(Patient_Treatment)[order(Patient_Treatment$treatment_date), head(.SD, 1L), by = Patient_Treatment$patient_id]
Then we need to combine both datasets in order to do the time interval. We notice that we have 46 unique patients in this file. We also have some patients that have different diagnosis dates, due to the different types of cancers they may have. I chose to leave them in the final dataset as is, even though it may appear as duplicate data.
names(PT2) <- c("patient_id", "first_treatment_date", "drug_code")
Question2_data <- merge(Patient_Diagnosis, PT2, by = "patient_id")
Now we need to find the elapsed time:
Question2_data$elapsed_time <- difftime(Question2_data$first_treatment_date, Question2_data$diagnosis_date, units = "days")
It would be good to plot it and see what insights we can get from the elapsed time. I chose a boxplot to get a nice overview.
Question2_data$elapsed_time <- as.numeric(as.character(Question2_data$elapsed_time))
boxplot(Question2_data$elapsed_time, data=Question2_data, main="Time Difference Between Diagnosis and Treatment in Days")
From looking at the first boxplot, we notice that we have a clear outlier, so I decided to remove it and build a second boxplot so we can have a better image of the distribution of the elapsed times.
Q2_2 <- Question2_data[c(-10),]
boxplot(Q2_2$elapsed_time, data=Q2_2, main ="Time Difference Between Diagnosis and Treatment in Days")
summary(Question2_data$elapsed_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.00 3.50 5.00 10.11 6.00 304.00
The second boxplot hints that the time elapsed between the diagnosis and the treatment varies between 0 and 7-8 days with an average of about 5 days. The summary statistics provides us with additional information. We see that we have some values below zero which is impossible im time intervals therefore these observations need to be double checked for accuracy. The median time between diagnosis and first treatment is 5 days. This is more accurate than the mean of 10 days due to the extreme outlier of 304 days present in our data.
Question 3: First Line of Treatment for Breast Cancer vs. Colon Cancer
In order to see what insights we can get from our data, I chose to first take a look at which drugs are administered dependig on the cancer type. Then to get a deeper sense of things, I built a time series showing the timeline of drug administrations broken down by the cancer type will help us see which types of drugs tend to be administered first.Both graphs were created using ggplot.
library(ggplot2)
Drug <- Question2_data$drug_code
Q3 <- ggplot(Question2_data, aes(factor(Question2_data$diagnosis))) + geom_bar(aes(fill=Drug)) + theme_classic() + xlab("Cancer Type") + ylab("Number of Treatments") + ggtitle("Distribution of each Treatment per Cancer Type")
Q3
Cancer_Type <- Question2_data$diagnosis
Q3_2 <- ggplot(Question2_data, aes(Question2_data$first_treatment_date, Question2_data$drug_code, color = Cancer_Type)) + geom_point(aes(fill=Cancer_Type)) + xlab("Date") + ylab("Drug")
Q3_2
The bar graphs tells us that chemotherapy drugs are more often used in general, and we also see that the immunotherapy drug D is only used for colon cancer cases. Based on the time series of the drug administration broken down by cancer type, it looks like for breast cancer, chemotherapy drugs are used as first-line of treatment. For colon cancer, it appears that both chemo and immunotherapy drugs tend to be used more simultaneously as first-line of treatment (we can see that the earliest instances of A B C and D treatment instances for colon cancer happen around the same period).
Question 4: Duration of Therapy for Breast Cancer Patients Regimen A vs. Regimen B
We first need to find the right dataset for the analysis. This question only concerns breast cancer treatment data so let’s remove any colon cancer cases from the Patient Treatment dataset. The Patient Treatment dataset has no information on diagnosis so I need to first merge both dataset before removing the colon cancer cases. After selecting only breast cancer cases, I retrieved the first and last treatment dates in order to come up with the treatment duration. From there, I merged this dataset, with the Patient Treatment adjusted dataset (PT) in order to get the drug administered as first treatment (as we are looking at first-line therapy). Once I have this final dataset, I then create boxplots broken down by the different regimen in order to see if there is a difference.
Q4_dataprep <- merge(Patient_Diagnosis, Patient_Treatment, by = "patient_id")
Q4_data <- Q4_dataprep[which(Q4_dataprep$diagnosis== "Breast Cancer"), ]
last_treatment_date <- aggregate(Q4_data$treatment_date, list(Q4_data$patient_id), max)
first_treatment_date <- aggregate(Q4_data$treatment_date, list(Q4_data$patient_id), min)
names(last_treatment_date) <- c("patient_id", "last_treatment_date")
names(first_treatment_date) <- c("patient_id", "first_treatment_date")
duration_data <- merge(first_treatment_date, last_treatment_date, by = "patient_id")
duration_data$duration <- difftime(duration_data$last_treatment_date, duration_data$first_treatment_date, units = "days")
finalQ4 <- merge(duration_data, PT2, by = "patient_id")
finalQ4$duration <- as.numeric(as.character(finalQ4$duration))
Viz4 <- boxplot(finalQ4$duration~finalQ4$drug_code , data = finalQ4, main="Treatment Duration Broken Down by Regimen", xlab = "Regimen", ylab="Treatment Duration")
Viz4
## $stats
## [,1] [,2] [,3]
## [1,] 41.0 74 73
## [2,] 56.5 74 73
## [3,] 75.5 76 78
## [4,] 161.5 80 84
## [5,] 179.0 80 87
##
## $n
## [1] 20 9 6
##
## $conf
## [,1] [,2] [,3]
## [1,] 38.40363 72.84 70.90464
## [2,] 112.59637 79.16 85.09536
##
## $out
## [1] 2584 38 1001 0 194 47
##
## $group
## [1] 1 2 2 2 2 3
##
## $names
## [1] "A" "B" "C"
Viz4_2data <- subset(finalQ4, finalQ4$duration<200)
Viz4_2 <- boxplot(Viz4_2data$duration~Viz4_2data$drug_code , data = Viz4_2data, main="Treatment Duration Broken Down by Regimen", xlab = "Regimen", ylab="Treatment Duration")
Viz4_2
## $stats
## [,1] [,2] [,3]
## [1,] 41.0 38.0 73
## [2,] 56.5 56.0 73
## [3,] 74.0 76.0 78
## [4,] 150.5 78.5 84
## [5,] 179.0 80.0 87
##
## $n
## [1] 19 8 6
##
## $conf
## [,1] [,2] [,3]
## [1,] 39.92718 63.43118 70.90464
## [2,] 108.07282 88.56882 85.09536
##
## $out
## [1] 0 194 47
##
## $group
## [1] 2 2 3
##
## $names
## [1] "A" "B" "C"
We detect outliers in the first boxplot which makes the observation difficult. Therefore I recreate the boxplots after removing the outliers. The second set of boxplots shows that the range of the treatment duration is much wider for Regimen A than for regimen B as a first-line therapy. The average duration seems to be close in both cases. The next step is to see whether this difference is statistically significant.
A = finalQ4$drug_code== "A"
duration.A = finalQ4[A, ]$duration
duration.A
## [1] 2584 41 45 51 52 53 60 157 70 144 166 167 173 179
## [15] 73 69 74 77 77 93
B = finalQ4$drug_code== "B"
duration.B = finalQ4[B, ]$duration
duration.B
## [1] 38 1001 0 74 76 76 77 80 194
t.test(duration.A, duration.B)
##
## Welch Two Sample t-test
##
## data: duration.A and duration.B
## t = 0.25031, df = 25.424, p-value = 0.8044
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -293.8613 375.2502
## sample estimates:
## mean of x mean of y
## 220.2500 179.5556
In this particular case, for simplicity we assume that our population follows a normak distribution. The result of the t-test tells us that the difference is statistically not equal to zero. Therefore there is a statistically significant difference between the treatment duration for patients using Regimen A vs. Regimen B as first-line therapy. One thing that is important to point out is that in some cases we observe that both drugs are administered simultaneously or at very short intervals, this could modify the final conclusion of our analysis. In order for this analysis to be more relevant, a situation where patients are adminsitered either drug and not both would provide more accurate results.