SLee Flatiron Interview Answers

Preface

This report contains the code snippets that can be used for reproducing the results. You can also use the attached code ‘explore.R’ - It’s the same code. The code below imports the data into R workspace (please create a subdirectory ./data/ and keep the csv files there).

library(dplyr)
library(reshape2)
library(ggplot2)
library(coin) # for permutation test
# ------------------- Dataset import and some cleaning ------------------ 
# read the diagnostic and treatment data and 
diag <- read.csv('./data/Patient_Diagnosis.csv',stringsAsFactors = FALSE)
trt <- read.csv('./data/Patient_Treatment.csv',stringsAsFactors = FALSE)
# Some cleaning
# convert date columns to appropriate format
diag$diagnosis_date <- as.Date(diag$diagnosis_date,"%m/%d/%Y")
trt$treatment_date <- as.Date(trt$treatment_date,"%m/%d/%Y") 
# remove duplicate rows
diag = unique(diag)
trt = unique(trt)

General questions

When presented with a new dataset or database, what steps do you generally take to evaluate it prior to working with it?

The preliminary step of data analysis, regardless of the hypothesis or the purpose of analysis, is to “understand the dataset”: How many rows (the number of data entries) and columns (the number of variables) are in the dataset? What does each row of the dataset represent (ex: one patient or one hospital visit for one patient)? What kind of columns are present? How is each predictor coded (ordinal/factorial/numeric/character/date)? Is a particular variable dominated by a single value?
Then, the next step is to assess the quality of the data. One important indication of data quality is the percentage of missing data. If there is a column with a large portion of data missing, we need to explain why. We can try to correlate the missingness of one variable against the values of other variables.
The next step is to transform the data if necessary so that it can be used quantitatively. This involves breaking a string into chunks, for example.

Based on the information provided above and the attached dataset, what three questions would you like to understand prior to conducting any analysis of the data?

What are the columns and which do each row represent for the diagnosis and treatment datasets? Knowing this is necessary for deciding how to merge the two datasets. I noticed that both dataset have a column for patient IDs which I used for merging. I used a full join to prevent any exclusion of data points as a result of merging.

# join the two datasets into a single data frame
dt <- full_join(trt,diag,by="patient_id")

The next step is to understand the encoding of the columns. I use the R command ‘summary’ for this purpose (this also shows how many missing data points are present for each column). Note that the treatment/diagnosis columns were already converted to date formats (see the attached explore.R code)

summary(dt)

##    patient_id   treatment_date        drug_code         diagnosis_date      
##  Min.   :2038   Min.   :0010-01-20   Length:1208        Min.   :0010-01-09  
##  1st Qu.:2961   1st Qu.:0011-09-17   Class :character   1st Qu.:0011-04-19  
##  Median :4692   Median :0012-06-22   Mode  :character   Median :0012-04-27  
##  Mean   :5093   Mean   :0012-05-07                      Mean   :0012-02-11  
##  3rd Qu.:6877   3rd Qu.:0013-01-23                      3rd Qu.:0012-11-15  
##  Max.   :9489   Max.   :0017-02-20                      Max.   :0013-08-23  
##                 NA's   :2                                                   
##  diagnosis_code   diagnosis        
##  Min.   :153.3   Length:1208       
##  1st Qu.:153.9   Class :character  
##  Median :174.4   Mode  :character  
##  Mean   :168.4                     
##  3rd Qu.:174.8                     
##  Max.   :174.9                     
##

# for non-numeric columns, what are the distinct values?  
table(dt$drug_code)

## 
##   A   B   C   D 
## 394 394 398  20

table(dt$diagnosis)

## 
## Breast Cancer  Colon Cancer 
##           854           354

From the results, I was able to tell that there are four drugs and two cancer types.

Sanity check: how much data is missing? According to the output from ‘summary’ (see above), there are two data points for treatment date missing. This is a result of full join - it appears that there is a patient with two diagnoses but without any treatment data.

dt[is.na(dt$treatment_date),]

##      patient_id treatment_date drug_code diagnosis_date diagnosis_code
## 1207       4256           <NA>      <NA>     0011-11-07          174.5
## 1208       4256           <NA>      <NA>     0011-11-07          174.8
##          diagnosis
## 1207 Breast Cancer
## 1208 Breast Cancer

Data analysis questions

First, the clinic would like to know the distribution of cancer types across their patients. Please provide the clinic with this information.

# Because the treatment dataset contains several data points for each patient,
# will only use the diagnosis dataset
diag$diagnosis %>% table(exclude=NULL) %>% prop.table()*100

## .
## Breast Cancer  Colon Cancer 
##      68.42105      31.57895

The majority (~68%) of the diagnosis was breast cancer, with colon cancer taking up the rest (~32%).

The clinic wants to know how long it takes for patients to start therapy after being diagnosed, which they consider to be helpful in understanding the quality of care for the patient. How long after being diagnosed do patients start treatment?

# First create a column for the # of days passed since the diagosis
dt <- mutate(dt,date_diff = treatment_date-diagnosis_date)
dt$date_diff <- as.numeric(dt$date_diff)
# now find out when the treatment started which is the smallest # of days to any treatment
dt_days <- dt %>% group_by(patient_id,diagnosis) %>% summarize(days_to_txt = min(date_diff))
dt_days_stats <- dt_days %>% group_by(diagnosis) %>% summarize(min = min(days_to_txt,na.rm=TRUE),
                                                               quartile1 = quantile(days_to_txt,0.25,na.rm=TRUE),
                                                               quartile2 = quantile(days_to_txt,0.5,na.rm=TRUE),
                                                               quartile3 = quantile(days_to_txt,0.75,na.rm=TRUE),
                                                               max = max(days_to_txt,na.rm=TRUE),
                                                               missing = sum(is.na(days_to_txt)))
dt_days_stats

## # A tibble: 2 x 7
##   diagnosis       min quartile1 quartile2 quartile3   max missing
##   <chr>         <dbl>     <dbl>     <dbl>     <dbl> <dbl>   <int>
## 1 Breast Cancer    -6      3.5          5         6    20       1
## 2 Colon Cancer      0      2.25         5         7   304       0

The table above shows the distribution of the days to the first treatment for each diagnosis group. Judging from the median value (quartile 2), we can tell that a treatment kicks off around a median of 5 days after the diagnosis for either cancer types. The box plot below visualizes this:

ggplot(dt_days,aes(x=diagnosis,y=days_to_txt)) + geom_boxplot()

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

The boxplot spotted a few outliers, which are represented in dots in the above plot: there are negative values from the breast cancer cohort and a very large value from the colon cancer cohort. More information on those data points are required before excluding these points for further analyses.

Which treatment regimens [i.e., drug(s)] do you think would be indicated to be used as first-line of treatment for breast cancer? What about colon cancer?

# Find the smallest days to treatment for each patient, a drug type and diagnosis type
dt_days_drug <- dt %>% 
  group_by(patient_id,diagnosis,drug_code) %>% 
  summarize(days_to_txt = min(date_diff))
ggplot(na.omit(dt_days_drug),aes(x=drug_code,y=days_to_txt)) + 
  geom_boxplot() + facet_grid(diagnosis ~ .,scales="free")

The boxplot above shows that the first-line treatments are drug A and B for breast cancer, and B and D for colon cancer.

Do the patients taking Regimen A vs. Regimen B as first-line therapy for breast cancer vary in terms of duration of therapy? Please include statistical tests and visualizations, as appropriate.

# Below is a function that returns which drug is the first-line drug 
# given the time points that each drug is administered
find_first_drug <- function(drug,d_diff){
  # first line drug is identified as the drug/s with the smallest treatment day
  prim_raw <-  drug[d_diff == min(d_diff)]
  # the same drug can be selected twice as a primary drug 
  # due to the drug prescribed to multiple types of breast cancer
  # so, remove duplicates in the list
  # sort() is to fix c("B","A") to c("A","B")
  prim <- sort(unique(prim_raw)) 
  return(paste(prim,collapse=","))
}
# Apply that function to the breast cancer subset of the data 
# Also calculate the duration of the treatment (taken from the largest number of days for any treatment)
dt_brca_duration <- dt %>% 
  filter(diagnosis=="Breast Cancer") %>%
  group_by(patient_id) %>%
  summarize(duration = max(date_diff),
            first_drug = find_first_drug(drug_code,date_diff))
# show the distribution of treatment duration under the regimen A vs B
# excluded the patients that 1) received A and B at the same time 2) received C or D
dt_brca_duration_AB <- filter(dt_brca_duration,first_drug %in% c("A","B"))
dt_brca_duration_AB

## # A tibble: 11 x 3
##    patient_id duration first_drug
##         <int>    <dbl> <chr>     
##  1       2120       52 B         
##  2       2238     1001 B         
##  3       2475        0 B         
##  4       2607       94 B         
##  5       2720       87 B         
##  6       2762       79 B         
##  7       3025       83 B         
##  8       7937       80 A         
##  9       7976       82 A         
## 10       8480       83 A         
## 11       8827       90 A

dt_brca_duration_AB %>%  
  ggplot(aes(y = duration,x = first_drug)) + geom_boxplot()

As seen from the plot, the group B has 1 outlier (patient 2238) with very large duration. The patient could have been diagnosed a second time (which can happen as a result of metastasis) and received a second course of treatment. I checked the diagnosis and treatment entries for the patient.

dt[dt$patient_id == "2238",]

##     patient_id treatment_date drug_code diagnosis_date diagnosis_code
## 2         2238     0010-01-21         B     0010-01-21          174.9
## 18        2238     0010-01-31         B     0010-01-21          174.9
## 30        2238     0010-02-10         B     0010-01-21          174.9
## 42        2238     0010-02-20         B     0010-01-21          174.9
## 56        2238     0010-03-02         B     0010-01-21          174.9
## 62        2238     0010-03-12         B     0010-01-21          174.9
## 70        2238     0010-03-22         B     0010-01-21          174.9
## 76        2238     0010-04-01         B     0010-01-21          174.9
## 695       2238     0012-09-18         B     0010-01-21          174.9
## 713       2238     0012-09-28         B     0010-01-21          174.9
## 729       2238     0012-10-08         B     0010-01-21          174.9
## 743       2238     0012-10-18         B     0010-01-21          174.9
##         diagnosis date_diff
## 2   Breast Cancer         0
## 18  Breast Cancer        10
## 30  Breast Cancer        20
## 42  Breast Cancer        30
## 56  Breast Cancer        40
## 62  Breast Cancer        50
## 70  Breast Cancer        60
## 76  Breast Cancer        70
## 695 Breast Cancer       971
## 713 Breast Cancer       981
## 729 Breast Cancer       991
## 743 Breast Cancer      1001

You can see that there were two series of treatment B with more than 2 years time gap in between. To explain this, I will assume that the diagnosis for the second cancer is somehow omitted and will exclude this point from the statistics until we know further details on this patients.

dt_brca_duration_AB_clean <- filter(dt_brca_duration_AB,patient_id != "2238")
dt_brca_duration_AB_clean %>% 
  group_by(first_drug) %>% 
  summarize(n_patients = n(),
            minimum = min(duration),
            maximum = max(duration),
            median = median(duration))

## # A tibble: 2 x 5
##   first_drug n_patients minimum maximum median
##   <chr>           <int>   <dbl>   <dbl>  <dbl>
## 1 A                   4      80      90   82.5
## 2 B                   6       0      94   81

dt_brca_duration_AB_clean %>%  
  ggplot(aes(y = duration,x = first_drug)) + geom_boxplot()

The table and the box plot above shows the distribution of duration after the outlier is removed (There is a caveat in the box plot: it was included just for the visualization purpose and the boxes don’t actually convey any statistical meaning because the sample size is too small (N=4 for drug A, 6 for group B) to estimate quantiles). This analysis shows that the treatment lasted for a median of 82.5 days when the drug A was used first, and 81 days for the first line drug B. Significance of the difference between the two drug groups was tested using the permutation test, which was negative (p~0.32). But we need larger sample size to confirm this trend.

# sample size is too small to use t-test, since the distribution can't be seen as normal
# so, will use permutation test(note that this requires coin package)
independence_test(duration ~ as.factor(first_drug),
                  data = dt_brca_duration_AB_clean)

## 
##  Asymptotic General Independence Test
## 
## data:  duration by as.factor(first_drug) (A, B)
## Z = 0.99088, p-value = 0.3217
## alternative hypothesis: two.sided

SLee Flatiron Interview Answers

Sangkyu Lee

4/2/2020

Preface

General questions

Data analysis questions