| Number of cancer diagnoses over time | ||
| Diagnosis Year | Cancer Type | |
|---|---|---|
| Breast Cancer | Colon Cancer | |
| All | 36 | 16 |
| 2010 | 10 | 0 |
| 2011 | 7 | 8 |
| 2012 | 10 | 6 |
| 2013 | 9 | 2 |
flatiron-technical-test
Introduction to the problem
A cancer clinic wants to understand how four antineoplastic (e.g., anti-cancer) drugs are being given. Drugs A and B are chemotherapy drugs (sometimes given in combination) and Drugs C and D are immunotherapy drugs. The clinic has provided us with two datasets: one gives diagnoses by patient and the other dataset gives treatment dates for these patients for the drugs of interest. None of the patients in this cohort have died to date, and no data is missing.
General questions
1. When presented with a new dataset or database, what steps do you generally take to evaluate it prior to working with it?
There are a couple of main areas that I’m usually interested in when looking at the appropriateness of a given RWD database. These are:
- Missingness and bias.
How complete is the dataset for the most important question we’re trying to answer? e.g. for specific cancer analyses, how complete is the biomarker test information (such as EGFR testing for lung cancer patients)? For RWE datasets, how complete is the patient journey - how much follow-up do we have, do patients drop in or out of network and the data, do we have closed or open claims data available? How representative is the data - how close to the known rates of different biomarker statuses is the dataset (e.g. what is the HER2+ve vs HER2-ve rates by ethnicity in the dataset compared to published literature). We can do this from simple summary statistics, counts of groups, and functions to look for missingess (e.g. naniar).
- Time scales.
What time periods are covered by the data? A lot of oncology studies I’m currently working on are focused on drugs approved in the last few years. If the data is focused on historic samples, then we may not be able to answer the questions we’re interested in. Again, to assess this we can look for simple counts per year or other similar summary statistics.
Apart from these, I generally have a set of study-specific goals that allow me to focus on finding the most appropriate data sources to answer the questions that we’re interested in asking.
2. Based on the information provided above and the attached dataset, what three questions would you like to understand prior to conducting any analysis of the data?
- What is the goal of the study? What is the overall question that we’re trying to answer?
I find that it can be very easy to continually add more analyses to a study if we haven’t defined the study goals - but this is not best practice, either scientifically or timewise. If we know what research question we are trying to answer, then that can inform all of our analyses and will help with requirements gathering and protocol writing.
- Why this dataset?
The dataset we’re using here is very sparse. However, in the real world there could be a good reason for that - some conditions are inherantly rare and so we may end up working with small sample sizes. But, it’s important that everyone involved in the study design is on the same page about the possible limitations of working with very small datasets (e.g. larger confidence intervals, harder to draw firm conclusions). In some cases, other datasets are more appropriate and can be used, but other team members may not have realised why we might need to do so.
- What timescale and output are you looking for?
In project work, the earlier we can agree on time-scales, deliverables, and what is expected from all parties then the more successful the analysis project will be. Expectations must be met from all sides, so if we have them in writing and know what the next steps are then we can work much more efficiently.
Data analysis
I’ve included two output files here - one is just the output, and one is the output with the code included too. There are multiple comments and talking points throughout the code.
1. First, the clinic would like to know the distribution of cancer types across their patients. Please provide the clinic with this information.
2. The clinic wants to know how long it takes for patients to start therapy after being diagnosed, which they consider to be helpful in understanding the quality of care for the patient. How long after being diagnosed do patients start treatment?
| Time from initial cancer diagnosis to start of therapy (days) | ||
| statistic | Cancer Type | |
|---|---|---|
| Breast Cancer | Colon Cancer | |
| N | 30 | 11 |
| Mean (SD) | 5.6 (3.9) | 30.3 (90.8) |
| Median (Range) | 5 (0-20) | 4 (0-304) |
3. Which treatment regimens [i.e., drug(s)] do you think would be indicated to be used as first-line of treatment for breast cancer? What about colon cancer? (For more information between first-line and second-line treatments (applicable between chemotherapy drugs as well as chemo v immuno therapies), please reference https://www.cancer.gov/publications/dictionaries/cancer-terms?cdrid=346494)
I’ll just focus on drugs, rather than surgery or other therapies available.
Typical first-line therapies for breast cancer are Doxorubicin, Cyclophosphamide, Carboplatin, Paclitaxel or Docetaxel. HER2 positive breast cancer patients may get trastuzumab or pertuzumab. TNBC patients may get pembrolizumab. Some patients, especially those who are HR+, will recieve an AI such as anastrazole.
For colon cancer, typical first line therapies include Capecitabine, Fluorouracil, Irinotecan, and combinations of these and other drugs. Pembrolizumab is also a first-line treatment for some unresectable colon cancers.
4. Do the patients taking Regimen A vs. Regimen B as first-line therapy for breast cancer vary in terms of duration of therapy? Please include statistical tests and visualizations, as appropriate.
Some caveats here:
The duration of therapy is defined as the earliest start therapy start date until the latest therapy start date for a specific line.
We’re only interested in first-line therapies here. We’ve defined first-line therapy as the first therapy given within 14 days pre-diagnosis date to anytime post-diagnosis date.
A line of therapy can overlap multiple treatment records if there is less than a 28 day gap between records
Many patients have both therapy A and B, at overlapping times. These can’t be considered as mono-therapies, and have been included here as combination therapies.
Treatment A and B look very similar to each other visually, while combination therapy A and B together has a much wider distribution.
To test for differences in duration of first-line therapy, I would typically look at time to treatment discontinuation (rwTTD). However, here we don’t have enough information to do this.
Instead, a linear model may give us an answer here. We’ll use a Poisson distribution here - a common use of Poisson is in time-to-event data. As this is supposed to be a short technical test, we’re not going to go in to model checking here - but very happy to talk about that more!
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| drug_code | |||
| drug_codeA+B | -14 | -35, 7.3 | 0.2 |
| drug_codeB | -12 | -37, 13 | 0.3 |
| 1 CI = Confidence Interval | |||
Our simple Poisson model suggests that there is no difference in first-line duration for drug B compared to drug A, and no difference in first-line duration for combination therapy A+B compared to A alone.