Why oncology drug data?

In oncology drug response datasets, IC50 (the concentration that cuts tumor cell viability by half) and dose–response AUC (the overall effect of the drug across all tested doses) do not have a linear relationship. Typically it is sigmoidal. This is because different tumor cells are uniquely sensitive (or not sensitive) to a given drug. That’s why the IC50–AUC scatter is typically heteroskedastic (uneven spread): due to tumor cell properties, the dose-response AUC varies more in some IC50 regions than others.

A standard four-parameter, S-shaped (4PS) model logistic, a form of NONLINEAR regression, is used to fit the dose–response data as seen below:

4Ps equation (with dose \(x\)):

\[\text{Response}(x) = \text{Bottom} + \frac{\text{Top} - \text{Bottom}}{1 + \left(\frac{x}{\text{IC50}}\right)^{\text{Hill}}}\]

This model is particularly well-suited for oncology assays because:

100% |  Top ←---------------
     |       \
     |        \ ← Hill slope controls steepness
50%  |         × ← IC50 
     |          \
     |           \
0%   |            Bottom →---------
     |________________________
       Low dose        High dose
  

AUC = Shaded area under this sigmoidal curve

Step 1: Dataset exploration, cleaning, and drug selection process

The selection of litronesib for this quantile regression analysis followed these steps:

Step 1a: Dataset creation

Two source files from depmap portal (https://depmap.org/portal/) were merged using an inner join in R to retain only records with both dose-response data (AUC, IC50) and suffiicnet tumor type information. Unnecessary columns were culled.

Step 1b: Heteroskedasticity screening - all drugs

All 1343 drugs, each with response and concentration data in up to 22 tumor types, were tested for heteroskedasticity using two methods:

\[H_0: \text{Var}(\varepsilon_i) = \sigma^2 \text{ (homoskedasticity)}\] \[H_1: \text{Var}(\varepsilon_i) = \sigma^2 h(x_i) \text{ (heteroskedasticity)}\]

\[Test statistic: $BP = n \cdot R^2_{aux} \sim \chi^2_p\]

\[\text{IQR Ratio} = \frac{IQR(IC50 | AUC \in Q_4)}{IQR(IC50 | AUC \in Q_1)}\]

Step 1c: B-P and IQR results: Selection of litronesib

BP and IQR generation for litronesib for deomnstration purposes ONLY (this was done for the full dataset of all 1343 drugs in making the drug selection):

# packages 
library(readxl) #to read my onc dataset
library(quantreg) #for QR fit  #Koenker 2023 implements quantile regression using linear programming methods
library(lmtest)  # for Breusch-Pagan test

# litronesib data
litro = read_xlsx("C:/Users/amcewen/OneDrive - Bentley University/Desktop/quantile regression data/litronesib_only - xlsx.xlsx")  

# log10 transformations for ease of interpretation 
litro$ic50_log10 = log10(litro$ic50)
litro$auc_log10 = log10(litro$auc)

# Remove rows with NA values 
litro_clean = litro[complete.cases(litro$ic50_log10, litro$auc_log10), ]

# OLS 
ols_model = lm(ic50_log10 ~ auc_log10, data = litro_clean)

# B-P
bp_test = bptest(ols_model)

# IQR (quantiles by AUC)
quartiles = quantile(litro_clean$auc_log10, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)

# Q1 IC50s
q1_data = litro_clean[litro_clean$auc_log10 >= quartiles[1] & litro_clean$auc_log10 <= quartiles[2], ]
iqr_q1 = IQR(q1_data$ic50_log10, na.rm = TRUE)

# Q4 IC50s
q4_data = litro_clean[litro_clean$auc_log10 >= quartiles[4] & litro_clean$auc_log10 <= quartiles[5], ]
iqr_q4 = IQR(q4_data$ic50_log10, na.rm = TRUE)

#  IQR ratio
iqr_ratio = iqr_q4 / iqr_q1

# overall IQR for IC50
overall_iqr = IQR(litro_clean$ic50_log10, na.rm = TRUE)

# results  
het_results = data.frame(
  Metric = c("Breusch-Pagan statistic", 
             "Breusch-Pagan p-value", 
             "Overall IQR", 
             "IQR ratio (Q4/Q1)"),
  Value = c(round(bp_test$statistic, 2),
            format.pval(bp_test$p.value, digits = 3),
            round(overall_iqr, 3),
            round(iqr_ratio, 2))
)

print(het_results)
##                    Metric  Value
## 1 Breusch-Pagan statistic  71.65
## 2   Breusch-Pagan p-value <2e-16
## 3             Overall IQR  0.618
## 4       IQR ratio (Q4/Q1)   4.81

Step 2: OLS fit for all tumor/cell types for litronesib

This analysis examines the relationship between the IC50 (drug concentration at 50% potency) and AUC (dose-response area under the curve) for litronesib across tumor types. Heteroskedasticity, and subsequently the need for a non-mean based regression apporoach, is suspected based on examination of the dataset, but first this section checks what OLS fit looks like so we can see if we still suspect that quantile regression would be best.

simple linear regression model is:

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

where:

In litronesib drug concentration-response data:

Step 2a: Load data and create log10 variables

# litronesib data 
litro = read_xlsx("C:/Users/amcewen/OneDrive - Bentley University/Desktop/quantile regression data/litronesib_only - xlsx.xlsx")

litro$ic50_log10 = log10(litro$ic50)
litro$auc_log10 = log10(litro$auc)

Step 2b: Fit the OLS model

ols_model = lm(ic50_log10 ~ auc_log10, data = litro)
summary(ols_model)
## 
## Call:
## lm(formula = ic50_log10 ~ auc_log10, data = litro)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.13595 -0.33931 -0.03473  0.22448  1.99912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.50258    0.04895  -30.70   <2e-16 ***
## auc_log10    1.78673    0.11354   15.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.523 on 545 degrees of freedom
## Multiple R-squared:  0.3124, Adjusted R-squared:  0.3112 
## F-statistic: 247.6 on 1 and 545 DF,  p-value: < 2.2e-16

Step 2c: Plot the data with OLS line

# parameters
intercept = round(coef(ols_model)[1], 3)
slope = round(coef(ols_model)[2], 3)
r2 = round(summary(ols_model)$r.squared, 3)

# plot
plot(litro$auc_log10, litro$ic50_log10, 
     xlab = "AUC(log10) - Predictor", 
     ylab = "IC50(log10) - Response",
     main = "Litronesib: OLS fit for all tumor types",
     pch = 16, col = "gray")
abline(ols_model, col = "red", lwd = 2)  

Step 2d: What does this tell us?

From the model output:

Looking at the output, The model suggetsts AUC explains 31.2% of the variance in IC50, leaving a lot of the relationship on the table -AUC is a meaningful predictor, but about 69% of the variation in IC50 remains unexplained by AUC alone.

Additonaly that maximum residual of 2 suggests some observations are poorly predicted by OLS fit.

Upon visual inspection, substantial scatter appears around the regression line, espeically at higher AUC/IC50, suggesting a mean-based regression will not fully capture the relationship between IC50 and AUC for litronesib.

Step 2e: Check OLS by tumor

Now let’s examine the extent to which heteroskedasticity may be tumor-driven: Different tumor types have different drug response curves

Find the top 4 tumor types

Fit separate OLS models for the top 4 tumor/cell types

# top tumors
tumor_counts = sort(table(litro$OncotreeLineage), decreasing = TRUE)
top_4 = c("Lung", "Skin", "CNS/Brain", "Esophagus/Stomach")
litro_top4 = litro[litro$OncotreeLineage %in% top_4, ]
tumor_results = data.frame()

# fit each tumor
for(tumor in top_4) {
  # get data for this tumor
  tumor_data = litro_top4[litro_top4$OncotreeLineage == tumor, ]
  tumor_model = lm(ic50_log10 ~ auc_log10, data = tumor_data)
 
  new_row = data.frame(
    Tumor = tumor,
    n = nrow(tumor_data),
    Intercept = round(coef(tumor_model)[1], 3),
    Slope = round(coef(tumor_model)[2], 3),
    R_squared = round(summary(tumor_model)$r.squared, 3)
  )
  tumor_results = rbind(tumor_results, new_row)
}


print(tumor_results)
##                          Tumor   n Intercept Slope R_squared
## (Intercept)               Lung 119    -1.703 1.409     0.268
## (Intercept)1              Skin  44    -1.301 2.058     0.374
## (Intercept)2         CNS/Brain  36    -1.287 2.165     0.405
## (Intercept)3 Esophagus/Stomach  45    -1.808 1.310     0.279

Plot each tumor type scatter/regression line separately

# plots
par(mfrow = c(2, 2))

for(i in 1:4) {
  tumor = top_4[i]

  tumor_data = litro_top4[litro_top4$OncotreeLineage == tumor, ]
  
  plot(tumor_data$auc_log10, tumor_data$ic50_log10,
       xlab = "AUC(log10) - Predictor", 
       ylab = "IC50(log10) - Response",
       main = tumor,
       pch = 16, col = "gray")
  tumor_reds = c("red", "firebrick", "darkred", "indianred")
  tumor_model = lm(ic50_log10 ~ auc_log10, data = tumor_data)
  abline(tumor_model, col = tumor_reds[i], lwd = 2)
}

Step 2f: Compare to overall model

# plot all one plot
plot(litro$auc_log10, litro$ic50_log10, 
     col = "lightgray", pch = 16,
     xlab = "AUC(log10) - Predictor", 
     ylab = "IC50(log10) - Response",
     main = "All vs Top 4 Tumors")

# tumor lines
tumor_reds = c("firebrick", "darkred", "indianred", "salmon")
for(i in 1:4) {
  tumor_data = litro_top4[litro_top4$OncotreeLineage == top_4[i], ]
  tumor_model = lm(ic50_log10 ~ auc_log10, data = tumor_data)
  abline(tumor_model, col = tumor_reds[i], lwd = 2)
  
  # OLS 
abline(ols_model, col = "red", lwd = 3)
}

legend("topleft", c("All (OLS)", top_4), 
       col = c("red", tumor_reds), 
       lwd = 2, cex = 0.7)

Steo 2g: Results and interpretation

To determine if heteroskedasticity is tumor-driven vs general, the analysis examined:

  1. Spread of residuals: Do some tumor types show wider spread than others?
  2. R² values: Lower R² suggests more unexplained variance
  3. Slope differences: Do different tumor types have different relationships to the drug?

Findings:

Finding 1: Weak relationships across all four tumor types

All four tumor types show low R² values (0.268 to 0.405), so AUC explains less than half of the variance in IC50 - even in the best case (CNS/Brain at 40.5%). This substantial unexplained variance confirms heteroskedasticity is generally present across litronesib’s AUC:IC50 relationship.

Finding 2: STILL, some tumor-driven differences exist

While all relationships are weak, CNS/Brain shows the strongest fit (R² = 0.405) and Lung shows the weakest (R² = 0.268). The slopes also vary: CNS/Brain and Skin show steeper slopes (around 2.0-2.2) compared to Lung and Esophagus (around 1.3-1.4), indicating tumor-specific drug responses.

Conclusions:

Heteroskedasticity in litronesib appears both general (present across all tumors) and partially tumor-driven (more pronounced in some tumor types). The consistently low R² values and visible scatter suggest that OLS regression, which models only the mean, is insufficient. This motivates quantile regression, which can model the entire distribution of IC50 values at different AUC levels.

Step 3: Quantile regression for litronesib IC50:AUC

Quantile regression allows modeling of the predictor-response relationships at different points in the distribution, not just the average, which can better account for the changing dynamic between IC50 and AUC.

The quantile regression model for the τ-th quantile is:

\[ Q_\tau(y_i | x_i) = \beta_{0,\tau} + \beta_{1,\tau} x_i \]

where:

Key difference from OLS:

For the litronesib data:

This gives 5 different regression lines:

\[ \begin{aligned} Q_{0.10}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.10} + \beta_{1, 0.10} \times \text{AUC} \quad \text{(10th percentile: high-sensitivity cells)} \\ Q_{0.25}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.25} + \beta_{1, 0.25} \times \text{AUC} \quad \text{(25th percentile)} \\ Q_{0.50}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.50} + \beta_{1, 0.50} \times \text{AUC} \quad \text{(50th percentile: median)} \\ Q_{0.75}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.75} + \beta_{1, 0.75} \times \text{AUC} \quad \text{(75th percentile)} \\ Q_{0.90}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.90} + \beta_{1, 0.90} \times \text{AUC} \quad \text{(90th percentile: low-sensitivity cells)} \end{aligned} \]

What this tells us:

Step 3a: Fit quantile regression model

The analysis fits models at 5 quantiles: τ = 0.10, 0.25, 0.50, 0.75, 0.90

#  quantiles
quantiles = c(0.10, 0.25, 0.50, 0.75, 0.90)
# results 
qr_results = data.frame()
qr_results_ci = data.frame()
# quantile regression for each tau
for(tau in quantiles) {
  qr_model = rq(ic50_log10 ~ auc_log10, data = litro, tau = tau)
  
  # 95% CIs using bootstrap (1000 reps)
  qr_summary = summary(qr_model, se = "boot", R = 1000)
  
  #  slope estimates and 95% CIs
  slope_est = qr_summary$coefficients[2, 1]
  slope_se = qr_summary$coefficients[2, 2]
  slope_lower = slope_est - 1.96 * slope_se
  slope_upper = slope_est + 1.96 * slope_se
  
  #  results 
  new_row = data.frame(
    Quantile = tau,
    Intercept = round(coef(qr_model)[1], 3),
    Slope = round(coef(qr_model)[2], 3)
  )
  qr_results = rbind(qr_results, new_row)
  
  # results with bootstrap CIs
  new_row_ci = data.frame(
    Quantile = tau,
    CI_Lower = round(slope_lower, 3),
    CI_Upper = round(slope_upper, 3)
  )
  qr_results_ci = rbind(qr_results_ci, new_row_ci)
}

print(qr_results)
##              Quantile Intercept Slope
## (Intercept)      0.10    -2.508 0.586
## (Intercept)1     0.25    -2.185 0.832
## (Intercept)2     0.50    -1.786 1.266
## (Intercept)3     0.75    -1.130 2.075
## (Intercept)4     0.90    -0.342 3.031
print(qr_results_ci)
##   Quantile CI_Lower CI_Upper
## 1     0.10    0.295    0.877
## 2     0.25    0.636    1.029
## 3     0.50    1.137    1.395
## 4     0.75    1.669    2.482
## 5     0.90    2.655    3.408

Step 3b: What do these coefficients mean?

From the results above, different equations exist for different quantiles:

If the slopes are different across quantiles, this confirms heteroskedasticity - the strength of the AUC-IC50 relationship depends on where a cell line falls in the sensitivity distribution.

Bootstrap confidence interval interpretations:

The bootstrap CIs provide evidence that these slope differences are real, not just sampling variation. Looking at the bootstrap CI results:

These CIs do not overlap,particularly between τ = 0.10 and τ = 0.90. This non-overlap means: 1. The slope differences are statistically significant at the 95% CI 2. The drug behaves fundamentally differently for sensitive versus resistant cell lines (tumor types) 3. Using a single OLS slope would be misleading - it would overestimate the relationship for sensitive cells and underestimate it for resistant cells

Step 3c: Visualizing the quantile regression lines

#  plot
plot(litro$auc_log10, litro$ic50_log10,
     col = "lightgray", pch = 16,
     xlab = "AUC(log10) - Predictor",
     ylab = "IC50(log10) - Response",
     main = "Quantile Regression: Litronesib IC50 vs AUC")

# lines 
abline(ols_model, col = "red", lwd = 2)
qr_colors = c("lightblue", "skyblue", "blue", "darkblue", "navy")
for(i in 1:length(quantiles)) {
  qr_model = rq(ic50_log10 ~ auc_log10, data = litro, tau = quantiles[i])
  abline(qr_model, col = qr_colors[i], lwd = 2)
}

# legend
legend("topleft", 
       c(paste("Quantile τ =", quantiles), "OLS mean"),
       col = c(qr_colors, "red"),
       lty = c(rep(1, 5), 2),
       lwd = 2, cex = 0.7)

Step 3d: Compare slopes across quantiles

plot(qr_results$Quantile, qr_results$Slope,
     type = "b", col = "blue", lwd = 2, pch = 16,
     xlab = "Quantile", ylab = "Slope",
     main = "How Slopes Change Across Quantiles")


# legend
legend("topright", c("Quantile slopes"), 
       col = c("blue"), lwd = 2, pch = c(16, NA))

Key questions to answer:

  1. Are slopes different across quantiles? If slopes at τ = 0.90 and τ = 0.10 differ substantially, this confirms our suspision that high-sensitivity and low-sensitivity cell lines respond differently to changes in drug exposure.

  2. Does the median slope (τ = 0.50) differ from the OLS mean slope? If yes, the distribution is skewed, and OLS may give misleading average results.

Limitations:

  1. Both IC50 and AUC are derived from the same dose-response curve, making their relationship correlational rather than causal. Neither metric truly predicts the other; they represent different summary statistics of the same underlying drug response.

  2. The R quantreg package assumes linear relationships at each quantile on the log10 scale. While this appears reasonable for litronesib, non-linear patterns at extreme quantiles might be missed. Smoothing-based quantile regression methods could reveal more complex relationships.

  3. This analysis treats all cell lines equally, but some tumor types are overrepresented with 98 lung samples versus 33 esophagus/stomach samples. This imbalance might bias the overall quantile estimates toward patterns in better-represented tumor types. While we demonstrate heteroskedasticity exists and varies by tumor type, we don’t identify the biological mechanisms driving this variance. The varying spread could result from differences in drug uptake, metabolism, target expression, or resistance mechanisms that aren’t measured in this dataset.