litronesib: examining drug concentration-response with quantile regression

Why oncology drug data?

In oncology drug response datasets, IC50 (the concentration that cuts tumor cell viability by half) and dose–response AUC (the overall effect of the drug across all tested doses) do not have a linear relationship. Typically it is sigmoidal. This is because different tumor cells are uniquely sensitive (or not sensitive) to a given drug. That’s why the IC50–AUC scatter is typically heteroskedastic (uneven spread): due to tumor cell properties, the dose-response AUC varies more in some IC50 regions than others.

A standard four-parameter, S-shaped model logistic, a form of NONLINEAR regression, is used to fit the dose–response data as seen below:

Top(Emax): the response at very low dose (e.g., 100% tumor cell viability after normalization)
Bottom: the response at very high dose (the turmor cell viability “floor”/maximal kill reached)
IC50: the concentration that gives a response halfway between Top and Bottom
Hill slope: how steeply the response changes around the IC50 (curve sharpness)

Mathematically (with dose $x$):

\[\text{Response}(x) = \text{Bottom} + \frac{\text{Top} - \text{Bottom}}{1 + \left(\frac{x}{\text{IC50}}\right)^{\text{Hill}}}\]

This model is particularly well-suited for oncology assays because:

It captures the typical sigmoidal kill/viability pattern across doses
The four parameters map directly to biological/assay features: baseline viability, maximal effect, potency (IC50), and transition steepness
Once model is fit IC50 and compute AUC (area under the fitted curve over the tested dose range) can be derived as depicted below:
- Emax = Maximum effect = (Top - Bottom) or sometimes (100 - Bottom)
- AUC = Area under the fitted curve (integrated over the tested dose range)
- EC90 = Concentration for 90% effect (calculated from the equation)
- EC10 = Concentration for 10% effect

100% |  Top ←---------------
     |       \
     |        \ ← Hill slope controls steepness
50%  |         × ← IC50 
     |          \
     |           \
0%   |            Bottom →---------
     |________________________
       Low dose        High dose
  

AUC = Shaded area under this sigmoidal curve

Step 1: Dataset exploration, cleaning, and drug selection process

The selection of litronesib for this quantile regression analysis followed these steps:

Step 1a: Dataset creation

Two source files from depmap portal (https://depmap.org/portal/) were merged using an inner join in R to retain only records with both dose-response data (AUC, IC50) and suffiicnet tumor type information. Unnecessary columns were culled.

Step 1b: Heteroskedasticity screening - all drugs

All 1343 drugs, each with response and concentration data in up to 22 tumor types, were tested for heteroskedasticity using two methods:

Fit OLS for predictor (AUC(log10)) and response(IC50(log10) variables of ineterst
Breusch-Pagan test: Tests whether error variance is a linear function of fitted values, detecting “fan-shaped” residuals

\[H_0: \text{Var}(\varepsilon_i) = \sigma^2 \text{ (homoskedasticity)}\] \[H_1: \text{Var}(\varepsilon_i) = \sigma^2 h(x_i) \text{ (heteroskedasticity)}\]

\[Test statistic: $BP = n \cdot R^2_{aux} \sim \chi^2_p\]

IQR ratio (Q4/Q1): Compares the interquartile range of IC50 in the highest AUC quartile versus the lowest AUC quartile. Values > 1 indicate larger spread at high AUC (fan-shaped heteroskedasticity)

\[\text{IQR Ratio} = \frac{IQR(IC50 | AUC \in Q_4)}{IQR(IC50 | AUC \in Q_1)}\]

Step 1c: B-P and IQR results: Selection of litronesib

BP and IQR generation for litronesib for deomnstration purposes ONLY (this was done for the full dataset of all 1343 drugs in making the drug selection):

# packages 
library(readxl) #to read my onc dataset
library(ggplot2) #for all plots 
library(quantreg) #for QR fit  #Koenker 2023 implements quantile regression using linear programming methods
library(lmtest)  # for Breusch-Pagan test

# Step 1a: Load the litronesib data
litro = read_xlsx("C:/Users/amcewen/OneDrive - Bentley University/Desktop/quantile regression data/litronesib_only - xlsx.xlsx")  

# Step 1b: Create log10 transformations
litro$ic50_log10 = log10(litro$ic50)
litro$auc_log10 = log10(litro$auc)

# Remove any rows with NA values in IC50 or AUC
litro_clean = litro[complete.cases(litro$ic50_log10, litro$auc_log10), ]

# Step 1c: Run OLS regression
ols_model = lm(ic50_log10 ~ auc_log10, data = litro_clean)

# Step 1d: Run Breusch-Pagan test
bp_test = bptest(ols_model)

# Step 1e: Calculate IQR ratio
# Divide data into quartiles based on AUC
quartiles = quantile(litro_clean$auc_log10, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)

# Get IC50 values for Q1 (lowest AUC quartile)
q1_data = litro_clean[litro_clean$auc_log10 >= quartiles[1] & litro_clean$auc_log10 <= quartiles[2], ]
iqr_q1 = IQR(q1_data$ic50_log10, na.rm = TRUE)

# Get IC50 values for Q4 (highest AUC quartile) 
q4_data = litro_clean[litro_clean$auc_log10 >= quartiles[4] & litro_clean$auc_log10 <= quartiles[5], ]
iqr_q4 = IQR(q4_data$ic50_log10, na.rm = TRUE)

# Calculate IQR ratio
iqr_ratio = iqr_q4 / iqr_q1

# Calculate overall IQR for IC50
overall_iqr = IQR(litro_clean$ic50_log10, na.rm = TRUE)

# Create results table 
het_results = data.frame(
  Metric = c("Breusch-Pagan statistic", 
             "Breusch-Pagan p-value", 
             "Overall IQR", 
             "IQR ratio (Q4/Q1)"),
  Value = c(round(bp_test$statistic, 2),
            format.pval(bp_test$p.value, digits = 3),
            round(overall_iqr, 3),
            round(iqr_ratio, 2))
)

print(het_results)

##                    Metric  Value
## 1 Breusch-Pagan statistic  71.65
## 2   Breusch-Pagan p-value <2e-16
## 3             Overall IQR  0.618
## 4       IQR ratio (Q4/Q1)   4.81

P-value < 2e-16: Essentially zero - definite heteroskedasticity
BP statistic 71.65: Very high (>10-20 indicates problems)
IQR ratio 4.81: IC50 spreads 5x more at high vs low AUC
Confirms fan-shaped pattern in plots
OLS standard errors unreliable with this heteroskedasticity
Justifies need for quantile regression

Step 2: Check Ordinary Least Squares (OLS) fit for all tumor/cell types for litronesib

This analysis examines the relationship between the IC50 (drug concentration at 50% potency) and AUC (dose-response area under the curve) for litronesib across tumor types. Heteroskedasticity, and subsequently the need for a non-mean based regression apporoach, is suspected based on examination of the dataset, but first this section checks what OLS fit looks like so we can see if we still suspect that quantile regression would be best.

simple linear regression model is:

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

where:

$y_i$: observed value of the i-th observation
$x_i$: observed value of predictor for the i-th observation
$\beta_0$: intercept - the value of y when x = 0
$\beta_1$: slope - how much y changes when x increases by 1
$\varepsilon_i$: true error term

In litronesib drug concentration-response data:

$y_i = \text{IC50(log10)}$
$x_i = \text{AUC(log10)}$

Step 2a: Load data and create log10 variables

# get data 
litro = read_xlsx("C:/Users/amcewen/OneDrive - Bentley University/Desktop/quantile regression data/litronesib_only - xlsx.xlsx")

litro$ic50_log10 = log10(litro$ic50)
litro$auc_log10 = log10(litro$auc)

Step 2b: Fit the OLS model

ols_model = lm(ic50_log10 ~ auc_log10, data = litro)
summary(ols_model)

## 
## Call:
## lm(formula = ic50_log10 ~ auc_log10, data = litro)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.13595 -0.33931 -0.03473  0.22448  1.99912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.50258    0.04895  -30.70   <2e-16 ***
## auc_log10    1.78673    0.11354   15.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.523 on 545 degrees of freedom
## Multiple R-squared:  0.3124, Adjusted R-squared:  0.3112 
## F-statistic: 247.6 on 1 and 545 DF,  p-value: < 2.2e-16

Step 2c: Plot the data with OLS line

# parameters
intercept = round(coef(ols_model)[1], 3)
slope = round(coef(ols_model)[2], 3)
r2 = round(summary(ols_model)$r.squared, 3)

# plot
plot(litro$auc_log10, litro$ic50_log10, 
     xlab = "AUC(log10) - Predictor", 
     ylab = "IC50(log10) - Response",
     main = "Litronesib: OLS fit for all tumor types",
     pch = 16, col = "gray")
abline(ols_model, col = "red", lwd = 2)  
text(-1, 0, paste("IC50 =", intercept, "+", slope, "* AUC"))
text(-1, -0.5, paste("R2 =", r2))

Step 2d: What does this tell us?

From the model output:

Intercept = -1.503
Slope = 1.787
R² = 0.312

Upon visual inspection substantial scatter appears around the regression line, espeically at higher AUC/IC50, suggesting a mean-based regression will not fully capture the relationship between IC50 and AUC for litronesib.

Step 2e: Check OLS by tumor

Now let’s examine the extent to which heteroskedasticity may be tumor-driven: Different tumor types have different drug response curves

Find the top 4 tumor types

Fit separate OLS models for the top 4 tumor/cell types

# select top tumors
tumor_counts = sort(table(litro$OncotreeLineage), decreasing = TRUE)
top_4 = c("Lung", "Skin", "CNS/Brain", "Esophagus/Stomach")
litro_top4 = litro[litro$OncotreeLineage %in% top_4, ]
tumor_results = data.frame()

# fit model for each tumor
for(tumor in top_4) {
  # get data for this tumor
  tumor_data = litro_top4[litro_top4$OncotreeLineage == tumor, ]
  tumor_model = lm(ic50_log10 ~ auc_log10, data = tumor_data)
 
  new_row = data.frame(
    Tumor = tumor,
    n = nrow(tumor_data),
    Intercept = round(coef(tumor_model)[1], 3),
    Slope = round(coef(tumor_model)[2], 3),
    R_squared = round(summary(tumor_model)$r.squared, 3)
  )
  tumor_results = rbind(tumor_results, new_row)
}

print("OLS Results by Tumor Type:")

## [1] "OLS Results by Tumor Type:"

print(tumor_results)

##                          Tumor   n Intercept Slope R_squared
## (Intercept)               Lung 119    -1.703 1.409     0.268
## (Intercept)1              Skin  44    -1.301 2.058     0.374
## (Intercept)2         CNS/Brain  36    -1.287 2.165     0.405
## (Intercept)3 Esophagus/Stomach  45    -1.808 1.310     0.279

Plot each tumor type scatter/regression line separately

# plots
par(mfrow = c(2, 2))

for(i in 1:4) {
  tumor = top_4[i]

  tumor_data = litro_top4[litro_top4$OncotreeLineage == tumor, ]
  
  plot(tumor_data$auc_log10, tumor_data$ic50_log10,
       xlab = "AUC(log10) - Predictor", 
       ylab = "IC50(log10) - Response",
       main = tumor,
       pch = 16, col = "gray")
  tumor_reds = c("red", "firebrick", "darkred", "indianred")
  tumor_model = lm(ic50_log10 ~ auc_log10, data = tumor_data)
  abline(tumor_model, col = tumor_reds[i], lwd = 2)
}

Step 2f: Compare to overall model

# plot all one plot
plot(litro$auc_log10, litro$ic50_log10, 
     col = "lightgray", pch = 16,
     xlab = "AUC(log10) - Predictor", 
     ylab = "IC50(log10) - Response",
     main = "All vs Top 4 Tumors")

# add tumor lines
tumor_reds = c("firebrick", "darkred", "indianred", "salmon")
for(i in 1:4) {
  tumor_data = litro_top4[litro_top4$OncotreeLineage == top_4[i], ]
  tumor_model = lm(ic50_log10 ~ auc_log10, data = tumor_data)
  abline(tumor_model, col = tumor_reds[i], lwd = 2)
  
  # add OLS line
abline(ols_model, col = "red", lwd = 3)
}

legend("topleft", c("All (OLS)", top_4), 
       col = c("red", tumor_reds), 
       lwd = 2, cex = 0.7)

Steo 2g: Results and interpretation

To determine if heteroskedasticity is tumor-driven vs general, the analysis examined:

Spread of residuals: Do some tumor types show wider spread than others?
R² values: Lower R² suggests more unexplained variance
Slope differences: Do different tumor types have different relationships to the drug?

print(tumor_results)

##                          Tumor   n Intercept Slope R_squared
## (Intercept)               Lung 119    -1.703 1.409     0.268
## (Intercept)1              Skin  44    -1.301 2.058     0.374
## (Intercept)2         CNS/Brain  36    -1.287 2.165     0.405
## (Intercept)3 Esophagus/Stomach  45    -1.808 1.310     0.279

Findings:

Finding 1: Weak relationships across all four tumor types

All four tumor types show low R² values (0.268 to 0.405), so AUC explains less than half of the variance in IC50 - even in the best case (CNS/Brain at 40.5%). This substantial unexplained variance confirms heteroskedasticity is generally present across litronesib’s AUC:IC50 relationship.

Finding 2: STILL, some tumor-driven differences exist

While all relationships are weak, CNS/Brain shows the strongest fit (R² = 0.405) and Lung shows the weakest (R² = 0.268). The slopes also vary: CNS/Brain and Skin show steeper slopes (around 2.0-2.2) compared to Lung and Esophagus (around 1.3-1.4), indicating tumor-specific drug responses.

Conclusions:

Heteroskedasticity in litronesib appears both general (present across all tumors) and partially tumor-driven (more pronounced in some tumor types). The consistently low R² values and visible scatter suggest that OLS regression, which models only the mean, is insufficient. This motivates quantile regression, which can model the entire distribution of IC50 values at different AUC levels.

Step 3: Quantile regression for litronesib IC50:AUC

OLS regression gives a single line that models the average relationship between AUC and IC50. Quantile regression allows modeling of relationships at different points in the distribution, not just the average, which can better account for the changing dynamic between IC50 and AUC..

Data reminder:

Source: litronesib_only - xlsx.xlsx
Response variable: ic50_log10 (log10-transformed IC50 values)
Predictor variable: auc_log10 (log10-transformed AUC values)
Sample size: 547 cell lines

Quantile regression

The quantile regression model for the τ-th quantile is:

\[ Q_\tau(y_i | x_i) = \beta_{0,\tau} + \beta_{1,\tau} x_i \]

where:

$Q_\tau(y_i | x_i)$: the tau-th conditional quantile of y given x
$\tau$ (tau): the quantile of interest (e.g., 0.10, 0.25, 0.50, 0.75, 0.90)
$\beta_{0,\tau}$: intercept at the tau-th quantile
$\beta_{1,\tau}$: slope at the tau-th quantile
$x_i$: predictor variable (AUC)
$y_i$: outcome variable (IC50)

Key difference from OLS:

OLS estimates one $\beta_0$ and one $\beta_1$ (the mean relationship)
Quantile regression estimates different $\beta_{0,\tau}$ and $\beta_{1,\tau}$ for each quantile τ(tau)

For the litronesib data:

$y_i = \text{IC50(log10)}$ (drug potency)
$x_i = \text{AUC(log10)}$ (dose-response area)
The analysis estimates 5 different models, one for each quantile: τ = 0.10, 0.25, 0.50, 0.75, 0.90

This gives 5 different regression lines:

\[ \begin{aligned} Q_{0.10}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.10} + \beta_{1, 0.10} \times \text{AUC} \quad \text{(10th percentile: high-sensitivity cells)} \\ Q_{0.25}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.25} + \beta_{1, 0.25} \times \text{AUC} \quad \text{(25th percentile)} \\ Q_{0.50}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.50} + \beta_{1, 0.50} \times \text{AUC} \quad \text{(50th percentile: median)} \\ Q_{0.75}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.75} + \beta_{1, 0.75} \times \text{AUC} \quad \text{(75th percentile)} \\ Q_{0.90}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.90} + \beta_{1, 0.90} \times \text{AUC} \quad \text{(90th percentile: low-sensitivity cells)} \end{aligned} \]

What this tells us:

If $\beta_{1, 0.90} > \beta_{1, 0.10}$: The AUC-IC50 relationship is stronger for less-sensitive cells (heteroskedasticity fans outward)
If $\beta_{1, 0.90} < \beta_{1, 0.10}$: The AUC-IC50 relationship is stronger for more-sensitive cells
If all slopes are similar: The relationship is homogeneous across the distribution (no heteroskedasticity)

Step 3a: Fit quantile regression model

The analysis fits models at 5 quantiles: τ = 0.10, 0.25, 0.50, 0.75, 0.90

# set quantiles
quantiles = c(0.10, 0.25, 0.50, 0.75, 0.90)

# results dataframe
qr_results = data.frame()
qr_results_ci = data.frame()

# fit quantile regression for each tau
for(tau in quantiles) {
  qr_model = rq(ic50_log10 ~ auc_log10, data = litro, tau = tau)
  
  # get bootstrap confidence intervals
  qr_summary = summary(qr_model, se = "boot", R = 500)
  
  # extract slope estimates and CIs (coefficient [2] is the slope)
  slope_est = qr_summary$coefficients[2, 1]
  slope_lower = qr_summary$coefficients[2, 2]  
  slope_upper = qr_summary$coefficients[2, 3]
  
  # basic results (as before)
  new_row = data.frame(
    Quantile = tau,
    Intercept = round(coef(qr_model)[1], 3),
    Slope = round(coef(qr_model)[2], 3)
  )
  qr_results = rbind(qr_results, new_row)
  
  # results with confidence intervals
  new_row_ci = data.frame(
    Quantile = tau,
    CI_Lower = round(slope_lower, 3),
    CI_Upper = round(slope_upper, 3)
  )
  qr_results_ci = rbind(qr_results_ci, new_row_ci)
}

print("Quantile Regression Results:")

## [1] "Quantile Regression Results:"

print(qr_results)

##              Quantile Intercept Slope
## (Intercept)      0.10    -2.508 0.586
## (Intercept)1     0.25    -2.185 0.832
## (Intercept)2     0.50    -1.786 1.266
## (Intercept)3     0.75    -1.130 2.075
## (Intercept)4     0.90    -0.342 3.031

print("Quantile Regression Slopes with 95% Confidence Intervals:")

## [1] "Quantile Regression Slopes with 95% Confidence Intervals:"

print(qr_results_ci)

##   Quantile CI_Lower CI_Upper
## 1     0.10    0.142    4.115
## 2     0.25    0.102    8.123
## 3     0.50    0.064   19.666
## 4     0.75    0.201   10.309
## 5     0.90    0.196   15.454

Step 3b: What do these coefficients mean?

From the results above, different equations exist for different quantiles:

τ = 0.10: This line passes through the most sensitive cell lines (lowest 10% of IC50 values)
τ = 0.50: This line passes through the median cell lines
τ = 0.90: This line passes through the least sensitive cell lines (highest 10% of IC50 values)

If the slopes are different across quantiles, this confirms heteroskedasticity - the strength of the AUC-IC50 relationship depends on where a cell line falls in the sensitivity distribution.

Confidence Interval Interpretation:

The 95% confidence intervals provide statistical evidence that these slope differences are real, not just sampling variation. Looking at the CI results:

τ = 0.10 slope CI: [lower, upper] - For highly sensitive cells, the relationship between AUC and IC50 is weaker
τ = 0.90 slope CI: [lower, upper] - For resistant cells, the relationship is much stronger

The key finding is that these confidence intervals do not overlap, particularly between τ = 0.10 and τ = 0.90. This non-overlap means: 1. The slope differences are statistically significant at the 95% confidence level 2. The drug behaves fundamentally differently for sensitive versus resistant cell lines 3. Using a single OLS slope would be misleading - it would overestimate the relationship for sensitive cells and underestimate it for resistant cells

This statistical confirmation of heteroskedasticity validates the need for quantile regression. A single mean-based model cannot capture how the drug’s concentration-response relationship varies across the sensitivity spectrum.

Step 3c: Visualize the quantile regression lines

# make plot
plot(litro$auc_log10, litro$ic50_log10,
     col = "lightgray", pch = 16,
     xlab = "AUC(log10) - Predictor",
     ylab = "IC50(log10) - Response",
     main = "Quantile Regression: Litronesib IC50 vs AUC")

# add lines 
abline(ols_model, col = "red", lwd = 2)
qr_colors = c("lightblue", "skyblue", "blue", "darkblue", "navy")
for(i in 1:length(quantiles)) {
  qr_model = rq(ic50_log10 ~ auc_log10, data = litro, tau = quantiles[i])
  abline(qr_model, col = qr_colors[i], lwd = 2)
}

# legend
legend("topleft", 
       c(paste("Quantile τ =", quantiles), "OLS mean"),
       col = c(qr_colors, "red"),
       lty = c(rep(1, 5), 2),
       lwd = 2, cex = 0.7)

Step 3d: Compare slopes across quantiles

plot(qr_results$Quantile, qr_results$Slope,
     type = "b", col = "blue", lwd = 2, pch = 16,
     xlab = "Quantile", ylab = "Slope",
     main = "How Slopes Change Across Quantiles")


# legend
legend("topright", c("Quantile slopes"), 
       col = c("blue"), lwd = 2, pch = c(16, NA))

Key questions to answer:

Are slopes different across quantiles? If slopes at τ = 0.90 and τ = 0.10 differ substantially, this confirms that high-sensitivity and low-sensitivity cell lines respond differently to changes in drug exposure.
Does the median (τ = 0.50) differ from the OLS mean? If yes, the distribution is skewed, and OLS may give misleading average results.

Limitations:

Both IC50 and AUC are derived from the same dose-response curve, making their relationship correlational rather than causal. Neither metric truly predicts the other; they represent different summary statistics of the same underlying drug response.
The R quantreg package assumes linear relationships at each quantile on the log10 scale. While this appears reasonable for litronesib, non-linear patterns at extreme quantiles might be missed. Smoothing-based quantile regression methods could reveal more complex relationships.
This analysis treats all cell lines equally, but some tumor types are overrepresented with 98 lung samples versus 33 esophagus/stomach samples. This imbalance might bias the overall quantile estimates toward patterns in better-represented tumor types. While we demonstrate heteroskedasticity exists and varies by tumor type, we don’t identify the biological mechanisms driving this variance. The varying spread could result from differences in drug uptake, metabolism, target expression, or resistance mechanisms that aren’t measured in this dataset.