In oncology drug response datasets, IC50 (the concentration that cuts tumor cell viability by half) and dose–response AUC (the overall effect of the drug across all tested doses) do not have a linear relationship. Typically it is sigmoidal. This is because different tumor cells are uniquely sensitive (or not sensitive) to a given drug. That’s why the IC50–AUC scatter is typically heteroskedastic (uneven spread): due to tumor cell properties, the dose-response AUC varies more in some IC50 regions than others.
A standard four-parameter, S-shaped model logistic, a form of NONLINEAR regression, is used to fit the dose–response data as seen below:
Mathematically (with dose \(x\)):
\[\text{Response}(x) = \text{Bottom} + \frac{\text{Top} - \text{Bottom}}{1 + \left(\frac{x}{\text{IC50}}\right)^{\text{Hill}}}\]
This model is particularly well-suited for oncology assays because:
100% | Top ←---------------
| \
| \ ← Hill slope controls steepness
50% | × ← IC50
| \
| \
0% | Bottom →---------
|________________________
Low dose High dose
AUC = Shaded area under this sigmoidal curve
The selection of litronesib for this quantile regression analysis followed these steps:
Step 1a: Dataset creation
Two source files from depmap portal (https://depmap.org/portal/) were merged using an inner join in R to retain only records with both dose-response data (AUC, IC50) and suffiicnet tumor type information. Unnecessary columns were culled.
Step 1b: Heteroskedasticity screening - all drugs
All 1343 drugs, each with response and concentration data in up to 22 tumor types, were tested for heteroskedasticity using two methods:
Fit OLS for predictor (AUC(log10)) and response(IC50(log10) variables of ineterst
Breusch-Pagan test: Tests whether error variance is a linear function of fitted values, detecting “fan-shaped” residuals
\[H_0: \text{Var}(\varepsilon_i) = \sigma^2 \text{ (homoskedasticity)}\] \[H_1: \text{Var}(\varepsilon_i) = \sigma^2 h(x_i) \text{ (heteroskedasticity)}\]
\[Test statistic: $BP = n \cdot R^2_{aux} \sim \chi^2_p\]
\[\text{IQR Ratio} = \frac{IQR(IC50 | AUC \in Q_4)}{IQR(IC50 | AUC \in Q_1)}\]
Step 1c: B-P and IQR results: Selection of litronesib
BP and IQR generation for litronesib for deomnstration purposes ONLY (this was done for the full dataset of all 1343 drugs in making the drug selection):
# packages
library(readxl) #to read my onc dataset
library(ggplot2) #for all plots
library(quantreg) #for QR fit #Koenker 2023 implements quantile regression using linear programming methods
library(lmtest) # for Breusch-Pagan test
# Step 1a: Load the litronesib data
litro = read_xlsx("C:/Users/amcewen/OneDrive - Bentley University/Desktop/quantile regression data/litronesib_only - xlsx.xlsx")
# Step 1b: Create log10 transformations
litro$ic50_log10 = log10(litro$ic50)
litro$auc_log10 = log10(litro$auc)
# Remove any rows with NA values in IC50 or AUC
litro_clean = litro[complete.cases(litro$ic50_log10, litro$auc_log10), ]
# Step 1c: Run OLS regression
ols_model = lm(ic50_log10 ~ auc_log10, data = litro_clean)
# Step 1d: Run Breusch-Pagan test
bp_test = bptest(ols_model)
# Step 1e: Calculate IQR ratio
# Divide data into quartiles based on AUC
quartiles = quantile(litro_clean$auc_log10, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
# Get IC50 values for Q1 (lowest AUC quartile)
q1_data = litro_clean[litro_clean$auc_log10 >= quartiles[1] & litro_clean$auc_log10 <= quartiles[2], ]
iqr_q1 = IQR(q1_data$ic50_log10, na.rm = TRUE)
# Get IC50 values for Q4 (highest AUC quartile)
q4_data = litro_clean[litro_clean$auc_log10 >= quartiles[4] & litro_clean$auc_log10 <= quartiles[5], ]
iqr_q4 = IQR(q4_data$ic50_log10, na.rm = TRUE)
# Calculate IQR ratio
iqr_ratio = iqr_q4 / iqr_q1
# Calculate overall IQR for IC50
overall_iqr = IQR(litro_clean$ic50_log10, na.rm = TRUE)
# Create results table
het_results = data.frame(
Metric = c("Breusch-Pagan statistic",
"Breusch-Pagan p-value",
"Overall IQR",
"IQR ratio (Q4/Q1)"),
Value = c(round(bp_test$statistic, 2),
format.pval(bp_test$p.value, digits = 3),
round(overall_iqr, 3),
round(iqr_ratio, 2))
)
print(het_results)
## Metric Value
## 1 Breusch-Pagan statistic 71.65
## 2 Breusch-Pagan p-value <2e-16
## 3 Overall IQR 0.618
## 4 IQR ratio (Q4/Q1) 4.81
This analysis examines the relationship between the IC50 (drug concentration at 50% potency) and AUC (dose-response area under the curve) for litronesib across tumor types. Heteroskedasticity, and subsequently the need for a non-mean based regression apporoach, is suspected based on examination of the dataset, but first this section checks what OLS fit looks like so we can see if we still suspect that quantile regression would be best.
simple linear regression model is:
\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
where:
In litronesib drug concentration-response data:
Step 2a: Load data and create log10 variables
# get data
litro = read_xlsx("C:/Users/amcewen/OneDrive - Bentley University/Desktop/quantile regression data/litronesib_only - xlsx.xlsx")
litro$ic50_log10 = log10(litro$ic50)
litro$auc_log10 = log10(litro$auc)
Step 2b: Fit the OLS model
ols_model = lm(ic50_log10 ~ auc_log10, data = litro)
summary(ols_model)
##
## Call:
## lm(formula = ic50_log10 ~ auc_log10, data = litro)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.13595 -0.33931 -0.03473 0.22448 1.99912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.50258 0.04895 -30.70 <2e-16 ***
## auc_log10 1.78673 0.11354 15.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.523 on 545 degrees of freedom
## Multiple R-squared: 0.3124, Adjusted R-squared: 0.3112
## F-statistic: 247.6 on 1 and 545 DF, p-value: < 2.2e-16
Step 2c: Plot the data with OLS line
# parameters
intercept = round(coef(ols_model)[1], 3)
slope = round(coef(ols_model)[2], 3)
r2 = round(summary(ols_model)$r.squared, 3)
# plot
plot(litro$auc_log10, litro$ic50_log10,
xlab = "AUC(log10) - Predictor",
ylab = "IC50(log10) - Response",
main = "Litronesib: OLS fit for all tumor types",
pch = 16, col = "gray")
abline(ols_model, col = "red", lwd = 2)
text(-1, 0, paste("IC50 =", intercept, "+", slope, "* AUC"))
text(-1, -0.5, paste("R2 =", r2))
Step 2d: What does this tell us?
From the model output:
Upon visual inspection substantial scatter appears around the regression line, espeically at higher AUC/IC50, suggesting a mean-based regression will not fully capture the relationship between IC50 and AUC for litronesib.
Step 2e: Check OLS by tumor
Now let’s examine the extent to which heteroskedasticity may be tumor-driven: Different tumor types have different drug response curves
Find the top 4 tumor types
Fit separate OLS models for the top 4 tumor/cell types
# select top tumors
tumor_counts = sort(table(litro$OncotreeLineage), decreasing = TRUE)
top_4 = c("Lung", "Skin", "CNS/Brain", "Esophagus/Stomach")
litro_top4 = litro[litro$OncotreeLineage %in% top_4, ]
tumor_results = data.frame()
# fit model for each tumor
for(tumor in top_4) {
# get data for this tumor
tumor_data = litro_top4[litro_top4$OncotreeLineage == tumor, ]
tumor_model = lm(ic50_log10 ~ auc_log10, data = tumor_data)
new_row = data.frame(
Tumor = tumor,
n = nrow(tumor_data),
Intercept = round(coef(tumor_model)[1], 3),
Slope = round(coef(tumor_model)[2], 3),
R_squared = round(summary(tumor_model)$r.squared, 3)
)
tumor_results = rbind(tumor_results, new_row)
}
print("OLS Results by Tumor Type:")
## [1] "OLS Results by Tumor Type:"
print(tumor_results)
## Tumor n Intercept Slope R_squared
## (Intercept) Lung 119 -1.703 1.409 0.268
## (Intercept)1 Skin 44 -1.301 2.058 0.374
## (Intercept)2 CNS/Brain 36 -1.287 2.165 0.405
## (Intercept)3 Esophagus/Stomach 45 -1.808 1.310 0.279
Plot each tumor type scatter/regression line separately
# plots
par(mfrow = c(2, 2))
for(i in 1:4) {
tumor = top_4[i]
tumor_data = litro_top4[litro_top4$OncotreeLineage == tumor, ]
plot(tumor_data$auc_log10, tumor_data$ic50_log10,
xlab = "AUC(log10) - Predictor",
ylab = "IC50(log10) - Response",
main = tumor,
pch = 16, col = "gray")
tumor_reds = c("red", "firebrick", "darkred", "indianred")
tumor_model = lm(ic50_log10 ~ auc_log10, data = tumor_data)
abline(tumor_model, col = tumor_reds[i], lwd = 2)
}
Step 2f: Compare to overall model
# plot all one plot
plot(litro$auc_log10, litro$ic50_log10,
col = "lightgray", pch = 16,
xlab = "AUC(log10) - Predictor",
ylab = "IC50(log10) - Response",
main = "All vs Top 4 Tumors")
# add tumor lines
tumor_reds = c("firebrick", "darkred", "indianred", "salmon")
for(i in 1:4) {
tumor_data = litro_top4[litro_top4$OncotreeLineage == top_4[i], ]
tumor_model = lm(ic50_log10 ~ auc_log10, data = tumor_data)
abline(tumor_model, col = tumor_reds[i], lwd = 2)
# add OLS line
abline(ols_model, col = "red", lwd = 3)
}
legend("topleft", c("All (OLS)", top_4),
col = c("red", tumor_reds),
lwd = 2, cex = 0.7)
Steo 2g: Results and interpretation
To determine if heteroskedasticity is tumor-driven vs general, the analysis examined:
print(tumor_results)
## Tumor n Intercept Slope R_squared
## (Intercept) Lung 119 -1.703 1.409 0.268
## (Intercept)1 Skin 44 -1.301 2.058 0.374
## (Intercept)2 CNS/Brain 36 -1.287 2.165 0.405
## (Intercept)3 Esophagus/Stomach 45 -1.808 1.310 0.279
Findings:
Finding 1: Weak relationships across all four tumor types
All four tumor types show low R² values (0.268 to 0.405), so AUC explains less than half of the variance in IC50 - even in the best case (CNS/Brain at 40.5%). This substantial unexplained variance confirms heteroskedasticity is generally present across litronesib’s AUC:IC50 relationship.
Finding 2: STILL, some tumor-driven differences exist
While all relationships are weak, CNS/Brain shows the strongest fit (R² = 0.405) and Lung shows the weakest (R² = 0.268). The slopes also vary: CNS/Brain and Skin show steeper slopes (around 2.0-2.2) compared to Lung and Esophagus (around 1.3-1.4), indicating tumor-specific drug responses.
Conclusions:
Heteroskedasticity in litronesib appears both general (present across all tumors) and partially tumor-driven (more pronounced in some tumor types). The consistently low R² values and visible scatter suggest that OLS regression, which models only the mean, is insufficient. This motivates quantile regression, which can model the entire distribution of IC50 values at different AUC levels.
OLS regression gives a single line that models the average relationship between AUC and IC50. Quantile regression allows modeling of relationships at different points in the distribution, not just the average, which can better account for the changing dynamic between IC50 and AUC..
Data reminder:
Quantile regression
The quantile regression model for the τ-th quantile is:
\[ Q_\tau(y_i | x_i) = \beta_{0,\tau} + \beta_{1,\tau} x_i \]
where:
Key difference from OLS:
For the litronesib data:
This gives 5 different regression lines:
\[ \begin{aligned} Q_{0.10}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.10} + \beta_{1, 0.10} \times \text{AUC} \quad \text{(10th percentile: high-sensitivity cells)} \\ Q_{0.25}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.25} + \beta_{1, 0.25} \times \text{AUC} \quad \text{(25th percentile)} \\ Q_{0.50}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.50} + \beta_{1, 0.50} \times \text{AUC} \quad \text{(50th percentile: median)} \\ Q_{0.75}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.75} + \beta_{1, 0.75} \times \text{AUC} \quad \text{(75th percentile)} \\ Q_{0.90}(\text{IC50} | \text{AUC}) &= \beta_{0, 0.90} + \beta_{1, 0.90} \times \text{AUC} \quad \text{(90th percentile: low-sensitivity cells)} \end{aligned} \]
What this tells us:
Step 3a: Fit quantile regression model
The analysis fits models at 5 quantiles: τ = 0.10, 0.25, 0.50, 0.75, 0.90
# set quantiles
quantiles = c(0.10, 0.25, 0.50, 0.75, 0.90)
# results dataframe
qr_results = data.frame()
qr_results_ci = data.frame()
# fit quantile regression for each tau
for(tau in quantiles) {
qr_model = rq(ic50_log10 ~ auc_log10, data = litro, tau = tau)
# get bootstrap confidence intervals
qr_summary = summary(qr_model, se = "boot", R = 500)
# extract slope estimates and CIs (coefficient [2] is the slope)
slope_est = qr_summary$coefficients[2, 1]
slope_lower = qr_summary$coefficients[2, 2]
slope_upper = qr_summary$coefficients[2, 3]
# basic results (as before)
new_row = data.frame(
Quantile = tau,
Intercept = round(coef(qr_model)[1], 3),
Slope = round(coef(qr_model)[2], 3)
)
qr_results = rbind(qr_results, new_row)
# results with confidence intervals
new_row_ci = data.frame(
Quantile = tau,
CI_Lower = round(slope_lower, 3),
CI_Upper = round(slope_upper, 3)
)
qr_results_ci = rbind(qr_results_ci, new_row_ci)
}
print("Quantile Regression Results:")
## [1] "Quantile Regression Results:"
print(qr_results)
## Quantile Intercept Slope
## (Intercept) 0.10 -2.508 0.586
## (Intercept)1 0.25 -2.185 0.832
## (Intercept)2 0.50 -1.786 1.266
## (Intercept)3 0.75 -1.130 2.075
## (Intercept)4 0.90 -0.342 3.031
print("Quantile Regression Slopes with 95% Confidence Intervals:")
## [1] "Quantile Regression Slopes with 95% Confidence Intervals:"
print(qr_results_ci)
## Quantile CI_Lower CI_Upper
## 1 0.10 0.142 4.115
## 2 0.25 0.102 8.123
## 3 0.50 0.064 19.666
## 4 0.75 0.201 10.309
## 5 0.90 0.196 15.454
Step 3b: What do these coefficients mean?
From the results above, different equations exist for different quantiles:
If the slopes are different across quantiles, this confirms heteroskedasticity - the strength of the AUC-IC50 relationship depends on where a cell line falls in the sensitivity distribution.
Confidence Interval Interpretation:
The 95% confidence intervals provide statistical evidence that these slope differences are real, not just sampling variation. Looking at the CI results:
The key finding is that these confidence intervals do not overlap, particularly between τ = 0.10 and τ = 0.90. This non-overlap means: 1. The slope differences are statistically significant at the 95% confidence level 2. The drug behaves fundamentally differently for sensitive versus resistant cell lines 3. Using a single OLS slope would be misleading - it would overestimate the relationship for sensitive cells and underestimate it for resistant cells
This statistical confirmation of heteroskedasticity validates the need for quantile regression. A single mean-based model cannot capture how the drug’s concentration-response relationship varies across the sensitivity spectrum.
Step 3c: Visualize the quantile regression lines
# make plot
plot(litro$auc_log10, litro$ic50_log10,
col = "lightgray", pch = 16,
xlab = "AUC(log10) - Predictor",
ylab = "IC50(log10) - Response",
main = "Quantile Regression: Litronesib IC50 vs AUC")
# add lines
abline(ols_model, col = "red", lwd = 2)
qr_colors = c("lightblue", "skyblue", "blue", "darkblue", "navy")
for(i in 1:length(quantiles)) {
qr_model = rq(ic50_log10 ~ auc_log10, data = litro, tau = quantiles[i])
abline(qr_model, col = qr_colors[i], lwd = 2)
}
# legend
legend("topleft",
c(paste("Quantile τ =", quantiles), "OLS mean"),
col = c(qr_colors, "red"),
lty = c(rep(1, 5), 2),
lwd = 2, cex = 0.7)
Step 3d: Compare slopes across quantiles
plot(qr_results$Quantile, qr_results$Slope,
type = "b", col = "blue", lwd = 2, pch = 16,
xlab = "Quantile", ylab = "Slope",
main = "How Slopes Change Across Quantiles")
# legend
legend("topright", c("Quantile slopes"),
col = c("blue"), lwd = 2, pch = c(16, NA))
Are slopes different across quantiles? If slopes at τ = 0.90 and τ = 0.10 differ substantially, this confirms that high-sensitivity and low-sensitivity cell lines respond differently to changes in drug exposure.
Does the median (τ = 0.50) differ from the OLS mean? If yes, the distribution is skewed, and OLS may give misleading average results.
Both IC50 and AUC are derived from the same dose-response curve, making their relationship correlational rather than causal. Neither metric truly predicts the other; they represent different summary statistics of the same underlying drug response.
The R quantreg package assumes linear relationships at each quantile on the log10 scale. While this appears reasonable for litronesib, non-linear patterns at extreme quantiles might be missed. Smoothing-based quantile regression methods could reveal more complex relationships.
This analysis treats all cell lines equally, but some tumor types are overrepresented with 98 lung samples versus 33 esophagus/stomach samples. This imbalance might bias the overall quantile estimates toward patterns in better-represented tumor types. While we demonstrate heteroskedasticity exists and varies by tumor type, we don’t identify the biological mechanisms driving this variance. The varying spread could result from differences in drug uptake, metabolism, target expression, or resistance mechanisms that aren’t measured in this dataset.