Does trial sponsor type (commercial vs non-commercial) predict how long Phase 2 respiratory disease trials take to complete?
We examine trials initiated during March 2019 through December 2022, a period spanning the onset and evolution of COVID-19, where it is theorized that this period might inflate the effect of sponsor on time-to-completion. Respiratory trials are particularly interesting during this period because:
To do this we will take a sample from the AACT (ClinicalTrials.gov) database.
Regular linear regression (OLS) assumes:
Time-to-trial completion violates both:
A GLM has three components:
The AACT (Aggregate Analysis of ClinicalTrials.gov) database is a publicly available relational database maintained by the Clinical Trials Transformation Initiative (CTTI). It contains all information from ClinicalTrials.gov in a structured format that can be queried using SQL.
To access AACT, you need a free account: https://aact.ctti-clinicaltrials.org/users/sign_up
library(RPostgres) # Database connection to AACT (PostgreSQL)
library(dplyr) # Data manipulation (filtering, grouping, summarizing)
library(ggplot2) # Data visualization (histograms, boxplots, scatterplots)
library(scales) # Formatting for plot axes (commas, percentages)
library(statmod) # Inverse Gaussian distribution for GLM
library(htmltools) # For embedding HTML diagrams
# ============================================================================
# NOTE: This code chunk connects to the live AACT database.
# Set eval=TRUE to run it (requires internet connection and AACT credentials)
# ============================================================================
username <- "MY USERNAME"
password <- "MY AACT PASSWORD"
con <- dbConnect(
RPostgres::Postgres(),
dbname = "aact",
host = "aact-db.ctti-clinicaltrials.org",
port = 5432,
user = username,
password = password
)
Query wuth the following paramterts was used to pull Phase 2 interventional trials for chronic respiratory conditions during the period of interest.
Included trials for these chronic respiratory conditions:
COPD (Chronic Obstructive Pulmonary Disease)
Asthma
Pulmonary Hypertension
Pulmonary Fibrosis (including Idiopathic Pulmonary Fibrosis/IPF)
Cystic Fibrosis
Interstitial Lung Disease
Bronchiectasis
Emphysema
Chronic Bronchitis
Start date: March 1, 2019 through December 31, 2022
This captures approximately one year pre-pandemic and nearly three years into the pandemic, allowing us to observe how trial operations adapted over time
query <- "
SELECT DISTINCT
s.nct_id, s.brief_title, s.overall_status, s.phase,
s.enrollment, s.start_date, s.completion_date, s.number_of_arms,
sp.agency_class as sponsor_type,
(SELECT COUNT(*) FROM facilities f WHERE f.nct_id = s.nct_id) as num_sites
FROM studies s
LEFT JOIN sponsors sp ON s.nct_id = sp.nct_id AND sp.lead_or_collaborator = 'lead'
INNER JOIN conditions c ON s.nct_id = c.nct_id
WHERE s.phase = 'PHASE2'
AND s.study_type = 'INTERVENTIONAL'
AND s.start_date >= '2019-03-01'
AND s.start_date <= '2022-12-31'
AND (
LOWER(c.downcase_name) LIKE '%copd%'
OR LOWER(c.downcase_name) LIKE '%chronic obstructive pulmonary%'
OR LOWER(c.downcase_name) LIKE '%asthma%'
OR LOWER(c.downcase_name) LIKE '%pulmonary hypertension%'
OR LOWER(c.downcase_name) LIKE '%pulmonary fibrosis%'
OR LOWER(c.downcase_name) LIKE '%ipf%'
OR LOWER(c.downcase_name) LIKE '%cystic fibrosis%'
OR LOWER(c.downcase_name) LIKE '%interstitial lung%'
OR LOWER(c.downcase_name) LIKE '%bronchiectasis%'
OR LOWER(c.downcase_name) LIKE '%emphysema%'
OR LOWER(c.downcase_name) LIKE '%chronic bronchitis%'
)
AND NOT (
LOWER(c.downcase_name) LIKE '%lung cancer%'
OR LOWER(c.downcase_name) LIKE '%lung neoplasm%'
OR LOWER(c.downcase_name) LIKE '%non-small cell%'
OR LOWER(c.downcase_name) LIKE '%small cell lung%'
)
AND NOT (
LOWER(c.downcase_name) LIKE '%covid%'
OR LOWER(c.downcase_name) LIKE '%coronavirus%'
OR LOWER(c.downcase_name) LIKE '%pneumonia%'
OR LOWER(c.downcase_name) LIKE '%influenza%'
)
"
trials <- dbGetQuery(con, query)
dbDisconnect(con)
save(trials, file = "phase2_respiratory_trials.RData")
write.csv(trials, "phase2_respiratory_trials_raw.csv", row.names = FALSE)
load("C:/Users/amcewen/OneDrive - Bentley University/Documents/quant pres #2/phase2_respiratory_trials.RData")
172 trials included from the AACT database.
# Create analysis dataset and variables
mydata <- trials
# Create sponsor type: Commercial vs Non-Commercial
# INDUSTRY = pharmaceutical and biotech companies (Commercial)
# Everything else (NIH, academic institutions, etc.) = Non-Commercial
mydata$sponsor <- ifelse(mydata$sponsor_type == "INDUSTRY", "Commercial", "Non-Commercial")
mydata$sponsor <- factor(mydata$sponsor, levels = c("Commercial", "Non-Commercial"))
# Calculate time-to-completion in days
mydata$start_date <- as.Date(mydata$start_date)
mydata$completion_date <- as.Date(mydata$completion_date)
mydata$completion_time <- as.numeric(mydata$completion_date - mydata$start_date)
# Define the COVID start date for relative-to-CIVID start timing variable
covid_start <- as.Date("2020-03-01")
# Calculate months from COVID start
# Negative values = trial started BEFORE pandemic declaration
# Positive values = trial started AFTER pandemic declaration
mydata$months_from_covid <- as.numeric(mydata$start_date - covid_start) / 31
# Make num_sites numeric as count from listing
mydata$num_sites <- as.numeric(mydata$num_sites)
# Track sample at each step
n_start <- nrow(mydata)
n_completed <- sum(mydata$overall_status == "COMPLETED")
mydata <- mydata[mydata$overall_status == "COMPLETED", ]
mydata <- mydata[mydata$completion_time > 0, ]
n_valid_time <- nrow(mydata)
mydata <- mydata[!is.na(mydata$num_sites), ]
mydata <- mydata[!is.na(mydata$enrollment), ]
mydata <- mydata[!is.na(mydata$number_of_arms), ]
n_final <- nrow(mydata)
# Display as table
filtering_steps <- data.frame(
Step = c("Starting sample",
"After keeping only COMPLETED trials",
"After removing invalid completion times",
"After removing missing covariate data"),
N_Trials = c(n_start, n_completed, n_valid_time, n_final)
)
print(filtering_steps)
## Step N_Trials
## 1 Starting sample 172
## 2 After keeping only COMPLETED trials 74
## 3 After removing invalid completion times 74
## 4 After removing missing covariate data 73
The final sample is 73 completed respiratory trials.
The continuous variables/covariates are on very different scales. Scaling them helps the model converge and makes coefficients from one covariate to the next, comparable.
mydata$num_sites_z <- scale(mydata$num_sites)[,1]
mydata$enrollment_z <- scale(mydata$enrollment)[,1]
mydata$arms_z <- scale(mydata$number_of_arms)[,1]
mydata$covid_timing_z <- scale(mydata$months_from_covid)[,1]
# Show what 1 SD means in original units
scaling_reference <- data.frame(
Variable = c("Number of Sites", "Enrollment", "Number of Arms", "Months from COVID"),
One_SD_Equals = c(
paste(round(sd(mydata$num_sites), 1), "sites"),
paste(round(sd(mydata$enrollment), 1), "participants"),
paste(round(sd(mydata$number_of_arms), 2), "arms"),
paste(round(sd(mydata$months_from_covid), 1), "months")
)
)
print(scaling_reference)
## Variable One_SD_Equals
## 1 Number of Sites 36.3 sites
## 2 Enrollment 189.6 participants
## 3 Number of Arms 1.63 arms
## 4 Months from COVID 13.9 months
Scaling does NOT affect our main predictor (sponsor type) because it is categorical.
write.csv(mydata, "phase2_respiratory_trials_clean.csv", row.names = FALSE)
Saved cleaned dataset as
phase2_respiratory_trials_clean.csv.
sponsor_summary <- data.frame(
Sponsor_Type = c("Commercial", "Non-Commercial", "Total"),
N_Trials = c(
sum(mydata$sponsor == "Commercial"),
sum(mydata$sponsor == "Non-Commercial"),
nrow(mydata)
),
Percent = c(
round(sum(mydata$sponsor == "Commercial") / nrow(mydata) * 100, 1),
round(sum(mydata$sponsor == "Non-Commercial") / nrow(mydata) * 100, 1),
100
)
)
print(sponsor_summary)
## Sponsor_Type N_Trials Percent
## 1 Commercial 53 72.6
## 2 Non-Commercial 20 27.4
## 3 Total 73 100.0
mean_time <- mean(mydata$completion_time)
median_time <- median(mydata$completion_time)
ggplot(mydata, aes(x = completion_time)) +
geom_histogram(bins = 30, fill = "blue", color = "white") +
geom_vline(xintercept = mean_time, color = "red", linetype = "dashed", size = 1) +
geom_vline(xintercept = median_time, color = "green", linetype = "solid", size = 1) +
labs(title = "Distribution of Trial Completion Time",
subtitle = "Red dashed = Mean | Green solid = Median | Mean > Median confirms right skew",
x = "Completion Time (days)",
y = "Count") +
annotate("text", x = mean_time + 50, y = Inf, label = paste("Mean:", round(mean_time, 0)),
vjust = 2, hjust = 0, color = "red", size = 3.5) +
annotate("text", x = median_time - 50, y = Inf, label = paste("Median:", round(median_time, 0)),
vjust = 2, hjust = 1, color = "green", size = 3.5)
The distribution is right-skewed (mean > median). This is why we need Gamma or Inverse Gaussian, not Normal.
ggplot(mydata, aes(x = sponsor, y = completion_time, fill = sponsor)) +
geom_boxplot() +
scale_fill_manual(values = c("Commercial" = "blue", "Non-Commercial" = "green")) +
labs(title = "Completion Time by Sponsor Type",
x = "Sponsor Type",
y = "Completion Time (days)") +
theme(legend.position = "none")
summary_by_sponsor <- mydata %>%
group_by(sponsor) %>%
summarise(
N = n(),
Mean = round(mean(completion_time), 0),
Median = round(median(completion_time), 0),
SD = round(sd(completion_time), 0),
Min = min(completion_time),
Max = max(completion_time)
) %>%
as.data.frame()
print(summary_by_sponsor)
## sponsor N Mean Median SD Min Max
## 1 Commercial 53 681 637 330 89 1646
## 2 Non-Commercial 20 1011 954 534 322 1917
For Gamma and Inverse Gaussian to be appropriate, variance should increase with the mean.
simple_fit <- lm(completion_time ~ sponsor + num_sites + enrollment, data = mydata)
plot_data <- data.frame(
fitted = fitted(simple_fit),
abs_resid = abs(residuals(simple_fit))
)
ggplot(plot_data, aes(x = fitted, y = abs_resid)) +
geom_point(size = 2, color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Heteroscedasticity Check",
subtitle = "Upward slope indicates variance increases with predicted completion time",
x = "Fitted Values (Predicted Completion Time)",
y = "Absolute Residuals")
The plot suggests a modest upward trend, though with N = 73 and sparse data at higher fitted values, the pattern is not definitive. However, duration data like trial completion time typically exhibits increasing variance: trials expected to take longer have more opportunities for delays and variability. This theoretical expectation, combined with the positive values and right skew observed in the histogram, justifies using Gamma or Inverse Gaussian GLMs. The diagnostic plots after model fitting will further validate this choice.
What is it?
Key assumption: Variance is proportional to the mean squared: Var(Y) ∝ μ²
When to use: Moderate right skew, no extreme outliers
What is it?
Key assumption: Variance is proportional to the mean cubed: Var(Y) ∝ μ³
When to use: Heavy right tails, extreme outliers, when data represents “time to reach a goal”
Both models use a log link:
\[\log(\mu) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...\]
This means:
\[\mu = e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...}\]
Why log link?
Part 1 — Distribution:
\[\text{CompletionTime}_i \sim \text{Gamma}(\mu_i, \phi)\]
Each trial’s completion time follows a Gamma distribution with mean μᵢ and dispersion parameter φ.
Part 2 — Link Function and Linear Predictor:
\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]
The log of the expected completion time is a linear combination of our predictors.
Putting it together:
\[\mu_i = e^{\beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + ...}\]
The Gamma distribution determines the shape of the error (right-skewed, variance ∝ μ²). The log link ensures predictions are always positive. The β coefficients tell us the effect of each predictor.
gamma_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
data = mydata,
family = Gamma(link = "log"))
summary(gamma_model)
##
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z +
## arms_z + covid_timing_z, family = Gamma(link = "log"), data = mydata)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.47790 0.06435 100.664 < 2e-16 ***
## sponsorNon-Commercial 0.46917 0.13306 3.526 0.000767 ***
## num_sites_z 0.16226 0.09315 1.742 0.086124 .
## enrollment_z 0.04680 0.09039 0.518 0.606376
## arms_z -0.06303 0.05813 -1.084 0.282170
## covid_timing_z -0.11725 0.05529 -2.120 0.037672 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.2052868)
##
## Null deviance: 23.535 on 72 degrees of freedom
## Residual deviance: 17.401 on 67 degrees of freedom
## AIC: 1066.5
##
## Number of Fisher Scoring iterations: 6
For a log link, we exponentiate coefficients to get multiplicative effects:
coefs <- coef(gamma_model)
pvals <- summary(gamma_model)$coefficients[, 4]
results_gamma <- data.frame(
Variable = c("(Intercept)", "Sponsor: Non-Commercial", "Number of Sites",
"Enrollment", "Number of Arms", "COVID Timing"),
Coefficient = round(coefs, 4),
Exp_Coef = round(exp(coefs), 4),
Pct_Change = round((exp(coefs) - 1) * 100, 1),
P_Value = round(pvals, 4),
Significant = ifelse(pvals < 0.05, "Yes", ifelse(pvals < 0.10, "Borderline", "No"))
)
row.names(results_gamma) <- NULL
print(results_gamma)
## Variable Coefficient Exp_Coef Pct_Change P_Value Significant
## 1 (Intercept) 6.4779 650.6019 64960.2 0.0000 Yes
## 2 Sponsor: Non-Commercial 0.4692 1.5987 59.9 0.0008 Yes
## 3 Number of Sites 0.1623 1.1762 17.6 0.0861 Borderline
## 4 Enrollment 0.0468 1.0479 4.8 0.6064 No
## 5 Number of Arms -0.0630 0.9389 -6.1 0.2822 No
## 6 COVID Timing -0.1172 0.8894 -11.1 0.0377 Yes
How to interpret:
par(mfrow = c(1, 2))
plot(fitted(gamma_model), residuals(gamma_model, type = "deviance"),
xlab = "Fitted Values", ylab = "Deviance Residuals",
main = "Gamma: Residuals vs Fitted", col = "blue")
abline(h = 0, col = "red", lty = 2)
qqnorm(residuals(gamma_model, type = "deviance"), main = "Gamma: Q-Q Plot", col = "blue")
qqline(residuals(gamma_model, type = "deviance"), col = "red")
par(mfrow = c(1, 1))
What we look for:
gamma_fit_stats <- data.frame(
Metric = c("AIC", "BIC", "Null Deviance", "Residual Deviance", "Deviance Explained", "Dispersion Parameter"),
Value = c(
round(AIC(gamma_model), 1),
round(BIC(gamma_model), 1),
round(gamma_model$null.deviance, 2),
round(gamma_model$deviance, 2),
paste0(round((1 - gamma_model$deviance/gamma_model$null.deviance) * 100, 1), "%"),
round(summary(gamma_model)$dispersion, 4)
)
)
print(gamma_fit_stats)
## Metric Value
## 1 AIC 1066.5
## 2 BIC 1082.6
## 3 Null Deviance 23.54
## 4 Residual Deviance 17.4
## 5 Deviance Explained 26.1%
## 6 Dispersion Parameter 0.2053
Part 1 — Distribution:
\[\text{CompletionTime}_i \sim \text{InverseGaussian}(\mu_i, \lambda)\]
This says: Each trial’s completion time follows an Inverse Gaussian distribution with mean μᵢ and shape parameter λ.
Part 2 — Link Function and Linear Predictor:
\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]
Key difference from Gamma: The Inverse Gaussian has a heavier right tail and assumes variance ∝ μ³ (instead of μ²). Everything else stays the same.
ig_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
data = mydata,
family = inverse.gaussian(link = "log"),
control = glm.control(maxit = 100))
summary(ig_model)
##
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z +
## arms_z + covid_timing_z, family = inverse.gaussian(link = "log"),
## data = mydata, control = glm.control(maxit = 100))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.48267 0.06198 104.591 <2e-16 ***
## sponsorNon-Commercial 0.44796 0.14317 3.129 0.0026 **
## num_sites_z 0.14782 0.09805 1.508 0.1364
## enrollment_z 0.07461 0.10873 0.686 0.4949
## arms_z -0.05447 0.05568 -0.978 0.3315
## covid_timing_z -0.09898 0.05546 -1.785 0.0788 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for inverse.gaussian family taken to be 0.0002911946)
##
## Null deviance: 0.041750 on 72 degrees of freedom
## Residual deviance: 0.034137 on 67 degrees of freedom
## AIC: 1082.1
##
## Number of Fisher Scoring iterations: 9
coefs_ig <- coef(ig_model)
pvals_ig <- summary(ig_model)$coefficients[, 4]
results_ig <- data.frame(
Variable = c("(Intercept)", "Sponsor: Non-Commercial", "Number of Sites",
"Enrollment", "Number of Arms", "COVID Timing"),
Coefficient = round(coefs_ig, 4),
Exp_Coef = round(exp(coefs_ig), 4),
Pct_Change = round((exp(coefs_ig) - 1) * 100, 1),
P_Value = round(pvals_ig, 4),
Significant = ifelse(pvals_ig < 0.05, "Yes", ifelse(pvals_ig < 0.10, "Borderline", "No"))
)
row.names(results_ig) <- NULL
print(results_ig)
## Variable Coefficient Exp_Coef Pct_Change P_Value Significant
## 1 (Intercept) 6.4827 653.7148 65271.5 0.0000 Yes
## 2 Sponsor: Non-Commercial 0.4480 1.5651 56.5 0.0026 Yes
## 3 Number of Sites 0.1478 1.1593 15.9 0.1364 No
## 4 Enrollment 0.0746 1.0775 7.7 0.4949 No
## 5 Number of Arms -0.0545 0.9470 -5.3 0.3315 No
## 6 COVID Timing -0.0990 0.9058 -9.4 0.0788 Borderline
par(mfrow = c(1, 2))
plot(fitted(ig_model), residuals(ig_model, type = "deviance"),
xlab = "Fitted Values", ylab = "Deviance Residuals",
main = "Inverse Gaussian: Residuals vs Fitted", col = "blue")
abline(h = 0, col = "red", lty = 2)
qqnorm(residuals(ig_model, type = "deviance"), main = "Inverse Gaussian: Q-Q Plot", col = "blue")
qqline(residuals(ig_model, type = "deviance"), col = "red")
par(mfrow = c(1, 1))
ig_fit_stats <- data.frame(
Metric = c("AIC", "BIC", "Null Deviance", "Residual Deviance", "Deviance Explained", "Dispersion Parameter"),
Value = c(
round(AIC(ig_model), 1),
round(BIC(ig_model), 1),
round(ig_model$null.deviance, 4),
round(ig_model$deviance, 4),
paste0(round((1 - ig_model$deviance/ig_model$null.deviance) * 100, 1), "%"),
round(summary(ig_model)$dispersion, 6)
)
)
print(ig_fit_stats)
## Metric Value
## 1 AIC 1082.1
## 2 BIC 1098.2
## 3 Null Deviance 0.0418
## 4 Residual Deviance 0.0341
## 5 Deviance Explained 18.2%
## 6 Dispersion Parameter 0.000291
comparison_table <- data.frame(
Metric = c("AIC", "BIC", "Deviance Explained"),
Gamma = c(
round(AIC(gamma_model), 1),
round(BIC(gamma_model), 1),
paste0(round((1 - gamma_model$deviance/gamma_model$null.deviance) * 100, 1), "%")
),
Inverse_Gaussian = c(
round(AIC(ig_model), 1),
round(BIC(ig_model), 1),
paste0(round((1 - ig_model$deviance/ig_model$null.deviance) * 100, 1), "%")
)
)
print(comparison_table)
## Metric Gamma Inverse_Gaussian
## 1 AIC 1066.5 1082.1
## 2 BIC 1082.6 1098.2
## 3 Deviance Explained 26.1% 18.2%
The Gamma model has lower AIC (1066.5 vs 1082.1) and lower BIC (1082.6 vs 1098.2), indicating better fit.
This makes sense because the data show moderate right skew without many extreme outliers. The Gamma’s variance assumption (Var ∝ μ²) fits the observed mean-variance relationship. The Inverse Gaussian assumes variance proportional to the mean CUBED, which is too extreme for these data.
Does sponsor type predict Phase 2 respiratory disease trial completion time?
sample_chars <- data.frame(
Characteristic = c("Total Completed Trials", "Commercial Sponsors", "Non-Commercial Sponsors",
"Time Period", "Therapeutic Areas"),
Value = c(
nrow(mydata),
sum(mydata$sponsor == "Commercial"),
sum(mydata$sponsor == "Non-Commercial"),
"March 2019 - December 2022",
"COPD, Asthma, Pulmonary Hypertension, Pulmonary Fibrosis, Cystic Fibrosis, ILD, Bronchiectasis"
)
)
print(sample_chars)
## Characteristic
## 1 Total Completed Trials
## 2 Commercial Sponsors
## 3 Non-Commercial Sponsors
## 4 Time Period
## 5 Therapeutic Areas
## Value
## 1 73
## 2 53
## 3 20
## 4 March 2019 - December 2022
## 5 COPD, Asthma, Pulmonary Hypertension, Pulmonary Fibrosis, Cystic Fibrosis, ILD, Bronchiectasis
Both the Gamma and Inverse Gaussian GLMs converged successfully. The Gamma model provided better fit based on AIC and BIC. This is consistent with the mean-variance relationship observed in exploratory analysis: variance increases with the mean, but not as dramatically as the Inverse Gaussian’s cubic assumption would require. The Gamma’s quadratic variance assumption (Var ∝ μ²) is a better match for these data.
H₀: Sponsor type does not predict trial completion time
H₁: Sponsor type does predict trial completion time
sponsor_result <- data.frame(
Metric = c("Effect Size", "Direction", "P-Value", "Decision"),
Value = c(
paste0(round(best_pct["sponsorNon-Commercial"], 1), "%"),
ifelse(best_pct["sponsorNon-Commercial"] > 0, "Non-commercial trials take LONGER", "Non-commercial trials complete FASTER"),
round(best_pvals["sponsorNon-Commercial"], 4),
ifelse(best_pvals["sponsorNon-Commercial"] < 0.05, "REJECT H0", "FAIL TO REJECT H0")
)
)
print(sponsor_result)
## Metric Value
## 1 Effect Size 59.9%
## 2 Direction Non-commercial trials take LONGER
## 3 P-Value 8e-04
## 4 Decision REJECT H0
Non-commercial trials take significantly longer to complete than commercial trials. Specifically, trials sponsored by academic institutions or government agencies take approximately 60% longer than industry-sponsored trials (p = 8^{-4}).
covid_result <- data.frame(
Metric = c("Effect Size (per 1 SD)", "1 SD equals", "Direction", "P-Value", "Significant"),
Value = c(
paste0(round(best_pct["covid_timing_z"], 1), "%"),
paste(round(sd(mydata$months_from_covid), 1), "months"),
ifelse(best_pct["covid_timing_z"] < 0, "Trials starting LATER completed FASTER", "Trials starting LATER took LONGER"),
round(best_pvals["covid_timing_z"], 4),
ifelse(best_pvals["covid_timing_z"] < 0.05, "Yes", "No")
)
)
print(covid_result)
## Metric Value
## 1 Effect Size (per 1 SD) -11.1%
## 2 1 SD equals 13.9 months
## 3 Direction Trials starting LATER completed FASTER
## 4 P-Value 0.0377
## 5 Significant Yes
The COVID timing variable is statistically significant. The negative coefficient indicates that trials starting LATER in the pandemic completed FASTER. For each 13.9-month increase in distance from the March 2020 pandemic declaration, completion time decreased by approximately 11%.
This suggests that sponsors adapted to pandemic conditions over time. Early in the pandemic, respiratory trials faced severe disruptions:
As the pandemic progressed, sponsors developed workarounds: remote monitoring, decentralized trial designs, and revised safety protocols. Trials initiating later benefited from these adaptations.
covariate_summary <- data.frame(
Variable = c("Sponsor Type (Non-Commercial)", "Number of Sites", "Enrollment",
"Number of Arms", "COVID Timing"),
Effect = c(
paste0(ifelse(best_pct["sponsorNon-Commercial"] > 0, "+", ""), round(best_pct["sponsorNon-Commercial"], 1), "% vs Commercial"),
paste0(ifelse(best_pct["num_sites_z"] > 0, "+", ""), round(best_pct["num_sites_z"], 1), "% per SD"),
paste0(ifelse(best_pct["enrollment_z"] > 0, "+", ""), round(best_pct["enrollment_z"], 1), "% per SD"),
paste0(ifelse(best_pct["arms_z"] > 0, "+", ""), round(best_pct["arms_z"], 1), "% per SD"),
paste0(ifelse(best_pct["covid_timing_z"] > 0, "+", ""), round(best_pct["covid_timing_z"], 1), "% per SD")
),
P_Value = round(best_pvals[-1], 4),
Significant = ifelse(best_pvals[-1] < 0.05, "Yes", ifelse(best_pvals[-1] < 0.10, "Borderline", "No"))
)
print(covariate_summary)
## Variable Effect
## sponsorNon-Commercial Sponsor Type (Non-Commercial) +59.9% vs Commercial
## num_sites_z Number of Sites +17.6% per SD
## enrollment_z Enrollment +4.8% per SD
## arms_z Number of Arms -6.1% per SD
## covid_timing_z COVID Timing -11.1% per SD
## P_Value Significant
## sponsorNon-Commercial 0.0008 Yes
## num_sites_z 0.0861 Borderline
## enrollment_z 0.6064 No
## arms_z 0.2822 No
## covid_timing_z 0.0377 Yes
implications <- data.frame(
Finding = c("Sponsor Selection", "Pandemic Adaptation", "Trial Design"),
Implication = c(
"Non-commercial trials take ~60% longer. This affects drug development timelines, budgeting, and academic-industry partnership decisions.",
"Trials starting later in the pandemic completed faster, demonstrating that the clinical trial enterprise can adapt to major disruptions.",
"Enrollment size and number of arms were not significant. Number of sites showed a borderline effect, suggesting multi-site complexity may extend timelines."
)
)
print(implications)
## Finding
## 1 Sponsor Selection
## 2 Pandemic Adaptation
## 3 Trial Design
## Implication
## 1 Non-commercial trials take ~60% longer. This affects drug development timelines, budgeting, and academic-industry partnership decisions.
## 2 Trials starting later in the pandemic completed faster, demonstrating that the clinical trial enterprise can adapt to major disruptions.
## 3 Enrollment size and number of arms were not significant. Number of sites showed a borderline effect, suggesting multi-site complexity may extend timelines.
Sample Size: With N = 73 trials, this analysis has moderate power. Smaller effects may not be detectable.
Group Imbalance: The sample was imbalanced between commercial (n = 53) and non-commercial (n = 20) sponsors, reflecting the composition of the Phase 2 respiratory trial population. The smaller non-commercial group may result in less precise estimates for that category, though the effect remained highly significant.
Observational Design: We cannot establish causation. Unmeasured confounders (trial complexity, regulatory pathway, therapeutic novelty) may explain sponsor differences.
Completion Bias: We analyzed only completed trials. Ongoing or terminated trials may differ systematically.
Time Period: Results reflect the COVID-19 era (2019-2022) and may not generalize to other periods.