10.1.4 Main Finding
Non-commercial vs Commercial sponsors:
- Effect: 111.3% longer completion time
- p-value: 6^{-4}
- Decision: REJECT null hypothesis - sponsor type DOES predict completion time
Does trial sponsor type (commercial vs non-commercial) predict how long Phase 2 respiratory disease trials take to complete?
We examine trials initiated during March 2019 through March 2021, a period spanning the onset of COVID-19, where it is theorized that this period might inflate the effect of sponsor on time-to-completion. Respiratory trials are particularly interesting during this period because:
To do this we will take a sample from the AACT (ClinicalTrials.gov) database.
Regular linear regression (OLS) assumes:
Time-to-trial completion violates both:
A GLM has three components:
The AACT (Aggregate Analysis of ClinicalTrials.gov) database is a publicly available relational database maintained by the Clinical Trials Transformation Initiative (CTTI). It contains all information from ClinicalTrials.gov in a structured format that can be queried using SQL.
To access AACT, you need a free account: https://aact.ctti-clinicaltrials.org/users/sign_up
# Load required packages
library(RPostgres) # Database connection
library(dplyr) # Data manipulation
library(ggplot2) # Plotting
library(scales) # Nice axis labels
library(statmod) # For Inverse Gaussian
# ============================================================================
# NOTE: This code chunk connects to the live AACT database.
# Set eval=TRUE to run it (requires internet connection and AACT credentials)
# ============================================================================
# Enter your AACT credentials
username <- "amgent06" # Replace with your username
password <- "agMORGAN2027!" # Replace with your password
# Connect to the database
con <- dbConnect(
RPostgres::Postgres(),
dbname = "aact",
host = "aact-db.ctti-clinicaltrials.org",
port = 5432,
user = username,
password = password
)
cat("Connected to AACT database successfully!\n")
## Connected to AACT database successfully!
We construct a query to pull Phase 2 interventional trials for chronic respiratory conditions.
We include trials for these chronic respiratory conditions:
We exclude:
query <- "
SELECT DISTINCT
s.nct_id,
s.brief_title,
s.overall_status,
s.phase,
s.enrollment,
s.start_date,
s.completion_date,
s.number_of_arms,
sp.agency_class as sponsor_type,
f.number_of_facilities as num_sites
FROM studies s
LEFT JOIN sponsors sp ON s.nct_id = sp.nct_id AND sp.lead_or_collaborator = 'lead'
LEFT JOIN facilities f ON s.nct_id = f.nct_id
INNER JOIN conditions c ON s.nct_id = c.nct_id
WHERE s.phase = 'PHASE2'
AND s.study_type = 'INTERVENTIONAL'
AND s.start_date >= '2019-03-01'
AND s.start_date <= '2021-03-31'
AND (
-- Include chronic respiratory conditions
LOWER(c.downcase_name) LIKE '%copd%'
OR LOWER(c.downcase_name) LIKE '%chronic obstructive pulmonary%'
OR LOWER(c.downcase_name) LIKE '%asthma%'
OR LOWER(c.downcase_name) LIKE '%pulmonary hypertension%'
OR LOWER(c.downcase_name) LIKE '%pulmonary fibrosis%'
OR LOWER(c.downcase_name) LIKE '%ipf%'
OR LOWER(c.downcase_name) LIKE '%cystic fibrosis%'
OR LOWER(c.downcase_name) LIKE '%interstitial lung%'
OR LOWER(c.downcase_name) LIKE '%bronchiectasis%'
OR LOWER(c.downcase_name) LIKE '%emphysema%'
OR LOWER(c.downcase_name) LIKE '%chronic bronchitis%'
)
AND NOT (
-- Exclude lung cancer
LOWER(c.downcase_name) LIKE '%lung cancer%'
OR LOWER(c.downcase_name) LIKE '%lung neoplasm%'
OR LOWER(c.downcase_name) LIKE '%pulmonary neoplasm%'
OR LOWER(c.downcase_name) LIKE '%non-small cell%'
OR LOWER(c.downcase_name) LIKE '%small cell lung%'
)
AND NOT (
-- Exclude COVID-19 and acute infections
LOWER(c.downcase_name) LIKE '%covid%'
OR LOWER(c.downcase_name) LIKE '%coronavirus%'
OR LOWER(c.downcase_name) LIKE '%sars-cov%'
OR LOWER(c.downcase_name) LIKE '%pneumonia%'
OR LOWER(c.downcase_name) LIKE '%acute respiratory%'
OR LOWER(c.downcase_name) LIKE '%influenza%'
)
"
# Execute the query
trials <- dbGetQuery(con, query)
# Disconnect from database
dbDisconnect(con)
# Save the raw data
save(trials, file = "phase2_respiratory_trials.RData")
write.csv(trials, "phase2_respiratory_trials_raw.csv", row.names = FALSE)
cat("Downloaded", nrow(trials), "trials\n")
cat("Saved as 'phase2_respiratory_trials.RData' and 'phase2_respiratory_trials_raw.csv'\n")
If you have already run the query above and saved the data, you can load it here:
# Load the previously downloaded data
load("C:/Users/amcewen/OneDrive - Bentley University/Documents/quant pres #2/phase2_respiratory_trials.RData")
cat("Loaded", nrow(trials), "trials from AACT database\n")
## Loaded 82 trials from AACT database
# Define the COVID start date for our timing variable
# March 1, 2020 = WHO declared COVID-19 a pandemic
covid_start <- as.Date("2020-03-01")
# Create analysis dataset
mydata <- trials
# Create sponsor type: Commercial vs Non-Commercial
# INDUSTRY = pharmaceutical and biotech companies (Commercial)
# Everything else (NIH, academic institutions, etc.) = Non-Commercial
mydata$sponsor <- ifelse(mydata$sponsor_type == "INDUSTRY",
"Commercial",
"Non-Commercial")
mydata$sponsor <- factor(mydata$sponsor, levels = c("Commercial", "Non-Commercial"))
# Calculate time-to-completion in days
mydata$start_date <- as.Date(mydata$start_date)
mydata$completion_date <- as.Date(mydata$completion_date)
mydata$completion_time <- as.numeric(mydata$completion_date - mydata$start_date)
# Calculate months from COVID start
# Negative values = trial started BEFORE pandemic declaration
# Positive values = trial started AFTER pandemic declaration
mydata$months_from_covid <- as.numeric(mydata$start_date - covid_start) / 31
# Make num_sites numeric
mydata$num_sites <- as.numeric(mydata$num_sites)
We only include trials that have actually completed (not ongoing or terminated) because we need actual completion dates.
# Track how many trials we lose at each step
cat("=== Data Filtering Steps ===\n\n")
## === Data Filtering Steps ===
cat("Starting sample:", nrow(mydata), "trials\n")
## Starting sample: 82 trials
# Keep only completed trials
mydata <- mydata[mydata$overall_status == "COMPLETED", ]
cat("After keeping only COMPLETED trials:", nrow(mydata), "\n")
## After keeping only COMPLETED trials: 39
# Remove trials with invalid completion times
mydata <- mydata[mydata$completion_time > 0, ]
cat("After removing invalid completion times (<=0 days):", nrow(mydata), "\n")
## After removing invalid completion times (<=0 days): 39
# Remove trials with missing covariate data
mydata <- mydata[!is.na(mydata$num_sites), ]
mydata <- mydata[!is.na(mydata$enrollment), ]
mydata <- mydata[!is.na(mydata$number_of_arms), ]
cat("After removing trials with missing covariate data:", nrow(mydata), "\n")
## After removing trials with missing covariate data: 38
cat("\n=== Final Sample ===\n")
##
## === Final Sample ===
cat("N =", nrow(mydata), "completed respiratory trials\n")
## N = 38 completed respiratory trials
The continuous variables are on very different scales:
Scaling (standardizing) them helps the model converge and makes coefficients comparable.
What does scaling do? For each variable, we subtract the mean and divide by the standard deviation. This transforms each variable to have mean = 0 and standard deviation = 1.
# Scale continuous variables (subtract mean, divide by SD)
mydata$num_sites_z <- scale(mydata$num_sites)[,1]
mydata$enrollment_z <- scale(mydata$enrollment)[,1]
mydata$arms_z <- scale(mydata$number_of_arms)[,1]
mydata$covid_timing_z <- scale(mydata$months_from_covid)[,1]
# What does "1 SD" mean in original units?
cat("=== Scaling Reference ===\n")
## === Scaling Reference ===
cat("What '1 SD increase' means in original units:\n\n")
## What '1 SD increase' means in original units:
cat(" num_sites: 1 SD =", round(sd(mydata$num_sites), 1), "sites\n")
## num_sites: 1 SD = 30.4 sites
cat(" enrollment: 1 SD =", round(sd(mydata$enrollment), 1), "participants\n")
## enrollment: 1 SD = 133.4 participants
cat(" number_of_arms: 1 SD =", round(sd(mydata$number_of_arms), 2), "arms\n")
## number_of_arms: 1 SD = 1.86 arms
cat(" months_from_covid: 1 SD =", round(sd(mydata$months_from_covid), 1), "months\n")
## months_from_covid: 1 SD = 6.9 months
Note: Scaling does NOT affect our main predictive variable (sponsor type) because it’s categorical.
# Save cleaned dataset as CSV for reference
write.csv(mydata, "phase2_respiratory_trials_clean.csv", row.names = FALSE)
cat("Saved cleaned dataset as 'phase2_respiratory_trials_clean.csv'\n")
## Saved cleaned dataset as 'phase2_respiratory_trials_clean.csv'
cat("This file contains", nrow(mydata), "trials with", ncol(mydata), "variables.\n")
## This file contains 38 trials with 17 variables.
table(mydata$sponsor)
##
## Commercial Non-Commercial
## 27 11
ggplot(mydata, aes(x = completion_time)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Distribution of Trial Completion Time",
subtitle = "Notice the right skew - most trials finish within a typical timeframe but some take much longer",
x = "Completion Time (days)",
y = "Count")
Key observation: The distribution is right-skewed. This is why we need Gamma or Inverse Gaussian, not Normal.
ggplot(mydata, aes(x = sponsor, y = completion_time, fill = sponsor)) +
geom_boxplot() +
labs(title = "Completion Time by Sponsor Type",
x = "Sponsor Type",
y = "Completion Time (days)") +
theme(legend.position = "none")
# Summary statistics by sponsor type
aggregate(completion_time ~ sponsor, data = mydata,
FUN = function(x) c(Mean = mean(x), Median = median(x), SD = sd(x), N = length(x)))
## sponsor completion_time.Mean completion_time.Median completion_time.SD
## 1 Commercial 694.4444 639.0000 377.0197
## 2 Non-Commercial 1257.2727 1479.0000 506.8639
## completion_time.N
## 1 27.0000
## 2 11.0000
For Gamma and Inverse Gaussian to be appropriate, variance should increase with the mean.
# Split data into bins and calculate mean and variance for each
mydata$bin <- cut(mydata$completion_time, breaks = 10)
mv <- aggregate(completion_time ~ bin, data = mydata,
FUN = function(x) c(mean = mean(x), var = var(x)))
mv <- data.frame(mean = mv$completion_time[,1], var = mv$completion_time[,2])
ggplot(mv, aes(x = mean, y = var)) +
geom_point(size = 3, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkred") +
labs(title = "Mean-Variance Relationship",
subtitle = "Upward slope confirms variance increases with mean",
x = "Mean Completion Time (days)",
y = "Variance")
Conclusion: Variance increases with the mean, supporting use of Gamma or Inverse Gaussian.
What is it?
Key assumption: Variance is proportional to the mean squared: Var(Y) ∝ μ²
When to use: Moderate right skew, no extreme outliers
What is it?
Key assumption: Variance is proportional to the mean cubed: Var(Y) ∝ μ³
When to use: Heavy right tails, extreme outliers, when data represents “time to reach a goal”
Both models use a log link:
\[\log(\mu) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...\]
This means:
\[\mu = e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...}\]
Why log link?
Part 1 — Distribution:
\[\text{CompletionTime}_i \sim \text{Gamma}(\mu_i, \phi)\]
This says: Each trial’s completion time follows a Gamma distribution with mean μᵢ and dispersion parameter φ.
Part 2 — Link Function and Linear Predictor:
\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]
This says: The log of the expected completion time is a linear combination of our predictors.
Putting it together:
\[\mu_i = e^{\beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + ...}\]
The Gamma distribution determines the shape of the error (right-skewed, variance ∝ μ²). The log link ensures predictions are always positive. The β coefficients tell us the effect of each predictor.
# Fit Gamma GLM with log link
gamma_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
data = mydata,
family = Gamma(link = "log"))
# View results
summary(gamma_model)
##
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z +
## arms_z + covid_timing_z, family = Gamma(link = "log"), data = mydata)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.47634 0.09804 66.059 < 2e-16 ***
## sponsorNon-Commercial 0.74801 0.19793 3.779 0.000649 ***
## num_sites_z 0.24021 0.11463 2.096 0.044119 *
## enrollment_z -0.03470 0.11204 -0.310 0.758798
## arms_z -0.07446 0.09392 -0.793 0.433776
## covid_timing_z -0.03532 0.08599 -0.411 0.683974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.24049)
##
## Null deviance: 14.2652 on 37 degrees of freedom
## Residual deviance: 9.6294 on 32 degrees of freedom
## AIC: 570.3
##
## Number of Fisher Scoring iterations: 9
For a log link, we exponentiate coefficients to get multiplicative effects:
# Get coefficients
coefs <- coef(gamma_model)
pvals <- summary(gamma_model)$coefficients[, 4]
# Create interpretation table
results_gamma <- data.frame(
Variable = names(coefs),
Coefficient = round(coefs, 4),
Exp_Coefficient = round(exp(coefs), 4),
Percent_Change = round((exp(coefs) - 1) * 100, 1),
P_Value = round(pvals, 4)
)
print(results_gamma)
## Variable Coefficient Exp_Coefficient
## (Intercept) (Intercept) 6.4763 649.5905
## sponsorNon-Commercial sponsorNon-Commercial 0.7480 2.1128
## num_sites_z num_sites_z 0.2402 1.2715
## enrollment_z enrollment_z -0.0347 0.9659
## arms_z arms_z -0.0745 0.9282
## covid_timing_z covid_timing_z -0.0353 0.9653
## Percent_Change P_Value
## (Intercept) 64859.0 0.0000
## sponsorNon-Commercial 111.3 0.0006
## num_sites_z 27.2 0.0441
## enrollment_z -3.4 0.7588
## arms_z -7.2 0.4338
## covid_timing_z -3.5 0.6840
How to interpret:
# Residuals vs Fitted plot
par(mfrow = c(1, 2))
# Plot 1: Residuals vs Fitted
plot(fitted(gamma_model), residuals(gamma_model, type = "deviance"),
xlab = "Fitted Values", ylab = "Deviance Residuals",
main = "Gamma: Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)
# Plot 2: Q-Q Plot
qqnorm(residuals(gamma_model, type = "deviance"), main = "Gamma: Q-Q Plot")
qqline(residuals(gamma_model, type = "deviance"), col = "red")
par(mfrow = c(1, 1))
What we look for:
cat("Gamma Model Fit Statistics\n")
## Gamma Model Fit Statistics
cat("==========================\n")
## ==========================
cat("AIC:", AIC(gamma_model), "\n")
## AIC: 570.3046
cat("BIC:", BIC(gamma_model), "\n")
## BIC: 581.7677
cat("Null Deviance:", gamma_model$null.deviance, "\n")
## Null Deviance: 14.26525
cat("Residual Deviance:", gamma_model$deviance, "\n")
## Residual Deviance: 9.62944
cat("Deviance Explained:", round((1 - gamma_model$deviance/gamma_model$null.deviance) * 100, 1), "%\n")
## Deviance Explained: 32.5 %
Part 1 — Distribution:
\[\text{CompletionTime}_i \sim \text{InverseGaussian}(\mu_i, \lambda)\]
This says: Each trial’s completion time follows an Inverse Gaussian distribution with mean μᵢ and shape parameter λ.
Part 2 — Link Function and Linear Predictor:
\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]
Key difference from Gamma: The Inverse Gaussian has a heavier right tail and assumes variance ∝ μ³ (instead of μ²). Everything else stays the same.
# Fit Inverse Gaussian GLM with log link
# We increase max iterations because IG can be harder to fit
ig_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
data = mydata,
family = inverse.gaussian(link = "log"),
control = glm.control(maxit = 100))
# View results
summary(ig_model)
##
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z +
## arms_z + covid_timing_z, family = inverse.gaussian(link = "log"),
## data = mydata, control = glm.control(maxit = 100))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.46912 0.09329 69.344 < 2e-16 ***
## sponsorNon-Commercial 0.78885 0.23450 3.364 0.00201 **
## num_sites_z 0.27770 0.12250 2.267 0.03029 *
## enrollment_z -0.10431 0.11076 -0.942 0.35337
## arms_z -0.06281 0.09257 -0.678 0.50237
## covid_timing_z -0.07668 0.09150 -0.838 0.40822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for inverse.gaussian family taken to be 0.0003387172)
##
## Null deviance: 0.022773 on 37 degrees of freedom
## Residual deviance: 0.017098 on 32 degrees of freedom
## AIC: 577.54
##
## Number of Fisher Scoring iterations: 16
# Get coefficients
coefs_ig <- coef(ig_model)
pvals_ig <- summary(ig_model)$coefficients[, 4]
# Create interpretation table
results_ig <- data.frame(
Variable = names(coefs_ig),
Coefficient = round(coefs_ig, 4),
Exp_Coefficient = round(exp(coefs_ig), 4),
Percent_Change = round((exp(coefs_ig) - 1) * 100, 1),
P_Value = round(pvals_ig, 4)
)
print(results_ig)
## Variable Coefficient Exp_Coefficient
## (Intercept) (Intercept) 6.4691 644.9140
## sponsorNon-Commercial sponsorNon-Commercial 0.7889 2.2009
## num_sites_z num_sites_z 0.2777 1.3201
## enrollment_z enrollment_z -0.1043 0.9009
## arms_z arms_z -0.0628 0.9391
## covid_timing_z covid_timing_z -0.0767 0.9262
## Percent_Change P_Value
## (Intercept) 64391.4 0.0000
## sponsorNon-Commercial 120.1 0.0020
## num_sites_z 32.0 0.0303
## enrollment_z -9.9 0.3534
## arms_z -6.1 0.5024
## covid_timing_z -7.4 0.4082
par(mfrow = c(1, 2))
# Plot 1: Residuals vs Fitted
plot(fitted(ig_model), residuals(ig_model, type = "deviance"),
xlab = "Fitted Values", ylab = "Deviance Residuals",
main = "Inverse Gaussian: Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)
# Plot 2: Q-Q Plot
qqnorm(residuals(ig_model, type = "deviance"), main = "Inverse Gaussian: Q-Q Plot")
qqline(residuals(ig_model, type = "deviance"), col = "red")
par(mfrow = c(1, 1))
cat("Inverse Gaussian Model Fit Statistics\n")
## Inverse Gaussian Model Fit Statistics
cat("=====================================\n")
## =====================================
cat("AIC:", AIC(ig_model), "\n")
## AIC: 577.5408
cat("BIC:", BIC(ig_model), "\n")
## BIC: 589.0039
cat("Null Deviance:", ig_model$null.deviance, "\n")
## Null Deviance: 0.02277328
cat("Residual Deviance:", ig_model$deviance, "\n")
## Residual Deviance: 0.01709837
cat("Deviance Explained:", round((1 - ig_model$deviance/ig_model$null.deviance) * 100, 1), "%\n")
## Deviance Explained: 24.9 %
cat("MODEL COMPARISON\n")
## MODEL COMPARISON
cat("================\n\n")
## ================
cat("AIC (lower is better):\n")
## AIC (lower is better):
cat(" Gamma:", round(AIC(gamma_model), 1), "\n")
## Gamma: 570.3
cat(" Inverse Gaussian:", round(AIC(ig_model), 1), "\n")
## Inverse Gaussian: 577.5
cat(" Winner:", ifelse(AIC(gamma_model) < AIC(ig_model), "Gamma", "Inverse Gaussian"), "\n\n")
## Winner: Gamma
cat("BIC (lower is better):\n")
## BIC (lower is better):
cat(" Gamma:", round(BIC(gamma_model), 1), "\n")
## Gamma: 581.8
cat(" Inverse Gaussian:", round(BIC(ig_model), 1), "\n")
## Inverse Gaussian: 589
cat(" Winner:", ifelse(BIC(gamma_model) < BIC(ig_model), "Gamma", "Inverse Gaussian"), "\n")
## Winner: Gamma
cat("SPONSOR TYPE EFFECT\n")
## SPONSOR TYPE EFFECT
cat("===================\n\n")
## ===================
# Gamma model
gamma_effect <- round((exp(coef(gamma_model)["sponsorNon-Commercial"]) - 1) * 100, 1)
gamma_p <- round(summary(gamma_model)$coefficients["sponsorNon-Commercial", 4], 4)
# IG model
ig_effect <- round((exp(coef(ig_model)["sponsorNon-Commercial"]) - 1) * 100, 1)
ig_p <- round(summary(ig_model)$coefficients["sponsorNon-Commercial", 4], 4)
cat("Gamma Model:\n")
## Gamma Model:
cat(" Non-commercial trials take", gamma_effect, "% longer than commercial\n")
## Non-commercial trials take 111.3 % longer than commercial
cat(" p-value:", gamma_p, "\n\n")
## p-value: 6e-04
cat("Inverse Gaussian Model:\n")
## Inverse Gaussian Model:
cat(" Non-commercial trials take", ig_effect, "% longer than commercial\n")
## Non-commercial trials take 120.1 % longer than commercial
cat(" p-value:", ig_p, "\n")
## p-value: 0.002
Does the effect of sponsor type depend on how big the trial is (number of sites)?
Maybe commercial sponsors are especially efficient at running large multi-site trials, but the difference disappears for small trials.
Part 1 — Distribution:
\[\text{CompletionTime}_i \sim \text{[Gamma or InverseGaussian]}(\mu_i, \phi)\]
We use whichever distribution fit better in the comparison above.
Part 2 — Link Function and Linear Predictor (with interaction):
\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{NonCommercial}_i \times \text{NumSites}_i) + \beta_4(\text{Enrollment}_i) + \beta_5(\text{Arms}_i) + \beta_6(\text{CovidTiming}_i)\]
What’s new? The term \(\beta_3(\text{NonCommercial} \times \text{NumSites})\)
How to interpret:
If β₃ is significant, the lines in our plot will NOT be parallel — meaning the sponsor type effect depends on trial size.
We use whichever distribution won the comparison above.
# Use the better-fitting distribution
if (AIC(gamma_model) <= AIC(ig_model)) {
best_family <- Gamma(link = "log")
best_name <- "Gamma"
} else {
best_family <- inverse.gaussian(link = "log")
best_name <- "Inverse Gaussian"
}
cat("Using", best_name, "distribution for interaction model\n\n")
## Using Gamma distribution for interaction model
# Fit model with interaction (sponsor * num_sites)
interaction_model <- glm(completion_time ~ sponsor * num_sites_z + enrollment_z + arms_z + covid_timing_z,
data = mydata,
family = best_family,
control = glm.control(maxit = 100))
summary(interaction_model)
##
## Call:
## glm(formula = completion_time ~ sponsor * num_sites_z + enrollment_z +
## arms_z + covid_timing_z, family = best_family, data = mydata,
## control = glm.control(maxit = 100))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.47846 0.09806 66.065 < 2e-16 ***
## sponsorNon-Commercial 1.24650 0.40351 3.089 0.00421 **
## num_sites_z 0.28048 0.11642 2.409 0.02211 *
## enrollment_z -0.12035 0.12173 -0.989 0.33049
## arms_z -0.09401 0.09505 -0.989 0.33028
## covid_timing_z -0.02931 0.08597 -0.341 0.73541
## sponsorNon-Commercial:num_sites_z 0.91307 0.63094 1.447 0.15789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.2401902)
##
## Null deviance: 14.265 on 37 degrees of freedom
## Residual deviance: 9.112 on 31 degrees of freedom
## AIC: 570.12
##
## Number of Fisher Scoring iterations: 9
We compare the main effects model to the interaction model:
# Get the main effects model (whichever was better)
main_model <- if(AIC(gamma_model) <= AIC(ig_model)) gamma_model else ig_model
# Likelihood ratio test
lr_test <- anova(main_model, interaction_model, test = "F")
print(lr_test)
## Analysis of Deviance Table
##
## Model 1: completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z +
## covid_timing_z
## Model 2: completion_time ~ sponsor * num_sites_z + enrollment_z + arms_z +
## covid_timing_z
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 32 9.6294
## 2 31 9.1120 1 0.51746 2.1544 0.1522
# Get p-value
interaction_pvalue <- lr_test$`Pr(>F)`[2]
cat("\n")
cat("INTERACTION TEST RESULT\n")
## INTERACTION TEST RESULT
cat("=======================\n")
## =======================
cat("F-test p-value:", round(interaction_pvalue, 4), "\n\n")
## F-test p-value: 0.1522
if (interaction_pvalue < 0.05) {
cat("CONCLUSION: The interaction IS significant (p < 0.05)\n")
cat("The effect of sponsor type DOES depend on trial size.\n")
} else {
cat("CONCLUSION: The interaction is NOT significant (p >= 0.05)\n")
cat("The effect of sponsor type does NOT depend on trial size.\n")
cat("The simpler main effects model is preferred.\n")
}
## CONCLUSION: The interaction is NOT significant (p >= 0.05)
## The effect of sponsor type does NOT depend on trial size.
## The simpler main effects model is preferred.
# Create prediction data
pred_data <- expand.grid(
sponsor = c("Commercial", "Non-Commercial"),
num_sites_z = seq(-0.5, 2, by = 0.1),
enrollment_z = 0,
arms_z = 0,
covid_timing_z = 0
)
# Get predictions
pred_data$predicted <- predict(interaction_model, newdata = pred_data, type = "response")
# Convert num_sites back to original scale for plotting
pred_data$num_sites <- pred_data$num_sites_z * sd(mydata$num_sites) + mean(mydata$num_sites)
# Plot
ggplot(pred_data, aes(x = num_sites, y = predicted, color = sponsor)) +
geom_line(size = 1.2) +
labs(title = "Predicted Completion Time by Sponsor Type and Trial Size",
subtitle = "Other variables held at their means",
x = "Number of Sites",
y = "Predicted Completion Time (days)",
color = "Sponsor Type") +
theme(legend.position = "bottom")
# Determine best model
if (AIC(gamma_model) <= AIC(ig_model)) {
best_model <- gamma_model
best_dist <- "Gamma"
} else {
best_model <- ig_model
best_dist <- "Inverse Gaussian"
}
# Get sponsor effect from best model
sponsor_effect <- round((exp(coef(best_model)["sponsorNon-Commercial"]) - 1) * 100, 1)
sponsor_p <- round(summary(best_model)$coefficients["sponsorNon-Commercial", 4], 4)
interaction_p <- round(interaction_pvalue, 4)
Does sponsor type predict Phase 2 respiratory disease trial completion time?
Non-commercial vs Commercial sponsors:
if (sponsor_p < 0.05 & sponsor_effect > 0) {
cat("Non-commercial trials (academic and government sponsors) take significantly\n")
cat("longer to complete than commercial (industry) trials. Specifically, they take\n")
cat("approximately", sponsor_effect, "% longer (p =", sponsor_p, ").\n\n")
cat("This may reflect:\n")
cat("- Differences in operational infrastructure and dedicated staff\n")
cat("- Resource constraints in academic settings\n")
cat("- The pandemic's disproportionate impact on academic medical centers\n")
cat("- Restrictions on pulmonary function testing during COVID-19\n")
} else if (sponsor_p < 0.05 & sponsor_effect < 0) {
cat("Non-commercial trials complete significantly FASTER than commercial trials.\n")
} else {
cat("No significant difference was found between commercial and non-commercial\n")
cat("sponsors in trial completion time (p =", sponsor_p, ").\n")
}
## Non-commercial trials (academic and government sponsors) take significantly
## longer to complete than commercial (industry) trials. Specifically, they take
## approximately 111.3 % longer (p = 6e-04 ).
##
## This may reflect:
## - Differences in operational infrastructure and dedicated staff
## - Resource constraints in academic settings
## - The pandemic's disproportionate impact on academic medical centers
## - Restrictions on pulmonary function testing during COVID-19
if (best_dist == "Gamma") {
cat("The Gamma distribution provided better fit because:\n\n")
cat("1. The completion time distribution has moderate right skew\n")
cat("2. There are not extreme outliers that require the heavier tail of Inverse Gaussian\n")
cat("3. The Gamma's variance assumption (Var proportional to mean squared) matched our data\n")
cat("\nThe Inverse Gaussian assumes variance proportional to the mean CUBED, which\n")
cat("would require much more dramatic increases in variability for longer trials.\n")
cat("Our data did not show this pattern, so the Gamma was preferred.\n")
} else {
cat("The Inverse Gaussian distribution provided better fit because:\n\n")
cat("1. The completion time distribution has a heavy right tail\n")
cat("2. Some trials take much longer than expected (extreme values)\n")
cat("3. Trial completion can be thought of as 'time to reach a goal' which\n")
cat(" is exactly what the Inverse Gaussian models (first passage time)\n")
}
## The Gamma distribution provided better fit because:
##
## 1. The completion time distribution has moderate right skew
## 2. There are not extreme outliers that require the heavier tail of Inverse Gaussian
## 3. The Gamma's variance assumption (Var proportional to mean squared) matched our data
##
## The Inverse Gaussian assumes variance proportional to the mean CUBED, which
## would require much more dramatic increases in variability for longer trials.
## Our data did not show this pattern, so the Gamma was preferred.
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] statmod_1.5.1 scales_1.4.0 ggplot2_4.0.0 dplyr_1.1.4
## [5] RPostgres_1.4.8
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-3 bit_4.6.0 gtable_0.3.6 jsonlite_2.0.0
## [5] compiler_4.5.1 tidyselect_1.2.1 blob_1.2.4 jquerylib_0.1.4
## [9] splines_4.5.1 yaml_2.3.10 fastmap_1.2.0 lattice_0.22-7
## [13] R6_2.6.1 labeling_0.4.3 generics_0.1.4 knitr_1.50
## [17] tibble_3.3.0 lubridate_1.9.4 DBI_1.2.3 bslib_0.9.0
## [21] pillar_1.11.0 RColorBrewer_1.1-3 rlang_1.1.6 cachem_1.1.0
## [25] xfun_0.52 sass_0.4.10 S7_0.2.0 bit64_4.6.0-1
## [29] timechange_0.3.0 cli_3.6.5 mgcv_1.9-3 withr_3.0.2
## [33] magrittr_2.0.3 digest_0.6.37 grid_4.5.1 rstudioapi_0.17.1
## [37] hms_1.1.3 nlme_3.1-168 lifecycle_1.0.4 vctrs_0.6.5
## [41] evaluate_1.0.5 glue_1.8.0 farver_2.1.2 rmarkdown_2.30
## [45] tools_4.5.1 pkgconfig_2.0.3 htmltools_0.5.8.1