1 Introduction

1.1 Research Question

Does trial sponsor type (commercial vs non-commercial) predict how long Phase 2 respiratory disease trials take to complete?

We examine trials initiated during March 2019 through December 2022, a period spanning the onset and evolution of COVID-19, where it is theorized that this period might inflate the effect of sponsor on time-to-completion. Respiratory trials are particularly interesting during this period because:

Respiratory patients may have avoided hospitals during COVID-19
Pulmonary function testing was often restricted (aerosol-generating procedures)
Hospital capacity was consumed by COVID-19 patients, potentially delaying other respiratory research

To do this we will take a sample from the AACT (ClinicalTrials.gov) database.

The dependent variable will be time-to-trial completion.
The independent variable is sponsor type (commercial or non-commercial)
Covariates will be number of clinical trial sites, total participants enrolled, number of trial treatment arms, number of months from COVID start (March 2019) and trial start.

1.2 Why Not Use Regular Linear Regression?

Regular linear regression (OLS) assumes:

The dependent/outcome variable is normally distributed
Variance is constant across all predicted values

Time-to-trial completion violates both:

Time is always positive (can’t be negative)
Time-to-trial completion is substantially right-skewed (most trials finish in a typical timeframe, but some drag on much longer)
Longer trials have more variable completion times (variance increases with the time-to-complete mean)

1.3 What is a GLM?

A GLM has three components:

Random Component: The probability distribution of the outcome (in this instance we will consider Gamma or Inverse Gaussian)
Systematic Component: The linear combination of predictors (β₀ + β₁X₁ + β₂X₂ + …)
Link Function: How the mean of the outcome relates to the predictors (we will use log link)

2 Data Acquisition from AACT Database

2.1 What is AACT?

The AACT (Aggregate Analysis of ClinicalTrials.gov) database is a publicly available relational database maintained by the Clinical Trials Transformation Initiative (CTTI). It contains all information from ClinicalTrials.gov in a structured format that can be queried using SQL.

To access AACT, you need a free account: https://aact.ctti-clinicaltrials.org/users/sign_up

2.2 Load All Required Libraries

library(RPostgres)  # Database connection to AACT (PostgreSQL)
library(dplyr)      # Data manipulation (filtering, grouping, summarizing)
library(ggplot2)    # Data visualization (histograms, boxplots, scatterplots)
library(scales)     # Formatting for plot axes (commas, percentages)
library(statmod)    # Inverse Gaussian distribution for GLM
library(htmltools)  # For embedding HTML diagrams

2.3 Connecting to the AACT Database

# ============================================================================
# NOTE: This code chunk connects to the live AACT database.
# Set eval=TRUE to run it (requires internet connection and AACT credentials)
# ============================================================================

username <- "MY USERNAME"
password <- "MY AACT PASSWORD"

con <- dbConnect(
  RPostgres::Postgres(),
  dbname = "aact",
  host = "aact-db.ctti-clinicaltrials.org",
  port = 5432,
  user = username,
  password = password
)

2.4 SQL Query

Query wuth the following paramterts was used to pull Phase 2 interventional trials for chronic respiratory conditions during the period of interest.

2.4.1 Trial Inclusion Criteria

Included trials for these chronic respiratory conditions:

COPD (Chronic Obstructive Pulmonary Disease)
Asthma
Pulmonary Hypertension
Pulmonary Fibrosis (including Idiopathic Pulmonary Fibrosis/IPF)
Cystic Fibrosis
Interstitial Lung Disease
Bronchiectasis
Emphysema
Chronic Bronchitis
Start date: March 1, 2019 through December 31, 2022
This captures approximately one year pre-pandemic and nearly three years into the pandemic, allowing us to observe how trial operations adapted over time

2.4.2 The Query

query <- "
SELECT DISTINCT
    s.nct_id, s.brief_title, s.overall_status, s.phase,
    s.enrollment, s.start_date, s.completion_date, s.number_of_arms,
    sp.agency_class as sponsor_type,
    (SELECT COUNT(*) FROM facilities f WHERE f.nct_id = s.nct_id) as num_sites
FROM studies s
LEFT JOIN sponsors sp ON s.nct_id = sp.nct_id AND sp.lead_or_collaborator = 'lead'
INNER JOIN conditions c ON s.nct_id = c.nct_id
WHERE s.phase = 'PHASE2'
  AND s.study_type = 'INTERVENTIONAL'
  AND s.start_date >= '2019-03-01'
  AND s.start_date <= '2022-12-31'
  AND (
    LOWER(c.downcase_name) LIKE '%copd%'
    OR LOWER(c.downcase_name) LIKE '%chronic obstructive pulmonary%'
    OR LOWER(c.downcase_name) LIKE '%asthma%'
    OR LOWER(c.downcase_name) LIKE '%pulmonary hypertension%'
    OR LOWER(c.downcase_name) LIKE '%pulmonary fibrosis%'
    OR LOWER(c.downcase_name) LIKE '%ipf%'
    OR LOWER(c.downcase_name) LIKE '%cystic fibrosis%'
    OR LOWER(c.downcase_name) LIKE '%interstitial lung%'
    OR LOWER(c.downcase_name) LIKE '%bronchiectasis%'
    OR LOWER(c.downcase_name) LIKE '%emphysema%'
    OR LOWER(c.downcase_name) LIKE '%chronic bronchitis%'
  )
  AND NOT (
    LOWER(c.downcase_name) LIKE '%lung cancer%'
    OR LOWER(c.downcase_name) LIKE '%lung neoplasm%'
    OR LOWER(c.downcase_name) LIKE '%non-small cell%'
    OR LOWER(c.downcase_name) LIKE '%small cell lung%'
  )
  AND NOT (
    LOWER(c.downcase_name) LIKE '%covid%'
    OR LOWER(c.downcase_name) LIKE '%coronavirus%'
    OR LOWER(c.downcase_name) LIKE '%pneumonia%'
    OR LOWER(c.downcase_name) LIKE '%influenza%'
  )
"

2.5 Execute Query and Download Data

trials <- dbGetQuery(con, query)
dbDisconnect(con)

save(trials, file = "phase2_respiratory_trials.RData")
write.csv(trials, "phase2_respiratory_trials_raw.csv", row.names = FALSE)

2.6 Load Previously Downloaded Data

load("C:/Users/amcewen/OneDrive - Bentley University/Documents/quant pres #2/phase2_respiratory_trials.RData")

172 trials included from the AACT database.

3 Dataset Creation & Cleaning

# Create analysis dataset and variables
mydata <- trials

# Create sponsor type: Commercial vs Non-Commercial
# INDUSTRY = pharmaceutical and biotech companies (Commercial)
# Everything else (NIH, academic institutions, etc.) = Non-Commercial
mydata$sponsor <- ifelse(mydata$sponsor_type == "INDUSTRY", "Commercial", "Non-Commercial")
mydata$sponsor <- factor(mydata$sponsor, levels = c("Commercial", "Non-Commercial"))

# Calculate time-to-completion in days
mydata$start_date <- as.Date(mydata$start_date)
mydata$completion_date <- as.Date(mydata$completion_date)
mydata$completion_time <- as.numeric(mydata$completion_date - mydata$start_date)

# Define the COVID start date for relative-to-CIVID start timing variable
covid_start <- as.Date("2020-03-01")

# Calculate months from COVID start 
# Negative values = trial started BEFORE pandemic declaration
# Positive values = trial started AFTER pandemic declaration
mydata$months_from_covid <- as.numeric(mydata$start_date - covid_start) / 31

# Make num_sites numeric as count from listing
mydata$num_sites <- as.numeric(mydata$num_sites)

# Track sample at each step
n_start <- nrow(mydata)
n_completed <- sum(mydata$overall_status == "COMPLETED")

mydata <- mydata[mydata$overall_status == "COMPLETED", ]
mydata <- mydata[mydata$completion_time > 0, ]
n_valid_time <- nrow(mydata)

mydata <- mydata[!is.na(mydata$num_sites), ]
mydata <- mydata[!is.na(mydata$enrollment), ]
mydata <- mydata[!is.na(mydata$number_of_arms), ]
n_final <- nrow(mydata)

# Display as table
filtering_steps <- data.frame(
  Step = c("Starting sample", 
           "After keeping only COMPLETED trials", 
           "After removing invalid completion times", 
           "After removing missing covariate data"),
  N_Trials = c(n_start, n_completed, n_valid_time, n_final)
)

print(filtering_steps)

##                                      Step N_Trials
## 1                         Starting sample      172
## 2     After keeping only COMPLETED trials       74
## 3 After removing invalid completion times       74
## 4   After removing missing covariate data       73

The final sample is 73 completed respiratory trials.

The continuous variables/covariates are on very different scales. Scaling them helps the model converge and makes coefficients from one covariate to the next, comparable.

mydata$num_sites_z <- scale(mydata$num_sites)[,1]
mydata$enrollment_z <- scale(mydata$enrollment)[,1]
mydata$arms_z <- scale(mydata$number_of_arms)[,1]
mydata$covid_timing_z <- scale(mydata$months_from_covid)[,1]

# Show what 1 SD means in original units
scaling_reference <- data.frame(
  Variable = c("Number of Sites", "Enrollment", "Number of Arms", "Months from COVID"),
  One_SD_Equals = c(
    paste(round(sd(mydata$num_sites), 1), "sites"),
    paste(round(sd(mydata$enrollment), 1), "participants"),
    paste(round(sd(mydata$number_of_arms), 2), "arms"),
    paste(round(sd(mydata$months_from_covid), 1), "months")
  )
)

print(scaling_reference)

##            Variable      One_SD_Equals
## 1   Number of Sites         36.3 sites
## 2        Enrollment 189.6 participants
## 3    Number of Arms          1.63 arms
## 4 Months from COVID        13.9 months

Scaling does NOT affect our main predictor (sponsor type) because it is categorical.

3.1 Save Final Dataset

write.csv(mydata, "phase2_respiratory_trials_clean.csv", row.names = FALSE)

Saved cleaned dataset as phase2_respiratory_trials_clean.csv.

4 Exploratory Data Analysis

4.1 How Many Trials in Each Group?

sponsor_summary <- data.frame(
  Sponsor_Type = c("Commercial", "Non-Commercial", "Total"),
  N_Trials = c(
    sum(mydata$sponsor == "Commercial"),
    sum(mydata$sponsor == "Non-Commercial"),
    nrow(mydata)
  ),
  Percent = c(
    round(sum(mydata$sponsor == "Commercial") / nrow(mydata) * 100, 1),
    round(sum(mydata$sponsor == "Non-Commercial") / nrow(mydata) * 100, 1),
    100
  )
)

print(sponsor_summary)

##     Sponsor_Type N_Trials Percent
## 1     Commercial       53    72.6
## 2 Non-Commercial       20    27.4
## 3          Total       73   100.0

4.2 What Does Completion Time Look Like?

mean_time <- mean(mydata$completion_time)
median_time <- median(mydata$completion_time)

ggplot(mydata, aes(x = completion_time)) +
  geom_histogram(bins = 30, fill = "blue", color = "white") +
  geom_vline(xintercept = mean_time, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = median_time, color = "green", linetype = "solid", size = 1) +
  labs(title = "Distribution of Trial Completion Time",
       subtitle = "Red dashed = Mean | Green solid = Median | Mean > Median confirms right skew",
       x = "Completion Time (days)",
       y = "Count") +
  annotate("text", x = mean_time + 50, y = Inf, label = paste("Mean:", round(mean_time, 0)), 
           vjust = 2, hjust = 0, color = "red", size = 3.5) +
  annotate("text", x = median_time - 50, y = Inf, label = paste("Median:", round(median_time, 0)), 
           vjust = 2, hjust = 1, color = "green", size = 3.5)

The distribution is right-skewed (mean > median). This is why we need Gamma or Inverse Gaussian, not Normal.

4.3 Completion Time by Sponsor Type

ggplot(mydata, aes(x = sponsor, y = completion_time, fill = sponsor)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Commercial" = "blue", "Non-Commercial" = "green")) +
  labs(title = "Completion Time by Sponsor Type",
       x = "Sponsor Type",
       y = "Completion Time (days)") +
  theme(legend.position = "none")

summary_by_sponsor <- mydata %>%
  group_by(sponsor) %>%
  summarise(
    N = n(),
    Mean = round(mean(completion_time), 0),
    Median = round(median(completion_time), 0),
    SD = round(sd(completion_time), 0),
    Min = min(completion_time),
    Max = max(completion_time)
  ) %>%
  as.data.frame()

print(summary_by_sponsor)

##          sponsor  N Mean Median  SD Min  Max
## 1     Commercial 53  681    637 330  89 1646
## 2 Non-Commercial 20 1011    954 534 322 1917

4.4 Does Variance Increase with Mean?

For Gamma and Inverse Gaussian to be appropriate, variance should increase with the mean.

simple_fit <- lm(completion_time ~ sponsor + num_sites + enrollment, data = mydata)

plot_data <- data.frame(
  fitted = fitted(simple_fit),
  abs_resid = abs(residuals(simple_fit))
)

ggplot(plot_data, aes(x = fitted, y = abs_resid)) +
  geom_point(size = 2, color = "blue", alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Heteroscedasticity Check",
       subtitle = "Upward slope indicates variance increases with predicted completion time",
       x = "Fitted Values (Predicted Completion Time)",
       y = "Absolute Residuals")

The plot suggests a modest upward trend, though with N = 73 and sparse data at higher fitted values, the pattern is not definitive. However, duration data like trial completion time typically exhibits increasing variance: trials expected to take longer have more opportunities for delays and variability. This theoretical expectation, combined with the positive values and right skew observed in the histogram, justifies using Gamma or Inverse Gaussian GLMs. The diagnostic plots after model fitting will further validate this choice.

5 Understanding Our Two GLM Distributions

5.1 The Gamma Distribution

What is it?

A continuous probability distribution for positive values
Right-skewed (has a long right tail)
Commonly used for waiting times, survival times, and durations

Key assumption: Variance is proportional to the mean squared: Var(Y) ∝ μ²

When to use: Moderate right skew, no extreme outliers

5.2 The Inverse Gaussian Distribution

What is it?

Also a continuous distribution for positive values
Has a heavier right tail than Gamma
Arises naturally as the time for a process to reach a threshold (“first passage time”)

Key assumption: Variance is proportional to the mean cubed: Var(Y) ∝ μ³

When to use: Heavy right tails, extreme outliers, when data represents “time to reach a goal”

5.3 The Log Link Function

Both models use a log link:

\[\log(\mu) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...\]

This means:

\[\mu = e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...}\]

Why log link?

Ensures predicted values are always positive (can’t predict negative time)
Coefficients have a multiplicative interpretation: exp(β) = percent change

6 Model 1: Gamma GLM

6.1 The Model Equation

Part 1 — Distribution:

\[\text{CompletionTime}_i \sim \text{Gamma}(\mu_i, \phi)\]

Each trial’s completion time follows a Gamma distribution with mean μᵢ and dispersion parameter φ.

Part 2 — Link Function and Linear Predictor:

\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]

The log of the expected completion time is a linear combination of our predictors.

Putting it together:

\[\mu_i = e^{\beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + ...}\]

The Gamma distribution determines the shape of the error (right-skewed, variance ∝ μ²). The log link ensures predictions are always positive. The β coefficients tell us the effect of each predictor.

Gamma GLM Structure

Predicting Trial Completion Time

Distribution Assumption

CompletionTime ~ Gamma(μ, φ)

Variance ∝ μ² (variance increases with mean squared)

Log Link Function

log(μ) = β₀ + β₁(NonCommercial) + β₂(NumSites) + β₃(Enrollment) + β₄(Arms) + β₅(CovidTiming)

CATEGORICAL (Dummy Coded)

Sponsor Type

• Commercial = 0 REFERENCE

• Non-Commercial = 1

β₁ = difference from reference

CONTINUOUS (Scaled)

• Number of Sites (z-scored)

• Enrollment (z-scored)

• Number of Arms (z-scored)

• COVID Timing (z-scored)

β = change per 1 SD increase

↓ Exponentiate to interpret ↓

Interpretation (Multiplicative Effects)

exp(β₁) = 1.60 → Non-commercial trials take 60% longer than Commercial

exp(β₅) = 0.89 → Each SD later from COVID start, trials complete 11% faster

6.2 Fitting the Model

gamma_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
                   data = mydata,
                   family = Gamma(link = "log"))

summary(gamma_model)

## 
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z + 
##     arms_z + covid_timing_z, family = Gamma(link = "log"), data = mydata)
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.47790    0.06435 100.664  < 2e-16 ***
## sponsorNon-Commercial  0.46917    0.13306   3.526 0.000767 ***
## num_sites_z            0.16226    0.09315   1.742 0.086124 .  
## enrollment_z           0.04680    0.09039   0.518 0.606376    
## arms_z                -0.06303    0.05813  -1.084 0.282170    
## covid_timing_z        -0.11725    0.05529  -2.120 0.037672 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2052868)
## 
##     Null deviance: 23.535  on 72  degrees of freedom
## Residual deviance: 17.401  on 67  degrees of freedom
## AIC: 1066.5
## 
## Number of Fisher Scoring iterations: 6

6.3 Interpret the Coefficients

For a log link, we exponentiate coefficients to get multiplicative effects:

coefs <- coef(gamma_model)
pvals <- summary(gamma_model)$coefficients[, 4]

results_gamma <- data.frame(
  Variable = c("(Intercept)", "Sponsor: Non-Commercial", "Number of Sites", 
               "Enrollment", "Number of Arms", "COVID Timing"),
  Coefficient = round(coefs, 4),
  Exp_Coef = round(exp(coefs), 4),
  Pct_Change = round((exp(coefs) - 1) * 100, 1),
  P_Value = round(pvals, 4),
  Significant = ifelse(pvals < 0.05, "Yes", ifelse(pvals < 0.10, "Borderline", "No"))
)
row.names(results_gamma) <- NULL

print(results_gamma)

##                  Variable Coefficient Exp_Coef Pct_Change P_Value Significant
## 1             (Intercept)      6.4779 650.6019    64960.2  0.0000         Yes
## 2 Sponsor: Non-Commercial      0.4692   1.5987       59.9  0.0008         Yes
## 3         Number of Sites      0.1623   1.1762       17.6  0.0861  Borderline
## 4              Enrollment      0.0468   1.0479        4.8  0.6064          No
## 5          Number of Arms     -0.0630   0.9389       -6.1  0.2822          No
## 6            COVID Timing     -0.1172   0.8894      -11.1  0.0377         Yes

How to interpret:

Exp_Coef = multiplicative effect on completion time
Pct_Change = percent increase (positive) or decrease (negative)

6.4 Check Model Fit

par(mfrow = c(1, 2))

plot(fitted(gamma_model), residuals(gamma_model, type = "deviance"),
     xlab = "Fitted Values", ylab = "Deviance Residuals",
     main = "Gamma: Residuals vs Fitted", col = "blue")
abline(h = 0, col = "red", lty = 2)

qqnorm(residuals(gamma_model, type = "deviance"), main = "Gamma: Q-Q Plot", col = "blue")
qqline(residuals(gamma_model, type = "deviance"), col = "red")

par(mfrow = c(1, 1))

What we look for:

Residuals vs Fitted: Random scatter around zero (no patterns)
Q-Q Plot: Points should follow the diagonal line

6.5 Model Fit Statistics

gamma_fit_stats <- data.frame(
  Metric = c("AIC", "BIC", "Null Deviance", "Residual Deviance", "Deviance Explained", "Dispersion Parameter"),
  Value = c(
    round(AIC(gamma_model), 1),
    round(BIC(gamma_model), 1),
    round(gamma_model$null.deviance, 2),
    round(gamma_model$deviance, 2),
    paste0(round((1 - gamma_model$deviance/gamma_model$null.deviance) * 100, 1), "%"),
    round(summary(gamma_model)$dispersion, 4)
  )
)

print(gamma_fit_stats)

##                 Metric  Value
## 1                  AIC 1066.5
## 2                  BIC 1082.6
## 3        Null Deviance  23.54
## 4    Residual Deviance   17.4
## 5   Deviance Explained  26.1%
## 6 Dispersion Parameter 0.2053

7 Model 2: Inverse Gaussian GLM

7.1 The Model Equation

Part 1 — Distribution:

\[\text{CompletionTime}_i \sim \text{InverseGaussian}(\mu_i, \lambda)\]

This says: Each trial’s completion time follows an Inverse Gaussian distribution with mean μᵢ and shape parameter λ.

Part 2 — Link Function and Linear Predictor:

\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]

Key difference from Gamma: The Inverse Gaussian has a heavier right tail and assumes variance ∝ μ³ (instead of μ²). Everything else stays the same.

Inverse Gaussian GLM Structure

Predicting Trial Completion Time

Distribution Assumption

CompletionTime ~ Inverse Gaussian(μ, λ)

Variance ∝ μ³ (variance increases with mean cubed — heavier tails than Gamma)

Log Link Function

log(μ) = β₀ + β₁(NonCommercial) + β₂(NumSites) + β₃(Enrollment) + β₄(Arms) + β₅(CovidTiming)

CATEGORICAL (Dummy Coded)

Sponsor Type

• Commercial = 0 REFERENCE

• Non-Commercial = 1

β₁ = difference from reference

CONTINUOUS (Scaled)

• Number of Sites (z-scored)

• Enrollment (z-scored)

• Number of Arms (z-scored)

• COVID Timing (z-scored)

β = change per 1 SD increase

↓ Exponentiate to interpret ↓

Interpretation (Multiplicative Effects)

exp(β₁) = 1.57 → Non-commercial trials take 57% longer than Commercial

exp(β₅) = 0.91 → Each SD later from COVID start, trials complete 9% faster

Why Gamma was preferred: Inverse Gaussian assumes variance grows with μ³, which is too extreme for these data. The Gamma's μ² assumption provided better fit (lower AIC/BIC).

7.2 Fit the Model

ig_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
                data = mydata,
                family = inverse.gaussian(link = "log"),
                control = glm.control(maxit = 100))

summary(ig_model)

## 
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z + 
##     arms_z + covid_timing_z, family = inverse.gaussian(link = "log"), 
##     data = mydata, control = glm.control(maxit = 100))
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.48267    0.06198 104.591   <2e-16 ***
## sponsorNon-Commercial  0.44796    0.14317   3.129   0.0026 ** 
## num_sites_z            0.14782    0.09805   1.508   0.1364    
## enrollment_z           0.07461    0.10873   0.686   0.4949    
## arms_z                -0.05447    0.05568  -0.978   0.3315    
## covid_timing_z        -0.09898    0.05546  -1.785   0.0788 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for inverse.gaussian family taken to be 0.0002911946)
## 
##     Null deviance: 0.041750  on 72  degrees of freedom
## Residual deviance: 0.034137  on 67  degrees of freedom
## AIC: 1082.1
## 
## Number of Fisher Scoring iterations: 9

7.3 Interpret the Coefficients

coefs_ig <- coef(ig_model)
pvals_ig <- summary(ig_model)$coefficients[, 4]

results_ig <- data.frame(
  Variable = c("(Intercept)", "Sponsor: Non-Commercial", "Number of Sites", 
               "Enrollment", "Number of Arms", "COVID Timing"),
  Coefficient = round(coefs_ig, 4),
  Exp_Coef = round(exp(coefs_ig), 4),
  Pct_Change = round((exp(coefs_ig) - 1) * 100, 1),
  P_Value = round(pvals_ig, 4),
  Significant = ifelse(pvals_ig < 0.05, "Yes", ifelse(pvals_ig < 0.10, "Borderline", "No"))
)
row.names(results_ig) <- NULL

print(results_ig)

##                  Variable Coefficient Exp_Coef Pct_Change P_Value Significant
## 1             (Intercept)      6.4827 653.7148    65271.5  0.0000         Yes
## 2 Sponsor: Non-Commercial      0.4480   1.5651       56.5  0.0026         Yes
## 3         Number of Sites      0.1478   1.1593       15.9  0.1364          No
## 4              Enrollment      0.0746   1.0775        7.7  0.4949          No
## 5          Number of Arms     -0.0545   0.9470       -5.3  0.3315          No
## 6            COVID Timing     -0.0990   0.9058       -9.4  0.0788  Borderline

7.4 Check Model Fit

par(mfrow = c(1, 2))

plot(fitted(ig_model), residuals(ig_model, type = "deviance"),
     xlab = "Fitted Values", ylab = "Deviance Residuals",
     main = "Inverse Gaussian: Residuals vs Fitted", col = "blue")
abline(h = 0, col = "red", lty = 2)

qqnorm(residuals(ig_model, type = "deviance"), main = "Inverse Gaussian: Q-Q Plot", col = "blue")
qqline(residuals(ig_model, type = "deviance"), col = "red")

par(mfrow = c(1, 1))

7.5 Model Fit Statistics

ig_fit_stats <- data.frame(
  Metric = c("AIC", "BIC", "Null Deviance", "Residual Deviance", "Deviance Explained", "Dispersion Parameter"),
  Value = c(
    round(AIC(ig_model), 1),
    round(BIC(ig_model), 1),
    round(ig_model$null.deviance, 4),
    round(ig_model$deviance, 4),
    paste0(round((1 - ig_model$deviance/ig_model$null.deviance) * 100, 1), "%"),
    round(summary(ig_model)$dispersion, 6)
  )
)

print(ig_fit_stats)

##                 Metric    Value
## 1                  AIC   1082.1
## 2                  BIC   1098.2
## 3        Null Deviance   0.0418
## 4    Residual Deviance   0.0341
## 5   Deviance Explained    18.2%
## 6 Dispersion Parameter 0.000291

8 Gamma vs Inverse Gaussian Comparison

comparison_table <- data.frame(
  Metric = c("AIC", "BIC", "Deviance Explained"),
  Gamma = c(
    round(AIC(gamma_model), 1),
    round(BIC(gamma_model), 1),
    paste0(round((1 - gamma_model$deviance/gamma_model$null.deviance) * 100, 1), "%")
  ),
  Inverse_Gaussian = c(
    round(AIC(ig_model), 1),
    round(BIC(ig_model), 1),
    paste0(round((1 - ig_model$deviance/ig_model$null.deviance) * 100, 1), "%")
  )
)

print(comparison_table)

##               Metric  Gamma Inverse_Gaussian
## 1                AIC 1066.5           1082.1
## 2                BIC 1082.6           1098.2
## 3 Deviance Explained  26.1%            18.2%

The Gamma model has lower AIC (1066.5 vs 1082.1) and lower BIC (1082.6 vs 1098.2), indicating better fit.

This makes sense because the data show moderate right skew without many extreme outliers. The Gamma’s variance assumption (Var ∝ μ²) fits the observed mean-variance relationship. The Inverse Gaussian assumes variance proportional to the mean CUBED, which is too extreme for these data.

GLM Model Comparison: Variable Treatment

How Variables Enter the Model

CATEGORICAL

Sponsor Type
• Commercial = 0 REF
• Non-Commercial = 1

R creates dummy variable
β = difference from reference

CONTINUOUS (Z-SCORED)

• Number of Sites
• Enrollment
• Number of Arms
• COVID Timing

Standardized: mean=0, SD=1
β = change per 1 SD increase

Both Models Use Same Link Function

log(μ) = β₀ + β₁(NonCommercial) + β₂(NumSites) + β₃(Enrollment) + β₄(Arms) + β₅(CovidTiming)

Interpretation: exp(β) = multiplicative effect on completion time

GAMMA

Distribution:

CompletionTime ~ Gamma(μ, φ)

Variance ∝ μ²

Best for:

Moderate right skew, no extreme outliers

Fit:

AIC = 1066.5
BIC = 1082.7

INVERSE GAUSSIAN

Distribution:

CompletionTime ~ IG(μ, λ)

Variance ∝ μ³

Best for:

Heavy tails, extreme outliers

Fit:

AIC = 1082.1
BIC = 1098.3

Model Selection

✓ GAMMA WINS

Lower AIC/BIC → Variance increases with μ², not μ³

9 Conclusions

9.1 Summary of Findings

9.1.1 Research Question

Does sponsor type predict Phase 2 respiratory disease trial completion time?

9.1.2 Sample Characteristics

sample_chars <- data.frame(
  Characteristic = c("Total Completed Trials", "Commercial Sponsors", "Non-Commercial Sponsors",
                     "Time Period", "Therapeutic Areas"),
  Value = c(
    nrow(mydata),
    sum(mydata$sponsor == "Commercial"),
    sum(mydata$sponsor == "Non-Commercial"),
    "March 2019 - December 2022",
    "COPD, Asthma, Pulmonary Hypertension, Pulmonary Fibrosis, Cystic Fibrosis, ILD, Bronchiectasis"
  )
)

print(sample_chars)

##            Characteristic
## 1  Total Completed Trials
## 2     Commercial Sponsors
## 3 Non-Commercial Sponsors
## 4             Time Period
## 5       Therapeutic Areas
##                                                                                            Value
## 1                                                                                             73
## 2                                                                                             53
## 3                                                                                             20
## 4                                                                     March 2019 - December 2022
## 5 COPD, Asthma, Pulmonary Hypertension, Pulmonary Fibrosis, Cystic Fibrosis, ILD, Bronchiectasis

9.1.3 Model Selection

Both the Gamma and Inverse Gaussian GLMs converged successfully. The Gamma model provided better fit based on AIC and BIC. This is consistent with the mean-variance relationship observed in exploratory analysis: variance increases with the mean, but not as dramatically as the Inverse Gaussian’s cubic assumption would require. The Gamma’s quadratic variance assumption (Var ∝ μ²) is a better match for these data.

9.1.4 Hypothesis Test: Sponsor Type

H₀: Sponsor type does not predict trial completion time

H₁: Sponsor type does predict trial completion time

sponsor_result <- data.frame(
  Metric = c("Effect Size", "Direction", "P-Value", "Decision"),
  Value = c(
    paste0(round(best_pct["sponsorNon-Commercial"], 1), "%"),
    ifelse(best_pct["sponsorNon-Commercial"] > 0, "Non-commercial trials take LONGER", "Non-commercial trials complete FASTER"),
    round(best_pvals["sponsorNon-Commercial"], 4),
    ifelse(best_pvals["sponsorNon-Commercial"] < 0.05, "REJECT H0", "FAIL TO REJECT H0")
  )
)

print(sponsor_result)

##        Metric                             Value
## 1 Effect Size                             59.9%
## 2   Direction Non-commercial trials take LONGER
## 3     P-Value                             8e-04
## 4    Decision                         REJECT H0

Non-commercial trials take significantly longer to complete than commercial trials. Specifically, trials sponsored by academic institutions or government agencies take approximately 60% longer than industry-sponsored trials (p = 8^{-4}).

9.1.5 Effect of COVID Timing

covid_result <- data.frame(
  Metric = c("Effect Size (per 1 SD)", "1 SD equals", "Direction", "P-Value", "Significant"),
  Value = c(
    paste0(round(best_pct["covid_timing_z"], 1), "%"),
    paste(round(sd(mydata$months_from_covid), 1), "months"),
    ifelse(best_pct["covid_timing_z"] < 0, "Trials starting LATER completed FASTER", "Trials starting LATER took LONGER"),
    round(best_pvals["covid_timing_z"], 4),
    ifelse(best_pvals["covid_timing_z"] < 0.05, "Yes", "No")
  )
)

print(covid_result)

##                   Metric                                  Value
## 1 Effect Size (per 1 SD)                                 -11.1%
## 2            1 SD equals                            13.9 months
## 3              Direction Trials starting LATER completed FASTER
## 4                P-Value                                 0.0377
## 5            Significant                                    Yes

The COVID timing variable is statistically significant. The negative coefficient indicates that trials starting LATER in the pandemic completed FASTER. For each 13.9-month increase in distance from the March 2020 pandemic declaration, completion time decreased by approximately 11%.

This suggests that sponsors adapted to pandemic conditions over time. Early in the pandemic, respiratory trials faced severe disruptions:

Pulmonary function testing was restricted (aerosol-generating)
Patients avoided hospital visits
Healthcare systems were overwhelmed

As the pandemic progressed, sponsors developed workarounds: remote monitoring, decentralized trial designs, and revised safety protocols. Trials initiating later benefited from these adaptations.

9.1.6 All Covariate Effects

covariate_summary <- data.frame(
  Variable = c("Sponsor Type (Non-Commercial)", "Number of Sites", "Enrollment", 
               "Number of Arms", "COVID Timing"),
  Effect = c(
    paste0(ifelse(best_pct["sponsorNon-Commercial"] > 0, "+", ""), round(best_pct["sponsorNon-Commercial"], 1), "% vs Commercial"),
    paste0(ifelse(best_pct["num_sites_z"] > 0, "+", ""), round(best_pct["num_sites_z"], 1), "% per SD"),
    paste0(ifelse(best_pct["enrollment_z"] > 0, "+", ""), round(best_pct["enrollment_z"], 1), "% per SD"),
    paste0(ifelse(best_pct["arms_z"] > 0, "+", ""), round(best_pct["arms_z"], 1), "% per SD"),
    paste0(ifelse(best_pct["covid_timing_z"] > 0, "+", ""), round(best_pct["covid_timing_z"], 1), "% per SD")
  ),
  P_Value = round(best_pvals[-1], 4),
  Significant = ifelse(best_pvals[-1] < 0.05, "Yes", ifelse(best_pvals[-1] < 0.10, "Borderline", "No"))
)

print(covariate_summary)

##                                            Variable               Effect
## sponsorNon-Commercial Sponsor Type (Non-Commercial) +59.9% vs Commercial
## num_sites_z                         Number of Sites        +17.6% per SD
## enrollment_z                             Enrollment         +4.8% per SD
## arms_z                               Number of Arms         -6.1% per SD
## covid_timing_z                         COVID Timing        -11.1% per SD
##                       P_Value Significant
## sponsorNon-Commercial  0.0008         Yes
## num_sites_z            0.0861  Borderline
## enrollment_z           0.6064          No
## arms_z                 0.2822          No
## covid_timing_z         0.0377         Yes

9.1.7 Practical Implications

implications <- data.frame(
  Finding = c("Sponsor Selection", "Pandemic Adaptation", "Trial Design"),
  Implication = c(
    "Non-commercial trials take ~60% longer. This affects drug development timelines, budgeting, and academic-industry partnership decisions.",
    "Trials starting later in the pandemic completed faster, demonstrating that the clinical trial enterprise can adapt to major disruptions.",
    "Enrollment size and number of arms were not significant. Number of sites showed a borderline effect, suggesting multi-site complexity may extend timelines."
  )
)

print(implications)

##               Finding
## 1   Sponsor Selection
## 2 Pandemic Adaptation
## 3        Trial Design
##                                                                                                                                                   Implication
## 1                    Non-commercial trials take ~60% longer. This affects drug development timelines, budgeting, and academic-industry partnership decisions.
## 2                    Trials starting later in the pandemic completed faster, demonstrating that the clinical trial enterprise can adapt to major disruptions.
## 3 Enrollment size and number of arms were not significant. Number of sites showed a borderline effect, suggesting multi-site complexity may extend timelines.

9.1.8 Limitations

Sample Size: With N = 73 trials, this analysis has moderate power. Smaller effects may not be detectable.
Group Imbalance: The sample was imbalanced between commercial (n = 53) and non-commercial (n = 20) sponsors, reflecting the composition of the Phase 2 respiratory trial population. The smaller non-commercial group may result in less precise estimates for that category, though the effect remained highly significant.
Observational Design: We cannot establish causation. Unmeasured confounders (trial complexity, regulatory pathway, therapeutic novelty) may explain sponsor differences.
Completion Bias: We analyzed only completed trials. Ongoing or terminated trials may differ systematically.
Time Period: Results reflect the COVID-19 era (2019-2022) and may not generalize to other periods.

Does Sponsor Type Predict Clinical Trial Completion Time?

Exec PHD Quantitative Analysis I

Amanda McEwen

2025-11-24

1 Introduction

1.1 Research Question

1.2 Why Not Use Regular Linear Regression?

1.3 What is a GLM?

2 Data Acquisition from AACT Database

2.1 What is AACT?

2.2 Load All Required Libraries

2.3 Connecting to the AACT Database

2.4 SQL Query

2.4.1 Trial Inclusion Criteria

2.4.2 The Query

2.5 Execute Query and Download Data

2.6 Load Previously Downloaded Data

3 Dataset Creation & Cleaning

3.1 Save Final Dataset

4 Exploratory Data Analysis

4.1 How Many Trials in Each Group?

4.2 What Does Completion Time Look Like?

4.3 Completion Time by Sponsor Type

4.4 Does Variance Increase with Mean?

5 Understanding Our Two GLM Distributions

5.1 The Gamma Distribution

5.2 The Inverse Gaussian Distribution

5.3 The Log Link Function

6 Model 1: Gamma GLM

6.1 The Model Equation

6.2 Fitting the Model

6.3 Interpret the Coefficients

6.4 Check Model Fit

6.5 Model Fit Statistics

7 Model 2: Inverse Gaussian GLM

7.1 The Model Equation

7.2 Fit the Model

7.3 Interpret the Coefficients

7.4 Check Model Fit

7.5 Model Fit Statistics

8 Gamma vs Inverse Gaussian Comparison

9 Conclusions

9.1 Summary of Findings

9.1.1 Research Question

9.1.2 Sample Characteristics

9.1.3 Model Selection

9.1.4 Hypothesis Test: Sponsor Type

9.1.5 Effect of COVID Timing

9.1.6 All Covariate Effects

9.1.7 Practical Implications

9.1.8 Limitations