1 Introduction

1.1 Research Question

Does trial sponsor type (commercial vs non-commercial) predict how long Phase 2 respiratory disease trials take to complete?

We examine trials initiated during March 2019 through March 2021, a period spanning the onset of COVID-19, where it is theorized that this period might inflate the effect of sponsor on time-to-completion. Respiratory trials are particularly interesting during this period because:

Respiratory patients may have avoided hospitals during COVID-19
Pulmonary function testing was often restricted (aerosol-generating procedures)
Hospital capacity was consumed by COVID-19 patients, potentially delaying other respiratory research

To do this we will take a sample from the AACT (ClinicalTrials.gov) database.

1.2 Why Not Use Regular Linear Regression?

Regular linear regression (OLS) assumes:

The dependent/outcome variable is normally distributed
Variance is constant across all predicted values

Time-to-trial completion violates both:

Time is always positive (can’t be negative)
Time-to-trial completion is substantially right-skewed (most trials finish in a typical timeframe, but some drag on much longer)
Longer trials have more variable completion times (variance increases with the time-to-complete mean)

1.3 What is a GLM?

A GLM has three components:

Random Component: The probability distribution of the outcome (In this instance we will consider Gamma or Inverse Gaussian)
Systematic Component: The linear combination of predictors (β₀ + β₁X₁ + β₂X₂ + …)
Link Function: How the mean of the outcome relates to the predictors (we will use log link)

2 Data Acquisition from AACT Database

2.1 What is AACT?

The AACT (Aggregate Analysis of ClinicalTrials.gov) database is a publicly available relational database maintained by the Clinical Trials Transformation Initiative (CTTI). It contains all information from ClinicalTrials.gov in a structured format that can be queried using SQL.

To access AACT, you need a free account: https://aact.ctti-clinicaltrials.org/users/sign_up

2.2 Load Required Packages

# Load required packages
library(RPostgres)  # Database connection
library(dplyr)      # Data manipulation
library(ggplot2)    # Plotting
library(scales)     # Nice axis labels
library(statmod)    # For Inverse Gaussian

2.3 Connect to AACT Database

# ============================================================================
# NOTE: This code chunk connects to the live AACT database.
# Set eval=TRUE to run it (requires internet connection and AACT credentials)
# ============================================================================

# Enter your AACT credentials
username <- "amgent06"  # Replace with your username
password <- "agMORGAN2027!"  # Replace with your password

# Connect to the database
con <- dbConnect(
  RPostgres::Postgres(),
  dbname = "aact",
  host = "aact-db.ctti-clinicaltrials.org",
  port = 5432,
  user = username,
  password = password
)

cat("Connected to AACT database successfully!\n")

## Connected to AACT database successfully!

2.4 The SQL Query

We construct a query to pull Phase 2 interventional trials for chronic respiratory conditions.

2.4.1 Inclusion Criteria

We include trials for these chronic respiratory conditions:

COPD (Chronic Obstructive Pulmonary Disease)
Asthma
Pulmonary Hypertension
Pulmonary Fibrosis (including Idiopathic Pulmonary Fibrosis/IPF)
Cystic Fibrosis
Interstitial Lung Disease
Bronchiectasis
Emphysema
Chronic Bronchitis

2.4.2 Exclusion Criteria

We exclude:

Lung cancer and other pulmonary malignancies (these are oncology trials)
COVID-19 trials (special pandemic-specific trials that would confound our analysis)
Acute respiratory infections (pneumonia, influenza — these are not chronic diseases)

2.4.3 Time Period

Start date: March 1, 2019 through March 31, 2021
This captures one year pre-pandemic and one year into the pandemic

2.4.4 The Query

query <- "
SELECT DISTINCT
    s.nct_id,
    s.brief_title,
    s.overall_status,
    s.phase,
    s.enrollment,
    s.start_date,
    s.completion_date,
    s.number_of_arms,
    sp.agency_class as sponsor_type,
    f.number_of_facilities as num_sites
FROM studies s
LEFT JOIN sponsors sp ON s.nct_id = sp.nct_id AND sp.lead_or_collaborator = 'lead'
LEFT JOIN facilities f ON s.nct_id = f.nct_id
INNER JOIN conditions c ON s.nct_id = c.nct_id
WHERE s.phase = 'PHASE2'
  AND s.study_type = 'INTERVENTIONAL'
  AND s.start_date >= '2019-03-01'
  AND s.start_date <= '2021-03-31'
  AND (
    -- Include chronic respiratory conditions
    LOWER(c.downcase_name) LIKE '%copd%'
    OR LOWER(c.downcase_name) LIKE '%chronic obstructive pulmonary%'
    OR LOWER(c.downcase_name) LIKE '%asthma%'
    OR LOWER(c.downcase_name) LIKE '%pulmonary hypertension%'
    OR LOWER(c.downcase_name) LIKE '%pulmonary fibrosis%'
    OR LOWER(c.downcase_name) LIKE '%ipf%'
    OR LOWER(c.downcase_name) LIKE '%cystic fibrosis%'
    OR LOWER(c.downcase_name) LIKE '%interstitial lung%'
    OR LOWER(c.downcase_name) LIKE '%bronchiectasis%'
    OR LOWER(c.downcase_name) LIKE '%emphysema%'
    OR LOWER(c.downcase_name) LIKE '%chronic bronchitis%'
  )
  AND NOT (
    -- Exclude lung cancer
    LOWER(c.downcase_name) LIKE '%lung cancer%'
    OR LOWER(c.downcase_name) LIKE '%lung neoplasm%'
    OR LOWER(c.downcase_name) LIKE '%pulmonary neoplasm%'
    OR LOWER(c.downcase_name) LIKE '%non-small cell%'
    OR LOWER(c.downcase_name) LIKE '%small cell lung%'
  )
  AND NOT (
    -- Exclude COVID-19 and acute infections
    LOWER(c.downcase_name) LIKE '%covid%'
    OR LOWER(c.downcase_name) LIKE '%coronavirus%'
    OR LOWER(c.downcase_name) LIKE '%sars-cov%'
    OR LOWER(c.downcase_name) LIKE '%pneumonia%'
    OR LOWER(c.downcase_name) LIKE '%acute respiratory%'
    OR LOWER(c.downcase_name) LIKE '%influenza%'
  )
"

2.5 Execute Query and Download Data

# Execute the query
trials <- dbGetQuery(con, query)

# Disconnect from database
dbDisconnect(con)

# Save the raw data
save(trials, file = "phase2_respiratory_trials.RData")
write.csv(trials, "phase2_respiratory_trials_raw.csv", row.names = FALSE)

cat("Downloaded", nrow(trials), "trials\n")
cat("Saved as 'phase2_respiratory_trials.RData' and 'phase2_respiratory_trials_raw.csv'\n")

2.6 Load Previously Downloaded Data

If you have already run the query above and saved the data, you can load it here:

# Load the previously downloaded data

load("C:/Users/amcewen/OneDrive - Bentley University/Documents/quant pres #2/phase2_respiratory_trials.RData")

cat("Loaded", nrow(trials), "trials from AACT database\n")

## Loaded 82 trials from AACT database

3 Data Preparation

3.1 Step 1: Create Variables

# Define the COVID start date for our timing variable
# March 1, 2020 = WHO declared COVID-19 a pandemic
covid_start <- as.Date("2020-03-01")

# Create analysis dataset
mydata <- trials

# Create sponsor type: Commercial vs Non-Commercial
# INDUSTRY = pharmaceutical and biotech companies (Commercial)
# Everything else (NIH, academic institutions, etc.) = Non-Commercial
mydata$sponsor <- ifelse(mydata$sponsor_type == "INDUSTRY", 
                          "Commercial", 
                          "Non-Commercial")
mydata$sponsor <- factor(mydata$sponsor, levels = c("Commercial", "Non-Commercial"))

# Calculate time-to-completion in days
mydata$start_date <- as.Date(mydata$start_date)
mydata$completion_date <- as.Date(mydata$completion_date)
mydata$completion_time <- as.numeric(mydata$completion_date - mydata$start_date)

# Calculate months from COVID start 
# Negative values = trial started BEFORE pandemic declaration
# Positive values = trial started AFTER pandemic declaration
mydata$months_from_covid <- as.numeric(mydata$start_date - covid_start) / 31

# Make num_sites numeric
mydata$num_sites <- as.numeric(mydata$num_sites)

3.2 Step 2: Filter to Completed Trials

We only include trials that have actually completed (not ongoing or terminated) because we need actual completion dates.

# Track how many trials we lose at each step
cat("=== Data Filtering Steps ===\n\n")

## === Data Filtering Steps ===

cat("Starting sample:", nrow(mydata), "trials\n")

## Starting sample: 82 trials

# Keep only completed trials
mydata <- mydata[mydata$overall_status == "COMPLETED", ]
cat("After keeping only COMPLETED trials:", nrow(mydata), "\n")

## After keeping only COMPLETED trials: 39

# Remove trials with invalid completion times
mydata <- mydata[mydata$completion_time > 0, ]
cat("After removing invalid completion times (<=0 days):", nrow(mydata), "\n")

## After removing invalid completion times (<=0 days): 39

# Remove trials with missing covariate data
mydata <- mydata[!is.na(mydata$num_sites), ]
mydata <- mydata[!is.na(mydata$enrollment), ]
mydata <- mydata[!is.na(mydata$number_of_arms), ]
cat("After removing trials with missing covariate data:", nrow(mydata), "\n")

## After removing trials with missing covariate data: 38

cat("\n=== Final Sample ===\n")

## 
## === Final Sample ===

cat("N =", nrow(mydata), "completed respiratory trials\n")

## N = 38 completed respiratory trials

3.3 Step 3: Scale the Continuous Predictors

The continuous variables are on very different scales:

num_sites: 0 to potentially hundreds of sites
enrollment: 0 to thousands of participants
number_of_arms: 1 to ~10 treatment arms
months_from_covid: approximately -12 to +12 months

Scaling (standardizing) them helps the model converge and makes coefficients comparable.

What does scaling do? For each variable, we subtract the mean and divide by the standard deviation. This transforms each variable to have mean = 0 and standard deviation = 1.

# Scale continuous variables (subtract mean, divide by SD)
mydata$num_sites_z <- scale(mydata$num_sites)[,1]
mydata$enrollment_z <- scale(mydata$enrollment)[,1]
mydata$arms_z <- scale(mydata$number_of_arms)[,1]
mydata$covid_timing_z <- scale(mydata$months_from_covid)[,1]

# What does "1 SD" mean in original units?
cat("=== Scaling Reference ===\n")

## === Scaling Reference ===

cat("What '1 SD increase' means in original units:\n\n")

## What '1 SD increase' means in original units:

cat("  num_sites:        1 SD =", round(sd(mydata$num_sites), 1), "sites\n")

##   num_sites:        1 SD = 30.4 sites

cat("  enrollment:       1 SD =", round(sd(mydata$enrollment), 1), "participants\n")

##   enrollment:       1 SD = 133.4 participants

cat("  number_of_arms:   1 SD =", round(sd(mydata$number_of_arms), 2), "arms\n")

##   number_of_arms:   1 SD = 1.86 arms

cat("  months_from_covid: 1 SD =", round(sd(mydata$months_from_covid), 1), "months\n")

##   months_from_covid: 1 SD = 6.9 months

Note: Scaling does NOT affect our main predictive variable (sponsor type) because it’s categorical.

3.4 Step 4: Save the Cleaned Dataset

# Save cleaned dataset as CSV for reference
write.csv(mydata, "phase2_respiratory_trials_clean.csv", row.names = FALSE)
cat("Saved cleaned dataset as 'phase2_respiratory_trials_clean.csv'\n")

## Saved cleaned dataset as 'phase2_respiratory_trials_clean.csv'

cat("This file contains", nrow(mydata), "trials with", ncol(mydata), "variables.\n")

## This file contains 38 trials with 17 variables.

4 Exploratory Data Analysis

4.1 How Many Trials in Each Group?

table(mydata$sponsor)

## 
##     Commercial Non-Commercial 
##             27             11

4.2 What Does Completion Time Look Like?

ggplot(mydata, aes(x = completion_time)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Trial Completion Time",
       subtitle = "Notice the right skew - most trials finish within a typical timeframe but some take much longer",
       x = "Completion Time (days)",
       y = "Count")

Key observation: The distribution is right-skewed. This is why we need Gamma or Inverse Gaussian, not Normal.

4.3 Completion Time by Sponsor Type

ggplot(mydata, aes(x = sponsor, y = completion_time, fill = sponsor)) +
  geom_boxplot() +
  labs(title = "Completion Time by Sponsor Type",
       x = "Sponsor Type",
       y = "Completion Time (days)") +
  theme(legend.position = "none")

# Summary statistics by sponsor type
aggregate(completion_time ~ sponsor, data = mydata, 
          FUN = function(x) c(Mean = mean(x), Median = median(x), SD = sd(x), N = length(x)))

##          sponsor completion_time.Mean completion_time.Median completion_time.SD
## 1     Commercial             694.4444               639.0000           377.0197
## 2 Non-Commercial            1257.2727              1479.0000           506.8639
##   completion_time.N
## 1           27.0000
## 2           11.0000

4.4 Does Variance Increase with Mean?

For Gamma and Inverse Gaussian to be appropriate, variance should increase with the mean.

# Split data into bins and calculate mean and variance for each
mydata$bin <- cut(mydata$completion_time, breaks = 10)
mv <- aggregate(completion_time ~ bin, data = mydata, 
                FUN = function(x) c(mean = mean(x), var = var(x)))
mv <- data.frame(mean = mv$completion_time[,1], var = mv$completion_time[,2])

ggplot(mv, aes(x = mean, y = var)) +
  geom_point(size = 3, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred") +
  labs(title = "Mean-Variance Relationship",
       subtitle = "Upward slope confirms variance increases with mean",
       x = "Mean Completion Time (days)",
       y = "Variance")

Conclusion: Variance increases with the mean, supporting use of Gamma or Inverse Gaussian.

5 Understanding Our Two GLM Distributions

5.1 The Gamma Distribution

What is it?

A continuous probability distribution for positive values
Right-skewed (has a long right tail)
Commonly used for waiting times, survival times, and durations

Key assumption: Variance is proportional to the mean squared: Var(Y) ∝ μ²

When to use: Moderate right skew, no extreme outliers

5.2 The Inverse Gaussian Distribution

What is it?

Also a continuous distribution for positive values
Has a heavier right tail than Gamma
Arises naturally as the time for a process to reach a threshold (“first passage time”)

Key assumption: Variance is proportional to the mean cubed: Var(Y) ∝ μ³

When to use: Heavy right tails, extreme outliers, when data represents “time to reach a goal”

5.3 The Log Link Function

Both models use a log link:

\[\log(\mu) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...\]

This means:

\[\mu = e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...}\]

Why log link?

Ensures predicted values are always positive (can’t predict negative time)
Coefficients have a multiplicative interpretation: exp(β) = percent change

6 Model 1: Gamma GLM

6.1 The Model Equation

Part 1 — Distribution:

\[\text{CompletionTime}_i \sim \text{Gamma}(\mu_i, \phi)\]

This says: Each trial’s completion time follows a Gamma distribution with mean μᵢ and dispersion parameter φ.

Part 2 — Link Function and Linear Predictor:

\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]

This says: The log of the expected completion time is a linear combination of our predictors.

Putting it together:

\[\mu_i = e^{\beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + ...}\]

The Gamma distribution determines the shape of the error (right-skewed, variance ∝ μ²). The log link ensures predictions are always positive. The β coefficients tell us the effect of each predictor.

6.2 Fit the Model

# Fit Gamma GLM with log link
gamma_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
                   data = mydata,
                   family = Gamma(link = "log"))

# View results
summary(gamma_model)

## 
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z + 
##     arms_z + covid_timing_z, family = Gamma(link = "log"), data = mydata)
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.47634    0.09804  66.059  < 2e-16 ***
## sponsorNon-Commercial  0.74801    0.19793   3.779 0.000649 ***
## num_sites_z            0.24021    0.11463   2.096 0.044119 *  
## enrollment_z          -0.03470    0.11204  -0.310 0.758798    
## arms_z                -0.07446    0.09392  -0.793 0.433776    
## covid_timing_z        -0.03532    0.08599  -0.411 0.683974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.24049)
## 
##     Null deviance: 14.2652  on 37  degrees of freedom
## Residual deviance:  9.6294  on 32  degrees of freedom
## AIC: 570.3
## 
## Number of Fisher Scoring iterations: 9

6.3 Interpret the Coefficients

For a log link, we exponentiate coefficients to get multiplicative effects:

# Get coefficients
coefs <- coef(gamma_model)
pvals <- summary(gamma_model)$coefficients[, 4]

# Create interpretation table
results_gamma <- data.frame(
  Variable = names(coefs),
  Coefficient = round(coefs, 4),
  Exp_Coefficient = round(exp(coefs), 4),
  Percent_Change = round((exp(coefs) - 1) * 100, 1),
  P_Value = round(pvals, 4)
)

print(results_gamma)

##                                    Variable Coefficient Exp_Coefficient
## (Intercept)                     (Intercept)      6.4763        649.5905
## sponsorNon-Commercial sponsorNon-Commercial      0.7480          2.1128
## num_sites_z                     num_sites_z      0.2402          1.2715
## enrollment_z                   enrollment_z     -0.0347          0.9659
## arms_z                               arms_z     -0.0745          0.9282
## covid_timing_z               covid_timing_z     -0.0353          0.9653
##                       Percent_Change P_Value
## (Intercept)                  64859.0  0.0000
## sponsorNon-Commercial          111.3  0.0006
## num_sites_z                     27.2  0.0441
## enrollment_z                    -3.4  0.7588
## arms_z                          -7.2  0.4338
## covid_timing_z                  -3.5  0.6840

How to interpret:

Exp(Coefficient) = multiplicative effect on completion time
Percent_Change = percent increase (positive) or decrease (negative)
Example: If Exp_Coefficient = 1.10, trials take 10% longer

6.4 Check Model Fit

# Residuals vs Fitted plot
par(mfrow = c(1, 2))

# Plot 1: Residuals vs Fitted
plot(fitted(gamma_model), residuals(gamma_model, type = "deviance"),
     xlab = "Fitted Values", ylab = "Deviance Residuals",
     main = "Gamma: Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)

# Plot 2: Q-Q Plot
qqnorm(residuals(gamma_model, type = "deviance"), main = "Gamma: Q-Q Plot")
qqline(residuals(gamma_model, type = "deviance"), col = "red")

par(mfrow = c(1, 1))

What we look for:

Residuals vs Fitted: Random scatter around zero (no patterns)
Q-Q Plot: Points should follow the diagonal line

6.5 Model Fit Statistics

cat("Gamma Model Fit Statistics\n")

## Gamma Model Fit Statistics

cat("==========================\n")

## ==========================

cat("AIC:", AIC(gamma_model), "\n")

## AIC: 570.3046

cat("BIC:", BIC(gamma_model), "\n")

## BIC: 581.7677

cat("Null Deviance:", gamma_model$null.deviance, "\n")

## Null Deviance: 14.26525

cat("Residual Deviance:", gamma_model$deviance, "\n")

## Residual Deviance: 9.62944

cat("Deviance Explained:", round((1 - gamma_model$deviance/gamma_model$null.deviance) * 100, 1), "%\n")

## Deviance Explained: 32.5 %

7 Model 2: Inverse Gaussian GLM

7.1 The Model Equation

Part 1 — Distribution:

\[\text{CompletionTime}_i \sim \text{InverseGaussian}(\mu_i, \lambda)\]

This says: Each trial’s completion time follows an Inverse Gaussian distribution with mean μᵢ and shape parameter λ.

Part 2 — Link Function and Linear Predictor:

\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]

Key difference from Gamma: The Inverse Gaussian has a heavier right tail and assumes variance ∝ μ³ (instead of μ²). Everything else stays the same.

7.2 Fit the Model

# Fit Inverse Gaussian GLM with log link
# We increase max iterations because IG can be harder to fit
ig_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
                data = mydata,
                family = inverse.gaussian(link = "log"),
                control = glm.control(maxit = 100))

# View results
summary(ig_model)

## 
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z + 
##     arms_z + covid_timing_z, family = inverse.gaussian(link = "log"), 
##     data = mydata, control = glm.control(maxit = 100))
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.46912    0.09329  69.344  < 2e-16 ***
## sponsorNon-Commercial  0.78885    0.23450   3.364  0.00201 ** 
## num_sites_z            0.27770    0.12250   2.267  0.03029 *  
## enrollment_z          -0.10431    0.11076  -0.942  0.35337    
## arms_z                -0.06281    0.09257  -0.678  0.50237    
## covid_timing_z        -0.07668    0.09150  -0.838  0.40822    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for inverse.gaussian family taken to be 0.0003387172)
## 
##     Null deviance: 0.022773  on 37  degrees of freedom
## Residual deviance: 0.017098  on 32  degrees of freedom
## AIC: 577.54
## 
## Number of Fisher Scoring iterations: 16

7.3 Interpret the Coefficients

# Get coefficients
coefs_ig <- coef(ig_model)
pvals_ig <- summary(ig_model)$coefficients[, 4]

# Create interpretation table
results_ig <- data.frame(
  Variable = names(coefs_ig),
  Coefficient = round(coefs_ig, 4),
  Exp_Coefficient = round(exp(coefs_ig), 4),
  Percent_Change = round((exp(coefs_ig) - 1) * 100, 1),
  P_Value = round(pvals_ig, 4)
)

print(results_ig)

##                                    Variable Coefficient Exp_Coefficient
## (Intercept)                     (Intercept)      6.4691        644.9140
## sponsorNon-Commercial sponsorNon-Commercial      0.7889          2.2009
## num_sites_z                     num_sites_z      0.2777          1.3201
## enrollment_z                   enrollment_z     -0.1043          0.9009
## arms_z                               arms_z     -0.0628          0.9391
## covid_timing_z               covid_timing_z     -0.0767          0.9262
##                       Percent_Change P_Value
## (Intercept)                  64391.4  0.0000
## sponsorNon-Commercial          120.1  0.0020
## num_sites_z                     32.0  0.0303
## enrollment_z                    -9.9  0.3534
## arms_z                          -6.1  0.5024
## covid_timing_z                  -7.4  0.4082

7.4 Check Model Fit

par(mfrow = c(1, 2))

# Plot 1: Residuals vs Fitted
plot(fitted(ig_model), residuals(ig_model, type = "deviance"),
     xlab = "Fitted Values", ylab = "Deviance Residuals",
     main = "Inverse Gaussian: Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)

# Plot 2: Q-Q Plot
qqnorm(residuals(ig_model, type = "deviance"), main = "Inverse Gaussian: Q-Q Plot")
qqline(residuals(ig_model, type = "deviance"), col = "red")

par(mfrow = c(1, 1))

7.5 Model Fit Statistics

cat("Inverse Gaussian Model Fit Statistics\n")

## Inverse Gaussian Model Fit Statistics

cat("=====================================\n")

## =====================================

cat("AIC:", AIC(ig_model), "\n")

## AIC: 577.5408

cat("BIC:", BIC(ig_model), "\n")

## BIC: 589.0039

cat("Null Deviance:", ig_model$null.deviance, "\n")

## Null Deviance: 0.02277328

cat("Residual Deviance:", ig_model$deviance, "\n")

## Residual Deviance: 0.01709837

cat("Deviance Explained:", round((1 - ig_model$deviance/ig_model$null.deviance) * 100, 1), "%\n")

## Deviance Explained: 24.9 %

8 Model Comparison

8.1 Which Model Fits Better?

cat("MODEL COMPARISON\n")

## MODEL COMPARISON

cat("================\n\n")

## ================

cat("AIC (lower is better):\n")

## AIC (lower is better):

cat("  Gamma:", round(AIC(gamma_model), 1), "\n")

##   Gamma: 570.3

cat("  Inverse Gaussian:", round(AIC(ig_model), 1), "\n")

##   Inverse Gaussian: 577.5

cat("  Winner:", ifelse(AIC(gamma_model) < AIC(ig_model), "Gamma", "Inverse Gaussian"), "\n\n")

##   Winner: Gamma

cat("BIC (lower is better):\n")

## BIC (lower is better):

cat("  Gamma:", round(BIC(gamma_model), 1), "\n")

##   Gamma: 581.8

cat("  Inverse Gaussian:", round(BIC(ig_model), 1), "\n")

##   Inverse Gaussian: 589

cat("  Winner:", ifelse(BIC(gamma_model) < BIC(ig_model), "Gamma", "Inverse Gaussian"), "\n")

##   Winner: Gamma

8.2 Do Both Models Agree on the Main Finding?

cat("SPONSOR TYPE EFFECT\n")

## SPONSOR TYPE EFFECT

cat("===================\n\n")

## ===================

# Gamma model
gamma_effect <- round((exp(coef(gamma_model)["sponsorNon-Commercial"]) - 1) * 100, 1)
gamma_p <- round(summary(gamma_model)$coefficients["sponsorNon-Commercial", 4], 4)

# IG model
ig_effect <- round((exp(coef(ig_model)["sponsorNon-Commercial"]) - 1) * 100, 1)
ig_p <- round(summary(ig_model)$coefficients["sponsorNon-Commercial", 4], 4)

cat("Gamma Model:\n")

## Gamma Model:

cat("  Non-commercial trials take", gamma_effect, "% longer than commercial\n")

##   Non-commercial trials take 111.3 % longer than commercial

cat("  p-value:", gamma_p, "\n\n")

##   p-value: 6e-04

cat("Inverse Gaussian Model:\n")

## Inverse Gaussian Model:

cat("  Non-commercial trials take", ig_effect, "% longer than commercial\n")

##   Non-commercial trials take 120.1 % longer than commercial

cat("  p-value:", ig_p, "\n")

##   p-value: 0.002

9 Interaction Model

9.1 Research Question

Does the effect of sponsor type depend on how big the trial is (number of sites)?

Maybe commercial sponsors are especially efficient at running large multi-site trials, but the difference disappears for small trials.

9.2 The Interaction Model Equation

Part 1 — Distribution:

\[\text{CompletionTime}_i \sim \text{[Gamma or InverseGaussian]}(\mu_i, \phi)\]

We use whichever distribution fit better in the comparison above.

Part 2 — Link Function and Linear Predictor (with interaction):

\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{NonCommercial}_i \times \text{NumSites}_i) + \beta_4(\text{Enrollment}_i) + \beta_5(\text{Arms}_i) + \beta_6(\text{CovidTiming}_i)\]

What’s new? The term \(\beta_3(\text{NonCommercial} \times \text{NumSites})\)

How to interpret:

β₁ = effect of non-commercial sponsor when NumSites = 0 (at the mean, since we scaled)
β₂ = effect of adding sites for commercial sponsors
β₃ = how much the site effect DIFFERS for non-commercial vs commercial sponsors

If β₃ is significant, the lines in our plot will NOT be parallel — meaning the sponsor type effect depends on trial size.

9.3 Fit the Interaction Model

We use whichever distribution won the comparison above.

# Use the better-fitting distribution
if (AIC(gamma_model) <= AIC(ig_model)) {
  best_family <- Gamma(link = "log")
  best_name <- "Gamma"
} else {
  best_family <- inverse.gaussian(link = "log")
  best_name <- "Inverse Gaussian"
}

cat("Using", best_name, "distribution for interaction model\n\n")

## Using Gamma distribution for interaction model

# Fit model with interaction (sponsor * num_sites)
interaction_model <- glm(completion_time ~ sponsor * num_sites_z + enrollment_z + arms_z + covid_timing_z,
                         data = mydata,
                         family = best_family,
                         control = glm.control(maxit = 100))

summary(interaction_model)

## 
## Call:
## glm(formula = completion_time ~ sponsor * num_sites_z + enrollment_z + 
##     arms_z + covid_timing_z, family = best_family, data = mydata, 
##     control = glm.control(maxit = 100))
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        6.47846    0.09806  66.065  < 2e-16 ***
## sponsorNon-Commercial              1.24650    0.40351   3.089  0.00421 ** 
## num_sites_z                        0.28048    0.11642   2.409  0.02211 *  
## enrollment_z                      -0.12035    0.12173  -0.989  0.33049    
## arms_z                            -0.09401    0.09505  -0.989  0.33028    
## covid_timing_z                    -0.02931    0.08597  -0.341  0.73541    
## sponsorNon-Commercial:num_sites_z  0.91307    0.63094   1.447  0.15789    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2401902)
## 
##     Null deviance: 14.265  on 37  degrees of freedom
## Residual deviance:  9.112  on 31  degrees of freedom
## AIC: 570.12
## 
## Number of Fisher Scoring iterations: 9

9.4 Is the Interaction Significant?

We compare the main effects model to the interaction model:

# Get the main effects model (whichever was better)
main_model <- if(AIC(gamma_model) <= AIC(ig_model)) gamma_model else ig_model

# Likelihood ratio test
lr_test <- anova(main_model, interaction_model, test = "F")
print(lr_test)

## Analysis of Deviance Table
## 
## Model 1: completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + 
##     covid_timing_z
## Model 2: completion_time ~ sponsor * num_sites_z + enrollment_z + arms_z + 
##     covid_timing_z
##   Resid. Df Resid. Dev Df Deviance      F Pr(>F)
## 1        32     9.6294                          
## 2        31     9.1120  1  0.51746 2.1544 0.1522

# Get p-value
interaction_pvalue <- lr_test$`Pr(>F)`[2]

cat("\n")

cat("INTERACTION TEST RESULT\n")

## INTERACTION TEST RESULT

cat("=======================\n")

## =======================

cat("F-test p-value:", round(interaction_pvalue, 4), "\n\n")

## F-test p-value: 0.1522

if (interaction_pvalue < 0.05) {
  cat("CONCLUSION: The interaction IS significant (p < 0.05)\n")
  cat("The effect of sponsor type DOES depend on trial size.\n")
} else {
  cat("CONCLUSION: The interaction is NOT significant (p >= 0.05)\n")
  cat("The effect of sponsor type does NOT depend on trial size.\n")
  cat("The simpler main effects model is preferred.\n")
}

## CONCLUSION: The interaction is NOT significant (p >= 0.05)
## The effect of sponsor type does NOT depend on trial size.
## The simpler main effects model is preferred.

9.5 Visualize the Interaction

# Create prediction data
pred_data <- expand.grid(
  sponsor = c("Commercial", "Non-Commercial"),
  num_sites_z = seq(-0.5, 2, by = 0.1),
  enrollment_z = 0,
  arms_z = 0,
  covid_timing_z = 0
)

# Get predictions
pred_data$predicted <- predict(interaction_model, newdata = pred_data, type = "response")

# Convert num_sites back to original scale for plotting
pred_data$num_sites <- pred_data$num_sites_z * sd(mydata$num_sites) + mean(mydata$num_sites)

# Plot
ggplot(pred_data, aes(x = num_sites, y = predicted, color = sponsor)) +
  geom_line(size = 1.2) +
  labs(title = "Predicted Completion Time by Sponsor Type and Trial Size",
       subtitle = "Other variables held at their means",
       x = "Number of Sites",
       y = "Predicted Completion Time (days)",
       color = "Sponsor Type") +
  theme(legend.position = "bottom")

10 Conclusions

10.1 Summary

# Determine best model
if (AIC(gamma_model) <= AIC(ig_model)) {
  best_model <- gamma_model
  best_dist <- "Gamma"
} else {
  best_model <- ig_model
  best_dist <- "Inverse Gaussian"
}

# Get sponsor effect from best model
sponsor_effect <- round((exp(coef(best_model)["sponsorNon-Commercial"]) - 1) * 100, 1)
sponsor_p <- round(summary(best_model)$coefficients["sponsorNon-Commercial", 4], 4)
interaction_p <- round(interaction_pvalue, 4)

10.1.1 Research Question

Does sponsor type predict Phase 2 respiratory disease trial completion time?

10.1.2 Sample

N = 38 completed trials
Therapeutic area: Chronic respiratory diseases (COPD, asthma, pulmonary hypertension, pulmonary fibrosis, cystic fibrosis, interstitial lung disease, bronchiectasis)
Period: March 2019 to March 2021

10.1.3 Best Model

Gamma GLM with log link
AIC: Gamma = 570.3, Inverse Gaussian = 577.5

10.1.4 Main Finding

Non-commercial vs Commercial sponsors:

Effect: 111.3% longer completion time
p-value: 6^{-4}
Decision: REJECT null hypothesis - sponsor type DOES predict completion time

10.1.5 Interaction Effect

p-value: 0.1522
Conclusion: Not significant - the sponsor effect is consistent regardless of trial size

10.1.6 Interpretation

if (sponsor_p < 0.05 & sponsor_effect > 0) {
  cat("Non-commercial trials (academic and government sponsors) take significantly\n")
  cat("longer to complete than commercial (industry) trials. Specifically, they take\n")
  cat("approximately", sponsor_effect, "% longer (p =", sponsor_p, ").\n\n")
  cat("This may reflect:\n")
  cat("- Differences in operational infrastructure and dedicated staff\n")
  cat("- Resource constraints in academic settings\n")
  cat("- The pandemic's disproportionate impact on academic medical centers\n")
  cat("- Restrictions on pulmonary function testing during COVID-19\n")
} else if (sponsor_p < 0.05 & sponsor_effect < 0) {
  cat("Non-commercial trials complete significantly FASTER than commercial trials.\n")
} else {
  cat("No significant difference was found between commercial and non-commercial\n")
  cat("sponsors in trial completion time (p =", sponsor_p, ").\n")
}

## Non-commercial trials (academic and government sponsors) take significantly
## longer to complete than commercial (industry) trials. Specifically, they take
## approximately 111.3 % longer (p = 6e-04 ).
## 
## This may reflect:
## - Differences in operational infrastructure and dedicated staff
## - Resource constraints in academic settings
## - The pandemic's disproportionate impact on academic medical centers
## - Restrictions on pulmonary function testing during COVID-19

10.1.7 Why Gamma?

if (best_dist == "Gamma") {
  cat("The Gamma distribution provided better fit because:\n\n")
  cat("1. The completion time distribution has moderate right skew\n")
  cat("2. There are not extreme outliers that require the heavier tail of Inverse Gaussian\n")
  cat("3. The Gamma's variance assumption (Var proportional to mean squared) matched our data\n")
  cat("\nThe Inverse Gaussian assumes variance proportional to the mean CUBED, which\n")
  cat("would require much more dramatic increases in variability for longer trials.\n")
  cat("Our data did not show this pattern, so the Gamma was preferred.\n")
} else {
  cat("The Inverse Gaussian distribution provided better fit because:\n\n")
  cat("1. The completion time distribution has a heavy right tail\n")
  cat("2. Some trials take much longer than expected (extreme values)\n")
  cat("3. Trial completion can be thought of as 'time to reach a goal' which\n")
  cat("   is exactly what the Inverse Gaussian models (first passage time)\n")
}

## The Gamma distribution provided better fit because:
## 
## 1. The completion time distribution has moderate right skew
## 2. There are not extreme outliers that require the heavier tail of Inverse Gaussian
## 3. The Gamma's variance assumption (Var proportional to mean squared) matched our data
## 
## The Inverse Gaussian assumes variance proportional to the mean CUBED, which
## would require much more dramatic increases in variability for longer trials.
## Our data did not show this pattern, so the Gamma was preferred.

11 Session Information

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] statmod_1.5.1   scales_1.4.0    ggplot2_4.0.0   dplyr_1.1.4    
## [5] RPostgres_1.4.8
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-3       bit_4.6.0          gtable_0.3.6       jsonlite_2.0.0    
##  [5] compiler_4.5.1     tidyselect_1.2.1   blob_1.2.4         jquerylib_0.1.4   
##  [9] splines_4.5.1      yaml_2.3.10        fastmap_1.2.0      lattice_0.22-7    
## [13] R6_2.6.1           labeling_0.4.3     generics_0.1.4     knitr_1.50        
## [17] tibble_3.3.0       lubridate_1.9.4    DBI_1.2.3          bslib_0.9.0       
## [21] pillar_1.11.0      RColorBrewer_1.1-3 rlang_1.1.6        cachem_1.1.0      
## [25] xfun_0.52          sass_0.4.10        S7_0.2.0           bit64_4.6.0-1     
## [29] timechange_0.3.0   cli_3.6.5          mgcv_1.9-3         withr_3.0.2       
## [33] magrittr_2.0.3     digest_0.6.37      grid_4.5.1         rstudioapi_0.17.1 
## [37] hms_1.1.3          nlme_3.1-168       lifecycle_1.0.4    vctrs_0.6.5       
## [41] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       rmarkdown_2.30    
## [45] tools_4.5.1        pkgconfig_2.0.3    htmltools_0.5.8.1

Does Sponsor Type Predict Respiratory Clinical Trial Completion Time During COVID19?

Exec PHD Quantitative Analysis I

Amanda McEwen

2025-11-24

1 Introduction

1.1 Research Question

1.2 Why Not Use Regular Linear Regression?

1.3 What is a GLM?

2 Data Acquisition from AACT Database

2.1 What is AACT?

2.2 Load Required Packages

2.3 Connect to AACT Database

2.4 The SQL Query

2.4.1 Inclusion Criteria

2.4.2 Exclusion Criteria

2.4.3 Time Period

2.4.4 The Query

2.5 Execute Query and Download Data

2.6 Load Previously Downloaded Data

3 Data Preparation

3.1 Step 1: Create Variables

3.2 Step 2: Filter to Completed Trials

3.3 Step 3: Scale the Continuous Predictors

3.4 Step 4: Save the Cleaned Dataset

4 Exploratory Data Analysis

4.1 How Many Trials in Each Group?

4.2 What Does Completion Time Look Like?

4.3 Completion Time by Sponsor Type

4.4 Does Variance Increase with Mean?

5 Understanding Our Two GLM Distributions

5.1 The Gamma Distribution

5.2 The Inverse Gaussian Distribution

5.3 The Log Link Function

6 Model 1: Gamma GLM

6.1 The Model Equation

6.2 Fit the Model

6.3 Interpret the Coefficients

6.4 Check Model Fit

6.5 Model Fit Statistics

7 Model 2: Inverse Gaussian GLM

7.1 The Model Equation

7.2 Fit the Model

7.3 Interpret the Coefficients

7.4 Check Model Fit

7.5 Model Fit Statistics

8 Model Comparison

8.1 Which Model Fits Better?

8.2 Do Both Models Agree on the Main Finding?

9 Interaction Model

9.1 Research Question

9.2 The Interaction Model Equation

9.3 Fit the Interaction Model

9.4 Is the Interaction Significant?

9.5 Visualize the Interaction

10 Conclusions

10.1 Summary

10.1.1 Research Question

10.1.2 Sample

10.1.3 Best Model

10.1.4 Main Finding

10.1.5 Interaction Effect

10.1.6 Interpretation

10.1.7 Why Gamma?

11 Session Information