1 Introduction

1.1 Research Question

Does trial sponsor type (commercial vs non-commercial) predict how long Phase 2 oncology trials take to complete?

We examine trials initiated during March 2019 through March 2021, a period spanning the onset of COVID-19, where it is theorized that this period might inflate the effect of sponsor on time-to-completion. To do this we will take a sample from the AACT (ClinicalTrials.gov) database.

1.2 Why Not Use Regular Linear Regression?

Regular linear regression (OLS) assumes:

The dependent/outcome variable is normally distributed
Variance is constant across all predicted values

Time-to-trial completion violates both:

Time is always positive (can’t be negative)
Time-to-trial completion is substantially right-skewed (most trials finish in a typical timeframe, but some drag on much longer)
Longer trials have more variable completion times (variance increases with the time-to-compleyte mean)

1.3 What is a GLM?

A GLM has three components:

Random Component: The probability distribution of the outcome (In this instance we will consider Gamma or Inverse Gaussian)
Systematic Component: The linear combination of predictors (β₀ + β₁X₁ + β₂X₂ + …)
Link Function: How the mean of the outcome relates to the predictors (we will use log link)

2 Load Packages and Data

# Load required packages
library(dplyr)    # Data manipulation
library(ggplot2)  # Plotting
library(scales)   # Nice axis labels

# Load the data
load("C:/Users/amcewen/OneDrive - Bentley University/Documents/quant pres #2/phase2_solid_tumor_trials.RData")

3 Data Preparation

3.1 Step 1: Create Variables

# Define the COVID start date for our timing variable
covid_start <- as.Date("2020-03-01")

# Create analysis dataset
mydata <- trials

# Create sponsor type: Commercial vs Non-Commercial
mydata$sponsor <- ifelse(mydata$sponsor_type == "INDUSTRY", 
                          "Commercial", 
                          "Non-Commercial")
mydata$sponsor <- factor(mydata$sponsor, levels = c("Commercial", "Non-Commercial"))

# Calculate time-to-completion in days
mydata$start_date <- as.Date(mydata$start_date)
mydata$completion_date <- as.Date(mydata$completion_date)
mydata$completion_time <- as.numeric(mydata$completion_date - mydata$start_date)

# Calculate months from COVID start 
mydata$months_from_covid <- as.numeric(mydata$start_date - covid_start) / 31

# Make num_sites numeric
mydata$num_sites <- as.numeric(mydata$num_sites)

3.2 Step 2: Filter to Completed Trials

We only include trials that have actually completed (not ongoing or terminated).

# Keep only completed trials with valid data
mydata <- mydata[mydata$overall_status == "COMPLETED", ]
mydata <- mydata[mydata$completion_time > 0, ]
mydata <- mydata[!is.na(mydata$num_sites), ]
mydata <- mydata[!is.na(mydata$enrollment), ]
mydata <- mydata[!is.na(mydata$number_of_arms), ]

# How many trials do we have?
nrow(mydata)

## [1] 444

The final sample is 444 completed trials.

3.3 Step 3: Scale the Continuous Predictors

The continuous variables are on very different scales:

num_sites: 0 to 1000+
enrollment: 0 to thousands
number_of_arms: 1 to 10
months_from_covid: -12 to +12

Scaling (standardizing) them helps the model converge and makes coefficients comparable.

# Scale continuous variables (subtract mean, divide by SD)
mydata$num_sites_z <- scale(mydata$num_sites)[,1]
mydata$enrollment_z <- scale(mydata$enrollment)[,1]
mydata$arms_z <- scale(mydata$number_of_arms)[,1]
mydata$covid_timing_z <- scale(mydata$months_from_covid)[,1]

# What does "1 SD" mean in original units?
cat("1 SD of num_sites =", round(sd(mydata$num_sites), 1), "sites\n")

## 1 SD of num_sites = 69.6 sites

cat("1 SD of enrollment =", round(sd(mydata$enrollment), 1), "participants\n")

## 1 SD of enrollment = 71.7 participants

cat("1 SD of number_of_arms =", round(sd(mydata$number_of_arms), 2), "arms\n")

## 1 SD of number_of_arms = 1.18 arms

cat("1 SD of months_from_covid =", round(sd(mydata$months_from_covid), 1), "months\n")

## 1 SD of months_from_covid = 7.5 months

Note: Scaling does NOT affect our main predictive variable (sponsor type) because it’s categorical.

4 Exploratory Data Analysis

4.1 How Many Trials in Each Group?

table(mydata$sponsor)

## 
##     Commercial Non-Commercial 
##            114            330

4.2 What Does Completion Time Look Like?

ggplot(mydata, aes(x = completion_time)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Trial Completion Time",
       subtitle = "Notice the right skew - most trials finish within 1500 days but some take much longer",
       x = "Completion Time (days)",
       y = "Count")

Key observation: The distribution is right-skewed. This is why we need Gamma or Inverse Gaussian, not Normal.

4.3 Completion Time by Sponsor Type

ggplot(mydata, aes(x = sponsor, y = completion_time, fill = sponsor)) +
  geom_boxplot() +
  labs(title = "Completion Time by Sponsor Type",
       x = "Sponsor Type",
       y = "Completion Time (days)") +
  theme(legend.position = "none")

# Summary statistics by sponsor type
aggregate(completion_time ~ sponsor, data = mydata, 
          FUN = function(x) c(Mean = mean(x), Median = median(x), SD = sd(x), N = length(x)))

##          sponsor completion_time.Mean completion_time.Median completion_time.SD
## 1     Commercial            1176.2368              1148.5000           429.6435
## 2 Non-Commercial            1223.4667              1236.0000           460.3835
##   completion_time.N
## 1          114.0000
## 2          330.0000

4.4 Does Variance Increase with Mean?

For Gamma and Inverse Gaussian to be appropriate, variance should increase with the mean.

# Split data into bins and calculate mean and variance for each
mydata$bin <- cut(mydata$completion_time, breaks = 10)
mv <- aggregate(completion_time ~ bin, data = mydata, 
                FUN = function(x) c(mean = mean(x), var = var(x)))
mv <- data.frame(mean = mv$completion_time[,1], var = mv$completion_time[,2])

ggplot(mv, aes(x = mean, y = var)) +
  geom_point(size = 3, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "Mean-Variance Relationship",
       subtitle = "Upward slope confirms variance increases with mean",
       x = "Mean Completion Time (days)",
       y = "Variance")

Conclusion: Variance increases with the mean, supporting use of Gamma or Inverse Gaussian.

5 Understanding Our Two GLM Distributions

5.1 The Gamma Distribution

What is it?

A continuous probability distribution for positive values
Right-skewed (has a long right tail)
Commonly used for waiting times, survival times, and durations

Key assumption: Variance is proportional to the mean squared: Var(Y) ∝ μ²

When to use: Moderate right skew, no extreme outliers

5.2 The Inverse Gaussian Distribution

What is it?

Also a continuous distribution for positive values
Has a heavier right tail than Gamma
Arises naturally as the time for a process to reach a threshold (“first passage time”)

Key assumption: Variance is proportional to the mean cubed: Var(Y) ∝ μ³

When to use: Heavy right tails, extreme outliers, when data represents “time to reach a goal”

5.3 The Log Link Function

Both models use a log link:

\[\log(\mu) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...\]

This means:

\[\mu = e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...}\]

Why log link?

Ensures predicted values are always positive (can’t predict negative time)
Coefficients have a multiplicative interpretation: exp(β) = percent change

6 Model 1: Gamma GLM

6.1 The Model Equation

Part 1 — Distribution:

\[\text{CompletionTime}_i \sim \text{Gamma}(\mu_i, \phi)\]

This says: Each trial’s completion time follows a Gamma distribution with mean μᵢ and dispersion parameter φ.

Part 2 — Link Function and Linear Predictor:

\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]

This says: The log of the expected completion time is a linear combination of our predictors.

Putting it together:

\[\mu_i = e^{\beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + ...}\]

The Gamma distribution determines the shape of the error (right-skewed, variance ∝ μ²). The log link ensures predictions are always positive. The β coefficients tell us the effect of each predictor.

6.2 Fit the Model

# Fit Gamma GLM with log link
gamma_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
                   data = mydata,
                   family = Gamma(link = "log"))

# View results
summary(gamma_model)

## 
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z + 
##     arms_z + covid_timing_z, family = Gamma(link = "log"), data = mydata)
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.086976   0.035362 200.413  < 2e-16 ***
## sponsorNon-Commercial  0.010169   0.041675   0.244  0.80733    
## num_sites_z            0.050988   0.017317   2.944  0.00341 ** 
## enrollment_z          -0.021633   0.018233  -1.187  0.23607    
## arms_z                 0.009169   0.017765   0.516  0.60602    
## covid_timing_z        -0.078083   0.017377  -4.493 8.98e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.129226)
## 
##     Null deviance: 88.442  on 443  degrees of freedom
## Residual deviance: 84.048  on 438  degrees of freedom
## AIC: 6764.9
## 
## Number of Fisher Scoring iterations: 5

6.3 Interpret the Coefficients

For a log link, we exponentiate coefficients to get multiplicative effects:

# Get coefficients
coefs <- coef(gamma_model)
pvals <- summary(gamma_model)$coefficients[, 4]

# Create interpretation table
results_gamma <- data.frame(
  Variable = names(coefs),
  Coefficient = round(coefs, 4),
  Exp_Coefficient = round(exp(coefs), 4),
  Percent_Change = round((exp(coefs) - 1) * 100, 1),
  P_Value = round(pvals, 4)
)

print(results_gamma)

##                                    Variable Coefficient Exp_Coefficient
## (Intercept)                     (Intercept)      7.0870       1196.2851
## sponsorNon-Commercial sponsorNon-Commercial      0.0102          1.0102
## num_sites_z                     num_sites_z      0.0510          1.0523
## enrollment_z                   enrollment_z     -0.0216          0.9786
## arms_z                               arms_z      0.0092          1.0092
## covid_timing_z               covid_timing_z     -0.0781          0.9249
##                       Percent_Change P_Value
## (Intercept)                 119528.5  0.0000
## sponsorNon-Commercial            1.0  0.8073
## num_sites_z                      5.2  0.0034
## enrollment_z                    -2.1  0.2361
## arms_z                           0.9  0.6060
## covid_timing_z                  -7.5  0.0000

How to interpret:

Exp(Coefficient) = multiplicative effect on completion time
Percent_Change = percent increase (positive) or decrease (negative)
Example: If Exp_Coefficient = 1.10, trials take 10% longer

6.4 Check Model Fit

# Residuals vs Fitted plot
par(mfrow = c(1, 2))

# Plot 1: Residuals vs Fitted
plot(fitted(gamma_model), residuals(gamma_model, type = "deviance"),
     xlab = "Fitted Values", ylab = "Deviance Residuals",
     main = "Gamma: Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)

# Plot 2: Q-Q Plot
qqnorm(residuals(gamma_model, type = "deviance"), main = "Gamma: Q-Q Plot")
qqline(residuals(gamma_model, type = "deviance"), col = "red")

par(mfrow = c(1, 1))

What we look for:

Residuals vs Fitted: Random scatter around zero (no patterns)
Q-Q Plot: Points should follow the diagonal line

6.5 Model Fit Statistics

cat("Gamma Model Fit Statistics\n")

## Gamma Model Fit Statistics

cat("==========================\n")

## ==========================

cat("AIC:", AIC(gamma_model), "\n")

## AIC: 6764.899

cat("BIC:", BIC(gamma_model), "\n")

## BIC: 6793.57

cat("Null Deviance:", gamma_model$null.deviance, "\n")

## Null Deviance: 88.44229

cat("Residual Deviance:", gamma_model$deviance, "\n")

## Residual Deviance: 84.04837

cat("Deviance Explained:", round((1 - gamma_model$deviance/gamma_model$null.deviance) * 100, 1), "%\n")

## Deviance Explained: 5 %

7 Model 2: Inverse Gaussian GLM

7.1 The Model Equation

Part 1 — Distribution:

\[\text{CompletionTime}_i \sim \text{InverseGaussian}(\mu_i, \lambda)\]

This says: Each trial’s completion time follows an Inverse Gaussian distribution with mean μᵢ and shape parameter λ.

Part 2 — Link Function and Linear Predictor:

\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{Enrollment}_i) + \beta_4(\text{Arms}_i) + \beta_5(\text{CovidTiming}_i)\]

Key difference from Gamma: The Inverse Gaussian has a heavier right tail and assumes variance ∝ μ³ (instead of μ²). Everything else stays the same.

7.2 Fit the Model

# Fit Inverse Gaussian GLM with log link
# We increase max iterations because IG can be harder to fit
ig_model <- glm(completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + covid_timing_z,
                data = mydata,
                family = inverse.gaussian(link = "log"),
                control = glm.control(maxit = 100))

# View results
summary(ig_model)

## 
## Call:
## glm(formula = completion_time ~ sponsor + num_sites_z + enrollment_z + 
##     arms_z + covid_timing_z, family = inverse.gaussian(link = "log"), 
##     data = mydata, control = glm.control(maxit = 100))
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)            2.302e+01  5.472e+17       0        1
## sponsorNon-Commercial  2.630e+01  6.682e+17       0        1
## num_sites_z           -2.024e+01  1.678e+17       0        1
## enrollment_z           2.311e+01  3.560e+17       0        1
## arms_z                -2.613e+02  2.199e+17       0        1
## covid_timing_z         1.367e+01  3.191e+17       0        1
## 
## (Dispersion parameter for inverse.gaussian family taken to be 4.240697e+52)
## 
##     Null deviance: 1.8086e-01  on 443  degrees of freedom
## Residual deviance: 2.8573e+36  on 438  degrees of freedom
## AIC: 45162
## 
## Number of Fisher Scoring iterations: 16

7.3 Interpret the Coefficients

# Get coefficients
coefs_ig <- coef(ig_model)
pvals_ig <- summary(ig_model)$coefficients[, 4]

# Create interpretation table
results_ig <- data.frame(
  Variable = names(coefs_ig),
  Coefficient = round(coefs_ig, 4),
  Exp_Coefficient = round(exp(coefs_ig), 4),
  Percent_Change = round((exp(coefs_ig) - 1) * 100, 1),
  P_Value = round(pvals_ig, 4)
)

print(results_ig)

##                                    Variable Coefficient Exp_Coefficient
## (Intercept)                     (Intercept)     23.0231      9972869242
## sponsorNon-Commercial sponsorNon-Commercial     26.3008    264426562036
## num_sites_z                     num_sites_z    -20.2377               0
## enrollment_z                   enrollment_z     23.1149     10931830637
## arms_z                               arms_z   -261.3229               0
## covid_timing_z               covid_timing_z     13.6734          867563
##                       Percent_Change P_Value
## (Intercept)             9.972869e+11       1
## sponsorNon-Commercial   2.644266e+13       1
## num_sites_z            -1.000000e+02       1
## enrollment_z            1.093183e+12       1
## arms_z                 -1.000000e+02       1
## covid_timing_z          8.675620e+07       1

7.4 Check Model Fit

par(mfrow = c(1, 2))

# Plot 1: Residuals vs Fitted
plot(fitted(ig_model), residuals(ig_model, type = "deviance"),
     xlab = "Fitted Values", ylab = "Deviance Residuals",
     main = "Inverse Gaussian: Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)

# Plot 2: Q-Q Plot
qqnorm(residuals(ig_model, type = "deviance"), main = "Inverse Gaussian: Q-Q Plot")
qqline(residuals(ig_model, type = "deviance"), col = "red")

par(mfrow = c(1, 1))

7.5 Model Fit Statistics

cat("Inverse Gaussian Model Fit Statistics\n")

## Inverse Gaussian Model Fit Statistics

cat("=====================================\n")

## =====================================

cat("AIC:", AIC(ig_model), "\n")

## AIC: 45161.98

cat("BIC:", BIC(ig_model), "\n")

## BIC: 45190.65

cat("Null Deviance:", ig_model$null.deviance, "\n")

## Null Deviance: 0.1808587

cat("Residual Deviance:", ig_model$deviance, "\n")

## Residual Deviance: 2.857272e+36

cat("Deviance Explained:", round((1 - ig_model$deviance/ig_model$null.deviance) * 100, 1), "%\n")

## Deviance Explained: -1.579837e+39 %

8 Model Comparison

8.1 Which Model Fits Better?

cat("MODEL COMPARISON\n")

## MODEL COMPARISON

cat("================\n\n")

## ================

cat("AIC (lower is better):\n")

## AIC (lower is better):

cat("  Gamma:", round(AIC(gamma_model), 1), "\n")

##   Gamma: 6764.9

cat("  Inverse Gaussian:", round(AIC(ig_model), 1), "\n")

##   Inverse Gaussian: 45162

cat("  Winner:", ifelse(AIC(gamma_model) < AIC(ig_model), "Gamma", "Inverse Gaussian"), "\n\n")

##   Winner: Gamma

cat("BIC (lower is better):\n")

## BIC (lower is better):

cat("  Gamma:", round(BIC(gamma_model), 1), "\n")

##   Gamma: 6793.6

cat("  Inverse Gaussian:", round(BIC(ig_model), 1), "\n")

##   Inverse Gaussian: 45190.7

cat("  Winner:", ifelse(BIC(gamma_model) < BIC(ig_model), "Gamma", "Inverse Gaussian"), "\n")

##   Winner: Gamma

8.2 Do Both Models Agree on the Main Finding?

cat("SPONSOR TYPE EFFECT\n")

## SPONSOR TYPE EFFECT

cat("===================\n\n")

## ===================

# Gamma model
gamma_effect <- round((exp(coef(gamma_model)["sponsorNon-Commercial"]) - 1) * 100, 1)
gamma_p <- round(summary(gamma_model)$coefficients["sponsorNon-Commercial", 4], 4)

# IG model
ig_effect <- round((exp(coef(ig_model)["sponsorNon-Commercial"]) - 1) * 100, 1)
ig_p <- round(summary(ig_model)$coefficients["sponsorNon-Commercial", 4], 4)

cat("Gamma Model:\n")

## Gamma Model:

cat("  Non-commercial trials take", gamma_effect, "% longer than commercial\n")

##   Non-commercial trials take 1 % longer than commercial

cat("  p-value:", gamma_p, "\n\n")

##   p-value: 0.8073

cat("Inverse Gaussian Model:\n")

## Inverse Gaussian Model:

cat("  Non-commercial trials take", ig_effect, "% longer than commercial\n")

##   Non-commercial trials take 2.644266e+13 % longer than commercial

cat("  p-value:", ig_p, "\n")

##   p-value: 1

9 Interaction Model

9.1 Research Question

Does the effect of sponsor type depend on how big the trial is (number of sites)?

Maybe commercial sponsors are especially efficient at running large multi-site trials, but the difference disappears for small trials.

9.2 The Interaction Model Equation

Part 1 — Distribution:

\[\text{CompletionTime}_i \sim \text{[Gamma or InverseGaussian]}(\mu_i, \phi)\]

We use whichever distribution fit better in the comparison above.

Part 2 — Link Function and Linear Predictor (with interaction):

\[\log(\mu_i) = \beta_0 + \beta_1(\text{NonCommercial}_i) + \beta_2(\text{NumSites}_i) + \beta_3(\text{NonCommercial}_i \times \text{NumSites}_i) + \beta_4(\text{Enrollment}_i) + \beta_5(\text{Arms}_i) + \beta_6(\text{CovidTiming}_i)\]

What’s new? The term \(\beta_3(\text{NonCommercial} \times \text{NumSites})\)

How to interpret:

β₁ = effect of non-commercial sponsor when NumSites = 0 (at the mean, since we scaled)
β₂ = effect of adding sites for commercial sponsors
β₃ = how much the site effect DIFFERS for non-commercial vs commercial sponsors

If β₃ is significant, the lines in our plot will NOT be parallel — meaning the sponsor type effect depends on trial size.

What’s new? The term \(\beta_3(\text{NonCommercial} \times \text{NumSites})\)

How to interpret:

β₁ = effect of non-commercial sponsor when NumSites = 0 (at the mean, since we scaled)
β₂ = effect of adding sites for commercial sponsors
β₃ = how much the site effect DIFFERS for non-commercial vs commercial sponsors

If β₃ is significant, the lines in our plot will NOT be parallel — meaning the sponsor type effect depends on trial size.

9.3 Fit the Interaction Model

We use whichever distribution won the comparison above.

# Use the better-fitting distribution
if (AIC(gamma_model) <= AIC(ig_model)) {
  best_family <- Gamma(link = "log")
  best_name <- "Gamma"
} else {
  best_family <- inverse.gaussian(link = "log")
  best_name <- "Inverse Gaussian"
}

cat("Using", best_name, "distribution for interaction model\n\n")

## Using Gamma distribution for interaction model

# Fit model with interaction (sponsor * num_sites)
interaction_model <- glm(completion_time ~ sponsor * num_sites_z + enrollment_z + arms_z + covid_timing_z,
                         data = mydata,
                         family = best_family,
                         control = glm.control(maxit = 100))

summary(interaction_model)

## 
## Call:
## glm(formula = completion_time ~ sponsor * num_sites_z + enrollment_z + 
##     arms_z + covid_timing_z, family = best_family, data = mydata, 
##     control = glm.control(maxit = 100))
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        7.063874   0.036032 196.045  < 2e-16 ***
## sponsorNon-Commercial              0.031207   0.042042   0.742  0.45830    
## num_sites_z                        0.269399   0.089675   3.004  0.00282 ** 
## enrollment_z                      -0.027341   0.018315  -1.493  0.13620    
## arms_z                             0.005772   0.017728   0.326  0.74488    
## covid_timing_z                    -0.080548   0.017338  -4.646 4.49e-06 ***
## sponsorNon-Commercial:num_sites_z -0.228690   0.091275  -2.505  0.01259 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1279069)
## 
##     Null deviance: 88.442  on 443  degrees of freedom
## Residual deviance: 83.228  on 437  degrees of freedom
## AIC: 6762.4
## 
## Number of Fisher Scoring iterations: 5

9.4 Is the Interaction Significant?

We compare the main effects model to the interaction model:

# Get the main effects model (whichever was better)
main_model <- if(AIC(gamma_model) <= AIC(ig_model)) gamma_model else ig_model

# Likelihood ratio test
lr_test <- anova(main_model, interaction_model, test = "F")
print(lr_test)

## Analysis of Deviance Table
## 
## Model 1: completion_time ~ sponsor + num_sites_z + enrollment_z + arms_z + 
##     covid_timing_z
## Model 2: completion_time ~ sponsor * num_sites_z + enrollment_z + arms_z + 
##     covid_timing_z
##   Resid. Df Resid. Dev Df Deviance      F  Pr(>F)  
## 1       438     84.048                             
## 2       437     83.228  1  0.82033 6.4135 0.01167 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Get p-value
interaction_pvalue <- lr_test$`Pr(>F)`[2]

cat("\n")

cat("INTERACTION TEST RESULT\n")

## INTERACTION TEST RESULT

cat("=======================\n")

## =======================

cat("F-test p-value:", round(interaction_pvalue, 4), "\n\n")

## F-test p-value: 0.0117

if (interaction_pvalue < 0.05) {
  cat("CONCLUSION: The interaction IS significant (p < 0.05)\n")
  cat("The effect of sponsor type DOES depend on trial size.\n")
} else {
  cat("CONCLUSION: The interaction is NOT significant (p >= 0.05)\n")
  cat("The effect of sponsor type does NOT depend on trial size.\n")
  cat("The simpler main effects model is preferred.\n")
}

## CONCLUSION: The interaction IS significant (p < 0.05)
## The effect of sponsor type DOES depend on trial size.

9.5 Visualize the Interaction

# Create prediction data
pred_data <- expand.grid(
  sponsor = c("Commercial", "Non-Commercial"),
  num_sites_z = seq(-0.5, 2, by = 0.1),
  enrollment_z = 0,
  arms_z = 0,
  covid_timing_z = 0
)

# Get predictions
pred_data$predicted <- predict(interaction_model, newdata = pred_data, type = "response")

# Convert num_sites back to original scale for plotting
pred_data$num_sites <- pred_data$num_sites_z * sd(mydata$num_sites) + mean(mydata$num_sites)

# Plot
ggplot(pred_data, aes(x = num_sites, y = predicted, color = sponsor)) +
  geom_line(size = 1.2) +
  labs(title = "Predicted Completion Time by Sponsor Type and Trial Size",
       subtitle = "Other variables held at their means",
       x = "Number of Sites",
       y = "Predicted Completion Time (days)",
       color = "Sponsor Type") +
  theme(legend.position = "bottom")

10 Conclusions

10.1 Summary

# Determine best model
if (AIC(gamma_model) <= AIC(ig_model)) {
  best_model <- gamma_model
  best_dist <- "Gamma"
} else {
  best_model <- ig_model
  best_dist <- "Inverse Gaussian"
}

# Get sponsor effect from best model
sponsor_effect <- round((exp(coef(best_model)["sponsorNon-Commercial"]) - 1) * 100, 1)
sponsor_p <- round(summary(best_model)$coefficients["sponsorNon-Commercial", 4], 4)
interaction_p <- round(interaction_pvalue, 4)

10.1.1 Research Question

Does sponsor type predict Phase 2 oncology trial completion time?

10.1.2 Sample

N = 444 completed trials
Period: March 2019 to March 2021

10.1.3 Best Model

Gamma GLM with log link
AIC: Gamma = 6764.9, Inverse Gaussian = 4.5162^{4}

10.1.4 Main Finding

Non-commercial vs Commercial sponsors:

Effect: 1% longer completion time
p-value: 0.8073
Decision: FAIL TO REJECT null hypothesis - no significant difference

10.1.5 Interaction Effect

p-value: 0.0117
Conclusion: Significant - the sponsor effect depends on trial size

10.1.6 Interpretation

if (sponsor_p < 0.05 & sponsor_effect > 0) {
  cat("Non-commercial trials (academic and government sponsors) take significantly\n")
  cat("longer to complete than commercial (industry) trials. Specifically, they take\n")
  cat("approximately", sponsor_effect, "% longer (p =", sponsor_p, ").\n\n")
  cat("This may reflect:\n")
  cat("- Differences in operational infrastructure and dedicated staff\n")
  cat("- Resource constraints in academic settings\n")
  cat("- The pandemic's disproportionate impact on academic medical centers\n")
} else if (sponsor_p < 0.05 & sponsor_effect < 0) {
  cat("Non-commercial trials complete significantly FASTER than commercial trials.\n")
} else {
  cat("No significant difference was found between commercial and non-commercial\n")
  cat("sponsors in trial completion time (p =", sponsor_p, ").\n")
}

## No significant difference was found between commercial and non-commercial
## sponsors in trial completion time (p = 0.8073 ).

10.1.7 Why Gamma?

if (best_dist == "Gamma") {
  cat("The Gamma distribution provided better fit because:\n\n")
  cat("1. The completion time distribution has moderate right skew\n")
  cat("2. There are not extreme outliers that require the heavier tail of Inverse Gaussian\n")
  cat("3. The Gamma's variance assumption (Var proportional to mean squared) matched our data\n")
} else {
  cat("The Inverse Gaussian distribution provided better fit because:\n\n")
  cat("1. The completion time distribution has a heavy right tail\n")
  cat("2. Some trials take much longer than expected (extreme values)\n")
  cat("3. Trial completion can be thought of as 'time to reach a goal' which\n")
  cat("   is exactly what the Inverse Gaussian models (first passage time)\n")
}

## The Gamma distribution provided better fit because:
## 
## 1. The completion time distribution has moderate right skew
## 2. There are not extreme outliers that require the heavier tail of Inverse Gaussian
## 3. The Gamma's variance assumption (Var proportional to mean squared) matched our data

Does Sponsor Type Predict Clinical Trial Completion Time?

Exec PHD Quantitative Analysis I

Amanda McEwen

2025-11-24

1 Introduction

1.1 Research Question

1.2 Why Not Use Regular Linear Regression?

1.3 What is a GLM?

2 Load Packages and Data

3 Data Preparation

3.1 Step 1: Create Variables

3.2 Step 2: Filter to Completed Trials

3.3 Step 3: Scale the Continuous Predictors

4 Exploratory Data Analysis

4.1 How Many Trials in Each Group?

4.2 What Does Completion Time Look Like?

4.3 Completion Time by Sponsor Type

4.4 Does Variance Increase with Mean?

5 Understanding Our Two GLM Distributions

5.1 The Gamma Distribution

5.2 The Inverse Gaussian Distribution

5.3 The Log Link Function

6 Model 1: Gamma GLM

6.1 The Model Equation

6.2 Fit the Model

6.3 Interpret the Coefficients

6.4 Check Model Fit

6.5 Model Fit Statistics

7 Model 2: Inverse Gaussian GLM

7.1 The Model Equation

7.2 Fit the Model

7.3 Interpret the Coefficients

7.4 Check Model Fit

7.5 Model Fit Statistics

8 Model Comparison

8.1 Which Model Fits Better?

8.2 Do Both Models Agree on the Main Finding?

9 Interaction Model

9.1 Research Question

9.2 The Interaction Model Equation

9.3 Fit the Interaction Model

9.4 Is the Interaction Significant?

9.5 Visualize the Interaction

10 Conclusions

10.1 Summary

10.1.1 Research Question

10.1.2 Sample

10.1.3 Best Model

10.1.4 Main Finding

10.1.5 Interaction Effect

10.1.6 Interpretation

10.1.7 Why Gamma?