Clinical Trials Assignment

Author

Dimitris Anagnostou

Part A

Exercise 1

1) In a 3+3 design, escalation to the next dose level can occur in two ways: (i) if 0 out of the first 3 patients experience a dose-limiting toxicity (DLT) or (ii) if 1 out of the first 3 patients experiences a DLT, in which case 3 additional patients are treated at the same dose level, and escalation occurs only if no more than 1 out of 6 patients in total experiences a DLT. Using basic combinatorial principles we get the following code:

p_a = c(0.08, 0.18, 0.30, 0.45) # scenario A, toxicity probabilities
p_b = c(0.12, 0.22, 0.28, 0.40) # scenario B

# We implement a 3 + 3 design. Transition probabilities are:

pi_a = dbinom(0, 3, p_a) + dbinom(1,3, p_a)*dbinom(0,3, p_a) # P (transition in A)
pi_b = dbinom(0, 3, p_b) + dbinom(1,3, p_b)*dbinom(0,3, p_b) # P (transition in B)

transitions = data.frame(Dose=1:4, Pi_A=pi_a, Pi_B=pi_b)
transitions
  Dose      Pi_A      Pi_B
1    1 0.9368676 0.8714555
2    2 0.7515675 0.6651055
3    3 0.4942630 0.5357811
4    4 0.2343184 0.3093120

2) \(P(\text{stop before 4})=P(\text{stop at 1})+P(\text{stop at 2})+P(\text{stop at 3})\), where \(P(\text{stop at 1})=1*(1-P_{i1})\), \(P(\text{stop at 2})=P_{i1}*(1-P_{i2})\) and \(P(\text{stop at 3})=P_{i1}*P_{i2}*(1-P_{i3})\)

# For scenario A
p4a = (1-pi_a[1])+pi_a[1]*(1-pi_a[2]) + pi_a[1]*pi_a[2]*(1-pi_a[3])
# alternate equivalent formula: 1- pi_a[3]*pi_a[2]*pi_a[1]
paste0("For scenario A P(stop before 4): ", round(p4a,4))
[1] "For scenario A P(stop before 4): 0.652"
# For scenario B
p4b =(1-pi_b[1])+pi_b[1]*(1-pi_b[2]) + pi_b[1]*pi_b[2]*(1-pi_b[3])
paste0("For scenario B P(stop before 4): ", round (p4b,4))
[1] "For scenario B P(stop before 4): 0.6895"

3) Using similar reasoning as in question 2, we find probabilities of stopping at each dose:

p_stop1_a = 1*(1-pi_a[1])
p_stop2_a = pi_a[1]*(1-pi_a[2])
p_stop3_a = pi_a[1]*pi_a[2]*(1-pi_a[3])
p_stop4_a = pi_a[1]*pi_a[2]*pi_a[3]*(1-pi_a[4])

p_stop_a = c(p_stop1_a,p_stop2_a,p_stop3_a,p_stop4_a)

p_stop1_b = 1*(1-pi_b[1])
p_stop2_b = pi_b[1]*(1-pi_b[2])
p_stop3_b = pi_b[1]*pi_b[2]*(1-pi_b[3])
p_stop4_b = pi_b[1]*pi_b[2]*pi_b[3]*(1-pi_b[4])
p_stop_b = c(p_stop1_b,p_stop2_b,p_stop3_b,p_stop4_b)


data.frame(Stop_At=1:4, Scenario_A=p_stop_a, Scenario_B=p_stop_b)
  Stop_At Scenario_A Scenario_B
1       1 0.06313243  0.1285445
2       2 0.23274834  0.2918456
3       3 0.35609915  0.2690659
4       4 0.26647258  0.2144890

4) We plot the cumulative probabilities of stopping at each dose, in each scenario.

results_stop <- data.frame(
  stop_at = 1:4,
  cum_stop_A = cumsum(p_stop_a),
  cum_stop_B = cumsum(p_stop_b)
)

plot(
  results_stop$stop_at,
  results_stop$cum_stop_A,
  type = "b",
  pch = 21,
  xlab = "Dose Level",
  ylab = "Cumulative Probability of Stopping",
  xaxt = "n",
  ylim = c(0, 1)
)

axis(1, at = 1:4)

lines(
  results_stop$stop_at,
  results_stop$cum_stop_B,
  type = "b",
  pch = 16
)

legend(
  "topleft",
  legend = c("Scenario A", "Scenario B"),
  pch = c(21, 16),
  lty = 1
)

We know that the 3rd dose is the ideal dose in each scenario. Assisted by the plot, we note that scenario A has less chance of stopping at the 2nd dose and is more likely to continue to the 3rd dose. In this sense, scenario A is preferable.

5) Scenario A has marginally better operating characteristics, as the probability of termination increases more slowly at lower doses, allowing safer escalation. Its dose–toxicity relationship is closer to a sigmoid curve, which is biologically more plausible. By contrast, Scenario B is steeper, with toxicity accumulating more rapidly at early doses, leading to higher termination probabilities sooner and less favorable operating behavior.

Consistency check: If toxicity probabilities increase, this should first be reflected in the transition probabilities, which would become smaller. Higher toxicity would therefore make transitions to the next dose level less likely.

dbinom(0, 3, p_a) + dbinom(1,3, p_a)*dbinom(0,3, p_a) #initial transition probabilities
[1] 0.9368676 0.7515675 0.4942630 0.2343184
dbinom(0, 3, p_a+0.05) + dbinom(1,3, p_a+0.05)*dbinom(0,3, p_a+0.05) #after increasing toxicity probabilities by 0.05
[1] 0.8528872 0.6433011 0.3964555 0.1718750

Exercise 2

1) We make a single-stage design in both scenarios via the following code:

library(clinfun)

d1 = ph2single(pu=0.15, pa=0.40, ep1=0.1, ep2=0.2)
d1
   n r Type I error Type II error
1 16 4   0.07905130    0.16656738
2 17 4   0.09871000    0.12599913
3 19 5   0.05369611    0.16292248
4 20 5   0.06730797    0.12559897
5 21 5   0.08273475    0.09574016
d2 = ph2single(pu=0.20, pa=0.40, ep1=0.05, ep2=0.1)
d2
   n  r Type I error Type II error
1 47 14   0.03663689    0.09877433
2 48 14   0.04373256    0.08125713
3 50 15   0.03080342    0.09550171
4 51 15   0.03678715    0.07888304
5 52 15   0.04356872    0.06475717

We note that the smallest number of patients that satisfies the error-rate constraints is 16 in the first scenario, requiring at least 5 responders. In the second scenario, the corresponding minimum is 47 patients, requiring at least 15 responders.

2) We make two-stage designs using the following code:

library(clinfun)

d21 = ph2simon(pu=0.15, pa=0.40, ep1=0.1, ep2=0.2)
d21

 Simon 2-stage Phase II design 

Unacceptable response rate:  0.15 
Desirable response rate:  0.4 
Error rates: alpha =  0.1 ; beta =  0.2 

        r1 n1 r  n EN(p0) PET(p0)   qLo   qHi
Minimax  1  9 4 16  11.80  0.5995 0.457 1.000
Optimal  1  7 4 18  10.12  0.7166 0.000 0.457
d22 = ph2simon(pu=0.20, pa=0.40, ep1=0.05, ep2=0.1)
d22

 Simon 2-stage Phase II design 

Unacceptable response rate:  0.2 
Desirable response rate:  0.4 
Error rates: alpha =  0.05 ; beta =  0.1 

           r1 n1  r  n EN(p0) PET(p0)   qLo   qHi
Minimax     5 24 13 45  31.23  0.6559 0.108 1.000
Admissible  4 20 14 49  30.74  0.6296 0.058 0.108
Optimal     4 19 15 54  30.43  0.6733 0.000 0.058

So the design parameters are as follows:

Scenario 1

  • Optimal : r1=1, n1=7, r=4, n=18, EN(p0)=10.12

  • Min-max: r1=1, n1=9, r=4, n=16, EN(p0)=11.80

Scenario 2

  • Optimal : r1=4, n1=19, r=15, n=54, EN(p0)=30.43

  • Min-max: r1=5, n1=24, r=13, n=45, EN(p0)=31.23

3) From the previous questions we construct the following table:

Design Scenario 1 Scenario 2
Single

Sample Size: 16

EN0: 16

PET0: 0 (can’t terminate early)

Sample Size: 47

EN0: 47

PET0: 0 (can’t terminate early)

Optimal

Sample Size: 18

EN0: 10.1

PET0: 71.67%

Sample Size: 54

EN0: 30.43

PET0: 67.3%

MiniMax

Sample Size: 16

EN0: 11.8

PET0: 59.95%

Sample Size: 45

EN0: 31.23

PET0: 65.59%

4) For Scenario 1, there is no reason to suggest a single-stage design, as the MiniMax design uses the same sample size while allowing early stopping for futility. Although the Optimal design would expose two additional patients, the expected reduction in patients exposed under H0 is fewer than two, albeit with a higher probability of early termination. Overall, the MiniMax design appears to be the most appropriate choice; however, if the clinician prioritizes the increased likelihood of early stopping, the Optimal design also represents a reasonable alternative.

For scenario 2: Again, no reason to prefer single stage design. An Optimal design would expose up to 9 additional patients for an expected gain of < 1 patient exposed on average and a negligible increase in PET0. Minimax seems the best choice here.

5) We create a helper function for power calculation in the two-stage designs:

simon_power <- function(n1, r1, n, r, p_true) {
  
  n2 <- n - n1 # how many in second stage
  
  x1 <- (r1 + 1):n1 # responses in first stage that allow continuing

  # P(X1 = x1)
  prob_x1 <- dbinom(x1, size = n1, prob = p_true)
  
  # P(X2 > r - x1)
  prob_stage2 <- 1 - pbinom(r - x1, size = n2, prob = p_true)
  
  sum(prob_x1 * prob_stage2)
}

In the first scenario, MinMax design:

simon_power(n1=9, r1=1, n=16, r=4, p_true=0.3)
[1] 0.5293164

In the second scenario, MinMax design:

simon_power(n1=24, r1=5, n=45, r=13, p_true=0.3)
[1] 0.4683275

Consistency check: If the significance level \(\alpha\) was incorrectly treated as two-sided, effectively halving the tail probability available for rejection, stricter decision boundaries would be required. This would lead to larger critical values \(r\) (and \(r_1\)) and, if the designs were re-optimized to maintain the same power, to larger total sample sizes across all designs.

Exercise 3

1) Theoretical probability perfect balance:

paste0("Probability for perfect balabnce: ", round(dbinom(40,80,0.5),2))
[1] "Probability for perfect balabnce: 0.09"

2) Simulation of 1000 repeats:

n_sim = 1000
n = 80

prop_A = numeric(n_sim)

for (i in 1:n_sim) {
  assignments = rbinom(n, size=1, prob=0.5)
  prop_A[i] = mean(assignments)
}
quantile(prop_A, probs=c(0.025,0.975))
  2.5%  97.5% 
0.4000 0.6125 

So the 95% CI is (0.3875, 0.6125)

3)

library(blockrand)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
rand_list = blockrand(n=80, num.levels = 2, levels = c("A", "B"), block.sizes = 2)


head(data.frame(rand_list) %>% select(block.id, treatment) %>% group_by(block.id) %>% summarise(treatment= paste(treatment, collapse = ", ")),20)
# A tibble: 20 × 2
   block.id treatment 
   <fct>    <chr>     
 1 1        B, A, B, A
 2 2        B, A, B, A
 3 3        B, A, B, A
 4 4        B, B, A, A
 5 5        B, A, B, A
 6 6        A, A, B, B
 7 7        A, A, B, B
 8 8        A, A, B, B
 9 9        A, B, B, A
10 10       A, B, A, B
11 11       B, A, A, B
12 12       B, B, A, A
13 13       A, A, B, B
14 14       B, A, B, A
15 15       B, A, A, B
16 16       B, A, A, B
17 17       A, B, A, B
18 18       B, B, A, A
19 19       B, A, B, A
20 20       B, B, A, A

4) In simple randomization, the two groups are the same size on average (in expectation), since each allocation has equal probability; however, chance imbalances may still occur in a finite sample. Blocked randomization guarantees equal group sizes throughout enrollment, but introduces a risk of allocation predictability, particularly with fixed and known block sizes.

5) In an open-label study, simple randomization or blocked randomization with sufficiently large or randomly varying block sizes is preferable, as this minimizes the predictability of treatment assignment and thus reduces the risk of selection bias when allocation is known. In a double-blind study, blocked randomization is appropriate, since allocation concealment is preserved and balance between treatment groups can be maintained throughout enrollment without increasing the risk of selection bias.

Consistency check: The first feature that would reveal the error is the imbalance in group sizes within blocks, since with a block size of 3 it is impossible to achieve a 1:1 allocation in each block. This would manifest as alternating imbalances (e.g. 2A:1B or 1A:2B), instead of the expected equal distribution.

Exercise 4

1) We write the following code using gsDesign package:

library(gsDesign)

# O’Brien–Fleming
obf <- gsDesign(
  k = 3,
  alpha = 0.025,
  beta = 0.20,
  timing = c(0.30, 0.60),
  test.type = 2,   # two-sided symmetric
  sfu = "OF"
)
gsBoundSummary(obf)
               Analysis               Value Efficacy Futility
              IA 1: 30%                   Z   3.6383  -3.6383
  N/Fixed design N: 0.3         p (1-sided)   0.0001   0.0001
                            ~delta at bound   2.3549  -2.3549
                        P(Cross) if delta=0   0.0001   0.0001
                        P(Cross) if delta=1   0.0182   0.0000
              IA 2: 60%                   Z   2.5727  -2.5727
 N/Fixed design N: 0.61         p (1-sided)   0.0050   0.0050
                            ~delta at bound   1.1775  -1.1775
                        P(Cross) if delta=0   0.0051   0.0051
                        P(Cross) if delta=1   0.3497   0.0000
                  Final                   Z   1.9928  -1.9928
 N/Fixed design N: 1.01         p (1-sided)   0.0231   0.0231
                            ~delta at bound   0.7065  -0.7065
                        P(Cross) if delta=0   0.0250   0.0250
                        P(Cross) if delta=1   0.8000   0.0000
# Z's and p-values
cbind(Z = gsBoundSummary(obf)$Eff[gsBoundSummary(obf)$Val=="Z"],
      P = 2*gsBoundSummary(obf)$Eff[gsBoundSummary(obf)$Val=="p (1-sided)"])
          Z      P
[1,] 3.6383 0.0002
[2,] 2.5727 0.0100
[3,] 1.9928 0.0462
# Pocock
poc <- gsDesign(
  k = 3,
  alpha = 0.025,
  beta = 0.20,
  timing = c(0.30, 0.60),
  test.type = 2,   # two-sided symmetric
  sfu = "Pocock"
)
gsBoundSummary(poc)
               Analysis               Value Efficacy Futility
              IA 1: 30%                   Z   2.2991  -2.2991
 N/Fixed design N: 0.35         p (1-sided)   0.0107   0.0107
                            ~delta at bound   1.3796  -1.3796
                        P(Cross) if delta=0   0.0107   0.0107
                        P(Cross) if delta=1   0.2635   0.0000
              IA 2: 60%                   Z   2.2991  -2.2991
 N/Fixed design N: 0.71         p (1-sided)   0.0107   0.0107
                            ~delta at bound   0.9755  -0.9755
                        P(Cross) if delta=0   0.0185   0.0185
                        P(Cross) if delta=1   0.5536   0.0000
                  Final                   Z   2.2991  -2.2991
 N/Fixed design N: 1.18         p (1-sided)   0.0107   0.0107
                            ~delta at bound   0.7557  -0.7557
                        P(Cross) if delta=0   0.0250   0.0250
                        P(Cross) if delta=1   0.8000   0.0000
# Z's and p-values
cbind(Z = gsBoundSummary(poc)$Eff[gsBoundSummary(poc)$Val=="Z"],
      P = 2*gsBoundSummary(poc)$Eff[gsBoundSummary(poc)$Val=="p (1-sided)"])
          Z      P
[1,] 2.2991 0.0214
[2,] 2.2991 0.0214
[3,] 2.2991 0.0214

2) At the first interim look, the O’Brien–Fleming design applies a much more stringent threshold than the Pocock design (Z = 3.10, p ≈ 0.002 vs Z = 2.30, p ≈ 0.021), making early stopping far less likely. At the second look, the O’Brien–Fleming boundary relaxes substantially (Z = 2.19, p ≈ 0.028) and becomes comparable to, though slightly less stringent than, the Pocock boundary, which remains constant across looks. At the final analysis, the O’Brien–Fleming design applies the least stringent threshold (Z = 1.70, p ≈ 0.090), whereas the Pocock design maintains the same stricter threshold as earlier looks, illustrating the trade-off between strong protection against early stopping and greater leniency at the final analysis.

The following plot attempts to depict this situation:

library(ggplot2)
par(mar=c(4,4,1,1))
ss1<- plot(obf, main="")
ss2<-ss1+theme(legend.position = "none")+
  theme(axis.text=element_text(size=10))+
  theme(axis.title=element_text(size=15,face="bold"))
ss2

ss1<- plot(poc, main="")
ss2<-ss1+theme(legend.position = "none")+
  theme(axis.text=element_text(size=10))+
  theme(axis.title=element_text(size=15,face="bold"))
ss2

3) At the second interim analysis (60% of events), the observed test statistic is Z = 2.30. Under the O’Brien–Fleming design, the efficacy boundary at this look is Z ≈ 2.57; since 2.30 does not exceed this threshold, the trial would be not stopped. Under the Pocock design, the boundary is Z = 2.2991 at all looks; since the observed Z is above the boundary, the trial would be stopped.

4) If the sponsor is particularly concerned about controlling the Type I error while still allowing the possibility of early stopping, I would recommend the O’Brien–Fleming group sequential design. This approach imposes very stringent stopping boundaries at early interim analyses, thereby strongly protecting against false-positive conclusions, while gradually relaxing the boundary so that early stopping for true efficacy remains possible.

Consistency check: No, the Z boundaries do not change. A two-sided test with α = 0.05 allocates 0.025 to the upper (or lower) tail, which is identical to a one-sided test with α = 0.025. Since the efficacy boundaries depend only on the upper (or lower) tail Type I error, the resulting Z thresholds remain the same.

Part B

1) Main drawbacks

  • Due to small number of patients exposed there is high variability and imprecise toxicity estimation

  • Rule based escalation ignores information from previously tested doses

  • Can stop early after few toxicities, resulting in conservative dose selection (as we saw, there was ~30/40%, depending on the scenario, probability of stopping before the ideal 3rd dose in exercise 1, part A)

  • MTD from a 3+3 design often corresponds to a DLT probability below the target level, hence we get systematic underestimation

Model based designs:

  • Aim to improve accuracy of MTD estimation by explicitly modeling dose-toxicity relationship

  • Use all accumulated toxicity data rather than only the current dose level. Can also incorporate covariates (e.g. age, sex) and offer a personalized dose-toxicity profile.

  • Target a pre-specified toxicity probability, reducing the danger of underestimation.

2)

  • Phase II trials are not powered for definitive hypothesis testing or confirmatory inference.

  • Endpoints are often surrogate or intermediate rather than clinically definitive.

  • Lack of randomization (in many Phase II designs) increases bias and false-positive risk.

  • Treating Phase II as confirmatory can lead to overinterpretation of noisy efficacy signals.

  • Numerical results often show inflated effect estimates that do not replicate in Phase III.

P0 is the minimum clinically uninteresting response level. Defines a threshold below which the treatment is considered ineffective. P1 represents a clinically meaningful improvement worth further investigation. Helps drive the design’s power to detect a treatment effect.

An overly optimistic p1 leads to underestimated sample size, since large effects require fewer patients in order to be detected. Results in low power if the true effect is more modest than assumed. Increases the risk of false-negative decisions, discarding potentially useful treatments.

3) Simple (“coin toss”) randomization can lead to imbalances between groups purely due to chance, with such chance effects being more pronounced in small sample sizes.

By dividing enrollment into small, predefined blocks and ensuring equal treatment allocation within each block, the trial remains balanced at regular intervals throughout the enrollment period, thereby mitigating systematic time-related bias.

4) part 1.

In a coin toss, the probability of success at each toss is the same \(p\), but the probability of observing at least 1 success increases as the number of tosses increases. \(P(\text{0 successes in n tosses})=(1-p)^n\), where \(n\) is the number of tosses, hence \(P(\text{at least 1 success in n tosses})=1-(1-p)^n\). This is illustrated in the following graph:

# Parameters
p <- 0.5
n <- 1:20

# Probability of at least one success
prob_at_least_one <- 1 - (1 - p)^n

# Plot
plot(
  n,
  prob_at_least_one,
  type = "b",
  pch = 19,
  xlab = "Number of tosses (n)",
  ylab = "P(at least one success)",
  main = "Probability of ≥1 success vs number of tosses"
)

abline(h = 1, lty = 2)

Via similar reasoning applied to type I error probability we can understand why multiple looks inflate type I error. We illustrate it as well below:

# Parameters
alpha <- 0.05
n <- 1:20   # number of looks / tests

# Probability of at least one Type I error
type1_error <- 1 - (1 - alpha)^n

# Plot
plot(
  n,
  type1_error,
  type = "b",
  pch = 19,
  xlab = "Number of looks (k)",
  ylab = "P(at least one Type I error)",
  main = "Inflation of Type I Error with Multiple Looks"
)

abline(h = alpha, lty = 2)  # nominal alpha

4) part 2.

Data Safety and Monitoring Committee (DSMC)

(also called Treatment Effects Monitoring Committee – TEMC)

Role and responsibilities

  • Serves an advisory (not executive) role

  • Assists with protocol design

  • Monitors data quality and study timelines

  • Reviews drug toxicity and adverse events (patient safety)

  • Assesses treatment efficacy

  • Provides recommendations on: continuation of the study, protocol modification, early termination, and dissemination of results

    • Is intellectually and financially independent from investigators
  • Advises trial sponsors and/or investigators

  • Holds meeting assessing, group compatibilities, design assumptions, dropout rates, recruitment rates, funding availability, data quality, timeline, patient eligibility, protocol deviations.

4) part 3.

The greatest practical risk when interim results are widely known is the introduction of operational bias, as knowledge of emerging treatment effects can influence the behavior of investigators, clinicians, and participants, either consciously or unconsciously. This may lead to biased patient recruitment, differential care or monitoring between treatment groups, increased dropout from the perceived inferior arm, and biased outcome assessment, ultimately compromising the internal validity and credibility of the trial’s results.

5) Intention-to-Treat (ITT) analysis is considered more conservative and closer to real-world clinical practice because it analyzes participants according to their original randomization, regardless of treatment adherence, protocol deviations, or treatment crossover. By preserving the benefits of randomization, ITT minimizes selection and attrition bias and avoids artificially inflating treatment effects that can occur when non-adherent patients are excluded. At the same time, because noncompliance and deviations are common in routine clinical care, ITT reflects the effectiveness of a treatment under practical, real-world conditions, rather than its efficacy under ideal circumstances.

Per-protocol (PP) analysis can lead to biased conclusions when treatment adherence or protocol compliance is related to prognosis or outcomes. By excluding participants who deviate from the protocol (e.g., non-adherence, treatment crossover, early withdrawal), PP analysis breaks the original randomization and may introduce selection bias, because the remaining participants are often healthier, more motivated, or respond better to treatment. As a result, treatment effects may be overestimated, reflecting outcomes among a selected subgroup rather than the population originally randomized.