1) In a 3+3 design, escalation to the next dose level can occur in two ways: (i) if 0 out of the first 3 patients experience a dose-limiting toxicity (DLT) or (ii) if 1 out of the first 3 patients experiences a DLT, in which case 3 additional patients are treated at the same dose level, and escalation occurs only if no more than 1 out of 6 patients in total experiences a DLT. Using basic combinatorial principles we get the following code:
p_a =c(0.08, 0.18, 0.30, 0.45) # scenario A, toxicity probabilitiesp_b =c(0.12, 0.22, 0.28, 0.40) # scenario B# We implement a 3 + 3 design. Transition probabilities are:pi_a =dbinom(0, 3, p_a) +dbinom(1,3, p_a)*dbinom(0,3, p_a) # P (transition in A)pi_b =dbinom(0, 3, p_b) +dbinom(1,3, p_b)*dbinom(0,3, p_b) # P (transition in B)transitions =data.frame(Dose=1:4, Pi_A=pi_a, Pi_B=pi_b)transitions
2) \(P(\text{stop before 4})=P(\text{stop at 1})+P(\text{stop at 2})+P(\text{stop at 3})\), where \(P(\text{stop at 1})=1*(1-P_{i1})\), \(P(\text{stop at 2})=P_{i1}*(1-P_{i2})\) and \(P(\text{stop at 3})=P_{i1}*P_{i2}*(1-P_{i3})\)
# For scenario Ap4a = (1-pi_a[1])+pi_a[1]*(1-pi_a[2]) + pi_a[1]*pi_a[2]*(1-pi_a[3])# alternate equivalent formula: 1- pi_a[3]*pi_a[2]*pi_a[1]paste0("For scenario A P(stop before 4): ", round(p4a,4))
[1] "For scenario A P(stop before 4): 0.652"
# For scenario Bp4b =(1-pi_b[1])+pi_b[1]*(1-pi_b[2]) + pi_b[1]*pi_b[2]*(1-pi_b[3])paste0("For scenario B P(stop before 4): ", round (p4b,4))
[1] "For scenario B P(stop before 4): 0.6895"
3) Using similar reasoning as in question 2, we find probabilities of stopping at each dose:
We know that the 3rd dose is the ideal dose in each scenario. Assisted by the plot, we note that scenario A has less chance of stopping at the 2nd dose and is more likely to continue to the 3rd dose. In this sense, scenario A is preferable.
5) Scenario A has marginally better operating characteristics, as the probability of termination increases more slowly at lower doses, allowing safer escalation. Its dose–toxicity relationship is closer to a sigmoid curve, which is biologically more plausible. By contrast, Scenario B is steeper, with toxicity accumulating more rapidly at early doses, leading to higher termination probabilities sooner and less favorable operating behavior.
Consistency check: If toxicity probabilities increase, this should first be reflected in the transition probabilities, which would become smaller. Higher toxicity would therefore make transitions to the next dose level less likely.
n r Type I error Type II error
1 47 14 0.03663689 0.09877433
2 48 14 0.04373256 0.08125713
3 50 15 0.03080342 0.09550171
4 51 15 0.03678715 0.07888304
5 52 15 0.04356872 0.06475717
We note that the smallest number of patients that satisfies the error-rate constraints is 16 in the first scenario, requiring at least 5 responders. In the second scenario, the corresponding minimum is 47 patients, requiring at least 15 responders.
2) We make two-stage designs using the following code:
3) From the previous questions we construct the following table:
Design
Scenario 1
Scenario 2
Single
Sample Size: 16
EN0: 16
PET0: 0 (can’t terminate early)
Sample Size: 47
EN0: 47
PET0: 0 (can’t terminate early)
Optimal
Sample Size: 18
EN0: 10.1
PET0: 71.67%
Sample Size: 54
EN0: 30.43
PET0: 67.3%
MiniMax
Sample Size: 16
EN0: 11.8
PET0: 59.95%
Sample Size: 45
EN0: 31.23
PET0: 65.59%
4) For Scenario 1, there is no reason to suggest a single-stage design, as the MiniMax design uses the same sample size while allowing early stopping for futility. Although the Optimal design would expose two additional patients, the expected reduction in patients exposed under H0 is fewer than two, albeit with a higher probability of early termination. Overall, the MiniMax design appears to be the most appropriate choice; however, if the clinician prioritizes the increased likelihood of early stopping, the Optimal design also represents a reasonable alternative.
For scenario 2: Again, no reason to prefer single stage design. An Optimal design would expose up to 9 additional patients for an expected gain of < 1 patient exposed on average and a negligible increase in PET0. Minimax seems the best choice here.
5) We create a helper function for power calculation in the two-stage designs:
simon_power <-function(n1, r1, n, r, p_true) { n2 <- n - n1 # how many in second stage x1 <- (r1 +1):n1 # responses in first stage that allow continuing# P(X1 = x1) prob_x1 <-dbinom(x1, size = n1, prob = p_true)# P(X2 > r - x1) prob_stage2 <-1-pbinom(r - x1, size = n2, prob = p_true)sum(prob_x1 * prob_stage2)}
In the first scenario, MinMax design:
simon_power(n1=9, r1=1, n=16, r=4, p_true=0.3)
[1] 0.5293164
In the second scenario, MinMax design:
simon_power(n1=24, r1=5, n=45, r=13, p_true=0.3)
[1] 0.4683275
Consistency check: If the significance level \(\alpha\) was incorrectly treated as two-sided, effectively halving the tail probability available for rejection, stricter decision boundaries would be required. This would lead to larger critical values \(r\) (and \(r_1\)) and, if the designs were re-optimized to maintain the same power, to larger total sample sizes across all designs.
Exercise 3
1) Theoretical probability perfect balance:
paste0("Probability for perfect balabnce: ", round(dbinom(40,80,0.5),2))
# A tibble: 20 × 2
block.id treatment
<fct> <chr>
1 1 B, A, B, A
2 2 B, A, B, A
3 3 B, A, B, A
4 4 B, B, A, A
5 5 B, A, B, A
6 6 A, A, B, B
7 7 A, A, B, B
8 8 A, A, B, B
9 9 A, B, B, A
10 10 A, B, A, B
11 11 B, A, A, B
12 12 B, B, A, A
13 13 A, A, B, B
14 14 B, A, B, A
15 15 B, A, A, B
16 16 B, A, A, B
17 17 A, B, A, B
18 18 B, B, A, A
19 19 B, A, B, A
20 20 B, B, A, A
4) In simple randomization, the two groups are the same size on average (in expectation), since each allocation has equal probability; however, chance imbalances may still occur in a finite sample. Blocked randomization guarantees equal group sizes throughout enrollment, but introduces a risk of allocation predictability, particularly with fixed and known block sizes.
5) In an open-label study, simple randomization or blocked randomization with sufficiently large or randomly varying block sizes is preferable, as this minimizes the predictability of treatment assignment and thus reduces the risk of selection bias when allocation is known. In a double-blind study, blocked randomization is appropriate, since allocation concealment is preserved and balance between treatment groups can be maintained throughout enrollment without increasing the risk of selection bias.
Consistency check: The first feature that would reveal the error is the imbalance in group sizes within blocks, since with a block size of 3 it is impossible to achieve a 1:1 allocation in each block. This would manifest as alternating imbalances (e.g. 2A:1B or 1A:2B), instead of the expected equal distribution.
Exercise 4
1) We write the following code using gsDesign package:
Analysis Value Efficacy Futility
IA 1: 30% Z 3.6383 -3.6383
N/Fixed design N: 0.3 p (1-sided) 0.0001 0.0001
~delta at bound 2.3549 -2.3549
P(Cross) if delta=0 0.0001 0.0001
P(Cross) if delta=1 0.0182 0.0000
IA 2: 60% Z 2.5727 -2.5727
N/Fixed design N: 0.61 p (1-sided) 0.0050 0.0050
~delta at bound 1.1775 -1.1775
P(Cross) if delta=0 0.0051 0.0051
P(Cross) if delta=1 0.3497 0.0000
Final Z 1.9928 -1.9928
N/Fixed design N: 1.01 p (1-sided) 0.0231 0.0231
~delta at bound 0.7065 -0.7065
P(Cross) if delta=0 0.0250 0.0250
P(Cross) if delta=1 0.8000 0.0000
# Z's and p-valuescbind(Z =gsBoundSummary(obf)$Eff[gsBoundSummary(obf)$Val=="Z"],P =2*gsBoundSummary(obf)$Eff[gsBoundSummary(obf)$Val=="p (1-sided)"])
Z P
[1,] 3.6383 0.0002
[2,] 2.5727 0.0100
[3,] 1.9928 0.0462
Analysis Value Efficacy Futility
IA 1: 30% Z 2.2991 -2.2991
N/Fixed design N: 0.35 p (1-sided) 0.0107 0.0107
~delta at bound 1.3796 -1.3796
P(Cross) if delta=0 0.0107 0.0107
P(Cross) if delta=1 0.2635 0.0000
IA 2: 60% Z 2.2991 -2.2991
N/Fixed design N: 0.71 p (1-sided) 0.0107 0.0107
~delta at bound 0.9755 -0.9755
P(Cross) if delta=0 0.0185 0.0185
P(Cross) if delta=1 0.5536 0.0000
Final Z 2.2991 -2.2991
N/Fixed design N: 1.18 p (1-sided) 0.0107 0.0107
~delta at bound 0.7557 -0.7557
P(Cross) if delta=0 0.0250 0.0250
P(Cross) if delta=1 0.8000 0.0000
# Z's and p-valuescbind(Z =gsBoundSummary(poc)$Eff[gsBoundSummary(poc)$Val=="Z"],P =2*gsBoundSummary(poc)$Eff[gsBoundSummary(poc)$Val=="p (1-sided)"])
Z P
[1,] 2.2991 0.0214
[2,] 2.2991 0.0214
[3,] 2.2991 0.0214
2) At the first interim look, the O’Brien–Fleming design applies a much more stringent threshold than the Pocock design (Z = 3.10, p ≈ 0.002 vs Z = 2.30, p ≈ 0.021), making early stopping far less likely. At the second look, the O’Brien–Fleming boundary relaxes substantially (Z = 2.19, p ≈ 0.028) and becomes comparable to, though slightly less stringent than, the Pocock boundary, which remains constant across looks. At the final analysis, the O’Brien–Fleming design applies the least stringent threshold (Z = 1.70, p ≈ 0.090), whereas the Pocock design maintains the same stricter threshold as earlier looks, illustrating the trade-off between strong protection against early stopping and greater leniency at the final analysis.
The following plot attempts to depict this situation:
3) At the second interim analysis (60% of events), the observed test statistic is Z = 2.30. Under the O’Brien–Fleming design, the efficacy boundary at this look is Z ≈ 2.57; since 2.30 does not exceed this threshold, the trial would be not stopped. Under the Pocock design, the boundary is Z = 2.2991 at all looks; since the observed Z is above the boundary, the trial would be stopped.
4) If the sponsor is particularly concerned about controlling the Type I error while still allowing the possibility of early stopping, I would recommend the O’Brien–Fleming group sequential design. This approach imposes very stringent stopping boundaries at early interim analyses, thereby strongly protecting against false-positive conclusions, while gradually relaxing the boundary so that early stopping for true efficacy remains possible.
Consistency check: No, the Z boundaries do not change. A two-sided test with α = 0.05 allocates 0.025 to the upper (or lower) tail, which is identical to a one-sided test with α = 0.025. Since the efficacy boundaries depend only on the upper (or lower) tail Type I error, the resulting Z thresholds remain the same.
Part B
1) Main drawbacks
Due to small number of patients exposed there is high variability and imprecise toxicity estimation
Rule based escalation ignores information from previously tested doses
Can stop early after few toxicities, resulting in conservative dose selection (as we saw, there was ~30/40%, depending on the scenario, probability of stopping before the ideal 3rd dose in exercise 1, part A)
MTD from a 3+3 design often corresponds to a DLT probability below the target level, hence we get systematic underestimation
Model based designs:
Aim to improve accuracy of MTD estimation by explicitly modeling dose-toxicity relationship
Use all accumulated toxicity data rather than only the current dose level. Can also incorporate covariates (e.g. age, sex) and offer a personalized dose-toxicity profile.
Target a pre-specified toxicity probability, reducing the danger of underestimation.
2)
Phase II trials are not powered for definitive hypothesis testing or confirmatory inference.
Endpoints are often surrogate or intermediate rather than clinically definitive.
Lack of randomization (in many Phase II designs) increases bias and false-positive risk.
Treating Phase II as confirmatory can lead to overinterpretation of noisy efficacy signals.
Numerical results often show inflated effect estimates that do not replicate in Phase III.
P0 is the minimum clinically uninteresting response level. Defines a threshold below which the treatment is considered ineffective. P1 represents a clinically meaningful improvement worth further investigation. Helps drive the design’s power to detect a treatment effect.
An overly optimistic p1 leads to underestimated sample size, since large effects require fewer patients in order to be detected. Results in low power if the true effect is more modest than assumed. Increases the risk of false-negative decisions, discarding potentially useful treatments.
3) Simple (“coin toss”) randomization can lead to imbalances between groups purely due to chance, with such chance effects being more pronounced in small sample sizes.
By dividing enrollment into small, predefined blocks and ensuring equal treatment allocation within each block, the trial remains balanced at regular intervals throughout the enrollment period, thereby mitigating systematic time-related bias.
4) part 1.
In a coin toss, the probability of success at each toss is the same \(p\), but the probability of observing at least 1 success increases as the number of tosses increases. \(P(\text{0 successes in n tosses})=(1-p)^n\), where \(n\) is the number of tosses, hence \(P(\text{at least 1 success in n tosses})=1-(1-p)^n\). This is illustrated in the following graph:
# Parametersp <-0.5n <-1:20# Probability of at least one successprob_at_least_one <-1- (1- p)^n# Plotplot( n, prob_at_least_one,type ="b",pch =19,xlab ="Number of tosses (n)",ylab ="P(at least one success)",main ="Probability of ≥1 success vs number of tosses")abline(h =1, lty =2)
Via similar reasoning applied to type I error probability we can understand why multiple looks inflate type I error. We illustrate it as well below:
# Parametersalpha <-0.05n <-1:20# number of looks / tests# Probability of at least one Type I errortype1_error <-1- (1- alpha)^n# Plotplot( n, type1_error,type ="b",pch =19,xlab ="Number of looks (k)",ylab ="P(at least one Type I error)",main ="Inflation of Type I Error with Multiple Looks")abline(h = alpha, lty =2) # nominal alpha
4) part 2.
Data Safety and Monitoring Committee (DSMC)
(also called Treatment Effects Monitoring Committee – TEMC)
Role and responsibilities
Serves an advisory (not executive) role
Assists with protocol design
Monitors data quality and study timelines
Reviews drug toxicity and adverse events (patient safety)
Assesses treatment efficacy
Provides recommendations on: continuation of the study, protocol modification, early termination, and dissemination of results
Is intellectually and financially independent from investigators
Advises trial sponsors and/or investigators
Holds meeting assessing, group compatibilities, design assumptions, dropout rates, recruitment rates, funding availability, data quality, timeline, patient eligibility, protocol deviations.
4) part 3.
The greatest practical risk when interim results are widely known is the introduction of operational bias, as knowledge of emerging treatment effects can influence the behavior of investigators, clinicians, and participants, either consciously or unconsciously. This may lead to biased patient recruitment, differential care or monitoring between treatment groups, increased dropout from the perceived inferior arm, and biased outcome assessment, ultimately compromising the internal validity and credibility of the trial’s results.
5) Intention-to-Treat (ITT) analysis is considered more conservative and closer to real-world clinical practice because it analyzes participants according to their original randomization, regardless of treatment adherence, protocol deviations, or treatment crossover. By preserving the benefits of randomization, ITT minimizes selection and attrition bias and avoids artificially inflating treatment effects that can occur when non-adherent patients are excluded. At the same time, because noncompliance and deviations are common in routine clinical care, ITT reflects the effectiveness of a treatment under practical, real-world conditions, rather than its efficacy under ideal circumstances.
Per-protocol (PP) analysis can lead to biased conclusions when treatment adherence or protocol compliance is related to prognosis or outcomes. By excluding participants who deviate from the protocol (e.g., non-adherence, treatment crossover, early withdrawal), PP analysis breaks the original randomization and may introduce selection bias, because the remaining participants are often healthier, more motivated, or respond better to treatment. As a result, treatment effects may be overestimated, reflecting outcomes among a selected subgroup rather than the population originally randomized.