DTEBOP2:An R Package for Designing Two-Arm Multi-Stage Survival Trials with Delayed Treatment Effects

1. Introduction

We have developed a novel design method to extend the Bayesian Optimal Phase II (BOP2) design (Zhou et al. 2020) to address delayed treatment effects in a two-arm RCT setting. Below is the corresponding R package, DTEBOP2, along with its user manual.

The DTEBOP2 package serves as multiple purposes, including:

Determining the required interim sample size and final sample sizes in a two-stage design setting.
Identifying the optimal design parameters \(\lambda\) and \(\gamma\) in \(P(\tilde{\mu}_1 < \tilde{\mu}_0 \mid \mathcal{D}_n, S)> 1-\lambda(n/N)^{\gamma}\) to control the type I error rate in multi-stage design setting.
Given the separation time and estimated optimal design parameter \(\lambda\) and \(\gamma\), it can provide the operating characteristics including Percentage of Rejecting the Null Hypothesis, Probability of Early Stopping, Average Sample Size and Trial Duration.
Conducting interim analyses, including Go/No-Go decisions, as well as the final analysis.

Clinical researchers can leverage the functions within this package when conducting clinical trials focused on immunotherapies or molecularly targeted therapies for cancer, where the delayed treatment effect will be anticipated. The following are steps using the proposed methods when designing a study in which DTE is anticipated.

Step 1: Overall Median Survival Times
Provide the overall median survival time for the standard of care (SOC) and experimental arm, denoted as \(\bar{\mu}_j,j=0,1\) (0 is for SOC and 1 is for experimental arm). These values correspond to the null and alternative hypotheses in a conventional setting, for example, 2.8 months vs. 3.5 months.
Step 2: Prior Distribution for Delayed Treatment Effect (DTE) Timepoint \(S\) and Most Likely DTE Timepoint \(S\)
Specify the prior distribution for the delayed separation timepoint \(S\) as a truncated Gamma distribution, i.e.,
\[ S \sim \text{Gamma}(s_1, s_2)I(L, U) \] and the best-guessed most likely \(S\).
Step 3: Sample Size Calculation and Optimal Parameter Determination for Two-Stage Design
For a two-stage design, we can provide the calculated sample size for the interim and the final analyses using the function Two_stage_sample_size(.). The function will also return the optimal parameters \(\lambda\) and \(\gamma\) such that
\[ P(\tilde{\mu}_1 < \tilde{\mu}_0 \mid \mathcal{D}_n, S) > 1 - \lambda\left(\frac{n}{N}\right)^{\gamma}. \]. If your study plans to have more than two stages, skip Step 3 and go directly to step 4. Note: (\(\tilde{\mu}_0,\tilde{\mu}_1\)) represents the median of time-to-event before and after separation timepoint \(S\). A more detailed explanation will be provided later.
Step 4: Optimal Parameters Determination
If your study plans to have more than two stages, similar to the BOP2 design method, if a user provides sample sizes at interim and final stages, we then can compute the optimal parameters \(\lambda\) and \(\gamma\) satisfying: \[ P(\tilde{\mu}_1 < \tilde{\mu}_0 \mid \mathcal{D}_n, S) > 1 - \lambda\left(\frac{n}{N}\right)^{\gamma}. \] with R function get.optimal_2arm_piecewise(.)
Step 5: Conduct Interim and Final Analyses
If we have the data, we can use the function conduct(.) to perform interim and/or final analyses to make Go/No-Go decisions based on the obtained optimal parameters \(\lambda\) and \(\gamma\) in previous step 3 or 4.

Below, we provide a step-by-step explanation of the procedures outlined above.

2. Eliciting the Truncated Gamma Prior for DTE Separation Timepoint \(S\)

In the previous Step 2, we require the trialist to specify the prior distribution for the DTE timepoint S as a truncated Gamma distribution, i.e., \[ S \sim \text{Gamma}(s_1, s_2)I(L, U) \] and also need a best-guessed most likely DTE timepoint \(S\).

For specifying the L and U, we can ask domain experts:

“What would be reasonable estimates/guesses for the lower and upper bounds of the delayed treatment effect timepoint S in population-level survival curves?”

This part is relatively straightforward, especially if historical data or studies are available for reference.

Meanwhile, to obtain a best-guess estimate of the most likely DTE timepoint \(S\), a straightforward method is to use the midpoint of the lower and upper bounds \(L\) and \(U\).

Next, we introduce the specification of \(s_1\) and \(s_2\) for Gamma distribution. Although default values of \(s_1=1\) and \(s_2=1\) are used in this package, more accurate estimates can be obtained by asking domain experts the following questions

“From your experience, can you provide at least one of the following statistics for \(S\)?”

Mean
Standard deviation
Median
95% confidence interval

If any of the above questions we can get answer, We can use the following proposed method to estimate \(s_1\) and \(s_2\) by minimizing the following weighted least squares (WLS) function:

\[ w_1\sum_{i=1}^n(\text{mean}(S)-\text{mean from Expert } i)^2 + w_2\sum_{i=1}^n(\text{median}(S)-\text{median from Expert } i)^2 + w_3\sum_{i=1}^n(\text{sd}(S)-\text{sd from Expert } i)^2 + w_4\sum_{i=1}^n(\text{quantile}(S,0.025)-\text{0.025 quantile from Expert } i)^2 + w_5\sum_{i=1}^n(\text{quantile}(S,0.975)-\text{0.975 quantile from Expert } i)^2 \] where \(S \sim \text{Gamma}(s_1, s_2)I(L, U)\), and \(w_i\) (\(i=1,2,3,4,5\)) represents the relative importance of each statistic.

Since not all experts may provide every statistic, we can still proceed by using, for example, just one available statistic to estimate the optimal parameters.

The function trunc_gamma_para(.) can help us implement this WLS optimization and estimate \(s_1\) and \(s_2\).

The lower and upper \(L\) and \(U\), along with estimated Gamma prior parameters, reflect a complete picture of DTE timepoint. Biostatisticians may visualize the final prior to facilitate discussion and decision-making.

3. Calculating Median Survival Times for Pre- and Post-Separation DTE Timepoint given DTE timepoint S

From the above, we have the most likely DTE timepoint \(S\) and based on the study proposed null and alternative hypotheses, we have the overall median survival times for the SOC (\(\bar{\mu}_0\)) and experimental (\(\bar{\mu}_1\)) arms, we can then back-calculate the median survival times for pre- and post-separation DTE timepoint, denoted as (\(\tilde{\mu}_0, \tilde{\mu}_1\)) using the following formula:

\[ \begin{aligned} \bar{\mu}_0 &= \tilde{\mu}_0 \\ \bar{\mu}_1 &= \begin{cases} \tilde{\mu}_0, & \text{if } \tilde{\mu}_0 < S, \\ (1 - \frac{\tilde{\mu}_1}{\tilde{\mu}_0}) S + \tilde{\mu}_1, & \text{if } \tilde{\mu}_0 \geq S. \end{cases} \end{aligned} \]

The median survival times for pre and post DTE separation timepoints will be used inherently in our proposed algorithm to obtain optimal estimates for design parameters \(\lambda\) and \(\gamma\), which are crucial for controlling the type I error rate.

4. Determination of Sample Size in a Two-Stage Design Setting

The sample size for a two-stage design is determined by minimizing the following weighted expected sample size function, while ensuring that the power is at least 1− (Type II error) when \(S \leq U\). To balance two competing objectives—minimizing patient exposure to futile treatments and maximizing access to effective treatments—we define the function as follows: \[ EN = w \frac{PS_{H_0}\times n_1 + (1 - PS_{H_0})(n - n_1)}{n} + (1 - w) \left(1 - \frac{PS_{H_1} \times n_1 + (1 - PS_{H_1})(n - n_1)}{n} \right) \]

Here, \(w\in[0,1]\) is a tuning parameter representing the trade-off between two goals:

When \(w=1\): the function simplifies to \[EN=E(N|H_0)=\frac{PS_{H_0}\times n_1 + (1 - PS_{H_0})(n - n_1)}{n}\]

This expression reflects the expected sample size under the null hypothesis. Minimizing this value emphasizes early stopping when the treatment is futile, thereby reducing unnecessary patient exposure to ineffective therapies.

When \(w=0\): the function becomes \[EN=1-E(N|H_1)=1 - \frac{PS_{H_1} \times n_1 + (1 - PS_{H_1})(n - n_1)}{n}\]

In this case, the focus shifts to maximizing the number of patients who can access a truly effective treatment. By minimizing \(1−E(N∣H_1)\), we effectively encourage continuation of the trial when the treatment is promising, avoiding premature termination that could deny access to beneficial therapy. By varying \(w\), this design allows flexible prioritization between safety (minimizing exposure to ineffective treatment) and efficacy (maximizing access to effective treatment), while ensuring adequate statistical power when the true signal is weak.

Although this sample size is derived only for a two-stage design, it can be extended to sample size calculations for trials with multiple stages. Such extensions are not explored in this package.

Sometimes, the probability of early stopping under the null hypothesis \(PS_{H_0}\) at the first stage, as computed purely by the algorithm, may be too low (e.g., 25%). This indicates that the design lacks the ability to stop the study early for futility, which is generally undesirable, especially in phase II trials. To address this, our method provides users with the option to specify a minimum desired \(PS_{H_0}\). Based on this user-specified threshold instead of just minimizing \(EN\) above, we can still compute and provide the corresponding sample sizes for both the interim and final stages while keeping power at least \(1 -\) (Type II error) when \(S \leq U\). Note: Given that \(PS_{H_0}\) increases with \(n_1\) for a fixed \(n\), our proposed algorithm restricts the ratio \(n_1/n\) to be less than 0.75 by default.

5. Conduct the Trial: Make Go/No-Go Decision

From the above, with the obtained optimal \(\lambda\) and \(\gamma\), once we have the data, we can calculate \(P(\tilde{\mu}_1 < \tilde{\mu}_0 \mid \mathcal{D}_n, S)\) to determine whether to proceed with the trial (Go/No-Go decision).

6. Summary

This vignette provides a structured, step-by-step approach to designing two-arm multi-stage survival trials with delayed treatment effects. In summary, the process includes:

Eliciting expert opinions to construct a truncated Gamma prior for the separation time \(S\).
Obtaining optimal estimates for \(\lambda\) and \(\gamma\).
Making Go/No-Go decision at interim and final stages.
Determining the sample size in a two-stage trial setting, if applicable.

When a delayed treatment effect (DTE) is anticipated, the proposed design method can reduce the required sample size compared to methods (e.g., the conventional BOP2) that do not account for DTE. The details can be found in Cai, Mori, and Pan (2025).

7. Example 1

We consider the CheckMate 017 study (Brahmer et al. 2015) as an example. This randomized, open-label, international phase 3 trial evaluated the efficacy and safety of nivolumab, a fully human IgG4 programmed death-1 (PD-1) immune checkpoint inhibitor, compared to docetaxel. The study focused on patients with squamous-cell non-small-cell lung cancer (NSCLC) following prior platinum-based chemotherapy. Here, we focus on progression-free survival (PFS), a secondary endpoint, which has demonstrated a delayed treatment effect.

We now use DTEBOP2 R package to design this trial by assuming the PFS endpoint to be the primary endpoint and also demonstrate the process for conducting interim and final analyses once the data are available.

Preparation Step: Extract Data from the Published PFS Kaplan-Meier Plot

First, we reconstruct the Kaplan-Meier (KM) plot for PFS using PlotDigitizer. The datasets extracted from PlotDigitizer for the treatment and control arms are stored in our R package as “Experimental” and “SOC”.

library(DTEBOP2)
library(invgamma)
library(truncdist)
library(doParallel)
library(survRM2adapt)
library(reconstructKM)
library(ggplot2)
data("Experimental")
data("SOC")
Experiment_NAR <- data.frame(time=seq(from=0, to=24, by=3), NAR=c(135, 68, 48, 33,  21, 15, 6, 2,0))
SOC_NAR <- data.frame(time=seq(from=0, to=24, by=3), NAR=c(137, 62, 26, 9, 6, 2,1, 0, 0))
Experiment_aug <- format_raw_tabs(raw_NAR=Experiment_NAR,
                              raw_surv=Experimental) 
SOC_aug <- format_raw_tabs(raw_NAR=SOC_NAR,
                           raw_surv=SOC) 

# reconstruct by calling KM_reconstruct()
Experiment_recon <- KM_reconstruct(aug_NAR=Experiment_aug$aug_NAR, aug_surv=Experiment_aug$aug_surv)
SOC_recon <- KM_reconstruct(aug_NAR=SOC_aug$aug_NAR, aug_surv=SOC_aug$aug_surv)

# put the treatment and control arms into one dataset
Experiment_IPD <- data.frame(arm=1, time=Experiment_recon$IPD_time, status=Experiment_recon$IPD_event)
SOC_IPD <- data.frame(arm=0, time=SOC_recon$IPD_time, status=SOC_recon$IPD_event)
allIPD <- rbind(Experiment_IPD, SOC_IPD)

KM_fit <- survival::survfit(survival::Surv(time, status) ~ arm, data=allIPD)

KM_plot <- survminer::ggsurvplot(
  fit              = KM_fit,
  data             = allIPD,
  risk.table       = TRUE,    
  palette          = c('red', 'blue'),
  legend           = c(0.8, 0.9),
  legend.title     = '',
  legend.labs      = c('Docetaxel', 'Nivolumab'),
  title            = 'PFS',
  ylab             = 'Survival (%)',
  xlab             = 'Time (Month)',
  risk.table.title = 'Number at Risk',
  break.time.by    = 3,
  tables.height    = 0.4,
  risk.table.fontsize = 5
)
print(KM_plot,fig.align='center')

In the original paper, the hazard ratio of nivolumab and Docetaxel is 0.62 (\(95\%\) CI, 0.47 to 0.81), which means nivolumab has statistical significantly better performance compared with SOC.

Based on the figure above, it is evident that a delayed treatment effect is present, with the two PFS curves visibly diverging around 2.2 months.

In the following, we assume the results are unknown and provide a step-by-step guide by using our proposed method to design a study while anticipating the delayed treatment effect.

Step 1: Hypothesis Setup Based on Overall Median Survival Times

To design a study, one key component is establishing the null and alternative hypotheses. Here, we assume \(\bar{\mu}_0 = 2.8\) months for the standard of care (SOC) arm and \(\bar{\mu}_1 = 3.5\) months for the experimental arm.

Step 2.1: Prior Specification for Delayed Time \(S\)

Next, we derive the most likely value and prior distribution of the delayed separation timepoint \(S\).

For the prior of \(S\), we seek expert opinion to establish the lower (\(L\)) and upper (\(U\)) bounds. For example, the clinical team may find a range with \(L=2\) months and \(U=2.5\) months for the delayed separation timepoint \(S\) based on literature and their experiences.

If experts have greater confidence in the separation timepoint, for example, by providing specific estimates as shown below:

Expert A: Mean = 2.2, Median = 2.27
Expert B: Mean = 2.1, Median = 2.3
Expert C: Mean = 2.3, Median = 2.31

We can use the following code to obtain the estimated prior parameters \(s_1\) and \(s_2\) based on the previously introduced optimization algorithm. As a recap, we assume a truncated Gamma prior for \(S\), following a Gamma distribution \(\text{Gamma}(s_1, s_2)I(L, U)\). In the following code: weights=c(4,4,2,1) indicates that the expert-provided mean and median are given equal weight and more emphasis compared to other quantities.

expert_data_correct <- list(
  list(mean = 2.2, median = 2.27, sd = NULL, q25 = NULL, q975 = NULL), # expert A
  list(mean = 2.1, median = 2.3, sd = NULL, q25 = NULL, q975 = NULL),  # expert B
  list(mean = 2.3, median = 2.31, sd = NULL, q25 = NULL, q975 = NULL) # expert C
 )
 param<-trunc_gamma_para(L = 2, U = 2.5, expert_data = expert_data_correct,weights = c(4,4,2,1,1) ,num_cores = 4,seed=123) # num_cores specifies the number of core to do the parallel computing.
 print(param)

The estimated \(s_1,s_2\) are 12.86 and 0.19, respectively. Below is the density plot of the truncated gamma distribution \(Gamma(12.86,0.19)I(2,2.5)\):

plot(density(rtrunc_gamma(1000,L=2,U=2.5,shape=12.86,scale=0.19)),xlab="",main="")

The plot suggests that the prior is close to uniform distribution over \([2,2.5]\).

Step 2.2: Determining the Most Likely DTE timepoint \(S\) Using Historical Data

When expert opinions are unavailable, the most likely \(S\) can be estimated using the mean of the truncated gamma prior or midpoint of \([L,U]\) e.g., \((L+U)/2\). In this specific example, CheckMate 017, since the data is already available, we assume it represents historical data. Our goal here is to show how to use this “historical” data to identify the most likely timepoint \(S\)—that is, the point at which the Kaplan-Meier (KM) curves begin to diverge.

The following R code identifies this separation time point \(S\). We identify several probabilities and obtain the corresponding survival times for two arms. The \(S\) is identified to be the timepoint where the survival time begin to differ significantly.

y_prob <-  c(0.54, 0.53, 0.52,0.51)     #y-axis survival probability to explore

summary_fit <- summary(KM_fit)

arm_times <- list()

for (i in 1:length(KM_fit$strata)) {
  
  time_points <- summary_fit$time[summary_fit$strata == names(KM_fit$strata)[i]]
  survival_probs <- summary_fit$surv[summary_fit$strata == names(KM_fit$strata)[i]]
  
  closest_times <- sapply(y_prob, function(tp) {
    closest_index <- which.min(abs(y_prob - tp))
    return(time_points[closest_index])
  })
  

  arm_times[[names(KM_fit$strata)[i]]] <- setNames(closest_times, y_prob)
}

print(arm_times)
#> $`arm=0`
#>    0.54    0.53    0.52    0.51 
#> 0.43813 0.68154 0.92495 1.02231 
#> 
#> $`arm=1`
#>    0.54    0.53    0.52    0.51 
#> 0.27692 0.38769 0.53538 0.77538

Based on the above results, when the corresponding \(y\) with probability value reaches 0.53, the survival time shows a significant difference, the corresponding value to \(x\) axis is equal to 2.28, this is the separation timepoint. Therefore, we select \(S=2.28\) as the most likely delayed timepoint. If you have a historical dataset, the above procedure offers a simpler alternative to traditional changepoint methods described in the literature.

Next, given the hypothesis of median survival times of 3.5 months for the experimental arm and 2.8 months for the SOC arm, and from the above with the best guessed delayed separation time point of 2.28 months within the range of \(L = 2\) months to \(U = 2.5\) months, by assuming accrual rate with 6 patients per month, minimum follow-up of 6 months for the last enrolled patient, Type I error Controlled at 10%, Type II error controlled at 15% (e.g., power is 85%). If the study is designed to have two stages, the estimated sample sizes for interim and final stages and the optimal parameters \(\lambda,\gamma\) can be computed as follows using function Two_stage_sample_size(.).

median.1=2.8              #\bar{mu}_0
median.2=3.5             #\bar{mu}_1
L=2                      # Lower bound of S
U=2.5                    # Upper bound of S
S_likely=2.28                   # The most likely delayed time
trunc.para=c(12.86,0.19) #  prior paramters of S
rate=6                   # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6                    # Follow-up time of the last enrolled patient
err1 = 0.1               # type I error 
err2=0.15                # type II error

Two_stage_sample_size(median.1=median.1 , median.2=median.2,L=L,U=U,S_likely=S_likely,err1=err1,err2=err2,trunc.para=trunc.para, FUP=FUP, rate=rate, weight=0.5,earlystop_prob =NULL,seed=123) 
  #calculate the sample size for two-stage design

#> [1] 28 40

The interim and final sample sizes are 28 and 40 per arm, respectively. The optimal parameters, \(\lambda\) and \(\gamma\), are 0.95 and 1, respectively, with an average type I error rate of 0.0871 and power of 0.8667.

Note: In the code above, weight\(=0.5\) indicates that equal importance is given to the probability of early stopping under both the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_1\)). Meanwhile, if we may be interested in the corresponding number of events for two stages, we can use function event_fun(.):

event_fun(median.1 = 2.8,median.2 = 3.5,S_likely = 2.28,n.interim = c(28,40),rate = 6,FUP = 6,n.sim = 1000,seed=123)
#>             Events
#> 1st interim     23
#> Final stage     68

We anticipate observing approximately 23 events at the interim analysis stage and a total of 68 events by the final analysis.

With optimal parameters and sample size, we can evaluate the operating characteristics, including:

Percentage of rejecting the null hypothesis
Probability of early stopping under \(H_0\)
Average sample size
Trial duration

for different true unknown separation timepoint S. For sensitivity analysis (SA) in the following, we assume that the most likely separation timepoint, S_likely=2.28 months, remains unchanged. Under this condition, the median survival times before and after the separation timepoint for both the control (SOC) and treatment (experimental) groups–\(\tilde{\mu}_0=2.8\) and \(\tilde{\mu}_1=6.57\), respectively–will also remain unchanged. This is because the median survival times before and after the separation depend solely on the assumed most likely separation timepoint (in this case, S_likely=2.28 months).

Impact of True S Variation Under a Fixed Assumed S_likely

We now consider two true unknown values for \(S\), 2 and 2.5 months, to examine their effects on the design. Since overall median survival time \(\bar{\mu}_1\) depends on true but unknown separation timepoin \(S\), its value of \(\bar{\mu}_1\) will change adjust accordingly.

Sensitivity Analysis 1.1: If true unknown DTE separation timepoint \(S = 2\)

In the following code, we can set \(L=U=2\). However, in the real trial design practices, of course, we should also set S_likley=2. Nevertheless, in the current setting, S_likely is still set to be 2.28 as same as the previous, the goal here is only to check the robustness of the performance of our proposed method.

##When H0 holds, the reject rate is the type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2,S_likely=2.28,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2957
#> 
#> $reject
#> [1] 0.0841
#> 
#> $average.patients
#> [1] 36.4516
#> 
#> $trial.duration
#> [1] 10.33529
##When H1 holds, the reject rate is the power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2,S_likely=2.28,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0711
#> 
#> $reject
#> [1] 0.8804
#> 
#> $average.patients
#> [1] 39.1468
#> 
#> $trial.duration
#> [1] 12.09543

Sensitivity Analysis 1.2: If true unknown DTE separation timepoint: \(S=2.5\)

##When H0 holds, the reject rate is the type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2.5,U=2.5,S_likely=2.28,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2379
#> 
#> $reject
#> [1] 0.0873
#> 
#> $average.patients
#> [1] 37.1452
#> 
#> $trial.duration
#> [1] 10.79129
##When H1 holds, the reject rate is the power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2.5,U=2.5,S_likely=2.28,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0763
#> 
#> $reject
#> [1] 0.8549
#> 
#> $average.patients
#> [1] 39.0844
#> 
#> $trial.duration
#> [1] 12.05357

From the above results, we observe that when S_likely = 2.28 remains unchanged, the type I error rates can still be controlled under 0.1, and the powers maintain above 0.85 across different true DTE separation timepoints. Additionally, we see that when \(S=2.5\), the power is lower than when \(S=2\), given that our best estimated separation timepoint is S_likely = 2.28. This is expected because when \(S=2.5\) and S_likely = 2.28, the unseparated portion of the two curves is included in the comparison, leading to a loss of power. Conversely, if the true separation occurs earlier than S_likely, the power will increase.

Sensitivity Analysis : Varying the Assumed Separation Timepoint S_likely

Next, we examine scenarios where S_likely\(=2\) or S_likely\(=2.5\),the best-guessed separation timepoint, to evaluate the robustness of our proposed method, while keeping the overall median survival time with \(\bar{\mu}_0=2.8\), \(\bar{\mu}_1=3.5\). In this scenario, median survival time of pre- and post-separation time point, \(\tilde{\mu}_0, \tilde{\mu}_1\), will be to be changed with S_likely accordingly. We report the average type I error and power to assess the performance of our method under these scenarios.

Specifically, we examine two extreme scenarios to assess the robustness of our method. The optimal design parameters were initially determined assuming the separation time point falls within the range of \(L = 2\) months to \(U = 2.5\) months.

In the first scenario, we evaluate the case where the clinical team believes the most likely separation time point is 2 months.
In the second scenario, we assess the case where the most likely separation time point is 2.5 months.

If the operating characteristics (OCs) are still satisfied in both cases, it would demonstrate the robustness of our proposed method.

Sensitivity Analysis 1.3: If best-guessed DTE separation timepoint: S_likely \(= 2\)

##When H0 holds, the reject rate is the average type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2.5,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2729
#> 
#> $reject
#> [1] 0.0871
#> 
#> $average.patients
#> [1] 36.7252
#> 
#> $trial.duration
#> [1] 10.52729
##When H1 holds, the reject rate is the average power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2.5,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.102
#> 
#> $reject
#> [1] 0.7226
#> 
#> $average.patients
#> [1] 38.776
#> 
#> $trial.duration
#> [1] 11.8653

Sensitivity Analysis 1.4: If best-guessed DTE separation timepoint: S_likely \(= 2.5\)

##When H0 holds, the reject rate is the average type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2.5,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2729
#> 
#> $reject
#> [1] 0.0871
#> 
#> $average.patients
#> [1] 36.7252
#> 
#> $trial.duration
#> [1] 10.52729
##When H1 holds, the reject rate is the average power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2.5,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0424
#> 
#> $reject
#> [1] 0.95
#> 
#> $average.patients
#> [1] 39.4912
#> 
#> $trial.duration
#> [1] 12.33141

From these results, we can see that the average type I errors remain controlled under 0.1. However, when S_likely\(=2\), the average power falls below 0.85. This is expected because the sample size was originally determined based on S_likely\(=2.28\), with corresponding \(\tilde{\mu}_0=2.8,\tilde{\mu}_1=6.57\), whereas for when S_likely\(=2\), the corresponding these values are changed to \(\tilde{\mu}_0=2.8,\tilde{\mu}_1=5.25\). That is, if the best guessed separation timepoint shifts earlier to the beginning of the trial, the treatment effect will be diluted.

Vice versa, When the best guessed separation timepoint S_likely is larger than 2.28, as we see in the case of S_likely\(=2.5\) (\(\tilde{\mu}_0=2.8,\tilde{\mu}_1=9.33\)), the average power remains above 0.85, ensuring adequate performance.

Just for another exploration, if we set S_likely\(=2\) to determine the required sample size. The results are as follows:

Example 2: Impact of Setting S_likely\(=2\)

median.1=2.8              #\bar{mu}_0
median.2=3.5             #\bar{mu}_1
L=2                      # Lower bound of S
U=2.5                    # Upper bound of S
S_likely=2                 # The most likely delayed time
trunc.para=c(12.86,0.19)        # default prior parameter of S
rate=6                   # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6                    # Follow-up time of the last patient
err1 = 0.1               # type I error 
err2=0.15                # type II error
Two_stage_sample_size(median.1=median.1 , median.2=median.2,L=L,U=U,S_likely=S_likely,err1=err1,err2=err2,trunc.para=trunc.para, FUP=FUP, rate=rate, weight=0.5,earlystop_prob =NULL,seed=123)   #calculate the sample size for two-stage design

#> [1] 44 63

The interim and final sample sizes are 44 and 63 per arm, respectively. The optimal parameters in \(P(\tilde{\mu}_1 < \tilde{\mu}_0 \mid \mathcal{D}_n, S)> 1-\lambda(n/N)^{\gamma}\) are \(\lambda=0.95,\gamma=1\) with average type I error 0.0813 and power 0.8786. For the number of events for two stages, we have:

event_fun(median.1 = 2.8,median.2 = 3.5,S_likely = 2,n.interim = c(44,63),rate = 6,FUP = 6,n.sim = 1000,seed=123)
#>             Events
#> 1st interim     47
#> Final stage    112

We may know that we expect that there are 46 events and 112 events in total at interim stage and final stage, respectively.

Take a further step, with the optimal parameters \(\lambda=0.95, \gamma=1\), we consider four scenarios by varying both S_likely and \(S\).

Sensitivity Analysis 2.1: \(S=2\), S_likely\(=2\)

##When H0 holds, the reject rate is the average type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2,U=2,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.4087
#> 
#> $reject
#> [1] 0.078
#> 
#> $average.patients
#> [1] 55.2347
#> 
#> $trial.duration
#> [1] 12.80447
##When H1 holds, the reject rate is the average power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2,U=2,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0544
#> 
#> $reject
#> [1] 0.8943
#> 
#> $average.patients
#> [1] 61.9664
#> 
#> $trial.duration
#> [1] 16.00119

Sensitivity Analysis 2.2: \(S=2.5\), S_likely\(=2\)

##When H0 holds, the reject rate is the type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2.5,U=2.5,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.3778
#> 
#> $reject
#> [1] 0.0807
#> 
#> $average.patients
#> [1] 55.8218
#> 
#> $trial.duration
#> [1] 13.08364
##When H1 holds, the reject rate is the power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2.5,U=2.5,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0651
#> 
#> $reject
#> [1] 0.8687
#> 
#> $average.patients
#> [1] 61.7631
#> 
#> $trial.duration
#> [1] 15.90049

Sensitivity Analysis 2.3: \(S=2\), S_likely\(=2.5\)

##When H0 holds, the reject rate is the type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2,U=2,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.4087
#> 
#> $reject
#> [1] 0.078
#> 
#> $average.patients
#> [1] 55.2347
#> 
#> $trial.duration
#> [1] 12.80447
##When H1 holds, the reject rate is the power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2,U=2,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0045
#> 
#> $reject
#> [1] 0.9955
#> 
#> $average.patients
#> [1] 62.9145
#> 
#> $trial.duration
#> [1] 16.44764

Sensitivity Analysis 2.4: \(S=2.5\), S_likely\(=2.5\)

##When H0 holds, the reject rate is the type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2.5,U=2.5,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.3778
#> 
#> $reject
#> [1] 0.0807
#> 
#> $average.patients
#> [1] 55.8218
#> 
#> $trial.duration
#> [1] 13.08364
##When H1 holds, the reject rate is the power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2.5,U=2.5,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0066
#> 
#> $reject
#> [1] 0.9932
#> 
#> $average.patients
#> [1] 62.8746
#> 
#> $trial.duration
#> [1] 16.4295

Summary of the above results

With a maximum sample size of 63 per arm, the type I error is controlled at 0.1, and power remains above 0.85 across all combinations of \(S\) and S_likely, demonstrating doubly robustness (DR) property – robustness to both the true but unknown separation time point S, and the user-specified best-guess value S_likely as long as the unknown true \(S\) and best-guessed S_likely are in \([L,U]\).
A larger upper bound \(U\) increases the required sample size. Meanwhile, smaller \(L\) results in smaller best guessed \(S\), that is, S_likely, which means that the undiveraged region of the two curves is also included for comparison and thus also needs more sample size to maintain the power. In short, a larger interval length \([L,U]\) reflects greater uncertainty in specification and leads to an increased required sample size.
The lower bound \(L\) and upper bound \(U\) should be chosen carefully based on prior knowledge and experience, as both influence the maximum required sample size. The guiding principle is to choose \(L\) and \(U\) such that the true separation time is covered with as much certainty as possible.

Comparison with Log-Rank Test Without DTE Consideration

Compared with the log-rank test that does not incorporate the information of \(S\), given the event rate \(P=0.85\), the sample size under the log-rank test is:

\[ \frac{4(z_{0.95} + z_{0.85})^2}{P\left(\log\left(\frac{2.8}{3.5}\right)\right)^2} = 680 \]

Without incorporating the information of \(S\), the required sample size is 340 per arm. In contrast, if we employ the median survival time from post separation timepoint, 6.57, with S_likely\(=2.28\), the sample size in terms of number of patients under the log-rank test is only \[ \frac{4(z_{0.95} + z_{0.85})^2}{P\left(\log\left(\frac{2.8}{6.57}\right)\right)^2} = 48 \]

In other words, if we assume that the true separation time is \(S = 0\), then the required sample size is at least 24 subjects per arm. On the other hand, under scenarios where the separation time is unknown, the sample size could increase up to 340 per arm. Therefore, the sample size required by our proposed method is expected to lie between 24 and 340, depending on the actual separation time. This observation highlights the value of incorporating prior information about the separation time into the design. By leveraging this information, our method can substantially reduce the required sample size, thereby improving efficiency and lowering resource demands in clinical trial planning.

Our method significantly reduces the required sample size by leveraging information about the separation time S. The key reason for this reduction is the use of the ratio \(\tilde{\mu}_1\) and \(\tilde{\mu}_0\) instead of \(\bar{\mu}_1\) and \(\bar{\mu}_0\) for sample size determination. The former is a more appropriate choice when a delayed treatment effect (DTE) is anticipated. This approach allows for a more efficient design, achieving meaningful sample size savings while maintaining statistical power.

Example 3: Applying conduct(·) for Go/No-Go Decisions with Mock Data

We now generate two mock datasets and use the conduct(·) function to guide Go/No-Go decision-making at the interim and final stages based on observed data.

Example 3.1: Mock Data Scenario Assuming \(H_0\) to Be True

The first dataset is generated under the null hypothesis \(H_0\).

set.seed(126)
n.interim = c(44,63)
nmax=63
FUP=6
rate=6
L=2
U=2.5
median.1=2.8
median.2=3.5
lambda_1=log(2)/2.8  #hazard rate before S
lambda_2=log(2)/2.8  #hazard rate after S
nobs = n.interim+1
nobs[length(nobs)] = 63
wait.t = rexp(nmax,rate = rate)
arrival.t = cumsum(wait.t)
tobs = arrival.t[nobs]
tobs[length(tobs)] = tobs[length(tobs)] + FUP
S_likely=2.1  
event.t.E = generate_pe(nmax,t=S_likely,lambda1=lambda_1,lambda2=lambda_1)
event.t.C=rexp(nmax,rate=lambda_1)


####first interim
k = 1
Ind_event_E = ifelse(arrival.t[1:n.interim[k]] + event.t.E[1:n.interim[k]] <= tobs[k],1,0)   ##event indicator for th treatment, 1: event.
Ind_event_C = ifelse(arrival.t[1:n.interim[k]] + event.t.C[1:n.interim[k]] <= tobs[k],1,0)
t.event.E  = rep(0,n.interim[k])
t.event.C=rep(0,n.interim[k])
for(l in 1:length(t.event.E)){
  t.event.E[l] = ifelse(arrival.t[l]+event.t.E[l]<=tobs[k],event.t.E[l],tobs[k]-arrival.t[l])
  t.event.C[l] = ifelse(arrival.t[l]+event.t.C[l]<=tobs[k],event.t.C[l],tobs[k]-arrival.t[l])
}
data.E=data.frame(Time=t.event.E,Status=Ind_event_E) ## treatment dataset
data.C=data.frame(Time=t.event.C,Status=Ind_event_C) ## control dataset
print("Treatment Dataset:")
#> [1] "Treatment Dataset:"
print(head(data.E))
#>         Time Status
#> 1 1.48614261      1
#> 2 6.41029766      1
#> 3 0.18085940      1
#> 4 0.08234503      1
#> 5 2.26719651      1
#> 6 0.50314936      1
print("Control Dataset:")
#> [1] "Control Dataset:"
print(head(data.C))
#>        Time Status
#> 1 4.6947190      1
#> 2 6.6259936      0
#> 3 0.2985375      1
#> 4 0.7839282      1
#> 5 2.0822781      1
#> 6 2.7224489      1
conduct(data.E,data.C,median.1=median.1,median.2=median.2,lambda=0.95,gamma=1,nmax=63,S_likely=S_likely,L=L,U=U,trunc.para=c(12.86,0.19))[[4]]
#> [1] "Decision: Go"

The result is Go. Assuming that we have all the data, we are now moving to the final analysis.

k = 2

Ind_event_E = ifelse(arrival.t[1:n.interim[k]] + event.t.E[1:n.interim[k]] <= tobs[k],1,0)   ##event indicator for th treatment, 1: event.
Ind_event_C = ifelse(arrival.t[1:n.interim[k]] + event.t.C[1:n.interim[k]] <= tobs[k],1,0)
t.event.E  = rep(0,n.interim[k])


t.event.C=rep(0,n.interim[k])
for(l in 1:length(t.event.E)){
  t.event.E[l] = ifelse(arrival.t[l]+event.t.E[l]<=tobs[k],event.t.E[l],tobs[k]-arrival.t[l])
  t.event.C[l] = ifelse(arrival.t[l]+event.t.C[l]<=tobs[k],event.t.C[l],tobs[k]-arrival.t[l])

}
data.E=data.frame(Time=t.event.E,Status=Ind_event_E)
data.C=data.frame(Time=t.event.C,Status=Ind_event_C)
print("Treatment Dataset:")
#> [1] "Treatment Dataset:"
print(head(data.E))
#>         Time Status
#> 1 1.48614261      1
#> 2 6.41029766      1
#> 3 0.18085940      1
#> 4 0.08234503      1
#> 5 2.26719651      1
#> 6 0.50314936      1
print("Control Dataset:")
#> [1] "Control Dataset:"
print(head(data.C))
#>         Time Status
#> 1  4.6947190      1
#> 2 15.7118917      0
#> 3  0.2985375      1
#> 4  0.7839282      1
#> 5  2.0822781      1
#> 6  2.7224489      1

conduct(data.E,data.C,median.1=median.1,median.2=median.2,lambda=0.95,gamma=1,nmax=nmax,S_likely=S_likely,L=L,U=U,trunc.para=c(12.86,0.19))[[4]]
#> [1] "Decision: Not Reject Null hypothesis"

The result is Not Reject Null Hypothesis, which meets our expectation. The second mock dataset is created under \(H_1\) holds.

Example 3.2: Mock Data Scenario Assuming \(H_1\) to Be True

The second mock dataset is generated under the alternative hypothesis \(H_1\)

set.seed(126)
n.interim = c(44,63)
nmax=63
FUP=6
rate=6
L=2
U=2.5
median.1=2.8         
median.2=3.5         
lambda_1=log(2)/2.8  #hazard rate before S
lambda_2=log(2)/6.6  #hazard rate after S
nobs = n.interim+1
nobs[length(nobs)] = 63
wait.t = rexp(nmax,rate = rate)
arrival.t = cumsum(wait.t)
tobs = arrival.t[nobs]
tobs[length(tobs)] = tobs[length(tobs)] + FUP
S_likely=2.1  
event.t.E = generate_pe(nmax,t=S_likely,lambda1=lambda_1,lambda2=lambda_2)
event.t.C=rexp(nmax,rate=lambda_1)


####first interim
k = 1
Ind_event_E = ifelse(arrival.t[1:n.interim[k]] + event.t.E[1:n.interim[k]] <= tobs[k],1,0)   ##event indicator for th treatment, 1: event.
Ind_event_C = ifelse(arrival.t[1:n.interim[k]] + event.t.C[1:n.interim[k]] <= tobs[k],1,0)
t.event.E  = rep(0,n.interim[k])
t.event.C=rep(0,n.interim[k])
for(l in 1:length(t.event.E)){
  t.event.E[l] = ifelse(arrival.t[l]+event.t.E[l]<=tobs[k],event.t.E[l],tobs[k]-arrival.t[l])
  t.event.C[l] = ifelse(arrival.t[l]+event.t.C[l]<=tobs[k],event.t.C[l],tobs[k]-arrival.t[l])
}
data.E=data.frame(Time=t.event.E,Status=Ind_event_E) ## treatment dataset
data.C=data.frame(Time=t.event.C,Status=Ind_event_C) ## control dataset
print("Treatment Dataset:")
#> [1] "Treatment Dataset:"
print(head(data.E))
#>         Time Status
#> 1 1.48614261      1
#> 2 6.62599365      0
#> 3 0.18085940      1
#> 4 0.08234503      1
#> 5 2.49410605      1
#> 6 0.50314936      1

print("Control Dataset:")
#> [1] "Control Dataset:"
print(head(data.C))
#>        Time Status
#> 1 4.6947190      1
#> 2 6.6259936      0
#> 3 0.2985375      1
#> 4 0.7839282      1
#> 5 2.0822781      1
#> 6 2.7224489      1
conduct(data.E,data.C,median.1=median.1,median.2=median.2,lambda=0.95,gamma=1,nmax=63,S_likely=S_likely,L=L,U=U,trunc.para=c(12.86,0.19))[[4]]
#> [1] "Decision: Go"

The result is Go. Assuming we have the full dataset now, we can conduct the final analysis.

k = 2

Ind_event_E = ifelse(arrival.t[1:n.interim[k]] + event.t.E[1:n.interim[k]] <= tobs[k],1,0)   ##event indicator for th treatment, 1: event.
Ind_event_C = ifelse(arrival.t[1:n.interim[k]] + event.t.C[1:n.interim[k]] <= tobs[k],1,0)
t.event.E  = rep(0,n.interim[k])


t.event.C=rep(0,n.interim[k])
for(l in 1:length(t.event.E)){
  t.event.E[l] = ifelse(arrival.t[l]+event.t.E[l]<=tobs[k],event.t.E[l],tobs[k]-arrival.t[l])
  t.event.C[l] = ifelse(arrival.t[l]+event.t.C[l]<=tobs[k],event.t.C[l],tobs[k]-arrival.t[l])

}
data.E=data.frame(Time=t.event.E,Status=Ind_event_E)
data.C=data.frame(Time=t.event.C,Status=Ind_event_C)
print("Treatment Dataset:")
#> [1] "Treatment Dataset:"
print(head(data.E))
#>          Time Status
#> 1  1.48614261      1
#> 2 12.25998734      1
#> 3  0.18085940      1
#> 4  0.08234503      1
#> 5  2.49410605      1
#> 6  0.50314936      1

print("Control Dataset:")
#> [1] "Control Dataset:"
print(head(data.C))
#>         Time Status
#> 1  4.6947190      1
#> 2 15.7118917      0
#> 3  0.2985375      1
#> 4  0.7839282      1
#> 5  2.0822781      1
#> 6  2.7224489      1
conduct(data.E,data.C,median.1=median.1,median.2=median.2,lambda=0.95,gamma=1,nmax=nmax,S_likely=S_likely,L=L,U=U,trunc.para=c(12.86,0.19))[[4]]
#> [1] "Decision: Reject Null hypothesis"

The result is Reject Null hypothesis, which meets our expectations.

8. Incorporating Early Stopping Probability into Sample Size Determination

Now suppose we require the probability of early stopping to be no less than 0.4. We calculate the interim and final sample sizes using the function Two_stage_sample_size(.) with earlystop_prob =0.4, under two scenarios: S_likely=2.28 and S_likely=2. In both cases, the resulting sample sizes ensure that the probability of early stopping (PS) is at least 0.4 when \(S\leq U\).

Example 8.1: S_likely\(=2.28\)

median.1=2.8              #\bar{mu}_0
median.2=3.5             #\bar{mu}_1
L=2                      # Lower bound of S
U=2.5                    # Upper bound of S
S_likely=2.28                 # The most likely delayed time
trunc.para=c(12.86,0.19) #  prior of S
rate=6                   # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6                    # Follow-up time of the last patient
err1 = 0.1               # type I error 
err2=0.15                # type II error
Two_stage_sample_size(median.1=median.1 , median.2=median.2,L=L,U=U,S_likely=S_likely,err1=err1,err2=err2,trunc.para=trunc.para, FUP=FUP, rate=rate, earlystop_prob =0.4,seed=123)

#> [1] 39 52

With the interim and final sample size, we can check empirically the early stopping rate under \(H_0\) when \(S=2\) and \(S=2.5\) as follows.

##S=2
n.interim=c(39,52)
getoc_2arm_piecewise(median.true=c(median.1,median.1),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = n.interim,S_likely=2.28,L=2,U=2,FUP=FUP,trunc.para=trunc.para,rate=rate,nsim=10000,seed=10)$earlystop
#> [1] 0.448
##S=2.5
getoc_2arm_piecewise(median.true=c(median.1,median.1),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = n.interim,S_likely=2.28,L=2.5,U=2.5,FUP=FUP,trunc.para=trunc.para,rate=rate,nsim=10000,seed=10)$earlystop
#> [1] 0.4156

The results indicate that, in order to ensure an early stopping probability of at least 0.4 under \(H_0\), the final sample size must increase from 40 to 52. This aligns with expectations, as the original probability of early stopping under H0 was approximately 0.24 when \(S=2.5\). Achieving a ~16% increase in the probability of early stopping requires an increase in sample size.

Example 8.2: S_likely\(=2\)

median.1=2.8              #\bar{mu}_0
median.2=3.5             #\bar{mu}_1
L=2                      # Lower bound of S
U=2.5                    # Upper bound of S
S_likely=2                # The most likely delayed time
trunc.para=c(12.86,0.19) #  prior of S
rate=6                   # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6                    # Follow-up time of the last patient
err1 = 0.1               # type I error 
err2=0.15                # type II error
Two_stage_sample_size(median.1=median.1 , median.2=median.2,L=L,U=U,S_likely=S_likely,err1=err1,err2=err2,trunc.para=trunc.para, FUP=FUP, rate=rate, earlystop_prob =0.4,seed=123)

#> [1] 46 65

The results indicate that to ensure a probability of early stopping rate of at least 0.4 in this case, the final sample size must be also increasing from 63 to 65. This aligns with our expectations, as the initial early stopping rate was approximately 0.37 when \(S=2.5\). A modest increase in sample size yields an approximately 4% improvement in the probability of early stopping

9. Application of the Proposed Method in a Three-Stage Design

Previously, we only considered a two-stage design. Here, we demonstrate how our method can be applied to a multi-stage setting by showing a three-stage design. Specifically, we set the interim sample sizes as n.interim\(= c(13, 28, 40)\) and obtain the optimal values of \(\lambda\) and \(\gamma\) below in the first place.

n.interim=c(13,28,40)
get.optimal_2arm_piecewise(median.1=median.1,median.2=median.2, L=L,U=U,S_likely=2.28,Uniform=FALSE,trunc.para=c(1,1), err1=0.1, n.interim=n.interim, rate=rate, FUP=FUP,track=TRUE,nsim = 10000,control = T,control.point = c(L,U))  #Get the optimal lambda and gamma

#> [1] 0.9500 1.0000 0.0875 0.8678

The optimal values of \(\lambda\) and \(\gamma\) are 0.95 and 1, respectively. Next, we evaluate the operating characteristics when \(S=2\) and \(S=2.5\) like before.

Example 9.1: \(S = 2\)

##When H0 holds, the reject rate is the type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(13,28,40),L=2,U=2,S_likely=2.28,FUP=6,trunc.para=c(1,1),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2962
#> 
#> $reject
#> [1] 0.0841
#> 
#> $average.patients
#> [1] 36.3901
#> 
#> $trial.duration
#> [1] 10.32178
##When H1 holds, the reject rate is the power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(13,28,40),L=2,U=2,S_likely=2.28,FUP=6,trunc.para=c(1,1),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0722
#> 
#> $reject
#> [1] 0.8793
#> 
#> $average.patients
#> [1] 39.1036
#> 
#> $trial.duration
#> [1] 12.08181

Example 9.2: \(S = 2.5\)

##When H0 holds, the reject rate is the type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(13,28,40),L=2.5,U=2.5,S_likely=2.28,FUP=6,trunc.para=c(1,1),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2385
#> 
#> $reject
#> [1] 0.0873
#> 
#> $average.patients
#> [1] 37.114
#> 
#> $trial.duration
#> [1] 10.78268
##When H1 holds, the reject rate is the power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(13,28,40),L=2.5,U=2.5,S_likely=2.28,FUP=6,trunc.para=c(1,1),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0773
#> 
#> $reject
#> [1] 0.854
#> 
#> $average.patients
#> [1] 39.0514
#> 
#> $trial.duration
#> [1] 12.04232

Compared to the two-stage design, the three-stage design exhibits similar operating characteristics, with only a slight increase in probability of early stopping under \(H_0\). This suggests that incorporating an additional interim analysis enables earlier stopping without substantially altering the overall statistical properties of the design.

10. When an expert-elicited prior with high certainty is used

In this subsection, we conduct a further sensitivity analysis under the following scenarios:

When an expert-elicited prior with high certainty is used.
When a default prior for \(S\) is used.

Example 10.1: Expert-Elicited Prior for \(S\)

Suppose we have three experts providing the following prior information:

Expert A: Mean = 2.28
Expert B: Median = 2.4
Expert C: Mean = 2.4, Median = 2.3

expert_data_correct <- list(
  list(mean = 2.28, median = NULL, sd = NULL, q25 = NULL, q975 = NULL), # Expert A
  list(mean = NULL, median = 2.4, sd = NULL, q25 = NULL, q975 = NULL), # Expert B
  list(mean = 2.4, median = 2.3, sd = NULL, q25 = NULL, q975 = NULL)   # Expert C
)

param_1 <- trunc_gamma_para(
  L = 2, 
  U = 2.5, 
  expert_data = expert_data_correct, 
  weights = c(4,4,2,1,1), 
  num_cores = 4,seed=123  # Number of cores for parallel computing
)

print(param_1)

The estimated \(s_1\),\(s_2\) are 19.49 and 0.27 and the density plot of the prior of \(S\) is as below

plot(density(rtrunc_gamma(1000,L=2,U=2.5,shape=19.49,scale=0.27)),xlab="",main="")

Compared with the prior we used before, the new prior is more concentrated around 2.42. Now, we set S_likely\(=2.28\), and the optimal parameters of \(\lambda,\gamma\) is obtained

n.interim = c(28,40)
trunc.para=param_1
get.optimal_2arm_piecewise(median.1=median.1,median.2=median.2, L=L,U=U,S_likely=S_likely,Uniform=FALSE,trunc.para=trunc.para, err1=err1, n.interim=n.interim, rate=rate, FUP=FUP,track=TRUE,nsim = 10000,control = T,control.point = c(L,U))  #Get the optimal lambda and gamma

#> [1] 0.9500 1.0000 0.0870 0.8628

From the results, the optimal values of \(\lambda\) and \(\gamma\) remain the same as those obtained under the default prior with slightly different average type I error and power. Consequently, the operating characteristics, including type I error and power, are nearly identical to those under the default prior. We give one example to illustrate it. Suppose that S_likely=2.28 and \(S\)=2.5

median.1=2.8              #\bar{mu}_0
median.2=3.5             #\bar{mu}_1
L=2                      # Lower bound of S
U=2.5                    # Upper bound of S
S_likely=2.28            # The most likely delayed time
trunc.para=c(19.49,0.27) #  prior of S
rate=6                   # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6                    # Follow-up time of the last patient
err1 = 0.1               # type I error 
err2=0.15 
##When H0 holds, the reject rate is the type I error. 
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2.5,U=2.5,S_likely=S_likely,FUP=FUP,trunc.para=trunc.para,rate=rate,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2379
#> 
#> $reject
#> [1] 0.0873
#> 
#> $average.patients
#> [1] 37.1452
#> 
#> $trial.duration
#> [1] 10.79129
##When H1 holds, the reject rate is the power. 
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2.5,U=2.5,S_likely=S_likely,FUP=FUP,trunc.para=trunc.para,rate=rate,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0763
#> 
#> $reject
#> [1] 0.8549
#> 
#> $average.patients
#> [1] 39.0844
#> 
#> $trial.duration
#> [1] 12.05357

This suggests that the operating characteristics have stable performance to expert-elicited prior.

Example 10.2: Default Prior for \(S\)

If sufficient information is unavailable to determine the prior parameters \(s_1\) and \(s_2\) of \(S\), we recommend using the default prior with \(s_1 = 1\) and \(s_2 = 1\). As shown below, the resulting prior distribution is approximately uniform over \([L, U]\), indicating a vague prior for \(S\).

plot(density(rtrunc_gamma(1000,L=2,U=2.5,shape=1,scale=1)),xlab="",main="")

In addition, the optimal parameters, as shown below, are identical to those obtained using the expert-elicited prior and yield comparable average type I error and power.

#> [1] 0.9500 1.0000 0.0876 0.8694

Therefore, if we have no expert data for the \(S\), we may use the default prior for \(S\).

Conclusion

This work presents a flexible and efficient framework for designing two-arm multi-stage survival trials with delayed treatment effects (DTE). By incorporating prior knowledge about the DET separation time point S, our method enables substantial reductions in required sample size while preserving type I error control and achieving desired power. The use of ratio \(\tilde{\mu}_1/\tilde{\mu}_0\) instead of \(\bar{\mu}_1/\bar{\mu}_0\) in sample size determination offers a more appropriate and robust basis for design when DTE is anticipated.

Through comparison with the traditional log-rank test, we demonstrated that our method could achieve the same operating characteristics with far fewer events/sample sizes. Sensitivity analyses further confirmed the robustness of our approach under varying prior specifications and design parameters. Additionally, we illustrated the applicability of the method to three-stage designs, given sample sizes in each stage, how to conduct the trial, e.g, to make Go/No-GO decisions.

Overall, this approach offers a principled, data-informed, and computationally feasible solution for designing efficient survival trials under delayed treatment effects, and serves as a valuable tool for researchers and practitioners in clinical trial planning.

References

Brahmer, Julie, Karen L. Reckamp, Paul Baas, Lucio Crinò, Wilfried E. E. Eberhardt, Elena Poddubskaya, Scott Antonia, and David R. Spigel. 2015. “Nivolumab Versus Docetaxel in Advanced Squamous-Cell Non–Small-Cell Lung Cancer.” The New England Journal of Medicine 373: 123–35.

Cai, Zhongheng, Tomi Mori, and Haitao Pan. 2025. “Bayesian Optimal Phase II Randomized Clinical Trial Design for Immunotherapy with Delayed Outcomes.” Statistics in Biopharmaceutical Research (Submitted).

Zhou, Heng, Cong Chen, Linda Sun, and Ying Yuan. 2020. “Bayesian Optimal Phase II Clinical Trial Design with Time-Toevent Endpoint.” Pharmaceutical Statistics 19: 776–86.