We have developed a novel design method to extend the Bayesian Optimal Phase II (BOP2) design (Zhou et al. 2020) to address delayed treatment effects in a two-arm RCT setting. Below is the corresponding R package, DTEBOP2, along with its user manual.
The DTEBOP2
package serves as multiple purposes,
including:
Clinical researchers can leverage the functions within this package when conducting clinical trials focused on immunotherapies or molecularly targeted therapies for cancer, where the delayed treatment effect will be anticipated. The following are steps using the proposed methods when designing a study in which DTE is anticipated.
Step 1: Overall Median Survival Times
Provide the overall median survival time for the standard of care (SOC)
and experimental arm, denoted as \(\bar{\mu}_j,j=0,1\) (0 is for SOC and 1 is
for experimental arm). These values correspond to the null and
alternative hypotheses in a conventional setting, for example, 2.8
months vs. 3.5 months.
Step 2: Prior Distribution for Delayed Treatment Effect
(DTE) Timepoint \(S\) and Most Likely
DTE Timepoint \(S\)
Specify the prior distribution for the delayed separation timepoint
\(S\) as a truncated Gamma
distribution, i.e.,
\[
S \sim \text{Gamma}(s_1, s_2)I(L, U)
\] and the best-guessed most likely \(S\).
Step 3: Sample Size Calculation and Optimal Parameter
Determination for Two-Stage Design
For a two-stage design, we can provide the calculated sample size for
the interim and the final analyses using the function
Two_stage_sample_size(.)
. The function will also return the
optimal parameters \(\lambda\) and
\(\gamma\) such that
\[
P(\tilde{\mu}_1 < \tilde{\mu}_0 \mid \mathcal{D}_n, S) > 1 -
\lambda\left(\frac{n}{N}\right)^{\gamma}.
\]. If your study plans to have more than two stages, skip Step 3
and go directly to step 4. Note: (\(\tilde{\mu}_0,\tilde{\mu}_1\)) represents
the median of time-to-event before and after separation timepoint \(S\). A more detailed explanation will be
provided later.
Step 4: Optimal Parameters Determination
If your study plans to have more than two stages, similar to the BOP2
design method, if a user provides sample sizes at interim and final
stages, we then can compute the optimal parameters \(\lambda\) and \(\gamma\) satisfying: \[
P(\tilde{\mu}_1 < \tilde{\mu}_0 \mid \mathcal{D}_n, S) > 1 -
\lambda\left(\frac{n}{N}\right)^{\gamma}.
\] with R function
get.optimal_2arm_piecewise(.)
Step 5: Conduct Interim and Final Analyses
If we have the data, we can use the function conduct(.)
to
perform interim and/or final analyses to make Go/No-Go decisions based
on the obtained optimal parameters \(\lambda\) and \(\gamma\) in previous step 3 or 4.
Below, we provide a step-by-step explanation of the procedures outlined above.
In the previous Step 2, we require the trialist to specify the prior distribution for the DTE timepoint S as a truncated Gamma distribution, i.e., \[ S \sim \text{Gamma}(s_1, s_2)I(L, U) \] and also need a best-guessed most likely DTE timepoint \(S\).
For specifying the L and U, we can ask domain experts:
“What would be reasonable estimates/guesses for the lower and upper bounds of the delayed treatment effect timepoint S in population-level survival curves?”
This part is relatively straightforward, especially if historical data or studies are available for reference.
Meanwhile, to obtain a best-guess estimate of the most likely DTE timepoint \(S\), a straightforward method is to use the midpoint of the lower and upper bounds \(L\) and \(U\).
Next, we introduce the specification of \(s_1\) and \(s_2\) for Gamma distribution. Although default values of \(s_1=1\) and \(s_2=1\) are used in this package, more accurate estimates can be obtained by asking domain experts the following questions
“From your experience, can you provide at least one of the following statistics for \(S\)?”
If any of the above questions we can get answer, We can use the following proposed method to estimate \(s_1\) and \(s_2\) by minimizing the following weighted least squares (WLS) function:
\[ w_1\sum_{i=1}^n(\text{mean}(S)-\text{mean from Expert } i)^2 + w_2\sum_{i=1}^n(\text{median}(S)-\text{median from Expert } i)^2 + w_3\sum_{i=1}^n(\text{sd}(S)-\text{sd from Expert } i)^2 + w_4\sum_{i=1}^n(\text{quantile}(S,0.025)-\text{0.025 quantile from Expert } i)^2 + w_5\sum_{i=1}^n(\text{quantile}(S,0.975)-\text{0.975 quantile from Expert } i)^2 \] where \(S \sim \text{Gamma}(s_1, s_2)I(L, U)\), and \(w_i\) (\(i=1,2,3,4,5\)) represents the relative importance of each statistic.
Since not all experts may provide every statistic, we can still proceed by using, for example, just one available statistic to estimate the optimal parameters.
The function trunc_gamma_para(.)
can help us implement
this WLS optimization and estimate \(s_1\) and \(s_2\).
The lower and upper \(L\) and \(U\), along with estimated Gamma prior parameters, reflect a complete picture of DTE timepoint. Biostatisticians may visualize the final prior to facilitate discussion and decision-making.
From the above, we have the most likely DTE timepoint \(S\) and based on the study proposed null and alternative hypotheses, we have the overall median survival times for the SOC (\(\bar{\mu}_0\)) and experimental (\(\bar{\mu}_1\)) arms, we can then back-calculate the median survival times for pre- and post-separation DTE timepoint, denoted as (\(\tilde{\mu}_0, \tilde{\mu}_1\)) using the following formula:
\[ \begin{aligned} \bar{\mu}_0 &= \tilde{\mu}_0 \\ \bar{\mu}_1 &= \begin{cases} \tilde{\mu}_0, & \text{if } \tilde{\mu}_0 < S, \\ (1 - \frac{\tilde{\mu}_1}{\tilde{\mu}_0}) S + \tilde{\mu}_1, & \text{if } \tilde{\mu}_0 \geq S. \end{cases} \end{aligned} \]
The median survival times for pre and post DTE separation timepoints will be used inherently in our proposed algorithm to obtain optimal estimates for design parameters \(\lambda\) and \(\gamma\), which are crucial for controlling the type I error rate.
The sample size for a two-stage design is determined by minimizing the following weighted expected sample size function, while ensuring that the power is at least 1− (Type II error) when \(S \leq U\). To balance two competing objectives—minimizing patient exposure to futile treatments and maximizing access to effective treatments—we define the function as follows: \[ EN = w \frac{PS_{H_0}\times n_1 + (1 - PS_{H_0})(n - n_1)}{n} + (1 - w) \left(1 - \frac{PS_{H_1} \times n_1 + (1 - PS_{H_1})(n - n_1)}{n} \right) \]
Here, \(w\in[0,1]\) is a tuning parameter representing the trade-off between two goals:
This expression reflects the expected sample size under the null hypothesis. Minimizing this value emphasizes early stopping when the treatment is futile, thereby reducing unnecessary patient exposure to ineffective therapies.
In this case, the focus shifts to maximizing the number of patients who can access a truly effective treatment. By minimizing \(1−E(N∣H_1)\), we effectively encourage continuation of the trial when the treatment is promising, avoiding premature termination that could deny access to beneficial therapy. By varying \(w\), this design allows flexible prioritization between safety (minimizing exposure to ineffective treatment) and efficacy (maximizing access to effective treatment), while ensuring adequate statistical power when the true signal is weak.
Although this sample size is derived only for a two-stage design, it can be extended to sample size calculations for trials with multiple stages. Such extensions are not explored in this package.
Sometimes, the probability of early stopping under the null hypothesis \(PS_{H_0}\) at the first stage, as computed purely by the algorithm, may be too low (e.g., 25%). This indicates that the design lacks the ability to stop the study early for futility, which is generally undesirable, especially in phase II trials. To address this, our method provides users with the option to specify a minimum desired \(PS_{H_0}\). Based on this user-specified threshold instead of just minimizing \(EN\) above, we can still compute and provide the corresponding sample sizes for both the interim and final stages while keeping power at least \(1 -\) (Type II error) when \(S \leq U\). Note: Given that \(PS_{H_0}\) increases with \(n_1\) for a fixed \(n\), our proposed algorithm restricts the ratio \(n_1/n\) to be less than 0.75 by default.
From the above, with the obtained optimal \(\lambda\) and \(\gamma\), once we have the data, we can calculate \(P(\tilde{\mu}_1 < \tilde{\mu}_0 \mid \mathcal{D}_n, S)\) to determine whether to proceed with the trial (Go/No-Go decision).
This vignette provides a structured, step-by-step approach to designing two-arm multi-stage survival trials with delayed treatment effects. In summary, the process includes:
When a delayed treatment effect (DTE) is anticipated, the proposed design method can reduce the required sample size compared to methods (e.g., the conventional BOP2) that do not account for DTE. The details can be found in Cai, Mori, and Pan (2025).
We consider the CheckMate 017 study (Brahmer et al. 2015) as an example. This randomized, open-label, international phase 3 trial evaluated the efficacy and safety of nivolumab, a fully human IgG4 programmed death-1 (PD-1) immune checkpoint inhibitor, compared to docetaxel. The study focused on patients with squamous-cell non-small-cell lung cancer (NSCLC) following prior platinum-based chemotherapy. Here, we focus on progression-free survival (PFS), a secondary endpoint, which has demonstrated a delayed treatment effect.
We now use DTEBOP2 R package to design this trial by assuming the PFS endpoint to be the primary endpoint and also demonstrate the process for conducting interim and final analyses once the data are available.
Preparation Step: Extract Data from the Published PFS Kaplan-Meier Plot
First, we reconstruct the Kaplan-Meier (KM) plot for PFS using PlotDigitizer. The datasets extracted from PlotDigitizer for the treatment and control arms are stored in our R package as “Experimental” and “SOC”.
library(DTEBOP2)
library(invgamma)
library(truncdist)
library(doParallel)
library(survRM2adapt)
library(reconstructKM)
library(ggplot2)
data("Experimental")
data("SOC")
Experiment_NAR <- data.frame(time=seq(from=0, to=24, by=3), NAR=c(135, 68, 48, 33, 21, 15, 6, 2,0))
SOC_NAR <- data.frame(time=seq(from=0, to=24, by=3), NAR=c(137, 62, 26, 9, 6, 2,1, 0, 0))
Experiment_aug <- format_raw_tabs(raw_NAR=Experiment_NAR,
raw_surv=Experimental)
SOC_aug <- format_raw_tabs(raw_NAR=SOC_NAR,
raw_surv=SOC)
# reconstruct by calling KM_reconstruct()
Experiment_recon <- KM_reconstruct(aug_NAR=Experiment_aug$aug_NAR, aug_surv=Experiment_aug$aug_surv)
SOC_recon <- KM_reconstruct(aug_NAR=SOC_aug$aug_NAR, aug_surv=SOC_aug$aug_surv)
# put the treatment and control arms into one dataset
Experiment_IPD <- data.frame(arm=1, time=Experiment_recon$IPD_time, status=Experiment_recon$IPD_event)
SOC_IPD <- data.frame(arm=0, time=SOC_recon$IPD_time, status=SOC_recon$IPD_event)
allIPD <- rbind(Experiment_IPD, SOC_IPD)
KM_fit <- survival::survfit(survival::Surv(time, status) ~ arm, data=allIPD)
KM_plot <- survminer::ggsurvplot(
fit = KM_fit,
data = allIPD,
risk.table = TRUE,
palette = c('red', 'blue'),
legend = c(0.8, 0.9),
legend.title = '',
legend.labs = c('Docetaxel', 'Nivolumab'),
title = 'PFS',
ylab = 'Survival (%)',
xlab = 'Time (Month)',
risk.table.title = 'Number at Risk',
break.time.by = 3,
tables.height = 0.4,
risk.table.fontsize = 5
)
print(KM_plot,fig.align='center')
In the original paper, the hazard ratio of nivolumab and Docetaxel is
0.62 (\(95\%\) CI, 0.47 to 0.81), which
means nivolumab has statistical significantly better performance
compared with SOC.
Based on the figure above, it is evident that a delayed treatment effect is present, with the two PFS curves visibly diverging around 2.2 months.
In the following, we assume the results are unknown and provide a step-by-step guide by using our proposed method to design a study while anticipating the delayed treatment effect.
Step 1: Hypothesis Setup Based on Overall Median Survival Times
To design a study, one key component is establishing the null and alternative hypotheses. Here, we assume \(\bar{\mu}_0 = 2.8\) months for the standard of care (SOC) arm and \(\bar{\mu}_1 = 3.5\) months for the experimental arm.
Step 2.1: Prior Specification for Delayed Time \(S\)
Next, we derive the most likely value and prior distribution of the delayed separation timepoint \(S\).
For the prior of \(S\), we seek expert opinion to establish the lower (\(L\)) and upper (\(U\)) bounds. For example, the clinical team may find a range with \(L=2\) months and \(U=2.5\) months for the delayed separation timepoint \(S\) based on literature and their experiences.
If experts have greater confidence in the separation timepoint, for example, by providing specific estimates as shown below:
We can use the following code to obtain the estimated prior
parameters \(s_1\) and \(s_2\) based on the previously introduced
optimization algorithm. As a recap, we assume a truncated Gamma prior
for \(S\), following a Gamma
distribution \(\text{Gamma}(s_1, s_2)I(L,
U)\). In the following code: weights=c(4,4,2,1)
indicates that the expert-provided mean and median are given equal
weight and more emphasis compared to other quantities.
expert_data_correct <- list(
list(mean = 2.2, median = 2.27, sd = NULL, q25 = NULL, q975 = NULL), # expert A
list(mean = 2.1, median = 2.3, sd = NULL, q25 = NULL, q975 = NULL), # expert B
list(mean = 2.3, median = 2.31, sd = NULL, q25 = NULL, q975 = NULL) # expert C
)
param<-trunc_gamma_para(L = 2, U = 2.5, expert_data = expert_data_correct,weights = c(4,4,2,1,1) ,num_cores = 4,seed=123) # num_cores specifies the number of core to do the parallel computing.
print(param)
The estimated \(s_1,s_2\) are 12.86 and 0.19, respectively. Below is the density plot of the truncated gamma distribution \(Gamma(12.86,0.19)I(2,2.5)\):
The plot suggests that the prior is close to uniform distribution over \([2,2.5]\).
Step 2.2: Determining the Most Likely DTE timepoint \(S\) Using Historical Data
When expert opinions are unavailable, the most likely \(S\) can be estimated using the mean of the truncated gamma prior or midpoint of \([L,U]\) e.g., \((L+U)/2\). In this specific example, CheckMate 017, since the data is already available, we assume it represents historical data. Our goal here is to show how to use this “historical” data to identify the most likely timepoint \(S\)—that is, the point at which the Kaplan-Meier (KM) curves begin to diverge.
The following R code identifies this separation time point \(S\). We identify several probabilities and obtain the corresponding survival times for two arms. The \(S\) is identified to be the timepoint where the survival time begin to differ significantly.
y_prob <- c(0.54, 0.53, 0.52,0.51) #y-axis survival probability to explore
summary_fit <- summary(KM_fit)
arm_times <- list()
for (i in 1:length(KM_fit$strata)) {
time_points <- summary_fit$time[summary_fit$strata == names(KM_fit$strata)[i]]
survival_probs <- summary_fit$surv[summary_fit$strata == names(KM_fit$strata)[i]]
closest_times <- sapply(y_prob, function(tp) {
closest_index <- which.min(abs(y_prob - tp))
return(time_points[closest_index])
})
arm_times[[names(KM_fit$strata)[i]]] <- setNames(closest_times, y_prob)
}
print(arm_times)
#> $`arm=0`
#> 0.54 0.53 0.52 0.51
#> 0.43813 0.68154 0.92495 1.02231
#>
#> $`arm=1`
#> 0.54 0.53 0.52 0.51
#> 0.27692 0.38769 0.53538 0.77538
Based on the above results, when the corresponding \(y\) with probability value reaches 0.53, the survival time shows a significant difference, the corresponding value to \(x\) axis is equal to 2.28, this is the separation timepoint. Therefore, we select \(S=2.28\) as the most likely delayed timepoint. If you have a historical dataset, the above procedure offers a simpler alternative to traditional changepoint methods described in the literature.
Next, given the hypothesis of median survival times of 3.5 months for
the experimental arm and 2.8 months for the SOC arm, and from the above
with the best guessed delayed separation time point of 2.28 months
within the range of \(L = 2\) months to
\(U = 2.5\) months, by assuming accrual
rate with 6 patients per month, minimum follow-up of 6 months for the
last enrolled patient, Type I error Controlled at 10%, Type II error
controlled at 15% (e.g., power is 85%). If the study is designed to have
two stages, the estimated sample sizes for interim and final stages and
the optimal parameters \(\lambda,\gamma\) can be computed as follows
using function Two_stage_sample_size(.).
median.1=2.8 #\bar{mu}_0
median.2=3.5 #\bar{mu}_1
L=2 # Lower bound of S
U=2.5 # Upper bound of S
S_likely=2.28 # The most likely delayed time
trunc.para=c(12.86,0.19) # prior paramters of S
rate=6 # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6 # Follow-up time of the last enrolled patient
err1 = 0.1 # type I error
err2=0.15 # type II error
Two_stage_sample_size(median.1=median.1 , median.2=median.2,L=L,U=U,S_likely=S_likely,err1=err1,err2=err2,trunc.para=trunc.para, FUP=FUP, rate=rate, weight=0.5,earlystop_prob =NULL,seed=123)
#calculate the sample size for two-stage design
#> [1] 28 40
The interim and final sample sizes are 28 and 40 per arm, respectively. The optimal parameters, \(\lambda\) and \(\gamma\), are 0.95 and 1, respectively, with an average type I error rate of 0.0871 and power of 0.8667.
Note: In the code above, weight
\(=0.5\) indicates that equal importance is
given to the probability of early stopping under both the null
hypothesis (\(H_0\)) and the
alternative hypothesis (\(H_1\)).
Meanwhile, if we may be interested in the corresponding number of events
for two stages, we can use function event_fun(.)
:
event_fun(median.1 = 2.8,median.2 = 3.5,S_likely = 2.28,n.interim = c(28,40),rate = 6,FUP = 6,n.sim = 1000,seed=123)
#> Events
#> 1st interim 23
#> Final stage 68
We anticipate observing approximately 23 events at the interim analysis stage and a total of 68 events by the final analysis.
With optimal parameters and sample size, we can evaluate the operating characteristics, including:
for different true unknown separation timepoint S. For sensitivity
analysis (SA) in the following, we assume that the most likely
separation timepoint, S_likely=2.28
months, remains
unchanged. Under this condition, the median survival times before and
after the separation timepoint for both the control (SOC) and treatment
(experimental) groups–\(\tilde{\mu}_0=2.8\) and \(\tilde{\mu}_1=6.57\), respectively–will
also remain unchanged. This is because the median survival times before
and after the separation depend solely on the assumed most likely
separation timepoint (in this case, S_likely=2.28
months).
Impact of True S Variation Under a Fixed Assumed
S_likely
We now consider two true unknown values for \(S\), 2 and 2.5 months, to examine their effects on the design. Since overall median survival time \(\bar{\mu}_1\) depends on true but unknown separation timepoin \(S\), its value of \(\bar{\mu}_1\) will change adjust accordingly.
Sensitivity Analysis 1.1: If true unknown DTE separation timepoint \(S = 2\)
In the following code, we can set \(L=U=2\). However, in the real trial design
practices, of course, we should also set S_likley=2
.
Nevertheless, in the current setting, S_likely
is still set
to be 2.28 as same as the previous, the goal here is only to check the
robustness of the performance of our proposed method.
##When H0 holds, the reject rate is the type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2,S_likely=2.28,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2957
#>
#> $reject
#> [1] 0.0841
#>
#> $average.patients
#> [1] 36.4516
#>
#> $trial.duration
#> [1] 10.33529
##When H1 holds, the reject rate is the power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2,S_likely=2.28,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0711
#>
#> $reject
#> [1] 0.8804
#>
#> $average.patients
#> [1] 39.1468
#>
#> $trial.duration
#> [1] 12.09543
Sensitivity Analysis 1.2: If true unknown DTE separation timepoint: \(S=2.5\)
##When H0 holds, the reject rate is the type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2.5,U=2.5,S_likely=2.28,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2379
#>
#> $reject
#> [1] 0.0873
#>
#> $average.patients
#> [1] 37.1452
#>
#> $trial.duration
#> [1] 10.79129
##When H1 holds, the reject rate is the power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2.5,U=2.5,S_likely=2.28,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0763
#>
#> $reject
#> [1] 0.8549
#>
#> $average.patients
#> [1] 39.0844
#>
#> $trial.duration
#> [1] 12.05357
From the above results, we observe that when S_likely
=
2.28 remains unchanged, the type I error rates can still be controlled
under 0.1, and the powers maintain above 0.85 across different true DTE
separation timepoints. Additionally, we see that when \(S=2.5\), the power is lower than when \(S=2\), given that our best estimated
separation timepoint is S_likely
= 2.28. This is expected
because when \(S=2.5\) and
S_likely
= 2.28, the unseparated portion of the two curves
is included in the comparison, leading to a loss of power. Conversely,
if the true separation occurs earlier than S_likely
, the
power will increase.
Sensitivity Analysis : Varying the Assumed Separation
Timepoint S_likely
Next, we examine scenarios where S_likely
\(=2\) or S_likely
\(=2.5\),the best-guessed separation
timepoint, to evaluate the robustness of our proposed method, while
keeping the overall median survival time with \(\bar{\mu}_0=2.8\), \(\bar{\mu}_1=3.5\). In this scenario, median
survival time of pre- and post-separation time point, \(\tilde{\mu}_0, \tilde{\mu}_1\), will be to
be changed with S_likely
accordingly. We report the average
type I error and power to assess the performance of our method under
these scenarios.
Specifically, we examine two extreme scenarios to assess the robustness of our method. The optimal design parameters were initially determined assuming the separation time point falls within the range of \(L = 2\) months to \(U = 2.5\) months.
If the operating characteristics (OCs) are still satisfied in both cases, it would demonstrate the robustness of our proposed method.
Sensitivity Analysis 1.3: If best-guessed DTE separation
timepoint: S_likely
\(= 2\)
##When H0 holds, the reject rate is the average type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2.5,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2729
#>
#> $reject
#> [1] 0.0871
#>
#> $average.patients
#> [1] 36.7252
#>
#> $trial.duration
#> [1] 10.52729
##When H1 holds, the reject rate is the average power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2.5,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.102
#>
#> $reject
#> [1] 0.7226
#>
#> $average.patients
#> [1] 38.776
#>
#> $trial.duration
#> [1] 11.8653
Sensitivity Analysis 1.4: If best-guessed DTE separation
timepoint: S_likely
\(= 2.5\)
##When H0 holds, the reject rate is the average type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2.5,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2729
#>
#> $reject
#> [1] 0.0871
#>
#> $average.patients
#> [1] 36.7252
#>
#> $trial.duration
#> [1] 10.52729
##When H1 holds, the reject rate is the average power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2,U=2.5,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0424
#>
#> $reject
#> [1] 0.95
#>
#> $average.patients
#> [1] 39.4912
#>
#> $trial.duration
#> [1] 12.33141
From these results, we can see that the average type I errors remain
controlled under 0.1. However, when S_likely
\(=2\), the average power falls below 0.85.
This is expected because the sample size was originally determined based
on S_likely
\(=2.28\), with
corresponding \(\tilde{\mu}_0=2.8,\tilde{\mu}_1=6.57\),
whereas for when S_likely
\(=2\), the corresponding these values are
changed to \(\tilde{\mu}_0=2.8,\tilde{\mu}_1=5.25\).
That is, if the best guessed separation timepoint shifts earlier to the
beginning of the trial, the treatment effect will be diluted.
Vice versa, When the best guessed separation timepoint
S_likely
is larger than 2.28, as we see in the case of
S_likely
\(=2.5\) (\(\tilde{\mu}_0=2.8,\tilde{\mu}_1=9.33\)),
the average power remains above 0.85, ensuring adequate performance.
Just for another exploration, if we set S_likely
\(=2\) to determine the required sample size.
The results are as follows:
Example 2: Impact of Setting
S_likely
\(=2\)
median.1=2.8 #\bar{mu}_0
median.2=3.5 #\bar{mu}_1
L=2 # Lower bound of S
U=2.5 # Upper bound of S
S_likely=2 # The most likely delayed time
trunc.para=c(12.86,0.19) # default prior parameter of S
rate=6 # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6 # Follow-up time of the last patient
err1 = 0.1 # type I error
err2=0.15 # type II error
Two_stage_sample_size(median.1=median.1 , median.2=median.2,L=L,U=U,S_likely=S_likely,err1=err1,err2=err2,trunc.para=trunc.para, FUP=FUP, rate=rate, weight=0.5,earlystop_prob =NULL,seed=123) #calculate the sample size for two-stage design
#> [1] 44 63
The interim and final sample sizes are 44 and 63 per arm, respectively. The optimal parameters in \(P(\tilde{\mu}_1 < \tilde{\mu}_0 \mid \mathcal{D}_n, S)> 1-\lambda(n/N)^{\gamma}\) are \(\lambda=0.95,\gamma=1\) with average type I error 0.0813 and power 0.8786. For the number of events for two stages, we have:
event_fun(median.1 = 2.8,median.2 = 3.5,S_likely = 2,n.interim = c(44,63),rate = 6,FUP = 6,n.sim = 1000,seed=123)
#> Events
#> 1st interim 47
#> Final stage 112
We may know that we expect that there are 46 events and 112 events in total at interim stage and final stage, respectively.
Take a further step, with the optimal parameters \(\lambda=0.95, \gamma=1\), we consider four
scenarios by varying both S_likely
and \(S\).
Sensitivity Analysis 2.1: \(S=2\), S_likely
\(=2\)
##When H0 holds, the reject rate is the average type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2,U=2,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.4087
#>
#> $reject
#> [1] 0.078
#>
#> $average.patients
#> [1] 55.2347
#>
#> $trial.duration
#> [1] 12.80447
##When H1 holds, the reject rate is the average power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2,U=2,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0544
#>
#> $reject
#> [1] 0.8943
#>
#> $average.patients
#> [1] 61.9664
#>
#> $trial.duration
#> [1] 16.00119
Sensitivity Analysis 2.2: \(S=2.5\), S_likely
\(=2\)
##When H0 holds, the reject rate is the type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2.5,U=2.5,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.3778
#>
#> $reject
#> [1] 0.0807
#>
#> $average.patients
#> [1] 55.8218
#>
#> $trial.duration
#> [1] 13.08364
##When H1 holds, the reject rate is the power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2.5,U=2.5,S_likely=2,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0651
#>
#> $reject
#> [1] 0.8687
#>
#> $average.patients
#> [1] 61.7631
#>
#> $trial.duration
#> [1] 15.90049
Sensitivity Analysis 2.3: \(S=2\), S_likely
\(=2.5\)
##When H0 holds, the reject rate is the type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2,U=2,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.4087
#>
#> $reject
#> [1] 0.078
#>
#> $average.patients
#> [1] 55.2347
#>
#> $trial.duration
#> [1] 12.80447
##When H1 holds, the reject rate is the power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2,U=2,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0045
#>
#> $reject
#> [1] 0.9955
#>
#> $average.patients
#> [1] 62.9145
#>
#> $trial.duration
#> [1] 16.44764
Sensitivity Analysis 2.4: \(S=2.5\), S_likely
\(=2.5\)
##When H0 holds, the reject rate is the type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2.5,U=2.5,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.3778
#>
#> $reject
#> [1] 0.0807
#>
#> $average.patients
#> [1] 55.8218
#>
#> $trial.duration
#> [1] 13.08364
##When H1 holds, the reject rate is the power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(44,63),L=2.5,U=2.5,S_likely=2.5,FUP=6,trunc.para=c(12.86,0.19),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0066
#>
#> $reject
#> [1] 0.9932
#>
#> $average.patients
#> [1] 62.8746
#>
#> $trial.duration
#> [1] 16.4295
Summary of the above results
With a maximum sample size of 63 per arm, the type I error is
controlled at 0.1, and power remains above 0.85 across all combinations
of \(S\) and S_likely
,
demonstrating doubly robustness (DR) property – robustness to both the
true but unknown separation time point S, and the user-specified
best-guess value S_likely as long as the unknown true \(S\) and best-guessed S_likely
are in \([L,U]\).
A larger upper bound \(U\)
increases the required sample size. Meanwhile, smaller \(L\) results in smaller best guessed \(S\), that is, S_likely
, which
means that the undiveraged region of the two curves is also included for
comparison and thus also needs more sample size to maintain the power.
In short, a larger interval length \([L,U]\) reflects greater uncertainty in
specification and leads to an increased required sample size.
The lower bound \(L\) and upper bound \(U\) should be chosen carefully based on prior knowledge and experience, as both influence the maximum required sample size. The guiding principle is to choose \(L\) and \(U\) such that the true separation time is covered with as much certainty as possible.
Comparison with Log-Rank Test Without DTE Consideration
Compared with the log-rank test that does not incorporate the information of \(S\), given the event rate \(P=0.85\), the sample size under the log-rank test is:
\[ \frac{4(z_{0.95} + z_{0.85})^2}{P\left(\log\left(\frac{2.8}{3.5}\right)\right)^2} = 680 \]
Without incorporating the information of \(S\), the required sample size is
340 per arm. In contrast, if we employ the median
survival time from post separation timepoint, 6.57, with
S_likely
\(=2.28\), the
sample size in terms of number of patients under the log-rank test is
only \[
\frac{4(z_{0.95} +
z_{0.85})^2}{P\left(\log\left(\frac{2.8}{6.57}\right)\right)^2} = 48
\]
In other words, if we assume that the true separation time is \(S = 0\), then the required sample size is at least 24 subjects per arm. On the other hand, under scenarios where the separation time is unknown, the sample size could increase up to 340 per arm. Therefore, the sample size required by our proposed method is expected to lie between 24 and 340, depending on the actual separation time. This observation highlights the value of incorporating prior information about the separation time into the design. By leveraging this information, our method can substantially reduce the required sample size, thereby improving efficiency and lowering resource demands in clinical trial planning.
Our method significantly reduces the required sample size by leveraging information about the separation time S. The key reason for this reduction is the use of the ratio \(\tilde{\mu}_1\) and \(\tilde{\mu}_0\) instead of \(\bar{\mu}_1\) and \(\bar{\mu}_0\) for sample size determination. The former is a more appropriate choice when a delayed treatment effect (DTE) is anticipated. This approach allows for a more efficient design, achieving meaningful sample size savings while maintaining statistical power.
Example 3: Applying conduct(·)
for Go/No-Go
Decisions with Mock Data
We now generate two mock datasets and use the conduct(·)
function to guide Go/No-Go decision-making at the interim and final
stages based on observed data.
Example 3.1: Mock Data Scenario Assuming \(H_0\) to Be True
The first dataset is generated under the null hypothesis \(H_0\).
set.seed(126)
n.interim = c(44,63)
nmax=63
FUP=6
rate=6
L=2
U=2.5
median.1=2.8
median.2=3.5
lambda_1=log(2)/2.8 #hazard rate before S
lambda_2=log(2)/2.8 #hazard rate after S
nobs = n.interim+1
nobs[length(nobs)] = 63
wait.t = rexp(nmax,rate = rate)
arrival.t = cumsum(wait.t)
tobs = arrival.t[nobs]
tobs[length(tobs)] = tobs[length(tobs)] + FUP
S_likely=2.1
event.t.E = generate_pe(nmax,t=S_likely,lambda1=lambda_1,lambda2=lambda_1)
event.t.C=rexp(nmax,rate=lambda_1)
####first interim
k = 1
Ind_event_E = ifelse(arrival.t[1:n.interim[k]] + event.t.E[1:n.interim[k]] <= tobs[k],1,0) ##event indicator for th treatment, 1: event.
Ind_event_C = ifelse(arrival.t[1:n.interim[k]] + event.t.C[1:n.interim[k]] <= tobs[k],1,0)
t.event.E = rep(0,n.interim[k])
t.event.C=rep(0,n.interim[k])
for(l in 1:length(t.event.E)){
t.event.E[l] = ifelse(arrival.t[l]+event.t.E[l]<=tobs[k],event.t.E[l],tobs[k]-arrival.t[l])
t.event.C[l] = ifelse(arrival.t[l]+event.t.C[l]<=tobs[k],event.t.C[l],tobs[k]-arrival.t[l])
}
data.E=data.frame(Time=t.event.E,Status=Ind_event_E) ## treatment dataset
data.C=data.frame(Time=t.event.C,Status=Ind_event_C) ## control dataset
print("Treatment Dataset:")
#> [1] "Treatment Dataset:"
print(head(data.E))
#> Time Status
#> 1 1.48614261 1
#> 2 6.41029766 1
#> 3 0.18085940 1
#> 4 0.08234503 1
#> 5 2.26719651 1
#> 6 0.50314936 1
print("Control Dataset:")
#> [1] "Control Dataset:"
print(head(data.C))
#> Time Status
#> 1 4.6947190 1
#> 2 6.6259936 0
#> 3 0.2985375 1
#> 4 0.7839282 1
#> 5 2.0822781 1
#> 6 2.7224489 1
conduct(data.E,data.C,median.1=median.1,median.2=median.2,lambda=0.95,gamma=1,nmax=63,S_likely=S_likely,L=L,U=U,trunc.para=c(12.86,0.19))[[4]]
#> [1] "Decision: Go"
The result is Go
. Assuming that we have all the data, we
are now moving to the final analysis.
k = 2
Ind_event_E = ifelse(arrival.t[1:n.interim[k]] + event.t.E[1:n.interim[k]] <= tobs[k],1,0) ##event indicator for th treatment, 1: event.
Ind_event_C = ifelse(arrival.t[1:n.interim[k]] + event.t.C[1:n.interim[k]] <= tobs[k],1,0)
t.event.E = rep(0,n.interim[k])
t.event.C=rep(0,n.interim[k])
for(l in 1:length(t.event.E)){
t.event.E[l] = ifelse(arrival.t[l]+event.t.E[l]<=tobs[k],event.t.E[l],tobs[k]-arrival.t[l])
t.event.C[l] = ifelse(arrival.t[l]+event.t.C[l]<=tobs[k],event.t.C[l],tobs[k]-arrival.t[l])
}
data.E=data.frame(Time=t.event.E,Status=Ind_event_E)
data.C=data.frame(Time=t.event.C,Status=Ind_event_C)
print("Treatment Dataset:")
#> [1] "Treatment Dataset:"
print(head(data.E))
#> Time Status
#> 1 1.48614261 1
#> 2 6.41029766 1
#> 3 0.18085940 1
#> 4 0.08234503 1
#> 5 2.26719651 1
#> 6 0.50314936 1
print("Control Dataset:")
#> [1] "Control Dataset:"
print(head(data.C))
#> Time Status
#> 1 4.6947190 1
#> 2 15.7118917 0
#> 3 0.2985375 1
#> 4 0.7839282 1
#> 5 2.0822781 1
#> 6 2.7224489 1
conduct(data.E,data.C,median.1=median.1,median.2=median.2,lambda=0.95,gamma=1,nmax=nmax,S_likely=S_likely,L=L,U=U,trunc.para=c(12.86,0.19))[[4]]
#> [1] "Decision: Not Reject Null hypothesis"
The result is Not Reject Null Hypothesis
, which meets
our expectation. The second mock dataset is created under \(H_1\) holds.
Example 3.2: Mock Data Scenario Assuming \(H_1\) to Be True
The second mock dataset is generated under the alternative hypothesis \(H_1\)
set.seed(126)
n.interim = c(44,63)
nmax=63
FUP=6
rate=6
L=2
U=2.5
median.1=2.8
median.2=3.5
lambda_1=log(2)/2.8 #hazard rate before S
lambda_2=log(2)/6.6 #hazard rate after S
nobs = n.interim+1
nobs[length(nobs)] = 63
wait.t = rexp(nmax,rate = rate)
arrival.t = cumsum(wait.t)
tobs = arrival.t[nobs]
tobs[length(tobs)] = tobs[length(tobs)] + FUP
S_likely=2.1
event.t.E = generate_pe(nmax,t=S_likely,lambda1=lambda_1,lambda2=lambda_2)
event.t.C=rexp(nmax,rate=lambda_1)
####first interim
k = 1
Ind_event_E = ifelse(arrival.t[1:n.interim[k]] + event.t.E[1:n.interim[k]] <= tobs[k],1,0) ##event indicator for th treatment, 1: event.
Ind_event_C = ifelse(arrival.t[1:n.interim[k]] + event.t.C[1:n.interim[k]] <= tobs[k],1,0)
t.event.E = rep(0,n.interim[k])
t.event.C=rep(0,n.interim[k])
for(l in 1:length(t.event.E)){
t.event.E[l] = ifelse(arrival.t[l]+event.t.E[l]<=tobs[k],event.t.E[l],tobs[k]-arrival.t[l])
t.event.C[l] = ifelse(arrival.t[l]+event.t.C[l]<=tobs[k],event.t.C[l],tobs[k]-arrival.t[l])
}
data.E=data.frame(Time=t.event.E,Status=Ind_event_E) ## treatment dataset
data.C=data.frame(Time=t.event.C,Status=Ind_event_C) ## control dataset
print("Treatment Dataset:")
#> [1] "Treatment Dataset:"
print(head(data.E))
#> Time Status
#> 1 1.48614261 1
#> 2 6.62599365 0
#> 3 0.18085940 1
#> 4 0.08234503 1
#> 5 2.49410605 1
#> 6 0.50314936 1
print("Control Dataset:")
#> [1] "Control Dataset:"
print(head(data.C))
#> Time Status
#> 1 4.6947190 1
#> 2 6.6259936 0
#> 3 0.2985375 1
#> 4 0.7839282 1
#> 5 2.0822781 1
#> 6 2.7224489 1
conduct(data.E,data.C,median.1=median.1,median.2=median.2,lambda=0.95,gamma=1,nmax=63,S_likely=S_likely,L=L,U=U,trunc.para=c(12.86,0.19))[[4]]
#> [1] "Decision: Go"
The result is Go
. Assuming we have the full dataset now,
we can conduct the final analysis.
k = 2
Ind_event_E = ifelse(arrival.t[1:n.interim[k]] + event.t.E[1:n.interim[k]] <= tobs[k],1,0) ##event indicator for th treatment, 1: event.
Ind_event_C = ifelse(arrival.t[1:n.interim[k]] + event.t.C[1:n.interim[k]] <= tobs[k],1,0)
t.event.E = rep(0,n.interim[k])
t.event.C=rep(0,n.interim[k])
for(l in 1:length(t.event.E)){
t.event.E[l] = ifelse(arrival.t[l]+event.t.E[l]<=tobs[k],event.t.E[l],tobs[k]-arrival.t[l])
t.event.C[l] = ifelse(arrival.t[l]+event.t.C[l]<=tobs[k],event.t.C[l],tobs[k]-arrival.t[l])
}
data.E=data.frame(Time=t.event.E,Status=Ind_event_E)
data.C=data.frame(Time=t.event.C,Status=Ind_event_C)
print("Treatment Dataset:")
#> [1] "Treatment Dataset:"
print(head(data.E))
#> Time Status
#> 1 1.48614261 1
#> 2 12.25998734 1
#> 3 0.18085940 1
#> 4 0.08234503 1
#> 5 2.49410605 1
#> 6 0.50314936 1
print("Control Dataset:")
#> [1] "Control Dataset:"
print(head(data.C))
#> Time Status
#> 1 4.6947190 1
#> 2 15.7118917 0
#> 3 0.2985375 1
#> 4 0.7839282 1
#> 5 2.0822781 1
#> 6 2.7224489 1
conduct(data.E,data.C,median.1=median.1,median.2=median.2,lambda=0.95,gamma=1,nmax=nmax,S_likely=S_likely,L=L,U=U,trunc.para=c(12.86,0.19))[[4]]
#> [1] "Decision: Reject Null hypothesis"
The result is Reject Null hypothesis
, which meets our
expectations.
Now suppose we require the probability of early stopping to be no
less than 0.4. We calculate the interim and final sample sizes using the
function Two_stage_sample_size(.)
with
earlystop_prob =0.4
, under two scenarios:
S_likely=2.28
and S_likely=2
. In both cases,
the resulting sample sizes ensure that the probability of early stopping
(PS) is at least 0.4 when \(S\leq
U\).
Example 8.1: S_likely
\(=2.28\)
median.1=2.8 #\bar{mu}_0
median.2=3.5 #\bar{mu}_1
L=2 # Lower bound of S
U=2.5 # Upper bound of S
S_likely=2.28 # The most likely delayed time
trunc.para=c(12.86,0.19) # prior of S
rate=6 # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6 # Follow-up time of the last patient
err1 = 0.1 # type I error
err2=0.15 # type II error
Two_stage_sample_size(median.1=median.1 , median.2=median.2,L=L,U=U,S_likely=S_likely,err1=err1,err2=err2,trunc.para=trunc.para, FUP=FUP, rate=rate, earlystop_prob =0.4,seed=123)
#> [1] 39 52
With the interim and final sample size, we can check empirically the early stopping rate under \(H_0\) when \(S=2\) and \(S=2.5\) as follows.
##S=2
n.interim=c(39,52)
getoc_2arm_piecewise(median.true=c(median.1,median.1),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = n.interim,S_likely=2.28,L=2,U=2,FUP=FUP,trunc.para=trunc.para,rate=rate,nsim=10000,seed=10)$earlystop
#> [1] 0.448
##S=2.5
getoc_2arm_piecewise(median.true=c(median.1,median.1),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = n.interim,S_likely=2.28,L=2.5,U=2.5,FUP=FUP,trunc.para=trunc.para,rate=rate,nsim=10000,seed=10)$earlystop
#> [1] 0.4156
The results indicate that, in order to ensure an early stopping probability of at least 0.4 under \(H_0\), the final sample size must increase from 40 to 52. This aligns with expectations, as the original probability of early stopping under H0 was approximately 0.24 when \(S=2.5\). Achieving a ~16% increase in the probability of early stopping requires an increase in sample size.
Example 8.2: S_likely
\(=2\)
median.1=2.8 #\bar{mu}_0
median.2=3.5 #\bar{mu}_1
L=2 # Lower bound of S
U=2.5 # Upper bound of S
S_likely=2 # The most likely delayed time
trunc.para=c(12.86,0.19) # prior of S
rate=6 # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6 # Follow-up time of the last patient
err1 = 0.1 # type I error
err2=0.15 # type II error
Two_stage_sample_size(median.1=median.1 , median.2=median.2,L=L,U=U,S_likely=S_likely,err1=err1,err2=err2,trunc.para=trunc.para, FUP=FUP, rate=rate, earlystop_prob =0.4,seed=123)
#> [1] 46 65
The results indicate that to ensure a probability of early stopping rate of at least 0.4 in this case, the final sample size must be also increasing from 63 to 65. This aligns with our expectations, as the initial early stopping rate was approximately 0.37 when \(S=2.5\). A modest increase in sample size yields an approximately 4% improvement in the probability of early stopping
Previously, we only considered a two-stage design. Here, we
demonstrate how our method can be applied to a multi-stage setting by
showing a three-stage design. Specifically, we set the interim sample
sizes as n.interim
\(= c(13, 28,
40)\) and obtain the optimal values of \(\lambda\) and \(\gamma\) below in the first place.
n.interim=c(13,28,40)
get.optimal_2arm_piecewise(median.1=median.1,median.2=median.2, L=L,U=U,S_likely=2.28,Uniform=FALSE,trunc.para=c(1,1), err1=0.1, n.interim=n.interim, rate=rate, FUP=FUP,track=TRUE,nsim = 10000,control = T,control.point = c(L,U)) #Get the optimal lambda and gamma
#> [1] 0.9500 1.0000 0.0875 0.8678
The optimal values of \(\lambda\) and \(\gamma\) are 0.95 and 1, respectively. Next, we evaluate the operating characteristics when \(S=2\) and \(S=2.5\) like before.
Example 9.1: \(S = 2\)
##When H0 holds, the reject rate is the type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(13,28,40),L=2,U=2,S_likely=2.28,FUP=6,trunc.para=c(1,1),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2962
#>
#> $reject
#> [1] 0.0841
#>
#> $average.patients
#> [1] 36.3901
#>
#> $trial.duration
#> [1] 10.32178
##When H1 holds, the reject rate is the power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(13,28,40),L=2,U=2,S_likely=2.28,FUP=6,trunc.para=c(1,1),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0722
#>
#> $reject
#> [1] 0.8793
#>
#> $average.patients
#> [1] 39.1036
#>
#> $trial.duration
#> [1] 12.08181
Example 9.2: \(S = 2.5\)
##When H0 holds, the reject rate is the type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(13,28,40),L=2.5,U=2.5,S_likely=2.28,FUP=6,trunc.para=c(1,1),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2385
#>
#> $reject
#> [1] 0.0873
#>
#> $average.patients
#> [1] 37.114
#>
#> $trial.duration
#> [1] 10.78268
##When H1 holds, the reject rate is the power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(13,28,40),L=2.5,U=2.5,S_likely=2.28,FUP=6,trunc.para=c(1,1),rate=6,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0773
#>
#> $reject
#> [1] 0.854
#>
#> $average.patients
#> [1] 39.0514
#>
#> $trial.duration
#> [1] 12.04232
Compared to the two-stage design, the three-stage design exhibits similar operating characteristics, with only a slight increase in probability of early stopping under \(H_0\). This suggests that incorporating an additional interim analysis enables earlier stopping without substantially altering the overall statistical properties of the design.
In this subsection, we conduct a further sensitivity analysis under the following scenarios:
Example 10.1: Expert-Elicited Prior for \(S\)
Suppose we have three experts providing the following prior information:
expert_data_correct <- list(
list(mean = 2.28, median = NULL, sd = NULL, q25 = NULL, q975 = NULL), # Expert A
list(mean = NULL, median = 2.4, sd = NULL, q25 = NULL, q975 = NULL), # Expert B
list(mean = 2.4, median = 2.3, sd = NULL, q25 = NULL, q975 = NULL) # Expert C
)
param_1 <- trunc_gamma_para(
L = 2,
U = 2.5,
expert_data = expert_data_correct,
weights = c(4,4,2,1,1),
num_cores = 4,seed=123 # Number of cores for parallel computing
)
print(param_1)
The estimated \(s_1\),\(s_2\) are 19.49 and 0.27 and the density plot of the prior of \(S\) is as below
Compared with the prior we used before, the new prior is more
concentrated around 2.42. Now, we set S_likely
\(=2.28\), and the optimal parameters of
\(\lambda,\gamma\) is obtained
n.interim = c(28,40)
trunc.para=param_1
get.optimal_2arm_piecewise(median.1=median.1,median.2=median.2, L=L,U=U,S_likely=S_likely,Uniform=FALSE,trunc.para=trunc.para, err1=err1, n.interim=n.interim, rate=rate, FUP=FUP,track=TRUE,nsim = 10000,control = T,control.point = c(L,U)) #Get the optimal lambda and gamma
#> [1] 0.9500 1.0000 0.0870 0.8628
From the results, the optimal values of \(\lambda\) and \(\gamma\) remain the same as those obtained
under the default prior with slightly different average type I error and
power. Consequently, the operating characteristics, including type I
error and power, are nearly identical to those under the default prior.
We give one example to illustrate it. Suppose that
S_likely=2.28
and \(S\)=2.5
median.1=2.8 #\bar{mu}_0
median.2=3.5 #\bar{mu}_1
L=2 # Lower bound of S
U=2.5 # Upper bound of S
S_likely=2.28 # The most likely delayed time
trunc.para=c(19.49,0.27) # prior of S
rate=6 # accrual rate. It assumes that the arrival time follow Poisson(rate)
FUP=6 # Follow-up time of the last patient
err1 = 0.1 # type I error
err2=0.15
##When H0 holds, the reject rate is the type I error.
getoc_2arm_piecewise(median.true=c(2.8,2.8),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2.5,U=2.5,S_likely=S_likely,FUP=FUP,trunc.para=trunc.para,rate=rate,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.2379
#>
#> $reject
#> [1] 0.0873
#>
#> $average.patients
#> [1] 37.1452
#>
#> $trial.duration
#> [1] 10.79129
##When H1 holds, the reject rate is the power.
getoc_2arm_piecewise(median.true=c(2.8,3.5),Uniform=FALSE,lambda=0.95,gamma=1,n.interim = c(28,40),L=2.5,U=2.5,S_likely=S_likely,FUP=FUP,trunc.para=trunc.para,rate=rate,nsim=10000,seed=10)
#> $earlystop
#> [1] 0.0763
#>
#> $reject
#> [1] 0.8549
#>
#> $average.patients
#> [1] 39.0844
#>
#> $trial.duration
#> [1] 12.05357
This suggests that the operating characteristics have stable performance to expert-elicited prior.
Example 10.2: Default Prior for \(S\)
If sufficient information is unavailable to determine the prior parameters \(s_1\) and \(s_2\) of \(S\), we recommend using the default prior with \(s_1 = 1\) and \(s_2 = 1\). As shown below, the resulting prior distribution is approximately uniform over \([L, U]\), indicating a vague prior for \(S\).
In addition, the optimal parameters, as shown below, are identical to those obtained using the expert-elicited prior and yield comparable average type I error and power.
#> [1] 0.9500 1.0000 0.0876 0.8694
Therefore, if we have no expert data for the \(S\), we may use the default prior for \(S\).
This work presents a flexible and efficient framework for designing two-arm multi-stage survival trials with delayed treatment effects (DTE). By incorporating prior knowledge about the DET separation time point S, our method enables substantial reductions in required sample size while preserving type I error control and achieving desired power. The use of ratio \(\tilde{\mu}_1/\tilde{\mu}_0\) instead of \(\bar{\mu}_1/\bar{\mu}_0\) in sample size determination offers a more appropriate and robust basis for design when DTE is anticipated.
Through comparison with the traditional log-rank test, we demonstrated that our method could achieve the same operating characteristics with far fewer events/sample sizes. Sensitivity analyses further confirmed the robustness of our approach under varying prior specifications and design parameters. Additionally, we illustrated the applicability of the method to three-stage designs, given sample sizes in each stage, how to conduct the trial, e.g, to make Go/No-GO decisions.
Overall, this approach offers a principled, data-informed, and computationally feasible solution for designing efficient survival trials under delayed treatment effects, and serves as a valuable tool for researchers and practitioners in clinical trial planning.