In the SimTOST
R package, which is specifically designed
for sample size estimation for bioequivalence studies, hypothesis
testing is based on the Two One-Sided Tests (TOST) procedure. (Sozu et al.
2015b) In TOST, the equivalence test is framed as a
comparison between the the null hypothesis of ‘new product is worse by a
clinically relevant quantity’ and the alternative hypothesis of
‘difference between products is too small to be clinically
relevant’.
The null and alternative hypotheses for the equivalence test are presented below for two different approaches:
One common approach for assessing bio-equivalence involves comparing pharmacokinetic (PK) measures between test and reference products. This is done using the following interval (null) hypothesis:
Null Hypothesis (\(H_0\)): At least one endpoint does not meet the equivalence criteria:
\[H_0: m_T^{(j)} - m_R^{(j)} \le \delta_L ~~ \text{or}~~ m_T^{(j)} - m_R^{(j)} \ge \delta_U \quad \text{for at least one}\;j\]
Alternative Hypothesis (\(H_1\)): All endpoints meet the equivalence criteria:
\[H_1: \delta_L<m_{T}^{(j)}-m_{R}^{(j)} <\delta_U \quad\text{for all}\;j\]
Here, \(m_T\) and \(m_R\) represent the mean endpoints for the test product (the proposed biosimilar) and the reference product, respectively. The equivalence limits, \(\delta_L\) and \(\delta_u\), are typically chosen to be symmetric, such that \(\delta = - \delta_L = \delta_U\).
The null hypothesis (\(H_0\)) is rejected if, and only if, all null hypotheses associated with the \(K\) primary endpoints are rejected at a significance level of \(\alpha\). This ensures that equivalence is established across all endpoints simultaneously.
The DOM test can be implemented in sampleSize() by setting
ctype = "DOM"
and lognorm = FALSE
.
For pharmacokinetic (PK) outcomes, such as the area under the curve
(AUC) and maximum concentration (Cmax), log-transformation is commonly
applied to achieve normality. To perform this transformation, the
logarithm of the geometric mean should be provided to
mu_list
, while the logarithmic variance can be derived from
the coefficient of variation (CV) using the formula:
\[ \text{Logarithmic Variance} = \log\left(1 + {\text{CV}^2}\right) \]
Equivalence limits must also be specified on the log scale to align with the transformed data.
The equivalence hypotheses can also be expressed as a Ratio of Means (ROM):
Null Hypothesis (\(H_0\)): At least one endpoint does not meet the equivalence criteria:
\[H_0: \frac{\mu_T^{(j)}}{\mu_R^{(j)}} \le E_L ~~ \text{or}~~ \frac{\mu_T^{(j)}}{\mu_R^{(j)}} \ge E_U \quad \text{for at least one}\;j\]
Alternative Hypothesis (\(H_1\)): All endpoints meet the equivalence criteria:
\[H_1: E_L< \frac{\mu_{T}^{(j)}}{\mu_{R}^{(j)}} < E_U \quad\text{for all}\;j\]
Here, \(\mu_T\) and \(\mu_R\) represent the arithmetic mean endpoints for the test product (the proposed biosimilar) and the reference product, respectively.
The ROM test can be implemented in sampleSize() by setting
ctype = "ROM"
and lognorm = TRUE
. Note that
the mu_list
argument should contain the arithmetic means of
the endpoints, while sigma_list
should contain their
corresponding variances.
The ROM test is converted to a Difference of Means (DOM) tests by log-transforming the data and equivalence limits. The variance on the log scale is calculated using the normalized variance formula:
\[ \text{Logarithmic Variance} = \log\left(1 + \frac{\text{Arithmetic Variance}}{\text{Arithmetic Mean}^2}\right) \]
The logarithmic mean is then calculated as:
\[\text{Logarithmic Mean} = \log(\text{Arithmetic Mean}) - \frac{1}{2}(\text{Logarithmic Variance})\]
When evaluating bioequivalence, certain statistical and methodological requirements must be adhered to, as outlined in the European Medicines Agency’s bioequivalence guidelines (Committee for Medicinal Products for Human Use (CHMP) 2010). These requirements ensure that the test and reference products meet predefined criteria for equivalence in terms of PK parameters. The key considerations are summarized below:
When conducting a DOM test, The FDA recommends that the equivalence acceptance criterion (EAC) be defined as \(\delta = EAC = 1.5 \sigma_R\), where \(\sigma_R\) represents the variability of the log-transformed endpoint for the reference product.
Assessment of equivalence is often required for more than one primary variable. (Sozu et al. 2015b) For example, EMA recommends showing equivalence both for AUC and Cmax
A decision must be made as to whether it is desirable to
When a trial defines multiple co-primary endpoints, equivalence must be demonstrated for all of them to claim overall treatment equivalence. In this setting, each endpoint is tested separately at the usual significance level (\(\alpha\)), and equivalence is established only if all individual tests are statistically significant. Because conclusions require rejecting all null hypotheses, a formal multiplicity adjustment is not needed to control the Type I error rate. (Committee for Propietary Medicinal Products (CPMP) 2002) However, as the number of co-primary endpoints (\(K\)) increases, the likelihood of failing to meet equivalence on at least one endpoint also rises, resulting in a higher Type II error rate (i.e., a greater risk of incorrectly concluding non-equivalence) (Mielke et al. 2018)
This has several important implications:
When a trial aims to establish equivalence for at least \(k\) primary endpoints, adjustments are necessary to account for the increased risk of Type I error due to multiple hypothesis testing (Sozu et al. 2015a). Without such adjustments, the likelihood of incorrectly concluding equivalence for at least one endpoint increases as the number of endpoints grows.
For example, if a study includes \(m = 3\) independent primary endpoints and uses a significance level of \(\alpha = 5%\) for each test, the overall probability of falsely concluding equivalence for at least one endpoint is:
\[ 1 – (1-\alpha)^m = 1 - (1-0.05)^3 = 0.1426. \] This means that the overall probability of making any false positive error, also known as the family-wise error rate (FWER), increases to approximately 14%.
To address this issue, adjustments to the significance level are necessary for multiple endpoint comparisons for which various methods have been proposed. In SimTOST, the following approaches are included:
The most common and easiest procedure for multiplicity adjustment to control the FWER is the Bonferroni method. Each hypothesis is tested at level
\[\alpha_{bon}= \alpha/m\]
where \(m\) is the total number of tests. Although simple, this method is highly conservative, particularly when tests are correlated, as it assumes all tests are independent. This conservativeness remains pronounced even for \(k=1\), where only one of the \(m\) hypotheses needs to be rejected. (Mielke et al. 2018)
In the sampleSize()
function, the Bonferroni correction can be applied by setting
adjust = "bon"
.
The Sidak correction is an alternative method for controlling the FWER. Like the Bonferroni correction, it assumes that tests are independent. However, the Sidak correction accounts for the joint probability of all tests being non-significant, making it mathematically less conservative than the Bonferroni method. The adjusted significance level is calculated as:
\[\alpha_{sid}= 1-(1-\alpha)^ {1/m}\]
The Sidak correction can be implemented by specifying
adjust = "sid"
in the sampleSize() function.
This correction explicitly accounts for the scenario where equivalence is required for only \(k\) out of \(m\) endpoints. Unlike the Bonferroni and Sidak corrections, which assume that all \(m\) tests contribute equally to the overall Type I error rate, the k-adjustment directly incorporates the number of endpoints (\(k\)) required for equivalence into the adjustment. The adjusted significance level is calculated as:
\[\alpha_k= \frac{k*\alpha}{m}\]
where \(k\) is the number of endpoints required for equivalence, and \(m\) is the total number of endpoints evaluated.
Hierarchical testing is an approach to multiple endpoint testing where endpoints are tested in a predefined order, typically based on their clinical or regulatory importance. A fallback testing strategy is applied, allowing sequential hypothesis testing. If a hypothesis earlier in the sequence fails to be rejected, testing stops, and subsequent hypotheses are not evaluated. (Chowdhry et al. 2024)
To implement hierarchical testing in simTOST
, the user
specifies adjust = "seq"
in the sampleSize() function and
defines primary and secondary endpoints using the type_y
vector argument. The significance level (\(\alpha\)) is adjusted separately for each
group of endpoints, ensuring strong control of the Family-Wise Error
Rate (FWER) while maintaining interpretability.
k
secondary endpoints.k
secondary endpoints demonstrate equivalence.An example of hierarchical testing can be found in this vignette.
In certain cases, it may be necessary to compare multiple treatments
simultaneously. This can be achieved by specifying multiple comparators
in the mu_list
and sigma_list
parameters. The
sampleSize() function can
accommodate multiple treatments, allowing for the evaluation of
equivalence across different products or formulations.
Although trials with multiple arms are common, there is no clear consensus in the literature as to whether statistical corrections should be applied for testing multiple primary hypotheses in such analyses. In SimTOST, no adjustments are made for trials involving more than two treatment arms.