Wilcoxon TOST

The Wilcoxon group of tests (includes Mann-Whitney U-test) provide a non-parametric test of differences between groups, or within samples, based on ranks. This provides a test of location shift, which is a fancy way of saying differences in the center of the distribution (i.e., in parametric tests the location is mean). With TOST, there are two separate tests of directional location shift to determine if the location shift is within (equivalence) or outside (minimal effect). The exact calculations can be explored via the documentation of the wilcox.test function.

TOSTER’s version is the wilcox_TOST function. Overall, this function operates extremely similar to the t_TOST function. However, the standardized mean difference (SMD) is not calculated. Instead the rank-biserial correlation is calculated for all types of comparisons (e.g., two sample, one sample, and paired samples). Also, there is no plotting capability at this time for the output of this function.

As an example, we can use the sleep data to make a non-parametric comparison of equivalence.

data('sleep')
library(TOSTER)

test1 = wilcox_TOST(formula = extra ~ group,
                      data = sleep,
                      paired = FALSE,
                      eqb = .5)


print(test1)

## 
## Wilcoxon rank sum test with continuity correction
## 
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero 
## TOST: don't reject null equivalence hypothesis
## 
## TOST Results 
##            Test Statistic p.value
## NHST                 25.5   0.069
## TOST Lower           34.0   0.894
## TOST Upper           20.0   0.013
## 
## Effect Sizes 
##                           Estimate               C.I. Conf. Level
## Median of Differences       -1.346       [-3.4, -0.1]         0.9
## Rank-Biserial Correlation   -0.490 [-0.7493, -0.1005]         0.9

Rank-Biserial Correlation

The standardized effect size reported for the wilcox_TOST procedure is the rank-biserial correlation. This is a fairly intuitive measure of effect size which has the same interpretation of the common language effect size (Kerby 2014). However, instead of assuming normality and equal variances, the rank-biserial correlation calculates the number of favorable (positive) and unfavorable (negative) pairs based on their respective ranks.

For the two sample case, the correlation is calculated as the proportion of favorable pairs minus the unfavorable pairs.

\[ r_{biserial} = f_{pairs} - u_{pairs} \]

For the one sample or paired samples cases, the correlation is calculated with ties (values equal to zero) not being dropped. This provides a conservative estimate of the rank biserial correlation.

It is calculated in the following steps wherein \(z\) represents the values or difference between paired observations:

Calculate signed ranks:

\[ r_j = -1 \cdot sign(z_j) \cdot rank(|z_j|) \] 2. Calculate the positive and negative sums:

\[ R_{+} = \sum_{1\le i \le n, \space z_i > 0}r_j \]

\[ R_{-} = \sum_{1\le i \le n, \space z_i < 0}r_j \] 3. Determine the smaller of the two rank sums:

\[ T = min(R_{+}, \space R_{-}) \]

\[ S = \begin{cases} -4 & R_{+} \ge R_{-} \\ 4 & R_{+} < R_{-} \end{cases} \] 4. Calculate rank-biserial correlation:

\[ r_{biserial} = S \cdot | \frac{\frac{T - \frac{(R_{+} + R_{-})}{2}}{n}}{n + 1} | \]

Confidence Intervals

The Fisher approximation is used to calculate the confidence intervals.

For paired samples, or one sample, the standard error is calculated as the following:

\[ SE_r = \sqrt{ \frac {(2 \cdot nd^3 + 3 \cdot nd^2 + nd) / 6} {(nd^2 + nd) / 2} } \]

wherein, nd represents the total number of observations (or pairs).

For independent samples, the standard error is calculated as the following:

\[ SE_r = \sqrt{\frac {(n1 + n2 + 1)} { (3 \cdot n1 \cdot n2)}} \]

The confidence intervals can then be calculated by transforming the estimate.

\[ r_z = atanh(r_{biserial}) \]

Then the confidence interval can be calculated and back transformed.

\[ r_{CI} = tanh(r_z \pm Z_{(1 - \alpha / 2)} \cdot SE_r) \]

Conversion to other effect sizes

Two other effect sizes can be calculated for non-parametric tests. First, there is the concordance probability, which is also known at the c-statistic, c-index, or probability of superiority¹. The c-statistic is converted from the correlation using the following formula:

\[ c = \frac{(r_{biserial} + 1)}{2} \]

The Wilcoxon-Mann-Whitney odds (O’Brien and Castelloe 2006), also known as the “Generalized Odds Ratio” (Agresti 1980), is calculated by converting the c-statistic using the following formula:

\[ WMW_{odds} = e^{logit(c)} \]

Either effect size is available by simply modifying the ses argument for the wilcox_TOST function.

# Rank biserial
wilcox_TOST(formula = extra ~ group,
                      data = sleep,
                      paired = FALSE,
                      ses = "r",
                      eqb = .5)

## 
## Wilcoxon rank sum test with continuity correction
## 
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero 
## TOST: don't reject null equivalence hypothesis
## 
## TOST Results 
##            Test Statistic p.value
## NHST                 25.5   0.069
## TOST Lower           34.0   0.894
## TOST Upper           20.0   0.013
## 
## Effect Sizes 
##                           Estimate               C.I. Conf. Level
## Median of Differences       -1.346       [-3.4, -0.1]         0.9
## Rank-Biserial Correlation   -0.490 [-0.7493, -0.1005]         0.9

# Odds

wilcox_TOST(formula = extra ~ group,
                      data = sleep,
                      paired = FALSE,
                      ses = "o",
                      eqb = .5)

## 
## Wilcoxon rank sum test with continuity correction
## 
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero 
## TOST: don't reject null equivalence hypothesis
## 
## TOST Results 
##            Test Statistic p.value
## NHST                 25.5   0.069
## TOST Lower           34.0   0.894
## TOST Upper           20.0   0.013
## 
## Effect Sizes 
##                       Estimate             C.I. Conf. Level
## Median of Differences  -1.3464     [-3.4, -0.1]         0.9
## WMW Odds                0.3423 [0.1433, 0.8173]         0.9

# Concordance

wilcox_TOST(formula = extra ~ group,
                      data = sleep,
                      paired = FALSE,
                      ses = "c",
                      eqb = .5)

## 
## Wilcoxon rank sum test with continuity correction
## 
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero 
## TOST: don't reject null equivalence hypothesis
## 
## TOST Results 
##            Test Statistic p.value
## NHST                 25.5   0.069
## TOST Lower           34.0   0.894
## TOST Upper           20.0   0.013
## 
## Effect Sizes 
##                       Estimate             C.I. Conf. Level
## Median of Differences   -1.346     [-3.4, -0.1]         0.9
## Concordance              0.255 [0.1254, 0.4497]         0.9

Brunner-Munzel

As Karch (2021) explained, there are some reasons to dislike the WMW family of tests as the non-parametric alternative to the t-test. Regardless of the underlying statistical arguments², it can be argued that the interpretation of the WMW tests, especially when involving equivalence testing, is a tad difficult. Some may want a non-parametric alternative to the WMW test, and the Brunner-Munzel test(s) may be a useful option.

The Brunner-Munzel test (Brunner and Munzel 2000; Neubert and Brunner 2007) is based on calculating the “stochastic superiority” (Karch 2021, i.e., probability of superiority), which is usually referred to as the relative effect, based on the ranks of the two groups being compared (X and Y). A Brunner-Munzel type test is then a directional test of an effect, and answers a question akin to “what is the probability that a randomly sampled value of X will be greater than Y?”

\[ \hat p = P(X>Y) + 0.5 \cdot P(X=Y) \]

Basics of the Calculative Approach

In this section, I will quickly detail the calculative approach that underlies the Brunner-Munzel test in TOSTER.

A studentized test statistic can be calculated as:

\[ t = \sqrt{N} \cdot \frac{\hat p -p_{null}}{s} \] Wherein the default null hypothesis \(p_{null}\) is typically 0.5 (50% probability of superiority is the default null), and \(\sigma_N\) refers the rank-based Brunner-Munzel standard error. The null can be modified therefore allowing for equivalence testing directly based on the relative effect. Additionally, for paired samples the probability of superiority is based on a hypothesis of exchangability and is not based on the differences scores³.

For more details on the calculative approach, I suggest reading Karch (2021). At small sample sizes, it is recommended that the permutation version of the test (perm = TRUE) be used rather than the basic test statistic approach.

Example

The interface for the function is very similar to the t.test function. The brunner_munzel function itself does not allow for equivalence tests, but you can set an alternative hypothesis for “two.sided”, “less”, or “greater”.

# studentized test
brunner_munzel(formula = extra ~ group,
                      data = sleep,
                      paired = FALSE)

## Sample size in at least one group is small. Permutation test (perm = TRUE) is highly recommended.

## 
##  two-sample Brunner-Munzel test
## 
## data:  extra by group
## t = -2.1447, df = 16.898, p-value = 0.04682
## alternative hypothesis: true relative effect is not equal to 0.5
## 95 percent confidence interval:
##  0.01387048 0.49612952
## sample estimates:
## p(X>Y) + .5*P(X=Y) 
##              0.255

# permutation
brunner_munzel(formula = extra ~ group,
                      data = sleep,
                      paired = FALSE,
               perm = TRUE)

## 
##  two-sample Brunner-Munzel permutation test
## 
## data:  extra by group
## t-observed = -2.1447, df = 16.898, p-value = 0.0482
## alternative hypothesis: true relative effect is not equal to 0.5
## 95 percent confidence interval:
##  0.00826466 0.49593164
## sample estimates:
## p(X>Y) + .5*P(X=Y) 
##              0.255

The simple_htest function allows TOST tests using a Brunner-Munzel test by setting the alternative to “equivalence” or “minimal.effect”. The equivalence bounds, based on the relative effect, can be set with the mu argument.

# permutation based Brunner-Munzel test of equivalence
simple_htest(formula = extra ~ group,
             test = "brunner",
             data = sleep,
             paired = FALSE,
             alternative = "equ",
             mu = .7,
             perm = TRUE)

## 
##  two-sample Brunner-Munzel permutation test
## 
## data:  extra by group
## t-observed = -3.8954, df = 16.898, p-value = 0.9974
## alternative hypothesis: equivalence
## null values:
## relative effect relative effect 
##             0.3             0.7 
## 90 percent confidence interval:
##  0.05824864 0.45564662
## sample estimates:
## p(X>Y) + .5*P(X=Y) 
##              0.255

Bootstrap TOST

The bootstrap is a simulation based technique, derived from re-sampling with replacement, designed for statistical estimation and inference. Bootstrapping techniques are very useful because they are considered somewhat robust to the violations of assumptions for a simple t-test. Therefore we added a bootstrap option, boot_t_TOST to the package to provide another robust alternative to the t_TOST function.

In this function, we provide the percentile bootstrap solution outlined by Efron and Tibshirani (1993) (see chapter 16, page 220). The bootstrapped p-values are derived from the “studentized” version of a test of mean differences outlined by Efron and Tibshirani (1993). Overall, the results should be similar to the results of t_TOST. However, for paired samples, the Cohen’s d(rm) effect size cannot be calculated.

Two Sample Algorithm

Form B bootstrap data sets from x* and y* wherein x* is sampled with replacement from \(\tilde x_1,\tilde x_2, ... \tilde x_n\) and y* is sampled with replacement from \(\tilde y_1,\tilde y_2, ... \tilde y_n\)
t is then evaluated on each sample, but the mean of each sample (y or x) and the overall average (z) are subtracted from each

\[ t(z^{*b}) = \frac {(\bar x^*-\bar x - \bar z) - (\bar y^*-\bar y - \bar z)}{\sqrt {sd_y^*/n_y + sd_x^*/n_x}} \]

An approximate p-value can then be calculated as the number of bootstrapped results greater than the observed t-statistic from the sample.

\[ p_{boot} = \frac {\#t(z^{*b}) \ge t_{sample}}{B} \]

The same process is completed for the one sample case but with the one sample solution for the equation outlined by \(t(z^{*b})\). The paired sample case in this bootstrap procedure is equivalent to the one sample solution because the test is based on the difference scores.

Example

We can use the sleep data to see the bootstrapped results. Notice that the plots show how the re-sampling via bootstrapping indicates the instability of Hedges’s d_z.

data('sleep')

test1 = boot_t_TOST(formula = extra ~ group,
                      data = sleep,
                      paired = TRUE,
                      eqb = .5,
                    R = 999)


print(test1)

## 
## Bootstrapped Paired t-test
## 
## The equivalence test was non-significant, t(9) = -2.777, p = 1e+00
## The null hypothesis test was significant, t(9) = -4.062, p = 0e+00
## NHST: reject null significance hypothesis that the effect is equal to zero 
## TOST: don't reject null equivalence hypothesis
## 
## TOST Results 
##                 t df p.value
## t-test     -4.062  9 < 0.001
## TOST Lower -2.777  9       1
## TOST Upper -5.348  9 < 0.001
## 
## Effect Sizes 
##               Estimate     SE               C.I. Conf. Level
## Raw             -1.580 0.3699 [-2.9073, -1.0601]         0.9
## Hedges's g(z)   -1.174 0.6937   [-1.405, 1.0716]         0.9
## Note: studentized bootstrap ci method utilized.

plot(test1)

Ratio of Difference (Log Transformed)

In many bioequivalence studies, the differences between drugs are compared on the log scale (He et al. 2022). The log scale allows researchers to compare the ratio of two means.

\[ log ( \frac{y}{x} ) = log(y) - log(x) \] The United States Food and Drug Administration (FDA)⁴ hs stated a rationale for using the log transformed values:

Using logarithmic transformation, the general linear statistical model employed in the analysis of BE data allows inferences about the difference between the two means on the log scale, which can then be retransformed into inferences about the ratio of the two averages (means or medians) on the original scale. Logarithmic transformation thus achieves a general comparison based on the ratio rather than the differences.

In addition, the FDA, considers two drugs as bioequivalent when the ratio between x and y is less than 1.25 and greater than 0.8 (1/1.25), which is the default equivalence bound for the log functions.

log_TOST

For example, we could compare whether the cars of different transmissions are “equivalent” with regards to gas mileage. We can use the default equivalence bounds (eqb = 1.25).

log_TOST(
  mpg ~ am,
  data = mtcars
)

## 
## Log-transformed Welch Two Sample t-test
## 
## The equivalence test was non-significant, t(23.96) = -1.363, p = 9.07e-01
## The null hypothesis test was significant, t(23.96) = -3.826, p = 8.19e-04
## NHST: reject null significance hypothesis that the effect is equal to one 
## TOST: don't reject null equivalence hypothesis
## 
## TOST Results 
##                 t    df p.value
## t-test     -3.826 23.96 < 0.001
## TOST Lower -1.363 23.96   0.907
## TOST Upper -6.288 23.96 < 0.001
## 
## Effect Sizes 
##                  Estimate      SE               C.I. Conf. Level
## log(Means Ratio)  -0.3466 0.09061 [-0.5017, -0.1916]         0.9
## Means Ratio        0.7071      NA   [0.6055, 0.8256]         0.9

Note, that the function produces t-tests similar to the t_TOST function, but provides two effect sizes. The means ratio on the log scale (the scale of the test statistics), and the means ratio. The means ratio is missing standard error because the confidence intervals and estimate are simply the log scale results exponentiated.

Bootstrap + Log

However, it has been noted in the statistics literature that t-tests on the logarithmic scale can be biased, and it is recommended that bootstrapped tests be utilized instead. Therefore, the boot_log_TOST function can be utilized to perform a more precise test.

boot_log_TOST(
  mpg ~ am,
  data = mtcars,
  R = 499
)

## 
## Bootstrapped Log Welch Two Sample t-test
## 
## The equivalence test was non-significant, t(23.96) = -1.363, p = 9.46e-01
## The null hypothesis test was significant, t(23.96) = -3.826, p = 0e+00
## NHST: reject null significance hypothesis that the effect is equal to 1 
## TOST: don't reject null equivalence hypothesis
## 
## TOST Results 
##                 t    df p.value
## t-test     -3.826 23.96 < 0.001
## TOST Lower -1.363 23.96   0.946
## TOST Upper -6.288 23.96 < 0.001
## 
## Effect Sizes 
##                  Estimate      SE              C.I. Conf. Level
## log(Means Ratio)  -0.3466 0.08641 [-0.5216, -0.159]         0.9
## Means Ratio        0.7071 0.06104   [0.5936, 0.853]         0.9
## Note: studentized bootstrap ci method utilized.

Just Estimate an Effect Size

It was requested that a function be provided that only calculates a robust effect size. Therefore, I created the ses_calc and boot_ses_calc functions as robust effect size calculation⁵. The interface is almost the same as wilcox_TOST but you don’t set an equivalence bound.

ses_calc(formula = extra ~ group,
         data = sleep,
         paired = TRUE,
         ses = "r")

##                            estimate lower.ci  upper.ci conf.level
## Rank-Biserial Correlation 0.9818182 0.928369 0.9954785       0.95

# Setting bootstrap replications low to
## reduce compiling time of vignette
boot_ses_calc(formula = extra ~ group,
         data = sleep,
         paired = TRUE,
         R = 199,
         boot_ci = "perc", # recommend percentile bootstrap for paired SES
         ses = "r")

## Bootstrapped results contain extreme results (i.e., no overlap), caution advised interpreting the with confidence intervals.

##    estimate bias         SE  lower.ci upper.ci conf.level boot_ci
## 1 0.9818182    0 0.03267924 0.8909091        1       0.95    perc

Directly inspired by this blog post from Professor Frank Harrell https://hbiostat.org/blog/post/wpo/↩︎
I would like to note that I think the statistical properties of the WMW tests are sound, and Frank Harrell has written many blogposts outlined their sound application in biomedicine. ↩︎
This means the relative effect will not match the concordance probability provided by ses_calc↩︎
Food and Drug Administration (2014). Bioavailability and Bioequivalence Studies Submitted in NDAs or INDs — General Considerations.Center for Drug Evaluation and Research. Docket: FDA-2014-D-0204↩︎
The results differ greatly because the bootstrap CI method, basic bootstrap, is more conservative than the parametric method. This difference is more apparent with extremely small samples like that in the sleep dataset.↩︎

Robust TOST Procedures

Aaron R. Caldwell

2025-02-06

Wilcoxon TOST

Rank-Biserial Correlation

Confidence Intervals

Conversion to other effect sizes

Brunner-Munzel

Basics of the Calculative Approach

Example

Bootstrap TOST

Two Sample Algorithm

Example

Ratio of Difference (Log Transformed)

log_TOST

Bootstrap + Log

Just Estimate an Effect Size

References