The Beta-Binomial Test for Count Data

Quick start

The user needs to install the package once, most likely by entering install.packages("countdata") in the R console, and load the package before use.

library(countdata)

The following are three complete examples, from reading input data in a tab-deliminated text format to writing the result to a local text file output.txt. The output column pval contains the p-values of the test and the column pval.BH contains the p-values adjusted for multiple testing using the Benjamini-Hochberg method (Benjamini & Hochberg, 1995).

Two-group beta-binomial test (Pham et al., Bioinformatics 2010)

d <- read.delim("https://tvpham.github.io/data/example-3groups.txt")

head(d)
#>    a1  a2  a3  b1  b2  b3  c1  c2
#> 1 624 496 509 414 394 375 325 288
#> 2 615 854 930 341 523 360 359 329
#> 3 553 560 745 819 490 481 480 500
#> 4 525 412 401 354 321 310 258 228
#> 5 484 284 315 268 282 307 270 298
#> 6 482 348 400 242 365 367  81 118

# compare the first 3 samples against the next three samples
out <- countdata::bb.test(d[, 1:6], 
                          colSums(d[, 1:6]), 
                          c(rep("a", 3), rep("b", 3)))
#> Using 11 thread(s) ...
#> No. of data rows = 1786, no. of groups = 2, no. of samples = 6...
#> 5%
#> 11%
#> 18%
#> 23%
#> 30%
#> 35%
#> 41%
#> 49%
#> 55%
#> 64%
#> 81%
#> 87%
#> 93%
#> 98%
#> Done.

d.norm <-  countdata::normalize(d[, 1:8])

write.table(cbind(d, d.norm, 
                  fc = countdata::fold.change(d.norm[, 1:3], d.norm[, 4:6]),
                  pval = out$p.value,
                  pval.BH = p.adjust(out$p.value, method = "BH")),
            file = "output.txt", row.names = FALSE, sep = "\t")

Three-group beta-binomial test (Pham et al., Bioinformatics 2010)

d <- read.delim("https://tvpham.github.io/data/example-3groups.txt")

head(d)
#>    a1  a2  a3  b1  b2  b3  c1  c2
#> 1 624 496 509 414 394 375 325 288
#> 2 615 854 930 341 523 360 359 329
#> 3 553 560 745 819 490 481 480 500
#> 4 525 412 401 354 321 310 258 228
#> 5 484 284 315 268 282 307 270 298
#> 6 482 348 400 242 365 367  81 118

# compare the first 3 samples, the next three samples, and the last two samples.
out <- countdata::bb.test(d[, 1:8],
                          colSums(d[, 1:8]),
                          c(rep("a", 3), rep("b", 3), rep("c", 2)))
#> Using 11 thread(s) ...
#> No. of data rows = 1786, no. of groups = 3, no. of samples = 8...
#> 0%
#> 5%
#> 10%
#> 16%
#> 23%
#> 33%
#> 40%
#> 45%
#> 51%
#> 58%
#> 63%
#> 70%
#> 76%
#> 81%
#> 87%
#> 93%
#> 98%
#> Done.

d.norm <-  countdata::normalize(d[, 1:8])

write.table(cbind(d, d.norm, 
                  pval = out$p.value,
                  pval.BH = p.adjust(out$p.value, method = "BH")),
            file = "output.txt", row.names = FALSE, sep = "\t")

Two-group paired beta-binomial test (Pham & Jimenez, Bioinformatics 2012)

d <- read.delim("https://tvpham.github.io/data/example-paired.txt")

head(d)
#>   pre.1 pre.2 pre.3 post.1 post.2 post.3
#> 1   575   179   335    505    172    204
#> 2   294   245   256    396    390    265
#> 3   293   282   320    372    240    204
#> 4   303   282   250    307    243    227
#> 5   396   271   171    327    216    103
#> 6   238   261   271    245    234    215

out <- countdata::ibb.test(d[, 1:6], 
                           colSums(d[, 1:6]), 
                           c(rep("pre_treatment", 3), rep("post_treatment", 3)))
#> Using 11 thread(s) ...
#> No. of data rows = 2919, no. of pair(s) = 3...
#> 0%
#> 5%
#> 11%
#> 16%
#> 21%
#> 26%
#> 31%
#> 36%
#> 41%
#> 46%
#> 51%
#> 57%
#> 63%
#> 68%
#> 73%
#> 78%
#> 83%
#> 88%
#> 94%
#> 99%
#> Done.

d.norm <-  countdata::normalize(d[, 1:6])

write.table(cbind(d, d.norm, 
                  fc = out$fc,
                  pval = out$p.value,
                  pval.BH = p.adjust(out$p.value, method = "BH")),
            file = "output.txt", row.names = FALSE, sep = "\t")

Unpaired test

The beta-binomial test (Pham et al., Bioinformatics 2010) is used for significance analysis of independent samples for two or more groups. Suppose that a vector x contains the count numbers and a vector tx the corresponding normalization sizes, for example the total spectral counts per sample in a proteomics experiment. In addition, the group information is specified in a vector group. We perform a beta-binomial test as follows.

x <- c(1, 5, 1, 10, 9, 11, 2, 8)

tx <- c(19609, 19053, 19235, 19374, 18868, 19018, 18844, 19271)

group <- c(rep("cancer", 3), rep("normal", 5))

countdata::bb.test(x, tx, group)
#> Using a single CPU core ...
#> No. of data rows = 1, no. of groups = 2, no. of samples = 8...
#> 100%
#> Done.
#> $p.value
#> [1] 0.01568598

The test can be applied to a data matrix row by row. The following example compares three groups, the first group consisting of columns 1 to 3 of a data matrix, the second columns 4 to 6, and the third columns 7 and 8.

d <- read.delim("https://tvpham.github.io/data/example-3groups.txt")

# compare 3 groups, using all available CPU cores
out <- countdata::bb.test(d[, 1:8], 
                          colSums(d[, 1:8]), 
                          c(rep("a", 3), rep("b", 3), rep("c", 2)))
#> Using 11 thread(s) ...
#> No. of data rows = 1786, no. of groups = 3, no. of samples = 8...
#> 0%
#> 5%
#> 11%
#> 16%
#> 21%
#> 28%
#> 34%
#> 40%
#> 46%
#> 52%
#> 58%
#> 77%
#> 83%
#> 88%
#> 93%
#> 98%
#> Done.

The output out$p.value contains p-values for each row of d. For multiple testing correction, use the p.adjust function in R. For example, to apply the Benjamini & Hochberg method for p-value adjustment

pval.BH <- p.adjust(out$p.value, method = "BH")

Data normalization

The beta-binomial test takes raw counts as input and normalizes the values internally. To obtain normalized values, we multiply each sample by a scaling factor so that the total counts are equal across samples. This procedure is implemented in the normalize function.

d.norm <-  countdata::normalize(d[, 1:8])

# check -- all values should be equal
colSums(d.norm)
#>    a1    a2    a3    b1    b2    b3    c1    c2 
#> 19159 19159 19159 19159 19159 19159 19159 19159

The matrix d.norm contains the normalized data. We can combine the normalized data and the result of the beta-binomial test by cbind(d.norm, out$p.value), and subsequently save or view the resulting matrix

head(cbind(d.norm, out$p.value))
#>         a1       a2       a3       b1       b2       b3        c1       c2
#> 1 609.6800 498.7595 506.9889 409.4057 400.0766 377.7803 330.43276 286.3262
#> 2 600.8866 858.7512 926.3254 337.2158 531.0662 362.6691 365.00111 327.0879
#> 3 540.3094 563.1155 742.0564 809.9113 497.5572 484.5661 488.02377 497.0941
#> 4 512.9520 414.2921 399.4156 350.0715 325.9508 312.2983 262.31278 226.6749
#> 5 472.8929 285.5800 313.7554 265.0259 286.3493 309.2761 274.51337 296.2681
#> 6 470.9388 349.9361 398.4195 239.3144 370.6294 369.7209  82.35401 117.3142
#>    out$p.value
#> 1 0.0001164498
#> 2 0.0016150240
#> 3 0.4501729545
#> 4 0.0003105657
#> 5 0.2521494212
#> 6 0.0001057092

Fold change calculation

The package provides a convenient function to calculate the ratio between the averages of two sample groups row by row. For example, to calculate the fold change between all group pairs

fc12 <- countdata::fold.change(d.norm[, 1:3], d.norm[, 4:6])
fc13 <- countdata::fold.change(d.norm[, 1:3], d.norm[, 7:8])
fc23 <- countdata::fold.change(d.norm[, 4:6], d.norm[, 7:8], BIG = 100)

head(cbind(d.norm, out$p.value, fc12, fc13, fc23))
#>         a1       a2       a3       b1       b2       b3        c1       c2
#> 1 609.6800 498.7595 506.9889 409.4057 400.0766 377.7803 330.43276 286.3262
#> 2 600.8866 858.7512 926.3254 337.2158 531.0662 362.6691 365.00111 327.0879
#> 3 540.3094 563.1155 742.0564 809.9113 497.5572 484.5661 488.02377 497.0941
#> 4 512.9520 414.2921 399.4156 350.0715 325.9508 312.2983 262.31278 226.6749
#> 5 472.8929 285.5800 313.7554 265.0259 286.3493 309.2761 274.51337 296.2681
#> 6 470.9388 349.9361 398.4195 239.3144 370.6294 369.7209  82.35401 117.3142
#>    out$p.value      fc12      fc13      fc23
#> 1 0.0001164498 -1.360633 -1.746148 -1.283335
#> 2 0.0016150240 -1.938309 -2.298320 -1.185735
#> 3 0.4501729545 -1.029825 -1.248907 -1.212738
#> 4 0.0003105657 -1.342337 -1.808716 -1.347438
#> 5 0.2521494212 -1.245834 -1.252351 -1.005232
#> 6 0.0001057092 -1.244604 -4.071068 -3.270976

Note that a positive fold change value means up-regulation where the average of the second group is higher than that of the first group. Conversely, a negative value means down-regulation where the average of the first group is higher than that of the second. If one group contains all zeros, a positive or negative big value is returned (default BIG = 10000).

Paired test

The inverted beta-binomial test (Pham & Jimenez, Bioinformatics 2012) is used for paired sample testing, for example between pre-treatment and post-treatment data. The following is an example of a paired test.

x <- c(33, 32, 86, 51, 52, 149)

tx <- c(7742608, 15581382, 20933491, 7126839, 13842297, 14760103)

group <- c(rep("cancer", 3), rep("normal", 3))

countdata::ibb.test(x, tx, group)
#> Using a single CPU core ...
#> No. of data rows = 1, no. of pair(s) = 3...
#> 100%
#> Done.
#> $p.value
#> [1] 0.004103636
#> 
#> $fc
#> [1] 2.137632

Analogously to the unpaired situation, the test can be performed for a data matrix row by row. The following example compares two groups where columns 1 to 3 are respectively paired with columns 4 to 6.

d <- read.delim("https://tvpham.github.io/data/example-paired.txt")

# perform a paired test for all rows
out <- countdata::ibb.test(d[, 1:6], 
                           colSums(d[, 1:6]), 
                           c(rep("pre_treatment", 3), rep("post_treatment", 3)))
#> Using 11 thread(s) ...
#> No. of data rows = 2919, no. of pair(s) = 3...
#> 0%
#> 5%
#> 10%
#> 16%
#> 21%
#> 26%
#> 31%
#> 37%
#> 42%
#> 47%
#> 52%
#> 57%
#> 62%
#> 67%
#> 73%
#> 78%
#> 83%
#> 88%
#> 93%
#> 98%
#> Done.

The result out is a list of two elements where p.value is the p-value of the test and fc an estimation of the common fold change. We can calculate the normalized data and write out the result as follows.

d.norm <- countdata::normalize(d[, 1:6])

head(cbind(d.norm, out$p.value, out$fc))
#>      pre.1    pre.2    pre.3   post.1   post.2   post.3 out$p.value    out$fc
#> 1 594.2861 176.8823 347.1786 490.0689 164.2960 208.5462 0.064856368 -1.297663
#> 2 303.8611 242.1015 265.3067 384.2916 372.5316 270.9056 0.072525346  1.259570
#> 3 302.8275 278.6637 331.6333 361.0012 229.2502 208.5462 0.334902070 -1.168634
#> 4 313.1629 278.6637 259.0885 297.9231 232.1159 232.0588 0.045979358 -1.117328
#> 5 409.2823 267.7939 177.2166 317.3317 206.3252 105.2954 0.006665643 -1.356184
#> 6 245.9828 257.9122 280.8520 237.7562 223.5190 219.7913 0.048216659 -1.151104

References

Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57, 289–300.
Pham TV, Piersma SR, Warmoes M, Jimenez CR (2010) On the beta binomial model for analysis of spectral count data in label-free tandem mass spectrometry-based proteomics. Bioinformatics, 26(3):363-369.
Pham TV, Jimenez CR (2012) An accurate paired sample test for count data. Bioinformatics, 28(18):i596-i602.