A number of technology platforms in proteomics and genomics produce count data for quantitative analysis. In proteomics, the number of MS/MS events observed for a protein in a mass spectrometry experiment has been shown to correlate strongly with the protein’s abundance. In genomics, next-generation sequencing technologies use read counts as a measure of the abundance of the target transcripts. The R package countdata contains functions for statistical significance analysis of count data for both paired and unpaired designs.
The user needs to install the package once, most likely by entering install.packages("countdata")
in the R console, and load the package before use.
The following are three complete examples, from reading input data in a tab-deliminated text format to writing the result to a local text file output.txt
. The output column pval
contains the p-values of the test and the column pval.BH
contains the p-values adjusted for multiple testing using the Benjamini-Hochberg method (Benjamini & Hochberg, 1995).
Two-group beta-binomial test (Pham et al., Bioinformatics 2010)
d <- read.delim("https://tvpham.github.io/data/example-3groups.txt")
head(d)
#> a1 a2 a3 b1 b2 b3 c1 c2
#> 1 624 496 509 414 394 375 325 288
#> 2 615 854 930 341 523 360 359 329
#> 3 553 560 745 819 490 481 480 500
#> 4 525 412 401 354 321 310 258 228
#> 5 484 284 315 268 282 307 270 298
#> 6 482 348 400 242 365 367 81 118
# compare the first 3 samples against the next three samples
out <- countdata::bb.test(d[, 1:6],
colSums(d[, 1:6]),
c(rep("a", 3), rep("b", 3)))
#> Using 11 thread(s) ...
#> No. of data rows = 1786, no. of groups = 2, no. of samples = 6...
#> 5%
#> 11%
#> 18%
#> 23%
#> 30%
#> 35%
#> 41%
#> 49%
#> 55%
#> 64%
#> 81%
#> 87%
#> 93%
#> 98%
#> Done.
d.norm <- countdata::normalize(d[, 1:8])
write.table(cbind(d, d.norm,
fc = countdata::fold.change(d.norm[, 1:3], d.norm[, 4:6]),
pval = out$p.value,
pval.BH = p.adjust(out$p.value, method = "BH")),
file = "output.txt", row.names = FALSE, sep = "\t")
Three-group beta-binomial test (Pham et al., Bioinformatics 2010)
d <- read.delim("https://tvpham.github.io/data/example-3groups.txt")
head(d)
#> a1 a2 a3 b1 b2 b3 c1 c2
#> 1 624 496 509 414 394 375 325 288
#> 2 615 854 930 341 523 360 359 329
#> 3 553 560 745 819 490 481 480 500
#> 4 525 412 401 354 321 310 258 228
#> 5 484 284 315 268 282 307 270 298
#> 6 482 348 400 242 365 367 81 118
# compare the first 3 samples, the next three samples, and the last two samples.
out <- countdata::bb.test(d[, 1:8],
colSums(d[, 1:8]),
c(rep("a", 3), rep("b", 3), rep("c", 2)))
#> Using 11 thread(s) ...
#> No. of data rows = 1786, no. of groups = 3, no. of samples = 8...
#> 0%
#> 5%
#> 10%
#> 16%
#> 23%
#> 33%
#> 40%
#> 45%
#> 51%
#> 58%
#> 63%
#> 70%
#> 76%
#> 81%
#> 87%
#> 93%
#> 98%
#> Done.
d.norm <- countdata::normalize(d[, 1:8])
write.table(cbind(d, d.norm,
pval = out$p.value,
pval.BH = p.adjust(out$p.value, method = "BH")),
file = "output.txt", row.names = FALSE, sep = "\t")
Two-group paired beta-binomial test (Pham & Jimenez, Bioinformatics 2012)
d <- read.delim("https://tvpham.github.io/data/example-paired.txt")
head(d)
#> pre.1 pre.2 pre.3 post.1 post.2 post.3
#> 1 575 179 335 505 172 204
#> 2 294 245 256 396 390 265
#> 3 293 282 320 372 240 204
#> 4 303 282 250 307 243 227
#> 5 396 271 171 327 216 103
#> 6 238 261 271 245 234 215
out <- countdata::ibb.test(d[, 1:6],
colSums(d[, 1:6]),
c(rep("pre_treatment", 3), rep("post_treatment", 3)))
#> Using 11 thread(s) ...
#> No. of data rows = 2919, no. of pair(s) = 3...
#> 0%
#> 5%
#> 11%
#> 16%
#> 21%
#> 26%
#> 31%
#> 36%
#> 41%
#> 46%
#> 51%
#> 57%
#> 63%
#> 68%
#> 73%
#> 78%
#> 83%
#> 88%
#> 94%
#> 99%
#> Done.
d.norm <- countdata::normalize(d[, 1:6])
write.table(cbind(d, d.norm,
fc = out$fc,
pval = out$p.value,
pval.BH = p.adjust(out$p.value, method = "BH")),
file = "output.txt", row.names = FALSE, sep = "\t")
The beta-binomial test (Pham et al., Bioinformatics 2010) is used for significance analysis of independent samples for two or more groups. Suppose that a vector x
contains the count numbers and a vector tx
the corresponding normalization sizes, for example the total spectral counts per sample in a proteomics experiment. In addition, the group information is specified in a vector group
. We perform a beta-binomial test as follows.
x <- c(1, 5, 1, 10, 9, 11, 2, 8)
tx <- c(19609, 19053, 19235, 19374, 18868, 19018, 18844, 19271)
group <- c(rep("cancer", 3), rep("normal", 5))
countdata::bb.test(x, tx, group)
#> Using a single CPU core ...
#> No. of data rows = 1, no. of groups = 2, no. of samples = 8...
#> 100%
#> Done.
#> $p.value
#> [1] 0.01568598
The test can be applied to a data matrix row by row. The following example compares three groups, the first group consisting of columns 1 to 3 of a data matrix, the second columns 4 to 6, and the third columns 7 and 8.
d <- read.delim("https://tvpham.github.io/data/example-3groups.txt")
# compare 3 groups, using all available CPU cores
out <- countdata::bb.test(d[, 1:8],
colSums(d[, 1:8]),
c(rep("a", 3), rep("b", 3), rep("c", 2)))
#> Using 11 thread(s) ...
#> No. of data rows = 1786, no. of groups = 3, no. of samples = 8...
#> 0%
#> 5%
#> 11%
#> 16%
#> 21%
#> 28%
#> 34%
#> 40%
#> 46%
#> 52%
#> 58%
#> 77%
#> 83%
#> 88%
#> 93%
#> 98%
#> Done.
The output out$p.value
contains p-values for each row of d
. For multiple testing correction, use the p.adjust
function in R. For example, to apply the Benjamini & Hochberg method for p-value adjustment
The beta-binomial test takes raw counts as input and normalizes the values internally. To obtain normalized values, we multiply each sample by a scaling factor so that the total counts are equal across samples. This procedure is implemented in the normalize
function.
d.norm <- countdata::normalize(d[, 1:8])
# check -- all values should be equal
colSums(d.norm)
#> a1 a2 a3 b1 b2 b3 c1 c2
#> 19159 19159 19159 19159 19159 19159 19159 19159
The matrix d.norm
contains the normalized data. We can combine the normalized data and the result of the beta-binomial test by cbind(d.norm, out$p.value)
, and subsequently save or view the resulting matrix
head(cbind(d.norm, out$p.value))
#> a1 a2 a3 b1 b2 b3 c1 c2
#> 1 609.6800 498.7595 506.9889 409.4057 400.0766 377.7803 330.43276 286.3262
#> 2 600.8866 858.7512 926.3254 337.2158 531.0662 362.6691 365.00111 327.0879
#> 3 540.3094 563.1155 742.0564 809.9113 497.5572 484.5661 488.02377 497.0941
#> 4 512.9520 414.2921 399.4156 350.0715 325.9508 312.2983 262.31278 226.6749
#> 5 472.8929 285.5800 313.7554 265.0259 286.3493 309.2761 274.51337 296.2681
#> 6 470.9388 349.9361 398.4195 239.3144 370.6294 369.7209 82.35401 117.3142
#> out$p.value
#> 1 0.0001164498
#> 2 0.0016150240
#> 3 0.4501729545
#> 4 0.0003105657
#> 5 0.2521494212
#> 6 0.0001057092
The package provides a convenient function to calculate the ratio between the averages of two sample groups row by row. For example, to calculate the fold change between all group pairs
fc12 <- countdata::fold.change(d.norm[, 1:3], d.norm[, 4:6])
fc13 <- countdata::fold.change(d.norm[, 1:3], d.norm[, 7:8])
fc23 <- countdata::fold.change(d.norm[, 4:6], d.norm[, 7:8], BIG = 100)
head(cbind(d.norm, out$p.value, fc12, fc13, fc23))
#> a1 a2 a3 b1 b2 b3 c1 c2
#> 1 609.6800 498.7595 506.9889 409.4057 400.0766 377.7803 330.43276 286.3262
#> 2 600.8866 858.7512 926.3254 337.2158 531.0662 362.6691 365.00111 327.0879
#> 3 540.3094 563.1155 742.0564 809.9113 497.5572 484.5661 488.02377 497.0941
#> 4 512.9520 414.2921 399.4156 350.0715 325.9508 312.2983 262.31278 226.6749
#> 5 472.8929 285.5800 313.7554 265.0259 286.3493 309.2761 274.51337 296.2681
#> 6 470.9388 349.9361 398.4195 239.3144 370.6294 369.7209 82.35401 117.3142
#> out$p.value fc12 fc13 fc23
#> 1 0.0001164498 -1.360633 -1.746148 -1.283335
#> 2 0.0016150240 -1.938309 -2.298320 -1.185735
#> 3 0.4501729545 -1.029825 -1.248907 -1.212738
#> 4 0.0003105657 -1.342337 -1.808716 -1.347438
#> 5 0.2521494212 -1.245834 -1.252351 -1.005232
#> 6 0.0001057092 -1.244604 -4.071068 -3.270976
Note that a positive fold change value means up-regulation where the average of the second group is higher than that of the first group. Conversely, a negative value means down-regulation where the average of the first group is higher than that of the second. If one group contains all zeros, a positive or negative big value is returned (default BIG = 10000).
The inverted beta-binomial test (Pham & Jimenez, Bioinformatics 2012) is used for paired sample testing, for example between pre-treatment and post-treatment data. The following is an example of a paired test.
x <- c(33, 32, 86, 51, 52, 149)
tx <- c(7742608, 15581382, 20933491, 7126839, 13842297, 14760103)
group <- c(rep("cancer", 3), rep("normal", 3))
countdata::ibb.test(x, tx, group)
#> Using a single CPU core ...
#> No. of data rows = 1, no. of pair(s) = 3...
#> 100%
#> Done.
#> $p.value
#> [1] 0.004103636
#>
#> $fc
#> [1] 2.137632
Analogously to the unpaired situation, the test can be performed for a data matrix row by row. The following example compares two groups where columns 1 to 3 are respectively paired with columns 4 to 6.
d <- read.delim("https://tvpham.github.io/data/example-paired.txt")
# perform a paired test for all rows
out <- countdata::ibb.test(d[, 1:6],
colSums(d[, 1:6]),
c(rep("pre_treatment", 3), rep("post_treatment", 3)))
#> Using 11 thread(s) ...
#> No. of data rows = 2919, no. of pair(s) = 3...
#> 0%
#> 5%
#> 10%
#> 16%
#> 21%
#> 26%
#> 31%
#> 37%
#> 42%
#> 47%
#> 52%
#> 57%
#> 62%
#> 67%
#> 73%
#> 78%
#> 83%
#> 88%
#> 93%
#> 98%
#> Done.
The result out
is a list of two elements where p.value
is the p-value of the test and fc
an estimation of the common fold change. We can calculate the normalized data and write out the result as follows.
d.norm <- countdata::normalize(d[, 1:6])
head(cbind(d.norm, out$p.value, out$fc))
#> pre.1 pre.2 pre.3 post.1 post.2 post.3 out$p.value out$fc
#> 1 594.2861 176.8823 347.1786 490.0689 164.2960 208.5462 0.064856368 -1.297663
#> 2 303.8611 242.1015 265.3067 384.2916 372.5316 270.9056 0.072525346 1.259570
#> 3 302.8275 278.6637 331.6333 361.0012 229.2502 208.5462 0.334902070 -1.168634
#> 4 313.1629 278.6637 259.0885 297.9231 232.1159 232.0588 0.045979358 -1.117328
#> 5 409.2823 267.7939 177.2166 317.3317 206.3252 105.2954 0.006665643 -1.356184
#> 6 245.9828 257.9122 280.8520 237.7562 223.5190 219.7913 0.048216659 -1.151104
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57, 289–300.
Pham TV, Piersma SR, Warmoes M, Jimenez CR (2010) On the beta binomial model for analysis of spectral count data in label-free tandem mass spectrometry-based proteomics. Bioinformatics, 26(3):363-369.
Pham TV, Jimenez CR (2012) An accurate paired sample test for count data. Bioinformatics, 28(18):i596-i602.