General Region Utilities

is.valid.region
bedr.sort.region & is.sorted.region
bedr.merge.region & is.merged.region
in.region & order.region
bedr.subtract.region
flank.region & grow.region
intersectRegion
bedr.join.region & bedr.join.multiple.region
get.RegionSize
get.fastaRegion
read.vcf
groupby
test.region.similarity (jaccard & reldist)

These utilities are the main components of the package and can be combined to perform powerful genome arithmetic. Description of these utilities and associated parameters can be found in R docs. Working examples of some of the key functions is covered below:

Load bedr Library

Loading of bedr package will indicate the presensece of bedtools, bedops and tabix libraries on your system. If these are not present, there is very little value in using bedr package. If you have installed these in a non-standard path/directory of your computer, please add these to PATH environment variable.

# load bedr library
library("bedr");

## 
## 
## ######################
## #### bedr v1.0.7 ####
## ######################
## 
## checking binary availability...
##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
##   * Checking path for bedops... PASS
##     /usr/local/bin/bedops
##   * Checking path for tabix... PASS
##     /usr/local/bin/tabix
## tests and examples will be skipped on R CMD check if binaries are missing

N.B. Log info printed through verbose = T (default, in almost all bedr methods) should be carefully assessed. Often, status=FAIL may just highlight a potential problem, and hence code continues to execute without a graceful failure. Therefore, please do read the log messages carefully.

Validate Region

First check if the regions are valid. This involves checking for “chr” prefix, data types, end > start position and for compliance to bed formats zero based start position. All checks can be turned off as required. The “chr” check is useful due to various human reference formats (NCBI vs UCSC) having different standards. This can result in unexpected results if comapring across specifications. Similarly, bed format uses a zero based start postion, but vcf’s use a one based position. Therefore, a snp at position 100 would be chr1:100-100 in one based but chr1:99-100 in zero based format. Another common mistake relates to overlaps between adjacent intervals, for example in zero based setups chr1:10-100 does not intersect with chr1:100-110.

if (check.binary("bedtools")) {

    # get example regions
    index <- get.example.regions();
    a <- index[[1]];
    b <- index[[2]];

    # region validation
    is.a.valid  <- is.valid.region(a);
    is.b.valid  <- is.valid.region(b);
    a <- a[is.a.valid];
    b <- b[is.b.valid];
        
    # print
    cat(" REGION a: ", a, "\n");
    cat(" REGION b: ", b, "\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
## VALIDATE REGIONS
##  * Checking input type... PASS
##    Input is in index format
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
## VALIDATE REGIONS
##  * Checking input type... PASS
##    Input is in index format
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  REGION a:  chr1:10-100 chr1:101-200 chr1:200-210 chr1:211-212 chr2:10-50 chr10:50-100 chr2:40-60 chr20:1-5 
##  REGION b:  chr1:1-10 chr1:111-250 chr1:2000-2010 chr2:1-5 chr10:100-150 chr2:40-60 chr20:6-10

Sort Region

Generally, it’s a good idea to confirm that you’ve sorted your inputs to avoid unexpected results and take advantage of optimized algorithms. For example merging and clustering require adjacent intervals. is.sorted.region is convenient for explicit evaluation although it’s done internally for core operations.

if (check.binary("bedtools")) {

    # check if already sorted
    is.sorted <- is.sorted.region(a);

    # sort lexographically
    a.sort <- bedr.sort.region(a);

    # sort naturally
    a.sort.natural <- bedr.sort.region(a, method = "natural");

    # sort - explicit call using primary API function bedr()
    b.sort <- bedr(
        engine = "bedtools", 
        input = list(i = b), 
        method = "sort", 
        params = ""
        );

    # print
    cat(" REGION a: ", a.sort, "\n");
    cat(" REGION b: ", a.sort.natural, "\n");
    cat(" REGION c: ", b.sort, "\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
## SORTING
## VALIDATE REGIONS
##  * Checking input type... PASS
##    Input is in index format
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
## SORTING
## VALIDATE REGIONS
##  * Checking input type... PASS
##    Input is in index format
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##     Natural sorting is done in R which could be memory intensive for large files
##  * Processing input (1): i
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... FAIL
##    The input for object is not *lexographically* ordered!
##    This can cause unexpected results for some set operations.
##    try: x <- bedr.sort.region(x)
##    bedtools sort -i /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/i_62d5aa85040.bed 
##  REGION a:  chr1:10-100 chr1:101-200 chr1:200-210 chr1:211-212 chr10:50-100 chr2:10-50 chr2:40-60 chr20:1-5 
##  REGION b:  chr1:10-100 chr1:101-200 chr1:200-210 chr1:211-212 chr2:10-50 chr2:40-60 chr10:50-100 chr20:1-5 
##  REGION c:  chr1:1-10 chr1:111-250 chr1:2000-2010 chr10:100-150 chr2:1-5 chr2:40-60 chr20:6-10

Merge Regions

Similarly, merging adjacent or overlapping regions is often required to avoid redundancy. Performing an intersection/join when you have redundant regions can cause unexpected results and hence best is to merge redundant regions.

if (check.binary("bedtools")) {

    # check if already merged (non-overlapping regions)
    is.merged <- is.merged.region(a.sort);
    is.merged <- is.merged.region(b.sort);

    # merge
    a.merge <- bedr.merge.region(a.sort);

    # merge - explicit call using primary API function bedr()
    b.merge <- bedr(
        engine = "bedtools", 
        input = list(i = b.sort), 
        method = "merge", 
        params = ""
        );

    # print
    cat(" REGION a: ", a.merge, "\n");
    cat(" REGION b: ", b.merge, "\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
## MERGING
## VALIDATE REGIONS
##  * Checking input type... PASS
##    Input is in index format
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Processing input (1): i
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
##    bedtools merge -i /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/i_62d56c47903a.bed -d 0 
##  * Collapsing 8 --> 6 regions... NOTE
##  * Processing input (1): i
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##    bedtools merge -i /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/i_62d564c61622.bed 
##  REGION a:  chr1:10-100 chr1:101-210 chr1:211-212 chr10:50-100 chr2:10-60 chr20:1-5 
##  REGION b:  chr1:1-10 chr1:111-250 chr1:2000-2010 chr10:100-150 chr2:1-5 chr2:40-60 chr20:6-10

Subtract Region

The subtract utility identifies regions exclusive to one (first) set of regions. For instance, one might be interested in removing non-coding regions from gene regions.

if (check.binary("bedtools")) {

    # subtract
    a.sub1 <- bedr.subtract.region(a.merge, b.merge);

    # subtract - explicit call using primary API function bedr()
    a.sub2 <- bedr(
        input = list(a = a.merge, b = b.merge), 
        method = "subtract", 
        params = "-A"
        );

    # print
    cat(" REGION a - sub1: ", a.sub1, "\n");
    cat(" REGION a - sub2: ", a.sub2, "\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
## SUBTRACTING
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##    bedtools subtract -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d559cb854a.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d526d131ec.bed   -A 
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##    bedtools subtract -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d541e679fd.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d57bf9cf72.bed -A
##  REGION a - sub1:  chr1:10-100 chr10:50-100 chr20:1-5 
##  REGION a - sub2:  chr1:10-100 chr10:50-100 chr20:1-5

in Region

Finding genomic regions which overlap partially with another set of regions is an extremely useful requirement for various sequence analysis tasks e.g finding genes/SNPs which may fall in particular segments (e.g chromosomal bands).

if (check.binary("bedtools")) {

    # check if present in a region
    is.region <- in.region(a.merge, b.merge);

    # or alternatively R-like in command
    is.region <- a.merge %in.region% b.merge

    # print
    cat(" is.region: ", is.region, "\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
##  is.region:  FALSE TRUE TRUE FALSE TRUE FALSE

Intersect/Join Regions

The intersect/join function identifies sub-regions of first set which overlap with the second set of regions, and returns a table with chromosome name along with coordinates of overlapping sub-region. The bedr.join.multiple.region function creates an intersect table for multiple regions displaying the overlapping regions along with a fast access truth table for all versus all regions.

if (check.binary("bedtools")) {

    # intersect / join
    a.int1 <- bedr.join.region(a.merge, b.merge);
    a.int2 <- bedr(
        input = list(a = a.sort, b = b.sort), 
        method = "intersect", 
        params = "-loj -sorted"
        );

    # multiple join
    d <- get.random.regions(15, chr="chr1", sort = TRUE);
    a.mult <- bedr.join.multiple.region(
        x = list(a.merge, b.merge, bedr.sort.region(d))
        );

    # print
    cat(" REGION a intersect: \n"); print(a.int1); cat("\n");
    cat(" REGION multi (a,b,c) intersect: \n"); print(a.mult); cat("\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
## JOINING
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##    bedtools intersect -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d56f26d513.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d513a50087.bed -loj -sorted
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##    bedtools intersect -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d512911742.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d526d284e5.bed -loj -sorted
## JOINING
## SORTING
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Overlapping regions can cause unexpected results.
##  * Processing input (1): i
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##  * Processing input (2): 
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##  * Processing input (3): 
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in bed format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##    bedtools multiinter -i /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/i_62d5267bd72a.bed /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/_62d57fa3b4fe.bed /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/_62d531630376.bed  -names  a b c -g /private/var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T/RtmptsBwzQ/Rinst62962227cd5c/bedr/genomes/human.hg19.genome -filler . 
##  REGION a intersect: 
##          index   V4  V5  V6
## 1  chr1:10-100    .  -1  -1
## 2 chr1:101-210 chr1 111 250
## 3 chr1:211-212 chr1 111 250
## 4 chr10:50-100    .  -1  -1
## 5   chr2:10-60 chr2  40  60
## 6    chr20:1-5    .  -1  -1
## 
##  REGION multi (a,b,c) intersect: 
##                       index n.overlaps names a b c
## 1                 chr1:1-10          1     b 0 1 0
## 2               chr1:10-100          1     a 1 0 0
## 3              chr1:101-111          1     a 1 0 0
## 4              chr1:111-210          2   a,b 1 1 0
## 5              chr1:210-211          1     b 0 1 0
## 6              chr1:211-212          2   a,b 1 1 0
## 7              chr1:212-250          1     b 0 1 0
## 8            chr1:2000-2010          1     b 0 1 0
## 9    chr1:36536614-36567843          1     c 0 0 1
## 10   chr1:39531896-39551738          1     c 0 0 1
## 11   chr1:96595818-96615660          1     c 0 0 1
## 12 chr1:106145780-106177820          1     c 0 0 1
## 13 chr1:114839604-114858053          1     c 0 0 1
## 14 chr1:157500398-157528956          1     c 0 0 1
## 15 chr1:157946862-157967135          1     c 0 0 1
## 16 chr1:159613032-159634262          1     c 0 0 1
## 17 chr1:173615144-173638442          1     c 0 0 1
## 18 chr1:182664052-182696092          1     c 0 0 1
## 19 chr1:184743714-184757316          1     c 0 0 1
## 20 chr1:190087700-190109266          1     c 0 0 1
## 21 chr1:204724529-204744802          1     c 0 0 1
## 22 chr1:225534034-225555264          1     c 0 0 1
## 23 chr1:247803076-247824306          1     c 0 0 1
## 24             chr10:50-100          1     a 1 0 0
## 25            chr10:100-150          1     b 0 1 0
## 26                 chr2:1-5          1     b 0 1 0
## 27               chr2:10-40          1     a 1 0 0
## 28               chr2:40-60          2   a,b 1 1 0
## 29                chr20:1-5          1     a 1 0 0
## 30               chr20:6-10          1     b 0 1 0

Statistically Quantify Regions’ Similarity (jaccard and reldist)

When comparing (pairwise) the extent of overlap between a large collection of genomic regions, it becomes necessary to report the similarity (intersect) using a single quantitative measure. Bedtools implements two such statitics; jaccard and reldist (Favorov A et al.) which can be called through bedr as shown below:

if (check.binary("bedtools")) {

    # compare a and b set of sequences
    jaccard.stats <- jaccard(a.sort, b.sort);
    reldist.stats <- reldist(a.sort, b.sort);

    # print
    cat(" JACCARD a & b: \n"); print(jaccard.stats); cat("\n");
    cat(" RELDIST a & b: \n"); print(reldist.stats); cat("\n");

    # even better way to run both jaccard and reldist, as well as estimate P value through random permutations
    jaccard.reldist.stats <- test.region.similarity(a.sort, b.sort, n = 40);

    cat(" JACCARD/RELDIST a & b: \n"); print(jaccard.reldist.stats); cat("\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
## JACCARD METRIC
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##    bedtools jaccard -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d537cd1a2.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d56f1728b4.bed -f 1e-09
## RELATIVE DISTANCE
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##    bedtools reldist -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d52697383b.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d53f4c187e.bed 
##  JACCARD a & b: 
##     intersection union-intersection  jaccard n_intersections
## 120          120                420 0.285714               3
## 
##  RELDIST a & b: 
##      reldist count total fraction
## 0.00    0.00     2     8    0.250
## 0.01    0.01     2     8    0.250
## 0.17    0.17     1     8    0.125
## 0.28    0.28     1     8    0.125
## 0.40    0.40     1     8    0.125
## 0.42    0.42     1     8    0.125
## 
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in index format
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
## JACCARD METRIC
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in bed format
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in bed format
##    bedtools jaccard -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d553fd84df.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d5291e778d.bed -f 1e-09
## RELATIVE DISTANCE
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in bed format
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input is in bed format
##    bedtools reldist -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d5f3adf12.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d5620b1e8d.bed 
## PERMUTATION TEST
##  JACCARD/RELDIST a & b: 
## $results
##                                                 test   effect  p
## jaccard.perm.gt                         jaccard.stat 0.285714  0
## reldist.perm.median.lt                reldist.median     0.01  0
## reldist.perm.fraction.zero.gt  reldist.fraction.zero    0.143  0
## reldist.perm.fraction.fifty.gt   reldist.fraction.50        0 10
## 
## $n
## [1] 40
## 
## $relative.distance
##      reldist count total fraction
## 0.00    0.00     1     7    0.143
## 0.01    0.01     2     7    0.286
## 0.17    0.17     1     7    0.143
## 0.28    0.28     1     7    0.143
## 0.40    0.40     1     7    0.143
## 0.42    0.42     1     7    0.143

GroupBy

For multiple genomic features/regions mapping to the exact locus (e.g SNPs), it is often useful to collapse their additional annotations into a single record. To collapse, one may be interested in using quantitative operations such as sum, mean, median (see bedtools for all the options) or simple concatenate multiple entries.

if (check.binary("bedtools")) {

    # read example regions file
    regions.file <- system.file("extdata/example-a-region.bed", package = "bedr");
    a <- read.table(regions.file, header = FALSE, stringsAsFactors = FALSE);
    colnames(a) <- c("a.CHROM", "a.START", "a.END", "Score");

    # sort
    a <- bedr.sort.region(a);

    # group by (on first three columns) the score in column 4. Concatenate scores
    a.collapsed <- bedr(
        input = list(i = a), 
        method = "groupby", 
        params = "-g 1,2,3 -c 4 -o collapse"
        );

    # group by (on first three columns) the score in column 4. Compute mean
    a.mean <- bedr(
        input = list(i = a), 
        method = "groupby", 
        params = "-g 1,2,3 -c 4 -o mean"
        );

    # print
    cat(" REGION a groupby (collapsed): \n"); print(a.collapsed); cat("\n");
    cat(" REGION a groupby (mean): \n"); print(a.mean); cat("\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
## SORTING
## VALIDATE REGIONS
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Processing input (1): i
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... FAIL
##    The input for object is not *lexographically* ordered!
##    This can cause unexpected results for some set operations.
##    try: x <- bedr.sort.region(x)
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
##    bedtools groupby -i /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/i_62d5bd7e7b9.bed -g 1,2,3 -c 4 -o collapse
##  * Processing input (1): i
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... FAIL
##    The input for object is not *lexographically* ordered!
##    This can cause unexpected results for some set operations.
##    try: x <- bedr.sort.region(x)
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
##    bedtools groupby -i /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/i_62d527e5a447.bed -g 1,2,3 -c 4 -o mean
##  REGION a groupby (collapsed): 
##              a.CHROM a.START a.END          Score
## chr1:10-100     chr1      10   100 98.1,92.7,87.1
## chr1:101-200    chr1     101   200           94.2
## chr1:200-210    chr1     200   210     95.08,90.1
## chr1:211-212    chr1     211   212          80.76
## chr10:50-100   chr10      50   100           77.5
## chr2:10-50      chr2      10    50     76.1,93.98
## chr2:40-60      chr2      40    60          71.32
## chr20:1-5      chr20       1     5    32.65,39.84
## 
##  REGION a groupby (mean): 
##              a.CHROM a.START a.END       Score
## chr1:10-100     chr1      10   100 92.63333333
## chr1:101-200    chr1     101   200        94.2
## chr1:200-210    chr1     200   210       92.59
## chr1:211-212    chr1     211   212       80.76
## chr10:50-100   chr10      50   100        77.5
## chr2:10-50      chr2      10    50       85.04
## chr2:40-60      chr2      40    60       71.32
## chr20:1-5      chr20       1     5      36.245

Example Workflow 1: Compare Variant Callers

The example workflow below reads two VCF files from different variant callers, further limiting to structural variants only and finally identifying common calls.

if (check.binary("bedtools")) {
    VALID.SV.TYPES <- c('BND', 'CNV', 'DEL', 'DUP', 'INS', 'INV');
    POSITION.COLUMNS <- c('CHROM', 'POS', 'END');

    callerA.filename <- system.file("extdata/callerA.vcf.gz", package = "bedr");
    callerB.filename <- system.file("extdata/callerB.vcf.gz", package = "bedr");

    # read the VCF file
    callerA <- read.vcf(callerA.filename, split.info = TRUE)$vcf;
    callerB <- read.vcf(callerB.filename, split.info = TRUE)$vcf;

    # focus on SVs
    callerA <- callerA[which(callerA$SVTYPE %in% VALID.SV.TYPES), ];
    callerB <- callerB[which(callerB$SVTYPE %in% VALID.SV.TYPES), ];

    # convert to zero-based coordinates
    callerA$POS <- callerA$POS - 1;
    callerB$POS <- callerB$POS - 1;

    # find all overlapping pairs, retrieve size of overlap (bp)
    overlapping.pairs <- bedr.join.region(
        callerA[, POSITION.COLUMNS],
        callerB[, POSITION.COLUMNS],
        report.n.overlap = TRUE,
        check.chr = FALSE
        );
    colnames(overlapping.pairs) <- c(
        'a.CHROM', 'a.POS', 'a.END',
        'b.CHROM', 'b.POS', 'b.END',
        'Overlap'
        );
    overlapping.pairs$b.POS <- as.numeric(overlapping.pairs$b.POS);
    overlapping.pairs$b.END <- as.numeric(overlapping.pairs$b.END);

    # compute a distance between overlapping pairs
    min.breakpoint.distances <- cbind(
        overlapping.pairs$a.POS - overlapping.pairs$b.POS,
        overlapping.pairs$a.END - overlapping.pairs$b.END
        );

    min.breakpoint.distances <- apply(
        abs(min.breakpoint.distances),
        1,
        min
        );
    a.length <- overlapping.pairs$a.END - overlapping.pairs$a.POS;
    b.length <- overlapping.pairs$b.END - overlapping.pairs$b.POS;

    overlapping.pairs$distance  <- (min.breakpoint.distances + abs(a.length - b.length)) / 2;

    # print
    cat(" OVERLAPPING PAIRS: \n"); print(head(overlapping.pairs)); cat("\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
## READING VCF
##  * checking if file exists... PASS
##  * Reading vcf header...
##    Done
##  * Reading vcf body...
##    Done
##  * Parse vcf header...
##    Done
##  * Split info...
##  * Done
## READING VCF
##  * checking if file exists... PASS
##  * Reading vcf header...
##    Done
##  * Reading vcf body...
##    Done
##  * Parse vcf header...
##    Done
##  * Split info...
##  * Done
## JOINING
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##    bedtools intersect -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d56dd518d.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d5355fb35e.bed -wo -sorted
##  OVERLAPPING PAIRS: 
##   a.CHROM     a.POS     a.END b.CHROM     b.POS     b.END Overlap distance
## 1       1 247204800 247209976       1 247205240 247209984    4736    220.0
## 2       1 247874052 247886918       1 247874057 247877965    3908   4481.5
## 3      10   5304636   5309227      10   5202123   5324657    4591  66686.5
## 4      10  12504806  12506154      10  12504806  12506113    1307     20.5
## 5      10  19431941  19433784      10  19431472  19434296    1843    725.0
## 6      10  36026185  36030048      10  36026100  36030053    3863     47.5

Example Workflow 2: Exome Target Processing

This example builds on the Example Workflow 1, and annotates the structural variants with RefSeq gene identifiers if the variant falls within the gene body for instance. Limiting to chromosome 1 and 2 only. Lastly compare which genes were common between the calls generated by two variant callers.

if (check.binary("bedtools")) {

    # get Human RefSeq genes (Hg19) in BED format
    refseq.file <- system.file("extdata/ucsc.hg19.RefSeq.chr1-2.txt.gz", package = "bedr");
    refseq <- read.table(refseq.file, header = FALSE, stringsAsFactors = FALSE);

    # sort Refseq and remove chr prefix
    refseq.sorted <- bedr.sort.region(refseq[, 1:4]);
    colnames(refseq.sorted) <- c(POSITION.COLUMNS, "Gene");
    refseq.sorted$CHROM <- gsub("^chr", "", refseq.sorted$CHROM);

    # reuse the SV calls from workflow 1 and add gene identifiers to callerA
    callerA.annotated <- bedr.join.region(
        callerA[, POSITION.COLUMNS],
        refseq.sorted,
        report.n.overlap = TRUE,
        check.chr = FALSE
        );
    colnames(callerA.annotated) <- c(
        'a.CHROM', 'a.POS', 'a.END',
        'Gene.CHROM', 'Gene.POS', 'Gene.END', 'Gene', 'Overlap'
        );

    # reuse the SV calls from workflow 1 and add gene identifiers to callerB
    callerB.annotated <- bedr.join.region(
        callerB[, POSITION.COLUMNS],
        refseq.sorted,
        report.n.overlap = TRUE,
        check.chr = FALSE
        );
    colnames(callerB.annotated) <- c(
        'b.CHROM', 'b.POS', 'b.END',
        'Gene.CHROM', 'Gene.POS', 'Gene.END', 'Gene', 'Overlap'
        );

    # reinstate chr prefix to chromosome names
    callerA.annotated$a.CHROM <- paste('chr', callerA.annotated$a.CHROM, sep = "");
    callerB.annotated$b.CHROM <- paste('chr', callerB.annotated$b.CHROM, sep = "");

    # print
    cat(" CALLER A GENES (chr 1,2): \n"); print(head(callerA.annotated)); cat("\n");
    cat(" CALLER B GENES (chr 1,2): \n"); print(head(callerB.annotated)); cat("\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
## SORTING
## VALIDATE REGIONS
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
## JOINING
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... FAIL
##    The input for object is not *lexographically* ordered!
##    This can cause unexpected results for some set operations.
##    try: x <- bedr.sort.region(x)
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
##    bedtools intersect -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d56d3eda2d.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d57d941328.bed -wo -sorted
## JOINING
##  * Processing input (1): a
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... PASS
##  * Checking for overlapping 'contiguous' regions... PASS
##  * Processing input (2): b
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... FAIL
##    The input for object is not *lexographically* ordered!
##    This can cause unexpected results for some set operations.
##    try: x <- bedr.sort.region(x)
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
##    bedtools intersect -a /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/a_62d57fde5e8d.bed -b /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/b_62d5226e2506.bed -wo -sorted
##  CALLER A GENES (chr 1,2): 
##   a.CHROM    a.POS    a.END Gene.CHROM Gene.POS Gene.END         Gene
## 1    chr1 32176186 32193834          1 32192705 32229664 NM_001294336
## 2    chr1 32176186 32193834          1 32192705 32229664 NM_001294335
## 3    chr1 36419037 36421454          1 36396682 36522063    NM_024852
## 4    chr1 36419037 36421454          1 36396682 36522063    NM_177422
## 5    chr1 40302119 40310286          1 40306705 40349177    NM_017646
## 6    chr1 52748187 52752298          1 52607765 52812358    NM_007324
##   Overlap
## 1    1129
## 2    1129
## 3    2417
## 4    2417
## 5    3581
## 6    4111
## 
##  CALLER B GENES (chr 1,2): 
##   b.CHROM     b.POS     b.END Gene.CHROM  Gene.POS  Gene.END         Gene
## 1    chr1 235776110 235776340          1 235676124 235881413    NR_039973
## 2    chr1 235776110 235776340          1 235710984 235813293 NM_001098722
## 3    chr1 235776110 235776340          1 235710984 235814054 NM_001098721
## 4    chr1 235776110 235776340          1 235710984 235814054    NM_004485
## 5    chr1 236025376 236025711          1 235824330 236030227    NM_000081
## 6    chr1 236025376 236025711          1 235824330 236047008 NM_001301365
##   Overlap
## 1     230
## 2     230
## 3     230
## 4     230
## 5     335
## 6     335

Next, you may want to collapse multiple gene entries in one row. This can be done using groupby utility.

options("warn" = -1);
if (check.binary("bedtools")) {

    # collapse column 7 (Gene) against unique composite key (column 1, 2 and 3)
    callerA.annotated.grouped <- bedr(
        input = list(i = callerA.annotated), 
        method = "groupby", 
        params = "-g 1,2,3 -c 7 -o collapse"
        );

    # collapse column 7 (Gene) against unique composite key (column 1, 2 and 3)
    callerB.annotated.grouped <- bedr(
        input = list(i = callerB.annotated), 
        method = "groupby", 
        params = "-g 1,2,3 -c 7 -o collapse"
        );

    # print
    cat(" CALLER A GENES (chr 1,2) GROUPED: \n"); print(head(callerA.annotated.grouped)); cat("\n");
    cat(" CALLER B GENES (chr 1,2) GROUPED: \n"); print(head(callerB.annotated.grouped)); cat("\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
##  * Processing input (1): i
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... FAIL
##    The input for object is not *lexographically* ordered!
##    This can cause unexpected results for some set operations.
##    try: x <- bedr.sort.region(x)
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
##    bedtools groupby -i /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/i_62d568d6f6c8.bed -g 1,2,3 -c 7 -o collapse
##  * Processing input (1): i
## CONVERT TO BED
##  * Checking input type... PASS
##    Input seems to be in bed format but chr/start/end column names are missing
## VALIDATE REGIONS
##  * Check if index is a string... PASS
##  * Check index pattern... PASS
##  * Check for missing values... PASS
##  * Check for larger start position... PASS.
##  * Check if zero based... PASS
##  * Checking sort order... FAIL
##    The input for object is not *lexographically* ordered!
##    This can cause unexpected results for some set operations.
##    try: x <- bedr.sort.region(x)
##  * Checking for overlapping 'contiguous' regions... FAIL
##    The input for object has overlapping features!
##    This can cause unexpected results for some set operations.
##    i.e. x <- bedr.merge.region(x)
##    bedtools groupby -i /var/folders/ch/h71q7lwx2wdb07ry2mlp337r000k4d/T//RtmpZs9Ufg/i_62d54353986.bed -g 1,2,3 -c 7 -o collapse
##  CALLER A GENES (chr 1,2) GROUPED: 
##                          V1       V2       V3                        V4
## chr1:32176186-32193834 chr1 32176186 32193834 NM_001294336,NM_001294335
## chr1:36419037-36421454 chr1 36419037 36421454       NM_024852,NM_177422
## chr1:40302119-40310286 chr1 40302119 40310286                 NM_017646
## chr1:52748187-52752298 chr1 52748187 52752298       NM_007324,NM_004799
## chr1:62421145-62424426 chr1 62421145 62424426                 NM_176877
## chr1:67068898-67075815 chr1 67068898 67075815    NM_001308203,NM_032291
## 
##  CALLER B GENES (chr 1,2) GROUPED: 
##                            V1        V2        V3
## chr1:235776110-235776340 chr1 235776110 235776340
## chr1:236025376-236025711 chr1 236025376 236025711
## chr1:236225022-236225405 chr1 236225022 236225405
## chr1:236440947-236441376 chr1 236440947 236441376
## chr1:236594960-236595252 chr1 236594960 236595252
## chr1:236882374-236882744 chr1 236882374 236882744
##                                                                     V4
## chr1:235776110-235776340 NR_039973,NM_001098722,NM_001098721,NM_004485
## chr1:236025376-236025711              NM_000081,NM_001301365,NR_102436
## chr1:236225022-236225405                                     NM_002508
## chr1:236440947-236441376                                     NM_019891
## chr1:236594960-236595252                           NM_145861,NM_080738
## chr1:236882374-236882744           NM_001278343,NM_001278344,NM_001103

You may also want to summarise the counts of overlapping and unique sub-regions. This can be done by bedr.plot.region method which displays the summary as a venn diagram.

if (check.binary("bedtools")) {

    # merge and plot overlapping regions between the two callers with genes 
    # on chromosome 1 and 2
    callerA.merged <- callerA.annotated[, c('a.CHROM', 'a.POS', 'a.END')];
    callerB.merged <- callerB.annotated[, c('b.CHROM', 'b.POS', 'b.END')];
    callerA.merged <- bedr.merge.region(callerA.merged);
    callerB.merged <- bedr.merge.region(callerB.merged);

    # plot (sub-regions exclusive to a and b, and sub-regions in common)
    bedr.plot.region(
        input = list(
            a = callerA.merged, 
            b = callerB.merged
            ),
        params = list(lty = 2, label.col = "black", main = "Genes Overlap"),
        feature = 'interval',
        verbose = FALSE
        );
    }

The venn diagram shows the number of sub-regions on chromosome 1 and 2 which are common between the two callers, and the ones which are exclusive to the callers.

Example Workflow 3: Copy Number Recurrence

In this example, we show bedr can be used to process CNA segmented data to estimate percent genome altered and identify minimal common regions. The copy-number data used is a subset of TCGA COAD study. Note: For efficiency reason, this is a very coarse example of calling recurrent copy number regions, and only to demonstrate how bedr can be used alongside other copy number tools.

if (check.binary("bedtools")) {

    # read copy number segmented (regions) data
    cna.file <- system.file("extdata/CNA.segmented.txt.gz", package = "bedr");
    cna.data <- read.table(cna.file, header = TRUE, stringsAsFactors = FALSE);
    cna.data$Chromosome <- paste("chr", cna.data$Chromosome, sep = "");

    # check if regions are valid
    valid.segments <- is.valid.region(cna.data[, c("Chromosome", "Start", "End")]);
    cat(" VALID REGIONS: ", length(which(valid.segments) ==  TRUE), "/", length(valid.segments), "\n");

    # restrict to copy number regions with |log-ratio| > 0.10
    cna.data <- cna.data[which(abs(cna.data$Segment_Mean) > 0.10), ];

    # create a list data-structure for every patient's copy number data
    # and sort them
    cna.data.gain <- list();
    cna.data.loss <- list();
    pga.gain <- list();
    pga.loss <- list();
    sample.ids <- unique(cna.data$Sample);
    hg19.size <- 3137161264;
    for (sample.id in sample.ids) {

        # extract sample specific gains
        gain.index <- which(cna.data$Sample == sample.id & cna.data$Segment_Mean > 0);
        if (length(gain.index) > 0) {

            cna.data.gain[[sample.id]] <- cna.data[
                gain.index,
                2:4
                ];

            # sort
            cna.data.gain[[sample.id]] <- bedr.sort.region(
                x = cna.data.gain[[sample.id]],
                method = "natural",
                verbose = FALSE
                );

            # estimate percent genome gained
            pga.gain[[sample.id]] <- sum(
                apply(
                    cna.data.gain[[sample.id]],
                    1,
                    FUN = function(x) { return( (as.numeric(x[3]) - as.numeric(x[2])) ); }
                    )
                );
            pga.gain[[sample.id]] <- pga.gain[[sample.id]]/hg19.size*100;
            }

        # extract sample specific losses
        loss.index <- which(cna.data$Sample == sample.id & cna.data$Segment_Mean < 0);
        if (length(loss.index) > 0) {

            cna.data.loss[[sample.id]] <- cna.data[
                loss.index,
                2:4
                ];

            # sort
            cna.data.loss[[sample.id]] <- bedr.sort.region(
                x = cna.data.loss[[sample.id]],
                method = "natural",
                verbose = FALSE
                );

            # estimate percent genome loss
            pga.loss[[sample.id]] <- sum(
                apply(
                    cna.data.loss[[sample.id]],
                    1,
                    FUN = function(x) { return( (as.numeric(x[3]) - as.numeric(x[2])) ); }
                    )
                );
            pga.loss[[sample.id]] <- pga.loss[[sample.id]]/hg19.size*100;
            }
        }
    }

You may want to see the distribution of Percent Genome Altered (Gain and Loss), as shown below:

if (check.binary("bedtools")) {

    # set graphics params
    par(mfrow = c(1, 2), las = 1);

    # plot histograms of Gain and Loss frequencies
    hist(unlist(pga.gain), xlab = "Percent Gain", main = "Percent Genome Altered (Gain)");
    hist(unlist(pga.loss), xlab = "Percent Loss", main = "Percent Genome Altered (Loss)");
    }

Lets find minimal common regions either gained and deleted across patients:

if (check.binary("bedtools")) {

    # find minimal common regions (gain)
    mcr.gain <- bedr.join.multiple.region(
        x = cna.data.gain,
        species = "human",
        build = "hg19",
        check.valid = FALSE,
        check.sort = FALSE,
        check.merge = FALSE,
        verbose = FALSE
        );

    # find minimal common regions (loss)
    mcr.loss <- bedr.join.multiple.region(
        x = cna.data.loss,
        species = "human",
        build = "hg19",
        check.valid = FALSE,
        check.sort = FALSE,
        check.merge = FALSE,
        verbose = FALSE
        );

    # reorder by frequency of recurrence
    mcr.gain <- mcr.gain[order(as.numeric(mcr.gain$n.overlaps), decreasing = TRUE), ];
    mcr.loss <- mcr.loss[order(as.numeric(mcr.loss$n.overlaps), decreasing = TRUE), ];

    # print
    cat(" RECURRENT CNA GAINS \n"); print(head(mcr.gain[, 1:5])); cat("\n");
    cat(" RECURRENT CNA LOSSES \n"); print(head(mcr.loss[, 1:5])); cat("\n");
    }

##   * Checking path for bedtools... PASS
##     /Users/shaider/Desktop/phd_work/SW/bedtools2/bin/bedtools
##  RECURRENT CNA GAINS 
##        V1        V2        V3 n.overlaps
## 1219 chr8 121908427 121952407         19
## 844  chr7  40179918  40214196         18
## 1218 chr8 121903972 121908427         18
## 1220 chr8 121952407 121955784         18
## 791  chr7   4233222   4233356         17
## 796  chr7   7676343   7767752         17
##                                                                                                                                                                                        names
## 1219 Sample.2,Sample.4,Sample.6,Sample.7,Sample.8,Sample.9,Sample.10,Sample.11,Sample.12,Sample.14,Sample.19,Sample.25,Sample.28,Sample.34,Sample.37,Sample.43,Sample.46,Sample.49,Sample.52
## 844          Sample.2,Sample.4,Sample.6,Sample.8,Sample.10,Sample.14,Sample.16,Sample.19,Sample.20,Sample.21,Sample.22,Sample.25,Sample.28,Sample.37,Sample.40,Sample.43,Sample.46,Sample.49
## 1218          Sample.2,Sample.4,Sample.6,Sample.7,Sample.8,Sample.10,Sample.11,Sample.12,Sample.14,Sample.19,Sample.25,Sample.28,Sample.34,Sample.37,Sample.43,Sample.46,Sample.49,Sample.52
## 1220          Sample.2,Sample.4,Sample.6,Sample.8,Sample.9,Sample.10,Sample.11,Sample.12,Sample.14,Sample.19,Sample.25,Sample.28,Sample.34,Sample.37,Sample.43,Sample.46,Sample.49,Sample.52
## 791                    Sample.2,Sample.4,Sample.6,Sample.8,Sample.10,Sample.14,Sample.16,Sample.19,Sample.25,Sample.28,Sample.34,Sample.37,Sample.40,Sample.43,Sample.46,Sample.49,Sample.52
## 796                    Sample.2,Sample.4,Sample.6,Sample.8,Sample.10,Sample.11,Sample.14,Sample.16,Sample.19,Sample.25,Sample.28,Sample.37,Sample.40,Sample.43,Sample.46,Sample.49,Sample.52
## 
##  RECURRENT CNA LOSSES 
##        V1        V2        V3 n.overlaps
## 1760 chr8  19285726  19286253         16
## 738  chr4  27782716  27784717         15
## 744  chr4  29318239  29318249         15
## 969  chr4 156319821 156320512         15
## 975  chr4 158414039 158463259         15
## 977  chr4 158465669 158468864         15
##                                                                                                                                                              names
## 1760 Sample.4,Sample.8,Sample.10,Sample.12,Sample.14,Sample.15,Sample.16,Sample.19,Sample.25,Sample.28,Sample.34,Sample.37,Sample.40,Sample.43,Sample.46,Sample.52
## 738             Sample.4,Sample.6,Sample.8,Sample.10,Sample.14,Sample.16,Sample.19,Sample.25,Sample.28,Sample.35,Sample.37,Sample.40,Sample.43,Sample.49,Sample.52
## 744             Sample.4,Sample.6,Sample.8,Sample.10,Sample.14,Sample.16,Sample.19,Sample.23,Sample.25,Sample.35,Sample.37,Sample.40,Sample.43,Sample.49,Sample.52
## 969               Sample.2,Sample.3,Sample.4,Sample.6,Sample.8,Sample.10,Sample.14,Sample.16,Sample.25,Sample.28,Sample.40,Sample.43,Sample.46,Sample.49,Sample.52
## 975              Sample.2,Sample.4,Sample.6,Sample.8,Sample.10,Sample.14,Sample.16,Sample.19,Sample.25,Sample.28,Sample.40,Sample.43,Sample.46,Sample.49,Sample.52
## 977              Sample.2,Sample.4,Sample.6,Sample.8,Sample.10,Sample.14,Sample.16,Sample.19,Sample.25,Sample.28,Sample.40,Sample.43,Sample.46,Sample.49,Sample.52

Introduction to bedr package

Author(s): Syed Haider, Daryl Waggott, Emilie Lalonde, Clement Fung, Paul C. Boutros

Dated: 2019-04-01

Table of Contents

Introduction

Third Party Tools