Standard Workflow

Preprocessing

The preprocessing pipeline consists of the following steps:

Filtering Gencode annotations to extract “transcript” annotations.
Distinguishing between protein-coding (MANE_Select) and long non-coding (lncRNA, Ensembl_canonical) transcripts.
Dividing transcripts into windows of a user-defined size (windsize).
Processing bedgraph files to retrieve signal values, excluding blacklisted regions, and retaining scores within high-mappability intervals.
Generating a final annotated table incorporating the scores derived from the preceding steps.

structure

Important: The preprocessing function allows for saving intermediate objects, which prevents recomputation in cases of failure due to memory or time limits typically set in HPC parameters. See the saveobjectpath and deletetmp parameters in the ?preprocessing help documentation for details. Resource requirements for processing the complete dataset are provided below.

nb CPU	RAM	Time
15	113.5 G	3h47
10	78.9 G	4h24
7	57.4 G	5h27
5	43G	6h58
3	28.7G	9h02

The testing dataset below is limited to a portion of the chromosome 13 and one control replicate.

Input Files

The following input files are required for the analysis:

exptab: A table describing the experiments, containing the columns “condition,” “replicate,” “direction,” “strand,” and “path.” The “direction” column should contain the values “forward” and “reverse,” and the “strand” column should contain “plus” and “minus.” The ?checkexptab utility function can be used to verify the table’s format.
gencode: The Gencode annotation file, available at https://www.gencodegenes.org/.
maptrack: Files containing low-mappability intervals for the genome of interest. For hg38, the track k50.Unique.Mappability.bb was downloaded from the UCSC server, and scores below 0.8 were set to NA.
blacklist: Blacklist files for various organisms are available at https://github.com/Boyle-Lab/Blacklist/tree/master/lists.

Retrieving Transcript Annotations from Gencode

The ?retrieveanno function performs the following steps:

Reads and validates experimental data from the provided exptab CSV file.
Reads genomic annotations from the Gencode file and filters for “transcript” entries.
Processes protein-coding and long non-coding RNA (lncRNA) transcripts separately:
- For protein-coding genes, it selects the most representative transcript (MANE_Select or Ensembl_canonical).
- For lncRNAs, it filters out transcripts with undesirable evidence levels (containing keywords “not_best_in_genome_evidence,” “transcript_support_level 5,” or “transcript_support_level 4”).
Combines the processed annotations into a single data frame, labeling each transcript with its biotype (protein-coding or lncRNA).
Optionally saves the resulting data frame as an RDS file in the specified saveobjectpath directory.
Optionally reports the total analysis time (if showtime = TRUE).

allannobed <- retrieveanno(expprepath, gencodepath, verbose = FALSE)
message("\n The result is:\n")

## 
##  The result is:

print(head(allannobed, 3))

##   chrom start  end           ensembl  symbol strand        biotype
## 1 chr13     1  800 ENST00000400113.8  TUBA3C      - protein-coding
## 2 chr13  1000 1600 ENST00000623511.1 FAM230C      +         lncRNA

Defining Windows for Each Transcript

The ?makewindows function performs the following operations:

Removes any Ensembl transcript names containing “PAR_Y.”
Filters out intervals smaller than windsize.
Divides each remaining transcript into windows of size windsize.

The output includes metadata for each window, such as its chromosome, start and end coordinates, associated gene, and window number.

allwindowsbed <- makewindows(allannobed, windsize, verbose = FALSE)
message("\n The result is:\n")

## 
##  The result is:

print(head(allwindowsbed, 3))

##          biotype   chr coor1 coor2        transcript   gene strand window
## 1 protein-coding chr13     1     5 ENST00000400113.8 TUBA3C      -      1
## 2 protein-coding chr13     5     9 ENST00000400113.8 TUBA3C      -      2
## 3 protein-coding chr13     9    13 ENST00000400113.8 TUBA3C      -      3

Retrieving Bedgraph Scores and Filtering Blacklisted and Low-Mappability Regions

The ?blacklisthighmap function iterates through each chromosome, processing genomic scores by removing those that overlap with blacklisted and low-mappability regions. It also calculates weighted means for scores within each window. This function leverages parallel processing for efficiency and supports saving (saveobjectpath) and reloading (reload) intermediate results to optimize the workflow.

The main steps include:

Reading and processing bedGraph values.
Removing scores overlapping with blacklisted and low-mappability regions.
Calculating weighted means for overlapping scores within each genomic window.
Saving the processed results to a specified temporary folder (tmpfold).

If you did not execute the “quick start” section code, you need to first copy the bedgraph files to your current folder:

expprepath <-  system.file("extdata", "exptab-preprocessing.csv", package="tepr")
expdfpre <- read.csv(expprepath)
bgpathvec <- sapply(expdfpre$path, function(x) system.file("extdata", x,
    package = "tepr"))
expdfpre$path <- bgpathvec
write.csv(expdfpre, file = "exptab-preprocessing.csv", row.names = FALSE, quote = FALSE)
exptabpath <- "exptab-preprocessing.csv"

Note that this function does not return a value; instead, it saves the processed files to a temporary folder for use by the subsequent ?createtablescores function.

## Retrieving bedgraph scores, removing those in black list and low mappability intervals
blacklisthighmap(maptrackpath, blacklistpath, expprepath, nbcputrans = 1, allwindowsbed, windsize, genomename = "hg38", tmpfold = file.path(tempdir(), "tmptepr"))

Creating the Final Table

The ?createtablescores function merges the files stored in the previously produced temporary folder that belong to the same experiment and direction. These files are combined into a single table, with two columns per experiment: one containing the experiment name and the other containing the corresponding scores. The resulting table also includes annotations for each transcript.

finaltab <- createtablescores(tmpfold = file.path(tempdir(), "tmptepr"), expprepath, savefinaltable = FALSE)

The final table should look like:

##       V1    V2   V3   V4                V5      V6 V7 V8
## 1 lncRNA chr13 1000 1003 ENST00000623511.1 FAM230C  +  1
## 2 lncRNA chr13 1003 1006 ENST00000623511.1 FAM230C  +  2
## 3 lncRNA chr13 1006 1009 ENST00000623511.1 FAM230C  +  3
##                              V9               V10      V11               V12
## 1 ENST00000623511.1_FAM230C_+_1 ctrl_rep1.forward 3.536147 ctrl_rep1.reverse
## 2 ENST00000623511.1_FAM230C_+_2 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
## 3 ENST00000623511.1_FAM230C_+_3 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
##   V13
## 1  NA
## 2  NA
## 3  NA

tepr Analysis

The quick start section demonstrates how to perform a tepr analysis in a single call using the ?tepr function. Below, we detail the individual steps performed by ?tepr using a subset of 6 transcripts for demonstration purposes. Plots generated from the complete annotation dataset are also provided. The complete table resulting from preprocessing can be downloaded here. See section 6.1 for notes on processing the complete dataset.

Input Files

Two tables are required as input:

transdf: The table generated by the ?preprocessing function (described in previous sections). This table contains information about each experiment, along with scores for each protein-coding and lncRNA annotation.
exptab: A table describing the experiments, containing the columns “condition”, “replicate”, “direction”, “strand”, and “path”. The “direction” column should contain the values “forward” and “reverse”, and the “strand” column should contain “plus” and “minus”. You can verify the table’s format using the ?checkexptab utility function.

These two tables are included with the package and can be accessed using the ?system.file command within the “extdata” directory. For this demonstration, we have sampled 6 transcripts from the transdf table to reduce computation time.

## Define manually the column names for display purpose
colnames(transdf) <- c("biotype", "chr", "coor1", "coor2", "transcript", "gene",
    "strand", "window", "id", "ctrl_rep1.plus", "ctrl_rep1.plus_score",
    "ctrl_rep1.minus", "ctrl_rep1.minus_score", "ctrl_rep2.plus",
    "ctrl_rep2.plus_score", "ctrl_rep2.minus", "ctrl_rep2.minus_score",
    "HS_rep1.plus", "HS_rep1.plus_score", "HS_rep1.minus",
    "HS_rep1.minus_score", "HS_rep2.plus", "HS_rep2.plus_score",
    "HS_rep2.minus", "HS_rep2.minus_score")

message("The table given by the preprocessing function is:\n")

## The table given by the preprocessing function is:

print(head(transdf, 3))

##          biotype  chr    coor1    coor2        transcript gene strand window
## 1 protein-coding chr7 55019017 55019980 ENST00000275493.7 EGFR      +      1
## 2 protein-coding chr7 55019980 55020943 ENST00000275493.7 EGFR      +      2
## 3 protein-coding chr7 55020943 55021906 ENST00000275493.7 EGFR      +      3
##                           id    ctrl_rep1.plus ctrl_rep1.plus_score
## 1 ENST00000275493.7_EGFR_+_1 ctrl_rep1.forward             1.356774
## 2 ENST00000275493.7_EGFR_+_2 ctrl_rep1.forward             3.657029
## 3 ENST00000275493.7_EGFR_+_3 ctrl_rep1.forward             8.161519
##     ctrl_rep1.minus ctrl_rep1.minus_score    ctrl_rep2.plus
## 1 ctrl_rep1.reverse                    NA ctrl_rep2.forward
## 2 ctrl_rep1.reverse                    NA ctrl_rep2.forward
## 3 ctrl_rep1.reverse                    NA ctrl_rep2.forward
##   ctrl_rep2.plus_score   ctrl_rep2.minus ctrl_rep2.minus_score    HS_rep1.plus
## 1             1.430225 ctrl_rep2.reverse                    NA HS_rep1.forward
## 2             5.461259 ctrl_rep2.reverse                    NA HS_rep1.forward
## 3             9.692524 ctrl_rep2.reverse                    NA HS_rep1.forward
##   HS_rep1.plus_score   HS_rep1.minus HS_rep1.minus_score    HS_rep2.plus
## 1           2.011272 HS_rep1.reverse                  NA HS_rep2.forward
## 2           8.927487 HS_rep1.reverse                  NA HS_rep2.forward
## 3          12.844230 HS_rep1.reverse                  NA HS_rep2.forward
##   HS_rep2.plus_score   HS_rep2.minus HS_rep2.minus_score
## 1           2.068358 HS_rep2.reverse                  NA
## 2           8.943509 HS_rep2.reverse                  NA
## 3          14.917819 HS_rep2.reverse                  NA

message("\n The expdf table contains information about each replicate (here limited to one):\n")

## 
##  The expdf table contains information about each replicate (here limited to one):

head(expdf)

##   condition replicate direction strand                       path
## 1      ctrl         1   forward   plus ctrl_rep1_chr13.forward.bg
## 2      ctrl         1   reverse  minus ctrl_rep1_chr13.reverse.bg
## 3      ctrl         2   forward   plus ctrl_rep2_chr13.forward.bg
## 4      ctrl         2   reverse  minus ctrl_rep2_chr13.reverse.bg
## 5        HS         1   forward   plus   HS_rep1_chr13.forward.bg
## 6        HS         1   reverse  minus   HS_rep1_chr13.reverse.bg

You can verify the table’s format using the ?checkexptab utility function. Note that this verification is also performed by ?averageandfilterexprs.

checkexptab(expdf)

Expressed Transcripts and Average Expression

To study changes in nascent RNA patterns, the initial step is to select expressed transcripts with ?averageandfilterexprs. The expression value of a gene is defined as the mean score across both strands. These scores are derived from the bedgraph files. A gene is considered expressed if its mean expression in all conditions exceeds a specified threshold. The default expression threshold is set to 0.1. If no expressed transcript is returned, an error is thrown.

resallexprs <- averageandfilterexprs(expdf, transdf, expthres, verbose = FALSE)

resallexprs is a list. Its first element is the input transdf with additional columns, such as the calculated mean expression values. The second element is a vector containing the IDs of the transcripts considered expressed. This vector will be used in subsequent downstream analyses.

print(resallexprs[[2]])

## [1] "ENST00000230895.11" "ENST00000274140.10" "ENST00000275493.7" 
## [4] "ENST00000527786.7"  "ENST00000549802.5"  "ENST00000615891.2"

NA Values per Transcript and Condition

During preprocessing, scores within user-defined blacklisted regions, low-mappability regions, or any other excluded regions are replaced with NA values. The ?countna function retrieves the number of missing scores for each transcript. This table can be used for further data filtering, such as removing transcripts with an excessive number of windows lacking scores.

rescountna <- countna(resallexprs, expdf, verbose = FALSE)
print(rescountna)

##                         gene         transcript strand Count_NA
## ENST00000230895.11       DAP ENST00000230895.11      -        0
## ENST00000274140.10   MARCHF6 ENST00000274140.10      +        0
## ENST00000275493.7       EGFR  ENST00000275493.7      +        0
## ENST00000527786.7       FLI1  ENST00000527786.7      +        0
## ENST00000549802.5  LINC01619  ENST00000549802.5      -        0
## ENST00000615891.2      AP5S1  ENST00000615891.2      +        8

Empirical Cumulative Distribution Function (ECDF) for Genes

The ?genesECDF function calculates the empirical cumulative distribution function (ECDF) for expressed genes across their transcripts. An ECDF is a probability distribution that estimates the Cumulative Distribution Function; here, we use the ?ecdf function from R. It generates an ECDF for each transcript, which are then used to assess differences in signal using a Kolmogorov-Smirnov test. The ECDF helps answer the question: How likely is it that we would observe a distribution of signals like this if the signals were drawn from a uniform probability distribution? In other words, this statistic helps identify specific loci where the transcription signal deviates significantly from a uniform density. In the following steps, the ECDF is used to compute the difference from the cumulative distribution (see ?meananddifference), which is subsequently used to calculate the area under the curve (AUC, see ?allauc) and the inflection point or knee (see ?kneeid).

resecdflist <- genesECDF(resallexprs, expdf, verbose = FALSE)

The ?genesECDF function returns a list. The first element is the main table with added ECDF columns, prefixed with Fx.

print(head(as.data.frame(resecdflist[[1]]), 3))

##          biotype  chr    coor1    coor2         transcript gene strand window
## 1 protein-coding chr5 10760820 10761234 ENST00000230895.11  DAP      -    200
## 2 protein-coding chr5 10760410 10760820 ENST00000230895.11  DAP      -    199
## 3 protein-coding chr5 10760000 10760410 ENST00000230895.11  DAP      -    198
##                             id         ctrl_rep1         ctrl_rep2
## 1 ENST00000230895.11_DAP_-_200 ctrl_rep1.reverse ctrl_rep2.reverse
## 2 ENST00000230895.11_DAP_-_199 ctrl_rep1.reverse ctrl_rep2.reverse
## 3 ENST00000230895.11_DAP_-_198 ctrl_rep1.reverse ctrl_rep2.reverse
##           HS_rep1         HS_rep2 coord value_ctrl_rep1_score
## 1 HS_rep1.reverse HS_rep2.reverse     1              2.424338
## 2 HS_rep1.reverse HS_rep2.reverse     2              9.451467
## 3 HS_rep1.reverse HS_rep2.reverse     3             13.218833
##   value_ctrl_rep2_score value_HS_rep1_score value_HS_rep2_score
## 1             0.5111166            1.647340             2.20375
## 2            14.2957119           11.853574            13.22088
## 3            16.2775011            9.271255            14.74644
##   Fx_ctrl_rep1_score Fx_ctrl_rep2_score Fx_HS_rep1_score Fx_HS_rep2_score
## 1        0.001792650       0.0002728066       0.00839895       0.01045131
## 2        0.008888557       0.0080750764       0.07086614       0.07315914
## 3        0.018748133       0.0169685727       0.11968504       0.14299287

The second element returns the number of windows per transcript, which is set to 200 by default during preprocessing.

nbwindows <- resecdflist[[2]]
print(nbwindows)

## [1] 200

Mean and Differences of Scores for Each Condition

The ?meandifference function computes three types of statistics: mean score values, mean ECDF for each replicate, and the difference between the mean ECDF and the uniform cumulative distribution. Mean values across replicates are calculated, resulting in one mean_value column per condition. These mean_value score columns are required for subsequent attenuation score calculations (see ?attenuation). The mean ECDF values are used to calculate the difference from the cumulative distribution, which we term diff_Fx. The diff_Fx statistic is used to calculate the area under the curve (AUC, see ?allauc) and the inflection point or knee (see ?kneeid).

The difference is calculated as: diff_Fx = meanECDF - cumvec / nbwindows

resmeandiff <- meandifference(resecdflist[[1]], expdf, nbwindows, verbose = FALSE)
print(head(resmeandiff, 3))

##          biotype  chr    coor1    coor2         transcript gene strand window
## 1 protein-coding chr5 10760820 10761234 ENST00000230895.11  DAP      -    200
## 2 protein-coding chr5 10760410 10760820 ENST00000230895.11  DAP      -    199
## 3 protein-coding chr5 10760000 10760410 ENST00000230895.11  DAP      -    198
##                             id         ctrl_rep1         ctrl_rep2
## 1 ENST00000230895.11_DAP_-_200 ctrl_rep1.reverse ctrl_rep2.reverse
## 2 ENST00000230895.11_DAP_-_199 ctrl_rep1.reverse ctrl_rep2.reverse
## 3 ENST00000230895.11_DAP_-_198 ctrl_rep1.reverse ctrl_rep2.reverse
##           HS_rep1         HS_rep2 coord value_ctrl_rep1_score
## 1 HS_rep1.reverse HS_rep2.reverse     1              2.424338
## 2 HS_rep1.reverse HS_rep2.reverse     2              9.451467
## 3 HS_rep1.reverse HS_rep2.reverse     3             13.218833
##   value_ctrl_rep2_score value_HS_rep1_score value_HS_rep2_score
## 1             0.5111166            1.647340             2.20375
## 2            14.2957119           11.853574            13.22088
## 3            16.2775011            9.271255            14.74644
##   Fx_ctrl_rep1_score Fx_ctrl_rep2_score Fx_HS_rep1_score Fx_HS_rep2_score
## 1        0.001792650       0.0002728066       0.00839895       0.01045131
## 2        0.008888557       0.0080750764       0.07086614       0.07315914
## 3        0.018748133       0.0169685727       0.11968504       0.14299287
##   mean_value_ctrl mean_Fx_ctrl diff_Fx_ctrl mean_value_HS  mean_Fx_HS
## 1        1.467727  0.001032728 -0.003967272      1.925545 0.009425128
## 2       11.873589  0.008481817 -0.001518183     12.537228 0.072012643
## 3       14.748167  0.017858353  0.002858353     12.008848 0.131338957
##    diff_Fx_HS Diff_meanvalue_ctrl_HS Diff_meanvalue_HS_ctrl Diff_meanFx_ctrl_HS
## 1 0.004425128             -0.4578180              0.4578180         -0.00839240
## 2 0.062012643             -0.6636391              0.6636391         -0.06353083
## 3 0.116338957              2.7393194             -2.7393194         -0.11348060
##   Diff_meanFx_HS_ctrl
## 1          0.00839240
## 2          0.06353083
## 3          0.11348060

The resulting table is then split into a list of tables, one per transcript. This enables parallel processing by transcript for subsequent operations (see nbcpu parameter in each function).

## Split the results by transcripts
bytranslistmean <- split(resmeandiff, factor(resmeandiff$transcript))

Area Under the Curve (AUC) and Differences in AUC for Transcript Data

The Area Under the Curve (AUC) values enable comparison of the nascent RNA signal to a uniform distribution (representing no signal attenuation, where x = y) or comparison of the difference in signal distribution between two conditions (dAUC). The difference between cumulative densities is estimated using a Kolmogorov-Smirnov test with adjusted p-value. Finally, ?allauc returns results by transcript, along with a mean value termed MeanValueFull.

For each condition (if more than one), the AUC of diff_Fx (described in the previous section) is computed using a trapezoidal approximation (see pracma::trapz):

AUC = pracma::trapz(transcoord, diff_Fx)

where transcoord represents the positions along the transcript and diff_Fx is the difference from the cumulative distribution.

## Calculate Area Under Curve (AUC) and Differences of AUC for Transcript Data
resauc <- allauc(bytranslistmean, expdf, nbwindows, verbose = FALSE)
print(head(resauc, 3))

##    gene         transcript strand window_size  AUC_ctrl  p_AUC_ctrl D_AUC_ctrl
## 1 AP5S1  ENST00000615891.2      +          41 19.405661 0.008635679      0.165
## 2   DAP ENST00000230895.11      -         410  5.830076 0.987410626      0.045
## 3  EGFR  ENST00000275493.7      +         963  7.515544 0.544142503      0.080
##   MeanValueFull_ctrl   AUC_HS     p_AUC_HS D_AUC_HS MeanValueFull_HS
## 1           3.001846 10.85386 1.122497e-01     0.12         2.561331
## 2           7.927029 65.58938 1.203856e-28     0.57         1.003308
## 3           6.210447 56.67197 3.857500e-22     0.50         3.330179
##   adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS dAUC_Diff_meanFx_HS_ctrl
## 1        0.02590704    1.346996e-01                -8.551806
## 2        0.98741063    3.611568e-28                59.759300
## 3        0.81621375    7.714999e-22                49.156430
##   p_dAUC_Diff_meanFx_HS_ctrl D_dAUC_Diff_meanFx_HS_ctrl
## 1               5.224189e-02                      0.135
## 2               9.396710e-26                      0.540
## 3               7.407064e-21                      0.485
##   adjFDR_p_dAUC_Diff_meanFx_HS_ctrl
## 1                      6.269027e-02
## 2                      5.638026e-25
## 3                      1.481413e-20

Knee and Max ECDF Differences for Each Transcript

To assess attenuation in the nascent RNA signal, it’s necessary to identify significant changes in signal distribution. tepr accomplishes this by identifying the maximum diff_Fx. In other words, it pinpoints the greatest variation in the signal progression of the empirical cumulative distribution (see ?genesECDF and ?meandifference). The window where this maximum change occurs is termed the knee and is calculated as: knee = max(diff_Fx). Thus, the knee position is defined as the maximum difference between the ECDF and the y=x curve.

?kneeid returns both the position (Knee_AUC) and the value (diff_Fx) of the knee.

resknee <- kneeid(bytranslistmean, expdf, verbose = FALSE)
print(resknee)

##                            transcript knee_AUC_ctrl max_diff_Fx_ctrl
## ENST00000230895.11 ENST00000230895.11            49       0.04433176
## ENST00000274140.10 ENST00000274140.10            49       0.04980327
## ENST00000275493.7   ENST00000275493.7            94       0.07642503
## ENST00000527786.7   ENST00000527786.7            61       0.10785215
## ENST00000549802.5   ENST00000549802.5            59       0.53104881
## ENST00000615891.2   ENST00000615891.2           100       0.16331981
##                    knee_AUC_HS max_diff_Fx_HS
## ENST00000230895.11          34     0.56852906
## ENST00000274140.10          76     0.07198283
## ENST00000275493.7           70     0.49569073
## ENST00000527786.7           78     0.49396784
## ENST00000549802.5           48     0.58517400
## ENST00000615891.2           65     0.11664132

Attenuation

The ?attenuation function calculates the mean score of all window means before the knee location and the mean score of all window means after the knee location. It then calculates an “attenuation score” using the following formula: att <- 100 - (downmean / upmean) x 100 (x for multiplication). The function returns three values, stored in the columns “UP_mean_,” “DOWN_mean_,” and “Attenuation_,” representing att, downmean, and upmean, respectively.

resatt <- attenuation(resauc, resknee, rescountna, bytranslistmean, expdf,
    resmeandiff, verbose = FALSE)
print(head(resatt, 3))

##           transcript    gene strand  chr    coor1    coor2   size window_size
## 1 ENST00000230895.11     DAP      - chr5 10679230 10761234  82005         410
## 2 ENST00000274140.10 MARCHF6      + chr5 10353695 10440388  86694         433
## 3  ENST00000275493.7    EGFR      + chr7 55019017 55211628 192612         963
##   AUC_ctrl p_AUC_ctrl D_AUC_ctrl MeanValueFull_ctrl    AUC_HS     p_AUC_HS
## 1 5.830076  0.9874106      0.045           7.927029 65.589375 1.203856e-28
## 2 5.046713  0.9639452      0.050           8.032174  7.447405 6.271671e-01
## 3 7.515544  0.5441425      0.080           6.210447 56.671974 3.857500e-22
##   D_AUC_HS MeanValueFull_HS adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS
## 1    0.570         1.003308         0.9874106    3.611568e-28
## 2    0.075         9.129133         0.9874106    6.271671e-01
## 3    0.500         3.330179         0.8162138    7.714999e-22
##   dAUC_Diff_meanFx_HS_ctrl p_dAUC_Diff_meanFx_HS_ctrl
## 1                59.759300               9.396710e-26
## 2                 2.400692               9.874106e-01
## 3                49.156430               7.407064e-21
##   D_dAUC_Diff_meanFx_HS_ctrl adjFDR_p_dAUC_Diff_meanFx_HS_ctrl knee_AUC_ctrl
## 1                      0.540                      5.638026e-25            49
## 2                      0.045                      9.874106e-01            49
## 3                      0.485                      1.481413e-20            94
##   max_diff_Fx_ctrl knee_AUC_HS max_diff_Fx_HS Count_NA UP_mean_ctrl
## 1       0.04433176          34     0.56852906        0     9.375234
## 2       0.04980327          76     0.07198283        0     9.662655
## 3       0.07642503          70     0.49569073        0     7.209353
##   DOWN_mean_ctrl Attenuation_ctrl UP_mean_HS DOWN_mean_HS Attenuation_HS
## 1       7.464112         20.38480   4.364260    0.3201154       92.66507
## 2       7.522987         22.14369  10.803586    8.1156322       24.88020
## 3       5.346316         25.84194   8.049177    0.8118397       89.91400

Defining the Universe and Groups

Universe

Prior to further downstream analysis, the set of transcripts to be studied must be defined. These transcripts are designated as belonging to the “Universe”. tepr selects a transcript for inclusion in the “Universe” if it meets the following criteria:

Its windows are sufficiently long.
It does not contain too many missing values (NA).
Its mean expression value is sufficiently high.
It has a significant Kolmogorov-Smirnov test adjusted p-value. This test compares the ECDF distribution to the theoretical cumulative distribution.

Analytically, this can be represented as:

window_size > windsizethreshold &
Count_NA < countnathreshold &
meanctrl > meanctrlthreshold &
meanstress > meanstressthreshold &
pvaltheory > pvaltheorythres

If only one condition is provided, only the control threshold is used:

window_size > windsizethreshold &
Count_NA < countnathreshold &
meanctrl > meanctrlthreshold &
pvaltheory > pvaltheorythres

Appropriate threshold values for each criterion depend on the specific experiments. The user can adjust these thresholds throughout the analysis. A transcript that satisfies the above criteria is defined as belonging to the “Universe” of transcripts to be analyzed. The Universe column in the data will contain TRUE or FALSE to indicate membership.

Groups

The transcripts in the Universe are further categorized into Groups. A transcript belongs to the “Attenuated” group if it meets the following criteria:

It belongs to the Universe (Universe == TRUE).
Its “stress” AUC value is sufficiently high (“stress” is opposed to “control” and can be renamed).
The negative base-10 logarithm of the adjusted p-value from the Kolmogorov-Smirnov test exceeds a given threshold.

Analytically, this is represented as:

Universe == TRUE &
aucstress > aucstressthreshold &
-log10(pvalks) > attenuatedpvalksthreshold

If only one condition is provided, no comparison via the Kolmogorov-Smirnov test with another condition is performed. A transcript will be considered attenuated if it deviates significantly from the theoretical cumulative distribution (y = x) and its AUC value is sufficiently high. The significant deviation has already been evaluated during the Universe calculation (pvaltheory > pvaltheorythres). Therefore, with only one condition, a transcript is considered attenuated if:

Universe == TRUE & aucctrl > aucctrlthreslower

A transcript is considered non-attenuated (“Outgroup”) if it meets the following criteria:

The Kolmogorov-Smirnov adjusted p-value is above a specified threshold.
The control AUC falls between two user-defined thresholds.

Analytically, this is represented as:

Universe == TRUE & pvalks > outgrouppvalksthreshold &
aucctrl > aucctrlthresoldhigher & aucctrl < aucctrlthresholdlower

For a single-condition analysis, where no Kolmogorov-Smirnov test against another condition is performed, a transcript belongs to the “Outgroup” if it belongs to the Universe and its control AUC falls between two user-defined thresholds:

Universe == TRUE & aucctrl > aucctrlthresoldhigher & aucctrl < aucctrlthresholdlower

Note that transcripts belonging to neither “Attenuated” nor “Outgroup” are assigned NA to the Group column. The Universe and Group columns are computed with the ?universegroup function:

res <- universegroup(resatt, expdf, verbose = FALSE)
print(head(res, 2))

##   Universe      Group         transcript    gene strand  chr    coor1    coor2
## 1     TRUE Attenuated ENST00000230895.11     DAP      - chr5 10679230 10761234
## 2     TRUE   Outgroup ENST00000274140.10 MARCHF6      + chr5 10353695 10440388
##    size window_size AUC_ctrl p_AUC_ctrl D_AUC_ctrl MeanValueFull_ctrl    AUC_HS
## 1 82005         410 5.830076  0.9874106      0.045           7.927029 65.589375
## 2 86694         433 5.046713  0.9639452      0.050           8.032174  7.447405
##       p_AUC_HS D_AUC_HS MeanValueFull_HS adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS
## 1 1.203856e-28    0.570         1.003308         0.9874106    3.611568e-28
## 2 6.271671e-01    0.075         9.129133         0.9874106    6.271671e-01
##   dAUC_Diff_meanFx_HS_ctrl p_dAUC_Diff_meanFx_HS_ctrl
## 1                59.759300               9.396710e-26
## 2                 2.400692               9.874106e-01
##   D_dAUC_Diff_meanFx_HS_ctrl adjFDR_p_dAUC_Diff_meanFx_HS_ctrl knee_AUC_ctrl
## 1                      0.540                      5.638026e-25            49
## 2                      0.045                      9.874106e-01            49
##   max_diff_Fx_ctrl knee_AUC_HS max_diff_Fx_HS Count_NA UP_mean_ctrl
## 1       0.04433176          34     0.56852906        0     9.375234
## 2       0.04980327          76     0.07198283        0     9.662655
##   DOWN_mean_ctrl Attenuation_ctrl UP_mean_HS DOWN_mean_HS Attenuation_HS
## 1       7.464112         20.38480    4.36426    0.3201154       92.66507
## 2       7.522987         22.14369   10.80359    8.1156322       24.88020

Plotting

tepr provides visualization capabilities for: the cumulative transcription density along a selected transcription unit (?plotecdf); the comparison of AUC between two conditions (?plotauc); the average transcription density of a user-selected set of transcripts (?plotmetagenes); and a histogram of the distance between knees and transcription start sites (TSS, ?plothistoknee).

`plotecdf`

The empirical cumulative distribution function (ECDF) for a specific transcript (EGFR in this example) can be visualized using:

colvec <- c("#90AFBB", "#10AFBB", "#FF9A04", "#FC4E07")
plotecdf(resmeandiff, res, expdf, "EGFR", colvec, plot = TRUE, verbose = FALSE)

`plotauc`

AUC values between conditions can be compared by highlighting the two conditions on the plot (only 6 points appearing):

plotauc(res, expdf, plottype = "groups", plot = TRUE)

The plot generated using the complete dataset is:

AUC group plot

It is also possible to compare the AUC by coloring points according to the Kolmogorov-Smirnov test adjusted p-values and by indicating graphically the points corresponding to a group of genes (here defined by genevec):

genevec <- c("EGFR", "DAP", "FLI1", "MARCHF6", "LINC01619")
plotauc(res, expdf, genevec, plot = TRUE)

The plot generated using the complete dataset is:

AUC pval plot

`plotmetagenes`

The average transcription density along a meta-transcript or meta-gene can be visualized as follows:

plotmetagenes(res, resmeandiff, expdf, plottype = "attenuation", plot = TRUE)

The plot on the complete dataset gives:

Attenuation meta

Note that the plottype parameter, set to “attenuation” in this example, specifies that the distribution should be plotted for transcripts identified as “attenuated” by the ?universegroup function. Other options for plottype include “outgroup” (for transcripts categorized as “outgroup”), “universe” (for all transcripts in the Universe), and “all” (for all transcripts in the initial table).

`plothistoknee`

The distribution of knee distances from the TSS can be visualized as a percentage:

plothistoknee(res, plot = TRUE)

The plot generated using the complete dataset is:

histo percent

The distances can also be visualized in kilobases (kb) by setting the plottype parameter to “kb”:

plothistoknee(res, plottype = "kb", plot = TRUE)

The plot generated using the complete dataset is:

histo kb

Analysis of More Than Two Conditions

`teprmulti` Analysis

Analyzing more than two experimental conditions is possible using the ?teprmulti function. This function performs an “all vs. all” comparison and returns a list. Each element of this list corresponds to a specific pairwise comparison and contains another list with two data frames:

resmeandiff_<comparison>: The results of the ?meandifference function for the specified two-condition comparison.
resunigroupatt_<comparison>: The results of the ?universegroup function for the specified two-condition comparison.

resteprmulti <- teprmulti(expdf, transdf, expthres)

The ?showallcomp function returns a vector of all comparison names that ?teprmulti will perform. This allows users to reduce the number of comparisons by using the dontcompare parameter. dontcompare accepts a vector of comparison names to exclude. The following code limits ?teprmulti to only three comparisons:

## Returns a vector with all the comparison names to exclude
dontcompvec <- showallcomp(expdf)
dontcompvec <- dontcompvec[- c(1,2,3)]

resteprmulti <- teprmulti(expdf, transdf, expthres, dontcompare = dontcompvec)

`plotmulti`

Using the list of results from ?teprmulti, all plots for all comparisons can be generated at once using ?plotmulti. This function iterates through each element of resteprmulti (each element representing a two-condition comparison) and calls the following functions:

?plotecdf: Generates a figure for each gene specified in ecdfgenevec.
?plotauc: Generates figures with options “group” and “p-value”. The “p-value” option figure is not generated if genaucvec = NA.
?plotmetagenes: Generates figures with options “attenuation”, “outgroup”, “universe”, and “all”.
?plothistoknee: Generates figures with options “percent” and “kb”.

Figures are written to the directory specified by outfold, and subfolders are automatically created for each comparison.

plotmulti(resteprmulti, expdf, ecdfgenevec = c("CDC27", "BCAR1", "TRAM2"), outfold = "./multicomp")

Calculating Knee for Each Condition

The ?kneeallconds function allows retrieving knee values processing each condition independently (see next section).

kneedf <- kneeallconds(alldf, expdf, expthres)

Single Condition Analysis

If your dataset contains only one condition (with one or more replicates), you can analyze the data by comparing the observed signal to a linear “theoretical” distribution representing a uniform signal y = x. The calculation of differences in means and AUC is skipped when running ?meandifference and ?allauc. The criteria for defining the “Universe” and “Group” columns also differ (see details in the previous section).

## The experiment table is limited to one condition and one replicate
expdfonecond <- expdf[which(expdf$condition == "HS" & expdf$replicate == 1), ]

## The table obtained by preprocessing is limited to the condition 'HS' and replicate 1
transdfonecond <- transdf[, c(seq_len(9), 18, 19, 20, 21)]

## Computing the object 'res' with tepr on one condition
resonecond <- tepr(expdfonecond, transdfonecond, expthres, controlcondname = "HS", verbose = FALSE)

Because AUC and metagene plots require two conditions, they cannot be generated in a single-condition analysis. However, the ECDF for a given gene can be obtained by considering the y = x distribution as follows:

plotecdf(resonecond[[1]], resonecond[[2]], expdfonecond, genename = "EGFR",
    colvec = c("#90AFBB"), plot = TRUE, verbose = FALSE)

The histogram of knee positions can be obtained with:

## Randomly marking 3 transcripts as attenuated as a mock example
idxatt <- sample(seq_len(6), 3)
resonecond[[2]][idxatt, "Group"] <- "Attenuated"
resonecond[[2]][idxatt, "Universe"] <- TRUE
plothistoknee(resonecond[[2]], kneename = "knee_AUC_HS", plottype = "percent", plot = TRUE, verbose = FALSE)

Transcription Elongation Profiling

Victor Billon, Nicolas Descostes, and Gael Cristofari

Introduction

Getting Help

Quick Start

Data

Description

Download

Standard Workflow

Preprocessing

Input Files

Retrieving Transcript Annotations from Gencode

Defining Windows for Each Transcript

Retrieving Bedgraph Scores and Filtering Blacklisted and Low-Mappability Regions

Creating the Final Table

tepr Analysis

Input Files

Expressed Transcripts and Average Expression

NA Values per Transcript and Condition

Empirical Cumulative Distribution Function (ECDF) for Genes

Mean and Differences of Scores for Each Condition

Area Under the Curve (AUC) and Differences in AUC for Transcript Data

Knee and Max ECDF Differences for Each Transcript

Attenuation

Defining the Universe and Groups

Plotting

`plotecdf`

`plotauc`

`plotmetagenes`

`plothistoknee`

Analysis of More Than Two Conditions

`teprmulti` Analysis

`plotmulti`

Calculating Knee for Each Condition

Single Condition Analysis

Annex

Note on Processing the Complete Dataset

Transcription Elongation Profiling

Victor Billon, Nicolas Descostes, and Gael Cristofari

Introduction

Getting Help

Quick Start

Data

Description

Download

Standard Workflow

Preprocessing

Input Files

Retrieving Transcript Annotations from Gencode

Defining Windows for Each Transcript

Retrieving Bedgraph Scores and Filtering Blacklisted and Low-Mappability Regions

Creating the Final Table

tepr Analysis

Input Files

Expressed Transcripts and Average Expression

NA Values per Transcript and Condition

Empirical Cumulative Distribution Function (ECDF) for Genes

Mean and Differences of Scores for Each Condition

Area Under the Curve (AUC) and Differences in AUC for Transcript Data

Knee and Max ECDF Differences for Each Transcript

Attenuation

Defining the Universe and Groups

Plotting

plotecdf

plotauc

plotmetagenes

plothistoknee

Analysis of More Than Two Conditions

teprmulti Analysis

plotmulti

Calculating Knee for Each Condition

Single Condition Analysis

Annex

Note on Processing the Complete Dataset

`plotecdf`

`plotauc`

`plotmetagenes`

`plothistoknee`

`teprmulti` Analysis

`plotmulti`