tepr (Transcription Elongation Profile) is an R package designed for analyzing data from nascent RNA sequencing technologies, such as TT-seq, mNET-seq, and PRO-seq. It calculates the probability distribution of nascent RNA sequencing signal across the gene body or transcription unit of a given gene. By comparing this profile to a uniform signal, tepr can identify transcription attenuation sites. Furthermore, it can detect increased or decreased transcription attenuation by comparing profiles across different conditions. Beyond its rigorous statistical testing and high sensitivity, a key strength of tepr is its ability to resolve the elongation pattern of individual genes, including the precise location of the primary attenuation point, when present. This capability allows users to visualize and refine genome-wide aggregated analyses, enabling the robust identification of effects specific to gene subsets. These metrics facilitate comparisons between genes within a condition, across conditions for the same gene, or against a theoretical model of perfect uniform elongation.
For any questions or bug reports, please open an issue on GitHub.
The following example demonstrates a complete run of the package. Preprocessing is performed on chromosome 13, and postprocessing is limited to the first 100 transcripts. The system.file command retrieves file paths from the package’s “extdata” directory.
library(tepr)
#########
# Pre-processing - takes ~ 21 seconds
#########
## Parameters
expprepath <- system.file("extdata", "exptab-preprocessing.csv", package="tepr")
windsize <- 200
## Read input tables
expdfpre <- read.csv(expprepath)
## Retrieving the bedgraph paths
bgpathvec <- sapply(expdfpre$path, function(x) system.file("extdata", x,
package = "tepr"))
## Replace the path column of expdfpre with the previously retrieved paths
## and writing the new experiment file to the current folder
expdfpre$path <- bgpathvec
write.csv(expdfpre, file = "exptab-preprocessing.csv", row.names = FALSE, quote = FALSE)
expprepath <- "exptab-preprocessing.csv"
gencodepath <- system.file("extdata", "gencode-chr13.gtf", package = "tepr")
maptrackpath <- system.file("extdata", "k50.umap.chr13.hg38.0.8.bed",
package = "tepr")
blacklistpath <- system.file("extdata", "hg38-blacklist-chr13.v2.bed",
package = "tepr")
genomename <- "hg38"
## The lines below are optional. The chromosome info can be retrieved automatically
## Make chromosome info retrieval explicit for building the vignette
chromtabtest <- rtracklayer::SeqinfoForUCSCGenome(genomename)
allchromvec <- GenomeInfoDb::seqnames(chromtabtest)
chromtabtest <- chromtabtest[allchromvec[which(allchromvec == "chr13")], ]
finaltab <- preprocessing(expprepath, gencodepath, windsize, maptrackpath,
blacklistpath, genomename, finaltabpath = tempdir(), finaltabname = "anno.tsv",
chromtab = chromtabtest, showtime = FALSE, verbose = FALSE)
#########
# tepr analysis - takes ~ 1 seconds
#########
## Parameters (transpath limited to 6 transcripts)
exppath <- system.file("extdata", "exptab.csv", package="tepr")
transpath <- system.file("extdata", "cugusi_6.tsv", package="tepr")
expthres <- 0.1
## Read input tables
expdf <- read.csv(exppath)
transdf <- read.delim(transpath, header = FALSE)
reslist <- tepr(expdf, transdf, expthres, showtime = FALSE, verbose = FALSE)
The table produced by the preprocessing function is saved without column headers. The resulting data, using the testing dataset, is:
## biotype chr coor1 coor2 transcript gene strand window
## 1 lncRNA chr13 1000 1003 ENST00000623511.1 FAM230C + 1
## 2 lncRNA chr13 1003 1006 ENST00000623511.1 FAM230C + 2
## 3 lncRNA chr13 1006 1009 ENST00000623511.1 FAM230C + 3
## id dataset.x score.x dataset.y
## 1 ENST00000623511.1_FAM230C_+_1 ctrl_rep1.forward 3.536147 ctrl_rep1.reverse
## 2 ENST00000623511.1_FAM230C_+_2 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
## 3 ENST00000623511.1_FAM230C_+_3 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
## score.y
## 1 NA
## 2 NA
## 3 NA
The tepr
function returns two tables that are used for plotting different information (see below for explanations and ?tepr
). On the testing dataset it gives:
## The table meandifference:
## biotype chr coor1 coor2 transcript gene strand window
## 1 protein-coding chr5 10760820 10761234 ENST00000230895.11 DAP - 200
## 2 protein-coding chr5 10760410 10760820 ENST00000230895.11 DAP - 199
## id ctrl_rep1 ctrl_rep2
## 1 ENST00000230895.11_DAP_-_200 ctrl_rep1.reverse ctrl_rep2.reverse
## 2 ENST00000230895.11_DAP_-_199 ctrl_rep1.reverse ctrl_rep2.reverse
## HS_rep1 HS_rep2 coord value_ctrl_rep1_score
## 1 HS_rep1.reverse HS_rep2.reverse 1 2.424338
## 2 HS_rep1.reverse HS_rep2.reverse 2 9.451467
## value_ctrl_rep2_score value_HS_rep1_score value_HS_rep2_score
## 1 0.5111166 1.64734 2.20375
## 2 14.2957119 11.85357 13.22088
## Fx_ctrl_rep1_score Fx_ctrl_rep2_score Fx_HS_rep1_score Fx_HS_rep2_score
## 1 0.001792650 0.0002728066 0.00839895 0.01045131
## 2 0.008888557 0.0080750764 0.07086614 0.07315914
## mean_value_ctrl mean_Fx_ctrl diff_Fx_ctrl mean_value_HS mean_Fx_HS
## 1 1.467727 0.001032728 -0.003967272 1.925545 0.009425128
## 2 11.873589 0.008481817 -0.001518183 12.537228 0.072012643
## diff_Fx_HS Diff_meanvalue_ctrl_HS Diff_meanvalue_HS_ctrl Diff_meanFx_ctrl_HS
## 1 0.004425128 -0.4578180 0.4578180 -0.00839240
## 2 0.062012643 -0.6636391 0.6636391 -0.06353083
## Diff_meanFx_HS_ctrl
## 1 0.00839240
## 2 0.06353083
##
##
## The table universegroup:
## Universe Group transcript gene strand chr coor1 coor2
## 1 TRUE Attenuated ENST00000230895.11 DAP - chr5 10679230 10761234
## 2 TRUE Outgroup ENST00000274140.10 MARCHF6 + chr5 10353695 10440388
## size window_size AUC_ctrl p_AUC_ctrl D_AUC_ctrl MeanValueFull_ctrl AUC_HS
## 1 82005 410 5.830076 0.9874106 0.045 7.927029 65.589375
## 2 86694 433 5.046713 0.9639452 0.050 8.032174 7.447405
## p_AUC_HS D_AUC_HS MeanValueFull_HS adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS
## 1 1.203856e-28 0.570 1.003308 0.9874106 3.611568e-28
## 2 6.271671e-01 0.075 9.129133 0.9874106 6.271671e-01
## dAUC_Diff_meanFx_HS_ctrl p_dAUC_Diff_meanFx_HS_ctrl
## 1 59.759300 9.396710e-26
## 2 2.400692 9.874106e-01
## D_dAUC_Diff_meanFx_HS_ctrl adjFDR_p_dAUC_Diff_meanFx_HS_ctrl knee_AUC_ctrl
## 1 0.540 5.638026e-25 49
## 2 0.045 9.874106e-01 49
## max_diff_Fx_ctrl knee_AUC_HS max_diff_Fx_HS Count_NA UP_mean_ctrl
## 1 0.04433176 34 0.56852906 0 9.375234
## 2 0.04980327 76 0.07198283 0 9.662655
## DOWN_mean_ctrl Attenuation_ctrl UP_mean_HS DOWN_mean_HS Attenuation_HS
## 1 7.464112 20.38480 4.36426 0.3201154 92.66507
## 2 7.522987 22.14369 10.80359 8.1156322 24.88020
For a comprehensive analysis, we will use the data from Cugusi et al.. To validate our approach and explore the utility of tepr metrics in revealing gene-specific elongation characteristics, we re-analyzed previous experiments demonstrating that heat shock (HS) induces attenuation in human cultured cells. We compared TT-seq data from HS-stressed and control MRC5-VA cells. Stressed cells were cultured for 2 hours at 42°C, while control cells were maintained at 37°C. Both samples were subjected to a 15-minute pulse of 4-thiouridine (4sU) labeling of nascent transcripts, followed by purification and sequencing.
The bedgraph files, mappability track and black list necessary for performing the entire analysis can be downloaded from zenodo:
#!/usr/bin/sh
mkdir data
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.basic.annotation.gtf.gz -P data/ && gunzip data/gencode.v43.basic.annotation.gtf.gz
wget https://github.com/Boyle-Lab/Blacklist/raw/refs/heads/master/lists/hg38-blacklist.v2.bed.gz -P data/ && gunzip data/hg38-blacklist.v2.bed.gz
wget https://zenodo.org/records/15050723/files/k50.umap.hg38.0.8.bed -P data/
wget https://zenodo.org/records/15050723/files/ctrl_rep1.forward.bg -P data/
wget https://zenodo.org/records/15050723/files/ctrl_rep1.reverse.bg -P data/
wget https://zenodo.org/records/15050723/files/ctrl_rep2.forward.bg -P data/
wget https://zenodo.org/records/15050723/files/ctrl_rep2.reverse.bg -P data/
wget https://zenodo.org/records/15050723/files/HS_rep1.forward.bg -P data/
wget https://zenodo.org/records/15050723/files/HS_rep1.reverse.bg -P data/
wget https://zenodo.org/records/15050723/files/HS_rep2.forward.bg -P data/
wget https://zenodo.org/records/15050723/files/HS_rep2.reverse.bg -P data/
If one would like to skip the preprocessing, the preprocessing result file can be downloaded with:
The preprocessing pipeline consists of the following steps:
windsize
).structure
Important: The preprocessing
function allows for saving intermediate objects, which prevents recomputation in cases of failure due to memory or time limits typically set in HPC parameters. See the saveobjectpath
and deletetmp
parameters in the ?preprocessing
help documentation for details. Resource requirements for processing the complete dataset are provided below.
nb CPU | RAM | Time |
---|---|---|
15 | 113.5 G | 3h47 |
10 | 78.9 G | 4h24 |
7 | 57.4 G | 5h27 |
5 | 43G | 6h58 |
3 | 28.7G | 9h02 |
The testing dataset below is limited to a portion of the chromosome 13 and one control replicate.
The following input files are required for the analysis:
?checkexptab
utility function can be used to verify the table’s format.NA
.The ?retrieveanno
function performs the following steps:
exptab
CSV file.saveobjectpath
directory.showtime = TRUE
).##
## The result is:
## chrom start end ensembl symbol strand biotype
## 1 chr13 1 800 ENST00000400113.8 TUBA3C - protein-coding
## 2 chr13 1000 1600 ENST00000623511.1 FAM230C + lncRNA
The ?makewindows
function performs the following operations:
windsize
.windsize
.The output includes metadata for each window, such as its chromosome, start and end coordinates, associated gene, and window number.
##
## The result is:
## biotype chr coor1 coor2 transcript gene strand window
## 1 protein-coding chr13 1 5 ENST00000400113.8 TUBA3C - 1
## 2 protein-coding chr13 5 9 ENST00000400113.8 TUBA3C - 2
## 3 protein-coding chr13 9 13 ENST00000400113.8 TUBA3C - 3
The ?blacklisthighmap
function iterates through each chromosome, processing genomic scores by removing those that overlap with blacklisted and low-mappability regions. It also calculates weighted means for scores within each window. This function leverages parallel processing for efficiency and supports saving (saveobjectpath
) and reloading (reload
) intermediate results to optimize the workflow.
The main steps include:
tmpfold
).If you did not execute the “quick start” section code, you need to first copy the bedgraph files to your current folder:
expprepath <- system.file("extdata", "exptab-preprocessing.csv", package="tepr")
expdfpre <- read.csv(expprepath)
bgpathvec <- sapply(expdfpre$path, function(x) system.file("extdata", x,
package = "tepr"))
expdfpre$path <- bgpathvec
write.csv(expdfpre, file = "exptab-preprocessing.csv", row.names = FALSE, quote = FALSE)
exptabpath <- "exptab-preprocessing.csv"
Note that this function does not return a value; instead, it saves the processed files to a temporary folder for use by the subsequent ?createtablescores
function.
The ?createtablescores
function merges the files stored in the previously produced temporary folder that belong to the same experiment and direction. These files are combined into a single table, with two columns per experiment: one containing the experiment name and the other containing the corresponding scores. The resulting table also includes annotations for each transcript.
The final table should look like:
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 lncRNA chr13 1000 1003 ENST00000623511.1 FAM230C + 1
## 2 lncRNA chr13 1003 1006 ENST00000623511.1 FAM230C + 2
## 3 lncRNA chr13 1006 1009 ENST00000623511.1 FAM230C + 3
## V9 V10 V11 V12
## 1 ENST00000623511.1_FAM230C_+_1 ctrl_rep1.forward 3.536147 ctrl_rep1.reverse
## 2 ENST00000623511.1_FAM230C_+_2 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
## 3 ENST00000623511.1_FAM230C_+_3 ctrl_rep1.forward 3.847245 ctrl_rep1.reverse
## V13
## 1 NA
## 2 NA
## 3 NA
The quick start section demonstrates how to perform a tepr analysis in a single call using the ?tepr
function. Below, we detail the individual steps performed by ?tepr
using a subset of 6 transcripts for demonstration purposes. Plots generated from the complete annotation dataset are also provided. The complete table resulting from preprocessing can be downloaded here. See section 6.1 for notes on processing the complete dataset.
Two tables are required as input:
transdf: The table generated by the ?preprocessing
function (described in previous sections). This table contains information about each experiment, along with scores for each protein-coding and lncRNA annotation.
exptab: A table describing the experiments, containing the columns “condition”, “replicate”, “direction”, “strand”, and “path”. The “direction” column should contain the values “forward” and “reverse”, and the “strand” column should contain “plus” and “minus”. You can verify the table’s format using the ?checkexptab
utility function.
These two tables are included with the package and can be accessed using the ?system.file
command within the “extdata” directory. For this demonstration, we have sampled 6 transcripts from the transdf
table to reduce computation time.
## Define manually the column names for display purpose
colnames(transdf) <- c("biotype", "chr", "coor1", "coor2", "transcript", "gene",
"strand", "window", "id", "ctrl_rep1.plus", "ctrl_rep1.plus_score",
"ctrl_rep1.minus", "ctrl_rep1.minus_score", "ctrl_rep2.plus",
"ctrl_rep2.plus_score", "ctrl_rep2.minus", "ctrl_rep2.minus_score",
"HS_rep1.plus", "HS_rep1.plus_score", "HS_rep1.minus",
"HS_rep1.minus_score", "HS_rep2.plus", "HS_rep2.plus_score",
"HS_rep2.minus", "HS_rep2.minus_score")
message("The table given by the preprocessing function is:\n")
## The table given by the preprocessing function is:
## biotype chr coor1 coor2 transcript gene strand window
## 1 protein-coding chr7 55019017 55019980 ENST00000275493.7 EGFR + 1
## 2 protein-coding chr7 55019980 55020943 ENST00000275493.7 EGFR + 2
## 3 protein-coding chr7 55020943 55021906 ENST00000275493.7 EGFR + 3
## id ctrl_rep1.plus ctrl_rep1.plus_score
## 1 ENST00000275493.7_EGFR_+_1 ctrl_rep1.forward 1.356774
## 2 ENST00000275493.7_EGFR_+_2 ctrl_rep1.forward 3.657029
## 3 ENST00000275493.7_EGFR_+_3 ctrl_rep1.forward 8.161519
## ctrl_rep1.minus ctrl_rep1.minus_score ctrl_rep2.plus
## 1 ctrl_rep1.reverse NA ctrl_rep2.forward
## 2 ctrl_rep1.reverse NA ctrl_rep2.forward
## 3 ctrl_rep1.reverse NA ctrl_rep2.forward
## ctrl_rep2.plus_score ctrl_rep2.minus ctrl_rep2.minus_score HS_rep1.plus
## 1 1.430225 ctrl_rep2.reverse NA HS_rep1.forward
## 2 5.461259 ctrl_rep2.reverse NA HS_rep1.forward
## 3 9.692524 ctrl_rep2.reverse NA HS_rep1.forward
## HS_rep1.plus_score HS_rep1.minus HS_rep1.minus_score HS_rep2.plus
## 1 2.011272 HS_rep1.reverse NA HS_rep2.forward
## 2 8.927487 HS_rep1.reverse NA HS_rep2.forward
## 3 12.844230 HS_rep1.reverse NA HS_rep2.forward
## HS_rep2.plus_score HS_rep2.minus HS_rep2.minus_score
## 1 2.068358 HS_rep2.reverse NA
## 2 8.943509 HS_rep2.reverse NA
## 3 14.917819 HS_rep2.reverse NA
##
## The expdf table contains information about each replicate (here limited to one):
## condition replicate direction strand path
## 1 ctrl 1 forward plus ctrl_rep1_chr13.forward.bg
## 2 ctrl 1 reverse minus ctrl_rep1_chr13.reverse.bg
## 3 ctrl 2 forward plus ctrl_rep2_chr13.forward.bg
## 4 ctrl 2 reverse minus ctrl_rep2_chr13.reverse.bg
## 5 HS 1 forward plus HS_rep1_chr13.forward.bg
## 6 HS 1 reverse minus HS_rep1_chr13.reverse.bg
You can verify the table’s format using the ?checkexptab
utility function. Note that this verification is also performed by ?averageandfilterexprs
.
To study changes in nascent RNA patterns, the initial step is to select expressed transcripts with ?averageandfilterexprs
. The expression value of a gene is defined as the mean score across both strands. These scores are derived from the bedgraph files. A gene is considered expressed if its mean expression in all conditions exceeds a specified threshold. The default expression threshold is set to 0.1. If no expressed transcript is returned, an error is thrown.
resallexprs
is a list. Its first element is the input transdf
with additional columns, such as the calculated mean expression values. The second element is a vector containing the IDs of the transcripts considered expressed. This vector will be used in subsequent downstream analyses.
## [1] "ENST00000230895.11" "ENST00000274140.10" "ENST00000275493.7"
## [4] "ENST00000527786.7" "ENST00000549802.5" "ENST00000615891.2"
During preprocessing, scores within user-defined blacklisted regions, low-mappability regions, or any other excluded regions are replaced with NA values. The ?countna
function retrieves the number of missing scores for each transcript. This table can be used for further data filtering, such as removing transcripts with an excessive number of windows lacking scores.
## gene transcript strand Count_NA
## ENST00000230895.11 DAP ENST00000230895.11 - 0
## ENST00000274140.10 MARCHF6 ENST00000274140.10 + 0
## ENST00000275493.7 EGFR ENST00000275493.7 + 0
## ENST00000527786.7 FLI1 ENST00000527786.7 + 0
## ENST00000549802.5 LINC01619 ENST00000549802.5 - 0
## ENST00000615891.2 AP5S1 ENST00000615891.2 + 8
The ?genesECDF
function calculates the empirical cumulative distribution function (ECDF) for expressed genes across their transcripts. An ECDF is a probability distribution that estimates the Cumulative Distribution Function; here, we use the ?ecdf
function from R. It generates an ECDF for each transcript, which are then used to assess differences in signal using a Kolmogorov-Smirnov test. The ECDF helps answer the question: How likely is it that we would observe a distribution of signals like this if the signals were drawn from a uniform probability distribution? In other words, this statistic helps identify specific loci where the transcription signal deviates significantly from a uniform density. In the following steps, the ECDF is used to compute the difference from the cumulative distribution (see ?meananddifference
), which is subsequently used to calculate the area under the curve (AUC, see ?allauc
) and the inflection point or knee (see ?kneeid
).
The ?genesECDF
function returns a list. The first element is the main table with added ECDF columns, prefixed with Fx
.
## biotype chr coor1 coor2 transcript gene strand window
## 1 protein-coding chr5 10760820 10761234 ENST00000230895.11 DAP - 200
## 2 protein-coding chr5 10760410 10760820 ENST00000230895.11 DAP - 199
## 3 protein-coding chr5 10760000 10760410 ENST00000230895.11 DAP - 198
## id ctrl_rep1 ctrl_rep2
## 1 ENST00000230895.11_DAP_-_200 ctrl_rep1.reverse ctrl_rep2.reverse
## 2 ENST00000230895.11_DAP_-_199 ctrl_rep1.reverse ctrl_rep2.reverse
## 3 ENST00000230895.11_DAP_-_198 ctrl_rep1.reverse ctrl_rep2.reverse
## HS_rep1 HS_rep2 coord value_ctrl_rep1_score
## 1 HS_rep1.reverse HS_rep2.reverse 1 2.424338
## 2 HS_rep1.reverse HS_rep2.reverse 2 9.451467
## 3 HS_rep1.reverse HS_rep2.reverse 3 13.218833
## value_ctrl_rep2_score value_HS_rep1_score value_HS_rep2_score
## 1 0.5111166 1.647340 2.20375
## 2 14.2957119 11.853574 13.22088
## 3 16.2775011 9.271255 14.74644
## Fx_ctrl_rep1_score Fx_ctrl_rep2_score Fx_HS_rep1_score Fx_HS_rep2_score
## 1 0.001792650 0.0002728066 0.00839895 0.01045131
## 2 0.008888557 0.0080750764 0.07086614 0.07315914
## 3 0.018748133 0.0169685727 0.11968504 0.14299287
The second element returns the number of windows per transcript, which is set to 200 by default during preprocessing.
## [1] 200
The ?meandifference
function computes three types of statistics: mean score values, mean ECDF for each replicate, and the difference between the mean ECDF and the uniform cumulative distribution. Mean values across replicates are calculated, resulting in one mean_value
column per condition. These mean_value
score columns are required for subsequent attenuation score calculations (see ?attenuation
). The mean ECDF values are used to calculate the difference from the cumulative distribution, which we term diff_Fx. The diff_Fx statistic is used to calculate the area under the curve (AUC, see ?allauc
) and the inflection point or knee (see ?kneeid
).
The difference is calculated as: diff_Fx = meanECDF - cumvec / nbwindows
resmeandiff <- meandifference(resecdflist[[1]], expdf, nbwindows, verbose = FALSE)
print(head(resmeandiff, 3))
## biotype chr coor1 coor2 transcript gene strand window
## 1 protein-coding chr5 10760820 10761234 ENST00000230895.11 DAP - 200
## 2 protein-coding chr5 10760410 10760820 ENST00000230895.11 DAP - 199
## 3 protein-coding chr5 10760000 10760410 ENST00000230895.11 DAP - 198
## id ctrl_rep1 ctrl_rep2
## 1 ENST00000230895.11_DAP_-_200 ctrl_rep1.reverse ctrl_rep2.reverse
## 2 ENST00000230895.11_DAP_-_199 ctrl_rep1.reverse ctrl_rep2.reverse
## 3 ENST00000230895.11_DAP_-_198 ctrl_rep1.reverse ctrl_rep2.reverse
## HS_rep1 HS_rep2 coord value_ctrl_rep1_score
## 1 HS_rep1.reverse HS_rep2.reverse 1 2.424338
## 2 HS_rep1.reverse HS_rep2.reverse 2 9.451467
## 3 HS_rep1.reverse HS_rep2.reverse 3 13.218833
## value_ctrl_rep2_score value_HS_rep1_score value_HS_rep2_score
## 1 0.5111166 1.647340 2.20375
## 2 14.2957119 11.853574 13.22088
## 3 16.2775011 9.271255 14.74644
## Fx_ctrl_rep1_score Fx_ctrl_rep2_score Fx_HS_rep1_score Fx_HS_rep2_score
## 1 0.001792650 0.0002728066 0.00839895 0.01045131
## 2 0.008888557 0.0080750764 0.07086614 0.07315914
## 3 0.018748133 0.0169685727 0.11968504 0.14299287
## mean_value_ctrl mean_Fx_ctrl diff_Fx_ctrl mean_value_HS mean_Fx_HS
## 1 1.467727 0.001032728 -0.003967272 1.925545 0.009425128
## 2 11.873589 0.008481817 -0.001518183 12.537228 0.072012643
## 3 14.748167 0.017858353 0.002858353 12.008848 0.131338957
## diff_Fx_HS Diff_meanvalue_ctrl_HS Diff_meanvalue_HS_ctrl Diff_meanFx_ctrl_HS
## 1 0.004425128 -0.4578180 0.4578180 -0.00839240
## 2 0.062012643 -0.6636391 0.6636391 -0.06353083
## 3 0.116338957 2.7393194 -2.7393194 -0.11348060
## Diff_meanFx_HS_ctrl
## 1 0.00839240
## 2 0.06353083
## 3 0.11348060
The resulting table is then split into a list of tables, one per transcript. This enables parallel processing by transcript for subsequent operations (see nbcpu parameter in each function).
The Area Under the Curve (AUC) values enable comparison of the nascent RNA signal to a uniform distribution (representing no signal attenuation, where x = y) or comparison of the difference in signal distribution between two conditions (dAUC). The difference between cumulative densities is estimated using a Kolmogorov-Smirnov test with adjusted p-value. Finally, ?allauc
returns results by transcript, along with a mean value termed MeanValueFull.
For each condition (if more than one), the AUC of diff_Fx (described in the previous section) is computed using a trapezoidal approximation (see pracma::trapz
):
AUC = pracma::trapz(transcoord, diff_Fx)
where transcoord
represents the positions along the transcript and diff_Fx is the difference from the cumulative distribution.
## Calculate Area Under Curve (AUC) and Differences of AUC for Transcript Data
resauc <- allauc(bytranslistmean, expdf, nbwindows, verbose = FALSE)
print(head(resauc, 3))
## gene transcript strand window_size AUC_ctrl p_AUC_ctrl D_AUC_ctrl
## 1 AP5S1 ENST00000615891.2 + 41 19.405661 0.008635679 0.165
## 2 DAP ENST00000230895.11 - 410 5.830076 0.987410626 0.045
## 3 EGFR ENST00000275493.7 + 963 7.515544 0.544142503 0.080
## MeanValueFull_ctrl AUC_HS p_AUC_HS D_AUC_HS MeanValueFull_HS
## 1 3.001846 10.85386 1.122497e-01 0.12 2.561331
## 2 7.927029 65.58938 1.203856e-28 0.57 1.003308
## 3 6.210447 56.67197 3.857500e-22 0.50 3.330179
## adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS dAUC_Diff_meanFx_HS_ctrl
## 1 0.02590704 1.346996e-01 -8.551806
## 2 0.98741063 3.611568e-28 59.759300
## 3 0.81621375 7.714999e-22 49.156430
## p_dAUC_Diff_meanFx_HS_ctrl D_dAUC_Diff_meanFx_HS_ctrl
## 1 5.224189e-02 0.135
## 2 9.396710e-26 0.540
## 3 7.407064e-21 0.485
## adjFDR_p_dAUC_Diff_meanFx_HS_ctrl
## 1 6.269027e-02
## 2 5.638026e-25
## 3 1.481413e-20
To assess attenuation in the nascent RNA signal, it’s necessary to identify significant changes in signal distribution. tepr accomplishes this by identifying the maximum diff_Fx. In other words, it pinpoints the greatest variation in the signal progression of the empirical cumulative distribution (see ?genesECDF
and ?meandifference
). The window where this maximum change occurs is termed the knee and is calculated as: knee = max(diff_Fx). Thus, the knee position is defined as the maximum difference between the ECDF and the y=x curve.
?kneeid
returns both the position (Knee_AUC) and the value (diff_Fx) of the knee.
## transcript knee_AUC_ctrl max_diff_Fx_ctrl
## ENST00000230895.11 ENST00000230895.11 49 0.04433176
## ENST00000274140.10 ENST00000274140.10 49 0.04980327
## ENST00000275493.7 ENST00000275493.7 94 0.07642503
## ENST00000527786.7 ENST00000527786.7 61 0.10785215
## ENST00000549802.5 ENST00000549802.5 59 0.53104881
## ENST00000615891.2 ENST00000615891.2 100 0.16331981
## knee_AUC_HS max_diff_Fx_HS
## ENST00000230895.11 34 0.56852906
## ENST00000274140.10 76 0.07198283
## ENST00000275493.7 70 0.49569073
## ENST00000527786.7 78 0.49396784
## ENST00000549802.5 48 0.58517400
## ENST00000615891.2 65 0.11664132
The ?attenuation
function calculates the mean score of all window means before the knee location and the mean score of all window means after the knee location. It then calculates an “attenuation score” using the following formula: att <- 100 - (downmean / upmean) x 100 (x for multiplication). The function returns three values, stored in the columns “UP_mean_,” “DOWN_mean_,” and “Attenuation_,” representing att, downmean, and upmean, respectively.
resatt <- attenuation(resauc, resknee, rescountna, bytranslistmean, expdf,
resmeandiff, verbose = FALSE)
print(head(resatt, 3))
## transcript gene strand chr coor1 coor2 size window_size
## 1 ENST00000230895.11 DAP - chr5 10679230 10761234 82005 410
## 2 ENST00000274140.10 MARCHF6 + chr5 10353695 10440388 86694 433
## 3 ENST00000275493.7 EGFR + chr7 55019017 55211628 192612 963
## AUC_ctrl p_AUC_ctrl D_AUC_ctrl MeanValueFull_ctrl AUC_HS p_AUC_HS
## 1 5.830076 0.9874106 0.045 7.927029 65.589375 1.203856e-28
## 2 5.046713 0.9639452 0.050 8.032174 7.447405 6.271671e-01
## 3 7.515544 0.5441425 0.080 6.210447 56.671974 3.857500e-22
## D_AUC_HS MeanValueFull_HS adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS
## 1 0.570 1.003308 0.9874106 3.611568e-28
## 2 0.075 9.129133 0.9874106 6.271671e-01
## 3 0.500 3.330179 0.8162138 7.714999e-22
## dAUC_Diff_meanFx_HS_ctrl p_dAUC_Diff_meanFx_HS_ctrl
## 1 59.759300 9.396710e-26
## 2 2.400692 9.874106e-01
## 3 49.156430 7.407064e-21
## D_dAUC_Diff_meanFx_HS_ctrl adjFDR_p_dAUC_Diff_meanFx_HS_ctrl knee_AUC_ctrl
## 1 0.540 5.638026e-25 49
## 2 0.045 9.874106e-01 49
## 3 0.485 1.481413e-20 94
## max_diff_Fx_ctrl knee_AUC_HS max_diff_Fx_HS Count_NA UP_mean_ctrl
## 1 0.04433176 34 0.56852906 0 9.375234
## 2 0.04980327 76 0.07198283 0 9.662655
## 3 0.07642503 70 0.49569073 0 7.209353
## DOWN_mean_ctrl Attenuation_ctrl UP_mean_HS DOWN_mean_HS Attenuation_HS
## 1 7.464112 20.38480 4.364260 0.3201154 92.66507
## 2 7.522987 22.14369 10.803586 8.1156322 24.88020
## 3 5.346316 25.84194 8.049177 0.8118397 89.91400
Universe
Prior to further downstream analysis, the set of transcripts to be studied must be defined. These transcripts are designated as belonging to the “Universe”. tepr selects a transcript for inclusion in the “Universe” if it meets the following criteria:
Analytically, this can be represented as:
window_size > windsizethreshold &
Count_NA < countnathreshold &
meanctrl > meanctrlthreshold &
meanstress > meanstressthreshold &
pvaltheory > pvaltheorythres
If only one condition is provided, only the control threshold is used:
window_size > windsizethreshold &
Count_NA < countnathreshold &
meanctrl > meanctrlthreshold &
pvaltheory > pvaltheorythres
Appropriate threshold values for each criterion depend on the specific experiments. The user can adjust these thresholds throughout the analysis. A transcript that satisfies the above criteria is defined as belonging to the “Universe” of transcripts to be analyzed. The Universe
column in the data will contain TRUE
or FALSE
to indicate membership.
Groups
The transcripts in the Universe are further categorized into Groups. A transcript belongs to the “Attenuated” group if it meets the following criteria:
Universe == TRUE
).Analytically, this is represented as:
If only one condition is provided, no comparison via the Kolmogorov-Smirnov test with another condition is performed. A transcript will be considered attenuated if it deviates significantly from the theoretical cumulative distribution (y = x) and its AUC value is sufficiently high. The significant deviation has already been evaluated during the Universe calculation (pvaltheory > pvaltheorythres
). Therefore, with only one condition, a transcript is considered attenuated if:
A transcript is considered non-attenuated (“Outgroup”) if it meets the following criteria:
Analytically, this is represented as:
Universe == TRUE & pvalks > outgrouppvalksthreshold &
aucctrl > aucctrlthresoldhigher & aucctrl < aucctrlthresholdlower
For a single-condition analysis, where no Kolmogorov-Smirnov test against another condition is performed, a transcript belongs to the “Outgroup” if it belongs to the Universe and its control AUC falls between two user-defined thresholds:
Note that transcripts belonging to neither “Attenuated” nor “Outgroup” are assigned NA to the Group
column. The Universe
and Group
columns are computed with the ?universegroup
function:
## Universe Group transcript gene strand chr coor1 coor2
## 1 TRUE Attenuated ENST00000230895.11 DAP - chr5 10679230 10761234
## 2 TRUE Outgroup ENST00000274140.10 MARCHF6 + chr5 10353695 10440388
## size window_size AUC_ctrl p_AUC_ctrl D_AUC_ctrl MeanValueFull_ctrl AUC_HS
## 1 82005 410 5.830076 0.9874106 0.045 7.927029 65.589375
## 2 86694 433 5.046713 0.9639452 0.050 8.032174 7.447405
## p_AUC_HS D_AUC_HS MeanValueFull_HS adjFDR_p_AUC_ctrl adjFDR_p_AUC_HS
## 1 1.203856e-28 0.570 1.003308 0.9874106 3.611568e-28
## 2 6.271671e-01 0.075 9.129133 0.9874106 6.271671e-01
## dAUC_Diff_meanFx_HS_ctrl p_dAUC_Diff_meanFx_HS_ctrl
## 1 59.759300 9.396710e-26
## 2 2.400692 9.874106e-01
## D_dAUC_Diff_meanFx_HS_ctrl adjFDR_p_dAUC_Diff_meanFx_HS_ctrl knee_AUC_ctrl
## 1 0.540 5.638026e-25 49
## 2 0.045 9.874106e-01 49
## max_diff_Fx_ctrl knee_AUC_HS max_diff_Fx_HS Count_NA UP_mean_ctrl
## 1 0.04433176 34 0.56852906 0 9.375234
## 2 0.04980327 76 0.07198283 0 9.662655
## DOWN_mean_ctrl Attenuation_ctrl UP_mean_HS DOWN_mean_HS Attenuation_HS
## 1 7.464112 20.38480 4.36426 0.3201154 92.66507
## 2 7.522987 22.14369 10.80359 8.1156322 24.88020
tepr provides visualization capabilities for: the cumulative transcription density along a selected transcription unit (?plotecdf
); the comparison of AUC between two conditions (?plotauc
); the average transcription density of a user-selected set of transcripts (?plotmetagenes
); and a histogram of the distance between knees and transcription start sites (TSS, ?plothistoknee
).
plotecdf
The empirical cumulative distribution function (ECDF) for a specific transcript (EGFR in this example) can be visualized using:
colvec <- c("#90AFBB", "#10AFBB", "#FF9A04", "#FC4E07")
plotecdf(resmeandiff, res, expdf, "EGFR", colvec, plot = TRUE, verbose = FALSE)
plotauc
AUC values between conditions can be compared by highlighting the two conditions on the plot (only 6 points appearing):
The plot generated using the complete dataset is:
AUC group plot
It is also possible to compare the AUC by coloring points according to the Kolmogorov-Smirnov test adjusted p-values and by indicating graphically the points corresponding to a group of genes (here defined by genevec
):
genevec <- c("EGFR", "DAP", "FLI1", "MARCHF6", "LINC01619")
plotauc(res, expdf, genevec, plot = TRUE)
The plot generated using the complete dataset is:
AUC pval plot
plotmetagenes
The average transcription density along a meta-transcript or meta-gene can be visualized as follows:
The plot on the complete dataset gives:
Attenuation meta
Note that the plottype
parameter, set to “attenuation” in this example, specifies that the distribution should be plotted for transcripts identified as “attenuated” by the ?universegroup
function. Other options for plottype
include “outgroup” (for transcripts categorized as “outgroup”), “universe” (for all transcripts in the Universe), and “all” (for all transcripts in the initial table).
plothistoknee
The distribution of knee distances from the TSS can be visualized as a percentage:
The plot generated using the complete dataset is:
histo percent
The distances can also be visualized in kilobases (kb) by setting the plottype
parameter to “kb”:
The plot generated using the complete dataset is:
histo kb
teprmulti
AnalysisAnalyzing more than two experimental conditions is possible using the ?teprmulti
function. This function performs an “all vs. all” comparison and returns a list. Each element of this list corresponds to a specific pairwise comparison and contains another list with two data frames:
resmeandiff_<comparison>
: The results of the ?meandifference
function for the specified two-condition comparison.resunigroupatt_<comparison>
: The results of the ?universegroup
function for the specified two-condition comparison.The ?showallcomp
function returns a vector of all comparison names that ?teprmulti
will perform. This allows users to reduce the number of comparisons by using the dontcompare
parameter. dontcompare
accepts a vector of comparison names to exclude. The following code limits ?teprmulti
to only three comparisons:
plotmulti
Using the list of results from ?teprmulti
, all plots for all comparisons can be generated at once using ?plotmulti
. This function iterates through each element of resteprmulti
(each element representing a two-condition comparison) and calls the following functions:
?plotecdf
: Generates a figure for each gene specified in ecdfgenevec
.?plotauc
: Generates figures with options “group” and “p-value”. The “p-value” option figure is not generated if genaucvec = NA
.?plotmetagenes
: Generates figures with options “attenuation”, “outgroup”, “universe”, and “all”.?plothistoknee
: Generates figures with options “percent” and “kb”.Figures are written to the directory specified by outfold
, and subfolders are automatically created for each comparison.
The ?kneeallconds
function allows retrieving knee values processing each condition independently (see next section).
If your dataset contains only one condition (with one or more replicates), you can analyze the data by comparing the observed signal to a linear “theoretical” distribution representing a uniform signal y = x. The calculation of differences in means and AUC is skipped when running ?meandifference
and ?allauc
. The criteria for defining the “Universe” and “Group” columns also differ (see details in the previous section).
## The experiment table is limited to one condition and one replicate
expdfonecond <- expdf[which(expdf$condition == "HS" & expdf$replicate == 1), ]
## The table obtained by preprocessing is limited to the condition 'HS' and replicate 1
transdfonecond <- transdf[, c(seq_len(9), 18, 19, 20, 21)]
## Computing the object 'res' with tepr on one condition
resonecond <- tepr(expdfonecond, transdfonecond, expthres, controlcondname = "HS", verbose = FALSE)
Because AUC and metagene plots require two conditions, they cannot be generated in a single-condition analysis. However, the ECDF for a given gene can be obtained by considering the y = x distribution as follows:
plotecdf(resonecond[[1]], resonecond[[2]], expdfonecond, genename = "EGFR",
colvec = c("#90AFBB"), plot = TRUE, verbose = FALSE)
The histogram of knee positions can be obtained with:
## Randomly marking 3 transcripts as attenuated as a mock example
idxatt <- sample(seq_len(6), 3)
resonecond[[2]][idxatt, "Group"] <- "Attenuated"
resonecond[[2]][idxatt, "Universe"] <- TRUE
plothistoknee(resonecond[[2]], kneename = "knee_AUC_HS", plottype = "percent", plot = TRUE, verbose = FALSE)
For processing the complete dataset, parallelization is highly recommended. We used 20 CPUs for our analysis. The code below assumes that the file “dTAG_Cugusi_stranded_20230810.tsv” is located in the current working directory. You can verify the format of the transdf
table using the checkexptab
utility function.
The approximate computation time for each function is indicated below.
library(tepr)
## After downloading with: wget https://zenodo.org/records/15050723/files/cugusi.tsv -P preprocessed/
transpath <- "preprocessed/cugusi.tsv"
exppath <- system.file("extdata", "exptab.csv", package="tepr")
## Reading input files
transdf <- read.delim(transpath, header = FALSE)
expdf <- read.csv(exppath)
reslist <- tepr(expdf, transdf, expthres = 0.1, nbcpu=5, showtime = TRUE, verbose = TRUE)
## Calculating average expression per transcript
Removing lines with values < expthres
-- Analysis performed in: 8.1 secs
## Counting NA values
Splitting the table by each transcript
-- Analysis performed in: 38 secs
## Computing ecdf
Filtering to keep only the expressed transcripts
Splitting the table by each transcript
Computing ecdf on each transcript
-- Analysis performed in: 3.2 mins
## Computing meandifference
Merging columns for condition ctrl
Calculating average and difference between replicates for columns 'value' of ctrl
Calculating average and difference between replicates for columns 'Fx' of ctrl
Merging columns for condition HS
Calculating average and difference between replicates for columns 'value' of HS
Calculating average and difference between replicates for columns 'Fx' of HS
Computing all differences on mean columns
-- Analysis performed in: 4.3 secs
## Split the results by transcripts
-- Analysis performed in: 6.3 secs
## Computing AUC and difference
Computing the differences (d or delta) of AUC
Computing the Area Under Curve (AUC)
Merging results
-- Analysis performed in: 1.6 mins
## Calculating knee
-- Analysis performed in: 26 secs
## Calculating attenuation
Merging tables
Building summary
Merging summary
Merging detailed mean table with summary
Splitting the previous table by transcript
Computing up and down mean
Merging attenuation to the complete table
-- Analysis performed in: 1.26459248860677
## Computing universe and group columns
-- Analysis performed in: 0.021 secs
-- tepr analysis performed in: 7.4 mins
## Retrieving results
resmeandiff <- reslist[[1]]
res <- reslist[[2]]
## Plotting and saving to current folder
colvec <- c("#90AFBB", "#10AFBB", "#FF9A04", "#FC4E07")
plotecdf(resmeandiff, res, expdf, "EGFR", colvec, formatname = "png")
plotauc(res, expdf, plottype = "groups", outfile = "AUCcompare_group",
formatname = "png")
genevec <- c("EGFR", "DAP", "FLI1", "MARCHF6", "LINC01619")
plotauc(res, expdf, genevec, plottype = "pval", outfile = "AUCcompare_pval",
formatname = "png")
plotmetagenes(res, resmeandiff, expdf, plottype = "attenuation", formatname = "png")
plothistoknee(res, formatname = "png")
plothistoknee(res, plottype = "kb", formatname = "png")