Abstract
bakR is a tool for comparing nucleotide recoding RNA-seq datasets (NR-seq). NR-seq refers to a class of methods (e.g., TimeLapse-seq, SLAM-seq, TUC-seq, etc.) which combine RNA-seq, metabolic labeling, and unique metabolic label recoding chemistries to assess the kinetics of gene expression. This vignette will provide a quick introduction to using bakR. A longer version of this vignette (“Differential kinetic analysis with bakR”) exists to provide a more in-depth discussion of each step of this process.
Install and load bakR to run the code in this vignette; instructions on how to do so can be found at this link. Also, you’ll want to set the seed so as to ensure the results you get reproduce those presented in the vignette.
The 1st step to using bakR is to create a bakRData object. A bakRData
object consists of two components: a cB data frame and a metadf data
frame. cB stands for counts binomial and contains all of the information
about mutations seen in sequencing reads in each sample sequenced.
metadf stands for metadata data frame and contains important information
about the experimental details of each sample (i.e., how long the
metabolic label feed was, which samples are reference samples, and which
are experimental samples). Examples of what these data structures are
available via calls to data("cB_small")
and
data("metadf")
, for the cB and metadf respectively.
A cB data frame consists of rows corresponding to groups of reads with identical data, where data corresponds to the values of the following four variables:
The last column of a cB data frame is n: Number of reads with identical data for the other 4 columns.
cB data frames are most easily obtained from the Snakemake implementation of a pipeline developed by our lab, called bam2bakR (available here).
The metadf has two columns:
The metadf data frame must also have row names corresponding to the sample name that the tl and Exp_ID entries describe, as the sample name appears in the cB data frame.
Once you have these two data frames correctly constructed, you can create a bakRData object:
bakR contains three different implementations of a statistical model of NR-seq data. You must always first run the most efficient implementation. The other two implementations can then be run using the output of this implementation. Below I will show how to run the efficient implementation using simulated data:
# Simulate a nucleotide recoding dataset
sim_data <- Simulate_bakRData(500)
# This will simulate 500 features, 2 experimental conditions
# and 3 replicates for each experimental condition
# See ?Simulate_bakRData for details regarding tunable parameters
# Extract simulated bakRData object
bakRData <- sim_data$bakRData
# Extract simualted ground truths
sim_truth <- sim_data$sim_list
# Run the efficient model
Fit <- bakRFit(bakRData)
#> Finding reliable Features
#> Filtering out unwanted or unreliable features
#> Processing data...
#> Estimating pnew with likelihood maximization
#> Estimating unlabeled mutation rate with -s4U data
#> Estimated pnews and polds for each sample are:
#> # A tibble: 6 × 4
#> # Groups: mut [2]
#> mut reps pnew pold
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 0.0502 0.00100
#> 2 1 2 0.0500 0.00100
#> 3 1 3 0.0501 0.00100
#> 4 2 1 0.0501 0.00100
#> 5 2 2 0.0500 0.00100
#> 6 2 3 0.0503 0.00100
#> Estimating fraction labeled
#> Estimating per replicate uncertainties
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the
#> quality of your data and get recommendations for next steps.
bakRFit() is used here as a wrapper for two functions in bakR:
cBprocess() and fast_analysis(). For more details on what these
functions do, run ?cBprocess
or fast_analysis
.
Alternatively, see the more highly detailed version of this vignette for
additional details.
While the fast implementation was running, it outputted a message regarding the estimated pnews and pold. The pnews are the estimated mutation rates of reads from new RNAs (new meaning RNAs synthesized after the start of s4U labeling) in each sample (muts = Exp_ID, and reps = a numerical replicate ID that corresponds to the order replicates appear in the cB), and polds are the estimates for the background mutation rate used in all analyses. For more details on how bakR estimates these mutation rates and alternative estimation strategies implemented in bakR, see the longer form version of this vignette as well as the vignette on troubleshooting analyses.
To run the heavier, more highly powered models, just rerun bakRFit() on the Fit object, but with either the StanFit or HybridFit parameters set to true.
# Run Hybrid model (This might take several minutes to run)
Fit <- bakRFit(Fit, HybridFit = TRUE)
# Run Full model (This might take ~10-30 minutes to run)
Fit <- bakRFit(Fit, StanFit = TRUE)
The Fit objects contain lists pertaining to the fits of each of the models. The possible contents include:
bakR provides a variety of easy to use functions for beginning to investigate your data. The visualizations are particularly aimed at revealing trends in RNA stabilization or destabilization. These include MA plots:
Volcano plots:
## Volcano Plot with Fast Fit; significance assessed relative to an FDR control of 0.05
plotVolcano(Fit$Fast_Fit)
and PCA plots:
This vignette provides the minimal amount of information to get up and running with bakR. If you would like a more thorough discussion of each step of this process, check out the long form version of this vignette (“Differential kinetic analysis with bakR”). In addition, there are a number of other vignettes that cover various topics not discussed in these intro vignettes:
Note, all but the “Differential synthesis rate analysis” vignette are only fully compatible with version 1.0.0 of bakR. Update bakR as necessary