Accurate QUAntification of circulaR Isoforms Using Model-based strategy
AQUARIUM (Accurate QUAntification of circulaR Isoforms Using Model-based strategy) is a bioinformatics analysis pipe line for circular RNA sequencing data.
This pipeline can determine the relative abundance of both linear and circRNA in RNA-seq data under the guidance of a configuration file. Since the relative expression abundunce of circular and linear RNAs are unified under the same standard, it avoids the need to introduce other parameters to describe the expression of cyclic RNA in other count-based circular RNA quantification tools.
It accepts output of circRNA identification tools (CIRI, CIRI-full) or a BED-format file to specify the circular RNA transcripts. Then, it transforms all circular transcripts to pseudo-linear transcripts. Finally, it estimates the expression of both linear and circular transcripts using salmon framework.
Code written in python and R, and published under mit license.
This pipeline does not need compilation. You can download the package directly from the project homepage as a zip file and unzip it manually.
If git is installed, you can also download this pipeline using following command.
git clone https://github.com/wanjun-group-seu/AQUARIUM
Python 3 and R need to be installed before you can proceed with the following steps.
This pipeline requires biopython and gffutils.
But if you are not sure whether you have installed the relevant package or not.
you can use a versatile helper script info.py
:
python path/to/your/AQUARIUM/info.py
When this script is invoked in command line, it first checks whether the software-related python packages are installed properly. It will report an error once any python package is not installed properly. You can check the error message to find the missing package.
These R package may be required for analysis: tidyverse
, Biostrings
, rtracklayer
, GenomicFeatures
, the latter 3 packages are part of bioconductor.
Install them in the R console:
# To install `tidyverse`:
install.packages("tidyverse")
# to install other 3 bioconductor packages
install.packages("BiocManager")
BiocManager::install("Biostrings")
BiocManager::install("rtracklayer")
BiocManager::install("GenomicFeatures")
If you need to install Biostrings
rtracklayer
GenomicFeatures
in an older version of R that does not work with BiocManager
, please check bioconductor
page of these packages.
gffread is part of Cufflinks (http://cole-trapnell-lab.github.io/cufflinks/). This GFF/GTF parsing utility is used to extract sequence.
In order to use this pipeline to analyze the data, a configuration file is essential.
But instead of writing it from scratch, you can call new_config.py
to create a new one, and then modify it.
The default configure template lies in the pysrc/body/default.cpysrc/body/default.cfg
.
When you invoke new_config.py, you’re actually making a copy of this file.
This template is almost empty after a fresh installation. You can modify this file to save your time in future use.
More details can be found in instructions on config file
The main command line interface for circRNA profiling is wf_profile_circRNA.py
.
You can luanch it as follows:
python wf_profile_circRNA.py config_file
The only argument is the path of config file. It contains the information needed to analyze a single sample, where is the input files , where is the output files, the parameters of certain steps, etc.
Before calling this interface, you can get a config copy and then modify it
Here we will show how to perform a simple but complete analysis after the installation of pipeline.
You can download a minimal dataset for testing from MEGA.
It contains a tiny genome( mt.fa, mt.gtf),pair end sequence data (r1.fq, r2.fq), and a circRNA detection report(circ_mt.report), also a sample config file(your.cfg).
Run this pipeline using that sample config file will cause error, because file path in that cfg file is not correct. And creating a new config file is a more preferable way. But you use it as a guide by searching “TODO” to find what needs to be edited.
By invoking new_config.py, you will get a copy of the template. then you can modify this config file to guide your analysis.
The format of invoking new_config.py is as following:
new_config.py target
target
is the place where your configuration file will be found.
If this target
is an existing folder, then a default.cfg
file will appear in this folder.
Before you modify the configuration file, it is recommended that you read instructions on configure file
If you have modified the template after installation, you may notice that the changes you have made are already in effect on this copy.
With the sample config as a guide, you can modify it by referring to instructions provided by info.py.
Here, our goal is to quantify the RNA in the sequencing data. so you can query info.py like this:
python path/to/AQUARIUM/info.py profile_circRNA
From its output, we can see that the relevant module is [CIRC_PROFILE].
After modifying the configuration file, it is time to try running the process.
python path/to/AQUARIUM/wf_profile_circRNA.py some.cfg
It will output a lot of log messages to the screen. So you can use the nohup
command to make it run in the background, like:
nohup python path/to/AQUARIUM/wf_profile_circRNA.py some.cfg &
You can find a quant.sf
with RNA expression levels in the folder specified by the -o
parameter in the CIRC_PROFILE
section of configration.
When an error occurs, first check the log message. The error message will be at the end of the screen output. Search the words ERROR and WARING in the LOG message would also help.
If you come across a strange looking error message or find a bug, please do let us know. You submit new issues here: https://github.com/wanjun-group-seu/AQUARIUM/issues
During the process, we will create a binary database file for GTF in order to query the genome information faster. This ‘.db’ database file will share base name with the GTF file and will locate in the same folder.
However, since there are different versions of GTF files for different species. So creating the database may encounter some unexpected problems.
For this reason, a prudent solution is to create the relevant database manually with following steps:
Enter the python interactive environment
python
call gffutils.create_db
to build the database.
import gffutils
gffutils.create_db(gtf_file, db_file_path)
here, gtf_file is path to your gtf file, db_file_path is where the .db file will locate.
In general it takes about 15 minutes to process the human genome GTF, But if it takes too long, or if it shows a warning, some parameters needs to be tweaked. Get more information by
help(gffutils.create_db)
We use an ini
format configuration file to set all parameters in the process.
The user can call new_config.py
to get a copy of the default configuration file, and modify it to meet specific requirements.
Typically, a single sample corresponds to a single configuration file.
For large-scale analysis of multiple samples, users can start by writing a configuration file template, and then use character substitution to obtain multiple configuration files.
The configuration file is a plain text file that follows the syntax of the ini
file.
The file is divided into sections, with the name of each section stored in square brackets, and pairs of key-value pairs under the square brackets. It suppports syntax like ${A:b}
to quote the value of b in another section A.
[section_name]
keyname1=value1
keyname2=value2
keyname3=${A:b}
You might see a lot of sections in this file, but don’t worry, only some of them will be used in each operation.
These sections can be divided into two categories, first three sections are required for all analysis modules, and the other sections correspond to different module of analysis.
The first three sections meta/global/custom store the paths to the necessary files for the analysis process.
Section META
contains parameters that are constant in the environment, such as the location of the executable and genome annotations.
Section GLOBAL
includes information that is invariant in the analysis of the same batch of samples, such as location of sequencing data.
Section CUSTOM
is used to store the information corresponding to a single sample. and some user-defined variables.
The subsequent sections are related to specific analysis modules, with more information available in info.py
.
In each section you can see a lot of keys reserved as placeholder. You need to refer to the results of info.py to determine which ones are useful.
To facilitate your modification of the parameters of each module, you need to use info.py
.
This is a multi-purpose script. During the installation process, you can invoke info.py to check whether the relevant package has been installed or not. After the installation is complete, info.py is also used to query the information for each analysis step.
If the software-related packages are installed correctly, invoking info.py without parameters will give you the following information:
python path/to/your/AQUARIUM/info.py
this pipeline now has following wrappers, choose one to see more information
bwa - BWA and BWA MEM
ciri - CIRI : a circular RNA detection tool
ciri2 - CIRI2 : a circular RNA detection tool, multiple core empowered
ciri_as - CIRI-AS: circular RNA Alternative Splicing Event detection tool
ciri_full - CIRI-FULL: a powerful circular RNA detection and rebuilding tool, which combines CIRI and CIRI-AS
detect_circRNA - home made pipeline for detecting circRNA
fastqc - fastqc , a QC tool
knife - KNIFE: a circular RNA detection tool
profile_circRNA - home made pipeline for profiling the circRNA
rsem - RSEM : a RNA seq quantification tool
sailfish - Sailfish: a RNA-seq quantification tool based on k-mer
salmon - Salmon: a RNA-seq quantification tool based on fragment
star - STAR : a junction sensitive aligner
you could use 'grep -v "^#\ " | grep -v "^$"' to filter out the options and use them in config file
Here, it prompts you to select an item in the list for more information. Each of these items corresponding to a section in the configuration file. profile_circRNA
is the main step of the process. It correspond to wf_profile_linear.py
Here we look up the parameters of profile_circRNA
:
-> % python path/to/your/AQUARIUM/info.py profile_circRNA
[CIRC_PROFILE]
# this section [CIRC_PROFILE] contains following options:
####### essential arguments
quantifier = salmon
# the back end quantifier: sailfish or salmon
-o =
# output folder that contains the index built by sailfish and quantification results
####### only one of these option pairs are needed
-g/--genomic_seqs_fasta
# path to genomic sequence fasta file
-a/--annotation
# path to gene annotation file, ie, .gtf or .gff files
-c/--ciri_bsj/--bed
# path to circRNA detection (file for ciri, folder for ciri-full) to specify circular RNA
####### optional arguments
ciri_as_prefix
# path prefix to CIRI-AS report of circular exons
--decoys
# path to decoy sequences for salmon
preserved_id_list
# path to preserved id list
--mll
# mean library length, this option is to fix up the effective length.
-k = 31
# k-mer size used by sailfish to built index. default is 21
-1
# path to raw pair-end reads, mate 1
-2
# path to raw pair-end reads, mate 2
-r
# path to single-end raw sequencing reads file.
additional_circ_ref
# path to additional circular RNA reference file (.fa),
additional_annotation
# path to additional circular RNA annotation file in gtf format
additional_linear_ref
# path to additional linear RNA reference file(.fa)
flag_use_linc_explicitly
# flag to specify whether linc RNA should be include in quantification result, commenting it out to
# set False
flag_reject_linear
# flag to specify whether to reject linear RNA during quantification, for example for a RNase R
# treated sample, comment it out to set False
####### only one of these following options are optional
####### these options are not allowed
# # -h
# # --help
As you can see, the output of the command line lists all parameters of the [CIRC_PROFILE] section of the configuration file, along with a brief description of each parameter.
In fact, you can just overwrite the relevant section of the profile with the texts above, and then fill in the key-value pairs afterwards.
Also, You may notice some commented out key-value pairs. With the help of info.py, you can change them to you need.
All path should be absolute path
A: Not all of these parameters are essential.
If you just want to use the existing ciri-full result to quantify the RNA-seq data. These paramters are required:
num_thread
, genome_fa
,genome_annotation
,salmon_bin
.
A: This depends on the situation you have encountered, You can read the following pages for complete information: Salmon Library Type
A: This make the config file “portable”. If you think the absolute paths are too long, and there are many duplicate parts. You can define some prefixes for the paths in custom, and refer to them later