How to use TransPi
1. Description
TransPi – a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly
TransPi provides a useful resource for the generation of de novo transcriptome assemblies, with minimum user input but without losing the ability of a thorough analysis.
1.1. Programs used
-
List of programs use by TransPi:
-
FastQC
-
fastp
-
sortmerna
-
rnaSPADES
-
SOAP
-
Trinity
-
Velvet
-
Oases
-
TransABySS
-
rnaQUAST
-
EvidentialGene
-
CD-Hit
-
Exonerate
-
Blast
-
BUSCO
-
Psytrans
-
Trandecoder
-
Trinotate
-
Diamond
-
Hmmer
-
Bowtie2
-
rnammer
-
tmhmm
-
signalP
-
iPath
-
SQLite
-
R
-
Python
-
-
Databases used by TransPi:
-
Swissprot
-
Uniprot custom database (e.g. all metazoan proteins)
-
Pfam
-
2. Installing TransPi
2.1. Requirements
-
System: Linux OS
-
Data type: Paired-end reads
Example: IndA_R1.fastq.gz, IndA_R2.fastq.gz
Make sure reads end with _R1.fastq.gz and _R2.fastq.gz .
Multiple individuals can be run at the same time.
|
2.2. Downloading TransPi
1- Clone the repository
git clone https://github.com/palmuc/TransPi.git
2- Move to the TransPi directory
cd TransPi
2.3. Configuration
TransPi requires various databases to run. The precheck script will installed the databases and software, if necessary, to run the tool.
The precheck run needs a PATH
as an argument for installing (locally) all the databases the pipeline needs.
bash precheck_TransPi.sh /YOUR/PATH/HERE/
This process may take a while depending on the options you select. Step that takes longer is downloading, if desired, the entire metazoan proteins from UniProt (6Gb). Other processes and databases are relatively fast depending on internet connection. |
Once the precheck run is done it will create a file named nextflow.config
that contains the various PATH
for the databases.
If selected, it will also have the local conda environment PATH
.
The nextflow.config
file has also other important parameters for pipeline execution that will be discussed further
in the following sections.
3. Running TransPi
3.1. Full analysis (--all
)
After the successful run of the precheck script you are set to run TransPi.
We recommend to run TransPi with the option --all
where it will do the complete analysis, from raw reads filtering to annotation.
Other options described below.
To run the complete pipeline.
nextflow run TransPi.nf --all --reads '/YOUR/READS/PATH/HERE/*_R[1,2].fastq.gz' \
--k 25,41,53 --maxReadLen 75 -profile conda
Arguments explanations:
--all Run full TransPi analysis
--reads PATH to the paired-end reads
--k kmers list to use for the assemblies
--maxReadLen Max read length in the library
-profile Use conda to run the analyses
If you combined multiple libraries of the same individual to create a reference transcriptome, which will be later use in downstream analyses (e.g. Differential Expression),
make sure the kmer list is based on the length for the shortest read library and the Example: Combining reads of 100bp with 125bp |
You can run multiple samples at the same time |
3.2. Other options
3.2.1. --onlyAsm
Run only the Assemblies and EvidentialGene analysis.
Example for --onlyAsm
:
nextflow run TransPi.nf --onlyAsm --reads '/home/rrivera/TransPi/reads/*_R[1,2].fastq.gz' \
--k 25,41,53 --maxReadLen 75 -profile conda
You can run multiple samples at the same time |
3.2.2. --onlyEvi
Run only the Evidential Gene analysis
Example for --onlyEvi
:
nextflow run TransPi.nf --onlyEvi -profile conda
TransPi looks for a directory named onlyEvi . It expects one file per sample to perform the reduction. The file should have all the assemblies concatenated into one.
|
You can run multiple samples at the same time |
3.2.3. --onlyAnn
Run only the Annotation analysis (starting from a final assembly)
Example for --onlyAnn
:
nextflow run TransPi.nf --onlyAnn -profile conda
TransPi looks for a directory named onlyAnn . It expects one file per sample to perform the annotation.
|
You can run multiple samples at the same time |
3.3. Using -profiles
TransPi can also use docker, singularity, and individual conda installations (i.e. per process) to deploy the pipeline.
test Run TransPi with a test dataset
conda Run TransPi with conda.
docker Run TransPi with docker container
singularity Run TransPi with singularity container with all the necessary tools
TransPiContainer Run TransPi with a single container with all tools
Multiple profiles can be specified (comma separated) |
Refer to Section 6 of this manual for further details on deployment of TransPi using other profiles.
4. Results
4.1. Directories
4.1.1. results
After a successful run of TransPi the results are saved in a directory called results
. This directory is divided into multiple directories for each major step of the pipeline.
Directories will be created based on the options selected in the pipeline execution |
fastqc |
Fastqc html files |
filter |
Filter step html files |
rRNA_reads |
Info and reads of rRNA removal process |
normalization |
Normalized reads files |
saveReads |
Folder with reads saved from the filter and normalization processes |
assemblies |
All individual assemblies |
evigene |
Non-redundant final transcriptome (ends with name |
rnaQuast |
rnaQUAST output |
mapping |
Mapping results |
busco4 |
BUSCO V4 results |
transdecoder |
Transdecoder results |
trinotate |
Annotation results |
report |
Interactive report of TransPi |
figures |
Figures created by TransPi (BUSCO comparison, Annotation, GO, etc) |
stats |
Basic stats from all steps of TransPi |
pipeline_info |
Nextflow report, trace file and others |
RUN_INFO.txt |
File with all versions of the tools used by TransPi. Also info from the run like command and PATH |
4.1.2. work
A directory called work
is also created when running TransPi. It contains all the Nextflow working files, TransPi results and intermediate files.
Directory work can be removed after the pipeline is done since all important files are stored in the results directory.
|
4.3. Report
TransPi creates an interactive custom HTML report for ease data exploration.
Report Sponge transcriptome
5. Additional options
There are other parameters that can be changed when executing TransPi.
5.1. Output options
--outdir
|
name of output directory. Example: |
-w, -work
|
name of working directory. Example: |
--tracedir
|
Name for directory to save pipeline trace files. Default "pipeline_info" |
5.2. Additional analyses
--rRNAfilter
|
Remove rRNA from sequences. Requires option --rRNAdb |
--rRNAdb
|
PATH to database of rRNA sequences to use for filtering of rRNA. Default "" |
--filterSpecies
|
Perform psytrans filtering of transcriptome. Requires options |
--host
|
Host (or similar) protein file. |
--symbiont
|
Symbionts (or similar) protein files |
--psyval
|
Psytrans value to train model. Default "160" |
--allBuscos
|
Run BUSCO analysis in all assemblies |
--rescueBusco
|
Generate BUSCO distribution analysis |
--minPerc
|
Mininmum percentage of assemblers require for the BUSCO distribution. Default ".70" |
--shortTransdecoder
|
Run Transdecoder without the homology searches |
--withSignalP
|
Include SignalP for the annotation. Needs manual installation of CBS-DTU tools. Default "false" |
--rnam
|
PATH to Rnammer software. Default "" |
--withTMHMM
|
Include TMHMM for the annotation. Needs manual installation of CBS-DTU tools. Default "false" |
--tmhmm
|
PATH to TMHMM software. Default "" |
--withRnammer
|
Include Rnammer for the annotation. Needs manual installation of CBS-DTU tools. Default "false" |
--rnam
|
PATH to Rnammer software. Default "" |
5.3. Skip options
--skipEvi
|
Skip EvidentialGene run in --onlyAsm option. Default "false" |
--skipQC
|
Skip FastQC step. Default "false" |
--skipFilter
|
Skip fastp filtering step. Default "false" |
--skipKegg
|
Skip kegg analysis. Default "false" |
--skipReport
|
Skip generation of final TransPi report. Default "false" |
5.4. Other parameters
--minQual
|
Minimum quality score for fastp filtering. Default "25" |
--pipeInstall
|
PATH to TransPi directory. Default "". If precheck is used this will be added to the nextflow.config automatically. |
--envCacheDir
|
PATH for environment cache directory (either conda or containers). Default "Launch directory of pipeline" |
6. Examples
Here are some examples on how to deploy TransPi depending on the method to use (e.g. conda) and the analyses to be performed.
6.1. Profiles
You can use TransPi either with: - a local conda environment (from precheck); - individual conda environment per process; - docker or singularity
6.1.1. Conda
This way of executing TransPi assumes that you installed conda locally. All these is done automatically for you, if desired, with the precheck script.
Example:
nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
--reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
-profile conda
-profile conda tells TransPi to use conda. An individual environment is used per process.
|
6.1.2. Containers
Docker or singularity can also be use for deploying TransPi. You can either use individual containers for each process or a TransPi container with all the tools.
Individual
To use individual containers:
Example for docker:
nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
--reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
-profile docker
Example for singularity:
nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
--reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
-profile singularity
Some individual containers can create problems. We are working on solving these issues. In the meantime you can use the TransPi container (see below). |
TransPi container
To use the TransPi container with all the tools you need to use the profile TransPiContainer
.
Example for docker:
nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
--reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
-profile docker,TransPiContainer
Example for singularity:
nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
--reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
-profile singularity,TransPiContainer
6.2. Other examples
Order of commands is not important. |
6.2.1. Filtering
Scenario:
Sample |
Coral sample |
Read length |
150bp |
TransPi mode |
--all |
Kmers |
25,35,55,75,85 |
Reads PATH |
/YOUR/PATH/HERE/*_R[1,2].fastq.gz |
Output directory |
Results_Acropora |
Work directory |
work_acropora |
Engine |
conda |
Filter species |
on |
host |
scleractinian proteins |
symbiont |
symbiodinium proteins |
Command:
nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
--reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
-w work_acropora -profile conda --filterSpecies \
--host /YOUR/PATH/HERE/uniprot-Scleractinia.fasta \
--symbiont /YOUR/PATH/HERE/uniprot-symbiodinium.fasta
6.2.2. BUSCO distribution
Scenario:
Sample |
SampleA |
Read length |
100bp |
TransPi mode |
--all |
Kmers |
25,41,57,67 |
Reads PATH |
/YOUR/PATH/HERE/SampleA/*_R[1,2].fastq.gz |
Output directory |
Results_SampleA |
Engine |
conda |
All BUSCOs |
on |
BUSCO distribution |
on |
Command:
nextflow run TransPi.nf --all --maxReadLen 100 --k 25,35,55,75,85 \
--outdir Results_SampleA --reads '/YOUR/PATH/HERE/SampleA/*_R[1,2].fastq.gz' \
-profile conda --allBuscos --buscoDist
6.2.3. --onlyEvi
Scenario:
Sample |
Assemblies from multiple assemblers and kmers |
Read length |
50bp |
TransPi mode |
--onlyEvi |
Kmers |
25,33,37 |
Reads PATH |
/YOUR/PATH/HERE/*_R[1,2].fastq.gz |
Output directory |
Reduction_results |
Engine |
conda |
Command:
nextflow run TransPi.nf --onlyEvi --outdir Reduction_results \
-profile conda
6.2.4. --onlyAnn
Scenario:
Sample |
Transcriptome missing annotation |
Read length |
100bp |
TransPi mode |
--onlyEvi |
Kmers |
25,41,57,67 |
Reads PATH |
/YOUR/PATH/HERE/*_R[1,2].fastq.gz |
Output directory |
Annotation_results |
Engine |
singularity |
Container |
TransPi container |
Command:
nextflow run TransPi.nf --onlyAnn --outdir Annotation_results \
-profile singularity,TransPiContainer
6.2.5. Skip options
Scenario:
Sample |
Coral sample |
Read length |
150bp |
TransPi mode |
--all |
Kmers |
25,35,55,75,85 |
Reads PATH |
/YOUR/PATH/HERE/*_R[1,2].fastq.gz |
Output directory |
Results_Acropora |
Work directory |
work_acropora |
Engine |
docker |
Container |
Individual containers |
Skip QC |
on |
Skip Filter |
on |
Command:
nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
--reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
-w work_acropora -profile docker \
--skipQC --skipFilter
6.2.6. Extra annotation steps
Scenario:
Sample |
Mollusc sample |
Read length |
150bp |
TransPi mode |
--all |
Kmers |
25,35,55,75,85 |
Reads PATH |
/YOUR/PATH/HERE/*_R[1,2].fastq.gz |
Output directory |
Results |
Engine |
conda |
Skip QC |
on |
SignalP |
on |
TMHMM |
on |
RNAmmer |
on |
Command:
nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
--reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results \
-profile conda --skipQC --withSignalP --withTMHMM --withRnammer
6.2.7. Full run and extra annotation
Scenario:
Sample |
Coral sample |
Read length |
150bp |
TransPi mode |
--all |
Kmers |
25,35,55,75,85 |
Reads PATH |
/YOUR/PATH/HERE/*_R[1,2].fastq.gz |
Output directory |
Results |
Engine |
conda |
Skip QC |
on |
SignalP |
on |
TMHMM |
on |
RNAmmer |
on |
Filter species |
on |
host |
scleractinian proteins |
symbiont |
symbiodinium proteins |
All BUSCOs |
on |
BUSCO distribution |
on |
Remove rRNA |
on |
rRNA database |
/YOUR/PATH/HERE/silva_rRNA_file.fasta |
Command:
nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
--reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results \
-profile conda --skipQC --withSignalP --withTMHMM --withRnammer \
--host /YOUR/PATH/HERE/uniprot-Scleractinia.fasta \
--symbiont /YOUR/PATH/HERE/uniprot-symbiodinium.fasta
--allBuscos --buscoDist --rRNAfilter \
--rRNAdb "/YOUR/PATH/HERE/silva_rRNA_file.fasta"
7. Extra information
Here are some notes that can help in the execution of TransPi. Also some important considerations based Nextflow settings. For more in detail information visit the Nextflow documentation
7.1. -resume
If an error occurs and you need to resume the pipeline just include the -resume
option when calling the pipeline.
./nextflow run TransPi.nf --onlyAnn -profile conda -resume
7.2. template.nextflow.config
7.2.1. Resources
The template.nextflow.config
file has different configurations for the each program of the pipeline
(e.g. some with a lot of CPUs, others with a small amount of CPUs). You can modify this depending on the resources you have in your system.
Example:
In this case, all the processes using the label big_cpus
will use 30 CPUs. If your system only has 20 please modify these values accordingly to avoid errors.
Setting the correct CPUs and RAM of your system is important because nextflow will start as many jobs as possible if the resources are available.
If you are in a VM with 120 CPUs, nextfow will be able to start four processes with the label big_cpus .
|
7.2.2. Data
The precheck is designed to create a new nextflow.config
every time is run with with the PATH
to the databases.
You can modify the values that do not need editing for your analysis on the template.nextflow.config
. This way you avoid doing the same changes to the nextflow.config
after the precheck run.
Example: Modify the template.nextflow.config
with your cluster info to avoid repeating these in the future.
7.3. Custom profiles
We are using SLURM as our workload manager in our server.
Thus we have custom profiles for the submission of jobs. For example our nextflow.config
has the following lines in the profiles
section.
profiles {
palmuc {
process {
executor='slurm'
clusterOptions='--clusters=inter --partition=bigNode --qos=low'
}
}
}
You can add your custom profiles depending on the settings of your system and the workload manager you use (e.g. SGE, PBS, etc).
The line clusterOptions
can be used to add any other option that you will usually use for your job submission.
7.4. Local nextflow
To avoid calling the pipeline using ./nextflow …
you can modify the nextflow command like this chmod 777 nextflow
. For running the pipeline you just need to use:
7.5. Real Time Monitoring
To monitor your pipeline remotely without connecting to the server via ssh use Nextflow Tower.
Make an account with your email and follow their instructions. After this, you can now run the pipeline adding the -with-tower
option and follow live the execution
of the processes.
8. Issues
We tested TransPi using the followings deployment methods:
-
conda = individual conda environments per process
-
docker = using TransPi container (i.e. -profile docker,TransPiContainer)
-
singularity = using TransPi container (i.e. -profile singularity,TransPiContainer)
Using individual container per process is working for the majority of processes. However, we found a couple os issues with some containers (e.g. transabyss). We are working to find a solution for these issues. |
8.1. Reporting an issue
If you find a problem or get an error please let us know by opening an issue in the repository.
8.2. Test dataset
We include a test
profile to try TransPi using a small dataset. However, this can create issues in some of the process (e.g. contamination removal by psytrans).