TransPi Manual

How to use TransPi

1. Description

TransPi – a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly

TransPi provides a useful resource for the generation of de novo transcriptome assemblies, with minimum user input but without losing the ability of a thorough analysis.

For more info see the Preprint
Code available at GitHub
Author: Ramón Rivera-Vicéns
- Twitter

1.1. Programs used

List of programs use by TransPi:
- FastQC
- fastp
- sortmerna
- rnaSPADES
- SOAP
- Trinity
- Velvet
- Oases
- TransABySS
- rnaQUAST
- EvidentialGene
- CD-Hit
- Exonerate
- Blast
- BUSCO
- Psytrans
- Trandecoder
- Trinotate
- Diamond
- Hmmer
- Bowtie2
- rnammer
- tmhmm
- signalP
- iPath
- SQLite
- R
- Python
Databases used by TransPi:
- Swissprot
- Uniprot custom database (e.g. all metazoan proteins)
- Pfam

2. Installing TransPi

2.1. Requirements

System: Linux OS

Data type: Paired-end reads

Example:
    IndA_R1.fastq.gz, IndA_R2.fastq.gz

Make sure reads end with _R1.fastq.gz and _R2.fastq.gz. Multiple individuals can be run at the same time.

2.2. Downloading TransPi

1- Clone the repository

git clone https://github.com/palmuc/TransPi.git

2- Move to the TransPi directory

cd TransPi

TransPi requires various databases to run. The precheck script will installed the databases and software, if necessary, to run the tool. The precheck run needs a PATH as an argument for installing (locally) all the databases the pipeline needs.

bash precheck_TransPi.sh /YOUR/PATH/HERE/

This process may take a while depending on the options you select. Step that takes longer is downloading, if desired, the entire metazoan proteins from UniProt (6Gb). Other processes and databases are relatively fast depending on internet connection.

Once the precheck run is done it will create a file named nextflow.config that contains the various PATH for the databases. If selected, it will also have the local conda environment PATH.

The nextflow.config file has also other important parameters for pipeline execution that will be discussed further in the following sections.

3. Running TransPi

3.1. Full analysis (`--all`)

After the successful run of the precheck script you are set to run TransPi.

We recommend to run TransPi with the option --all where it will do the complete analysis, from raw reads filtering to annotation. Other options described below.

To run the complete pipeline.

nextflow run TransPi.nf --all --reads '/YOUR/READS/PATH/HERE/*_R[1,2].fastq.gz' \
     --k 25,41,53 --maxReadLen 75 -profile conda

Arguments explanations:

--all                Run full TransPi analysis
--reads              PATH to the paired-end reads
--k                  kmers list to use for the assemblies
--maxReadLen         Max read length in the library
-profile             Use conda to run the analyses

If you combined multiple libraries of the same individual to create a reference transcriptome, which will be later use in downstream analyses (e.g. Differential Expression), make sure the kmer list is based on the length for the shortest read library and the maxReadLen based on the longest read length.

Example: Combining reads of 100bp with 125bp

--k 25,41,53,61 --maxReadLen 125

You can run multiple samples at the same time

3.2. Other options

3.2.1. `--onlyAsm`

Run only the Assemblies and EvidentialGene analysis.

Example for --onlyAsm:

nextflow run TransPi.nf --onlyAsm --reads '/home/rrivera/TransPi/reads/*_R[1,2].fastq.gz' \
     --k 25,41,53 --maxReadLen 75 -profile conda

You can run multiple samples at the same time

3.2.2. `--onlyEvi`

Run only the Evidential Gene analysis

Example for --onlyEvi:

nextflow run TransPi.nf --onlyEvi -profile conda

TransPi looks for a directory named onlyEvi. It expects one file per sample to perform the reduction. The file should have all the assemblies concatenated into one.

You can run multiple samples at the same time

3.2.3. `--onlyAnn`

Run only the Annotation analysis (starting from a final assembly)

Example for --onlyAnn:

nextflow run TransPi.nf --onlyAnn -profile conda

TransPi looks for a directory named onlyAnn. It expects one file per sample to perform the annotation.

You can run multiple samples at the same time

3.3. Using `-profiles`

TransPi can also use docker, singularity, and individual conda installations (i.e. per process) to deploy the pipeline.

test                Run TransPi with a test dataset
conda               Run TransPi with conda.
docker              Run TransPi with docker container
singularity         Run TransPi with singularity container with all the necessary tools
TransPiContainer    Run TransPi with a single container with all tools

Multiple profiles can be specified (comma separated)

Example: -profile test,singularity

Refer to Section 6 of this manual for further details on deployment of TransPi using other profiles.

4. Results

4.1. Directories

4.1.1. `results`

After a successful run of TransPi the results are saved in a directory called results. This directory is divided into multiple directories for each major step of the pipeline.

Directories will be created based on the options selected in the pipeline execution

fastqc	Fastqc html files
filter	Filter step html files
rRNA_reads	Info and reads of rRNA removal process
normalization	Normalized reads files
saveReads	Folder with reads saved from the filter and normalization processes
assemblies	All individual assemblies
evigene	Non-redundant final transcriptome (ends with name `.combined.okay.fa`)
rnaQuast	rnaQUAST output
mapping	Mapping results
busco4	BUSCO V4 results
transdecoder	Transdecoder results
trinotate	Annotation results
report	Interactive report of TransPi
figures	Figures created by TransPi (BUSCO comparison, Annotation, GO, etc)
stats	Basic stats from all steps of TransPi
pipeline_info	Nextflow report, trace file and others
RUN_INFO.txt	File with all versions of the tools used by TransPi. Also info from the run like command and PATH

NOTES

Name of output directory can be changed by using the --outdir parameter when executing the pipeline. Example --outdir Results_SampleA.
If multiple samples are run, each directory will have all files together but each one with a unique sample name.

4.1.2. `work`

A directory called work is also created when running TransPi. It contains all the Nextflow working files, TransPi results and intermediate files.

Directory work can be removed after the pipeline is done since all important files are stored in the results directory.

4.2. Figures

TransPi produces multiple figures that are stored in the results directory.

Example:

UniProt

4.3. Report

TransPi creates an interactive custom HTML report for ease data exploration.

Report Sponge transcriptome

NOTE

Example report here is a PDF file and not a HTML file. However the original HTML file with interactive visualization (i.e. as generated in TransPi) can be downloaded here

5. Additional options

There are other parameters that can be changed when executing TransPi.

5.1. Output options

`--outdir`	name of output directory. Example: `--outdir Sponges_150`. Default "results"
`-w, -work`	name of working directory. Example: `-work Sponges_work`. Only one dash is needed for `-work` since it is a nextflow function.
`--tracedir`	Name for directory to save pipeline trace files. Default "pipeline_info"

5.2. Additional analyses

`--rRNAfilter`	Remove rRNA from sequences. Requires option --rRNAdb
`--rRNAdb`	PATH to database of rRNA sequences to use for filtering of rRNA. Default ""
`--filterSpecies`	Perform psytrans filtering of transcriptome. Requires options `--host` and `--symbiont`
`--host`	Host (or similar) protein file.
`--symbiont`	Symbionts (or similar) protein files
`--psyval`	Psytrans value to train model. Default "160"
`--allBuscos`	Run BUSCO analysis in all assemblies
`--rescueBusco`	Generate BUSCO distribution analysis
`--minPerc`	Mininmum percentage of assemblers require for the BUSCO distribution. Default ".70"
`--shortTransdecoder`	Run Transdecoder without the homology searches
`--withSignalP`	Include SignalP for the annotation. Needs manual installation of CBS-DTU tools. Default "false"
`--rnam`	PATH to Rnammer software. Default ""
`--withTMHMM`	Include TMHMM for the annotation. Needs manual installation of CBS-DTU tools. Default "false"
`--tmhmm`	PATH to TMHMM software. Default ""
`--withRnammer`	Include Rnammer for the annotation. Needs manual installation of CBS-DTU tools. Default "false"
`--rnam`	PATH to Rnammer software. Default ""

5.3. Skip options

`--skipEvi`	Skip EvidentialGene run in --onlyAsm option. Default "false"
`--skipQC`	Skip FastQC step. Default "false"
`--skipFilter`	Skip fastp filtering step. Default "false"
`--skipKegg`	Skip kegg analysis. Default "false"
`--skipReport`	Skip generation of final TransPi report. Default "false"

5.4. Other parameters

`--minQual`	Minimum quality score for fastp filtering. Default "25"
`--pipeInstall`	PATH to TransPi directory. Default "". If precheck is used this will be added to the nextflow.config automatically.
`--envCacheDir`	PATH for environment cache directory (either conda or containers). Default "Launch directory of pipeline"

6. Examples

Here are some examples on how to deploy TransPi depending on the method to use (e.g. conda) and the analyses to be performed.

6.1. Profiles

You can use TransPi either with: - a local conda environment (from precheck); - individual conda environment per process; - docker or singularity

6.1.1. Conda

This way of executing TransPi assumes that you installed conda locally. All these is done automatically for you, if desired, with the precheck script.

Example:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile conda

-profile conda tells TransPi to use conda. An individual environment is used per process.

6.1.2. Containers

Docker or singularity can also be use for deploying TransPi. You can either use individual containers for each process or a TransPi container with all the tools.

Individual

To use individual containers:

Example for docker:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile docker

Example for singularity:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile singularity

Some individual containers can create problems. We are working on solving these issues. In the meantime you can use the TransPi container (see below).

TransPi container

To use the TransPi container with all the tools you need to use the profile TransPiContainer.

Example for docker:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile docker,TransPiContainer

Example for singularity:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile singularity,TransPiContainer

6.2. Other examples

Order of commands is not important.

6.2.1. Filtering

Scenario:

Sample	Coral sample
Read length	150bp
TransPi mode	--all
Kmers	25,35,55,75,85
Reads PATH	/YOUR/PATH/HERE/*_R[1,2].fastq.gz
Output directory	Results_Acropora
Work directory	work_acropora
Engine	conda
Filter species	on
host	scleractinian proteins
symbiont	symbiodinium proteins

Command:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -w work_acropora -profile conda --filterSpecies \
    --host /YOUR/PATH/HERE/uniprot-Scleractinia.fasta \
    --symbiont /YOUR/PATH/HERE/uniprot-symbiodinium.fasta

6.2.2. BUSCO distribution

Scenario:

Sample	SampleA
Read length	100bp
TransPi mode	--all
Kmers	25,41,57,67
Reads PATH	/YOUR/PATH/HERE/SampleA/*_R[1,2].fastq.gz
Output directory	Results_SampleA
Engine	conda
All BUSCOs	on
BUSCO distribution	on

Command:

nextflow run TransPi.nf --all --maxReadLen 100 --k 25,35,55,75,85 \
    --outdir Results_SampleA --reads '/YOUR/PATH/HERE/SampleA/*_R[1,2].fastq.gz' \
    -profile conda --allBuscos --buscoDist

6.2.3. `--onlyEvi`

Scenario:

Sample	Assemblies from multiple assemblers and kmers
Read length	50bp
TransPi mode	--onlyEvi
Kmers	25,33,37
Reads PATH	/YOUR/PATH/HERE/*_R[1,2].fastq.gz
Output directory	Reduction_results
Engine	conda

Command:

nextflow run TransPi.nf --onlyEvi --outdir Reduction_results \
    -profile conda

NOTES

A directory named onlyEvi is needed for this option with the transcriptome to perform the reduction.

You can do multiple transcriptomes at the same time. Each file should have a unique name.

No need to specify reads PATH, length, cutoff, and kmers when using the --onlyEvi.

6.2.4. `--onlyAnn`

Scenario:

Sample	Transcriptome missing annotation
Read length	100bp
TransPi mode	--onlyEvi
Kmers	25,41,57,67
Reads PATH	/YOUR/PATH/HERE/*_R[1,2].fastq.gz
Output directory	Annotation_results
Engine	singularity
Container	TransPi container

Command:

nextflow run TransPi.nf --onlyAnn --outdir Annotation_results \
    -profile singularity,TransPiContainer

NOTES

A directory named onlyAnn is needed for this option with the transcriptome to annotate.

You can do multiple transcriptomes (i.e. samples) at the same time. Each file should have a unique name.

No need to specify reads PATH, length, cutoff, and kmers when using the --onlyAnn.

6.2.5. Skip options

Scenario:

Sample	Coral sample
Read length	150bp
TransPi mode	--all
Kmers	25,35,55,75,85
Reads PATH	/YOUR/PATH/HERE/*_R[1,2].fastq.gz
Output directory	Results_Acropora
Work directory	work_acropora
Engine	docker
Container	Individual containers
Skip QC	on
Skip Filter	on

Command:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -w work_acropora -profile docker \
    --skipQC --skipFilter

6.2.6. Extra annotation steps

Scenario:

Sample	Mollusc sample
Read length	150bp
TransPi mode	--all
Kmers	25,35,55,75,85
Reads PATH	/YOUR/PATH/HERE/*_R[1,2].fastq.gz
Output directory	Results
Engine	conda
Skip QC	on
SignalP	on
TMHMM	on
RNAmmer	on

Command:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results \
    -profile conda --skipQC --withSignalP --withTMHMM --withRnammer

NOTE

This option requires manual installation of the CBS-DTU tools: signalP, tmhmm, and rnammer.
For more info visit CBS-DTU tools
It also assumes that the PATH for all the tools are in the nextflow.config file.

6.2.7. Full run and extra annotation

Scenario:

Sample	Coral sample
Read length	150bp
TransPi mode	--all
Kmers	25,35,55,75,85
Reads PATH	/YOUR/PATH/HERE/*_R[1,2].fastq.gz
Output directory	Results
Engine	conda
Skip QC	on
SignalP	on
TMHMM	on
RNAmmer	on
Filter species	on
host	scleractinian proteins
symbiont	symbiodinium proteins
All BUSCOs	on
BUSCO distribution	on
Remove rRNA	on
rRNA database	/YOUR/PATH/HERE/silva_rRNA_file.fasta

Command:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results \
    -profile conda --skipQC --withSignalP --withTMHMM --withRnammer \
    --host /YOUR/PATH/HERE/uniprot-Scleractinia.fasta \
    --symbiont /YOUR/PATH/HERE/uniprot-symbiodinium.fasta
    --allBuscos --buscoDist --rRNAfilter \
    --rRNAdb "/YOUR/PATH/HERE/silva_rRNA_file.fasta"

7. Extra information

Here are some notes that can help in the execution of TransPi. Also some important considerations based Nextflow settings. For more in detail information visit the Nextflow documentation

7.1. `-resume`

If an error occurs and you need to resume the pipeline just include the -resume option when calling the pipeline.

./nextflow run TransPi.nf --onlyAnn -profile conda -resume

7.2. `template.nextflow.config`

7.2.1. Resources

The template.nextflow.config file has different configurations for the each program of the pipeline (e.g. some with a lot of CPUs, others with a small amount of CPUs). You can modify this depending on the resources you have in your system.

Example:

process { withLabel: big_cpus { cpus='30' memory='15 GB' }

In this case, all the processes using the label big_cpus will use 30 CPUs. If your system only has 20 please modify these values accordingly to avoid errors.

Setting the correct CPUs and RAM of your system is important because nextflow will start as many jobs as possible if the resources are available. If you are in a VM with 120 CPUs, nextfow will be able to start four processes with the label big_cpus.

7.2.2. Data

The precheck is designed to create a new nextflow.config every time is run with with the PATH to the databases. You can modify the values that do not need editing for your analysis on the template.nextflow.config. This way you avoid doing the same changes to the nextflow.config after the precheck run.

Example: Modify the template.nextflow.config with your cluster info to avoid repeating these in the future.

7.3. Custom profiles

We are using SLURM as our workload manager in our server. Thus we have custom profiles for the submission of jobs. For example our nextflow.config has the following lines in the profiles section.

profiles {
    palmuc {
        process {
            executor='slurm'
            clusterOptions='--clusters=inter --partition=bigNode --qos=low'
        }
    }
}

You can add your custom profiles depending on the settings of your system and the workload manager you use (e.g. SGE, PBS, etc).

The line clusterOptions can be used to add any other option that you will usually use for your job submission.

7.4. Local nextflow

To avoid calling the pipeline using ./nextflow … you can modify the nextflow command like this chmod 777 nextflow. For running the pipeline you just need to use:

nextflow run TransPi.nf …

7.5. Real Time Monitoring

To monitor your pipeline remotely without connecting to the server via ssh use Nextflow Tower. Make an account with your email and follow their instructions. After this, you can now run the pipeline adding the -with-tower option and follow live the execution of the processes.

nextflow run TransPi.nf --all -with-tower -profile conda

8. Issues

We tested TransPi using the followings deployment methods:

conda = individual conda environments per process
docker = using TransPi container (i.e. -profile docker,TransPiContainer)
singularity = using TransPi container (i.e. -profile singularity,TransPiContainer)

Using individual container per process is working for the majority of processes. However, we found a couple os issues with some containers (e.g. transabyss). We are working to find a solution for these issues.

8.1. Reporting an issue

If you find a problem or get an error please let us know by opening an issue in the repository.

8.2. Test dataset

We include a test profile to try TransPi using a small dataset. However, this can create issues in some of the process (e.g. contamination removal by psytrans).