How to use TransPi

1. Description

TransPi – a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly

TransPi provides a useful resource for the generation of de novo transcriptome assemblies, with minimum user input but without losing the ability of a thorough analysis.

1.1. Programs used

  • List of programs use by TransPi:

    • FastQC

    • fastp

    • sortmerna

    • rnaSPADES

    • SOAP

    • Trinity

    • Velvet

    • Oases

    • TransABySS

    • rnaQUAST

    • EvidentialGene

    • CD-Hit

    • Exonerate

    • Blast

    • BUSCO

    • Psytrans

    • Trandecoder

    • Trinotate

    • Diamond

    • Hmmer

    • Bowtie2

    • rnammer

    • tmhmm

    • signalP

    • iPath

    • SQLite

    • R

    • Python

  • Databases used by TransPi:

    • Swissprot

    • Uniprot custom database (e.g. all metazoan proteins)

    • Pfam

2. Installing TransPi

2.1. Requirements

  • System: Linux OS

  • Data type: Paired-end reads

    Example:
        IndA_R1.fastq.gz, IndA_R2.fastq.gz
Make sure reads end with _R1.fastq.gz and _R2.fastq.gz. Multiple individuals can be run at the same time.

2.2. Downloading TransPi

1- Clone the repository

git clone https://github.com/palmuc/TransPi.git

2- Move to the TransPi directory

cd TransPi

2.3. Configuration

TransPi requires various databases to run. The precheck script will installed the databases and software, if necessary, to run the tool. The precheck run needs a PATH as an argument for installing (locally) all the databases the pipeline needs.

bash precheck_TransPi.sh /YOUR/PATH/HERE/
This process may take a while depending on the options you select. Step that takes longer is downloading, if desired, the entire metazoan proteins from UniProt (6Gb). Other processes and databases are relatively fast depending on internet connection.

Once the precheck run is done it will create a file named nextflow.config that contains the various PATH for the databases. If selected, it will also have the local conda environment PATH.

The nextflow.config file has also other important parameters for pipeline execution that will be discussed further in the following sections.

3. Running TransPi

3.1. Full analysis (--all)

After the successful run of the precheck script you are set to run TransPi.

We recommend to run TransPi with the option --all where it will do the complete analysis, from raw reads filtering to annotation. Other options described below.

To run the complete pipeline.

nextflow run TransPi.nf --all --reads '/YOUR/READS/PATH/HERE/*_R[1,2].fastq.gz' \
     --k 25,41,53 --maxReadLen 75 -profile conda

Arguments explanations:

--all                Run full TransPi analysis
--reads              PATH to the paired-end reads
--k                  kmers list to use for the assemblies
--maxReadLen         Max read length in the library
-profile             Use conda to run the analyses

If you combined multiple libraries of the same individual to create a reference transcriptome, which will be later use in downstream analyses (e.g. Differential Expression), make sure the kmer list is based on the length for the shortest read library and the maxReadLen based on the longest read length.

Example: Combining reads of 100bp with 125bp

--k 25,41,53,61 --maxReadLen 125

You can run multiple samples at the same time

3.2. Other options

3.2.1. --onlyAsm

Run only the Assemblies and EvidentialGene analysis.

Example for --onlyAsm:

nextflow run TransPi.nf --onlyAsm --reads '/home/rrivera/TransPi/reads/*_R[1,2].fastq.gz' \
     --k 25,41,53 --maxReadLen 75 -profile conda
You can run multiple samples at the same time

3.2.2. --onlyEvi

Run only the Evidential Gene analysis

Example for --onlyEvi:

nextflow run TransPi.nf --onlyEvi -profile conda
TransPi looks for a directory named onlyEvi. It expects one file per sample to perform the reduction. The file should have all the assemblies concatenated into one.
You can run multiple samples at the same time

3.2.3. --onlyAnn

Run only the Annotation analysis (starting from a final assembly)

Example for --onlyAnn:

nextflow run TransPi.nf --onlyAnn -profile conda
TransPi looks for a directory named onlyAnn. It expects one file per sample to perform the annotation.
You can run multiple samples at the same time

3.3. Using -profiles

TransPi can also use docker, singularity, and individual conda installations (i.e. per process) to deploy the pipeline.

test                Run TransPi with a test dataset
conda               Run TransPi with conda.
docker              Run TransPi with docker container
singularity         Run TransPi with singularity container with all the necessary tools
TransPiContainer    Run TransPi with a single container with all tools

Multiple profiles can be specified (comma separated)

Example: -profile test,singularity

Refer to Section 6 of this manual for further details on deployment of TransPi using other profiles.

4. Results

4.1. Directories

4.1.1. results

After a successful run of TransPi the results are saved in a directory called results. This directory is divided into multiple directories for each major step of the pipeline.

Directories will be created based on the options selected in the pipeline execution
fastqc

Fastqc html files

filter

Filter step html files

rRNA_reads

Info and reads of rRNA removal process

normalization

Normalized reads files

saveReads

Folder with reads saved from the filter and normalization processes

assemblies

All individual assemblies

evigene

Non-redundant final transcriptome (ends with name .combined.okay.fa)

rnaQuast

rnaQUAST output

mapping

Mapping results

busco4

BUSCO V4 results

transdecoder

Transdecoder results

trinotate

Annotation results

report

Interactive report of TransPi

figures

Figures created by TransPi (BUSCO comparison, Annotation, GO, etc)

stats

Basic stats from all steps of TransPi

pipeline_info

Nextflow report, trace file and others

RUN_INFO.txt

File with all versions of the tools used by TransPi. Also info from the run like command and PATH

NOTES
  • Name of output directory can be changed by using the --outdir parameter when executing the pipeline. Example --outdir Results_SampleA.

  • If multiple samples are run, each directory will have all files together but each one with a unique sample name.

4.1.2. work

A directory called work is also created when running TransPi. It contains all the Nextflow working files, TransPi results and intermediate files.

Directory work can be removed after the pipeline is done since all important files are stored in the results directory.

4.2. Figures

TransPi produces multiple figures that are stored in the results directory.

Example:

UniProt

4.3. Report

TransPi creates an interactive custom HTML report for ease data exploration.

NOTE
  • Example report here is a PDF file and not a HTML file. However the original HTML file with interactive visualization (i.e. as generated in TransPi) can be downloaded here

5. Additional options

There are other parameters that can be changed when executing TransPi.

5.1. Output options

--outdir

name of output directory. Example: --outdir Sponges_150. Default "results"

-w, -work

name of working directory. Example: -work Sponges_work. Only one dash is needed for -work since it is a nextflow function.

--tracedir

Name for directory to save pipeline trace files. Default "pipeline_info"

5.2. Additional analyses

--rRNAfilter

Remove rRNA from sequences. Requires option --rRNAdb

--rRNAdb

PATH to database of rRNA sequences to use for filtering of rRNA. Default ""

--filterSpecies

Perform psytrans filtering of transcriptome. Requires options --host and --symbiont

--host

Host (or similar) protein file.

--symbiont

Symbionts (or similar) protein files

--psyval

Psytrans value to train model. Default "160"

--allBuscos

Run BUSCO analysis in all assemblies

--rescueBusco

Generate BUSCO distribution analysis

--minPerc

Mininmum percentage of assemblers require for the BUSCO distribution. Default ".70"

--shortTransdecoder

Run Transdecoder without the homology searches

--withSignalP

Include SignalP for the annotation. Needs manual installation of CBS-DTU tools. Default "false"

--rnam

PATH to Rnammer software. Default ""

--withTMHMM

Include TMHMM for the annotation. Needs manual installation of CBS-DTU tools. Default "false"

--tmhmm

PATH to TMHMM software. Default ""

--withRnammer

Include Rnammer for the annotation. Needs manual installation of CBS-DTU tools. Default "false"

--rnam

PATH to Rnammer software. Default ""

5.3. Skip options

--skipEvi

Skip EvidentialGene run in --onlyAsm option. Default "false"

--skipQC

Skip FastQC step. Default "false"

--skipFilter

Skip fastp filtering step. Default "false"

--skipKegg

Skip kegg analysis. Default "false"

--skipReport

Skip generation of final TransPi report. Default "false"

5.4. Other parameters

--minQual

Minimum quality score for fastp filtering. Default "25"

--pipeInstall

PATH to TransPi directory. Default "". If precheck is used this will be added to the nextflow.config automatically.

--envCacheDir

PATH for environment cache directory (either conda or containers). Default "Launch directory of pipeline"

6. Examples

Here are some examples on how to deploy TransPi depending on the method to use (e.g. conda) and the analyses to be performed.

6.1. Profiles

You can use TransPi either with: - a local conda environment (from precheck); - individual conda environment per process; - docker or singularity

6.1.1. Conda

This way of executing TransPi assumes that you installed conda locally. All these is done automatically for you, if desired, with the precheck script.

Example:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile conda
-profile conda tells TransPi to use conda. An individual environment is used per process.

6.1.2. Containers

Docker or singularity can also be use for deploying TransPi. You can either use individual containers for each process or a TransPi container with all the tools.

Individual

To use individual containers:

Example for docker:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile docker

Example for singularity:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile singularity
Some individual containers can create problems. We are working on solving these issues. In the meantime you can use the TransPi container (see below).
TransPi container

To use the TransPi container with all the tools you need to use the profile TransPiContainer.

Example for docker:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile docker,TransPiContainer

Example for singularity:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -profile singularity,TransPiContainer

6.2. Other examples

Order of commands is not important.

6.2.1. Filtering

Scenario:

Sample

Coral sample

Read length

150bp

TransPi mode

--all

Kmers

25,35,55,75,85

Reads PATH

/YOUR/PATH/HERE/*_R[1,2].fastq.gz

Output directory

Results_Acropora

Work directory

work_acropora

Engine

conda

Filter species

on

host

scleractinian proteins

symbiont

symbiodinium proteins

Command:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -w work_acropora -profile conda --filterSpecies \
    --host /YOUR/PATH/HERE/uniprot-Scleractinia.fasta \
    --symbiont /YOUR/PATH/HERE/uniprot-symbiodinium.fasta

6.2.2. BUSCO distribution

Scenario:

Sample

SampleA

Read length

100bp

TransPi mode

--all

Kmers

25,41,57,67

Reads PATH

/YOUR/PATH/HERE/SampleA/*_R[1,2].fastq.gz

Output directory

Results_SampleA

Engine

conda

All BUSCOs

on

BUSCO distribution

on

Command:

nextflow run TransPi.nf --all --maxReadLen 100 --k 25,35,55,75,85 \
    --outdir Results_SampleA --reads '/YOUR/PATH/HERE/SampleA/*_R[1,2].fastq.gz' \
    -profile conda --allBuscos --buscoDist

6.2.3. --onlyEvi

Scenario:

Sample

Assemblies from multiple assemblers and kmers

Read length

50bp

TransPi mode

--onlyEvi

Kmers

25,33,37

Reads PATH

/YOUR/PATH/HERE/*_R[1,2].fastq.gz

Output directory

Reduction_results

Engine

conda

Command:

nextflow run TransPi.nf --onlyEvi --outdir Reduction_results \
    -profile conda
NOTES
  • A directory named onlyEvi is needed for this option with the transcriptome to perform the reduction.

You can do multiple transcriptomes at the same time. Each file should have a unique name.
  • No need to specify reads PATH, length, cutoff, and kmers when using the --onlyEvi.

6.2.4. --onlyAnn

Scenario:

Sample

Transcriptome missing annotation

Read length

100bp

TransPi mode

--onlyEvi

Kmers

25,41,57,67

Reads PATH

/YOUR/PATH/HERE/*_R[1,2].fastq.gz

Output directory

Annotation_results

Engine

singularity

Container

TransPi container

Command:

nextflow run TransPi.nf --onlyAnn --outdir Annotation_results \
    -profile singularity,TransPiContainer
NOTES
  • A directory named onlyAnn is needed for this option with the transcriptome to annotate.

You can do multiple transcriptomes (i.e. samples) at the same time. Each file should have a unique name.
  • No need to specify reads PATH, length, cutoff, and kmers when using the --onlyAnn.

6.2.5. Skip options

Scenario:

Sample

Coral sample

Read length

150bp

TransPi mode

--all

Kmers

25,35,55,75,85

Reads PATH

/YOUR/PATH/HERE/*_R[1,2].fastq.gz

Output directory

Results_Acropora

Work directory

work_acropora

Engine

docker

Container

Individual containers

Skip QC

on

Skip Filter

on

Command:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results_Acropora \
    -w work_acropora -profile docker \
    --skipQC --skipFilter

6.2.6. Extra annotation steps

Scenario:

Sample

Mollusc sample

Read length

150bp

TransPi mode

--all

Kmers

25,35,55,75,85

Reads PATH

/YOUR/PATH/HERE/*_R[1,2].fastq.gz

Output directory

Results

Engine

conda

Skip QC

on

SignalP

on

TMHMM

on

RNAmmer

on

Command:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results \
    -profile conda --skipQC --withSignalP --withTMHMM --withRnammer
NOTE
  • This option requires manual installation of the CBS-DTU tools: signalP, tmhmm, and rnammer.

  • For more info visit CBS-DTU tools

  • It also assumes that the PATH for all the tools are in the nextflow.config file.

6.2.7. Full run and extra annotation

Scenario:

Sample

Coral sample

Read length

150bp

TransPi mode

--all

Kmers

25,35,55,75,85

Reads PATH

/YOUR/PATH/HERE/*_R[1,2].fastq.gz

Output directory

Results

Engine

conda

Skip QC

on

SignalP

on

TMHMM

on

RNAmmer

on

Filter species

on

host

scleractinian proteins

symbiont

symbiodinium proteins

All BUSCOs

on

BUSCO distribution

on

Remove rRNA

on

rRNA database

/YOUR/PATH/HERE/silva_rRNA_file.fasta

Command:

nextflow run TransPi.nf --all --maxReadLen 150 --k 25,35,55,75,85 \
    --reads '/YOUR/PATH/HERE/*_R[1,2].fastq.gz' --outdir Results \
    -profile conda --skipQC --withSignalP --withTMHMM --withRnammer \
    --host /YOUR/PATH/HERE/uniprot-Scleractinia.fasta \
    --symbiont /YOUR/PATH/HERE/uniprot-symbiodinium.fasta
    --allBuscos --buscoDist --rRNAfilter \
    --rRNAdb "/YOUR/PATH/HERE/silva_rRNA_file.fasta"

7. Extra information

Here are some notes that can help in the execution of TransPi. Also some important considerations based Nextflow settings. For more in detail information visit the Nextflow documentation

7.1. -resume

If an error occurs and you need to resume the pipeline just include the -resume option when calling the pipeline.

./nextflow run TransPi.nf --onlyAnn -profile conda -resume

7.2. template.nextflow.config

7.2.1. Resources

The template.nextflow.config file has different configurations for the each program of the pipeline (e.g. some with a lot of CPUs, others with a small amount of CPUs). You can modify this depending on the resources you have in your system.

Example:

process { withLabel: big_cpus { cpus='30' memory='15 GB' }

In this case, all the processes using the label big_cpus will use 30 CPUs. If your system only has 20 please modify these values accordingly to avoid errors.

Setting the correct CPUs and RAM of your system is important because nextflow will start as many jobs as possible if the resources are available. If you are in a VM with 120 CPUs, nextfow will be able to start four processes with the label big_cpus.

7.2.2. Data

The precheck is designed to create a new nextflow.config every time is run with with the PATH to the databases. You can modify the values that do not need editing for your analysis on the template.nextflow.config. This way you avoid doing the same changes to the nextflow.config after the precheck run.

Example: Modify the template.nextflow.config with your cluster info to avoid repeating these in the future.

7.3. Custom profiles

We are using SLURM as our workload manager in our server. Thus we have custom profiles for the submission of jobs. For example our nextflow.config has the following lines in the profiles section.

profiles {
    palmuc {
        process {
            executor='slurm'
            clusterOptions='--clusters=inter --partition=bigNode --qos=low'
        }
    }
}

You can add your custom profiles depending on the settings of your system and the workload manager you use (e.g. SGE, PBS, etc).

The line clusterOptions can be used to add any other option that you will usually use for your job submission.

7.4. Local nextflow

To avoid calling the pipeline using ./nextflow …​ you can modify the nextflow command like this chmod 777 nextflow. For running the pipeline you just need to use:

nextflow run TransPi.nf …​

7.5. Real Time Monitoring

To monitor your pipeline remotely without connecting to the server via ssh use Nextflow Tower. Make an account with your email and follow their instructions. After this, you can now run the pipeline adding the -with-tower option and follow live the execution of the processes.

nextflow run TransPi.nf --all -with-tower -profile conda

8. Issues

We tested TransPi using the followings deployment methods:

  • conda = individual conda environments per process

  • docker = using TransPi container (i.e. -profile docker,TransPiContainer)

  • singularity = using TransPi container (i.e. -profile singularity,TransPiContainer)

Using individual container per process is working for the majority of processes. However, we found a couple os issues with some containers (e.g. transabyss). We are working to find a solution for these issues.

8.1. Reporting an issue

If you find a problem or get an error please let us know by opening an issue in the repository.

8.2. Test dataset

We include a test profile to try TransPi using a small dataset. However, this can create issues in some of the process (e.g. contamination removal by psytrans).