espresso-caller

Automated and reproducible tool for identifying genomic variants at scale

Processing high-throughput sequencing data was the biggest challenge in the past, dealing with small datasets of large sample data. Genomics projects with thousands of samples have became popular due to decreasing in sequencing costs. Now we are facing a new challenge: how to process and store large-scale datasets efficiently.

The Sequence Alignment/Map Format Specification (SAM) and its binary version (BAM) is the most popular and recommended file format to store sequence alignment data. However, due to the increasing throughput of sequencing platforms, BAM files have lost their efficiency, as we are dealing more frequently with deeper whole-genome sequencing (WGS), whole-exome sequencing (WES) and other large-scale genomics data.

The CRAM file format is an improved version of SAM format. Instead of storing DNA sequences aligned to a reference genome, it stores only the differences between the sample and the genome sequence. With this feature, it reduces dramatically the file sizes. To use this new format, common tools, like Samtools, require the original genome file (normally in FASTA format).

On another front, different computational tools have been developed to address different problems and they are often chained together in a pipeline to solve a bigger problem. For example, a basic variant caller workflow will:

  1. map the reads from a FASTQ file to a reference genome;
  2. use the allelic counts at each site to call a genotype.

It is often the case that each of the aforementioned steps will be performed by a different tool (like BWA + GATK). To assist on this matter (combining different tools, linking the output of one step to the input of the following), languages that are agnostic to the choice of platform were developed. These languages allow us to describe tasks and combine them into workflows, using a dialect closer to the research domain and distant from the computer science domain. Common Workflow Language (CWL), Nextflow and Workflow Description Language (WDL) are the most popular languages to define workflows for processing genomics data. It is important to highlight that these languages only describe what needs to be done.

As an actual user, one expects that the aforementioned tasks/workflows are executed to produce results of interest. For this reason, researchers developed workflow execution systems, which combine workflow files (.cwl, .nf or .wdl) with parameter files (.json of .yaml) and input data to generate a dependency tree and actually execute the processing tasks. The most popular workflow executors are: Toil and Cromwell. These tools are flexible enough to support different computing backends such as local, cluster, custom environments and cloud.

Distributing bioinformatics tools and keeping track of their versions between several workflows and computing nodes have become a big problem for both system administrators (system updates breaks user tools) and users (updated version o tool outputs different results, compromising the reproducibility of the research – for more information see FAIR Guiding Principles). To solve this problem the research community adopted container technologies, mainly Singularity and Docker. Projects like BioContainers provide thousands of container images of bioinformatics tools. Also, Docker Hub is the main repository for popular open-source projects.

Back to workflows… the workflow file + inputs file have been working well for small datasets. With our wftools software, we can submit workflows (and their inputs) to execution services, such as Cromwell in server mode, check workflow executions status and logs, and collect output files copying them to desirable directory. The input JSON file must contain workflow-specific parameter to work, for example, a workflow that require FASTQ files as input we have to provide a list of absolute paths to FASTQ files (two list if paired-end), other workflows may require several resource files that can be specific for different genome versions. It is important to note that Cromwell checks the file existence immediately prior to its use within the task, and this is associated with several downstream errors that force the system to abort the execution of the workflow. Additionally, the input JSON files are manually produced.

Espresso-Caller is a tool that automates execution of workflows for identification of genomic variants from whole-exome and whole-genome datasets. It collects raw sequencing paired-end FASTQ or raw gVCF files, together with resources files (b37 or hg38 versions), assessing their existence, to generate the JSON file that is used as input of workflows for processing WES/WGS data. Our software takes some conventions to automatically find the required files and to define sample metadata. Internally it submits WDL files to Cromwell server through Cromwell API. The next section introduces the workflows, next section provides examples of command lines and explanation of arguments, conventions and outputs. The last section shows advanced uses of espresso This document ends with the actual usage help.

Workflows

espresso provides three workflows: hc (HaplotypeCalling) takes raw FASTQ files and their metadata as input plus resources files and generates per-sample raw gVCF alongside unmapped BAM and analysis-ready CRAM files. joint (JointDiscovery) takes raw gVCF files and merge into a single unified VCF; all executes all data processing steps: from raw FASTQs to unified VCF.

Our tool bundles in-house workflows and workflows defined by the Broad Institute as GATK Best Practices defined in WDL format. The workflow files are stored inside the tool package. The list below presents each workflow in the order that is executed by this tool.

HaplotypeCalling runs the following workflows:

  1. Convert raw paired-end FASTQ files to unmapped BAM format (uBAM)
  2. Map sequences to reference genome, recalibrate alignments and generate analysis-ready BAM files
  3. Produce sample-specific sufficient statistics to call variants
  4. Validate BAM files
  5. Convert BAM files to CRAM format

JointDiscovery, we combine the sample-specific sufficient statistics to produce final genotype calls. For this step, we use the original WDL file produced by the Broad Institute.

Espresso-Caller follows some convention over configuration (where configuration is the inputs JSON file). Your data files must files the following conventions to work:

For all and hc:

Resource files can be downloaded using the following scripts. Downloaded files will have the right names to work with espresso.

To download resource files run. It is important to inform the absolute path to destination directory.

bash download_resources_b37_files.sh /home/data/ref/b37
bash download_resources_hg38_files.sh /home/data/ref/hg38

The following command will run espresso with two directories containing raw FASTQ files. Samples files can be separated by directories to use different metadata for each group of samples (or batch). If two --fastq <fastq directory> argument is defined you have to inform the following arguments twice.

Also, sample name (SM), platform unit (PU) are extracted from FASTQ automatically by using predefined conventions.

Next we have to inform the path to resources files (--reference <resources directory>) and the reference genome version (only b37 and hg38 are supported), --version <genome version>.

Some computing environments do not support container technology. In this case we have to inform the absolute paths to this software. See installing required software script.

The --dont_run flag does espresso to not submit workflows to Cromwell server. In this mode, the tool will check all required files, copy required workflow files and generate inputs JSON file writing both to destination directory. It is very recommend that you run our tool in this mode and check JSON file before running in production. Also, it is useful when there are some change to do in the default JSON file and then submit workflow using espresso instead.

The optional --move flag will tell espresso to move output files from Cromwell execution directory to destination directory. It is useful for processing large-scale genomics datasets avoiding file duplication.

The last two arguments are required: callset name and destination directory to write all files.

espresso all \
	--fastq ~/raw/batch1 \
	--fastq ~/raw/batch2 \
	--library Batch1 \
	--library Batch2 \
	--date 2018-10-19 \
	--date 2019-02-07 \
	--platform ILLUMINA \
	--platform ILLUMINA \
	--center MyCenter \
	--center MyCenter \
	--reference ~/ref/b37 \
	--version b37 \
	--gatk_path_override ~/bin/gatk-4.1.0.0/gatk \
	--gotc_path_override ~/bin \
	--samtools_path_override ~/bin/samtools-1.9/samtools \
	--move \
	--dont_run \
	MyDataset.b37 \
	~/res/my_dataset
Starting the haplotype-calling workflow with reference genome version b37
Workflow file: /home/data/res/my_dataset/haplotype-calling.wdl
Inputs JSON file: /home/data/res/my_dataset/haplotype-calling.b37.inputs.json
Workflow will not be submitted to Cromwell. See workflow files in /home/data/res/my_dataset

If you run without --dont_run argument you will see the text below. As you can see, after workflow is submitted no output is presented until execution is finished. Then the tool will print the collected output files and exit.

In this version all run hc first then joint. TODO: write WDL file that do both

Starting haplotype-calling workflow with reference genome version b37
Workflow file: /home/data/res/my_dataset/haplotype-calling.wdl
Inputs JSON file: /home/data/res/my_dataset/haplotype-calling.b37.inputs.json
Workflow submitted to Cromwell Server (http://localhost:8000)
Workflow id: 9977f400-d1b6-41ff-ab92-7ebbbf7a30a8

Expected outputs

At the end of execution espresso will collect output files copying them to destination directory. These are the file you expect to see.

For each sample it should have.

Running hc or joint individually

It is possible to run only hc by espresso hc ... and joint by espresso joint .... Running hc is useful when we don’t want to generate the unified VCF for theses FASTQ files. The joint is useful when we already have raw gVCF files and we want to merge to or more data from batches or studies. However there are some conventions that your VCF files must follow (only if they weren’t generated by espresso).

Reproduce data processing

With WDL and JSON files written in the result directory it is possible to re-execute data processing without espresso. To do that we need the Cromwell binary file. It should also works on different workflow engine such as miniwdl or workflow submission toolm such as wf.

Run workflow Cromwell in Local mode

java -jar cromwell.jar run \
	-i /home/data/res/my_dataset/haplotype-calling.b37.inputs.json \
	/home/data/res/my_dataset/haplotype-calling.wdl

Development

Install latest development version.


Reproducible example

FASTQ_DIR=NA12878
REF_DIR=references/hg38
DATASET_NAME=NA12878
RESULT_DIR=results/hg38
GENOME_VERSION=hg38

espresso all \
    --fastq $FASTQ_DIR \
    --library Phase3 \
    --date 2015-07-30 \
    --platform Illumina \
    --center IGSR \
    --disable_platform_unit \
    --reference $REF_DIR \
    --version $GENOME_VERSION \
	--dont_run \
    $DATASET_NAME \
    $RESULT_DIR

espresso joint \
    --vcf $RESULT_DIR \
	--prefix "" \
    --reference $REF_DIR \
    --version $GENOME_VERSION \
	--dont_run \
    $DATASET_NAME \
    $RESULT_DIR

Required resources files

Broad References - Human genomics reference files used for sequencing analytics.

Container images

Memory requirements

Workflow.Task Default Espresso argument
ConvertPairedFastQsToUnmappedBamWf.PairedFastQsToUnmappedBAM 7 --fastq_bam_mem_size_gb
ConvertPairedFastQsToUnmappedBamWf.CreateFoFN    
PreProcessingForVariantDiscovery_GATK4.GetBwaVersion 1  
PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem 14 --align_mem_size_gb
PreProcessingForVariantDiscovery_GATK4.MergeBamAlignment 4 --merge_bam_mem_size_gb
PreProcessingForVariantDiscovery_GATK4.SortAndFixTags 10 --sort_mem_size_gb
PreProcessingForVariantDiscovery_GATK4.MarkDuplicates 7.5 --mark_duplicates_mem_size_gb
PreProcessingForVariantDiscovery_GATK4.CreateSequenceGroupingTSV 2  
PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator 6 --baserecalibrator_mem_size_gb
PreProcessingForVariantDiscovery_GATK4.GatherBqsrReports 4  
PreProcessingForVariantDiscovery_GATK4.ApplyBQSR 4 --aplly_bqsr_mem_size_gb
PreProcessingForVariantDiscovery_GATK4.GatherBamFiles 3  
HaplotypeCallerGvcf_GATK4.CramToBamTask 15  
HaplotypeCallerGvcf_GATK4.HaplotypeCaller 7  
HaplotypeCallerGvcf_GATK4.MergeGVCFs 3  
ValidateBamsWf.ValidateBAM 4  
BamToCram.ConvertBamToCram    
JointGenotyping.GetNumberOfSamples 1  
JointGenotyping.ImportGVCFs 7  
JointGenotyping.GenotypeGVCFs 7  
JointGenotyping.HardFilterAndMakeSitesOnlyVcf 3.5  
JointGenotyping.IndelsVariantRecalibrator 26 --indels_variant_recalibrator_mem_size_gb
JointGenotyping.SNPsVariantRecalibratorCreateModel 104  
JointGenotyping.SNPsVariantRecalibrator 3.5 --snps_variant_recalibrator_mem_size_gb
JointGenotyping.GatherTranches 7  
JointGenotyping.ApplyRecalibration 7  
JointGenotyping.GatherVcfs 7  
JointGenotyping.CollectVariantCallingMetrics 7  
JointGenotyping.GatherMetrics 7  
JointGenotyping.DynamicallyCombineIntervals 3  

CPU requirements

Number of CPU cores are defined for all tasks. The PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem task requires 16 CPU cores by default. To change this value, for example to 10, use --align_num_cpu 10 argument. It is also required to set BWA command line --bwa_commandline_override "bwa mem -K 100000000 -p -v 3 -t 10 -Y \$bash_ref_fasta". This will instruct the software to use 10 threads (-t 10). The scape character in \$bash_ref_fasta is required!

Container images

Development

Install latest development version.

python3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install click requests
pip install --force git+https://github.com/labbcb/espresso-caller.git

Clone and test package

git clone https://github.com/labbcb/espresso-caller.git
cd esresso-caller
pip install -e .

python -m unittest discover -s tests