hpexome

An automated tool for processing whole-exome sequencing data

Whole-exome sequencing has been widely used in clinical applications for the identification of the genetic causes of several diseases. HPexome automates many data processing tasks for exome-sequencing data analysis of large-scale cohorts. Given ready-analysis alignment files it is capable of breaking input data into small genomic regions to efficiently process in parallel on cluster-computing environments. It relies on Queue workflow execution engine and GATK variant calling tool and its best practices to output high-confident unified variant calling file. Our workflow is shipped as Python command line tool making it easy to install and use.

hpexome [OPTIONS] [DESTINATION]

OPTIONS

Required arguments

Optional arguments

Performance-specific arguments

System path to required software

DESTINATION Sets the directory in which the outputs will be saved. If not set, the outputs will be saved in the directory in which the process is running.

Reproducible example

In this example we will download and process a whole-exome sequence sample from the 1000 Genomes Project and required reference files, as well as required software. Let’s create a directory to write all files.

mkdir hpexome
cd hpexome

HPexome only requires Python 3 and Java 8 to run, optionally DMRAA-supported batch processing system such as SGE. However, to create input data it is required to align raw sequencing reads to reference genome and sort those reads by coordinate. The required software are: BWA to align reads, amtools to convert output to BAM, and Picard to sort reads and fix tags.

# HPexome
pip3 install hpexome

# BWA
wget https://github.com/lh3/bwa/releases/download/v0.7.17/bwa-0.7.17.tar.bz2
tar xf bwa-0.7.17.tar.bz2 
make -C bwa-0.7.17 

# Samtools
wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
tar xf samtools-1.10.tar.bz2
make -C samtools-1.10

# Picard
wget https://github.com/broadinstitute/picard/releases/download/2.21.7/picard.jar

Download raw sequencing data as FASTQ files.

wget \
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/sequence_read/SRR098401_1.filt.fastq.gz \
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/sequence_read/SRR098401_2.filt.fastq.gz

Download required reference files.

wget \
    https://storage.googleapis.com/gatk-legacy-bundles/hg19/ucsc.hg19.fasta \
    https://storage.googleapis.com/gatk-legacy-bundles/hg19/ucsc.hg19.fasta.fai \
    https://storage.googleapis.com/gatk-legacy-bundles/hg19/ucsc.hg19.dict \
    ftp://[email protected]/bundle/hg19/dbsnp_138.hg19.vcf.gz \
    ftp://[email protected]/bundle/hg19/dbsnp_138.hg19.vcf.idx.gz \
    ftp://[email protected]/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
    ftp://[email protected]/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz \
    ftp://[email protected]/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz \
    ftp://[email protected]/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.idx.gz \
    ftp://[email protected]/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz \
    ftp://[email protected]/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz \
    ftp://[email protected]/bundle/hg19/1000G_omni2.5.hg19.sites.vcf.gz \
    ftp://[email protected]/bundle/hg19/1000G_omni2.5.hg19.sites.vcf.idx.gz

gunzip dbsnp_138.hg19.vcf.gz \
    dbsnp_138.hg19.vcf.idx.gz \
    Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
    Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz \
    1000G_phase1.indels.hg19.sites.vcf.gz \
    1000G_phase1.indels.hg19.sites.vcf.idx.gz \
    1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz \
    1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz \
    1000G_omni2.5.hg19.sites.vcf.gz \
    1000G_omni2.5.hg19.sites.vcf.idx.gz

Index reference genome.

bwa-0.7.17/bwa index ucsc.hg19.fasta

Align raw sequencing reads to the human reference genome.

bwa-0.7.17/bwa mem \
    -K 100000000 -t 16 -Y ucsc.hg19.fasta \
    SRR098401_1.filt.fastq.gz SRR098401_2.filt.fastq.gz \
    | samtools-1.10/samtools view -1 - > NA12878.bam

Sort aligned reads by genomic coordinates.

java -jar picard.jar SortSam \
    INPUT=NA12878.bam \
    OUTPUT=NA12878.sorted.bam \
    SORT_ORDER=coordinate \
    CREATE_INDEX=true

Fix RG tags.

java -jar picard.jar AddOrReplaceReadGroups \
    I=NA12878.sorted.bam \
    O=NA12878.sorted.rgfix.bam \
    RGID=NA12878 \
    RGSM=NA12878 \
    RGLB=1kgenomes \
    RGPL=Illumina \
    PU=Unit1 \
    CREATE_INDEX=true

In some computing setups it will require to set SGE_ROOT environment variable.

export SGE_ROOT=/var/lib/gridengine

Run HPexome.

hpexome \
    --bam NA12878.sorted.rgfix.bam \
    --genome ucsc.hg19.fasta \
    --dbsnp dbsnp_138.hg19.vcf \
    --indels Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
    --indels 1000G_phase1.indels.hg19.sites.vcf \
    --sites 1000G_phase1.snps.high_confidence.hg19.sites.vcf \
    --sites 1000G_omni2.5.hg19.sites.vcf \
    --scatter_count 16 \
    --job_runner GridEngine \
    result_files

It is expected the following files.

result_files/
├── HPexome.jobreport.txt
├── NA12878.sorted.rgfix.HC.raw.vcf
├── NA12878.sorted.rgfix.HC.raw.vcf.idx
├── NA12878.sorted.rgfix.HC.raw.vcf.out
├── NA12878.sorted.rgfix.intervals
├── NA12878.sorted.rgfix.intervals.out
├── NA12878.sorted.rgfix.realn.bai
├── NA12878.sorted.rgfix.realn.bam
├── NA12878.sorted.rgfix.realn.bam.out
├── NA12878.sorted.rgfix.recal.bai
├── NA12878.sorted.rgfix.recal.bam
├── NA12878.sorted.rgfix.recal.bam.out
├── NA12878.sorted.rgfix.recal.cvs
└── NA12878.sorted.rgfix.recal.cvs.out

See docs folder for information about performance and validation tests.

Development

Clone git repository from GitHub.

git clone https://github.com/labbcb/hpexome.git
cd hpexome

Create virtual environment and install development version.

python3 -m venv venv
source venv/bin/activate
pip install --requirement requirements.txt

Publish new hpexome version to Pypi.

pip install setuptools wheel twine
python setup.py sdist bdist_wheel
twine upload -u $PYPI_USER -p $PYPI_PASS dist/*