Whole-exome sequencing has been widely used in clinical applications for the identification of the genetic causes of several diseases. HPexome automates many data processing tasks for exome-sequencing data analysis of large-scale cohorts. Given ready-analysis alignment files it is capable of breaking input data into small genomic regions to efficiently process in parallel on cluster-computing environments. It relies on Queue workflow execution engine and GATK variant calling tool and its best practices to output high-confident unified variant calling file. Our workflow is shipped as Python command line tool making it easy to install and use.
hpexome [OPTIONS] [DESTINATION]
OPTIONS
Required arguments
-I, --bam
One or more sequence alignment files in BAM format or directories containing *.bam
files.-R, --genome
Reference genome in single FASTA file.--dbsnp
dbSNP file in VCF format. See dbSNP FTP.--sites
VCF files containing known polymorphic sites to skip over in the recalibration algorithm.Optional arguments
--indels
Inputs the VCF file with known insertions and deletions (indels) to be used.-L, --intervals
One or more genomic intervals over which to operate.--unified_vcf
Unify VCF files into a single one.-O, --output_file_name
Output file name for unified VCF. Default is unified.vcf
.--min_prunning
Minimum support to not prune paths in the graph. Default value is 2
.-stand_call_conf
Minimum phred-scaled confidence threshold at which variants should be called. Default is 30
.Performance-specific arguments
-nt, --num_data_threads
Controls the number of data consecutive threads sent to the processor that are used in the parallelization process. It is used in the Realigner Target Creator, and may not be used together with the scattercount option. If not set, the walker will run in serial.-nct, --num_threads_per_data_thread
Controls the number of CPU threads allocated to each data thread. It is used with the Base Recalibrator and the Print Reads, and may not be used together with the scattercount
option. If not set, the walkers will run in serial. --job_runner
Job executor engine (eg. Lsf706, Grid, PbsEngine).--scatter_count
Controls the number of parts in which the genetic sequence will be divided when sent to be parallelized by the Job executor engine. It is used in all walkers. It must be used with the -jobRuner
option, or else it will not use the GridEngine and the process will be run in serial.System path to required software
--java_path
Path to java. Use this to pass JVM-specific arguments. Default is java
.DESTINATION
Sets the directory in which the outputs will be saved. If not set, the outputs will be saved in the directory in which the process is running.
In this example we will download and process a whole-exome sequence sample from the 1000 Genomes Project and required reference files, as well as required software. Let’s create a directory to write all files.
mkdir hpexome
cd hpexome
HPexome only requires Python 3 and Java 8 to run, optionally DMRAA-supported batch processing system such as SGE. However, to create input data it is required to align raw sequencing reads to reference genome and sort those reads by coordinate. The required software are: BWA to align reads, amtools to convert output to BAM, and Picard to sort reads and fix tags.
# HPexome
pip3 install hpexome
# BWA
wget https://github.com/lh3/bwa/releases/download/v0.7.17/bwa-0.7.17.tar.bz2
tar xf bwa-0.7.17.tar.bz2
make -C bwa-0.7.17
# Samtools
wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
tar xf samtools-1.10.tar.bz2
make -C samtools-1.10
# Picard
wget https://github.com/broadinstitute/picard/releases/download/2.21.7/picard.jar
Download raw sequencing data as FASTQ files.
wget \
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/sequence_read/SRR098401_1.filt.fastq.gz \
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/sequence_read/SRR098401_2.filt.fastq.gz
Download required reference files.
wget \
https://storage.googleapis.com/gatk-legacy-bundles/hg19/ucsc.hg19.fasta \
https://storage.googleapis.com/gatk-legacy-bundles/hg19/ucsc.hg19.fasta.fai \
https://storage.googleapis.com/gatk-legacy-bundles/hg19/ucsc.hg19.dict \
ftp://[email protected]/bundle/hg19/dbsnp_138.hg19.vcf.gz \
ftp://[email protected]/bundle/hg19/dbsnp_138.hg19.vcf.idx.gz \
ftp://[email protected]/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
ftp://[email protected]/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz \
ftp://[email protected]/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz \
ftp://[email protected]/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.idx.gz \
ftp://[email protected]/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz \
ftp://[email protected]/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz \
ftp://[email protected]/bundle/hg19/1000G_omni2.5.hg19.sites.vcf.gz \
ftp://[email protected]/bundle/hg19/1000G_omni2.5.hg19.sites.vcf.idx.gz
gunzip dbsnp_138.hg19.vcf.gz \
dbsnp_138.hg19.vcf.idx.gz \
Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz \
1000G_phase1.indels.hg19.sites.vcf.gz \
1000G_phase1.indels.hg19.sites.vcf.idx.gz \
1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz \
1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz \
1000G_omni2.5.hg19.sites.vcf.gz \
1000G_omni2.5.hg19.sites.vcf.idx.gz
Index reference genome.
bwa-0.7.17/bwa index ucsc.hg19.fasta
Align raw sequencing reads to the human reference genome.
bwa-0.7.17/bwa mem \
-K 100000000 -t 16 -Y ucsc.hg19.fasta \
SRR098401_1.filt.fastq.gz SRR098401_2.filt.fastq.gz \
| samtools-1.10/samtools view -1 - > NA12878.bam
Sort aligned reads by genomic coordinates.
java -jar picard.jar SortSam \
INPUT=NA12878.bam \
OUTPUT=NA12878.sorted.bam \
SORT_ORDER=coordinate \
CREATE_INDEX=true
Fix RG tags.
java -jar picard.jar AddOrReplaceReadGroups \
I=NA12878.sorted.bam \
O=NA12878.sorted.rgfix.bam \
RGID=NA12878 \
RGSM=NA12878 \
RGLB=1kgenomes \
RGPL=Illumina \
PU=Unit1 \
CREATE_INDEX=true
In some computing setups it will require to set SGE_ROOT
environment variable.
export SGE_ROOT=/var/lib/gridengine
Run HPexome.
hpexome \
--bam NA12878.sorted.rgfix.bam \
--genome ucsc.hg19.fasta \
--dbsnp dbsnp_138.hg19.vcf \
--indels Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
--indels 1000G_phase1.indels.hg19.sites.vcf \
--sites 1000G_phase1.snps.high_confidence.hg19.sites.vcf \
--sites 1000G_omni2.5.hg19.sites.vcf \
--scatter_count 16 \
--job_runner GridEngine \
result_files
It is expected the following files.
result_files/
├── HPexome.jobreport.txt
├── NA12878.sorted.rgfix.HC.raw.vcf
├── NA12878.sorted.rgfix.HC.raw.vcf.idx
├── NA12878.sorted.rgfix.HC.raw.vcf.out
├── NA12878.sorted.rgfix.intervals
├── NA12878.sorted.rgfix.intervals.out
├── NA12878.sorted.rgfix.realn.bai
├── NA12878.sorted.rgfix.realn.bam
├── NA12878.sorted.rgfix.realn.bam.out
├── NA12878.sorted.rgfix.recal.bai
├── NA12878.sorted.rgfix.recal.bam
├── NA12878.sorted.rgfix.recal.bam.out
├── NA12878.sorted.rgfix.recal.cvs
└── NA12878.sorted.rgfix.recal.cvs.out
See docs folder for information about performance and validation tests.
Clone git repository from GitHub.
git clone https://github.com/labbcb/hpexome.git
cd hpexome
Create virtual environment and install development version.
python3 -m venv venv
source venv/bin/activate
pip install --requirement requirements.txt
Publish new hpexome version to Pypi.
pip install setuptools wheel twine
python setup.py sdist bdist_wheel
twine upload -u $PYPI_USER -p $PYPI_PASS dist/*