« Home « Kết quả tìm kiếm

PaCBAM: Fast and scalable processing of whole exome and targeted sequencing data


Tóm tắt Xem thử

- PaCBAM: fast and scalable processing of whole exome and targeted sequencing data.
- Background: Interrogation of whole exome and targeted sequencing NGS data is rapidly becoming a preferred approach for the exploration of large cohorts in the research setting and importantly in the context of precision medicine.
- Single-base and genomic region level data retrieval and processing still constitute major bottlenecks in NGS data analysis.
- Fast and scalable tools are hence needed..
- PaCBAM computes depth of coverage and allele-specific pileup statistics, implements a fast and scalable multi-core computational engine, introduces an innovative and efficient on-the-fly read duplicates filtering strategy and provides comprehensive text output files and visual reports.
- Conclusions: PaCBAM is a fast and scalable tool designed to process genomic regions from NGS data files and generate coverage and pileup comprehensive statistics for downstream analysis.
- Genomic region and single-base level data retrieval and processing, which represent fundamental steps in genomic analyses such as copy number estimation, variant calling and quality control, still constitute one of the major bot- tlenecks in NGS data analysis.
- To deal with the computa- tionally intensive task of calculating depth of coverage and pileup statistics at specific chromosomal regions and/or positions, different tools have been developed.
- Most of them, including specific modules of SAMtools [1] and BEDTools [2] and the most recent Mosdepth [3], only measure and optimize the computation of depth of sequencing coverage.
- provide instead statistics at single-base resolution, which.
- Specifically, PaCBAM computes depth of coverage and allele-specific pileup statistics at regions and single- base resolution levels and provides data summary visual reporting utilities.
- PaCBAM introduces also an innovative and efficient on-the-fly read duplicates filtering approach..
- While most tools for read duplicates filtering work on SAM/BAM files sorted by read name [1, 7] or read pos- ition (Tarasov et al., 2015, broadinstitute.github.io/picard) and generate new SAM/BAM files, PACBAM performs the filtering directly during the processing, not requiring the creation of intermediate BAM/SAM files and fully exploiting parallel resources..
- Full list of author information is available at the end of the article.
- The tool splits the list of regions provided in the BED file and spawns different threads to execute parallel computations using a shared and optimized data structure.
- The shared data structure collects both re- gion and single-base level information and statistics which are processed and finally exposed through four different output options.
- Each output mode provides the user with only the statistics of interest, generating a combination of the following text output files: a) depth of coverage of all genomic regions, which for each region provides the mean depth of coverage, the GC content and the mean depth of coverage of the sub-region (user specified, default 0.5 fraction) that maximizes the coverage peak signal, to account for the reduced coverage depth due to incomplete match of reads to the captured regions (Additional file 1: Figure S1).
- b) single-base resolution pileup, which provides for each gen- omic position in the target the read depth for the 4 possible bases (A, C, G and T), the total depth of coverage, the variants allelic fraction (VAF), the strand bias information for each base.
- d) pileup of SNPs posi- tions, which extracts the pileup statistics for all SNPs specified in the input VCF file and uses the alterna- tive alleles specified in the VCF file for the VAF calculation and the genotype assignment (Additional file 1 for details).
- All output files are tab-delimited text files and their format details are provided in the Additional file 1..
- In addition, we implemented an efficient on-the-fly duplicated reads filtering strategy which implements an approach that is similar to the Picard MarkDuplicates method but that applies the filter during region and single-base level information retrieval and processing without the need of creating new BAM files (Additional file 1).
- Reports include plots summarizing dis- tributions of regions and per-base depth of coverage, SNPs VAF distribution and genotyping, strand bias distribution, substitutions spectra, regions GC content (Additional file 1: Figure S2-S8)..
- To mimic different application scenarios, we mea- sured the execution time and memory used by PaC- BAM to compute pileups from multiple input BAM files spanning different depth of coverage and differ- ent target sizes (Additional file 1: Table S1) using an increasing number of threads.
- Of note, while PaCBAM pileup output files are of constant size, output files of SAMtools, Sambamba and GATK have a size that is function of the coverage.
- among all the experiments we run in the performance analyses, PaCBAM output is up to 17.5x smaller with respect to outputs generated by the other tested tools..
- 1b and Additional file 1: Figure S12-S14, have a memory usage that depends only on the target size, Sambamba usage depends on both target size and number of threads and SAMtools usage is constant.
- As an example of performance comparison, when analyzing a BAM file with ~300X mean coverage and ~30Mbp target size using 30 threads (Fig.
- Of note, in the sequencing scenarios here considered, PaCBAM demonstrates up to 100x execution time im- provement and up to 90% less memory usage with re- spect to the single-base pileup module of our previous tool ASEQ (Additional file 1: Figure S15)..
- a Comparison of PaCBAM and GATK depth of coverage (left) with zoom in the coverage range [0,500] (right).
- number of positions considered in the analysis and correlation results are reported.
- c Single-base coverage obtained by running either Picard MarkDuplicates + PaCBAM pileup or PaCBAM pileup with duplicates filtering option active (left) with zoom in the coverage range [0,500] (right).
- d Regional mean depth of coverage obtained by running either Picard MarkDuplicates + PaCBAM pileup or PaCBAM pileup with duplicates filtering option active.
- The figure focuses on the analysis of a BAM file with ~300X mean coverage and ~30Mbp target size using 30 threads.
- Depth of coverage and pileup statistics of PaCBAM pileup were compared to GATK results on a BAM file with ~300X average coverage and ~64Mbp target size observing almost perfect concordance (Fig.
- PaCBAM duplicates removal strategy was tested by comparing PaCBAM pileups obtained from a paired- end BAM file first processed with Picard MarkDupli- cates or parallel Sambamba markdup, to PaCBAM pileups obtained from the same initial BAM file but using the embedded on-the-fly duplicates filtering.
- 2c-d and Additional file 1: Figure S16, both single-base and region level statistics results are strongly concordant, with single-base total coverage difference (with respect to Picard) that in 99.94% of positions is <.
- 10X, single-base allelic fraction differ- ence that in 99.95% of positions is <.
- Overall, these analyses demonstrate that PaCBAM ex- ploits parallel computation resources better than existing tools, resulting in evident reductions of processing time and memory usage, that enable a fast and efficient cover- age and allele-specific characterization of large WES and targeted sequencing datasets.
- We presented PaCBAM, a fast and scalable tool to process genomic regions from NGS data files and generate coverage and pileup statistics for down- stream analysis such as copy number estimation, variant calling and data quality control.
- PaCBAM generates both region and single-base level statistics and provides a fast and innovative on-the-fly read duplicates filtering strategy.
- Additional file.
- Cumulative coverage distribution report.
- Variant allelic fraction distribution report.
- SNP allelic fraction distribution report.
- Strand bias distribution report.
- Genomic regions depth of coverage distribution report.
- Genomic regions GC content distribution report.
- Run time comparison at 150X depth of coverage.
- Run time comparison at 230X depth of coverage.
- Run time comparison at 300X depth of coverage.
- Memory usage comparison at 150X depth of coverage.
- Memory usage comparison at 230X depth of coverage.
- Memory usage comparison at 300X depth of coverage.
- Memory usage comparison among PaCBAM pileup and pileup module of ASEQ.
- Comparison of PaCBAM duplicates filtering strategy to Sambamba markdup and Picard MarkDuplicates modules.
- Mean depth of coverage and target sizes of all BAM files used to test PaCBAM performance.Table S2.
- Time and memory usage of.
- duplicates filtering performance analyses

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt