« Home « Kết quả tìm kiếm

The effect of variant interference on de novo assembly for viral deep sequencing


Tóm tắt Xem thử

- The effect of variant interference on de novo assembly for viral deep sequencing.
- Next-generation sequencing (NGS) approaches have surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored..
- This “ variant interference ” (VI) is highly consistent and reproducible by ten commonly-used de novo assemblers, and occurs over a range of genome length, read length, and GC content.
- Conclusions: These results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing..
- Keywords: De novo assembly, Variant, Quasispecies, Virus, Microbe.
- The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.
- 1 Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA 30329, USA Full list of author information is available at the end of the article.
- In this study, we use the term “variant” to encompass both quasispecies and strains regardless of how the variants originated in the biological samples..
- Reference-mapping and de novo as- sembly are the two primary bioinformatic strategies for genome assembly.
- Reference-mapping requires a closely- related genome as input to align reads, while de novo as- sembly generates contigs without the use of a reference genome.
- Therefore, de novo assembly is the most suitable strategy for analyzing underexplored taxa [16] or for vi- ruses with high mutation and/or recombination rates..
- The two most common graph algorithms employed by de novo assembly programs are: overlap graphs for overlap-layout-consensus (OLC) methods, and k-mer based graphs for de Bruijn graph (DBG) methods.
- In this study, we first examined how often NGS and de novo assembly were applied for viral sequences de- posited in the GenBank nucleotide database (www.ncbi..
- The rise of NGS and de novo assembler use in GenBank viral sequences.
- The most common combination observed was 454 and Sanger (18,124 en- tries), likely due to the early emergence of the 454 tech- nology compared to other NGS platforms (Fig.
- De novo assembly programs (ABySS, BWA, Canu, Cap3, IDBA, MIRA, Newbler, SOAPdenovo, SPAdes, Trinity, and Velvet) have increased from less than 1% of viral sequence entries ≥2000 nt in 2012, to 20% of all viral sequence entries in 2019 (Fig.
- Multifunctional programs that offer both as- sembly options were the most common programs cited for the years but since the exact sequence assembly strategy used for these records is unknown (Tables S1-S5), the contributions of de novo assembly are likely underestimated.
- An expanded summary of the sequencing technologies and assembly approaches used for viral GenBank records is available in the Supplement text and Supplement Tables S1-S6..
- Effect of variant assembly using popular de novo assemblers.
- After establishing the growing use of NGS technologies for viral sequencing, we next focused on understanding how the presence of viral variants may influence de novo assembly output.
- 1 Trends and patterns of sequencing technology and assembly methods of viral entries in the GenBank database.
- b Count of all viral entries with at least one Sequencing Technology documented for the years .
- d and e Percentage ratio graph of all viral entries with Sequencing Technology documented for the years with (d) and without (e) the Other category.
- f Percentage ratio graph of viral entries with length greater than 2000 nt that have been documented with one of the seven Sequencing Technologies for the years .
- g Percentage ratio graph of viral entries with length greater than 2000 nt and that have been documented with one of the six NGS as the Sequencing Technology for the years .
- For (h) and (i), the Other category describes assembly methods outside of the 18 most popular programs investigated.
- i Reclassification of panel (h) by the nature of the assembly methods.
- The programs can be grouped into de novo assembler, reference-mapping, and software that can perform both.
- 10 of the most used de novo assembly programs (Fig.
- The output of the SPAdes, MetaSPAdes, ABySS, Cap3, and IDBA assemblers shared a few commonalities, demonstrated by a conceptual model in Fig.
- First, below a certain PID, when viral variants have enough distinct nucleotides to resolve the two variant contigs, the de novo assemblers produced two contigs correctly (Fig.
- As PID between the variants continue to increase, the de novo assemblers can no longer distin- guish between the variants and assembled all the reads into a single contig, a phenomenon we define as “variant singularity” (VS).
- Slight differences in the variant interference patterns (relative to the canonical variant interference model) were observed for the 10 assemblers investigated.
- The PID range where VI was observed was distinct for each de novo assembler (Fig.
- For SOAPdenovo2, a larger number of con- tigs were produced regardless of the PID.
- For Experiment 2, we focused our study on evaluating whether VI observed in SPAdes de novo assembly is in- fluenced by the GC content or genome length of the pathogen.
- 2 Workflow diagram of the investigation of variant simulated NGS reads through de novo assembly.
- In the second step, an artificial mutated variant genome was created.
- Mutated variant reads are also generated for each of the mutation parameters.
- In the third and fourth steps, the initial and mutated variants were then combined and used as input for de novo assembly for the three experiments, as detailed in Supplement Figure S1.
- It is also one of the lead- ing assemblers for viral assembly (Fig.
- Two datasets were used for the evaluation: reads generated from four artificial ge- nomes ranging in length from 2 Kb to 1 Mb, as well as from genome sequences of poliovirus (NC_002058;.
- No discernable correlation was observed between the GC content of variant genomes and the de- gree of VI for any of the simulated datasets (Supplemen- tal Figure S1, p <.
- This PID threshold, 99.21%, marked the drastic transition from VI to VS, whereas the transition from VD to VI (i.e., the VD threshold) occurred at 98.99% PID (Fig.
- Since read length is an important fac- tor for de novo assembly success, [23] we hypothesized that it may also influence the ability to distinguish viral variants.
- Also, with in- creasing read length, the width of the PID window where VI occurs gradually decreased from a 1.52%.
- Variant interference in 10 de novo assemblers.
- a Schematic diagram depicting concepts of the VD, VI, and VS, and their relationship to PID.
- The number of contigs produced by each de novo assemblers at different variant PID ranges were shown.
- The y-axis denotes the number of contigs.
- When only reads from the major variant were assembled (M), full genomes were obtained for all datasets using SPAdes and Cap3, and for the CV-B5 sample using Gen- eious.
- Conversely, assembly of the read bins containing major and minor variants (Mm) resulted in an increased number of contigs for 9 of the 12 sample and assembly software combinations tested (Fig.
- 6), indicating that VI due to the addition of the minor variant reads likely.
- 4 The effect of genome length and read length on de novo assembly of simulated variants across a range of percentage identities (PID).
- For the simulated genome lengths of 2Kb, 10 kb, 100 Kb, and 1 Mb, the average of contig number at each PID was plotted..
- For all six genome lengths and each of the 13 iterations, VI consistently occurred in the same range of PID .
- c The relationship between genome length and the total number of contigs produced.
- Our analysis of the GenBank entries quantified the decade-long expansion of NGS technologies and de novo assembly for viral sequencing (Fig.
- The contig representation schematic showing the abundance and length of the generated contigs reveals the impact of variant interference on de novo assembly.
- The bar graphs show the UG 50 % metric and the length of the longest contig.
- UG 50 % is a percentage-based metric that estimates length of the unique, non- overlapping contigs as proportional to the length of the reference genome [24].
- current de novo assembly programs perform for datasets with viral variants.
- Viral variants are expected in bio- logical samples, with the number of variants and the ex- tent of the sequence divergence between variants related to the mutation rate of the virus and the types of speci- mens that are being investigated.
- Even though the de novo assembly graph linked the different contig fragments, the assembly could not differentiate the multiple routes of possible contig construction.
- We speculate this is the main reason why VI occurs in the context of de Bruijn graph assemblers..
- Through this study, we demonstrated that reads from related genome variants adversely affect de novo assembly.
- As NGS and de novo assembly have become essential for generating full-length viral genomes, future studies should investi- gate the combined effects of the number and relative proportion of minor variants, as well as additional as- sembly factors (e.g., error rates) to supplement this work..
- Analyzing NGS and assembler usage in the virus nucleotide collection in GenBank.
- The total number of viral sequences submitted annu- ally in GenBank through December 2019 was calculated by filtering GenBank submissions by “virus,” followed by application of the following additional filtering steps:.
- Fastq reads were combined in equal numbers for the initial and mu- tated variants and used as input for subsequent de novo.
- The same process was utilized to generate the artificial genomes, initial and mutated variant genomes, and reads for each of the experiments..
- Experiment 1: analyzing simulated reads from variants using different de novo assembly programs.
- The de novo assembly algorithms used were either overlap- layout-consensus (OLC) [Cap [41] and Mira [42, 43.
- The simulation settings for the reads were paired-end reads, 250 nt read length, and 50X coverage.
- The original GC content was kept con- stant for the poliovirus and coronavirus genomes.
- A total of 13,338 SPAdes assemblies were generated, which included 12,844 assemblies for the four artificial genomes (247 datasets per genome X 4 artificial genome lengths X 13 GC content proportions X 1 assembler) and 494 assemblies for the poliovirus and coronavirus datasets (247 datasets per genome X 2 genomes X 1 as- sembler) (Supplement Figure S1b).
- Since there was little statistical differ- ence when comparing the contig numbers generated at varying percent GC for each of the four genome length datasets (Spearman’s ρ = 0.8299 to 0.9801, p <.
- A total of 538 SPAdes assemblies were generated and 247 datasets for the and 250 nt read lengths, respectively) (Supplement Figure S1c)..
- The datasets were analyzed using an in-house pipeline (VPipe), [25] which performs various quality control (QC) steps and de novo assembly using SPAdes.
- De novo assembly for each of the four clinical samples was performed for the following binned NGS datasets:.
- The length of the longest contig produced from each assembly and the performance metric UG 50 % [24].
- Supplemental Text: Analysis of viral GenBank records demonstrated the advent of NGS in viral sequencing in the last two decades.
- Workflow diagrams of simulated data from data creation through de novo assembly.
- Analysis of the final contig assembly graphs for a clinical sample.
- Total count of sequencing technologies for sequences >2000 nt in the NCBI GenBank non-redundant nucleotide database for years .
- Total count of assembly pro- grams used to generate sequences >2000 nt in the NCBI GenBank non- redundant nucleotide database.
- The 10 de novo assemblers used for analysis of the simulated data, as categorized by their underlying assembly algorithms..
- Sequencing reads for the experiments conducted using clinical specimens are available through the NCBI Sequence Read Archive (SRA.
- beyond the analysis of viral diversity.
- High-throughput sequencing (HTS) for the analysis of viral populations.
- De novo assembly of highly diverse viral populations.
- A comprehensive study of De novo genome assemblers: current challenges and future prospective.
- Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era.
- An ensemble strategy that significantly improves de novo assembly of.
- Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology.
- Bandage: interactive visualization of de novo genome assemblies.
- Structure, function and diversity of the healthy human microbiome..
- IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler.
- SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.
- Geneious basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt