« Home « Kết quả tìm kiếm

Assembly of chloroplast genomes with long- and short-read data: A comparison of approaches using Eucalyptus pauciflora as a test case


Tóm tắt Xem thử

- Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats.
- Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome.
- Full list of author information is available at the end of the article.
- In some species, one copy of the inverted repeats has been lost during evo- lution [13].
- The length of the inverted repeats usually ranges from 10 to 30 kb [9], although in extreme cases can be as short as 114 bp [14] or as long as 76 kb [15]..
- Initial assemblies of the chloroplast genome relied on Sanger sequencing, which produces highly accurate reads of around 1 kb in length [16–18].
- And when Sanger sequencing is used to confirm an assembly, this removes some of the benefits of using short-reads to assemble chloroplast genomes..
- New long-read sequencing technologies have the po- tential to allow for reference-free assembly of chloroplast genomes by combining some of the best features of Sanger and short-read sequencing.
- It is possible, therefore, for a single read to cover the entire chloroplast genome, or at least a very large section of it, suggesting that it should be feasible to use long-reads to perform reference-free assembly of the chloroplast genome.
- Hybrid assembly using a combination of long- and short-reads may be the best approach to assembling chloroplast genomes, because it can potentially combine the benefits of the length of long-reads and the accuracy of short-reads.
- This will be particularly important for chloroplast genomes with atypical structures, such as the Chickpea, which has lost one of the inverted repeat regions [13], or Selagin- ella tamariscina in which the repeats are in the same orientation instead of inverted [35].
- Parts of the chloroplast genome are frequently transferred into.
- Despite the potential advantages of long-reads for chloroplast genome assembly, we lack a comparison of the performance of long-read-only, short-read-only, and hybrid assembly approaches for the chloroplast genome.
- Eucalypts are widely distributed in Australia, accounting for roughly 75% of the forest areas [41].
- and Canu [53], one of the most popular long-read assemblers cur- rently available.
- Of the 648 genome assemblies we performed, the best genome assemblies were the hybrid assemblies with at least 20x coverage of both long- and short-reads (Fig.
- In the following, we discussed each of the assembly categories in more detail..
- Surprisingly, the total length of the Canu assem- blies varied widely with relatively minor changes in coverage (E.g.
- Validation of the long-read-only assemblies after man- ual curation showed that the assemblies produced by Hinge were more accurate than those produced by Canu (Fig.
- For the Hinge assemblies, the mapping rate of the validation reads increased predictably with the.
- For the Canu assem- blies, the mapping rate of the validation reads was the highest.
- 1c and Additional file 2: Figure S1c), but varied substantially at lower coverage, reflecting the fact that many of the Canu assemblies were missing large por- tions of the chloroplast genome.
- Additional file 2: Figure S1c and d), at which the error rate was 0.0022 per base of the validation reads in the best assemblies.
- Since the expected error rate of the validation reads likely to be at most 0.0010 (see above), this.
- Short-read correction.
- Short-read-only assemblies.
- Short-read-only assemblies with coverage ≥20x produced three complete contigs corresponding to the three major structural regions of the chloroplast genome: the long single copy, the short single copy and the inverted repeat (Additional file 3: Figure S2a).
- 1 Comparison of chloroplast genome assemblies.
- The coverage of the long- and short-reads is shown along the top and left-hand-side of each panel, respectively.
- The total coverage of the chloroplast genome across all contigs output by the assembler.
- inverted repeat), the short-read-only assemblies had high mapping rates of the validation reads (99.36%) and low rates of mismatches and indels between the validation reads and the assemblies (0.0007).
- This error rate is lower than the expected error rate of the validation reads, suggesting that the short-read-only assemblies may contain few or no errors.
- Assemblies using only 5x short-read coverage were missing ~ 40 kb of the chloroplast genome, regardless of the long-read coverage (Additional file 3: Figure S2b).
- Hybrid assembly with different lengths of long-reads Hybrid assembly performance was highly dependent on the length of the long-reads.
- The differences between these six groups of as- semblies fall into just three regions of the chloroplast genome (Fig.
- To distinguish which variant in each of these regions was most likely to be correct, we mapped all of the long- and short-reads to all six assemblies.
- 2), but neither the short- nor the long-reads provided clear preference for the length of the homopolymer.
- Roughly the same number of reads map successfully to this region regardless of whether the assembly has 14, 15, or 16 adenines in this homopolymer, and in all cases no roughly 10% of short-reads disagree with the length of the homopolymer in the assembly.
- The mapping results suggest that the per-base accur- acy of the final genome assembly is very high, and is un- likely to contain any errors.
- The high number of sites with a non-reference allele at ≥10% for the mapped long-reads is expected given their much higher error rate, and biases associated with the estimation of the length of ho- mopolymer runs.
- Our best assembly of the E.
- pauciflora chloroplast gen- ome is a hybrid assembly, which has the best overall quality statistics, and does not suffer from the minor er- rors at the junctions of the single copy and inverted re- peat regions associated with short-read only assemblies (see above).
- pauciflora chloroplast genome is.
- We also show that very similar ac- curacy can be obtained from short-read-only assemblies, although in this case the genome is assembled into three contigs representing the three main regions of the chloro- plast genome.
- Short-read-only assemblies of the chloroplast gen- ome were highly accurate, but were divided into the three regions of the chloroplast genome – the long single copy, short single copy, and inverted repeat..
- This will be beneficial for those who wish to ob- tain a reference-free estimate of the overall structure of the chloroplast genome (e.g.
- and for those who wish to accurately infer the sequence of the junctions between the major regions of the genome without additional Sanger sequencing.
- The limited length of short-reads (150 bp in this study) will increase the rate at which these reads map incorrectly, particu- larly given the existing copies of sections of the chloro- plast genome in the mitochondrial or nuclear genome [36–39].
- Furthermore, analysis of reads that contain chloroplast DNA with nuclear flanking regions should enable nuclear copies of the chloroplast genome to be excluded from analyses, which may help clarify.
- pauciflora chloroplast genome.
- That study suggested that long-reads were beneficial based on the observation that short-read-only assemblies recovered just ~ 90% of the chloroplast genome assembled into seven contigs, with a relatively high proportion (0.12%) of uncertain sites.
- We were able to successfully as- semble 100% of the chloroplast genome into three contigs (corresponding directly to the three major structural re- gions of the chloroplast genome) with very high accuracy, given just 20x coverage of short-read data.
- These improved genome assembly algorithms mitigate many of the previ- ous limitations of short-read-only assemblies of the chloroplast genome, making the differences between ap- proaches less severe than they were previously..
- We showed that it is necessary to include at least 5x coverage of such reads to gain the benefits of the hybrid assembly approach.
- The final genome assembly of the E.
- Scripts for this analysis are available in the 1_pre_assembly/1_qual- ity_control/short_read folder of the github repository..
- long_read folder of the github repository..
- Simple linearization of the reference set would risk failing to capture reads that span the point at which the genomes were circularized.
- Scripts for this analysis are available in the 1_pre_assem- bly/2_cpDNAExtraction folder of the github repository..
- To do this, we randomly selected 100x coverage of paired-end Illumina reads (59,656 pairs of reads in total) from all of the chloroplast paired-end Illumina reads identified above..
- Nevertheless, as long as the vast majority of the validation reads come from the chloroplast genome of E..
- pauciflora, it should be the case that the mapping rate of the validation reads will increase monotonically as the accuracy of the chloroplast genome assembly increases..
- We also expect the error rate of mapped validation reads to decrease monotonically as the accuracy of the gen- ome assembly increases.
- Thus, if the error rate of the mapped reads is greater than ~ 0.0010 errors per base, this suggests that there are errors that cannot be attributed to sequencing error, and are likely to represent assembly errors.
- Scripts for random read selection are available in the 2_assembly/randomSelec- tion folder of the github repository..
- Scripts for this ana- lysis are available in the 2_assembly/longReadOnly folder of the github repository..
- Short-read-only assembly.
- Scripts for this analysis are available in the 2_assembly/short- ReadOnly folder of the github repository..
- Scripts for this analysis are available in the 2_assembly/hybrid folder of the github repository..
- For ex- ample, the long-read-only assemblies, short-read-only assemblies, and hybrid assemblies with <20x long-read coverage all tended to produce multiple contigs, occa- sionally with some regions of the chloroplast genome represented more than once.
- Scripts for this analysis are available in the 3_post_assembly/1_same_structure folder of the github repository..
- Scripts for this analysis are available in the 3_post_assembly/2_polish folder of the github repository..
- (ii) the average per-base coverage of the reference genome by the contigs (calculated by the sum of any part of any contig that aligns successfully to the E.
- regnans chloro- plast genome divided by the length of the E.
- (iii) the sum of the length of all contigs output by the assembler.
- (iv) the percentage of the validation reads successfully mapped to the polished genome.
- (v) the error rate of the valid- ation reads that successfully mapped to the polished genome.
- Scripts for this analysis are available in the 3_post_- assembly/3_assembly_quality_control folder of the github repository..
- To further assess the assembly with lowest error rate, highest mapping rate, and coverage closest to 1.0, we attempted to estimate the number of possible errors and heteroplasmic sites in that assembly using all of the long- and short-read data available to us.
- For example, short-reads that derive from regions of the nuclear or mitochondrial genome that are similar to the chloroplast genome may map with high probabil- ity to the chloroplast genome, but this mapping is far less likely to occur with long-reads, which in our dataset are a minimum of 5 kb.
- Thus, if the frequency of the non-reference base of a mismatch or indel is similar between the short- and long-read mappings, then this is most likely to indi- cate either an assembly error or a heteroplasmic site..
- Scripts for this analysis are available in the 3_post_as- sembly/4_SNP_call folder of the github repository..
- We then produced 312 individual alignments of the 32 sequences (one alignment for each fragment) using Clustal Omega v1.2.4 [89].
- Scripts for this analysis are available in the phylogeneti- c_analysis folder of the github repository..
- Chloroplast phylogeny of Cucurbita: evolution of the domesticated and wild species.
- A large- scale chloroplast phylogeny of the Lamiaceae sheds new light on its subfamilial classification.
- The evolution of the plastid chromosome in land plants: gene content, gene order gene function.
- Inferring the evolutionary mechanism of the chloroplast genome size by comparing whole-chloroplast genome sequences in seed plants.
- Complete plastid genome sequence of the chickpea (Cicer arietinum) and the phylogenetic distribution of rps12 and clpP intron losses among legumes (Leguminosae).
- Complete nucleotide sequence of the Cryptomeria japonica D.
- organization and evolution of the largest and most highly rearranged chloroplast genome of land plants.
- The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression.
- Complete chloroplast genome of the genus Cymbidium: lights into the species identification, phylogenetic implications and population genetic analyses.
- Combined analysis of the chloroplast genome and transcriptome of the Antarctic vascular plant Deschampsia antarctica Desv.
- Benchmarking of the Oxford Nanopore MinION sequencing for.
- An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome.
- Genome analysis of the ancient Tracheophyte Selaginella tamariscina reveals evolutionary features relevant to the Acquisition of Desiccation Tolerance.
- Entire plastid phylogeny of the carrot genus (Daucus, Apiaceae): concordance with nuclear data and mitochondrial and nuclear DNA insertions to the plastid..
- The Complete Chloroplast genome sequence of the medicinal plant Swertia mussotii using the PacBio RS II platform.
- Evaluation of the impact of Illumina error correction tools on de novo genome assembly.

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt