« Home « Kết quả tìm kiếm

Hecaton: Reliably detecting copy number variation in plant genomes using short read sequencing data


Tóm tắt Xem thử

- Results: To enable reliable and comprehensive detection of CNV in plant genomes, we developed Hecaton, a novel computational workflow tailored to plants, that integrates calls from multiple state-of-the-art algorithms through a machine-learning approach.
- Moreover, it correctly detects dispersed duplications, a type of CNV commonly found in plant species, in contrast to several state-of-the-art tools that.
- Phenotypic variation between individuals of the same plant species is caused by a host of different types of genetic variation, including single nucleotide polymor- phisms (SNPs), small insertions and deletions, and larger structural variation.
- CNV comprises a large part of the genetic variation found within plant popu- lations and is thought to play a key role in adaptation and evolution [1].
- One clear example of such adaptive evolution is presented by the weed species Amaran- thus palmeri, which rapidly became resistant to a widely used herbicide through amplification of the EPSPS gene, resulting in increased expression [2].
- Given the increasing interest of the plant research com- munity in CNV the question arises whether current methods accurately detect copy number variants (CNVs) in plants.
- 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- Although current state-of-the-art CNV detection algo- rithms generally perform well when applied to human datasets [8], the typical complexity of plant data likely introduces false positive calls.
- The reads representing this sequence generally share high similar- ity with other assembled regions of the reference, to which they are incorrectly aligned as a result.
- We expect that this issue introduces a significant number of false positives when working with plant data, as duplica- tion and transposition of genomic sequences is consid- ered to be one of the main drivers of adaptive evolution in plants [10]..
- First, it makes use of a cus- tom post-processing step to correct erroneously detected dispersed duplications, which are systematically mispre- dicted by some state-of-the-art tools.
- To maximize the performance of Hecaton, we combine predictions of a diverse set of popular, open-source tools that complement each other in terms of the signals and strategies used to call CNVs.
- Implementation of hecaton.
- 1 Overview of Hecaton.
- The final output of the calling stage consists of four VCF files containing CNVs, one for each tool..
- Stage 2: Post-processing.
- The post-processing stage of Hecaton serves three pur- poses.
- A novel adjacency is defined as a pair of bases that are adjacent to each other in the genome of a sample of interest, but not in the genome of the reference to which the sample is compared.
- For example, it represents deletions as a single adjacency containing two breakends positioned on the 5’ and 3’ end of the deleted sequence.
- Next, it clusters adjacencies generated by the same tool of which the break- points are located within 10 bp of each other on either the 5’ end or 3’ end, as these are likely to be part of the same variant.
- Finally, it converts each cluster to a deletion, insertion, tandem duplication, or dispersed duplication, based on the relative positions of the breakends and the orientation of the sequences that are joined in a clus- ter.
- Calls are merged if they fulfill all of the following conditions: they are of the same type.
- The regions of the merged calls are defined as the union of the regions of the “donor".
- The number of discordantly aligned read pairs and split reads supporting a merged call are both defined as the median of the numbers of the donor calls.
- The final result of the post-processing stage is a single BEDPE file containing all merged calls..
- The rows of the matrix correspond to CNV calls and the columns correspond to features (Additional file 2: Table S1), which are extracted from the INFO and FORMAT fields of the VCF file containing the calls..
- The training and testing set were constructed by run- ning the calling and post-processing stages of Hecaton on Illumina data of an Arabidopsis thaliana Col-0–Cvi-0 F1 hybrid and a sample of the Japonica rice Suijing18 cultivar (Additional file 2: Table S2).
- As we aimed to maximize the performance of the model for low coverage datasets in particular, we subsam- pled these datasets to 10x coverage using seqtk [26].
- Calls were labeled as true or false positives using long read data of the same samples (See Additional file 3: Supplemen- tary Methods for details).
- The hyperparameters of the model (n_estimators, max_depth, and max_features) were selected by doing a grid search with 10-fold cross-validation on the training set, using the accuracy of the model on the validation data as optimiza- tion criterion..
- The performance of Hecaton was compared to that of cur- rent state-of-the-art tools using short read data simulated from rearranged versions of the Solanum lycopersicum Heinz 1706 reference genome of tomato [28].
- In the first stage, it aligns short read WGS data to a reference genome of choice and calls CNVs from the resulting alignments using Delly, GRIDSS, LUMPY, and Manta, four state-of-the-art tools that complement each other in terms of their methodological set-up.
- Below, we first describe how the design of Hecaton allows it to outperform the current state-of-the-art and then we will present an application of Hecaton to crop data..
- Hecaton accurately detects dispersed duplications.
- To show the impact of this problem, we applied Delly, GRIDSS, LUMPY, and Manta to short read data simulated from modified versions of the S.
- Such signals consist of novel adjacencies, pairs of bases that are adjacent to each other in the genome of the sample of interest, but not in the genome of the reference to which the sample is compared..
- The post-processing step of Hecaton corrects dispersed duplications that are erroneously predicted by Delly, LUMPY, and Manta, which significantly improves their performance.
- The performance of the post-processing step improved with coverage (Fig.
- 3), as it fails to detect dispersed dupli- cations if one or both of the adjacencies resulting from them are missing from the output of Delly, LUMPY, Manta, or GRIDSS.
- Performance metrics are reported as the mean over all 10 simulations with error bars depicting the standard error of the mean.
- The size distributions of the detected false positives are depicted as box plots.
- If only one of the two adjacencies could be detected, the post-processing script classified it as a false positive deletion, false positive tandem duplication, or generic breakend..
- Hecaton generally outperforms state-of-the-art cNV detection tools.
- The labels (true or false positive) of these CNVs were determined using long read data of the same samples.
- The machine-learning approach used during the filter- ing stage of Hecaton integrates calls of Delly, LUMPY,.
- For example, at a precision level of 80%, Hecaton detected 43 true positive tandem duplications, while the best performing state-of-the-art tool, GRIDSS, detected only 19.
- Our results agree with previous work in which a method that carefully merges calls of different CNV calling tools attained a higher pre- cision and recall than any of the individual tools [11]..
- As the approach performed about equally well when using a random forest model trained on either 10x or 50x coverage data (Additional file 1: Figure S4), the ran- dom forest framework itself is the main driver of the improvement, rather than the sequencing coverage used to train the models.
- Dispersed duplications.
- 3 Performance of the post-processing step of Hecaton on data simulated from diploid rearranged tomato genomes.
- The post-processing script of Hecaton recalled dispersed duplications not originally found in the output of Delly, LUMPY, Manta.
- post-processing stage of Hecaton significantly increased the precision of tools by replacing pairs of overlapping false positive deletions and tandem duplications by true positive intrachromosomal dispersed duplications.
- improved upon current state-of-the-art ensemble meth- ods that are applicable to, but not specifically designed for plant data.
- The poor performance of MetaSV and SURVIVOR sharply contrasts with the good performance they showed in the benchmarks of the publications describing them [31, 32]..
- As a large fraction of calls could not be validated using long read data, due to the highly repeti- tive nature of the Mo17 assembly (Additional File 2: Table S3), we only report performance metrics for calls that overlap for at least 50% of their length with genes or the 5000 bp interval upstream or downstream of genes.
- Consistent with the results of our pre- vious benchmarks, Hecaton attained a better combination of recall and precision compared to both individual state- of-the art tools and ensemble approaches (Fig.
- With some of the insertions, the mates of the soft- clipped reads mapped to a different chromosome, indi- cating that some may be interchromosomal transpositions instead..
- The curve of Hecaton was produced by varying the threshold of the probabilistic score used to define calls as true positives.
- The performance of Hecaton can be further improved, as it is relatively easy to extend it to include other CNV detection tools or to train new random forest models using addi- tional plant data.
- 5 Performance of different CNV detection algorithms on short read data of the maize B73 accession.
- although Hecaton has no upper limit in terms of the size of CNV it can detect, we were not able to evaluate its per- formance on CNVs that were larger than 1 Mb, as such calls tended to be falsely validated by one of our valida- tion methods, VaPoR (Additional file 1: Figure S6).
- Finally, we were not able to robustly assess the performance of Hecaton in polyploid plant species, as we could not find polyploid samples of which both short and long read data were publicly available.
- We expect that the performance of Hecaton on polyploids should be comparable to the.
- Most of the CNVs between the tomato samples and the reference genome consisted of deletions (Table 3), following a similar trend as seen in the A.
- No insertions were reported for any of the samples (Table 3).
- In contrast to several state- of-the-art tools, Hecaton correctly detects dispersed duplications.
- Figure S3: Recall of the.
- post-processing step of Hecaton for dispersed duplications simulated at different allele dosages in tetraploid tomato genomes, before and after post-processing.
- Figure S4: Performance of Hecaton on the test set containing Col-0–Cvi-0 and Suijing18 CNV events called from 10x coverage data, using random forest models trained on CNVs detected at different levels of sequencing coverage.
- Figure S13: Density plot of the fraction of N’s in the 400 bp flanking regions of CNV events called in domesticated and wild tomato samples..
- Table S2: Description of the used datasets.
- Evaluating the performance of CNV detection tools using real data.
- RYW, SS, and DDR contributed to the design of Hecaton.
- The funding bodies had no role in the design of the study.
- Additional file 2: Table S2 provides a full description of the datasets, including NCBI Short Read Archive accession numbers.
- Gaines TA, Zhang W, Wang D, Bukun B, Chisholm ST, Shaner DL, et al..
- Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, et al.
- Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome.
- Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, et al..
- Chaisson MJ, Sanders AD, Zhao X, Malhotra A, Porubsky D, Rausch T, et al.
- Lee AY, Ewing AD, Ellrott K, Hu Y, Houlahan KE, Bare JC, et al.
- Cameron DL, Schroeder J, Penington JS, Do H, Molania R, Dobrovic A, et al.
- Chen X, Schulz-Trieglaff O, Shaw R, Barnes B, Schlesinger F, Källberg M, et al.
- Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, et al.
- Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, et al..
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al.
- Mohiyuddin M, Mu JC, Li J, Bani Asadi N, Gerstein MB, Abyzov A, et al..
- Jeffares DC, Jolly C, Hoti M, Speed D, Shaw L, Rallis C, et al.
- Zarate S, Carroll A, Krashenina O, Sedlazeck FJ, Jun G, Salerno W, et al..
- Sun S, Zhou Y, Chen J, Shi J, Zhao H, Zhao H, et al.
- Ewing AD, Houlahan KE, Hu Y, Ellrott K, Caloian C, Yamaguchi TN, et al..
- Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, et al..
- Margolin AA, Bilal E, Huang E, Norman TC, Ottestad L, Mecham BH, et al.
- Maron LG, Guimarães CT, Kirst M, Albert PS, Birchler JA, Bradbury PJ, et al.
- Sutton T, Baumann U, Hayes J, Collins NC, Shi BJ, Schnurbusch T, et al..
- Aflitos S, Schijlen E, de Jong H, de Ridder D, Smit S, Finkers R, et al..
- Chin CS, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, et al.
- Nie SJ, Liu YQ, Wang CC, Gao SW, Xu TT, Liu Q, et al.
- Zapata L, Ding J, Willing EM, Hartwig B, Bezdan D, Jiao WB, et al..
- Jiao Y, Peluso P, Shi J, Liang T, Stitzer MC, Wang B, et al

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt