« Home « Kết quả tìm kiếm

Evaluation of tools for identifying large copy number variations from ultra-lowcoverage whole-genome sequencing data

- Additionally, ultra-low-coverage WGS data was simulated to investigate the ability of the algorithms to identify CNVs in the sex chromosomes and the theoretical minimum coverage at which these tools can accurately function.
- Recently, the applicability of the CNV detec- tion methods for noninvasive prenatal testing samples with read depth of 0.2–0.3× was assessed [18].
- Presently, the ability of the methods to detect CNVs from such ultra-low-coverage sequencing data remains unclear..
- Com- pared to array-based and karyotyping based benchmarking data, simulated CNVs provide the most accurate ground truth in respect to exact breakpoints of the CNVs.
- In both parts of the comparison, we measure the per- formance using sensitivity, false discovery rate (FDR) and F1 score.
- Finally, we also compare run times of the methods.
- Figure 1 illustrates the mains steps of the com- parison process..
- Canvas and QDNAseq also detected correctly all the autosomal CNVs, but Canvas produced also some add- itional false positives, whereas QDNAseq produced some copy number neutral segments within some of the CNVs.
- Two of the tools predicted the correct location, but a false copy number for some of the CNVs.
- The results show that only BIC-seq2 was able to accurately detect all of the CNVs in both sex chromosomes, whereas the other tools had more or less difficulties in predicting them.
- BIC-seq2 was the only al- gorithm that was able to accurately detect both of the CNVs in the chromosome Y.
- All of the algorithms, except HMMcopy, were able to detect the duplication in the chromosome Y.
- In order to assess how the coverage of the simulated WGS data affects the performance, we used nine differ- ent coverages and 0.0005.
- The original simulated dataset with coverage of 1× was downsampled to each of the nine different coverages 20 times.
- The average sensi- tivity, FDR and F1 score of the six CNV algorithms were Table 1 Summary of features for the algorithms.
- 1 Flowchart showing the main steps of our comparison, including preprocessing of the data, detection of copy number variations (CNVs) with six different algorithms (BIC-seq2, Canvas, CNVnator, FREEC, HMMcopy, and QDNAseq) and evaluation and validation of the results.
- Overall, the choice of the evaluation criteria had no effect on the order of the best and poor-performing tools, and there was not considerable variation in the inferred CNVs for any of the tools across the twenty subsets of the data..
- In general, when using either the stringent or loose criteria all of the tools performed poorly with extremely low read coverages (0.0005× to 0.01×) and better with higher coverages.
- All of the tools achieved ≥ 50 % sensi- tivity with read coverages ≥ 0.01x.
- However, CNVnator achieved high sensitivity with many of the window sizes (Supplementary Fig.
- 2 Genomic map visualization of the copy number variations (CNVs) detected in the simulated dataset using the six algorithms (rows 1 – 6) along with the ground truth CNVs (row 7) in the respective chromosomal locations.
- The bottom part of the visualization depicts the depth of read coverage at each 50 kbp window.
- The read coverage of the data used in this visualization was 1×.
- 3 Performance evaluation of the six copy number variation (CNV) algorithms using the simulated data with the stringent criteria: at least 80 % overlap between the inferred and ground truth CNVs and no filtering by CNV length.
- a True positive rate (TPR), b False discovery rate (FDR), and c F1 score of the CNV detections achieved by the different tools when the read coverage is varied.
- Error bars denote the standard error of the results generated from the results of 20 different random subsets.
- Next, five different window sizes and 2000 kbp) were tested to investigate the relationship between the coverage and the optimal choice of the win- dow size.
- We used the F1 values of the window size comparison to select the opti- mal window size for each method at coverage of 0.1×, which we used in the cell line data benchmarking.
- It should be noted that some of the larger windows sizes (1 Mbp, 2 Mbp) were likely too large for the identifica- tion of the smallest CNVs of 1 Mbp length.
- For QDNAseq the CNV detection is visualized using two different setups: inclu- sion and exclusion of the sex chromosomes X and Y..
- How- ever, none of the tools met the minimum overlap criter- ion of >.
- All of the algorithms found the large chromosome 12 gain.
- The fragmented detection of QDNAseq and Canvas for the chromosome 12 gain can be explained by the exclusion of the blacklisted regions that both algorithms use by default.
- In this setting, most of the algorithms de- tected the gain in the chromosomes 12 and 17 of the ab- normal samples, and hence the sensitivity of the algorithms was similar (Fig.
- BIC-seq2 had clearly the best sensitivity with both the abnormal and normal data, because BIC-seq2 was able to identify also some of the smaller gains in the chromosomes 7 and 20.
- With these stringent criteria none of the algorithms de- tected the only gain in the normal samples (Supplemen- tary Fig.
- The analysis of the simulated data showed that QDNAseq achieved one.
- of the best sensitivities in the comparison (Fig.
- When evaluating the CNVs by their exact copy number, no impact on the sensitivity, FDR or the F1 score was observed for five of the six tools, HMMcopy being the only exception.
- which is why we considered it the best method of the cell line benchmarking.
- 4 Visualization of the CNVs detected in the cell line data with the six algorithms along with the array-based benchmark CNVsin the respective chromosomal locations.
- a Karyotypically abnormal (H9-AB) and b normal (H9-NO) variants of the human embryonic stem cell line H9 were analysed.
- The CNV detection methods have differences in how they handle the centromeres, affecting the evaluation of the large gain in the chromosome 12.
- However, this was not a significant issue in our comparison due to the small size of the centromere and our comparison approach that penalized for the re- dundant segmentation based on the size of the gaps..
- 5 Performance evaluation of the six algorithms using the cell line data with the criteria of ≥ 50 % overlap and no minimum length requirement for the detected CNVs.
- a, d True positive rate (TPR), b, e False discovery rate (FDR), and c, f F1 score of the CNV detections.
- The slowness of BIC- seq2 is attributable to the computationally demanding normalization step, accounting for 99.9 % of the run time.
- In terms of the real maximum memory consump- tion, FREEC and Canvas were the lowest and highest memory consumers, respectively..
- The failure rates at each coverage was estimated by calculating the proportion of the runs in the simulation experiment that failed to complete.
- All of the algorithms had zero failure rate with read coverages ≥ 0.01x.
- These tools were selected because they are commonly used either based on the number of citations or the number of downloads of the tool.
- 6 Performance evaluation of the six algorithms using the combined abnormal cell line samples while varying three evaluation parameters:.
- a True positive rate (TPR), b False discovery rate (FDR), and c F1 score of the CNV detections for each of the six tools.
- Table 2 Mean and standard deviation (SD) of the running times in seconds and maximum memory consumption for each algorithm.
- All of the selected tools were read depth based algo- rithms.
- The two principal causes of system- atic biases in the read alignment efficiency are the local GC-content and mappability of the different genomic re- gions [34].
- In the third step, segmentation of the counts into homogeneous regions with highly-similar copy numbers is performed.
- All the six algorithms were also coupled with convenient visualization functions that can be used in illustrating the effect of bias correction or in the interpretation of the results..
- Since BIC-seq2 is run in two steps, with the normalization step being considerably slow, it was clearly the slowest of the six tools..
- It should be noted that for some of the methods, the number of false positives can be poten- tially decreased by improving the filtering that excludes problematic regions.
- CNVnator would considerably benefit from blacklisting centromere regions, whereas QDNAseq would benefit from disabling some of the fil- ters that cause copy number neutral gaps to the CNVs in the autosomes and false positives to the sex chromosomes..
- However, the window size was considered an important parameter in the present work, as it dir- ectly affects the size of the CNVs that can be identified..
- Furthermore, many of the tools were not readily usable with our 2 × 150 bp sequencing setup (BIC-seq2, FREEC, HMMcopy and QDNAseq).
- However, the drawback of this approach is that it decreases the accur- acy of the read alignment, and hence it could decrease the accuracy of the CNV detection..
- In this work, we used simulated WGS data as well as WGS data from hESC samples to evaluate the perform- ance of the CNV detection tools..
- To investigate the ability of the algorithms to identify CNVs in sex chromosomes, and to also acquire a more genuine ground truth for the purpose of benchmarking, we created simulated WGS data.
- The quality and quantity of the DNA was analyzed with Nanodrop and Qubit 2.0, and.
- The quality of the libraries was determined with Agilent 2100 Bioanalyzer.
- Quality control of the raw sequence data was performed using FastQC v0.11.4 (https://www.bioinformatics..
- Alignment of the reads was done with BWA-mem v0.7.16a [40] against the human reference genome hg19.
- The number and shape of chromosomes of the samples were determined, i.e.
- and the identification of the abnormal copy number re- gions.
- The performance of the method has been previ- ously demonstrated on low-coverage data (0.1x) [25]..
- Second, since half of the algo- rithms (CNVnator, QDNAseq, HMMcopy) have no de- fault value for the window size and since the window size can be altered for all the tools except for Canvas, we investigated how the choice of the window size affected the performance.
- However, for QDNAseq we tested differ- ent window sizes that were avaible in the bin annotation of the R package and 1000 kbp).
- Next, we calculated the ratio of how many bases the two gen- omic region sets overlapped to the length of the ground truth CNV.
- Every inferred CNV that did not overlap with any of the ground truth CNVs were counted as false positive (FP).
- Performance evaluation of the six copy number variation (CNV) algorithms using the simulated data with the loose criteria: at least 60% overlap between the inferred and ground truth CNV segments and inclusion of ≥ 0.5Mbp CNV segments..
- A) True positive rate(TPR), B) False discovery rate (FDR), and C) F1 score of the CNV detections achieved by the different tools when the read coverage is varied.
- Error bars denote the standard error of the results produced with 20 different random subsets.
- The coefficient of variation of 0.05 is the default value of the built-in method of FREEC for selecting the window size based on the coverage.
- Visualization of the CNVs detected in the H9-AB-p116 dataset using the six algorithms along with the array- based benchmark CNV segments in the respective chromosomal loca- tions.
- The bottom part of the visualization depicts the depth of read coverage at each 50 kbp win- dow.
- Visualization of the CNVs detected in the H9-AB-p113 dataset using the six algorithms along with the array-based benchmark CNV segments in the respective chromosomal locations.
- The bottom part of the visualization de- picts the depth of read coverage at each 50 kbp window.
- Visualization of the CNVs detected in the H9-p38 dataset using the six al- gorithms along with the array-based benchmark CNV segments in the re- spective chromosomal locations.
- Visualization of the CNVs detected in the H9-p41 dataset using the six algorithms along with the array-based benchmark CNV segments in the respective chromosomal lo- cations.
- Visualization of the CNVs detected in all the chro- mosomes in the combined sample H9-AB by the six algorithms along with the array-based benchmark CNV segments in the respective chromosomal locations.
- The bottompart of the visualization depicts the depth of read coverage at each 50 kbp window.
- Visualization of the CNVs detected in all the chromosomes in the combined sample H9-NO by the six algorithms along with the array-based benchmark CNV segments in the respective chromosomal locations.
- Visualization of the CNVs detected in all the chromosomes in the combined sample H9-AB-p116 by the six algo- rithms along with the array-based benchmark CNV segments in the re- spective chromosomal locations.
- Visualization of the CNVs detected in all the chromosomes in the combined sample H9-AB-p113 by the six algorithms along with the array-based benchmark CNV segments in the respective chromosomal lo- cations.
- Visualization of the CNVs detected in all the chromo- somes in the combined sample H9-NO-p41 by the six algorithms along with the array-based benchmark CNV segments in the respective chromosomal locations.
- Visualization of the CNVs detected in all the chromosomes in the combined sample H9-NO-p38 by the six algorithms along with the array-based benchmark CNV segments in the respective chromosomal locations.
- Performance evaluation of the six algorithms using the cell line data with the stringent criteria: at least 80% overlap between the inferred and array-validated CNV segments and ≥ 0.5Mbp CNV length requirement for the detected CNV segment.
- A,D) True positive rate, B,E) False discovery rate and C,F) F1 score of the CNV detections.
- Number of bases and read coverage of the cell line samples for each sample individually and for the combined samples.
- JS performed the evaluation of the tools.
- A copy number variation map of the human genome.
- Epigenetic Silencing of the Key Antioxidant Enzyme Catalase in Karyotypically Abnormal Human Pluripotent Stem Cells

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt

Evaluation of tools for identifying large copy number variations from ultra-lowcoverage whole-genome sequencing data

CHỦ ĐỀ LIÊN QUAN