« Home « Kết quả tìm kiếm

Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines


Tóm tắt Xem thử

- We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform ’ s impact..
- Results: The number of detected variants/variant classes per individual was highly dependent on the experimental setup.
- 1 Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany.
- Full list of author information is available at the end of the article.
- From sequencing over variant calling to subsequent stat- istical analysis – variation can be introduced at any step of the genomics workflow.
- Each sequencing technology produces its own imprint of systematic biases, imposing one of the most crucial bottlenecks in genomics re- search.
- Most of all, a vast majority of the previous studies were conducted on a single refer- ence individual or at most only on a small number of individuals, not allowing to achieve generalizable conclu- sions that can be translated to large cohorts.
- In order to disentangle the impact of the bioinformatic processing pipeline from the sequencing technology on.
- Throughout the study, we use the name of the sequen- cing platform (HSX, MOL, or CG) in combination with the mapping and variant calling pipeline to refer to the five distinct experimental setups we compared, namely CG, MOL + Isaac, HSX + Isaac, MOL + GATK, and HSX + GATK..
- 1 Graphical abstract of the study.
- a Schematic overview of the workflow of our analysis.
- The average number of concordant variants for different intersections was calculated and reported..
- Furthermore, we investigated the concordance in introns, exons, intergenic regions, repeat elements annotated with the RepeatMasker software, and bins with varying GC content.
- of variants in the respective experimental setups and all possible intersections.
- Subsequently, we analyzed the concordance of different setups within genomic regions, including exons, introns, repetitive elements, or genomic bins with varying GC content (Fig.
- In the first step of the analysis, we estimated the number of variants detected in each setup.
- HSX + Isaac was associated with the highest average number of SNPs, followed by HSX + GATK, MOL + Isaac, MOL + GATK, and then CG (Fig.
- The average number of indels seen in all setups was 214,730, corresponding to 23.9% of HSX.
- The distributions of the observed number of SNPs and indels differed significantly form the expected numbers (chi-square test, p <.
- However, the variants unique to each platform were highly overrepresented, especially in the case of SNPs, as demonstrated by expected-versus-observed- number-of-variants ratios that were significantly higher than 1 in all cases (Table 1).
- Distribution of variants along the genome.
- 2 Composition of variants detected under different experimental setups.
- However, the ab- solute number of exonic SNPs and indels was very similar for all investigated setups (Additional file 1, Fig..
- In contrast, the number of intronic and intergenic indels varied substantially where the HSX + Isaac, MOL + GATK, and HSX + GATK strategies detected substantially more indels.
- As ex- pected, based on the distributions of the absolute num- bers of variants, we observed a higher concordance for SNPs compared to indels (Fig.
- Then, we estimated the absolute number of variants detected by each setup in the respective re- petitive region.
- We observed similar distributions for the number of SNPs in distinct repetitive regions under all experimen- tal setups (Additional file 1, Fig.
- The majority of SNPs were detected in LINE, Alu, and LTR elements, whereas the number of SNPs in low complexity regions was much lower.
- Conversely, the absolute number of indels in the RepeatMasker annotated regions varied enormously between the setups.
- Notably, the highest number of indels were detected with the HSX + GATK strategy, and the difference was most prominent in Alu, LINE elements, and simple repeats (Additional file 1, Fig.
- Analysis of the Jaccard distance in pairwise compari- sons of experimental approaches indicated that concord- ance of variant calls was consistently worse for indels compared to SNPs in all types of repetitive regions (Fig.
- This finding correlates with the fact that we observed the most con- siderable discrepancy in the number of indels under each setup in these two types of repetitive regions (Add- itional File 1, Fig, S1d).
- Distribution of variants in genomic bins with varying GC content.
- To investigate the Table 1 Monte Carlo simulation-based comparison of the observed and expected number of variants uniquely detected by each experimental approach and in the intersection of all setups.
- Then, we estimated the non-parametric correlation between the number of SNPs and indels in the same genomic bins with the relative proportion of GC bases..
- We identified a weak positive correlation between the number of variants and the GC content in the 100 kbp genomic bins under all experi- mental conditions.
- In contrast, the positive correlation between the number of variants and GC content was weaker in bins with a GC content between ~ 37% and ~ 64%.
- It is im- portant to mention that the 6th percentile of the GC content distribution was already at 33%, thereby bins with a lower GC content are relatively rare (cf.
- Consequently, the majority of vari- ants were located in regions with medium GC content (Additional file 1, Fig.
- The differences between regions with varying GC content were not.
- 6a,b) compared to the remaining experimental setups.
- A higher number of variants in high confidence intervals could be indicative of increased sensitivity for the respective setup.
- 5 Relationship between GC content, number of variants, and concordance of variant calls between experimental setups.
- The correlation between GC content and the number of SNPs (a) and indels (b) based on genomic bins of 100 kbp was evaluated with the Spearman correlation coefficient r.
- 47%) GC content.
- Furthermore, nu- merous studies have led to conflicting estimates of the accuracy of preferred analysis pipelines for sequencing data, and challenges remain in benchmarking variant call datasets .
- This real-world data set allowed us to perform a practical comparison of the consistency of different sequencing and variant calling methods..
- First, the average number of SNPs consistently de- tected throughout all experimental setups corresponded to a range of 82 to 88% of SNPs for each method.
- A Monte-Carlo- simulation-based statistical test revealed that the ob- served number of variants called by all setups was sig- nificantly higher than expected by chance.
- Nevertheless, the observed number of variants unique to each method was also increased considerably relative to the expected quantity, which hints at the platform- and processing- pipeline specific biases.
- This finding also implies that the choice of the sequencing platform and the following bio- informatic strategy directly affect variant calling and, in turn, account for between-study heterogeneity observed in downstream analyses such as GWAS [17–19]..
- Upon inspection of the different subtypes of variants, we observed that all experimental setups demonstrated a greater concordance in SNPs than in indel detection..
- Furthermore, indels accounted for the majority of variants unique to HSX + GATK.
- The discrepancies in indel numbers were far higher than the number of random mutations an in- dividual should have.
- deletion event in a highly repetitive stretch of the gen- ome [21].
- Prominently, reprocessing the Illumina co- horts with the GATK pipeline was associated with a considerable increase in the number of indels in these repeat regions, introns, and genomic bins with medium GC content relative to the same sequencing cohort proc- essed with the Isaac software.
- In our current study, we identified a weak positive correlation between higher GC content and an increased number of detected variants, as well as a slightly better concordance in bins with low GC content.
- Our correlation analysis between GC content and the number of detected variants revealed a very similar relationship for all investigated setups..
- However, it is not clear whether these findings indicate a higher overall sensitivity of the respective setup, when taking into account that these.
- Bar plots show the absolute number of SNPs (a) and indels (b) with MAF <.
- For instance, Corn- ish and colleagues identified the GATK+UnifiedGenoty- per as the most sensitive strategy to detect variants among 30 alternative pipelines, and results were com- parable irrespective of the aligner used [27].
- Importantly, none of the strategies achieved an average sensitivity for indels higher than 33%..
- In our study, reprocessing the Illumina co- horts with GATK was associated with a significant in- crease in the number of detected indels.
- This technique could be used to determine an actual error rate for the different setups and to distinguish correct from incorrect variant calls in spe- cific regions of the genome.
- Furthermore, there is still a considerable number of recently published studies, which use older sequencing data from a wide variety of sources or as test-data for novel computational approaches [32, 33] and we believe this will continue to be the case.
- Our study suggests that while both experimental factors such as the sequencing technology as well as the choice of data analysis method considerably contribute to heterogeneous results, the impact of the mapping and variant calling strategy might be more pronounced, es- pecially for indels.
- All multiple nucleotide polymor- phisms (MNP) were decomposed into consecutive SNP since they were not encoded as such in the vcf files of the Illumina HiSeq data set.
- BCFtools stats was used to retrieve in- formation about the number of variants per sample..
- The mapping of the unaligned files was done with bwa mem according to the GATK best practice.
- GATK was used for the processing of the sequencing files for MOL and HSX.
- For variant calling, the NVIDIA Clara Parabricks Hap- lotypeCaller was used, a GPU accelerated version of the GATK HaplotypeCaller..
- Bedtools nuc was used to calculate the GC content of 100kbp bins for the whole genome..
- Additionally, we created bed-files of the following features:.
- The resulting file with the vari- ants was used to filter each individual vcf-file of the re- spective set-up..
- By comparing the number of vari- ants found in high confidence regions, one can estimate the sensitivity of distinct setups.
- GC content annotation.
- GC content was calculated as the proportion of GC bases in genomic bins with a length of 100 kbp using bedtools nuc.
- Genomic bins were annotated as having low GC content (less than 37.
- medium GC content (be- tween 37 and 47.
- and high GC content (more than 47.
- These cut-off values were based on the quantiles of the GC content distribution.
- Namely, 37% GC content corresponds to the 30th percentile of the distribution, meaning that approximately 1/3 of the bins have a lower GC content.
- GC content.
- However, this value cor- responds to the 95th percentile of the observed GC con- tent distribution.
- In order to balance the proportion of bins with a high GC content, we, therefore, chose the lower cut-off value of 47%, which corresponds approxi- mately to the 90th percentile of the GC content distribution..
- The pairwise concordance between experimental setups was investigated with the Jaccard distance, which is cal- culated by dividing the intersection of two sets by the union of the sets and subtracting the resulting value from 1:.
- Values of the Jaccard distance vary between 0 and 1, and lower values correspond to a higher concordance between two experimental setups..
- The observed and expected number of unique vari- ants for the five experimental setups and their inter- section (Venn-diagrams in Fig.
- Assuming the total average number of SNPs and indels observed as the “true”.
- We computed the null-distribution for the expected number of variants in each set by randomly drawing the observed number of variants for each ap- proach and determining the respective number of unique variants.
- This step was repeated 1000 times, and the mean values over all runs were taken as esti- mates for the null distribution of the number of unique variants.
- The observed number of unique.
- Furthermore, we calculated the ratio of the observed versus expected number of variants under each condition.
- These ratios were considered to be significantly higher than 1 if the lower bound of the 95% confidence interval did not cross 1.
- The confidence intervals were obtained by calculating the 2.5% and the 97.5% quantiles of the empirical distributions of the expected versus ob- served ratios from the 1000 simulation steps..
- The correlation between the number of variants and GC content in genomic bins of 100 kbp was evaluated with the non-parametric Spearman correlation coeffi- cient.
- S1: Distribution of variants along the genome.
- S2: Correlation between GC content and number of variants detected..
- AT acknowledges funding from NIH-NCTAS UL1TR002550.The funders had no role in the research design, data collection and analysis, writing of the manuscript and the decision to publish.
- Reanalysis of the data was formally governed via the Data Ac- cess Agreement “ MTA-EGA between the Scripps Institute and the Johannes Gutenberg University of Mainz and approved by the Ethics Board of the Universita della Svizzera italiana (Study-Identifier INF-A).

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt