« Home « Kết quả tìm kiếm

How imputation can mitigate SNP ascertainment Bias


Tóm tắt Xem thử

- influenced by a non-random selection of the SNPs included in the used genotyping arrays.
- Full correction for this bias requires detailed knowledge of the array design process, which is often not available in practice.
- Results: The strategy was first tested by simulating additional ascertainment bias with a set of 1566 chickens from 74 populations that were genotyped for the positions of the Affymetrix Axiom ™ 580 k Genome-Wide Chicken Array..
- Full list of author information is available at the end of the article.
- when used for samples of other species, this can result in a lack of variable and thus informative SNPs on the array and therefore a shift of the frequency spectrum towards rare variants [16]..
- Exemplarily, the shift in allele frequencies towards common variants leads to an sys- tematic overestimation of the heterozygosity of popula- tions [16, 17].
- The relative effect is stronger for populations that were part of the discovery set compared to populations that were not part of the discovery set [16].
- Systematic differences in allele frequency spectra further increase estimates of genetic distances between populations which were part of the discovery set and those which were not [10]..
- The complex interaction between the size of the dis- covery panel and its restriction to a subset of popula- tions makes it difficult to predict or outright correct for.
- Further, attempts to implement bias-reduced estimators require strong as- sumptions on the design process of the used SNP array [12], which is often not public knowledge or too compli- cated to be remodeled [18, 19].
- Due to strongly decreasing sequen- cing costs and the complexity of the ascertainment bias correction strategies, more and more studies started using WGS data for population genetic analysis during the last years [20–24].
- Imputation-based studies mostly either used a refer- ence panel of the same population as the study set itself [36–38] or utilized large global reference panels such as the 1000 Genomes or 1000 Bull genomes [26, 41] projects.
- Set 1: Individual sequence data of 68 chickens from 68 different populations, sequenced within the scope of the EU project Innovative Management of Animal Genetic.
- Set 3: Genotypes of 1566 chickens from 74 popula- tions, either genotyped (sub-set of the Synbreed Chicken Diversity Panel.
- The intersection of the used data sets is shown in Fig.
- 1 and accession information of the raw data per sample can be found in Supplementary File 1.
- While individual sequences are considered to be the gold standard throughout this study, genotypes of the Affymetrix Axiom™ 580 k Genome-Wide Chicken Array [15] are biased towards variation which is com- mon in the commercial chicken lines [16] and pooled sequences only allow for an estimate of population al- lele frequencies and show a slight bias due to sample size and coverage (Supplementary File .
- Calling of WGS SNPs and generation of genotype set Alignment of the raw sequencing reads against the latest chicken reference genome GRCg6a [52] and SNP calling was conducted for individual and pooled sequenced data follow- ing GATK best practices [53, 54].
- Additionally, all individual se- quences were genotyped for the positions of the Affymetrix Axiom™ 580 k Genome-Wide Chicken Array [15]..
- To ensure compatibility between Array- and WGS data, the genotypes of the Synbreed Chicken Diversity panel were lifted over from galGal5 to galGal6 and corrected for switches of reference and alternate alleles.
- The description of the detailed pipeline can be found in Supplementary File 2..
- A first comparison was based solely on the 15,868 SNPs of chromosome 10 of the genotype set which allowed for a high number of repetitions while still being based on a sufficiently sized chromosome.
- To simulate an ascertainment bias of known strength, an even more strongly biased array was de- signed in silico from the genotype set for each of the 74 pop- ulations (further called discovery populations) by using only SNPs with MAF >.
- Imputation of the in silico arrays to the reference set was performed by running Beagle 5.0 [35] with ne the genetic distances taken from Groenen et al.
- After the initial tests of the imputation strategies by the in silico designed arrays, we imputed the complete geno- type set to sequence level, using the available individual sequences as the reference panel.
- 74_1perLine) which is equivalent to the first scenario allPop_74 of the in silico array im- putation.
- This was needed as we observed low assembly quality and in- sufficient coverage of the array on the small chromo- somes.
- (2), where D xy ac- counts for the genetic distance between populations X and Y, while x il and y il represent the frequency of the i th allele at the l th locus in population X and Y, respectively..
- 2 Schematic representation of the workflow of creating and re-imputing the in silico arrays.
- We therefore could not dissect the effects of the two biases in the comparisons on se- quence level and did not include F ST there..
- The magnitude of the bias can therefore be defined as the distance of the estimates to that line.
- For between population estimators (F ST , D), a group de- scribes the according combination of the two involved population groups.
- Differences of the estimated slopes from one and the correlation between heterozygosity and distance estimates from biased and true set within groups were used as indicators for the magnitude of bias and random estimation error..
- Comparisons of popula- tion estimates on sequence level are therefore limited to 45 populations out of the 74 populations which were used as study and reference set for the imputation process..
- In case of the imputation to sequence level, we used leave-one-out val- idation to assess per-animal imputation accuracy.
- As expected, the in silico ascertained sets showed a strong overestimation of the H E for nearly all popula- tions in all cases.
- To get an impression on the strength of the correction and the needed size of the reference panel, Fig.
- 4 com- pares the correlation by population group, the slope for the within-group regression of the true H E and H O vs..
- Due to smaller reference panels, the correction effect of the im- putation was generally worse than for strategy allPop_.
- The accuracy was consistently higher for individuals which were part of the discovery population (Fig.
- However, note that there is also a slight effect of the remaining pooling bias, which cannot be separated from ascertainment bias for the pooled sequenced populations..
- Overall performance of the correction method.
- 5 Development of the per-animal imputation accuracy for the in silico array to genotype set imputation with an increasing number of reference animals per population.
- The lines show the trend of the median and outliers are not shown in the plot as they do not add valuable information due to the high number of repetitions.
- Note that there is also an effect of pooled sequencing which affects the ‘ true ’ values of the pooled sequenced populations.
- The results were less beneficial for the real WGS data, but also showed a strong decrease of the slope towards one.
- From the imputed in silico arrays, we could additionally realize a fast closing of the gap of the stronger overestimation of heterozygosity within discov- ery populations and the less severe overestimation in non-discovery populations.
- An interesting side observation was that F ST did not show any ascertainment bias on the real array data (Figure S10) when calculated in the form of summing the numerator across SNPs and dividing by the sum of the denominator as calculated in this study.
- are more representative for discovery populations than non-discovery populations, relatively more of the genetic variability in discovery populations is explained by the array and imputation is more accurate on average..
- Imputation was able to mitigate this SNP ascer- tainment bias in our samples for all studied estimators (H E , H O , F ST , D), measured as correlation, average rela- tive difference and slope of the regression line when comparing the biased estimators to the according gold standard.
- Development of the per-animal imput- ation accuracy with an increasing number of reference animals per popu- lation.
- the lines show the trend of the median..
- Development of correlations within population group (r), slope and mean overestimation of the regression.
- Development of correlation within population group (r), slope and mean overestimation of the regression lines for Nei ’ s Distance (D) and F ST estimates and different reference panel strategies.
- A – HE estimated from array positions of the sequencing data vs.
- The plot therefore shows the magnitude of the bias introduced by pooled sequencing and the according effect of the correction factor.
- The color again shows the values before and after implementing the correction of the pooled sequence estimates.
- C – HE estimated from array positions of the sequencing data vs.
- HE estimated from all positions of the sequencing data.
- Effect of pooled sequencing on the expression of the ascertainment bias in Nei ’ s standard genetic distance (D).
- The biased D was either estimated directly from the array genotypes (D.arr, pooled bias + ascertainment bias) or from the array positions of the sequencing data (D.arr.seq, pure ascertainment bias), while the estimates from the complete sequence were assumed to be the true estimates.
- Effect of pooled sequencing on the expression of the ascertainment bias in Wright ’ s fixation index (F ST.
- The biased F ST was either estimated directly from the array genotypes (FST.arr, pooled bias + ascertainment bias) or from the array positions of the sequencing data (FST.arr.seq, pure ascertainment bias), while the estimates from the complete sequence were assumed to be the true estimates.
- was not part of the reference set at all (only possible for commercial sam- ples with multiple individual sequences per population and not for the scenario 158_all).
- Colour further indicates whether the test sample of the according validation run was a commercial or a non-commercial chicken..
- CR and TP substantially contributed to design of the study, interpretation of the data and revision of the manuscript.
- SW substantially contributed to acquisition and interpretation of the data and revised the manuscript.
- AW substantially contributed to acquisition and curation of the data.
- HS substantially contributed to conception and design of the study, interpretation of the data and revision of the manuscript.
- The sampling, genotyping and sequencing of the chicken populations involved funding by the EC project AVIANDIV (EC Contract No..
- Further, the publication fees were covered by the Open Access Pub- lication Funds of the University of Goettingen.
- www.aviandiv.fli.de) and later extended by samples of the project SYNBREED (FKZ 0315528E.
- Blood sampling was done in strict accordance to the German animal welfare regulations, with written consent of the animal owners and was approved by the at the according times ethics responsible persons of the Friedrich- Loeffler-Institut.
- https://doi..
- https://doi.org/10.1186/s z..
- https://doi.org/10.1007/s .
- https://doi.org/10.1126/.
- https://doi.org/10.1371/journal..
- https://doi.org .
- https://doi.org/10.1534/genetics .
- https://doi.org/10.1038/s y..
- Whole-genome Resequencing of red Junglefowl and Indigenous Village chicken reveal new insights on the genome dynamics of the species.
- https://doi.org/10.3389/fgene .
- https://doi.org/10.1038/nature15393..
- https://doi.org/10.1038/nrg2796..
- https://doi.org/10.1093/genetics .
- https://doi.org/10.1038/ng2088..
- https://doi.org/10.1534/g .
- https://doi.org/10.1186/s y..
- https://doi.org/10.1038/ncomms9658..
- https://doi.org/10.1186/s .
- https://doi.org/10.1017/S .
- https://doi.org/10.1186/s x..
- doi:https://doi.org/10.1534/g .
- doi.org/10.1186/s .
- https://doi.org/10.1038/nrg3803..
- https://doi.org bi1110s43..
- A high-density SNP-based linkage map of the chicken genome reveals sequence features correlated with recombination rate.
- https://doi.org/10.1038/srep10442..
- https://doi.org/10.1038/s

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt