« Home « Kết quả tìm kiếm

Investigating the accuracy of imputing autosomal variants in Nellore cattle using the ARS-UCD1.2 assembly of the bovine genome


Tóm tắt Xem thử

- Investigating the accuracy of imputing autosomal variants in Nellore cattle using the ARS-UCD1.2 assembly of the bovine genome.
- Background: Imputation accuracy among other things depends on the size of the reference panel, the marker ’ s minor allele frequency (MAF), and the correct placement of single nucleotide polymorphism (SNP) on the reference genome assembly.
- Further, we compared the reliability of the model-based imputation quality score (Rsq) from Minimac3 to the empirical imputation accuracy..
- When the size of the reference panel increased from 250 to 2000, R 2 dose increased from 0.845 to 0.917, and the number of polymorphic markers in the imputed data set increased from 586,701 to 618,660.
- Imputation accuracy increased from 0.903 to 0.913, and the marker density in the imputed data increased from 593,239 to 595,570 when haplotypes were inferred in 500 and 2900 target animals.
- However, both metrics were positively correlated and the correlation increased with the size of the reference panel and MAF of imputed variants..
- The use of large reference and target panels improves the accuracy of the imputed genotypes and provides genotypes for more markers segregating at low frequency for downstream genomic analyses.
- The model-based imputation quality score from Minimac3 (Rsq) can be used to detect poorly imputed variants but its reliability depends on the size of the reference panel and MAF of the imputed variants..
- The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.
- If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
- Full list of author information is available at the end of the article.
- Imputation accuracy depends on several factors in- cluding the size and composition of the reference panel [3, 7], the minor allele frequency (MAF) of imputed vari- ants [8], and correct placement of variants on the refer- ence genome assembly .
- The effect of the target panel size, on the other hand, has been barely studied.
- These tools are probably more sensitive to the size of the target panel because the phasing accuracy depends on the number and the re- latedness of the samples considered [15–17].
- Recently a highly contiguous version (ARS-UCD1.2) of the bovine reference sequence — assembled using long sequencing reads — filled gaps and resolved repetitive regions of the previous UMD3.1 assembly [21]..
- These quality scores are correlated with true empirical estimates [22], however, their reliability with respect to the size of the reference panel and MAF of the imputed variants has not been tested extensively..
- First, we investigate if the improve- ments in the new bovine genome assembly (ARS-UCD1.2) affect the imputation quality.
- In addition to the commonly used imputation quality metrics, we study the marker density in the imputed data set, i.e.
- the number of markers segregating in the target panel, after im- putation.
- Comparison of the accuracy of imputation between the UMD3.1 and ARS-UCD1.2 genome assemblies.
- Following quality control (QC) on the genotype data, we determined the position of 684,561 and 683,590 autosomal markers of the Illumina Bovi- neHD BeadChip according to the UMD3.1 and ARS- UCD1.2 bovine genome assemblies, respectively.
- The genotypes of 1938 randomly selected animals (target) were reduced to the markers of the Illumina BovineSNP50 genotyping array to mimic low SNP density.
- Only a few SNPs showed large differences in the accuracy of imputation between the two assemblies (Fig.
- 762 of these passed our QC parameters and segregated in the target panel after imputation.
- Compared to the UMD3.1 assem- bly, 2874 markers were placed on a different chromosome in the ARS-UCD1.2 assembly.
- Of the 5447 putatively misplaced SNPs on the UMD3.1 assembly, 4769 segre- gated in a taurine cattle population [7], where using a slightly different approach of these markers were identified as likely misplaced.
- The drop in the accuracy was even more substantial (from 0.89 to 0.14) for.
- We used SNPs mapped to ARS-UCD1.2 to study the ef- fect of reference panel size on the accuracy of imput- ation and marker density in the imputed data set.
- The correlation-based empirical accuracy metric R 2 dose does not reflect the realized marker density (number of informative markers) in the imputed data set.
- Therefore, we studied the marker density in the imputed dataset when the genotypes were imputed using different refer- ence panels.
- Larger reference panels captured a greater fraction of genetic variation in the population, allowing the imputation of a larger number of variants (Fig.
- In the smallest (n = 250) and the largest.
- Difference in imputation accuracy (R 2 dose) between the ARS-UCD1.2 and UMD3.1 assembly for all makers (a), and for markers that were remapped to a different chromosome (b), when genotypes were aligned to the two Bovine assemblies.
- (n = 2000) reference panels, 597,095 and 636,201 markers were polymorphic, respectively, of which 595, 609 and 629,025 were also polymorphic in the 1938 tar- get animals..
- When we reduced the marker density of the target an- imals to the content of the BovineSNP50 BeadChip genotyping array and imputed the missing genotypes using reference panels of increasing size, the number of informative markers increased from 586,701 to 618,660..
- The increase in the imputed marker density with larger reference panels was mainly.
- The additional markers available in the imputed data when genotypes were inferred from the largest reference panels (compared to the smallest reference panel) were imputed with the mean and median R 2 dose of 0.674 and 0.740 with 42% of the markers being imputed with R 2 dose >.
- Size of the reference panel Rsq Low-frequency (MAF <.
- 2%) markers segregating in the reference panel was monomorphic/fixed in the target animals (considering the true genotypes).
- The absolute number of monomorphic markers in the target panel that were polymorphic post-imputation increased (526 vs.
- 990 for n = 250 and 2000 respectively), but the percentage (35 and 14% respectively) decreased with the size of the ref- erence panel.
- Interestingly, Minimac3 reported moder- ately high Rsq values for these markers, increasing with the size of the reference panel and 0.740 using reference panels with and 2000 animals respectively)..
- The 50 K genotypes of the base animals were phased together with or 2400 add- itional animals, and subsequently imputed to 777 K using the same reference panel of 1000 animals.
- Imputation accuracy (R 2 dose) for markers grouped in MAF bins when imputing with reference panels of different sizes (a).
- 2%) in the full data set (n = 3938) that are polymorphic in the reference, target and imputed data, when imputing with reference panels of varying sizes (b).
- a Reference panel size.
- b Number of markers that were imputed as polymorphic but were monomorphic in the target animals (considering the true genotypes).
- In the 1000 reference animals, 625,538 markers were polymorphic of which 607,560 were also polymorphic in the 500 base animals.
- When we combined the base ani- mals with different numbers of additional animals and 2400) and imputed the missing ge- notypes, the number of markers that were polymorphic in the imputed data increased with the number of add- itional animals used.
- For instance, the number of poly- morphic markers in the imputed data increased by 2331 markers (from 593,239 to 595,570) when 2400 additional animals were considered to infer the haplotypes of the 500 base animals.
- Of 17,978 rare markers that had MAF less than 2% in the reference panel, 14.6% were monomorphic in the true genotypes of the target panel.
- markers were polymorphic in the imputed dataset likely due to erroneously imputed genotypes.
- The fraction of fixed markers in the target data that turned polymorphic after imputation increased with the number of additional animals used for phase inference in target animals (9.5 and 10.5% using 0 and 2400 additional animals for haplo- type inference, respectively) (Table 4)..
- Reliability of the model-based quality score from Minimac3.
- Irrespective of the size of the reference panel and MAF of the imputed variants, the Rsq values were higher than R 2 dose and R 2 gt but lower than CR values (Table 1, Additional file 5).
- Imputation accuracy (R 2 dose) for markers grouped in MAF bins when genotypes were imputed into the haplotypes of 500 target samples that were inferred together with additional samples (a).
- 2%) in the full data (n = 3938) that are polymorphic in the reference (n = 1000), base target (n = 500) and, imputed data (base), when genotypes were imputed into haplotypes of 500 base animals inferred with different number of additional samples (b).
- when either the MAF of the imputed variants or the size of the reference panels increased (Additional file 6, Table 1)..
- In Bos taurus indicus cattle, the ac- curacy of imputing 50 K genotypes to higher density has been investigated only using the UMD3.1 assembly of the bovine genome [23, 24].
- To the best of our know- ledge, our study is the first to evaluate the accuracy of imputing 50 K to 777 K genotypes in a Bos taurus indi- cus cattle breed using the ARS-UCD1.2 assembly of the bovine genome [21]..
- Higher contiguity of the ARS-UCD1.2 genome assem- bly might improve haplotype inference in regions that contained phasing and imputation errors in the previous.
- However, we also identified SNPs that had considerably higher imputation accuracy when aligned to UMD3.1 than to ARS-UCD1.2 indicating that some physical ARS- UCD1.2 coordinates of the 777 K markers are wrong.
- These SNPs suffer from ascertainment bias as they are predom- inantly located in more accessible regions of the genome [25].
- Using map po- sitions of the UMD3.1 assembly and the software FIm- pute [4], these studies assessed the correlation between true and imputed (best-guess) genotypes (Rgt).
- Using a pre-phasing-based approach that does not explicitly con- sider pedigree information, we obtained a lower mean Table 4 Number of polymorphic and monomorphic markers in n = 1000 reference and n = 500 target animals when different numbers of additional animals were included for phase inference in the 500 target animals.
- b Number of markers that were imputed as polymorphic but were monomorphic in the base target (considering the true genotypes).
- The slightly lower accuracy obtained in our study likely resulted from including more low- frequency markers, that are difficult to impute, rather than from differences in the imputation software used..
- Applying the same MAF filter used in the previous stud- ies (MAF >.
- In our study, the mean accuracy of imputation (R 2 dose) increased by 8.52% when the size of the refer- ence panel increased from 250 to 2000.
- Our findings also show that the size of the reference panel is corre- lated to the number of informative markers, i.e., poly- morphic, in the imputed data set.
- Our approach allowed for an indirect investigation of the effect of sample size on phasing accuracy by studying the accuracy of imputing genotypes into the inferred haplotypes.
- We show that the size of the target panel also influences pre-phasing-based imputation quality..
- Both imputation accuracy and marker density in the tar- get animals increased when their haplotypes were in- ferred in larger cohorts.
- If the size of the target panel increases, it seems advisable to infer tar- get haplotypes and impute genotypes in the pooled data again instead of imputing only the new data..
- The correlation between empirical and model-based quality scores increased with the size of the reference panel.
- How- ever, Minimac3 reported moderately high Rsq values (increasing with the size of the reference panel used.
- mono- morphic in the target panel, considering the true geno- types, but polymorphic in the imputed data).
- The improvement in the new assembly affected the imputation of only a small fraction of SNPs present on the Bovine HD SNP chip..
- Accuracy of imputation and the number of informative markers in the imputed data benefit from large reference panels.
- The size of the target panel also has an influence on both metrics when genotypes are inferred using pre-phasing-based ap- proaches.
- The map positions of the SNPs were determined according to the Bos taurus taurus ref- erence genome assembly UMD3.1 [38].
- After filtering, 3938 animals with genotypes for 684,561 autosomal SNPs remained in the dataset..
- Liftover to ARS-UCD1.2.
- UMC_bovine_coordinates/) to determine the physical co- ordinates of the Illumina BovineHD BeadChip markers according to the ARS-UCD1.2 genome assembly.
- The marker density in the target panels was re- duced to match the BovineSNP50 (version 2) BeadChip comprising 56,206 SNPs.
- Subsequently, the genotypes of the target panels were imputed to higher density using information from the reference panel using a pre- phasing-based imputation workflow.
- R 2 dose values were used to assess imputation quality in different imputation scenarios and all three empirical accuracies were used to investigate the reliabil- ity of the model-based quality score estimate from Mini- mac3 (Rsq)..
- To study the effect of refer- ence and target panel composition on the realized marker density in the imputed dataset, we also calcu- lated the proportion of markers that are imputed to the major allele in all target samples but were segre- gating in the real data set..
- To study the effect of the reference panel, we imputed a target panel of 1938 animals from 50 K to 777 K using reference panels with and 2000 randomly sampled animals in 10 repli- cates.
- The imputation was carried out in 10-fold cross-validation, and the accuracy of imputation was assessed only for the 500 animals in the base panel..
- Comparison between UMD3.1 and ARS-UCD1.2 genome assemblies.
- List of misplaced SNPs on the ARS-UCD1.2 reference genome assembly.
- All authors have agreed on the content of the manuscript..
- The funding body did not have any role in the study design, data collection, analysis and interpretation, the writing of the manuscript, or any influence on the content of the manuscript..
- HP is a member of the editorial board of BMC Genomics.
- Imputation of high-density genotypes in the Fleckvieh cattle population.
- Revealing misassembled segments in the bovine reference genome by high resolution linkage disequilibrium scan.
- De novo assembly of the cattle reference genome with single-molecule sequencing.
- Evaluation of the accuracy of imputed sequence variant genotypes and their utility for causal variant detection in cattle.
- A whole-genome assembly of the domestic cow, Bos taurus

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt