« Home « Kết quả tìm kiếm

Evaluation of whole exome sequencing as an alternative to BeadChip and whole genome sequencing in human population genetic analysis


Tóm tắt Xem thử

- The largest publicly accessible set of human genomic sequence data available today originates from exome sequencing that comprises around 1.2% of the whole genome (approximately 30 million base pairs)..
- Results: To unbiasedly compare the effect of SNP selection strategies in population genetic analysis we subsampled the variants of the same highly curated 1 K Genome dataset to mimic genome, exome sequencing and array data in order to eliminate the effect of different chemistry and error profiles of these different.
- Next we compared the application of the exome dataset to the array-based dataset and to the gold standard whole genome dataset using the same population genetic analysis methods..
- Conclusions: Our results draw attention to some of the inherent problems that arise from using pre-selected SNP sets for population genetic analysis.
- The investigation of the ethnogenesis of human populations is made possible by population genetic studies, through com- paring genetic makeup and frequencies of the selected vari- ants or alleles, and also by computing their genetic distance from the rest of the studied population or their level of ad- mixture [1, 2].
- Full list of author information is available at the end of the article.
- 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- Ultimately, application of the Whole Exome Sequen- cing (WES) method had spread and gained popularity, as WES is cost effective for routine genetic diagnosis of rare inherited diseases, and extensive databases have been generated containing thousands of publicly access- ible exomes (Exome Aggregation Consortium ~ 61000 exomes [5], Exome Variant Server ~ 6500 exomes [6.
- Since exome data by definition contains high portion of the functional vari- ants that are under selection pressure, in this study, we explored whether this could lead to any bias in popula- tion genetic analysis..
- In order to compare the practicality of each strategy – (the genome data as unbiased standard, the commonly used array data, and the potentially usable exome data.
- we generated three subsets of the same publicly available experimental data (HGP ~ 2,500 unre- lated genomes), accordingly (GENOME, EXOME and BEADCHIP datasets).
- F ST is the proportion of the total genetic variance contained in a subpopulation (S) relative to the total genetic variance (T).
- The f 3 -statistics is used for testing the relationship of three populations [16] by allowing the detection of the presence of admixture in a population C from two other populations, A and B.
- Because of the complex genetic ancestry of human populations, there usually exists more than one possible model for any stud- ied case.
- Visualization of admixture components offers an insight into the genetic structure of the studied populations..
- For each dataset, we calculated the pairwise F ST value between each studied population and compared the re- sults of the different datasets (Additional file 2: Table S3).
- However in the EXOME dataset we found very small but systematic differences between the F ST values of African (except in the LWK African population) and European populations and the African and East Asian populations.
- We observed that F ST values calculated on the basis of the BEADCHIP dataset were systematically overestimated between populations originating from dif- ferent super-populations..
- The relative po- sitions of the super-populations are almost the same,.
- The detailed PCA of the populations for each super-population showed that the indicated eigen- vectors and values were very similar in all three datasets (Fig.
- The greatest difference between the three datasets was seen in the AFR super-population.
- The dif- ferences between the overlap of the historically known admixed ASW and ACB African populations and their relations to the other African populations indicated slightly different population affiliations in the BEAD- CHIP dataset compared to the GENOME and EXOME datasets..
- In order to compare the usability of the three datasets we calculated the f 3 -statistics for all possible combina- tions of population triads and plotted the resulting F 3.
- The absolute values were significantly higher in the BEADCHIP dataset com- pared to the GENOME dataset, while the EXOME data- set was more similar to it..
- Since the absolute CV error values were significantly higher in the BEADCHIP data- set compared to the GENOME and EXOME datasets, we displayed the relative CV values compared to the lowest fitting model (K = 3) that resulted the highest CV error for each dataset (Additional file 1: Figure S2)..
- The best fitting model of admixture indicated by the minimum of cross validation error was K = 6 in the GENOME, EXOME and HUN-EXOME datasets, how- ever the CV error of the BEADCHIP dataset indicated the K = 9 model as the best fitting model of admixture..
- The curve of the CV errors of the BEADCHIP dataset shows that this dataset resulted in very similar alterna- tive models (ranging from K = 7 to 10) with almost iden- tical CV errors.
- Since the analysis of the different dataset suggested different best fitting admixture models, we vi- sualized both models (K = 6 and K = 9) for each dataset..
- The analysis of the best fitting admixture model (K = 6) suggested by both the GENOME and EXOME datasets produced very similar admixture results (Additional file 1:.
- The analysis of the best fitting admixture model (K = 9) suggested by the BEADCHIP dataset produced very simi- lar admixture results for each datasets (Additional file 1:.
- The EXOME dataset was again most similar to the GENOME dataset, and again, the ad- mixture of the BEADCHIP dataset systematically overesti- mated the minor admix components compared to the other two datasets (some examples are highlighted by ar- rows in Additional file 1: Figure S4)..
- Overrepresentation of AIMs in the BEADCHIP dataset Our previous population genetic analyses suggested that the variant composition of the BEADCHIP dataset is dif- ferent from the GENOME and EXOME datasets.
- frequencies (MAF) of the analyzed populations for each SNP and visualized it as a density plot (Fig.
- Analysis of the HUN-EXOME dataset.
- The PCA of the HUN-EXOME dataset (Additional file 1:.
- Since the analysis of the EXOME dataset was compar- able to the other datasets, we also performed the admix- ture analysis of the HUN-EXOME dataset for both the K = 6 and K = 9 models..
- In the dataset of analyzed populations considering the K = 9 admixture model (Additional file 1: Figure S7B), Europeans display a North-South gradient by the indicated two European specific admix components (represented by the Finnish and Italian populations).
- In this study, we compared the use of these SNP-selection strategies for population genetic analysis, as each method differs in terms of the ratio in which it contains variants under natural selection.
- Our GEN- OME dataset mainly contains non-exonic variants, since more than 98% of the human genome consists of non-exonic region and our coordinate-based selection was random.
- al- though only a portion of the exonic variants are func- tional.
- In order to make the various approaches comparable, we used the same curated HGP 1 kG gen- omic variant data to select the subsets of the GENOME, EXOME and BEADCHIP datasets (see in detail in the Methods section).
- 4 Density plot of the variation of the minor allele frequencies (MAF) of SNPs between the analyzed populations in the different datasets.
- 12%) remained after LD pruning, while in the GENOME dataset the ratio was ~ 43.
- We suppose that these differences are contributed to the tightly linked pre-selected AIMs in the BEADCHIP data- set.
- This is assumed to be a consequence of the organization of the human genome, where genes and exons are not homoge- neously dispersed, but rather tend to be packed tightly in functionally active euchromatic chromosome regions..
- 38%) remained in the EX- OME dataset out of the ~ 405 k unpruned variants.
- Com- paring the PCA results of the EXOME dataset with the gold standard GENOME dataset, we refined the LD pruning parameters of exome data.
- while keeping the original 0.1 squared correlation threshold – in order to counter the ef- fect of the packed exome variant composition and to elimin- ate the tightly linked markers.
- According to our results, this modification did not significantly alter the variants of the BEADCHIP dataset.
- The fixation index is one of the most commonly used statistics in population genetics, which is a measure of population differentiation due to genetic structure.
- 1–3%) is still only a portion of what was observed in the BEADCHIP dataset.
- The comparison of F ST values of the BEADCHIP dataset to the GENOME dataset revealed that the pairwise F ST distances between populations of different super-populations were systemat- ically larger.
- and that the extent of the differ- ence appears to be correlated to the phylogenetic distance..
- We assume that this is due to a general overrepresentation of differentiating SNPs (AIMS) and im- balances in the selection and proportion of the marker com- position in the pre-selected BEADCHIP dataset.
- This hypothesis is also supported by the observed F ST values of the admixed ASW population.
- Accordingly, the BEADCHIP data places this population closer to its admix sources - the European (EUR) and Admixed Ameri- can (AMR) super-populations - than those of the GENOME F ST values, indicating that the overrepresentation of AIMs distorts the true population distances.
- We observed that for the first two eigenvec- tors of the whole dataset (Fig.
- 1a-c) the eigenvalues were greater in the BEADCHIP dataset compared to the GENOME dataset, while the EXOME datasets were more similar to it.
- An overestimation of the eigenvalues miscalcu- lates the population growth rate.
- The largest differences between the detailed PCA were seen in the African super-population (Fig.
- We assume that again, the overrepresenta- tion of AIMs in the BEADCHIP dataset is responsible for the increased eigenvalues and the different relations of the admixed ASW and ACB populations.
- Comparison of the three datasets showed that the GENOME and the EXOME data gave almost identical F 3 values (r = 0.9998 Fig.
- On the other hand, the f 3 -statistics of the BEADCHIP dataset showed less cor- relation (r = 0.9911 Fig.
- The Z scores of the f 3 -statis- tics in the BEADCHIP dataset were also higher, even though the SNP count was about one third of the GEN- OME dataset, which indicates higher deviation from the mean.
- We assume that this bias is due to the overrepresentation of AIMs between the East Asian and African populations within the pre-selected variants of the BEADCHIP dataset.
- Nevertheless, the admixture analysis of the different datasets resulted very similar admixture components for all of the tested models (Additional file 1:.
- The observed overrepresentation of minor admixture components in the BEADCHIP dataset may lead to overestimation of their admix ratios..
- In our comparative analyses, we used the same HGP dataset to make an unbiased assessment of the different strategies of variant selection.
- We also excluded potentially pseudogenic, conservative and repetitive regions where reads could be ambigiously mapped to multiple genomic regions and the proportion of the alternatively aligned sequencing reads may lead to differences depending on the threshold values or the pipeline applied for the analysis..
- Both the PCA and admixture analysis of the modern Hungarian exome dataset confirms that genetically, modern Hungarians are Europeans (Additional file 1:.
- Admixture analysis also detected a portion of a South Asian component in a few individuals of the Hungarian population, which in our view can be attrib- uted to the Gypsies living as an ethnic minority.
- Historically, gypsy tribes left India in the 9th and 10th centuries as a result of Muslim attacks in areas they inhabited and first appeared in territory of the medieval Kingdom of Hungary in the 14th and 15th centuries, probably fleeing from the conquering Turks in the Bal- kans [23].
- First, the pre-selection of SNPs in the array is not necessarily a uniform representation of all of the analyzed popula- tions, and second, the increased proportion of AIMs is shifting the balance towards highlighting the differences between populations.
- However, it could also lead to deviations of the Fixation index, specific cases of f3-statistics, PCA eigenvalues, suggested admix- ture model, and admixture rates.
- Based on the usability of the EXOME data- set we would encourage everyone to use and to publicly share WES sequences with the correct indication of eth- nic and geographical origin, which could contribute to- wards a better understanding of the genetic relationships among human populations..
- Since some of the regions may con- tain repetitive elements or pseudogenic regions with non-unique sequences, we excluded all regions that had any MAPQ0 reads (mapping quality 0, indicating that the read could be mapped to multiple genomic regions).
- As a result, we generated a BED coordinate list (Additional file 1: Data/HighCov_HighQual_EXOME.bed) that con- tained the high coverage, unique genomic regions of the exome kit.
- HaplotypeCaller module, and --ts_filter_level 99.0 in the variant recalibration (ApplyRecalibration) module.
- We used the publicly available variants from the higly cur- rated VCF files of the Human Genome Project 1 kG phase 3 dataset [30].
- We excluded the geographically inaccurate CEU population and all of the related individuals from our analysis.
- ITU - Indian Telugu in the UK.
- To compare the capabilities of the sequencing-based (WES and WGS) and array-based approaches, we cre- ated three SNP datasets (denoted as EXOME, GENOME and BEADCHIP) established upon the following rules:.
- first, the EXOME dataset (Additional file 1: Data/EX- OME) was prepared by filtering the public 1 kG variants with the bedtools intersect algorithm using the genomic coordinates of the high-coverage high-quality exome BED coordinate list that we established using the Hungarian exome data.
- The first and last 5 Mb of the telomeric regions of chromo- somes were excluded in order to eliminate uncertain se- quences, such as repetitive elements.
- The resulting unbiased genomic region coordinates were found to have a cumulative size (68.7 Mb), which is similar to the cumu- lative size of the high-coverage exome regions (68.9 Mb)..
- Because of the structure of the human genome, it was expected that exome vari- ants would yield non-homogeneously dispersed markers..
- In order to allow unbiased comparison of the three methods, we used the same LD pruning parameters (10,000 kb sliding window, 0 SNPs increment and r 2 threshold of 0.1) for each dataset..
- The F 3 tests were carried out by the qp3Pop program of ADMIXTOOLS [8] for each population triad of the ana- lyzed populations.
- F 3 and F 4 values of the tests were visualized by ggplot2..
- The best fitting model of admixture indicated was K=6 in the GENOME, EXOME and HUN-EXOME datasets, however, it was K=9 in case of the BEADCHIP data- set.
- PCA of the HUN- EXOME dataset confirms that genetically, modern Hungarians belong to the European super-population and within the Europeans, two genetic components exist with a North-South European gradient.
- CHB, YRI) values of the HUN-EXOME dataset indicates that Hun- garians have higher East Asian admixture component than British but lower than Finnish population.
- The difference between the FST matrix of the GENOME and EXOME, GENOME and BEADCHIP datasets.
- ITU: Indian Telugu in the UK;.
- STU: Sri Lankan Tamil in the UK.
- The higly currated VCF files of the Human Genome Project 1 kG phase 3 dataset was avaiable at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/.
- DT, ZB, TK and ZM evaluated the results of the analyses.
- ZM and TK wrote the initial version of the manuscript while ZB, DT and MS contributed to subsequent versions.
- All authors reviewed and approved the final version of the manuscript..
- Measurement of the human allele frequency spectrum demonstrates greater genetic drift in east Asians than in Europeans

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt