« Home « Kết quả tìm kiếm

From reference genomes to population genomics: Comparing three referencealigned reduced-representation sequencing pipelines in two wildlife species


Tóm tắt Xem thử

- For both species, population structure inferences were influenced by the percent of missing data..
- Conclusions: For studies of non-model species with a reference genome, we recommend combining Stacks output with further filtering (as included in our R pipeline) for population genetic studies, paying particular attention to potential impact of missing data thresholds.
- 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated..
- Whereas in the past tens of microsatellites may have been used to infer population structure and answer fundamental and ap- plied questions, now thousands of single nucleotide poly- morphisms (SNPs) can be generated and aligned to reference genomes [1, 2].
- One of the initial benefits of RRS approaches was the lack of a need for a reference genome [4].
- As concerns over the current biodiversity crisis deepen, there has been a call for the greater use of genetic and genomic data in the manage- ment of species both in captivity and the wild [20, 21]..
- The devil has exhibited a severe population crash due to the emergence of a contagious cancer, devil facial tumour disease (DFTD) in the 1990s [22, 23].
- To aid conservation of the species, the devil genome was sequenced in 2012 [24].
- which breeds in the Arctic and overwinters in Northern Europe and has a reference genome available [29].
- For the goose, we re- analyse a subset of the data made available by Pujolar et al.
- Unsurpris- ingly, considering the demographic histories of the two species, the number of SNPs returned for each differed substantially (Table 1), although we acknowledge that the laboratory methods for the two datasets were also differ- ent (see [8].
- For both species, mean multilocus heterozygosity esti- mates obtained using Stacks and GATK were noticeably lower than for SAMtools (Table 1).
- Genotype ratios (ratios of genotypes called as either of the two.
- The error rate between technical repeats was 12.3% for both SAMtools and GATK.
- Table 1 Summary statistics for the resultant SNP loci datasets of three pipelines, filtered at a 70% call rate (see Additional file 1:.
- Table S1 for data filtered on 30% call rate), for Tasmanian devil ( N = 131) and pink-footed goose ( N = 40), including total number of loci (total loci), average number of loci sequenced across individuals (mean loci), amount of missing data.
- b Error rates could not be calculated for the pink-footed goose dataset as no replicates were included in the current analysis.
- Table 2 Genotypic differences between loci common to the 3 pipelines for devils (155 loci) and geese (3283 loci).
- Homozygous → Homozygous refers to those loci where an AA is called a TT in the other pipeline for example.
- Homozygous → Heterozygous are any genotype calls that are homozygous in the first pipeline but called heterozygous in the other for that sample at the same locus.
- Heterozygous → Homozygous are those calls that are heterozygous in one pipeline but called homozygous in the other for that sample at the same locus.
- For both species, differenti- ation visualised using a principle coordinates analysis (PCoA) was clearest with the GATK pipeline, relative to the Stacks and SAMtools pipelines (Fig.
- For devils, we also reanalysed our dataset with the addition of N = 66 captive animals (with a mixture of genetic heritage) and found that these fell intermediate to the two major populations, as expected (Additional file 1:.
- Nevertheless, patterns across the three analysis methods were similar for both species: data processed by all three pipelines provided F ST values that were similar (Table 3).
- Each analysis method produced a varying amount of missing data (Table 1), but filtering less stringently (30%.
- vs 70% call rate) to allow more missing data (and thus a greater number of loci) did not generally change the qualitative interpretation of our results by PCoA nor F ST.
- With an increasing proliferation of refer- ence genomes, researchers skilled in the use of WGS alignment and assembly software (such as SAMtools [10] and GATK [11]) may prefer to use these tools when expanding their studies to include population-level RRS data.
- Although all of the analytical pipelines we examined were able to detect genetic structure between the two populations of both species, there were differences in the resultant datasets.
- Due to the greater number of SNPs obtained, GATK may perform better for conduct- ing analyses such as genome-wide associations that re- quire a high marker density, however we note that computational resources required may be a limiting fac- tor for use of GATK when studying non-model.
- 2 PCoAs of the two datasets after processing through three pipelines with a call rate of 70% and the custom R script as outlined in Fig.
- The tighter clus- tering of the two devil populations demonstrated by the SAMtools PCoA and the lower estimations of pairwise F ST relative to Stacks and GATK, is likely influenced by the greater proportion of heterozygous genotype calls in that dataset.
- Considering compute time and downstream population inferences, Stacks combined with the custom R script was the best performer of the three software packages we tested, and provided results that were independent of number of loci or percentage missing data for devils, but was influenced by missing data for geese..
- As shown here, there are differences between the three pipelines observed in the PCoAs and pairwise F ST com- parisons.
- These may result in different recommenda- tions, which may impact the genetic outcomes of the populations in question.
- We used the same parameters for each species for the purpose of comparison and note that our MAF thresholds may not be suitable for both populations given expected levels of diversity and sample sizes.
- The sample sizes were quite different and may have resulted in more alleles being sampled in the devil dataset, which has likely influenced population-level results [35]..
- In devils, Pop1 refers to the Western population ( N = 47), Pop2 refers to the Eastern population ( N = 18), and Pop3 refers to the insurance population ( N = 66).
- In geese, Pop1 refers to the Iceland population ( N = 20) and Pop2 refers to the Denmark population ( N = 20).
- An additional feature designed specifically to make use of the technical replicates performed by DArT PL is the reproducibility filter and error rate calculation, which can be extended to any RRS project where replicates have been used.
- While all pipelines performed well, they each have pros and cons which differ depending on the diversity present in the population and the amount of missing data..
- Stacks was less than optimal when missing data levels were high for goose as the populations could no longer be discriminated.
- Methods for the goose RRS are reported at [8].
- We used data [40] for the Iceland (“Population 1”, N = 20) and Denmark (“Population 2”, N = 20) sites, as reported in [8]..
- For both species, we used the Burrows-Wheeler aligner (BWA) v0.7.15 ‘aln’ function [41] to align single-end reads (devil) or paired-end reads (goose) following [14].
- to the respective reference genome [24, 29].
- For our devil data, bias in per base sequence content was de- tected in the first 5 bases of reads (adaptor region) with FastQC so these were trimmed during the genome alignment step (−B 5) to remove the restriction enzyme cut site (PstI-HpaII).
- SAMtools and GATK pipelines.
- For our goose data, cleaned reads were aligned to the pink- footed goose genome [29] with BWA ‘mem’ followed by SAMtools sort and local realignment with GATK as per the devil data..
- BCFtools merge [10] was used to merge single sample VCFs into a multi-sample VCF and filter on genotyping rate (min 70%, similar to Stacks -r) and MAF of 1% with VCFtools [46], to reflect the values used in the Stacks pipeline..
- Custom R script.
- We calculated coverage difference as the per- centage difference at each SNP between the read depth of the reference allele and SNP allele, and used a cover- age difference of ≤80% as our cut-off.
- We had accurate sex data for all devil samples and could therefore identify and filter out SNPs that may be sex-linked if no heterozygotes were present in the heterogametic sex but at least one hetero- zygote was present in the homogametic sex.
- The three resulting SNP datasets (Stacks, SAMtools and GATK) for each species were assessed for their ability to examine our study populations using a set of markers mapped to the genome.
- For each of the datasets, summary statistics of observed (H O ) and expected heterozygosity (H E ) across loci were calculated using the adegenet package for R [48, 49].
- For devils, two dif- ferent analyses were performed for each of the three pipelines, the first including all individuals sequenced (N = 131), and the second only the founding wild-born individuals (N = 65).
- Impact of missing data.
- For both species, we refiltered all three pipelines less stringently (genotyping rate of 30% rather than 70%) to examine the impacts of missing data on population in- ference.
- Summary statistics for the resultant SNP loci datasets of three pipelines, filtered less stringently at a higher allowable missing data (30% call rate.
- PCoA of the devil dataset only for the three pipelines, considering all three popula- tions.
- PCoAs of the two datasets after processing through three pipelines fil- tered less stringently, allowing more missing data (30% call rate).
- the amount of missing data of a sam- ple for the b) Stacks, c) SAMtools and d) GATK pipelines.
- (ZIP 346 kb) Additional file 2: Custom R script.
- We thank the Save the Tasmanian Devil Program, the Zoo and Aquarium Association, and member zoos, for collection of ear biopsies for the devil gene bank.
- We acknowledge the many staff and students of the Australasian Wildlife Genomics Group (University of Sydney) who have contributed to the Tasmanian devil project over the years.
- We would also like to thank the authors of the goose data (JM Pujolar, L Dalén, MM Hansen and J Madsen) for making their data publicly available, which provided a valuable comparison to the devil data reported herein..
- All authors contributed to the writing and editing of the final manuscript.
- These projects contributed to the costs of the sequencing of all Tasmanian devil samples and the salary for BW.
- All authors were responsible for the projects ’ design and implementation..
- All devil samples were collected under Save the Tasmanian Devil Program standard operating procedures for routine management of the species..
- Maroso F, Hillen J, Pardo B, Gkagkavouzis K, Coscia I, Hermida M, et al..
- Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, et al.
- Shafer A, Peart CR, Tusso S, Maayan I, Brelsford A, Wheat CW, et al..
- Johnson RN, O ’ Meally D, Chen Z, Etherington GJ, Ho SYW, Nash WJ, et al..
- Ekblom R, Brechlin B, Persson J, Smeds L, Johansson M, Magnusson J, et al..
- Genome sequencing and conservation genomics in the Scandinavian wolverine population.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al.
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al..
- Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, et al.
- O'Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, et al.
- Shafer ABA, Wolf JBW, Alves PC, Bergström L, Bruford MW, Brännström I, et al.
- Lazenby BT, Tobler MW, Brown WE, Hawkins CE, Hocking GJ, Hume F, et al..
- Murchison Elizabeth P, Schulz-Trieglaff Ole B, Ning Z, Alexandrov Ludmil B, Bauer Markus J, Fu B, et al.
- Genome sequencing and analysis of the Tasmanian devil and its transmissible cancer.
- Grueber CE, Fox S, McLennan EA, Gooley RM, Pemberton D, Hogg CJ, et al..
- Complex problems need detailed solutions: harnessing multiple data types to inform genetic management in the wild.
- Miller W, Hayes VM, Ratan A, Petersen DC, Wittekindt NE, Miller J, et al.
- Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii (Tasmanian devil).
- Hendricks S, Epstein B, Schönfeld B, Wiench C, Hamede R, Jones M, et al..
- First de novo whole genome sequencing and assembly of the pink-footed goose.
- Hogg CJ, Ivy JA, Srb C, Hockley J, Lees C, Hibbard C, et al.
- Influence of genetic provenance and birth origin on productivity of the Tasmanian devil insurance population.
- Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, et al.
- Earth BioGenome project: sequencing life for the future of life..
- Hogg CJ, Grueber CE, Pemberton D, Fox S, Lee AV, Ivy JA, et al.
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al.
- Adegenet: a R package for the multivariate analysis of genetic markers.
- Adegenet 1.3-1: new tools for the analysis of genome-wide SNP data.
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira Manuel AR, Bender D, et al

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt