« Home « Kết quả tìm kiếm

Phylogeny analysis from gene-order data with massive duplications


Tóm tắt Xem thử

- Phylogeny analysis from gene-order data with massive duplications.
- Background: Gene order changes, under rearrangements, insertions, deletions and duplications, have been used as a new type of data source for phylogenetic reconstruction.
- There exist many computational methods for the reconstruction of gene-order phylogenies, including widely used maximum parsimonious methods and maximum likelihood methods.
- However, both methods face challenges in handling large genomes with many duplicated genes, especially in the presence of whole genome duplication..
- Methods: In this paper, we present three simple yet powerful methods based on maximum-likelihood (ML) approaches that encode multiplicities of both gene adjacency and gene content information for phylogenetic reconstruction..
- We also evaluate our method on real whole-genome data from eleven mammals.
- Conclusions: Our new encoding schemes successfully incorporate the multiplicity information of gene adjacencies and gene content into an ML framework, and show promising results in reconstruct phylogenies for whole-genome data in the presence of massive duplications..
- Keywords: Phylogeny reconstruction, Maximum likelihood, Variable length binary encoding, Whole genome duplication.
- Phylogeny analysis is one of the key research areas in evolutionary biology.
- Currently, the dominant data source used in phylogenetic reconstruction is sequence data [1], which can be collected in large amount at low cost (e.g., for coding genes).
- gene sequences) in phylogenetic reconstruction needs accurate inference of ortholog relationships and provides us only local information – different parts of the genome may.
- Full list of author information is available at the end of the article.
- Large-scale changes on genomes may hold the key of building a coherent picture of the past history of con- temporary species [2].
- As whole genomes are collected at increasing rates, whole- genome data has become a new and attractive type of data source for phylogenetic analysis [3–8].
- Since phylogenetic reconstruction problem is the key to ancestral reconstruction problem, a number of related works [9–14], based on phylogenetic analysis, have been well studied since the 2010s..
- 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- Therefore we can use parsimony software such as TNT [15] and PAUP* [8] developed for molecular sequences to reconstruct gene order phylogeny.
- Although MPBE and MPME failed to compete with direct parsi- monious approaches on whole-genome data they show great speedup and pave the way for future improvements.
- Moreover, sequence data can be analyzed by searching the phylogeny with maximum likelihood score as suggested by Felsenstein [17] in 1981.
- Recent algorithmic development and high-performance comput- ing tools such as RAxML [18] have made the maximum likelihood approach feasible for analyzing very large col- lection of molecular sequences and reconstructing better phylogenies than parsimonious methods.
- The first suc- cessful attempt to use maximum-likelihood to reconstruct a phylogeny from the whole-genome data of bacterials was published [19] in 2011, but that method appeared to be too time-consuming to process eukaryotic genomes..
- [20] described a maximum-likelihood approach, MLWD, for phylogenetic analysis that takes into account genome rearrangements as well as duplica- tions, insertions, and losses.
- Although the MLWD method outperforms both distance- based methods [21, 22], the MLWD approach did not make full use of the copy number information of both gene adjacency and gene content, and thus its performance fades out when genomes experienced a large number of duplications, especially in the presence of whole genome duplications..
- In this paper, we propose new maximum-likelihood methods for phylogenetic reconstruction from whole- genome data, by taking into account copy number varia- tions in both gene adjacency and gene content.
- Moreover, we also applied our new method to analyze the real whole-genome data from eleven mammals..
- An adjacency of two consecutive genes a and b can form one of the following four possibili- ties, (a t , b h.
- With the above notations, we can represent a genome by a mul- tiset of adjacencies and telomeres (if there’s any).
- e) as a multiset of adjacencies and telomeres S.
- Note that in the presence of duplicated genes, there is no one-to-one correspondence between genomes and multisets of genes, adjacencies, and telom- eres [23].
- For example, the genome consisting of the linear chromosome (+a, −c, +a, +b, +a) and the circular one.
- will have the same multiset of adjacencies and telomeres as the above example..
- Deletion, insertion and duplication not only change the ordering of genes, but also change the copy number of genes.
- A segmen- tal duplication copies a single or a segment of genes from a genome, and inserted the copy back to the genome..
- A whole genome duplication (WGD) accounts for the operation on an ancestral node, by which a genome is transformed into another by duplicating all chromosomes..
- In this section, we first give description of three versions of Variable Length Binary Encoding schemes (VLBE) and then introduce Variable Length Binary Encoding based Phylogeny Reconstruction with Maximum Likelihood on Whole-Genome Data (VLWDx)..
- In the WLMD approach [20], the copy number informa- tion of both gene adjacency and gene content has not been fully reflected in the binary encoding.
- WLMD uses binary encoding to note the absence or presence of an adjacency or gene (i.e., 1 for presence and 0 for absence), but WLMD does not distinguish the number of copies of the same adjacency or gene in the genome..
- In this paper, we propose a new encoding scheme that encodes a genome data by Variable Length Binary Encoding schemes (VLBE), which preserves as much as possible of both gene order and gene content informa- tion.
- We then incorporate a dedicated transition model, and develop the phylogenetic reconstruction method,.
- Maximum Likelihood on Whole-Genome Data (VLWDx), which is aimed to be more robust compared to WLMD [20], especially in the presence of a large number of duplications..
- For rearrangement-only model, we apply VLBE 1 to encode the presence or absence of any adjacency or telom- ere in the genome.
- We take into account only the adjacen- cies and telomeres that appear in at least one of the given genomes.
- How- ever, the number of adjacencies and telomeres that appear in at least one of the input genome is usually much smaller – in fact, it is usually linear in n rather than quadratic [20]..
- For the general model with not only rearrangements, but also duplications, insertions and deletions, we add the encoding of gene content besides the encoding of adjacen- cies.
- In the following three subsections, we give details on the three encoding schemes, along with the resulting encodings for the genome given in Table 1(a)..
- Variable length binary encoding 1 (VLBE 1.
- We count the maximum number of occurrences t for each adjacency a ∈ A among all the genomes.
- The encoding of each adjacency a is performed as follows: if genome D i has k copies of the adjacency a, we append t − k 0’s and k 1’s to the sequence..
- Variable length binary encoding 2 (VLBE 2.
- We propose VLBE 2 to encode the multiplicity of adjacen- cies as well as the presence or absence of gene content.
- the encoding of each adjacency a as follows: if genome D i has k copies of the adjacency a, we append t − k 0’s and k 1’s to the sequence.
- We also append the encoding of gene content as follows: for each unique gene, if it presents in genome D i , append 1 at the encoding for genome D i , otherwise append 0 to the sequence (see Table 2 for an example)..
- Variable length binary encoding 3 (VLBE 3.
- We further explore whether variable length binary encod- ing on gene content would also make a difference on phylogeny reconstruction.
- VLBE 3 is aimed at encoding both adjacencies and gene content.
- We count the maximum number of occurrences t for each adjacency a ∈ A and encode each adjacency a as follows: if genome D i has k copies of adjacency a, we append t − k 0’s and k 1’s to the encoding sequence for D i .
- We also append content encod- ing in the same way as for the adjacencies.
- As mentioned above, VLBE 1 , VLBE 2 and VLBE 3 aim at transforming gene order information to binary sequences without losing important genomic information, after encoding.
- The key of phylogenetic reconstruction based on binary encoding is to determine the transition model of flipping a state (from 1 to 0 or from 0 to 1).
- Once we build the encoding sequences for all of the input genomes, we use RAxML (version 7.2.8) to recon- struct a tree from these sequences.
- Table 1 Example of the binary encoding through VLBE 1 , for three genomes: G G and G Adjacencies.
- Note that (1,2) and (-2,-1) are the same adjacency.
- Table 2 Example of the binary sequences using VLBE 2 , for three genomes: G G and G .
- We apply dif- ferent evolutionary rates r so that the tree diameters are in the range of d ∈ {1n, 2n, 3n, 4n}: larger diameter means a genome is more distant from its ances- tor, and hence more computationally expensive this data set will be.
- For evolving on each branch, we use a set of evolution- ary events, including inversions, fusions, fissions, translo- cations, indels, segment duplications and whole genome duplications.
- This result is in line with the observation that variable length binary encoding pre- serves more adjacency and gene content information than MLWD does..
- These simulations show that our VLWD approach can reconstruct more accurate phylogenies from genome data experienced various evolutionary events, than the pre- vious binary encoding-based approach MLWD.
- VLWD 3 also outperforms VLWD 1 and VLWD 2 , indicating the importance of encoding the multiplicity of both adjacen- cies and gene content..
- 1 RF error rates for different approaches for trees with 60 species (left) and 100 species (right), with genomes of 1000 genes and tree diameters from 1 to 4 times the number of genes, under the evolutionary events without duplications VLWD1, VLWD2, VLWD3 are the three proposed methods, and MLWD is the previous encoding method.
- In our approach, since the encoded sequence of each genome combines information from both gene adjacency and gene content, it is difficult to compute the optimal transition probabilities following the same procedure as described in [20].
- Thus we set 1000 as the default bias ratio in the above transition model..
- Figures 2 and 3 together indicate that MLWD returns similar results for data set with and without whole genome duplication, while VLWD 3 takes advantage of encoding the multiplicity of both gene adjacencies and gene content, and thus improves on the cases with whole genome duplication compared to those without whole genome duplication..
- In the previous part, we test our VLWD 3 approach on sim- ulated data set and achieve very good performance for reconstructing phylogenies.
- Here we test VLWD 3 on the whole genome data of eleven mammal species from online database Ensembl [24]..
- To obtain the whole genome data of eleven mam- mal species, we first encode all of the genes into gene orders by using the same gene order to represent all of the homologous genes across different mammal genomes (each genome may contain multiple copies of homologous genes).
- Subsequently, we input the gene order content and adjacencies into the VLWD 3 approach to reconstruct the phylogenetic relationship for these eleven mammal species (see Fig.
- 2 RF error rates for different approaches for trees with 60 species (left) and 100 species (right), with genomes of 1000 genes and tree diameters from 1 to 4 times the number of genes, under the evolutionary events with segmental duplications VLWD1, VLWD2, VLWD3 are the three proposed methods, and MLWD is the previous encoding method.
- 3 RF error rates for different approaches for trees with 60 species (left) and 100 species (right), with genomes of 1000 genes and tree diameters from 1 to 4 times the number of genes, under the evolutionary events with both segmental and whole genome duplications VLWD1, VLWD2, VLWD3 are the three proposed methods, and MLWD is the previous encoding method.
- We also compare this VLWD 3 phylogeny with the previous gene order based mammal phylogeny study of Luo et al.
- There are eight mammal species shared by these two phylogenies, and all of the shared branches for these eight species agree with each other..
- middle two branches in the tree of Fig.
- We describe three simple yet powerful approaches for phylogenetic reconstruction based on maximum- likelihood (ML), and design experiments to show the importance of taking into account multiplicities of both gene adjacencies and gene content information.
- Extensive experiments on simulated data sets show that our pro- posed approaches achieve the most accurate phylogenies compared to existing methods, particularly in the pres- ence of a large number of duplications or whole genome.
- Moreover, we applied our new approach to reconstruct the phylogeny of 11 mammal genomes, using only the whole-genome data from Ensembl [24]..
- Our new encoding schemes successfully model the multiplicities of gene adjacencies and gene content and incorporate them into a maximum-likelihood framework..
- Experiments on both simulated and real datasets show the effectiveness and efficiency of our approaches in recon- struction phylogenies using whole-genome data, in the presence of massive duplications..
- The full contents of the supplement are available online at https://bmcgenomics..
- Genome-scale evolution: reconstructing gene orders in the ancestral species.
- In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology.
- New approaches for reconstructing phylogenies from gene order data.
- MLGO: phylogeny reconstruction and ancestral inference from gene-order data.
- Reconstructing Ancestral Genomic Orders Using Binary Encoding and Probabilistic Models.
- Ancestral reconstruction with duplications using binary encoding and probabilistic model.
- Reconstruction of ancestral gene orders using probabilistic and gene encoding approaches.
- Evolutionary trees from DNA sequences: a maximum likelihood approach.
- RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models.
- Maximum likelihood phylogenetic reconstruction using gene order encodings.
- Maximum likelihood phylogenetic reconstruction from high-resolution whole-genome data and a tree of 68 eukaryotes.
- Rates of genome evolution and branching order from whole genome analysis.
- Phylogeny Reconstruction from Whole-Genome Data Using Variable Length Binary Encoding

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt