« Home « Kết quả tìm kiếm

Ancient evolutionary signals of proteincoding sequences allow the discovery of new genes in the Drosophila melanogaster genome


Tóm tắt Xem thử

- Ancient evolutionary signals of protein- coding sequences allow the discovery of new genes in the Drosophila melanogaster genome.
- Background: The current growth in DNA sequencing techniques makes of genome annotation a crucial task in the genomic era.
- Traditional gene finders focus on protein-coding sequences, but they are far from being exhaustive..
- This strategy potentially highlights protein-coding regions in genomic sequences regardless of traditional homology or translation signatures.
- Keywords: Ancient sequences, Gene finding, Protein-coding genes, Drosophila melanogaster, Protomotifs.
- Research groups from all over the world are sequencing whole genomes as a common task, taking advantage of the current burst in the genomics era [1].
- However, computational tools for gene discovery usually miss around 20% of protein- coding genes when annotating a whole genome, or even more in the case of eukaryotic organisms [2, 3].
- significant number of protein-coding sequences and other functional genomic elements are missing when using cur- rently available genomic annotation approaches..
- Its genome was se- quenced in 2000, and 13,601 protein-coding genes were initially annotated, coming from the integration of the two used gene finders, which respectively predicted 13,189 and 17,464 genes [4].
- The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.
- If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
- greater increase is expected to come from the discovery of new kinds of genes, such as those shorter than 100 amino acids, which in the fruit-fly genome could account for thousands of them [7]..
- Among the new proposed methods, we have pre- viously shown that accumulation of low-score alignments, which would represent footprints of ancient sequences, highlights present and ancient protein-coding regions which are hard to discover by conventional methods [10].
- Thus, the profile of AnABlast with peaks of accumulated protomotifs, accurately marks puta- tive protein-coding genes, pseudogenes, and fossils of an- cient coding sequences, overcoming the effects of possible sequencing errors and reading frame shifts, since it does not search for reading frames but sequence coding signals..
- We previously showed that this strategy is useful to discover putative protein-coding regions.
- mel- anogaster genome, and its annotated protein-coding.
- We also show that AnABlast is useful to dis- cover small ORF and fossil sequences that are hidden to conventional gene finder algorithms, and show how this new strategy can contribute to discover the complete set of protein-coding regions of a whole genome..
- Searching for protein-coding signals in the fruit-fly genome.
- AnABlast is a computational tool that searches for protein- coding regions in whole genomes by taking into account low-score alignments shared by multiple unrelated protein sequences.
- Alignments obtained from this similarity search (called protomotifs), including those of a low score, are then piled up along the query sequence, and peaks accu- mulating protomotifs above a specific threshold will high- light potential protein-coding regions and will be considered coding-signals (Fig.
- Finally, these coding- signals can be evaluated: those ones underlying exons from a protein-coding gene will be true positive predictions, exons without coding signals will be false negatives, and coding-signals underlying introns or intergenic regions will be putative false positives.
- Peaks with a protomotif accumulation above a peak-height threshold are considered as putative protein-coding regions (coding-signal).
- represent genomic regions encoding putative new proteins, but also non-functional degenerated protein-coding re- gions, something of particular interest in current genome research..
- To test the capability of AnABlast to discover protein- coding regions, we used this algorithm to analyze the whole genome of the fruit-fly D.
- positive signals represent a particularly interesting set of genomic regions, since they could constitute new protein- coding regions..
- Protein-coding signals highlighted by AnABlast are mainly composed by protein sequence alignments of low score, but also occasionally high score.
- To test if such alignments are just random, or they actually match true protein-coding regions, we studied the distribution of protomotifs underlying protein-coding regions at differ- ent BLAST bit-scores, regarding to the different possible reading frames.
- Though millions of protomotifs were scattered throughout the fruit-fly genome within anno- tated exons, most of them were concentrated in the right reading frame, with a much lower number found in any of the other possible five reading frames (Fig.
- Thus, protomotifs are mainly accumulated in the true reading frame in spite of their low score.
- Interestingly, a significant number of them accumulate in the right strand but at a different reading frame.
- 2 Distribution of protomotifs coming from true positive coding-signals separated by the true reading frame of the protein-coding sequences where they accumulate.
- The different parts of the figure represent protomotifs accumulated in protein-coding sequences at different BLAST bit- score starting in a frame + 1, b frame + 2, c frame + 3, d frame − 1, e frame − 2, and f frame − 3.
- In fact, the protomotifs accumulated in the contrary strand present bit-score values lower than 30, but those accu- mulated in the true strand present values near to 100..
- This result suggests, from an evolutionary point of view, that new protein-coding genes might putatively come just from shifting the reading frame in the same strand..
- Optimization of AnABlast parameters for the efficient prediction of protein-coding signal.
- Until now we have seen how AnABlast coding-signals mainly match to protein-coding region in the genome, but we did not use any threshold to evaluate the results and measure the accuracy in the procedure of gene prediction..
- Regardless of the taken score, true positive coding-signals account for the highest peak-height (Fig.
- However, the absolute number of true protein-coding regions dropped down with such higher scores, decreasing the number of peaks underling protein-coding sequences (Fig..
- However, the distribution of these false positives shows outliers with peak-height values in- distinguishable from the true positive set, which could be considered as new putative protein-coding regions..
- In a specific genome, the accuracy of the protein- coding sequence prediction by AnABlast not only de- pends on the bit-score value, but also on the peak-height threshold coming from the alignment accumulation..
- In this way, when the most current database from 2017 was used, the false positive outliers appear from a peak-height of 70, so proposing that peaks higher than this value could predict new protein-coding regions.
- To better test both the precision and recall of AnABlast in predicting protein-coding sequences, coding-signals were compared against the gene annotation of fruit-fly genome and the accuracy of AnABlast prediction in this set was analyzed.
- When using the database of 2017, the preci- sion has an asymptote at around peak-height equal to 100 with a value of around 90% (only 1 in 10 predictions are not right), though the recall at this threshold is only of 65% (only 6.5 in 10 of the true protein-coding sequences are recovered) (Fig.
- By using these parameters, we expect that AnABlast could discover new unknown protein-coding genes and exons inside the 10% putative false positives.
- All of this gives a great number of AnABlast coding-signals matching with protein-coding region spread over the fruit-fly chromosomes, and between 4500 and 7000 (de- pending of the used database) candidates to be new protein-coding sequences (Table 1.
- The number of annotated protein-coding sequences is continuously revisited in annotated genomes, and new genes, exons and pseudogenes continuously appear as a consequence of experimental results and new in silico ap- proaches.
- For instance, when the FlyBase database re- leased in 2012 is compared to the current 2017 release, it can be found that 38 protein-coding genes, 91 exons from well-known genes, and 74 pseudogenes entered into the database later than 2012.
- This dataset of true protein- coding sequences absent in the 2012 allows us to carry out a simulation to estimate the efficiency of AnABlast in discovering new protein-coding sequences.
- 35), we found that AnABlast highlights the majority of the protein-coding sequences from this dataset (Table 2)..
- More than 60% of the protein-coding genes are found, and also the 80% of the pseudogenes were predicted by AnABlast.
- In the case of new exons, their small length (some of them are coding for only a few amino acids) makes extremely difficult the in silico identi- fication.
- Overall, it is important to highlight that the most of these new protein-coding sequences pre- dicted by AnABlast were not found by the widely used gene finder AUGUSTUS [11]..
- One of the new genes that AnABlast failed to identify in the 2012 database (CG45546) is coding for a short protein of 93 amino acids (Fig.
- Interestingly, AnABlast effi- ciently identified it when using the 2017 database, due to the fact that this sequence and its putative homologs were now included in the database, increasing the peak- height to a significant level.
- One of these exons was found in the Ory gene (CG40446).
- The exon appearing in the 2017 data- base is found by AnABlast using the 2012 release and a peak-height higher than 35 (Fig.
- Interestingly, AnA- Blast produced two weak peaks within an intronic sequence of this gene, with a peak-height around 50, very similar to others in the 3′ end.
- In addition, a high peak over- lapping with an exon in the 5′ end is also emerging in the reverse strand, which represents a tri-nucleotide region coding for amino acid repeats in all the reading frames..
- This artefact is characteristic of nucleotide repeats, and it can be avoided by enabling the low-complexity filter in the similarity search step with BLAST..
- In addition to new genes and exons, AnABlast was able to discover 59 pseudogenes which did not appear in the used 2012 database (Table 2).
- This shows the ability of AnABlast for discovering protein-coding regions re- gardless of the presence of a complete open reading frame.
- One of these pseudogenes (CR44906), included in FlyBase in 2013, is clearly highlighted by AnABlast in the reverse strand of the 2012 database (Fig.
- Re- markably, another coding-signal is found in the forward strand, upstream of this pseudogene.
- Peak-height (<.
- Peak-height (>.
- Protein-coding gene .
- Putative new protein-coding sequences in the present database.
- Finally, according to the efficient identification of protein- coding sequences highlighted by AnABlast, it is expected that after a future further characterization, a considerable fraction of the false positive sequences predicted when the 2017 database is used become true positives.
- A list of false positives which could likely propose new putative protein-coding sequences is avail- able in Suppl.
- 5 AnABlast profile for three regions of the fruit-fly genome.
- Green color represents protomotif accumulation in the forward strand, and red color in the reverse strand.
- b region including part of the gene Ory (CG40446), together with the exons annotated in both database releases (the red arrow marks two peaks corresponding to an ancient mobile element).
- c region of the pseudogene CR44906, including surrounding genes and a transposable element in the 5 ′ end.
- alignments (protomotifs) do not accumulate randomly but in true genomic protein-coding regions (Supplemen- tary file 1.
- So, it is expected that future growth in the protein database will increase the protomotif accu- mulation in true protein-coding regions, but none accu- mulation would appear in non-coding regions..
- melanogaster genome as a model sys- tem, we set optimal parameters for using AnABlast as a new protein-coding finder in whole sequenced genomes..
- 85% of the actual genes annotated in a genome (Table 2).
- However, the accumulation of protein sequences using a non-redundant database and low bit-scores in the BLAST search enables AnABlast to discover new genes that scape to conventional strategies, producing a precision up to 90% with exons, genes and pseudogenes among the identified protein-coding signals (Fig.
- Im- portantly, coding signals are also identified by AnABlast in the coding strand, but at different reading frames.
- In fact, the peak-height distribution matching ORFs in true genes has the high- est value, followed by peaks identified in the next frame, which suggests that new protein-coding regions may emerge by point deletions in the original frame.
- This ob- servation agree with previous evidences in mammals suggesting a higher frequency of evolutionary fixation for deletion than for insertion mutations [18, 19], a trend that have also been found in the D.
- Remarkably, AnABlast coding-signal are sometimes found in the ends of well-known genes, over- lapping with the right reading frame and suggesting than C-terminus and N-terminus of genes are subjected to evolutionary contractions and expansions that are effi- ciently identified by AnABlast [10]..
- The discovery of protein-coding genes by AnABlast is in- dependent of the appearance of an open reading frame, a feature that allows finding sequences without canonical structures, such as pseudogenes and transposable se- quences.
- These short protein-coding sequences were missed in the past, since it is difficult to distinguish between functional open reading frames and non-functional ones arisen by chance [22].
- The present study shows how AnABlast is able to dis- cover new putative protein-coding genes and exons where other methods fail.
- To allow the analysis of genomic regions, we have built a web application which is available at http://www.bioinfocabd.upo.es/anablast/ [23], where researchers can evaluate new protein-coding signals in their own sequences.
- Search for protein-coding signals.
- Protomotifs were classified by reading frame, when they matched to well-known exons in the genome.
- The remaining analysis with sequences up to 10 kb were per- formed in the web application of AnABlast, which allows to analyze genomic sequences up to 25Kb, or longer if.
- To a higher constraint, the sequences were searched in UniProt database to discard previously described protein-coding genes, and only genes not appearing in any database release before 2017 were main- tained.
- The sequences from the testing dataset were taken with 100 nucleotides both in the 5′.
- Accuracy was measured by comparing exons and protein-coding signal from AnABlast, considering true positives (TP, AnABlast coding-signals matching to exons, or pseudogenes), false positives (FP, AnABlast coding-signals matching to in- trons or intergenic regions), and false negatives (FN, exons or pseudogenes without AnABlast coding-signals)..
- Recall and precision values when using AnABlast at different peak-height thresholds with the fruit-fly genome and the 2012 and 2017 releases of the database..
- Note that false positives from 2017 are novel predicted protein-coding sequences..
- AnABlast results for all the sequences in the testing dataset, including genes, exons and pseudogenes.
- Competitiveness of the Spanish Government grant BFU P.
- Genomics: annotation of the Drosophila genome.
- FlyBase 2.0: the next generation.
- Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes.
- AnABlast: re-searching for protein-coding sequences in genomic regions

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt