« Home « Kết quả tìm kiếm

De novo transcriptome assembly, annotation and comparison of four ecological and evolutionary model salmonid fish species


Tóm tắt Xem thử

- Improving understanding of the genetic mechanisms underlying traits in these species would significantly progress research in these fields.
- Assembly completeness was assessed using three approaches, all of which supported high quality of the assemblies: 1) ~78% of Actinopterygian single-copy orthologs were successfully captured in our assemblies, 2) orthogroup inference identified high overlap in the protein sequences present across all four species (40% shared across all four and 84% shared by at least two), and 3) comparison with the published Atlantic salmon genome suggests that our assemblies represent well covered (~98%) protein-coding transcriptomes.
- Thorough comparison of the generated assemblies found that 84-90% of transcripts in each assembly were orthologous with at least one of the other three species.
- Full list of author information is available at the end of the article.
- 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- The latter point forms the primary objective of the FAASG (Func- tional Annotation of All Salmonid Genomes), a recent initiative developed by the ICSASG, which aims to gen- erate functionally annotated resources for nine species of salmonids and integrate data generated from within the wider research community [24].
- RNA-Seq methods allow genome-wide investigation of the transcriptome, providing an in-depth overview of transcript sequence and expression profiles [25–28]..
- Furthermore, the recent release of the PhyloFish data- base (http://phylofish.sigenae.org/index.html) represents another major contribution, with the generation of transcriptomic resources for 15 fish species, of which six are salmonids: grayling (Thymallus thymallus), lake whitefish (Coregonus clupeaformis), European whitefish (Coregonus lavaretus), brown trout (Salmo trutta), rainbow trout (Oncorhynchus mykiss) and brook trout (Salvelinus fontinalis), with 66,996 to 78,415 transcripts per species [35].
- We also conduct a thorough comparison of the de novo assemblies generated for these four species, provid- ing valuable insight into the level of sequence similarity and divergence between salmonids of varying phylogen- etic proximity.
- to assess assembly completeness, including a compara- tive analysis of the current assemblies against the pub- lished reference genome for Atlantic salmon, and other reference transcriptomes available.
- Tanks used a flow through system using untreated water from Loch Lomond and subject to ambient temperature of the loch, which ranged from 4 to 16 °C over the duration of the study.
- To allow permeation of the RNALater preservative into all tissues, several in- cisions were made along the dorsal side of each.
- Briefly, libraries were synthesised for each of the 32 samples using the TruSeq Stranded mRNA Sample Preparation kit (Illumina, San Diego, CA), according to the manufacturer’s instructions..
- A schematic representation of the de novo transcriptome reconstruction and analysis pipeline is given in Fig.
- Given that of the four species studied, a reference genome is currently only available for Atlantic salmon, we assem- bled the transcriptomes de novo to avoid any bias that might have been introduced by a genome-guided approach.
- 1, [11]) and the high complexity of the gen- ome (as a result of the Ss4R duplication event [9.
- Consequently, the pre-processed reads for each of the four species were subjected to the de-novo assembling proced- ure using Trinity r with the default parame- ters.
- Prior to filtering our four de novo assemblies to remove re- dundant and poorly constructed transcripts, we performed an initial quality assessment of the transcript sets.
- Read representation was determined by mapping the cleaned reads back to their corresponding assemblies, for each of the four species individually, with Bowtie2 v2.2.6 (–local, –no-unal) [46]..
- To determine how successfully assembled transcripts were reconstructed to full- or near full-length in each of the four assemblies, we calculated coverage against the NCBI Atlantic salmon proteins database (GCF .
- To provide a comprehensive and quantitative overview of the level of completeness achieved for our assemblies,.
- number of complete (length is within two standard devi- ations of the mean length of the given BUSCO), dupli- cated (complete BUSCOs represented by more than one transcript), fragmented (partially recovered BUSCOs) and missing (not recovered) in each of the four de novo assemblies.
- To further assess the completeness and util- ity of the resources presented here, we examined how successfully BUSCOs were recovered in our assemblies compared to the NCBI protein dataset for Atlantic.
- 2 Schematic of the de novo transcriptome reconstruction and analysis pipeline used to generate the protein-coding transcriptome assemblies for Atlantic salmon, brown trout, Arctic charr and European whitefish.
- Utilising the sister taxa in the present study provides a val- idation of the completeness of our de novo transcriptomes..
- In addition to assessing completeness of the final assem- blies, we also applied OrthoFinder to assess and control completeness of our assemblies at each stage of the filter- ing pipeline, by comparing orthogroup size distribution within our salmon de novo assemblies relative to the Atlantic salmon RefSeq protein set (GCA .
- Fourth, we compared completeness and similarity of the current assemblies for Arctic charr, brown trout and European whitefish to previously published tran- scriptomes for these three species.
- Full-length transcript reconstruction in the pre- vious assemblies for each of the three species was evaluated following the same protocol described above.
- Second, we identified the transcript set overlap of the current compared to previously pub- lished conspecifics assemblies.
- To provide comprehensive annotation of the final tran- script sets, we compared our de novo assemblies against two annotation resources.
- Again GO annotation of the ‘assembly-specific’ transcripts was conducted with PANTHER, per the pipelines de- scribed above.
- 30) removed approximately 11% of the raw reads.
- This resulted in high quality RNA-seq datasets, which contained between 180 and 210 M paired-end reads for each of the four species (Table 1)..
- Respectively, for the Atlantic sal- mon, brown trout, Arctic charr and European whitefish assemblies we found that and 91% of the reads successfully aligned..
- amino acid identity), which collapsed around 12% of the transcripts.
- However, in those other published as- semblies, no annotation was found for around half of the transcripts.
- Therefore, we performed additional fil- tering and analyses on the de novo assemblies to pro- duce comprehensive reference gene sets for each of the four species..
- Further- more, ~60% of the query transcripts aligned significantly (−evalue 1e-3) to the Atlantic salmon reference se- quences over more than 70% of their length (Fig.
- We employed three robust, reference-based methods to evaluate and compare the completeness of the gene set of our four transcriptomes..
- First, protein gene set completeness was assessed using the BUSCO pipeline, which revealed that the majority of the Actinopterygian core genes had been successfully re- covered in all four assemblies.
- Specifically, of the 4584 single-copy orthologs searched, we recovered 76% to.
- Only between 10 and 13% of the 4584 single-copy orthologs were classified as missing from our assemblies, indicat- ing good coverage and high quality of the assembly of the protein-coding transcriptomes for these species.
- We found that BUSCO recovery in the current assemblies was three times greater than that identified for the Phy- loFish assembly of the corresponding species.
- For both the brown trout and European whitefish assemblies pre- sented here, we recovered 78 and 76% of the BUSCOs completely, whereas only 26% of BUSCOs were com- pletely recovered in either of the previous trout and whitefish assemblies (Table 3).
- Feature Atlantic salmon Brown trout Arctic charr European whitefish.
- Atlantic salmon (yellow), brown trout (green), Arctic charr (blue) and European whitefish (red).
- high-quality of the dataset, recovery for both ‘complete’.
- Future combination of the current assemblies with RNA- seq data generated from different developmental stages could offer a promising means of producing transcrip- tomes with even greater levels of completion for these species.
- Over of the transcripts that were identified as putative orthologs were shared across all four species.
- We also found that approximately 50% of the inferred orthogroups were represented by at least three species, and that over 84% of the orthologous transcripts identified in our four assemblies were shared by at least one of the other species’ assemblies (Fig.
- Additionally, we found that ~92% of the total transcripts in each of four assemblies were orthologous with at least one transcript from the Atlantic salmon RefSeq protein dataset (Additional file 2: Table S2).
- The marked level of sequence overlap observed between the four current transcriptomes, as well as between the published set of Atlantic salmon RefSeq proteins, further validates the completeness and quality of the assemblies presented here.
- This statement is further supported by the add- itional OrthoFinder analyses we performed comparing the orthogroup distribution size of the current salmon assembly (at all four filtering steps: unfiltered, after TransDecoder single-best ORF prediction, after CD-Hit clustering at 100% identity and after Trinity full-length transcript analysis (e.g.
- Given the high quality of the recently published protein set for Atlantic salmon, we were able to empirically test whether we had success- fully re-constructed a comprehensive set of orthologous transcripts in our assemblies.
- The results demonstrated good consistency, both between the present and existing protein sets for Atlantic salmon, as well as between sub- sequent filtering steps of the current salmon assembly (Additional file 3: Figure S2.
- Despite the relatively strict filtering we applied to the current assemblies, we found that only between 0.04 to 11% of the total orthogroups were lost during filtering..
- As such, these results further vindicate the quality of the assemblies we present here..
- For ex- ample, transcriptomic diversity between ecologically di- vergent cichlid species, Amphilophus astorquii and Amphilophus zaliosus, using RNA-seq, found that over Table 3 Summary of the complete, duplicated, fragmented and missing orthologs inferred from Benchmarking Universal.
- Atlantic salmon.
- 50% of the 24,174 and 21,382 ESTs (respectively) were orthologous between the species [64].
- This is consistent with the recent Atlantic salmon refer- ence genome publication, which found that 98% of the NCBI mRNA sequences for Atlantic salmon aligned to the genome [12].
- 70% coverage) length compared to the NCBI protein database for Atlantic salmon.
- Of the seven assemblies, we found that the number of tran- scripts reconstructed to full-length was highest in the PhyloFish brown trout [35] assembly (19,404 tran- scripts), followed by the four current assemblies (11,099 to 13,546 transcripts), then the PhyloFish European whitefish [35] assembly (5073 transcripts), with the low- est number of full-length transcripts recovered in the Magnanou et al.
- Table 4 Alignment statistics of the new de novo transcriptomes mapping to the Atlantic salmon reference genome ICSASG_v2 Assembly Number of transcripts.
- This further supports the quality of the new assemblies presented here and the relevance of their contribution to the currently available resources for salmonids..
- A total of 8038 sequences were identi- fied as overlapping between the charr transcriptomes, which is representative of around 24% of the current as- sembly and 23% of the Magnanou et al.
- ~86% of the total transcripts from the current assembly and ~41% of the total PhyloFish transcripts.
- Similarly, for European whitefish we found sequence overlap for 28,499 transcripts, representative of ~85% of the total.
- transcripts from the current assembly and ~38% of the PhyloFish transcripts..
- Here we made no direct assessment of the cause of the differences between the charr assemblies and add- itionally the high proportion of transcripts that were unique to all three previous assemblies, however we offer several possible explanations.
- Given that successful alignment to the NCBI Atlantic salmon protein database was used as part of our filtering pipe- line, 100% of the final set of transcripts for all four as- semblies are annotated to the salmon database (Table 6)..
- The consistently complete or near complete annotation obtained across both protein data- bases gives us high confidence in the accuracy of the assembled transcripts..
- Respect- ively, in the published lake whitefish [32], coho salmon [33] and Arctic charr [34] transcriptomes, 54, 40 and 48% of the transcripts were unannotated.
- Higher annota- tion success was obtained for the six salmonid species included in the PhyloFish database, with unannotated transcripts comprising just 9 to 15% of the assemblies [35].
- Database Atlantic salmon.
- SwissProt/UniProtKB accessions are one of the most widely recognised by GO analysis softwares, there- fore the high level of annotation against the SwissProt database makes our four assemblies very useful for fu- ture comparative analyses and downstream applications..
- Consistency across the assemblies further indicates accuracy of the assemblies and the assigned annotations..
- We performed a separate GO analysis on the ‘assem- bly-specific’ transcripts and observed no difference in the number and assortment of the gene ontology terms compared to the complete dataset (Fig.
- distinguish between orthologous and paralogous se- quences in our transcriptome assemblies and by com- bining the results generate a robust approximation of the number of paralogous sequences.
- First, using the BUSCO tool, we found that 34 to 37% of the single- copy orthologs detected in our assemblies were dupli- cated (Table 3).
- Of the total number of transcripts in each of the species’.
- Further, to the best of our knowledge, the results presented here represent the most comprehensive identification of true paralogs within de novo assembled trancriptomes for salmonids, demonstrating a considerably higher capture rate than reported previously for the coho salmon transcriptome, where 29% of the assembled transcripts were identified as duplicates [33].
- However it is important to note that although we have high confidence in our identified paralogs, they are not representative of the complete set of paralogs present across the genome..
- Publication of the high-quality reference genome for Atlantic salmon has provided invaluable insight into the rediploidization process and the evolutionary fate of du- plicated genes within the salmonid genome [12].
- [12] found that 55% of the duplicated genes created during the salmonid-specific WGD event have been retained as two functional copies in the genome.
- The increased complexity of the salmonid genomes makes it difficult to distinguish between true paralogs and duplicated sequences that result from sequencing error and mis-assembly.
- The reduced proportion of duplicate genes (34 to 37%) identified in the current study is likely a result of the current limitations for de novo assembly algorithms.
- Specifically, de novo assem- blers, such as Trinity, are not able to distinguish between similar paralogs, therefore reconstruction of the complete set of paralogs for species with such highly du- plicated genomes remains a major challenge.
- Furthermore, we provide a comprehensive overview and characterization of the generated transcriptomes, as well as presenting a comparison across these four species.
- The marked level of continuity and completeness of the transcriptomes is highly supported by several methods of quantitative and qualitative assessment.
- Database of the orthogroups containing Atlantic salmon RefSeq proteins (GCF and the corresponding transcripts from four de novo protein-coding transcriptomes presented here;.
- Cumulative number of unique matching proteins that aligned to the NCBI protein database for Atlantic salmon (GCF at a given coverage: Atlantic salmon (yellow), brown trout (green), Arctic charr (blue), European whitefish (red), Magnanou et al.
- species-specific ’ transcripts for each of the four assemblies.
- We thank Alex Lyle, Stuart Wilson, Simon McKelvey, and Oliver Hooker for their valuable contributions to generating, rearing and sampling of the specimens used in this study.
- All authors read, commented on and approved the final version of the manuscript..
- Atlantic salmon .
- Sequencing the genome of the Atlantic salmon (Salmo salar).
- A well-constrained estimate for the timing of the salmonid whole genome duplication reveals major decoupling from species diversification.
- Combined de novo and genome guided assembly and annotation of the Pinus patula juvenile shoot transcriptome.
- de novo assembly and annotation of the freshwater crayfish Astacus astacus transcriptome

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt