« Home « Kết quả tìm kiếm

Impact of sequencing depth and technology on de novo RNA-Seq assembly


Tóm tắt Xem thử

- technology on de novo RNA-Seq assembly.
- Background: RNA-Seq data is inherently nonuniform for different transcripts because of differences in gene expression.
- However, the amount of genomic sequence assembled did not plateau for many of the analyzed organisms.
- Most of the unannotated genomic sequences are single-exon transcripts whose biological significance will be questionable for some users.
- On the issue of sequencing technology, both of the analyzed platforms recovered a similar number of full-length transcripts.
- The missing “ gap ” regions in the HiSeq assemblies were often attributed to higher GC contents, but this may be an artefact of library preparation and not of sequencing technology..
- DNBseq ™ is a viable alternative to HiSeq for de novo RNA-Seq assembly..
- Keywords: Rna-seq assembly , Sequencing depth , Sequencing technology.
- RNA-Seq is a widely used next-generation sequencing (NGS) methodology for transcriptome profiling [1], both to identify novel transcript sequences and for differential expression studies.
- Much has been written about this methodology and it is not our intention to rehash the many excellent articles that can be found in the litera- ture [2, 3].
- scientists ask before they initiate a RNA-Seq experiment..
- poly- ploidy, and the outbred nature of many samples col- lected in the wild make genome assembly a continuing challenge..
- 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- Full list of author information is available at the end of the article.
- There are several tech- nical differences in the two sequencing pipelines, which are illustrated in Fig.
- Adapters are ligated to these fragments and are processed to produce libraries (2) containing single-strand DNA cir- cles with an adapter in the middle for DNBSeq™ and linear double stranded DNA with adapters at each end for Illu- mina.
- The small size of the DNBs.
- Sequencing depth is an important consideration for RNA- Seq because of the tradeoff between the cost of the experi- ment and the completeness of the resultant data.
- As more sequencing is done to as- semble less abundant transcripts, a greater proportion of the additional reads will come from transcripts that already have sufficient depth to assemble.
- RNA-seq datasets are usually a few Gbp in size, but in the public databases there are some unusually large datasets.
- with sizes in the many 10’s of Gbp.
- Alignments of the high-quality scaffolds remaining were compared to the transcriptome annotations in the reference genomes.
- Even in the curves that reach the asymptote, it is not be- cause the genome or annotated exome has been com- pletely covered, as only 60% of the genome bases and 75%.
- of the exon bases have transcripts aligned against them in the most complete case (Drosophila).
- Why so many are not in the “official”.
- An important consideration is the proportion of the as- sembled sequences that align with and without introns..
- In Homo sapiens, Mus musculus, and Arabidopsis thaliana libraries, a majority of the unanno- tated single-exon material is intronic, suggesting but not proving that they are simply unprocessed mRNAs.
- Some species contain higher levels of ORFs in the unannotated scaffolds, which may be partially explained by missing annotations in the references.
- The proportion of annotated scaffolds not containing ORFs is likely af- fected by the completeness of the assembly.
- We count the total number of unique bases in the alignments, based either on the genome or the exome.
- The vertical scale normalizes the exome size of the 16 Gbp assemblies to unity.
- 2 are consistent with the total amounts of non-minimal introns [20] in the underlying genomes.
- HiSeq and DNBseq ™ platforms are nearly equivalent except in the most GC-rich regions.
- Within the context of the previous discussion, and in particular Fig.
- 2, we are most interested in re- covering the exome that appears in the genome anno- tations.
- Many scaffolds align partially to the exome.
- annotation threshold, of either the scaffold length or of the reference transcript length.
- Annotated and unannotated SE refers to the proportion of annotated and unannotated transcripts that are single- exon.
- Unannotated SEI refers to the proportion of unannotated single-exon transcripts that are intronic.
- Annotated and Unannotated ORF refers to the proportion of scaffolds in each category that have ORFs of at least 100 amino acids in length, out of the scaffolds that are at least 300 bases long.
- All of the analyzed sequence was for libraries created from the Universal Human Reference RNA (UHRR), which is comprised of RNA from ten human cell lines, and is commonly used as a control for microarray gene- expression experiments.
- Target sizes were and 10 Gbp, to the extent that sufficient data was avail- able in the source library.
- Each user will have to decide for him/herself if this additional genome coverage is worthwhile, given that it was not included in the exome annotations from GENCODE..
- Next, the scaffolds were aligned against the GRCh38 transcriptome using the LAST aligner [25], to evaluate the completeness of the RNA-seq assembly.
- To be declared complete, at least 95% of the annotated transcript must be aligned to by a single RNA-seq scaffold.
- DNBseq™ results were fairly consistent in the number of complete transcripts recovered for each subset size.
- We also combined all of the DNBseq™ libraries and as- sembled subsets of different sizes.
- This is likely be- cause the sequences that are sampled in different libraries are complementary and occur in sufficient quantity such that they will be assembled in the combined libraries..
- To get an idea of the overlap in complete assembled transcripts between the two sequencing platforms, we compared the complete transcripts for the 4 Gbp subset assemblies, as that was the largest available subset in most of the libraries.
- We then compared the GC con- tents of the assembled regions to the gap regions.
- In contrast, albeit for only some of the HiSeq libraries, the gap regions reveal a bias against GC- rich sequence.
- The fact Illumina libraries can be suscep- tible to both high and low GC biases has previously been reported [10–12], although there are techniques that can reduce the magnitude of the biases.
- To ensure that the subset of transcripts are representative of the complete set of transcripts, we counted the number of transcripts with GC-content in 1% segments for both sets.
- Finally, we estimated the transcript abundances using Kallisto [26], focusing again on the 3 representative li- braries, plotting the ratio of transcript abundances from the DNBseq™ library to each of the two HiSeq libraries..
- the 9,687 transcripts (4.75% of all transcripts) with abun- dances meeting the threshold in ERR1831367 and SRR1261168, we see a slight slope in the regression-fitted line, as expected if there is a bias against higher GC-con- tent reads in the HiSeq data.
- How much RNA-Seq data is optimal? It is well-known that there are diminishing returns to ever deeper tran- scriptome sequencing and the exact choice will always be a function of budget vs ambition.
- Cumulants of GC-content in the assembled versus gap regions for (a) DNBseq ™ libraries and (b) HiSeq libraries.
- Relative coverage as a function of GC-content, computed on 100-base windows across the set of 565 transcripts described in the text.
- These debates escalated when the ENCODE Consortium assigned biochemical functions to 80% of the human genome [29]..
- It is up to the.
- In the former category, the most pertinent ques- tion is if the DNBseq™ platform (BGISEQ-500 and more recent MGISEQ-2000 and MGISEQ-T7, which are cap- able of PE150 reads) is a viable alternative to Illumina..
- Here, we show that for recovery of transcript sequences from de novo assembled RNA-Seq libraries the two plat- forms give equally good results.
- Some of the Illumina li- braries under-represented GC-rich sequences, leading to gaps in the assemblies.
- 8 A slight bias in the transcript abundances vs GC-content.
- The log ratio of the expression levels of (a) SRR1261168, the HiSeq run with the most complete assemblies and (b) SRR950078, the HiSeq run with the least complete assemblies, compared to ERR1831364, the DNBseq ™ run with most the complete assemblies.
- systematic analysis of library making protocols that is beyond the scope of this publication, it is unclear if this is an intrinsic disadvantage of the Illumina platform..
- Increasing sequencing depth of RNA-Seq experiments has quickly diminishing returns in terms of exomic sequence assembled.
- A large portion of the additional sequences as- sembled as sequencing depth increases appears to be un- annotated single-exon transcripts.
- DNBseq™ is a viable alternative to HiSeq for de novo RNA-Seq assembly.
- Higher levels of GC-bias are seen in some of the Illumina libraries, which is likely attribut- able to differences in library preparation..
- To improve alignment rates in the next step, five bases were also trimmed from the 5′-ends of reads in the SRR1523365 (C.
- Scaffolds that align to the genome over at least 98%.
- A scaffold is con- sidered to be annotated if has an alignment that is greater than that percentage threshold, where the de- nominator on the percentage calculation is the length of the scaffold or the length of the reference transcript (whichever is more favorable)..
- Reads are eliminated when more than 10% of the bases have a PHRED score of less than 10 or when more than 1% of the bases are ambiguous N’s.
- Regions are declared to be a gap when there are no scaf- folds that align to that region of the annotated transcript and, moreover, there are no reads (assembled or unas- sembled) that align to that region.
- For each 100 bp window along each tran- script, we calculate the ratio of the read depth for that window against the average read depth along the tran- script, as well as the GC-content of that window.
- For each library pair, we plot the ratio of the transcript abundance for each transcript, as a function of GC-content.
- RNA-Seq: RNA sequencing.
- All sequencing data is used in this study was previously available on SRA, with the identifiers described in the datasets sections.
- Project name: Supporting code for “ Impact of sequencing depth and technology on de novo RNA-Seq assembly.
- ZZ, DA, XL, CG and RD are employees of MGI, which makes one of the technologies being evaluated.
- However, they only provided the sequence and the University of Alberta based authors were responsible for the analyses results and discussions in the paper.
- RNA-Seq: a revolutionary tool for transcriptomics.
- Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.
- Phylotranscriptomic analysis of the origin and early diversification of land plants.
- An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome.
- Comparative performance of the BGISEQ-500 vs Illumina HiSeq2500 sequencing platforms for palaeogenomic sequencing.
- Assessment of the cPAS-based BGISEQ-500 platform for metagenomic sequencing..
- Comparative performance of the BGISEQ-500 and Illumina HiSeq4000 sequencing platforms for transcriptome analysis in plants.
- accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.
- SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads.
- BLAT — The BLAST-Like Alignment Tool.
- Near-optimal probabilistic RNA- seq quantification.
- Analysis of the mouse transcriptome based on functional annotation of 60,770 full- length cDNAs.
- An integrated encyclopedia of DNA elements in the human genome.
- function ” in the human genome according to the evolution-free gospel of ENCODE.
- Defining functional DNA elements in the human genome.
- Proceedings of the National Academy of Sciences.
- A reference human genome dataset of the BGISEQ-500 sequencer.
- Impact of sequencing depth and technology on de novo RNA-Seq assembly

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt