« Home « Kết quả tìm kiếm

A high-throughput SNP discovery strategy for RNA-seq data


Tóm tắt Xem thử

- A high-throughput SNP discovery strategy for RNA-seq data.
- The rapid advance of next generation sequencing (NGS) provides a high-throughput means of SNP discovery.
- However, SNP development is limited by the availability of reliable SNP discovery.
- Especially, the optimum assembler and SNP caller for accurate SNP prediction from next generation sequencing data are not known..
- Results: Herein we performed SNP prediction based on RNA-seq data of peach and mandarin peel tissue under a comprehensive comparison of two paired-end read lengths (125 bp and 150 bp), five assemblers (Trinity, IDBA, oases, SOAPdenovo, Trans-abyss) and two SNP callers (GATK and GBS).
- The rate of false positive SNPs was significantly lower when the paired-end read length was 150 bp compared with 125 bp.
- The combination of assembler Trinity, SNP caller GATK, and the paired-end read length 150 bp had the best performance in SNP discovery with 100%.
- Conclusions: Through comparison of authentic SNPs obtained by PCR cloning strategy and putative SNPs predicted from different combinations of five assemblers, two SNP callers, and two paired-end read lengths, we provided a reliable and efficient strategy, Trinity-GATK with 150 bp paired-end read length, for SNP discovery from RNA-seq data.
- Keywords: Single nucleotide polymorphism (SNP), RNA-seq, Paired-end read length, Trinity, GATK.
- The basic procedures for converting raw data generated from whole genome or transcriptome sequencing into a final SNP result include obtaining raw data from NGS platforms, assembling, and SNP calling to identify SNP between the same unigenes or between different samples from different plant varieties or within one sample when the plant is genetically heterogeneous.
- The availability of multiple choices for different read lengths, assemblers and SNP callers make the SNP analysis even more complicated.
- It was reported that dif- ferent read lengths, assemblers and SNP callers all con- tribute to the accuracy and reliability of the final SNP result [22 – 25].
- Therefore, theoretically a longer read length can produce higher quality raw data and influence down- stream analyses although no reports about the ef- fects of read length on the accuracy of SNP discovery are available.
- At present, the best assembling and SNP calling soft- ware combination for achieving the most accurate SNP data on RNA-seq is not known.
- The access to SNP in- formation on RNA-seq data is a formidable task limited by the availability of reliable SNP discovery methods in- cluding assembling and SNP calling pipeline to resolve the problems of genotyping errors and missing data.
- Here transcriptome sequencing of two peach cultivars and two mandarin cultivars was completed and for peach, two paired-end read lengths was involved.
- The effects of different paired-end read lengths, assemblers, and SNP callers on the accuracy of SNP re- sults were investigated and it was found that SNPs can be accurately discovered by performing RNA-seq with a 150 bp read length, assembling with Trinity and SNP calling with GATK.
- The study provides general guide- lines for accurate SNP discovery from transcriptome data..
- Paired-end read lengths of 125 bp and 150 bp were applied to peach and 150 bp to mandarin.
- After pre-processing filtering of low-quality se- quences and adaptor trimming, high quality sequencing reads that passed thresholds were assembled for further SNP discovery analysis..
- Reads filtered, processed and assembled into contigs as described above were used for SNP discovery.
- When the paired-end read length was 125 bp, the percentages of multi-mapped reads were approximately 60% in both ‘HJ’ and ‘YL’ and that of uniq-mapped reads were around 40% (Additional file 2:.
- The percentages of multi-mapped reads declined dramatically to less than 30%, meanwhile, the percentages of uniq-mapped reads were significantly higher when the paired-end read length was 150 bp (Additional file 2: Table S2).
- These data implied a strong influence of paired-end read length on read mapping quality..
- Depending on the paired-end read lengths, and the assemblers and Table 1 Summary of the sequencing data of ‘ Hujingmilu.
- peach and ‘ Ponkan.
- mandarin libraries with either 125 bp or 150 bp paired-end read lengths.
- HJ-150 bp .
- YL-150 bp .
- PK-150 bp .
- YP-150 bp .
- The data from two paired-end read lengths (125 bp and 150 bp) in peach were compared and it was found that the number of SNP predicted with the same assembler and SNP caller was affected by read lengths.
- For four assemblers, except for oases, the number of SNPs predicted was higher when the read length was 125 bp (Additional file 3:.
- Irrespective of the read lengths and SNP cal- lers employed, the minimum number of SNPs was gen- erated with assembler IDBA_tran for both peach and mandarin (Additional file 3: Table S3)..
- To explore which assembler and SNP caller were most suitable for accurately detecting SNPs, the SNPs pre- dicted through transcriptomic analysis were compared with those obtained via PCR amplification followed by gene cloning and sequencing procedures.
- Five anthocya- nin biosynthesis related genes in peach and nine carote- nogenic genes in mandarin possessing putative SNPs as predicted by at least one combination of read length, as- sembler and SNP caller were selected.
- Table S4), and the number of SNPs predicted through different combinations of assemblers and SNP callers for transcriptome analysis as well as the accuracy are sum- marized in Table 2.
- All these combinations were applied to transcriptomic data from paired-end read lengths of 125 bp and 150 bp, respectively, and an overview of raw data for targeted genes listed in Additional file 6: Table S6 and Additional file 7: Table S7..
- The accuracy of SNP discovery in peach was compared between ten combinations of assemblers and SNP cal- lers.
- Different read lengths under the same assemblers and SNP callers resulted in distinct results for the same samples.
- The accuracy of SNP discovery, represented by the higher percentage of true SNP discovered and a much lower percentage of false positive SNP predicted, was higher with a read length of 150 bp compared to 125 bp (Table 2).
- For ten combinations, the average per- centage of true SNP discovery was 50.25 and 39.25% re- spectively when the read lengths was 150 bp and 125 bp, while the percentage of false positive SNP discovery was 6.11 and 34.45% (Table 2).
- Moreover, with a read length.
- 1 A simplified workflow of analysis strategies for RNA-seq and SNP discovery.
- There was no obvious cor- relation between the percentage of authentic SNP discov- ery and the percentage of false positive SNP discovery with read length 150 bp (Table 2).
- The accuracy of SNP discovery was also affected by the combinations of assem- blers and SNP callers.
- When the read length was 150 bp, the combination of Trinity and GATK produced the high- est accuracy, i.e., 100% discovery of authentic SNP and no prediction of any false positive SNP (Table 2).
- No combin- ation with 100% accuracy was observed when the read length was 125 bp, further indicating the importance of longer read length.
- For read length 150 bp, no false posi- tive SNP was predicted with four assemblers (with the sin- gle exception of IDBA) when the SNP caller GBS was taken.
- Compared with GATK, the high filtration standard of GBS reduced the error rate, but on the other side, missed a lot of authentic SNPs and the rate of true SNP discovery dropped to 15% with four assemblers and to 5%.
- The study showed that not only SNP callers, but also assemblers had a strong influence on ac- curacy of SNP discovery from RNA-seq data..
- To investigate whether the optimum combination of assembler and SNP caller is independent of plant spe- cies, all ten combinations were also applied to transcrip- tomic data from mandarin with read length of 150 bp..
- A summary of the number of SNPs predicted through different combinations of as- semblers and SNP callers for transcriptome analysis as well as the accuracy were summarized in Table 3..
- Table 2 Accuracy of SNP predictions from ten combinations of assemblers and SNP callers with 40 authentic SNPs presented in five anthocyanin biosynthesis related genes in peach as example.
- The RNA-Seq was performed under the read lengths of 125 bp and 150 bp.
- Paired-end read length (bp).
- In summary, the combination of Trinity with GATK was the best strategy for SNP discovery, obtaining 100% accuracy in peach and mandarin when the read length was 150 bp, and this strategy might be applicable to wide range of plants and other organisms..
- Characterization of SNPs in peach and mandarin transcriptomes.
- As described above, using the combination of Trinity with GATK and with a read length of 150 bp, SNPs in transcriptome data of peach and mandarin can be accur- ately discovered.
- RNA-seq dataset and following SNP discovery often dif- fer due to sequencing quality as affected by the read length, sequencing depth and sequencing platforms as well as various downstream analyses .
- In this study, our data suggested that the accuracy of SNP dis- covery was affected by paired-end read lengths, assem- blers and SNP callers.
- Previously the effect of read length on accuracy of SNP discovery has not reported, although it was suggested that a higher read length can Table 3 Accuracy of SNP predictions from ten combinations of assemblers and SNP callers with 240 authentic SNPs presented in nine carotenogenic genes in mandarin as example.
- The RNA-Seq was performed under the read length of 150 bp.
- Here our data indicated that the longer read length is necessary for high quality SNP discovery.
- The third factor affecting SNP discovery we found in this study is SNP caller.
- The study here revealed the merits and defects of two paired-end read lengths, five assemblers and two SNP callers for SNP analysis, and provide detailed comparison of different methods for reference..
- High efficient SNP discovery become possible in recent de- cades due to the availability of huge raw data from next generation sequencing.
- Although several assemblers and SNP callers have been used in previous literatures, the accuracy was rarely evaluated and the optimum combin- ation of assembler and SNP caller is not clear.
- In addition, the effect of paired-end read length of sequen- cing on the accuracy of SNP discovery has not been.
- YP) transcriptomes using Trinity and GATK with read length of 150 bp.
- Here we evaluated the accuracy of SNP discov- ery from combinations of two paired-end read lengths, five assemblers, and two SNP callers, and established an ideal strategy, i.e., obtaining sequencing raw data from 150 bp paired-end read length, assembling with Trinity and SNP calling with GATK, for SNP discovery.
- With the advantages of high throughput, high accuracy, low cost, and especially, being independ- ent of a reference genome, the method established here can be expected to be widely used for SNP discovery in genetic diversity analysis, breeding and genome-wide as- sociation studies..
- The sequencing was performed on an Illumina HiSeq™ 2500 and 4000 plat- form and paired-end reads were generated.
- Paired-end read lengths of 125 bp and 150 bp were used for peach cultivars and the length 150 bp for mandarin cultivars..
- Transcriptome assembly and SNP detection.
- A simpli- fied workflow of assembly and SNP calling is outlined in Fig.
- Additional file 1: Table S1.
- Additional file 2: Table S2.
- Additional file 3: Table S3.
- Number of SNPs predicted from RNA-seq data under different paired-end read lengths, assemblers and SNP callers..
- Additional file 4: Table S4.
- (DOCX 17 kb) Additional file 5: Table S5.
- (DOCX 21 kb) Additional file 6: Table S6.
- An overview of the number of SNPs predicted in targeted genes from peach with ten different strategies under the read length of 125 bp.
- Additional file 7: Table S7.
- An overview of the number of SNPs predicted in targeted genes from peach with ten different strategies under the read length of 150 bp.
- Additional file 8: Table S8.
- An overview of the number of SNPs predicted in targeted genes from mandarin with ten different strategies under the read length of 150 bp.
- YP) (B) libraries using Trinity and GATK with read length of 150 bp.
- YP) libraries using Trinity and GATK with read length of 150 bp.
- RNA-Seq: RNA-sequencing.
- SNP discovery through next-generation sequencing and its applications.
- SNP discovery in the transcriptome of white Pacific shrimp Litopenaeus vannamei by next generation sequencing.
- GBS-SNP-CROP: a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data.
- Novel tools for conservation genomics: comparing two high-throughput approaches for SNP discovery in the transcriptome of the European hake.
- Genotype and SNP calling from next-generation sequencing data.
- The impact of read length on quantification of differentially expressed genes and splice junction detection.
- The impacts of read length and transcriptome complexity for de novo assembly: a simulation study..
- Read length versus depth of coverage for viral quasispecies reconstruction.
- De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res.

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt