« Home « Kết quả tìm kiếm

Discerning novel splice junctions derived from RNA-seq alignment: A deep learning approach

- Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach.
- Results: In this work, we present a deep learning based splice junction sequence classifier, named DeepSplice, which employs convolutional neural networks to classify candidate splice junctions.
- We show (I) DeepSplice outperforms state-of-the-art methods for splice site classification when applied to the popular benchmark dataset HS3D, (II) DeepSplice shows high accuracy for splice junction classification with GENCODE annotation, and (III) the application of DeepSplice to classify putative splice junctions generated by Rail-RNA alignment of 21,504 human RNA-seq data significantly reduces 43 million candidates into around 3 million highly confident novel splice junctions..
- Conclusions: A model inferred from the sequences of annotated exon junctions that can then classify splice junctions derived from primary RNA-seq data has been implemented.
- The performance of the model was evaluated and compared through comprehensive benchmarking and testing, indicating a reliable performance and gross usability for classifying novel splice junctions derived from RNA-seq alignment..
- can potentially identify novel splice junctions between exons by the evidence of spliced alignments..
- How- ever, novel splice junctions predicted by read alignments are not totally reliable, since the possibility of randomly mapping a short read up to 150 bases to the large refer- ence genome is high [13], especially when gapped align- ments with short anchoring sequences are permitted.
- [14] that investigated splicing variation, 21,504 RNA-seq samples from the Se- quenced Read Archive (SRA) were aligned to the human hg19 reference genome with Rail-RNA [15], identifying 42 million putative splice junctions in total.
- This value is 125 times the number of total annotated splice junctions in humans, making it impossible to admit that all of them actually exist.
- False positive splice junctions may lead to false edges in splice graphs, significantly increas- ing the complexity of the graphical structures [16].
- Conventional strategies designed to filter out false posi- tive exon splice junctions depend primarily on two prop- erties: (1) the number and the diversity of reads mapped to the given splice junction [13].
- Thus, further classification of putative splice junctions re- vealed by RNA-seq data is still necessary but remains a challenging issue..
- Our method is applicable to both splice site pre- diction and splice junction classification.
- Trained on an older version of the GENCODE project gene annotation data [37], we show that our algorithm can predict the newly annotated splice junctions with high accuracy and per- forms better than splice site-based approach.
- The applica- tion of DeepSplice to further classify putative intropolis human splice junction data by Nellore et al.
- [14] is able to eliminate around 83% unannotated splice junctions.
- We discover that the combinational information from the functional pairing of donor and acceptor sites facilitates the recognition of splice junctions and demonstrate from large amounts of sequencing data that non-coding gen- omic sequences contribute much more than coding se- quences to the location of splice junctions [25, 38]..
- We then evaluated DeepSplice’s per- formance by classifying annotated splice junctions from.
- Finally, we applied DeepSplice to intropolis [14], a newly published splice junction database with splice junctions derived from 21,504 samples..
- splice site classification.
- DeepSplice predicts newly annotated splice junctions with high accuracy.
- Next, we evaluated the accuracy of DeepSplice in terms of splice junction classification.
- To achieve this, we trained DeepSplice using splice junctions extracted from the GENCODE annotation version 3c, and then tested the model on newly annotated splice junctions in the GENCODE annotation version 19.
- The training set con- tains 521,512 splice junctions, and the testing set con- tains 106,786 splice junctions.
- In both training and testing sets, half of the splice junctions are annotated, and the rest are false splice junctions randomly sampled [42] from human reference genome (GRCh37/hg19)..
- We trained the first model by feeding the 521,512 training splice junction sequences to DeepSplice for a binary classification, splice junctions or not.
- In the first mode (Splice Junction Mode), the input splice junction se- quences were with the length of 120 nucleotides, reflect- ing 30 nucleotides of upstream and downstream nucleotides for both donor and acceptor splice site.
- Splice Junction Mode achieved an auPRC score of for Donor+Acceptor Site Mode..
- 2 The ROC curves of DeepSplice Splice Junction Mode and Donor+Acceptor Site Mode for splice junction classification on the GENCODE data set.
- DeepSplice Splice Junction Mode achieves a higher auROC score of 0.989.
- Donor+Acceptor Site Mode detected 39,067 splice junctions from 9185 genes.
- There is a 95% likelihood that the confidence interval covers the true classification error of DeepSplice on the testing splice junctions..
- These results indicate that the proposal splice junction classification in DeepSplice achieves high accuracy in identifying novel splice junctions in large data sets than conventional splice site classification..
- There are highly conserved segments on splice junctions between exons and introns which help in the prediction of splice junctions by computational methods and decipher biological signals of splice junctions.
- This is achieved by the quantification of the contribution of nucleotides in splice junction sequences to the classification process using deep Taylor decomposition [40]..
- We then applied deep Taylor decomposition to the re- sults of splice junction classification with the GENCODE data set.
- nucleotides in the testing splice junction sequences.
- Re- gions of increased importance in splice junction classifica- tion are consistent with the result from splice site classification..
- There are putative splice junctions in total, including canonical splice junctions containing flanking string GT-AG semi-canonical splice junctions containing flanking string AT-AC or GC-AG [44], and no non-canonical splice junctions which are not allowed by Rail-RNA.
- Table 3 lists the number of splice junctions in each category separated by the number of reoccurrence in samples and total read support across all samples in four scales: (a) equal to 1 {1}, (b) more than 1 and no greater than c) more than 10 and no greater than and (d) more than 1000 (1000.
- As listed in Table 3, for our analysis, we only retain splice junctions in intropolis that are supported by more than one sample, followed by the filtering of false splice junction sequences due to repetitive sequences.
- After this pre-processing splice junctions were left for further classification..
- The DeepSplice model was trained on 812,967 splice junctions including annotated splice junc- tions from GENCODE annotation version false splice junctions generated from the HS3D data set, and randomly selected semi-canonical splice junctions with only one read sup- port from intropolis.
- Overall, DeepSplice classified 3,063,698 splice junctions as positive.
- Figure 5(a) lists the proportions of positive canonical splice junctions, positive semi-canonical splice junctions and negatives from the classification results at different levels of aver- age read support per sample.
- Splice junctions with aver- age read support per sample more than 15 achieve a positive rate around 88%.
- There is a significant rise in the probability to obtain a positive splice junction with the increase of the average read sup- port per sample.
- Around 99% positive splice junctions Table 2 Classification performance evaluation of different DeepSplice modes on GENCODE data set.
- Splice junction classification Splice Junction Mode .
- Figure 5(b) illus- trates the proportions of positive semi-canonical and ca- nonical splice junctions cumulatively with the increase of the average read support per sample..
- To further clarify characteristics of the positives, we categorized splice junctions in intropolis based on anno- tated splice sites in GENCODE annotation: (1) splice junctions with both splice sites annotated, (2) splice junctions with the donor splice site annotated, (3) splice junctions with the acceptor splice site annotated, and (4) splice junctions with neither the donor nor acceptor splice sites annotated.
- Figure 6(a) shows the discrete proportions of negatives and positive splice junctions in each category above, given the average read sup- port per sample.
- Results indicate that 97% of splice junctions with both sites annotated are classified as positives, while only 39% with both sites being novel are positive.
- Splice junctions connecting annotated splice sites also tend to be associated with higher read coverage.
- Figure 6(b) illustrates the proportions of positive splice junctions in each category cumula- tively with the increase of the average read support per sample.
- Figure 7 shows positive splice junctions in intropolis near known protein-coding junctions show a periodic pattern, such that splice sites which maintain the coding frame of the exon are observed more often than those which disrupt frame.
- Even though splice junctions with high read support and/or high reoccurrence are more likely to be classified as real, a significant portion of relatively low-expressed splice junctions also carry true splicing signals.
- DeepSplice may provide the first round of filtering of RNA-seq derived splice junc- tions for further structural validation, and studies that assess functional annotation of these splice junctions are warranted.
- DeepSplice could also extend its function- ality to discriminate splice junctions that are highly or lowly supported by gene expression evidence and try to figure out what sequence patterns associate to this difference in future.
- Employing deep convolutional neural network, we de- velop DeepSplice, a model inferred from the sequences of annotated exon junctions that can then classify splice junctions derived from primary RNA-seq data, which can be applied to all species with sufficient transcript.
- The major application of DeepSplice is the classification of splice junctions rather than individual donor or acceptor sites.
- DeepSplice employs a convolutional neural network (CNN, or ConvNet) to understand sequence features that characterize real splice junctions.
- In the super- vised training step, CNN learns features that help to differentiate actual splice junctions from fake ones.
- Deep Taylor decomposition [40] of the CNN is used to explain to what extent each nucleotide in the candidate splice junction has contributed to the inference..
- Splice junction representation.
- A splice junction sequence is represented by four subse- quences, the upstream exonic subsequence and down- stream intronic subsequence at the donor site, and the Table 3 Distribution of splice junctions from intropolis given.
- Splice junction number Reoccurrence in samples.
- 4 Visualization of the contribution of nucleotides in the flanking splice sequences to the final decision function of DeepSplice on the GENCODE dataset for splice junction classification.
- The nucleotides in the proximity of a splice junction have the highest impact on the classification outcome.
- Each splice junction sequence is transformed into a 3-dimensional tensor of shape [height, width, channels]..
- In this way, the convolutional neural network trans- forms the nucleotide signal in splice junction sequences to the final label of class as shown in Fig.
- 5 Positive splice junctions tend to have high read support and contain the canonical flanking string.
- a Discrete proportions of negatives, positive semi-canonical splice junctions and positive canonical splice junctions from the classification results, given the average read support per sample.
- Splice junctions with average read support per sample more than 15 achieve a positive rate of around 88%.
- In contrast, for splice junctions with average read support per sample no more than 1, only 36% are identified as positive.
- There is a significant rise in the probability to obtain a positive splice junction with the increase of the average read support per sample.
- Around 99% positive splice junctions contain the canonical flanking string.
- b Cumulative proportions of positive semi-canonical and canonical splice junctions with the increase of the average read support per sample.
- We propose to use deep Taylor decomposition [40] to explain the contribution of nucleotides in the splice junction sequence to the final decision function of the deep convolutional neural network, as shown in Fig.
- 6 Positive splice junctions tend to have both donor and acceptor sites annotated.
- a Discrete proportions of negatives, positive splice junctions without annotated site, positive splice junctions with acceptor site annotated, positive splice junctions with donor site annotated and positive splice junctions with two sides annotated, given the average read support per sample.
- 97% of splice junctions with both sites annotated are classified as positives, while only 39% with both sites being novel are positive.
- b Cumulative proportions of positive splice junctions in each category with the increase of the average read support per sample.
- 8 Visualization of splice junction sequence representation and deep convolutional neural network in DeepSplice.
- The convolutional neural network transforms the nucleotide signal in splice junction sequences to the final label of class.
- Positive splice junctions in intropolis near known protein-coding junctions show a periodic pattern.
- splice junction sequence S to quantify their contribu- tions to the predicted label of class..
- For both architectures, the inputs are splice junction sequences.
- Filtering of false splice junction as a result of repetitive sequences.
- Be- fore the classification of splice junctions, we first remove the splice junctions whose sequence at the acceptor (donor) site has high sequence similarity with the imme- diate flanking sequence next to the donor (acceptor) site or the sequence at any of its alternative acceptor (donor) sites, as shown in Fig.
- This filtering strategy is independent of read coverage and enables the retention of correct splice junctions even with low read.
- 10 Illustration of splice junction filtering strategy.
- Deep Taylor decomposition explains the contribution of each nucleotide in the splice junction sequence to the final decision function of the deep convolutional neural network.
- The removal of these sequences is necessary as most of them are highly similar with one of the splice junctions remaining in the data set..
- MapSplice: accurate mapping of RNA-seq reads for splice junction discovery.
- TrueSight: a new algorithm for splice junction detection using RNA-seq.
- Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the sequence read archive.
- Boosted Categorical Restricted Boltzmann Machine for Computational Prediction of Splice Junctions.
- DeepSplice: Deep classification of novel splice junctions revealed by RNA-seq.
- Recognition of splice junctions on DNA sequences by BRAIN learning algorithm

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt

Discerning novel splice junctions derived from RNA-seq alignment: A deep learning approach

CHỦ ĐỀ LIÊN QUAN