« Home « Kết quả tìm kiếm

MRCNN: A deep learning model for regression of genome-wide DNA methylation

- MRCNN: a deep learning model for regression of genome-wide DNA methylation.
- Background: Determination of genome-wide DNA methylation is significant for both basic research and drug development.
- Results: In this paper, we propose a deep learning method for prediction of the genome-wide DNA methylation, in which the Methylation Regression is implemented by Convolutional Neural Networks (MRCNN).
- Through minimizing the continuous loss function, experiments show that our model is convergent and more precise than the state-of-art method (DeepCpG) according to results of the evaluation.
- Conclusions: Genome-wide DNA methylation could be evaluated based on the corresponding local DNA sequences of target CpG loci.
- With the autonomous learning pattern of deep learning, MRCNN enables accurate predictions of genome-wide DNA methylation status without predefined features and discovers some de novo methylation-related motifs that match known motifs by extracting sequence patterns..
- DNA methylation primarily occurs symmetrically at the cytosine residues that are followed by guanine (CpG) on both DNA strands, and 70–80% of the CpG dinucleotides are methylated in the mammalian genomes [1].
- 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0.
- conversely, although the density of CpG islands in the promoter region is high, the gene remains unmethylated.
- In addition, most of the methods need to combine a large amount of information, like knowledge of predefined features .
- Here, we report MRCNN, a computational method based on convolution neural networks for prediction of genome-wide DNA methylation states at CpG-site reso- lution [20, 21].
- On the other hand, by using a continuous loss function to perform param- eter calculations, a continuous value prediction of the methylation level can be achieved.
- In addition, some de novo motifs are discovered from the filters of the convolution layer..
- The ratio is used as the network prediction target value, while the weights between the nodes in the network are optimized by min- imizing the error between the predicted value and the target value.
- We choose the window size of 400 (with- out counting the target site and including each 200 bps DNA fragment upstream and downstream), with consid- eration for the potential workload of the calculation..
- Prior to conducting MRCNN training, these fragments needed to be encoded to convert the bases A, T, C, and G in the original sequence into matrices that could be input to the network.
- The first layer of the MRCNN is a single convolutional layer, which is mainly employed to extract single nitrogen- ous base information from the 400*4 input matrix.
- Because each base is a 1*4 independent code, the size of the convolu- tion kernel can only be 1*4.
- In the design of the first layer, we choose not to adopt the pooling operation because the convolution of the first layer was essentially the synthesis of coding information, that is, ensuring each base’s encoded information could be read completely by the net- work.
- Here, w x;y f ;1 is the parameter or weight of the convolu- tional filter f for this layer, and b f , 1 is the corresponding bias.
- Then, the output of the first layer L n , 1 for each CpG site is a 400*1 tensor with 16 channels.
- The size of the convolution kernel is 3*3, the pooling method is max pooling, and the step sizes are 1*1 and 3*3.
- Nonoverlapping pooling is implemented to de- crease the dimensions of the input tensor and, hence, the number of model parameters..
- The convolution of the first layer and these two layers is linear convolution operation, with no pooling layer connection or activation function..
- The main purpose is to improve the effect of the convo- lution and nonlinear activation function, which results in part of the input falling into the saturated zone, with corresponding weights not being able to be updated.
- For the loss function in the training process, we chose the Mean Square Error (MSE) function for meas- urement, which is a classic solution to the problem of regression:.
- First, for construction of the model, we selected nearly 10 million sites from WGBS for train- ing.
- Since all chromosome numbers are disrupted, it is not necessary to consider the difference among different chromosomes, which is more conducive to the discovery of the genome-wide DNA methylation patterns..
- Then, it is reshaped as a 2D tensor for the following operations, and the convolution and pooling operations obtain higher-level sequence feature, while the next two convolution layers overcome the side effects of the saturated zone.
- The division of the test set was based on two aspects, one being the original methylation level and the other being whether the region where the site is lo- cated belonged to the CpG islands.
- Details will be ex- plained in the Results section.
- This also helps reduce the accidental errors in the model testing process, which is equivalent to a number of completely different test sets, as the training and test sites are completely different in origin.
- On the basis of the above, in order to analyze the se- quence features extracted during the training of the model, we visualized the weight matrix of the convolu- tional filters by reverse decoding from weight assign- ment and corresponding raw tensor input.
- Specifically, the products of the first convolutional layer shared four types of weights, which corresponded to the original en- coding of the four bases, so that the base sequence could be assigned according to the input, and then the weights of the different sequences could be reassigned according to the size of the filter weights.
- where Y represents the predicted value of the methyla- tion level and Y 0 represents the true value..
- These three states were grouped by different cutoff values of the methylation rate.
- Analysis of the classification perform- ance was implemented by comparing the classification metrics of sites from the CpG islands and non-CpG islands among different models, which could be more comprehensive because of the difference in methylation patterns on distinct regions of the genome.
- In addition, we also analyzed the fil- ters from the model training process, and verified the validity of the sequence feature extraction, and obtained related de novo motifs..
- Here, to demonstrate the predictive ability for different methylation states, we distinguished successive methyla- tion values in the raw data by different cutoff values..
- Most of the previous studies were focus on predictions of hypermethylation and hypomethylation, thus we also evaluated model performance based on predictions of the two states.
- However, in addition to this, in order to objectively evaluate the regression prediction, we added the evaluation for prediction of the intermediate.
- The different regression results of the three groups confirmed our previous expectation that MRCNN plays different roles in learning hypermethylation (hyper), hy- pomethylation (hypo) and intermediate methylation (mid) statuses.
- In terms of the overall regression re- sults, MRCNN achieved good results.
- First, maximum error for a single site prediction was approximately 0.5, and the prediction error distribution showed high accur- acy of the predictions as most of the errors were con- centrated around 0.1 for all test sites, see in Additional file 1.
- The RMSE and MAE of the three groups were calculated as follows: hyper: RMSE MAE = 0.129885.
- Considering that most previous studies on methylation were based on CpG islands [4], the evaluation of the classification performance was implemented for loci from CpG islands and non-CpG islands.
- Additionally, we compared MRCNN to DeepCpG for analysis of the classification ability for methylation under different deep-learning architectures and brought in the simple CNN model as the baseline method..
- In particular, these sites were previously grouped, with part of them from CpG islands and the rest from non-CpG islands.
- The re- sults of the classification comparison were shown in Fig.
- The results showed that the overall prediction of MRCNN was better than that of DeepCpG, while the re- sult of DeepCpG was better than that of the baseline model, CNN.
- P -value on sites from CpG islands.
- 2 MRCNN achieved regression of the whole genome methylation.
- The box diagrams depict the distribution of the prediction errors of the three groups of sites.
- P-value on sites from non-CpG islands..
- To fully compare the classification performance of the three models, we also selected several sets of loci from the whole genome with different sizes for testing.
- In addition, we also find that in the prediction of sites from CpG islands, the SE is less than the SP, while this situation is exactly the opposite for sites from non-CpG islands.
- This illustrates the effect of the different.
- methylation patterns of CpG islands and non-CpG islands on feature extraction during model training..
- The test loci come from normal brain white matter, lung tissue, and colon tissue, which were randomly distributed on CpG islands and non-CpG islands for the consideration of genome-wide methylation prediction.
- The results of the classification performances were shown in Fig.
- The difference between the SE and SP between CpG islands and non-CpG islands reveals distinct methylation patterns in different regions of the genomes.
- For more cautious consider- ation, we also evaluated the prediction of MRCNN in the cancerous phenotypes of the three tissues, and the results were shown in Additional file 3.
- In particular, we an- alyzed the learned filters of the first convolutional layer..
- We compared the representation of the learned fil- ters with the original input tensor representation and found that the learned filters were more able to.
- distinguish the methylation level of the sites and explain the feature extraction by MRCNN.
- The original feature could not distin- guish the hyper and hypo methylation states quite well, while after the convolutional feature extraction, it could be roughly separated and would be sufficient to demon- strate the validity of the convolution operation.
- The discovered sequence motifs associated with DNA methylation are from the online motif-based sequence analysis tools MEME [23] (version 5.0.1).
- Part of the motifs and their matches were shown in the Fig.
- 5 Clustering results for hypermethylation and hypomethylation loci of the original features and the learned filters of the first convolutional layer.
- a t-SNE plot of the original input tensor representation.
- b t-SNE plot of the learned feature map representation.
- 6 Discovered sequence motifs associated with DNA methylation.
- corresponding class and family representing the biological factor species of the known motifs.
- The p-value is defined as the probability that a random motif of the same width as the target would have an optimal alignment with a match score as good as or better than the target ’ s..
- major bases of one certain type at a specific site, while there was no particularly obvious trend in the motif cor- responding to the hypomethylation and intermediate methylation status.
- In addition, regardless of hypermethy- lation or hypomethylation sites, several of the matched known motifs were related to zinc finger factors, suggest- ing that it might play an important role in the methylation process.
- There have been reports in the literature that methylation is associated with zinc finger factors [26].
- Thus, the combination of the two aspects gives us an opportunity.
- Although we can analyze the phenomenon of methylation through deep learning, we still need to understand the essence of methylation, and only by making full use of the known information can we discover new knowledge.
- One of the future efforts is to combine deep learning with existing known methylated biological backgrounds to construct a comprehensive analytical model to achieve deeper un- derstanding of this epigenetic phenomenon.
- In this paper, we propose a novel deep learning model based on convolutional neural networks for predicting DNA methylation at single-CpG-site precision using local DNA sequence.
- The extraction of DNA sequence features is achieved by multistep 2D-array-convolution, and the MSE loss function is min- imized to achieve regression of the methylation values..
- We also further demonstrate the discovery of de novo motifs by analyzing the learned filters of the convolutional layer, and some of these motifs have been reported playing an important role in the regulation of methylation..
- Distribution of the differences between the predicted value and the true value of all sites.
- Comparsion of the comprehensive classification performance metrics including ACC and AUC on the different size of test subsets.
- Comparison of the classification performances in three cancerous tissues.
- The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/.
- Predicting DNA methylation level across human tissues.
- DNA methylation and human disease.
- DNA methylation landscapes: provocative insights from epigenomics.
- Principles and challenges of genome-wide DNA methylation analysis.
- Functions of DNA methylation: islands, start sites, gene bodies and beyond.
- Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements.
- DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning..
- Linking DNA methylation and histone modification:.
- CpGIMethPred: computational model for predicting methylation status of CpG islands in human genome.
- Predicting methylation status of CpG islands in the human brain.
- Histone methylation marks play important roles in predicting the methylation status of CpG islands.
- Predicting DNA methylation susceptibility using CpG flanking sequences.
- Deep learning

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt

MRCNN: A deep learning model for regression of genome-wide DNA methylation

CHỦ ĐỀ LIÊN QUAN