« Home « Kết quả tìm kiếm

PretiMeth: Precise prediction models for DNA methylation based on single methylation mark


Tóm tắt Xem thử

- Background: The computational prediction of methylation levels at single CpG resolution is promising to explore the methylation levels of CpGs uncovered by existing array techniques, especially for the 450 K beadchip array data with huge reserves.
- This leads to the limited application of the prediction results, especially when performing downstream analysis with high precision requirements..
- Only one DNA methylation feature that shared the most similar methylation pattern with the CpG locus to be predicted was applied in the model.
- PretiMeth achieved precise modeling for each CpG locus by using only one significant feature, which also suggested that our precise prediction models could be probably used for reference in the probe set design when the DNA methylation beadchip update.
- the methylation landscape of the human genome and the aberrant methylation pattern resulting in different diseases is still a hot spot.
- DNA methylation data from genome-wide sequencing can provide more comprehen- sive methylation information, while the high cost of the current bisulfite sequencing platforms makes it impracti- cal for the large-scale research [7].
- Full list of author information is available at the end of the article.
- Till now, it is quite important to extract the methylation levels of CpG loci un- covered by the existing methylation array data, especially for the 450 K array data from precious cancer studies..
- DeepCpG was a deep neural network model for predicting the methylation state of CpG dinucleotides in multiple cells based on surrounding sequence components and neigh- bouring methylation information [13].
- Most of the models performed well and could achieve the pre- diction accuracy close to 90%.
- Previous studies have indicated that the methylation level of a CpG locus was correlated with the methylation levels of its neighbouring CpG loci (indicating possible co- methylation), and the methylation marks of the upstream and downstream CpG loci were widely used as important and informative features for prediction .
- And this strategy could improve the prediction ac- curacy for CpG loci that did not have highly correlated neighbouring CpG loci or had no surrounding CpG locus in the defined flanking region..
- In this study, we proposed to predict the methylation levels of model loci (the target CpG loci to be predicted).
- by using the methylation levels of feature loci (the CpG loci used for feature selection) and constructed logistic regression model for each model locus (Fig.
- For each model locus, the methylation values of its co-methylated CpG locus was finally selected as the pre- diction feature.
- Logistic regression was applied to predict the methylation levels of model loci based solely on their co- methylated loci.
- In both of the cross-validation and inde- pendent data test, PretiMeth demonstrated satisfying performance and outperformed other comparable methods.
- Furthermore, we applied our precise prediction models (Super high accurate and High accurate models) on 13 cancers from TCGA and obtained the methylation land- scapes for the tumor and normal samples.
- One of the CpG loci, chr was lo- cated in the enhancer region, where DNase I and H3K27ac were marked and also bound by a variety of transcription factors (TFs), indicating that it may be a potential therapeutic target for a variety of cancers.
- In the investigation of the differentially methylated genes, we found 10 genes differentially methylated in at least 10 cancers.
- Methylation correlations between model loci and the candidate feature loci.
- To compare the methylation similarity between the model loci and the three candidate feature loci (includ- ing the nearest neighbouring CpG loci, the CpG loci.
- Previous prediction work using the methylation values from the nearby CpG loci had shown that the nearby loci were closely co-methylated, particularly when the distance was less than 2 kb from each other .
- Firstly, we restricted to find the nearest neighbour- ing loci in the 2 kb flanking region of the model loci, and there were 189,582 model loci meeting the re- quirement.
- Then, we compared the correlation between all 413, 719 model loci and the three feature loci without the re- strictions of the 2 kb franking range (Additional file 2:.
- We found that the average correl- ation between the model loci and the nearest neighbour- ing loci fell to 0.4651.
- The significant co-methylation trend was shown between the model loci and their co-methylated loci, which indicated that the methylation values of the co-methylated loci might be more highly informative for prediction than the other two types of feature loci..
- Besides, we investigated the location distances between all the model loci and their co-methylated loci, and found that ~ 95% of distances were larger than 2 kb, which indicated that the co-methylated loci not only exist in the nearby regions of model loci but also could exist in two distal regions (Fig.
- We further evaluated the profile of the correlation between model loci and their co-methylated loci based on the different regions of the genome (Fig.
- The model loci and their co- methylated loci both located in the gene promoter re- gion showed a higher correlation than other regions (es- pecially in TSS200 and 1stExon).
- To establish separate prediction models for each CpG locus, we used the methylation values of the three candi- date feature loci as the prediction features.
- And the shade of the color indicates the methylation level, the dark color represents hyper-methylation, the light color represents hypo-methylation.
- PretiMeth accurately predicts the methylation levels of model loci by the methylation levels of feature loci.
- The performance of LR model was slightly better than the OLS model, and the output value of the logistic regression method was more in line with the definition of methylation level.
- For feature selection, we found that the contribution of the co-methylated loci was significantly higher than the other two types of features, which was consistent with the conclusions of our correlation analysis.
- Al- though the performance of the model could increase slightly when the three features were all applied.
- How- ever, only the fewer model loci could be predicted when more features were used, due to the missing values existed in the 450 K array data.
- Therefore, considering the balance between the accur- acy and practicality of the prediction method, we only used the feature of the co-methylated loci to develop PretiMeth based on logistic regression algorithm..
- a The distribution of correlation coefficients for model loci and their co-methylated loci..
- The solid red line represents the cumulative distribution function (CDF) and the blue histogram represents the probability density function (PDF) of the Pearson correlation coefficient.
- b The methylation profile between two pairs of model loci and their co-methylated loci.
- c The correlation coefficient matrix between model loci and their co-methylated loci located indifferent genomic regions.
- (M: the co-methylated loci.
- The advantage of our single-locus modeling compared with the previous general models is that our precision model could tell how accurate the predicted methylation levels of the CpG loci were, which means one can know which CpG locus is accurately predicted and which is relatively unreliable.
- Therefore, we used the RMSE of the cross-validation results to assess the accuracy of the predictions for each single-locus model.
- But there are actually no fixed restrictions on the division of the models, the division of the models mainly lies in the user’s personalized judgment on the accuracy of the model and the task requirements.
- To further verify the performance of our proposed precision models, we applied these models on other 139 independent test samples and got the average values of the performance indicators.
- Then we observed the performance of the four cat- egories of models on the 139 independent testing sets..
- One could see that the Super high accurate and High accurate models achieved extremely high prediction accuracy, while the Medium accurate models also achieved high accuracy (ACC ≥ 0.9) to the other general prediction models in the state of art.
- To verify the expansion ability of the model from 450 K array to EPIC array, we applied our models to predict the methylation levels of the model loci (measured by EPIC array) by the methylation levels of the co-methylated loci (measured by 450 K array)..
- The average prediction results were shown in Table 2, and the scatter plot comparing the predicted methyla- tion values with the methylation values detected by EPIC data were shown in Fig.
- And the stable high prediction performance of the Super high accurate and High accurate model indicates that they.
- Furthermore, we applied our Super high accurate and High accurate models to WGBS data, using the methyla- tion values of the co-methylated loci to predict the methylation levels of the model loci (the methylation values were both measured by WGBS).
- For each 450 K data, we applied our PretiMeth model to predict the methylation levels of the model loci.
- the prediction accuracy of each single CpG locus), only the CpG loci predicted based on the Super high accurate and High accurate models were applied for DML ana- lysis to ensure the reliability of the analysis..
- This also reflected the coverage of the EPIC array design, which provided more methylation information of loci in the remote regulatory region [9, 26].
- Therefore, we curiously observed the region including chr in the UCSC genome browser [29] and found that the region is marked by DNase I and an active enhancer marker H3K27ac (Additional file 2: Figure S5a).
- Also, we checked the chro- matin status region of the roadmap in the WashU Epige- nome Browser [30] (Additional file 2: Figure S5b), and found that they were annotated as Genic enhancers, En- hancers and Strong transcription in different normal cells or tissues.
- Some of the TFs have been reported to play key roles in the process of Table 1 The performance of methylation prediction based on.
- As the methylation changes of an enhan- cer region can be owing to gain or loss of some transcrip- tion factor bindings (TFs) [37–39], we suspected that this enhancer region may be a potential therapeutic target for a variety of cancers..
- In the differential methyla- tion analysis based on the experimental data, the locus chr showed hypomethylation in breast can- cer (Δβ = 0.1134, P = 0.0235) and hypermethylation in prostate cancer (Δβ.
- Importantly, PretiMeth picked up on potential co-methylated loci by calculating the methylation correlation between distant CpGs to improve the prediction performance.
- In the cross-platform perform- ance evaluation, the Super high accurate and High ac- curate models performed quite well on both array and WGBS data.
- In our algorithm, it could slightly improve the prediction performance by adding the methylation marks of neigh- bouring flanking CpG loci and the CpG loci with the most similar flanking sequence component.
- This not only simplified the model construction but also improved prediction for CpG loci that did not have highly correlated neighbouring CpG loci or had no surrounding CpG locus in the defined flanking region..
- When applying our model to TCGA data, we only focused on the CpG loci derived Table 2 The prediction performance in the cross-chip evaluation.
- High accurate .
- Based on our PretiMeth model, we could accurately predict the methylation level of some EPIC-covered loci by using the methylation level of 450 K-covered loci..
- 6 Comparison of the predicted methylation levels and the methylation levels profiled by EPIC technology in IMR90 sample 1 and NA12878 sample 1 from (a) all prediction models, (b) Super high and High accurate models, and (c) only Super high accurate models.
- Therefore, our PretiMeth can be probably used for reference in the probe set design when the DNA methylation beadchip updates..
- c Probe chr located in the body region of the KIFC3 gene, showed significant hypomethylation (the methylation level of the locus in tumor samples were lower than those in normal samples) in 12 cancers based on predicted methylation data.
- PretiMeth used a single-locus modeling strategy and could provide the evaluation of the prediction accuracy for each single CpG locus, which would facilitate the candidate selection for the following biological applica- tions.
- Meanwhile, our findings supported the idea that the methylation value of the co-methylated locus is very important for the methylation prediction work..
- Moreover, there were 3 additional samples measured by EPIC that were used to the comparison of the prediction performance between PretiMeth and the other two methods.
- For the model application, the 450 K array data of the TCGA database were downloaded.
- Totally, there were 413,719 model loci and 450,137 feature loci..
- Three kinds of candidate features were used for model construction: the methylation value of the nearest neigh- bouring CpG locus, the methylation value of the co- methylated CpG locus, and the methylation value of the CpG locus with the most similar flanking sequence.
- the nearest neighbouring CpG locus: the feature loci located closest to the model loci on the same chromosome..
- the co-methylated CpG locus: the feature loci sharing the most similar methylation pattern with model loci in the EPIC samples..
- β n i g, β CpGi denotes the vector of the methy- lation value of the i-th locus in all n samples and β k i .
- n represents the methylation value of the i-th locus in the k-th samples..
- To characterize the sequence composition pattern across CpGs, we extracted 340 sequence features in the range of 200 bp flanking range of the i-th locus, includ- ing all 1- to 4-mers occurrence frequencies: Seq CpGi ¼ f OC 1−mers CpGi .
- The methylation similarity PearsonMeth ij and the se- quence similarity PearsonSeq ij were measured by the Pearson correlation coefficient between model loci and feature loci:.
- and CpGs that were not classified in any of the previous categories were annotated as intergenic.
- Additional criteria included the location of the CpG loci relative to the CpG island (open sea, island, shore, shelf) [9]..
- The logistic regression algorithm and the ordinary least squares algorithm were respectively developed to predict the methylation levels of model loci using the methyla- tion levels of feature loci in training data.
- Let variable β CpGl represents the methylation level of the nearest neighbouring loci l, β CpGi represents the methylation level of the i-th model loci, β CpGm repre- sents the methylation level of the matched co- methylated loci m, and β CpGn represents the methylation level of the matched loci n with most similar sequence..
- The methylation level predicted by the logistic regres- sion model for the i-th model locus in the k-th sample is:.
- β k CpGn g , β k CpG represents the experimental methylation levels of the matched par- ticular loci in the k-th sample for CpGi, (w i , b i ) is the fit- ting parameter of the logistic regression model for CpGi, and the region of β k i ðLRÞ represents a number in [0,1].
- The methylation level predicted by the ordinary least squares model for the i-th model locus in the k-th sample is:.
- where Y represents the predicted value of the methyla- tion level and Y 0 represents the detected value with array or WGBS technique..
- For calculating SP, SE, MCC, and ACC, we defined the methylation status as + 1 if the methylation value is lar- ger than 0.5, and the methylation status as − 1 otherwise..
- Additional file 1: A detailed note of the samples used in this study..
- The distribution of correlation coefficients for model loci and the candidate feature loci..
- Scatter plotting the predicted methylation levels and the methylation levels profiled by 850 K technology in other samples of IMR90 and NA12878.
- Annotations of the enhancer region we found..
- Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome.
- Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences.
- Critical evaluation of the Illumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling.
- Histone methylation marks play important roles in predicting the methylation status of CpG islands

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt