« Home « Kết quả tìm kiếm

Guidance for DNA methylation studies: statistical insights from the Illumina EPIC array


Tóm tắt Xem thử

- Guidance for DNA methylation studies:.
- Background: There has been a steady increase in the number of studies aiming to identify DNA methylation differences associated with complex phenotypes.
- First, we quantify the multiple testing burden and propose a standard statistical significance threshold for identifying DNA methylation sites that are associated with an outcome.
- Second, we establish whether linear regression, the chosen statistical tool for the majority of studies, is appropriate and whether it is biased by the underlying distribution of DNA methylation data.
- Finally, we assess the sample size required for adequately powered DNA methylation association studies..
- Results: We quantified DNA methylation in the Understanding Society cohort (n = 1175), a large population based study, using the Illumina EPIC array to assess the statistical properties of DNA methylation association analyses.
- By simulating null DNA methylation studies, we generated the distribution of p-values expected by chance and calculated the 5% family-wise error for EPIC array studies to be .
- Next, we tested whether the assumptions of linear regression are violated by DNA methylation data and found that the majority of sites do not satisfy the assumption of normal residuals.
- Finally, we performed power calculations for EPIC based DNA methylation studies, demonstrating that existing studies with data on ~ 1000 samples are adequately powered to detect small differences at the majority of sites..
- 9 × 10 − 8 adequately controls the false positive rate for EPIC array DNA methylation studies.
- methodology for DNA methylation studies, despite the fact that the data do not always satisfy the assumptions of this test.
- These findings have implications for epidemiological-based studies of DNA methylation and provide a framework for the interpretation of findings from current and future studies..
- Full list of author information is available at the end of the article.
- There is increasing interest in the role of epigenetic processes in health and disease, with the primary focus of most epigenetic epidemiological studies being on DNA methylation (DNAm) [1].
- bounded by the limits of 0 and 1 it means that at the extreme ends of the distribution the variance is compressed.
- This property of the data is called heteroskedasticity, defined as a relationship between the mean and variance of a dataset, and violates another assumption of linear regression.
- First, we used a permutation procedure to establish an appropriate significance threshold that accounts for the multiple testing burden of the EPIC array.
- Using the inverse of the Bonferroni correction formula this is equivalent to.
- In order to model the information gain in terms of number of independent tests as the coverage of the microarray increases, we applied our permutation procedure to subsamples of DNAm sites at increasing densities (x i = 5, 15.
- Line graphs depicting the relationship between the number of EPIC array DNA methylation sites (x-axis) and a) the 5% family-wise error rate (FWER.
- The final point includes all DNA methylation sites on the EPIC array and therefore could not be resampled to generate a confidence interval.
- Line graphs depicting the relationship between the number of DNA methylation sites (x-axis) and a) the effective number of independent tests (y-axis) and b) the multiple testing corrected threshold.
- The blue horizontal line represents the estimated asymptote of the Monod model of 5,803,067 independent tests equivalent to a genome-wide significance threshold of .
- First, we considered the level of DNAm at each site, hypothesising that sites which are located at the extremes of the distribution would be more likely to violate the assumptions of the tests.
- Table 1 Summary of DNA methylation sites significantly rejecting the assumptions of linear regression.
- For each of the 5 tests performed by the gvlma package the number and percentage of DNA methylation sites with significant p-values (P <.
- because sites with low levels of variation are typically located at the boundaries of the distribution of DNAm (Additional file 1: Figure S7)..
- Venn diagram depicting the overlap of DNA methylation sites significant for each test of a linear assumption (P <.
- Presented are the number of overlapping DNA methylation sites along with the percentage of all tested sites shown in brackets.
- 4 Comparison of tests of linear regression assumptions across the distribution of DNA methylation levels.
- Boxplots of – log 10 (p-value) for each of the 5 tests (a) global (b) skewness (c) kurtosis (d) link function and (e) heteroskedasticity for groups of DNA methylation sites binned by their mean DNA methylation level.
- As with the beta-value analysis, the primary assumption violated by M-values related to the shape of the distribution of residuals.
- We found that the distribution of the mean rank was normally distributed with a mean of 402,446 (Additional file 1: Figure S9), similar to the expected value.
- We observed no association between p-values from the gvlma tests and a DNAm site’s mean rank indicating that even highly significant rejections of the assumptions of linear regression do not bias EWAS results in terms of either false positives or false negatives (Fig.
- 5 Comparison of tests of linear regression assumptions against DNA methylation variability.
- Boxplots of – log 10 (p-value) for each of the 5 tests (a) global (b) skewness (c) kurtosis (d) link function and (e) heteroscedasticity for groups of DNA methylation sites binned by their standard deviation.
- 6 Comparison of tests of linear regression assumptions with bias in DNA methylation association studies.
- Scatterplots of – log 10 (p-value) (y-axis) from the (a) global (b) skewness (c) kurtosis (d) link function and (e) heteroskedasticity tests performed in the R gvlma package against average (mean) ranking from 1000 simulated null association studies (x-axis) for all DNA methylation sites.
- We attempted to extrapolate from the experiment- wide threshold for the EPIC array to estimate an appropriate threshold for all potential tests across the genome, including those not currently profiled by the EPIC array, by using simulations to profile how the number of independent tests changes as the coverage of the microarray increases.
- Therefore, our estimate of the number of independent tests in the genome is likely to be imprecise.
- 7 Power curves of typical DNA methylation studies.
- This was particularly the case for DNAm sites that have low levels of variation or are located at the extreme ends of the distribution.
- We show that linear regression is a valid statistical methodology for DNAm studies, despite the fact that the data do not always satisfy the assumptions of the test.
- In other words, continually increasing the number of sites studied has diminishing returns in terms of the increase in additional variation captured.
- For each of the site densities (x i.
- was calculated by using the inverse of the Bonferroni correction for multiple testing (m i = 0.05/P Ti.
- For each of the 804,826 models (one per DNAm site) we tested for violations of the assumptions of linear regression using the gvlma (Global Validation of Linear Model Assumptions) R package [37].
- This package performs four tests to test the performance of the model fit with regards to the four assumptions of a linear regression: linearity, homoskedasticity, uncorrelat- edness and normality of the residuals (Additional file 1:.
- In order to assess how DNAm sites on the EPIC array performed across these five tests we plotted Quantile-Quantile (QQ) plots of the observed vs expected p-values.
- These mean ranks were then compared with the p-values of the gvlma tests..
- Power calculations were performed for each of the 804,826 sites in the dataset using the function pwr.t.test from the R package pwr [54].
- Quantile-Quantile plots of the 5 tests of the assumptions of linear regression.
- log 10 (p-value) from the (a) global (b) skewness (c) kurtosis (d) link function and (e) heteroskedasticity tests performed in the R gvlma pack- age for all DNA methylation sites.
- Scatterplots of – log 10 (p-value) against mean DNA methylation level from the (a) global (b) skewness (c) kurtosis (d) link function and (e) heteroskedasticity tests performed in the R gvlma pack- age for all DNA methylation sites.
- Each point represents a single site, and the color of the point represents the density of points plotted (low dens- ity in grey to high density in yellow).
- Scatterplots of DNA methylation standard deviation against – log 10 (p-values) from the (a) glo- bal (b) skewness (c) kurtosis (d) link function and (e) heteroskedasticity tests performed in the R gvlma package.
- Each point represents a single DNA methylation site, and the color of the points represents the density of points plotted (low density in grey to high density in yellow).
- DNA methylation sites were defined as variable if the range of their middle 80% of values, calculated as the 90th percentile (P 90 ) minus the 10th percentile (P 10 ) was greater than 5%.
- y-axis) against mean methylation level (x- axis), for all DNA methylation sites tested.
- The color of the points repre- sents the density of points plotted (low density in grey to high density in yellow).
- Comparison of suitability of linear regression assump- tions for M-values across the distribution of DNA methylation levels.
- Box- plots of – log 10 (p-value) for each of the 5 tests (a) global (b) skewness (c) kurtosis (d) link function and (e) heteroscedasticity for groups of DNA methylation sites binned by their mean DNA methylation level, measured as a beta-value.
- Histogram of DNA methylation sites mean rank from simu- lated null association studies.
- Estimating the multiple testing correction significance threshold for sub-samples of EPIC array DNA methylation sites.
- Summary of results from tests of assumptions of linear re- gression separated by mean DNA methylation level.
- The number and percentage of DNA methylation sites significant for each test (P <.
- split by mean DNA methylation level.
- split by DNA methylation level standard deviation.
- Summary of results from tests of as- sumptions of linear regression separated by DNA methylation variability status.
- Variable DNA methylation sites are defined as those with the range of their middle 80% of values greater than 5%..
- Summary of DNA methylation sites significantly rejecting the assumptions of linear regression comparing beta-values and M-values..
- DNA methylation data generation in UKHLS was funded through enhancements to the Economic and Social Research council (ESRC) grants ES/K005146/1 and ES/N00812X/1.
- Individual level DNA methylation are available on application through the European Genome-phenome Archive under accession EGAS https://www.ebi.ac.uk/ega/home).
- Phenotype linked to DNA methylation data are available through application to the METADAC (www.metadac.ac.uk).
- DNA methylation profiling in breast cancer discordant identical twins identifies DOK7 as novel epigenetic biomarker.
- Genome-scale discovery of DNA-methylation biomarkers for blood-based detection of colorectal cancer.
- Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis.
- Identification of type 1 diabetes- associated DNA methylation variable positions that precede disease diagnosis.
- An integrated genetic-epigenetic analysis of schizophrenia: evidence for co-localization of genetic associations and differential DNA methylation.
- Common DNA methylation alterations in multiple brain regions in autism..
- Alzheimer's disease: early alterations in brain DNA methylation at ANK1, BIN1, RHBDF2 and other loci.
- Principles and challenges of genomewide DNA methylation analysis.
- A genome-wide analysis of DNA methylation and fine particulate matter air pollution in three study populations: KORA F3, KORA F4, and the normative aging study.
- Differences in smoking associated DNA methylation patterns in south Asians and Europeans.
- Novel region discovery method for Infinium 450K DNA methylation data reveals changes associated with aging in muscle and neuronal pathways.
- DNA methylation contributes to natural human variation.
- Association of DNA methylation with age, gender, and smoking in an Arab population.
- Genome-wide DNA methylation analysis of systemic lupus erythematosus reveals persistent hypomethylation of interferon genes and compositional changes to CD4+ T-cell populations.
- A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies.
- Bigmelon: tools for analysing large DNA methylation datasets.
- Leveraging DNA-methylation quantitative-trait loci to characterize the relationship between Methylomic variation, gene expression, and complex traits.
- Epigenome-wide profiling identifies significant differences in DNA methylation between matched-pairs of T- and B- lymphocytes from healthy individuals.
- Aberrant DNA methylation associated with Alzheimer ’ s disease in the superior temporal gyrus.
- Beta regression improves the detection of differential DNA methylation for epigenetic epidemiology.
- Elevated polygenic burden for autism is associated with differential DNA methylation at birth.
- DNA methylation in newborns and maternal smoking in pregnancy: genome-wide consortium meta-analysis..
- Power and sample size estimation for epigenome-wide association scans to detect differential DNA methylation.
- DNA methylation age of human tissues and cell types.
- DNA methylation arrays as surrogate measures of cell mixture distribution.
- Blood-based profiles of DNA methylation predict the underlying distribution of cell types: a validation analysis

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt