Dealing with big data in epigenetic studies

There’s been a long-standing need for tools and mathematical techniques which effectively deal with the large amount of data present in epigenetic studies. Complete experimental conditions and results aren’t accurately described when researchers fail to transform large sets of data into information.

This 2018 Baltimore review/promotional paper described an approach that promised to resolve the following data issues.

1. Epigenetic changes occur in many ways and areas, and they often aren’t isolated from each other:

“Fully characterizing the polymorphic and stochastic nature of DNA methylation requires specification of joint probability distributions of methylation patterns formed by sets of spatially coupled CpG sites.”

2. The absence of DNA methylation or gene expression provides signals that should be processed into information. A study of DNA methylation and age reported this situation as:

“Due to the methods applied in the present study, not all the effects of DNA methylation on gene expression could be detected; this limitation is also true for previously reported results.

The textbook case of DNA methylation regulating gene expression (the methylation of a promoter and silencing of a gene) remains undetected in many cases because in an array analysis, an unexpressed gene shows no signal that can be distinguished from background and is therefore typically omitted from the analysis.”

The current review described the problem as:

“These techniques assign zero probabilities to unobserved methylation patterns despite their biological plausibility, which results in underestimating the true biological heterogeneity of methylation patterns.”

3. A subset of the above is that unknown or random past causes and effects of epigenetic changes aren’t adequately modeled:

“We demonstrated..that the empirical approach to joint methylation analysis..does not perform well when dealing with highly stochastic methylation data.”

The paper’s approach is tailored for whole genome bisulfite sequencing (WGBS), the “gold-standard experimental technique for studying DNA methylation.” It’s named informME and is publicly available at https://github.com/GarrettJenkinson/informME.