Received date: February 27, 2012; Accepted date: April 22, 2012; Published date:April 26, 2012
Citation: Su J, Huang D, Yan H, Liu H, Zhang Y (2012) Advances in Bioinformatics Tools for High-Throughput Sequencing Data of DNA Methylation. Hereditary Genet 1:107. doi: 10.4172/2161-1041.1000107
Copyright: © 2012 Su J, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Hereditary Genetics: Current Research
DNA methylation plays crucial roles in regulating gene expression during cellular development and differentiation. Recently, Next-generation sequencing (NGS) technologies spurred a revolution in investigating global DNA methylation profiles. Analysis of DNA methylation patterns on a genome-wide scale is essential to understanding the underlying mechanisms of DNA methylation. Here, we reviewed several next-generation sequencing techniques coupled with different pretreatment methods (endonuclease digestion, affinity enrichment and bisulfite conversion) and summarized the relative bioinformatics tools and resources for further analysis of DNA methylation.
DNA methylation; Next-generation sequencing technologies; Bioinformatics tools
Epigenetics, the study of mitotically or meiotically heritable regulatory changes in gene function that occur without changing the DNA sequence, has flourished in recent years. One of the most well-studied epigenetic phenomenon is DNA methylation that occurs mainly at the 5’ position of cytosine in the sequence context CpG, CpHpG and CpHpH where H is A, T or C and has been widely observed in animals, plants and fungi . Methylation patterns are constantly changing over evolutionary time . The most common pattern in invertebrate animals is ‘mosaic methylation’, whereby stable methylated domains are interspersed with methylation-free regions. In contrast, vertebrate genomes are globally methylated, with the exception of CpG islands (genomic regions of CpGs enrichment) . In comparison with the relative stability of genomic DNA sequences, DNA methylomes dynamically change among different cells and even vary along with the change of conditions in a single cell .
As an important epigenetic mark involved in a diverse range of biological processes, DNA methylation is well-known for its roles in stable transcriptional gene silencing, X inactivation  and genomic imprinting . Current evidence indicates that DNA methylation also plays a significant role in maintaining cellular function and development of autoimmunity and aging [7,8]. Aberrant DNA methylation may be associated with the disorder of gene expression in carcinogenesis . What’s more, DNA methylation is adapted for a specific cellular memory function in development supported by the heritability and the secondary nature of DNA methylation states . Therefore, it is of great significance to study the biological function of DNA methylation and its underlying mechanism in various organisms. Fortunately, the development of sequencing technology makes it easier to measure the genome-wide DNA methylation profiling, which is a premise of understanding the role of methylation in development and disease. This review mainly focuses on bioinformatics tools for processing and analysis of high-throughput DNA methylation data generated by the next-generation sequencing technologies (Figure 1).
NGS-based technologies for detecting DNA methylation
Next-generation sequencing (NGS), the latest and most promising methodology for genome-wide analysis of DNA methylation, may be used as an alternative for DNA methylation analysis. Very large amounts of sequence information produced by NGS provide a quantitative measure of DNA methylation abundance. Meanwhile, sequencing-based analysis will increase the efficiency and resolution of the detection of DNA methylation as it may use less input DNA to obtain high coverage sequencing of whole genomes and avoid biases that affect hybridization such as sequence composition [1,10]. What’s more, the ability to interrogate CpGs in repetitive elements gives the sequencing- based approaches a distinct advantage over microarrays . Nearly all technologies of sequence-specific DNA methylation analysis are based on one of three main pretreatment approaches of DNA samples: endonuclease digestion, affinity enrichment and bisulfite conversion. The different combination of pretreatment methods and subsequent molecular biology techniques, such as DNA microarrays and high-throughput sequencing, generates a plethora of techniques for mapping DNA methylation feasible on a genome-wide scale. In the following section, we reviewed the various NGS-based technologies for detecting DNA methylation, as summarized in Table 1.
|Pretreatment method||Genome coverage||NGS-based analysis||Application||Ref|
|Endonuclease digestion||Moderate||Methl-seq||Assay a range of genomic elements; allowing a broader survey of regions than classic methylation studies limited to CpG islands and promoters|||
|HELP-seq||Measurement of repetitive sequences, copy-number variability, allele-specific and smaller fragments (<50bp) ;Sensitivity of detection of hypomethylated loci||[12,75]|
|MSCC||Identification of the unmethylated region of a genome by pinpointing unmethylated CpGs at single base-pair resolution||[13,14]|
|Affinity enrichment||Moderate||MeDIP-seq||Generation of unbiased, cost-effective, and full-genome methylation levels without the limitations of restriction sites or CpG islands;||[1,76]|
|MIRA-seq||Analyze recovered or double-stranded methylated DNA on a genome-wide scale; Applicable to various clinical and diagnostic situations.|||
|MDB-seq||Applied to any biological settings to identify differentially methylated regions at the genomic scale|||
|MethylCap-seq||Detection of differentially methylated regions with high genome coverage; Detect DMRs in clinical samples||[20,77]|
|Bisulfite conversion||High||BS-seq||Sensitively measure cytosine methylation on a genome-wide scale within specific sequence contexts||[48,78]|
|RRBS||Analyze a limited number of gene promoters and regulatory sequence elements in a large number of samples; Analyzing and comparing genomic methylation patterns||[11,28]|
|BSPP||Focus sequencing on the most informative genomic regions ;exon capturing and SNP genotyping; Detecting methylation in large genomes|||
|BC-seq||Detect site-specific switches in methylation; Determine DNA methylation frequencies in CGIs sampled from a variety of genomic settings including promoters, exons, introns, and intergenic loci|||
Table 1: NGS-based technologies for detecting DNA methylation.
Methylation sensitive restriction endonucleases (HpaII, MspI and HhaI) and their corresponding isoenzymes (not sensitive to methylation ) are widely used to distinguish methylated from unmethylated cytosines . Over the past decade, several methyl-sensitive restriction based methods have been developed. By using NGS to analyze the output of the HELP assay (HpaII-tiny fragment enrichment by Ligationmediated PCR), HELP-seq , is more sensitive than array-based HELP in identifying hypomethylated loci. Methyl-sensitive cut counting (MSCC) is a cost-effective approach to detect unmethylated CpGs at single-base resolution using a flanking cut with a type-IIs restriction enzyme (MmeI) and adaptor ligation after HpaII digestion [13-15]. Another method, Methyl-seq, may sequence the fragments digested by HpaII or MspI other than randomly sheared fragments .
Affinity purification of methylated DNA, the most recent and simplest way to enrich methylated DNA , was first proved with the methyl-binding protein MECP2 . Enrichment based methods are widely employed to survey DNA methylation pattern. Those methods use NGS to sequence methylated DNA fragments obtained by methyl- CpG binding domain (MBD) proteins (MBD-seq , MIRA , MethylCap-seq  or immunoprecipitation with specific antibodies for methylated cytosine (MeDIP-seq . MeDIP-seq and MBD-seq are similar in concept where fragmented DNA is enriched based on its methylation content . MIRA-seq (methylated CpG island recovery assay) consists of the capture of fragment double-stranded methylated DNA using the MBD2b/MDB3L1 complex and subsequent next-generation sequencing of eluted DNA. In MethylCap-seq, a methyl-binding domain protein is utilized to enrich DNA fractions with similar methylation levels and the avidity of the DNA-MBD interaction relies on the local methyl-CpGs density [20,23,24].
Having found the fact that after sodium bisulfite treatment, unmethylated cytosines in single-stranded DNA are deaminated to give uracil while leaving methylated cytosine intact, a revolution of DNA methylation analysis was spurred since the 1990s [25,26]. Bisulfiteconverted DNA is particularly well suited for sequencing-based approaches especially with the application of next-generation sequencing platforms . The whole genome bisulfite-sequencing (BS-seq) is a powerful technique to measure the methylation state genome-wide at single-base resolution, based on the combination of the treatment of DNA with sodium bisulfite and NGS . Nevertheless, mapping high-throughput bisulfite reads to the reference genome remains a great challenge. A modified version of BS-seq, Reduced Representation Bisulfite Sequencing (RRBS) , has been developed, which is based on size selection of restriction fragments. However, although RRBS can reduce the sequence redundancy, it is limited to methylation at restriction sites. To address this limitation , an approach called Bisulfite padlock probes (BSPP) , is a choice, which employs padlock probes to capture an arbitrary set of sequencing targets from bisulfite-converted DNA in highly parallel manner. Another bisulfate based method, BCseq , is applied for DNA methylation profiling in genomic regions spanning tens of millions of bases through a combination of bisulfite conversion with hybrid selection technologies and deep sequencing .
Tools for mapping short reads to references and peak detection
Since millions to billions of short reads are generated through the combination of the NGS and the pretreatment methods, mapping the sheer volume of NGS data to the reference genome faces several challenges such as the alignment accuracy of short reads, time-consuming and high cost. Recently, several tools have been developed to address the above issues, such as MAQ, SOAPaligner/SOAP2, Bowtie, BLAT, BWA, BFAST and SHRiMP (Table 2) .These methods can be divided into two categories: hash table-based methods and Burrows– Wheeler based methods. All of HRiMP, MAQ, SOAP, BFAST and BFAST are the hash table-based tools which has either the reads or references by constructing a hash table of short oligomers . Among these hash table-based tools, MAQ is an accurate, efficient, versatile, and user-friendly tool, which has been wildly used in aligning short reads from a single individual . However, as other hash table-based aligner, MAQ also suffers two defects: A large memory is required to build an index for the human genome. In addition, it is unsuitable for the alignment of longer read because it is unable to support the gapped alignment for single-end reads . Several tools, like Bowite, BWA and SOAP2, have been developed based on Burrows–Wheeler indexing. Bowtie is a short-read alignment tool based on Burrows-Wheeler and has comparable speed and high accuracy on aligning single-end reads rather than paired-end reads . The SOAP2 alignment tool is a significantly improved version of the original SOAP [35,36]. By implementing Burrows–Wheeler indexing, SOAP2 not only reduces computer memory usage but also increases alignment speed at an unprecedented rate . Compared with Bowite, SOAP2 may be applied to both single-end and paired-end reads. And BWA may also perform gapped alignments of short reads by sacrificing the speed of alignments. To sum up, mapping tens of millions of short reads to a reference genome efficiently with these tools open the doors to further investigate DNA methylation patterns.
Table 2: Alignment tools for short reads.
To further detect significant functional regions with different methylation patterns from NGS data, several peak detection algorithms for chip-seq data have been developed in many studies, which have be applied to detect significantly methylated or unmethylated regions from NGS data of DNA methylation [37,38]. To our knowledge, more than 14 special tools for peak detection have developed as summarized in Table 3. Sliding window is the commonly used technique in peak detection algorithms, such as MACS, PeakSeq, FindPeaks and USeq. After identifying windows, these methods use different approaches to determine which windows are the true enriched regions . For example, FindPeaks simply calculates significance of genomic regions without the control sample based on the assumed Poisson distribution followed by the reads , while MACS uses a control sample to more accurately model the background distribution of the reads and empirically estimates the false discovery rate (FDR) for each detected peak . There are also several statistical algorithms, like BayesPeak , based on a fully Bayesian hidden Markov model and chromaSig , on unsupervised learning method. Some significant comparisons of these algorithms have been made by Wilbanks and Facciotti  and Pepke et al. . However, these peak detection algorithms could not taken CpG density of genomic fragments into consideration in detecting peaks of tag density. Therefore, the new algorithm for identifying the significant peak without the effect of CpG density will help in detecting the potential methylated or unmethylated regions of the genomes from the NGS data.
|MACS||Model-based analysis of ChIP-Seq||http://liulab.dfci.harvard.edu/MACS/|||
|ChIPseeqer||in-depth analysis of ChIP-seq datasets||http://physiology.med.cornell.edu/faculty/elemento/lab/CS_files/ChIPseeqer-2.0.tar.gz|||
|HPeak||A HMM-based algorithm for defining read enriched regions||www.sph.umich.edu/csg/qin/HPeak|||
|CASSys||ChIP-seq data Analysis Software System||http://localness.zbh.uni-hamburg.de/~ProjektChipSeq/cgi-bin/login.rb|||
|PeakSeq||A general scoring approach for ChIP-seq data analysis.||http://info.gersteinlab.org/PeakSeq|||
|Sole-Search||Integrated peak-calling and analysis software||http://chipseq.genomecenter.ucdavis.edu/cgi-bin/chipseq.cgi|||
|SISSRS||Precise identification of binding sites from short reads generated from ChIP-Seq experiment||http://sissrs.rajajothi.com/|||
|BayesPeak||Bayesian analysis of ChIP-seq data||http://bioconductor.org/packages/release/bioc/html/BayesPeak.html|||
|PeakRanger||A cloud-enabled peak caller for ChIP-seq data||http://www.modencode.org/software/ranger/|||
|FindPeaks 3.1||A tool for identifying areas of
|Sole-Search||An integrated analysis program for peak detection and functional annotation using ChIP-seq data||http://chipseq.genomecenter.ucdavis.edu/cgi-bin/chipseq.cgi|||
|PeakAnalyzer||Genome-wide annotation of chromatin binding and modification loci||http://www.bioinformatics.org/peakanalyzer/wiki/|||
|chromaSig||A Probabilistic Approach to Finding Common Chromatin Signatures||http://bioinformatics-renlab.ucsd.edu/rentrac/wiki/ChromaSig|||
|Fish the ChIPs||A pipeline for automated genomic
annotation of ChIP-Seq data
Table 3: Peak detection algorithms.
Analysis of bisulfite sequencing data with computational tools
The alignment approaches of short reads could not be directly applied to bisulfite sequencing data because the pretreatment method of bisulfite-conversion converts unmethylated cytosines to thymines. Fortunately, several analysis tools have been developed for analyzing bisulfite sequencing data as shown in Table 4. Most of the alignment tools apply a combination of the strategy for the asymmetric C/T conversion and the following mapping algorithms based on the previous short read mapping programs such as Bowtie (BS Seeker , Bismark , and SOAP (BSMAP . Meanwhile, several software tools may be currently available for further bisulfite sequencing analysis. several tools such as BiQ AnalyzerHT  and CpGviewer , can accept raw bisulfite sequences as input data, while CyMATE , and CpG PatternFinder , need to use aligned sequences as input data. Furthermore, the tool of QUMA is an interactive web-based tool for quantitative methylation analysis and includes most of data-processing functions for the analysis of bisulfite sequences . By the special tools for bisulfite sequencing data, the users may determine the methylated level of CpGs at single-base resolution.
|RRBSMAP||A fast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing||http://rrbsmap.computational-epigenetics.org/|||
|BSMAP||Ahole genome bisulfite sequence mapping||http://code.google.com/p/bsmap/|||
|BS Seeker||Precise mapping for bisulfite sequencing||http://pellegrini.mcdb.ucla.edu/BS_Seeker/BS_Seeker.html|||
|Bismark||Map and determine the Methylation state of BS-Seq read||http://www.bioinformatics.bbsrc.ac.uk/projects/bismark/|||
|SOCS-B||An alignment algorithm for bisulfite sequencing using the Applied Biosystems SOLiD System||http://solidsoftwaretools.com/gf/project/socs/|||
|BRAT||Bisulfite-treated reads analysis tool||http://compbio.cs.ucr.edu/brat/|||
|BISMA||Analysis of bisulfite Sequencing data from both unique and repetitive sequences||http://biochem.jacobs-university.de/BDPC/BISMA/|||
|BiQAnalyzerHT||Locus-specific analysis of DNA methylation by high-throughput bisulfite sequencing||http://biq-analyzer-ht.bioinf.mpi-inf.mpg.de/|||
|CpGviewer||Sequence analysis and editing for bisulphite genomic sequencing projects||http://xserve1.leeds.ac.uk/~iancarr/cpgviewer|||
|CpG PatternFinder||Windows-based program for bisulphite DNA||-|||
|CyMATE||Bisulphite-based analysis of plant genomic DNA||http://www.gmi.oeaw.ac.at/CyMATE|||
|GenomeStudio Software||Analyzing data generated from Illumina assays||-||-|
|MethMarker||Design, optimize and validate DNA methylation biomarkers for a given DMR||http://methmarker.mpi-inf.mpg.de/|||
|BDPC||Bisulfite sequencing Data methylation analysis.||http://biochem.jacobs-university.de/BDPC|||
|MethylCoder||Software pipeline for bisulte-treated sequences||https://github.com/brentp/methylcode|||
|QUMA||Quantification tool for methylation analysis||http://quma.cdb.riken.jp/|||
Table 4: Analysis of bisulfite sequencing data with computational tools.
DNA Methylation databases
In order to storage the vast amount of data generated by previous mentioned NGS-based technologies, several useful methylation databases have been available for researchers who might use the data as input for further research (Table 5). There are several primary methylation databases, like MethDB , designed to store heterogeneous data from different kinds of experiments and NGSmethDB , established for storage and retrieval of methylation data derived from NGS. MethylomeDB , includes published DNA methylation data which is related to the brain development and function, and MethyCancer , is a openly reachable database for human DNA methylation and cancer. What’s more, having incorporated gene methylation data derived from cross-data set analysis for disease and normal samples, DiseaseMeth , a human disease methylation database , can be used for identifying differentially methylated genes deeply and investigating the relationship between gene and disease. In sum, following the further study about methylation, more databases will be produced and then more information about methylation will be known.
|MethDB||Database for DNA methylation data||http://www.methdb.de|||
|MethyCancer Database||Database of cancer DNA methylation data||http://methycancer.psych.ac.cn/|||
|PubMeth||Database of DNA methylation literature||http://www.pubmeth.org/|||
|NGSmethDB||Database for DNA methylation data at single-base resolution||http://bioinfo2.ugr.es/NGSmethDB/gbrowse/|||
|DBCAT||Database of CpG islands and analytical tools for identifying comprehensive methylation profiles in cancer cells||http://dbcat.cgm.ntu.edu.tw/|||
|MethylomeDB||Database of DNA methylation profiles of the brain||http://epigenomics.columbia.edu/methylomedb/index.html|||
|DiseaseMeth||Human disease methylation database||http://bioinfo.hrbmu.edu.cn/diseasemeth|||
|CpG IE||Identification of CpG islands||http://bioinfo.hku.hk/cpgieintro.html|||
|CpG IS||Identification of CpG islands||http://cpgislands.usc.edu/|||
|CG clusters||Identification of CpG islands||http://greallylab.aecom.yu.edu/cgClusters/|||
|CpGcluster||Identification of CpG islands||http://bioinfo2.ugr.es/CpGcluster|||
|CpGIF||Identification of CpG islands||http://www.usd.edu/~sye/cpgisland/CpGIF.htm|||
|CpG_MI||Identification of CpG islands||http://bioinfo.hrbmu.edu.cn/cpgmi|||
|CpGProD||Identification of CpG islands||http://pbil.univ-lyon1.fr/software/cpgprod.html|||
|EpiGRAPH||Genome scale statistical analysis||http://epigraph.mpi-inf.mpg.de/WebGRAPH|||
|Galaxy||General purpose analysis||http://main.g2.bx.psu.edu/|||
|QDMR||Identification of differentially methylated regions||http://bioinfo.hrbmu.edu.cn/qdmr.|||
|Batman||MeDIP DNA methylation analysis tool||http://td-blade.gurdon.cam.ac.uk/software/batman|||
|CisGenome Browser||A flexible tool for genomic data visualization||http://biogibbs.stanford.edu/~jiangh/browser/|||
|MethVisual||Visualization and exploratory statistical analysis of DNA methylation profiles from bisulfite sequencing||http://methvisual.molgen.mpg.de/|||
|MethTools||A toolbox to visualize and analyze DNA methylation data||http://genome.imb-jena.de/methtools/|||
Table 5: Bioinformatics tools.
CpG islands (CGIs) are genomic regions with high frequency of CpGs which typically occur in the promoter regions . Due to their importance as genomic markers in promoter regions and as epigenetic regulatory regions associated with promoter activity, the identification of CGIs becomes indispensible. CGIs can be identified either through experimental [60,61], or computational methods . Here, we merely introduce some computational methods (Table 5). Three sequence parameters (length, GC content, and ratio of the observed over the expected CpGs(ObsCpG/ExpCpG)) are commonly used as the criteria in sliding window in the identification algorithms(CpG IE , CpG IS , CpGProD  ). However, the traditional CGIs criteria mostly identify repetitive sequences which are generally highly methylated. Although CpGProD and CpGIS use more stringent criteria to solve this problem, a portion of functional CGIs could be missed due to ad hoc thresholds. In addition , the window size and step size may limit the number and length of CGIs found by those methods . As a result, rather than revise presented base compositional criteria further, several other methods focus on statistical property in a sequence. For example, Cp- Gcluster , identifies CGIs based on the physical distances between neighboring CpGs and CG clusters , obtains CG-dense fragments based on empirical species-specific CG cluster definition. Nevertheless, compared with the sequence-criteria-based methods, CpGcluster has a high false positive rate and the proportion of promoter-associated CGIs in CG clusters is slightly lower . Among the current tools for identification of CpG islands, CpG_MI , obtained highest prediction accuracy of functional CpG islands by fully utilizing the cumulative mutual information of physical distances between two neighboring CpGs. These algorithms for identification of CpG islands provide the functional regions for the studies of DNA methylation.
Other analysis tools of methylation data
Several other methods for bioinformatics analysis of DNA methylation are also listed in Table 5.Besides CpG islands, DMR (differentially methylated region) is another focus in recent DNA methylation studies. Compared with other methods based on statistics or counting, QDMR , (quantitative differentially methylated regions) is an effective tool to quantify methylation difference and identify DMRs across multiple samples by adapting Shannon entropy. CisGenome Browser is a wide application tool for data visualization . The comprehensive tools of EpiGRAPH , and Galaxy , performed analysis of the genomic and epigenomic data, such as genome sequences, conservation scores, methylation data or any signal associated with genomic loci or regions generated by biological experiments.
In recent years, researchers pay more and more attention to the studies of DNA methylation associated with embryonic development , as well as cancer . Coupled with different pretreatment approaches, numerous next-generation sequencing based technologies are available for detecting DNA methylation. Although compared with bisulfite-based methods, enzyme-based and affinity enrichmentbased DNA methylation analysis technologies are relatively simple and cheap, the single-base resolution of bisulfite sequencing makes it possible for researchers to extract methylation information of CpGs genome- wide. A great quantity of data generated by these methods shift the bottleneck in DNA methylation advances from data generation to data analysis . In this review, we summarized the useful alignment methods and peek-detection algorithms as well as bioinformatics tools for storage, analysis and visualization of DNA methylation data. In brief, along with the advances in measuring technologies of DNA methylation in the future , more computational tools and resources for DNA methylation analysis will be available, which may facilitate the users to explore the mechanism of DNA methylation patterns.
JS and DH contributed equally to this work and are regarded as co-first authors. This work was supported partly by National Natural Science Foundation of China (61075023), Science Foundation of Heilongjiang Province (C201012 and QC2011C061) and Scientific Research Fund of Heilongjiang Provincial Education Department (12511272).