Complexity and Entropy Analysis of DNA Methyltransferase

DNA methylation is a main part of epigenetics. It refers to heritable change in gene expression without alteration in DNA sequence, and plays an important role in regulation of gene expression, chromatin structure, normal development, cell differentiation, X chromosome inactivation and genomic imprinting [1]. In higher organisms, the process of DNA methylation is regulated by three DNA methyltransferases, DNMT1, DNMT3a, and DNMT3b. Functionally, DNMT1 is a maintenance methyltransferase that is responsible for copying DNA methylation patterns to the daughter strands during DNA replication [2], whereas DNMT3a and DNMT3b are de nove methyltransferases [3], and transcriptional repressors with unique localization properties of heterochromatin [4,5]. Most importantly, the methyltransferases are closely related to speciation process, their activities vary among tissues, different cells and development stages as well as aging [6-9]. However, the complexities and structural characteristics of these genes are still unknown among species; the information analysis in entropy view may thus help us elucidate mechanisms and effects of the genes in epigenetic processes.


Introduction
DNA methylation is a main part of epigenetics. It refers to heritable change in gene expression without alteration in DNA sequence, and plays an important role in regulation of gene expression, chromatin structure, normal development, cell differentiation, X chromosome inactivation and genomic imprinting [1]. In higher organisms, the process of DNA methylation is regulated by three DNA methyltransferases, DNMT1, DNMT3a, and DNMT3b. Functionally, DNMT1 is a maintenance methyltransferase that is responsible for copying DNA methylation patterns to the daughter strands during DNA replication [2], whereas DNMT3a and DNMT3b are de nove methyltransferases [3], and transcriptional repressors with unique localization properties of heterochromatin [4,5]. Most importantly, the methyltransferases are closely related to speciation process, their activities vary among tissues, different cells and development stages as well as aging [6][7][8][9]. However, the complexities and structural characteristics of these genes are still unknown among species; the information analysis in entropy view may thus help us elucidate mechanisms and effects of the genes in epigenetic processes.
In present research, we hypothesize that information measure of the gene is related to its genomic complexity, which thereby is associated to relevant biological processing. We try to test the assumption with the information analysis based on available sequences from different species. Among the three methyltransferases, because the available information for DNMT1 gene is relatively complete and more conserved than DNMT3a and DNMT3b, we thus focused only on DNMT1 gene with Shannon entropy. The information measure was introduced by Claude E. Shannon [10]. Theoretically, it reflects an uncertainty associated with a random variable and quantifies the information contained in a message. Complexity of symbolic sequence reflects an ability to represent a compact form based on some structural features of this sequence, indicating regularity or randomness of sequence that encodes complex structure. Right now there are several methods to measure the information complexity analysis, including entropy [11], generalized complexity [12], clustering of cryptically simple sequence stochastic complexity [13], alphabet capacity l-gram [14], generalized lattice graphs [15], promoter QSAR model [16] and grammatical complexity [17]. They are being widely applied in many fields, such as engineer, biology, medical, agriculture and forestry, etc. [18][19][20][21][22][23]. Among these methods, entropy measure is the simplest method using only the symbol frequencies.
Generally speaking, complex sequences give rise to complex structures. As an important gene controlling DNA methylation, the complexity and structural information stored in the DNMT1 gene should further be studied whether evolution has a tendency towards complexity. In this paper, we initially explored information complexities in different species, analyzed their DNA sequence, protein sequence, domains and non-domain region coded by the gene as well as checked its methylation status in a unique chick model, which we attempted to elucidate the relationship between complexity information indices of DNMT1 gene and biological processing.

Methods and Material
Dataset resource mRNA sequences, genomic DNA sequences and protein sequences of gene DNMT1 in different species were obtained from the NCBI website (Seen supplementary materials). Intron sequences of DNMT were gotten from the website http://genome.ucsc.edu/cgi-bin/hgBlat?command=start and organized by perl language. Domain regions of protein sequence of different species were obtained from the website http://pfam.sanger.ac.uk/.

Nucleotide composition
We first calculated the GC content and AT content, which are the measurements of analysis of nucleotide composition [24,25]. Both are defined as: where n is the length of window size, A i is 1 if there is a nucleotide A in the i th position, and 0 otherwise (the terms T i , C i and G i are defined similarity ) , so , , , ∑ ∑ ∑ ∑ are the numbers of the four nucleotides over the window size. The GC content and AT content depend on the sliding window size.

Kolmogorov-Smirnov test
Kolmogorov-Smirnov test is to compare the distributions of values in the two data vectors X 1 and X 2 with length n 1 and n 2 , respectively, representing random samples from some underlying distribution (s). It is non-parametric and distribution free. P value was obtained by statistical toolbox software of Maltlab. Its null hypothesis (H 0 ) is that X 1 and X 2 are drawn from the same continuous distribution, whereas the alternative hypothesis (H A ) is that they are drawn from different continuous distributions. The H 0 is rejected if the index H is 1. Otherwise, the H 0 is accepted if the index H is 0. Statistically, the test statistic can be written as: 1 2 Max(|F (x)-F (x)|) (2) where F 1 (x) is the proportion of X 1 values less than or equal to x and F 2 (x) is the proportion of X 2 values less than or equal to x. In this paper, X 1 and X 2 are GC and AT contents of two species, respectively.

Methylation analysis of DNMT1 gene
Lines 6 3 and 7 2 White Leghorn chickens were initially selected at the Avian Disease and Oncology Laboratory in 1939 for tumor resistance or susceptibility induced by herpesvirus. Tissue samples were collected at 2 weeks and 15 months of age and stored at −20°C until analyses. We then extracted DNA, treated the genomic DNA with bisulfite, amplified with PCR and quantitatively measured the methylation level of the gene with pyrosequencing methods [8].

Information analysis
Shannon entropy is a measure of the uncertainty associated with a random variable. The Shannon entropy of a discrete random variable X taken possible value { } where p(x i ) is the probability of X = x i . , M is the number of symbol in the sequence; N is the length of the sequence. According to Renyi definition, the q th order generalized entropy is defined as Where p i is the occurrence probability of the i th symbol in the sequence? When p = 0, H 0 = H max = logM. And if q → 1, . When q = 2, ∑ . It is obvious that H 0 is the maximum Shannon entropy, which represents the sequence in a completely random situation. H 1 is Shannon entropy in information theory and refers to a uncertain measure of single character in the sequence. H 2 is an uncertain measure of repeating twice of a single character.
We can prove that generalized entropy is decrease function of q.
Based on the generalized entropy, the following formulas were given: Here, I, GC R and R are defined as information of ordering, repeatability complexity and redundancy.
To compare complexity measurement of sequences with different length in our research, relative repeatability complexity, was defined as follows: In addition, in the paper, we also first computed information entropy of DNA methylation levels of four CpG sites in the exon1 region of DNMT1 according to methyaltion and unmethylation percentage.

Phylogenetic relationship of DNMT1 among several species
The mRNA sequences of gene DNMT1 of different species were downloaded, aligned, and compared. The evolutionary tree, as shown in Figure 1, was built [26]. The distance was calculated with the Jukes-Cantor method. From the Figure 1, it is obvious that the animals are grouped based on the evolution distance of the DNMT1 mRNA sequences. The smallest evolutionary distance is found between human and chimpanzee, followed by dog and cattle, and then the sequential phylogeny relationships between mouse and zebrafish.
The results indicate that the phylogeny relationship of different mRNA sequences of DNMT1 represents the speciation variations. Thus, we thereby inferred that the epigenetic mechanism controlled by this gene may be vital not only to speciation process, but also to evolution and development for different organisms.

Composition analysis of DNA sequence of DNMT1 gene
To our knowledge, because of alternative splicing of exons, DNA mutations and RNA editing, there may have been relative big variations in mRNA length and nucleotide compositions for an ortholog gene among different species, which are directly related to species-specific modular formation and gene expression regulation. Therefore, to further explore the nucleotide composition of gene DNMT1, we first computed GC and AT contents and then plotted as shown in Figure 2a and Figure 2b. In which, the window size was optimized about 20bp without the overlap between windows. We found that the GC content is larger than 0.5 in most positions, whereas the AT content is less than 0.5 in most positions for mammals. On the other hand, AT content in zebrafish is larger than GC content over most positions (Table1). The compositions of the four DNA nucleotides were calculated in the mRNA sequences. It is obvious that nucleotide distribution of the species is not completely uniform, it varies among species. We also found that, the proportions of four nucleotides as shown in Figure 3, except for Zebrafish and the numbers of nucleotide G and C in mRNA sequence are more than that of A and T, which is consistent with result of the GC and AT contents ( Table 2).
To test the distribution of GC and AT contents, a two-sample Kolmogorov-Smirnov (K-S) test was used to compare the distributions among the species. The null hypothesis (H 0 ) for this test is that the GC or AT contents in species A and species B are drawn from the same continuous distribution. We will accept H 0 when index H is 0. Otherwise we reject H 0 when H is 1. The K-S test results were shown in Table 3 and Table 4. We found that the distribution tests based on the GC and AT contents are the exactly same and may vary with reference. Interestingly, chimpanzee, cattle, mouse and human have the same distribution (P>0.05). They are significant different from zebrafish and dog (P<0.05, Table 3 and 4).

The entropy analysis of mRNA sequence of different species
Because of so different mRNA sequences, we computed the       several measures related to information entropy of gene DNMT1 in different species in order to quantitatively compare complexity and information of the mRNA sequences. The information indices include Shannon entropy (H 1 ), repeatability complexity (GC R ), information of ordering (I) and redundancy (R). The results were shown in the Table  5 and we found that GC R , I and R of human mRNA sequences are the highest compared to those of other species. In terms of Shannon entropy, the zebrafish's information is the highest in all studied species. As shown in Figure 4a, GC R , I and R gradually decrease from higher organisms to lower ones. This indicates that the higher organism's mRNA sequence is more complex, regular, and stores more potential information. Therefore, we thought that GC R , I and R reflect the complicate processing of evolution and development.
Sequentially, the analysis among these species is extended to ascertain the complexity and information in intron region and the whole gene sequences. Because the whole DNA sequences of DNMT1 gene are not always available in chimpanzee and cattle, we analyzed entropy information only in four species: human, dog, mouse and Zebrafish. The GC R , I and R on both intron regions and whole genomic DNA are listed in Table 6 and 7. We found that the Shannon entropy information of DNA sequences in intron region and whole genomic sequence of DNMT1 is alike in three mammals, which indicates the similar DNA nucleotide structure and nucleotide compositions of them. The zebrafish has the lowest entropy information in mRNA sequence and the highest GC R , I and R in intron region and whole genomic sequence (Figure 4a, 4b and 4c). It was discovered that the tendency of GC R , I and R of whole DNA sequence and intron is similar in three mammals, which are obviously reverse to that of mRNA. The results, as shown in mRNA sequences, human's information contents are the highest, whereas its value is lowest in intron and whole DNA, suggest that mRNA sequence would play an important role for species evolution. The results confirmed that the entropy information in DNMT1 gene is DNA nucleotide composition dependent. In term of entropy information of the gene, there are significant correlations and similarities among mammalians followed by relatively bigger divergence between mammalians and veberate animal (Figure 4b and  4c).

Information analysis of protein and domain coded by DNMT1 gene in different species
The information contents of entropy have demonstrated difference in mRNA sequence of the gene. We postulated that the complexity information measures may result in changes in protein sequences and functions. Because proteins largely constitute the machinery that makes life, they carry out all structural, catalytic, and regulatory functions. Protein sequences also formulate the basis of other structures of protein. Therefore, it is necessary to analyze entropy information of the protein sequences and different domains of the DNMT1 gene among these species and check the impacts of entropy information on protein domains. To easily compare the entropy information of protein sequences and compositions, we used a relative repeatability complexity (GC R %), which avoids the different length of protein sequences. The results of different entropy information were shown in Table 8 and it is obvious that the caliber of the relative repeatability complexity is not large, its minimum and maximum are respectively 0.0034, 0.0038. The GC R % of the DNMT1 gene in human still is the highest among these species, and is the same for the information analysis of mRNA sequence of the gene. Moreover, information indices R and I of most of mammals, except for mouse, are higher than that of zebrafish (Table 8).
With the aid of computer prediction, we found that there are seven domains, containing four kinds of domain types on the coding region of DNMT1 gene. The domain types include one DAMP domain, one Cxxc zinc finger domain, and two BAH domains as well as three DNA methylase domains. Information analysis were performed separately     for the seven domains and non-domain regions, and discovered that the GC R %, I and R on most of domains and non-domain regions are larger than whole protein sequences. In the species, we found that the GC R % of non-domain region is related to the evolution distance of the organisms, i.e., the GC R % gradually decreases from higher organism to lower one ( Figure 5). The results suggest that non-domain region may be important regions for biological processes. They imply that there exists some relationships between non-domain regions and domains, and the interactions may lead to a more complicate protein functions. Furthermore, information indices of DMAP binding, CXXC zinc finger domains and the second DNA methylase are not only larger than other domains, but also significantly higher than nondomain region as shown in Figure 6a, 6b and 6c. Interestingly, it was found that same type domains have unequal information (data not shown). Therefore we thought that domain regions stored more information than non-domain regions, which further indicates that protein function is mainly determined by domain regions.

The information analysis of DNA methylation for DNMT1 gene and aging
In our on-going research, the relationship between methylation  status of DNMT1 gene and aging in chicken was discovered in a unique chicken model. This was a quantitative measure of DNA methylation levels of four CpG sites in exon1 region of chicken DNMT1 gene for 8 weeks of age and 15 months of age ( Figure  7a). The methylation level of the CpG sites for 8 weeks of age was lower than that of 15 months of age in liver (P<0.05). The results suggested that the methylation status of DNMT1 is related to aging in chickens. In further information analysis, we treated the methylation percentage as a probability of methyaltion happening at a given CpG site. The probabilities at different CpG sites could then be used to calculate the entropy information and other information contents, thus we set up the relationship between DNA methylation profile and entropy. The results were shown in Figure 7b, except for H 1 Shannon entropy, the GC R , I and R of DNA methylation profiles in DNMT1 gene are similar to the methylation levels, i.e., the complexity and entropy of methylation status of the gene are higher at 15 months of age than at 8 weeks of age.

Discussion
The complexity search for DNA regions with different information views is one of the pivotal tasks of structural sequence analysis in postgenomics era. Entropy being an information measure includes several types such as Shannon entropy, spectral entropy and conformational entropy. Shannon entropy was applied to measure global splicing disorders [27], analyzed conserved protein sequences of influenza A viruses and identify vaccine targets [28]. Spectral entropy was used to measure order and correlations in genomic DNA sequences and analyze levels of ordering in coding and noncoding regions of DNA sequences [29,30]. The relationship between spectral entropy and GC content analyses was set up in the β-esterase gene cluster [31,32]. However, the most of researches mentioned above only involved the entropy analysis in DNA sequences. In this study, to overcome the variations of protein sequence length and the types of amino acids, we successfully applied an information measure named as relative repeatability complexity, GC R (%) = (H 1 -H 2 )/H 0 *100 in protein sequence. Compared to the information of ordering (I), complexity (GC R ) and redundancy (R), the relative repeatability complexity supplies another relative measure of repeatability complexity. Like Shannon entropy, we believe that the GC R % could be widely used in complexity analysis of protein sequence and domain identification.
The research revealed the relationship between DNA nucleotide composition of DNMT1 gene and complexity information. With different approaches, the GC and AT contents as well as different entropy information were explored in various patterns of the DNMT1 gene from different species. Because of DNA nucleotide composition differences, it was found that the entropy in DNMT1 gene among species is DNA base composition dependent, which is corresponding to the differences of DNMT1 gene in AT content and GC content between mammalian animals and zebrafish. We further demonstrated that the complexity of introns of the DNMT1 gene in mammals is lower than that of coding regions. Most importantly, although the the AT and GC contents among chimpanzee, cattle, mouse and human have the same distribution based on K-S test (P>0.05) and they are significant different from zebrafish and dog (P<0.05), there are obvious similarities between mammalians in term of the entropy information indices for DNMT1 gene [31]. All of these results indicate that the complexity information of the DNMT1 gene may be preconditioned by strong inequality in nucleotide content (based composition) in different species, also by tandem, dispersed repeats or palindrome-hairpin structures, as well as by a combination of all these factors.
We did not only demonstrate the correlations of complexity information between mRNA composition and protein sequence, but we further revealed the impacts of entropy information on domains and non-domain of DNMT1 gene in different species. The entropy change of protein as a whole is interpreted as a diversification of protein coded by DNMT1 gene, and is also reflected in complexities of DNA sequence of these organisms. These results indicated that there appears to be a general evolutionary tendency: the similarity of mammalian organisms and diversification with others. The claim is meaningful in biological functions because many other evidences from experiments and protein sequence alignments have confirmed that protein sequence complexity and entropy information have a strong linear correlation to some phenomena such as packing density and hydrophobicity [33], the structural and functional characteristics in G-protein-coupled receptors [34]. Amazingly, DNA binding site, a most important biological interaction between DNA and protein, is a completely entropy-driven process. Some evidence showed that a positive entropy may be due to release of dehydration upon forming the protein/DNA complex [35]. Considering and combining our previous results, to further study the function of DNMT1 gene on epigenetics, we may redesign its structures via configurational entropy. Therefore, future studies should include identification of functional domains and domain boundaries from sequence alone with entropy information among different organisms, and deeply ascertaining the mechanisms of entropy-driven DNA binding on protein/DNA complex of DNMT1 gene with aids of bioinformatics methods [36][37][38].
In our research, the methylation status of DNMT1 with agingspecific in a unique chick model also implied some aging-driven entropy characteristics. Many evidences have shown that aberrant DNA methylation and histone acetylation have been linked to a number of aging related disorders including cancer, autoimmune disorders and others [39][40][41]. With rich knowledge supporting aging processes is characterized by entropy changes, we think that an increasing entropy may lead to the loss of molecular fidelity and slow accumulation of overwhelming maintenance system [42,43]. The relationship between methylation status and entropy levels of the DNMT1 gene, function on maintaining methylation fidelity, shows a huge increase in methylation with aging in the unique chicken model. Nevertheless, the changeable entropy information suggested that aging is a process of energy dispersion. But entropy is to disperse the concentrated energy, resulting in a biologically inactive or malfunctioning. Considering other factors, we propose a hypothesis that aging process is due in part to DNA methylation, DNA damage, mutations and chemical bonds loss, etc., which may cause entropy changes and molecular fidelity loss. To get answer about aging process from DNA methylation changes over ages, the DNMT1 gene could be a candidate target to pursue in the future research [44].
In summary, information measure of DNMT1 gene, one of the most important epigenetic genes, is relevant to its genomic complexity, which thereby associates to evolution and aging processing. The intrinsic mechanism is not to be studied yet. In post-genomic era, many unknown genes can be clustered, and analyzed in domain regions and non-domain regions based on complexity of DNA and protein sequences. By applying these entropy information methods to various functional genomic regions, we will have deeper insights on gene functions and genome annotations.