Circular Code Signal in Frameshift Genes

In 1996, a trinucleotide circular code has been identified simultaneously in eukaryotic and prokaryotic genes [1,2]. It allows their reading frame to be retrieved. Frameshift genes, by bypassing or rereading one nucleotide, shift translation. Therefore, a theoretical forecast of circular codes should lead to a shift signal in this class of genes. This hypothesis is verified in this paper. Frameshifting and circular code are briefly presented.


Introduction
In 1996, a trinucleotide circular code has been identified simultaneously in eukaryotic and prokaryotic genes [1,2]. It allows their reading frame to be retrieved. Frameshift genes, by bypassing or rereading one nucleotide, shift translation. Therefore, a theoretical forecast of circular codes should lead to a shift signal in this class of genes. This hypothesis is verified in this paper. Frameshifting and circular code are briefly presented.

Frameshifting
By our convention, the reading frame in a gene established by a start codon ATG, GTG or TTG is the frame 0, and the shifted frames 1 and 2 are the reading frame 0 shifted by one and two nucleotides in the 5΄ − 3΄ direction, respectively.
In the reading frame of a gene, a series of three nucleotides (codons) is translated into a series of amino acids according to the genetic code. The correspondence between the nucleotide sequence and the protein sequence is determined by genetic codes that are well-conserved across species.
However, in some cases, the protein synthesized by the ribosome does not correspond to the transcribed mRNA because of a translational error [3]. These translational errors are not the consequence of a genetic code alteration but they result mainly in a change in the ribosome translation site on the mRNA. This site modification could be expressed in different ways as a frameshifting, a stop codon reading through or a ribosomal hopping [4,5,6].
There exist two types of triggers that can stimulate the translational errors: instant and programmed. Instant translational errors are very rare with a rate estimated to 3 × 10 -5 [7]. Programmed translational errors are produced by different means including a special motif, a secondary structure in the mRNA or a cell lack of an amino acid. These events can increase the probability of a translational error up to near 100% [4].
Frameshifting is a translational error where the ribosome pauses the translation then bypasses one nucleotide and hence shifts translation of the reading frame from frame 0 to frame 1 (+1 frameshift) (Figure 1) or it rereads one nucleotide and hence shifts translation of the reading frame from frame 0 to frame 2 (-1 frameshift) ( Figure 2). Then, the ribosome continues to translate the codons in the shifted frame until it encounters a stop codon in its frame. In some rare cases, the number of nucleotides bypassed or rereaded may vary. The protein product is then partially encoded by frame 0 and partially encoded by this shifted frame [8,9].
The frameshift prediction using computational methods is a difficult task due to the diversity of motifs and secondary or tertiary structures which stimulate frameshifting. Besides, these computational methods generate many frameshift gene candidates. However, alignment methods of frameshift genes determined experimentally constitute a classical approach to identify frameshifts [10,11,12]. For the -1 frameshift identification, some computational methods consider the primary and secondary structure of mRNA associated with the frameshift site. This structure is classically composed of a slippery heptamer X,XXY,YYZ (X, Y and Z are three nucleotides and the comma ", " represents the frame 0) separated by five to nine nucleotides from a stem loop ( Figure  3). The approaches based on this structural model also allowed several frameshift genes to be identified (e.g. [13,14,15,16]).

Common circular code
In 1996, two statistical studies, a codon frequency per frame and a codon correlation function per frame, showed that the 64 trinucleotides T = {AAA,…,TTT} are preferentially distributed in the three frames of genes [1,2]. By excluding the four trinucleotides with identical nucleotides id T = {AAA,CCC,GGG,TTT} and by assigning each trinucleotide to a preferential frame (frame of its highest occurrence frequency), the same three subsets X 0 , X 1 and X 2 of 20 trinucleotides are found in the frames 0, 1 and 2, respectively, of two large and different gene populations (protein coding regions): eukaryotes (26, sequences, 4,709,758 trinucleotides) (Table 1) [1,2]. These three trinucleotide subsets present several strong biomathematical properties, particularly the fact that they are circular codes.
We recall the definitions and the main properties of this common circular code which will be involved in this paper.  In other words, a trinucleotide circular code Y is a set of trinucleotides such that all sequences (e.g. genes) generated by concatenation of words of Y and written on a circle, where the last letter is followed by the first one, have only one decomposition (factorization) into words of Y. If a sequence has two decompositions then Y is not a circular code. As an example, let the set Y be composed of the six following trinucleotides: Y = {AAT,ATG,CCT,CTA,GCC,GGC} and the sequence w, be a series of the nine following letters: w = ATGGCCCTA. The sequence w, written on a circle, can be factorized into words of Y according to two different ways: ATG,GCC,CTA and AAT,GGC,CCT, the commas showing the way of decomposition ( Figure 4). Therefore, Y is not a circular code. In contrast, if the set Z obtained by replacing the word GGC of Y by GTC is considered, i.e. Z = {AAT,ATG,CCT,CTA,GCC,GTC}, then there never exists an ambiguous sequence with Z, such as w for Y, and then Z is a circular code. The flower automaton is the classical algorithm in code theory to test if a set of words is a circular code or not (without Frameshift site Frame 1 Figure 1: The +1 frameshift. At a certain site, the ribosome bypasses one nucleotide and changes the codon reading frame from frame 0 to frame 1.  The common circular code X. The common circular code X identified in both eukaryotic and prokaryotic genes: X 0 , X 1 and X 2 are the preferential sets of 20 trinucleotides in frames 0, 1 and 2 of genes, respectively. generating all the possible sequences). Furthermore, it can also determine the minimal window length for retrieving the construction frame (size of the longest ambiguous word which can be read in at least two frames, plus one letter) (detailed in [17]).

Definition 4: Complementary map:
The complementary map 4 : Furthermore, according to the property of the complementary and antiparallel double helix, Definition 5: Self-complementary trinucleotide circular code: Definition 7: Permuted trinucleotide circular code: A permuted trinucleotide circular code (Y) is the circular permutation of a trinucleotide circular code Y so that for each y ∈Y, (y)∈ (Y).

Remark 1:
A trinucleotide circular code Y does not necessarily imply that Y 1 = (Y) and Y 2 = 2 (Y) are also trinucleotide circular codes.
Definition 9: C 3 circular code gene population: A gene population F has the C 3 trinucleotide circular code property if the three of trinucleotides with the highest occurrence frequency in frame 0, 1 and 2, respectively, of genes F, are C 3 trinucleotide circular codes. Definition 10: C 3 self-complementary trinucleotide circular code: A trinucleotide circular code Y is C 3 self-complementary if Y is a C 3 trinucleotide circular code satisfying the following properties: ( 1 Y and 2 Y are complementary trinucleotide circular codes).
The trinucleotide set X 0 coding the reading frames (frames 0) of eukaryotic and prokaryotic genes is a C 3 self-complementary trinucleotide circular code [1]. X 0 will be also noted C 3 code X and simply called common circular code. Therefore, the common circular code X and its two permuted circular codes X 1 = (X) and X 2 = 2 (X) can exist in a DNA double helix simultaneously: X in a given DNA strand can be paired with X in the antiparallel complementary DNA (cDNA) strand, X 1 (X shifted by one nucleotide in the 5΄ − 3΄ direction) in a given DNA strand can be paired with X 2 (X shifted by two nucleotides in the 5΄ − 3΄ direction) in the cDNA strand and X 2 in a given DNA strand can similarly be paired with X 1 in the cDNA strand. Furthermore, the C 3 code X allows retrieval of any frame in genes, locally anywhere in the three frames and in particular without start codons in reading frames, and automatically with a minimal window of 13 nucleotides in each frame (Property 5 and Remark at page 121 in [1]).
A recent review of circular codes in genes details the research context, the history and the other properties of this C 3 code X (rarity, largest window length, higher frequency of "misplaced" trinucleotides in shifted frames, flexibility) [18].

Method Circular code signal (CCS) method
We have developed circular code signal (CCS) methods to identify protein coding genes (e.g. [19]), global maximum or minimum of the C 3 code X in micro RNAs [20], etc. This approach is extended and applied here for the case of frameshift genes. It will reveal periodic signals of the C 3 code X.

Let
= {AAA,…,TTT} be the set of 64 trinucleotides over the 4-nucleotide alphabet = {A,C,G,T}. Let n ∈ be a nucleotide, t ∈ , a trinucleotide and F, a frameshift gene population with N(F ) sequences s. By convention, the position i = 0 refers to the frameshift site in a frameshift gene s and is associated with the nucleotide n 0 . In a +1 frameshift gene, n 0 is the bypassed nucleotide and in a -1 frameshift gene, n 0 is the reread nucleotide. By convention, all nucleotides before the frameshift nucleotide (in the 5΄ direction) have a negative position i < 0 and all nucleotides after the frameshift nucleotide (in the 3΄ direction), a positive position i > 0. The sequence s is considered as a series of nucleotides n i and the method reading frame is … n −1 , n 0 , n 1 , …. The sequence s begins at position s min (negative value) and ends at position s max (positive value) with a length equal to s maxs min + 1. F . Let the frame 0 of a sequence s be the frame established by the trinucleotide 0 1 2 n n n , and the shifted frames 1 and 2 be the frame 0 shifted by one and two nucleotides in the 5΄− 3΄ direction, respectively. Thus, the method reading frame is not necessarily the classical codon reading frame.  Figure 5). Let X f , f ∈ {0,1,2}, be the three codes X 0 , X 1 and X 2 in the three frames f.
Then, the score function P(X f ,i,s) computes the occurrence of the code X f in its associated frame, i.e. = f, in a window w i of a sequence s

Circular code periodic signals revealed by the CCS method
The definitions used in this CCS method have properties which allow circular code periodic signals in genes to be identified.

Property 1:
The indicator function (Formula 1) is based on a set X of 20 trinucleotides which has the following property: it contains no permuted trinucleotide (w) (Definition 7), and thus, in particular, no trinucleotide id = {AAA,CCC,GGG,TTT}. For example, on the purine/pyrimidine alphabet {R,Y} (R={A,G}, Y={C,T}), it was already explained that the class of motifs RRR and YYY have no modulo 3 periodicity in genes (Figure 2 in [21]) as they do not occur in a preferential frame (Table 3(a) in [1]).

Property 2:
The indicator function (Formula 1) is based on three sets X f associated with the three frames f, f ∈ {0,1,2}, so that X 1 = (X 0 ) and X 2 = 2 (X 0 ), i.e. the frame 1 (2 respectively) is analysed with the trinucleotides of X 0 permuted by one (two respectively) nucleotides. Note that X 1 and X 2 also do not have trinucleotides id .

Property 3:
The window length of 14 nucleotides is in correspondence with the length of the minimal window of 13 nucleotides of the three circular codes X 0 , X 1 and X 2 to retrieve the frames 0, 1 and 2 in genes [18]. This sliding window length is very short compared to some classical methods of signal processing for genes ( [22] and their subsequent works).

Property 4:
If a gene population F has the C 3 code X property (Definition 9) then P (i, F ) has a modulo 3 periodicity.
Indeed, a high score value P (i, F ) in a window w i at position i followed by two low score values at positions i + 1 and i + 2 reflect a high probability that the reading frame of trinucleotides of X 0 is in this frame i. From the other side, if the window w i is in the reading frame then the trinucleotides of X 0 are not identified in frame 0 of w i+1 and w i+2 but in frames 2 and 1, respectively. Hence, the score values of windows w i+1 and w i+2 are low. A similar reasoning holds for the codes X 1 and X 2 which are permuted trinucleotide circular codes of X 0 (Definitions 7 and 8).

Significance level of a modulo 3 periodicity
A modulo 3 periodicity is quantified by counting the local peaks on a frame according to the following indicator function δ P (i, F ) Note: When n is sufficiency large and p is not too close to 0 or 1, i.e. n > Max{5/p 0 , 5/(1−p 0 )} = 15, the central limit theorem applies and the approximation of the normal distribution Z(F ) to the binomial distribution Y(F ) can be used as follows  Table 3.
In order to automate the reading frame identification, let p(f,a,b, F) be the observed probability of modulo 3 maxima in the frame f of the range [a,b] in the frameshift gene population F . Then, the reading frame , ∈ {0,1,2}, in the position interval [a,b] in the frameshift population F is = f such that p(f,a, b, F This statistical approach can be easily extended to evaluate any type of periodicity (modulo 2, modulo 3, etc.).

Data acquisition
The frameshift gene populations F used in this study are extracted from the RECODE 2 database (release 2010; [23,24,25]). The RECODE 2 database is a compilation of programmed translational recoding events. It deals with programmed ribosomal frameshifts, codon redefinition and translational bypass occurring in a variety of organisms. Each entry includes the gene, its encoded protein for both normal and alternate decoding, the type of the recoding event involved and the trans-factors and cis-elements that influence recoding.
Our study concerns the -1 and +1 frameshifts of eukaryotic and prokaryotic genes as the common circular code X has been identified only in these populations and not, for example, in viral genes [1]. Therefore, four gene populations F are extracted according to the frameshift type and the organism kingdom: -1 frameshifts of eukaryotes EUK −1 and prokaryotes PRO −1 , and +1 frameshifts of eukaryotes EUK +1 and prokaryotes PRO +1 . Table 2 shows the kingdom, the shift type, the number of genes, the minimum i position min F and the maximum i position max F of the studied frameshift gene populations. Figures 6-9 show the graphical results of the score function P(i, F ) (Formula 2) computed on the four frameshift gene populations F of eukaryotes and prokaryotes EUK +1 , PRO +1 , EUK -1 and PRO −1 , respectively. The x-axis represents the position i of the sliding window in F and the y-axis, the score value P (i, F ). For display purposes, the graphical results are presented only on an interval of 200 nucleotides around the frameshift position i = 0.

Results
A modulo 3 periodicity is observed in these four frameshift gene populations. It means that the C 3 code X is a main primary structure in these populations. In Figures 6-9, the local peaks are marked according to the frame of their position i. The peaks in frame f = 0 of the sequence s (i mod 3 = 0) are marked by rhombuses, and those in frames f = 1 (i mod 3 = 1) and f = 2 (i mod 3 = 2), by squares and triangles, respectively. It must be reported again that the frames here are established according to the frameshift position i = 0 and not as usual by the start codon of the sequence s, hence f = 0 is not necessarily the classical codon reading frame.
The frame that contains most of the local peaks before the frameshift site differs from the one after. Precisely, in Figure 6 of EUK +1 and Figure  7 of PRO +1 , before the frameshift site, the majority of peaks (53% with a significance level α = 0.565% and 67% with α = 0.040%, respectively) are in frame 0 and after the frameshift site, the majority of peaks (both 62% with α = 0.018%) are in frame 1. In Figure 8 of EUK -1 and Figure  9 of PRO -1 , before the frameshift site, the majority of peaks (94% with α ≈ 10 -15 and 65% with α = 0.005%, respectively) are in frame 1 and after the frameshift site, the majority of peaks (62% with α = 0.018% and 88% with α ≈ 10 -12 , respectively) are in frame 0. This change of periodicity frame after the frameshift site is clearly observed in Figures  6-9. In genes without frameshift sites, almost all local peaks are in the  same frame (see for example the pioneer papers at the gene and gene population levels published by [26,27,28] etc.). Table 3 shows the observed probability p( ,a,b,F ) of modulo 3 maxima in the three frames of the ranges [a,b] in the four studied frameshift gene populations F before and after their frameshift sites. As expected and in agreement with the figure's results, in the +1 frameshift genes, the reading frames identified are the frame = 0 before the frameshift site and the frame = 1 after the frameshift site. In the -1 frameshift genes, the reading frames identified are the frame = 1 before the frameshift site and the frame = 0 after the frameshift site. Therefore, the periodicity frame associated to the C 3 code X moves in the same direction of translational frameshifting.
This change of periodicity frame is also observed at the individual gene level. For example, Figure 10 shows the computation of P (i,F) (Formula 2) on the gene abp140 which is a +1 frameshift eukaryotic gene identified experimentally [29]. Before the frameshift site, the majority of local peaks (74% with α ≈ 10 -7 ) are in frame 0 while after the frameshift site, the majority of local peaks (82% with α ≈ 10 -9 ) are in frame 1. This circular code signal approach by studying the local peaks before and after the frameshift position in individual frameshift genes of the RECODE 2 database identifies successfully the frameshift type (+1 and -1) in 68% of genes.

Discussion
Periodic signals of the common circular code X are identified in the +1 and -1 frameshift genes of both eukaryotes and prokaryotes.  Furthermore, the circular code periodic signal shifts in the same direction of translational frameshifting. This last result confirms a theoretical forecast of circular codes. Indeed, if a circular code is associated with a unique decomposition of a reading frame, a frameshift gene with two successive reading frames should have a shift of this circular code signal.
The statistical results observed suggest two hypotheses concerning the ribosomal translational frameshifting origin. The first hypothesis assumes that the frameshift gene structure is composed of two regions (before and after the frameshift site) which have the same circular code distribution but not in the same frame. When the ribosome scans the mRNA and reaches the second region, then it detects a change in the code distribution and shifts the mRNA forward or backward in order to retrieve the common distribution. The second hypothesis assumes that the circular code distribution is the same in the entire mRNA. A particular motif or secondary structure generates causes the ribosome to shift the reading frame at a certain position. This shift drove frameshift genes to evolve their internal structure by adding or deleting a nucleotide to overcome the shifting side effect.
The common circular code is a structural property associated to genes. The statistical analysis of the RECODE 2 database (release 2010) revealed here that the frameshift genes also have this code property in their primary structure whatever the frameshift type (+1 and -1) and whatever the species kingdom (eukaryotes and prokaryotes).
The properties of this developed CCS method may lead to some considerations in signal processing of DNA sequences in the particular case of genes. Periodicities can be revealed in genes with the following window features: an unexpected short sliding window of 14 nucleotides; a window content based on a subset of trinucleotides, precisely 20 trinucleotides; a subset of 20 trinucleotides with particular properties, precisely no permuted trinucleotides; and a different subset of 20 trinucleotides for each gene frame so that the three subsets are deduced from each other by a number of permutations related to the frame shift, precisely X 1 (X 2 respectively) in frame 1 (2 respectively) is deduced by one (two respectively) permutation of 20 trinucleotides of X 0 .
This circular code information may be used directly or combined with existing methods to improve the identification of frameshift genes in genomes and their encoded proteins.    Table 3: Observed probability p( ,a,b, F ) (in % rounded) of modulo 3 maxima in the three frames of the ranges [a,b] in the four studied frameshift gene populations F. For each population F, this probability is computed for two position intervals. The first interval is the region preceding the frameshift site and ranging from the minimum i position min F = a (Table 2) of the population F to the frameshift position b = 0. The second interval is the region following the frameshift site and ranging from the frameshift site a = 1 to the maximum i position max F = b (Table 2) of the population F. The values in bold indicate the identified reading frames. In all populations, the circular code periodic signals move in the same direction of translational frameshifting.