Evolution of the Genetic Code – Some Novel Aspects

The Genetic Code has been known since 1962 [1]. It is largely universal, though some minor variations have been discovered. It is the logical connection between the nucleic acid and protein “worlds” [2]. Both nucleic acids and proteins have changed over evolutionary history. It is rational to assume that even the Genetic Code has an evolutionary history. Studies on the Genetic Code demand simultaneous and independent knowledge of the corresponding nucleic acid and protein sequences, for which data are usually not available. De novo sequencing of proteins has no scientific priority. An additional methodological difficulty is that the species that utilized early variants of Genetic Code may have been extinct for a long time, and ancient variants of proteins are no longer translated. The result of these difficulties is that studies on the evolution of the Genetic Code consist mainly of speculations that have very little chance of experimental confirmation or rejection. However, the history of the Genetic Code is not wholly beyond the reach of serious scientific study. This article reviews and analyzes the main theories about the evolutionary development of the Genetic Code and extends them with recently-discovered aspects of its function.


Introduction
The Genetic Code has been known since 1962 [1]. It is largely universal, though some minor variations have been discovered. It is the logical connection between the nucleic acid and protein "worlds" [2]. Both nucleic acids and proteins have changed over evolutionary history. It is rational to assume that even the Genetic Code has an evolutionary history. Studies on the Genetic Code demand simultaneous and independent knowledge of the corresponding nucleic acid and protein sequences, for which data are usually not available. De novo sequencing of proteins has no scientific priority. An additional methodological difficulty is that the species that utilized early variants of Genetic Code may have been extinct for a long time, and ancient variants of proteins are no longer translated. The result of these difficulties is that studies on the evolution of the Genetic Code consist mainly of speculations that have very little chance of experimental confirmation or rejection. However, the history of the Genetic Code is not wholly beyond the reach of serious scientific study. This article reviews and analyzes the main theories about the evolutionary development of the Genetic Code and extends them with recently-discovered aspects of its function.

Assumptions
In this study we have made some necessary assumptions: 1. The development of the Genetic Code is, like any other biological development, a process from the simpler to the more complex. Therefore we accept the possibility that the present-day fourbase-type nucleic acids developed from (say) two-base-type molecules; that the current triple code might have been preceded by a one-or two-letter code; and that the current 64 codons have developed from, say, four or sixteen codons.
2. Codons that have been in use for longer (in evolutionary terms) are numerically over-represented in the species-specific Codon Usage Tables (CUT). It is logical to suppose that newly-developed functions required new proteins, and coding for more proteins required a number of available amino acids and of codons that increased during the millions of years of biological evolution.
3. Some species disappear. This extinction of the "unfit" may (though not necessarily) mean the loss of all information related to their biology while they were extant. However, much of ancient biological history is preserved in modern organisms, often in hidden, non-expressed genomic sequences. It is well documented that the non-expressed part of the genome has grown (accumulated) rapidly during evolution and has reached huge proportions (~98% of the total DNA) in humans. (It is also observed that these structures sometimes become activated by mistake and this activation leads to immunodeficiency, cancerization or the appearance of bizarre body parts, a phenomenon called atavism [3]). This consideration suggests that the history of ancient proteins and the function of the primitive Genetic Code might be preserved in non-expressed genomic DNA.

4.
Codons developed in a way that was compatible with the principle of base complementarity. Bases are known to form complementary pairs and nucleic acids are known to form complementary strands. Therefore, it seems inevitable that codon-anticodon pairs must have existed at every stage of codon development, and it has always been important that the meanings of codons and anticodons are not confused during translation.

5.
Codons developed in close connection with their encoded amino acids. The specific (unique) interaction between nucleic acids and proteins is biologically as important and meaningful as the specific (unique) interaction between complementary nucleic acid strands. This assumption might be controversial and is known to have both supporters [4] and opponents [5] among influential scientists. However, our own development, The Common Periodic amino acids and a stop signal. The location of redundancy is the 3 rd codon position, called the wobble base. This is the first indication of structure in the codons.
The second (central) codon bases are also clearly distinguished from the others (1 st and 3 rd ). These central bases are undoubtedly related to the physico-chemical properties (charge, hydropathy and some structural aspects) of the encoded amino acids [6].
There is a readily detectable, periodic energy pattern along exons that is not detectable along intronic sequences. The free folding energy (-dG, Gibbs energy) is periodically lower (on average) in relation to the 1 st and 3 rd codon bases than the 2 nd . This is possible only if G-C base pairs are preferentially located at the 1 st and 3 rd codon positions (also as the average distribution). This energy pattern provides a virtual physicochemical definition of codon boundaries [7].
The foregoing observations indicate that the 2 nd codon letter is clearly distinguished from the 1 st and 3 rd , both structurally and functionally.

The possible origin of the Genetic Code
The literature is rather rich in ideas regarding the possible origin and development of the genetic code. It has been suggested that the recent Genetic Code developed from a primitive A+T-containing code [8], while others have found evidence for a primitive G+C-containing code [9]. We performed statistical analyses of Codon Usage Frequencies (CUFs) in several species in the hope of finding evidence for one or the other primitive code.
Genome-wide species sequencing projects have emerged only during the past 10 years and have provided reliable data for analysis of species-specific codon usage. These data are collected in numerous Codon Usage Tables (CUT) [10]. We examined codon usage frequencies in 113 species from different stages of evolutionary development, and found that codon usage is strongly biased in every species and shows a rather similar pattern in different organisms ( Figure 1). These large differences in codon usage frequencies not simply the result of differences in amino acid usage frequencies, which are to be expected, but are largely caused by differences in the usage of synonymous codons (codons encoding the same amino acid).
Furthermore, the synonymous codon usage bias is about the same (fairly well conserved) in all species. Detailed analyses of the base compositions of codons reveal that the most frequent codons are preferentially built of A & T bases (Figure 2). The 13 most frequent codons (20% of the 64 possible) have A or T in the central position (the critical position in relation to the physicochemical properties of the encoded amino acid [6]) and provide 36% of all codon usages (100%).
The codon usage frequency pattern is strikingly similar in the seven major species categories. However, it is possible to detect some minor but statistically significant differences ( Figure 3). The relative CUF pattern shows significant, systematic differences when the values of animals are compared to distant species categories such as viruses, protists and archaea. No such difference is found when the animal values are compared to phylogenetically closer relatives such as bacteria, fungi and plants.
The negative correlation between animals and archaea is most significant. This difference is clearly related to the base compositions of the codons: AT-rich codons are preferentially used by archaea, while animals prefer GC-rich codons ( Figure 4).
The amino acid usage frequencies of animals and archaea are clearly different, but this difference does not correlate with the AT content of the synonymous codons and therefore fails to explain the AT-related differences in CUF patterns ( Figure 5).
The dominance of A and T in the most frequently used codons, and the synonymous codon usage bias in favor of AT-rich codons in older rather than younger species, suggest to us that primitive codons were built of A and T bases, while G and C came into use during a later, second developmental stage.
The next important question about the primitive Code is the number of bases necessary in the codons. The modern triplet codon provides 64 different combinations of the four nucleic acid bases, far more than necessary to encode 20 amino acids. Two singlet codons (A and T) can theoretically encode two amino acids. The coding rules in the modern Common Periodic Table of Codons and Nucleic Acids [6] suggests that T codes for one hydrophobic and A for one hydrophilic or charged amino acid, say T>Phe [now encoded by TTT]      and A>Lys [now encoded by AAA]. Thus, the hypothetical primitive [FK] n -type oligopeptides and [AT] n -type oligonucleotides may have formed a nucleo-peptide complex (the positively charged K attracting the negatively charged nucleic acid), which would preferentially have become located on aqueous boundary surfaces, forming a layer (primitive membrane?) ( Figure 6). The next step toward the complexity of the recent Genetic Code might have been recognition of the order (sequence) of bases. Bases in the order ATAATA form a different shape from, for example, TATTAT, and this shape difference might have been utilized to distinguish between two amino acids (regarding their coding as well as their binding). There is strong evidence suggesting that codons and amino acids developed in parallel (co-evolution) [4] and there is a significant connection between the physicochemical properties of the amino acids and their codon structures [11]. Therefore, the development of a triplet code that still utilized only A & T seems a rather "logical" possibility.  [12]).
The development of this triplet code immediately raises the question of translation reading frames: where are the beginning and the end of a triplet codon? We assume that the codon boundaries were not yet defined in the primitive code; AT-only triplet codons were overlappingly translated (Figure 7).
The idea of overlapping translation goes back to the 1950s. In the years immediately following the proposal of the structure for dsDNA, George Gamow suggested a so-called "diamond code" to explain the connection and information transfer between DNA and proteins [13][14][15]. In his model the nucleic acid bases from 20 different cavities into which the 20 different amino acids fit specifically. The order of cavities determines the order (sequence) of encoded amino acids, which polymerize and form the individual proteins. Gamow's model was the very first model for translation and it turned out to be overlapping, which means that the 2 nd and 3 rd bases of a triplet codon are identical to the 1 st and 2 nd bases in the next triplet codon, so amino acid neighbors are interdependently encoded. The attractive feature of the overlapping codon model is that it takes advantage of an interesting structural similarity between amino acids and nucleic acids, namely that the distance between the amino acids and the distance between nucleotide bases is the same, which strongly suggests a connection and 1:1 relationship between these very different residues. In addition, the "frame shift" problem does not arise in overlapping translation. A big disadvantage of this model, which turned out to be "fatal", is that it simply doesn't permit some amino acid neighbors that do exist in real proteins [16]. Gamow's model is still revisited time after time as a way of avoiding frame-shift problems when no other way is apparent. We know today that codon boundaries are physicochemically defined in modern codons by the periodic distribution of GC bases [7], which was not the case in the AT-only model described above.
There is no overlapping translation in recent or modern organisms (as far as we know), but signs that it once existed might have been preserved. When we look for such molecular, phylogenetic "fossils" it is important to bear in mind that biological history is preserved in the form of DNA (not protein) and historical DNA records are located in the non-translated DNA domains (exons, and even in regions called "junk" DNA) [17].
Suppose that overlapping translation did exist in the past, but at a certain point in evolution it was replaced by the now-practiced nonoverlapping translation. In that case, some nucleic acid sequences might exist in two different forms with the same translational meaning (protein sequence): one "compact", which was overlappingly translated in the past, and one "extended", which is nowadays translated nonoverlappingly (Extended OTS). A third category of nucleic acid sequences, which developed later, comprises those that cannot be compressed into OTS and are called Extended non-OTS (Figure 8).
To test this idea we constructed 64 polycodon frequencies, each corresponding to one codon repeated 10 times. We were looking for the incidence of these simple monotone repeats in the Nucleotide Sequence Databases, provided by the Blast server of NCBI [18] (which contains all GenBank + EMBL + DDBJ + PDB sequences, but no EST, STS,   [19]. Surprisingly many sequences were found that were significantly similar to the polycodon-like query oligonucleotides, and their frequencies differed depending on the base composition of the query (Figure 9). This base composition-dependent distribution recalled what we found in the CUF tables, i.e. AT-rich sequences (codons) were once more frequently used than GC-rich codons (sequences) (see figures 2 and 4).
In the next step we assumed -very arbitrarily -that these highly frequent polycodon sequences represent compact, primitive nucleic acids that could be extended and translated overlappingly as well as non-overlappingly (as shown in figure 8). At about 1/3 of the compact sequences were found even in the extended OTS forms, especially those derived from AT-rich sequences ( Figure 10). The extended non-OTS forms had the lowest frequency, and this was independent of base composition.
These findings suggest that codon-like repeats (especially AT-rich) played a significant role in the genome and were therefore preserved. They represent compact sequences from the early period of codon development, which could be overlappingly translated. However, these sequences successively lost their importance (translation?) with the development of GC-containing codons and the shift to non-overlapping translation.
Our recent and previous experiments provide support for the ideas of Jimenez-Sanchez [8], who suggested that the recent genetic code developed from a simpler, AT-only code, and G and C were added in a second, later step of development. However, an AT-only-containing nucleic acid cannot be dissected into well-defined triplet codons, so frame-shifts have to exist and the overlapping translation of codons is unavoidable. These simple nucleic acids and their translation have, of course, significant limitations for the development of biological functions and life as we know it today. Another concern regarding the AT-only codon is the absence of any experimental evidence. The ATonly code bearers might have been disappeared. More likely, the idea of Jimenez-Sanchez is an extreme and theoretical extrapolation of the biological reality, that older species contains more AT, while younger species more GC bases in their genome.
The situation changed dramatically with the addition of G and C bases. This addition increased the number of possible codons to 64, provided the possibility of high energy signatures along the nucleic acid sequences (physicochemical definition of codon boundaries [7]) and made it possible to shift from overlapping translation to the recent, more permissive, non-overlapping variant. An alternative evolutionary model was proposed by Ikehara et al. [9], who emphasized the importance of GC at the 1 st and 3 rd codon positions. We completely agree with Ikehara's statements about the special significance of G and C in defining codon boundaries. However, our recent analyses of CUF tables clearly show the dominance of ATrich codons in terms of frequency of use. This indicates that AT-rich codons have an evolutionary importance of their own. There are three categories of nucleic acid sequences. The oldest form is the compact sequence, in which codon boundaries cannot yet be recognized, so it is translated overlappingly (OTS). Extended OTS sequences developed from compact sequences and made it possible to read the OTS non-overlappingly. Extended non-OTS sequences are late developments and they cannot be compressed into compact forms. Codons are indicated by blue boxes. The difference between extended OTS and non-OTS is indicated by red letters. Note that the translation of extended OTS and non-OTS sequences into protein sequences is the same (yellow boxes).  The involvement of overlapping translation in our model of codon evolution solves the obvious problem of frame shifts, but creates new concerns at the same time. It might be difficult to understand how the transition to non-overlapping translation could have happened without serious conflict between the two systems.
Evolutionary questions often have philosophical aspects. Biological sciences recently tend to picture evolution as an uninterrupted, continuous, linear process. However it is not known whether the biological evolution that scientists are able to see and observe on the Earth "is the" biological evolution or it is only a local variant of a greater biological evolution in the Universe. Francis C. Crick, the founding father of molecular biology, lunched the idea of panspermia [20]. It suggested, that life developed somewhere in the universe and spread in the cosmos, even to the Earth, as DNA trapped in cosmic debris. In this case only a short part of biological evolution is available for us; we can see trends, suggest necessary events (like the AT-only nucleic acids [8] and overlapping translation [13][14][15]) without the possibility to find remains of these evolutionary steps.
It is concluded that the well-known triplet codons and the 64/20 translation is a complex system that is the result of successive development from much simpler systems, like AT-only codons (which were coding only a few amino acids) and overlapping translation. The evolutionary "addition" of GC nucleotides was necessary to define the recent codon-structure that physicochemically marks codon boundaries and makes the more sophisticated, non-overlapping, translation possible.