Journal of Computer Science & Systems Biology

Author(s): Arqus DG, Fallot JP, Marsan L, Michel CJ

Abstract The subset X0=[AAC,AAT,ACC,ATC,ATT,CAG,CTC,CTG, GAA,GAC,GAG,GAT,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC] of 20 trinucleotides has a preferential occurrence in the frame 0 (reading frame established by the ATG start trinucleotide) of protein (coding) genes of both prokaryotes and eukaryotes. This subset X0 is a complementary maximal circular code with two permutated maximal circular codes X1 and X2 in the frames 1 and 2 respectively (frame 0 shifted by one and two nucleotides respectively in the 5'-3' direction). X0 is called a C3 code (Arquès and Michel, 1997, J. Biosyst 44, 107-134). A quantitative study of these three subsets X0, X1 and X2 in the three frames 0, 1 and 2 of eukaryotic protein genes shows that their occurrence frequencies are constant functions of the trinucleotide positions in the sequences. The frequencies of X0, X1 and X2 in the frame 0 of eukaryotic protein genes are 48.5\%, 29\% and 22.5\% respectively. These properties are not observed in the 5' and 3' regions of eukaryotes where X0, X1 and X2 occur with variable frequencies around the random value (1/3). Several frequency asymmetries unexpectedly observed, e.g. the frequency difference between X1 and X2 in the frame 0, are related to a new property of the C3 code X0 involving substitutions. An evolutionary analytical model at three parameters (p, q, t) based on an independent mixing of the 20 codons (trinucleotides in the frame 0) of X0 with equiprobability (1/20) followed by t approximately 4 substitutions per codon according to the proportions p approximately 0.1, q approximately 0.1 and r = 1 - p - q approximately 0.8 in the three codon sites respectively, retrieves the frequencies of X0, X1 and X2 observed in the three frames of protein genes and explains these asymmetries. The complex behaviour of these analytical curves is totally unexpected and a priori difficult to imagine. Finally, the evolutionary analytical method developed could be applied to the phylogenetic tree reconstruction and the DNA sequence alignment.
