Genome-Wide Relative Analysis of Codon Usage Bias and Codon Context Pattern in the Bacteria Salinibacter Ruber, Chromohalobacter Salexigens and Rhizobium Etli

Codon is the basic unit for biological message transmission during synthesis of proteins in an organism. Codon Usage Bias is preferential usage among synonymous codons, in an organisms. This preferential use of a synonymous codon was found not only among species but also occurs among genes within the same genome of a species. This variation of codon usage patterns are controlled by natural processes such as mutation, drift and pressure. In this study, we have used computational as well as statistical techniques for finding codon usage bias and codon context pattern of Salinibacter ruber (extreme halophilic), Chromohalobacter salexigens (moderate halophilic) and Rhizobium etli (nonhalophilic). In addition to this, compositional variation in translated amino acid frequency, effective number of codons and optimal codons were also studied. A plot of ENc versus GC3s suggests that both mutation bias and translational selection contribute to these differences of codon bias. However, mutation bias is the driving force of the synonymous codon usage patterns in halophilic bacteria (Salinibacter ruber and Chromohalobacter salexigens) and translational selection seems to affect codon usage pattern in non-halophilic bacteria (Rhizobium etli). Correspondence analysis of Relative Synonymous Codon Usage revealed different clusters of genes varying in numbers in the bacteria under study. Moreover, codon context pattern was also seen variable in these bacteria. These results clearly indicate the variation in the codon usage pattern in these bacterial genomes. Genome-Wide Relative Analysis of Codon Usage Bias and Codon Context Pattern in the Bacteria Salinibacter Ruber, Chromohalobacter Salexigens and Rhizobium Etli Mohammad Samir Farooqi1*, DC Mishra1, Niyati Rai1, DP Singh2, Anil Rai1, KK Chaturvedi1, Ratna Prabha2 and Manjeet Kaur1


Introduction
The study of organisms from extreme environments is an important field of research for enhancing knowledge in context of the molecular and biological approaches in agriculture. It helps in better and deeper understanding in multiple scientific areas for developing new varieties /breeds and biological materials. The ability of organism to survive under high salt conditions offers an excellent opportunity to increase understanding of hyper saline physiology and in identifications of genes which are responsible for salt tolerant. Halophilic organisms which thrive in saline environments such as salt lakes, coastal lagoons and man-made salterns characterized by two stress factors, the high ion concentration and low water potential [1,2]. It can be seen that extensive information on the taxonomy, physiology and ecology of halophilic microorganisms has been reported but relative codon usage patterns in these organisms have little been studied. There is a wide range of halophilic microorganisms which comprise domains of Archaea and Bacteria. The saline cytoplasm of these bacteria requires enzymes which are rich in acidic amino acids and dependent on K+ or Na+ for their biological activity [3]. Oren and Mana 2002 [4] have reported that these organisms include: (i) the extremely halophilic Archaea of the family Halobacteriaceae, which comprises Halobacterium, Haloarcula, Haloquadratum, Halorhabdus, Natronobacterium and Natronococcus (ii) the halophilic Bacteria of the order Haloanaerobiales and (iii) the bacterium S. ruber. Extremely halophile, S. ruber is a red, aerobic bacterium, requires at least 150g of salt/liter for growth and grows optimally at NaCl concentrations between 200-300 g/litre [5]. C. salexigens, a gram-negative aerobic bacterium, is moderately halophilic in nature. It grows at NaCl concentrations ranging between 0.5M and 4M, with an optimum growth at 2.0-2.5M and at an optimum temperature of 37°C [6,7]. Rhizobium etli (R. etli) is gram-negative soil bacteria, which fixes nitrogen and forms an endosymbiosis nitrogen fixing association with roots of legumes. R. etli inoculants are useful as bio fertilizers. These inoculants promote plant growth, productivity and are internationally accepted as an alternative source of N-fertilizer [8].
The genetic code is the sequence of nucleotides in DNA or RNA that determines specific amino acid sequence in synthesis of proteins. It employs 64 codons, which can be grouped into 20 disjoint families, one family for each of the standard amino acid, and 21 st family for translation/ termination signal. Different codons that encode the same amino acid are called synonymous codons and they usually differ by nucleotide at the third codon position. According to the number of synonymous codons related to each amino acid, there are two amino acids with one codon choice, nine with two, one with three, five with four and three with six. These represent five synonymous families types (SF), designated as SF types 1, 2, 3, 4 and 6 [9]. The unequal or Page 2 of 10 preferred usage of a particular codon by an amino acid among the SF family is termed as Synonymous Codon Usages (SCU). The specific SCU patterns may be due to mutational bias [10], bias in G+C content, natural selection etc. However, SCU pattern is non-random and species-specific [11,12] frequency of synonymous codons usage varies among species [13]. It has also been reported that there is significant variation of codon usage bias (CUB) among different genes within the same organism [14,15]. In some organisms codon bias is very strong, whereas, in others, different synonymous codons are used with similar frequencies [16][17][18]. Similarly, the strength of codon bias varies across genes within each genome, with some genes using a highly biased set of codons and others using the different synonymous codons with similar frequencies. The degree of codon usage bias is also shown to depend on the level of gene expression, with highly expressed genes exhibiting greater codon bias than infrequently expressed genes [14,15]. This correlation was used to predict highly expressed genes of an organism especially in case of prokaryotes. Among bacteria, genomic G+C content varies over a wide range, presumably reflecting variation in mutation biases [19] with a major impact on codon usage [20]. Analysis of codon usage pattern can provide a basis for understanding the relevant mechanism for biased usage of synonymous codons [15,[21][22]. The codon context pattern may affect the translation selection of genes that is suitable for studying codon bias patterns in understanding the genetic diversity. Studies have already shown that set of preferred codons are used by each genome and that codon context is not a random event [23][24][25].
In this study, the genomes of bacteria S. ruber and C. salexigens and R. etli bacteria have been analyzed in terms of synonymous codon usage bias and codon context pattern for understanding of molecular mechanism under salinity stress and also to have a comparative analysis of codon usage in these bacteria.

Nucleotide sequence data
In our study we have included complete coding sequences (CDSs) of three bacteria viz. S. ruber, C. Salexigens and R. etli, i.e, extreme, moderate and non-halophilic bacteria respectively. The nucleotide sequence in FASTA format was retrieved from http://cmr.jcvi.org/ cgi-bin/CMR/CmrHomePage.cgi. In order to minimize the sampling errors, gene sequences less than 300bp length and those with intermediate termination codons were removed [26]. Final dataset after exclusion of these sequences consisted of 1450, 2147 and 3703 genes of S. ruber, C. salexigens and R. etli respectively. Perl program has been developed for merging these gene sequences for further processing and analysis [27,28].

Calculation of codon usage indices
The frequency of codons (excluding stop codons) corresponding to each amino acid in the CDSs is used for codon usage analysis. Relative Synonymous Codon Usage (RSCU) [29], Effective Number of Codons (EN c ) [9] and Codon Adaptation Index (CAI) [30] were calculated. Highly and lowly expressed genes, and frequency of optimal codons were identified in all the three bacterial species using CodonW software (http://codonw.sourceforge.net/).

Statistical analysis
Statistical analysis was carried out using SAS 9.2. In order to derive valid biological conclusions, multivariate statistical analysis using Correspondence Analysis (CA) was applied. For large multi-dimensional datasets, CA allows a reduction in the dimensionality of the data so that an efficient visualization that captures most of the variation can occur [31,32]. The CA was also used for determining highly expressed genes and optimal codons. Pearson correlation was calculated to identify the relationship between CAI and EN c values. In order to know the effect of base composition of third position of codons on their effective number, Poisson regression analysis was performed taking EN c as dependent variable and A3s, T3s, C3s and G3s frequencies as independent variables.

Codon context pattern analysis
Codon context generally refers to sequential pair of codons in a gene. Codon context pattern analysis was performed using the Anaconda 2.0 software [33]. The amino acid pairs and the residual values of each codon pair were calculated from these coding sequences. Cluster tree was generated to compare the genomes through codon context pattern analysis. The cluster pattern is based on average matrix of residuals of each codon context among the species [34]. Codon context patterns reveal that the specific codons are frequently used as the 3′-and 5′-context of start and stop codons.

Codon usage pattern
Over all RSCU value for the 59 codons (Table 1) provides ample evidence of codon usage bias in the studied bacterial genomes. It can be seen that codon ending with C and G nucleotides in all synonymous codon family are the most preferred as compared to A and T ending codons in these bacteria. Moreover, bias is more towards codons ending with C as compared to G, this clearly shows that genes of these bacteria are highly dominated by codons ending with C. These results indicate that the codon usage pattern in these bacterial species is mostly contributed by compositional constraints.

Heterogeneity of codon usage
In order to study the heterogeneity of codon usage, two different indices, namely, EN c and GC 3s were used ( Figure 2). S. ruber (green color) shows extreme GC 3s content from 23 to 97% (mean: 85% and standard deviation: 7.8%) and a wide range of EN c variation from 29.17 to 61 (mean: 37.85 and standard deviation: 5.73). In C. salexigens (blue color), GC 3s values vary from 33 to 94% (mean: 81% and standard deviation: 6.1%) and their corresponding EN c values varies from 26.32 to 61 (mean: 37.9 and standard deviation: 4.462). In R. etli (red color), GC 3s values vary from 29 to 91% (mean: 61% and standard deviation: 14.6%) and their corresponding EN c values varies from 25.04 to 61 (mean: 46.2 and standard deviation: 7.004). The heterogeneity of means was also supported by independent sample t-test, which was performed between the means of S. ruber and C. salexigens, S. ruber and R. etli and R. etli and C. salexigens for GC 3s and EN c ( Table 3) distribution. All the means were found significant with P < 0.05 except the means of EN c in case of S. ruber and C. salexigens.

Mutational bias effect on codon usage variation
In order to determine differences between nucleotide composition and codon selection in each species, Pearson correlation between EN c and CAI were obtained. Significant negative correlation was observed in S. ruber (r = -0.43711, P < 0.0001), C. salexigens (r = -0.57703, P < 0.0001) and R. etli (r = -0.72062, P < 0.0001). The correlation values indicates that codon usage bias of genes of these species have very distinct relationships with nucleotide composition of coding sequences.
In order to further test, the compositional constraints, Poisson regression was performed between EN c as dependent variable and nucleotide composition at the third codon position as predictor variables. The estimated coefficients are shown in Table 4. All the coefficients are found to be significant in S. ruber. The coefficient of T3s is non-significant in C. salexigens and T3s and G3s are non-significant It can be clearly seen from the Figure 2 that R. etli is tri-modal, whereas other two are uni-modal. This indicates that high variation in codon usage occurs in these bacteria. This large range of variation in codon usage is probably due to differential mutational pressure acting on different coding regions of a genome.

Relationship between EN c and GC 3s
Based on the codon homozygosity, EN c is the most useful concept reflecting codon usage bias pattern in different organisms. EN c values were calculated for coding sequences of these three bacterial species. In order to shape codon usage bias, EN c and their corresponding GC 3s values are required to demonstrate the role of dominant factors in bacteria. EN c and GC 3s plot was made in absence of selection pressure ( Figure 3). Generally, if GC-composition bias is not responsible for any codon usage bias, all genes must lie on normal curve but this actually doesn't occur in this study.
The EN c plot in Figure 3 for S. ruber (in green colour) and C. salexigens (in blue colour) shows maximum negative correlation, whereas, in R. etli (red in colour), two distinguished groups are observed, where, one has almost zero and other with negative correlation between EN c and GC 3s . Thus, strong influence of compositional constraints on codon usages bias could be stated from the presence of significant negative

Multivariate statistical approach
The dataset of RSCU values of genes of these bacteria was subjected to correspondence analysis (CA), a method of multivariate statistical analysis (MVA). In this study, CA has been performed on RSCU values to minimize the effects of amino acid composition. The most prominent axes contributing to the codon usage variation among the genes are determined. It is seen that axis 1 has the largest fraction of the variation; axis 2 describes the second largest trend, and so on with each subsequent axis describing a progressively smaller amount of variation as shown in Table 5. It must be remembered that although the first axis explains a substantial amount of variation, its value is still lower than found in other organisms studied earlier [35]. The low value might be due to the extreme genomic composition [36] of this organism. It is also obvious from Figure 4 that the majority of the points are clustered around the origin of axes indicating that these genes have more or less similar codon usage biases. However, few points are widely scattered along the negative side of axis 1, which suggest that codon usage bias of these genes are not homogeneous. It is interesting to note that the scatter plot (Figure 4) drawn between axis 1 and axis 2, scores for R. etli genes are clearly differentiated into three clusters, whereas in case of S. ruber and C. salexigens single cluster is observed. Genes falling in same cluster indicate that these genes have more or less similar codon usage bias.

Translational optimal codons
In order to identify the optimal codons, 10% of genes each from both extremes of axis 1 were analysed for these species under study (Table 5). Ikemura 1981 [37] showed that there is a match between these codons and the most abundant tRNAs. It has been reported that highly expressed genes have a strong selective preference for codons with a high concentration for the corresponding tRNA molecule [38,39]. This trend has been interpreted as the co-adaptation between amino acid composition of protein and tRNA-pools to enhance the translational efficiency. The possible reasons for the varying GC bias in bacteria under saline habitats [40], although not very strict, could be linked with tRNA affinity as deciphered in this study, selection on genomic base composition [41] and presence of highly acidic proteome in halophiles mostly lacking basic proteins and over representation of acidic residues (e.g. Asp and Glu) in amino acids [42].

Codon context analysis
Data clustering helps to know the identification patterns of preferred and rejected codon pairs which can give a better understanding of genetic diversity. The codon context maps along with cluster trees were generated. The 5' codons are in rows and the 3' codons are in columns in 64 x 64 contingency ( Table 6). The green color represents highest number of the context i.e., positive values and red color represents the lowest number of context i.e., negative values. It has been observed that the highest and lowest number of codon context is comparatively lower in S. ruber as compared to that of C. salexigens and R. etli. (Figures 5a, 5b  and 5c). Hierarchical clustering of codon context data based on single linkage highlights discrete groups of good and bad codon context. It can be seen from the Figures 5a, 5b and 5c that major numbers of codons do not fall into any cluster which indicates preferences or rejections of codons are defined on one to one basis. Further, species specific codon context maps indicate that each species has specific set of codon context rules and there is no clear distinguishable common features present among these species.
Distribution of the adjusted residuals from the codon context map of S. ruber, C. salexigens and R. etli. (Figures 6a, 6b and 6c) show that 57.75, 49.78 and 52.28 percent of the residuals respectively fall within the non-significant -5 to +5 interval, indicating that a very large number of codon combinations are not significant to the rejection of    independence. This is in accordance with above clustering result of the codon context.
Occurrence of codon context frequencies in each bacterial species were analysed and found to be variable. The frequent and rare codon contexts of each species are listed in Table 7. In S. ruber, GAC-GAG was most abundant (2848) and GUA-AUC was lowest (37) whereas in C. salexigens, codons CUG-GCC (3611) was most frequently presented and AUC-GUA with least frequency of 50. However, in case of R. etli, GCC-GGC was most frequently presented (7559) and UAC-CGU was the rarest with occurrence of (122).

Conclusion
Codon usage bias is the parameter that delineates the differences in the occurrence of synonymous codons in genomic coding sequences. This codon bias is calculated for all coding sequences of the three bacteria. On analysing codon usage bias of these bacteria, it has been observed that S. ruber and C. salexigens follow almost similar pattern in codon usage bias and R. etli varying in a noticeably different manner. The pattern of codon usage bias within S. ruber and C. salexigens is remarkably similar. Although, these bacteria show a similarity in the overall codon bias pattern, but some prominent differences are also seen. These differences in the codon bias pattern of all the bacteria are due to mutation and genetic drift as well as translation selection acting on coding sequences. Selection favours the preferred codons over the non-preferred ones. Nevertheless the existences of non-preferred or non-optimal codons are due to the action of mutational and genetic drift forces.
In this study, it was found that most frequent codons end with 'G or C' mostly at 3rd codon position with greater preference of 'C' in all the three species. This finding may be the result of compositional *Codons whose occurrences are significantly higher (P < 0 .01) in the extreme left side of axis 1 than the genes present on the extreme right of the first major axis. AA: amino acid; N: number of codon; 1: genes on extreme left of axis 1; 2: genes on extreme right of axis. Table 6: RSCU for the highly and lowly expressed genes highlighting translational optimal codons in the three bacteria.    constraint that occurred in codon usage pattern in these bacteria. This comparative study will be useful for understanding the pattern of codon usage in these bacterial species.