Manjeet Kumar and Petety V. Balajia*
Department of Biosciences and Bioengineering Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India
Received Date: July 01, 2014; Accepted Date: July 22, 2014; Published Date: July 31, 2014
Citation: Kumar M, Balajia PV (2014) Diversity, Abundance and Distribution of O-linked Glycosylation Pathway Enzymes in Prokaryotes-A Comparative Genomics Study. J Glycomics Lipidomics 4:117. doi: 10.4172/2153-0637.1000117
Copyright: © 2014 Kumar M, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Glycomics & Lipidomics
In prokaryotes, the protein protein N- and O-glycosylation pathways (GlyPW) have been experimentally characterised
in some of the organisms. Identifying GlyPWs in other prokaryotes is essential to understand the role of glycosylation.
Herein we report a BLASTp and a hidden Markov model (HMM)-profile based comparative genomics approach to identify
putative O-glycosylation enzymes in completely sequenced prokaryotic genomes using the experimentally characterized
O-GlyPW enzymes as query sequences. Homologs for enzymes of all five categories viz., initiation, modification,
extension, flippase and oligosaccharyltransferase are found in 128 organisms and no homolog is found for any of these
in 52 organisms. A large number of organisms have homologs for all categories except oligosaccharyltransferases, which
show high sequence diversity. Thus, O-GlyPW enzyme homologs are widely prevalent. Most of the 128 organisms are
proteobacteria and more than half are pathogenic. The pattern of distribution of homologs indicates species- and strainspecific
variations and acquisition of homologs by horizontal gene transfer.
DATDH: 2,4-Diacetamido-2,4,6-trideoxy-hexose; diNAcBac: N,N′-Diacetamido bacillosamine (i.e., 2,4-diacetamido-2,4,6-trideoxyglucose); GalT: Galactosyltransferase; GATDH: 4-Glyceramido-2- acetamido-2,4,6-trideoxy-hexosamine; GlcT: Glucosyltransferase; GT: Glycosyltransferase; HMM: Hidden Markov model; LPS: Lipopolysaccharide; MSA: Multiple Sequence Alignment; ORF: Open Reading Frame; OT/OTase: Oligosaccharyltransferase; pgl: Protein glycosylation locus in Campylobacter jejuni and Pilin glycosylation locus in Neisseria
The pathways for the glycosylation of proteins in prokaryotes have been characterized in some of the organisms and this include. These are the O-glycosylation pathways of Neisseria [1-5], Helicobacter pylori , Pseudomonas aeruginosa , Bacteroides fragilis  and Acinetobacter baumannii , and the N-glycosylation pathways of Campylobacter jejuni [10-12], Haloferax volcanii  and Methanococcus voltae . In the genus Neisseria, the O-glycosylation pathway (Figure S1) has been delineated in the species gonorrhoeae [1,5], lactamica  and meningitidis [2,3]. The enzymes involved in these pathways have been characterized to various extents [1-4,15-19]. For example, in Neisseria meningitidis, PglE has been shown to be a β1,4-GalT and pglE has been shown to be responsible for phase variation between tri- and disaccharide structures . In Neisseria gonorrhoeae, enzyme activities, substrate specificities and steady state kinetics parameters have been determined . Functional characterization of PglL from Neisseria meningitidis and PilO from Pseudomonas aeruginosa has shown that both these enzymes have relaxed glycan specificity and they require the glycan to be translocated to the periplasm . PilO has preference towards short oligosaccharides whereas the range of glycans that PglL can transfer is structurally more diverse. In N. gonorrhoeae and N. meningitidis, the protein O-glycosylation enzymes are clustered and form the pilin glycosylation locus . Pgl polymorphism, phase variability and competition among the enzymes for a common substrate may lead to glycoforms [3,17,20] i.e., variants of a glycoprotein which differ from each other only in the nature of attached glycan . For example, strains which possess NsPglB1 have 2,4-diacetamido- 2,4,6-trideoxy-glucose at the reducing end of the glycans; in contrast, strains which possess its variant allele NsPglB2 have 4-glyceramido-2- acetamido-2,4,6-trideoxy-hexosamine .
Enzymes of the prokaryotic O-glycosylation pathways can be grouped into five categories (Figure S1 and Table S1). Category-I includes the initiation enzymes which catalyse the transfer of a saccharide to a lipid molecule. This forms the first step in the assembly of glycans on a lipid-linked carrier. The N-terminal domain of the enzymes NsPglB and NsPglB2 are examples for this category of enzymes. Category-II includes modification enzymes which catalyse the modification of simple saccharides. Examples include the enzymes involved in the biosynthesis of DATDH. These are NsPglD (dehydratase), NsPglC (aminotransferase) and the C-terminal domains of NgPglB and NsPglB2. Category-III includes extension enzymes. These are glycosyltransferases (GTs) which catalyse the transfer of a saccharide from a nucleotide sugar donor substrate to acceptors in different linkages. These enzymes are responsible for the extension and elaboration of the lipid- linked glycan. The enzymes NsPglA (α-1,3-GalT), NsPglH (α-1,3-GlcT) and NsPglE (β-1,4-GalT) are a few examples. Category-IV includes flippases which flip the pre-assembled glycan from the cytosolic side to the periplasmic side. These enzymes can flip the lipid-linked glycan containing 1, 2 or 3 saccharide moieties (Figure S1). Category-V includes oligosaccharyltransferases (OTs) which transfer the pre-assembled glycan from a lipid-1 In Neisseria, pgl denotes pilin glycosylation locus and contains enzymes of the O-glycosylation pathway. In Campylobacter jejuni, pgl denotes protein glycosylation locus and contains enzymes of the N-glycosylation pathway. The enzymes that constitute these pathways are denoted by the letters of the alphabet e.g., PglA, PglB, and so on. However, enzymes sharing the same name have different functions in the two pathways e.g., PglC of Campylobacter jejuni is a galactosyltransferase whereas PglC of Neisseria is an acetyltransferase. Hence, in this study, 2 or 3 letter prefixes denoting the genus and species names of organisms are added to names of proteins Table S1 linked carrier to the acceptor protein. Minimally, an organism requires at least one initiator enzyme (Category-I), a flippase (Category-IV) and an OT (Category-V) for O-glycosylation. Enzymes belonging to Category-II and -III determine the final structure of the glycan.
The identification of enzymes and characterization of their substrate specificities is critical to delineate the glycosylation pathways in various prokaryotes. These also help in understanding the role of glycans in processes such as virulence and pathogenesis. GTs are potential drug targets (see, for example, ). In addition, their promiscuous substrate specificity in response to variations in the assay conditions is advantageous for in vitro glycan synthesis [23,24]. Experimental approaches for the identification of new GTs include the use of probes derived from the sequences of hitherto characterized GTs  and screening cell lysates for activity . The main disadvantage of such approaches is that they are very time-consuming. Computational approaches can help to reduce the time by narrowing down the possible candidate ORFs. Such an approach has indeed been used to identify putative eukaryotic , prokaryotic  and archaeal  GTs and followed by experimental characterization in a few cases (see, for example, ). In view of this, the present study was initiated with the objective of identifying the homologs of the enzymes involved in O-glycosylation pathways using a bioinformatics-based comparative genomics approach. In the present study also, a bioinformatics-based comparative genomics approach has been used for the identification of the homologs of the enzymes involved in O-glycosylation pathways. The amino acid sequence of the ORFs has been used as query for all the database searches.
Enzymes of the O-glycosylation pathway have been characterized from several organisms (Table S1). The amino acid sequences of these enzymes were used as query for searching their homologs. The proteomes of 865 completely sequenced bacterial genomes constituted the target dataset (Table S2). This dataset is the same as that used for searching the homologs of enzymes that are part of the Campylobacter jejuni N- glycosylation pathway . This dataset was used as such to facilitate comparison of the results from the present study with that obtained on N-glycosylation pathway  two studies. For each organism, its Taxonomy ID was used to fetch the following information from the NCBI database: super kingdom, group, genome size, GC content, Gram status, motility, oxygen requirement, habitat, temperature range and pathogenicity.
The search strategy is depicted in Figure S2. Essentially, it involves searching the target database first by BLAST . Hits with E-value <1.0 are selected. This is followed by the identification of hits with high query and subject coverages i.e., the extent to which the alignment covers the query/subject sequences. Hits were combined if the query sequences shared ≥ 75% sequence identity. A Multiple sequence alignment (MSA) of these hits was used to generate a hidden Markov model (HMM) profile using the software HMMER http://hmmer. janelia.org . The dataset of 865 proteomes was re-searched using this HMM profile . Both BLAST and HMMER were installed and run locally. Default values were used for all the parameters except that BLOSUM62 was used as the scoring matrix by setting the compositionbased score adjustment to True. E-value cut-off was set to 0.1 for HMMER. Multiple sequence alignment of the chosen BLAST hits was performed using T-Coffee with default values for all the parameters .
Analysis of BLAST hits. Searching the dataset of 865 proteomes using BLASTp gave a large number of hits for most of the enzymes (Table 1). The number of hits obtained for different enzymes within a category is variable. Hits for Category-I enzymes varied between 764 and 1476; variations in the number of hits is much higher for Category- II, -III and -IV. Very few hits are obtained for Category-V enzymes. Within each category, not all hits are unique. This is because many of the hits share sequence similarity with more than one query enzyme. Query coverage is also important in addition to E-value to establish sequence homology. Hence, query coverages of the hits were plotted against their respective E-values (Figure S3). In addition, cumulative frequencies were plotted to visualize the distribution of E-values. It is seen that nearly 75% of hits in Category-I have E-value <10-10. However, the query coverage is >0.8 for 6,515 hits. This indicates that the sequence homologs of Category-I enzymes have diverged less. In Category-II, 939 hits have E-value <10-10 and query coverage >0.8. Only a small fraction of hits for enzymes of Category-III, -IV and -V have E-value <10-10.
|Protein||Number of BLASTphits||Numberof BLASTphits chosenfor MSA||QS Coverage threshold||Number ofhits pooled forMSA§||NumberofHMM
|Highest E-value (HMM search)|
† NA denotes not applicable §Hits were grouped together when query sequence identity is ≥ 75%
¶BLAST hits with query coverage ≥ 70% selected directly for final analysis and HMM profiles were not generated
#Only query coverage was taken in these cases
Table 1: Number of hits obtained from BLAST and HMM searches †
Identification of homologs from HMM profiles. The distribution of E-values and the extent of query coverages of BLAST hits Figure S3 suggest that there can be many false positives vis-à-vis molecular function. It is not possible to ascertain the exact number of false positives without experimental data. Hence, a more stringent strategy was used to identify homologs so that false positives are fewer (Figure S2). Essentially, BLAST hits with very high query coverages and low E-values were chosen to generate HMM profiles. Specifically, the following steps were followed:
(i) Hits with high (>80%) query and subject coverage’s were selected. In the case of NmPglH, NmPglE, SeWzx, PaWzx and AaWaaL, a lower cut-off for query coverage had to be used since very few hits have higher coverage’s (Table 1). In Category-V, very few hits had high subject coverage’s suggesting that hits are much longer in primary sequence than the query. Hence, only query coverage was used as cut-off criterion in this case. High query and subject coverage criteria led to very few numbers of hits for further analysis in case of NsPglF, PaPilO, PaWaaL,
(i) HpWaaL, PgWaaL and HpWaaL-G. In these cases BLAST hits having query coverage of ≥ 70% were selected as final hits for analysis.
(ii) Within each category, hits were combined when query sequences shared ≥ 75% sequence identity. For example, in Category-I, hits for the enzymes EcWecA, KpWecA and YeWecA were combined together.
(iii) MSAs of the hits chosen as above were obtained by considering the entire sequence and HMM profiles were generated.
(iv) The dataset of 865 proteomes was re-searched using these HMM profiles and the hits that have HMM profile coverage ≥ 90% were selected for further analysis. These are taken to be the sequence homologs of the query enzymes (Table 1).
Setting a high stringency cut-off of at least 90% HMM profile coverage meant a substantial reduction in the number of hits (Table 1). The E-values for the hits satisfying the 90% HMM profile coverage are very low except in nine cases: the highest E-value in these cases lies between 1.0×10-4 and 0.1 (Table 1). Plots of cumulative frequencies of E-values for such hits showed that, even in these cases, most of the hits have E-values <10-10 (Figure S4). Thus, choosing only hits with low E-value and high alignment coverage ensured that the hits are likely to be functional homologs also. The final hits for further analysis were obtained by combining the hits of all enzymes from that category (Table 2).
|Category||Name||Total number of HMM hits§||Number of unique hits||Number of organisms|
¶The number of organisms from which the indicated number of unique HMM hits
§These are the hits that have ≥90% HMM profile coverage (Table 1)
Table 2: Total number of HMM hits, unique hits chosen for further analysis and the number of organisms to which these hits belong to.
Analysis of HMM hits. Every HMM hit is unique only in the case of Category-II. This is not surprising since enzymes belonging to this category have different molecular functions viz., dehydratase, acetyltransferase and aminotransferase (Table S1). In Category-I, -III and –IV, only a subset of hits are unique indicating that many hits align with more than one HMM in that category, albeit with different e-values (Table 2). The highest number of hits for a given category is obtained for extension enzymes (Category-III). This may be a reflection of the diversity of the glycan structures. Alternatively, some of these enzymes are part of other glycan biosynthesis pathways e.g., LPS and capsular polysaccharides.
Very few hits are obtained for Category-V and most of them are unique i.e., most of the hits share sequence similarity with only one enzyme in this category (Table 2). Comparison of the amino acid sequences of the experimentally characterized OTs in Category-V Table S1 showed that these enzymes are highly divergent. Statistically significant sequence similarity coupled with adequate query coverage can be observed in only two cases:
(i) Moderate similarity (alignment scores between 29.3 and 47.8 bits; E-values between 3.0×10-04 and 5.0×10-10) is shared by a part (residues 185-365) of AaWaaL with the OTs from H. pylori and N. meningitidis.
(ii) The H. pylori enzymes HpWaaL and HpWaaL-G show very high similarity with each other. In contrast, the two P. aeruginosa enzymes PaPilO and PaWaaL have no detectable similarity with each other; so is the case with the two N. meningitidis enzymes NmOTase and NmPglL. A similar observation viz., proteins performing the same molecular function despite the absence of sequence similarity was seen in the case of two OTs involved in the N-glycosylation of prokaryotic proteins. These are Campylobacter jejuni PglB  and Pyrococcus furiosus OT .
Organisms that have homologs for all the enzyme categories and for none. The dataset used in this study has proteomes of 865 organisms. Of these, 128 have at least one homolog for all the five enzyme categories. All these 128 organisms belong to the superkingdom bacteria and are represented by different groups (Table 3). However, a majority are from proteobacteria. The percentage of organisms of a group that have homologs for all enzymes categories is highest for Betaproteobacteria and this may be because all the Category-II and III enzymes are from Neisseria, a Betaproteobacteria (Table S1). These 128 organisms are quite diverse in terms of their habitat, motility and pathogenicity (Table 4). Out of the 128, 70 are pathogens with representation from Alphaproteobacteria, Bacteroidetes/Chlorobi,
|Group||Number of organisms in the dataset||Pathways¶|
|Number of organisms vis-à-vis O- glycosylation pathway||Number of organisms vis-à-vis N- glycosylation pathway|
|Homologs for all enzymes||No homolog for any enzyme||Homologs for all enzymes||No homolog for any enzyme|
|Chlamydiae / Verrucomicrobia||13||0||3||0||0|
¶Data for homologs of the N-glycosylation pathway is from Ref. 
Table 3: Number of organisms in each group that have / do not have homologs for enzymes of the O- and N-glycosylation.
Betaproteobacteria, Firmicutes, Gammaproteobacteria and Spirochaetes. Also, 54 of these organisms are pathogenic in humans/ animals. The temperature range of these organisms is known for 116 organisms; of these 111 are mesophiles. The organisms belong to different habitats (39 multiple habitats, 46 Host-associated manner) and are also different in their oxygen requirements (29 facultative, 14 anaerobic, 61 are aerobic). The size of genome is ≥ 5 Mb for nearly half (62 out of 128) of them.
The Betaproteobacteria group has the highest number (51/128) of organisms that have homologs for all five category enzymes. The habitat of these organisms is also diverse: 14 are multiple habitat, 18 are host-associated and 8 are terrestrial. The GC content of these organisms varies from 48 to 69%. A majority of these organisms are motile. A substantial number (47/128) of Gammaproteobacteria also have homologs for enzymes of all five categories. There is a significant variation in the genome size (1.9-6.6 Mb) and GC content (32.2-66.6) of these organisms which indicates the absence of any correlation between the genome size and GC content and O-glycosylation of proteins.
The motility status is known for 90 of the 128 organisms; vast majorities (68 out of 90) are motile. It has been shown in some organisms that flagella are O-glycosylated and this has been shown to be important for its assembly [35,36]. In Pseudomonas syringae, it has been suggested that the absence of glycosylation destabilizes the filament structure of flagella and affects the swimming activity of mutants . In addition, in Pseudomonas aeruginosa, it has been suggested that the glycosylation of flagellum and motility can play a crucial role in flagellum-mediated virulence . Nearly half of the organisms (from among the 128) that are motile are also pathogenic
(Table 4). Thus, it can be inferred that these pathogenic organisms also glycosylate flagellar proteins also besides several other virulence factors. With respect to other features of these 128 organisms, it is observed that a large number are Gram negative, mesophilic and have >50% GC content. Six organisms viz., Clavibacter michiganensis subsp. sepedonicus, Verminephrobacter eiseniae EF01-2, Yersinia pseudotuberculosis YPIII, Actinobacillus pleuropneumoniae L20, Candidatus Ruthia magnifica str. Cm (Calyptogena magnifica) and Dichelobacter nodosus VCS1703A have at least one homolog for initiator (Category-I), flippases (Category-IV) and OTs (Category-V) enzyme. Among these, Clavibacter michiganensis is Gram positive and belongs to the group Actinobacteria. Verminephrobacter eiseniae belongs to Betaproteobacteria and the remaining four are Gammaproteobacteria. These organisms live in mesophilic temperatures and have multiple/ host-associated habitat.
|Taxid||OrganismName||Gram Status||Motile||Habi- tat§||Temp. range†||Patho- genic|
Table 4: Some characteristics of organisms that have at least one homolog for each category of O- glycosylation pathway enzymes¶.
As mentioned earlier, an organism should minimally have an initiator enzyme, a flippase and an OT to O-glycosylate proteins. A large number of organisms did not have homologs of these three enzymes. In most of the cases, OT is the missing enzyme (Table S3). These organisms probably do have OTs but these have escaped detection in this study because of the high sequence divergence of OTs, as mentioned earlier.
Fifty-two organisms do not have homologs for even a single enzyme of any of the five categories. These organisms also belong to diverse habitats. Their temperature range is mostly mesophilic and they are from different subgroups (Table 5). These organisms have varied morphology. Among different groups, Chlamydiae and Crenarchaeota do not have homologs for any of the five enzyme categories. Out of 52 organisms which do not have homologs for even a single enzyme category, 41 are host-associated and 30 are pathogenic. Comparative genomics studies have shown that large scale genome deletions are characteristic of host-associated organisms/symbionts [39,40].
|Taxid||OrganismName||Gram Status||Motility||Habi- tat§||Temp range†||Patho- genic|
Table 5: Some characteristics of organisms that do not have homolog for any of the O-glycosylation pathway enzymes¶.
The genome size of 25 out of 52 organisms which lack homologs for any of the five enzyme categories is ≤ 1.0 Mb. The significantly small size of the genomes can be a reason the absence of homologs for any of five enzyme categories. Among the organisms which have at least one homolog for each enzyme category 122 organisms has genome size of ≥ 2 Mb. Also, no correlation was found between the presence/absence of homologs and GC content Figure S5.
Organisms have both O- and N-linked glycosylation. Organisms that have homologs of the enzymes of the N-glycosylation pathway of Campylobacter jejuni have been identified in an earlier study . It is seen that the maximum number of organisms that have homologs for all enzymes of the N-glycosylation pathway belong to the group Epsilonproteobacteria (Table 3) and the query enzymes are from C. jejuni, an Epsilonproteobacteria. This scenario is similar to that observed for O-glycosylation pathway i.e., all Category-II and –III query enzymes are from Neisseria, a Betaproteobacteria. This is suggestive of the inherent sequence divergence of the glycosylation pathway enzymes.
It is found that Roseiflexus castenholzii and Desulfovibrio desulfuricans have homologs for both N- and O-glycosylation pathway enzymes. R. castenholzii belongs to the group Chloroflexi whereas D. desulfuricans is a Deltaproteobacteria. These two organisms differ from each other in their habitat, oxygen requirements and temperature range. Despite these differences, they both seem to have N- and O-linked glycosylation pathway enzymes. Glycosylation is known to play a role in the stabilization of the folded form of proteins  and this can be a possible role for glycosylation of proteins in R. castenholzii, a thermophile.
Desulfovibrio desulfuricans species shows the potential of being pathogenic since it has been found that it can cause bacteremia in immunocompetent man . These two organisms can be good model systems to study the effects of glycosylation and exploitation as microbial factory for glycosylating heterologous proteins. Species and strain-specific variations in the presence of homologs. Analysis of the presence of homologs for enzymes of different categories in different species of a genus did not show much variation, especially when Category-V is excluded (Table 6). Homologs of OTs are present in only a few species in genera such as Pseudomonas and Leptospira and in none of the species Escherichia and Thermotoga. This probably is due to the high sequence divergence observed among enzymes of Category-V, as mentioned earlier. All the species of Helicobacter and all but one of the species of Yersinia (from among those present in the dataset) lack homologs of flippases (Category-IV). Since these organisms have homologs of OTs, it is possible that either an alternative flippase is present (non-orthologous gene displacement) or it has substantially diverged from the sequences used as query (Table S1). Two species of Francisella lack homolog for Category-V enzymes. The absence of homolog in Francisella philomiragia represents a species-specific loss. In Francisella tularensis subsp. Holarctica, the loss is strain-specific as other strains do have homologs of OTs. One species in the genus Ralstonia viz., Ralstonia metallidurans does not have homologs for any of the five enzyme categories. This organism has a specialized habitat and it is not clear if the absence of homologs is in any way related to its habitat. Organisms belonging to Yersinia have homologs for enzymes of Category-I, -II and -V. In addition, only Yersinia enterocolitica and Yersinia pseudotuberculosis have homologs for Category-III enzymes and only Yersinia pseudotuberculosis has homologs for Category-IV enzymes. As no other strain of Yersinia pseudotuberculosis contains homologs for Category-IV enzymes, Y. pseudotuberculosis seem to have acquired these genes by horizontal gene transfer. This surmise is strengthened by GC content: the GC content of the homolog is ~31% whereas the GC content of rest of the genome is ~48%. The GC contents of two homologs of Category-III enzymes in Y. enterocolitica and Y. pseudotuberculosis are 32 and 30%, respectively, suggesting the possibility of horizontal gene transfer in these cases also. In Acinetobacter baumannii, one strain has homologs for all five enzyme categories whereas a few others have homologs for only three or four category enzymes. This variability is suggestive of strain-specific variability as observed in Neisseria.
Overall, few genera had homologs for all enzyme categories whereas homologs for few categories were absent in other genera (Table 6). This may be due to the local needs/habitat of that particular organism . The non-uniform occurrence of homologs for different categories across different genera as well as within the same genus hints at heterogeneity in the glycans synthesized by these organisms. Such kindThis type of heterogeneity is likely to be present in different organisms of a species also. In one related study, it was established that glycan structures with different chain length are present in the genus Campylobacter when grouped on the basis of thermotolerance . The variation of homologs in different organisms is not surprising since, even among Neisseria, species- and strain-specific polymorphisms have been reported [20,45].
|Genus||Number of organisms||Number of organisms with homolog for at least one enzyme in the five categories|
Table 6: Variations in the presence of homologs for O-glycosylation pathway enzymes in different genera.
Distribution of different enzyme categories among the organisms. OT is critical for glycosylation and the existence of its homologs in an organism strengthens the prediction that O-glycosylation occurs in these organisms. Homologs for OT were found in 168 organisms (Table 2). Few of these have more than one homolog. Most of these 168 organisms are proteobacteria; others include Actinobacteria, Bacteroidetes/Chlorobi, Chloroflexi, Cyanobacteria, Firmicutes and Spirochaetes. Twenty-one organisms have at least one homolog for all category enzymes except Category-IV (Table S4). It can be surmised that a divergent class of flippases are involved in these cases for transferring the oligosaccharide across the membrane in these organisms. Some organisms belonging to Betaproteobacteria and Gammaproteobacteria groups are missing homologs for extension enzymes (Category-III). This is suggestive of a glycan containing only a monosaccharide. The Actinobacteria Clavibacter michiganensis subsp. sepedonicus lacks homologs for Category-II enzymes and thus indicates variability in the glycan structure. Homologs of extension enzymes (Category III) are present in most of the organisms. Number of organisms having homologs for initiator and modification enzyme category were almost equal with a slight majority of modification enzymes. A substantial number of organisms have homologs for flippase.
Antibiotic resistant organisms having homologs for all enzyme categories. There are 85organisms in the dataset that are tagged as antibiotic resistant by the Center for Disease Control and Prevention, Atlanta (www.cdc.gov/drugresistance DiseasesConnectedAR.html#1). Nine of these have homologs for all five enzyme categories and these hints at the existence of O- glycosylation pathway. The genomes of all of these organisms are >2 Mb with 39-57% GC content. The habitat is either host-associated or multiple. All are mesophiles and live in aerobic environment. Recently, the antibiotic- resistant Acinetobacter baumannii ATCC 17978 has been reported to have the O-glycosylation pathway . Even the present study shows that this organism has homologs for all five enzyme categories and hence can potentially glycosylate the proteins.
Distribution of organisms in the phylogenetic tree. 16S rRNA based phylogenetic analysis shows that the organisms which have homologs for all five enzyme categories are scattered in the phylogenetic tree and so do those that do not have homologs for any of the five enzyme categories (Figure 1). Organisms having homologs for all categories and for none of the categories are clustered in only a few branches. Variations in the occurrence of homologs belonging to different categories are observed among closely related organisms in certain subtrees (Figure 2). For example, in the Bradyrhizobium subtree, except two organisms, the other three have homologs for all enzymes categories (Figure 2A). These two organisms viz. Rhodopseudomonas palustris and Oligotropha carboxidovorans have homologs for all enzyme categories except Category-V and Categories-IV and –V, respectively. In the subtree containing some Betaproteobacteria, Ralstonia metallidurans and Ralstonia eutropha have homologs for all enzyme categories but their immediate neighbour Cupriavidus taiwanensis does not have homolog for any enzyme category Figure 2B. In this subtree, Ralstonia solanacearum has homologs for all enzymes except flippases. Polynucleobacter necessarius lacks homologs for Category-III and -V. The presence/absence of all homologs and variations in the number of homologs represent significant diversity among the members of the subtree. All except two organisms in the subtree containing Diaphorobacter sp. and Leptothrix cholodni have homologs for all enzyme categories (Figure 2C). These two organisms are Methylibium petroleiphilum and Polaromonas napthalenivorans which lack homologs for Category-V enzymes. As discussed earlier, even these organisms may have OTs and the reason for not finding the homologs may be because of the sequence divergence.
The organisms which lack homologs for any of the five categories were also mapped in the phylogenetic tree. In one of the subtrees, most of the members are from Mycoplasma (Figure 2D). Homologs are absent in all organisms except Mycoplasma mycoides. It is intriguing that many of these organisms also lack homologs for enzymes involved in N-linked glycosylation as reported earlier . The absence of both N- and O-linked glycosylation in these parasitic organisms suggests that these organisms have very different pathways for glycosylation or have evolved other, as yet, unknown mechanisms to serve the role played by glycosylation.
In some subtrees, one organism has homologs for enzymes of all categories whereas its neighbour does not have homolog for enzymes of any category. For example, Geobacter uraniireducens (a Deltaproteobacteria) has homologs from all five enzyme categories but its neighbour lacks homologs for only Category-V (Figure 2E and 2G). Uraniireducens has the largest genome (5.1 Mb) size among all the Geobacter which are part of this study. It is tempting to speculate that the high genome size of this organism is the reason for it having homologs for all five enzyme categories. Interestingly, it is the only Geobacter in the dataset which is microaerophilic; all others are anaerobic. Additionally, the homolog of Category-V enzyme in G. uraniireducens has significantly low GC content (40%) than the GC content of this organism in whole (54%). This suggests the presence of horizontally transferred genes in this organism. Also, other members in this subtree viz., Geobacter metallireducens and Pelobacter carbinolicus do not have homologs even for a single enzyme category.
The variation in the number of homologs belonging to different categories in case of many organisms reflects the diversity of the O-glycosylation pathway as has been demonstrated in Neisseria gonorrhoea [5,15]. These variations can be attributed to the horizontal gene transfer and selective loss of genetic material [46-48]. Moreover, a gene may exist in a phase variable form in few strains but not in others . This gene might give benefit to one organism in the form of constitutive gene whereas another strain of the same species may get advantage from it as a contingency gene . One such example is from Haemophilus influenzae which uses mechanisms such as homologous recombination and slipped-strand mispairing to generate highfrequency changes in expression of genes belonging to polysaccharide (LPS, CPS) and fimbrial category . The understanding of the O-linked glycosylation system and its effects are likely to be more complicated since a dynamic interplay between O-glycosylation and other post-translational modifications such as the addition of phosphoethanolamine / phosphocholine has been reported .
In summary, homologs for all five enzymes categories are found in 128 organisms. The number is likely to be even more since a significant number of organisms have homologs for all categories except OTs, which are known to be highly divergent in their sequences. Besides, the criteria used to identify homologs were kept very stringent to minimise false positives. Overall, this study clearly shows that the O-glycosylation pathway enzyme homologs are widely prevalent. Analyses of the pattern of distribution of homologs indicate speciesand strain-specific variations in glycan structures and acquisition of Oglycosylation pathway enzyme homologs by horizontal gene transfer in certain clades.
There are several examples of proteins which share sequence similarity but varying levels of functional similarity. In view of this, it is not possible to ascertain exactly the nature of donor and acceptor substrates used by the homologs of different enzyme categories which are identified in this study. Further bioinformatics analyses, combined with experimental data, and are essential to ascertain the specific functions of these enzymes. The experimental characterization of the substrate specificities, combined with the spatiotemporal pattern of expression of these genes, will lead to a better understanding of their involvement in various biological processes. The homologs identified are a good starting point for experimental characterization of their molecular functions.
Manjeet Kumar is grateful to the Council of Scientific and Industrial Research, India for research fellowship.