Phylogenetics & Evolutionary Biology The Distribution of Polyhedral Bacterial Microcompartments Suggests Frequent Horizontal Transfer and Operon Reassembly Proteins

Bacterial microcompartments (BMCs) are proteinaceous organelles that carry out specific metabolic reactions. Using domain representations of the BMC shell proteins, we identified BMCs in genomes of 358 bacterial species including human gut microbes, bioremediation agents, cellulosic ethanol producers, and pathogens. Multiple BMCs of different metabolic types are present in 40% of the BMC-containing genomes. BMC genes frequently clustered at a single locus that includes enzymes related to the compartment’s metabolic function. The distribution of BMC- containing species was mapped onto a phylogenetic tree constructed from 16S rRNA. The presence of BMCs was sporadically distributed across the phylogenetic tree. All bacterial families that contained species with BMCs also had species without them. Even within a species, BMC number varied, indicative of frequent horizontal transfer and gene loss. Similarly, phylogenetic trees constructed from individual BMC genes indicated that horizontal gene of the BMC loci is a common occurrence.


Introduction
Polyhedral protein microcompartments were first identified in Cyanobacteria in 1961 [1]. In Cyanobacteria and chemoautotrophs these microcompartments contain the central enzyme involved in the fixation of carbon dioxide, ribulose-1,5-bisphosphate carboxylase/ oxygenase (RuBisCO) [2,3], and were hence named carboxysomes. Also associated with the carboxysome is carbonic anhydrase, which converts bicarbonate to carbon dioxide [4]. Carboxysomes are not bounded by a lipid bilayer, but instead are composed of a crystalline layer of shell proteins similar in appearance to bacterial virus coats [5]. In Synechococcus WH8102 there are approximately 250 individual RuBisCO complexes per carboxysome, organized into three to four concentric layers [6]. The carboxysomes appear to be necessary to concentrate carbon dioxide [7] in close proximity to the RuBisCO complex. Molecular transport across the shell is believed to occur via small pores at the center of each hexameric face of the shell structure [5], but the mechanism(s) which govern transport are not clear.
Genetic analyses resulted in the identification of orthologs of the carboxysome shell proteins in enteric bacteria and subsequently other bacterial families [8][9][10]. We will refer to these structures as bacterial BMCs (BMCs) although they have also been called polyhedral microcompartments, metabolosomes, polyhedral bodies, and protein microcompartments. BMCs are relatively large macromolecular complexes, 100 to 150 nm in cross section, and contain metabolic enzymes both within, and possibly as integral parts of, the polyhedral shell [9,11].
In Salmonella the shell protein genes are localized within two unusually large operons, comprised of 21 and 17 genes, involved in the catabolism of 1,2-propandiol (pdu-type BMC) and ethanolamine (eut-type BMC) [11,12]. Each BMC contains an adenosylcobalamin (Ado-B12) cofactor-requiring enzyme, either propanediol dehydratase or ethanolamine ammonia lyase. In addition to sharing shell structural proteins and a requirement for Ado-B12 each BMC operon also contains acetaldehyde and alcohol dehydrogenase, which lead to the production either of propionate and propanol from 1,2-propandiol (pdu BMCs) or of acetate and ethanol from ethanolamine (eut BMCs). Thus, the BMC pathways can be used as a hydrogen sink when the alcohols are produced or to provide carbon and adenosine triphosphate (ATP) through acetyl/ propionyl-coenzyme A and ultimately to pyruvate. BMCs may play a role in protecting the cell from toxic aldehydes [13][14][15][16], conserving volatile metabolic intermediates [7] and/or creating separate sub-pools of larger cofactors NAD and coenzyme A (CoA) [17].
In the human gut microbe, Roseburia inulinivorans, microarray analysis led to the identification of a novel BMC-associated B12independent propanediol dehydratase [18]. The R. inulinivorans BMC functions in the metabolism of the animal host-derived fucose. A similar BMC has also been shown to exist in the forest soil-derived Clostridium phytofermentans, where it functions in the metabolism of fucose and rhamnose [19]. This type of BMC has been coined the glycyl radical prosthetic group-based (grp) BMC [20].
The carboxysome and BMC (eut, pdu and grp) shell proteins share two protein domains. One protein domain, represented by the Protein family (Pfam) database [21] model, (Pfam03319) is the EutN/ PduN/CcmL/CsoS4/GrpN family. Proteins having these domains form pentamers which act as the vertices of the polyhedral shell structure and have thus been recently named vertex proteins [22]. The second domain is present in proteins which form cyclical hexamers that are believed to come together and form the faces of the shell [23][24][25]. This protein domain is represented by the Pfam database model (Pfam00936) and is present in multiple paralogs which number between 3 and 7 genes per BMC locus [25]. For this paper we will refer to proteins with this domain as shell face proteins, although the location and orientation of all family members has not been experimentally determined. Using these two protein domains or representative proteins, new ftp.ncbi.nlm.nih.gov/genomes/Bacteria/) using Perl scripts. The protein sequences were then annotated using the COG (Clustered Orthologous Genes) and Pfam databases implemented in the Conserved Domain Database (CDD) [38]. To identify the BMC type enzymes associated with this 40 gene region, we queried using RPS-BLAST for Pfams belonging to diagnostic enzymes. For the carboxysome, we used pfam02788 and pfam00101 (RuBisCO large subunit (rbcL) and RuBisCO small subunit (rbcS)). For the eut-type BMC, we used pfam06277, pfam06751 and pfam05985 (activating enzyme (eutA), ethanolamine ammonia-lyase large subunit (eutB) and ethanolamine ammonia-lyase small subunit (eutC)). For the pdu-type BMC, we used pfam02286 (B12-dependent propanediol dehydratase large subunit (pduC)). For the grp-type BMC, we used and pfam01228, pfam04055 and pfam13247 (Glycyl radical domain (Gly_radical), the radical S-adenosylmethionine domain (radical_SAM) and pyruvate formate lyase (PFL)).

Taxonomic distribution of BMCs
The BMC vertex protein was chosen for identifying BMC loci and constructing the phylogenetic relationship among BMCs, because it is usually present as a single copy, whereas the shell face proteins are frequently present as multiple paralogs. BMC vertex proteins were identified in 358 of the 2458 genomes in the NCBI database ( Table  1). None of the 152 genomes sequenced to date from Archaeal phyla BMC shell protein genes are continuously being identified as more bacterial genomes are sequenced [20,[25][26][27][28]. This comparative genome approach has led to the discovery of new types of BMCs and orphan BMCs whose metabolic role is still unknown. An additional metabolic type of BMC has been proposed based on genetic, biochemical and comparative genomic analysis that a BMC locus present in Rhodococcus erythropolis and Mycobacterium sp. MCS is involved in aminoalcohol or aminoketone metabolism [29,30].
The polyhedral shape of BMCs suggests similarities to phage capsid, but there is no evidence for sequence or structural homology. Thus, the origin of the BMC shell proteins is an open mystery. Several BMC loci have been noted to undergo horizontal gene transfer (HGT) [23,25,31]. In this study we conduct comprehensive analysis of the available genomes by first identifying BMC loci using the vertex protein, then mapping the distribution unto a phylogenetic tree built from the 16S rRNA gene. The diverse functions and taxonomic distribution suggest a complex evolutionary history. By looking at the distribution of BMC type and dispersal patterns across phyla, we also find that operons coding for the same function have evolved independently several times. Here, we discuss four main phyla that include the most sequenced species, Actinobacteria, Cyanobacteria, Firmicutes and Proteobacteria, to give us a more accurate view of the dispersal of BMCs than in phyla where less sampling exists.

Identifying vertex proteins in genome sequences
Predicted proteins from complete bacterial genomes were downloaded from the GenBank FTP (ftp://ftp.ncbi.nlm.nih.gov/ genomes/Bacteria/) site on May 9, 2013. We searched the Pfam representation of the vertex protein (Pfam03319) against this database using RPS-BLAST [32]. A cutoff for determining significant hits to Pfam03319 was based on E-value (lower than 10 -5 ) and an amino acid length of ~100. All of these proteins had a member of the shell face protein family (Pfam00936) within 20 genes on the chromosome, whereas none of the hits immediately above this cutoff contained proximal genes with the Pfam00936 shell face protein domain.

Phylogenetic analysis
The vertex protein family sequences obtained above were aligned using ClustalW2 [33]. phylML [34] was used to construct our phylogenetic trees using the Maximum Likelihood method with 100 bootstrap replicates. The phylogenetic trees were visualized and manipulated in Dendroscope, [35] Figtree (http://tree.bio.ed.ac.uk/ software/figtree/) and using TreeCollapseCL4 to collapse tree branches based on bootstrap cutoff (http://emmahodcroft.com/TreeCollapseCL. html). 16S rRNA sequences were extracted from the NCBI complete genome data set using custom Perl scripts. Twenty-three genomes lacked 16S rRNA sequences and 15 16S rRNA sequences contained stretches of unknown nucleotides (N) or were truncated. These genomes were removed from the analysis. The remaining 16S rRNA sequences from NCBI were aligned to the master 16S rRNA alignment at the Ribosomal Database Project (RDP) website [36]. The taxonomy of each 16S rRNA gene was determined using RDP classifier [37]. Phylogenetic trees were constructed and visualized as described above.

Determining BMC type based on neighboring enzymes
To identify enzymes that can be used to determine the BMC type (e.g. carboxysome, eut, pdu, grp) protein sequences 20 positions to the left and 20 positions to the right of each vertex protein were retrieved from the PTT files available for each genome at GenBank FTP site (ftp:// (Euryarchaeota, Crenarchaeota, Korarchaeota and Nanoarchaeota) contain BMCs.
Several phyla including the Bacteroidetes, Chlamydiae and Tenericutes have over 50 genomes represented in the database, but also completely lacked the signature protein for BMCs.
No phylum had vertex proteins in 100% of the genomes sequenced. In the Cyanobacteria, only Cyanobacterium UCYN-A, lacks a shell vertex protein.
Cyanobacterium UCYN-A has a reduced genome and is lacking many genes related to photosynthesis and the Calvin cycle, including RuBisCo [39]. A RPS-BLAST search for BMC shell proteins using pfam00936 domain supports the loss of all shell components of the carboxysome in Cyanobacterium UCYN-A.
The broad sporadic distribution of BMCs in many phyla is illustrated in Figure 1 using the context of a phylogenetic tree based on the 16S rRNA gene (which serves as a proxy for an organismal tree). The broad phylogenetic distribution of BMCs is unusual.
With the exception of Cyanobacteria, all other taxonomic groups contain BMCs in only a fraction of the members even at the family and genus levels (Tables 2,3).

Many bacterial genomes contain more than one BMC vertex protein
Many genomes contain more than one vertex protein (Table 4) which is indicative of either multiple BMC loci or paralogs within a locus. A Meliobacteria genome contains 6 instances of the vertex protein distributed across 4 separate BMC loci. While most BMC loci contain only 1 copy of the vertex protein, there are exceptions, most notably that nearly all α-carboxysomes contain 2 copies. In Cyanobacteria and other carboxysome containing taxa there was only the 1 BMC locus, while many other phyla contained multiple BMCs per genome. Forty percent of the genomes with BMC vertex proteins contained 2 or more separate BMC loci (Table 5). complete prokaryotic genomes database. The sequences from genomes that contain at least one BMC vertex protein are highlighted in red. A more detailed version of the tree including the taxon labels will be supplied upon request.

Class within Proteobacteria
Genomes Genomes with

Phylogenetic analysis of the vertex protein
A phylogenetic tree was constructed from all vertex proteins ( Figure  2). The vertex protein is short (~100 amino acids) and the boot strap support for some groupings is weak. However, many robust patterns are evident. Cyanobacteria are present in 2 major groups corresponding to their RuBisCo type. The multiple vertex proteins present in the marine Prochlorococcus and Synechococcus and some chemoautotrophs branch deeply in the tree, suggesting ancient gene duplication. More frequently, vertex proteins belonging to the same phylum did not cluster together (Figure 2A) suggesting separate evolutionary trajectories resulting from horizontal gene transfer.

Distribution of BMC types
BMC types are distinguished by the different enzymes encapsulated within or associated with the compartment (Figure 3). The 5 previously identified types include the α-carboxysome, β-carboxysome, eut, pdu, and grp types. Because the type enzymes are nearly always associated with the vertex protein in the same operon/locus, we searched the 40 genes surrounding the vertex protein for the BMC type enzymes. The results are shown in the context of the vertex protein tree ( Figure 2B) and with the taxonomic distribution ( Figure 4). The α-carboxysomes and β-carboxysomes formed monophyletic groups. The β-carboxysomes consisted of only Cyanobacteria ( Figure 4B). The α-carboxysomes contained equal numbers of Cyanobacteria and Proteobacteria (Alphaproteobacteria, Betaproteobacteria and Gammaproteobacteria) and an Actinobacteria ( Figure 4A).
In the other BMC types the vertex protein phylogeny was inconsistent with the distribution of the type enzymes and none of these BMC types formed monophyletic groups ( Figure 2B). This indicates multiple origins of each of the other BMC types since vertex proteins belonging to loci of similar function were not most closely related to each other. In the eut-type ( Figure 4C), Proteobacteria (50%), composed of mostly Gammaproteobacteria, formed the largest distribution, closely followed by Firmicutes (47%). There were also minor players (less than 5%) including Fusobacteria, Chloroflexii, Synergistetes and Actinobacteria. In the pdu-type ( Figure 4E), Proteobacteria (48%), composed of mostly Gammaproteobacteria and some Deltaproteobacteria, along with Firmicutes (47%) formed the largest distributions. There were also minor taxa including Actinobacteria, Fusobacteria and Synergistetes.       For the grp-type BMC ( Figure 4D), Firmicutes (56%) formed the largest distribution, closely followed by Proteobacteria (40%), and mostly composed of Gammaproteobacteria and some Alphaproteobacteria and Deltaproteobacteria. The minor taxa included Acidobacteria, Actinobacteria, Chlorobi and Planctomycetes.

Clades with no BMC type enzymes
Since we were looking for specific proteins as key enzymes in a 40 gene window some BMC types eluded us and were designated as an unknown BMC type. The taxonomic distribution of the BMC loci with no adjacent eut, pdu or grp BMC types was very different and included Planctomycetes, Chlorobi, Chloroflexi, Actinobacteria and a Verrucomicrobium ( Figure 4E). Most of these loci included an alcohol dehydrogenase and/or an aldehyde dehydrogenase indicative of the non-carboxysomal type BMCs. However, many of these loci had 2 or 3 vertex proteins, a feature common in α-carboxysomes.
We searched the rest of the genome outside of the 40 gene window for type enzymes that might be part of a separate operon (Supplementary  Table S1). We found that all Cyanobacteria had RuBisCO subunit sequences somewhere in their genomes (not shown).
Mycobacteria have been proposed to have a different BMC type [30]. Our results showed that all Mycobacteria that had BMC genomic potential also had pdu-type enzymes elsewhere in their genome. Finally, there were three different clades consisting of four Planctomycetes, a Firmicute and a Verrucomicrobium that have an unknown BMC type. Most taxa that had an unknown BMC type had grp-related enzymes somewhere in their genomes which might be an overestimation since radical SAM and pyruvate formate lyase domains are found on other enzymes.

Discussion
We demonstrated that (1) BMCs are present in 11 different bacterial phyla, (2) with the exception of Cyanobacteria, BMC distribution is sporadic even at the family and genus levels, (3) Cyanobacteria and other carboxysome containing taxa only have 1 BMC type per genome, (4) multiple BMC types are present in 40% of the BMC containing genomes, (5) monophyletic clades of α-carboxysome and β-carboxysomes are evident on the vertex protein tree, and (6) there is an incongruence between BMC type and vertex protein evolution for eut, pdu and grp-type BMCs. Overall, our results suggest that, with the exception of carboxysomal loci in Cyanobacteria, BMC loci have undergone considerable HGT. The fraction of BMCs in our study (15%) is similar to an earlier report of 17%, which examined roughly half of the current complete genomes [30]. While the NCBI complete prokaryotic genome database contains over 2500 genomes, sequencing is concentrated in just a few phyla, whereas other phyla have not been extensively sampled for taxonomic breadth. Many phyla in our study contained less than 50 sampled genomes and BMCs may yet still be discovered in those groups.

Ancient origin of carboxysomes
No genomes from Archaea contain BMCs, suggesting that BMCs arose after the divergence of Bacteria from Archaea and that horizontal transfer of BMCs has not occurred between these domains. Where and when did BMCs originate? It is evident that carboxysomes have conferred a great advantage specifically for Cyanobacteria since virtually all of the species belonging to this group have the genomic potential to express these structures (Table 1). This suggests an origin of BMCs that dates back to the origin of Cyanobacteria nearly over three billion years ago [40].
Our phylogenetic analysis of the vertex protein was unable to provide strong support for a sister relationship between the α-carboxysome and β-carboxysomes ( Figure 2B). A recent analysis of cyanobacterial genomes indicates that the marine Prochlorococcus and Synechococcus originated within the Cyanobacteria and are not ancestral to all other Cyanobacteria [41]. While it is possible that the α-carboxysome originated within chemoautophic Proteobacteria, the sporadic distribution of the carboxysome in these bacteria does not provide evidence of an older lineage.
It is interesting that while no chloroplast genomes contain carboxysomes, a carboxysome operon is present Paulinella chromatophora, a freshwater amoeba with photosynthetic endosymbionts of Cyanobacterial origin [42]. This endosymbiotic event is not related to the origins of chloroplast and occurred more recently, about 60 mya, as the result of endosymbiosis of a member of the marine Prochlorococcus/Synechococcus group [43][44][45]. Why then have no chloroplast genomes retained the carboxysome shell proteins? One possible explanation is that Cyanobacteria did not require this structure early on when the atmospheric CO2 levels were higher and prior to the symbiotic origin of chloroplast. In either scenario of possible origin, the carboxysome is undoubtedly old in evolutionary sense, likely going back at least 2.5 billion years ago to the Great Oxidation event [46].

Recent origin of Eut, Pdu and Grp-type BMCs
The taxonomic segregation of the carboxysomes and the eut, pdu and grp-type BMCs suggest that ecological differences may play important roles in selecting for BMC types. The eut, pdu and grp-type BMCs are important in heterotrophic lifestyles such as those belonging to gut and soil microbes in the Proteobacteria and Firmicutes. There are 1654 genomes sequenced from these two phyla, representing 64% of the total NCBI complete genome database. Thus, it is not surprising that they are the most abundant taxa for each of these 3 BMC types. However, in these phyla the eut, pdu and grp-type BMCs do not seem to be subject to consistent strong selection as they are present in some taxa but absent in close relatives that appear in a similar environment. The limited phylogenetic depth of these BMC types in any taxonomic group suggests that they are of recent evolutionary origin and undergo frequent gene transfer and loss ( Figure 2B).
Many of the bacteria with multiple BMC loci (Table 5) have different types of BMCs. It has long been known that Salmonella and Listeria strains contain both the eut and pdu-type BMCs [23,27,47]. Some strains of E. coli contain 1 each of the eut, pdu and grp-type BMCs. Clostridium phytofermentans contains 1 eut type and 2 grp types. However, the 2 grp-type BMCs are differentially expressed [19].

BMC operon evolution
While there were some loci with no key enzymes in the 40 gene region surrounding the vertex protein, for the most part BMC-related genes clustered closely together on the chromosome. One explanation could be offered by the Selfish Operon theory, which when first suggested, used Salmonella typhimurium's eut-type BMC locus as an example supporting the hypothesis [31]. The theory suggests that genes having a weakly selected function are more likely to persist in a population by clustering close to each other. This is because when these genes are located in a single locus they are more likely to be horizontally transferred together and maintain their function. In this manner they are less likely to be lost to genetic drift. The theory goes on to predict that genes having a weakly selected function, are more likely to cluster together than those that are strongly selected for. Interestingly, β-carboxysomal genes are more dispersed on physically disparate regions of the genome than other BMC loci and carboxysomes have been strongly selected for in the Cyanobacteria which is consistent with the Selfish Operon Theory. However, the question remains of why α-carboxysomes, if also strongly selected for, still cluster together.
The distribution of the diagnostic enzymes associated with the eut, pdu and grp-type BMCs is not consistent with the vertex protein phylogeny since vertex proteins belonging to BMC loci of similar function do not form monophyletic groups. This phylogenetic topology suggests that the reassembly of BMC loci of similar function has occurred several times. Many bacteria contain other enzymes in the genome that share sequence similarity with these enzymes, so it is not clear why or when these enzymes need to be encapsulated in a BMC. It is possible that the "free standing" enzymes may replace the BMC type enzymes in the BMC locus leading to different BMC functions. Other enzymes associated with BMC loci also do not have a consistent phylogenetic pattern suggesting frequent association or loss from the BMC locus (results not shown). Detailed phylogenetic analysis of these genes is needed to determine rates of gene gain, replacement and loss from BMC loci to further evaluate the Selfish Operon Theory and other aspects of BMC evolution.