Genome Mining and Comparative Genomic Analysis of Five Coagulase-Negative Staphylococci (CNS) Isolated from Human Colon and Gall Bladder

Coagulase-negative Staphylococci (CNS) are known to cause distinct types of infections in humans like endocarditis and urinary tract infections (UTI). Surprisingly, there is a lack of genome analysis data in literature against CNS particularly of human origin. In light of this, we performed genome mining and comparative genomic analysis of CNS strains Staphylococcus cohnii subsp. cohnii strain GM22B2, Staphylococcus equorum subsp. strain equorum G8HB1, Staphylococcus pasteuri strain BAB3 isolated from gall bladder and Staphylococcus haemolyticus strain 1HT3, Staphylococcus warneri strain 1DB1 isolated from colon. We identified 29% of shared virulence determinants in the CNS strains which involved resistance to antibiotics and toxic compounds, bacteriocins and ribosomally synthesized peptides, adhesion, invasion, intracellular resistance, prophage regions, pathogenicity islands. 10 unique virulence factors involved in adhesion, negative transcriptional regulation, resistance to copper and cadmium, phage maturation were also present in our strains. Apart from comparing the genome homology, size and G + C content, we also showed the presence 10 different CRISPR-cas genes in the CNS strains. Further, KAAS based annotation revealed the presence of CNS genes in different pathways involved in human diseases. In conclusion, this study is a first attempt to unveil the pathogenomics of CNS isolated from two distinct body organs and highlights the importance of CNS as emerging pathogens of health care sector. Journal of Data Mining in Genomics & Proteomics J o u r n a l of D ata Mi ning in Gmics & rot e o m i c s ISSN: 2153-0602 Citation: Nair RG, Kaur G, Khatri I, Singh NK, Maurya SK, et al. (2016) Genome Mining and Comparative Genomic Analysis of Five CoagulaseNegative Staphylococci (CNS) Isolated from Human Colon and Gall Bladder. J Data Mining Genomics Proteomics 7: 192. doi:10.4172/21530602.1000192


Introduction
The genus Staphylococcus is very well characterized consisting of fifty one species and twenty seven sub-species (www.bacterio.net/ staphylococcus.html). The members are Gram-stain-positive with low G + C content 30-35 mol % [1,2]. Methicillin resistant Staphylococcus aureus (MRSA) and vancomycin resistant Staphylococcus aureus (VRSA) are some of the prominent pathogens that cause wide variety of infections in humans as well as animals [3][4][5][6][7]. The human body consists of a vast repertoire of bacteria, among which the genus Staphylococcus represents the proportion of bacteria that can cause severe infections to the host and majority of these colonize inside new born babies through mother's skin [8,9]. Staphylococcus aureus is a major pathogen in the genus that causes endovascular infections, pneumonia, septic arthritis, endocarditis, osteomyelitis, foreign-body infections and sepsis in hospitals and outpatients [10][11][12][13]. Second most important are those CNS that target neonates through intravascular catheters, prosthetic devices, post-operative sternal wound infections and immune-compromised hosts in the health care environment [14][15][16]. Interestingly, CNS are gradually developing drug resistance characteristics, which limits present therapies and poses a great threat to the health care system worldwide [8,14,[17][18][19][20][21].
To study more insight in the pathogenicity of CNS isolated from humans, we sequenced the draft genomes of three CNS strains recovered from gall bladder and two from colon of human organs collected at Post Graduate Institute of Medical Education and Research, Chandigarh, India.
The homology and differences in the five CNS genomes were assessed using Mauve 2.3.1 and BRIG. Further, a comparative genomic strategy was employed using the published genome of S. aureus strain RF122 to characterize the pathogenic properties among the CNS isolates. We were able to identify and analyze the major virulence determinants between these strains, which included adhesion, resistance to antibiotic and toxic compounds, bacteroicins and ribosomally synthesized peptides, invasion and intracellular resistance, phages and prophages, CRISPR-cas proteins etc. Findings demonstrated the pathogenic potential of CNS isolated from two distinct body organs and identified them as emerging human pathogens.

Materials and Method Bacterial strain isolation and identification
Out of five species of staphylococci, three strains viz., Staphylococcus cohnii subsp. cohnii strain GM22B2, Staphylococcus equorum subsp. equorum strain G8HB1, Staphylococcus pasteuri strain BAB3 were isolated from gall bladder, whereas Staphylococcus haemolyticus strain 1HT3, and Staphylococcus warneri strain 1DB1 were isolated from the colon. All the five strains of CNS were isolated from five different patients. Patient 1: Strain G22B2, Patient 2: Strain G8HB1, Patient 3: Strain BAB3, Laparoscopic cholecystectomy was performed for removal of gallstones. Patient 4: Strain 1HT3, gastric lipoma, sample for biopsy. Patient 5: Strain 1DB1, carcinoma cecum, terminal colon. The tissues samples were recovered during the course of surgery. They were cut into smaller pieces with sterile scissor and forceps. The tissue samples were homogenized in sterile 1X PBS and centrifuged at 4000 rpm for 2 minutes to remove debris. The supernatants were serially diluted and plated on tryptic soya agar (TSA; HiMedia, India), incubated at 37°C for 36 h and pure colonies were isolated. The selected strains were identified by 16S rRNA gene sequencing. Genomic DNA extraction and amplification was performed as previously described [22]. Identification of phylogenetic neighbours and the calculation of pairwise 16S rRNA gene sequence similarities were achieved using the EzTaxon server [23] and alignment was carried out using Mega version 6.0 [24]. Phylogenetic trees were constructed using the neighbourjoining as well as maximum likelihood and maximum parsimony algorithms. Bootstrap analysis was performed to assess the confidence limits of the branching ( Figure 1).

Whole genome sequencing and annotation
The draft genome was sequenced at Genotypic Pvt. Ltd. (Bengaluru, India, http://www.genotypic.co.in). Library preparation was performed at Genotypic Technology's genomics facility following NEXTFlex DNA library protocol as per manufacturer's instructions. ~ 3 μg of genomic DNA was sonicated using Covaristo to obtain 500 to 700 bp fragment size. The size distribution was checked by running an aliquot of the sample on Agilent HS DNA Chip. The resulting fragmented DNA was cleaned up using HighPrep PCR clean up system as described by the manufacturer. Fragmented DNA was subjected to a series of enzymatic reactions that repaired frayed ends, phosphorylated fragments, and added a single nucleotide 'A' overhang then ligated adaptors using NEXTFlex DNA sequencing kit following the protocol as described by manufacturer. Sample cleanup was done using HighPrep PCR beads. After ligation-cleanup, ~ 500-800 bp fragments was size selected on 2% low melting agarose gel and cleaned using MinElute column, QIAGEN. PCR (cycles) amplification of adaptor ligated fragments was done and cleaned up using HighPrep PCR Clean-up beads.

Comparative genomics and pathogenomics
The automated genome annotation for all five Staphylococcus strains was accomplished using RAST [25][26][27]. The ribosomal RNA genes in the genomes were identified by RNAmmer 1.2 [28]. The tRNA and tmRNA genes were identified by ARAGON [29]. Prophage regions were identified by PHAST [30]. Insertion sequence (IS) elements were identified by the IS finder (http://www-is.biotoul.fr/) [31]. Genome sequence similarity among the five CNS along with reference strain Staphylococcus aureus RF122 was carried out using BRIG [32]. Multiple whole genome sequence alignment was performed using Mauve 2.3.1 which uses sum of pairs breakpoint score for rearrangement detection in the whole genomes. Even if two strains have unequal genome content, this method can predict genome rearrangement with high accuracy which makes it an important tool for evolutionary genomics [33,34]. CRISPR finder tool was used to identify the CRISPR genes in the genomes of the CNS strains [35]. Further, KEGG Automatic Annotation Server (KAAS, http://www.genome.jp/tools/kaas/) was employed to map the orthologous genes and their biological roles in various pathways.

Genome features
The genomes were sequenced with Illumina Miseq and assembly was carried out with CLC bio workbench v6.0.4 (CLC Bio, Denmark) (www.clcbio.com). Among the CNS isolates the genome size of the S. equorum subsp. equorum G8HB1 was the largest (2.799 Mb) compared to other CNS species (ranging from 2.403 Mb to 2.776 Mb). Genomic G+C content of S. equorum subsp. equorum G8HB1 (33.08%) was highest among all CNS. S. haemolyticus 1HT3 (32.78%), S. cohnii subsp. cohnii G22B2 (32.28%) and S. warneri 1DB1 (32.55%) was higher than S. pasteuri BAB3 (31.50%). The variation in the genomic G + C content among the CNS species could be attributed to mutation and selection pressures [36,37] which may arise due to multiple factors like environment [38], symbiotic lifestyle [39], aerobiosis [40], and nitrogen fixing ability [41]. There was a slight variation in the rRNA operons and tRNA coding genes (Table 1) among the CNS species, which could be correlated with the strength of codon usage bias value known as the S value. The species of CNS growing rapidly would have more rRNA, tRNA genes and more codon usage bias [41]. Number of Insertion sequence (IS) elements were maximum in S. haemolyticus strain 1HT3, 33 IS elements belonging to 11 IS families, which indicated a genome-wide inversion and rearrangement (Table 1). Although, the total number of IS elements in the other CNS species were comparable but, the diversity and distribution in IS families were different.

Multiple whole genome alignment
The multiple whole genome alignment of the five CNS strains was carried out with reference strain S. aureus RF122. Extent of Local Collinear Blocks (LCB) connecting lines depicted more homology in the genomes. The genomes of strain 1DB1, 1HT3 and BAB3 showed more connecting lines therefore have a greater extent of homologous regions in their genomes. Conditional links in red colour between 1DB1, G22B2 and reference strain RF122 portrayed common stretch of sequence among each other ( Figure 2).

Draft genome visualization using BRIG
Whole genome circular comparative map of five CNS strains against reference genome S. aureus RF122 was generated on the basis of BLAST sequence similarities using BRIG software [32] which mapped the whole genome in the form of concentric rings. Each genome was represented by a different colour and the darker areas in the circular genome displayed 100% sequence similarity with the reference genome, whereas the lighter (grey) areas showed 70% sequence similarity ( Figure 3).

Genome comparison and identification of virulence determinants
Genome comparison among the five CNS strains with reference genome of S. aureus RF122 revealed several major categories of genes, among which two categories viz., 1. Virulence, disease and defense. 2. Phages, prophages, transposable elements and plasmids were further studied because of their utmost relevance in contributing pathogenesis in humans. Total genes present in the all the major categories that are responsible for virulence in the strains were 128. Significant numbers of unique virulence factors present in genomes of CNS strains were 10 ( Table 2). Pie chart depicts the putative factors contributing towards pathogenesis in the CNS strains ( Figures 4A and 4B).

Genes responsible for adhesion
Adhesion to the surface of the host cell is the primary step in the process of infection, which determines the pathogen survival and extent of pathogenicity. A total number of 23 genes were present in all the strains and divided into two sub-systems that were responsible for adhesion. One gene involved in functional role as chaperonin    (heat shock protein 33) was common in all the strains ( Figure 5). The number of genes responsible for adhesion varies to a great extent among the CNS strains compared to the reference genome of S. aureus RF122 which had most number of genes. S. cohnii subsp. cohnii strain G22B2 had only one gene and S. equorum subsp. equorum strain G8HB1 had no genes that could be related to poor adhesion capacity of these strains. Adhesion helps to predict the bio-film formation which is crucial in understanding pathogenicity.

Genes encoding toxins and super antigens
S. aureus species is well known to produce exfoliative toxins (ETs) and pyrogenic toxin super-antigens (PTSAgs) that cause diseases like staphylococcal scalded-skin syndrome (SSSS), staphylococcal toxic shock syndrome (TSS), and staphylococcal food poisoning (SFP) [10,18]. Super antigens of S. aureus can directly bind to the V-β region of T-cells and major histocompatibility complex (MHC) class II molecules of antigen presenting cells, thereby avoiding antigen  processing and presentation. This leads to direct activation of V-β expressing T-cells [15]. Genes for toxins and superantigens were absent in the CNS strains, except reference strain RF122.

Genes involved in production of bacteriocins and ribosomally synthesized antibacterial peptides
Bacteriocins are low molecular weight proteins that have lethal effect against bacteria by a fast acting mechanism, which forms pores in target membranes. The sub-category bacteriocins and ribosomally synthesized antibacterial peptides comprised of two subsystems; bacitracin stress response and colicinV bacteriocins production cluster, which consisted of a total of 12 genes. ColicinV bacteriocins production cluster was present in all the CNS strains. Strains G8HB1 and G22B2 had less genes involved in bacitracin stress response ( Figure 6). These findings indicate that CNS strains are equally contributing towards antimicrobial activity as compared to S. aureus RF122.

Genes involved in resistance to antibiotic and toxic compounds
Resistance to drugs and toxic compounds is a prevalent feature in all CNS strains that necessitates for the development of a newer improved multipotent drug against the clinical CNS strains. A total number of 46 genes were present in all the staphylococcal strains under study, divided into fifteen subsystems that confer resistance to antibiotics and toxic compounds. 18 genes were commonly present in all the staphylococcal strains (Figure 7). It was observed that the CNS isolates have more number of resistance genes compared to the reference genome of S. aureus RF122. Strains G22B2 and G8HB1 showed maximum resistance against arsenic toxicity, cobalt-zinc cadmium toxicity, mercuric reductase and cadmium resistance. Strains 1HT3 and G8HB1 showed maximum resistance to copper haemostasis compared to S. aureus RF122. In beta lactamase resistance, strain 1DB1 had maximum number of genes. Genes for bile hydrolysis, multidrug resistance 2 protein versions were found in Gram-positive bacteria, tecioplanin resistance in Staphylococcus and resistance to fluroquinolones were equally present in all the strains. No resistance was observed against fosfomycin and chromium compounds in strains RF122, 1HT3, 1DB1 and BAB3 whereas strains G8HB1 and G22B2 showed resistance against fosfomycin and chromium. Aminoglycoside adenyltransferases resistance was absent in all the strains except G8HB1 and 1DB1. These data suggest that the clinically isolated CNS strains could be multi drug resistant.

Genes involved in invasion and intracellular resistance
This sub-category included two subsystems; Mycobacterium virulence operon involved in protein synthesis (SSU ribosomal proteins) and Mycobacterium virulence operon involved in protein synthesis (LSU ribosomal proteins), had a total number of 9 genes present in all the strains. In Mycobacteria SSU and LSU ribosomal proteins contribute in Tuberculosis infection. Our findings demonstrate that these proteins are also encoded by CNS strains, possibly conferring them pathogenicity. SSU proteins constitute the smaller subunit of ribosome and are named using Rv number for eg: Rv0682-Rv0686. Here Rv0682 is encoded by gene rpsL and forms Protein S12 which is involved in the translation initiation step. Similarly LSU proteins constitute the larger subunit of ribosome, Rv1641 is one such which is encoded by gene infC and functions as initiation factor -3 during the protein synthesis. These genes possibly originate from Mycobacterium tuberculosis and help in invading the host cell, supporting intracellular survival for longer periods; thereby evading host immune system.

Genes encoding phages and prophages
Several phages and prophage regions have been found in the genomes of Staphylococcus spp. that could contribute to virulence. This subcategory included 9 subsystems having 23 genes. Maximum numbers of genes were present in the subsystem phage packaging machinery in strain 1DB1 (Figure 8). Phage tail length tape measure    proteins were present in all the strains except 1HT3 which showed a possible horizontal transfer of tail fibre proteins among these strains.

Pathogenicity islands in staphylococci
Pathogenicity islands (PAIs) are one among the factors like plasmids and bacteriophages that is responsible for evolution of the pathogens [42]. In staphylococci PAIs were first identified in S. aureus known as SaPIs. They consisted of phage-related chromosomal islands that were highly mobile, encoding super antigens [43]. Only one PAI was found in all staphylococcal strains. Listeria pathogenicity island LIPI-1 extended contained two genes phosphatidylinositol-specific phospholipase C (EC 4.6.1.13) present only in the reference strain RF122 and zinc metalloproteinase precursor (EC 3.4.24.29) present in all the strains except 1HT3.

Analysis of prophage regions
Prophages regions comprise of phage DNA integrated into the bacterial genome. Phage DNA acts as a mobile genetic element and can be considered as a vector for lateral gene transfer among bacteria. Infact, a greater proportion of the bacterial virulent factors are encoded by phage [44]. Prophage regions were identified in all the six strains, including reference genome RF122. Five strains had intact prophage regions. There were total of twelve different types of prophage regions found in all the strains, including the common prophage region PHAGE_ Staphy_PT1028_NC_007045, present in strains 1DB1 and 1HT3. Also prophage region PHAGE_Staphy_StB20_NC_019915 was present in strains 1DB1 and G22B2. Diversity in the prophage regions contributes to the adaptation of lysogens to new hosts and responsible for pathogenicity in the CNS strains ( Table 3).

Identification of CRISPR-cas proteins
The bacteria and archea also have defence mechanisms, which enable them to protect themselves against foreign bodies like viruses. One such system is the CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) and their associated CRISPR-associated sequence (CAS) proteins that provide adaptive immunity to the bacteria [45][46][47]. CRISPR-cas genes were detected in strains 1DB1, G8HB1 and 1HT3, whereas in strains G22B2 and reference strain RF122 were not detected. 1DB1 had two CRISPR-cas genes and was coding for hypothetical protein (S. aureus) in region the 1265562-1265698 bp. 1HT3 was having only one CRISPR-cas gene in the region 10866-10925 bp coding for hypothetical protein (S. haemolyticus). CRISPR-cas genes were found in two regions 354805-354864, 36501-36775bp of BAB3 genome. In G22B2, two CRISPR-cas genes were found in regions 440-521, 117647-117712 bp. Among all the strains, G8HB1 had maximum number of CRISPR-cas genes present in the regions 2255-2348, 219970-220100, 291655-291801 bp. Presence of these genes in the strains presumably indicate that they have evolved phage resistance mechanisms.

Gene candidates involved in the human diseases
Putative functional genes showing involvement in the pathways correlating human diseases were identified in the CNS strains using KASS [48]. In total, 357 putative genes were associated with different processes such as metabolism, genetic information processing, cellular processing and virulence properties (Figures 9A-9E). Functional pathways in relevance to the human diseases were identified in all strains except 1DB1 (Table 4). Protein named groEL with a chaperonin function was involved in three human diseases viz; Tuberculosis, Legionellosis and Type I diabetes mellitus. In tuberculosis, it stimulates the production of IL-18 via TLR-4 mediated myD88 signaling, whereas in legionellosis it mediates the legionella entry into the epithelial cell or macrophage and in Type I diabetes mellitus it causes the apoptosis of β cells in the pancreas via IL-2 and IFN-gamma release which targets the CD8 cytotoxic T-cells and macrophages. Gene regulation is under the tight control of MicroRNA's (miRNAs) which is small cluster of 21-23 nucleotides in length. The miRNA signatures have been observed in several cancers. Protein DNMT 1 is a target of microRNA miR-152, which blocks DNMT 1 involved in hepatocellular carcinoma. Other proteins involved in the cancer pathways included frmA (alcohol dehydrogenase), which functions in chemical carcinogenesis by conversion of chloral acetate to trichloroacetic acid during the synthesis of olefines which mey lead to renal tubule adenoma and carcinoma. Protein fum C performs the reversible conversion of fumarate to malate in the mitochondria thereby acting tumor suppressor by allowing the production HGH (hypoxia-inducible factor prolyl hydroxylase) which is negatively regulated by fumarate. HGH regulates the HIF-α (hypoxia-inducible factor 1 alpha) pathway by degrading HIF-α which mediates cell proliferation via TGF-β.

Two component systems
In response to environmental cues bacteria can change their genes expression pattern through receptors which respond external chemical and physical stimuli, which constitutes the two component system. Each two-component system consists of a sensor proteinhistidine kinase (HK) and a response regulator (RR). Phosphorylation of response regulator leads to change in its output domain which now can bind to DNA mediating the transcriptional control [49]. Three proteins involved in two component systems pathway were identified in CNS strains using KASS [48]. NarG (nitrate reductase alpha subunit) present in strains G8HB1, BAB3 and IDB1 was involved in nitrogen metabolism functioning in nitrate reduction. LytS (LytT family, sensor histidine kinase) of G22B2 mediated autolysis via generation of holin like proteins lrgA and lrgB. YdfJ (membrane protein) of strain 1HT3 was involved in signal transduction.

Discussion
CNS is the most prominent group associated with the hospital acquired infections, therefore they are increasingly becoming important in light of diagnostics and pathogenesis. Though these bacteria inhabit the human skin and mucous membranes, they are also isolated from a wide variety of habitats and are no longer considered as symbionts. In hospital settings, as the strategies of more invasive procedures like foreign polymer bodies catheters etc. increases, the risk of these bacteria to colonise the polymer surface by the formation of a thick, multi-layered biofilm increases. Newer antibiotic resistant strains of CNS are appearing due to continuous use of antibiotics in hospitals. CNS top's in the list of pathogens causing nosocomial infections at global level and it is imperative to understand its pathogenomics for an effective therapy. The standard microbiological and molecular biology approaches used to assess the virulence profile of pathogenic CNS, although effective is more time consuming. The limitation in these approaches could only be overcome by Whole Genome Sequencing (WGS), which is a powerful tool for studying bacterial genomics. Sequencing the entire genome provides a more detailed and robust information for comparison between two or more genomes.
In the present study using WGS and bioinformatics analysis we determined virulence factors in five different CNS strains isolated from two distinct organs, gall bladder and colon in humans. The study revealed numerous attributes which are responsible for conferring pathogenicity to the five CNS strains such as resistance to antibiotics and toxic compounds, adhesion, invasion, intracellular resistance, prophage regions etc., which were present in their genome. There were several traits that were conserved among all the CNS strains including the reference genome RF122, like genes encoding invasion and intracellular resistance, resistance to fluoroquinolones and teicoplanin, multidrug resistance 2-protein version found in Gram-positive bacteria, colicin-V and bacteriocin production cluster etc. which could be possibly explained on the basis of horizontal gene transfer. Fewer traits showed uniqueness among the CNS strains that were absent in the reference genome of RF122. These included arsenical resistance operon repressor, cadmium resistance protein, cadmium efflux system accessory protein and phage major tail protein etc. The acquisition of these new resistance genes in the CNS strains differentiates them from the other CNS species and clearly demonstrates their greater pathogenic potential. The genomes of all five CNS strains could be ordered on the basis of pathogenic potential starting from most pathogenic to least pathogenic; S. equorum subsp. equorum strain G8HB1, S. cohnii subsp. cohnii strain G22B2, S. pasteuri strain BAB3, S. warneri strain 1DB1 and S. haemolyticus strain 1HT3.
In conclusion, this study is a first attempt to map and understand the virulence profiles of different CNS species isolated from distinct human body organs. WGS is the most appropriate tool to analyse the pathogenic potential of CNS. Our analysis has provided new insights into the genome of CNS and offers a potential in developing therapies against emerging CNS pathogens.