Evolutionary Analysis of CRISPRs in Archaea: An Evidence for Horizontal Gene Transfer

CRISPRs are a kind of defense mechanism featured by the prokaryotic species, both archaeal and bacterial domain. About 90% of archaea and 40% of bacteria holds CRISPR loci in their genome [1]. A CRISPR unit is made up of repeating sequences known as the direct repeats, which are separated by spacer sequences and is preceded by a 500-550 bps leader sequence [2,3]. The direct repeats of a CRISPR are partially palindromic and lie within the range of 24-48 bps [4,5]. These repeats tend to form a dyad symmetry that results in the formation of hairpin structures [6-8] including the spacers. Similarity searches on spacers authenticate that they are the captured segments of the genome sequences of the invaders which are derived either from their sense or anti-sense strand [9]. A CRISPR imparts immunity against the invading organisms by the mechanism catalyzed by the products of the CRISPR associated (CAS) genes.


Introduction
CRISPRs are a kind of defense mechanism featured by the prokaryotic species, both archaeal and bacterial domain. About 90% of archaea and 40% of bacteria holds CRISPR loci in their genome [1]. A CRISPR unit is made up of repeating sequences known as the direct repeats, which are separated by spacer sequences and is preceded by a 500-550 bps leader sequence [2,3]. The direct repeats of a CRISPR are partially palindromic and lie within the range of 24-48 bps [4,5]. These repeats tend to form a dyad symmetry that results in the formation of hairpin structures [6][7][8] including the spacers. Similarity searches on spacers authenticate that they are the captured segments of the genome sequences of the invaders which are derived either from their sense or anti-sense strand [9]. A CRISPR imparts immunity against the invading organisms by the mechanism catalyzed by the products of the CRISPR associated (CAS) genes.
A CRISPR unit is activated when a foreign genome gets conjugated [10][11][12]. The genome of the invading strain gets disintegrated through the CRISPR mechanism with the help of the CAS proteins, thereby the broken foreign nucleotide fragments gets integrated into the host as a spacer at the leader end of the CRISPR unit [13][14][15]. When the prokaryote encounters the same predator once again after the first attack, the CRISPR unit in the host genome is transcribed into a pre-CRISPR RNA (pre-crRNA) molecule [16,17]. These pre-crRNA molecules are then processed into CRISPR RNA (crRNA) with the aid of the proteins encoded by the CAS genes [18,19]. The resulting crRNAs contain a spacer flanked by the fragments of repeats on either side. The crRNAs scan the invading genome for the fragment that matches with the bound spacer sequence. The fragment in the invading genome that matches with any of the spacer in the host crRNAs is called a protospacer [20]. The crRNAs along with the CAS proteins (CRISPR-CAS complex) gets bound to the protospacer in the foreign genome sequence by a complementary sequence pairing method. The complex shears the invading nucleotide genome into small fragments which then gets inserted into the host CRISPR unit as a spacer [21][22][23].
The whole CRISPR mechanism is catalyzed by the CAS genes that are found in the vicinity of the CRISPR arrays [13,24]. These genes encode the enzymes involved in the processing of the CRISPR transcripts. The genes also aid in the recognition and neutralization of foreign genetic elements with the inclusion of new spacers. CAS genes can be classified into different categories depending on their role of action. There are altogether six types of core CAS genes associated with the CRISPR mechanism, out of which Cas5 and Cas6 are newly added [25]. Excluding the newly added genes, the four core genes are aligned as Cas3-Cas4-Cas1-Cas2. The Cas2 is a sequence-specific endoribonuclease [26], Cas3 acts as a helicase [27], Cas4 resembles the RecB family of exonucleases and contain a cysteine rich motif and Cas1 found in all the organisms harboring a CRISPR unit is highly basic. Apart from the core genes, there are a few subtype genes that belong to the RAMP (Repair Associated Mysterious Proteins) family of proteins [28].
Each of the genomes analyzed for this study hold varied CRISPR units, based on their length, repeats and spacers. CRISPRs appear to share similar direct repeats within the studied archaeal strains [17]. The similarity in the repeats of a CRISPR unit indicates a possible horizontal gene transfer between the strains [6,29]. This horizontal gene transfer may be mediated by plasmids, mega plasmids and even prophages which carry the CRISPR units [8,30]. Similarity in spacers seems to have originated when two different organisms encounter invasion by the same phage or plasmid. CRISPR arrays within the chromosomal and plasmid genome of the same strain having similar spacers protects the genome from degradation due to 5' overlap of the repeat [15]. The

Abstract
Akin to the eukaryotic immune system, prokaryotes harbor CRISPRs, a lineup of DNA direct repeats and spacers to foster immunity against the invading phages and plasmids. A CRISPR (Clustered Regularly Interspaced Short Palindrome Repeats) unit found within the genome of an organism consists of an array of repeating sequences interspaced by unique spacers and is associated with special genes that reside adjacent to the array. The spacer sequences are nucleotide fragments integrated from an invading organism. The comparative genomic analysis of CRISPR sequences affirms huge variation within the CRISPR-CAS systems among different prokaryotes. Here, an analysis of the complete archaeal species is directed for their CRISPR sequences along with a case study. The phylogenetic analysis sketched from the CRISPR sequences signifies a harmony along the direct repeats of the analyzed organisms with no trace of spacer similarity. Further, novel CRISPR elements are procured, aside from those formerly present in the database. The CRISPRs are then subjected for local alignment using BLAST to ensure whether any of the sequences showed similarity with the human and viral genomes available in NCBI.
analysis of CAS genes shows that the common CAS genes found in the vicinity of all the CRISPR units are Cas1 and Cas6. The present study focuses on the computational and phylogenetic analysis of the CRISPR units present in all the 110 species of archaea. The main objectives of this work are the analysis of the direct repeats and the CAS proteins associated with the CRISPRs for substantiating their diversity.

Materials and Methods
Even though, there are efficient software packages to extract the repeats from genome and protein sequences [31][32][33][34], one has to employ a dedicated software package to extract CRISPRs. Hence, we have used Online CRISPRFinder [35] (URL-http://crispr.u-psud.fr/Server/), a program to enumerate all the mandatory details of CRISPRs. Among the tools available for retrieving the CRISPR units, CRISPRFinder is found to have better efficiency in CRISPR investigation.

Retrieving the CRISPR units from the available archaeal strains
The archaeal domain accommodates a total of 110 species. The 110 archaeal species and their strains were retrieved from the taxonomic databases and their whole genome sequences were obtained from the NCBI/GenBank database. Search for all accurate CRISPR units in the archaeal genomes was accomplished by a web interface (with default parameters) that offers elementary crossing points for the CRISPR identification with precision, allowing a factual definition of the direct repeat consensus boundaries and related spacers. This program was developed in Perl under Debian Linux [35] and was implemented to obtain the CRISPRs along with the flanking sequences. The CRISPRFinder output displays CRISPRs with the repeats, the intervening spacers along with their accurate positions in the genome and the referenced genes found within the sequence.
CRISPRFinder employs a stringent filter to cull out the confirmed CRISPRs. Confirmed CRISPRs are the ones that have at least three motifs and two exact identical direct repeats (DRs), while the remaining candidates are tagged as questionable CRISPRs. For our analyses, we have strictly considered confirmed CRISPRs and have validated it against CRISPRdb [36], a database that catalogues the confirmed CRISPRs. This database serves as a reliable source of complete CRISPR information.

CRISPR analysis
The finalized direct repeats representing 191 diverse CRISPR clusters from different organisms were analyzed for inferring the total percentage they make up in the entire genome of the organism and to find the specific GTTTG/C and GAAAC motifs in the direct repeat sequences. These motifs were responsible for the palindrome nature of the direct repeat sequences within the CRISPR array. Then the chosen direct repeats were aligned with the help of ClustalW [37] (URL -http:// www.ebi.ac.uk/Tools/msa/clustalw2/), a multiple sequence alignment program. The alignment results were used as an input for constructing the phylogenetic tree using MEGA (Molecular Evolutionary Genetic Analysis), an offline toolkit for conducting alignments and drawing the relationship trees with an accurate branch distances [38].

CAS protein retrieval and phylogenetic analysis
CAS genes form the fundamental part of the CRISPR machinery that encodes the necessary DNA manipulating enzymes needed for the accomplishment of the defense mechanism. The references to the genes related to each CRISPRs in the CRISPRdb is pointed to the GenBank and the amino acid sequences of the proteins encoded by these genes are retrieved from NCBI. Phylogenetic investigation was carried out by aligning the protein sequences from different species in ClustalW. The phylograms constructed using the tool MEGA showed a clear-cut picture of the close relationships within the species, corroborating the fact that they may be the products of Horizontal Gene Transfer.

Results and Discussion
The structural attributes of CRISPRs are deduced to vary with a presumable rate within and between the species. From the examination of the CRISPR units and their organization within the genome, a comparative substantiation is made on the diversity of the CRISPR units within the archaeal domain. All the retrieved chromosomes and plasmid genomes were scanned using the CRISPRFinder for the presence of the CRISPR loci in them.

The output retrieved from the CRISPRFinder
A genome FASTA file was uploaded into the CRISPRFinder that gave an output of the CRISPR sequences with the direct repeats, spacers, CAS genes and leader sequences. The resulting output from the CRISPRFinder was cross-checked with the data available in the CRISPRdb to filter only the confirmed CRISPR sequences. The screening process revealed that some of the CRISPRs that were presented as questionable sequences in the CRISPRFinder were treated as confirmed groups. These CRISPRs are also combined along with the other confirmed sequences that are common between the tool and the repository, adding to a grand total of 391 CRISPRs. The number of repeat elements per CRISPR unit varies with the species. Among the analyzed archaeal strains Metallosphaera cuprina Ar-4 (1,840,348 bps) in Crenarchaeota, harbors the longest CRISPR of 12,176 bps with 25 direct repeats and 189 spacers, followed by 11,632 bps CRISPR in Methanococcus voltae A3 (1,936,387 bps) in Euryarchaeota with 31 direct repeats and 171 spacers. Methanotorris igneus Kol5 (1,854,197 bps) has the shortest CRISPR of 85 bps with 31 direct repeats spaced by a single spacer (Supplementary Table 1). In general, all the CRISPRs comprise direct repeats in the range of 2-100 integrated with 1-99 spacer(s). The number of direct repeats is higher in the methanogens with a segment size lying between the ranges of 24 to 46 bps.
A total of 44 plasmid sequences are also retrieved along with the chromosomal genomes of the 110 archaeal species and only ten plasmid genomes showed the presence of CRISPR units in them. Out of the ten plasmids, one belongs to the methanogens and the remaining comes under the halophilic archaeal group. These plasmid genomes display a total of 15 CRISPR units, thereby giving a net count of 391 CRISPR units including those present in the chromosomal genomes.

Novel CRISPRs
A total of 33 novel CRISPRs are seen in many of the species along with the other CRISPR clusters that are commonly displayed in both the CRISPRFinder output and the database. These CRISPRs satisfied the criteria for the sequences to form CRISPR-like units and are displayed under the category of confirmed CRISPRs in the CRISPRFinder output (Tables 1a and 1b). The majority of the novel CRISPRs is discovered in the methanogens. Some of the CRISPRs are not included in the database but yet seemed to be satisfying the criteria to become a CRISPR and thus, they are also labeled as confirmed ones for the further investigation. To examine the conservation of the novel CRISPRs, we aligned them using ClustalW and constructed a circular phylogram using the Interactive Tree of Life [39]. From Figure 1, it is evident that even though the novel CRISPRs are conversed along the   (Figure 1) to observe its distribution associated with the novel CRISPRs.

Palindromicity in the sequences
Repeat sequences in different CRISPR loci are not completely conserved, although the existence of certain partially conserved sequences such as GTTTG/C motif at the 5' end and the GAAAC motif at the 3' end of the direct repeat have been detected which imparts a partial palindromic character to the direct repeat unit. Some of the direct repeats of Thermococcus sp. CL1 display palindromicity in their sequences which are shown below:

• GTTTCAGAACCACATAATGTTTGGAAAC
The advantage of having these special motifs in their genomes is that

Phylogenetic approach
The broad distribution of the similar CRISPR/CAS system among various organisms irrespective of their origin can be considered as the result of a horizontal gene transfer they undergo during the microbial evolution. To prove this, evolutionary analysis of the direct repeats is carried out using ClustalW.

Similarity between the repeats
To acquire the evolutionary relationships and sequence similarities among the direct repeat sequences in the 391 CRISPR loci, the repeat units are aligned using ClustalW. The alignment scores are examined to retain only those sequences that displayed high sequence similarity with each other. The homogeneity in the analyzed sequences is drawn in the form of phylogenetic trees or phylograms with the aid of MEGA tool. Alignment of the 391 direct repeat sequences is carried out in two different steps. An alignment of the total direct repeat units is made and the sequences showing score more than the maximum score (95) are selected for analyzing the Horizontal Gene Transfer possibilities in them ( Table 2). Figure 2 represents the phylogram for all 391 CRISPRs across 110 species of archaea, with the novel CRISPRs tagged by a black dot. From the phylogenetic tree of the direct repeats (Figure 2), it is evident, that the CRISPRs are well conserved within the archaeal phyla. This corroborates the fact that the CRISPRs can be used to infer the evolutionary relationship for 110 archaeal species. The tree also serves an evidence for horizontal gene transfer across archaeal phyla as exemplified by Figure 3. For more detailed information, the tree ( Figure 2) is colour coded corresponding to different archaeal phyla. Similarity between the species of two different phyla is seen in the direct repeats of F. placidus and C. Korarchaeum (Figure 3). In addition, the program MEME [40] was employed (with default parameters) to identify and analyze the consensus motif from the multiple sequence alignment of all 391 direct repeats ( Figure 4) and a total of 33 novel CRISPRs are identified. The consensus motif (Figure 4) was also defined in the novel CRISPRs.

Similarity between the spacers
Spacer acquisition occurs in between the repeat sequences of the CRISPR array, when the host genome is infected by a pathogen. This addition of spacers takes place amidst the leader sequence and the prior repeat unit by the assistance of a special feature on the incoming genome sequence known as the Protospacer Adjacent Motif (PAM) and the CAS gene products. Spacers are occasionally repeated, sometimes more than once within a cluster, and can also appear in different arrays within the same chromosome. In some cases, interspecies repetition of a particular spacer can also be visualized which may suggest that the same predator can attack two entirely different organisms. In Archaea, few spacers are found to be repeated at different positions in different CRISPRs within the same organism. The Crenarchaeal strains Pyrococcus furiosus DSM 3638 and Pyrococcus furiosus COM1 share 187 spacer matches in their CRISPR unit with Sulfolobus solfataricus 98/2 and Sulfolobus solfataricus P2 showing 159 spacer matches in their CRISPR loci. The CRISPR loci in halophilic archaea Natronomonas pharaonis DSM 2160 and its extra chromosomal plasmid PL23 share seven spacer sequences in common. This repetition of the spacer units in different CRISPR clusters reveals the fact that different clusters get activated and integrate these spacers    on having an encounter with the foreign elements each time. Although identical groups of spacer-repeat units have been observed in the closely related strains, however, they have not been detected in other species. Within the analyzed CRISPR units of archaea, none of them tends to bare similarity between the spacers except in halophiles. In halophiles, some of the spacers of plasmid CRISPR tend to match with the spacer of the chromosomal CRISPR of the same strain (Table 3).

On the core CAS proteins
A set of genes known as the core CAS genes (Cas1-Cas6) encode a set of enzymes such as the helicases and nucleases which help in the manipulation of the DNA strands. These proteins which are inevitable for the functioning of the CRISPRs have also been analyzed phylogenetically for sequence similarity and predictable evolutionary homology. The FASTA sequences of the core CAS proteins are obtained for all the species of interest from the GenBank. Alignment is carried out between a single CAS protein family found in all the organisms at one stretch (for example; Cas1 protein family of all the species are aligned). The alignment score and the distance guide tree developed as a result of the Clustal alignment are analyzed and used in order to construct the phylograms for each CAS protein family by making use of MEGA. Among the CAS genes, Cas4 and Cas1 are seen in most of the strains (Table 4). Table 5 represents the distribution of CAS gene across 110 archaeal species. By aligning the core CAS proteins of the CRISPRs under study, it is observed that no significant score is given when the Cas3 proteins are aligned and this is also similar to Cas6 proteins. A decent similarity is observed in Cas1, Cas3, Cas4, Cas6 group of protein families. Cas2 gene of A. hospitalis, S. islandicum HVE10/4 and S. islandicum REY15AQ displays 100% similarity. Cas5 genes of S. solfataricus, S. islandicus LS215 and YG5714 tends to share 100% similarity. Many of the CRISPRs shared similar spacers within the CRISPR loci and across genomes. Few of the strains share the same spacers in their chromosomal and the plasmid genomes. Among the seven species of Pyrobaculum, some species shared similar DRs ( Table 2). Similarity between the spacers suggests the fact that the organisms had encountered the same phage or plasmid. While, the purpose of CRISPR loci within the chromosomal and plasmid genome of the same strain having similar spacers suggests protection of the genome from degradation due to 5' overlap of the repeat. Among the analyzed species, P. aerophilum str. IM2 and Pyrobaculum sp. 1860 share the similar direct repeat 'GTTTCAACTATCTTTTGATTTCTGG' . While, P. aerophilum str. IM2, Pyrobaculum sp. 1860 and P. oguniense TE7 had 'CCAGAAATCAAAAGATAGTTGAAAC' . Finally, P. arsenaticum DSM 13514 and P. oguniense TE7 shared 'CTTTCAATCCTCTTTTTGAGATTC' . In case of spacer similarity, P. arsenaticum DSM13514 and P. oguniense TE7 had similar spacers within their CRISPR unit (Table 3).
For the analysis, the direct repeats are aligned for all the seven Pyrobaculum species using ClustalW and then constructed a phylogenetic tree ( Figure 5). According to the phylogenetic tree, the species sharing the similar direct repeats are grouped under the same clade. The tree ( Figure 5) suggests a horizontal gene transfer between the different species, which could have been mediated through the    plasmids, megaplasmids, and even prophages. The horizontal gene transfer plays a critical role in the distribution and the evolution of CRISPR loci [41]. The existence of DRs might assist the inclusion of DNA segments by recombination and thus suggestively contributing to the evolution of species and their genomic differentiation [30].

BLAST results of CRISPR sequences
The retrieved CRISPRs were subjected to NCBI Blast. The BLAST results did not show any match between the CRISPRs and the human genome sequences. CRISPR loci in P. yayanosii chromosome showed a match with the mushroom Tuber melanosporum mel28 hypothetical protein sequence. A portion of the CRISPR in M. voltae A3 showed high matches with that of Leptotrichia buccalis DSM 1135. The results also revealed that in the archaeal domain, Euryarchaeota holds 55.2% of CRISPRs followed by Crenarcheota with 39.9%, Archaeal plasmid with 3.8% and 0.25% each in Korarchaeota and Nanoarchaeota. Thaumarchaeota do not accommodate any genomes to hold CRISPRs.

Conclusion
A total of 391 confirmed CRISPR loci are detected in the genomes of 110 archaeal species, out of which 33 are found to be neoteric groups that are not marked in the CRISPRdb. The 5' and 3' palindromic motifs that supported the nomenclature of this defensive asset can pave a path for further understanding of the RNA-based CRISPR mechanism. The direct repeats of the CRISPRs may be considered as the products of Horizontal Gene Transfer since they show a phylogenetic relationship with some distant inter-genus species. A set of core protein data retrieved from the databases when aligned and phylogenetically examined, displayed a clean portrait of relationships within the species, highlighting the fact that they would have undergone Horizontal Gene Transfer. Using the results of the present study, a comparative analysis of the CRISPR contents and its functionalities in the complete archaeal domain can be carried out to shed light on the similarities and dissimilarities in the CRISPR organization in them. Many CRISPRs share same spacers within the CRISPR loci and some between the organisms. Some of the strains share the same spacers in their chromosomal and the plasmid genomes. Such spacers protect the strain from degradation due to 5' overlap of the repeat, provided the spacers should be flanked by the same repeats.