Mukti Jaiswal and Anjana Pandey*
Nanotechnology and Molecular Biology laboratory, Department of Biotechnology, University of Allahabad, Allahabad-211002, Uttar Pradesh, India
Received Date: July 14, 2014; Accepted Date: August 18, 2014; Published Date: August 20, 2014
Citation: Jaiswal M, Pandey A (2014) In Silico Mining of Simple Sequence Repeats in Whole Genome of Xanthomonas sp. J Comput Sci Syst Biol 7:203-208. doi: 10.4172/jcsb.1000157
Copyright: © 2014 Jaiswal M, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Computer Science & Systems Biology
In current scenario, microsatellites are a large source of genetic markers. In this study, we mined simple sequence repeats in whole genome of Xanthomonas species (Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv. oryzae, Xanthomonas campestris pv. Campestris) by in silico methods. A total of 640 SSRs, 377 SSRs and 541 SSRs were detected in whole genome having density of 1SSR/8.08 kb, 1SSR/13.10 kb and 1SSR/9.510 kb for 5175.554 kb, 4941.439 kb and 5148.708 kb sequences length respectively. The results elucidated, only 32 types (0.618%), 39 types (0.789%) and 96 types (1.864%) of SSR sequences were present in Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv.oryzae and Xanthomonas campestris pv. campestis respectively. Depending on the repeat units, the length of SSRs ranged from 10 to 18 bp for di-, 12 to 27 bp for tri-, 12 to 24 bp for tetra-, 20 to 35 bp for penta-, 24 to 72 for hexa-, 21 to 133 bp for hepta-, 24 to 40 bp for octa- and 27 to 36 bp for nano nucleotide repeats. Di-nucleotide repeats were the most frequent repeat type (70.97%) followed by tri-nucleotide (22.23%), hepta–nucleotide (2.86%), hexa-nucleotide (1.91%) and tetra-nucleotide (1.392%) in all three species of Xanthomonas. Annotation of sequences containing SSRs were also carried out to assign function to each of the sequences. SSR containing sequences of Xanthomonas species could not assign any specific class of protein (77.56%) due to the absence of homologs in the protein sequence database and these could be treated as an ideal molecular marker.
Xanthomonas; Microsatellite; Molecular marker; Genome annotation; Protein sequence database; Simple sequence repeat
Simple sequence repeats (SSRs), or microsatellites or short tandem repeats are short repeat motifs (1–6 bp) that are present in both protein coding and non-coding regions of DNA sequences [1-3]. SSRs are highly abundant and exhibit extensive levels of polymorphisms in prokaryotic genomes  and show a high level of length polymorphism due to insertion or deletion mutations of one or more repeat type . SSRs being more abundant in noncoding regions than in exons because the lacks of selective constraints prevent correction mutations at these alleles [6,7]. Moreover, different taxon varies in abundance of different types of SSRs . Strand slippage replication is generally considered to be the primary mechanism for the generation of microsatellite polymorphisms [9-12]. Recombination may also act on these sequences by changing repeat number through unequal crossover or gene conversion. The SSRs are either mined conventionally [13-17] or database sequences of genome and Expressed sequence tags (ESTs), which represent the expressed part of genome also serve as source of SSRs [18-20].
SSRs fulfils the criteria of Ideal molecular markers which are highly polymorphic, provide reproducible results and be simple to assay . It has been also found useful in numerous DNA- and PCR-fingerprinting experiments for strain typing of a variety of fungi without prior knowledge of their abundance and distribution in the investigated fungal genomes [21,22] and are also useful across a number of related plant species [23,24]. Because of their high mutability, SSRs are thought to play an active role in genome evolution by creating and maintaining genetic variation . The genus Xanthomonas is a diverse and economically important group of bacterial phytopathogens, belonging to the gamma-subdivision of the proteobacteria. Xanthomonas axonopodis pv. citri (syn. X. citri pv. Citri) , which is a bacteria, causes asiatic citrus canker and reduces fruit quality and yield. C. paradisi (grapefruit) and C. aurantifolia (Mexican lime) are the most susceptible in field. C. reticulata (mandarin/tangerine) and C. sinensis (sweet orange) are relatively tolerant [26,27]. Significantly, no citrus species is resistant to XAC after artificial inoculation, suggesting that there is no true genetic resistance against XAC and that field tolerance is mainly due to variation in growth habit . Xanthomonas oryzae pv. oryzae causes bacterial blight by invading the vascular tissue, which constrain production of this staple crop in much of Asia and parts of Africa. Tremendous progress has been made in characterizing the diseases and breeding for resistance. Xanthomonas oryzae pv. oryzae is important, from the standpoint of food security and as models for understanding fundamental aspects of bacterial interactions with plants. Xanthomonas campestris pv. campestris (Xcc), a gram negative aerobic rod, is the causal agent of black rot, which affects crucifers such as Brassica and Arabidopsis. The Xcc bacterium also infects weeds, including Arabidopsis thaliana, which has been sequenced and is the model species used in plant research.
The present study has been conducted for mining of whole genome of Xanthomonas species (Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv. oryzae, Xanthomonas campestris pv. campestris) available at National centre for biotechnology information (NCBI) to find out the distribution and abundance of SSRs for the development of markers and to annotate SSR containing sequences. In the present study, we performed comparison of the complete genome of three different species of Xanthomonas (Xanthomonas axonopodis pv. citri str. 306, Xanthomonas oryzae pv. oryzae KACC10331 and Xanthomonas campestris pv. campestris str. 8004 chromosome with each other. This comparative genomics approach has greatly accelerated the study of the molecular basis of pathogenicity and virulence of Xac.
Retrieval of whole genome sequences
Complete genome sequences for the above three species of Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv. oryzae and Xanthomonas campestris pv. campestris were downloaded from National centre for Biotechnology Information (NCBI) Gen Bank having accesion numbers AE008923.1, AE013598.1 and CP000050.1 respectively.
Detection of SSRs
The harvesting of the SSRs was done using SSRIT (Simple sequence repeat identification tool) software . The minimum length of SSR was fixed at 10 bp. The SSRs were defined as _10 bp di-nucleotide repeats; _12 bp trinucleotide repeats; _16 bp tetra-nucleotide repeats; _20 bp penta-nucleotide repeats; _24 bp hexa-nucleotide repeats; _21 bp hepta-nucleotide repeats; _24 bp octa-nucleotide repeats and; _27 bp nano-nucleotide repeats. The poly A and poly T repeats were removed by using Microsoft Word program, as these are not considered as SSRs, due to their presence at 3’ ends of mRNA/cDNA sequences.
Annotation of SSR containing sequences
Functional annotation of all the SSR containing sequences was determined on the basis of 70% similarity against non-redundant (nr) protein database entries. It was performed using program Basic Local Alignment Search Tool (BLAST) and its variant BLASTX . The resulting proteins obtained through similarity search by BLASTX program were classified into their respective classes.
In the present study, whole genome of Xanthomonasaxonopodispv. citri, Xanthomonasoryzaepv. oryzae and Xanthomonascampestrispv. campestis sequences available at NCBI were searched for microsatellites with a minimum length of 10 bp. A total of 640 SSRs detected in whole genome of Xanthomonasaxonopodispv. citri, 377 SSRs in whole genome of Xanthomonasoryzaepv. oryzaeand 541 SSRs detected in whole genome of Xanthomonascampestrispv. campestishas been screened, excluding Poly A and Poly T. Depending upon the length of the repeat unit itself (1–9 bp), the lengths of SSRs varied from 10 to 133 bp, respectively. Figure1 shows the frequencies of SSRs with di-, tri-, tetra- penta-, hexa-, hepta-, octa-, and nanonucleotide repeat units. The most frequent repeat sequences type found within the whole genome of Xanthomonasaxonopodispv. citri, Xanthomonasoryzaepv. oryzae and Xanthomonascampestrispv. Campestiswere di-nucleotide repeats (70.97%) followed by tri -nucleotide (22.23%), hepta –nucleotide (2.86%), hexanucleotide (1.91%) and tetra-nucleotide (1.392%). Whereas, frequencies of repeats of penta-, octa- and nano- were 1.10%, 0.55% and 0.36% respectivelyin Xanthomonascampestrispv. Campestis. However no penta-, octa-and nano-nucleotide repeat was detected during the screening of Xanthomonasaxonopodispv. citri and Xanthomonasoryzaepv. oryzae. The average frequency percentages of all three Xanthomonas species are shown in Figure1.
The observed frequency of different repeat types comprising the SSRs is presented in Figure 2A-2E and summarized in Table 1A and 1B SSRs were comprised of 5 types of di-nucleotide (GC)n, (CG)n, (TC) n, (GT)n, (CT)n; 24 different types of tri-nucleotide (CCG)n, (CAG) n, (GGC)n, (CCA)n, (CGG)n, (ACG)n, (CGC)n, (GTG)n, (GCC) n, (GGT)n, (GTT)n, (ACC)n, (TGG)n, (GCG)n, (CTG)n, (CAA) n, (TGC)n, (AGC)n, (TGG)n, (TCC)n, (CAC)n, (GTG)n, (AGA)n and (GCA)n repeats; 19 types of tetra nucleotide (CGTG)n, (CGGC) n, (GAAG)n, (ATGC)n, (AGCG)n, (AGCC)n, (GCGG)n, (GCCC) n, (GCTG)n, (CCGG)n, (CCCG)n, (AAGC)n, (GGCA)n, (TGCC) n, (GGCT)n, (GCAC)n, (CGGG) n, (GCCA) n, (CACG) n repeats; 30 different types of hexa-nucleotide (CATCTA)n, (ACAGCG)n, (GTTGCG)n, (GTAGCG)n, (GGCAGT)n, (GGCAAT)n, (GCTGCC) n, (GGCGTT)n, (CAGGCC)n, (GTAGCT)n, (TTGGCT)n, (CAATGT)n, (GCATGG)n, (TGCTGT)n, (TTGCCG)n, (TTGGAA)n, (CATCTA)n, (ACACCA)n, (CCGCGG)n, (ATGGCC)n, (TCGGAA) n, (ATTGCC)n, (GTCATG)n, (GATGGA)n, (CGATAC)n, (ATGTCG)n, (CGCCAA)n, (TCGCTG)n, (TGTCGC)n, (AGCCAA)n repeats and 33 types of hepta-nucleotide (ATTGGCC)n, (CGGGAAT) n, (TGGGGAT)n, (TCGGGAA)n, (GGGATTC)n, (ACGCACA) n, AATCGGG)n, (GGGATTT)n, (GGGAGTC)n, (GGCGGAT) n, (TTCCCGA)n, (CGATTCC)n, (CGCAAAC)n, (CCAATCC) n, (CCGCTTG)n, (CAACCGC)n, (TAAGCAG)n, (ATCGGGA) n, (ATTCCCA)n, (TTCCCGC)n, (GGTTGCG)n, (CCGATTC) n, (GGGAATG)n, (GATTCGG)n, (GGATTCG)n, (GGGAATC) n, (GGATTGG)n, (GCGTGTC)n, (GGGAAGC)n, (GGACTGC)n, (GGGATTG)n, (GGGATGC)n, (GTTGCGT)n repeats.
Table 1a: Summary of in silico mining of Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv. oryzae and Xanthomonas campestris pv. Campestris.
*Data in parentheses is the percentage value of the repeat type
Table 1b: Summary of in silico mining of Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv. oryzae and Xanthomonas campestris pv. Campestris.
Among di-nucleotide repeats, only (GC)n and (CG)n repeats were more frequent in all species of Xanthomonas. The most abundant repeats among trinucleotides were (CGC)n present in X. campestris followed by (GCC)n, (GCG)n, (CAC)n, (CCG)n, (CCA)n, (TGC) n, (GCC)n (CAA)n, (TGG)n, (GCA)n and others are present in less frequent repeats spread in all three genomes (Figure 2B). The most abundant repeat in tetra- nucleotide is (GAAG) present in Xanthomonas campestris and other repeat types were present in equal frequencies. Whereas, Hexa-nucleotide (ATGGCC) has two frequencies which is only found in X. axonopodis pv. citri. Other hexa-nucleotide repeats were present in equal frequencies. Hepta-nucleotides repeats had four frequency for SSRs (CGATTCC)n, three for (GGGAATG)n, and two for (GGGAATC)n which is only found in Xanthomonas campestris pv. campestris and rest of the SSRs were found in equal frequency.
Annotation of Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv. oryzae and Xanthomonas campestris pv. campestris sequences containing SSRs.
To determine the function of SSR containing sequences, the 32 sequences from Xanthomonas axonopodis pv. citri with SSRs were mined and annotated against the nr protein database available at http://www.ncbi.nlm.nih.gov. For a small number, 4 (12.5%) SSR containing sequences, annotations were available (Figure 3A) of which 15 (5.49%) predicted proteins, 6 (2.19%) putative proteins, 124 (45.42%) hypothetical proteins and 128 (46.88%) belonged to different functional classes. Maximum sequences i.e. (28=87.5%) could not be assigned to any specific class due to the absence of a homologs in the protein sequence database. Likewise (39 sequences) SSRs of Xanthomonas oryzae pv. oryzae were mined and annotated against the nr protein database; for a small number i.e. 10 (25.64%) SSR containing sequence annotations available (Figure 3B) of which 59 (7.54%) predicted proteins, 22 (2.81%) putative proteins, 403 (51.53%) hypothetical proteins and 298 (38.10%) belonged to different functional classes. Maximum number of sequences i.e. (29=74.35%) could not be assigned to any specific class due to the absence of a homolog in the protein sequence database. For Xanthomonas campestris pv. campestris (96 sequences) SSRs were mined and annotated against the nr protein database; for small number of 28 (29.16) SSR containing sequence annotations available, (Figure 3C), in which 88 (8.76%) predicted proteins, 16 (1.59%) putative proteins, 498 (49.60%) hypothetical proteins and 402 (40.03%) belonged to different functional classes. Maximum number of sequences i.e. (68=70.83%) could not be assigned any specific class due to the absence of homologs in the protein sequence database and matched proteins were searched for SSRs but no protein found to contain SSR.
In the present study, whole genome sequences of three species of Xanthomonas retrieved from NCBI were mined for SSRs which could be used for designing the markers.
The abundance of the different repeats in the SSRs as detected in Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv. oryzae and Xanthomonas campestris pv. campestris were variable and not evenly distributed. These results are similar with earlier findings, which shows that the abundance of different repeats varied extensively depending upon the species examined . We excluded poly A and poly T repeats due to which their number is under represented in the study. The SSRs with di-nucleotide repeats followed by tri-nucleotide were most abundant distribution in sequences of all three Xanthomonas species. Analysis of SSRs in terms of number showed hepta-nucleotides repeats present in largest quantity 33 (26.82) followed by hexa- 0 (24.39%), tri-23 (18.69%), tetra-19 (15.44%), Di-7 (5.6%), penta-6 (4.8%), octa-3 (2.43%) and nano-2(1.62%) respectively. In earlier studies on E. coli and Chlamydial strains characteristic SSR distribution with a marked relative abundance of tri- and hexanucleotide repeats are reported. Microsatellites contribute to the genomic variability of prokaryotes even in closely related genomes, as supported by the observed difference in SSR distribution patterns even between related genomes. It has been shown that Chlamydia-specific Pmp proteins are candidates of sequences where SSRs can contribute to genetic variation in the strains examined . (AT)n and (CT)n are the most common repeat motif in fungi, plants and insects . In our study (GC)n and (CG)n repeat were abundant while no (AT)n and (CT)n repeat were detected. The smaller repeat motifs were found to be predominant among SSRs identified and as the length of repeat unit increases their occurrence decreases. This may be because longer repeats have higher mutation rates, therefore less stable .
It has been proposed that numerous SSRs are the hot spots for recombination [34,35] especially di-nucleotide repeats are preferential sites for recombination due to their high affinity for recombination enzymes . As molecular markers, dinucleotides are more important than the other SSRs and are one of the most sought-after markers because of their higher mutation rates. SSRs may affect DNA replication  and also plays important role in regulation of gene activity . Some SSRs, found in upstream activation sequences, serve as binding sites for a variety of regulatory proteins [38,39]. In addition to this, the presence of repeated sequences within proteins has been detected in all organisms examined .
The protein coding sequences investigated, which are present in SSR sequence are as following: Xanthomonas axonopodis pv. citri, 4 SSR sequences i.e.12.5%, Xanthomonas oryzae pv. oryzae 10 SSR i.e.25.64% Xanthomonas campestris pv. Campsites 28 SSR i.e. 29.16%. The annotations for these protein coded by SSR sequence categorized into different classes of proteins (predicted, hypothical, putative and others). Due to the absence of a homolog in the protein sequence database, remaining SSRs sequences, i.e. 28 (87.5%), 29 (74.35%) and 68 (70.83%) corresponding to Xanthomonas axonopodis pv.citri, Xanthomonas oryzae pv. oryzae and Xanthomonas campestris pv. campestris respectively could not be assigned any specific class as well as no obvious function has yet been assigned.
Subsequently it can be concluded that the most frequent repeat sequences type found within the whole genome of Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv. oryzae and Xanthomonas campestris pv. Campestis were di-nucleotide repeats (70.97%) followed by tri-nucleotide (22.23%), hepta–nucleotide (2.86%), hexanucleotide (1.91%) and tetra-nucleotide (1.392%).
It can also be concluded that these non-coding SSR containing sequences of Xanthomonas axonopodis pv. citri, Xanthomonas oryzae pv. oryzae and Xanthomonas campestris pv. Campestis are 28 (87.5%), 29 (74.35%) and 68 (70.83%) respectively, would serve to be an ideal molecular marker. The current investigation is a valuable approach for saving both costs and time, provides quantitative data in understanding distribution of SSRs in the whole genome and information for designing of the molecular markers to be used in various studies.
Prof. Anjana Pandey of Department of Biotechnology, University of Allahabad, is gratefully acknowledged for continuous support and encouragement.