Sablok G * and N.S.Shekhawat
Computational Unit, Biotechnology Center, Jai Narain Vyas University, Jodhpur-342033.
Received Date: September 27, 2008; Accepted Date: November 20, 2008; Published Date: December 26, 2008
Citation: Sablok G, Shekhawat NS (2008) Bioinformatics Analysis of Distribution of Microsatellite Markers (SSRs) / Single Nucleotide Polymorphism (SNPs) in Expressed Transcripts of Prosopis Juliflora: Frequency and Distribution. J Comput Sci Syst Biol 1:087-091. doi: 10.4172/jcsb.1000008
Copyright: © 2008 Sablok G, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Computer Science & Systems Biology
The present paper aims to identify the distribution of Microsatellite markers (SSRs)/Single Nucleotide Polymorphism (SNPs) as resource tools for the analysis of interspecies hypervariability in Prosopis spp. Microsatellites are ubiquitously repeated extension of 1-5 bp motif extended throughout the genome. The dbEST division of Genbank contains 1467 ESTs of Prosopis juliflora which have been utilized for the development of genic microsatellite markers (SSRs). Locally installed assembling program was used for the cluster analysis of the EST. Analysis was performed on a Linux Cluster system. The analysis of the putative unigenes has been shown to have most abundant motif of A/T followed by dinucleotide AG/CT and trinucleotide repeat AAG/CTT. EST-derived SNPs are becoming the resources for the development of SNP markers. The relative rates of development of the high throughput computational methods for the detection of SNPs (Single Nucleotide Polymorphism) and small indels (insertion / deletion) has gained wide applications in the field of the molecular markers.
Prosopis juliflora; insilico analysis; Expressed Sequence Tag (EST); Microsatellite markers; SNP Markers
Microsatellites, often abbreviated as simple sequence repeats (SSRs), are 1 to 5 bp repeat mutation prone motifs. Due to mutation prone specific and high rate of polymorphism, microsatellites (SSR) have become a modern genetic resource tool for the genetic mapping studies, developing and establishing new patterns of molecular breeding and in inferring the phylogenetic and comparative genome analysis. These repetitive stretches are distributed in the coding and non-coding region with a bias towards the noncoding regions. The variation of SSR is controlled by a complex two-level system allowing changes in the number of repeat units and changes in the repeat sequences by point mutations (Sharopova Natalya, 2008). Simple Sequence Repeats have attracted relatively more attention because of their abundance in plants genome, reproducibility, high level of polymorphism, co-dominant inheritance and hypervariability. The ‘hypervariability’ can be documented and supported by illustrating the fact that the SSR differs in the number of repeats from genotype to genotype making them as the best suitable model genetic markers. The linkage maps of nuclear genomes are constructed through the analysis of di-, tri- or tetra-nucleotide SSRs. Single nucleotide repeats have been used in the population genetic analyses of chloroplast genomes (Powell et al., 1995). SSR are also finding applications in the field of comparative genomics where SSR of one species are finding the utilization in the (genetic mapping of the other species. Cordeiro et al., 2001; Peakall et al., 1998; Rallo etal., 2003). The frequency of the SSR reported is 1 SSR /6 kb in plant genomes (Cardle et al., 2000). Recent investigation and analysis on several plant genomes have also demonstrated that the frequencies of SSRs were significantly higher in ESTs than in genomic DNA (Morgante et al., 2002). Bioinformatics is currently acting as a resource tool for the development of SSR with the availability of the genomic sequences and Expressed Sequence Tags (ESTs) as molecular databases. Expressed Sequence Tags (ESTs) are currently the most widely sequenced nucleotide commodity in the terms of number of sequences and total nucleotide count. These provide a robust sequence resource that can be used for genome annotation, gene discovery and comparative genomics (Rudd, 2003). The concept of using cDNA as a route to expedited gene discovery was first discovered in early 1980s (Putney et. al., 1983). (Brenner 1990) proposed that an obvious method for characterizing the important part of human genome will involve looking at messengers from the expressed genes thus advocating the application of highthroughput methods for transcriptome sampling. Mark Adams (1991) first used the term EST in relation to gene discovery and the Human Genome Project. The first step in deriving a gene index from an EST set is to remove redundancy by clustering ESTs representing the same native transcripts (Kalyanaraman et al., 2003). This strategy is implicit in DNA assembly tools such as PHRAP (http:// www.phrap.org/), TIGR Assembler (Sutton et al., 1996) or CAP3 (Huang and Madan, 1999) that are widely used for EST clustering. The Nucleotide Redundancy Index is calculated at the whole organism level, and represents the ratio of the total length of all clusters for that organism to the total length of all ESTs for that organism; at the extreme case, a value of 1 would indicate that every cluster was a singleton containing one EST. This measure should provide some indication of when newly sequenced ESTs are not finding new gene products. In this study we screened SSRs based upon ESTs that were publicly available forProsopis species. We evaluated different motifs of mono, di-, tri-, tetra-, and pentanucleotide SSRs for variation in length and location, relative abundance, distribution for further usage as molecular markers. We have also analyzed the distribution of the SNP and the frequency of the SNP distribution.
All the Prosopis ESTs (1467) used are downloaded from the public database NCBI dbEST division (http:// www.ncbi.nlm.nih.gov/dbEST/index.html) in May of 2008. ESTs downloaded from dbEST division (http:// www.ncbi.nlm.nih.gov/dbEST/index.html) of NCBI were analyzed for the nucleotide redundancy index and for putative gene mining. The ESTs of Prosopis used were directly downloaded from the NCBI dbEST database. Cluster analysis was employed to define groups of overlapping EST sequences and by so doing eliminate EST redundancy while simultaneously improving the length of sequences in the local database. This type of clustering also assists in the estimation of genome coverage and SSR frequency per genome. ESTs were respectively clustered using PHRAP obtained under an academic License. The putative unigenes were saved as dual resource databases in FASTA-formatted files for SSR detection. The number of sequences per cluster varied widely. All unigene databases were used to identify and characterize SSRs using a Perl Script for SSR identification. The minimum repeat unit was defined as ten for mononucleotides, six for dinucleotides, and five for all the higher order motifs including tri-, tetra-, penta-, and hexanucleotides The FASTA-formatted sequence file was allowed to search each sequence for all possible combination of mononucleotide, dinucleotide, trinucleotide, tetranucleotide, and pentanucleotide repeats. The SNP were identified using the alignment with the consensus sequence and in house perl script was used to validate it.
Occurrence and Frequency Analysis of Different SSR Motifs and SNP Distribution
The increasing numbers of genomic and expressed sequences are acting as valuable resources for developing a new class of molecular markers. These data sets have been used to study comparative genomics, assigning gene function and in protein evolution. Comparative genome analysis requires the same sets of genes (i.e. cross-reference genes) to be mapped to chromosomes in the species compared. Thus, comparative maps with sets of expressed sequence tag (EST) derived markers (i.e. cross-species markers) are essential for comparative genome analysis. Several studies have utilized publicly available ESTs to mine simple sequence repeats (SSRs) or microsatellites markers for plants and human. The EST derived SSR markers (ESTSSRs) have proved very useful for construction of genetic and comparative maps. Cluster analyses were performed on Prosopis ESTs using the PHRAP program. A total of 1467 Prosopis ESTs analyzed fell into 953 putative Unigenes showing a redundancy of 55.62 %. The 953 putative Unigenes have a total base pair of 402624 bp were then used as a source database in FASTA-formatted text files for the for SSR identification. The fasta formatted files were allowed to search in the clustered databases of Prosopis juliflora leaf cDNA constructs for mono, di-, tri-, tetra-, and penta-nucleotide SSRs respectively. The results depicted 199 EST- SSRs in the 179 putative unigenes out of 953 showing an occurrence of 18.78 % with the frequency of 1 SSR/2.021 kb. The eSSR’s identified in the present study is clustered into 10 different types of motifs in Prosopis juliflora leaf cDNA constructs. The frequency distribution of 10 different types of motifs has been shown (Figure 1).
The analysis was again performed on the Unigene dataset for the identification of the potent SSR by removing the Poly A tail which are added during post transcriptional modifications. The Unigene dataset was trimmed using trim- EST program of EMBOSS package and the results showed a remarkable decrease in the SSR frequency. Earlier the SSR frequency was 1 SSR /2.021kb but now the frequency changed to 1 SSR/5.20kb indicating the presence of the detection of the Poly A tail as Mononucleotide repeat motif. The frequency Distribution has been shown after trimming (Figure 2). The number of SSR falls from 199 to 77 showing an occurrence of 7.24% in a total 401042 bp analyzed. For the identification of the putative SNP we have used CAP3 and then on the basis of alignment with the consensus sequence we have identified 2284 SNP sites and 1008 indel polymorphisms with frequency 1.60 SNPs / 100 bp. The reported frequency of SNP has also shown the smallest contig containing the concatenation of the two EST .The relative rate of the transition and transversion were 589 and 687 respectively. To identify the SNP frequency the concatenation of the 795 sequences were done to generate a consensus sequence of 142711 bp. The distribution of the EST concatenation is shown (Figure 3). The SNP distribution is shown (Figure 4).
This work is done to implement the clustering algorithm on the identification of the SSR in the Prosopis juliflora spp. The work described the identification of the SSR and the removal of the Poly A tail in the identification of true mononucleotide tracts. The work also helps in the identification of SNP as a resource tool for further usage as a molecular marker.
We thank Biotechnology Center, Jai Narain Vyas University for providing the computational facilities. This work was supported by a grant from Department of Science and Technology, Rajasthan, India.
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals