VMD: Viral Microsatellite Database-A Comprehensive Resource for all Viral Microsatellites

Microsatellites are the small DNA sequences with a tandem repetition of a particular motif of size 1-6. Microsatellites are found in all known genomes and play a significant role in many fields including DNA Finger-printing, Population Studies, Forensics, Paternity Studies, Gene Regulation, Genetic Disorder Studies, and Evolution of Genomes. They are extensively used as genetic markers for identifying pathogenic bacteria and viruses. More over, they are found to be associated with the plasticity, adaptation and virulence of bacteria and viruses. Insilico analysis of microsatellites in various viruses would reveal many interesting facts about their evolution and adaptation. To the best of our knowledge, there is no comprehensive and exclusive database of all viral microsatellites that extracts all types of microsatellites with flexible extraction options. In this paper, we describe the details of a relational database named Viral Microsatellite Database (VMD). VMD currently hosts microsatellites of around 3500 viral genomes along with their alignments, locus information, imperfection info, protein info etc. The database can be accessed and downloaded for free for academic / research purposes from http://www.mcr.org.in/ vmd.


Introduction
Microsatellites or Simple Sequence Repeats (SSRs) or Short Tandem Repeats (STRs) are tandem repeats of motifs of size 1-6 nucleotides long (mono to hexa nucleotides) viz. (T) 9 , (ATGC) 4 , (ATCGAT) 3 (Schlotterer, 2000). These microsatellites can be classified into three main categories; (i) perfect (ii) imperfect and (iii) compound (Merkel and Gemmel, 2008). A 'perfect' microsatellite tract is a tandem repetition of exact copies of a particular motif. E.g.CAGCAGCAGCAG. Here, the motif 'CAG' is repeated 4 times (represented as (CAG) 4 . The perfect repeats are often interrupted by mismatches such as insertions, deletions or substitutions resulting in an 'imperfect' tract of microsatellite. Eg. CAGCTGCAGCAG is an imperfect microsatellite tract with a substitution A->T at 5 th position. Finally, a 'compound' microsatellite tract is one that contains multiple motifs with in the same tract separated by 0 or more intervening nucleotides.
Eg. (CAG) 4 ctgca (GC) 3 is a compound microsatellite tract with two microsatellite tracts of motifs CAG and GC separated by 4 nucleotides. Over the years, these ubiquitous repeats are of great interest for the researchers due to their application and significance in various fields. They are found in all known genomes including bacteria and viruses and are distributed through out the genome in both coding and non-coding regions (Toth et al., 2007). These repeats also play an important role in gene regulation and are also responsible for causing changes in protein products (Li et al., 2004;Martin et al., 2005). Mutations in these microsatellite tracts have been implicated to be responsible for certain neurodegenerative diseases in humans (Tautz and Schlotterer, 1994). They are known to be used in several areas such as DNA Fingerprinting, Forensics, Paternity studies, Linkage analysis etc.
Apart from all these, microsatellites are known to be highly polymorphic by nature as they gain/lose repeat units (motifs) in course of time, thus, making them highly important in the studies of genome evolution (Jarne and Lagoda, 1996). They are thought to be one of the sources of genetic diversity (Kashi and King, 2006). More recently, studies revealed that microsatellites are imparting a certain degree of plasticity in bacterial genomes indicating their significance in the context of pathogen adaptability, virulence and survival (Sreenu et al., 2006). Studies in certain viral genomes have also shown that polymorphism does exist among the viruses (Davis et al., 1999) and proven to be useful as markers in epidemiological and virulence studies (Hood et al., 1996). Studies also show that microsatellites play an im- 'Msatfinder'. However, these databases are not specifically designed for virus microsatellites and does not include complete microsatellite information of all available viral genomes such as repeat statistics, alignments, motif specific search, coding/noncoding locus information etc. IMEx-web [unpublished] is another comprehensive resource available till date from which one can extract microsatellites of all types with varying degrees of imperfection. IMEx-web does not store the microsatellites in a database but extracts the repeats 'on the fly' using the program IMEx (Mudunuri and Nagarajaram, 2007) over the web. Storing the repeats in a database and providing a better search facility would be very advantageous rather than extracting the microsatellites using a program. Using a database is very flexible in terms of filtering the required output such as repeats of particular type/size; repeats in coding regions only; sort the results by a particular field; restrict the output data by various fields etc. More over, the database VMD is made available for download so that one can write their own SQL queries to extract re-

Database Descripton
Database architecture VMD follows a multi-tier architecture where the data is extracted and curated using various programs/tools at different levels and compiled into a relational database. The database is connected to a flexible and user-friendly web-interface with many easy-to-use options. Perfect and Imperfect microsatellites are extracted from all the viral genomes using IMEx (Mudunuri and Nagarajaram, 2007) and their corresponding information is stored in the database. The flanking sequences and compound microsatellites are not stored in the database. As said before, compound microsatellites contain more than one microsatellite tract separated by 0 or more nucleotides. The user sets the maximum allowable distance (dMax) between any two microsatellites. In-order to reduce redundancy, compound microsatellites are not stored in the database but are extracted by calculating the distance between the adjacent microsatellites in the database and report them in the output as compound microsatellites if the distance is less than or equal to dMax. Similarly, flanking sequences of the repeats are also not stored in the database rather they are extracted from the sequence when required. The user will set the flanking limit that can vary for every query of the database. Hence, flanking sequences are extracted from the genome sequences 'on the fly' and are incorporated in the output. The detailed architecture and design of VMD is shown in Figure 1.

Microsatellite extraction
All the sequenced viral genomes were downloaded from the National Centre for Biotechnology Information (NCBI) (ftp:// ftp.ncbi.nlm.nih.gov/refseq/release/viral/) repository. We downloaded both the sequence (.fna files) and the protein information (.ptt files) files of all sequenced viral genomes. Microsatellites were extracted using Imperfect Microsatellite Extractor (IMEx) from all the sequences with the following parameters: Repeat numbers (n): Mono: 6, Di: 3, Tri-Hexa: 2; Imperfection limit / unit (k): Mono-Tri: 1, Tetra-Penta: 2 and Hexa: 3; Microsatellites are extracted from each sequence with three different levels of imperfection: Perfect microsatellites (with 0% imperfection), stringent imperfect microsatellites (with 10% imperfection, means one mismatch per every 10 nucleotides) and relaxed imperfect microsatellites (with 20% imperfection). The corresponding information of each microsatellite including the repeat motif, start and end co-ordinates, tract-length, number of iterations, imperfection % in the tract (p%), alignments, nucleotide composition etc are also stored in the database.

Web-interface
A user-friendly and easy-to-use web-interface has been designed to facilitate the users to extract the microsatellite information. In order to provide better navigation, the genomes are organized in alphabetical order and the user can browse through the genomes names easily. VMD search is provided in 2 different modes. Basic mode, with limited options, is designed specifically for beginners where the user can select a viral genome of his interest and the type of the repeat and extract the results.

JCSB/Vol.2 November-December 2009
Advanced mode is for the expert users, which include many options to filter repeats of his interest. Using the advanced mode, apart from the options in basic mode, users can filter the repeats of a particular motif (e.g. CAG, GT, ATG) or particular size(s) (e.g. mono, tri, penta); restrict the maximum and minimum number of repeat units (no. of iterations); find microsatellites only in coding regions / non-coding regions / both; sort the results in a particular format; select the fields to be present in the output file; export the results in various formats etc. In-order to design primers for the microsatellites of interest, a popular primer design module Primer3 (Rozen and Skaletsky, 2000) has been incorporated in the web-interface.

Output options
VMD has a wide range of output options to facilitate the users to analyze the results. The output of the VMD queries can be exported in different formats and the user has a choice to select more than one of these. The results of the query can be exported into HTML, Excel, CSV and Text formats so that the users can use them for post-processing in their convenient way and analyze the results. The users of VMD can also choose what information should be part of the output such as repeat number, flanking sequence, coding info, alignments, nucleotide composition, etc. An option to send the results via an automated email delivery module is also present so that the users need not wait for the system to extract results.

Implementation
In-house programs (in C language) have been developed to submit each viral genome as input automatically to IMEx program for extraction of microsatellites. The database has been constructed using the popular relational database management system MySQL and the extracted microsatellite information has been stored in various tables in the database using Perl scripts developed in-house. Similarly, the protein information is extracted and stored in the database using in-house C and Perl programs. Web server using Apache software has been installed on the IBM Xeon server with Linux operating system. A comprehensive web-interface has been developed using HTML and CSS and hosted on the server. The forms are validated thoroughly using JavaScript. The connection between the web-interface and the MySQL database has been achieved using the server side scripting language PHP. The mapping of microsatellite repeats to their corresponding coding/non-coding regions as well as the extraction of compound microsatellites and flanking sequences has been achieved using the PHP programs. The primer3 module was incorporated via CGI-Perl programming.
Keeping in mind the importance of viruses and their effect in the current era, Viral Microsatellite Database (VMD) can be a very useful resource for microsatellite studies of thousands of viral genomes.

Availability
The current release (version 1.1, December 2009) archives microsatellites and their corresponding repeat information of 3465 complete viral genomes. A bimonthly updating of the database is planned. VMD has been made available for free and can be accessed from http://www.mcr.org.in/vmd. The database can be downloaded for free upon request purely for non-commercial use (for academicians and researchers).