Received date: June 08, 2010; Accepted date: June 29, 2010; Published date: June 29, 2010
Citation: Croce O, Chevenet F, Christen R (2010) A New Web Server for the Rapid Identification of Microorganisms. J Microbial Biochem Technol 2:084-088. doi:10.4172/1948-5948.1000029
Copyright: © 2010 Croce O, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Microbial & Biochemical Technology
Identification of microorganisms; RNA sequences; BLAST; Web server; Phylogenetic tree
Identifications of microbial organisms are now usually done by comparing their SSU rRNA gene sequences to those of known organisms (Stackebrandt and Goebel, 1994). The usual application is to study the composition of the microbial community within a given environmental or clinical sample. SSU rRNA gene sequences are thus obtained (McCabe et al., 1999), either after cloning the PCR products and random sequencing a set of clones (Amann et al., 1995; Hugenholtz and Pace, 1996) or by pyrosequencing (Jonasson et al., 2002; Roesch et al., 2007; Huse et al., 2008; Christen, 2008). The questions are to find out if these sequences are related to other sequences already found in environmental samples, and/or related to well known cultured microorganisms and eventually a type strain (Albuquerque et al., 2009).
The general process starts with a similarity search of the new sequence (s) against the public databases usually using BLAST (Altschul et al., 1990), then align (s) to the most similar sequences and finally do a classification or a phylogenetic analysis in order to proceed to identifications. Online BLAST servers are a common choice to quickly identify related sequences, but the databases they rely on are now filled with partial sequences from environmental samples that are often poorly annotated. For example, the NCBI BLAST server (www.ncbi.nlm.nih.gov/BLAST/) and the EBI server (www.ebi.ac.uk/ blast2/) allow the exclusion of environmental sequences, but these databases still contain many inaccurate descriptions. A similar database restricted to 16S rRNA sequences exists at DDBJ (https://blast.ddbj.nig.ac.jp/top-e.html). As a result, any BLAST query returns many (sometimes only) poorly described sequences (Lin et al., 2008; Clarridge, 2004). Thus, a tedious manual analysis of the BLAST results is necessary to identify closely related well described species. Some other tools such as Eztaxon (Chun et al., 2007) are more specific. Eztaxon is associated with a hand-curated database of 16S rRNA gene sequences for bacterial type strains. It allows users to perform similarity-based searches, multiple sequences alignment and various phylogenetic analyses. The database of Eztaxon is extremely useful for characterization of a new species before its publication, but being restricted to bacterial type strains, it does not allow to identify well deposited cultured species that have not been validated as type strains. Also, it does not allow for identifications of protozoa and archaea. Finally, it does not provide taxonomic informations required to construct a fully annotated phylogenetic tree.
Finding similar rRNA sequences using a specific BLAST
The database server described here contains SSU rRNA sequences extracted from the EMBL database. Each entry is parsed with a Python script to check if the species definitions line reveals or not that a proper latin species name has been used. Therefore, descriptions such as ‘uncultured’, ‘env’, ‘Genus sp.’, ‘genomosp.’ are automatically excluded. If necessary SSU rRNA subsequences are extracted from longer genomic fragment. The NCBI taxonomic description is also fully extracted, checked and associated to each sequence. Finally, we use the “List of Prokaryotic names with Standing in Nomenclature” at https://www.bacterio.net (Euzeby, 2008) to identify all the bacterial type strains. The data is stored into a local relational database (MySQL). Several databases for BLAST are formatted with the previous data, for Bacteria, Archaea and Protozoa divisions. For each division we propose sub-databases containing sequences of minimal lengths (500, 800, 1000, 1200 nt), since longer sequences are more appropriated to appreciate deeper branching. The databases often contain many sequences of the same species. Therefore, we built sub-databases including only 2 sequences per species, allowing, in each case, to produce a workable phylogenetic tree with proper outgroups. These two sequences are selected as being among the longest of each species.
Two different web interfaces are proposed: the NCBI BLAST default interface (Ye et al., 2006; McGinnis and Madden, 2004) and the ViroBLAST interface (Deng et al., 2007) which extends the utility of BLAST to query against multiple sequence databases and user sequence datasets. It also offers a friendly output to easily parse and navigate BLAST results. Due to the restricted amount of sequences in the database, a BLAST performs only in a few seconds (we have limited the max BLAST hits number at 500).
Recovering information and display using blast2tree
Recovering an informative taxonomy of a sequence is often very difficult using conventional BLAST servers. The definition line of the fasta format is too long and then often truncated by phylogeny programs such as Phylip (Felsenstein, 1989) or Clustal (Larkin et al., 2007). Moreover, this line contains no information concerning the taxonomic assignment. The online tool Blast2Tree intends to solve these problems. It helps to perform an easy retrieval of sequences and associated information. A simple copy/paste of any number of lines from the BLAST hits allows to get the corresponding SSU rRNA sequences (under two different formats) as well as the associated taxonomy. This can be done several times from different parts of a same BLAST result or from different queries. The final list can be hand-edited and the choice can be visualized as a table with complete taxonomic assignments displayed. A phylogenetic tree can be drawn on screen and exported before or after data retrieval. This interface also allows to copy/paste data from Eztaxon.
Downloads of the following files are possible:
1. The exact SSU rRNA sequences (FASTA format), even when such sequences are embedded into a larger sequence (such as complete genomes). It is possible either to download sequences as selected from the BLAST result, or to download a clade of sequences (such as every sequence in a genus or a given species).
2. The annotations (tlf format). These annotations comprise the complete taxonomy description, the name of the strain if available and the notification of type strain. The tlf format is a simple ASCII file that can be used by softwares such as TreeDyn (https://www. treedyn.org) or ScripTree (https://atgc.lirmm.fr/scriptree/).
3. The full taxonomy (HTML format) is sorted under similar terms. The taxonomical terms can be retrieved from the pasted BLAST hits or from keywords entered into the form.
4. A phylogenetic tree (Newick format). Blast2Tree is able to automatically build a phylogenetic tree based on MAFFT software (Katoh et al., 2009). MAFFT is a program for multiple sequence alignment. It offers various multiple alignment strategies and we integrated the fastest one (although the less accurate) which is a simple progressive method like Clustal. The detailed algorithms are described by Katoh et al. (Katoh et al., 2002). The automatic tree building is also very fast and takes less than 40 seconds for 500 leafs, 7 seconds for 100 leafs, 3 seconds for 10 sequences. The user can also upload his own tree.
5. An image of a phylogenetic tree (jpeg, ps or tiff formats). Blast2Tree integrates the scripting language “ScripTree” to generate highquality images of phylogenetic trees. A simple click is enough to display a downloadable image. This image includes the sequence to identify, the set of sequences selected from the BLAST results and the related annotations.
We have compared the results of the common BLAST servers (NBCI BLAST, EBI WuBlast2, DDBJ BLAST) and our web server. We took a given sequence as example, which is described as uncultured bacteria in the related NCBI entry (Table 1). The objective was to determine which known species are most related to the input sequence. For each server and similar options, the results are very different:
1. Most of the 100 similar sequences returned by the NCBI BLAST server are annotated as “Uncultured...”. Recently, the NCBI BLAST added a feature to exclude the environmental sample sequences. If this option is enabled, the results become better, but are not suitable because of the amount of fuzzy descriptions (e.g. many descriptions with “Enterobacter sp.”).
2. WU-Blast2 on EBI server with the database “embl release” gives results similar to NCBI. Improved results can be obtained by selecting the database “embl standard prokaryotes”, the most frequent sequences retrieved being “Enterobacter sp” blended with various species. Although sequences from the environment are not included in this database, many sequences are still poorly described.
3. BLAST on DDBJ allows to choose a database of 16S rRNA sequences of prokaryotes, which behaves very similar to the “embl standard prokaryotes” database at EBI.
4. Concerning our server, we selected the database containing bacterial 16 rRNA and only 2 sequences for each species. The first results clearly show that the closest species are from the genera Kluyvera and Enterobacter (Table 2).
Table 1: Sequence [EMBL: GU084214] described as an uncultured bacteria. This sequence is taken as an example of an “unknown sequence” to highlight the differences between the results of the usual online BLASTs (EBI, NCBI, DDBJ) and our BLAST server.
|Sequences producing significant alignments:||Score||E-value|
|AY567708|Candidatus Cuticobacterium kirbyi||1356||0.0|
Table 2: Example of BLAST output from https://bioinfo.unice.fr/blast. The BLASTed sequence is [EMBL: GU084214] (the target sequence to identify) and the database used is “2 Seq/Bacteria”. Only the fi rst 33 most similar sequences are displayed.
This example highlights that none of the common public BLAST servers are able to easily retrieve either a set of sequences with an informative and complete taxonomy or subsequences embedded in a larger genomic fragment. Only the use of a dedicated database returns meaningful BLAST results that are required for an easy identification of microbial rRNA sequences.
Figure 1 shows the phylogenetic tree built with Blast2Tree. This tree includes the sequences retrieved from our server and the sequence to identify. The default view displays the tree with the name of each species as leaves, the strain (with “T” as exponent, if this strain is a type strain) and the accession number. In this figure, the sequence to identify (named “New.sequence”) is clearly related to the genus Kluyvera and the figure is almost ready for publication. The user can also refine the analysis by adding for example all sequences of Kluyvera.
Besides these common BLAST servers, some projects based on specific rRNA databases included tools which intend to help in determining the sequences of procaryotes.
1. The Greengenes web application (DeSantis et al., 2006) gives access to a database of 16S rRNA aligned sequences (https:// greengenes.lbl.gov). It allows to export, slice, browse and compare the sequences, and to search probes throught many graphical tools. The data and tools presented by Greengenes aim to assist the researcher in choosing phylogenetically specific probes, interpreting microarray results, and aligning/annotating novel sequences. This tools also to compare a given sequence by alignment against the Greengenes database using a specific BLAST or a tool called “Simrank” based on shared 7-mers. They display an interactive table of BLAST or Simrank hits arranged by taxonomy.
2. The Ribosomal Database Project (RDP) (Cole et al., 2009) provides ribosome related databases and online tools for data analysis to the scientific community (https://rdp.cme.msu.edu). As of May 2010 (release 10.20), RDP maintains 1,237,963 aligned and annotated 16S rRNA sequences of Archaea and Bacteria, including sequences from cultured organisms and sequences obtained from environmental samples. Among the available services, RDB provides a classifier that assigns 16S rRNA gene sequences to the new phylogenetically consistent higher-order bacterial taxonomy using a naïve Bayesian classifier (Wang et al., 2007). This service allows to select a set of sequences with some relevant criteria such as sequences length or sequences from isolates only. Then, it is possible to build a crude online tree which is downloadable into the Newick format.
3. Silva is an online resource that provides comprehensive, quality checked and regularly updated databases of aligned 16S, 18S, 23S, 28S ribosomal RNA sequences for Bacteria, Archaea and Eukarya (Pruesse et al., 2007) (https://www.arb-silva.de). The project “The All-Species Living Tree” is associated with the Silva databases. This project aims to reconstruct a single 16S rRNA tree harboring all sequenced type strains of the hitherto classified species of Archaea and Bacteria (Yarza et al., 2008). Also, Silva proposes to do phylogenetic classification linked to the ARB software (Ludwig et al., 2004) (https://www.arb-home.de). An online form helps to align the sequence to identify with the closest sequences included into the database of Silva. A file is created (fasta+metadata or ARB file format) and can be used with the ARB software for further phylogenetic studies and for the visualization of the tree. The ARB software package has to be installed on the system and is not available for all operating systems (i.e. Windows). Note that an independent BLAST server exists which uses a part of Silva data (https://www.sepsitest-blast.de). SepsiTest has similarities with our Blast, but the database includes only sequences of type strains. However, we also added the possibility to paste the output BLAST of SepstiTest to Blast2Tree.
The final phylogenetic tree may not include sequences that are not clearly named at the species level. Depending on the database provided by these described tools or depending of the ability of the user, it could be very tedious to avoid sequences annotated with some flooding terms such as ‘Candidatus’, ‘unclassified’ or ‘environmental samples’. Moreover, the user can spend much time to obtain representative annotations in order to have a “ready to print” tree for publication. Finally, it is not easy to obtain a quick overview of a relevant set of closed sequences related to an unknown sequence to identify. Our web server is a tool intended to fill this gap.
Quality of analysis and display
The MAFFT algorithm used to build the trees is very fast but gives an average quality. No bootstrap confidence estimates or other statistical analysis are displayed. To perform a more robust tree we recommend to get sequences from Blast2tree and to build the tree by using classical phylogenetic methods. This procedure is longer but should provide better results. It is then possible to re-import the tree (in Newick format) into Blast2Tree. Concerning the final tree image, Blast2tree performs automatic procedures with the scripting facilities of “ScripTree” (www.scriptree.org). Some options allow to select the annotations to add onto the image. Also, it is possible to export the generated files from Blast2Tree to ScripTree software in order to obtain a more customized tree output. Another way to modify the tree consists in using the stand-alone software TreeDyn which is an advanced tool for the graphical management of trees.
Advantages of the system
The BLAST on main public servers are not appropriate for an accurate identification of prokaryotes, because it is not easy to find taxonomically meaningful relatives to a query sequence. The main advantages using a specific server are (i) to get clear and informative results (every hit will be from a well described species); (ii) a fast BLAST computing due to a low amount of data (the databases of sequences are much smaller than public databases); (iii) the possibility to export the true SSU rRNA sequences and taxonomic descriptions which facilitates the phylogenetic analysis and helps the user to integrate new sequences into a meaningful phylogenetic tree.
Existing tools based on specific rRNA databases offer many advantages, but they are not fully suitable for a fast and easy identification of an unknown prokaryote sequence that have to be included into an annotated tree. However, the high quality of the databases such as those provided by Silva or Greengenes could be used in the future in addition to the database used by our server.
Finally, the online tool Blast2Tree provides a fast and useful way to see how a new sequence is positioned within a tree of relevant closed sequences.
Availability and requirements
The server is freely accessible over the Internet at https://bioinfo.unice.fr/blast/. An online help is also available on https://bioinfo.unice.fr/blast/blast2tree/help/, it contains tutorials and examples of use. This server for microbial identification has been running for about one year and is used within the framework of a European project. Server and database are available for local installation, interested individuals are invited to contact the authors for more information.
This work was supported by funds from the European Commission for the HEALTHY WATER project (FOOD-CT-2006-036306) to R. Christen. The authors are solely responsible for the content of this publication. It does not represent the opinion of the European Commission. The European Commission is not responsible for any use that might be made of data appearing therein.