Comparison of the Virulence Factors and Analysis of Hypothetical Sequences of the Strains TIGR4, D39, G54 and R6 of Streptococcus Pneumoniae

Whole genome sequences of the four strains of Streptococcus pneumoniae, encapsulated TIGR4, D39, G54 and nonencapsulated R6 are considered for the comparative study on genome features, whole genome pairwise alignment, gene role category, and virulence factors using relevant comparative genomics tools. The study of capsular polysaccharide synthesizing genes reveals that many cps genes are unique to TIGR4, which shows the high virulence nature of TIGR4. Further, the study on the other virulence factors such as pneumococcal surface protein A, autolysin, hyaluronate lyase, pneumolysin, neuraminidase B, and pneumococcal surface antigen A of TIGR4 are much related to those of the other three strains, and hence the virulence nature due to these factors among four strains seems to be similar. But it differs from neuraminidase A, choline binding protein A and immunoglobulin A1 protease. Also in the present study, 4 and 22 hypothetical protein sequences of TIGR4 and R6 respectively are predicted as virulence factors. Among those sequences, it is found that 8 hypothetical protein sequences with 7 different functional regions of R6 are related to other previously known virulence factors of TIGR4 and R6 of S. pneumoniae.


Introduction
The whole genome sequences of bacteria of closely related species or strains are providing new avenues of investigation for the further understanding of microbial diversity, pathogenesis, host-parasite interaction, evolution, etc. through a comparative analysis of their genomes. Streptococcus pneumoniae, commonly pneumococcus (Dowson, 2004; Gregory and DeSalle, 2005), a human pathogen, causes life threatening diseases like pneumoniae, bacteremia, meningi-tis, sepsis, and otitis media. Genome sequencing of four S. pneumoniae strains, namely, TIGR4, D39, G54 and R6 have been completed and genome sequencing of other 14 strains are ongoing. G54 genome sequence is not yet added in GenBank but it is inbuilt in Comprehensive Microbial Resource (CMR) and D39 genome sequence is available in GenBank but not in CMR. TIGR4, a clinical isolate, is encapsulated and highly virulent and many of its virulence fac-tors have been studied (Tettelin et al., 2001). D39, the encapsulated and virulent strain (Lanie et al., 2007), was used by Avery, Macleod, and McCarty (Avery et al., 1979) in their landmark study on the role of DNA as the genetic material. G54 is an encapsulated clinical strain type 19F (Dopazo et al., 2001). R6, a derivative of the serotype 2 clinical isolate D39, is nonencapsulated and avirulent. The genes encoding many virulence factors are present in R6 genome in addition to the genes of capsular biosynthesis (Hoskins et al., 2001).
Many types of comparative studies (Tettelin et al., 2001;Lanie et al., 2007;Hoskins et al., 2001;AlonsoDeVelasco et al., 1995;Brückner et al., 2004;Ferretti et al., 2004;Silva et al., 2006) have already been carried out in Streptococcus strains on various aspects. The preliminary comparative analysis (Jothi et al., 2007) of the whole genomes of both the encapsulated TIGR4 and nonencapsulated R6 strains of S. pneumoniae provided some insights into the high virulence nature of TIGR4. This present study summarizes specifically how the whole genomes of the four strains, namely, TIGR4, D39, G54 and R6 of S. pneumoniae differ from each other by their genome features, genome diversity, gene role category and virulence factors. Comparison of the virulence factors among these strains can provide further insight into any strain uniqueness with relevance to virulence nature and can stimulate new approaches into disease prevention and treatment.
S. pneumoniae has two surface layers outside the plasma membrane, namely, cell wall and capsule. The cell wall has triple-layered peptidoglycan that holds the capsular and cell wall polysaccharides, and also few proteins. The capsule completely covers the inner structure of S. pneumoniae. The cell wall polysaccharide is common to all serotypes of S. pneumoniae, but the chemical structure of the capsular polysaccharide is serotype-specific (AlonsoDeVelasco et al., 1995). After Avery's experiment (Avery et al., 1979), the capsule has long been recognized as the major virulence factor of S. pneumoniae. Experimental proof for this was provided by the difference in 50% lethal dose between encapsulated and nonencapsulated strains. Encapsulated strains were found (AlonsoDeVelasco et al., 1995) to be at least 10 5 times more virulent than strains lacking the capsule. Certain proteins in S. pneumoniae like pneumococcal surface protein A (PspA), autolysin (LytA), hyaluronate lyase (Hyl), pneumolysin (Ply), neuraminidases A and B (NanA and NanB), choline binding protein A (CbpA), pneumococcal surface antigen A (PsaA) and immunoglobulin A1 (IgA1) protease are important virulence factors (AlonsoDeVelasco et al., 1995; Jedrzejas, 2001; Rigden et al., 2003) and these could be used as potential vaccine can-didates. The preliminary identification of the surface proteins and virulence factors of S. pneumoniae were done by computational analysis of its genome sequences (Tettelin and Hollingshead, 2004;Gregory and DeSalle, 2005;Tettelin et al., 2001;Hoskins et al., 2001) and continued in several subsequent studies (Brückner et al., 2004;Polissi et al., 1998;Wizemann et al., 2001). Strains of S. pneumoniae are now resistant to commonly prescribed antibiotics, such as, penicillin, macrolides and fluoroquinolones (Tettelin et al., 2001). Because of the multidrug resistance nature of the S. pneumoniae strains, we need a deeper understanding of the virulence factors, for that the comparative genomics approach may provide more insight.
At present, only 70 % of the genes in any given genome can be predicted with reasonable confidence (Bork, 2000). The remaining genes are either hypothetical (do not have any known homolog) or conserved hypothetical (homologous to genes of unknown function), because it is unclear whether they encode actual proteins. The large quantity of hypothetical protein sequences in completely sequenced genomes of organisms makes their study an enormous task. Characterization of these genes or proteins of unknown function is generally recognized as an essential step towards fully understanding the biology of the pathogenic organism and for potential targets. Few studies (Galperin and Koonin, 2004;Brown, 2005;Sivashankari and Shanmughavel, 2006) have already been carried out on hypothetical sequences. In the present study, hypothetical protein sequences of the strains TIGR4 and R6 of S. pneumoniae are analyzed to find their virulence nature using VirulentPred. Among those sequences, it is also analyzed how far the hypothetical protein sequences are related to other previously known virulence factors of TIGR4 and R6 of S. pneumoniae.

Materials and methods
Various analysis of the whole genomes of the four strains, namely, TIGR4, D39, G54 and R6 of S. pneumoniae like the whole genome alignment, comparison of gene role categories, finding the location of the virulence factors in the genome and comparison of virulence regions are carried out using the appropriate bioinformatics software tools.

Sequence Retrieval and Whole Genome Pairwise Alignment
The complete genome sequences and the list of annotated gene and protein sequences of TIGR4, D39 and R6 are retrieved from the NCBI -FTP server (ftp:// ftp.ncbi.nih.gov/genomes). We used the run-mummer3 program available in the standalone MUMmer 3.20 (http:// mummer.sourceforge.net/) and its built-in mummerplot for obtaining the whole genome pairwise alignment of S. pneumoniae strains TIGR4, D39, and R6 in different combinations. MUMmer at Comprehensive Microbial Resource (CMR) is used for the whole genome pairwise alignment of the strains TIGR4, G54 and R6 in different combinations.

Comparison of the Role Category of Genes and Sequence Analysis
The tool in CMR database (http://cmr.tigr.org/tigr-scripts/ CMR/ CmrHomePage.cgi), the role category piechart is used for the genome features and functional role category comparison of the whole genomes of TIGR4, G54 and R6. Bacterial Annotation System (BASys -http:// wishart.biology.ualberta.ca/basys) -A web server for automated bacterial genome annotation is used to know the role category for three strains TIGR4, D39 and R6, whose whole genomes are already available in it. From the prediction server of the Center for Biological Sequence Analysis (CBS -http://www.cbs.dtu.dk/services), the Genome Atlas is used for the analysis of repeats of S. pneumoniae. The sequences of various virulence factors, which are taken for our study, have been verified by using the virulence factors database (http://www.mgc.ac.cn/VFs). BioEdit (http:// www.mbio.ncsu.edu/ BioEdit/bioedit.html) is used to compute sequence composition of the genomes and genes. Further, LALIGN (http://www.ch.embnet.org/software/ LALIGN_form.html) is used for the pairwise global alignment of the gene sequences of the strains of S. pneumoniae.

Functional Annotation of Hypothetical Sequences
VirulentPred (http://bioinfo.icgeb.res.in/virulent) is a SVM (Support Vector Machine) based method to predict bacterial virulent protein sequences, which can be used to screen virulent proteins in proteomes. In the present study the above tool is used to analyse the hypothetical sequences of the strains TIGR4 and R6 of S. pneumoniae. From the proteome of TIRG4 and R6 of S. pneumoniae, all unannotated hypothetical protein sequences are retrieved using PERL script and those sequences are used as data set for virulence factor prediction.

Results and Discussion
Comparative genomics and in silico studies have begun to reveal insights into gene and protein functions of many organisms. Here, we compare the genomes of the strains TIGR4, D39, G54 and R6 of S. pneumoniae using the appropriate tools for whole genome comparison and the results are discussed below. Table 1 summarizes the general information about the genomes including statistics of genes of these four strains, obtained and compiled from CMR and NCBI web servers. The genome sizes of these four strains range between 2 Mb and 2.16 Mb (c.f. Sl.No.2 of Table1). Among these four strains, D39 is the smallest and TIGR4 is the largest based on genome size. The nucleotide base (A, T, G, C, AT and GC) compositions of four strains show that the strains have low GC (~40%) genomes. The number of genes encoding for proteins of these four strains ranges between 1914 and 2234 (c.f. Sl.No.3 of Table1). Of the total base pairs of four genomes, approximately 85 -87% of base pairs (bps) are involved in coding and the remaining are non-coding or junk DNA. The number of genes involved in RNA synthesis (structural RNA, tRNA, and rRNA) is more or less similar in all strains. Finally, by comparing the global and local repeats of TIGR4 and R6 using CBS web server, it is evident that both the repeats are high in TIGR4 than in R6 (c.f. Sl.No.4 of Table1) and this may be related to the duplicated regions of the chromosome (Gregory and DeSalle, 2005).

Comparison of Whole Genome Pairwise Alignments
The whole genome pairwise alignments of the strains TIGR4, D39 and R6 of S. pneumoniae (whose sequence data are available at NCBI) are obtained using the standalone version of MUMmer and the results are plotted using its built-in mummerplot. The whole genome pairwise alignments of the strains TIGR4, G54 and R6 are obtained using CMR, where these sequences are available, and the five possible alignments are shown in Figure 1(a) -(e). Generally, the genomes of prokaryotes are very dynamic, with insertions, deletions, inversions, and translocations being commonly observed among related species or even between different strains of the same species (Gregory and DeSalle, 2005;Hughes, 2000). The net result is that the particular complement of genes and their order along the chromosome are not typically conserved over evolutionary time. In some cases, genes that are grouped into operons in one species may be dispersed throughout the genome in others. We find similar results, while we analyzed the genomes of four strains of S. pneumoniae. In particular, we find that there exists a stability of the gene order in the genome pairs TIGR4 vs. D39 and TIGR4 vs. R6 and they are shown by fact that most of the points lie along the diagonal in Figures 1a and 1b. The results (Figures 1a and 1b) indicate that the stability of gene order of D39 vs. R6 must also be relatively high and it is shown in Figure 1c. This also confirms the fact that R6 is the derivative of D39. The whole genome pairwise alignments of TIGR4 vs. G54 and that of R6 vs. G54 do not show such a high degree of the stability of gene order compared to the above results (for D39 strain) and are shown in Figures 1d and 1e, respectively.
Many of the gene and protein sequences among these strains are approximately the same and this is not surprising as all the strains occupy the same niche in the human respiratory system. The small differences might have arisen after the divergence of these strains from other evolutionary lineages for adaptations in their host. This increases greatly in pathogens and appears to be associated with the ability to infect eukaryotes, perhaps reflecting a mechanism for evading host immune defenses and the unique genes may be located in a plasticity zone.
Since G54 genome sequence is not available at NCBI web server and D39 genome is not available at CMR server, we could not get the whole genome alignment for D39 vs. G54. However, we are able to predict the whole genome pairwise alignment of D39 vs. G54, based on the earlier result. As the Figures 1d and 1e are similar, it indicates that the alignment of D39 vs. G54 must also possess similar structure. This prediction may be confirmed if the whole genome sequence of G54 is made available in NCBI or genome sequence of D39 is included in CMR.
From similar analysis, we have also noted that the genes, gi|15900279-cps4E, gi|15900280-cps4F, gi|15900281-cps4G, gi|15900282-cps4H, gi|15900286-cps4I, gi|15900287-cps4J, gi|15900288-cps4K, gi|15900289-cps4L and gi|15900788cps-ptv are unique to TIGR4. Similarly, the genes gi|116516773-cps2E and gi|116516341-cps-ptv are unique to D39 strain. In the same way, the genes NT05SP0198, NT05SP0202 and NT05SP1909 are unique to the strain G54. But in R6, the only cps gene gi|15902136-capD is common to all other strains ( Table 2). As the TIGR4 strain has more number of cps genes than other strains it indicates the high virulence nature of TIGR4. Further, the results also explain that the virulence nature is lesser in D39 and G54 strains, and very less in R6 compared to TIGR4.
Though all the cps genes of TIGR4 are not present in D39, G54 and R6 strains, they are also pathogenic. Therefore, to know the other virulence factors in addition to cps genes, we consider the other genes of the strains from the gene role category aspect.

Comparison of the Role Category of Genes
Role category of genes of the different strains are compared by using the two different tools, namely, i. CMRrole category pie chart for TIGR4, G54 and R6 (Table 3) and ii. Bacterial Annotation System (BASys) for the strains TIGR4, D39 and R6, based on the availability of genome sequences. The genes responsible for biosynthesis of various proteins (Sl. Nos. 1-9 of Table 3) of TIGR4 are nearly same as in G54 and R6, which suggests the basic complement of proteins required for certain cellular processes. But the genes responsible for the biosynthesis of some other proteins (Sl.Nos.10-23 of Table 3) of TIGR4 are notably different from that of G54 and R6. This suggests that, these proteins are important for strain uniqueness and they may be involved in variations in pathogenesis among the strains  Table 3 is specific to the gene involved in that category only and does not represent the overall gene percentage. For example, autolysin (SP1937) of TIGR4 is categorized into two role categories such as cell envelope and cellular processes (Sl.Nos.11 and 12 of Table  3) and the percentage given is specific to the respective categories.

S.
No.

Types of sequences
The number of genes which are responsible for pathogenesis in the strains TIGR4, G54 and R6 are manually counted from CMR gene role category (sub role categories pathogenesis, toxin production and resistance) and found to be 101 (4.52 %), 47 (2.30 %) and 42 (1.89 %) respectively (Sl.No.19 of Table 3). TIGR4 has many pathogenic factors and it is highly virulent and G54 and R6 strains have approximately 50% of the pathogenic factors of TIGR4. Mobile and extra chromosomal elements comprise a significant fraction of the genome as with the 134 genes (5.99 %) in TIGR4, 71 (3.46 %) in G54 and 86 genes (3.87 %) in R6 (Sl.No.18 of Table 3). Generally transposons encode genes for antibiotic resistance (Gregory and DeSalle, 2005); therefore from our results, it is evident that the antibiotic resistance may be relatively higher in TIGR4 than the strains G54 and R6.
From the results of the comparative study on TIGR4, D39 and R6, using BASys server, we find that most of the values are more or less similar. But, there is a higher percentage for unknown functions in the strains TIGR4, D39 and G54, which indicates that the reason for the differences may also be hidden in the unknown genes or proteins (data not shown).
From Table 3, the number of hypothetical, conserved hypothetical, unclassified and unknown genes of whole genomes of the strains TIGR4, G54 and R6 are noted and is shown in Table 4. Nearly 37 -42 % of genes are of unknown type and it shows that these sequences have to be annotated and assigned functions of which some of them may be responsible for the virulence nature. Using the multigenome homology comparison tool, which is available at CMR, the numbers of unique genes in TIGR4, G54 and R6 are found to be 288, 104 and 78, respectively ( Table 4).
The unique genes of the strains TIGR4, G54 and R6 themselves have many hypothetical, conserved hypothetical, unknown and unclassified sequences and their percentage ranges from 65 to 74, thus the other possible differences among the strains may be known by studying the above said gene sequences. As far as the virulence factors are concerned, in the unique genes of the strain TIGR4, 3 capsular polysaccharide biosynthesis proteins (Sp_0351 (cps4F), Sp_0352 (cps4G) and Sp_0359 (cps4K)), 4 cell wall surface anchor family proteins (Sp_0462, Sp_0463, Sp_0464 and Sp_1772), a PspC protein (Sp_1417), a NanA protein (SP_1693) and a IgA1 protease (SP_2155) are there. In the case of R6, it has three proteins of type 2 capsule locus (Spr0315, Spr0317 and Spr0319) in its unique genes. But the strain G54 does not have such virulence factors in its unique genes ( Table 4). The above result shows the high virulence nature of TIGR4 and it also suggests that those virulence factors are specific to TIGR4 and R6. The above differences might have arisen because of the species-specific adaptation to their host particularly in the sake of defense mechanism.

Comparison of Virulence Factors Other than Capsular Polysaccharide Synthesizing Genes
In S. pneumoniae, the surface and cytoplasmic proteins such as pneumococcal surface protein A (PspA), autolysin (LytA), hyaluronate lyase (Hyl), pneumolysin (Ply), two neuraminidases (NanA and NanB), choline binding protein A (CbpA), pneumococcal surface antigen A (PsaA) and immunoglobulin A1 (IgA1) protease are already stated as the virulence factors (Jedrzejas, 2001;Rigden et al., 2003). The comparative results of the above mentioned sequences obtained from CMR, are given in Table 5. It provides more insight into the virulence factors of the strains TIGR4, D39, G54 and R6 of S. pneumoniae.
The virulence factors of TIGR4 are taken as reference and are compared with all other related sequences of the strains such as D39, G54 and R6, likewise the virulence factors of D39 are taken as reference and are compared with all the related sequences of the strains G54 and R6. Similarly the virulence factors of G54 are taken as reference and are compared with all the related sequences of the remaining strain R6 using the pairwise sequence alignment tool LALIGN, with default parameters (Alignment: Global; Scoring matrix: BLOSUM50, Gap opening penalty: -14 and extension penalty: -4), and all the results are comparatively shown in Table 5.
PspA is located in the cell wall of pneumococci and present in all S. pneumoniae strains (Jedrzejas, 2001). PspA of TIGR4 has ~53-63% identities with D39, G54 and R6 (Table 5). When we compare PspA in D39 vs. G54 and G54 vs. R6, the identities between those strains are nearly 63%. The above results indicate that nearly 50-60% virulence nature of PspA of TIGR4 exist in other strains D39, G54 and R6. But it is interesting to note that there is 100% identity between the PspA sequences of D39 and R6, thus the virulence nature of PspA is exactly the same.
Regarding LytA, Hyl, Ply, NanB and PsaA, all the four strains of S. pneumoniae have above 90% identities, thus the effect of the above mentioned five virulence factors is also similar and it also reflects on G+C percentage, protein length and gene length, but the location in their genomes varies and the similarities and differences can be noticed from the Table 5. Table 5: Comparison of the common virulence factors namely, pneumococcal surface protein A (PspA), autolysin (LytA), hyaluronate lyase (Hyl), pneumolysin (Ply), neuraminidase A (NanA), neuraminidase B (NanB), choline binding protein A (CbpA), pneumococcal surface antigen A (PsaA) and immunoglobulin A1 (IgA1) protease of four strains of S. pneumoniae. LALIGN program is used to find identity between sequences.

Strain
All strains have different neuraminidase sequences except G54 and R6 (~90% identity). In the case of CbpA and IgA1 of the strain TIGR4, high percent identities (~73 and 87%) exist with D39 and R6 respectively, exactly identical (100%) between D39 and R6. But very less identities (~40 and 35%) exist with G54 combinations. It seems that the virulence nature based on cbpA and IgaA are similar among the strains TIGR4, D39 and R6 and differs in G54.
From Table 5, it is interesting to note that all the virulence factors of D39 are very similar to R6 (above 99% identities except NanA), and it confirms the fact that the avirulent strain R6 is the derivative of the strain D39 (Lanie et al., 2007). Based on the role category, all TIGR4 virulence factors come under pathogenesis related functions and it also says that TIGR4 has high virulence nature.

Functional Annotation of Hypothetical Sequences Relevant to the Virulence Factors
Prediction of virulence factors from the hypothetical sequences of S. pneumoniae has implications on the identification and characterization of the virulence mechanism. The present study predicted using VirulentPred (Garg and Gupta, 2008) that 4 hypothetical sequences of TIGR4 and 22 of R6, respectively, are virulence factors. All these sequences are listed in Table 6. The prediction is based on protein features, such as, amino acid composition, di-peptide composition, similarity search, higher order di-peptide composition, PSSM and cascaded SVM module of the tool VirulentPred. However, similar predictions are not possible at present with D39 and G54 as the sequence information of the latter is not fully available.
Among the 4 predicted virulence factors of TIGR4, only one sequence (gi|15901572) is predicted in R6 as a hypothetical protein (gi|15903627) and the functional region is predicted as Plasmid_Txe (PF06769). This family contains many hypothetical proteins and there is no homolog with other mentioned virulence factors. But in R6, it is interesting to note that among the 22 predicted virulence factors of hypothetical protein sequences, 8 different sequences (gi|15902372, gi|15903388, gi|15903446, gi|15902652, gi|15902781, gi|15903694, gi|15903627 and gi|15903771) with 7 different functional regions which are related to the already mentioned virulence factors of the strains R6 and TIGR4. Those virulence factors are hyaluronidase, Immunoglobulin A1 protease, capsular polysaccharide synthesis, pneumolysin, neuraminidase and choline binding protein. The above mentioned related sequences of TIGR4 and R6 except gi|15903771 are compared in Table 7.
The hypothetical protein sequence, gi|15903771 of R6 has 71 amino acids and its functional region is predicted as pu- tative cell wall binding repeat (42-60) using Interproscan (ID -PF01473). It is also found that the same functional region is repeatedly present in the known virulence factors such as pneumococcal surface protein A, autolysin and choline binding proteins of the strains TIGR4 and R6. Since many domain regions have been identified in the above mentioned known virulence factors of TIGR4 and R6, the regions are not explicitly given. But one can easily obtain those regions using the tool Interproscan.

Conclusion
We have compared the virulence nature of the strains, encapsulated TIGR4, D39, G54 and nonencapsulated R6 of Streptococcus pneumoniae using comparative genomics tools. From the whole genome pairwise alignment, we found that the stability of the gene order in the genomes of TIGR4 vs. D39, TIGR4 vs. R6 and D39 vs. R6 are relatively higher than the genomes of TIGR4 vs. G54 and R6 vs. G54. We are able to predict the possible structure of whole genome pairwise alignment of D39 vs. G54 from the alignments of TIGR4 vs. G54 and R6 vs. G54.
From the comparison on the capsular polysaccharide (cps) synthesizing genes, we found that, TIGR4 strain has more number of cps genes than other strains, which may indicate the high virulence nature of TIGR4. Many cps genes are unique to TIGR4, only few are in D39 & G54 and none in R6, which shows the high virulence nature of TIGR4. Further, the study on other virulence factors such as, pneumo- coccal surface protein A, autolysin, hyaluronate lyase, pneumolysin, neuraminidase B and pneumococcal surface antigen A of TIGR4 are closely related to those of the other three strains, which shows that the virulence nature due to these factors among four strains seems to be similar. But the virulence factors neuraminidase A, choline binding protein A and immunoglobulin A1 protease of TIGR4 differs from other strains of S. pneumoniae, which shows that these factors are responsible for the differences in virulence nature among four strains.
From the gene role category comparison, many genes of TIGR4 that are nearly same as in G54 and R6, suggests the basic complement of proteins required for certain cellular processes in the strains of S. pneumoniae. But many of the genes of TIGR4 which are notably different from the strains G54 and R6, suggest that these proteins are important for strain uniqueness and they may be involved in variations in pathogenesis. Since many hypothetical, conserved hypothetical, unknown and unclassified proteins exist among the dissimilar role categorized genes, it seems that many of these genes of S. pneumoniae have to be annotated and assigned functions of which some of them may also be responsible for the virulence nature. Further, we have also found that most of the virulence factors are same in D39 and R6 and hence also confirms the fact that R6 is the derivative of the strain D39.
In order to annotate the uncharacterized protein sequences (hypothetical and conserved hypothetical), the present study predicted 4 and 22 hypothetical sequences of the strains TIGR4 and R6 respectively of S. pneumoniae are of virulence factors. Among those predicted virulence factors, 1 and 8 different hypothetical sequences of TIGR4 and R6 respectively contain conserved sequences of known virulence factors such as hyaluronidase, immunoglobulin A1 protease, capsular polysaccharide synthesis, pneumolysin, neuraminidase and choline binding protein. These sequences also may be considered as desirable targets for therapeutics. The effort is to narrow down the search of virulence factors from all hypothetical sequences and this conclusion will be a reality only when it is experimentally proved.