Analysis of Regulatory Genomics and Gene Expression Pat-tern of Medicinal Importance Genes of Helicobacter Pylori

Helicobacter pylori has is recognized as the main causal agent of chronic gastritis and duodenal ulcers, and it is associated with the subsequent development of gastric carcinoma. It adapted to life in a unique nice, the gastric epithelium of primates, its promoter may there for show different types of regulatory motif from those of other bacteria and it is well known fact that motif are the sequence portion which are responsible for gene regulation, by studying them we can control the expression of such genes of interest. Here, the objective of this work is to analyze the regulatory sequence pattern of virulence genes that have medicinal importance for providing a basis for drug development process and further analysis of transcriptional regulatory networks. For this purpose using available microarray gene expression data from Stanford Microarray Database, and computation tools, As a result we found that helicobacter pylori shows different type of regulatory motif of Oligo and Dyad pattern in studied genes. The most common length of single block, Oligo motif is 8 – 14, and the most common pattern for Dyad is 4 (4/8)3, we also observe that the GC content of these regulatory is just 15-20%, which is comparatively very less. Journal of Computer Science & Systems Biology Open Access JCSB/Vol.3 Issue 1 *Corresponding author: Pramod Katara, Department of Bioscience and Biotechnology, Banasthali University, Bansthali 304022, India, Tel: +919413094705; E-mail: pmkatara@gmail.com Received November 05, 2009; Accepted February 06, 2010; Published February 06, 2010 Citation: Katara P, Singh A, Ragav D, Sharma V (2010) Analysis of Regulatory Genomics and Gene Expression Pattern of Medicinal Importance Genes of Helicobacter Pylori. J Comput Sci Syst Biol 3: 010-015. doi:10.4172/jcsb.1000049 Copyright: © 2010 Katara P, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.


Introduction
In recent year there has been an exponential growth in molecular genetics technologies, such as DNA microarrays, allow us for the first time to obtain a "Whole" view of the cell. At the same time development in high throughput computing and bioinformatics provide various computational algorithm and tool to define functional, structural and regulatory behavior of such genes which are analyzed by DNA microarray technology.
At present there are various different technologies for measuring gene (mRNA) expression levels are present, but cDNA microarrays are preferably used in scientific community, this allows for a quantitative readout of gene expression on a gene-bygene basis (Brown and Botestein, 1999;Duggan et al., 1999). Microarrays have opened the possibility of creating the possibility of creating data sets of molecular information to represent many systems of biological or clinical interest. Gene expression profiles can be used as input to large-scale data analysis such as -to discover regulatory genomics, to discover taxonomy, to discover new gene of drug importance, and to increase our understanding of normal and disease states (Debouck and Goodfellow, 1999; Alizadeh et al., 2000).
A first step to analyze this all type of information is to examine the extremes, i.e. genes with significant differential expression in two individual samples or in a time series after a given treatment. This simple technique can be extremely efficient, for example, in screens for potential tumor markers or drug targets (Debouck and Goodfellow, 1999). However, such analyses do not address the full potential of genome-scale experiments to alter our understanding of cellular biology by providing, through an inclusive analysis of the entire repertoire of transcripts, a continuing comprehensive window into the state of a cell as it goes through a biological process. What is needed instead is a holistic approach to analysis of genomic data that focuses on illuminating order in the entire set of observations, allowing biologists to develop an integrated understanding of the process being studied.
A natural basis for organizing gene expression data is to Clustering genes with similar patterns of expression (Eisen et al., 1998;Alizadeh et al., 2000). The first step for it is to adopt a mathematical description of similarity. For any series of measurements, a number of sensible measures of similarity in the behavior of two genes can be used, such as the Euclidean distance, angle, or dot products of the two n-dimensional vectors representing a series of n measurements. We have found that the standard correlation coefficient (i.e., the dot product of two normalized vectors) conforms well to the intuitive biological notion of what it means for two genes to be "co-expressed;" this may be because this statistic captures similarity in "shape" but places no emphasis on the magnitude of the two series of measurements.
Conserved DNA sequences are present in all types of organism as Motif or in any other form, which correspond to transcriptional regulatory motifs in upstream regions of genes (McGurie et al., 2001). These conserved regions are often binding sites for DNA-binding proteins and some time also work as TFBS for transcription factor and known as gene regulatory elements.
In bacteria it is difficult to locate the regulatory region for a gene found within an operon (Jacob and Monad, 1961), since the promoter for that operon can lay several genes upstream, and it is difficult to predict which gene is at the head of the operon J Comput Sci Syst Biol  (Price et al., 2005). In addition, there are fewer instances of most regulatory motifs in a bacterial genome than in the yeast or any other Eukaryotes, as there is usually only one instance of a regulatory motif per operon instead of one instance per gene. It is easier to discover a motif that is found in more copies in the genome. However, one can increase the number of instances of a conserved regulatory motif by pooling together upstream sequence from co-express (co-regulated) or orthologous genes in closely related organisms, assuming the motif is conserved across these organisms. Now days in this bioinformatics era, various statistical based software are available which are based on different algorithm, for analysis of such conserved motif in given upstream sequences which work as regulatory elements, but main thing is to get biological relevance from them (Lescot et  Here we try to find out the consensus nucleotide elements (motifs) and there pattern, which are responsible for expression of gene and work as regulatory sequence in helicobacter pylori.
Helicobacter pylori is a very significant organism for medical purpose, the whole genome sequence of its widely available strain (26695 and J99) are completely sequenced, they contain a single circular genome of 1.7 million base pairs and around 1,500 predicted coding sequences (Tomb et al., 1997;Alm et al., 1999). The objective of present work is to find gene expression pattern of virulence genes and their regulatory sequence which work as binding sites for Transcriptional regulators in virulent genes. This study is based on well established concept that genes with similar gene expression patterns are most probably share common regulatory machinery (Altman and Raychaudhuri, 2001;Allocco et al., 2004).

Prediction of coexpressed genes
Here we have obtained a Hierarchical cluster as output by using microarray gene expression data in cluster which can visualize in tree-view as a hierarchical tree. We have found that this tree contain all given data in a hierarchical form. According to gene expression value, closely related (co-express) gene would in same cluster. By using different correlation type we also found that the centered correlation is better and suitable for hierarchical clustering and gives more appropriate output for further process (Eisen et al., 1998;Wen et al., 1998).
After clustering, we observed genes of all cluster and found that most of the well documented virulence genes i.e. VacA, FecA and some Cag-PAI, are shows proximity and most of them are grouped in two cluster along with some other genes, so these two cluster are of our interest. The genes of these selected clusters 'seed cluster' used for further analysis and rest of the cluster

Analysis of gene expression pattern of coexpressed gene
By applying clustered gene in genesis and some other plotting option for gene expression pattern analysis, plot by them shows that according to time period and provided environment condition, the expression of gene become changed, and this change is observable. This fluctuation seems same for most of genes which are is same cluster (Cluster 1 and 2) Figure 1.
Gene expression pattern shows that, when we going to calculate the variance for all gene at every given condition, we found that at some point it is very high for some gene. It is shows that after clustering there is a chance of getting some false positive gene in cluster. For those genes whose function is tilled not known but are came in these two cluster, we can assume that they are somehow related with the virulence of H pylori, and consider as probable virulence genes (Eisen et al., 1998;Spellman et al., 1998;van-Noort et al., 2003).

Retrieval of upstream (promoter) region
By help of RSAT we retrieve the promoter region for every cluster; here we found that the length of upstream region is differing from gene to gene. But the considerable thing which we observe here is that, for some genes, the length of promoter region is '0' (CL 1-5; CL 2-3), and they shows the distance '0' with there neighbor, when we search for it in TIGR Operon Database, found that most of such gene are related to same Operon unit (Salgado et al., 2000). Its give indication that probably ORF which shows '0' distance from there neighbor, are controlled by the same regulatory machinery, and by present method we can't be able to find motif for them because we didn't get Promoter region for them.

Analysis of Regulatory Motif
As a result of oligo analysis (regulatory motif) using MEME which use Position weight matrices, we predict five most probable motif for all genes in both cluster. In some cases it may be possible that, predicted sequence may be 'Pribnow box -TATAAT' (Pribnow, 1975), which is comman and essential to start transcripton of all genes, Thus we need attention when consider this type of sequence as a regulatory motif. By considering motif score we make merit of probable motif (Table 1). It is hard to say that, which particular size of motif is more probable in cluster, but here in MEME this is based on P-Value and E-Value Bioprospector also and observe that more than 60% motif from MEME are matched with this Bioprospector motif (Liu et al., 2001). With MEME considerable thing is that, it may give different result with different run because of its learning capability. Thus by applying Bioprospector we confirm the presence of such motifs.
The motifs in genes of both cluster are present in different numbers and pattern, in some genes they did not show there presence even a single time i.e. incluster-1 HP0527 and HP0887, and in cluster-2 HP0010, HP0109, HP0633 and HP0632, and in some other genes they are present in repeated manner i.e. in cluster-1 gene HP0382 shows two time presence of motif '+3' and in cluster-2 HP0111 shows three time presence of motif '+1' continuously, here we also observe that motifs are present in both Journal of Computer Science & Systems Biology -Open Access JCSB/Vol.3 Issue 1 '+' and '-' directions (Figure 1a and 1b). This all shows that coexpressed genes not share all of there motifs with each other, they share only some specific motifs in there upstream region and only because of them they are coregulated, or their may be some other factors behind their coregulation (Kremling et al., 2000).

Analysis of dyad pattern motif
Here by help of RSAT (dyad analysis option) we found some possible Dyad pattern motif, but the number of such dyad is very less in both clusters ( Table 2 and Table 3).
By observation of these tables found that, If we consider only non overlapping dyad then it is just 3 (CL 1 =1, CL 2 = 2) out of all 36 genes. The commonly visited pattern of Dyad is 4 (4/8)3.

Cluster-1 MOTIFS (nucleotide)
Cluster-2 MOTIFS (nucleotide)     The range of occurrence significance is from 0.01 to 2.80, Indicated that the significance of most of observed dyad is not too much, thus we can conclude that most of the virulence genes of H pylori are regulated by simple oligo pattern regulatory motifs and dyad pattern motifs not shown any significant role in their regulation.

Conclusion
In this work, we found out the gene expression pattern, regulatory elements (motifs) of different length and pattern in upstream regions of VacA and CagA-PAI family genes along with some other genes, which are clustered together according to there gene expression value in response of different experimental conditions, here we also found out the function of some unknown genes HP1458, HP0673, HP0448 and HP0722.
The study of gene expression pattern shows that virulence genes are shows under-expression in initial adverse conditions i.e. iron but after some time they become normal/over-express, it is shows that virulence genes have capability to express even in adverse condition.
By study of Regulatory genomics we found that virulence gene of Helicobacter pylori show a unique regulatory genomics. They shared regulatory motif of various patterns and length (6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22), the most common length of single block motif is 8 to 14 nucleotide and the most common pattern for Dyad is [4 (4/8)3]. The difference between most of found motif is just at a single or double nucleotide level. In all motifs more than 80% nucleotide shared by A & T base and the ratio of G and C base is just 15 to 20% which is less than its whole genome GC ratio (39%). The unknown gene HP0887 shows enough similarity with VacA gene which produces cytotoxin, indicate that somewhere HP0887 is also related with cytotoxin activity thus it had medicinal importance, Gene HP1458 shows similar gene expression pattern with Cag26 gene which is a member of Cag-PAI and work as marker for the presence of PAI, thus here by this study gene HP1458 can also be used as marker for the presence of PAI.

Methodology
In the present study for the mining of the regulatory sequence of virulence genes of Helicobacter pylori, virulence genes are collected from various available database VFDB (Chen et al., 2005;Yoon et al., 2005), and by data mining from various research articles (Censini et al., 1996;Maeda et al., 1999;Karhukorpi et al., 2000;Ali et al., 2005).
For the analysis, data sets from the asynchrony experiments reported on at Stanford University Web Site (http://genomewww.stanford.edu), is downloaded. The data is collected as fluorescence measurements which is made relative in the levels of expression in iron starvation. This data set was already used to determine genes that are regulated by iron availability and to access the growth phase dependence on that regulation (Merrell et al., 2003). On SMD this raw data are provided in 32 excel file sets (each file have equivalent weight) and each file conation expression data of different time and different iron concentration. It consists of a set of 4607 gene expression data. The raw data files are sorted and scaled by taking logarithm at base 2 of R/G normalized (mean) ratio. The genes having ratio as an absolute value greater than two standard deviation from the mean in at least one array are taken for further filtering (Tamayo et al., 1999;Ulm et al., 2004). In this sequel, only 80% of the sorted genes are selected as they pass through spot filter criteria. Ultimately a set of 772 out of 4607 genes are taken for current analysis.
For the clustering of genes on the basis of expression values (Schena et al., 1996;Yeung et al., 2004), Cluster software from Eisen lab is used (Eisen et al., 1998). As we are interested in genes responsible for virulence, therefore after hierarchical clustering, only those clusters having our gene of interest (known virulence genes) were selected and all other genes shared by such cluster were considered as probable virulence genes (van Noort et al., 2003;Bergmann et al., 2004 ).
Further for the analysis of the pattern of regulatory sequence of coexpressed virulence genes by using RSAT program, first we mine their upstream region, then oligo-analysis by the help of MEME and dyad-analysis using RSAT server was performed