Somnath Tagore^{1}, Virendra S. Gomase^{1*}, Rajat K. De^{2}  
^{1}Department of Bioinformatics, Padmashree Dr. D.Y. Patil University, Plot No50, Sector15, CBD Belapur, Navi Mumbai 400614, India  
^{2}Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India  
Corresponding Author :  Dr. Virendra S. Gomase, Department of Bioinformatics, Padmashree. Dr. D.Y. Patil University, Plot No50, Sector15, CBD Belapur, Navi Mumbai 400614, India, Email: virusgene1@yahoo.co.in 
Received July 11, 2008; Accepted August 02, 2008; Published August 14, 2008  
Citation: Somnath T, Virendra VG, Rajat KD (2008). Pathway Modeling: New face of Graphical Probabilistic Analysis. J Proteomics Bioinform 1: 281286.doi:10.4172/jpb.1000035  
Copyright: © 2008 Somnath T, et al. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.  
Related article at Pubmed Scholar Google 
Visit for more related articles at Journal of Proteomics & Bioinformatics
Pathway analysis is one of the most interesting aspects of Systems Biology. Modeling biological pathways is interesting as well as difficult to optimize. Various modeling problems of diseases can be successfully analyzed using this simulation approach. Graphical probabilistic approaches are one of the unique methodologies that are used for designing and analyzing pathways. We have discussed the various graphical approaches that are actively involved in pathway modeling.
Keywords 
Pathway modeling; Pathway analysis; Helmholtz machine; HMM 
Introduction 
Biological pathways are modeled for analyzing and visualizing various substeps of the network, study gene expression profiles and predicting outcome of various alterations made to the cells. A major challenge in developing these models is to choose the correct abstraction. Due to the large and diverse nature of biological networks, it is essential to balance computational complexity against model fidelity and to move between models of different levels of detail, using different meaning ways. Here, graphical probabilistic models are discussed for modeling biochemical pathways. Biological pathways are categorized into Metabolic Pathways, Signal Transduction Pathways and Gene regulatory Networks. Here, we have tried to look into all these aspects of biological pathway modeling. 
Graphical Probabilistic Models 
Graphical Probabilistic Models represent multivariate probability densities. These multivariate probability densities are represented by a product of terms that involves few variables. Furthermore, the products are represented by graph theoretical approach. This graph relates the variables that are represented by a common term. The common types of graphical models are discussed here (Agarwal et al., 2000; Hall et al., 1999). 
Types of Graphical Probabilistic Models 
Bayesian Networks 
Bayesian Networks are used for predicting relationship within variables. It is a directed acyclic graph whose nodes represent random variables; arcs represent statistical dependence relations among the variables and local probability distributions for each variable given values of its parents (Levitski et al., 2007; Marashi et al., 2007). 
Thus, for each variable X_{i}, 
i • {1, …, N} (1) 
the set of parent variables is denoted by parents (X_{i}), then the joint distribution of the variables is product of the local distributions. 
Pr (X_{1}, … , X_{n}) = Ð Pr (X_{r}  parent (X_{i})) (2) 
Gaussian Networks 
The normal distribution is univariate in nature. But, there is a difficulty working with univariate distribution as the covariance matrix must be positive definite in nature. But with gaussian networks, this constraint needs not to be considered (McKinney, 2006). 
Maximum Likelihood 
Maximum Likelihood Estimation begins with writing a mathematical expression called the Likelihood Function of the sample data. It is the probability of obtaining that particular set of data, given the chosen probability distribution model. This expression contains the unknown model parameters. The values of these parameters that maximize the sample likelihood are known as the Maximum Likelihood Estimators (MLE’s) (Hu, 2004; Jin et al., 2006). 
Thus, Given a family M{i} of probability distributions parameterized by ‘i’ associated with a known probability function fn{i}, we may draw a sample x{1} to x{n} of ‘n’ values from this distribution and then using fn{i} we may compute the probability density (Justenhoven et al., 2001). 
[fn{i}(x{1} to x{n}) i] [3] 
In this case, the likelihood function is given by, 
[L(i)=[fn{i}(x{1} to x{n})i] [4] 
Density Estimation 
Density Estimation is the construction of an estimate based on an unobserved data. This is again based upon an unobserved probability density function (EstivillCastro et al., 2001). 
Helmholtz Machine (HM) 
Helmholtz Machines are neural networks that learn the hidden structure of a set of data one being trained to create a generative model, producing the original set of data. Thus, by learning the various representations of the data, the underlying structure of the generative model approximates the hidden structure of the data set (EstivillCastro et al., 2001; EstivillCastro et al., 2001). These are categorized as Autoencoders, Deterministic HM and Stochastic HM. Autoencoders reconstructs its best guess of the input on the basis of the code that it sees, whereas Deterministic HM is inspired by meanfield methods and Stochastic HM captures the correlation between the activities in different hidden layers (Han et al., 2000). 
Latent Variable Models (LVM) 
Latent Variable Models relates a set of manifest variables to set of latent variables, which are grouped according to whether the manifest and latent variables are categorical or continuous. It provides a means to parse out measurement error by combining across observed variables and allow for the estimation of complex causal models. Furthermore, these are well developed for metric and discrete observed variables. Also, these account for clustering random effects (Tonella, 2001) 
Generative Topographic Mapping (GTM) 
In Generative Topographic Mapping (GTM), the training data is assumed to arise by first picking a point probabilistically in a lowdimensional space, then mapping the point to the highdimensional input space that is observed. This is done by a smooth function and then adding noise in the high dimensional input space. The ExpectationMaximization (EM) algorithm is used to make a training set that can be used to train the parameters of the lowdimensional probability distribution (Cormen et al., 2000). 
Hidden Markov Model (HMM) 
In a Hidden Markov Model, a state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. This model is a finite set of states, each of which is associated with a probability distribution (Demetrescu et al., 2003). Transitions among the states are governed by a set of probabilities called transition probabilities. In a particular state an outcome or observation can be generated, according to the associated probability distribution. The three main problems of HMM include Evaluation Problem, Decoding Problem and Learning Problem (Demetrescu et al., 2003). 
Application of Graphical Probabilistic Models 
Application to Metabolic Pathway Modeling 
A machine learning system is introduced for gene functions determination from heterogeneous data sources using a Weighted Naive Bayesian network (WNB). The aim is to infer functions of putative genes or Open Reading Frames (ORFs) from existing databases using computational methods. While integrating evidence from multiple and complementary sources significantly improves the prediction accuracy. The experimental results suggest that the stated hypothesis is valid and provide guidelines for using the WNB system for data collection, training and predictions. Furthermore, the combined training data sets consists results from gene expressions, clustering outputs and sequence homology from public databases. It is also used to analyze the contribution of each source of information toward the prediction performance through the weight training process (Deng et al., 2006). 
Searching for peptide hormones that signals via membrane receptors is often hampered by their small size, and lack of sequence similarity. A search tool based on the hidden Markov model is developed that uses various peptide hormone sequence features for estimating the likelihood that a protein contains a processed and secreted peptide of this class. Analysis of the top scoring hypothetical and poorly annotated human proteins identifies two candidate peptide hormones. Their analysis shows that both are localized to secretory granules in a transfected pancreatic cell line. The findings demonstrate the utility of a bioinformatics approach to identify novel biologically active peptides (Mirabeau et al., 2007). 
Multivariate methods are used for the analysis of molecular data including genotypic data and clinical phenotypes. These methods include latent variable models and joint multivariate modeling techniques. Thus, given the wide variety in the data considered, the objectives of the analysis and the methods applied, direct comparison of the results are discussed (Beyene et al., 2007). 
Major stem cell species are studied using a coclustering latent variable model (LVM). It helps to explain cell typespecific transcription factors, using expression profiles. The LVMbased study also helps to analyze regulatory modules for each stem cell cluster. Furthermore, the identities of the stem cell clusters are revealed by the constituent genes that are directly targeted by the modules (Joung et al., 2007). 
Application to Signal Transduction Modeling 
A primer on the use of Bayesian networks is introduced for analyzing the connectivity of signaling networks. Bayesian networks are used to derive causal influences among biological signaling molecules. An automatically derive a Bayesian network model is introduced from proteomic data and to interpret the resulting model (Pe’er, 2005). 
Stochastic biochemical systems are used for modeling transcriptional regulation in single cells. Transcriptional regulation is easily modeled using a hidden Markov model (HMM). It is used to mathematically and computationally study transcriptional regulation in single cells. Furthermore, analysis by Monte Carlo simulation is computationally laborious. Several simulations are employed based on a transcriptional regulatory system for showing the relative merits and limitations of various approximation techniques (Goutsias, 2006). 
Graphical models are very well used for analyzing GProtein coupled receptors (GPCRs). Most of signaling networks in cells are mediated through the interaction of GPCRs with heterotrimeric GTPbinding proteins (Gproteins). Experimental data suggest that heterotrimeric Gproteins interact with parts of the activated receptor at the transmembrane helixintracellular loop interface. An exploratory approach is designed to generate a refined library of Hidden Markov Models that predict the coupling preference of GPCRs to heterotrimeric Gproteins. It predicts the coupling preferences of GPCRs to Gs, Gi/o and Gq/11, but not G12/13 subfamilies (Sgourakis et al., 2005). 
A Hidden Markov model library is designed for classifying protein kinases into 12 families. This classification is also coupled with a misclassification rate of zero on the characterized kinomes of H. sapiens, M. musculus, D. melanogaster, C. elegans, S. cerevisiae, D. discoideum, and P. falciparum. This is applied to 38 unclassified kinases of yeast including AGC (5), CAMK (17), CMGC (4), and STE (1). It also facilitates the annotation of kinomes and provides data regarding early evolution and subsequent adaptations of the various protein kinase families (MirandaSaavedra et al., 2007). 
Application to Gene Regulatory Networks 
Gene regulatory networks are modeled using probabilistic Boolean network methods and dynamic Bayesian network methods. These methods are compared using certain biological timeseries dataset from the Drosophila Interaction Database for designing Drosophila gene network. Also, a subset of time points and gene samples from the whole dataset is used to evaluate the performance of these two approaches (Li et al., 2007). 
A hierarchical hidden Markov regression model is introduced for determination of gene regulatory networks from genomic sequence and gene expression microarray data. A hybrid Monte Carlo methodology is devised to estimate parameters under 2 classes of latent structure. One is arising due to the unobservable state identity of genes and the other is due to the unknown set of covariates influencing the response within a state (Gupta et al., 2007). 
A comparative gene predictor, called Conrad is proposed, based on semiMarkov conditional random fields (SMCRFs). It is trained to maximize annotation accuracy. It encodes information as features and treats all features equally in the training and inference algorithms. On Cryptococcus neoformans, configuring Conrad to reproduce the predictions of a twospecies phyloGHMM closely matches the performance of Twinscan. Furthermore, it produces similar results on Aspergillus nidulans comparing Conrad versus Fgenesh (DeCaprio et al., 2007). 
Hidden Markov Models are compared with genotyping to determine the transmission characteristics of sporadic vancomycinresistant enterococci (VRE). For this, a structured continuoustime hidden Markov model (HMM) is developed. Two parameters are estimated, one to quantify the crosstransmission of VRE and the other to quantify the level of VRE colonization from sporadic sources. Some evidence is found, based on model selection criteria that the crosstransmission parameter changed throughout the study period. This model estimates that crosstransmission increases at week 120 and declines after week 135, coinciding with environmental decontamination. HMMs are also applied to serial prevalence data to estimate the characteristics of acquisition of nosocomial pathogens and distinguish between epidemic and sporadic acquisition (McBryde et al., 2007). 
Current Research 
Bayesian networks are used for predicting interaction partners using multiple alignments of interacting protein domains sequences without the need for any training examples. This also accurately predicts interaction partners in datasets of polyketide synthases. Also, analysis of the predicted genome wide twocomponent signaling networks shows that interacting kinase/regulator pairs, which lie adjacent on the genome and which lie isolated form two relatively independent components of the signaling network in each genome (Burger et al., 2008). 
A hidden Markov model is used for predictive modeling of nuclear hormone receptor response elements coupled with chromatin microarray technology explains a binding site in the Type I human hepatic 3alphahydroxysteroid dehydrogenase (AKR1C4) promoter for the nuclear hormone receptor liver X receptor alpha. It also suggests that LXRalpha modulate the bile acid biosynthetic pathway at a unique site downstream of CYP7A1 (Stayrook et al., 2008). 
The probable state path of three nucleotides sequences of cisregulatory region of target genes are identified using a Hidden Markov Model (HMM). These regions are key elements in the transcriptional regulation of gene expression. These computations are also used to predict C(2)H(2) zinc finger transcription factor binding sites in cisregulatory regions of their target genes (Cho et al., 2008). 
Certain Markov matrix (MMM) values are used to characterize numerically 81 sequences of type III RNases and 133 proteins of a control group. Also one MMMQSAR and one classic hidden Markov model (HMM) is developed based on the same data. The MMMQSAR shows a discrimination power of RNAses from other proteins of 97.35% without using alignment, which is a result as good as for the known HMM techniques. Furthermore, the MMMQSAR model predicts the new RNase III with the same accuracy as other classical alignment methods (AgüeroChapín et al., 2008). 
Conclusion 
Graphical probabilistic models are of much importance in Systems biology, especially in analyzing and modeling biological networks. Bayesian Networks have large applications in almost every field of life science ranging from gene expression analysis, genetic/metabolic network analysis and pathway modeling. Gaussian Networks are applied to analyze various interaction networks like proteinprotein, genegene and geneprotein. Pathway modeling is also done based on this method. Maximum Likelihood is used in phylogenetic estimates, study genetic crossover, pathway modeling and gene expression analysis. Density Estimation is useful for certain immunological or clinical trials, metabolic network analysis and pathway modeling. Helmholtz Machine (HM) is used in studying metabolic activities of brain and nervous system. Latent Variable Models (LVM) is used for studying various regulatory networks, pathway modeling and gene expression profiles. Generative Topographic Mapping (GTM) is used in microarray analysis, gene expression level analysis and pathway modeling. Lastly, Hidden Markov Models (HMM) are used in protein structure analysis, sequence analysis, metabolic pathway analysis, gene expression analysis and promoter region identification. 
References 
