Piotr H. Pawlowski^{1 * } , Szymon Kaczanowski^{1,2}, Piotr Zielenkiewicz^{1,3}  
^{1}Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warszawa, Poland  
^{2}Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, USA  
^{3}Plant Molecular Biology Laboratory, Warsaw University, Warszawa, Poland  
Corresponding Author :  Dr. Piotr H. Pawlowski, PhD Fax :(48) 39121623 Email :piotrp@ibb.waw.pl 
Received April 13, 2008; Accepted May 14, 2008; Published May 20, 2008  
Citation: Piotr HP, Szymon K, Piotr Z (2008) Protein Interaction Network. Double Exponential Model. J Proteomics Bioinform 1:061067. doi: 10.4172/jpb.1000010  
Copyright: © 2008 Piotr HP, et al. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.  
Related article at Pubmed Scholar Google 
Visit for more related articles at Journal of Proteomics & Bioinformatics
The proper theoretical description of the distribution of the node degree for yeast proteinprotein interaction network was investigated to deal with the observed discrepancy between usually proposed models and the existing data. The power law or the generalized power law with exponential cutoff were shown to be inaccurate within a wide range of degree values. Proposed linearcombinationofexponentialdecays method exactly characterizing the distribution by the spectrum of decay constants revealed two separate parameter domains. A consequent hypothesis that the node degree distribution could follow the universal double exponential law was successfully verified by selected model comparison using the AIC criterion. BIND and DIP data for H. pylori, E. coli, S. cerevisiae, D. melanogaster, C. elegans and A. thaliana were used for this purpose. A linear change in the magnitude of the distribution components with proteome size was observed, manifesting the evolutional stability of the process of developing the protein interaction network. Proposed kinetic model of protein evolution, considering the two hypothetical protein classes, first, with a relatively rapid emerging rate and a short characteristic residence time, and the second one, with the opposite properties, analytically described the nature of biexponential pattern. The model presents a situation in which evolutionary conserved proteins increase their interactions due to specific kinetic conditions. Thus, we oppose the opinion that the majority of such interactions are biologically significant, and, therefore the older parts of interactome are more complex. We believe that our interactome results support the hypothesis of Stuart Kaufman, presented in his book "The Origin of Order", that random mutations and natural selection constitute the origin of order and complexity.
Keywords 
Protein interaction network; Exponential model; Scale free network; Power law 
Introduction 
The degree of a node (or connectivity) is the number of edges that are adjacent to it. From the theoretical point of view, it is one of the basic measures characterizing the importance of the node in the network. Although the power law (PL) and the generalized power law supplemented with an exponential cutoff (GPLEC) were widely popularized (Wagner, 2001; Jeong et al.. 2001) as the rules describing the distribution of the node degrees in proteinprotein interaction network, attempts at a more exact mathematical description are still being undertaken (Thomas et al.. 2003; Berg et al.. 2004). The reasons are both of practical and methodological nature. The first reason pertains to the still evolving databases, and the second one concerns the facts that the usually simple shape of arrangement of experimental points may be fitted in various manners giving at different theoretical assumptions quite similar results. According to the DIP data (see Materials and Methods) we could observe that the degree distribution of nodes of S. cerevisiae protein interaction network follows approximately a PL or a GPLEC, but only for the degree values smaller than 10. For higher values of we saw a serious discrepancy between the theory and the experiment, already reported by others as an exponential decay (Wilhelm et al.. 2003). 
There are additional indications (Barabási and Oltvai, 2004; PereiraLeal et al.. 2005) that the biological network characteristics may contain an exponential component. The main aim of the present paper is to resolve whether by using a more complex exponentialtype model one can better describe the distribution of node degree in the protein interaction network. Developing the above idea we proposed to consider a node distribution as a linear combination of exponential decays A_{i} exp (−λ_{i }k ) , with amplitudes A_{i} and decay constants λ_{i } being positive values. Our method applied to S. cerevisiae DIP data revealed two separate domains of λ_{i} , with two characteristic values of the parameters related to the relatively "fast", then "slow", tendency of a distribution to decay along kaxis. This led to the natural concept that a double exponential curve a_{1} exp(d_{1}k) + a_{2} exp(d_{2} k) could be a better model of the node degree distribution than the standard or modified power law. This supposition was confirmed by using BIND or DIP data for 6 different organisms and the AIC criterion (see Materials and Methods). The obtained results led to analysis of the dependence of both exponential contributions to the total protein pool on proteome size, clearly indicating a linear trend. In consequence, this analysis helps us to better characterise the evolutionary mechanism leading to the observed double exponential distribution and points out its universal elements. 
To explain the biexponential character of node degree distribution, the kinetic model of protein network evolution was proposed. It relates the searched distribution formula to the parameters describing the rate of some creation and disruption processes, postulated as being important in formation of the net. According to our model, two basic types of proteins, marked "1" and "2", with a different dynamics of evolutional behaviour were assumed. They were shown to be good candidates, from a statistical point of view, for the lowconnected nodes and hubs, respectively. 
The discussed results suggest that the process of evolution leads to a "biological" order in the interactome. Therefore, they support the hypothesis of Kaufman, (1993) that the process of random mutation and selection always leads to complexity. 
Materials and Methods 
Protein interaction network data for H. pylori ( = 724 nodes, N_{e} = 1403 edges) and S. cerevisiae (analogous values 4135 and 7839) were taken from Coevolution and Selforganisation in Dynamical Networks data sets (COSIN, http://www.cosin.org) derived from the Database of Interacting Proteins (DIP, http://dip.doembi.u cla.edu/). Data for E. coli (399 and 312), D. melanogaster (7910 and 23128), C. elegans (3227 and 5026) and A. thaliana (487 and 959) were taken from Biomolecular Interaction Network Database (BIND, http://www.bind.ca/Action). Only single proteinprotein interaction records (without selfinteraction) were analyzed. No noninteracting proteins were reported. 
According to our method of linear combination of exponential decays (LCED), a S. cerevisiae node degree distribution (histogram) was tentatively described by the sum: 
i_max n_{K} = ∑ A_{i} exp(λ_{i}K) (1) i=0 
where n_{k} was a number of k degree nodes and i _ max was the maximal value of a sum index i .Equation 1 was fitted to the experimental data, at _ max = 50 and gridded spectrum of decay constants λ_{i} = {0, 0.025, 0.050, 0.075… 1.250}. The fit had been repeated 20 times to find the sets of amplitudes A_{i} , and then the respective averages <A_{i} > were analysed. As a fitting algorithm the NonlinearRegress procedure (NRP) from Mathematica 4.1 (http://www.wolfram.com) was applied, with substitution A_{i} = (A^{'}_{i})^{2} to guarantee only the positive value of amplitude. Random starting conditions, A^{'}_{i0} , were being selected within the range 0.5 ≤ A^{'}_{0i} ≤ 1.5. 
In the final modelling with a double exponential law (DEL), 
n_{K} =a_{1} exp(d_{1}K) + a_{2} exp(d_{2}k) (2) 
In the alternative modelling with a PL, 
n_{K} = Ak^{−γ} (3) 
and with a GPLC, 
n_{K} = A (K+ K_{0})^{−γ} exp(K/K_{c}) (4) 
The fits were performed in the range 1 ≤ k ≤ 15, using NRP once (without squared substitution of amplitude), and at default starting conditions (1.0). 
To rate the quality of the proposed models, corrected Akaike's Information Criterion (AICc) was adopted, defined as: 
AIC_{c } = Z ln( σ^{2} ) + 2m + 2m(m + 1) / z m 1 (5) 
where σ ^{2} is the average squared residual for a given model,  the number of model m parameters, and . the number of observations (Burnham and Anderson 2004). In the case of PL, m = 2. For GPLEC and DEL, m = 4. The number of analysed points was z = 15 in each competing model. Models with a smaller AICc value were being favoured. 
In the theoretical considerations, the total proteome size ( * NP )of the analysed species was assumed to be equal to the number of open reading frames, i.e., 1788 for H. pylori, 4285 for E. coli, 6307 for S. cerevisiae, 14218 for D. melanogaster and 18944 for C. elegans (Liu and Rost, 2001) or 28952 for family members of A. thaliana (Horan et al., 2005). Due to division by the scaling factor , where: 
SC = n_{0} + n_{K >0} / N^{*}_{P} (6) 
describes the ratio of the extrapolated size of the analysed probe to the size of the total proteome, the DEL model amplitudes for accessed data, a1 and a2, were transformed into hypothetical values a_{1}^{*}= a_{1} / sc and a_{2}^{*}= a_{2} / sc , for the total species proteome (see Appendix 1). In eq. 6 the unknown value n_{0} was replaced by a1 + a2 . Then, the expected amount of proteins in considered contributions 1 and 2 to the total proteome was estimated by the sum of infinite geometrical series 
∞ N_{i}^{*} = ∑ a_{i}^{*}exp( d_{i}K ) i = 1, 2 (7) K=0 
leading to: 
N_{1}^{*} = a_{1}^{*} / 1  exp( d_{1} ) (8) 
N_{2}^{*} = a_{2}^{*} / 1  exp( d_{2} ) (9) 
In the estimation of the parameters of the model of protein network evolution (Appendix 2) eqs. A.2.811 were applied. 
Results 
It was observed that the distribution histogram of node degree of S. cerevisiae proteinprotein interaction network exhibits a wellordered pattern in the range1≤ k ≤ 25 (Fig. 1). 
Above that range statistical fluctuations prevailed and quantization perturbed the continuity of analysed characteristics of the network. Attempts to describe the investigated distribution by a PL: A=(1.65 ± 0.02).103, γ = (1.27 ± 0.02)(upper line), or by a GPLEC: A = (2.4 ± 0.7).10^{3},k0 = (0.3 ± 0.7),γ =(0.5 ± 0.3),kc =(3.0 ± 0.4)(bottom line) gave good results only in the range 1 ≤ k ≤ 10.The PL parameters obtained,γ = 1.27 and A/Np = 0.40, are consistent with γ = 1.32 and A/Np = 0.42 for the whole yeast interaction network (Yu et al.. 2004). A different picture is seen in case of the GPLEC model. One can notice a big discrepancy between our result and those for a Mean standard error (s.e.) is also presented. small sample of 1870 nodes (PastorSatorras et al., 2003), which may indicate the narrow area of applicability of the cutoff formula. 
The proposed LCED method (Fig. 2) revealed two narrow ranges of decay constants spectrum with dominant amplitudes at λ_{7} = 0.175 and λ_{25} = 0.625 (characteristic values of node degree: 1/ λ_{7} = 5.7, 1/ λ _{25} = 1.6). Halfwidth of the observed peaks equals 0.025 and 0.050, respectively. An example of one in 20 fits performed to obtain the above spectrum is also presented (Fig. 3). As it is seen here, and in the case of other fits (data not shown), their qualities, especially in the range of values k > 10, are better than the estimation with standard or modified power law. 
As a result of the above, it was hypothesized that our combination, even reduced to a double exponential formula, could provide a better description of the node degree distribution than the considered power law type models. The examples of yeast and five other species were analysed for k < 15. Corresponding fits of proposed DEL models are presented in (Fig. 4) af and Table 1.Their qualities are confirmed by AICc values, which favour biexponential approximation in 5/6 of the investigated cases (Table 2). Plots of alternative fits are not shown. 
Some parameters of DEL models vary with proteome size. The size N1* and N2* of distinguished protein groups increases with the total number of proteins N*_{P} (Fig. 5). There was no detected essential dependence of decay constant d_{1} and d_{2} on the proteome size. 
Discussion 
The results presented above confirm recent reports (Goldberg et al., 2005) suggesting the "break" of a power law in the global description of the protein interaction network. Actually, we can suggest that this "break" may be caused by the second exponential term in node degree distribution, which does not affect strongly the formula in the range of the node degree smaller than 10, but may be essential elsewhere. 
Initial inspection of the data shown in Fig. 1 reveals that GPLEC, the 4parameter improvement of PL (bottom line), fits better than PL alone (upper line), but is still a very long way from perfect. Hence we decided to introduce a more general description. 
In accordance with our idea, protein interaction network consists of subpopulations of vertexes described by a similar statistical formula, but with different parameters. As a universal formula we choose exponential decay, which is consistent with the suggested model of network evolution (see Appendix 2). 
The proposed LCED method revealed the spectrum of decay constants and the magnitude of subpopulational contributions into the degree global distribution (Fig. 2). Two classes of nodes with the values of decay constant lying closely together were clearly distinguished. Good quality of fits (Fig.3) testifies to the utility of the method and the acceptance of the formula. 
Reducing the huge number of parameters of a general model and taking into account the above observation, we propose to limit the number of decay components to only two items, indexed by 1 and 2. It did not weaken the fitting abilities for different species in the range 1 ≤ k ≤ 15 (Fig. 4af, Table 1), which was confirmed by the AICc criterion. As seen in Table 2, the DEL models are the best in 5/6 of investigated cases and just a little worse (2^{nd} place) than the winner in one case. Generally, they are more effective for networks with big proteomes (the PL model for a small probe of A. thaliana may be an exception) than for sets with a small protein number; PL or GPLEC models may give similar results 
Documented changes in the dimensions of the indexed protein classes with proteome size (Fig. 5) indicate a similar tendency for linear increase for the first (a) and the second (b) component of proteome. This way the ratio N_{1}^{*}/ N_{2}^{*} ≈ 2 .5 seems to be a universal constant for a wide group of organisms. The contribution of each class of proteins to the summary distribution was shown in the example of a yeast probe (Fig. 6). 
As seen for small values of k , the two classes contribute to the global distribution. For k > 10, the first class vanishes and the second class clearly dominates. The latter class may be related to so called hubs. It is worth stressing that the second class of proteins may bear only a few links, too.

It seems that the proposed doubleexponential model is a simplification of a hypothetical multicomponential model describing the full spectrum of contributions from different classes of proteins. The analysed data indicate that there probably exists the third, small amplitude class of yeast proteins (not visible in Fig. 2), which may be related to the "super" hubs connecting hundreds of nodes; however, a "false positive" error cannot be excluded. 
Although the two protein classes clearly dominate, the analysed subpopulations do not form spikes along the decay constant axes, but have some definite width. We believe that more sophisticated analysis of discussed contributions, considering their continuous representation, should fully describe protein network statistics and reveal new properties of the proteome system. 
As mentioned beforehand, to specify our hypothesis, we proposed a simple mathematical model of protein network evolution (Appendix 2). The applied assumptions permit duplication events to occur even more often than the appearance of "new" types of protein encoding genes. Such behaviour is suggested by the observation that genecopy number within a family is often changed during the process of speciation (Cheng et al., 2005; Ma and Gustafson, 2005; Ting et al., 2004). However, to avoid an enormous expansion of the system, we assumed that the speciation processes are no more frequent than deletion episodes effectively leading to the elimination of proteins. On the other hand, one can detect evolutionary conservation of genes present even in different kingdoms. Therefore, the probability of multiplication of old "proteins" is similar to the probability of multiplication of"young" proteins in a given genome. The facts mentioned above were "silently" included in the model. It relates amplitudes and decay constants to the emergence rates, q_{1} and q_{1} , effective elimination rates, γ_{1} and γ_{2} , and interaction gaining rates, ν_{1} and ν_{2} of the two classes of proteins, with different dynamics of evolutional performance. This difference in dynamics of the evolution of proteins manifests in the observed difference between"fast" and "slow" tendency in the variation of the node degree distribution along kaxis. In general the above parameters may differ for different evolutional pathways. 
According our model, the linear trend in Fig. 5 may be related to the stable dynamics of evolution of investigated classes of proteins during the inter space progress. Indeed, with equations A.2.1214 it is easy to show that the observed dependence calls for stability of the ratio. q_{1} γ_{2} / q_{ 2} γ_{1} This linear trend also suggests that for the total proteomes the corresponding amplitudes of calculated probability (frequency) of the occurrence of a node with a given degree may remain approximately constant. In a sense, we showed not a scalefree distribution but a scalefree evolution. 
As the analysed decay constants d_{1} and d_{2} do not exhibit a clear tendency to change, we may simply imagine that during evolution γ_{1} ,γ_{2} , ν_{1} and ν_{2} remain approximately constant (see eqs. A.2.1011). According to this picture, q_{1} and q_{2} slowly evolve in a stable manner ( q_{1} / q_{2} = const ), governed, for example, by the varying amount of DNA, which accounts for the change in the global protein pool (see eq. A.2.12). 
To make our considerations more quantitative we estimated values q_{1} , q_{2} , γ_{1} and γ_{2} , assuming that ν_{1} =ν_{2} =0.1 [1/mln years] (Berg et al., 2004). It is seen (Table 3) that first class of proteins may be characterized by a relatively rapid emerging rate q_{1} and also relatively rapid elimination γ_{1} rate (or short characteristic residence time) when to compare with the second class of proteins. 
The proposed mathematical model of evolution suggests unexpected explanation of the observation of Barabasi and coworkers that more densely interconnected parts, "motives" of the interaction network, are more strictly evolutionary conserved (Wuchty et al., 2003). Intuitively, one can suppose that proteins belonging to such motives are evolutionary conserved because they are required for maintaining the connections in such motives. But the results of our simulations suggest an exactly opposite explanation: the old proteins (evolutionary conserved proteins) are more interconnected because they are simply old enough. This explanation although surprising for us, does in fact have sense. Since the majority of the proteins are not interacting (for example, proteinprotein interaction network of yeast contains only approximately 30000 proteinprotein interactions (according to the estimation of Kumar and Snyder, 2000) and more than 36000000 proteinprotein pairs), and the protein interaction network is evolutionary conserved (see, for example, Matthews et al., 2001), it is likely that the majority of interactions have biological significance and that interactions appear gradually during the process of evolution. It is also likely that new "proteins" have no interactions or have a small number of interactions. During the process of evolution these proteins slowly gain new "useful" interactions. If they belong to the class 2, they may even gain many such interactions. This process leads to a wellordered proteinprotein interaction network in which proteins are not randomly connected and in which one can distinguish "modules" of interacting proteins. 
As we have already referred to in the Introduction, our results support the hypothesis of Stuart Kaufman that natural selection, random mutations and the process of evolution are the source of order in biological systems. This paper shows a random process of evolution leading to complex and nonrandom systems. Although it remains an open question whether the random process is rapid enough to lead to creation of structures as complex as multienzymatic complexes or flagelles, we believe that a right step in the proper direction has been taken. 
Acknowledgements 
We would like to thank Doktor Christophe Pakleza for his help in the application acknowledgement of Mathematica to numerical calculations. Fianancial support from the ministry of Science and Higher Education geant PBZMNil_2/1/2005 and postdoctoral fellowship from the foundation for polish Science to S.K are gratefully acknowledged. 
Appendix 1 
It was assumed that each data set analysed is only a homogenous part of the total proteome of a given species. Then the fitted DEL model formula and the hypothetical distribution of the total population of proteins of a given organism (see Appendix 2) are related in the proportion: 
n_{K} ^{def} a_{1} exp( d_{1}k ) + a_{2} exp ( d_{2}k ) N_{P} = = (A1.1) n_{k}^{*} a_{1}^{*} exp( d_{1}k ) + a_{2}^{*} exp (d_{2}k) N_{p}^{*} 
where a_{1}^{*} and a_{2}^{*} are the amplitudes of a hypothetical distribution for the total population, N_{p} is the extrapolated size of the analysed probe and N_{p}^{*} is the total size of proteome. 
In the above ratio, N_{p} value includes interacting proteins ( N_{k>0} ) and also noninteracting ones ( n_{0} )  not included in the investigated data sets, so that: 
N_{p} = n_{0} + N_{k>0} (A.1.2) 
As eq. A.1.1 is fulfilled for each node degree and for different decay constants d_{1} and d_{2} , it should be: 
a_{1}^{*} = a_{1} / sc (A.1.3) 
a_{2}^{*} = a_{2} / sc (A.1.4) 
where the scaling factor equals to: 
n_{0} + N_{k>0} sc = (A.1.5) N_{p}^{*} 
Appendix 2 
Let us consider protein interaction network containing two classes of proteins (namely 1 and 2) characterized by different dynamics of evolutional performance, i.e., emerging with the rates q_{1} and q_{2} (as noninteracting at the beginning), then gaining some interactions with the rates ν_{1} and ν_{2} , and being eliminated with the rates γ_{1} and γ_{2 }  per protein. All mentioned rates are assumed as being distinct and constant.

A number of selected proteins of a given class δN_{i}^{*}(i=1,2), originated within small period of time , vanishes with age a according the equation 
dδN_{i}^{*} 
with an initial condition 
N_{i}^{*}_{a=0} = q_{i}δt i= 1, 2 (A.2.2) 
The resolution of eqs. A.2.1 and A.2.2 represents the exponentially diminishing course 
δN_{i}^{*} = q_{i}δt exp(  γ_{i}a) i= 1, 2 (A.2.3) 
The assumed continuous approximation and linear increase in protein connectivity 
K = ν_{i}a (A.2.4) 
and also the relationship , let us to transform eq. A.2.3 into the formula 
q_{i} γ_{i} δN_{i}^{*}= δK exp(  K ) i=1, 2 (A.2.5) ν_{i} ν_{i} 
which integrated within successive intervals [k, k+1] indicates the number of kdegree proteins of class "i" , , equal to 
q_{i} γ_{i} γ_{i} n_{ki}^{*} = (1  exp(  ) ) exp(  K ) i=1, 2 (A.2.6) γ_{i} ν_{i} ν_{i} 
Now, the total distribution of node degree, n_{k}^{*}, where n_{k}^{*} = n_{k1}^{* }+ n_{k2}^{*}, may be written in the doubleexponential form: 
n_{k}^{*} = a_{1}^{*} exp( d_{1}k ) + a_{2}^{*} exp( d_{2}k ) (A.2.7) 
The symbols introduced above mean 
q_{1} γ_{1} a_{1}^{*} = (1  exp(  ) ) (A.2.8) γ_{1} ν_{1} 
q_{2} γ_{2} a_{2}^{*} = (1  exp(  ) ) (A.2.9) γ_{2} ν_{2} 
γ_{1} d_{1} = (A.2.10) ν_{1} 
γ_{2} d_{2} = (A.2.11) ν_{2} 
A contribution of "i " class proteins in eqs. A.2.7 formally vanishes for K > ζ_{e}v_{i} −1, where is the time of evolution of interactome. Thus the index k should not exceed max[ ζ_{e}v_{1} −1, ζ_{e}v_{2} −1 ] Assuming a relatively high value ζ_{e} ( >> 1/ ν_{i} ), by summation of a superposition of geometrical series n_{k}^{*}described by the eq. A.2.7 over 0 ≤ k ≤ ∞ , one can obtain the total size of proteome : N_{p}^{*} 
q_{1} q _{2} N_{p}^{*} = + (A.2.12) γ_{1} γ_{2} 
with a distinguished levels of class contribution 
q_{1} N_{1}^{*} = (A.2.13) γ_{1} 
and 
q_{2} N_{2}^{*} = (A.2.14) Y_{2} 
References 
