and equating individual terms with Equation (12) and integrating
leads to
∫fdr = -TdS ∫ f (r1 )dr1+ ∫ f (r2 )dr2 + .....+f (rn )drn (14)
Factoring out the constant T and solving for the entropy we
obtain
ΔS = ΔS1 + ΔDS2 +.......+ΔS n=εkΔSk (15)
which is exactly the same as Equation (A23) in App A6 for ξ=1 mer.
Normally, we should expect that ξ> 1
mer. Because the
CLE model averages the entropy contributions of each interaction
over the Kuhn length ( ξ), for ξ> 1 mer, the entropy
in Equation (15) should be scaled by a factor 1/ ξ (explained
in App A and (Dawson et al., 2001a)). Doing so,
the result exactly agrees with the expressions found in Equations
(A11), (A16-A20) and therefore Equation (A23).
There are at least four independent ways to arrive at
Equation (15). In (Dawson et al., 2001a) the CLE-model
was derived directly from physical considerations of the
entropy and in (Dawson et al., 2001b) it was derived by
assuming that each connection leads to the creation of a
new loop. In this Section, we have derived Equation (15)
from consideration of the force on a chain (which is physically
analogous to the pressure of an ideal gas). Equation
(15) can also be derived qualitatively from considerations
of diffusion.
The CLE Model Satisfies the Coordination Number Using the GPC Model
The entropy of a folded polymer is known to have the
form of an integral expression (Dill and Stigter, 1995; Chan
and Dill, 1997). Here we show that the summation rule in
Equation (15) has the properties of integration and that it
satisfies Equations (2 – 4) with the GPC model. We show
that the conformations of a polymer chain composed of N
segments has a coordination number that is a function of N;
i,e., q = f(N). Hence, under these conditions, the CLE model
satisfies both distinguishability and Equation (4).
The summation rule in Equation (15) is a consequence of
integrating the correlated interactions (the cross links) in
the model. From the derivation of each ΔSk (App A3, Equation
(A7)),ΔSk represents the probability that state k in configuration
rk[i] should acquire a configuration rk[f]: p(rk[f] ηrk[i])Δ r,
where p is the probability and k <=>(i,j) describes the
interaction between monomers i and j (App A, Equation
(A3)ff). The entropy Sk(rk) corresponds to the probability
of the configuration rk: P(rk)Δr. The ratio of these states (associated with rk[i] and rk[f]) form a conditional probabilityå
§The independence of the conditional probability for this Gaussian model is understood because it can be worked out from a Markov chain rule where
successive steps in the configuration depend only on the given configuration at the immediate previous step and are independent on any steps prior to
that point (Montroll, 1950; Feller, 1968 and 1971). In other words, knowledge of previous steps is restricted to the state of the current step and the next
step to be assigned. We are, therefore, justified to use this strategy on the grounds that it is the definition. Further, the theorem on the subadditivity of
entropy assures us that S12 ≤ S1 + S2 . Hence, at worst, we have consistently erred on the side of overestimation of the entropy. One can visualize that
the effective coupling dies off with distance; hence, for large enough Kuhn length, the Markov model is reasonable. This is the concept of renormalization
theory discussed in App A1. Whereas the model can certainly be further refined, it does not change these concepts. |
|
Figure 6: Example of a single hairpin containing Ls base pairs and a loop of length l nt. The total length of the sequence is
N.
|

Writing Equation (16) in terms of the entropy, we see that
Equation (15) is measuring the likelihood that each state k
will transition from an initial state [i] to a final state [f]

where k[ ]
denotes the state of the interactions between ij.
We transform the summation into integration by exchanging
the state label k for the enclosed sequence length (N(k))

Equation (18) calculates the total change in entropy due to
forcing a polymer into a specific configuration that is a function
of N(k).
In general, Equation (15) is much easier to arrange and
evaluate than the integral form in Equation (18). However,
for an RNA chain forming a hairpin in a single stem from 5’
to 3’ (Figure 6) or two anti-parallel beta-strands joined via a
loop, the summation in Equation (15) can be easily written
as an integral. For the GPC with parameters γ and ∂ 2,
Equation (15) becomes

and, converting to an integral,

Hence, q ∝ N easily satisfies Equation (4). Equation (21)
can also be derived directly from the summation (see
(Dawson et al., 2001a)).**
**It is true that the combinatorics of RNA secondary structure stems follow a 2N-1 rule. This is only the RNA secondary structure. Furthermore, whereas
this explains adequately the computational combinitorics of RNA secondary structure, it does not justify equating the combinatorics with the entropy
because these systems involvedistinguishable entities. The combinitorics only consider the pairing; not the distinguishable relationships or how it got
in a particular configuration. |
What does this mean?
First, equation (21) shows factorial-order growth with the
number of binding pairs (i.e., cross-links) that are formed.
The RNA-stem has strong correlation due to the fixed and
stabilized structure. This also suggests that the “penalty”
for RNA-stem formation should go mostly to stem formation
rather than to loop formation for the coarse-grained
entropy. (The fine grained entropy counts in the loops and
in the pairing interactions.) This relationship also would apply
to various beta-sheet conformations.
Second, Equation (21) is consistent with the fact that the
number of ways that N distinguishable particles can be arranged
is N!. It is consistent with the Gaussian (and Gamma)
function statistics because its maximum value is always less
than or equal to the normalization constant (Equation (A11)).
Moreover, the global entropy is known to be an integral property
for this type of system (Dill and Stigter, 1995; Chan
and Dill, 1997). Hence, Equation (15) is consistent with the
concept of integration and consistent with textbook statistics
Equation (21) is also consistent with the fact that x-ray diffraction can distinguish these indexed monomers and
produces a structural factor proportional to the number of
monomer (N). Were the true structures that of a lattice, a
coordination number (q) should emerge from the lattice
parameters and structural factor of the x-ray diffraction data
and most of the monomers in the protein structure or RNA
structure could not be uniquely identified and assigned because
of this degeneracy. We would observe dispersion
akin to a crystal with many defects. We observe unique
angles that are distinguishable (e.g., for proteins, the
Ramachandran plots all show non-degenerate distinguishable
residues).
For lattices that use the self avoiding random walk (App
B) with a large enough coordination number, degenerate
(but distinguishable) conformations have been observed
(Pokarowski et al., 2003). In this case, the example contained
approximately 12 residues and the lattice was a face
centered cubic (i.e., q = 12). Hence, even the lattice model
predicts degeneracy when a large enough coordination number
is used.
Finally, this model satisfies the inconsistencies in Equation
(4). A unique coordination number (q ~ O(N)) is always
obtained from the CLE model combined with the
GPC.††
†† In terms of Equation (4), the CLE-model permits N unique angles; hence, all entities in these models are theoretically distinguishable. Nevertheless, in
this Section, we currently are still ignoring the fact that there is real physical space involved with a real polymer. This must limit the set of possible
conformations and will be addressed later in this work. |
Making the Lattice Model Consistent
We have shown that the CLE model satisfies Equation
(4). In this Section, we show how to unify the lattice model
and the GPC.
The relationship between the lattice model and the CLEbased
GPC can be expressed as a family of equations having
the following form

where q(N), g(N), h(N) and a(N) are increasing functions functions
of N, and w and β are a constants and we have assumed
a unit Kuhn length (x ξ Ξ 1 mer).

which is the same form as Equation (A19) and easily transforms
into

which is very close to the asymptotic approximation known
as Stirling’s formula N! ≈(2∏)1/2 (N/e)N N1/2,
whereN!=1.2.3…N.‡‡ The total number of ways one can arrange
N distinguishable objects is also N! in size.
‡‡ The derivation of Stirling’s formula is via the Gamma function: Γ(x) ∫∞o t x
+1 e-t dt However, in Γ(x) , N plays the role of x and the integration of t is
from 0 to ¥. In this work, the probability density function and its weight ((e-t )tx+1 dt) have the same form (after exchange of variables) as Equation (A12).
However, x corresponds to ∂ γ and the integration is from 1 to N. Therefore, whereas both arrive at almost the same formula, the meaning behind the
operations is completely different. A detailed derivation of Stirling’s formula is found in Lebedev (1965). |
All the amino acids in a protein (or RNA) are distinguishable
using X-ray crystallography or NMR spectroscopy.
Such monomers are semi-classical enough in size and mass
to obey Maxwell-Boltzmann statistics which are used when
computing the statistics of distinguishable objects. It follows
that the true number of conformations must also be of order N! in size. In the previous section, we have shown
that the summation in Equation (15) is also a form of integration.
Equation (21) leads to Equation (23); whence the
number of conformations is of order NN. Equation (26) shows
that this is of similar size to a factorial expression. The CLE
model yields a family of equations that are consistent with a
system of distinguishable particles.
The lattice model requires that q(N) = constant Ξ qo,
g(N) / h(N)=1 and β =1. With the exception of a range of values
around qo, this does not match conceptually with a simple
integration of Equation (25). Neither does it return anything
remotely resembling a Gaussian model when we try to valuate
its derivative:
d[ln(qon)] ldN = ln(qo) (27)
A variant of the lattice model has the form C N ∝ q on
N ln NY
(Arinstein, 2005). This conforms to the second term in Equation
(25). However, it still fails to satisfy Equation (4) for N
large enough.
What is missing is the degeneracy. For a lattice constant
qo and sequence length N, the degeneracy σ (N) is

and so it grows monotonically with N. Equation (28) removes
the fact that we generate too many states when N>>qo
and too few when . In Equation (28), the coordination
number is q(N)=N, the root mean square deviation
expands to (g(N) / h(N))β =√N and the scaling factor is
the exponential base w=e. This is a standard Gaussian distribution.
The reason why the lattice model has often been successful
is because the sequence lengths that are used are
typically of similar order to qo. The extreme computational
costs usually restrict the use of lattice models to 4 <N < 20
mers. For such cases, Equation (27) has the form
q 0 ~ N and, as a result, it tends to yield conformations of the approximate
order; disguising the issue.
We have shown that Equations (22-25) are consistent with
a Gaussian-type model and satisfy the inequality on the right
hand side of Equation (4). Equation (27) has the same form
as Equations (22-25) and therefore, the CLE can incorporate
this model. Equation (4) is satisfied if we weight the coordination number by Equation (28). Therefore, the CLE
model embraces both forms and shows the route of transformation
between them.
Including Variable Flexibility: the Kuhn Length
Most functional biopolymers have different flexibilities in
different parts of their structures that reflect their function
and all such polymers have a Kuhn length larger than one
mer. A scaffold should be stiff (ς ~ 10 mers) whereas a
protein-protein docking region would require “shock absorbers”
and flexible interfaces (ς ~ 3 to 5 mers) to help the
subunits bind. Mechanical parts require flexible joints (ς ~
3 mers). All this indicates that we need a model that accounts
for different flexibilities of a real polymer.
For the case where ς ≠ 1
mer, Equation (22) must be
renormalized (see App A1). For ς > 1, Equation (22) is transformed
as follows

where γξ [] functions as an operator for scaling the global
entropy (Equation (A18)).
In Equation (25), we obtained the isolated entropy for a
particular binding pair (or region of binding pairs) in the
biopolymer. Because the conformations and their derivatives
are related and separable, we can handle each of these
parts separately and add them together according to Equation
(15).
We can now generalize these findings. From Equation
(15), we can sum the entropies. From Equation (25), the
derivatives express the instantaneous long range entropy
contribution between mers ij. A full transformation for the
CLE model for a given binding pair (bp) configuration (Equation
(A23); App A6) becomes

where N(ij) corresponds to individual binding pairs, ξk refers
to successive segments of mers k each of which has a
Kuhn length of ξk. The first summation of Equation (30)
scales the contribution of ΔSγô ( ξk) (the local coarse-grained entropy; App A6) for all segments of mers k in the polymer
chain and the second summation scales the entropy of a
group of binding pairs (the global coarse-grained entropy;
App A6), where ξk can vary depending on the location of ij.
Here we presume that ξk > 1 mer. Hence, the model is
easily adapted to a variable Kuhn length from first principles;
unlike either the lattice model or the GPC.
We can now understand from the total entropy that
branching in RNA structures reduces the entropy loss.
Consider two branches of length N1 and N2 such that N1+N2 ≤ N3, where N3 is the closing point of the two branches. It is
clear that even qon1 . qon2= qon1+n2 ≤ qon3, surely therefore,
N1N1.N2N2 <<N3N3Hence, Equation (22) shows that branching
is a way to reduce entropy loss in a complex structure.
We should expect multibranch loops in slowly folding polymers
like RNA to branch if there is any reasonable option
to do so. It is possible therefore, to scale these contributions
independently allowing a variable Kuhn length within the
same structure via Equation (30), yielding a variable flexibility
in the final structure.
For a pure lattice model where no correction for degeneracy
is necessary,ΔSξγô involves a small correction proportional
to ln(qo). The entropy in Equation (30) is then

which is a linear function of N: (N-1)ln(qo) (White et al.,
2005).
Using the CLE model, not only have we found a way to
transform the lattice model so that it is consistent, we have
shown how to evaluate a lattice when the structure has a
variable flexibility.
Squeezing a more Realistic Model from the Boundaries
In previous Sections, we have already made issue with
the Markov chain approximation used in the GPC-model
and the lattice model. The lattice model and the GPC are
merely statistical models that ignore the physical realities of
the systems they model. These models have largely been
successful because the physically impossible configurations
just so happen to have a small enough probability that ignoring
their “possibility” does not significantly affect many results.
Nevertheless, utrageously absurd configurations can be imagined that become ever more possible with increasing
length. Hence, an attrition of such conformations is expected
particularly for large N. Here we consider how to
build a more realistic model for estimating the total number
of conformations of a biopolymer that considers real polymers
with self avoiding interactions, coordination limits and
chain-winding limits.
Equation (23) and (29) express a family of equations to
which both models belong. We have seen in Equation (4)
that the lattice model can underestimate the number of conformations
for large sequence length. Likewise, because
the Markov model involves non-interacting particles, the
GPC-model can overestimate the true number of conformations.
Therefore, we can set bounds on the solution.
The basic lattice model permits folding back on itself. This
is certainly physically impossible and should be removed
from the set of possibilities. This is addressed by considering
a self avoiding chain. Since folding back on the same
chain is forbidden in this model, the coordination number
(qo) must necessarily be reduced. We therefore introduce
an effective coordination number q∼ for the lattice where q∼
< qo(Sykes, 1963). See App B for an explanation of how
to estimate an effective coordination number.
The upper bound is the GPC model. For the GPC model,
the self avoiding walk is often approximated using the lattice
model results of (Fisher, 1966), where the exponent ( )
on the volume term of Equation (A12) or the logarithmic
term of Equation (25) or (A19) is increases from 1.5 to 1.75
in 3 dimensions. The consequence of a self avoiding walk is
that it tends to increase the volume of the polymer.
We begin by assuming that the Kuhn length (ξ) is 1 mer
for the GPC. There is no loss of generality in this assumption
because the “lattice” can also be set to have the same
spacing as the Kuhn length so that the same boundaries
apply. For a lattice constant different from the Kuhn length,
Equation (30) scales the lattice model accordingly since the
Kuhn length tends to freeze out the degrees of freedom of
the monomers.
The true solution must lie between these two bounds such
that


where ψ is a constant, n is an excluded volume weight, and
δ (0.5 ≤ d ≤ 2) is the weight on the exponential function
(see App A5 and (Dawson et al., 2006)). For the standard
GPC-model v = 1/2 and δ Ξ 2. When d < 2, the weight on
N decreases, and if the system is globular, v--->1/3, this further
decreases the weight. Setting α(N)=γN, q(N)=( ψN)δv,
g(N)=exp{( ψN)1-δv/(1-δv )ψ }, h(N)=eN, w = exp(δv), and
b =ξ(γ,δ) in Equation (22), the derivative of the result (for
δv< 1)§§ is
§§ The expression for g(N) is also true for δv =1. To see that it is so, one can integrate the following inequality 1/x1+e ≤ 1/ x1-e between fixed limits a to b
(a < b) and then bring b − a arbitrarily close to zero as ε ---> 0. It therefore follows that when δv =1, the assembled components of the expression g(N)β
will contain the argument (ψN)1-δv / (1−δv) which must approach ln(ψN)as δv- 1 and this results in the Gaussian expression found in Equation (23).
It is therefore part of the same family of equations. The equation is also true for δv >1; however, the result can even exceed the GPC model which already
overestimates the true number of conformations. This case may be valid for denatured proteins and RNA where the solvent could become part of the “conformations” (in effect). This case is not considered here. |

where ψv=ξ-1(ξ/λ)1/v,ζ (γ,δ)=[Γ(γ+3/δ)/Γ(γ+1/δ)]δ/2
(from Equation (A14)) and ξ = 1
mer. The form of Equation
(36) is identical to that given in Equation (A19).
When we apply Equation (29) for ξ > 1 mer, α (N) = γN /ξand β=ζ(γ,δ)/ξ (where q(N), g(N), h(N) and w are unchanged),
we obtain the exact expression in Equation (A19)
(37)
which indicates that ξ is scaling the conformations (and
therefore also the entropy) by the effective mers. Reductions
to γ, particularly on the logarithmic term of Equation
(A19), would further reduce this number of conformations
from the standard GPC-model. This offers a far better description
of the actual number of conformations.
The weight δ is a measure of the long range correlation where δ=2 (Gaussian) reflects localized or independent
coupling, δ =1 (exponential) reflects diffusive coupling and
δ =1/2 (exponential square root) reflects a glassy unstructured
coupling. Because the polymer chain requires real
physical length considerations in evaluating this coupling,
there is certainly a “diffusive” component in the structure in
the sense that the correlation extends over a far longer range
than would occur if the polymer chain was non-interacting.
Consequently, this reduces the number of degrees of freedom
and independence of each effective mer. In general,
most biopolymers that we have studied so far tend to fall in
the range 1 ≤ d ≤ 2 . The parameter n (App A) tends to be
less than 1/2 in globular proteins (Grosberg and Khokhlov,
1994) suggesting that vγδ < 1; i.e., the correlation is glassy.
By proper partitioning of this function, one could even introduce
a variable δ or γ to this problem. In addition to regions
of variable flexibility, some biopolymers are believed to have
disordered regions and globular regions as well; hence “squeezing” offers additional options for future exploration.
In this Section, we have shown that we can “squeeze”
the correct solution between limits; the lattice model on the
one end and the GPC at the other end. The true conformation
limits on folding a beta-strand back and forth can be
largely accounted for by including a weight δ on the logarithmic
term of Equation (36) because the solution is bounded
between the two extremes (Equation (34)). “Squeezing” is
convenient starting point for developing tractable statistical
models that considers the steric effects and long range correlation
contributions that are ignored in statistical Markov
chain based models (Montroll, 1950; Feller, 1968 and 1971).
Incorporating the Worm Like Chain Model into the CLE Model
As shown in (Dawson et al., 2001a), the logarithmic function
in Equation (37) represents the resistance of a polymer
to compression and the remaining function is associated with
the stretching of a polymer chain. The stretching term is
important to comment on.
The function in Equations (36) and (37) is the generalized
treatment of the probability based on a Gamma function (and its derivative). The GPC does not properly model
changes in the entropy due to stretching the chain to a point
approaching the contour length. The solution for the worm
like chain model (Marko and Siggia, 1995)- also known
as a Porod-Kratky Chain (Flory, 1969) - weights the
stretching term (gβ(N)/ hβ(N)) with far greater accuracy.
The force response for the worm like chain is shown by
(Marko and Siggia, 1995) to be

where A is the persistence length (A ≈ ξb/2; (Flory, 1969)), L
is the contour length (L=Nb; Equation (A2)) and we have
used the relationship in Equation (8). Neglecting bending
and over-stretching issues of DNA (Rouzina and Bloomfield,
2001ab) etc., the entropy can be approximated by integrating
- the Equation (38) with respect to r, yielding

where S˜k(r) emphasizes the “spring like” contribution to
the entropy and C is an integration constant.
Equation (38) is scaling the system to N/ξ links with a
persistence length A. The CLE model is defined by each
mer and there are ξ mers in the Kuhn length (for the usual
case where ξ >1 mer). To transform this to a mer-equivalent
(rij) expression, we must scale Equation (39) by a weight
1/ξ. Therefore, for a single binding pair ij;

Since there is a group of n8 (=ξ) binding pairs in a given link
of the chain, the entropy is unchanged when we consider
the average entropy of the group;(s˜k(r8)/ ξ)n8=s˜k(r8),
where r8 is the effective averaged position of the group.
The CLE model averages the contribution from each binding
pair of mers ij in the group.
Let
A ≈ξb/2 and L---->Lij =Nij b (43)
Then Equation (39) transforms to

***In the definition of the entropy in Equation (A11), the GPC model has the limits 0 < r < ∞ For the worm like chain model, these limits must change
to 0 < r < L. |
From the definition of the reference state in Equation (A17),
we obtain

Since for large Nij both (rij2 )= ξNijb2 << Lij2 and λb<<Lij Equation (43) quickly simplifies to

which is exactly the second expressed term of Equation
(A17).
For structure prediction problems, the stretching contribution
can to some extent be neglected. The Jacobson-
Stockmayer model is a prime example (Jacobson and
Stockmayer, 1950). However, if one were to consider the
same situation in which multiple points were pulled apart,
we must include the independent contributions of Equation
(43). It should be clear that the so-called “Gaussian” contribution
does not adequately address this issue because it allows
the chain to extend to infinite length. Equation (43) is
in far more reasonable agreement with the anticipated behavior
of a real polymer when stretched out to length L.
For the case of stretching, we can also use Equation (29).
Consider a chain that is stretched from the equilibrium position r[i]= b √ ξN (App A2) to some significant fraction of its
contour length p [f] =r[f] /(Nb) where [i] refers to the initial
and [f] the final state of the system and r[i] < r (=r[f] )<Nb .
Using p [i]= r[i]/nb=√ξ/N,p=r/Nb and integrating
Equation (43) with respect to r using the states [i] and [f],we obtain

Where 
We now have the means to seek an equivalent expression
for stretching the GPC toward its full extension:p-->1.
To define g(N) and h(N) in Equation (29), the terms on
the right hand side of Equation (45) are integrated. Integrating
with respect to N and using the substitution N=r/ (ρb)
while holding r and b constant, this yields
(46)
and
(47)
where the remaining terms are Equation (29) becomes
(48)
where r--->Nb is assumed. From Equation (48), the entropic
response of a chain stretched out to its contour length from
the equilibrium position r[i] is
(49)
where 0<b √ξN<r<Nb Equation (49) yields an expression
that handles stretching. As r--->Nb, the dominant term
in Equation (49) is Γ(r / Nb)
and the logarithmic term can be
basically neglected.
We have shown that we can incorporate the worm like
chain model directly into the model in a seamless fashion
and therefore drastically improve the stretching domain predictions
of the CLE model. This is because the stretching
and compression components are decoupled. We have
therefore shown that the CLE model is not only universal; it
is highly versatile as an entropy estimation scheme for
biopolymers. Moreover, this shows that the weight of the
stretching term need not be precisely a Gaussian weight;
even an alternative constant weight is allowed because the
compression and extension components are separable.
The Virial Equation of State and the Contact order Model
In Equation (8), we introduced the virial equation of state.
Here we examine the equation of state of an ideal polymer
in the context of the CLE model.
To tie this to familiar concepts, we first construct the equation of
state for an ideal gas. An ideal gas consists of non-interacting
particles. For such a gas, the measured parameters P, V and T
represent, respectively, the average values for the pressure, volume
and temperature of a gas consisting of N gas particles. There
are so many gas particles in a normal volume that we simply cannot
measure each one; instead, we measure their average collective
properties. We can say effectively that each gas particle in a
vessel occupies an average fractional volume V/N and if we leave
P free, then P depends on N, V and T. For an ideal gas, the Helmholtz
equation is F =cvT−NkBT ln(V/Vo),where cv is the specific heat at
constant volume and Vo is a reference state volume. The virial
equation of state for an ideal gas is immediately obtained

In a similar way, using Equation (8) and referring to Figure
5, we can construct an average -r and an average -f such that

|
Figure 7: A comparison of the predicted minimum-free-energy secondary-structures of tRNA using the standard-model
(left) that neglects global interactions and the CLE-model (right) that incorporates them. The base-pairing thermodynamic
parameters are identical for both calculations. (a) Optimal secondary structure predictions of tRNA(phe) for E. coli. (b)
Optimal secondary structure predictions of tRNA(ala). (c) Optimal secondary structure predictions of tRNA(Ser) corresponding
to codon UCC. (d) Optimal secondary structure predictions of tRNA(Ser), codon AGC.
|
|
Figure 8: Comparison of the optimal secondary structure of the Tetrahymena thermophila group I intron (the L-21 ScaI
ribozyme) using the standard-model (left) and the CLE-model (right).
|
where Nbp is the number of binding pairs (base pairs in this case) and would be taken as roughly the midpoint of the
stem shown in Figure 5 and reflects the collective interaction
of all the base pairs forming the single stem of the folded
RNA molecule in Figure 5. Extrapolating to more complex
structures, a single domain can be defined by and the
observed behavior of the system will depend mostly on the
largest domain. Hence a biopolymer would have some such that =(1 / 2)max{rij}.
For both the ideal gas and the ideal polymer discussed
here, the contributions to P and are due to the sum of the
interactions of all the components in the system. For the
ideal gas, this was just added up by multiplication. For the
ideal polymer, we have to sum the binding pair contributions
individually. Using the average values for , and Nbp, the ideal polymer equation can also be expressed in the same
form as the ideal gas.
The variable is closely connected with the contact order
model, where the rate determining folding time is established
by max{rij} (Ivankov et al., 2003). The maximum in
the entropy is correlated with the largest value Nij. This
means that the folding time of the largest domain will be the
rate limiting step. We have shown that the contact order
model is a form of the virial equation of state and therefore
expresses the average equation of state for the system.
Therefore, The CLE model has the contact order model
within its interpretation framework.
To Experimentalists
We have derived and discussed - at length - a theory
that supports modeling the coarse-grained entropy of biopolymers.
We have shown that the existing models are subsumed
and extended under the theoretical framework of the CLE model. Here we explain why experimentalists
should want to understand the coarse-grained entropy we try to model.
First, the Kuhn length (ξ) is rarely mentioned in most studies
of biopolymers, yet flexibility is known to be very important
in functional proteins and RNA molecules. For typical
protein structures or folded single strand RNA (ssRNA)
structures, we can assume that 3<ξ ≤10 mers. However,
double strand RNA or DNA (dsRNA/dsDNA) can easily
showξ >200 mers; the very same linear sequence has two
drastically different Kuhn lengths (i.e., flexibilities). Similarly,
fibrous proteins (Lehninger, 1975) like collagen (a major
component of tissue consisting of a triple helix of amino
acids) and α-keratin (found in hair with an alpha helix) have
a long Kuhn length. A short fragment of such an amino
acid sequence looks similar to many protein fragments or
peptides. Why doesξ change?
Second, aggregation is what can happen when you boil or
denature any of these biopolymers. We also know of plaques
that form in neurodegenerative diseases. It is actually quite
easy to produce aggregation in an amino acid sequence:
indeed, it seems more difficult to produce amino acid sequences
that don’t easily aggregate (He et al., 2008). Perhaps
natural selection has already filtered out most of these
dysfunctional amino acid sequences from the gene pool and
what we see is a small subset of the actual possibilities.
Why is aggregation so common?
Third, we know that there are domains in folded proteins
and RNA. These are typically of the order of 200 to 500
monomers, though some are larger. What process limits this
size?
Fourth, what are the coarse-grained differences between
protein-protein binding and protein folding?
Equation (30) reveals a large part of where these features
come from. First, for dsRNA and fibrous proteins there
are no loops. The second term can be neglected in first
approximation. In the absence of any well defined tuning
from natural selection, the entropy cost of a functional domain
is non-linear (Dawson et al., 2001ab)

where pbp is the fraction of paired monomers in the domain
and C(ξ) is the Kuhn length corrections contained in
the first term of Equation (30). Like C(ξ), the enthalpy tends
to be local and linear in contribution. Moreover, the primary
contributors to the enthalpy are the statistical pairing potentials
(Zhang et al., 1997; Mintseris and Weng, 2003) that
only grow linearly with the presence of pairing interactions.
Hence, on the whole, it is typically far less expensive in
entropy to combine these biopolymers in fibrils than to form
complex folds. Aggregation is far easier than well ordered
and expensive structural folds. It is more economical to dock
many proteins together than to fold up a single complex
functional protein. Indeed, according to Equation (30), it is
hardly surprising that amyloid proteins form plaques, rather,
it is surprising that they don’t.
Yet ignorance abounds. In some biophysics meetings,
only two or maybe three people even mention persistence
length or Kuhn length. Flexibility receives honorable mentioned,
but its application to the design and properties of
biopolymers is essentially ignored because there is no global
concept of entropy. One can see many people who treat
the entire domain of a protein or an RNA molecule with the
same type of additive statistical pairing potential as if there
is no difference between biopolymers that fold, form fibrous
structures or dock. Occasionally, there is mention that a
global effect may confound the prediction (Zhang et al.,
1997), but that is as far as it goes. Hardly anyone seems to
find it strange that proteins can so easily aggregate. Lattice
models and worm like chains models are used on the same
protein yet no one even asks how the same protein can
have such different entropies. If we do so fallaciously on
the global coarse-grained scale, how in the world can we
expect to get the fine grained details right?
Qualitatively, the CLE model can certainly explain these
properties. In our previous work, we have also shown that
in at least some important cases, the CLE model can quantitative
address these issues (Dawson et al., 2007) and provide
structures that are predicted at the minimum free energy.
Some solutions for RNA folding are quite stable and
hardly difficult to hit on with the CLE-model. Figures 7 and
8 show some examples of the predicted minimum free energy
structures for tRNA and the group I intron respectively
for the standard model that neglects these global contributions
and the CLE model that considers these interactions.
The local statistical-thermodynamic potentials are identical
in these calculations; where we used the the Mfold 3.0
data set (Mathews et al., 1999). The standard model results are calculated using the Vienna Package RNAfold version
1.4 and the CLE model calculations are done using vsfold5
and vsfold4 (Dawson et al., 2006; 2007). This clearly shows
that it is possible to use statistical-thermodynamic pairing
potentials and predict a minimum free energy structure that
approaches the native state structure for the RNA molecule.
For tRNA, we observed 80% success in a complete genome
of RNA (Ito N, unpublished data). Preliminary protein
calculations also show promise (Dawson et al., 2005).
Success is not guaranteed. For one thing, currently, there
is no way to know what the Kuhn length should be for a
particular problem, and therefore, we usually have to make
an educated guess. There are clear differences in the behavior
of the pairing potentials such as the Mfold 2.9 data
set (Freier et al., 1986). This shows local interactions are
important in these problems too. Likewise, there are indications
that the GPC formulation could use different weights
for g and h in Equation (29). Hence, what tuning should be
applied to the CLE approach is still not completely clear.
This suggests that more needs to be done with statistical
pairing potentials in the context of the global entropy. Therefore,
there is more work to be done. However, the model
has consistently offered a fighting chance and has already
shown that it can overcome many obstacles and rogress
onward.
In this work, we have provided a foundation that unifies
the lattice model, the worm-like chain model, the Gaussian
polymer chain model, and the contact order model under
one framework. Clearly, each of these models has hit around
the right answer for the coarse-grained entropy of polymers.
This work does not solve every aspect of this problem.
Nevertheless, the method presented here is a powerful
tool for guiding us on how to ask the right questions.
Acknowledgements
This work was supported by a Grant-in-aid from the Ministry
of Education, Culture, Sports, Science and Technology
of Japan (MEXT). We thank Elliot Lieb for pointing us to
the subadditivity of entropy theorem and Michael Zuker for
asking WD “why does the lattice model fail when q > N”.
WD would like to thank Neil A McDougall (theoretical particle
physicist) Kenji Yamamoto (International Medical Center
of Japan), Greg Rose (business consultant), Craig Stevens
(software engineer) and Yucong Zhu (optical engineer) and
Bejon Kumar Bhowmick (bioinformatics) for their encouragement
and discussions.
References
-
Adzhubei AA, Sternberg MJ (1994) Conservation ofpolyproline II helices in homologous proteins: implications
for structure prediction by model building. ProteinScience 3: 2395-2410. » CrossRef » PubMed » Google Scholar
-
Arinstein AE (2005) Uniaxial ordering and rotator phase of ribbonlike polymers. Phys Rev E 72: 051806. » CrossRef » PubMed » Google Scholar
-
Ashcroft NW, Mermin ND (1976) Solid State Physics.Philadelphia, Saunders College.
» Google Scholar
-
Baldwin RL, Rose GD (1999) Is protein folding hierarchic?I. Local structure and peptide folding. Trends in Biochemical Sciences 24: 26-33. » CrossRef » PubMed » Google Scholar
-
Baldwin RL, Rose GD (1999) Is protein folding hierarchic?
II folding intermediates and transition states. Trends
in Biochemical Sciences 24: 77-83. » CrossRef » PubMed
» Google Scholar
-
Chan SC, Dill KA (1997) Solvation: how to obtain macroscopic
energies from partitioning and solvation experiments.
Annual Review Biophysics and Biomolecular
Structucture 26: 425-59. » CrossRef » Google Scholar
-
Chen SJ (2008) RNA folding: conformational statistics,
folding kinetics, and ion electrostatics. Annu Rev Biophys
37: 197-214. » CrossRef » PubMed » Google Scholar
-
Cohen FE, Sternberg MJ, Taylor WR (1982) Analysis
and prediction of the packing of alpha-helices against a
beta-sheet in the tertiary structure of globular proteins.
Journal of Molecular Biology 156: 821-62. » CrossRef » PubMed » Google Scholar
-
Dawson W, Fujiwara K, Kawai G (2007) Prediction of
RNA pseudoknots using heuristic modeling with mapping
and sequential folding. PLoS One 2: 905. » CrossRef » PubMed
» Google Scholar
-
Dawson W, Fujiwara K, Kawai G, Futamura Y,
Yamamoto K (2006) A method for finding optimal RNA
secondary structures using a new entropy model (vsfold).
Nucleotides, Nucleosides, and Nucleic Acids 25: 171-
189. » PubMed » Google Scholar
-
Dawson W, Kawai G, Yamamoto K (2005) Modeling
the long range entropy of biopolymers: A focus on protein
structure prediction and folding. Recent Research
Developments in Experimental & Theoretical Biology 1:
57-92.
-
Dawson W, Suzuki K, Yamamoto K (2001) A physical
origin for functional domain structure in nucleic acids as
evidenced by cross-linking entropy: part 1. Journal Theoretical
Biology 213: 359-86. » CrossRef » PubMed » Google Scholar
-
Dawson W, Suzuki K, Yamamoto K (2001) A physical
origin for functional domain structure in nucleic acids as
evidenced by cross-linking entropy: part 2. Journal Theoretical
Biology 213: 387-412. » CrossRef » PubMed » Google Scholar
-
Day R, Daggett V (2003) All-atom simulations of protein
folding and unfolding. Advances in Protein Chemistry
66: 373-403. » CrossRef » PubMed » Google Scholar
- de Gennes PG (1979) Scaling Concepts in Polymer
Physics. Ithaca, Cornell University Press.
» Google Scholar
- Dill KA, Stigter D (1995 ) Modeling protein stability as
heteropolymer collapse. Advances in Protein Chemistry
46: 59-104. » CrossRef » PubMed » Google Scholar
- Ding F, Tsao D, Nie H, Dokholyan NV (2008) Ab initio
folding of proteins with all-atom discrete molecular dynamics.
Structure 16: 1010-8. » CrossRef » PubMed » Google Scholar
- Feller W (1968) An Introduction to Probability Theory
and its Applications (pt I). New York Wiley. » Google Scholar
- Feller W (1971) An Introduction to Probability: Theory
and Its Applications (pt II). New York John Wiley &
Sons. » Google Scholar
- Fisher ME (1966) Effect of excluded volume on phase
transitions in biopolymers. J Chem Phys 45: 1469-1473. » CrossRef » Google Scholar
- Flory PJ (1953) Principles of Polymer Chemistry. Ithaca,
Cornell University Press.
» Google Scholar
- Flory PJ (1969) Statistical Mechanics of Chain Molecules.
New York Wiley (Regrettably out of print.) » CrossRef » Google Scholar
- Freier SM, Kierzek R, Jaeger JA, Sugimoto N, Caruthers
MH, et al. (1986) Improved free-energy parameters for
predictions of RNA duplex stability. Proc Natl Acad Sci
USA 83: 9373-7. » CrossRef » PubMed » Google Scholar
- Friederich MW, Vacano E, Hagerman PJ (1998) Global
flexibility of tertiary structure in RNA: yeast tRNA(phe)
as a model system. Proceedings of the National Academy
of Science (USA) 95: 3572-77. » CrossRef » PubMed » Google Scholar
- Go N (1999) The consistency principle revisited. In: Old
and new views of protein folding. Amsterdam. Elsevier
Science pp249-257.
- Grosberg AY, Khokhlov AR (1994) Statistical Physicss
of Macromolecules. New York AIP Press. » Google Scholar
- Hagerman PJ (1997) Flexibility of RNA. Annual Review
Biophysics and Biomolecular Structucture 26: 139-
56. » PubMed
- He YN, Chen YH, Alexander P, Bryan PN, Orban J
(2008) NMR structures of two designed proteins with
high sequence identity but different fold and function.
Proc Natl Acad Sci USA 105: 14412-14417. » PubMed
- Hnizdo V, Tan J, Killian BJ, Gilson MK (2008) Efficient
calculation of configurational entropy from molecular
simulations by combining the mutual-information expansion
and nearest-neighbor methods. J Comput Chem 29:
1605-14. » CrossRef » PubMed » Google Scholar
- Honig B, Ray A, Levinthal C (1976) Conformational flexibility
and protein folding: rigid structural fragments connected
by flexible joints in subtilisin BPN. Proc Natl Acad
Sci USA 73: 1974-8. » CrossRef » PubMed » Google Scholar
- Ivankov DN, Garbuzynskiy SO, Alm E, Plaxco KW,
Baker D, et al. (2003) Contact order revisited: influence
of protein size on the folding rate. Protein Science 12:
2057-62.
» CrossRef » PubMed » Google Scholar
- Jacobson H, Stockmayer W (1950) Intramolecular reaction
in polycondensations. I. the theory of linear systems.
Journal of Chemical Physics 18: 1600-1606. » CrossRef
» Google Scholar
- Kolinski A, Gront D, Pokarowski P, Skolnick J (2003) A
simple lattice model that exhibits a protein-like cooperative
all-or-none folding transition. Biopolymers 69: 399-
405. » PubMed » Google Scholar
- Kuhn W (1934) Uber die Gestalt fadenformiger Molekule
in Losungen (on the shape of filiform molecules in solution].
Kolloidzeitschrift 68: 2. from citation in I. Müller. 2007 ISBN: 978-3-540-46226-2). » CrossRef » Google Scholar
- Kuhn W (1936) Beziehungen zwischen Molekulgroshe,
statistischer Molekulgestalt und elastischen Eigenschaften
hochpolymerer Stoffe [Relations between molecular size,
statistical molecular shape and elastic properties of high
polymers. Kolloidzeitschrift 76: 258.(from citation in I. Müller. 2007 ISBN: 978-3-540-46226-2). » Google Scholar
- Lebedev NN (1965) Special Functions & their Applications.
Englewood Cliffs (NJ), Prentice- Hall. Dover reprint. » Google Scholar
- Lehninger AL (1975) Biochemistry. Ed. 2. New York,
Worth Publishers, INC. (Out of print: newer editions may
contain the same subject material. Newer books such
as Stryer L Biochemistry, Freeman also contain very
similar material on this related topic but not in the
same exposition.)
- Lesk AM (2001) Protein Architecture. Oxford, Oxford
University Press.
- Liu SM, Haynes CA (2005) Energy landscapes for adsorption
of a protein-like HP chain as a function of native-
state stability. J Colloid Interface Sci 284: 7-13. » CrossRef » PubMed
» Google Scholar
- Ma SK (1973) Introduction to Renormalization Group.
Reviews of Modern Physics 45: 589-614. » CrossRef » Google Scholar
- Marko JF, Siggia ED (1995) Stretching DNA. Macromolecules
28: 8759-8770. » Google Scholar
- Mathews DH, Sabina J, Zuker M, Turner DH (1999)
Expanded sequence dependence of thermodynamic parameters
improves prediction of RNA secondary structure.
Journal of Molecular Biology 288: 911-940. » CrossRef » PubMed » Google Scholar
- McKenzie DS (1976) Polymers and scaling. Physics
Reports 27C: 35-88. » CrossRef
» Google Scholar
- Mintseris J, Weng Z (2003) Atomic contact vectors in
protein-protein recognition. Proteins 53: 629-39. » CrossRef » PubMed » Google Scholar
- Mirny LA, Abkevich VI, Shakhnovich EI (1998) How
evolution makes proteins fold quickly. Proc Natl Acad
Sci USA 95: 4976-81. » CrossRef » PubMed » Google Scholar
- Montroll EW (1950) Markoff chain and excluded volume
effect in polymer chains. J Chem Phys 18: 734-
743. » CrossRef » Google Scholar
- Murray LJ, Arendall WB 3rd, Richardson DC,
Richardson JS (2003) RNA backbone is rotameric. Proc
Natl Acad Sci USA 100: 13904-9. » CrossRef » PubMed » Google Scholar
- Nash LK (1974) Elements of statistical Thermodynamics.
Reading, Addison-Wesley. (Kindly reissued recently
by Dover Books and well worth the investment.)
- Onuchic JN, Nymeyer H, Garcia AE, Chahine J, Socci
ND (2000) The energy landscape theory of protein folding:
insights into folding mechanisms and scenarios. Advances
in Protein Chemistry 53: 87-152. » CrossRef » PubMed » Google Scholar
- Pappu RV, Rose GD (2002) A simple model for
polyproline II structure in unfolded starts of alanine-based
peptides. Protein Science 11: 2437-2455. » CrossRef » PubMed » Google Scholar
- Pappu RV, Srinivasan R, Rose GD (2000) The Flory isolated-
pair hypothesis is not valid for polypeptide chains:
implications for protein folding. Proceedings of the National
Academy of Science USA 97: 12565-70. » CrossRef » PubMed » Google Scholar
- Pokarowski P, Kolinski A, Skolnick J (2003) A minimal
physically realistic protein-like lattice model: designing
an energy landscape that ensures all-or-none folding to
a unique native state. Biophys J 84: 1518-26. » CrossRef » PubMed » Google Scholar
- Richardson JS (1977) beta-Sheet topology and the relatedness
of proteins. Nature 268: 495-500. » CrossRef » PubMed » Google Scholar
- Richardson JS (1981) The anatomy and taxonomy of
protein structure. Advances in Protein Chemistry 34: 167-
339. » PubMed » Google Scholar
- Rouzina I, Bloomfield VA (2001a) Force-induced melting
of the DNA double helix 1. Thermodynamic analysis.
Biophys J 80: 882-93. » CrossRef » PubMed » Google Scholar
- Rouzina I, Bloomfield VA (2001b) Force-induced melting
of the DNA double helix. 2. Effect of solution conditions.
Biophys J 80: 894-900. » CrossRef » PubMed » Google Scholar
- Swendsen RH (2006) Statistical mechanics of colloids
and Boltzmann’s definition of the entropy. Am J Phys
74: 187-190. » CrossRef » Google Scholar
- Swendsen RH (2008) Gibbs’ Paradox and the Definition
of Entropy. Entropy 10:15-18.
» Google Scholar
- Sykes MF (1963) Self-avoiding walks on the simple cubic
lattice. J Chem Phys 39: 410-412.
» CrossRef » Google Scholar
- Takasu A, Watanabe K, Kawai G (2002) Analysis of
relative positions of ribonucleotide bases in a crystal structure
of ribosome. Nucleosides Nucleotides Nucleic Acids
21: 449-62.
» CrossRef » PubMed » Google Scholar
- Taylor WJ (1948) Average length and radius of normal paraffin hydrocarbon molecules. J Chem Phys 16: 257-
267. » CrossRef » Google Scholar
- Tinoco I, Bustamante C (1999) How RNA folds. Journal
of Molecular Biology 293: 271-81.
» CrossRef » PubMed » Google Scholar
- Wall FT, Erpenbeck JJ (1959) New method for the statistical
computation of polymer dimensions. J Chem Phys
30: 634-637. » CrossRef » Google Scholar
- Wall FT, Hiller LA, Atchison WF (1955) Statistical computation of mean dimensions of macromolecules. II J
Chem Phys 23: 913-921. » CrossRef » Google Scholar
- White RP, Funt J, Meirovitch H (2005) Calculation of
the Entropy of Lattice Polymer Models from Monte Carlo
Trajectories. Chem Phys Lett 410: 430-435. » CrossRef » PubMed
» Google Scholar
- Zhang C, Vasmatzis G, Cornette JL, DeLisi C (1997)
Determination of atomic desolvation energies from the
structures of crystallized proteins. J Mol Biol 267: 707-
26. » CrossRef
» PubMed » Google Scholar
|