Mathematical Modelling of Protein Precipitation Based on the Phase Equilibrium for an Antibody Fragment from E. coli Lysis

Copyright: © 2013 Ji Y, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Abstract Precipitation is an important operation in biopharmaceutical purification yet the mechanism of protein precipitation in multi-component solutions is not well understood. Existing models lack fundamental understanding of the process. In this paper, a new model describing how the protein solubility changes in the protein precipitation is proposed and is based on the phase equilibrium of the light liquid phase and dense solid phase. The model structure is generic and robust. It adequately reflects the non-linearity of protein precipitation kinetics and thus provides new fundamental insights into the protein precipitation in multi-component, complex protein solution. Two feed stocks of a pure fragment antigen-binding (Fab’) solution obtained by chromatographic purification and a clarified Fab’ homogenate solution from E. coli were used to examine the effect of ammonium sulphate concentrations and pH conditions on precipitation. It was found that the model can describe pure Fab’ precipitation well, and identify the non-ideal behavior of Fab’ precipitation in multi-component homogenates. Through statistical analysis, the model parameters have been further reduced from 8 to 4. The quality of the model is such that errors were within the acceptable statistical confidence limits, even when applied to multi-component impurity precipitation. The new model with fewer parameters is better than existing empirical models in reflecting the salting-in and salting-out effect of the protein precipitation. This demonstrated that the structure of the model is sound and over-fitting in the parameter estimation is avoided. The model can be applied directly to industrial processes for protein precipitation process design after appropriate calibration with the required operating conditions of pH and salt concentration. Mathematical Modelling of Protein Precipitation Based on the Phase Equilibrium for an Antibody Fragment from E. coli Lysis Yu Ji and Yuhong Zhou* Department of Biochemical Engineering, University College London, Torrington Place, London WC1E 7JE, UK


Introduction
Protein precipitation is a technique that utilizes the differences of protein solubility to precipitate proteins into the solid phase from the liquid phase. It has been used extensively to separate and purify proteins for sample preparation [1]. Ammonium sulphate is usually used to separate protein from complex solutions because it does not denature protein and has a very high salting-out effect [2][3][4]. Currently, with advanced fermentation technology, higher protein titres can be achieved upstream and it is now possible to produce multi-kilogram quantities of therapeutic monoclonal antibodies in a single batch [5]. However, this creates problems at the downstream purification stages. The high concentration of target protein plus impurities in the feedstock changes the physical properties of the protein solution. If such a complex biological material is applied directly to the chromatographic columns, they are susceptible to fouling and blockages [6][7][8] so significantly increasing the chromatographic processing time and cost. Therefore, a primary separation, such as protein precipitation, may be beneficial in the preparation of a relatively clearer and less contaminated solution for expensive high resolution steps.
During precipitation, the solubility of a protein depends primarily on process conditions including pH, salt concentration and temperature [9]. In order to optimize the precipitation process operation, a good understanding of the impact of these conditions on the behavior of the protein is needed. For industrial scale process engineering and design purpose, a protein precipitation model that directly links protein solubility with operating conditions would help support industrial process development e.g. scale-up, predict process optimal conditions and provide information for process control [10] .
The first attempt to model the protein solubility was by Cohn [11] . His log-linear equation, discussed later, gives a simple empirical relationship between the soluble protein concentration and ionic strength in the solution over a narrow salt concentration range. Melander and Horvath [12] then improved Cohn's empirical equation by linking the hydrophobic effect with thermodynamic parameters such as the hydrophobic surface. Unfortunately the improved model often sheds little light on the bioprocess operation and design as the linkage between the operating conditions and the hydrophobic surface cannot be established. The universal quasi chemical (UNIQUAC) model describes protein solubility by protein activity coefficients and a polynomial relationship between protein activity coefficients and osmotic second virial coefficients can be used to model protein behavior [13,14] where experiments to obtain protein activity data are required. The theoretical thermodynamic equations to predict protein solubility with molecular radius and surface parameters [15][16][17] worked quite well in a simple and defined system in which all physical properties are known. However, such thermodynamic-based models are of limited use for process design and control because the thermodynamic properties for complex multi-component processing materials are unknown.
Modified empirical exponential models that describe the traditional sigmoid shape of the precipitation curve directly link predictions with process conditions [18,19], but these models provide little fundamental understanding. Despite pH having been reported as a strong factor on protein precipitation, pH was not considered in these models. Temperature is another variable that often strongly influences protein precipitation. However, as most proteins are sensitive to temperature, a fixed temperature is applied during the industrial precipitation process, typically a low temperature (~ 4°C), to prevent protein denature.
The goal of this paper is to develop and validate a protein precipitation model that uses bioprocessing conditions as inputs to predict the protein solubility for complex multi-component materials. The model will be based on theoretical phase equilibrium to achieve an improved process understanding.
Antibody fragments expressed intracellularly in E. coli as next generation therapeutics is cheaper to produce by fermentation than antibodies from mammalian cell culture because of shorter culture time and less expensive media. It also has better selectivity than antibodies but is more difficult to purify due to high level of impurities.
Two different feed stocks, a pure fragment of antibody (Fab') solution and a clarified Fab' solution from E. coli homogenate, will be used to examine the generality of the model. The model will be validated by experimental data, and then statistical tests will be used to evaluate the quality of the model. The predictions of the model will be compared with four existing models where pH will be introduced as an extra variable [11,18,19].

Materials and Methods
Sodium monobasic phosphate, sodium dibasic phosphate, sodium acetate and ammonium sulphate were purchased from Sigma Chemical Co. Ltd. (Dorset, UK). All chemicals were reagent grade. Fab' producing strain E. coli W3110 was kindly provided by UCB (Slough, UK) and the cell paste was provided by the Fermentation Group, Department of Biochemical Engineering in University College London.

Precipitation material preparation
E. coli cells were suspended in a 10 mM pH 7.0 phosphate buffer at 40% (wt) and homogenized in an APV Lab 40 Homogenizer at 750 bar. The homogenized solution was centrifuged in an Eppendorf Centrifuge 5810R at 12,000 rpm for 2 hours with supernatant collected as the stock solution for further study. Pure Fab' solution was prepared from the collected supernatant. An AKTA Basic HPLC (GE Healthcare, Sweden) and Hitrap Mabselect 5 ml HPLC column (GE Healthcare, Sweden) were used to purify Fab' . The eluate was buffer exchanged to 10 mM pH 7.0 phosphate buffer and stored at 4°C.

Fab' and impurity concentration analysis
Fab' concentration and total protein concentration were analyzed by HPLC Agilent 1200 (Agilent Technologies, UK) with a Hitrap Protein G 1ml HPLC column (GE Healthcare, Sweden). Fab' concentration was calculated from the peak area according to a calibration curve, which was obtained using pure Fab' after Protein A and size exclusion chromatography. The impurity concentration analysis method was the same as Fab' except that the feedstock was used as the standard and the impurity area monitored.

Fab' precipitation by microwell scale high throughput experimentation
The Fab' precipitation was carried out in the ABgene's 96 deep microwell plates by a Packard MultiPROBE II HT EX (Packard BioScience Company, Meriden, U.S.A.). The experimental conditions were selected as follows: pH from 4.5 to 8.0 with intervals of 0.5, ammonium sulphate concentration from 0 to 3.0 mol/L, with intervals of 0.2 mol/L for pure Fab' solution and with intervals of 0.3 mol/L for clarified homogenized solution. The total volume of precipitate supernatant was 1.8 ml. The precipitation plate was shaken on an Eppendorf thermomixer at 600 rpm for 2 hours and then centrifuged at 4000 rpm for 15 minutes. The clear supernatant was transferred to an Agilent 96 HPLC microwell plate and analyzed in triplicate.

Hybrid model derivation and parameter estimation
Phase equilibrium based protein precipitation model: Protein precipitation has been thermodynamically regarded as a pure crystallisation process because the solution has only protein and salt [15][16][17]. However, for proteins in a real fermentation broth or with other complex biological materials, the precipitation will not form pure crystal but an amorphous mixture [20,21]. The precipitation can be treated as a distribution between a light liquid phase (supernatant) and a dense liquid phase (precipitate). Therefore, the proposed model in this paper is based on the phase equilibrium for the target protein in a multi-component solution: Suppose V l is the light liquid phase volume, V d the dense phase volume, C T the maximum protein concentration in the solution and V T the total solution volume with the approximation that there is no volume change during precipitation. Then, Introducing equations (5) and (6) into equation (4) we obtain The dense phase volume in protein precipitation cases will increase with salt concentration as more proteins will be precipitated and reach a nearly constant level at high salt concentration. It is often very small compared to the total solution volume because the total protein concentration is low in biopharmaceutical processing material, so it is reasonable to assume that depends on the protein properties and pH probably with an apparent isoelectric point (pI). Our preliminary experimental results shown in Figure 1 illustrate that the Fab' concentration increased linearly with pH in the range 4-8. Hence pH having a linear effect was approximated and the effects of salt on where C s is the salt concentration, a 1 , b 1 , c 1 , d 1 are constants, and α = a 1 d 1 , β = a 1 -b 1 c 1 , χ = b 1 , δ = d 1 , are the lumped constants. With different protein solutions, these lumped parameters may vary and hence should be estimated from real experimental data.
In 1943, Kirkwood [22] defined the protein activity coefficient in a multi-component solution as a simple function of the concentrations of all solute species. Long and Mcdevit [23] assumed that the protein activity coefficient can be represented by a log-linear function based on fundamental chemical thermodynamics: where k s is the salt activity coefficient, k i the components activity coefficient and i C the other component concentrations in the solution.
In a multi-component solution containing biomolecules and salt, the protein activity coefficient is dominated by the salt concentration; the other effects caused by biomolecules can be regarded as constant due to their very low concentrations. Therefore in the liquid phase, the second part of equation (10) can be represented by a constant. In the dense phase, the concentration of salt is considered as not significantly changing while the other molecules still have no or little effect; hence the overall protein activity coefficient can be regarded as a constant. Therefore where w 1 , w 2 and w 3 are constants.
In some cases, the protein property and its main interaction with salt will depend on the type of salt, so equation (11) may need a second order or even a higher order term of the salt concentration [23]. In this study, only the first order term was used in the model. A higher order model needs only to be considered if the first order model fails.
In 1985, Arakawa and Timasheff [20] published a theoretical protein precipitation model, which gave the following theoretical chemical potential equation: where  2 µ is the protein standard chemical potential in the solution,  w , 2 µ the protein standard chemical potential in the water, Z the net charge of the protein, I the ionic strength, m 3 the salt mole concentration with A and B the coefficients. The second differential   is an empirical constant over a wide range of salt concentrations [20]. The temperature was regarded as a constant in this study.
The ionic strength is defined as: where Q i is the molar concentration of ion i, and Z i is the charge number of that ion. For a neutral salt such as ammonium sulphate the ionic strength is linearly proportional to salt concentration. As there is no general mathematical model for protein surface net charge as a function of pH, we will approximate Z 2 in equation (12) by a second order pH polynomial equation. Therefore, equation (12) can be simplified into a function of salt concentration and pH by: where ϕ, φ, γ, η are lumped constants.
Under the assumption that the salt concentration in the dense phase is very small and does not change significantly, the value of equation (14) for dense phase protein will be considered as a constant, so whereτ is a lumped constant.
The second term essentially describes the protein salting-in effect at low salt concentration. To simplify the calculation, the second term is approximated by the pH effect for the low concentration range because from our experiments the pH effect dominated at low salt concentration. It was then described by a simplified second order polynomial function, while the salt effect was separated from this term and lumped into the first term on the right hand side of equation (15) to give: are constants. At high salt concentration the salting-in phenomena does not occur or the effect is very small compared to the first term in equation (15). Therefore, the coefficients in equation (16) were relatively insignificant for the overall model prediction at a high concentration range. It also kept the model mathematically consistent throughout the salt range.
Combining equations (9), (11) and (16), equation (7) becomes: This model is able to describe the strong non-linearity of the precipitation surface due to its sigmoid structure. All the parameters in equation (18) are lumped and so it is difficult to predict their values or limit their ranges. However, according to the modelling assumptions, parameters θ and δ should have positive values and the term α+βC s +χ pH C s will be positive. At low salt concentrations, the exponential expression in the model is not a dominant effect and thus the decrease of dense phase volume caused by the salting-in effect, which makes the value of T d V V smaller, explains the protein concentration increase. At high salt concentration, the second term which contains an exponential expression will have a value much larger than 1.0, so we can neglect the value of 1.0 in the denominator. Thus this model is similar to the exponential structure of Cohn's equation, discussed below. This model has a thermodynamic base assisted by empirical relationships. The model structure involves eight parameters to allow a proper expression of the solubility surface. This is different from models from purely experimental observation which are too simplified and so loose the necessary details.

Model comparison:
In order to evaluate the capability of this new model, it is useful to compare it with three published models, Cohn's [11], Niktari's [18] and Habib's [19] models plus a polynomial model. For process design purposes, all the selected models were modified to contain a pH factor by introducing a second order polynomial expression of pH to substitute for the model coefficients without changing the model structure, in order to link protein solubility directly to operating variables.
where S/S 0 is the ratio of Fab' concentration in the supernatant to the Fab' concentration in the feedstock. The expanded Niktari's equation is: The expanded Habib's sigmoid model is: A second order polynomial equation with interaction terms is used to represent the conventional two factors polynomial model: where y is Fab' concentration in the supernatant, and others are parameters.
Although the number of parameters involved in these models is similar, the structure of our model is very different.

Data analysis and parameter estimation:
In order to eliminate the errors in the model parameter estimation caused by different orders of variables, the experimental conditions of pH and salt concentration were normalized to a 0-1 range by the scaling: where scaled λ is the scaled value, real λ the real value, L λ the real value at low limit and U λ the real value at upper limit.
The Fab' concentration and the impurity concentration in both feed stocks were also normalized. The initial concentration was not preferred in the normalization because there was a salting-in effect. The maximum concentration of Fab' and impurities during saltingin were regarded as the true maximum protein concentration in the solution. In reality, this concentration was difficult to obtain so the maximum concentration from the experimental data was used as the approximation to the true maximum concentration, and also as the upper limit for equation (22). The model parameters in equations (18) - (22) were estimated by the nonlinear least-squares regression method in the Matlab Tool box (MathWorks, USA). was described by equation (18) with regards to pH and ammonium sulphate concentration. As the focus of this study is on the development of an accurate precipitation model, a large data set is needed to test the model accuracy and regression fitting. A nonlinear least squares method was used to estimate the parameters in the model. The accuracies of the model, equation (18), and the other models, equations (19), (20), (21) and (22), were measured by R 2 . In addition, the statistical F-test was used to evaluate the new model. The Wilcoxon test and paired t-test for individual parameters in the model were used to validate the model.

Protein precipitation modelling for pure Fab' , Fab' in clarified homogenate and impurities
119 experiments for pure Fab' solution and 79 experiments for clarified homogenate were carried out. The pure Fab' concentration, as shown in Figure 2a, slightly increased during the salting-in phase and then gradually decreased with salt concentration, similar to many previous pure protein precipitation curves [11,20]. For the clarified homogenate shown in Figure 2b, when the salt concentration was low, the Fab' concentration was significantly affected by other components in the solution and its concentration was altered compared to that of pure Fab' solution. Under low pH and at low salt concentration, the Fab' concentration was only 40% of the highest concentration. The same phenomena occurred with the impurity solubility, shown in Figure  2c. An explanation may be that the low Fab' concentration at certain conditions, while it was not a pure solution, was probably caused by the co-precipitation between the Fab' molecule and other impurity proteins, the solubility of which were significantly changed by pH at low ionic strength. The effect of salting-out dominated when the salt concentration increased and thus the solubility was nearly the same as that of pure Fab' solution.
These experimental data sets were then used to develop the pure Fab' precipitation model based on equation (18). The estimated parameters are shown in Table 1 and the R 2 value was 0.975. The F-test value of the model fitting was 624.81; indicating 95% confidence model accuracy was achieved. It is clear that the model prediction satisfactorily described both the salting-in and salting-out features of the concentrations without the cost of losing accuracy for any phases.
The generality of the model structure was then assessed by applying the model to Fab' precipitation in a clarified homogenate where multicomponents exist. The parameters of the pure Fab' model were used as the initial guess for the parameter estimation. The R 2 value for Fab' in clarified homogenate was 0.972 as shown in Table 1, which was very similar to that of the pure Fab' model. The F-test value of this model was 320.36, which was smaller than that in the pure Fab' model. Nevertheless, this suggests that the model was accurate to within 95% confidence. The predicted Fab' concentration surface is shown in Figure  2b. It can be seen that the Fab' solubility in the clarified homogenate was in general predicted well by the model. Nevertheless, there is a slight discrepancy between predicted solubility and experimental data at a low pH range as well as very low salt concentrations.
We also assessed the model by applying it to impurity precipitation, where a mixture of proteins was treated as an assumed pseudosingle molecule with average characteristics of all the proteins in the solution, e.g. average electronic charges and hydrophobic behaviour. 79 experimental data points from an impurity precipitation in a clarified homogenate were used for parameter estimation. The results are given in Table 1 and the R 2 value was 0.945, which is slightly lower than that of the Fab' models. The F-test value of the model was 172.60. These measures showed that the model was accurate to 95% confidence. The predicted impurity concentration surface is shown in Figure 2c. The geometrical pattern for the real data points and the model predicted surface were slightly different, especially in the high salt concentration region. We believe the difference was caused by the simplification of a mixture of proteins into a pseudo-single protein. Although the impurities were regarded as a pseudo-single molecule with an average value of all the mixture, in reality, different proteins will have a different sensitivity to pH and salt concentrations. Conditions may significantly affect one protein with no great effect on other molecules. Thus, the

Model modification
It has been shown in Figures 2 that the new model predictions agreed well with the experimental data. However, there are eight parameters in equation (18) and the model exhibits a high level of nonlinearity. A simpler model is more useful for the processes operation and design. It is also beneficial for the parameter estimation since a higher number of parameters have the potential to over-fit the data which could result in less accuracy of the model. The t-test for individual parameter in the model can be used to evaluate and decide if a parameter is necessary. If a parameter fails the t-test, it is either inaccurate or not needed, or both. After carrying out the t-test for each parameter in model (18), the parameters α, β, and χ failed in all three models, i.e., pure Fab' , Fab' in clarified homogenate and impurities. Therefore a simplified model is proposed: The simplified models were developed for pure Fab', Fab' in clarified homogenate and impurities in clarified homogenate based on equation (24) by using the experimental data sets again. The parameters are shown in Table 2. The R 2 values were evaluated and were a little lower than previous, but all tests showed that the models had excellent statistical confidence. All parameters passed t-tests with 95% confidence.
Using the values in Table 2, all three models showed similar predictions as in Figure 2. When the salt concentration is low (<0.2 mol/L), the value of the exponential term is small and changes little while the linear term dominates and changes rapidly. It describes the salting-in phenomenon at low salt concentration better. When the salt concentration is high, e.g. in the salting-out range, the value of the exponential term dominates due to the large values of parameter θ, while the effect of the linear term is small. The values of the parameters associated with pH vary relatively little. The parameter of the second order pH term for impurities, σ, is the largest with the fact that impurity concentration was influenced most by the pH in all three materials. The effect of pH around the neutral point is very small due to its small parameter value and its second order structure. However, according to the models there will be large effects describing the protein concentration changes in the experiments at the extreme pH conditions.
The most difficult operation conditions in precipitation to determine are the cutting points either in salting-in at the low salt concentration and salting-out at the high salt concentration ranges. As equation (24) is derived from thermodynamic phase equilibrium, the term 1/(δ+C s ) is strongly related to the salting-in effect, and δ significantly influences the magnitude of the salting-in effect. Term (1/δ)exp(λ)) indicates the potential increase in protein concentration. As shown in Table 2, (1/δ) values in the models of Fab' in clarified homogenate and impurity are much larger than that of pure Fab' , so the salting-in effect is significant in these multi-component cases. During salting-out, χ, the coefficient for the interaction between pH and salt, and θ, the coefficient for the salt, play an important role. In the clarified homogenate case, the values of χ in the Fab' model and the impurity model are nearly two hundredfold of that in the pure Fab' model, showing that the impact of multi-components on the salting-out are significant. Together with λ, these three parameters dominate the salting-out effect.

Model validation
Nine independent experiments were carried out and the results used to validate the model. When validating bioprocess models, it is not recommended to use error percentage to evaluate the models because the range of bioprocess data may be very wide, even after scaling, which will introduce mathematical errors. Thus, statistical tests are needed to validate any new model, no matter how good the fitting of the data is in the regression step.
There are however several unusual problems for bioprocess model validation. First of all, the number of samples used for validation is   normally small, e.g. nine samples in this case, due to the high cost and long experimental time as well as limited protein solution materials available. Secondly, the distribution of most bioprocess data is normally unknown or the data does not conform to any known distribution, e.g. standard normal distribution. Statistically, normal distribution can only be assumed when the number of samples is large, normally more than 30. Therefore, for a small validation group with an unknown distribution, it is risky to use the paired t-test due to a high probability to fail. Two possible solutions are a Wilcoxon signed-rank test [24] for few samples or to analyse validation samples together with previous regression data by the paired t-test. For a Wilcoxon test, when 2-tailed significance > 0.05, it is regarded as a validation pass. A paired t-test, when significance (the p-value associated with the correlation) > 0.05, can be considered as the null hypothesis, i.e. no difference between the experimental data and model calculated value. Table 3 shows the test results for modified model equation (24)

Model comparison
The experimental data sets were also used to estimate the parameters for the four models described by equations (19) to (22). Table 4  Theoretically, the polynomial model has the most flexibility to fit the data. However, the same results were obtained as for the previous two models as shown in Figure 6. The model in pure Fab' precipitation was quite good, with R 2 of 0.95, but the models for Fab' precipitation in multi-components were less good, as R 2 was less than 0.90 for this non-ideal solution. Moreover, at high salt concentration, the predicted concentration surface of the polynomial model was below zero, which also occurred in Habib's model. Therefore, both models were quite misleading as the predicted value below zero had no physical meaning. The negative value can be manually eliminated by constraining the parameters during regression but at the cost of overall accuracy. The polynomial model was not considered adequate for fitting in clarified homogenate precipitation. The higher order polynomial model may be more accurate but will inevitably introduce more parameters.
The comparison studies demonstrate that the structure of the model is crucial. If the structure is not right, the model will not predict the performance well even though there are high number of parameters in the model e.g. 9 parameters in Habib's model.

Conclusions
Phase equilibrium based models have been developed then validated using statistical tests. The model equation (24) can precisely describe the precipitation salting-in and salting-out effects for pure Fab' , Fab' in clarified homogenate and impurity with salt concentration and pH change. The model structure, based on a single protein, takes the multi-component factor into consideration and can be applied to a protein in a multi-component solution or a pseudo-single molecule. The estimated parameters in the model passed regression statistical tests proving that the models were accurate with 95% confidence.
Compared to thermodynamic based models, this new model conveniently predicts the precipitation results from operation conditions rather than thermodynamic parameters. The comparison between this new model and four existing empirical models showed that it was superior backed with the statistical tests. As our model structure is derived from fundamental phase equilibrium, it exhibits good prediction even though there are only 4 parameters. Such studies also show the challenge in multi-component protein precipitation modeling when strong nonlinearity exists. Our model structure is able to reflect the non-linearity of the protein precipitation in both saltingin and salting-out better than existing empirical models.