Reconstruction of Local Biochemical Reaction Network Based on Human Chromosome 9 Sequence Data

Metabolic networks are complex and highly interconnected, thus systems-level computational approaches are required to elucidate and to understand metabolic genotype–phenotype relationships. This paper has manually reconstructed the local human metabolic network based on DNA sequence data of human chromosome nine. Herein the paper describes the reconstruction process and discusses how the resulting chromosome-scale (or local) network differs from genome-scale ones. The underestimated results have revealed many gaps in the current understanding of human metabolism that require future experimental investigation. They also suggest possible problems arising from local reconstruction based on partial genome data. The study suggests further applications enabled by reconstruction of human metabolic network. The establishment of this network represents a step toward genome-scale human systems biology.


Introduction
After the Human Genome Project was finished at the beginning of 21 st century, it seems that we have our own destiny under control. However, what is the next step to deal with such a large pool of data so that they are meaningful to people? The direct approach to understanding the complex processes encoded by the human genome is studying gene products' function, assigning these enzyme products to biochemical pathways and reconstructing biochemical networks. These biochemical pathways define regulated sequences of biochemical transformations. It is a first step towards quantitative modelling of metabolism. An individual's metabolism is determined by one's genetics, environment, and nutrition. Hopefully, with the available human genome sequence and its annotation [1][2][3], the human body's metabolic network can be reconstructed. Numerous metabolic genes and enzymes have been individually studied for decades; however, these results are dispersed without integrated understanding. The procedure for integrating these diverse data types to form a network reconstruction and predictive model is well established for microorganisms [4] and has been applied to mouse hybridomas [5]. Such in silico models have enabled hypothesis-driven biology, including the prediction of the outcome of adaptive evolution [6][7][8][9][10] and the identification and discovery of candidates for missing metabolic functions that were subsequently experimentally verified [11]. Because metabolic networks are more complex in mammals than in single-celled organisms, there is likely to be an even greater opportunity for the use of computational models to understand the basis of normal and abnormal cellular function [12]. At the same time, reconstructing biochemical reaction networks in mammals is also more complex and tougher. Assignment of genes to pathways also permits a validation of the human genome annotation because patterns of pathway assignments spotlight likely false-positive and false-negative genome annotations. in databases. To solve this problem, two ways were suggested. One is de novo discover, that is using some analyzing tools to predict potential functions from protein sequence information. In the other method, these sequences were blasted and their functions were regarded the same as their identities' (similar known proteins') functions. The blast process used BLASTP 2.2.23, chose UniProtKB (Protein) as the database and blosum62 as the scoring matrix. All the candidates selected under certain criteria from the blast results were searched against UniProtKB for detailed description. Each candidate was manually examined to see whether it has certain characters. Candidates with similar function were regarded as one protein. A second selection was done then based on the information richness and redundancy of these candidate proteins.
Reactions these predicted proteins involved in were defined as those catalyzed by the enzymes determined in the final selection. Thus, structural proteins and proteins with unknown function were not examined in the reaction prediction step. Reactions involving predicted proteins as substrates were not included since it is rare for a gene product to be a metabolite. Reactions were obtained from KEGG ENZYME database and BRENDA.
KEGG LIGAND was used to map all the reactions predicted before, in detail, Pathway Mapper was employed. Besides, each reaction was also searched against KEGG PATHWAY database to view the whole map of pathway it involved in. All the predicted proteins were also searched against EMBL REACTOME database to directly find out whether they are involved in any known pathways.
All the processes were limited to Homo Sapiens if species specification was possible.

Results
The whole DNA sequence of chr9 is 1246910 bp(divided into 939400bp, referred as s1; 307510bp, referred as s2). The average G+C is 41.04%(42.11% for s1; 39.96% for s2). GENSCAN predicted 53 genes ( Supplementary Information 1), including information about their structures. GENSCAN results also predicted 53 peptides accordingly (Supplementary Invitation 2), so there is no need to use another tool, such as Transeq, to translate nucleotides into proteins.
2 predicted proteins were located in mitochondria; 5 were associated with secretary pathway (Supplementary Information 3). Physiochemical property analysis showed these predicted peptides vary in amino acids composition, thus differ in charge distribution ( Supplementary Information 4). Nevertheless, it is also necessary to infer secondary structures of these predicted proteins for function is usually associated with protein' higher structures. The results by different online tools were compared and final results of secondary structure (Supplementary Information 5) came from Scratch Protein Predictor (ExPASy).
The blast result is a list of proteins in the order of increasing E-value. The selection of similar proteins to represent the predicted protein is a problem. To which degree of similarity can we define the function of a predicted protein the same as that of its identity? Considering some relative numbers in alignment, candidates were selected with identity higher than 80% and relatively low E-value in this study. 166 protein candidates, corresponding to 24 predicted peptides, were selected under this criterion (supplementary data not shown here). 8 of these predicted proteins had matches with 100% identity, and E-values were also low enough so that we can equate these predicted proteins to their identical matches. However, most of these 166 candidates were uncharacterized proteins or isoforms of each other. After filtering according to their ontology and information from UniProtKB, 13 distinctive proteins were determined (Table 1), of which 9 were enzymes, which could be subdivided into 7 metabolic enzymes, plus 2 nonmetabolic enzymes (including enzymes whose substrates are macromolecules, such as protein kinases and DNA polymerases). The 'Unmatched' row includes predicted proteins with no character information, while the remaining 4 nonenzymatic proteins are listed in the 'Nonenzyme' row.
Some proteins' descriptions were not well-characterized that one protein may retrieve several enzymes with different EC number but similar function. In such cases, extra alignments between the predicted protein and its identities were done. The enzyme with the highest score was chosen. Sometimes several predicted proteins were assigned to one enzyme, which means several genes have the same function. 30.2% of the predicted genes coded enzymes, correspond to 28.75% of the whole chr9 DNA.
KEGG assigned one reaction to one enzyme, while BRENDA offered as many reactions as possible from literature study. Thus, sometimes one gene product was matched to more than one reaction, as happens with multifunctional enzymes. All together 82 reactions (52 of them had missing information on products) were predicted besides alpha-D-glucose 1,6-phosphomutase alpha-D-glucose 1-phosphate = D-glucose 6-phosphate Pro#, the ID number of predicted protein, in this study we set the order of the predicted proteins resulting from GENSCAN as 1~53.   (Table 2). Among them, one protein was multifunctional, the other two were specific to one pathway each. KEGG Pathway Mapper resulted in 7 pathway maps (Table 3), but the streptomycin biosynthesis and biosynthesis of secondary metabolites pathways obviously do not occur in human beings, thus, they were ignored.

Discussion
It is common to find several genes referred to one enzyme, or at least their products have the same function. One explanation is that these gene products may form a protein complex; therefore, their protein annotations are similar. Another explanation is that these genes code isoenzymes, that is enzymes with similar function but various sequences. This may result from gene duplication at some time. Gene duplications commonly happen in the evolution history, especially on the same chromosome. It is a main type of gene mutation and chromosome variation. According to duplication mechanism, it is quite possible to observe 'one enzyme, several genes' as these predicted genes locate on the same chromosome. The result suggests functional redundancy of interlocked genes may be a common phenomenon in higher organisms, which coordinates with the concept of quantitative trait.
Most predicted reactions involved phosphorylation and dephosphorylation of ATP or GTP, indicating their possible correlations in energy metabolism. However, there's still a lot of missing information, and these reactions are relatively not well-interconnected that a proper network can't be built.
The predicted enzymes encoded by human chromosome 9 and their corresponding reactions are relatively less in quantity, compared to other genome-wide predictions with DNA sizes in consideration. This suggests that prediction based on partial genome data is often problematic. Computational prediction of organism metabolic networks should better use whole genome data. The reason is not clear though. It is possible that this study used genes and proteins predicted simply from DNA sequences as input, while genome-wide network reconstruction researches used information from genome annotation as input. Gene annotations may be better as they are curated and confirmed by literature. Another potential inference from this underestimation is that adding data results in nonlinear increase of information, that is to say, the information encoded by whole genome isn't equal to simple adding of information from partial genome data.
The formulation of an in silico model from the reconstruction and initial analysis of the network structure will likely be critical in elucidating underlying mechanisms of disease and identifying treatment strategies by developing cell-, tissue-, and context-specific models and building additional layers of complexity into the framework. Genomescale microbial metabolic reconstructions have been widely used to successfully perform systems analysis to the point that models resulting from these reconstructions have become tools for hypothesis driven biological discovery [4]. Human metabolic reconstruction is expected not only to become a prototype for other mammalian reconstructions but hopefully also to enable significant dimensions in the study of human systems biology. The future promise for individualized medicine and treatment will need a context to integrate and analyze data, and models resulting from these reconstructions can play a significant role in fulfilling this need. However, the development of cell-type or context-specific models will require the integration of various types of data, including transcriptomic, proteomic, fluxomic, and metabolomic measurements. Achieving these ambitious goals will require top-down data sets in conjunction with quantitative bottom-up reconstructions such as the one this study has tried to make. Pro#, the ID number of predicted protein, in this study we set the order of the predicted proteins resulting from GENSCAN as 1~53. Pro#, the ID number of predicted protein, in this study we set the order of the predicted proteins resulting from GENSCAN as 1~53.