Role of a Web-based Software Platform for Systems Biology

Open-source software has gained interest in scientific communities in all domains and in all fields, although it is expensive to develop tools in individual teams. The pooling of resources is a good argument in promoting teams to develop new tools, in exploiting costly centralized computer-resources and building an interesting repository of databases. Providing users with a standard browsing interface is also a key-point when curation, navigation, and selection of hypotheses are required. In systems biology, these arguments are also justified since network reconstruction and simulation are time-consuming, and very large, hard to manage datasets become available. Such tasks may also require simple, specific tools and innovating algorithms ‘integrable’ to the frameworks. Users can also be involved in curating data, selecting model hypotheses and in interpreting and sharing results. The modern platform is a recent concept, as recent as global efforts in systems biology. The meeting of such two key research fronts is fascinating enough in itself to address the purpose of its future.


Introduction
Open-source software has gained interest in scientific communities in all domains and in all fields, although it is expensive to develop tools in individual teams. The pooling of resources is a good argument in promoting teams to develop new tools, in exploiting costly centralized computer-resources and building an interesting repository of databases. Providing users with a standard browsing interface is also a key-point when curation, navigation, and selection of hypotheses are required. In systems biology, these arguments are also justified since network reconstruction and simulation are time-consuming, and very large, hard to manage datasets become available. Such tasks may also require simple, specific tools and innovating algorithms 'integrable' to the frameworks. Users can also be involved in curating data, selecting model hypotheses and in interpreting and sharing results. The modern platform is a recent concept, as recent as global efforts in systems biology. The meeting of such two key research fronts is fascinating enough in itself to address the purpose of its future.

What is a computational platform?
Let us return to the original definition of a platform in general, but more specifically in science and bioinformatics. In language, the word 'platform' is very old. In common language before the 1990 there were four widespread meanings: (1) a flat surface; (2) a flat base for devices; (3) a political basis (programme); and (4) plant for drilling. For our purpose with respect to bioinformatics, two features are fundamental to the definition, the basis for assembly, and integrative capability. In this context, a software platform is some sort of hardware architecture and software framework, including an application framework, which allows software to run. A platform might be simply defined as a place to launch software. It includes a computer's architecture, operating system, programming languages and related user interface (runtime system libraries or graphical user interface). Such a working environment is becoming an instrument in managing data in many fields, for example natural sciences, social sciences and management systems in big companies [1]. Specifically, in computer science we can say that it is the continuation of effort to develop intelligent systems able to gather information from experts (now called users) with data integration workflows. Modern artificial intelligence has to develop and to adhere to stringent standards and frameworks to become integrative and novel computing paradigms and environments, such as collaborative networks and cluster architectures [2]. Hence, we can argue that a software platform is a scriptable integration computing environment and a web portal to make systems available in any location. Such an environment is not free, and the mutualization of costs and also perenity of its maintenance over time are guaranteed by an institution or a consortium of institutions. We defined at length what a computational platform could be. But what is a computational platform not? It is not open-source downloadable software. In this case, it may be just a framework. It is not a protocol to acquire wet-lab data and propose analytical tools to implement data processing and display results. It is not just an online database.

Short historical overview
The notion of an integrative library available to users is an old concept. The idea of an integrated environment for computers is as old as computer science itself since, by its architecture, a computer is a group of sub-systems. With regard to the technical toolbox we should mention two well-known environments. The first, the Numerical Algorithms Group (NAG), a non-profit software company (Oxford, UK) which proposed its first oldest Fortran libraries in October 1971. The second, the R-project, initiated in 1993 at the University of Auckland, proposed a core distribution in mid-1997, and now has more than 3,000 packages. Both toolboxes provide methods for the solution of mathematical, statistical and data mining problems, within visualization capabilities, in scientific software development.
The 'omics' measurement platforms developed in the 1990s preceded the emergence of software platforms. The primary platforms in use are mass spectrometry and microarrays which associate tools to process data flows. If we look at the bibliography, history started with database building and microarray processing; both terms 'systems biology' and 'platform' in bioinformatics occur with the emergence of large databases of high-throughput data and pathways availability. They naturally met as modelling and pathways exploitation were required to understand groups of genes. The birth of the joining of these two concepts was probably at the beginning of 2000 [3]. The implementation of a platform comes in the wake of traditional expert systems and tools for problem solving and knowledge management. A platform is useful in building molecular network maps, simulation tools, data resources and web services for sharing information.

Location of platforms
As bioinformatics is community-oriented, rather than countrypiloted, many platforms, which would benefit from a centralized information portal, occurred in several universities. It is the case in France where 13 platforms are indexed by the French national network in bioinformatics [113]. Another institution called IBISA (Device in Agronomy and Life Sciences) propose a directory of platforms in life sciences but also proposed quality control labels dedicated to the usage, service and development of a platform (technological and software) [108].

Data for systems biology
As mentioned above, a platform is not a database but is strongly linked to a database since data are necessary to infer interactions and obtain parameters for models (using pathways for instance). Some key experimental methods for systems biology are as follows.
(1) Oligonucleotide microarrays; the most widely used methods to monitor the expression levels of RNA transcripts in a biological sample are based on microarrays. They measure the hybridization of fluorescently labelled cDNA, synthesized from extracted mRNA, to known nucleotide sequences spotted on solid surfaces. For all genes on the microarray, an expression value is derived from the fluorescence intensity of the hybridized RNAs (2) RNA deep sequencing: the most recent transcriptomics approaches are based on the deep sequencing of transcripts extracted from biological samples. (3) Mass spectrometry experiments (MS): the compounds present in a sample are identified through the accurate measurements of their mass-to-charge ratios. (4) Nuclear magnetic resonance (NMR) is a common method in metabolomics and, in contrast to MS-based approaches, in most cases does not require analyte separation. NMR spectroscopy can provide detailed information on the molecular structure of compounds found in complex mixtures, and a wide range of small molecule metabolites in a sample can be detected simultaneously. (5) Literature, and already built pathways, can help to select a model which mines existing interactions in a given space and time development of a biological state.

Objectives of a platform in systems biology
Modelling leads to multiple and immediate fallout for such biological issues as: (1) metabolic engineering: (2) predictive toxicology and drug repurposing; drug-target networks can be used to identify multiple targets and to determine suitable combinations of drug targets or drugs: (3) solving major nutrition-associated problems in humans and animals including obesity, diabetes, cardiovascular disease, cancer, ageing, and intrauterine growth retardation: (4) inflammatory (immune) response below the baseline: (5) development of novel and more efficient microbes for the production of biofuels: (6) revealing new biomarkers in medical pathologies.

Examples and core of a software platform
If platforms 10 years ago were largely focused on database repositories and microarray or sequence processing, now some platforms are typically dedicated to systems biology. Of course they tried to offer access through the internet as a web portal, were developed on a linux operating system and stored data in a database management system such as Mysql, with a Java or Html-based user interface; integration of frameworks and data analysis tools written in script languages (python, R, Matlab, Perl), or in C++ to speed up computation as in BioArray Software Environment (BASE) [4].
A system-centric global knowledge management approach to discovering (organizing and sharing) scientific knowledge from large-scale data (SKM). Knowledge management is the collection of processes that govern the creation, dissemination, and utilization of knowledge. In this context, the creation step refers to the data integration 'pipeline' which proceeds from experimental data (either from biological experiments or simulations), to the sharing and utilization of the underlying inputs/outputs of each of the data integration steps (from 'raw' data to fully fledged systems dynamics models). Currently, solutions of large-scale system-centred problems suffer from a serious lack of integration of the underlying human, data, information, and knowledge resources. Researchers, who discover new insights, disperse their results over a variety of journal and conference publications, biomedical databases, information bases, and knowledge bases. These are typically devoted to a general subject area (e.g., gene sequences, protein structures, etc.) as opposed to being exclusively dedicated to the system under study. Researchers wanting to obtain relevant data, information, and knowledge invest considerable effort in locating and reintegrating the information in the context of the system under investigation. In other words, instead of publishing, sharing, and using the data, information and knowledge in the context of the relevant structures of the system in question, the data, information and knowledge is being heavily fragmented, decontextualized, and physically distributed, only to be relocated, recontextualized, and reintegrated by those who need the information. From a knowledge management perspective, this is an extremely poor solution.
In the following sections (VIII-XVIII) we detail different core parts of the bioinformatics community.

Commercial toolkits
Toolkits proposing access to large pathway repositories are also widespread; high level display, easy searching and quality of service are guaranteed. These are some popular products: Ingenuity

Web-based databases
Pathway-oriented approaches are appealing because they are hypothesis-driven. However, their limitation is the lack of knowledge of biological gene networks. Hence, even if sharing databases was fundamental to the emergence of platforms, sharing curated pathways and models is still useful beyond experimental data. We can say that KEGG (Kyoto Encyclopedia of Genes and Genomes) [14] or the visualization of pathways [15] played a pioneering role. Some recent platforms, for example WikiPathways, take into account modern technology such as web 2.0 [16]. There is also a tendency to sharing models of simulation such as BioModels.net [17]. Some databases are species-oriented like AtPID [18] or RIMAS [19] for plants, CoryneRegNet for bacteria [20] or HPRD for humans [21]. Uetz et al. [15] identified a network about 957 interactions and 6,000 proteins for yeast, now HPRD proposes 39,194 interactions and 30,047 proteins for humans. Some other databases are related to diseases like NetAge [22]. Other databases are more generalist like Gaggle Tool Creator (GTC) [23].

Standards
Standards help to integrate and exchange information; they are currently more and more used in platforms. BioPAX is a protocol for the specification and representation of cell signalling pathways, gene-regulatory networks, protein-protein interactions and other types of biomolecular interaction data [24]. SMBL (Systems Biology Markup Language) provides community-driven software support [25]. MIRIAM, or minim information requested in the annotation of biochemical networks, is a scheme to provide extensive documentation in the model file in a structured manner [26]. SBGN (Systems Biology Graphical Notation) is a recent attempt to standardize the visual representation of biological networks [27]. Automatic equation generation for SBML from SBGN diagrams was made possible. Biocoder is a C++library that enables biologists to express the exact steps needed to execute a protocol [28].

Simulation frameworks
A simulation framework is used to confirm the interest of interactions and validate the correct structure and dynamics of a pathway. Distinction occurs based on the type of modelling used, i.e. deterministic (DM) versus stochastic (SM). Deterministic models implicitly assume that the underlying quantities, i.e., concentrations or molecule numbers, vary in a deterministic and continuous fashion. On the other hand, the stochastic framework takes into account the random interactions of the biochemical species' more coarse-grained models using Ordinary Differential Equations (ODEs), Partial Differential Equations (PDEs), Stochastic Differential Equations (SDEs), or Markov Jump Processes (MJPs), typically used to model simple synthetic biology circuits. Two main types of stochastic models (MJPs and SDEs) are typically used in the literature to represent stochastic systems (Gillespie, Gibson-Bruck, tau-leaping, Bayesian methods). These coarse-grained models can be used as simplifications as long as their corresponding assumptions are satisfied. Flux balance analysis exploits the properties of a stoichiometric matrix and homeostasis with matrix algebra (MA); p-systems can be associated to this family as a kind of finite automata. In this family we also find Petri Net modelling, and Boolean networks. Logical models (LM) offer goods properties to analyze slopes of dynamics but also in this family we find symbolic differential systems.

Laboratory information management system (LIMS) frameworks
A LIMS is the central part of a typical proteomics workflow, including journal articles and data stored in repositories for community-wide use. Some of the LIMS systems were more convincing than others due to a highly sophisticated graphical workflow editor, mature advanced functionality, easy out-of-box behaviour, or the possibility to combine results from multiple search engines [60].

Visualization frameworks
Visualization is a means of exploratory data analysis and has been a key method for network analysis like GraphViz [107] since the birth of platforms and web-based portals. Cytoscpape is a general framework for visualization with customization [80]. The Genome Network Platform provides an integrated user interface with a protein-proteininteraction network viewer [81]. CellDesigner is a structure diagram editor for drawing gene-regulatory and biochemical networks [82]. ProMoT is a 'drag and drop' design platform for synthetic gene circuits [83]. MetPa [84] and Reactome [85] offer tools for visualization and navigation over thousands of pathways.

Multivariate data analysis frameworks
Biological space is highly dimensional in term of objects and features. Multivariate data analysis can be fruitful for data processing. Classical approaches can be used and the R platform is very widespread, as in STRUCTURELAB [86]. Some implementations are ad-hoc with multidimensional scaling as in BASE [87]. Less standard in multivariate data analysis but also popular for graphical analysis are the BIOCHAM [88] or Bayesian approaches such as the Visual Integration for Bayesian Evaluation (VIBE) software [89]. Even more ad-hoc techniques are integrated in toolkits such as ORBIT [90]. PerturbationAnalyzer [91], model checking [92,93].

Workflow frameworks
Workflow technologies for data processing design, application, and execution link these tools into high-throughput processing pipelines. Their ongoing development can be expected to greatly enrich the ability of researchers to not only process newly obtained bio/text data but also to compare, contrast, and combine results from previous research studies via meta-analytic and data mining approaches and to visualize unique patterns present in bio/text results that could only be identified through large-scale computing approaches. Taverna is a good representative of such a framework [91].

Ontology and terminology management
A 'knowledge space' where information is heavily fragmented, decontextualized, and physically distributed, and requires recontextualization for those who need information, is a poor solution. Ontology and terminology offer a more efficient proposal. Open Biomedical Ontologies (OBO) [94] and Gene Ontology (GO) [95] are popular packages for integration. Other specific ontologies appeared for plants or microbes. iHop [96] is a full platform with the extraction of relations and named entities from literature (2,800 organisms, 110,000 genes, 23.4 million sentences). SZGR [97] integrates literature and GO exploration for a specific disease platform. The ONTO-Toolkit is a framework for managing known ontologies such as OBO [98], in the same way Ontology Lookup Service [99] is based on GO and vocabulary exploration. Onto-Tools [100] is a framework for help annotation with ontologies based on GO. Ondex is a help tool with semantic integration [101]. EXCERBT is a tool integrated to MIPS web-database and gives access to literature on relationships between proteins [102]. caBIG [103] integrates a framework in a cancer database platform, with high level vocabulary and concept exploration, cleaning and exploration. Finally an ontology tool can also be integrated into a simulation framework such as Ph-SIM which links physiological knowledge to simulation parameters [104].

What should be the next step?
The next generation platforms will perhaps be as novel as current next generation sequencing devices, or the previous microarrays, and will transform our working habits. Some points are perhaps almost achievable now. The following is a short list but it is not final and complete. Advances should be made in the following: (1) In silico support of synthetic biology, from the specific data exchange formats, to the most popular software platforms and algorithms; (2) design and construction of an artificial bacterial cell; (3) workflow experiment design; (4) drug-target discovery -pharmacogenomics; (5) coupling distinct computational models of science and engineering systems, still a recurring challenge when developing multi-applications -a kind of meta-analysis.