Received date: July 24, 2013; Accepted date: October 24, 2013; Published date: October 28, 2013
Citation: Neogi SG, Krestyaninova M, Kapushesky M, Emam I, Brazma A, et al.(2013) MoDa-A Data Warehouse for Multi-“Omics” Data. J Data Mining Genomics Proteomics 4:145. doi: 10.4172/2153-0602.1000145
Copyright: © 2013 Neogi SG, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Data Mining in Genomics & Proteomics
The range of various “omics” technologies for measuring properties of biomolecular entities (e.g. transcripts, proteins, metabolites) in biological samples in a high throughput manner is continuing to increase. Information systems enabling integrative exploration of results of such experiments are needed. We have developed a system, MoDa (Molecular Data warehouse), that provides a unified framework for finding and visualizing results of various experimental techniques of molecular biology.
The warehouse architecture is optimized for various types of filtering and querying annotations of samples, experimental results and properties of genes and other molecular entities. The implementation is based on the BioMart technology, with enhanced means for manipulating multidimensional data. The user interface is a web-based application.
An important consideration for every data warehousing project is data acquisition and cleaning. To ensure that the data uploaded into the warehouse is consistent and sufficiently well-annotated for further statistical analyses, we implemented a repository for sample and research subject data, experimental metadata, and experimental results. A gene re-annotation pipeline was used to provide a uniform reference system for the collected data along the bioentity (“gene”) dimension.
We expect that the developed data warehousing infrastructure can be useful for collaborative projects employing high throughput molecular biology technologies.
Data warehousing; Data integration; Functional genomics
MoDa: Molecular Data warehouse; MolPAGE: Molecular Phenotyping to Accelerate Genomic Epidemiology, an EU project; SIMBioMS: System for Information Management in BioMedical Studies, software system; SIMS: Sample Information Management System, part of SIMBioMS; AIMS: Assay Information Management System, part of SIMBioMS; MAGE-TAB: MicroArray Gene Expression [format]–TABular; RDBMS: Relational DataBase Management System; NetCDF: Network Common Data Format, a binary file format; LIMS: Laboratory Information Management System
Context of the project
The work described in this paper was driven by data management needs of a collaborative multi-site project “Molecular Phenotyping to Accelerate Genomic Epidemiology” (MolPAGE). The project consortium brought together 18 leading academic institutions and biotechnological and pharmaceutical companies. The focus of MolPAGE was metabolic and cardiovascular diseases, and an important objective of the consortium was to develop standards and technologies capable of supporting a wide range of genomic epidemiology studies.
The major task of the informatics work was to design and implement a complete data management solution for the outcomes of molecular phenotyping experiments (transcriptomics, proteomics, metabonomics, and others). A 2-tier system was built, reflecting the two main stages of the data management lifecycle (Figure 1):
1. Information submission, storage and unification, facilitated by data repository components and the reannotation system;
2. Data integration, search and retrieval, facilitated by the MoDa warehouse interface.
The information system for data submissions and storage, SIMBioMS , includes two components interlinked through identifiers: Sample Information Management System (SIMS) and Assay Information Management System (AIMS). These are two databases with associated web-based submission tools, enabling the collection of sample information (SIMS) and experimental results and metadata (AIMS). Apart from data submission these systems also support sample and assay metadata search and export, as well as data file exchange. The system for sample information management, SIMS, has been described separately in Viksna et al.  (referred to as “Sample Management Database” in that article).
A critical and often costly aspect of any data warehousing effort is data cleansing and transformation to ensure that the data that can be queried through the warehouse is consistent. After experiment results of a study have been collected in AIMS, all supporting metaand sample data can be exported from SIMS and AIMS, the data files parsed, and the data points linked to the metadata and transferred to the data warehouse. As a part of data processing, biomolecular entities (genes, proteins, metabolites) are passed through the reannotation system, mapping the measured molecular entities (transcripts, proteins, metabolites etc) against a uniform reference system.
The main goal of this paper is to present the MoDa data warehouse -user interface guiding users through the data sets by presenting information in an intuitive manner. We will discuss a unified data abstraction framework for data access and analysis, as well as the user interface aspects of MoDa.
In the information technology industry, a data warehouse is a database that can be used for decision making and that contains summary data about organization’s performance in order to enable a wide range of queries and ad hoc data analysis . A data warehouse is often seen as a complementary system to an operational database that is used for managing daily information flow but has a limited support for queries. We are using the term “data warehouse” in a similar manner - our data warehouse widens the range of queries, but is a read-only database not containing all the data being collected and managed in the project. Data cleaning, curation and homogenization also enables further expanded analysis capabilities using downstream tools.
In the field of bioinformatics the line between a database and a warehouse is rather vague. When a public resource is called a data warehouse, very often it is done in order to reflect its enhanced search capabilities. Requirements for data warehousing in molecular biology are sketched in Schönbach et al. . For gene expression data the potential information management framework, including references to the need for data warehousing, is described in Basset et al. . Previous data warehousing efforts have concentrated on either a single type of information for single or several species, or spanned many data types, but focused on a single organism (Table 1) for an overview of data warehousing efforts in bioinformatics.
|Oncomine||Gene expression, limited scope||||Cancer-specific; relevant datasets from public repositories have been picked, array design and sample descriptions were reannotated, and a range of statistical processing techniques were performed to make data from different datasets comparable.|
|Genevestigator||Gene expression, limited scope||||Similar to Oncomine, but for Arabidopsis.|
|ArrayExpress warehouse||Gene expression, wide scope||||Along with the ArrayExpress repository forms the ArrayExpress resource, one of two large public gene expression databases. Provides gene-oriented query capabilities; access to individual profiles of gene expression; uniform, cleaned up annotation; and statistics-based ranking of query results.In 2009 superseded by Gene Expression Atlas.|
|METLIN||Metabonomics||||Provides flexible search facilities.|
|Uniprot||Proteins||||Protein sequence and functional information.|
|Biozon||Genes, proteins, pathways||||A unified biological resource of DNA sequences, proteins and pathways.|
|FlyMine||Organism-specific, Drosophila||||Sequence, functional, interaction and gene expression data. Covers a wide range of Drosophila related information and provides sophisticated query capabilities. Interesting work from the software engineering perspective, where popular queries are optimized by dynamically constructing the necessary structures in the database.|
|GeneCards||Organism-specific, Homo sapiens||||Proteomic, transcriptomic, genomic, genetic and functional information.|
|BioMart||Software for in-house installation||||Query-oriented data management system that provides capabilities to transform any database into a query-oriented warehouse, a generated website and programmatic access methods. Works well for one dimensional information (e.g., lists of genes) and sequence information.|
|BioWarehouse||Software for in-house installation||||Software environment for integrating biological databases into a single data warehouse suitable for data management, mining and exploration.|
Table 1: Representative examples of different types of data warehousing efforts.
None of the existing data warehousing solutions satisfied our needs–managing multidimensional data generated by a range of high-throughput molecular phenotyping technologies and providing means of data exploration. In this paper we introduce an approach for the integration of such data. In the centre of our work lies a simple abstraction for multidimensional “omics” data agglomeration and access. This approach has been validated by the implementation of a warehouse prototype that supports queries on genes, SNPs, methylation sites, proteins and small molecules, e.g. “in which experiments gene g1 has been studied”, or “which genes exhibit similar expression profile to gene g2 in a particular experiment”.
Data warehouse: purpose and design
MoDa data warehouse is designed for integration of data generated by different genotyping and molecular phenotyping platforms. It provides architecture and graphical user interface which are flexible enough to accommodate various data types and to give comprehensive access to the information content.
We had previous experience with principles and software components that have been proven to work in the context of gene expression data . In order to make this approach work also for epigenomics, proteomics and metabonomics data we generalized the concept of the gene expression matrix, enabling management of a wider range of biomeasurements.
In transcriptomics, the data is usually presented as an expression data matrix where rows correspond to genes and columns correspond to samples (Figure 2). There is also a third dimension corresponding to different measurement types, but most often users are interested in only a single measurement type that is thought to represent the true expression levels of genes.
The structure of data produced using various “-omics” technologies have significant commonalities. The “sample” dimension remains the same, while the “gene” dimension can be generalized to other types of molecular entities, such as proteins or metabolites. In Table 2 we illustrate how molecular identifiers and key characteristics of biomolecular entities vary when representating transcriptomics, proteomics and metabonomics data.
|Technology||microarrays||affinity arrays||massspectrom.||LC-MS, NMR|
|Biomolecular entity identifiers||gene/transcript identifier, e.g., Ensembl||protein identifier, e.g., Uniprot||-||metabolite identifier, e.g., HMDB or ChEBI||-|
|Biomolecular entity characteristics||sequence and annotation||sequence and annotation||m/z; probability; number of chrom. fraction||chemical formula and annotation||chemical shift (NMR) or m/z (LC-MS)|
|Measurement types||expression value||abundance level||intensity||Intensity|
|Basic assay unit||hybridisation||hybridisation||spectrum of an LC fraction||acquisition with a certain pulse programme (NMR); spectrum of a certain LC fraction (LC-MS)|
Table 2: Key characteristics of transcriptomics, proteomics and metabonomics experiments.
One of the implications of the wide variety of high-throughput technologies used in MolPAGE is the necessity of handling two substantially different types of molecular data:
- identified data, produced by affinity-based techniques, and
- non-identified data, produced by spectral methods.
Identified bioentities and consistency of molecular annotation
In affinity-based analysis it is assumed that the identities of biomolecular entities for which some characteristic (e.g. concentration) is measured can be derived from the information about reporter molecules (e.g., antibodies or oligonucleotides). Various data processing techniques of affinity-based data were used by data producers, resulting in matrices of bioentites vs samples that were submitted to AIMS. In order to load data into MoDa, these results had to be referenced to a universal system of identifiers. This step is often called re-annotation, and it is a mandatory data integration stage in order to provide easy cross-experiment data access, as it yields a consistent index of all molecular measurements. How the data consistency was achieved in MolPAGE for both sample and bioentity dimensions are described in Section 2.2.
Spectral data management
Let us consider spectral-based results and how those can be represented in a warehouse. In mass spectrometry and NMR experiment results are in the form of peak lists, for instance, mass/ charge (m/z) ratios and the corresponding intensity values. As part of post-experimental processing, identifiers of proteins-precursors may be assigned to m/z values of the detected peptides. Protein identifiers in mass spectrometry, as opposed to gene identifiers in microarrays, are of a probabilistic nature and represent results of post-experimental analysis, rather than are results of a measurement per se.
The usage of so-called hyphenated methodologies (e.g. LC-MS), i.e., techniques where a molecular separation method is coupled with a spectral one, creates a new dimension-time or a fraction number. As illustrated in the Figure 3 ((b) part), this can be wrapped into the sample dimension. Generally, as in the case of gene expression data, where the same sample can be measured several times by using technical replicates, we use the term “assay dimension” instead of sample dimension, where a single sample is used for each assay, but there can be additional technology-specific parameters attached to each assay that distinguish several measurements of the same sample.
Often, in both peptide and metabolite analysis, MS and/or NMR are employed for finding a characteristic molecular pattern (or a fingerprint) of a certain phenotype or condition. Considering individual analytes one by one can be very tedious and not the most effective approach. Full-length spectra comparison and applications of unsupervised learning data analysis methods constitute a widely used approach for NMR and MS-based analysis. Effectively, a spectral pattern rather than a particular molecule is the biomarker in this case. Even when metabolites are identified, effective data integration is difficult. While proteins can easily be associated with transcripts and genes by using such public resources as UniProt, GenBank, Ensembl, etc., metabolites may be linked to other types of biomolecular entities only through such aggregate concepts as phenotypes or known pathways.
Thus, proteomics data (affinity-based or MS-MS), SNP and methylation data are relatively easy to integrate with gene expression information through Ensembl or UniProt identifiers. At the same time, an attempt to integrate non-identified spectral results to the same extent (i.e., enabling complex queries by m/z across several samples or acquisitions) may result in ineffective data representation, excessive complexities in data storage and misinterpretation (comparison of peaks across spectra cannot performed on the fly, even if spectra are acquired on the same machine under a standard protocol). Therefore, although in principle it would be possible to integrate spectral, i.e., nonidentified data, in MoDa by having e.g. m/z ratios on the biomolecular entity dimension and corresponding peak intensities as measured values (similarly to how it is done in the METLIN database), we have decided to integrate into MoDa only measurements of bioentities with identities that have been established to a certain level of confidence.
There are two main approaches how data integration can be performed so that a data warehouse could be built: 1) uniformity of data can be enforced upon submission, or 2) data and metadata can be post-processed in order to align that to a single reference system.
For the “sample dimension” we follow the first approach: integration is enforced by the design of the sample data collection (in SIMS). When molecular data is entered into AIMS, either on the Assay or the Study Group level, samples from the SIMS database are referenced. This ensures that, first, if the same sample is analysed by different methods, this fact is retained in the system, and, second, that the information on samples coming from various collections is captured consistently in SIMS and is directly associated with the data files in AIMS.
The second approach is followed for the “gene dimension” integration. Due to a wide spectrum of underlying chemical and physical principles used in high-throughput experiments for measuring concentration of molecules (sequences, modifications, proteins, small molecules) there are differences in how results for experiments are reported. For correct interpretation of data all bioentities have to be referenced through a consistent system of identifiers. Integration along the gene/small molecule dimension can be seen as a data reduction process–the number of samples investigated in a high throughput study is on the order of tens or hundreds, rarely thousands, while the number of transcripts/proteins/metabolites characterized can be tens of thousands or more. Upon data submission references to biomolecular entities are usually encoded in data files, next to data values.
Ensembl and Uniprot databases were used as the primary sources for translation of gene and protein identifiers that have been provided by users to a uniform set of descriptors and for enrichment of the submitted data with sequence and structural annotation. Human Metabolome database  was used for small molecules as the primary source of annotation. Software components employed for reannotation were AIMS (data export functionality), a set of data parsers (developed specifically for MolPAGE, as well as some used internally for ArrayExpress), and BioMart  and Uniprot  webservices supporting multiple gene queries. The reannotation process was facilitated by a data manager. A detailed description of the pipeline will be provided in a separate paper.
The primary requirement for the data warehouse functionality is to facilitate queries based on gene, protein1 and metabolite properties, sample and person information, as well as experimental metadata. The interface is built around the concept of providing data about a subset of biomolecular entities for all samples and studies.
There are two complementary approaches to the design of data warehouse interfaces. The first one is to strictly predefine the way in which users will interact with the system. For example, 1) a user selects a bioentity1 or a set of bioentities, 2) the system retrieves all studies where these bioentities have been studied, 3) the user can individually explore the studies deeper, or export data to some data analysis tool. The other option is to provide as flexible an interface as possible where users have multiple decision points, e.g., to retrieve an initial set of samples, to further limit the scope by filtering out irrelevant samples, to retrieve genes studied in assays carried on these samples.
We have implemented the first option in the ArrayExpress data warehouse, from which the MoDa implementation derives. However, the underlying software infrastructure of ArrayExpress, and therefore also of MoDa, is well suited also for the second approach.
The following tasks can be carried out using the MoDa user interface:
• for a subset of biomolecular entities (e.g., genes or small molecules), find in which studies they have been investigated;
• for a subset of biomolecular entities, find in which studies related biomolecular entities (products or substrates) have been investigated;
• ompare performance of different reporters for the same biological entity, e.g., for Affymetrix transcriptomics experiments, compare different probesets for the same gene;
• visualize data for several bioentities, from one or several studies;
• find a similar profile for a given one, on a per-study basis.
We used the ArrayExpress gene expression data warehouse  as the prototype version of MoDa, and generalized and simplified the underlying software in order to add new types of data of similar structure.
As described above, MoDa is optimised for managing twodimensional data matrices. One of the dimensions is the gene (or, more generally, molecular entity) dimension, while the other one is the sample dimension. Dimension elements (genes and samples) have rich annotation, while the matrix contains just numeric values, e.g., gene expression levels.
Data is loaded into MoDa from MAGE-TAB files–a simple tabdelimited format that was created for gene expression data exchange, but that can serve well also for other technologies . The data is moved from the repository into the data warehouse using a set of functional modules that parse the files in the repository and create MAGE-TAB files, which are then loaded into the data warehouse by means of a single application. By decomposing the data mapping and data loading tasks we are able to trace the data movement efficiently, and such a modular approach is beneficial also for the software development process.
We used Lucene2 libraries to enable search by molecule, gene, or sample–there is no underlying database, therefore the entire system is much simpler and easier to clone and install at other sites. Also, compared to RDBMS usage, Lucene enables more flexible text search over bioentity and sample properties (including approximate querying). The data files are stored inside NetCDF files  which is a binary format for managing multi-dimensional data, initially developed for earth scientists, but increasingly being taken up in bioinformatics [6,12].
For the user interface we decided to try out Flex3, a free, open source framework for rich internet application development that creates Flash applications. We also used BlazeDS4, a technology for connecting Flash/ Flex applications with Java-based data services (in our case– search and retrieval of bioentities and studies, and retrieval of data from NetCDF files). See Figure 4 for an overview of all components.
We designed and implemented a warehouse for unified access to transcriptomics, methylation, proteomics and metabonomics data. The user interface allows for filtering and querying these data by properties of molecules and genes and by genomic regions. The content of the system is a non-redundant index of abundance levels of all molecules, genes and sites of methylation which were registered in the studied samples. Molecular and sample annotation is consistent throughout the entire data content.
In order to optimize the annotation process, a software infrastructure for the management of clinical and high throughput experiment data was developed:
• a LIMS-like repository was implemented in order to collect and process sample and assay information and experimental results;
• all data sets were consistently re-annotated;
• results of proteomics, transcriptomics, metabonomics and methylation analyses were presented in a data warehouse.
Separation of the submission and in-house data management from the data retrieval allowed optimizing the data warehouse architecture for complex queries and multifaceted data access and at the same time ensuring data quality and consistency.
We have created a uniform data management infrastructure usable in a multi-site, multi-technology project that could be further customized and developed for a wide range of biomedical projects.
MoDa is available from http://wwwdev.ebi.ac.uk/fg/MoDa2/
This work has been funded by the European Commission as a part of the Integrated Project MolPAGE (grant code: LSHG-CT-2004-512066). We would like to thank Amy Barrett, Anthony Maher, Derek Crockford, Magnus Åberg, Severine Zirah, Marc E Dumas, Anna Asplund, Erik Björling, Susanne Schwonbeck, Jens Lamerz, Andreas Petri, Kristian Almstrup, Matthias Schuster, Dimo Dietrich, Florian Eckhardt and ArrayExpress curators for providing data for MoDa, Juris Viksna, Andris Zarins, Peteris Rucevskis, Natalja Kurbatova and Karlis Podnieks for their work on SIMBioMS, and Hugo Berube, Arek Kasprzyk and Damian Smedley for their help with MoDa design and implementation.
No competing financial interests exist.