Figure 1

a) UniProt complete proteome and RefSeq NP entries were mapped to NCBI Entrez Gene identifiers and merged to create the target proteome set. 18,931 proteins were common between the two curated versions while 198 and 1,386 proteins were exclusively found in RefSeq and UniProt, respectively.
b) Mapping the human proteome was carried out in three-phases. In phase I, datasets were integrated from existing proteomic databases that host human protein data. Customized Perl scripts were used to parse the downloaded datasets and data inclusion criteria were followed. Filtered protein identifications were mapped to the target proteome and the resulting annotations were clustered for 19,552 unique proteins. In phase II, mRNA expression and TM domain were examined for 963 genes that were not identified at the protein level in Phase I. Tissues in which these genes are expressed were identified. Nearly 80 published proteomic studies (Supplementary Table 1) that use tissues/sample in which these genes are expressed were collated and integrated. All protein identifications from the proteomic studies that could be mapped to the target proteome resulted in 14,526 unique proteins, 149 of them were not detected in phase I. In phase III, manual curation of the scientific literature was performed for the remnant 814 genes that could not be identified at the protein level. An additional 15 proteins was collected in phase III. However, 799 genes could not be identified at the protein level by the three-phase approach. mRNA expression profiles and gene status were examined for these 799 genes to provide possible cluse on their proteincoding potentials. Protein data from the three phases were assembled to map the human proteome. The assembled data encompasses 19,712 unique proteins and is freely accessible through Human Proteome Browser (http://www.humanproteomebrowser.info). CRCDB – colorectal cancer database.

Figure 1: Three phase approach to map the human proteome.