alexa Microarray|type 1 Diabetes Research|Mass spectrometry|proteomics
ISSN: 0974-276X
Journal of Proteomics & Bioinformatics
Like us on:
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Statistical Analysis of Protein Microarray Data: A Case Study in Type 1 Diabetes Research

Le TT An1-3#, Anna Pursiheimo1,2#, Robert Moulder1 and Laura L Elo1,2*
1Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Turku, Finland
2Department of Mathematics and Statistics, University of Turku, Finland
3School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Vietnam
#Authors contribute equally
Corresponding Author : Laura Elo
Adjunct Professor, Group Leader
Turku Centre for Biotechnology
and Department of Mathematics and Statistics
FI-20014 University of Turku, Finland
Tel: +358 2 333 8009
Fax: +358 2 231 8808
E-mail: [email protected]
Received September 14, 2014; Accepted October 24, 2014; Published October 28, 2014
Citation: An LTT, Pursiheimo A, Moulder R, Elo LL (2014) Statistical Analysis of Protein Microarray Data: A Case Study in Type 1 Diabetes Research. J Proteomics Bioinform S12:003. doi: 10.4172/jpb.S12-003
Copyright: © 2014 An LTT, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Related article at
DownloadPubmed DownloadScholar Google

Visit for more related articles at Journal of Proteomics & Bioinformatics


In this report we provide an overview of protein microarrays and devote particular consideration on the statistical methods used in data analysis with applications concerning the study of type 1 diabetes. The latter methodologies are illustrated with publically available data from a study that identified novel type 1 diabetes associated autoantibodies. Amongst the methods employed, Reproducibility-Optimized Test Statistic (ROTS) shows better detection over the widely used LIMMA. With the application of this analytical approach, we identify new protein biomarkers that were not previously reported in original investigation. This observation emphasises the benefit of using different methods to extract critical information in the analysis of microarray data.

Protein microarray; Type 1 diabetes; Biomarker; Computational method; Reproducibility-Optimized Test Statistic (ROTS)
T1D: Type 1 Diabetes; T2D: Type 2 Diabetes; NGT: Normal Glucose Tolerance; SAM: Significance Analysis of Microarrays; LIMMA: Linear Models for Microarray Data; RP: Rank Product; ROTS: Reproducibility-Optimized Test Statistic; FDR: False Discovery Rate; ROC: Receiver Operating Characteristics
Protein microarrays
In recent years powerful high throughput microarray techniques for gene expression profiling have emerged and have been widely applied in comparative studies of cellular states and biological specimens [1,2]. These facilitate automated, paralleled analysis of thousands of genes and have created new possibilities in biomedical research. Despite such advantages, it should be noted that the correlation of RNA expression data and protein translation is variable [3] and transcriptomics measurement cannot take into account post-translational changes [4]. An important alternative is the use of protein microarrays, which can be used for protein detection, quantification and interaction measurements, thus providing a promising complementary approach to other systems biology approaches [5].
Akin to their oligonucleotide targeting analogues, protein microarrays are constructed on solid supports, such as a glass slide or nitrocellulose membrane, onto which small amounts of different probes (proteins) are bound at discrete locations [6]. These can range from high density chips containing thousands of proteins to specific arrays with tens or hundreds of antibodies. Currently there are three different types of protein microarrays: functional microarrays, reverse phase microarrays and analytical microarrays [6]. Functional protein microarrays generally incorporate a large panel of purified proteins or protein domains and are used to detect the biochemical activity and protein interactions [7]. With reverse phase protein microarrays a cell lysate is arrayed and then probed with antibodies against specific protein targets [8]. Analytical microarrays have included arrays of antibodies, aptamers, or affibodies and are typically used to measure binding affinities, specificities, and protein expression levels of proteins in complex mixtures [9,10]. Overall the main areas of application for protein microarrays include proteomics, protein functional analysis, antibody characterization, diagnostics, and treatment development. The listed range of specific targets and application reads like a who’s who of protein orientated biological research, including proteinprotein/ peptide/RNA interactions [11-13], protein post-translational modifications [4] and biomarker identification [14]. The latter includes application towards detection of infectious disease [15], cancer [14,16] and autoimmune diseases, e.g. systemus erythematosus [17] and rheumatoid arthritis [18].
Amongst the methods of detection used with protein microarrays, fluorescent labelling is the most widely used. Other approaches include photochemical and radioisotope tags. The fluorescent label or tag is attached to the probe or secondary antibody and the interaction determined by, for example, a microarray scanner [6].
Type 1 diabetes
Type 1 diabetes (T1D) is an autoimmune disease that results in the destruction of the insulin producing beta cells of the islets of the Langerhans [19]. At this point the patient is dependent on a daily insulin substitution for the rest of his/her life and there is a high risk of developing acute and long-term complications. There is a strong genetic component for T1D risk, in addition to the role of the environment, diet and viral infections, which have been indicated as influential factors driving its onset. Whilst the genetic traits for T1D are common in some populations, the outcome is unpredictable. Early signs of its onset are found with the detection of autoantibodies against islets cells (ICA). Currently the panel of autoantibodies that are regularly used to determine the development of the autoimmune reactions that underlie the development of T1D, consists of islet cell autoantibodies (ICA) and antibodies against insulin (IAA), glutamic acid decarboxylase (GADA), IA-2 protein (IA-2A) and Zinc transporter 8 (ZnT8) [20]. In prospective studies of T1D pathogenesis, serum samples are collected at regular intervals from genetically conferred risk groups and tested for these autoantibodies [21,22]. In this respect the use of protein microarrays is an attractive option for multiplexed detection of these known autoantibodies and for the detection of new markers.
Amongst the literature describing the use of protein microarrays in T1D research there have been a number of studies using commercial cytokine antibody arrays [23-27]. Broadly these include a group of studies using essentially the same cytokine array (RayBiotech) to study the effects of T1D autoantigen stimulation on cytokine production from peripheral blood mononuclear cells (PBMCs) from diabetics and children with diabetic parents. The PBMC samples used in these studies were from patients with cystic fibrosis related T1D [23], neonates with T1D parents [24], T1D patients and their relatives [26], T1D children [27] and children with mothers displaying maternal hyperglycaemia [25]. Miersch et al. [28] produced microarrays displaying 6000 proteins that were used to identify new T1D autoantibodies from the sera of T1D patients. Their analysis revealed 26 novel autoantibodies. In a similar fashion, Koo et al. [29] screened sera from type 1 and type 2 diabetes using arrays of 9,600 proteins. In their study, two novel autoantibodies were identified.
Protein microarrays: statistical and computational approaches
There are a number of reviews on differential expression analysis and feature selection using microarray data [30-35]. Most of them have focused on gene expression microarrays. Here we provide a brief review of differential expression analysis in the context of protein microarrays. In particular, we describe the essential steps in the analysis of protein microarray data and a number of computational tools for determining statistically significant differences between distinct sample groups. Finally, we provide a case study with a recently published protein microarray data from a study of T1D, which demonstrates the favourable performance of our reproducibility-optimized test statistic ROTS in comparison to six other methods, including Rank Product [36], T-test [37], SAM [38], LIMMA [39], Wilcoxon rank sum test [40] and M-score [41].
After data generation pre-processing is typically needed. This step includes removing unwanted outliers, damaged microarrays and normalizing data distribution [42]. With the determination of biological differences between different sample groups (e.g. cases and controls), normalization is needed to avoid systematic errors and other artificial differences. In general, normalization is used to adjust the expression values so that the measurements across the samples can be compared. The most common methods for protein microarrays are quantile normalization, variance stabilizing normalization, cyclic loess and robust linear model normalization [43-47]. The normalization methods can affect the results of the analysis, but currently there is no global consensus on the best solution for this [42,48,49].
Analysis of differential expression
A common goal in the analysis of experimental results is to identify the features that distinguish different conditions. This often begins by using statistical tools to compare the expression levels of the different conditions to find differentially expressed proteins. Several different approaches have been employed for protein microarrays, including Rank Product (RP) [50], Wilcoxon rank sum test [26,50], T-test [51,52], Significance Analysis of Microarrays SAM [53], Linear Models for Microarray Data LIMMA [54], M statistic [29] and many more. The available tools often have some critical points, for instance, they can be time consuming to apply, they do not adapt well to the intrinsic properties of new data sets or the results show poor reproducibility across data sets. For example, the T-test does not work well for data sets with only few replicate samples and it relies on the assumption that the data are normally distributed, which is not usually the case [55,56]. When the distribution of the data is not known, non-parametric methods are preferred. However, they can also be dependent on the characteristics of the data. To deal with small sample sizes, it has been proposed to use relevant background knowledge [57], improved statistical tests [38,39,58,59] or more than one method to obtain better detections [45,60]. Overall, there have been strong reasons to develop different tools to improve statistical power and identify reliable features from differential expression analysis.
To address the problem of deciding which statistical test is suitable for a particular data and best adaptive to the data characteristics, we have introduced a Reproducibility-Optimized Test Statistic ROTS [58,59], which learns the optimal test statistic directly from the given data. More specifically, ROTS gives more freedom for the standard deviation term in T-test, which enables the optimization process toward maximal overlap among top-ranked proteins across bootstrap resamples. Table 1 summarizes ROTS together with several other widely used tools for differential expression analysis: Rank Product, Wilcoxon rank sum test, the ordinary T-test, SAM and LIMMA. In brief, SAM and LIMMA are modifications of the ordinary T-test whereas Wilcoxson rank sum test and Rank Product are common non-parametric methods based on ranks.
False discovery rate
Statistical testing is based on setting a null hypothesis (e.g. there is no difference in protein expression between two groups) and testing if it is true or not. Statistical significance is determined by p-value, which is the probability that we reject the hypothesis while it is actually true. If the p-value is small (e.g. 0.05 or less) then there is strong evidence against the null hypothesis, whilst a large p-value means that the evidence is weak or the test is not significant.
In protein microarray studies, a large number of statistical tests are made simultaneously, one for each protein on the array. With such multiple testing it is necessary to apply corrections when assessing protein differential expression. For example, if there are 1000 proteins on the array and we use the p-value of 0.05 to determine differential expression, then by random chance alone we would expect 50 false positive discoveries. To reduce the number of false positives, the p-value needs to be corrected. Traditionally, Bonferroni correction has been used, but it is often too conservative and may also discard many true discoveries [61]. Therefore, the False Discovery Rate (FDR) approach has been developed that is less conservative. Common methods to control FDR include the Benjamini–Hochberg procedure and permutation-based procedures [62,63].
Data visualization
For each processing step, good visualization is helpful in the interpretation of the results, especially with high-dimensional data. Before and after normalization, histograms or boxplots can show the overall change in the shape of the data. This can also indicate possible outliers, which should be removed from the data before further analysis. Next, unsupervised clustering such as hierarchical clustering and heat maps can be used to explore known patterns or suggest new ones to be considered in order to obtain the full picture of the data. After detection of differential expression, volcano plots or the receiver operating characteristics (ROC) curves are often drawn to aid the interpretation the results. For example, volcano plots can show the fold change and the significance, while ROC curves visualize the relationships between the sensitivity and specificity of classification.
Case study: Identification of autoantibodies for T1D
To illustrate the performance of the different statistical methods in discovering new autoantibody biomarkers of T1D, we re-analysed the recently published protein microarray data by Koo et al. [29]. The data were downloaded from Gene Expression Omnibus (GEO) database (accession number GSE50866), including measurements from serum samples of 16 T1D patients, 16 T2D patients, and 27 healthy controls with normal glucose tolerance (NGT). The data were from the ProtoArray protein microarrays, containing 9480 human proteins. The data were log transformed (base 2) and quantile normalized before the statistical analysis. The readily normalized data were downloaded from GEO.
Using six different statistical methods, we identified proteins that showed significant differences between two groups of samples at false discovery rate FDR<0.05. Following the approach of the original study, three comparisons were considered: T1D vs. NGT, T1D vs. NGT and T2D (NGT+T2D), and T1D vs.T2D. Additionally, we compared the obtained results to those of the original study using the M-statistic with P<0.05 and an additional Z-score criterion [29]. Table 2 illustrates the numbers of detections with the different methods.
Overall, the widely used LIMMA and the Wilcoxon test detected only one protein in the three comparisons, whereas Rank Product resulted in very long lists of detections, suggesting that these methods may not suit the present data. In general, many more findings were made in the original study than in our comparisons. This is in line with the fact that the original study did not control the FDR levels, and was therefore more liberal than our FDR controlling strategy and thus more prone to false positive detections. SAM and ROTS detected similar numbers of proteins as significant. T-test detected more proteins in the comparisons T1D vs. NGT, and T1D vs. NGT+T2D, but none in the comparison T1D vs. T2D.
Investigation of the common detections between SAM, T-test and ROTS suggested that the overlap of the detections between these methods was often relatively small (Figure 1). In the comparison T1D vs.NGT, only ~15% (6/39) of the detections made by the T-test were found with at least one other method. With SAM the overlap was ~50% (10/19), and with ROTS 60% (9/15). In the comparison T1D vs. NGT+T2D, the overlap was ~5% (1/23) with T-test, 20% (2/10) with SAM, and 40% (2/5) with ROTS. Finally, in the comparison T1D vs. T2D, the overlap was ~50% (5/11) and ~80% (5/6) for SAM and ROTS, respectively, whereas T-test did not find any proteins. Taken together, ROTS gave the highest proportion of detections that were also found by at least one of the other methods. This supports the potential relevance of the proteins detected using ROTS, as it has been found in various contexts that detections made simultaneously by multiple different statistics are more likely to be true than those made by a single statistic [64,65]. Furthermore, out of all the 18 detections made with at least two methods across the three comparisons (11 in T1D vs. NGT, 2 in T1D vs. NGT+T2D, and 5 in T1D vs. T2D; Table 3) only two were not detected by ROTS (~10%), and these two were quite close to the borderline (FDR=0.056 and FDR=0.125).
Figure 2 further illustrates the relationship between the significant proteins detected using ROTS in the different comparisons. A total of four proteins were detected both in the comparison T1D vs. NGT and in the comparison T1D vs. NGT+T2D. These included EEF1A1 (eukaryotic translation elongation factor 1 alpha 1), EDIL3 (EGF-like repeats and discoidin I-like domains 3), ZADH1 (PTGR2, prostaglandin reductase 2), and MGC72080 (MGC72080 pseudoprotein). The latter two were not detected in the original study or with the other statistical tests considered in this study (Table 4). Prostaglandin reductase 2 (ZADH1) is involved in the metabolism of prostaglandins and has been implicated in relation to insulin sensitivity [66]. MGC72080 appears to be the product of a pseudo gene. It should be noted, however, that both of these proteins were detected with low intensity signals, and thus the interpretation of the results should be treated with caution. One alternative in such circumstances, is to filter out the low abundant proteins using, for instance, the overall average intensity or variance across the samples [67] or the combination of Z score, Chebyshev inequality precision value and coefficient of variation [68,69]. However, such implementations can be subjective and result in the loss of data describing potentially important proteins, such as lower abundance signalling molecules or receptors.
Four different proteins were detected by all the three methods (T-test, SAM and ROTS) across the comparisons. These were EEF1A1, EDIL3, SFRS3 (serine/arginine-rich splicing factor 3), and CPEB1 (cytoplasmic polyadenylation element binding protein 1). EEF1A1 was the key finding in the original study, where as SFRS3 was not detected in the original study despite the overall lower stringency used there. The T1DBase database shows that these proteins are highly expressed in T1D related cells, such as in pancreatic islets. Our consistent findings with the different methods suggest that these proteins could be useful candidates for further experimental studies to validate their role in T1D, such as using validation methods shown in [28,29]. In addition it was notable that UBE2L3, the other validated protein in the original study, was not consistently detected by multiple statistical tests in our comparisons. This is likely due to the fact that it was detected with a wide range of signal intensities in the individual samples measured. This further highlights the importance of careful validation in independent sample cohorts.
This report provides an overview of the statistical and computational tools available for protein microarray data analysis and demonstrates how they can be used to help to study T1D. In addition to clarifying the expression changes and activity of known proteins relevant to T1D, protein microarrays can enable the discovery of new biomarkers to predict the onset of T1D. From the computational viewpoint, the existing literature reveals limitations in the current practices. In particular, our reanalysis of the recently published T1D data [29] demonstrated how the choice of the statistical test can have a large impact on the results obtained. For instance, the widely-used methods for differential expression analysis LIMMA, Rank Product and Wilcoxon rank sum test did not perform well in these data. To overcome such limitations, we propose to adjust the test statistic to the properties of the data by optimizing the reproducibility of detection by using bootstrap resampling. Our ROTS package performed well for the given data set, yielding the highest proportion of detections that were also found by at least one of the other methods, supporting their potential relevance.
Another important issue in the analysis to be highlighted is the use of FDR to reduce the number of false positive discoveries. For instance, in the original study of the T1D data [29], the authors used nominal p-values to determine significance, which does not control FDR and is likely to produce several false positive findings. Accordingly, they found a large set of detections that was eventually reduced to two candidates validated in independent experiments [29]. Controlling FDR helps to eliminate many of the false positive detections.
In prospective studies of T1D risk cohorts, diabetes has been diagnosed in subjects who have not displayed any of the known autoantibodies [70,71]. Noticeably in one diabetes study, 19% of the children were negative for all autoantibodies and this significantly increased with the age of diagnosis [70]. Therefore, there is a growing demand to discover and validate new autoantibodies which can better predict the disease onset. The capabilities of protein microarray technology present many possibilities for T1D research, including the search for new autoantibodies. Moreover, proteomics markers, derived from discovery experiments in T1D research, could be profiled using targeted antibody assays to assist in risk classification, as has been investigated in the context of Systemic lupus erythematosus [72]. In such studies, flexibility in the statistical approaches employed can help to fully utilize the data. With more biological information about the relevant proteins, more complex dimensions could be integrated for further study, for example, connecting the detected proteins with their interactive pathways or networks to enhance the markers and practical applications in clinical T1D.
Finally, increasing the public availability of protein microarray data sets, in formats suitable for reanalysis, would greatly benefit the research community. If the collected data from most of the studies were made available, one could utilize several computational and statistical methods to identify and suggest a smaller set of relevant candidate biomarkers for further validation experiments, which would essentially save laboratorial effort and cost.
The authors would like to thank Henna Kallionpää and Deepankar Chakroborty for several interesting discussions. The work is funded by Juvenile Diabetes Research Foundation (JDRF), Päivikki and Sakari Sohlberg Foundation, Yrjö Jahnsson Foundation, and the Diabetes Research Foundation.

Tables and Figures at a glance

Table 1 Table 2 Table 3 Table 4


Figures at a glance

Figure 1 Figure 2
Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Recommended Conferences

  • 9th International Conference on Bioinformatics
    October 23-24, 2017 Paris, France
  • 9th International Conference and Expo on Proteomics
    October 23-25, 2017 Paris, France

Article Usage

  • Total views: 11600
  • [From(publication date):
    February-2015 - Jun 28, 2017]
  • Breakdown by view type
  • HTML page views : 7824
  • PDF downloads :3776

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2017-18
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

© 2008-2017 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version