Ullah E^{1}, Shahzad M^{3}, Rawi R^{1}, Dehbi M^{2}, Suhre K^{4}, Selim M^{5} and Bensmail H^{1*}
^{1}Computational Sciences and Engineering, Qatar Computing Research Institute, Education City, Qatar Foundation, Doha, Qatar
^{2}Qatar Biomedical Research Institute, Education City, Qatar Foundation, Doha, Qatar
^{3}Information Systems Department, University of Carnegie Mellon, US
^{4}Department of Physiology and Biophysics, Weill Cornell Medical College in Qatar, Education City, Qatar Foundation, Doha, Qatar
^{5}Dermatology Department, Hamad Medical Corporation, Doha, Qatar
Received date: December 30, 2014; Accepted date: January 14, 2015; Published date: January 16, 2015
Citation: Ullah E, Shahzad M, Rawi R, Dehbi M, Suhre K, et al. (2015) Integrative 1HNMRbased Metabolomic Profiling to Identify Type2 Diabetes Biomarkers: An Application to a Population of Qatar. Metabolomics 5:136. doi:10.4172/21530769.1000136
Copyright: © 2015 Ullah E, et al. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Metabolomics:Open Access
Diabetes is a leading health problem in the developed world. The recent surge of wealth in Qatar has made it one of the most vulnerable nations to diabetes and related diseases. Recent technological advances in 1H Nuclear Magnetic Resonance (NMR) spectroscopy techniques for metabolomics profiling offer a great opportunity for biomarkers discovery. Using this technology, we present in this study, an integrative approach to discover new metabolites and possibly new biomarkers. We performed an integrative analysis of 1H NMR spectras measured in urine, from 348 participants of the Qatar Metabolomics Study on Diabetes (QM Diab). Our analyses revealed several metabolites that correlate with diabetes and identified specific metabolites affected by anti diabetes medication, which constraints differentiation between diabetic and control patients.
^{1}HNMR; Metabolomics; Biomarkers; Diabetes; PCA; ADMM
Many chronic diseases like Type II Diabetes (T2D) and its complications may be prevenTable by avoiding factors that trigger the disease process. Accurate prediction and identification using biomarkers will be useful for disease prevention and initiation of proactive therapies to those individuals who are most likely to develop the disease. Recent techno logical advances in proton ^{1}H Nuclear Magnetic Resonance (NMR) spectroscopy techniques for metabolomics profiling offer great opportunity for biomarker discovery [118]. Because of experimental issues in the technical equipment, the levels of some metabolites cannot be universally determined. As the number of measured metabolites often exceeds the number of samples, dimensionality reduction methods are required.
In this study, we present a possible analysis workflow for mining ^{1}HNMR spectrum for a sample of subjects with T2D and controls (see Methods) using robust statistical approaches such as regularized principal component and regularized cluster analysis methods as an integrative approach to discover new metabolites and possibly discover new biomarkers (Figure 1).
QMDiab is a 2012 study from the Dermatology Department of Hamad Medical Corporation in Doha, Qatar. The incentive was the high prevalence of T2D mellitus in Qatar, where the country ranked #21 worldwide in 2013 (International Diabetes Federation, 2014). Metabolite analysis was per formed on human blood and urine biofluid of 348 subjects with T2D and controls (here we use urine biofluid only) where at least 100 patients were Qatari (173 males and 175 females). The subject characteristics are shown in Table 1.
Population Characteristics  T2D n=178  NoT2D n=170 

Age (years)  54.0 (34.870.7)  38.5(23.362.5) 
Gender (% female)  75 (44.1%)  98 (55.1%) 
Ethnicity  
Arab (%)  85 (50.0%)  115 (64.6%) 
South Asian (%)  65 (38.2%)  34 (19.1%) 
Filipino(%)  13 (7.6%)  22 (12.4%) 
Other or mix (%)  7 (4.1%)  7 (3.9%) 
Table 1: Subject characteristics. Arab: Bahrain, Egypt, Iraq, Jordan, Kuwait, Lebanon, Morocco, Oman, Palestine, Qatar, Saudi Arabia, So malia, Sudan, Syria, Tunisia, United Arab Emirates and Yemen South Asian: India, Bangladesh, Nepal, Pakistan, Sri Lanka. Values represent median (90% range) or number of subjects (%).
In the first round of analysis the complete spectra were binned into different bin sizes and normalized using the total peak area normalization method [8]. The qualitative analysis of the major variances in the spectra was performed directly by using a newly developed flexible and robust PCA (we named fPCA) which preprocess noisy and correlated ^{1}HNMR data (Figure 2). The fPCA was able to cluster all the samples without diabetes but the samples with diabetes had a wide spread. Compounds that are identified by NMR spectrum analysis of original data and loading spectrum of fPC1 were identified (Betaine, Dimethylamine, Glucose, Mannitol, N, NDimethylglycine, and b Alanine). These compounds belong, respectively, to the families of ammonium, amines, sugar and amino acids and can be used as potential biomarkers in human urine for detection of diabetes. Moreover, our study showed that 9 out 178 patients with diabetes had potential Paraquat poisoning based on their abnormal concentration on Citrate, Glutinane and Alanine and 24 out of 178 had Salicylate (Aspirin) detected in their urine. For Asprin abnormality, we conclude that people with diabetes are more encouraged to take Aspirin as it may reduce risk of heart attack due to coronary obstruction, which is a risk many diabetics may develop [19]. Flexible PCA is coded in Java and R and is available upon request from the corresponding author.
Abundances of metabolites are indicative of a variety of conditions, and can provide important insights in a wide variety of biological and clinical investigations. At the same time, interpretation of the spectra gives rise to substantial methodological challenges. The spectra are subject to biological and technical variations, and to uncertainty in identification and quantification of peaks. Nuclear magnetic resonance spectroscopy is a method of choice for identifying and quantifying metabolites in complex biological mixtures, as it is fast, nondestructive and highly reproducible. However interpretation of the spectra is hampered by their complexity, presence of overlapping peaks, and biological variation in the abundance of metabolites. The difficulty is particularly apparent in modern investigations, which require an accurate and fast analysis of spectra from hundreds and even thousands of biological samples. Statistical inference is the only approach that can yield objective and reproducible conclusions from such data. At present the statistical tools available for this task are of limited performance.
Diabetes is usually a lifelong chronic disease characterized by an aboveaverage concentration of sugar in the blood and urine. It is characterized as a disease of affluence and hence affects a considerable portion of the population of the developed world. Diabetes is caused by a reduction in the insulin production by the pancreas or a decreased response of body cells to insulin. The prevalence of diabetes in Qatar is higher in females than in males [1]. Risk factors also increase with age, obesity, hypertension, heart diseases and smoking habits [1]. Family history also effects a person’s predisposition to diabetes, which shows that there is a significant genetic component. In Qatar, there are a lot of marriages between close cousins and this is a cause for concern. Qatar Diabetes Association (QDA), which was set up by Qatar Foundation (www.qf.org.qa), is leading the fight against diabetes by educating the general population about the risk factors.
Technologies to measure highthroughput biomedical data in proteomics, chemometrics, and genomics have led to a proliferation of highdimensional data that pose many statistical challenges. As metabolites, are biologically interconnected, the variables, in these data sets are not only far larger than the sample size but are often highly correlated and noisy. More generally, methods such as PLS, PCA and SPCA can be used as dimension reduction techniques that finds projections of the data that maximize the covariance between the data and the response [15]. During the last decade, several work have been proposed to encourage sparsity in these projections, or loadings vectors, to select relevant features in highdimensional data [13,14]. There are several motivations for regularizing the PCA loadings vectors. Several authors have shown that the PCA projection vectors are asymptotically inconsistent in highdimensional settings and encouraging sparsity in the loadings has been shown to yield consistent projections [1115]. However, the computational cost is expensive when requiring a large number of loading so it is desirable to find an approach, which regularize loading scores, reduce features and boost the computation of PCA. The PCA loading vectors can be used as a data compression technique when making future predictions; sparsity further compresses the data. As many variables in highdimensional data are noisy and irrelevant, sparsity presents a method for automatic feature selection. This leads to results that are easier to interpret and visualize. While sparsity in PCA is important for highdimensional data, there is also a need for more general and flexible regularized methods. Consider our NMR spectroscopy as a motivating example. This highthroughput data measures the spectrum of chemical resonances of all the latent metabolites, or small molecules, present in a biological sample. Typical experimental data consists of discretized, functional, and nonnegative spectra with variables measuring in the thousands for only a small number of samples. Additionally, variables in the spectra have complex dependencies arising from correlation at adjacent chemical shifts, metabolites resonating at more than one chemical shift, and overlapping resonances of latent metabolites. Because of these complex dependencies, there is a long history of using PCA to reduce the NMR spectrum for supervised data [16]. Classical PCA or Sparse PCA, however, are not optimal for this type of data as they do not account for the nonnegativity or functional nature of the spectra and do not encourage sparsity or group sparsity.
In this paper, we seek a more flexible framework for regularizing the PCA loadings that are computationally efficient and fast for analyzing highdimensional ^{1}H NMR data that encourage sparsity, group sparsity, or smoothness, and also leads to a more computationally efficient and fast numerical algorithm.
QMDiab Study
This study was embedded in the Qatar Metabolomics Study on Diabetes (QMDiab), a crosssectional casecontrol study with 348 subjects (Tables 13). The work was a joint collaboration between Hamad Medical Corporation and Weill Cornell Medical College Qatar. Patients were asked to enroll between February and June 2012. The study has been approved by the Institutional Review Board (IRB) of Hamad Medical Corporation and Weill Cornell Medical College Qatarand is accordance with the Helsinki Declaration of 1975. Written informed consent was obtained from all participants. The study measured metabolites in 348 individuals within the age of 17 to 81. The metabolites were measured in the three body fluids nonfasting blood plasma, urine, and saliva. In the time from February to June 2012, 1107 samples were taken from the participants, comprising 1563 metabolites including amino acids, peptides, carbohydrates and lipids, as well as age, gender, ethnicity, weight, height, Body Mass Index (BMI) and personal history of T2D [17].
Characteristics  Arab n = 200  South Asian n = 99  Filipino n = 35  

Type II Diab n = 85 
Non Type II Diab n = 115 
Type II Diab n = 65 
Non Type II Diab n = 34 
Type II Diab n = 13 
Non Type II Diab n = 22 

Age (years)  53.9 (34.271.2)  39.1 (22.664.4)  52.6 (35.269.1)  39.0(25.057.6)  49.3(37.863.0)  37.2(23.257.8) 
Gender (% female)  51 (60.0%)  70 (60.9%)  11 (16.9%)  13 (38.2%)  11 (84.6%)  13 (59.1%) 
Smoking (%)  8 (9.4%)  10 (8.7%)  6 (9.2%)  2 (5.9%)  1 (7.7%)  2 (9.1%) 
Table 2: Subject characteristics stratified by ethnicity.
Characteristics  Female n = 173 Type II Diabetes n = 75  Female n = 173 Non Type II Diabetes n = 98  Male n = 175 Type II Diabetes n = 98  Male n = 175 Non Type II Diabetes n =95 

Age (years)  52.6 (33.770.6)  36.5 (19.561.2)  54.4 (34.971.1)  41.7 (25.964.3) 
Ethnicity  
Arab (%)  51 (68.0%)  70 (71.4%)  34 (35.8%)  45 (56.3%) 
South Asian (%)  11 (14.7%)  13 (13.3%)  54 (56.8%)  21 (26.3%) 
Filipino (%)  11 (14.7%)  13 (13.3%)  2 (2.1%)  9 (11.3%) 
Other or mix (%)  2 (2.7%)  2 (2.0%)  5 (5.3%)  5 (6.3%) 
Table 3: Subject characteristics stratified by gender.
The samples were analyzed by the three companies Metabolon Inc., Chenomx Inc., and Biocrates Life Sciences AG. The respective companies utilized liquid/gas chromatography with mass spectrometry injections, targeted profiling using NMR, and Multiple Reaction Monitoring (MRM). The study found that all variables of ethnicity, gender and smoking had a strong effect on a diabetes risk factor, advanced glycation end products. Women, Arabs, Filipinos, and smokers were more strongly affected than men, south Asians, and non or irregular smokers [17].
Statistical Analysis
NMR binned data: When dealing with high resolution NMR spectra it is in general impracticable to work with the entire data points of the spectra which are usually in the order of 32Kb and bigger. The most common strategy used to reduce the number of variables consists in dividing each spectrum in a defined number of regions, the so called bins. Several binning strategies are available today, from regular binning, where bins have fixed width, to more sophisticated strategies such as gaussian or dynamic adaptive binning [8]. Here we used regular binning to preprocess the high resolution data ∼ 65536 data points in a single spectrum and remove any anomalies. This was motivated by the fact that when dealing with an array of NMR spectra, whilst regular binning of a number of bins over stacked spectra containing spectra will generate a matrix, it is not possible to generate a similar matrix using directly deconvolved peaks (peak list) since the number and position of peaks varies from spectrum to spectrum. In our case, we have used a binning approach which automates the binning of assembled NMR spectrum using imposed alignment of each spectra. In fact, we had 354 files that contained NMR coordinates. Each file had approximately 65,000 data points. This means that our algorithm had to iterate through 374 x 65,000 = 22,620,000 (22 and 1/2 million) data points. Each bin gives rise to a new value which is representative for the bin. We used a bin interval of 0.007 ppm. Using JAVA, we iterated through all x values in this interval and calculated the mean and standard deviation. After this we considered values inside m ± 3σ and calculated their mean. The obtained matrix after processing the data was of size 348 x 2960.
Sparse PCA with elastic net: Consider the linear regression model with n observations and p predictors. Let Y = (y_{1} ,....,y_{n})^{T} be the response vector and X = [X_{1}, ..., X_{p}], j = 1, ..., p the predictors, where X_{j} = (x_{1j}, ...,x_{nj})^{T} . After a location transformation we can assume all the X_{j} and Y are centered. The lasso is a penalized least squares method, imposing a constraint on the l_{1} norm of the regression coefficients. Thus, the lasso estimates β_{lasso} are obtained by minimizing the lasso criterion
(1)
where λ is nonnegative. The lasso continuously shrinks the coefficients toward zero, and achieves its prediction accuracy via the bias variance tradeoff. Due to the nature of the l_{1} penalty, some coefficients will be shrunk to exact zero if λ is large enough. The elastic net [12] generalizes the lasso to overcome these drawbacks, while enjoying its other favorable properties. For any nonnegative λ_{1} and λ_{2}, the elastic net estimates are given as follows:
(2)
The connection between robust regression method and PCA have been discussed by [11] and the problem becomes equivalent to the following optimization problem
(3)
where is the l_{1}norm of β, Z_{i} =U_{i}D_{ii} the i^{th} principal component. Approximated principal component are given by where and large enough λ_{1} gives a sparse , hence a sparse .
Algorithm1 summarizes the steps of SPCA. From an algorithmic point of view, to find the solutions in (3), each of the corresponding optimization problems can be seen as a Lasso problem by introducing new observations and then use Least Angle Regression algorithm (LARS) or coordinatedescent (GaussSeidel) algorithm. It is interesting to note that (i) for p ≥ n the augmented data set has p + n observations and p variables, which can slow the computation considerably; (ii) if the original design matrix is normalized, there is no guarantees the augmented design matrix will behave similarly, which can cause a loss of a part of the interpretation of the data; and (iii) the coordinatedescent algorithm proceeds by one at a time philosophy, e.g. it minimizes the loss function of β_{j} while maintaining components β_{k}, k ≠ j fixed at their actual values, in this case we cannot develop GaussSeidel for a grouped variable selection problem. To overcome these limitations, we derive a unified alternating direction method of multipliers based algorithm to handle sparse principal component selection which aims at selecting important components and penalizing the others through β [20]. We propose a doubly regularized model with a general penalty term of the form
(4)
so the flexible elastic equation to minimize, given a fixed A=[α_{1},…, α_{k}], from Algorithm 1 becomes:
where λ , μ ≥ 0 are two tuning parameters, and Q = (q_{ij})_{1=i, j=p }are weights associated with the l_{1} and l_{2} norms respectively, which are fixed in advance.
The advantages of our algorithm are: (1) Provide a general frame to deal with the limitations of unweighed versions of lassotype estimates. A weighted version possesses the oracle properties of selecting the subset of interesting variables with a proper choice of the weights and increasing the number of hits and decreasing the number of false positives. (2) Combine the strengths of Lasso and a quadratic penalty designed to capture additional structure on the features in high dimensional setting which is frequent in highthroughput generated from ^{1}HNMR spectroscopy. (3) Develop an easy and fast algorithm using the Alternating Direction Method of Multipliers (ADMM) approach to find optimal estimator without augmenting or normalizing data (see next section).
Alternating Direction Method of Multipliers (ADMM): Recently, the alternating direction method of multipliers has been revisited and successfully applied to solving large scale problems arising from different applications. In this section we give an overview of ADMM. Consider the following optimization problem:
minimize f (β) + g(ξ)
subject to β  ξ = 0, (5)
where f and g are two convex functions and β, ξ∈R^{P}. In this optimization problem, we have two sets of variables, with separable objective. The augmented Lagrangian for this problem is:
where δ is the dual variable for the constraint βξ=0 and τ>0 is a penalty parameter. The augmented Lagrangian methods were developed in part to bring robustness to the dual ascent method, and in particular, to yield convergence without strong assumptions like strict convexity or finiteness of f and g.
At iteration k, the ADMM algorithm consists of the three steps:
(6)
(7)
(8)
1. In the first step of the ADMM algorithm, we fix ξ and δ and minimize the augmented Lagrangian over β.
2. In the second step, we fix β and δ and minimize the augmented Lagrangian over ξ.
3. Finally, we update the dual variableδ.
If we consider the scaled dual variable η = (1/τ) δ and the residual r = η  ξ, the ADMM algorithm can be expressed on its scaled dual form as (we will use the scaled form in the paper):
(9)
(10)
(11)
Stopping criteria: The primal and dual residuals at iteration k have the forms:
The ADMM algorithm terminates when the primal and dual residuals satisfy stopping criterion. A typical stopping criterion is given in [5] where the authors propose to terminate when The tolerances ε pri >0 and ε dual >0 can be chosen using an absolute and relative criterion, such as and , where ε abs >0 and ε rel >0 are absolute and relative tolerances. A reasonable value for the relative stopping criterion is or 104 , depends on the scale of the typical variable (see [5] for details).
SPCA with ADMM: In this section we derive an efficient Alternating Direction Method of Multipliers algorithm for an elastic net approach of sparse PCA estimators with a more general penalty term of the form
To check the importance of a variable, we estimate its coefficient solution of the generic problem:
(12)
or equivalently
(13)
where λ ,μ are two non negative tuning parameters, Q is a positive semidefinite matrix, y**=Σ 1/2 α j , X**=Σ 1/2 and Σ is the covariance matrix of X.
Equation (13) combines the strengths of regularized techniques of type Lasso and a quadratic penalty designed to capture additional structure on the features. When = 1, it is straightforward to show that all type of lasso models (Lasso, Enet, Slasso, L1Cp and Wfusion) are particular case of (13) using an augmented data reparameterization of the form , it is straightforward to show that all type of lasso models (Lasso, Enet, Slasso, L1Cp and Wfusion) are particular case of (13) using an augmented data reparameterization of the form
Therefore any efficient algorithm developed to find the whole solution path of the Lasso like least angle regression or coordinate descent algorithm can be applied. Unfortunately, the good properties of the two optimization techniques are overshadowed by the difficulties (i), (ii) and (iii). To deal with those problems, we propose to solve (13) using the ADMM algorithm. The idea is simple and straightforward. First, we propose to rewrite (13) on the following ADMM form:
subject to β  ξ = 0. (14)
If we write
is the soft thresholding function introduced and analyzed by [6]. The dualupdate step is straightforward and consists of updating η^{k} by η^{k+1}:= η^{k} +β_{k+1}  ξ^{k+1}. It is worth to notice that since τ > 0, μ ≥ 0, X^{t}X and Q are positive semidefinite matrices, (X^{t}X+μQ+ τI_{p}) is always invertible. If p > n, let M = μQ + τIp, to alleviate the cost of calculations, we can exploit the Woodbury formula for (X^{t}X + M)^{−1}. Algorithm 2 shows the complete details of the flexible elasticnet with ADMM and and Algorithm 3 summarizes the flexible PCA.
Tuning parameters selection: In practice, it is important to select appropriate tuning parameters in order to obtain a good prediction precision and to control the amount of sparsity in the model. Choosing the tuning parameters can be done via minimizing an estimate of the outofsample prediction error. If a validation set is available, this can be estimated directly. Lacking a validation set one can use tenfold cross validation. In our experiments l takes 100 logarithmically equally spaced values, μ ε {0,0.1,1,10,100} and γ ε {0.5,1,2.5,5,25}.
fPCA, was applied to examine similarities and/or differences in the ^{1}HNMR spectra. A flexible principal component is a weighted linear combination of each of the original NMR variables so that the original data matrix is compressed into a smaller number of variables; the NMR data may be compressed into three to four fPCs in cases where the changes between groups or due to specific treatments are quite large. Figure 3 shows projection of processed urine samples with uniform 0.007 ppm bin widths on the first and second fPC axes and Figure 4 summarizes the loading scores. From this projection, fPCA analysis shows two perpendicular clustered groups with an overlap in diabetic and nondiabetic samples.
Figure 3: The X axis represents projection on the first flexible principal component and the Y axis represents the orthogonal component. Clustering of the blue stars to the left of the zero line indicates the urine metabolomic diagnostic test is highly sensitive in determining the presence of diabetes disease. (Blue: Nondiabetic, Red: Diabetic)
Interestingly, after further analysis, we identified that the overlap summarizes patients with controlled diabetes. Any supervised learning algorithm may lead to invalid results due to huge overlap of diabetic and nondiabetic samples and provide a one cluster summary. Here, flexible principal component 1 (fPC1) provide the maximum variance across diabetic and nondiabetic samples while principal component 2 (fPC2) summarizes maximum variance across samples within diabetic or nondiabetic samples. Sixty metabolites were identified by ^{1}HNMR spectrum analysis of original data and loading spectrum of fPC1 and fPC2. Twenty four metabolites from major energy sources such as carbohydrates, lipids, and proteins, are identified by NMR spectrum analysis of original data and loading spectrum
of fPC1 can be used as potential biomarkers in human lipids for detection of diabetes. In total, 24 metabolites were detected at statistically different concentrations (Table 4).
Metabolite Diabetes detected by fPC1  Metabolite Diabetes detected by fPC2 

2Hydroxyisobutyrate  2Hydroxyisobutyrate 
3Hydroxyisovalerate   
Acetate  Acetate 
Acetone   
Betaine  Betaine 
Creatine   
Creatinine   
Dimethylamine  Dimethylamine 
Glucose  Glucose 
Glycine  Glycine 
Glycolate  Glycolate 
Hypoxanthine   
Isopropanol   
Lactate   
Maleate  Maleate 
Mannitol  Mannitol 
Methanol  Methanol 
Methylamine  Methylamine 
N,NDimethylglycine  N,NDimethylglycine 
Succinate   
Tartrate  Tartrate 
Taurine  Taurine 
βAlanine   
π Methylhistidine   
Table 4: Compounds detected in actual samples. Left column summarizes the ones detected by fPC1 and right column summarizes the ones detected by fPC2. Compound detected with fPC1 are potential biomarkers in human urine for diabetes. Compound detected with fPC2 indicating most variations in the normal and diabetic samples. Compounds in red have been reported abnormal in human urine in HMDB.
Many compounds detected at higher levels for T2D were the end products of gluconeogenesis, including glucose and its polymer. Glycinebetaine (betaine) and glutamate, three of the major osmoprotectants used by S. Typhimurium, were found at higher concentrations. Other compounds more abundant
in gestational diabetes mellitus were 3hydroxyisovalerate and 2hydroxyisobutyrate, probably due to altered biotin status and amino acid and/or gut metabolisms (the latter possibly related to higher BMI values). The major compounds detected at higher levels were the upper TCA cycle intermediates succinate, and the transhydrogenase Lactate/malate, which has dual metabolic functions, named: Delta(1) piperideine2carboxylate/Delta(1)pyrroline2carboxylate reductase, the first member of a novel subclass in a large family of NAD(P) dependent oxidoreductases [7]. The compounds identified by analysis of loading spectrum of fPC2 are the compounds that indicate most variations in the normal and diabetic blood samples (Table 6). The fPCA was able to cluster all the samples without diabetes. The samples with diabetes had a wide spread. An important observation in this case is the overlap of diabetic and nondiabetic samples. The overlap may be due a result of controlled diabetes of diabetic samples, which resulted into normal concentrations of metabolites compared to the metabolite concentrations for nondiabetic samples. The distribution of samples in fPC1, fPC2 space show that diabetic and nondiabetic samples are distributed along fPC1 and variation among the samples within the diabetic and nondiabetic groups is along fPC2. The spectra for fPC1 and fPC2 are shown in Figure 4. The spectra were processed to compute the concentration of metabolites.
Blood metabolite  actual sample  fPC1 (mM)  fPC2 (mM) 

1,3Dihydroxyacetone  X  0.4195  0.1324 
1,3Dimethylurate  X  1.3649  1.1966 
1,6Anhydrob Dglucose  X  
1,7Dimethylxanthine  X  0.1613  0.117 
1Methylnicotinamide  X  
2Hydroxyisobutyrate  X  0.0761  0.038 
2Oxobutyrate  X  0.0716  0 
2Oxoglutarate  X  0.0631  0 
3Aminoisobutyrate  X  
3Hydroxyisovalerate  X  0.0221  0 
3Hydroxymandelate  X  
3Indoxylsulfate  X  
4Hydroxyphenylacetate  X  
4Pyridoxate  X  0.2208  0 
5,6Dihydrouracil  X  0.6077  0 
Acetate  X  0.0601  0.0072 
Acetoin  X  0.0215  0 
Acetone  X  0.0518  0 
Alanine  X  
Adenine  X  0.1256  0 
Arabinitol  X  3.6563  0.1111 
Asparagine  X  
Benzoate  X  
Betaine  X  10.4309  9.8544 
Butanone  X  0.0261  0 
Caffeine  X  0.1369  0.0722 
Carnitine  X  
Choline  X  
Citrate  X  
Creatine  X  1.0648  0 
Creatine phosphate  X  1.3224  0 
Creatinine  X  0.633  0 
Cytosine  X  0.0336  0 
Dimethylamine  X  0.2206  0.4833 
Dimethyl sulfone  X  0.2033  0.1388 
Ethanol  X  
Ethanolamine  X  
Ethylene glycol  X  2.2361  
Formate  X  
Fumarate  X  0.1383  
Galactose  X  
Glucose  X  156.7578  146.6056 
Glucuronate  X  6.3225  0 
Glycine  X  15.3768  15.733 
Glycolate  X  58.6403  52.3104 
π Methylhistidine  X  0.1438  0 
Table 5: Compounds detected in actual samples and loadings of fPC1 and fPC2.
Blood metabolite  Actual sample  fPC1 (mM)  fPC2 (mM) 

Guanidoacetate  19.2443  16.7526  
Hippurate  X  
Histidine  X  
Histamine  0.1448  0  
Hypoxanthine  X  0.1254  0 
Isopropanol  X  0.017  0 
Lactate  X  0.2972  0 
Lysine  X  
Maleate  X  0.2852  0.1811 
Mannitol  X  18.0916  0.9322 
Methanol  X  1.1467  0.9029 
Methylamine  X  0.0838  0.0072 
Methylguanidine  0.1761  0.1834  
N,NDimethylglycine  X  0.028  0.0368 
NMethylhydantoin  0.2124  0  
NNitrosodimethylamine  0.2762  0  
OAcetylcarnitine  X  
OAcetylcholine  0.1567  0  
OPhosphocholine  X  
Propionate  0.0089  0  
Propylene glycol  X  
Pyroglutamate  X  
Salicylate  X  
Sarcosine  0.2763  0.3998  
Serine  0.0698  0  
Succinate  X  0.0456  0 
Sucrose  X  
Tartrate  X  0.308  0.3615 
Taurine  X  41.1061  48.1989 
Threonine  X  
Thymine  0.0636  0  
Trigonelline  X  
Trimethylamine Noxide  X  
Tyrosine  X  
TransAconitate  0.1153  0.0789  
Trimethylamine  0.0404  0.0392  
Trimethylamine Noxide  6.421  6.0661  
Uracil  X  
Urea  X  
Uridine  X  
Valine  X  
Xylose  X  
CsAconitate  X  
TransAconitate  X  
β Alanine  X  0.4513  0 
τ Methylhistidine 
Table 6: Compounds detected in actual samples and loadings of fPC1and fPC2.
Since the reference compound was present in all the metabolites, it was not included in f7PC1 and fPC2. Therefore, the concentration of compounds in fPC1 and fPC2 could not be computed. But, the maximum concentration of a compound was calculated and is shown in Table 5. Moreover, 9 people out of 178 had potential Paraquat poisoning based on their abnormal concentration on citrate, glutinane and alanine and 24 out of 178 had Salicylate (Aspirin) detected in their blood and urine. For Asprin abnormality, we conclude that people with diabetes are more encouraged to take Aspirin as it may reduce risk of heart attack due to coronary obstruction, which is a risk many diabetics concurrunce (Figure 5).
Figure 5: Metabolite biomarkers for Diabetes incident and metabolic classes. Red metabolites are the perturbed metabolites that were found associated with incident diabetes and their link to organspecific processes and pathways, courtesy of [10]
In this study, we presented an integrative analysis that revealed metabolites correlated with diabetes for a subset of Qatari population and we, furthermore, identified specific metabolites affected by medication, which constraints differentiation between diabetic and control patients. Despite significant advances, no single profiling method currently allows simultaneous analysis of all of the metabolites in the metabolome. Ultimate achievement of our study is to present an integrative statistical method for mining raw ^{1}H NMR data. Challenges appear in handling big data where number of peaks is larger than the number of samples which limited the use of traditional statistical methods. Our next work is the continuation of the development of computational methods for the analysis of complex ^{1}H NMR datasets and their integration with equally complex genomic, transcriptomic, and proteomic profiles as well as metabolome integrated network analysis.