A Penalized Regression Approach for Integrative Analysis in Genome-Wide Association Studies

Over one thousand genome-wide association studies (GWAS) have been conducted in the past decade. Increasing biological evidence suggests the polygenic genetic architecture of complex traits: a complex trait is affected by many risk variants with small or moderate effects jointly. Meanwhile, recent progress in GWAS suggests that complex human traits may share common genetic bases, which is known as “pleiotropy”. To further improve statistical power of detecting risk genetic variants in GWAS, we propose a penalized regression method to analyze the GWAS dataset of primary interest by incorporating information from other related GWAS. The proposed method does not require the individual-level of genotype and phenotype data from other related GWAS, making it useful when only summary statistics are available. method. Simulation studies showed that the proposed approach had satisfactory performance. We applied the proposed method to analyze a body mass index (BMI) GWAS dataset from a European American (EA) population and achieved improvement over single GWAS analysis. Journal of Biometrics & Biostatistics J o u rn al of Bio metrics & Bistatis t i c s


Introduction
Genome-wide association studies (GWAS) provide an unprecedented opportunity for identifying disease-associated genetic variants. Although disease associated SNPs at genome-wide -8 is referred to as "missing heritability". Rather than only using genomehuman height can be explained by using all of the genotyped common missing but remains hidden in the genome: due to the limited sample remain undiscovered. So far, people have found similar genetic architectures for many other complex diseases [4], such as psychiatric GWAS with larger sample sizes, in which more associated common 34,840 patients and 114,981 healthy people are analyzed to understand the genetic architecture of type 2 diabetes [8]). However, large sample recruitment may be expensive and time-consuming.
For single GWAS analysis, many existing statistical methods have been proposed [9,10]. Among them, penalized regression methods [11][12][13][14] have drawn particular attention in GWAS. However, due to limited sample size of a single GWAS and polygenicity of a complex trait, many existing methods do not have enough power to uncover the remaining risk genetic variants. Recently, increasing evidence suggests that complex traits may share common genetic bases, which is known as "pleiotropy" [15][16][17]. A systematic investigation of pleiotropy [18] suggests that 16.9% of genes and 4.6% of SNPs have been reported to statistical power in GWAS data analysis by integrating multiple from two aspects. First, a direct pool of samples from multiple GWAS existing methods (e.g. [19]) require the availability of all genotype data privacy restrictions on sharing individual-level data.
In this work, we aim at improving statistical power of identifying associated markers for the given GWAS data by integrating information from other GWAS, where only the summary statistics rather than the geneotype data of some GWAS are needed. We propose a penalized proposed approach is that genetically related traits can share common genetic bases [18,20], which enables us to borrow information from some related GWAS when analyzing the trait of primary interest.

Algorithm
Now we present our algorithm for parameter estimation in the above model. Noticing that objective function (1) is jointly convex in (B, σ 1 , . . . , σ k ), it is very convenient for us to use an alternating strategy k , k=1, . . . , q, we optimize (1) with k Fixingˆ( 1, , ) k k k q , the optimization problem becomes Since 1 = p j j |B gradient method [22].
gradient algorithm solves optimization problem (2) iteratively using the proximal operator of g(B): where the superscript m indicates the m th iteration, τ is the Lipschitz constant of f (B) and Is the gradient of f (B) evaluated at B (m-1) . Note that is q × p matrix which does not involve the optimization variable B. Let its ( ) m j G denote the j-th column of ( 1) ( 1) into p separate optimization problems and be solved analytically: To further accelerate the convergence of the above proximal gradient algorithm, we use the accelerated proximal gradient algorithm (APG) [22], where two points {B (m-1) , B (m-2) of the APG algorithm is given in Algorithm 1.

X X n
for m ≥ 1 do GWAS together and use a group-Lasso penalty to integrate information study and real data analysis, we showed that the proposed method had advantage over single-GWAS analysis.

Model
Suppose we have q GWAS, in which we have complete data for GWAS 1, including its genotype data and phenotype data, and only Consider a linear model between the phenotype y 1 and genotype X 1 of GWAS 1, is the error term 2 1 denotes the noise variance. For the rest of GWAS with only summary statistics, we assume that  p k z is an estimate of the true z k =b k +e k , k denotes the noise variance for the k-th GWAS, k=2, . . . , q. Here we use a simple example to illustrate the key idea. Suppose the 3 can improve the statistical power of identifying risk variants as the same loci consistently produce this toy example, we have 1 as the vector of the true for the second group, and B j for the j-th group, j=1, .. . , p. For convenience, we denote 1 2 , , . . . , is the j-th column.
To integrate information from multiple GWAS, we propose the following optimization problem Where γ is the regularization parameter controlling the sparsity of closely related to the scaled Lasso problem [21]. Here we emphasize on integration of information from multiple GWAS and use the group penalty to achieve this goal.
( 1) b be the solution of Algorithm 1. Fixing B at  B , we can take the derivative of (1) with respect to σ k , k=1, . . . , q and set them to zeros, yielding the following updating equations: Based on the above alternative optimization strategy, we now summarize the overall working algorithm in Algorithm 2.  γ γ max max  chosen according to the criteria that the minimum prediction error in primary GWAS is selected.

Simulation Study
We conducted a simulation study to evaluate the performance of the proposed method. For comparison, we also considered scaled Lasso on one GWAS with genotype data. We simulated two sets of genotype data, one for GWAS with genotype and one for summary statistics of GWAS. In the simulation study, we set n 1 =500 with n 2 =500 or n 2 =2000 for the sample sizes of two GWAS, while we set the number of SNPs to be p=5000 or p=10000. We considered the auto-regressive correlation ρ |j-k| . For block AR, we set block size to be 20 equally distributed a block is set to ρ |j-k| and 0 otherwise. We considered three scenarios with ρ=0.2, 0.5 and 0.8, corresponding to weak, moderate, and strong correlations, respectively. SNPs in the simulation study were generated with a two-stage procedure [11]. First, we drew the predictor vector xi 0, 1, or 2 according to whether x ij <-c,-c ≤ x ij ≤ c, or x ij under normal distribution in the way that signal-to-noise ratio was we normalized the genotype such that Finally, we generated the quantitative trait using the linear model, where e k was the error term under normal distribution with mean zero. In both correlation structures, there were ten trait-associated markers and in block AR, each of the ten blocks contained one trait-associated marker. For the second dataset, we applied the single-marker analysis and obtained the summary statistics for the integration, in which we only used this partial information without knowledge of genotype data.
combinations of correlation structures (AR and block AR), the sample size of the second GWAS n 2 and the total number of SNPs p for comparing the proposed method with the scaled Lasso. We used area under the curve (AUC) to show the selection performance. We 2 ) of observed values in Figure 1. As indicated by AUC and r 2 , the proposed method has selection and prediction performance of the proposed method can further improve as the sample size of the second GWAS n 2 increase from 500 to 2000, which indicates the proposed pIGWAS method is

Real Data Analysis
We applied our pIGWAS method to the quantitative trait-body mass index (BMI). We primarily used European American (EA) samples from two GWAS-Study of Addiction: Genetics and Environment (SAGE) and the Collaborative Study on the Genetics of Alco-holism web-site of Genetic Investigation of Anthropometric Traits (GIANT) consortium (http://www.broadinstitute.org/collaboration/giant/index. were 656,848 SNPs with minor allele frequency (MAF) ≥ 0.01 and pvalue ≥ 0.001 for Hardy-Weinberg equilibrium test in both GWAS data. For summary statistics from GIANT, we used SNPs with no missing values in MAF. Overall, there were 619,651 SNPs used in all chromosomes satisfying the pruning criteria in genotype data and existing plots for log 10 p-value and β using conventional marginal analysis for GWAS data are given in Figure 2. Obviously, there is not a rich set of upper panel of Figure 2) compared to the meta-analysis conducted in [25] (the lower panel of Figure 2).
First, we performed pIGWAS using combined GWAS data of SAGE the EA samples containing genotype data with corresponding summary conducted chromosome by chromosome. We also evaluated the relative stability of the selected SNPs using random sampling [26].
times. For each SNP, we were able to calculate the proportion of times   that the SNP was associated with the trait out of 100 samplings, i.e., the observed occurrence index (OOI). For comparison, we conducted single data analysis of GWAS with EA samples using scaled Lasso, and evaluated its relative stability of the selected markers using the OOI. The associated markers identified by integrative and single data analysis were listed in Table 1. The average OOI of SNPs selected by pIGWAS is 0.649 while that of SNPs selected by scaled Lasso is 0.648, suggesting a limited improvement of pIGWAS over scaled Lasso on the COGA-SAGE data set. There may be two reasons for this. First, the GWAS signals of BMI from the COGA-SAGE may be too weak to be distinguished from noise ( Figure 1). Second, the pleiotropic effects between BMI and height may not be strong enough to boost the power of pIGWAS. It is expected that pIGWAS could achieve a better performance in presence of well-powered GWAS signals and pleiotropy information.

Conclusion
GWAS suffer from low statistical power due to the individual weak effects of genetic variants. In this study, we proposed a statistical approach to jointly analyzing primary GWAS data with summary statistics together from other source. The key idea of the proposed approach lies on the existence of pleiotropic effects of genetic variants, which allows us to borrow information from genetically related GWAS. Specifically, we proposed a novel penalized regression that combines multiple GWAS together. The computationally efficient algorithm is derived for optimizing the model parameters. Based on both simulation study and real data analysis, we demonstrated the advantages of the proposed method over single-GWAS analysis.