Momiao Xiong^{1*}, Dan Xie^{1,2}, Pengfei Hu^{3} and Zheng Hou^{3}  
^{1}Human Genetics Center, Division of Biostatistics, The University of Texas Health Science Center at Houston, P.O. Box 20186, Houston, Texas 77225, USA  
^{2}College of Information Engineering, Hubei University of Chinese Medicine, Wuhan, Hubei, 430065, China  
^{3}State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, 200433, China  
Corresponding Author :  Dr. Momiao Xiong Human Genetics Center The University of Texas Health Science Center at Houston P.O. Box 20186, Houston, Texas 77225, USA Tel: 7135009894 Fax: 7135000900 Email: Momiao.X[email protected] 

Received April 01, 2013; Accepted April 25, 2013; Published April 29, 2013  
Citation: Xiong M, Xie D, Hu P, Hou Z (2013) Studies of Natural Selection in the Era of Nextgeneration Sequencing. J Phylogen Evolution Biol 1:104. doi:10.4172/232990021104  
Copyright: © 2013 Xiong M, et al. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.  
Related article at Pubmed Scholar Google 
Visit for more related articles at Journal of Phylogenetics & Evolutionary Biology
Introduction  
A deeper understanding of positive selection is of fundamental importance with far reaching consequences for basic and biomedical science. Evolutionary pressures acting on the genome as a whole and the specific role of environmental pressures on divergence shape the natural selection. In the past population genetics has mainly evolved from a theorydriven field with limited empirical data [1]. However, despite the intense interest in genomewide scans, a coherent study of recent human evolutionary history has yet to emerge due to the limitations of data, models, and analytical tools. Practical clear evidences that show the importance and nature of selection in human evolution are still limited [2]. Nextgeneration sequencing technologies detect ten millions of the genomic variants including both common and rare variants [35]. NextGeneration Sequencing (NGS) will broaden the range of questions open to empirical investigation of population genetics and offer unprecedented opportunities for population genetic and natural selection studies. The signature of the natural selection on human genome can be potentially identified by nextgeneration sequencing.  
Despite it can provide a powerful tool for discovering clear picture of natural selection on human genome, NGS also raises great challenges for our data analysis. The next generation sequencing technologies suffer from three limitations: sequence errors, assembly errors, and missing data. Although adaptations are often influenced by multiple interacted genes that are organized into pathways and networks, there have been no systematic, global, and rigorous analyses into the forces governing the evolution of biological pathways. There is increasing evidence that polygenic adaptation, where selection acts simultaneously on many loci, may be an important mechanism of evolutionary change in human genomes. Using the current paradigm of single locus approaches we are unable to detect a large number of natural selection signals.  
Gene expression analyses are important sources to study function of genetic variation and are increasingly acquiring an important role in unraveling mechanism of complex traits and identifying the evolutionary history of human populations. The rapidly advances in NGS technologies have been becoming the platform of choice for gene expression profiling. RNAseq for expression profiling offers comprehensive picture of natural selection on geneexpression and is superior to microarray platforms. RNAseq has made a number of significant qualitative and quantitative improvements on gene expression analysis and provides multiple layers of resolutions and transcriptome complexity: the expression at exon, SNP, and positional level; splicing; posttranscriptional RNA editing across the entire gene; isoform and allelespecific expression [6].  
The current paradigm for analysis of natural selection on gene expression that are mainly designed for microarray expression data is to use a single value of summarizing statistic to represent gene expression level and overlook all information on expression difference in exons, genomic position and alleles. Therefore, although RNAseq dramatically increases the level of biological details [7], we still use the traditional statistical methods for identifying natural selection on geneexpression and do not efficiently use all of the information contained in RNAseq data.  
To address these critical limitations and challenges arising from NGS data, the goals of this paper are to review the traditional methods for natural selection analysis and develop a unified statistical framework for the detection of adaptive evolution acting on biological pathways and application of these methods to largeresequencing data sets in humans that are free of ascertainment bias. To explore observed expression variation in exons or in genomic position across the genes, we extend one dimensional diffusion process to multidimensional diffusion process and a single variate stochastic differential equation to multivariate stochastic differential equations. We then use the extended multidimensional diffusion processes to model the evolution of geneexpression acted by natural selection. We hope that the present new development of natural selection analysis for NGS data will open a new avenue for natural selection analysis with NGS data.  
Genomewide Natural Selection Studies with NGS Data  
Detecting the regions in human genome under natural selection from whole genome sequencing data  
Identifying signatures of natural selection in the human genome is of fundamental implication for the study of population evolution and for the biomedical research. The distribution of selection in genome will provide important functional information. Natural selection modifies the level of variability within and between populations and shapes the pattern of genetic variations in the genome. And genetic variation in genome is the raw data for detection of natural selection. Reports on natural selection genes have been lasted for years, but very few are of whole genome from whole genome sequencing data. The 1,000 Genomes Project produces whole genome sequencing data and offers a unique and great opportunity to scan the genome for signature of natural selection. We applied the Tajima’s statistic test which compare the difference between the average number of nucleotide differences and the number of segregating sites [8], Fu’s test that is based on the infinite sites of models [9], Fay and Wu’s H test measuring departures from neutrality that are reflected in the difference between highfrequency and intermediate frequency [10]. Zeng et al.’s E test [11] comparing estimates of population nucleotide diversity parameter from the highfrequency and lowfrequency variants, Achaz’s Y test that can be used to test neutrality despite sequencing errors [12] and F_{st} statistic that measures genetic differentiation [13] to low coverage data from 1,000 genomes project [14] to scan the entire genomes for detection of selection. The data consisted of three populations: 60 individuals in CEU, 60 in YRI, and ASI (30 in CHB and 30 JPT). The size of slidingwindow being tested is 10 kb and the empirical significance level is 5%. We identified a number of candidate selection regions at 5% empirical significance level within 10kbsliding window. Figure 1 showed the distribution of candidate selection genes. By comparing with the empirical genomewide distribution of F_{st}, we also identified 6,507 candidate selection regions at an empirical significance level of 2.5% across the genomes. Among them, 106 candidate selection regions were overlapped with 581 identified selection genes by F_{st} which were reported in the literatures. Table 1 presented some overlapped candidate selection genes.  
Parallel natural selection on pathways  
It is well known that gene function in a concert, rather than in isolation. Complex phenotype variations are caused by dynamic interaction among many genes and many environmental exposures through regulation and metabolism. Adaptations are often influenced by multiple interacted genes that are organized into pathways [1517]. However, in the conventional population genetics, most researches have primarily focused on natural selection acting on a single locus [18]. Little attention has been paid to determining how the natural selection acts on multiple interacted genes in response to environmental perturbation. To reveal how multiple genes that are often organized into pathways or interacted networks are evolved to adapt dramatic changes in environment, lifestyle and culture [19], we conducted Fisher exact test for ontology and pathway enrichment analysis of the candidate selection genes. The dataset is a low coverage data from 1,000 genomes project as described in the previous section. The total number of SNPs and average number of SNPs per kb in CEU, YRI and ASI (CHB and JPT) are 9,262,087, 11,815,086, 7,170,861 and 3.32, 4.23 and 2.57, respectively. The Tajima’s test statistic which compares the difference between the average number of nucleotide differences and the number of segregating sites [8] was applied to test for natural selection on a gene. The genome was divided into nonoverlapping windows of 10 kb. The genome empirical Pvalue of 0.05 was used as threshold for declaring significant evidence of selection. If a gene included two selection candidate genomic regions, each with 10 kb, then the gene is taken as the selection candidate gene. The genomewide scan for selection signals identified 2,710, 2,675, 2,518 and 2,609 candidate selection genes in YRI, CEU, CHB and JPT, respectively. Our approach captured some genes that were previously reported to be under natural selection in climate, physical environment, diet and immune response (Table 2). To investigate the functions of genes under selection, we used Fisher’s exact test to identify the pathways which are enriched with the candidate selection genes, we assembled 501 pathways from KEGG and Biocarta (http://www.biocarta.com). The Pvalue for declaring enrichment of candidate selection genes after Bonferroni correction for multiple tests is 0.0001. Table 3 summarized pathways that are enriched with candidate selection genes including pathway involved in the development of neural system, metabolism of carbohydrates and phosphates, cell communication and signal transduction. We found that all six pathways enriched by high F_{ST} genes in the recent paper by Amato et al. [20] were included in table 3. A remarkable feature in table 3 is that pathways show enrichment of selective signals in at least three populations. This demonstrated that many genes were subject to parallel selection in human populations. To examine this in more detail, we next examined patterns of selection across populations for each pathway. As a representative example, figure 2 shows the axon guidance pathway, which exhibits one of the strongest enrichments for candidate selection genes as assessed by both Tajima’s D and F_{ST} [20]. Several features emerge from figure 2. First, although axon guidance pathway is under selection in four populations, the candidate selection genes in the pathway may be different in four populations. Interestingly, although the axon guidance pathway possesses strong evidence of selection in each population, the specific targets of selection are not identical (approximately 35% of candidate selection genes are significant in a single population). Thus, the evolutionary history of the axon guidance pathway in humans’ exhibits evidence for both shared signatures of selection and parallel adaptation. Second, many candidate selection genes shared by four populations are in the membrane and initiators of axon guidance pathway. For example, the gene DCC is a transmembrane protein and mediates axon guidance of neuronal growth cones. The gene EPHA3 belongs to the ephrin receptor subfamily and mediates developmental events, particularly in the nervous system. The genes ROBO1 encodes an integral membrane protein that functions in axon guidance and neuronal precursor cell migration.  
Natural Selection on Gene Expression  
Single locus test of natural selection on gene expression  
Evolutionary changes in geneexpression regulation may play a more important role in shaping phenotypic evolution than evolutionary changes in sequence [21]. Advances in genomewide gene expression profiling techniques open a new era in investigation of natural selection on gene expression. We widely observe heritable gene expression variation in yeast, mice and humans [21]. Although general forces acting on geneexpression evolution have remained unclear, natural selection and random genetic drift are among the major evolutionary forces to shape geneexpression variation.  
A key issue for investigation of natural selection on geneexpression is to develop a neutral model for gene expression. The first neutral model is a “clocklike” model proposed by Khaitovich et al. [22]. The model assumes that gene expression variance increases with divergence time. If we consider geneexpression as a quantitative trait, the models of phenotype trait evolution such as Brownian motion can be used to study natural selection on geneexpression evolution [2326]. Similar to the symmetric random walk which in each time unit is equally to take a unit step either to the left or to the right, genetic mutations that are fixed in an evolving population cause overexpression or under expression with equal probability. Variation in geneexpression caused by such mutation can hence be modeled by Brownian motion. However, the effects of mutations on expression of some genes may be slightly more complex than Brownian motion. Mutation perturbations are more likely to shift the geneexpression toward some optimal values. Therefore, their effects on the geneexpression variation can be modeled by the OrnsteinUhlenback (OU) stochastic process [24].  
We assume that geneexpression in evolution is a diffusion process [27]. Let X(t) be geneexpression level at time t . Then, the variable X(t) should satisfy the following two conditions:  
and  
,  
Where μ(x,t) is referred to as the drift parameter and σ^{2}(x,t) the diffusion parameter.  
For the neutral model, change in geneexpression is similar to a symmetric random walk. Thus, the process X(t) follows a Brownian motion with μ(x,t) = 0 and a constant variance σ^{2}(x,t) = σ^{2}.  
and for t > 0 . Then, the density function of Brownian motion X (t) should satisfy its associated Kolmogorov backward equation:  
(1)  
Its solution is a Gaussian kernel:  
(2)  
To consider the impact of natural selection on the geneexpression, we assume that μ(x,t) = −λx and σ^{2}(x,t) = σ^{2}. Its associated Kolmogorov backward equation is  
(3)  
The solution of above equation is  
(4)  
The drift parameter λ in the OU process represents a restoring force directed towards the origin. Equation (4) shows that the mean of the geneexpression X (t) will finally return to the origin after perturbation of selection disappears. We also see from equation (4) that the variance of X (t) does not increase in proportion to time. In beginning of selection, the variance is a linear function of the time σ^{2}t . When the time is getting large, the variance will saturates at a stable equilibrium  
The change in geneexpression is caused by two principal forces. First, a nonrandom change in geneexpression is induced by external forces such as natural selection. Second, random forces such as interaction among genes, random sampling etc, will also change geneexpression levels and can be modeled by Brownian motion. Thus, for a small duration from time t to t + Δt , the change in geneexpression can be approximated by  
(5)  
where the drift parameter μ(x,t) is a measure of the instantaneous change rate of expression at time t and expression level x while ΔB(t) = B(t + Δt) − B(t) is the incremental change associated with the standard Brownian motion process B(t) and σ^{2}(x,t) > 0measures the instantaneous variance associated with the evolution process of geneexpression X (t). The first term in equation (5) measures the changes caused by the deterministic evolutionary forces while the second term represents the random component of the evolution process of geneexpression X (t).  
The limit relation attendant to equation (5) is given by [27]  
(6)  
or written in the differential notation  
(7)  
Let . The process W(t) is commonly referred to as the white noise process. The white noise process is a continuous Gaussian process with uncorrelated points. The geneexpression evolution process X (t) satisfies the stochastic differential equation:  
(8)  
with initial value X (0) = x .  
Now we solve stochastic differential equation (8). Equation (8) can be rewritten as  
Integrating both sides yields  
(9)  
By integrating by parts the righthand side, we obtain  
(10)  
The process X (t) is a linear combination of Brownian motion B(τ) . This implies that X (t) is a Gaussian process. It follows from equation (10) that  
(11)  
The variance of X (t) can be obtained from equation (9)  
which implies that  
(12)  
Combining equations (11) and (12) we infer that X (t) is a Gaussian process with density function , which is consistent with equation (4).  
Multilocus joint tests of natural selection with RNAseq data  
Next we consider RNAseq data. We attempt to extend onedimensional Brownian motion to multidimensional Brownian motion and a single variate stochastic differential equation to a system of stochastic differential equations to model natural selection on geneexpression.  
A standard ndimensional Brownian motion is defined as a vector of n independent Brownian motions:  
(13)  
In other words, each Z_{j}(t) is a onedimensional Brownian motion and different components are independent. Now we define a dependent Brownian motion. Let  
be a positive symmetric matrix and can be decomposed as  
ρ = HH^{T}.  
We define a new vectorvalued process W(t) = [W_{1}(t),...,W_{n}(t)]^{T} as  
(14)  
which is referred to as a correlated Brownian motion. This new process has the following properties:  
(1) W(0) = 0 ,  
(2) If s ≤ t , then W(t) −W(s) is multivariate normal with mean 0 and variancecovariance matrix (t − s)ρ ,  
(3) If 0 ≤ r ≤ s < t , then the random variables W(t) −W(s) and W(r) are independent, and  
(4) The paths t →W(t) are continuous with probability 1.  
Now we introduce the concept of systems of stochastic differential equations. Let A(X (t) and H(X (t), t) be vector functions of the process X (t) . A system of stochastic differential equations is given by  
(15)  
Let X (t) i be the number of reads at the i th genomic position at time t . Let A(X (t) and H(X (t) be, respectively, given by  
A(X(t)) = AX(t) and H(X(t)) = Λ ,  
where  
Then, equation (15) is reduced to  
(16)  
The process described by equation (16) is similar to the strongly coupled OrnsteinUehlenbeck stochastic process. To solve equation (16) we begin with  
Integrating from 0to t on both sides of the above equation, we obtain  
or  
(17)  
Similar to the arguments for scalar OU process, X (t) is a multidimensional Gaussian process. Its mean and covariance matrix are, respectively, derived by  
and  
which implies  
(18)  
The variance of X (t) is therefore given by  
(19)  
Discussion  
In this paper we have addressed important issues for analysis of natural selection with NGS data. We investigated how adaptive evolution has shaped patterns of natural selection on whole genome. We extended traditional single locus tests of selection to multilocus joint tests of natural selection which represents a significant conceptual advance that departs from the current paradigm of single locus neutrality tests and will result in more biologically interpretable signatures of selection and facilitate the detection of adaptive evolution acting on biological pathways.  
To highlight the insights gleaned from analyzing selection acting on pathways, as well as current methodological limitations, we performed a genomewide scan for selection in the low coverage data of the 1000 Genomes Project to identify pathways that are enriched for candidate selection genes. We identified all six pathways enriched for high F_{ST} genes described in a recent analysis by Amato et al. [20] along with several novel pathways such as small cell lung cancer. We demonstrated enrichment of candidate selection genes in at least three populations, suggesting either shared signatures of selection, parallel adaptation, or a combination of both. Our preliminary results showed that integrating information across loci will increase the power to detect more subtle selection that would otherwise be difficult to identify through single locus approaches.  
Allelespecific alternative splicing, polyadenylation, allelespecific transcription start and end sites, and differential promoter usage generate a large variability at the transcriptional level. RNAseq technologies are able to measure mRNA variation across the genes. They provide substantially detailed biological insight than microarray platform. RNAseq is now opening unprecedented avenues to address the analysis of natural selection on entire transcriptomes. However, none of statistical methods for detecting signal of natural selection on geneexpression with RNAseq data are available. RNAseq poses great challenges to use its remarkable features in analysis of natural selection. To address these challenges, we extended one dimensional diffusion process to multidimensional diffusion process and one dimensional OrnsteinUehlenbeck process to coupled multidimensional Ornstein Uehlenbeck process for modeling natural selection acting on expression at each genomic position. To derive distribution of evolution process of geneexpression influenced by natural selection, we extended a single variate stochastic differential equation to multivariate stochastic differential equation. Their solutions are density function of multidimensional Gaussian process which is able to fully explore substantial variation of mRNA expression and capture the detailed evolutionary picture of geneexpression at genomic position level.  
RNAseq can identify different mRNA variants and measure expressions at exon, SNP, positional levels. The proposed multivariate stochastic differential equations for detection of natural selection on geneexpression take the various transcript variants into account and hence can consider complex patterns of gene expressions. We expect that the developed statistical methods for identifying natural selection on geneexpression with RNAseq data substantially outperforms the current statistical methods for identifying signal of natural selection with microarray expression data or RNAseq data based on overall gene expression levels.  
Analysis of NGS data opens up unprecedented avenue to address the essential issues on detection of natural selection on whole genome and entire transcriptomes. The results in this paper are preliminary. The purpose of this paper is to stimulate further discussions regarding great challenges we are facing in developing statistical methods and computational algorithms for analyzing natural selection with large and formidably complex genomic sequencing and RNAseq datasets to optimally utilize biological information hidden in the genomic sequencing data and RNAseq data.  
Acknowledgement  
The project described was supported by Grant 1R01AR057120–01, 1R01HL10603401 from National Institutes of Health and NHLBI.  
References  


Table 1  Table 2  Table 3 
Figure 1  Figure 2  Figure 3 