alexa Studies of Natural Selection in the Era of Next-generation Sequencing | OMICS International
ISSN: 2329-9002
Journal of Phylogenetics & Evolutionary Biology
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Studies of Natural Selection in the Era of Next-generation Sequencing

Momiao Xiong1*, Dan Xie1,2, Pengfei Hu3 and Zheng Hou3
1Human Genetics Center, Division of Biostatistics, The University of Texas Health Science Center at Houston, P.O. Box 20186, Houston, Texas 77225, USA
2College of Information Engineering, Hubei University of Chinese Medicine, Wuhan, Hubei, 430065, China
3State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, 200433, China
Corresponding Author : Dr. Momiao Xiong
Human Genetics Center
The University of Texas Health Science Center at Houston
P.O. Box 20186, Houston, Texas 77225, USA
Tel: 713-500-9894
Fax: 713-500-0900
E-mail: [email protected]
Received April 01, 2013; Accepted April 25, 2013; Published April 29, 2013
Citation: Xiong M, Xie D, Hu P, Hou Z (2013) Studies of Natural Selection in the Era of Next-generation Sequencing. J Phylogen Evolution Biol 1:104. doi:10.4172/2329-9002-1-104
Copyright: © 2013 Xiong M, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Related article at
DownloadPubmed DownloadScholar Google

Visit for more related articles at Journal of Phylogenetics & Evolutionary Biology

Introduction
A deeper understanding of positive selection is of fundamental importance with far reaching consequences for basic and biomedical science. Evolutionary pressures acting on the genome as a whole and the specific role of environmental pressures on divergence shape the natural selection. In the past population genetics has mainly evolved from a theory-driven field with limited empirical data [1]. However, despite the intense interest in genome-wide scans, a coherent study of recent human evolutionary history has yet to emerge due to the limitations of data, models, and analytical tools. Practical clear evidences that show the importance and nature of selection in human evolution are still limited [2]. Next-generation sequencing technologies detect ten millions of the genomic variants including both common and rare variants [3-5]. Next-Generation Sequencing (NGS) will broaden the range of questions open to empirical investigation of population genetics and offer unprecedented opportunities for population genetic and natural selection studies. The signature of the natural selection on human genome can be potentially identified by next-generation sequencing.
Despite it can provide a powerful tool for discovering clear picture of natural selection on human genome, NGS also raises great challenges for our data analysis. The next generation sequencing technologies suffer from three limitations: sequence errors, assembly errors, and missing data. Although adaptations are often influenced by multiple interacted genes that are organized into pathways and networks, there have been no systematic, global, and rigorous analyses into the forces governing the evolution of biological pathways. There is increasing evidence that polygenic adaptation, where selection acts simultaneously on many loci, may be an important mechanism of evolutionary change in human genomes. Using the current paradigm of single locus approaches we are unable to detect a large number of natural selection signals.
Gene expression analyses are important sources to study function of genetic variation and are increasingly acquiring an important role in unraveling mechanism of complex traits and identifying the evolutionary history of human populations. The rapidly advances in NGS technologies have been becoming the platform of choice for gene expression profiling. RNA-seq for expression profiling offers comprehensive picture of natural selection on gene-expression and is superior to microarray platforms. RNA-seq has made a number of significant qualitative and quantitative improvements on gene expression analysis and provides multiple layers of resolutions and transcriptome complexity: the expression at exon, SNP, and positional level; splicing; post-transcriptional RNA editing across the entire gene; isoform and allele-specific expression [6].
The current paradigm for analysis of natural selection on gene expression that are mainly designed for microarray expression data is to use a single value of summarizing statistic to represent gene expression level and overlook all information on expression difference in exons, genomic position and alleles. Therefore, although RNA-seq dramatically increases the level of biological details [7], we still use the traditional statistical methods for identifying natural selection on geneexpression and do not efficiently use all of the information contained in RNA-seq data.
To address these critical limitations and challenges arising from NGS data, the goals of this paper are to review the traditional methods for natural selection analysis and develop a unified statistical framework for the detection of adaptive evolution acting on biological pathways and application of these methods to large-resequencing data sets in humans that are free of ascertainment bias. To explore observed expression variation in exons or in genomic position across the genes, we extend one dimensional diffusion process to multidimensional diffusion process and a single variate stochastic differential equation to multivariate stochastic differential equations. We then use the extended multidimensional diffusion processes to model the evolution of geneexpression acted by natural selection. We hope that the present new development of natural selection analysis for NGS data will open a new avenue for natural selection analysis with NGS data.
Genome-wide Natural Selection Studies with NGS Data
Detecting the regions in human genome under natural selection from whole genome sequencing data
Identifying signatures of natural selection in the human genome is of fundamental implication for the study of population evolution and for the biomedical research. The distribution of selection in genome will provide important functional information. Natural selection modifies the level of variability within and between populations and shapes the pattern of genetic variations in the genome. And genetic variation in genome is the raw data for detection of natural selection. Reports on natural selection genes have been lasted for years, but very few are of whole genome from whole genome sequencing data. The 1,000 Genomes Project produces whole genome sequencing data and offers a unique and great opportunity to scan the genome for signature of natural selection. We applied the Tajima’s statistic test which compare the difference between the average number of nucleotide differences and the number of segregating sites [8], Fu’s test that is based on the infinite sites of models [9], Fay and Wu’s H test measuring departures from neutrality that are reflected in the difference between highfrequency and intermediate frequency [10]. Zeng et al.’s E test [11] comparing estimates of population nucleotide diversity parameter from the high-frequency and low-frequency variants, Achaz’s Y test that can be used to test neutrality despite sequencing errors [12] and Fst statistic that measures genetic differentiation [13] to low coverage data from 1,000 genomes project [14] to scan the entire genomes for detection of selection. The data consisted of three populations: 60 individuals in CEU, 60 in YRI, and ASI (30 in CHB and 30 JPT). The size of sliding-window being tested is 10 kb and the empirical significance level is 5%. We identified a number of candidate selection regions at 5% empirical significance level within 10kb-sliding window. Figure 1 showed the distribution of candidate selection genes. By comparing with the empirical genome-wide distribution of Fst, we also identified 6,507 candidate selection regions at an empirical significance level of 2.5% across the genomes. Among them, 106 candidate selection regions were overlapped with 581 identified selection genes by Fst which were reported in the literatures. Table 1 presented some overlapped candidate selection genes.
Parallel natural selection on pathways
It is well known that gene function in a concert, rather than in isolation. Complex phenotype variations are caused by dynamic interaction among many genes and many environmental exposures through regulation and metabolism. Adaptations are often influenced by multiple interacted genes that are organized into pathways [15-17]. However, in the conventional population genetics, most researches have primarily focused on natural selection acting on a single locus [18]. Little attention has been paid to determining how the natural selection acts on multiple interacted genes in response to environmental perturbation. To reveal how multiple genes that are often organized into pathways or interacted networks are evolved to adapt dramatic changes in environment, lifestyle and culture [19], we conducted Fisher exact test for ontology and pathway enrichment analysis of the candidate selection genes. The dataset is a low coverage data from 1,000 genomes project as described in the previous section. The total number of SNPs and average number of SNPs per kb in CEU, YRI and ASI (CHB and JPT) are 9,262,087, 11,815,086, 7,170,861 and 3.32, 4.23 and 2.57, respectively. The Tajima’s test statistic which compares the difference between the average number of nucleotide differences and the number of segregating sites [8] was applied to test for natural selection on a gene. The genome was divided into non-overlapping windows of 10 kb. The genome empirical P-value of 0.05 was used as threshold for declaring significant evidence of selection. If a gene included two selection candidate genomic regions, each with 10 kb, then the gene is taken as the selection candidate gene. The genome-wide scan for selection signals identified 2,710, 2,675, 2,518 and 2,609 candidate selection genes in YRI, CEU, CHB and JPT, respectively. Our approach captured some genes that were previously reported to be under natural selection in climate, physical environment, diet and immune response (Table 2). To investigate the functions of genes under selection, we used Fisher’s exact test to identify the pathways which are enriched with the candidate selection genes, we assembled 501 pathways from KEGG and Biocarta (http://www.biocarta.com). The P-value for declaring enrichment of candidate selection genes after Bonferroni correction for multiple tests is 0.0001. Table 3 summarized pathways that are enriched with candidate selection genes including pathway involved in the development of neural system, metabolism of carbohydrates and phosphates, cell communication and signal transduction. We found that all six pathways enriched by high FST genes in the recent paper by Amato et al. [20] were included in table 3. A remarkable feature in table 3 is that pathways show enrichment of selective signals in at least three populations. This demonstrated that many genes were subject to parallel selection in human populations. To examine this in more detail, we next examined patterns of selection across populations for each pathway. As a representative example, figure 2 shows the axon guidance pathway, which exhibits one of the strongest enrichments for candidate selection genes as assessed by both Tajima’s D and FST [20]. Several features emerge from figure 2. First, although axon guidance pathway is under selection in four populations, the candidate selection genes in the pathway may be different in four populations. Interestingly, although the axon guidance pathway possesses strong evidence of selection in each population, the specific targets of selection are not identical (approximately 35% of candidate selection genes are significant in a single population). Thus, the evolutionary history of the axon guidance pathway in humans’ exhibits evidence for both shared signatures of selection and parallel adaptation. Second, many candidate selection genes shared by four populations are in the membrane and initiators of axon guidance pathway. For example, the gene DCC is a transmembrane protein and mediates axon guidance of neuronal growth cones. The gene EPHA3 belongs to the ephrin receptor subfamily and mediates developmental events, particularly in the nervous system. The genes ROBO1 encodes an integral membrane protein that functions in axon guidance and neuronal precursor cell migration.
Natural Selection on Gene Expression
Single locus test of natural selection on gene expression
Evolutionary changes in gene-expression regulation may play a more important role in shaping phenotypic evolution than evolutionary changes in sequence [21]. Advances in genome-wide gene expression profiling techniques open a new era in investigation of natural selection on gene expression. We widely observe heritable gene expression variation in yeast, mice and humans [21]. Although general forces acting on gene-expression evolution have remained unclear, natural selection and random genetic drift are among the major evolutionary forces to shape gene-expression variation.
A key issue for investigation of natural selection on gene-expression is to develop a neutral model for gene expression. The first neutral model is a “clock-like” model proposed by Khaitovich et al. [22]. The model assumes that gene expression variance increases with divergence time. If we consider gene-expression as a quantitative trait, the models of phenotype trait evolution such as Brownian motion can be used to study natural selection on gene-expression evolution [23-26]. Similar to the symmetric random walk which in each time unit is equally to take a unit step either to the left or to the right, genetic mutations that are fixed in an evolving population cause overexpression or under expression with equal probability. Variation in gene-expression caused by such mutation can hence be modeled by Brownian motion. However, the effects of mutations on expression of some genes may be slightly more complex than Brownian motion. Mutation perturbations are more likely to shift the gene-expression toward some optimal values. Therefore, their effects on the gene-expression variation can be modeled by the Ornstein-Uhlenback (OU) stochastic process [24].
We assume that gene-expression in evolution is a diffusion process [27]. Let X(t) be gene-expression level at time t . Then, the variable X(t) should satisfy the following two conditions:
and
,
Where μ(x,t) is referred to as the drift parameter and σ2(x,t) the diffusion parameter.
For the neutral model, change in gene-expression is similar to a symmetric random walk. Thus, the process X(t) follows a Brownian motion with μ(x,t) = 0 and a constant variance σ2(x,t) = σ2.
and for t > 0 . Then, the density function of Brownian motion X (t) should satisfy its associated Kolmogorov backward equation:
                                                                  (1)
Its solution is a Gaussian kernel:
                    (2)
To consider the impact of natural selection on the gene-expression, we assume that μ(x,t) = −λx and σ2(x,t) = σ2. Its associated Kolmogorov backward equation is
                                                        (3)
The solution of above equation is
                      (4)
The drift parameter λ in the OU process represents a restoring force directed towards the origin. Equation (4) shows that the mean of the gene-expression X (t) will finally return to the origin after perturbation of selection disappears. We also see from equation (4) that the variance of X (t) does not increase in proportion to time. In beginning of selection, the variance is a linear function of the time σ2t . When the time is getting large, the variance will saturates at a stable equilibrium
The change in gene-expression is caused by two principal forces. First, a nonrandom change in gene-expression is induced by external forces such as natural selection. Second, random forces such as interaction among genes, random sampling etc, will also change geneexpression levels and can be modeled by Brownian motion. Thus, for a small duration from time t to t + Δt , the change in gene-expression can be approximated by
                                 (5)
where the drift parameter μ(x,t) is a measure of the instantaneous change rate of expression at time t and expression level x while ΔB(t) = B(t + Δt) − B(t) is the incremental change associated with the standard Brownian motion process B(t) and σ2(x,t) > 0measures the instantaneous variance associated with the evolution process of geneexpression X (t). The first term in equation (5) measures the changes caused by the deterministic evolutionary forces while the second term represents the random component of the evolution process of geneexpression X (t).
The limit relation attendant to equation (5) is given by [27]
                                                        (6)
or written in the differential notation
                                                        (7)
Let . The process W(t) is commonly referred to as the white noise process. The white noise process is a continuous Gaussian process with uncorrelated points. The gene-expression evolution process X (t) satisfies the stochastic differential equation:
                                                        (8)
with initial value X (0) = x .
Now we solve stochastic differential equation (8). Equation (8) can be rewritten as
Integrating both sides yields
                                                        (9)
By integrating by parts the right-hand side, we obtain
                           (10)
The process X (t) is a linear combination of Brownian motion B(τ) . This implies that X (t) is a Gaussian process. It follows from equation (10) that
                 (11)
The variance of X (t) can be obtained from equation (9)
which implies that
                                                        (12)
Combining equations (11) and (12) we infer that X (t) is a Gaussian process with density function , which is consistent with equation (4).
Multi-locus joint tests of natural selection with RNA-seq data
Next we consider RNA-seq data. We attempt to extend onedimensional Brownian motion to multi-dimensional Brownian motion and a single variate stochastic differential equation to a system of stochastic differential equations to model natural selection on geneexpression.
A standard n-dimensional Brownian motion is defined as a vector of n independent Brownian motions:
                                                        (13)
In other words, each Zj(t) is a one-dimensional Brownian motion and different components are independent. Now we define a dependent Brownian motion. Let
be a positive symmetric matrix and can be decomposed as
ρ = HHT.
We define a new vector-valued process W(t) = [W1(t),...,Wn(t)]T as
                                                        (14)
which is referred to as a correlated Brownian motion. This new process has the following properties:
(1) W(0) = 0 ,
(2) If s ≤ t , then W(t) −W(s) is multivariate normal with mean 0 and variance-covariance matrix (t − s)ρ ,
(3) If 0 ≤ r ≤ s < t , then the random variables W(t) −W(s) and W(r) are independent, and
(4) The paths t →W(t) are continuous with probability 1.
Now we introduce the concept of systems of stochastic differential equations. Let A(X (t) and H(X (t), t) be vector functions of the process X (t) . A system of stochastic differential equations is given by
                          (15)
Let X (t) i be the number of reads at the i -th genomic position at time t . Let A(X (t) and H(X (t) be, respectively, given by
A(X(t)) = AX(t) and H(X(t)) = Λ ,
where
Then, equation (15) is reduced to
                                                        (16)
The process described by equation (16) is similar to the strongly coupled Ornstein-Uehlenbeck stochastic process. To solve equation (16) we begin with
Integrating from 0to t on both sides of the above equation, we obtain
or
                                       (17)
Similar to the arguments for scalar OU process, X (t) is a multidimensional Gaussian process. Its mean and covariance matrix are, respectively, derived by
and
which implies
                                 (18)
The variance of X (t) is therefore given by
                              (19)
Discussion
In this paper we have addressed important issues for analysis of natural selection with NGS data. We investigated how adaptive evolution has shaped patterns of natural selection on whole genome. We extended traditional single locus tests of selection to multi-locus joint tests of natural selection which represents a significant conceptual advance that departs from the current paradigm of single locus neutrality tests and will result in more biologically interpretable signatures of selection and facilitate the detection of adaptive evolution acting on biological pathways.
To highlight the insights gleaned from analyzing selection acting on pathways, as well as current methodological limitations, we performed a genome-wide scan for selection in the low coverage data of the 1000 Genomes Project to identify pathways that are enriched for candidate selection genes. We identified all six pathways enriched for high FST genes described in a recent analysis by Amato et al. [20] along with several novel pathways such as small cell lung cancer. We demonstrated enrichment of candidate selection genes in at least three populations, suggesting either shared signatures of selection, parallel adaptation, or a combination of both. Our preliminary results showed that integrating information across loci will increase the power to detect more subtle selection that would otherwise be difficult to identify through single locus approaches.
Allele-specific alternative splicing, polyadenylation, allele-specific transcription start and end sites, and differential promoter usage generate a large variability at the transcriptional level. RNA-seq technologies are able to measure mRNA variation across the genes. They provide substantially detailed biological insight than microarray platform. RNA-seq is now opening unprecedented avenues to address the analysis of natural selection on entire transcriptomes. However, none of statistical methods for detecting signal of natural selection on gene-expression with RNA-seq data are available. RNA-seq poses great challenges to use its remarkable features in analysis of natural selection. To address these challenges, we extended one dimensional diffusion process to multidimensional diffusion process and one dimensional Ornstein-Uehlenbeck process to coupled multidimensional Ornstein- Uehlenbeck process for modeling natural selection acting on expression at each genomic position. To derive distribution of evolution process of gene-expression influenced by natural selection, we extended a single variate stochastic differential equation to multi-variate stochastic differential equation. Their solutions are density function of multidimensional Gaussian process which is able to fully explore substantial variation of mRNA expression and capture the detailed evolutionary picture of gene-expression at genomic position level.
RNA-seq can identify different mRNA variants and measure expressions at exon, SNP, positional levels. The proposed multi-variate stochastic differential equations for detection of natural selection on gene-expression take the various transcript variants into account and hence can consider complex patterns of gene expressions. We expect that the developed statistical methods for identifying natural selection on gene-expression with RNA-seq data substantially outperforms the current statistical methods for identifying signal of natural selection with microarray expression data or RNA-seq data based on overall gene expression levels.
Analysis of NGS data opens up unprecedented avenue to address the essential issues on detection of natural selection on whole genome and entire transcriptomes. The results in this paper are preliminary. The purpose of this paper is to stimulate further discussions regarding great challenges we are facing in developing statistical methods and computational algorithms for analyzing natural selection with large and formidably complex genomic sequencing and RNA-seq datasets to optimally utilize biological information hidden in the genomic sequencing data and RNA-seq data.
Acknowledgement
The project described was supported by Grant 1R01AR057120–01, 1R01HL106034-01 from National Institutes of Health and NHLBI.
References



























 

Tables and Figures at a glance

image
Table 1 Table 2 Table 3

Figures at a glance

image
Figure 1 Figure 2 Figure 3
Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Recommended Conferences

Article Usage

  • Total views: 11714
  • [From(publication date):
    April-2013 - Dec 14, 2017]
  • Breakdown by view type
  • HTML page views : 7867
  • PDF downloads : 3847
 

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
 
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2017-18
 
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri & Aquaculture Journals

Dr. Krish

[email protected]

1-702-714-7001Extn: 9040

Biochemistry Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Business & Management Journals

Ronald

[email protected]

1-702-714-7001Extn: 9042

Chemistry Journals

Gabriel Shaw

[email protected]

1-702-714-7001Extn: 9040

Clinical Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Engineering Journals

James Franklin

[email protected]

1-702-714-7001Extn: 9042

Food & Nutrition Journals

Katie Wilson

[email protected]

1-702-714-7001Extn: 9042

General Science

Andrea Jason

[email protected]

1-702-714-7001Extn: 9043

Genetics & Molecular Biology Journals

Anna Melissa

[email protected]

1-702-714-7001Extn: 9006

Immunology & Microbiology Journals

David Gorantl

[email protected]

1-702-714-7001Extn: 9014

Materials Science Journals

Rachle Green

[email protected]

1-702-714-7001Extn: 9039

Nursing & Health Care Journals

Stephanie Skinner

[email protected]

1-702-714-7001Extn: 9039

Medical Journals

Nimmi Anna

[email protected]

1-702-714-7001Extn: 9038

Neuroscience & Psychology Journals

Nathan T

[email protected]

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

Ann Jose

[email protected]

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

[email protected]

1-702-714-7001Extn: 9042

 
© 2008- 2017 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version