Medical, Pharma, Engineering, Science, Technology and Business

**Dimitrios V Vavoulis ^{*} and Julian Gough**

Department of Computer Science, University of Bristol, Bristol, United Kingdom

- *Corresponding Author:
- Dimitrios V Vavoulis

Department of Computer Science

University of Bristol

Bristol, United Kingdom

**Tel:**+44 (0)117 331573

**E-mail:**[email protected]

**Received Date:** October 20, 2013; **Accepted Date:** November 18, 2013; **Published Date:** November 25, 2013

**Citation:** Vavoulis DV, Gough J (2013) Non-Parametric Bayesian Modelling of Digital Gene Expression Data. J Comput Sci Syst Biol 7:001-009. doi: 10.4172/jcsb.1000131

**Copyright:** © 2013 Vavoulis DV, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Visit for more related articles at**
Journal of Computer Science & Systems Biology

Next-generation sequencing technologies provide a revolutionary tool for generating gene expression data. Starting with a fixed RNA sample, they construct a library of millions of differentially abundant short sequence tags or “reads”, which constitute a fundamentally discrete measure of the level of gene expression. A common limitation in experiments using these technologies is the low number or even absence of biological replicates, which complicates the statistical analysis of digital gene expression data. Analysis of this type of data has often been based on modified tests originally devised for analysing microarrays; both these and even de novo methods for the analysis of RNA-seq data are plagued by the common problem of low replication. We propose a novel, non-parametric Bayesian approach for the analysis of digital gene expression data. We begin with a hierarchical model for modelling over-dispersed count data and a blocked Gibbs sampling algorithm for inferring the posterior distribution of model parameters conditional on these counts. The algorithm compensates for the problem of low numbers of biological replicates by clustering together genes with tag counts that are likely sampled from a common distribution and using this augmented sample for estimating the parameters of this distribution. The number of clusters is not decided a priori, but it is inferred along with the remaining model parameters. We demonstrate the ability of this approach to model biological data with high fidelity by applying the algorithm on a public dataset obtained from cancerous and non-cancerous neural tissues. Source code implementing the methodology presented in this paper takes the form of the Python Package DGEclust, which is freely available at the following link: https://bitbucket.org/DimitrisVavoulis/dgeclust.

DGEclust; Stick-breaking priors; Negative binomial distribution

It is a common truth that our knowledge in Molecular Biology is only as good as the tools we have at our disposal. Next-generation or high-throughput sequencing technologies provide a revolutionary tool in the aid of genomic studies by allowing the generation, in a relatively short time, of millions of short sequence tags, which reflect particular aspects of the molecular state of a biological system. A common application of these technologies is the study of the transcriptome, which involves a family of methodologies, including RNA-seq ([1]), CAGE (Cap Analysis of Gene Expression; [2]) and SAGE (Serial Analysis of Gene Expression; [3]). When compared to microarrays, this class of methodologies offers several advantages, including detection of a wider level of expression levels and independence on prior knowledge of the biological system, which is required by hybridisation-based technologies, such as microarrays.

Typically, an experiment in this category starts with the extraction of a snapshot RNA sample from the biological system of interest and it’s shearing in a large number of fragments of varying lengths. The population of these fragments is then reversed-transcribed to a c-DNA library and sequenced on a high- throughput platform, generating large numbers of short DNA sequences known as “reads”. The ensuing analysis pipeline starts with mapping or aligning these reads on a reference genome. At the next stage, the mapped reads are summarised into gene-, exon- or transcript-level counts, normalised and further analysed for detecting differential gene expression [4].

It is important to realize that the normalised read (or tag) count data generated from this family of methodologies represents the number of times a particular class of c-DNA fragments has been sequenced, which is directly related to their abundance in the library and, in turn, the abundance of the associated transcripts in the original sample. Thus, this count data is essentially a discrete or digital measure of gene expression, which is fundamentally different in nature (and, in general terms, superior in quality) from the continuous fluorescence intensity measurements obtained from the application of microarray technologies. Due to their better quality, next-generation sequence assays tend to replace microarray- based technologies, despite their higher cost [5].

One approach for the analysis of count data of gene expression is to transform the counts to approximate normality and then apply existing methods aimed at the analysis of microarrays [6,7]. However, as noted in McCarthy et al. [8], this approach may fail in the case of very small counts (which are far from normally distributed) and also due to the strong mean-variance relationship of count data, which is not taken into account by tests based on a normality assumption. Proper statistical modelling and analysis of count data of gene expression requires novel approaches, rather than adaptation of existing methodologies, which aimed from the beginning at processing continuous input.

Formally, the generation of count data using next-generation sequencing assays can be thought of as random sampling of an underlying population of cDNA fragments. Thus, the counts for each tag describing a class of cDNA fragments can, in principle, be modelled using the Poisson distribution, whose variance is, by definition, equal to its mean. However, it has been shown that, in real count data of gene expression, the variance can be larger than what is predicted by the Poisson distribution [9-12]. An approach that accounts for the so-called “over-dispersion” in the data is to adopt quasi-likelihood methods, which augment the variance of the Poisson distribution with a scaling factor, thus by-passing the assumption of equality between the mean and variance [13-16]. An alternative approach is to use the Negative Binomial distribution, which is derived from the Poisson, assuming a Gamma-distributed rate parameter. The Negative Binomial distribution incorporates both a mean and a variance parameter, thus modelling over-dispersion in a natural way [17,18]. An overview of existing methods for the analysis of gene expression count data can be found in Oshlack et al. and Kvam et al. [4,19].

Despite the decreasing cost of next-generation sequencing assays (and also due to technical and ethical restrictions), digital datasets of gene expression are often characterised by a small number of biological replicates or no replicates at all. Although this complicates any effort to statistically analyse the data, it has led to inventive attempts at estimating as accurately as possible the biological variability in the data given very small samples. One approach is to assume a locally linear relationship between the variance and the mean in the Negative Binomial distribution, which allows estimating the variance by pooling together data from genes with similar expression levels [17]. Alternatively, one can make the rather restrictive assumption that all genes share the same variance, in which case the over-dispersion parameter in the Negative Binomial distribution can be estimated from a very large set of data points [11]. A further elaboration of this approach is to assume a unique variance per gene and adopt a weighted-likelihood methodology for sharing information between genes, which allows for an improved estimation of the gene-specific over-dispersion parameters [8]. Another yet distinct empirical Bayes approach is implemented in the software *baySeq*, which adopts a form of information sharing between genes by assuming the same prior distribution among the parameters of samples demonstrating a large degree of similarity [18].

In summary, proper statistical modelling and analysis of digital gene expression data requires the development of novel methods, which take into account both the discrete nature of this data and the typically small number (or even the absence) of biological replicates. The development of such methods is particularly urgent due to the huge amount of data being generated by high-throughput sequencing assays. In this paper, we present a method for modelling digital gene expression data that utilizes a novel form of information sharing between genes (based on non-parametric Bayesian clustering) to compensate for the all-toocommon problem of low or no replication, which plagues most current analysis methods.

We propose a novel, non-parametric Bayesian approach for the analysis of digital gene expression data. Our point of departure is a hierarchical model for over-dispersed counts. The model is built around the Negative Binomial distribution, which depends, in our formulation, on two parameters: the mean and an over-dispersion parameter. We assume that these parameters are sampled from a Dirichlet process with a joint Inverse Gamma - Normal base distribution, which we have implemented using stick breaking priors. By construction, the model imposes a clustering effect on the data, where all genes in the same cluster are statistically described by a unique Negative Binomial distribution. This can be thought of as a form of information sharing between genes, which permit pooling together data from genes in the same cluster for improved estimation of the mean and over-dispersion parameters, thus bypassing the problem of little or no replication. We develop a blocked Gibbs sampling algorithm for estimating the posterior distributions of the various free parameters in the model. These include the mean and over-dispersion for each gene and the number of clusters (and their occupancies), which does not need to be fixed *a priori*, as in alternative (parametric) clustering methods. In principle, the proposed method can be applied on various forms of digital gene expression data (including RNA-seq, CAGE, SAGE, Tagseq, etc.) with little or no replication and it is actually applied on one such example dataset herein.

The digital gene expression data we are considering is arranged in an *M×N* matrix, where each of the *N* rows corresponds to a different gene and each of the *M* columns corresponds to a different sample. Furthermore, all samples are grouped in *L* different classes (i.e. tissues or experimental conditions). It holds that *L *≤* M*, where the equality is true if there are no replicas in the data.

We indicate the number of reads for the *i ^{th}* gene at the

(1)

Where *μ _{ij}* is the mean of the Negative Binomial distribution and is the variance. Since the variance is always larger than the mean by the quantity , the Negative Binomial distribution can be thought of as a generalisation of the Poisson distribution, which accounts for over-dispersion. Furthermore, we model the mean as , where the offset is the depth or exposure of sample

Given the model above, the likelihood of observed reads *yij* = *{yij*: *λ(j)=l}* for the *i ^{th}* gene in class

(2)

Where the index *j* satisfies the condition *λ(j)=**l*. By extension, for the *i ^{th}* gene across all sample classes, the likelihood of observed counts

(3)

where the class indicator l runs across all L classes.

**Information sharing between genes**

A common feature of digital gene expression data is the small number of biological replicates per class, which makes any attempt to estimate the gene- and class-specific parameters *θ _{il}*={

(4)

where indicates a discrete random measure centered at and *w _{k}* is the corresponding weight. Conceptually, the fact that the above summation goes to infinity expresses our lack of prior knowledge regarding the number of components that appear in the mixture, other than the obvious restriction that their maximum number cannot be larger than the number of genes times the number of sample classes. In this formulation, the parameters are sampled from a prior base distribution G

(5)

Given the above, can take only positive values, as it ought to, while can take both positive and negative values.

What makes the mixture in Eq. 4 special is the procedure for generating the infinite sequence of mixing weights. We set *w _{1}*=V

(6)

There are various ways for defining the parameters *a _{k}* and

Given the above formulation, sampling *θ _{il}* from its prior distribution is straightforward. First, we introduce an indicator variable

**Figure 1:** Format of digital gene expression data. Rows correspond to genes and columns correspond to samples. Samples are grouped into classes (e.g. tissues or experimental conditions). Each element of the data matrix is a whole number indicating the number of counts or reads corresponding to the i^{th} gene at the j^{th} sample. The sum of the reads across all genes in a sample is the depth or exposure of that sample.

**Figure 2:** The clustering effect that results from imposing a stick-breaking prior on the gene and class- specific model parameters, θ_{il}. A matrix of indicator variables is used to cluster the observed count data into a finite number of groups, where the genes in each group share the same model parameters. The number of clusters is not known a priori. The distribution of weight mass among the various clusters in the model is determined by parameter η.

**Generative model**

The description in the previous paragraphs suggests a hierarchical model, which presumably underlies the stochastic generation of the data matrix in **Figure 1**. This model is explicitly described below:

(7)

At the bottom of the hierarchy, we identify the measured reads *y _{ij}* for each gene in each sample, which follow a Negative Binomial distribution with parameters . The parameters of the Negative Binomial distribution are gene- and class-specific and they are completely determined by an also gene- and class-specific indicator variable z

At this point, we introduce some further notation. We indicate the *N × L* matrix of indicator variables with the letter *Z*; lists the centers of the point measures in Eq. 4 and W={w_{1}, w_{2}, . . .} is the vector of mixing weights. We are interested in computing the joint posterior density , where *Y* is a matrix of count data as in **Figure 1**. We approximate the above distribution through numerical (Monte Carlo) methods, i.e. by sampling a large number of 9 -tuples from it. One way to achieve this is by constructing a Markov chain, which admits as its stationary distribution. Such a Markov chain can be constructed by using Gibbs sampling, which consists of alternating repeated sampling from the full conditional posteriors and . Below, we explain how to sample from each of these conditional distributions.

**Sampling from the conditional posterior **

In order to sample from the above distribution it is convenient to truncate the infinite mixture in Eq. 4 by rejecting all terms with index larger than K and setting , which is equivalent to setting K V = 1 . It has been shown that the error associated with this approximation when is less than or equal to ([8]). For example, for N =14×10^{3},M = 6,k = 200 and η =1 , the error is minimal (less than10^{−80} ). Thus, the truncation should be virtually indistinguishable from the full (infinite) mixture.

Next, we distinguish between ac k active clusters and kin inactive clusters , such that and . Active clusters are those containing at least one gene, while those containing no genes are considered inactive. We write:

Updating the inactive clusters is a simple matter of sampling k_{in} times from the joint distribution in Eq. 5 given the hyper-parameters *Φ*. Sampling the active clusters is more complicated and involves sampling each active cluster center individually from its respective posterior , where *Y _{ac,k}* is a matrix of measured count data for all genes in the

(8)

Where the superscript + indicates a candidate vector of parameters. Each of the two elements (*α* and *β*) of this vector is drawn from a symmetric proposal of the following form:

(9)

Where the random number r is sampled from the standard Normal distribution, i.e., r ~ Normal(0,1).The prior of is a joint Inverse Gamma - Normal distribution, as shown in Equation 5, while the likelihood function is a product of Negative Binomial probability distributions, similar to those in Equation 2 and 3.

**Sampling from the conditional posterior**

Each element *z _{il}* of the matrix of indicator variables

(10)

In the above expression, *Yil *is the data for the *i ^{th}* gene in class

**Sampling from the conditional posterior p(w | z)**

The mixing weights *W* are generated using a truncated stick-breaking process with *η=1*. As pointed out in Engström et al. [20], this implies that *W* follows a generalised Dirichlet distribution. Considering the conjugacy between this and the multinomial distribution, the first step in updating *W *is to generate *K − 1* Beta-distributed random numbers:

(11)

for *k=1, . . . , K − 1*, where *Nk *is the total number of genes in the k^{th} cluster. Notice that *Nk* can be inferred from *Z* by simple counting and where *N* is the total number of genes. *VK* is set equal to 1, in order to ensure that the weights add up to 1. These are simply generated by setting *V _{1}*=

**Sampling from the conditional posterior **

The hyper-parameters influence indirectly the observations* Y* through their effect on the distribution of the active cluster centres, where and If we further assume independence between and we can write

Assuming *K _{ac}*active clusters and considering that the prior for α* (see Equation 5), it follows that the posterior is:

(12)

The parameters γ_{1} to γ_{4} are given by the following expressions:

where the initial parameters are all positive. Since sampling from Equation 12 cannot be done exactly, we employ a Metropolis algorithm with acceptance probability

(13)

where the proposal distribution *q(•|•)* for sampling new candidate points has the same form as in Eq. 9. Furthermore, taking advantage of the conjugacy between a normal likelihood and a Normal-Inverse Gamma prior, the posterior probability for parameters *µ _{β}* and becomes:

(14)

The parameters to δ_{1} to δ_{4} (given initial parameters to are as follows:

where Sampling a -pair from the above posterior takes place in two simple steps: first, we sample where *δ _{3}* and

We summarise the algorithm for drawing samples from the posterior below. Notice that *x(t)* indicates the value of *x* at the *t ^{th}* iteration of the algorithm.

1. Set

2. Set

3. Set

4. Set *K* , the truncation level

5. Sample from its prior (Eq. 5) conditional on

6. Set all *K* elements of to the same value i.e *1/K*

7. Sample from the Categorical distribution with weights

8. For t=1,2,3,.....T

a. Sample given and the data matrix *Y* using a single step of the Metropolis algorithm for each active cluster (see Eq. 8)

b. Sample from its prior given (see Eq. 5)

c. Sample Z^{(t)} given , and the data matrix Y (see Eq. 10)

d. Sample W^{(t)} given Z(t) (see Eq. 11)

e. Sample *φ*^{(t)} given and (see Eqs. 12 and 14)

9. Discard the first *T _{0}* samples, which are produced during the burn-in period of the algorithm (i.e. before equilibrium is attained), and work with the remaining

The above procedure implements a form of blocked Gibbs sampling with embedded Metropolis steps for impossible to directly sample from distributions.

We applied the methodology described in the preceding sections on publicly available digital gene expression data (obtained from control and cancerous tissue cultures of neural stem cells; [20]) for evaluation purposes. The data we used in this study can be found at the following URL: http://genomebiology.com/content/supplementary/gb-2010-11-10-r106-s3.tgz. As shown in **Table 1**, this dataset consists of four libraries from glioblastoma-derived neural stem cells and two from non- cancerous neural stem cells. Each tissue culture was derived from a different subject (with the exception of GliNS1 and G144, which came from the same patient). Thus, the samples are divided in two classes (cancerous and non-cancerous) with four and two replicates, respectively.

Cancerous | Non-Cancerous | |||||
---|---|---|---|---|---|---|

Genes | GliNS1 | G144 | G166 | G179 | CB541 | CB660 |

13CDNA73 | 4 | 0 | 6 | 1 | 0 | 5 |

15E1.2 | 75 | 74 | 222 | 458 | 215 | 167 |

182-FIP | 118 | 127 | 555 | 231 | 334 | 114 |

. |
. |
. |
. |
. |
. |
. |

. |
. |
. |
. |
. |
. |
. |

. |
. |
. |
. |
. |
. |
. |

**Table 1:** Format of the data [6].

We implemented the algorithm presented above in the programming language Python, using the libraries NumPy, SciPy and MatplotLib. The most recent version of the software can be found at the following link: https://bitbucket.org/DimitrisVavoulis/dgeclust. Calculations were expressed as operations between arrays and the multiprocessing Python module was utilised in order to take full advantage of the parallel architecture of modern multicore processors. The algorithm was run for 200K iterations, which took approximately two days to complete on a 12-core desktop computer. Simulation results were saved to the disk every 50 iterations.

The raw simulation output includes chains of random values of the hyper-parameters *φ* , the gene- and class-specific indicators *Z* and the active cluster centres

The raw simulation output includes chains of random values of the hyper-parameters *φ* , the gene- and class-specific indicators *Z* and the active cluster centres , which constitute an approximation to the corresponding posterior distributions given the data matrix *Y*. The chains corresponding to the four different components of are illustrated in **Figure 3**. It may be observed that these reached equilibrium early during the simulation (after less than 20K iterations) and they remained stable for the remaining of the simulation. As explained earlier, these hyper-parameters are important, because they determine the prior distributions of the cluster centres α* and β* (hyperparameters and respectively) and, subsequently, of the gene- and class-specific parameters α and β. It follows from analysis of the chains in **Figure 3** that the estimates for these hyper-parameters are (indicating the mean and standard deviation of the estimates): The corresponding Inverse Gamma and Normal distributions, which are the priors of the cluster centres α* and β*, respectively, are illustrated in **Figure 4**.

**Figure 3:** Simulation results after 200K iterations. The chains of random samples correspond to the components of the vector of hyper-parameters φ , i.e. μ_{β} and σ^{2}_{β} . (Panel A) and a_{α} and s_{α} (panel B). The former determines the Normal prior distribution of the cluster center parameters β∗, while the latter pair determines the Inverse Gamma prior distribution of the cluster center parameters α∗. The random samples in each chain are approximately sampled (and constitute an approximation of) the corresponding posterior distribution conditional on the data matrix Y.

**Figure 4:** Estimated Inverse Gamma (panel A) and Normal (panel B) prior distributions for the cluster parameters α* and β*, respectively. The solid lines indicate mean distributions, i.e. those obtained for the mean values of the hyper-parameters a_{α} , s_{α} ,μ_{β} , σ_{β}^{2} . The dashed lines are distributions obtained by adding or subtracting individually one standard deviation from each relevant hyper-parameter.

A major use of the methodology presented above is that it allows us to estimate the gene and class-specific parameters α and β, under the assumption that the same values for these parameters are shared between different genes or even by the same gene among different sample classes. This form of information sharing permits pulling together data from different genes and classes for estimating pairs of *αand β *parameters in a robust way, even when only a small number of replicates (or no replicates at all) are available per sample class. As an example, in **Figure 5** we illustrate the chains of random samples for *α and β *corresponding to the non-cancerous class of samples for the tag with ID 182-FIP (third row in **Table 1**). These samples constitute approximations of the posterior distributions of the corresponding parameters. Despite the very small number of replicates (*n=4*), the variance of the random samples is finite. Similar chains were derived for each gene in the dataset, although it should be emphasised that the number of such estimates is smaller than the total number of genes, since more than one genes share the same parameter estimates.

**Figure 5:** Chains of random samples approximating the posterior distributions of the parameters α (panel A) and β (panel B) corresponding to the non-cancerous class of samples for the tag with ID 182-FIP (third row in **Table 1**). These samples were generated after 200K iterations of the algorithm. A similar pair of chains exists for each gene at each sample class (i.e. cancerous and non-cancerous), although not all pairs are distinct to each other due to the clustering effect imposed on the data by the algorithm.

It has already been mentioned that the sharing of *α and β *parameter values between different genes can be viewed as a form of clustering (**Figure 2**), i.e. there are different groups of genes, where all genes in a particular group share the same *α and β *parameter values. As expected in a Bayesian inference framework, the number of clusters is not constant, but it is itself a random variable, which is characterised by its own posterior distribution and its value, fluctuates randomly from one iteration to the next. In **Figure 6**, we illustrate the chain of sampled cluster numbers during the course of the simulation (panel A). The first 75K iterations were discarded as burn-in and the remaining samples were used for plotting the histogram in panel B, which approximates the posterior distribution of the number of clusters given the data matrix *Y*. It may be observed that the number of clusters fluctuates between 35 and 55 with a peak at around 42 clusters. The algorithm we present above does not make any particular assumptions regarding the number of clusters, apart from the obvious one that this number cannot exceed the number of genes times the number of sample libraries. Although the truncation level *K*=200 sets an artificial limit in the maximum number of clusters, this is never a problem in practise, since the actual estimated number of clusters is typically much smaller that the truncation level *K* (see the y-axis in **Figure 6A**). The fact that the number of clusters is not decided a priori, but rather inferred along with the other free parameters in the model sets the described methodology in an advantageous position with respect to alternative clustering algorithms, which require deciding the number of clusters at the beginning of the simulation [21].

**Figure 6:** Stochastic evolution of the number of clusters during 200K iterations of the simulation (panel A) and the resulting histogram after discarding the first 75K iterations as burn-in (panel B). After reaching equilibrium, the number of clusters fluctuates around a mean of approximately 43 clusters. In general, the estimated number of clusters is much smaller than the truncation level (K = 200, see y-axis in panel A). The histogram in panel B approximates the posterior distribution of the number of clusters given the data matrix Y.

Similarly to the stochastic fluctuation in the number of clusters, the cluster occupancies (i.e. the number of genes per cluster) are a random vector. In **Figure 7**, we illustrate the cluster occupancies at two different stages of the simulation, i.e. after 100K and 200K iterations, respectively. We may observe that, with the exception of a single super-cluster (containing more than 6000 genes), cluster occupancies range from between around 3000 and less than 1000 genes. It should be clarified that each cluster includes many (potentially, hundreds of) genes and it may span several classes. An individual cluster represents a Negative Binomial distribution (with concrete *α and β* parameters), which models with high probability the count data from all its member genes. This is illustrated in **Figure 8**, where we show the histogram of the log of the count data from the first sample (sample GliNS1 in **Table 1**) along with a subset of the estimated clusters after 200K iterations (gray lines) and the fitted model (red line). It may be observed that each cluster models a subset of the gene expression data in the particular sample. The complete model describing the whole sample is a weighted sum of the individual clusters/Negative Binomial distributions. Formally,

**Figure 7:** Cluster occupancies after 100K and 200K iterations of the algorithm. A single super-cluster (including more than 6000 genes) appears at both stages of the simulation. The occupancy of the remaining clusters demonstrates some variability during the course of the simulation, with clusters containing between 3000 and less than 1000 genes.

**Figure 8:** Histogram of the log of the number of reads from sample GliNS1, a subset of the estimated clusters (gray lines) and the estimated model of the sample at the end of the simulation. Each cluster (gray line) represents a Negative Binomial distribution with specific α and β parameters, which models a subset of the count data in this particular sample. The complete model (red line) is the weighted sum of all component clusters.

(15)

where *Yj* is the *j ^{th}* sample and the index

The proposed methodology provides a compact way to model each sample in a digital gene expression dataset following a two-step procedure: first, the dataset is partitioned into a finite number of clusters, where each cluster represents a Negative Binomial distribution (modelling a subset of the data) and the parameters of each such distribution are estimated. Subsequently, each sample in the dataset can be modelled as a weighted sum of Negative Binomial distributions. In **Figure 9**, we show the log of count data for each sample in the dataset shown in **Table 1** along with the fitted models (red lines) after 200 K iterations of the algorithm.

**Figure 9:** Histograms of the log of the number of reads from cancerous (panels Ai-iv) and non-cancerous (panels Bi,ii) samples and the respective estimated models after 200K iterations of the algorithm. As already mentioned, each red line is the weighted sum of many component Negative Binomial distributions / clusters, which model different subsets of each data sample. We may observe that the estimated models fit tightly the corresponding data samples.

Next-generation sequencing technologies are routinely being used for generating huge volumes of gene expression data in a relatively short time. This data is fundamentally discrete in nature and their analysis requires the development of novel statistical methods, rather than modifying existing tests that were originally aimed at the analysis of microarrays. The development of such methods is an active area of research and several papers have been published on the subject [4,19].

In this paper, we present a novel approach for modelling overdispersed count data of gene expression (i.e. data with variance larger than the mean predicted by the Poisson distribution) using a hierarchical model based on the Negative Binomial distribution. The novel aspect of our approach is the use of a Dirichlet process in the form of stick breaking priors for modelling the parameters (mean and overdispersion) of the Negative Binomial distribution. By construction, this formulation forces clustering of the count data, where genes in the same cluster are sampled from the same Negative Binomial distribution, with a common pair of mean and over-dispersion parameters. Through this elegant form of information sharing between genes, we compensate for the problem of little or no replication, which often restricts the analysis of digital gene expression datasets. We have demonstrated the ability of this approach to model accurately actual biological data by applying the proposed methodology on a publicly available dataset obtained from cancerous and non-cancerous cultured neural stem cells [20].

We show that inference is achieved in the proposed model through the application of a blocked Gibbs sampler, which includes estimating, among others, the gene- and class-specific mean and over-dispersion of the Negative Binomial distribution. Similarly, the number of clusters and their occupancies are inferred along with the rest free parameters in the model.

Currently, the software implementing the proposed method remains relatively computationally expensive. In particular, 200 K iterations require approximately two days completing on a 12-core desktop computer. This time scale is not disproportionate to the production time of experimental data and it is mainly due to the high volume of the tested data (> 15 K genes per sample) and the need to obtain long chains of samples for a more accurate estimation of posterior distributions. Long execution times are a characteristic, more generally, of all Monte Carlo approximation methods. Our implementation of the algorithm is completely parallelised and calculations are expressed as operations between vectors in order to take full advantage of modern multi-core computers. Ongoing work towards reducing execution times aims at the application of variation inference methods [22], instead of the blocked Gibbs sampler we currently use. The algorithm can be further improved by avoiding truncation of the infinite summation described in Equation 4, as described in Papaspiliopoulos and Roberts [23] and in Walker [24].

This non-parametric Bayesian approach for modelling count data has thus shown great promise in handling over-dispersion and the alltoo- common problem of low replication, both in theoretical evaluation and on the example dataset. The software that has been produced (DGE clust ) will be of great utility for the study of digital gene expression data and the statistical theory will contribute to leading the development of non-parametric methods in general for modelling all forms of count data of gene expression.

The authors would like to thank Prof. Peter Green and Dr. Richard Goldstein for useful discussions. Also, we would like to thank P. G. Engstrom and colleagues for producing the public data we used in this paper. Source code implementing the methodology presented in this paper can be downloaded from the following link: https://bitbucket.org/DimitrisVavoulis/dgeclust.

This work was supported by grants EPSRC EP/H032436/1 and BBSRC G_{0}22771/1.

- Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11: R106.
- Auer PL, Doerge RW (2011) A Two-Stage Poisson Model for Testing RNA-Seq Data. Statistical Applications in Genetics and Molecular Biology 10:1- 26.
- David M Blei, Michael I Jordan (2006) Variational inference for dirichlet process mixtures. Bayesian Analysis 1: 121-144.
- Carninci P (2009) Is sequencing enlightenment ending the dark age of the transcriptome? Nat Methods 6: 711-713.
- Nicole Cloonan, Alistair RRF, Gabriel Kolle, Brooke BAG, Geoffrey JF, et al. (2008) Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods 5: 613-619.
- Engström PG, Tommei D, Stricker SH, Ender C, Pollard SM, et al. (2012) Digital transcriptome profiling of normal and glioblastoma-derived neural stem cells identifies genes associated with patient survival. Genome Med 4: 1-76.
- Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11: 422.
- Ishwaran H, Lancelot FJ (2001) Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association 96:161-173.
- Daxin Jiang, Chun Tang, Aidong Zhang (2004) Cluster analysis for gene expression data: A survey. IEEE Trans Knowl Data Eng 16: 1370-1386.
- Kvam VM, Liu P, Si Y (2012) A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot 99: 248-256.
- Langmead B, Hansen KD, Leek JT (2010) Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol 11: R83.
- Lu J, Tomfohr JK, Kepler TB (2005) Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach. BMC Bioinformatics 6: 165.
- McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res 40: 4288-4297.
- Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, et al. (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320: 1344-1349.
- Oshlack A, Robinson MD, Young MD (2010) From RNA-seq reads to differential expression results. Genome Biol 11: 220.
- Papaspiliopoulos, Roberts GO (2008) Retrospective mcmc for dirichlet process hierarchical models. Biometrika 95: 169-186.
- Robinson MD, Smyth GK (2007) Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23: 2881-2887.
- Robinson MD, Smyth GK (2008) Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9: 321-332.
- Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, et al. (2003) Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A 100: 15776-15781.
- Srivastava S, Chen L (2010) A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res 38: e170.
- 't Hoen PA, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RH, et al. (2008) Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res 36: e141.
- Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270: 484-487.
- S Walker (2007) Sampling the dirichlet mixture model with slices. Comm Statist Sim Comput 36: 45-54.
- Wang L, Feng Z, Wang X, Wang X, Zhang X (2010) DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26: 136-138.
- Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10: 57-63.

Select your language of interest to view the total content in your interested language

- Advanced DNA Sequencing
- Algorithm
- Animal and Tissue Engineering
- Applications of Bioinformatics
- Artificial Intelligence Studies
- Artificial intelligence
- Artificial neural networks
- Big data
- Bioinformatics Algorithms
- Bioinformatics Databases
- Bioinformatics Modeling
- Bioinformatics Tools
- Biology Engineering
- Biostatistics: Current Trends
- Cancer Proteomics
- Chemistry of Biology
- Clinical Proteomics
- Cloud Computation
- Cluster analysis
- Comparative genomics
- Comparative proteomics
- Computational Chemistry
- Computational Sciences
- Computational drug design
- Computer Science
- Computer-aided design (CAD)
- Current Proteomics
- Data Mining Current Research
- Data algorithms
- Data mining applications in genomics
- Data mining applications in proteomics
- Data mining in drug discovery
- Data mining tools
- Data modelling and intellegence
- Data warehousing
- Ethics in Synthetic Biology
- Evolution of social network
- Evolutionary Optimisation
- Evolutionary algorithm
- Evolutionary algorithm in datamining
- Evolutionary computation
- Evolutionary science
- Experimental Physics
- Findings on Machine Learning
- Gene Synthesis
- Genome annotation
- Genomic data mining
- Genomic data warehousing
- Handover
- Human Proteome Project Applications
- Hybrid soft computing
- Industrial Biotechnology
- Knowledge modelling
- Machine Learninng
- Mapping of genomes
- Mass Spectrometry in Proteomics
- Mathematical Modeling
- Mathematics for Computer Science
- Meta genomics
- Microarray Proteomics
- Models of Science
- Molecular and Cellular Proteomics
- Multi Objective Programming
- Neural Network
- Ontology Engineering
- P4 medicine
- Physics Models
- Protein Sequence Analysis
- Proteogenomics
- Proteome Profiling
- Proteomic Analysis
- Proteomic Biomarkers
- Proteomics Clinical Applications
- Proteomics Research
- Proteomics Science
- Proteomics data warehousing
- Python for Bioinformatics
- Quantitative Proteomics
- Robotics Research
- Scientific Computing
- Simulation Computer Science
- Soft Computing
- Statistical data mining
- Studies on Computational Biology
- Swarm Robotics
- Swarm intelligence
- Synthetic Biology
- Synthetic Biology medicine
- Synthetic Biotechnology
- Synthetic Genomics
- Synthetic biology drugs
- Systems Biology
- Technologies in Computer Science
- Theoretical Chemistry
- Theoretical Computer Science
- Theoretical Issues in Ergonomics Science
- Theoretical Methods
- Theoretical and Applied Science

- Total views:
**11608** - [From(publication date):

January-2014 - Nov 24, 2017] - Breakdown by view type
- HTML page views :
**7841** - PDF downloads :
**3767**

Peer Reviewed Journals

International Conferences
2017-18