A Statistical Approach to Correcting Cross-Annotations in a Metagenomic Functional Profile Generated by Short Reads

Background Categorizing protein coding sequences into one family, if the proteins they encode perform the same biochemical function, and then tabulating the relative abundances among all the families, is a widely-adopted practice for functional profiling of a metagenomic sample. By homology searching of metagenomic sequencing reads against a protein database, the relative abundance of a family can be represented by the number of reads aligned to its members. However, it has been observed that, for short reads generated by next-generation sequencing platforms, some may be erroneously assigned to the functional families they are not associated to. This commonly occurred phenomenon is termed as cross-annotation. Current methods for functional profiling of a metagenomic sample use empirical cutoff values, to select the alignments and ignore such cross-annotation problem, or employ summarized equation to do a simple adjustment. Result By introducing latent variables, we use the Probabilistic Latent Semantic Analysis to model the proportions of reads assigned to functional families in a metagenomic sample. The approach can be applied on a metagenomic sample after the list of the true functional families being obtained or estimated. It was implemented in metagenomic samples functionally characterized by the database of Clusters of Orthologous Groups of proteins, and successfully addressed the cross-annotation issue on both in vitro-simulated, bioinformatics tool simulated metagenomic samples, and a real-world data. Conclusions Correcting cross-annotation will increase the accuracy of the functional profiling of a metagenome generated by short reads. It will further benefit differential abundance analysis of metagenomic samples under different conditions.

significance of sequence similarity, which requires that the lengths of two sequences compared are sufficiently long [21]. Researchers had to use different "conventional" or "empirical" cutoff values to choose the alignments with significant sequence similarity, for example, BLAST E-value cutoffs 10 −3 , 10 −5 [7,9,22]. It has been observed that, a large part of homologues, which can be detected by BLAST searching with long reads, are missed by searching with short reads using these E-value cutoffs [22]. In our recent paper [23], we proposed taking a number between 63 and 68 (default as 66) of BLAST similarity score as the cutoff to choose homologues, when aligning short reads with ~100 bases against COG database. We further suggested, through comparing the alignment output by RPS-BLAST on the same sample, to estimate artificial COGs in the BLAST output after cutoff.
Zhang et al. [16] pointed out another issue in the read-count approach with short reads, that is, different functional families tend to have different proportions of wrong annotations. We observed the same problem when analyzing the in vitro-Simulated data set M_4X (details later). For example, in the BLAST searching output after filtration by the score cutoff 66, the counts of reads assigned to COG0642, COG5001, COG2199, and COG2200 are 12172, 18, 4584 and 3543 respectively. But, only partial numbers of these reads truly originate from the CDSs to which they are associated. There are 7757, 12, 1669 and 1117 such reads correspondingly. This indicates that a non-negligible proportion of aligned reads, for example 4415 (12172 minus 7757) reads to COG0642, are actually associated to other COGs. Meanwhile, we know that the true counts of reads to these four COGs should be 10573, 5071, 2241, and 1751 respectively. This implies that many reads from a COG, for example 2816 (10573 minus 7757) reads from COG0642, can be erroneously assigned to other COG families. These phenomena together define cross-annotation and are demonstrated in Figure 1.
The above examples show that the problem of cross-annotation is not trivial and will greatly affect the accuracy of the functional profiling if not being addressed properly. In this paper, we propose a method to mitigate the cross-annotation effect and improve the accuracy of estimates of read counts assigned to the functional families.

Methods
In construction of functional profile of a metogenomic sample by the read-count approach, given the total number of reads and the probability that a read is generated from a COG family, the expected count of reads originated from the family can be easily calculated following a multinomial rule. Thus, accurately estimating the probability that a read is generated from a COG can certainly reduce the cross-annotation effect. We apply Probabilistic Latent Semantic Analysis (PLSA, details next) to estimate these probabilities, and then the proportions of reads originated from the estimated existing COGs.

Input data
The metagenomic short reads with about 100 bases are BLAST (specifically, blastx) aligned against the COG database. A read is assigned to a COG family according to its best-hit association. The raw functional profile, consisting of the list of all detected COGs and corresponding relative abundances (quantified by the counts of reads assigned), may include artificial COGs and have the cross-annotation issue. Following the work-flow in Du et al., [23], the BLAST alignments with similarity score greater than 66 are retained and the artificial COGs are identified and removed. Furthermore, we treat a COG family as an artifact as well, if it has zero read count in the RPS-BLAST output after filtration by the similarity score 61. Then the input data for PLSA modeling consist of the following parts:

1.
The COGs, to which the sequencing reads have been aligned;

2.
The count of reads assigned to each COG in (1);

3.
The estimated existing COGs (that is, the non-artificial COGs), and one extra family which covers the CDSs that exist but are not classified into COG families, and all the existing non-coding sequences in the sample.

PLSA modeling
PLSA is a statistical modeling technique originally developed for information retrieval from text collections [24]. In the following, we will show how PLSA modeling is used to correct the cross-annotations. Suppose that the metagenomic sequencing reads are aligned to N different COG families, of which there are M truly existing COGs (M ≤ N). Define A: one of the N COGs, denoted by a 1 , a 2 … a N , to which a read is aligned; T: one of the M COGs, denoted by c 1 , c 2 … c M , from which a read originates; α ij : the probability of a read being aligned to a i given that it originates from c j , that is, α ij =P(A=a i |T =c j ) or P(a i |c j ); β j : the probability of an aligned read being from the COG c j in the metagenomic sample, that is, β j= P(T=c j ) or P(c j ); y i : the observed count of reads being aligned to the COG a i ; t ir u : The unobserved value of T for the r th read aligned to COG a i , where r=1,…, y i .
Then, the probability of a read originating from c j and being aligned to a i is However, it is unobservable which COG an aligned read originates from. Thus, for any COG a i and the corresponding count y i , the probability that the r th read (r=1,…, y i ) is from one of the M COGs, c 1 , c 2 … c M , and aligned to a i can be written as: For any i ∈ {1,2,…, N}, under the assumption that a read being aligned to COG a i is independent of another read being aligned to a i , we have the following likelihood function of (α i1 , α i2 ,…, α iM , β 1 , β 2 ,…, β M ): Further assume that a read being aligned to COG a i is independent of the read being aligned to another COG, then the likelihood function and the log-likelihood function of (α, β) are where α denotes the vector (α 11 , α 12 , ⋯, α 1M , α 21 , α 22 , ⋯, α 2M , ⋯, α N1 , α N2 , ⋯, α NM )′; β denotes the vector (β 1 , β 2 , ⋯, β M )′; y denotes the vector (y 1 , y 2 , ⋯, y N )′; and t u denotes the vector Given the observed counts {y i }, our goal is to find the estimates (Maximum Likelihood Estimates, MLEs) of parameters α, β. However this could not be done by maximizing the likelihood directly since t u , the realization of T, is unobservable. Nevertheless, by treating T as a latent variable, we can apply the Expectation-Maximization (EM) algorithm to search for the MLEs of α, β. Next, we describe in detail the iteration steps of the algorithm, but postpone the setting of the initial values to Section 4.3.
E step-In this step, we calculate the expected value of the log-likelihood function with respect to the condition distribution of T |y, θ (k) , where θ (k) stands for the current estimate of θ=(α,β)′. By Bayes' rule, for a read being aligned to COG a i , the conditional probability that it is from COG c j is M step-In this step, we seek the maximizer of Q(θ|θ (k) ), that is, find θ(k+1) = argmax Q(θ| θ (k) ).
Using the Lagrangian method to maximize this function with respect to α ij s, subject to the constraint ∑ i = 1 N α i j = 1, j = 1, 2,…, M, we obtain the unique stationary point: Maximizing Q(β|θ (k) ) with respect to β j s, subject to the constraint ∑ j = 1 M β j = 1, we obtain the unique stationary point: where j=1,…, M. For any parameter, the iteration continues until the absolute change of two consecutive estimates is less than 10 −6 .
After the convergence of E-M iterations, PLSA modeling constructs the below decomposition of the vector of observed read counts, by introducing the latent variables: The approximation symbol is used to reflect the fact that the left hand side is a vector of integers (counts), while the decomposition in the right hand side may result in non-integer output. The MLEs of β j 's will then serve as the estimate of the proportions of the estimated existing COG families. Then, the estimated read count to COG c j can be computed as

Statistical learning about the initial values for PLSA modeling
Generally, the result by iterative MLE approach is sensitive to the initial values, since the algorithm may reach the local maximization. Two assumptions have been made in order to learn the initial values of parameters. First we assume that, for the reads originating from the CDSs associated to a common COG, the frequencies of reads assigned to different COGs are similar across samples. Second, for the reads aligned to a common COG, the frequencies of the reads originating from CDSs associated to different COGs, which appear in considered samples, do not change dramatically either. Thus, we can learn the distributions from one simulated metagenomic sample, and then use the learned distribution to set the initial values for PLSA modeling for another simulated or real sample.
The learned α i j L was computed as the percentage of the reads being aligned to COG a i among the reads originating from COG c j , in the learning sample. Let γ ji be the conditional probability of a read originating from c j given it being aligned to a i . Empirically, the learned γ ji L was calculated as the observed relative frequency of reads originating from c j in all the reads assigned to a i . In the following, we describe in detail how to set the initial values for PLSA modeling in a sample different from the learning sample.
The initial value of α

1.
For an estimated existing COG family c j , which is also present in the learning sample, we directly take α ji L as the initial value if the corresponding aligned COG appears in both samples. Otherwise, if the corresponding aligned COG appears in the new sample only, the initial value is set as the ratio of the remaining probability, 1 − ∑ i ∈ ∩ α i j L , and the number of the aligned COGs that appear in the new sample only, where i ∈ ∩ means that the summation is over all the aligned COGs appearing in both samples.

2.
For an estimated existing COG family shown in the new sample but not in the learning sample, we set the equal initial value as probability 1 N for each aligned COG.
The initial value of β

1.
For an estimated existing COG family c j , which is also in the learning sample, the initial value for β j L is set as, where i ∈ ∩ has the same meaning as above.

2.
For the estimated existing COGs shown in the new sample only, they share the same initial value, that is, the ratio of the remaining probability, 1 − ∑ j ∈ ∩ β j L , and the number of the estimated existing COGs appearing in the new sample only.

Results from the in vitro-simulated metagenomic data set
We used the simulated metagenomic datasets M2 and M3 in [25], which are 4X readcoverage data, and named M2_4X and M3_4X here. In the simulations, sequencing reads with about 100 bases were produced for different preselected bacterial genomes by 454 GS20 platforms. The description about these genomes is given in Supplementary file. These data were generated through a genuine sequencing process; therefore they can best capture the characteristics of the sequencing errors introduced by 454 GS20 platforms. The related genome references were downloaded from NCBI website, with the files that contain the locations of COG coding sequences (COG-CDS) on the genomes. BLAST (that is, blastn) was applied to align the short reads against the references. The best-hit alignment with identical match greater than 95% determined where a read comes from (genome, location), otherwise the read was excluded. If the location of a read overlaps with the coverage of a COG-CDS by at least 60 bases, we consider this COG as the correct annotation for the read.
Following the steps given in Section 2.1, M3_4X were BLAST aligned against the COG database, and the output alignments with similarity score greater than 66 were kept to serve as a learning sample. M2_4X is the data set we used to evaluate the proposed methods. In Figure 2 we compare the propositions of COG families generated with ("After PLSA" in the plot) and without ("Before PLSA" in the plot) our proposed method to the true propositions. the total variation distance (DTV): [26][27][28]. For each of the four measurements, the lower the value, the more accurate the method is. Both panels in Figure 2 indicate that the accuracy of the functional profiling of M2_4X is further improved by applying the PLSA method.
We also compared the functional profiles, in Figure 3, generated separately by our PLSA modeling method, the method proposed in Zhang et al., [16], and the two currently used Evalue cutoff methods (1e-3, 1e-5). Note, in order to compare different methods, the true proportions are generated using all the sequencing reads with detected genomic locations. Due to the facts that Zhang's method was summarized from a single simulation, and that the problems of both artificial families and cross-annotations are ignored by the E-value cutoff methods, the abundance proportions of certain COGs are skewed greatly in the profiles generated by these methods. In the true profile, COG2814 is ranked the fourth abundant family; however, it is ranked the 27th in Zhang's method, the 196th and 243th abundant in the profiles by E-value cutoff at 1e-3 and 1e-5 respectively. In the profile by our method, this functional family is correctly annotated as the fourth abundant one. Similar situations can be observed for the families: COG4191, COG5001 and COG2271 (left panel in Figure 3). For example, COG5001 is ranked the 14 th , 13 th , 2246 th , 2230 th and 2215 th abundant respectively in the true profile, the profiles by our method, Zhang's method, E-value cutoff at 1e-3 and 1e-5. Its actual proportion of 0.0047 is closely estimated as 0.0056 by our method; but the estimation drops dramatically to 3.9e-6, 6.6e-6 and 4.6e-6 respectively in the other three profiles, erroneously indicating that the family is very trivial. On the other hand, we observed that certain trivially abundant entries in the true profile, such as COG0784 (1866 th ), COG0067 (2024 th ) and COG0506 (2255 th ), become non-trivial in profiles by Zhang's method and the methods of E-value cutoff (not appear in the plot). As an example, COG0784 becomes non-trivial (ranks the 77 th , 27 th and 70 th respectively) in the profiles by the three methods. Evaluated by the above four measurements, the estimate of the complete functional profile by our PLSA method also shows the best accuracy (the lowest bar in the right panel in Figure 3).

Results from metagenomic data set simulated by a bioinformatics tool
The numbers of species in the samples in Section 5.1 are 7 (M2_4x) and 8 (M3_4X), usually smaller compared to the real-life metagenomic data. Thus, in this subsection, we would evaluate the proposed method on a metagenomic sample with large species diversity. We randomly selected 100 NCBI bacterial genome accession numbers (in the format of NC_######), among which 57 genomes were excluded in the simulation since they were for plasmid DNA sequences. MetaSim, a bioinformatic tool to simulate sequencing reads according to selected genomes, was used to generate the metagenomic data set, called Simu. The data set contains the short reads of 43 genomes with coverage one under the simulated conditions of 454 GS20 sequencing platform (see a brief description about these genomes, and the parameters used for simulation in Supplementary File). Since we know exactly where a read originates from, we did not use the blastn step as we did in Section 5.1. As before, the correct annotation of a read is defined as the COG whose coverage overlaps with the location of the read by at least 60 bases. To apply our method for modelling this simulated data set, we selected the learning data as the combination, called M_4X (available upon request), of all the three 4X read-coverage data sets of the simulated metagenomes M1, M2 and M3 in [25]. The reason we combined these 4X read-coverage data as the learning set is that this would provide us with more observed reads originating from a common COG, thus we will have a better statistical learning result about the distribution of these reads being aligned among COGs. On the other hand, with more reads being aligned to a common COG, we would have a better learning about the distribution of reads originating from COGs as well. The results presented in Figure 4 exhibit the profiles of COG families generated before and after applying the proposed PLSA modeling method. The finding is similar to those from Figure 2 in that the accuracy of the functional profiling can be improved by the proposed method, except the AVGRE measurement. A partial explanation for this is that the proposed method smooth the relative errors.
The comparison between PLSA modeling approach and the other three methods using data set Simu is presented in Figure 5. It is clear that (the left panel) the proportion of COG2814 is poorly estimated by the three methods (the true proportion: 0.0082; the proportion by Zhang's method: 0.0017; the proportion by cutoff 1e-3: 0.00048; the proportion by the cutoff 1e-5: 0.00045). The estimation is greatly improved to 0.0057 by our method. As a trivial abundant family, COG0784 has the true proportion of 1.9×10 −5 , which is estimated as 3.07×10 −5 , by our method (not appear in the plot). Its abundance is greatly inflated to a significant entry in the profiles generated by the two E-value cutoff methods (the cutoff 1e-3: 0.0013; the cutoff 1e-5: 0.0011). For the estimate of the complete functional profile, PLSA modeling method provides the best accuracy giving the lowest error in each of four measurements (the right panel in Figure 5).

Application of the proposed method on a real data set
A picoplanktonic sample was collected from 25m-depth seawater at the Hawaii Ocean Time Series (HOT) station on March, 2006. It was then sequenced with 454 GS20 machine to yield 385,193 short reads of 108 bases long on average [2]. We name this data set as HOT25. By using M_4X as the learning data, we applied he proposed PLSA modeling approach to correct the cross-annotations in its BLAST output (the one after filtration with similarity score cutoff 66). For the top 20 most abundant COGs estimated by PLSA modeling method, Figure 6a shows the discrepancy of the abundances given by the four approaches (the PLSA modeling method, Zhang's method, E-value Cutoffs 1e-3 and 1e-5). Unlike the simulated data, prior information on COG families for real data are not available, and thus, cannot be used to show the closeness of these profiles to the true one. The comparison of the complete functional profiles is displayed in Figure 6b, with the profile generated by Zhang's method being excluded since it is too different to compare. We can see that some COGs are estimated as very trivial ones by PLSA modeling method, but significant ones by E-value cutoff methods (the red/orange triangles close to the right tail of the blue curve). Table 1 lists three functional families, COG1028, COG0642 and COG0477, with corresponding proportions estimated by the methods and the ranks of abundances in each generated profile. A recent study has detected COG1028 is the most abundant COG family in the metagenomic samples from HOT station, the second abundant in the samples collected from western Arctic Ocean, and the third in the samples from the coastal water near Cape May, NJ [29,30]. The COG1028 belongs to the COG category I, Lipid transport and metabolism, demonstrating the universally important roles in different latitude of seawater. The PLSA modeling method also ranks COG1028 as the most abundant one in the HOT25 sample, but the other methods do not reach to the same conclusion. The family COG0642 comes from COG category T, Signal transduction. This mechanism is important for microbes to cope with changing environmental conditions. The role of COG0642 has been examined in many seawater related metagnomic projects; its abundance level is found varied in different depth of water since the environmental stimuli, such as temperature and sunlight, are different [31][32][33][34]. For the HOT25 sample, the COG0642 is estimated as the 51 st abundant family by the PLSA modeling method, but ranked very differently, 86th and 131th separately, by the E-value cutoff methods. There are also some research records about COG0477. In order to understand how bacterioplankton transform dissolved organic carbon in marine systems, Mou [35] conducted metagenomic analysis of bacterioplankton enriched with dimethylsulfoniopropionate (DMSP) and vanillic acid (VanA). Sequencing reads with an average length of 97 bases were obtained by pyrosequencing. The reads were aligned to COG database, and the abundance of each COG family was obtained. Furthermore, PCRbased 16S rDNA analysis was also carried out in the same project. For COG0477, its abundances in both DMSP and VanA samples were found very high by the metagenomic approach; but, interestingly there are no genes associated to COG0477 in the genomes detected by 16SrDNA analysis using the same samples. This is another supporting evidence about artifacts and cross-annotations when short reads being annotated. Note that different from E-value cutoff methods, both PLSA modeling method and Zhang's method give zero abundance to this family.

Discussion
Due to the fact that a microbial community usually includes multiple strains or similar species, the algorithms for assembling sequencing reads generated from a single genome are not applicable to metagenomic reads. On the other hand, accurate assembly really depends on sufficient sequencing depth [36], while a metagenomic sample generally consists of data with lower sequencing depth. Prior to 2013, the development of metagenomic assembly was evaluated as "in its infancy" [37,38], "at an early stage" [36]. And, to our knowledge, since 2013, there has been no breakthrough that can provide a widely accepted assembling tool. Therefore, we chose to analyze the unassembled short reads directly in this study. There might be an argument that assembled reads would reduce the cross-annotations. However, as a recent study indicated, there is a price for using assembled reads -it can bring in a considerable proportion of chimeric contigs [39], which is even harder to deal with in our opinion. Analysis of unassembled metagenomic reads is one of the approaches currently employed to study microbial communities [40][41][42][43]. Our method can be adapted to handle short reads with different lengths (e.g. ~200 bases), given a good alignment cutoff value and a trustable learning data set. We conducted the study specifically on reads with ~100 bases owing to two reasons: first, the alignment cutoff value for reads with ~100 bases has been suggested [23], and PLSA modeling method was applied on the filtered result; second, through literature search, the in vitro-simulated metagenomic samples are only available with this length range. Should similar samples with other lengths become available, we will enlarge the application scope of the method.
In addition to the issues of artificial COGs and cross-annotations, it has been reported in the literature that another problem exists with read count bias in metagenomic data [16]. Briefly speaking, the count of reads aligned to a COG family is correlated with the lengths and the conservations of COG-CDSs associated to the COG. This bias may have impact on the accuracy of the functional profile of COG families and deserves further investigation. To study and correct the read bias is one of our future research topics.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.       Table 1 Proportions and ranks of three COGs by different methods.