Ranking of Prokaryotic Genomes Based on Maximization of Sortedness of Gene Lengths

How variations of gene lengths (some genes become longer than their predecessors, while other genes become shorter and the sizes of these factions are randomly different from organism to organism) depend on organismal evolution and adaptation is still an open question. We propose to rank the genomes according to lengths of their genes, and then find association between the genome rank and variousproperties, such as growth temperature, nucleotide composition, and pathogenicity. This approach reveals evolutionary driving factors. The main purpose of this study is to test effectiveness and robustness of several ranking methods. The selected method of evaluation is measuring of overall sortedness of the data. We have demonstrated that all considered methods give consistent results and Bubble Sort and Simulated Annealing achieve the highest sortedness. Also, Bubble Sort is considerably faster than the Simulated Annealing method. Journal of Data Mining in Genomics & Proteomics J o u r n a l of D ata Mi ning in Gmics & rot e o m i c s ISSN: 2153-0602 Citation: Bolshoy A, Salih B, Cohen I, Tatarinova T (2014) Ranking of Prokaryotic Genomes Based on Maximization of Sortedness of Gene Lengths. J Data Mining Genomics Proteomics 5: 151. doi:10.4172/2153-0602.1000151


Introduction
It is natural to group homologous genes into well-defined categories, into gene families. There are several existing approaches. The most popular collection is the database of Clusters of Orthologous Groups (COG) of proteins, containing a comprehensive collection of prokaryotic gene families in complete genomes. Generally, proteins belonging to the same functional family have high sequence similarity; however, their lengths may be substantially different [1][2][3]. It is not clear how variations of gene lengths (some genes become longer than their predecessors, while other genes become shorter and the sizes of these factions are randomly different from organism to organism) depend on organismal evolution and adaptation. To answer this question we propose to rank genomes according to lengths of their genes, and calculate coefficients of association between genome rank and genome property. The main purpose of this study is to find a computationally effective ranking method that consistently gives reasonable results.
In this work we analyze four methods: Average ranking, Simple Additive Ranking, Bubble Sort and Simulated Annealing. It appearsthat the four selected methods give similar results applied to the same genomic set. Comparing the sortedness obtained by application of the four ranking methods to the matrix of gene lengths, we have found that two Monte Carlo heuristics (Bubble Sort and Simulated Annealing) show the best values of sortedness. Therefore, Bubble sort is a preferable method because it is considerably faster than the Simulated Annealing method.
This paper is organized as follows. Section 2 describes the formal definition of genome-ranking problem, general description of ranking methods, and structure of input data. It also introduces terminology used throughout this paper. Section3presents a detailed description of the selected ranking algorithms, and genomic data as an input to them. Section 4 presents the results and discussion of them. Finally, Section 5presents our conclusions.

Problem Formulation Ranking
The manuscript is motivated by the following problem: Given a set of objects, with each object described by a set of attributes, the rationale is to find out whether associations between a group of certain object attributes, which we call "object descriptors" and other attributes, which we call "object properties", are significantly non-random. There are different ways to do this [3][4][5]. Two main approaches are clustering and ranking.Using measures of similarity between the objects based on similarity between the sets of attribute values (object descriptors) one can cluster the objects(for example, genome-classification related applications in [5]), and after that apply methods of multifactor analysis to reveal which object properties are associated with different clusters. We explored this approach, taking genomes as objects and gene lengthsas object descriptors [5][6][7], and found that performed classification closely related to phylogeny and/or taxonomy of prokaryotes. Alternatively, one can work with the precedence relations between the descriptors to rank the objects for further analysis whether object properties correlate with the ranking [3].
A ranking is a relationship between a set of objects such that, for any two objects, the first is either 'ranked higher than', 'ranked lower than' or 'ranked equal to' the second. In a framework of our approach [5][6][7], every object is defined by a sparse vector of attribute values. So, a sparse matrix presents a set of objects. Permutations of the rows generate different orderings of the objects. There are different approaches to define the optimal order of rows in the matrix (the best ranking). Here we present two possible attitudes: maximization of the sortedness, measured as a minimum of the number of discordant pairs, and the Kemeny-optimal ranking [8,9]. In both cases,Kendall tau distance, based on the Kendall tau correlation coefficient, is used.

Sortedness
To compare quality of ranking obtained as results of different procedures, it is essential to select an appropriate measure of sortedness. Note, that the ranking procedures produce different permutations of rows of the same matrix. The task is not trivial, since each column has an individual sortedness, and ranks of elements in one column do not necessarily agree with the ranks inother columns.
Let us start with the definition of array sortedness and then expand it to matrices. There are several measures for quantifying the sortedness of an array [10]. Here, we present the three most popular measures: DISC, LIS and ED.
a) The number of inversions in a sequence η, denoted by DISC (η), is defined as the number of pairs of items in η which violate the natural ordering property (number of discordant pairs). b) Another measure for sortedness is the length of the longest increasing subsequence in η, denoted by LIS (η).
c) The third measure is an edit distance to monotonicity, denoted by ED (η), defined as the minimum number of single item deletions needed to reach a sorted sequence η' (it can be easily verified that in fact ED (η)=n-LIS (η)).
Among the three definitions of sortedness, the LIS and ED are more suitable for data stream applications and are not applicable for our purposes, while DISC is as a naturally appropriate definition of the sortedness of a matrix. A pair of numbers either follows the natural ordering or violates it. To apply the DISC criteriato array to a matrix we first should define the precedence rules for the rows, which may be determined in a few different ways, especially when a matrix is sparse: 1) The relationship between two rows r1 and r2 is determined by simple un-weighted comparisons of the elements common to both rows (both elements are not equal to zero): r >r is true then r 1 precedes r 2 ; if for the majority of common attributes the condition is true,then r 2 precedes r 1 ; otherwise r 1 and r 2 are tied.
2) The relationship between two rows (objects) r 1 , r 2 is determined by the sign of the sum of differences between non-missing elements of the rows: Having the definition of the precedence relations, we may count all contradictions of the type "r 1 precedes r 2 but r 1 ranked higher than r 2 "´in the given ranking, i.e. the number of discordant pairs. An optimal ranking will have a minimal number of discordant pairs of objects.

Kemeny-optimal ranking
A complementary to the sortedness approach is an optimal rank aggregation approach. The approach is to combine k different complete ranked lists of the same set of n elements into a single ranking, which best describes the preferences expressed in the given k lists. This problem dates back to as early as the late 18th century, when Condorcet and Bordain dependently proposed voting systems for elections with more than two objects [11,12]. There are numerous applications in sports, databases, and statistics [13,14] in which it is necessary to effectively combine rankings from different sources.
In the last decades, rank aggregation has been investigated and defined from a mathematical perspective. In particular, Kemeny [8] proposed a precise criterion for determining the "best" aggregate ranking. Given n objects and k permutations of the objects, {π 1 , π 2 , . . . , π k }, a Kemeny optimal ranking [8,9] of the objects is the ranking π that minimizes a "sum of distances", , where d(π,π k ) denotes a distance between a "total" ranking π and an "individual" ranking π k . Usually, either Kendall's τ rank-correlation coefficient or Spearman's ρ rank-correlation coefficient is used to find a distance. Kendall's τ is calculated as the difference between the number of concordant and discordant pairs divided by the total number of pairs. For our purposes, we should mention that a Kemeny optimal ranking minimizes the number of pairwise disagreements with the given k rankings.
It is known that finding a Kemeny optimal ranking is NP-hard [15] and remains NP-hard even when there are only four input lists to aggregate [13]. This motivates the problem of finding a ranking that approximately minimizes the number of disagreements with the given input rankings. Several approximation algorithms are currently used [13,16].

Solving of the optimization problem
Kemeny optimal ranking may be formulated in terms of solving an optimization problem using either Kendall's τ rank-correlation coefficient or Spearman's ρ rank-correlation coefficient. As described above, these coefficients provide measures of the degree of correspondence between two ranking vectors. In particular, they assess how well the natural ordering property of the vectors is preserved.
Comparing two pairs of equations (Eq. 1, Eq. 3) and (Eq. 2, Eq. 2) we can see that searching for the maximal sortednessis equivalent to finding the Kemeny optimal ranking in terms of combinatorial optimization.
For our goals Kendall's τ coefficient is more suitable than the Spearman's coefficient. However, finding a solution that maximizes the Kendall's τ rank-correlation coefficient is a difficult, NP-hard [15], task. Therefore heuristic methods, such as the Monte Carlo method [17] and the Simulated Annealing Procedure [18], are frequently used.

Monte carlo methods
In this section we will make aslightly artificial distinction between the Monte Carlo Method (MCM) and the Simulated Annealing (SA) procedure. The Monte Carlo method was developed in the late 1940s by Stanislaw Ulam while he was working on the nuclear bomb project at Los Alamos. It was named by Nicholas Metropolis after the Monte Carlo Casino [19]. In this study we propose to apply MCM to select the best B-rank ranking.
MCM follows the following steps: Define a domain of possible inputs.
Generate inputs randomly from a probability distribution over the domain.
Perform a deterministic computation on the inputs.
Aggregate the results.
The Metropolis-Hastings algorithm is a Monte Carlo method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult or impossible. The algorithm was named after Nicholas Metropolis, who was the first author of [18], and W. K. Hastings, who extended it to the more general case in 1970 [20]. The Simulated Annealing procedure is an adaptation of the Metropolis-Hastings algorithm. The method was described by [21] and by [22]. This method simulates behavior of a physical system, the internal energy of which is to be minimized. The goal of the procedure is to bring the system from an arbitrary initial configuration to a configuration with the minimum possible energy. At each step, SA considers some neighboring configuration ψ' of the current configuration ψ and probabilistically decides between moving the system to configuration ψ' or staying in configuration ψ. These probabilities ultimately lead the system to move to configurations of lower energy.

Gene Length-based model
In the works of Bolshoy et al. [5,6] a "gene length based" model was introduced. It is an algebraic model to represent genomes as vectors of genes. The set of genomes is represented as a matrix, in which each row stands for a genome and each column stands for a gene family, and each item stands for the length of a member of a gene family i in a genome j. In our study, the objects are prokaryotic genomes; the descriptors are the lengths of the genome proteins indexed according to the certain database. (We usethe database named "Clusters of Orthologous Groups of proteins database", see below in Materials and Methods.) Rankings of the selected genomes are performed using the descriptors in order to calculate coefficients of association between a genome rank and a genome property. The descriptors are of a homogenous nature-they're all positive integers. There are different types of genome properties, i.e., a prokaryote is either Archaea or Bacteria; an organism may be hyper thermophile, thermophile, psychrophile or mesophile; a genome has a certain GC-content, and so on. There are different types of data: Nominal, Ordinal, Interval, and Ratio. There are different types of data of genome properties as well: Kingdom is a nominal dichotomous (binary).
Variable: thermophilicity is an ordinal variable; GC composition is a ratio variable; and number of genes is an interval variable.

COGs database
The presented procedures are evaluated on the subset of the database of Clusters of Orthologous Groups of proteins (COGs) [1,2,23]. The principles of the database construction are described by [23]. Briefly, the COGs were constructed by applying the criterion of consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The data in COGs are updated continuously following the sequencing of new prokaryotic genomic sequences.
COGs database is a comprehensive gene-family definition database, developed to classify all conserved genes based on their homologous relationships and evolutionary development [23]. Each COG consists of at least three proteins assumed to have the same evolutionary counterparts. As described by [23], the COGs database is a growing and useful resource to identify genes and groups of orthologs across prokaryotic species that are related by evolution. Information about every completely sequenced and annotated prokaryotic genome is stored in the PTT-formatted files. The PTT file format is a table of protein features, prepared by the National Center for Biotechnology Information (NCBI). The complete collection of current PTT files can be found at ftp://ftp.ncbi.nih.gov/genomes/. Organization of the PTT files is summarized in Table 1 [24].
From every suitable NCBI PTT file, we extracted information about length (column 3) and COG (column 8). We added the genome index and a nominal binary identifier (chromosome, plasmid) to get a file describing the complete set of gene lengths across all available prokaryotic genomes. After the processing of the PTT files, two files were obtained.
One contained the names of genomes with a record format /integer, string/ <genome_index, genome_name> (for example, 22, Bacillus amyloliquefaciens dsm 7). The other was a gene-length file with a record format /integer, integer, integer/ <genome_index, COG_index, protein_length> //for example, 2, 1474, 411//. These data were sorted by COG_index, genome_index, protein_length in ascending order.All currently available genomes were described in these two files. To check the ranking procedures described below we used small subsets of this dataset.

Pre-processing procedures
To get an input file for further ranking the following pre-processing procedures were applied: 1. Selection of subsets of genomes. A subset may be defined applying different criteria: it may be either a representative sample, a taxa-specific subset, or randomly chosen genomes.
2. Application of a filtering parameter (an entry threshold) on a selected subset. Only COGs containing more than a threshold number of genomes are considered for further processing. For example, if the filtering value is equal to 20% and an amount of genomes in a subset is equal to 500, then only COGs containing at least 100 genomes are considered (passed the entry threshold).
3. Sampling: If there are multiple instances of a COG related to the same genome, a median length value for all paralogs (triplets <genome_index, COG_index, protein_length> from the same genome and from the same COG) is used for further processing.

Set of genomes
To compare performance of the methods, we used the same dataset as in our previous publication [3]. This small set contains 9 Archaeal and 91 Bacterial genomes. Table 2 of [3] briefly describes these genomes.

Average ranking method (A-rank)
Given a matrix A MxN where A i,j is the value of j th descriptor of the i th object, the average ranking method works this way: for each object i the average ofall its descriptor values are calculated, which determines the rank of object i relative to other objects. All missing values are ignored.

Simple additive ranking (SAR)
SAR or, alternatively, SAW (Simple Additive Weighting) is the oldest, most widely known and practically used method [25,26]. The method integrates several criteria into a single-weighted value. This is reflected in its name. In the framework of this method the object gets its rank through a simple addition of weighted ranks obtained by sorting of individual attributes. First, weights based on the importance of various attributes are assigned. Second, the ranks within the attribute are scaled. This means that whatever the ranges of the intra-ranks of the particular intra-attribute rankings are, they should each be converted to a comparable scale. For example, if there are 100 objects, one attribute has a ranking scale from 1 to 10 and the other has a ranking  Once all attribute rankings cover the same scale, they can be multiplied by their respective attribute weights. The "utility" for each object is defined by adding the scaled weighted scores across the attributes with further dividing by the number of contributing attributes. Objects are sorted in order of ascending utility.
Here is how SAR strategy is applied to the COGs dataset. All COGs are assigned a number. Given a matrix A MxN where A i,j is the value of j th descriptor of the i th object, the ranking is based on an individual ranking of each object based on the weighted sum of ranks for each descriptor separately. Thus, the eventual rank of the i th object, R i , is calculated as Where j refers to descriptor number and w j is the weight of the j th descriptor and r ij the ranking of the i th object with regard to the j th descriptor. Subsequently the ranks are scaled (normalized). In case of a sparse matrix, r ij of a missing descriptor value takes the fixed rank value M/2 while rij of non-missing descriptor values are calculated regularly and then scaled uniformly to the range [1..M]. Note that we used the version in which w j =1 for every j.

Bubble-sort ranking (B-rank)
The strategy of Local Pairwise Interchange (LOPI) amounts to choosing a pair of objects, interchanging them, and evaluating a quality of ranking t(ψ) for the changed ranking [27]. If the new ranking is better than a previous one, then the new ranking is accepted. The procedure is stopped if there is no interchange that may improve t(ψ). LOPI does not guarantee global optimality, but it is very efficient [28], and being enhanced by the Monte Carlo technique [17] brings sufficiently good results. (Randomness is introduced through the random choice of the initial configuration.) In a simulation study by [28] the LOPI strategies found a global maximum of t (ψ) in the majority of the cases.
As a LOPI strategy we apply here the regular "bubble sort" procedure [29] interchanging the rows of a given matrix. The order of any two rows, r 1 and r 2 , under this method is determined by the sign of the scalar G(r 1 ,r 2 ), such that if G(r 1 ,r 2 )<0, then r 1 precedes r 2 , and if G(r 1 ,r 2 )>0, then r 2 precedes r 1 , otherwise no change to their original order.

Simulated annealing
For large datasets an exhaustive search is computationally demanding and not feasible. For example, in order to use direct maximization for ranking of N=1390 genomes, one needs to examine N! ≈ 10 103765 configurations and calculate "system energy" for every configuration. Because of this, heuristic methods such as the Simulated Annealing method [21] are frequently used. Simulated Annealing Procedure is a general probabilistic meta-heuristic for the global optimization problem of obtaining a good approximation to the global optimum of a given function in a large search space.
Simulated Annealing models a process of heating a material and then slowly lowering the temperature to decrease defects, thus minimizing the system energy. At each step, a pair of objects is randomly chosen, than the objects are interchanged, and a quality of ranking τ for the changed ranking is evaluated. The algorithm accepts the proposed ranking that increases τ, but also, with a certain probability, rankings that lower the τ. We used acceptance probability function in the form The algorithm was implemented in R using the mpiR package to enable parallel processing using a high performance computer cluster for large datasets. By accepting points that lower the objective, the algorithm avoids being trapped in local maximum in early iterations and is able to explore globally for better solutions.

Results and Discussion
The Average ranking, the Simple Additive Ranking, and the Bubble sort ranking methods were applied both to the non-filtered input (matrix of size 100×5664, Figure 1) and to a smaller matrix (filtered version: excluding columns that contain more than 65% null values given a matrix of size 100×1455, Figure 2). Simulated Annealing procedure was applied only to the smaller matrix because of computational complexity. Each procedure produced a certain order of rows in matrix, a ranking vector X. Calculating of the Kendall tau correlation coefficients between the ranking vector X and each column of the matrix yielded distribution of correlations between the global ranking and individual COGs. These distributions are shown in Figures  1 and 2.
Distribution of τ' before the ordering (random ranking) is shown as a blue line and the post-ordering distributions are shown as bar graphs. Unordered genomes produce distributions of τ that are centered at zero. Application of all four ranking algorithms to this set resulted in right-shifted distributions of τ. The distributions of τ in Figures 1 and  2 are bell-shaped and are similar to each other, and the post-ordering distributions were all distinctly shifted to the right.
We conducted the Shapiro-Wilk test of normality for the unordered genomes and found that p-value is 0.19>0.05, supporting our visual examination that the randomly ordered genomes have approximately normal distribution of τ. All ordered distributions fail the normality test (p-value<0.05). For the unordered distribution, the skewness is 2.0703. The ordered distributions have larger absolute values of skewness, which are ≈ -5 ( Figure 1).
Since the distributions in Figures 1 and 2 are approximately normal and our sample sizes are large, we used t-test to find the significance of the differences of the means (shown in Table 3). Difference between the means of ordered and unordered sets is around 0.2 for all four methods (p-value <10 -16 for all four comparisons) [6,30]. It should also be mentioned that (a) all τ values for the filtered matrices are higher than their counterparts calculated for the complete matrices, and (b) the Simulated Annealing approach results in the best ordering, with Bubble sort being a close second [31][32][33][34] (Table 3).
It has already been mentioned [3,4,35] that A-rank is not likely to produce valuable results as compared with other ranking methods mentioned in this study. The difference between Simulated Annealing and Bubble Sort ranking is 0.002, which is not significant, given p-value of the t-test is 0.9854. We do not expect different ranking methods to yield identical genome orders, since different methods use different criteria for ordering.
Significant differences in genome orders produced by different ranking methods are expected, since the four methods use different criteria for ranking. However, we find that the four methods are highly correlated (Table 4). B-rank (Bubble Sort) and SA-rank (Simulated Annealing) are the most similar rankings (τ=0.84 for the filtered datasets). Table 4 also shows that filtering improves the agreement between different ranking methods by removing less abundant COGs that skew the results.
For all four methods there is significant average correlation between global rankings and individual COGs lengths. The filtering procedure results in the distributions without heavy left and right tails. COGs in the two extremes of the distribution have a small number of genomes and addition of these scores to the total τ may create a bias. The comparison of Figures 1 and 2 clearly demonstrates necessity of a filtering stage among the preprocessing procedures. We should not be surprised by this observation, since have already demonstrated that COGs containing very small amount of genomes ("unpopulated" COGs) should not be considered in phylogenomic methods. The "Forest of Life" concept was developed by [31][32][33][34]. They showed that there is a general evolutionary trend in the "Forest of Life" as well as in the "Tree of Life", but gene trees constructed from unpopulated COGs poorly contribute to and frequently contradict it. Now we demonstrated that the same is true while ranking prokaryotic genomesunpopulated COGs should be excluded. Note that although histograms A-D in Figure 2 are very similar, this does not imply identical genome orderings. Table 2 shows the genomes in ascending order as dictated by the SAR technique: genomes with shorter genes occupy the top portion of the table. Several features of the Table 2 are quite remarkable. First, the columns are not identical; moreover, they are not even nearly identical. Nevertheless, the SAR ranking at the bottom part (longer genes) amazingly coincides with all of the other three ranking methods. This is not the case for the top part of the SAR ranking. While all hyperthermophiles that have high SAR ranks (Thermotogae, Thermoplasmata, Aquifex, and Archaeoglobus) also have high SA and B ranks as well, Bacilli that have high SAR ranks (ranks 1, 6, 9) have much lower ranks by the SA ranking method (17, 19, 24, correspondingly).
Performance for Bubble Sort is O(n 2 ), and for Simulated Annealing is O(K(n 2 +n) log n) [35,36], where K is the number of COGs and n is the number of genomes. Therefore, we conclude that Bubble Sort is faster than Simulated Annealing and produces nearly identical ranking ( Table 4).
As we can see from the Table 3, the Average ranking method results in the lowest value of sortedness (0.15114 for the complete COG's dataset and 0.19459 for the filtered dataset). It also has the lowest correlation, in the vicinity of 0.5, with the other three methods (Table  4).
Motivated by the performance of Bubble sort for the 100-genome dataset, we applied it to the set of 1390 genomes. We used the same 35% filtering cut-off as for the 100-genome dataset. Here we show some preliminary results of ordering the big set of genomes ( Figure 3) for several groups of genomes.
• Crenarchaeota, phylum of Archaea, tend to have shorter genes ( Figure 3, top left). Almost all sequenced genomes of this phylum are related to hyperthermophilic species. Probably, it is the main factor affecting gene lengths of the species of this phylum, but contribution of other factors is an open question.
• Halobacteriales have longer genes (Figure 3,top right).In taxonomy, the Halobacteria are a class of the Euryarchaeota, and the extremely halophilic, aerobic members of Archaea are classified within the family Halobacteriaceae, order Halobacteriales. So, at the top of Figure 3 we present plots describing distribution of ranks of genomes from two different groups of Archaea. Our speculation is that hyperthermophilicity is a factor of gene shortening, while halophilicity is factor acting in the opposite direction.
• Middle plots of Figure 3 present Campylobacterales (left) and Chlamydiae (right). All genomes of these two selections are pretty short; however, the plots are very different. Family of Campylobacterales (belonging to the phylum Proteobacteria), have an average rank of 203, with the smallest rank of 10 (Helico bacterbizzozeronii ciii-1) and the largest rank of 392 (Helicobacter hepaticus ATCC 51449). Members of the class of Chlamydiae have exceptionally long genes: the ranks of 21 members of this class are located from positions 835 to 1127 in the ranking list.
• Bottom plots of Figure 3 present Actinobacteria (left) and Cyanobacteria (right). As a rule, members of the class of Actinobacteria have long genes. Finally, Cyanobacteria also tend to have long genes. At this point, it is only observation and we have no speculations on this issue.

Conclusions
We have presented four methods of genome ranking and compared their performance using a dataset of 100 genomes randomly selected from the entire NCBI collection of Eubacterial and Archaeal genomes. We have demonstrated that all four methods produce consistent results and that Bubble Sort and Simulated Annealing have the best ranking. Given computational advantages of Bubble Sort, it is the optimal method for the task of genome ordering. We also showed that filtering procedure (removal of the less populated COGs) improve the final sortedness of the dataset.
Using the subset of 100 randomly selected genomes, we demonstrated that hyperthermophilic species have shorter genes than the mesophilic species. Addition of all currently sequenced genomes   did not change this conclusion. It would be wrong to claim that environmental stress always causes genes to be shorter.