Richard O Akinola, Gaston K Mazandu and Nicola J Mulder*
Computational Biology Group/Institute of Infectious Disease and Molecular Medicine, Department of Clinical Laboratory Sciences, Faculty of Health Sciences, University of Cape Town, South Africa
Received Date: July 01, 2013; Accepted Date: August 06, 2013; Published Date: August 12, 2013
Citation: Akinola RO, Mazandu GK, Mulder NJ (2013) A Systems Level Comparison of Mycobacterium tuberculosis, Mycobacterium leprae and Mycobacterium smegmatis Based on Functional Interaction Network Analysis. J Bacteriol Parasitol 4:173. doi: 10.4172/2155-9597.1000173
Copyright: © 2013 Akinola RO, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Bacteriology & Parasitology
Mycobacterium leprae is a pathogenic bacteria that causes leprosy, a disease which affects mainly the skin, peripherical nerves, eyes and mucosa of the upper respiratory tract. Despite significant progress recorded in the last few years to stop this disease through a Multi-Drug Therapy (MDT) strategy, every year there are new reported cases of the disease. According to the World Health Organization, there were 192,242 new cases at the beginning of 2011. Mycobacterium leprae cannot be cultured in the laboratory but can be grown in mouse foot pads and more recently in nine banded armadillos because of its susceptibility to leprosy. Its highly reduced genome makes it an interesting species as a model for reductive evolution within the mycobacterial genus; it shares the same ancestor with Mycobacterium tuberculosis (MTB). A functional network for MTB was generated previously and extensive computational analyses were conducted to reveal the biological organization of the organism on the basis of the network’s topological properties. Here, we use genomic sequences and functional data from public databases to build protein functional networks for another slow grower, Mycobacterium leprae (MLP) and the fast growing non-pathogenic Mycobacterium smegmatis (MSM). Together with the MTB network, this provides an opportunity for comparison of three mycobacteria with different sized genomes. In this paper, we use network centrality measures to systematically compare MTB, MLP and MSM to quantify differences between these organisms at the systems biology level and to study network biology and evolution.
Biological networks; Mycobacterial pathogens; Mycobacterium tuberculosis; Mycobacterium leprae; Mycobacterium smegmatis; Rewiring rate
Leprosy is a chronic dermatological  and malignant human neurological disease . It is caused by the pathogen Mycobacterium leprae (MLP) which has similar characteristics to Mycobacterium tuberculosis (MTB). Various attempts to culture MLP have failed because, of all known bacteria, it has the longest doubling time . MLP is an acid-fast, rod shaped bacillus  and has a doubling time of ~14 days  and is host specific. It has been grown in mouse foot pads and more recently in nine banded armadillos because of their susceptibility to leprosy. According to the WHO , leprosy has been classified into two types based on how they smear the skin, paucibacillary (PB) and multibacillary (MB). Leprosy affects mainly the skin, peripheral nerves, the eyes and mucosa of the upper respiratory tract . However, through Multidrug Therapy, there has been a reduction in the number of reported cases of the disease, from the 228 474 new cases in 2010 to 192 246 cases at the beginning of 2011 . Its highly reduced genome makes it an interesting species as a model for reductive evolution within a genus.
According to Monot et al. , there are seven strains of Mycobacterium leprae that are well characterised, namely: India2, Thai53, TN, Africa, NHDP63, NHDP98 and Br4923. The first three strains are of SNP type 1, the fourth of SNP type 2, the fifth and sixth i.e., NHDP63 and NHDP98 are of SNP type 3, while the last one is of SNP 4 . The MLP strain TN is from Tamil Nadu, the Thai53 strain is from Thailand, the NHDP strains are from the United States and the Br4923 strain is from Brazil. So far, complete genome sequences have been obtained for TN and Br4923. In this article, we use the MLP strain TN.
Tuberculosis is caused by MTB and is one of the ‘most dangerous’ infectious diseases ; it claimed about 1.8 million victims in 2008 and there were estimates of 9.4 million new cases that year (3.6 million of whom are women), including 1.4 million cases among people living with Human Immunodeficiency Virus (HIV) or Acquired Immunodeficiency Syndrome (AIDS) according to the World Health Organization (WHO). The MTB bacillus is slow growing and has a complex cell wall.
Mycobacterium smegmatis (MSM) is an aerobic, fast growing, non-pathogenic mycobacterium which has many common features with pathogenic mycobacteria . It has the potential to adapt to microaerobiosis by changing from active growth to dormant or latent states. It can be dormant in conditions of low oxygen concentrations and can survive for more than 650 days in the absence of carbon, nitrogen and phosphorus. MSM is particularly useful in understanding the cellular processes that are important to pathogenic mycobacteria like MLP, MTB and M. avium subsp. paratuberculosis . This is one of the major reasons why we are including this mycobacterium in the present study. For the purpose of our comparisons, we will use the MC2 155 strain of MSM.
The genome sizes of MLP, MTB and MSM are 3,268,203; 4,411,532 and 6,988,209 base pairs, respectively (see, for example  and http://mycobrowser.epfl.ch/smegmalist.html, accessed on 15 September, 2012). This implies that the genome of MLP is approximately 1.4 Mb smaller than MTB and less than half the size of MSM. In addition, the G+C content of MLP is 57% which is lower than other mycobacterial genomes. However, as pointed out in Eiglmeier et al. , genomes of organisms that have suffered reductive evolution are usually richer in A+T content. MSM has a doubling period of three to four hours in culture and forms colonies in 3-4 days, MTB has a doubling time of twenty to twenty-four hours and forms a colony in an agar medium in three to four weeks , while, as stated earlier, MLP doubles in approximately fourteen days. Therefore, these three organisms represent three different genome sizes as well as different growth rates within one genus.
Although MTB and MLP share a common ancestor, MLP is an obligate intracellular parasite while MTB is a facultative intracellular parasite . Youm and Saier  compared the clinical CDC1551 strain of MTB to the TN strain of MLP. The genome of the MTB strain encodes 4189 proteins and the MLP strain 1605 proteins. This reduction in the MLP is attributed to reductive evolution with many genes having become pseudogenes. They defined a pseudogene as an inactivated gene that no longer produces functional proteins. In comparison to other mycobacterial species, two main consequences were proposed for the reduction in the genome of MLP : the presence of few proteins belonging to the PE and PPE functional category and traces belonging to insertion sequences and bacteriophages. As shown in Table S1, the number of proteins in the MTB genome belonging to the PE and PPE family is roughly fifteen times that of MLP, and while 82 proteins in MTB are insertion sequences or derived from bacteriophages there are only two in MLP. In addition, the presence of pseudogenes in MLP and the corresponding absence thereof in MTB accounts for some of the phenotypic differences between the two pathogens.
Gómez-Valero et al.  defined reductive evolution as the process by which genes and their corresponding functions are lost, resulting in the downsizing of the genome. Three reasons based on changes in lifestyle were given why an organism may have reductive evolution: a desire to ’move’ from a free living to a host-associated or intracellular life, when the organism restricts itself from multiple to specific hosts and from multiple to specific host tissues. By analyzing the distribution of gene-loss along the ancestral genome, it was shown that the genome downsizing in MLP was as a result of gene by gene inactivation and not inactivation in blocks or large chunks, before a gradual nucleotide loss . In addition, they classified ancestral genes in the MLP genome into three categories: retained, absent/deleted and pseudogenized. Genes belonging to the ’absent’ category have either diverged so much that they cannot be recognized or were totally deleted, while those in the pseudogenized category have sufficient levels of nucleotide similarity with MTB. It was also reported that 1537 genes have been lost from the ancestor to MLP, of which, 1129 are pseudogenes. In this work, we use functional genomic data from public databases to generate functional interaction networks for slow growers: MLP and MTB, and the fast growing non-pathogenic MSM. We used network centrality measures to make the following comparisons: MTB versus MLP, MTB versus MSM and MLP versus MSM in order to determine the impact of genome size on network evolution.
We downloaded datasets containing Uniprot protein accession numbers (ids) and gene names from UniProt Consortium  http://www.uniprot.org, accessed on 29 June, 2012) for the three mycobacterial organisms: Mycobacterium leprae, Mycobacterium tuberculosis and Mycobacterium smegmatis. These protein accession numbers were then used to extract protein protein interactions data from STRING [15,16]. STRING (Search Tool for the Retrieval of Interacting Genes/ Proteins) is a database containing predicted and known Protein-Protein Interactions (PPI). These functional protein-protein associations are derived from conserved genomic neighbourhood, gene fusion, imports from database (knowledge), phylogenetic co-occurrence, highthroughput experiments and text mining. STRING web resources and databases can be accessed from http://string-db.org/(accessed on 30 June, 2012). All the data from these different sources are integrated into a single network, before computing the combined confidence score for all PPI’s.
In addition, we derived other interactions from sequence similarity and signatures (shared domains), microarray data (co-expression), Protein Data Bank (PDB) [17,18] and MINT , DIP  and Intact http://www.ebi.ac.uk/intact/(accessed on 6 October 2012) data. PPI data from MINT, DIP and Intact were used to predict interologs in MLP and MSM based on the premise that orthologs of interacting proteins should themselves interact. Ortholog data were downloaded from Biomart (http://www.ebi.ac.uk/uniprot/biomart/, accessed on 12 August 2012). DOMINE is a database containing known and predicted protein domain interactions. The Domain-Domain Interactions (DDI) are inferred from Protein Data Bank (PDB) entries and those interactions from PFAM domain definitions predicted by thirteen different methodologies. We extracted DDI’s with PFAM ids from the DOMINE website (http://domine.utdallas.edu/cgi-bin/Domine, accessed on 17 October, 2012), neglecting self interactions to avoid loops. With the aid of the data containing PFAM ids and their corresponding InterPro ids, we converted those interactions from DDI into their interPro equivalents, before changing them to Uniprot- Uniprot protein interaction ids. InterPro data was downloaded from the interPro website for both MLP and MSM. We assigned a uniform score of 0.85 for all these interactions. We also used an informationtheory based technique proposed by Mazandu and Mulder  to derive PPI’s from protein sequence similarity and signatures as well as shared domains. In line with Mazandu et al. , the microarray data for MTB were downloaded from the Standford Microarray Database (SMD), at http:smd.stanford.edu/ (accessed on 28 October, 2011) and NCBI Gene Expression Omnibus (GEO) (see, http://www.ncbi.nlm. nih.gov/geo/ (accessed on 12 September, 2012).
Out of the seven experiments used in analyzing the microarray data for the MTB network, two with file ids 15569 and 15575 were downloaded from SMD and the remaining five; GSM219305, GSM219324, GSM219694, GSM219695, GSM219696 from GEO . For MSM, we retrieved 23 experiments from the Array Express database: four with GEO accessions: GSM743320, GSM743321, GSM743322, and GSM743323. The remaining 19 ids are GSM748761, GSM748762, GSM748763, to GSM748779. We then employed a random partial least squares regression approach described by Mazandu et al.  to generate functional association scores between pairs of interacting proteins. However, for MLP, we downloaded only four experiments contained in the GSE17191 series matrix from GEO (http://www.ncbi. nlm.nih.gov/geo/query/acc.cgi?acc=GSE17191, accessed on 12 August, 2012). This limited number of microarray experiments prevented us from using the same technique used for MSM and MTB so we decided to calculate the correlation instead. The Pearson correlation coefficient was calculated to find co-expressed genes and we inferred interactions between genes for which the correlation coefficient was exactly one.
After calculating the confidence score for each functional association protein pair, we computed the combined confidence score C(p,q) for interacting proteins p and q using the formula 
where n is the total number of PPI data sources and is the confidence score of a functional association between p and q predicted using the type of data source s. In all the three networks, n=11. Next, we define some network centrality measures that we used to characterize the proteins in each of the networks and in determining the most central protein.
Let be an ordered set of all proteins and be the set of all interacting proteins in the network G. It is conventional to define a network G as a graph G=(P, Q) . Then, we define the entries bij for i, j=1, 2, 3,...,m of the  m by m adjacency matrix of G as
We assume that the networks under consideration are simple, meaning there are no loops and no multiple functional protein-protein interactions. Hence, bii=0. Most importantly, the adjacency matrix is symmetric for undirected networks. Next, let d(p1, p2) be the distance between protein p1 and p2. For all possible paths in a network G from p1 to p2 , d(p1, p2) is the length of the shortest path from p1 to p2 . If there is no path between any two proteins, then the distance between them is infinite. The average shortest path length of a network is defined as .
where d(p1, p2) is the shortest path length from p1 to p2. The density of a network is the ratio of the number of edges to the number of possible edges in the network .
When two proteins p1 and p2 are functionally linked together by an edge, we say they are adjacent to each other. Several approaches were used to define the degree centrality of a protein pi. However, Nieminen [26,27], gave a mathematical formula for computing it as:
where u(pi, pj) is the Kronecker-Delta function. D(pi) is large if pi is functionally connected to other proteins in the network, meaning that such a protein partakes in a high number of biological interactions in the organism. If D(pi)=0, then it means pi does not interact with other proteins and it ’may’ not be an important protein neccessary for the survival of the organism. From the above formula, it means that the maximum D(pi) is m-1. The degree centrality of a protein gives an indication of its communicability in a network .
Betweenness is a structural property of communication in a network. According to Freeman , this means that a protein in a biological network is central if it falls on the shortest path between connecting pairs of other proteins. Thus, other proteins in the network depend on such a protein because it could withhold or distort information during transmission. The greater the betweenness of a protein  the greater its influence on the flow of information and importance in the biological processes of an organism. Mathematically, the normalized betweenness of a protein p1 in a network is defined as ,
Where is the number of shortest paths from protein x to y with represents the number of shortest paths from x to y with p1 as an inner protein , and
for all i . This can be expressed as Ae = λe, where e is the eigenvector of the adjacency matrix A corresponding to the eigenvalue l. According to Cvetkovi et al. , the eigenvector chosen as the eigenvector centrality must have all positive entries. Among all the eigenvectors corresponding to different eigenvalues l, only the one corresponding to the eigenvalue of the largest modulus should be the eigenvector centrality. The eigenvector centrality of a protein gives an indication of how connected the protein is to other well connected proteins in the network.
The closeness of a protein in a network is a measure of the degree to which it is close to other proteins on average and it can also be defined as the reciprocal of the average distance to other proteins. The closeness and betweenness centralities are anchored on the fact that information flows  along the shortest paths in a network and does not split. The closeness centrality of a central protein is usually high as it has a shorter distance to other proteins on average . Let be the set of all positive real numbers, the normalized closeness of a protein is
where d(p1, p2) is as defined in (2) and it is the length of the shortest path between p1 and p2.
Comparing two biological networks using evolution rewiring
Given any two networks, the approach used by Shou et al.  in measuring the evolutionary rewiring rate of biological networks was to name one as the reference network and the other as the compared network. For example, we take MTB as the reference network and MLP as the compared one. Firstly, all orthologous nodes from both networks are identified. This is then followed by the identification of three sets of nodes: Common nodes (CN), Lost Nodes (LN) and Gained Nodes (GN). Common nodes are nodes that have orthologous counterparts in both networks. Loss nodes are nodes present in the MTB network but with an absence of the orthologous counterpart in the MLP network, while gained nodes are nodes present in the compared network that do not have orthologous counterparts in the reference network. Three types of edges were distinguished as: gained edges from gained nodes, lost edges from lost nodes, and common edges from common nodes. TE is the total number of possible edges if both networks were fully connected. TE is the total number of rewired edges obtained by counting all interolog edges in both networks.
The ratio RE/TE was defined as the percentage of edge change in both networks. This ratio will be used later to measure edge changes in the different types of biological networks we want to compare. The formula .
was then used to calculate the evolutionary rewiring rate of two different organisms, where time divergence is the estimated evolutionary divergence time (measured in Mys) between the two organisms. The rewiring rate is measured as the number of rewired edges per edge per Mys. Network identity is defined as  the ratio of the number of common edges between orthologous nodes present in both networks to the total number of edges in both networks times 100%.
We generated three functional interaction networks for MTB, MLP and MSM by integrating data from eleven different evidence types. In Table 1, we present a comparison of the number of interactions and confidence scores from each of the data sources. We classified the confidence scores into three confidence levels viz-a-viz, low confidence scores are those scores less than 0.3 but not equal to zero (score<0.3), medium confidence scores range from 0.3 to 0.7 (0.3 ≤ score ≤ 0.7), while high confidence scores are scores strictly greater than 0.7 but less than or equal to 1 (0.7<score ≤ 1). In most cases, the combined score is higher than the individual sub-scores  and the confidence increases when a protein-protein interaction is predicted by many data sources or when interaction data are integrated from many evidence types. An understanding of the biological organization of an organism from its PPI network can play a crucial role in vaccine or drug target discovery by highlighting important proteins. Network centrality measures can be used to locate central proteins that play important roles in the biological processes and molecular functions of the organism.
|Type of evidence||Low Confidence||Medium Confidence||High Confidence|
Table 1: Data source and confidence range (low confidence: scores less than 0.3; medium confidence: scores from 0.3 to 0.7; high confidence: scores greater than 0.7) of the functional networks for MLP, MTB and MSM.
In the next section, we present and compare the functional PPI networks for MTB, MLP and MSM.
Functional Protein-Protein Interaction Networks for the three Mycobacteria
A description of the MTB network has been given previously by Mazandu and Mulder , the only difference between the MTB network presented in this work and the previous one, is that the number of functional interactions has increased from 58098 to 59919, because of the addition of PPIs predicted from interologs and PDB. In this paper, we follow the approach used by Mazandu and Mulder  in generating biological networks by taking those interactions with medium and high confidence scores. In addition, all the three networks included functional associations with a low confidence score if the interactions were predicted by two or more data sources. However, the number of functional interactions with low confidence scores is small in all three networks compared to those predicted by medium and high confidence scores. Table 2 summarizes important structural properties of the three networks under consideration.
|Mycobacterium tuberculosis||Mycobacterium leprae||Mycobacterium smegmatis|
|Number of proteins (Nodes)||4136||1412||4953|
|Number of functional interactions (Edges)||59919||20742||66543|
|Number of hubs||201||103||755|
|Average shortest path length||3.62739||3.16955||4.2224|
|Number of connected components||23||19||166|
|% of Nodes in largest component||58.7%||97.5%||91.7%|
Table 2: Comparing network parameters and values in the MTB, MLP and MSM networks.
The numbers of proteins in the MLP and MSM networks are 1412 and 4953, while the numbers of protein protein functional interactions are 27042 and 66543, respectively. This shows that the number of proteins and functional association pairs in MSM and MTB are roughly three times that of MLP, though they share a common ancestor. The 1412 proteins in the MLP network corresponds to 87.9% of the complete proteome obtained from the Uniprot database [14,33,34], while the 4953 proteins in MSM amounts to 74.5% of the complete proteome. From Mazandu and Mulder  and Table 2, there are 201 hubs, in the MTB network, while MLP and MSM, have 103 and 755 hubs, respectively. Degree based hubs are proteins with a high degree and structural hubs are those proteins that are able to disconnect the network . Here we are referring to structural hubs. The high number of hubs in the MSM network may simply be a reflection of the larger genome and network.
The average path length is computed by finding the mean over all shortest paths between all pairs of proteins in the network . While the MLP network has an average shortest path length of approximately 3 which can be seen in Figure S1(b), the MTB and MSM networks have average shortest path length of approximately 4, as shown in Figures S1(a) and S1(c), respectively. These figures show the probability distributions of their shortest path lengths. The computed average path length for each of the organisms is of the order log |P| in magnitude. Using the same argument as in Mazandu and Mulder , this means that each of the networks exhibit the ‘small world property’ [35,36]. These values give an indication of information spread in their respective networks independent of the number of proteins. Out of the three organisms, MSM has the highest number of connected components at 166 compared to the 23 for MTB and 19 for MLP
Furthermore, based on the degree of each protein in the three networks, results of our computation show that the distribution of the degree approximates a power-law, that is, for each protein degree k, . We show that each of the networks depicts a scalefree topology. The degree exponents for each of the mycobacterial species are β ~ 3.48 , β ~ 3.41 and β ~ 3.41 for the MTB (Figure S2(a)),MSM(Figure S2(c)) and MLP (Figure S2(b)) networks, respectively.
Locating the most central protein
One of the major problems in network analysis is which criteria should be used to identify the most central protein in a network. However, as stated earlier, the higher the betweenness of a protein, the greater the influence on the flow of information and importance in the biological processes of an organism. By ordering proteins based on betweenness, the MTB network in Mazandu and Mulder  was analyzed for the variations in the betweeness metric in terms of protein category by showing that proteins with high degree centralities that are positioned in the centre of the network are more likely to locate other proteins in a connected component faster than structural hubs. From the three computed networks, we used the betweenness centrality metric to obtain the following most central proteins: Q8VKQ9 in the MTB network, which does not have an ortholog in the other two networks; A0R5I7 in the MSM network, which is not orthologous to any protein in the MTB and MLP networks, and P57993 in the MLP network, which is an ortholog of P65728 in the MTB network and A0QQK3 in the MSM network. Table 3 shows some important properties of these central proteins based on network centrality measures. Apart from the most central MTB protein, which belongs to the PE family, the most central proteins in MSM and MLP (and its orthologs) are Serinethreonine protein kinases.
|Mycobacterium tuberculosis||Q8VKQ9||PE family protein||1.65648e-03||117248.09||0.34124||156||N|
|Mycobacterium leprae||P57993||Probable serine/threonine||3.05175e-02||36356.78||0.42189||125||N|
Table 3: The most central proteins for the three mycobacterial organisms and their centrality measures.
Next, for each network, we calculated the shortest path length, d of all proteins from Q8VKQ9, A0R5I7 and P57993, classified them into d=1, 2, 3,..., 9 from the protein and n, the total number of proteins in each set. For example, from Figure S3 (supplementary material), d=1 represents proteins which have a shortest path length one from Q8VKQ9 and n=156 proteins belong to that category, in the same vein, d=2 means those proteins with distance two from Q8VKQ9 and n=987 etc. Out of the 4136 proteins in the MTB network, only 53 have no path to this protein. A similar diagram showing the distribution of shortest path lengths of other proteins from A0R5I7 for the MSM network and P57993 for the MLP network are illustrated in Figures S4 and S5 (supplementary material), respectively. Thirty-six (out of 1412) proteins have no path to P57993 in MLP and 409 out of the 4953 proteins have no path to A0R5I7 in MSM.
For different protein sets at each distance from the central protein for the three organisms under consideration, we used their Gene Ontology (GO) annotations to carry out GO term over-representation analysis using Blast2GO . Blast2GO enrichment analysis tool produces a statistical significance  analysis of each GO term in a given protein set using Fisher’s exact test to find over or under represented functional labels between two protein sets. The proteins in the sets under consideration were used as the query sets and the remaining proteins in the network constitute the reference-set. We considered functions with p-values less than 0.05 to be significant and computed the adjusted p-values using the BONFERRONI correction . The results are shown in Tables S2-S10 (supplementary material). From Table S2, the result shows that proteins annotated to cellular metabolic process were the most over-represented for proteins sets with d=1 and 2 in MLP. In Table S3, the GO term corresponding to ‘integral to membrane’ is over-represented in d=1, oxidoreductase and acyl-CoA dehydrogenase activity in d=2, and proteins involved in biosynthetic processes are over-represented in d=3 for MTB. Similarly, in Tables S4-S5, the GO terms cellular metabolic process and protein metabolic process were over-represented in MSM for d=1 and both nucleotide and ATP binding were over-represented for d=2. Furthermore, we subdivided the total proteins in each mycobacterial organism into different sets based on network centrality measures and used Blast2GO’s  Fisher’s Exact test to find over represented GO terms in each of the sets. The results are as tabulated in Tables S2-S10 (supplementary material). Table S6 shows for high degree proteins the GO-term ‘translation’, high closeness, the GO-term ‘nucleotide binding’ and high eigenvector centrality the GO-term ‘ribonucleoprotein complex’ are significantly over-represented in the MLP network. From tables S7-S8, hubs having GO-term ‘cobalin binding’ and high degree proteins (degree >100) have GO-term ‘regulation of transcription, DNA-dependent’ significantly over-represented in MSM. In the same vein, as shown in Tables S9- S10 for the MTB network, protein sets that are hubs, high degree, high betweenness, high closeness and eigenvector have the following respective GO-terms: ‘transposase activity’, ‘oxidoreductase activity’, ‘binding’, ‘ACP phosphopantetheine’, and ‘oxidoreductase activity’ significantly over-represented.
Comparing important proteins in the MTB and MLP networks
We compare important proteins in the MTB and MLP networks using the approaches presented in Mazandu and Mulder  and Shou et al.  to understand the biological processes to which these proteins are involved. These proteins possess certain topological properties such as having high betweenness, closeness and eigenvector centralities, which make them important in the functionality of the network.
The center of gravity of a network is defined as the set of proteins that maximizes the closeness measure to any other protein in the network . A protein in a functional network belongs to the gravity centre, if its closeness centrality measure is strictly greater than the reciprocal of the average shortest path length . This value corresponds to 1/3.62739 or 0.27568 in MTB and 1/3.16955 or 0.31550 in the MLP network. In using the betweenness centrality to determine important proteins, we considered those proteins in which their betweenness is greater than the total number of shortest paths; obtained by multiplying the average shortest path length by the total number of proteins in the functional network. We then combined these criteria with the requirement that the eigenvector centrality should be greater than 10-5. We obtained a set of 355 and 116 proteins which have a high centre of gravity and thus may be potentially interesting as drug targets in the MTB and MLP networks, respectively [40-43]. Interestingly, proteins belonging to the intermediary metabolism and respiration functional class are the most represented in these lists of potential drug targets as shown in Table S1 and Figure 1. This is followed by proteins belonging to the unknown classes for MTB and information pathways for MLP. We obtained the functional classes from Tuberculist (http:// genolist.pasteur.fr/Tuberculist, accessed 28 October, 2011) & Leproma (http://genolist.pasteur.fr/Leproma, accessed 28 September, 2012). Among these potential drug targets, we extracted those proteins with high closeness which are classified as central proteins, and influential proteins, which are those with high eigenvector centralities. 241 and 69 are central and influential targets, respectively in the MTB network, while 95 and 37 are central and influential targets, respectively in MLP
We used the technique described in Shou et al.  to identify a total of 2859 proteins in the MTB network without a corresponding ortholog in the MLP network and 135 proteins in the MLP network without corresponding orthologs in the MTB network. In total, 1277 proteins have orthologous counterparts in both networks as shown in Table 4, only five pairs are both hubs (Table S10). A close look at Table S10 shows that these proteins have high betweenness and closeness. The protein O07727 (Probable D-amino-acid oxidase) in the MTB network is a drug, central and influential target and should be examined further as a potential drug target.
|Organism||Uniprot Acc.||Uniprot Description||shortest path length (d)|
|Mycobacterium tuberculosis||Q8VKQ9||PE family protein||156||987||2190||656||77||12||4||0||0|
|Mycobacterium leprae||P57993||Probable serine/threonine-protein kinase||125||820||328||87||11||4||0||0||0|
|Mycobacterium smegmatis||A0R5I7||Serine/threonine protein kinase||153||1573||1857||721||174||51||10||3||1|
Table 4: The three most important proteins in the three networks and the shortest path lengths from each of them.
Comparing important proteins in the MTB and MSM networks
Due to the unavailability of curated functional classes for MSM at the time of writing, we were unable to make the same kind of comparison as with MLP. However, we identified 2148 distinct proteins belonging to the MTB network without corresponding orthologs in the MSM network. Similarly, 2965 proteins in the MSM network have no corresponding orthologs in the MTB network. 1988 proteins have orthologous counterparts in both networks (Table 5). Furthermore, we found five orthologous protein pairs in the networks that are both hubs (Table S11). Using the same approach as outlined previously, we identified 294 potential drug targets in the MSM network and by choosing proteins with closeness greater than 0.27, we found 184 proteins as central targets. As defined earlier, influential targets are proteins top ranked by their eigenvector centralities, with values greater than 0.07. We identified 16 influential target proteins in MSM.
|in A only||in B only||Proteins||Edges||Identity|
Table 5: Number of ortholog proteins shared, common edges and network identity of the compared networks.
Comparing important proteins in the MSM and MLP networks
As shown in Table 5, out of the 1412 proteins in the MLP proteome, only 342 have no orthologous counterpart in the MSM network. 3883 proteins in MSM have no corresponding ortholog in MLP. Sixteen out of the 1070 orthologs present in both networks are both hubs (Table S12).
Table 5 summarizes the three comparisons discussed so far. A common edge is an edge in which both protein pairs are corresponding orthologs in both networks and are interologs. From the second to last column, we include the total number of common edges to the two organisms being compared. 3693 functional interactions are common to the MTB and MLP networks, 2284 edges are common to the MSM and MTB networks, while 1901 are common to MLP and MSM.
From the three networks, we sought those proteins which have orthologous counterparts in all three organisms and found a total of 1001 proteins (Figure 2), 260 additional proteins have orthologs in the MLP and MTB networks, 46 proteins have orthologs in MLP and MSM alone, and 983 orthologous proteins are present in just the MTB and MSM networks etc. All three networks have 297 common edges. Based on the classification of proteins as drug, central and influential targets and using the 1001 orthologs in their intersection, we found eight drug and one influential target overlaps among the three organisms.
Subnetworks of orthologs only
From the original three networks, we removed those proteins (and interactions or edges involving those proteins) that are not among the 1001 set of common proteins. We are now left with three subnetworks each consisting of 1001 proteins. The network properties of the three subnetworks are compared in Table 6. The last row (singletons) shows the number of proteins in the subnetworks without neighbours or proteins, and thus with degree zero. We then determined the number of shared edges for the orthologs and used this to calculate network identity for these subnetworks. The last column of Table 7 shows that the MLP and MTB subnetworks are more similar than the MTB versus MSM and MLP versus MSM subnetworks.
|Mycobacterium tuberculosis||Mycobacterium leprae||Mycobacterium smegmatis|
|Number of proteins (Nodes)||1001||1001||1001|
|Number of functional interactions (Edges)||9941||13670||5086|
|Number of hubs||38||60||160|
|Average degree (μd)||20||27||11|
|Average shortest path length||3.1055||3.0124||4.1655|
|Number of connected components||16||33||157|
|% of Nodes in largest component||98.5%||95.8%||79.2%|
|#(Degree-μd) < 0||628||598||773|
|#(Degree-μd) = 0||25||10||16|
|#(Degree-μd) > 0||976||393||212|
Table 6: Comparing network parameters and values in the MTB, MLP and MSM subnetworks. μd is the mean degree for each network. #(Degree-μd) < 0=628 means the total number of proteins such that the deviation from the mean degree (Degree-μd) is less than zero is 628.
|A||B||Edges in||Edges in||Common||Common||Network|
|A only||B only||Proteins||Edges||Identity|
Table 7: Number of common edges and network identity of the compared sub networks.
As an example, from Table S10, we chose the Q7D903 protein from the MTB network on the grounds that it has a high betweenness, it is a hub and has orthologs in both MLP and MSM networks. As shown in Figure 5, this protein has 37 neighbours while its corresponding orthologs Q9CD64 and A0R3F9 have 17 and 34 neighbours in MLP and MSM networks, respectively (Figures 6 and 7). However, out of the 37 neighbours, Figure 8 shows that only seven of them have orthologs in both MLP and MSM. In the same vein, out of the 17 neighbours of Q9CD64, 13 have orthologs in both MTB and MSM core subnetworks. Finally, the third network in Figure 8) shows that only seven proteins out of the 34 neighbours of A0R3F9 have orthologs in both MLP and MTB subnetworks. Among the three proteins in Figure 8, only six of their respective neighbours are orthologs of each other. The functional classes of the neighbour nodes are not conserved but then neither are those of the central ortholog proteins. This may be due to differences in assigning functional classes for different organisms and the fact that for MSM we had to use GO terms, since there was no functional class label available. We used PINV http://biosual.cbio.uct.ac.za/biosual/tests/pinv/pinv.html (accessed on 15 June, 2013) to generate the figure showing the interactions these proteins are involved in.
Evolutionary differences between the three mycobacterial species
In line with Shou et al. , for each pair of networks compared, we took sub-samples of their edges from 100% to 1%. We subjected both networks to random attacks by removing 50 nodes and all edges involving those nodes; repeated the simulations 10 times before calculating 95% confidence intervals on the resulting numbers. Since the MTB and MSM networks have more edges and nodes than the MLP network, we decided to remove 200 nodes rather than 50. The results in Tables S13-S15 show that as more nodes and their edges were removed from the networks, the percentage of edge change (defined previously) decreases. Since the three organisms have a common ancestor, and considering that they diverged approximately 2000 million years ago and computed rewiring rates for MTB and MLP networks and MLP and MSM networks.
In line with Shou et al. , we define bottlenecks as proteins within the top 10% ranked by betweenness and hubs as proteins within the top 10% ranked by degree. The choice of 10% is due to the small number of proteins in the MLP network. We grouped proteins in each network into Bottleneck Hubs (BH), Non-Hub-Bottlenecks (NH-B), Non- Bottleneck Hubs (NB-H) and Non-Hub Non-Bottlenecks (NH-NB). Bottleneck hubs are proteins within the top 10% ranked by betweenness and degree, non hubs non bottlenecks are neither within the top 10% ranked by betweenness nor degree. Non-hub bottlenecks are not within the top 10% ranked by degree but are top 10% ranked by betweenness, while non-bottleneck hubs are the converse of non-hub bottlenecks. Figures 3 and 4 show that Bottleneck hubs rewire faster than non-hubs non bottlenecks. For the MSM network, it can be observed from Figure 4 that no proteins belong to the Non-hub non-bottlenecks category. These results agree with those in Shou et al. .
In this study, we have generated functional protein-protein interaction networks for Mycobacterium leprae, Mycobacterium smegmatis and used an updated Mycobacterium tuberculosis network. In addition, since the betweenness centrality is a measure of the flow of information in a network, we have identified central proteins in each of the three networks, and carried out an overrepresentation analysis of the GO terms in the three networks under different headings. We compared the three networks using the following three-way approach: slow grower (MTB) versus slow grower (MLP), fast grower MSM versus MLP, MTB versus MSM and using orthologs as described in Shou et al. , we identified 1001 orthologous proteins common to the three networks. We also computed the network identities of the compared networks and determined that MLP’s network is more similar to the MTB network than the MSM network which makes sense as they are more closely related and both are slow growers. Furthermore, from the original three networks, we removed those proteins and functional interactions involving those proteins that are not among the 1001 set of proteins. Thus, obtaining three subnetworks each consisting of 1001 proteins. Based on the criteria outlined in Section 3-C, we determined eight overlapping drug targets proteins and one overlapping influential target protein among the 1001 proteins set. We then determined the number of shared edges for the orthologs and used this to calculate network identity for these subnetworks. Our result shows that the MLP and MTB subnetworks are more similar than the MTB versus MSM and MLP versus MSM subnetworks. One other interesting result that we found in this study is that among the three proteins in the three subnetworks shown in Figure 8, only six of their respective neighbours are orthologs of each other. Finally, by comparing the network rewiring rates of the compared organisms, we determined that bottleneck hubs rewire faster than non-hubs non-bottlenecks in line with Shou et al. .
The authors appreciate financial support received from the National Research Fundation (NRF) South Africa and the developers of open-source software. We also appreciate Kenneth Opap for useful discussions on PDB and most especially the Computational Biology Group of the Institute of Infectious Disease and Molecular Medicine here at the University of Cape Town.