Setting up a Meta-Threading Pipeline for High-Throughput Structural Bioinformatics: eThread Software Distribution, Walkthrough and Resource Profiling

e Thread, a meta-threading and machine learning-based approach, is designed to effectively identify structural templates for use in protein structure and function modeling from genomic data. This is an essential methodology for high-throughput structural bioinformatics and critical for systems biology, where extensive knowledge of protein structures and functions at the systems level is prerequisite. e Thread integrates a diverse collection of algorithms,


Introduction
In modern biological sciences, the focus has shi ed from the study of individual molecules to the exhaustive exploration of molecular interactions at the systems level. is new paradigm has given rise to the rapidly developing domain of systems biology [1], which lies at the intersection of life and computer sciences. Systems biology is facilitated by whole genome sequencing [2] that routinely generates large datasets of protein sequences; nevertheless, the molecular structures and functions of many of these sequences o en remain unknown. Computational methods for protein structure and function prediction are expected to bridge the gap between the number of known sequences and the number of fully annotated gene products, which are requisite for systems biology applications. Amongst many computational techniques developed over the past years, the most accurate algorithms in this eld build on homology, i.e. they use information inferred from related proteins. Sequence-based methods can provide useful structural and functional information for a subset of target proteins; however, these algorithms typically require a high sequence identity to already annotated proteins to maintain a high accuracy [3]. As might be expected, this reduces the coverage of suitable targets, since for many proteins no close homologues are available in the public databases, e.g. the Protein Data Bank [4]. It has been demonstrated that relaxing the safe sequence similarity thresholds in sequence-based function annotation may lead to high levels of mis-annotation [5].
To address this issue, a number of techniques have been developed to search for low-sequence identity templates that can be used to construct the structural model of a target protein and to subsequently infer its molecular function.
is is a major goal of contemporary structural bioinformatics, which aims at the high-throughput modeling of all gene products across the entire proteomes of various organisms in the so-called "twilight zone" of sequence similarity [6]. Protein threading [7] represents the latest trend in the development of template identi cation and alignment algorithms. ese techniques have the desired capability to e ectively deal with the complex and equivocal relations between protein sequence, structure and function, which are the major obstacles for standard bioinformatics approaches. Such a structure-oriented approach holds a considerable promise to speed up genome-wide protein annotation [8], which certainly will have impact on many areas of modern molecular, cell and systems biology. It has a great potential to overcome the limitations of more traditional sequence-based approaches; however, at the cost of a signi cantly increased demand for computational resources.
In particular, meta-threading techniques are widely used in structural bioinformatics. ese methods operate by considering outputs from a variety of individual threading algorithms; the combined predictions have a higher chance to be accurate than those produced by a single method. An example of such an approach is recently developed e read, which integrates ten state-of-the-art single-threading algorithms additionally supported by machine learning to provide a uni ed resource for protein structure and function modeling [9]. Here, the scienti c challenge is to e ciently combine multiple algorithms to signi cantly increase the overall accuracy over the individual methods and to push the envelope of systems-level template-based protein structure modeling and functional annotation. Meta-threading pipelines also render signi cant challenges at the level of practical implementation and the optimal utilization of computing resources. ese can be considered as heterogeneous collections of algorithms and computational techniques that may signi cantly di er from each other in terms of the required wall clock time, memory usage, I/O operations and network bandwidth. ey also may or may not share common input les or the access to external data libraries (eg. template libraries), and o en employ a complicated system of dependencies between individual jobs. To the best of our knowledge, the majority of currently implemented meta-threading pipelines are available as pseudo-gateways, i.e. web interfaces that query several publicly available CGI servers over the Internet using simple web tools such as wget or curl.
A local installation of the meta-threading pipeline on a highperformance computing (HPC) platform provides the most reliable and robust solution for high-throughput structural bioinformatics [10]. However, running a diverse collection of algorithms on a large multi-core system necessarily requires comprehensive resource pro ling to ensure the optimal utilization of available resources. In this communication, we describe the stand-alone so ware distribution of e read, which can be deployed on any modern Linux-based HPC system. We perform a comprehensive resource pro ling, analyze the computational requirements and discuss the achievable models of parallelism. We also touch on the possibility of accelerating the computations by graphics processors (GPUs). Finally, we provide a case study to demonstrate the practical application of this so ware on production platforms. e read is freely available for academic and non-commercial users at www.brylinski.org/ethread.

Materials and Methods
Overview of e read pipeline e read is a meta-threading procedure that combines predictions from ten state-of-the-art single-threading algorithms: COMPASS [11] In addition to target-to-template alignments generated by e read, e read/TASSER-Lite needs long-range inter-residue contacts, which can be predicted for a given target sequence using eContact (included in the e read so ware distribution). Moreover, both modeling protocols employ several popular tools for protein structure modeling, e.g. Jackal [23] and Pulchra [24]. Typically, multiple models are generated for a given target sequence. To rank them and assign the prediction con dence, we developed eRank, which uses individual scoring functions provided by  [28] and Stride [29]. e complete list of structure-based tools required by e read is shown in table 2. e read, eContact and eRank also employ several machine learning models to improve prediction accuracy; we use two so ware packages that o er various Support Vector Machines ( Physical testing systems e primary testing system is HP Proliant DL 180 G6 server which has 2 Intel Xeon E5645 6-core processors running at 2.4GHz and it is equipped with 48GB of memory. Additionally, the following three systems were used for the benchmarking of GPU-BLAST: 1) dual Intel Xeon E5620 4-core processor running at 2.4GHz, equipped with 24GB RAM and NVIDIA Tesla M2050, 2) dual Intel Xeon E5540 4-core processor running at 2.5GHz, equipped with 24GB RAM and NVIDIA Tesla M2070, and 3) single Intel Xeon E5540 4-core processor running at 2.5GHz, equipped with 36GB RAM and NVIDIA Tesla C2075.

Simulated multi-core systems
We constructed 18 virtual multi-core systems, equipped with 6, 8, 12, 16, 24 and 32 computing cores, and 1, 2 and 4GB of RAM per core. We also designed a simple job scheduling system that assigns jobs to the computing cores using the following rules: 1) the total number of concurrently running jobs must be less or equal to the number of cores, 2) the total memory for running jobs cannot exceed the host shared  Datasets e resource pro ling is carried out on a dataset of 275 proteins randomly chosen from the original e read benchmarking dataset [9]. ese proteins were selected to uniformly populate 11 bins with 25 structures in each bin. e bins evenly span the range of the target sequence length between 50 and 600 residues.
As a benchmarking dataset for the simulated systems-level modeling, we selected the complete proteome of Escherichia coli K-12 [32], which comprises 4,646 gene products 50-600 amino acids in length. e read meta-threading pipeline employ ten individual threading algorithms, thus the total number of jobs needed to process E. coli proteome is 46,460. e expected memory consumption and computing time for each job was calculated based on meta-threading pro ling results obtained on the primary testing system.

Pro ling of meta-threading components
Individual single-threading components of the e read pipeline signi cantly di er from each other with respect to the wall clock time and memory consumption. Both resources are typically limited on many HPC systems; for example, currently the largest HPC cluster in the state of Louisiana, Queen Bee (http://www.loni.org/systems/), allows jobs to run for up to 48 hours and features 8 computing cores and 8GB of RAM per node. e results of meta-threading resource pro ling on the primary testing system are shown in gures 1 and 2. Figure 1A shows the average wall clock time required for each single-threading component method. In all cases, except for HHMER, the simulation time scales well with the target sequence length; however, the algorithms di er with respect to the total CPU time. e least expensive algorithms, CSI-BLAST, pfTools and HMMER, require at most minutes, whereas the most expensive reader typically needs several hours to complete the calculations. e read uses two template libraries: full-chain (11,468 structures) and domain-only (10,013 structures). Figure 1B reports the percentage of time spent on threading through a particular library; due to the number and length of template structures, the chain library requires slightly longer simulation times. Furthermore, several algorithms use PSI-BLAST to construct a sequence pro le for a given target, which is o en the most time consuming step. Blue pie slices in gure 1B show that for HHpred, COMPASS and pGen reader, 88%, 62% and 51% CPU cycles are used up by the sequence pro le construction, respectively.
Individual threading algorithms also di er with respect to the memory utilization, see gure 2A. CSI-BLAST, pfTools, reader and HMMER require the least amount of memory; whereas, HHpred, COMPASS and pGen reader, all of which employ PSI-BLAST for the construction of sequence pro les, need signi cantly more RAM. Figure 2B shows that for HHpred, COMPASS and pGen reader, 86%, 78% and 93% of the memory was utilized during the sequence pro le construction, respectively. SAM-T2K has the highest memory requirement because it launches BLASTP, which loads a large sequence library into memory. e actual threading calculations use only ~2% of the memory; however, throughout 96% of the simulation time ( Figure  1B, SAM-T2K). Furthermore, in most cases, the required memory scales well with the target protein length. It is also important to note that all these algorithms do not depend on each other therefore can be e ciently processed in parallel.

Simulated multi-core system running meta-threading
We conduct a simple computer experiment to show that metathreading pipelines follow Gustafson-Barsis' law, which states that computations involving arbitrarily large data sets can be e ciently parallelized [33]. Figure 3 presents the simulated operation of three virtual systems equipped with 8, 16 and 32 cores, and 1GB of RAM per core, processing 46,460 individual threading jobs for E. coli proteome. e CPU utilization is almost constantly 100% throughout the operation time, with the exception for two larger systems, where a few high memory jobs initially saturated the available host memory and blocked the remaining cores for a short period of time, see gures 3B and 3C. Moreover, because of the job prioritization, high memory jobs, e.g. SAM-T2K, pGen reader and COMPASS, as well as long jobs, e.g. reader, were selected for execution before other threading algorithms. e total computer time required for completing meta-threading against E. coli proteome is shown in gure 4. For instance, one needs 6 years, 303 days and 17 hours to complete the calculations on a single computing core. Using 100 nodes of the aforementioned Louisiana       HPC cluster Queen Bee, each equipped with 8 cores and 8GB of RAM, the entire E. coli proteome could be processed in 3 days and 3 hours. However, increasing the size of the host shared memory does not shorten the simulation time. is is because the computations are dominated by low memory jobs, eg. reader. Figure 3A demonstrates that the memory was fully utilized only throughout around 50% of the total computing time, which leaves a substantial room to maneuver job allocation. Figure 4 inset shows that a meta-threading pipeline closely follows Gustafson-Barsis' law, i.e. doubling the number of CPU cores shortens the computing time by a factor of 2. We identify three factors responsible for this performance: 1) a large set of diverse protein threading jobs, 2) a considerable margin for the memory utilization on modern HPC systems, and 3) the lack of dependencies between individual jobs. Consequently, proteome-wide meta-threading applications are perfectly parallelizable at the task level.

Pro ling of e read
e read is a meta-predictor that integrates outputs from individual threading components. It operates in two modes: structural and functional. e former identi es the most con dent structural templates and constructs consensus target-to-template alignments for use in protein structure modeling. e latter additionally evaluates selected templates for the utility in function assignment and considers a variety of protein molecular functions: ligand, metal, inorganic cluster, protein and nucleic acid binding. Figure 5 shows that the memory required by e read is well correlated with the target protein length, whereas the average wall clock time is characterized by larger standard deviations; this is because the simulation time also depends on the number of identi ed templates. Moreover, including the functional component fourfold increases the wall clock time and the memory

Pro ling of protein structure modeling
In e read, the three-dimensional structural models of target proteins can be constructed using two modeling protocols: Modeller [21] and TASSER-Lite [22]. Resource pro ling for e read/Modeller are shown in gure 6A. Both simulation time and the memory used are correlated with the target protein length. is protocol has relatively low hardware requirements; even long target sequences typically need less than 0.5GB of RAM and up to 50 hours of CPU time. Compared to e read/Modeller, e read/TASSER-Lite requires signi cantly more resources, see gure 6B. e duration of structure assembly and re nement simulations in TASSER-Lite is limited to 48 hours, so the total wall clock time does not keep growing exponentially for sequences longer than 300 residues. Both modeling protocols pro led here, e read/Modeller and e read/TASSER-Lite, typically generate full chain models within 1-3 days on a single computing core. e read/Modeller comprises two modeling stages: structure assembly by Modeller followed by model ranking using eRank. ere is very little overhead resulting from eRank; it completes within seconds, therefore it is not included in the pro ling results. In contrast, structure modeling using e read/TASSER-Lite consists of four consecutive stages: residue contact prediction using eContact, threading by Prospector [34], structure assembly/re nement by TASSER and model ranking by eRank. As shown in gure 6C, structure assembly and re nement is the most computationally intense and consumes 80% of the total CPU time. TASSER-Lite also includes additional threading using Prospector. is modeling stage is the most memory intense and  accounts for 65% of the total memory utilization of up to 1.65GB, see gure 6D. eContact, which predicts long-range inter-residue contacts before TASSER simulations get started, and eRank, which ranks the constructed models, extend the modeling time by only 5% and have relatively small memory requirements compared to the remaining modeling stages.

Potential for GPU acceleration
Heterogeneous HPC systems that include massively parallel graphics processors (GPUs) are quickly becoming popular, mainly because of their remarkably high performance-to-cost ratio. Consequently, GPUaccelerated supercomputers show an exponential growth in the Top500 ranking, with 52 systems powered by NVIDIA Tesla GPUs currently on the list compared to only 10 systems in 2010. Bioinformatics and systems biology are examples of many rapidly developing research areas that are moving towards heterogeneous computing architectures [35]; GPU implementations of several popular bioinformatics tools have been reported recently. Bioinformatics so ware available for GPUs include both sequence, e.g. CUDA-BLASTP [36], GPU-BLAST [37], CUDASW++ [38] and GHOSTM [39], as well as structure alignment algorithms, e.g. ppsAlign [40] and TM-score-GPU [41]. Most of the component methods integrated into the e read pipeline do not have a GPU implementation, with the exception for BLASTP, which is used by SAM-T2K. To the best of our knowledge, GPU-BLASTP has not been benchmarked against its serial CPU version in a meta-threading environment. Figure 7 shows wall clock times for serial NCBI BLASTP compared to that provided by GPU-BLAST [37] collected on three systems equipped with di erent GPU cards. Interestingly, the speedup depends on the target sequence length; this is likely due to the overhead caused by transferring the library data to the accelerator. Longer sequences require more calculations, thus the parallel processing by multiple GPU cores results in signi cantly shorter simulation times compared to the serial version. e speedup starts at 1.4, 1.5 and 1.7 for sequences 50-100aa, and reaches 2.1, 2.2 and 2.5 for sequences 550-600aa on Tesla M2050, M2070 and C2075 card, respectively. Without the laborious porting of the source codes of individual protein threading algorithms to CUDA, the construction of sequence pro les using PSI-BLAST would be the next logical step to speed up the entire pipeline by accelerating HHpred, COMPASS and pGen reader; however, it is not currently available. Nevertheless, the Authors of GPU-BLAST noted that PSI-BLAST can be implemented on the GPU similarly to BLASTP and similar speedups can be expected [37]. Once available, it could give a boost to meta-threading pipelines by moving the sequence pro le construction to GPU accelerators.

So ware walkthrough and a case study
e read is freely available to the academic community. Web-based e read provides a user-friendly interface for a fast and easy access to the entire so ware package. Once the target amino acid sequence is submitted with selected structure prediction options, the modeling results can be downloaded to a local machine or displayed directly on the website, see gure 8 for a snapshot of prediction results. In gure 8A, the top-ranked structural model predicted using e read/Modeller is visualized in the Astex Viewer Java applet [42]. e estimated TMscore of 0.753 for this model suggests a high modeling con dence.
e Ramachandran plot in gure 8B shows that 87.8% amino acids in the predicted target backbone reside in the most favorable region (colored in red); here, a threshold of 90% is commonly accepted to de ne high quality models. e web server also provides other results from the model quality check by PROCHECK [43] with respect to main-chain and side-chain parameters, e.g. peptide bound planarity, bad non-bonded interactions and Cα tetrahedral distortion. ese may help users assess how a predicted structure compares with well-re ned experimental structures ( Figure 8C).
In addition to the web-based service, we also provide the source code of e read allowing users to install the so ware package and build protein structures locally. ere are three major steps for the local setup including the installation of: 1) required Perl modules, 2) third-party so ware for single-threading algorithms, and 3) e read so ware and the corresponding threading libraries. We note that all third-party programs are free for academic and non-commercial use; however, users are responsible for obtaining so ware licenses and complying with legal and other requirements. Upon the completion of the so ware installation, all single-threading tools as well as e read will be available locally for the identi cation of structural templates and the construction of target-to-template alignments. Finally, structural modeling protocols, Modeller [21] and TASSER-Lite [22], can be used to build the three-dimensional model for a given target sequence followed by a simple all-atom re nement using molecular mechanics.
As a proof of concept, we use e read to build a structural model for the 59-residue fragment of an uncharacterized protein from domestic horse [44] (UniProt ID: F6VMN7), for which the experimental structure is unavailable. e rst step is the construction of threading alignments. We start with obtaining the amino acid sequence of F6VMN7in FASTA format from UniProt [45]. Next, we deploy each protein threading/fold recognition algorithm with the sequence of F6VMN7 as an input to identify suitable templates and to generate the corresponding target-to-template alignments. Structural templates are selected from both full-chain and domain-only libraries. Because di erent threading algorithms generate target-to-template alignments in di erent formats, the output les are converted to the e read format using conversion scripts included in the e read so ware distribution. Next, the target-to-template alignments constructed by individual threading algorithms are concatenated as one of the input les for e read. In addition to this input, two other les, the target sequence in FASTA format and the threading libraries, are required to generate the nal consensus alignments. As the second step, we use Modeller and TASSER-Lite separately to build structure models from e read alignments. Typically, more than one model is predicted, therefore model construction is followed by ranking and quality assessment. Speci cally, for eRank/Modeller, three input les, including the target F6VMN7 sequence and two les generated by PSIPRED and e read, are required to rank the constructed models and assign TM-scores as con dence estimates. In gure 9A, the top-ranked e read/Modeller model is shown in PDB format and its molecular structure is visualized in VMD [46]. For e read/TASSER-Lite, the inter-residue contacts are rst predicted using eContact; subsequently, residue contacts, target sequence and e read alignments are used as input to construct the three-dimensional model. Similar to eRank/ Modeller, eRank/TASSER-Lite is then deployed to rank protein models and assign the modeling con dence. e resulting top-ranked model constructed for F6VMN7 by e read/TASSER-Lite is shown in gure 9B. Figure 9C shows the global superposition of the top-ranked models generated by e read/Modeller and e read/TASSER-Lite. Both models are remarkably similar to each other (RMSD is 0.74Å), which indicates a high con dence for the structure modeling of this target. However, there exist some di erences. For example, the model built using e read/Modeller (yellow) has longer beta sheets compared to that using e read/TASSER-Lite (green), with approximately 27% and 13% of residues assigned by STRIDE [29] to the β-sheet conformation, respectively. It suggests that despite the likely correct global topology, the model constructed by e read/TASSER-Lite may require more rigorous local structure re nement to improve the secondary structure content [9].

Conclusion
Systems biology is emerging as a promising discipline in the eld of biology. Powered by modern computer technologies, it aims to help comprehend molecular interactions at the systems level. Towards this goal, acquiring extensive knowledge of protein structures and their functions is essential. Continuing advancements in sequencing technologies spark o the rapid accumulation of gene and gene product sequences; yet, the annotation of these sequences is falling far behind. erefore, a high-throughput protein annotation is a daunting task in bioinformatics. Up to date, various computational tools have been developed to reach this goal. Di erent from sequence-based methods that heavily depend on high sequence identity to already annotated protein sequences, structure-based methods are making headway in function inference in the "twilight zone" [6] of sequence identity. Consequently, the genome-wide coverage of annotated proteins can be systematically expanded. Among many template-based approaches,  protein meta-threading is of particular interest, primarily because this approach integrates multifaceted factors to enhance the prediction accuracy. Towards this e ort, we developed e read that combines ten single-threading algorithms supported by machine learning to identify suitable templates for the prediction of protein structure and function [9]. e heterogeneous collection of algorithms used in e read creates a challenge for the optimal utilization of system resources. In this communication, we thoroughly pro le e read and its component methods in terms of the total wall clock time and memory consumption, as well as the resource distribution at major computing stages. e pro ling results show that the total CPU time and memory utilization di er dramatically among single-threading methods; yet, the overall resource required typically scales well with the target protein length. Furthermore, in a simple experiment using several simulated multicore systems, we show that meta-threading pipelines closely follows Gustafson-Barsis' law [33], thus systems-level applications, eg. genomewide modeling of protein structure and function, are exemplary tasks for large computer clusters.
In addition to parallel computing using multiple CPU cores, we also examine whether using a GPU-accelerated platform would shorten the production time. e benchmarking results are encouraging; however, to signi cantly speed up protein meta-threading pipelines requires a substantial code development and porting individual algorithms to CUDA. Here, one of the most promising targets for GPU computing is PSI-BLAST, which is used by several component methods. A GPU implementation of this algorithm could signi cantly accelerate the entire meta-threading pipeline. Similarly to GPU computing, alternative technologies, such as Intel Many Integrated Core architecture, also hold a considerable promise to speed up bioinformatics applications.
We provide a user-friendly web service freely to the academic community and non-commercial users; we also provide source code of e read, which can be deployed locally on a high-performance computing platform for high-throughput protein structure and function modeling. e web-based gateway, stand-alone so ware, benchmarking results and datasets, as well as documentation and illustrative tutorials are available at www.brylinski.org/ethread.