Mining the Association Rules of Transcription Factor Binding Sites in Human Tandem Repeats Using Aprior Algorithm

the Association Rules of Transcription Factor Binding Sites in Human Tandem Repeats


Introduction
Repetitive DNA sequences have been identified in large quantities in both eukaryotic and prokaryotic genomes (Buard, et al., 1994;Van Belkum et al., 1998). There are various databases which shows TR characteristics, for instance, ABCC GRID database (Collins et al., 2003), Minisatellite database (le Flèche et al., 2001), PlantSat database (Macas et al., 2002), Representative Sequences DataBase (RSDB) ( Horng et al., 2002), the Microsatellite Analysis Server (MICAS) for prokaryotic genomes (Sreenu et al., 2003) and the Tandem Repeats Finder (TRF) ( Benson, 1999). Boby et al., (2004) stated "Many databases exist of perfect TR, but the focus on short perfect repeats has left gaps in our understanding of the potentially important biological or medical roles of those that are longer and harder to detect". Boby et al., (2004) built a perfect and imperfect TR database, TRbase, relating TR to disease genes for the human genome. Unfortunately, for reasons unknown, this TRbase contained only TR and annotations retrieved for completed chromosomes 4, 5, 6, 14, 16, 18, 19, 20, 21 and 22, rather than the whole human genome (Boby, 2004) .
Many transcription factor binding sites have been collected in databases. TRANSFAC (Heinemeyer et al., 1998;Heinemeyer et al., 1999) is the most complete and well maintained database for transcription factor binding sites. Notably, consensus patterns or nucleotide distribution matrices can be used to describe transcription factor binding sites. While describing binding sites, Brazma et al., (1997) stated "The matrix representation is generally considered as the best available means for representing the consensus, however, at present most consensus descriptions are unreliable in the sense that they tend to give many false positives when compared against the genome sequences of even modest length". Therefore, this study describes the binding sites using consensus patterns. Brazma et al., (1997)  analyze the association rules in the combinations, their work focused on upstream and random regions, in which their ratio appears. Their tool can find all the combinations satisfying the given parameters with respect to the given set of upstream regions, its counterset, and the chosen set of sites. However, the tool is only used in yeast genome.
To face a large among of repeat sequences, data mining plays a prominent role in knowledge extraction. Agrawal and Srikanth, (1993) introduced the problem of mining association rules over basket data. An example of an association rule is given below. The work stated '50% of transactions that contain beer also contains diapers; 5% of all transactions contain both of these items'. Where 50% is called the confidence of the rule, and 5% is the support of the rule. Data mining is crucial for extracting knowledge in a database. Frequently used data mining approaches, include association rules, statistical, neural network and genetic algorithms. In statistics, Chi-square test statistics (χ2) is extensively applied for testing independence and correlation. Chi-square is based on comparing observed frequencies with the corresponding expected frequencies. The closer that observed frequencies are to expected frequencies, implies a greater weight in favor of independence. Let f0 be an observed frequency, and f is an expected frequency, Chi-square is used to test the significance of the deviation from the expected values. The χ2value is defined as follows: ( ) Where χ2value of 0 implies the sites that are statistically independent. If it is higher than a certain threshold value, e.g., 4.12 at the 97% significance level, we reject the independent assumption. We say that it is correlated. Research of partial classification using association rules introduces two case studies for partial classification (Ali et al., 1997) . The two case studies are medical diagnosis and telecommunications. Instead of attempting to predict future values, such research focuses on identifying characteristics of some of the data classes. Transcription factors (TF) are proteins that exert control over gene expression by recognizing and binding short DNA sequences (Bulyk, 2003) (base pairs, roughly the width of the major groove). Experimental methods to identify these binding sites include SELEX and recent high throughput methods such as ChIP-chip (Ren et al., 2000) and protein-dsDNA binding microarrays (Mukherjee et al., 2004). Cawley et al., (1993) concludes that TF binding site regions not only are located at the 5' termini of protein-coding genes but are also distributed throughout the human genome.
In our study, to fill in the gap of the original TRbase, our first step was to extend the database. According to Cawley et al., (2004) conclusion (TF binding sites are distributed randomly in human genomes), we utilized Agrawal and Srikanth, (1993) algorithm to explore the combinations of TF binding sites in TR sequences in human genomes rather than in a specific region. We then identified all combinations rules, which were pruned by the chi-square test (χ 2 ) and subjected to testing for independence and correlation (Mukherjee et al., 2004). The pruned combination rules of TF binding sites will be far-reaching for biologists researching gene expressions and regulatory elements. Fig. 1 illustrates the framework of our method. The project started with extension of the original TRbase, followed by statistical analysis and data mining including generation of association rules.

Methods Framework
The steps of the proposed approach are summarized as follows: Extension of the original TRbase to include all human chromosomes.
Determination of the number of item sets of the TF binding sites in TRANSFAC.

TRbase Extension
DNA sequences and annotations in the original TRbase were retrieved for the completed chromosomes 4, 5, 6, 14, 16, 18, 19, 20, 21 and 22 (Boby, 2004); however this project required data on all those disease genes and their relevant information in the whole-human genome, indicating that the original TRbase needed to be extended to all human chromosomes prior to data preparation. DNA sequences and annotations of the remaining chromosomes were downloaded from GenBank. All TR were detected using the TRF program (version 3.01) (Benson, 1999) with parameters as in Boby, (2004) applied to DNA sequences extracted from GenBank in the FASTA format using the Seqret program of EMBOSS (Rice et al., 2000).

Association rule mining for TF binding sites in tandem repeats Association rule mining and Agrawal and Srikanth, (1993) algorithm
Association rule mining is important for extracting knowledge from many repetitive sequences. Agrawal and Srikanth, (1993) developed the Apriori algorithm for mining association rules. The Apriori algorithm (Agrawal and Srikanth, 1994) accepts as inputs two thresholds, min-supp and min-conf, and mines (finds) all association rules having support and confidence greater than or equal to those thresholds. The Apriori algorithm mines association rules using a two-stage process.
The first stage generates all the sets of items that satisfy the min-supp constraint, called frequent itemsets. The second stage constructs all the association rules that satisfy the min-conf constraint from those frequent itemsets. Details of this algorithm are contained in Agrawal and Srikanth, (1994); Liu et al., (1999).

Tandem repeats sequences in TRbase
The extended Trbase, now covering the whole human genome, consists of perfect and imperfect TR For more details of the features of the data in the TRbase; refer to Boby (2004).
TRANSFAC database (release 7.0) contains 7915 site sequences, and 6133 factor entries. Most sites are also consensus patterns. The data in TRANSFAC has the following features. A transcription factor binding site accession number may have different consensus sequences. Different binding site accession numbers may have a same consensus sequence. Wild characters such as 'M' or 'W' used in TRANSFAC make the sequences cover other sequences. Small consensus sequences may appear in larger ones. Our approach needs a preprocessing feature because complex characteristics of the transcription factor binding sites are encountered in TRANSFAC.

Features of the data in TRANSFAC
Genome sequences are a string of A, C, G or T. The symbols used in addition to A, C, G, or T also include the following:

Mapping of tandem repeats and transcription factor binding sites
Let I= {i 1 , i 2 ,…, i m } be a set of TF binding sites retrieved from TRANSFAC (http://www.gene-regulation.com/pub/ databases.html), called the item set. Let D be a set of TR, where each TR sequence, d, corresponding to a transaction, includes a set of items. Example 2 is used to illustrate the mapping between TR and the TF binding sites.
In this example 2, 'AAAAAAAAAAAAAAAA GAAAAGG' is a TR sequence in the TRbase Database. We mapped it to a transaction whose ID is ID17. The repeat sequence has three consensus patterns, i.e., 'AAAAG, 'GG'. The consensus pattern 'AAAAG' has an accession number R02248. However, the other consensus pattern 'GG' has three accession numbers: R04365, R04367 and R04690.
In ours experiments, the minimum support is set to 10%. Association rules are generated only if they have higher support, i.e., >= 10% and confidence, i.e., >=90%. Apriori and AprioriTid algorithms (Agrawal and Srikanth, 1994) are then applied to mine association rules.

Pruning the discovered associations
It is well known that many discovered associations are redundant or minor variations of others. Their existence may simply be due to chance rather than true correlation. Thus, those spurious and insignificant rules should be removed. This is similar to pruning of overfitting rules in classification (Klemettinen et al., 1994) Rules that are very specific (with many conditions) tend to overfit the data and have little predictive power. Although association rules are not normally used for prediction, rules that only capture the irregularities and idiosyncrasies of the data have no value and should be removed. An example of such a rule is shown below. Example : We have the following two rules: R1: Job = yes → Loan = approved [sup = 60%, conf = 90%] R2: Job=yes, Credit_history=good → Loan= approved [sup = 40%, conf = 91%] If we know R1, then R2 is insignificant because it gives little extra information. Its slightly higher confidence is more likely due to chance than to true correlation. It thus should be pruned. R1 is more general and simple. General and simple rules are preferred. In this work, we measure the significance of a rule using chi-square test (χ2) for correlation from statistics (Mills, 1955).

Summarizing the unpruned rules
Pruning can reduce the number of rules substantially. However, the number of rules left can still be very large. This step finds a subset of the rules, called direction setting rules (or DS rules), to summarize the unpruned rules. Essentially, DS rules are significant association rules that set the directions for non-DS rules to follow. The direction of a rule is the type of correlation it has, i.e., positive correlation or negative correlation or independence, which is also computed using (χ2) test. Let us see an example. Here, it is presented as a post-processing method. In implementation, it is combined with rule mining. Example 2: We have the following discovered rules: R1: Job = yes → Loan = approved [sup = 40%, conf = 70%] R2: Own_house = yes → Loan = approved [sup = 30%, conf = 75%] χ2 analysis shows that having a job is positively correlated to the grant of a loan, and owning a house is also positively correlated to obtaining a loan. Then, the following association is not so surprising to us: R3: Job = yes, Own_house = yes → Loan = approved [sup = 20%, conf =90%] because it intuitively follows R1 and R2. We can use R1 and R2 to provide a summary of the three rules. R1 and R2 are DS rules as they set the direction (positive correlation) that is followed by R3. In real-life data sets, a large number of associations are like R3. From the example, we see that the DS rules give the essential relationships of the domain. The non-DS rule is not surprising if we already know the DS rules. However, this, by no means, says that non-DS rules are not interesting. Non-DS rules can provide further details about the domain. For example, the non-DS rule above (R3) gives a higher confidence, which may be of interest to the user. Using DS rules to form a summary is analogous to summarization of a text article. From the summary, we know the essence of the article. If we are interested in the details of a particular aspect, the summary can point us to them in the article. In the same way, the DS rules give the essence of the domain and points the user to those related non-DS rules. Non-DS rules are basically combinations of DS rules. In the above Example, R3 is a combination of R1 and R2.
It is well known that many discovered associations are redundant or minor variations of others. Their existence may simply be due to chance rather than true correlation. Thus present study, we measure the significance of rules using the chi-square test (χ 2 ) for correlation from statistics (Mills, 1955). The chi-square statistical test (χ 2 ) is frequently adopted to test independence and correlation.

Results and Discussions Extended TRbase
The original TRbase can be accessed via the website http:/ /trbase.ex.ac.uk (Boby, 2004). Similarly, the new extended TRbase can be accessed via http://www.trbase2.cn. Compared with the original TRbase, our extended TRbase is unabridged.

Association rule mining
We retrieved all human TF binding site consensus sequences (3,447) from the TRANSFAC database (Ali et al., 1997;Bulyk, 2003), and released those consensus sequences, which we found that they consisted of 455,979 site sequences. We then mapped the released sequences to itemsets in association rule mining. The 649,400 TR whose con_size was 10 or greater were selected as transaction datasets-as above described, the minimum confidence and support were set to 90% and 10%, respectively. The 44 combinations of TF binding sites (supplementary file 2) were worked out. Liu et al., (1999) describe the pruning and summarizing approaches and definitions. The 44 combinations of TF binding sites (supplementary file 1) were, respectively, classified into 3 groups: positive correlation rules, negative correlation rules and independent rules. In our experiment, only the positive correlation rules were considered significant, and were further classified into DS rules and non-DS rules. The rules can be used to find genes in complete genomes and cluster TR once they are verified. Some classification rules for the human genome are as follows. To explore human TR, we employed both statistical and standard data mining methods. The rule, R00046->R00707 positive-correlation, DS, means that the consensus sequences of more than 10% TRs in human genome contains the consensus sequences corresponding to TFBS ROOO46, ROO707, furthermore, 90% of the cases, they are at the same time in the same TR. In most cases, each tandem repeat in TRbase2 corresponds to a disorder, the corresponding disorders are linked at certain sense by means of the combination of ROOO46 and R00707, thus, those combinations will provide a bridge to research human disease correlations. Future studies should investigate other methods to detect those combinations mined.

Conclusion
The extended TRbase provides a platform to study the associations between disease genes and previously uncharacterized TR in the whole human genome. We successfully updated the original TRbase to include the whole human genome (http://www.trbase2.cn). Moreover, the extended TRbase now holds more comprehensive knowledge concerning human TR by inclusion of statistical analysis. The extended TRbase provides users not only with query tools but also useful statistical data. The authors hope that the statistical results will help biologists discover a whole new area of biology, including predicting transposons. This study identified TF binding site combinations in TR sequences of the TRbase. The TF binding sites in TRANSFAC were first preprocessed due to their complex characteristics. The Apriori and AprioriTid algorithms (Agrawal and Srikanth, 1994) were then applied to mine the associations fromc the TF binding site combinations in repeat sequences. Some association rules were generated. Chi-square tests were used to remove the insignificant rules. Finally, redundant rules were pruned and the remaining rules were classified into DS and non-DS sets. The discovered rules can also be used to find useful genes in complete genomes as well as partially cluster the TR in the extended TRbase.