alexa Information Theory Based Feature Selection for Multi-Relational Naive Bayesian Classifier | Open Access Journals
ISSN: 2153-0602
Journal of Data Mining in Genomics & Proteomics
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Information Theory Based Feature Selection for Multi-Relational Naive Bayesian Classifier

Vimalkumar B Vaghela1*, Kalpesh H Vandra2 and Nilesh K Modi3

1Department of Computer Science & Engineering, Karpagam University, Coimbatore, India

2C. U. Shah College of Engineering & Technology, Surendranagar, Gujarat, India

3S.V. Institute of Computer Studies, S. V. Campus, Kadi, Gujarat, India

*Corresponding Author:
Vimalkumar B Vaghela
Department of Computer Science & Engineering
Karpagam University
Coimbatore, India
E-mail: [email protected]

Received date: May 09, 2014; Accepted date: June 30, 2014; Published date: July 08, 2014

Citation: Vaghela VB, Vandra KH, Modi NK (2014) Information Theory Based Feature Selection for Multi-Relational Naïve Bayesian Classifier. J Data Mining Genomics Proteomics 5:155. doi:10.4172/2153-0602.1000155

Copyright: 2014 Vaghela VB, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Journal of Data Mining in Genomics & Proteomics

Abstract

 Today data’s are stored in relation structures. In usual approach to mine these data, we often use to join several relations to form a single relation using foreign key links, which is known as flatten. Flatten may cause troubles such as time consuming, data redundancy and statistical skew on data. Hence, the critical issues arise that how to mine data directly on numerous relations. The solution of the given issue is the approach called multi-relational data mining (MRDM). Other issues are irrelevant or redundant attributes in a relation may not make contribution to classification accuracy. Thus, feature selection is an essential data pre-processing step in multi-relational data mining. By filtering out irrelevant or redundant features from relations for data mining, we improve classification accuracy, achieve good time performance, and improve comprehensibility of the models. We had proposed the entropy based feature selection method for Multi-relational Naïve Bayesian Classifier. We have use method InfoDist and Pearson’s Correlation parameters, which will be used to filter out irrelevant and redundant features from the multi-relational database and will enhance classification accuracy. We analyzed our algorithm over PKDD financial dataset and achieved the better accuracy compare to the existing features selection methods.

Keywords

Multi-relational data mining; Entity-relationship; Maximum relevant feature set; Minimum redundancy feature set; Feature and relation selection; Fast correlation based filter

Introduction

The term data mining refers to the extraction of valuable knowledge from large amounts of data. Data mining is the process of discovering knowledge from data. With the massive quantity of data stored in repositories, it is progressively more significant to develop powerful analysis and decision making tool for the extraction of interesting knowledge. The task of classification is concerned with predicting the value of one field from the values of other field. The target field is called the class. The other fields are called attributes. Propositional machine learning algorithms assume the input data is represented in a simple attribute-value format. Most existing data mining algorithms (including algorithms for classification, clustering, association analysis, outlier detection, etc.) work on single tables. For example, a typical classification algorithm (e.g.,C4.5 or SVM) works on a table containing many tuples, each of which has a class label, and a value on each attribute in the table. In recent years, there has been growing interest in multi-relational classification research and application, which address the difficulties in dealing with large relation search space, complex relationships between relations, and a daunting number of attributes involved. Most structured data is stored in relational databases, which is stored in multiple relations by their characters [1]. Conventionally, many classification approaches can only be applied to a single relation. When performing these approaches on multi-relational data, it often requires transferring data into a single table by flattening and feature construction, which is known as Propositionalization. However, many of these methods are heuristic, so flatten may cause some problems such as time consuming and statistical skew on data. Multi-relational data mining (MRDM) has been successfully applied in a variety of areas, such as marketing, sales, finance, fraud detection, and natural sciences. Multi-Relational data mining looks for patterns that involve multiple relations in a relational database, its main difference with traditional data mining approaches is that it does not need to transform the data into a single table, it learns from the data in its original form preserving its structure and incorporating such structure into the learning process [2,3].

Relational databases

A relational database is a collection of tables called relations, each of which is assign a unique name. Each relation consists of a set of attributes and stores a large set of tuples. Every tuple in a relational table represents an object which is used to identifying by a unique key to describe by a set of attribute values. Often one uses a semantic model to represent relational databases, allowing one to describe and design the database without having to pay attention to the physical database. Such a model is often referred to as a database scheme. One of the most common models is the Entity-Relationship (ER) model (Figure 1).

data-mining-genomics-financial-database

Figure 1: The schema of a financial database (from PKDD CUP 1999) [9,10].

A relational database typically consists of several tables (relations) and not just one table. A schema for a relational databases describe a set of entities DB={E1, E2, …, En}, and set of relationships between entities [4]. Each row in a relation is a tuple. Each relation has at least one primary key attributes. The other attributes are either descriptive attributes or foreign key attributes. Foreign key attributes link to primary key attribute of other relations. A relational database contains multiple interconnected relations, each of which represents a certain kind of objects or a type of relationships. A relational database consists of a set of named tables, often referred to as relations that individually behave as the single table that is the subject of Propositional Data Mining [5]. Data structures more complex than a single record are implemented by relating pairs of tables through so-called foreign key relations [6]. Such a relation specifies how certain columns in one table can be used to look up information in corresponding columns in the other table, thus relating sets of records in the Tables 1 and 2 [7,8]

Sementic relationship graph

For a classification task in a multi-relational database, there is usually one table containing the class label attribute. We call this table as target table, and call the class label attribute as target attribute. Apart from the target table, there are usually many other tables linked to the target table directly or indirectly through arbitrarily long chains of joins. In order to represent this kind of relationship between tables, we use a graph, which is called a semantic relationship graph. Semantic Relationship Graph [9-11] is kind of similar to ER diagram, which usually can be automatically generated from those common commercial database systems [12].

Definition: Semantic Relationship Graph is a directed acyclic graph SRG (V, E, W), where V is a set of vertices, each of which corresponding to a table in the database. E is a set of directed edges, and an edge (v, w) means table w can be linked to table v by directly joining these two tables. W is a set of attributes, each of which links two tables. We call this kind of attribute link attribute.

Each edge of the semantic relationship graph represents one of the following two relationships between tables v and w:

Primary-key to foreign-key relationship, indicating that table w contains foreign-key referring to primary-key in table v.

Foreign-key to primary-key relationship, indicating that table v contains foreign-key referring to primary-key in table w.

The reason we define a directed graph instead of undirected graph is that we need to start from the target table and link other tables with the target table step by step. We can also relax the constraints of semantic relationship graph by allowing the existence of cycle. If so, in order to avoid the iteration doing too many times, we can also set a parameter to control the iteration times. In the following sections, we only regard SRG as an acyclic graph. SRG facilitates the process of virtually joining the relations and acts just like road maps for the entire algorithm. One example of SRG for a financial database from PKDD CUP99 is given in Figure 2.

data-mining-genomics-Semantic-relationship

Figure 2: Semantic relationship graph for the financial database from PKDD CUP99 [11].

Tuple ID propagation

Tuple ID propagation is a method for virtually joining non-target relations with the target relation. It is flexible and efficient method and it avoids the high cost of physical join. Suppose the primary key of the target relation is an attribute of integers, which represent the IDs of the target tuples. We use the ID of each target tuple to represent that tuple. This process takes only small amount of time and space compared to the physical joins used by the existing classifiers and it will boost up the effectiveness of the multi-relational classification techniques. Tuple ID propagation approach reveal to search in the relational database and which is observed that less costly than physical joins in both time and space.

Definition: ID propagation. Suppose we have relation R1 and R2, which can be joined by attributes R1. A and R2. A. Each tuple in R1 is associated with some IDs in the target relation. For each tuple t in R2, we set t’s IDs to be the union of {u’s ID | u∈ R1,u.A=t.A}.

Feature selection process

With the creation of huge databases and the consequent requirements for good machine learning techniques, new problem arise and novel approaches to feature selection [13] are demand. Feature selection [14] plays an important role in classification. Feature selection is an important preprocessing step to machine learning. It selects an effective subset from the original features according to a certain criterion so that it can improve the performance of later data processing, such as classification and clustering. In real-world applications, there are many irrelevant and redundant attributes in relations of relational database, in which are little contribution to classification accuracy. Hence, feature selection is an essential data processing step in multi-relational data mining. By applying feature selection techniques [4], we can improve classification accuracy, achieve good time performance, and enhance comprehensibility of the models. Feature selection reduces the number of features, removes irrelevant, redundant, or noisy data, and brings the immediate effects for applications: speeding up a data mining algorithm, improving mining performance such as predictive accuracy and result [15] comprehensibility.In fact, feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization.

Feature selection is a process that selects a subset of original features. The optimality of a feature subset is measured by an evaluation criterion. As the dimensionality of a domain expands, the number of features N increases [16]. Finding an optimal feature subset is usually intractable and many problems related to feature selection have been shown to be NP-hard [17]. Feature selection algorithms [18] designed with different evaluation criteria broadly fall into two categories: the filter model and the wrapper model

Filter model: The filter model relies on general characteristics of the data to evaluate and select feature subsets without involving any mining algorithm.

Wrapper model: The wrapper model [19] requires one predetermined mining algorithm and uses its performance as the evaluation criterion. It searches for features better suited to the mining algorithm aiming to improve mining performance, but it also tends to be more computationally expensive than the filter model.

Feature selection is defined by many authors by looking at it from various angles [20]. But /*as expected, many of those are similar in intuition and/or content. The following lists those that are conceptually different and cover a range of definitions

1. Find the minimally sized feature subset that is necessary and sufficient to the target concept.

2. Select a subset of M features from a set of N features, M<N, such that the value of a criterion function is optimized over all subsets of size M.

3. The aim of feature selection is to choose a subset of features for improving prediction accuracy or decreasing the size of the structure without significantly decreasing prediction accuracy of the classifier built using only the selected features.

4. The goal of feature selection is to select a small subset such that the resulting class distribution, given only the values for the selected features, is as close as possible to the original class distribution given all feature values.

Feature selection attempts to select the minimally sized subset of features according to the following criteria Figure 3,4 and Table 1. The criteria can be:

Feature selection method Author Name and year of publication Description Drawback
Feature and Relation Selection (FARS) B. Hu, H. Liu, J. He and X. Du (2008) Evaluated by using table symmetrical uncertainty (TSU) which is symmetrical uncertainty (SU) value between relation and class. (over multi-relational dataset) Discrete values are not handled
Feature Selection using InfoDist C. Sha, X. Qiu and A. Zhou (2007) Evaluated by InfoDist which based on information theory. Discrete values are not handled
Wilks Lambda criterion method A. Ouardighi, A. Akadi and D. Aboutajdine (2007) Evaluated by a statistical value used in discriminant analysis. Insufficient to improve the classifier performances only by relevance criterion.
MR2 feature selection method A Unler, A Murat, RB Chinnam (2007) This method uses InfoDist and Pearson Correlation to calculate the relevant features (over multi-relational dataset) Cutoff value is hard to decide
Fast Correlation Based Filter (FCBF) L. Yu and H. Liu(2003) Evaluated by information gain combines optimal subset and feature relevance weight method. Discrete values are not handled

Table 1: Analysis of feature selection method.

data-mining-genomics-Feature-Selection-Process

Figure 3: Feature Selection Process [20].

data-mining-genomics-Proposed-Algorithm

Figure 4: Our Proposed Algorithm.

1. The classification accuracy does not significantly decrease; and

2. The resulting class distribution, given only the values for the selected features, is as close as possible to the original class distribution, given all features.

Our proposed entropy based feature selection algorithm

In feature selection, first we use Info Dist to evaluate the distance between feature and class label. If a feature xi has less distance d(xi,C) with the class label C, we thought it is more relevant to the class label. We define a cutoff distance based on standard deviation. These features with distance larger than mean distance plus cutoff value are regarded as irrelevant and are removed. In experiment, we observe the effect to classification accuracy with respective to different cutoff values.

Second, we use Pearson’s correlation to evaluate the correlation between features. Two features with high correlation are redundant each other. We select the minimum redundancy features according to the correlation between the features. We select three different feature sets according to InfoDist distances and Pearson’s correlations for our experiments. The three selection methods are described as follows:

1. Maximum relevant feature set (MaxRel): We use cutoff value to discard irrelevant features from the sorted InfoDist feature list. These features which have the smallest Pearson’s correlation with respective to each feature in the remaining feature list are appended in the selection feature list with duplicates eliminated.

2. Minimum redundancy feature [21] set (MinRed): We discard irrelevant features from the sorted InfoDist feature list using cutoff value as in maximum relevant feature set. The f e a tu r e which has the smallest Pearson’s correlation with the listed feature is put immediate following the listed feature included in the selection list [22]. The selected feature lists are primary based on less redundancy between features.

InfoDist calculation: InfoDist is based on information theory. The main concept of information theory is entropy, which measures the expected uncertainty or the amount of information provided by a certain event. The entropy of a random variable X is defined as follows:

image (1)

Where, P(X=x) is the prior probability of x.

Entropy H(Y |X) of a random variable Y given X is defined as follows:

image (2)

Mutual information is a measure of how much the probability distribution for a random variable changes when the value of another random variable is known. The mutual information between two random variables X and Y is defined in the following:

image (3)

InfoDist adopt the conditional entropy to measure the relevance between a feature and the class label. The distance, d(X, C), of a feature and the class label is evaluated by

image(4)

Pearson’s correlation calculation: We adopt Pearson’s correlation to measure the redundancy between features. If variables X and Y are continuous, the correlation is calculated by formula defined in the following:

image (5)

If X is a discrete feature with k values, and Y is a continuous feature. The correlation is calculated by formula defined in the following:

image (6)

W here Xbi is a binary feature that takes value 1 when X has value xi; otherwise, 0. If variables X and Y are both discrete, the correlation is calculated by formula defined in the following:

image (7)

There are two types of methods to deal with multiple relations by Naive Bayes [23,24]. One is to convert multiple relations into a single relation; and the other is to deal with each relation directly. We prefer the latter method because the advantages of MRDM.

Now, we need to extend the above formula of classification to deal with multiple relations. We assume that t is the target relation, and s is non-target relation that can be joined with the relation t. Assume the relation t has n attributes and the relation s has m attributes. For a tuple x in the relation t: x=(x1, x2,………..,xn), there are p tuples in the relation s which can be joined with the tuple x. These p tuples are (yk1, yk2,……..,ykp), where each tuple yki is represented by r attributes: yki=(yki1,yki2,……..,ykir). Then, the class label of the tuple x can be predicted according to the following formula:

image (8)

In order to make the above expression operational, we should find a feasible way to specify the probability distribution for each attribute and compute the associated conditional probabilities. In our algorithm, we adopt the tuple ID propagation method to virtually join relations along each path and collect the required information for computation. To guide the search within the relation space, a Semantic Relationship Graph is also constructed to represent and summarize the relationships between various relations in the database.

By virtual join, tuple ids are propagated from the target relation to the non-target relations. The semantic relationships between relations remain unchanged. The storage space is cheaper than physical join. In feature selection, first we use InfoDist to evaluate the distance between feature and class label. If a feature xi has less distance d(xi,C) with the class label C, we consider it is more relevant to the class label. We define a cutoff distance based on standard deviation. These features with distance larger than mean distance plus cutoff value are regarded as irrelevant and are removed. Second, we use Pearson’s correlation, to evaluate the correlation between features. Two features with high correlation are redundant to each other. We select the minimum redundancy features according to the correlation between the features. Thus, feature which have the smallest Pearson’s correlation with respective to each feature in the remaining feature list are appended in the selection feature list with duplicates eliminated. We select three different feature sets according to InfoDist distances and Pearson’s correlations for our experiments. As a result, the best candidate features are produced to improve classification accuracy. Then these selected features are subjected for Multi-relational Naïve Bayes classification.

Entropy based feature selection algorithm: The symbols and functions referred to the algorithm is as follows:

1. D, Relational database

2. M, Target table

3. Ri (1, 2 …n) Association tables

4. FeatureCount_Ri, Count of number of features in a Table Ri

5. Function CreateRelationGraph (G) is generating a relation diagram.

6. Function Propagate (Ri, M) means to transmit the class label to The table Ri from the target table M;

7. Function InfoDist (Aj, C) means to calculate the InfoDist of the feature Aj w.r.t class label C, in a respective Table.

8. Function PerCorr (Aj, C) means to calculate the Pearson’s Correlation of the feature Aj w.r.t other features in a Table.

9. Aj, Feature of the table

10. C, Class label

Experiments, Results and Discussion

For our experimental study we had used the well-known relational database PKDD Financial dataset (Table 2) as describe the Figure 1. Loan table is the target table and other table is consider as a non-target table for our work.

Relation No. of Objects Description
Account 4500 describes static characteristics of an account
Client 5369 describes characteristics of a client
Disposition 5369 relates together a client with an account
Order 6471 describes characteristics of a payment order
Transaction 1056320 describes one transaction on an account
Loan 682 describes a loan granted for a given account
Card 892 describes a credit card issued to an account
District 77 describes demographic characteristics of a district

Table 2: PKDD Financial dataset description

Table 3 describes the performance comparisions of the existing feature selection with our proposed alogortihm Entropy based feature selection classifier. We had used the accuracy as our comparision parameter and we achieve the better accuracy compared to the existing methods.

Data Set Classifier Accuracy (%)
PKDD Financial dataset FARS 83
Multi-relational Nave Bayes Classifier with MR2 feature selection and wrapper method 89
Entropy based feature selection classifier 91

Table 3: Performance Comparisions of Entropy based feature selection classifier with existing Feature Selection Based Classifier.

Table 4 describes the performance comparisions of the existing multi-relational classifiers (without feature selection approach). Still we are able to achieve the better performance compare to the existing methods on PKDD financial datasets.

Data Set Classifier Accuracy (%)
PKDD Financial dataset FOIL 71.5
TILDE 81.3
Graph-NB 85.25
CrossMine 89.8
Entropy based feature selection classifier 91

Table 4: Performance Comparisions of Entropy based feature selection classifier with existing Multi-relational Classifier.

For calculating the classification accuracy we develop four join paths according to dataset relationships. Each path starts from Loan target table and follows Account table to branch paths. For easy reference, we call joins paths as OrderPath, TransPath, CardPath and ClientPath Figures 5 and 6.

data-mining-genomics-Performance-Comparisons

Figure 5: Performance Comparisons of Entropy based feature selection classifier with existing Feature Selection Based Classifier.

data-mining-genomics-Multi-relational-Classifier

Figure 6: Performance Comparisons of Entropy based feature selection classifier with existing Multi-relational Classifier.

OrderPath->Loan, Account, Order

TransPath->Loan Account, Transaction

CardPath->Loan, Account, Disposition, Card

ClientPath-> Loan, Account, Disposition, Client

Conclusion and Future Work

We proposed a entropy based feature selection method which having MaxRel feature selection method with Multi-relational Naïve Bayes classifier. Our proposed entropy based feature selection for multirelational naive bayesian classifier improve the classification accuracy and enhance comprehensibility of the models. In pre-processing step, feature selection is done to select relevant features by using InfoDist values and remove redundancy features by using Pearson’s correlation. In filter step, we select fewer relevant features in feature pool with respect to cut-off value. The experimental result shows that the our proposed classifier is effective in respect to the classification accuracy. For the future work, we can apply our proposed classifier to the more relational dataset to measure the performance of our classifier.

References

Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Recommended Conferences

  • 9th International Conference on Bioinformatics
    October 23-24, 2017 Paris, France
  • 9th International Conference and Expo on Proteomics
    October 23-25, 2017 Paris, France

Article Usage

  • Total views: 11670
  • [From(publication date):
    August-2014 - Oct 17, 2017]
  • Breakdown by view type
  • HTML page views : 7887
  • PDF downloads :3783
 

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
 
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2017-18
 
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri, Food, Aqua and Veterinary Science Journals

Dr. Krish

[email protected]

1-702-714-7001 Extn: 9040

Clinical and Biochemistry Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Business & Management Journals

Ronald

[email protected]

1-702-714-7001Extn: 9042

Chemical Engineering and Chemistry Journals

Gabriel Shaw

[email protected]

1-702-714-7001 Extn: 9040

Earth & Environmental Sciences

Katie Wilson

[email protected]

1-702-714-7001Extn: 9042

Engineering Journals

James Franklin

[email protected]

1-702-714-7001Extn: 9042

General Science and Health care Journals

Andrea Jason

[email protected]

1-702-714-7001Extn: 9043

Genetics and Molecular Biology Journals

Anna Melissa

[email protected]

1-702-714-7001 Extn: 9006

Immunology & Microbiology Journals

David Gorantl

[email protected]

1-702-714-7001Extn: 9014

Informatics Journals

Stephanie Skinner

[email protected]

1-702-714-7001Extn: 9039

Material Sciences Journals

Rachle Green

[email protected]

1-702-714-7001Extn: 9039

Mathematics and Physics Journals

Jim Willison

[email protected]

1-702-714-7001 Extn: 9042

Medical Journals

Nimmi Anna

[email protected]

1-702-714-7001 Extn: 9038

Neuroscience & Psychology Journals

Nathan T

[email protected]

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

John Behannon

[email protected]

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

[email protected]

1-702-714-7001 Extn: 9042

 
© 2008-2017 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version
adwords