Entropy Based Mean Clustering: A Enhanced Clustering Approach

Many applications of clustering require the use of normalized data, such as text data or mass spectra mining data. The K –Means Clustering Algorithm is one of the most widely used clustering algorithm which works on greedy approach. Major problems with the traditional K mean clustering is generation of empty clusters and more computations required to make the group of clusters. To overcome this problem we proposed an Algorithm namely Entropy Based Means Clustering Algorithm. The proposed Algorithm produces normalized cluster centers, hence highly useful for text data or massive data. The proposed algorithm shows better performance when compared with traditional K Mean Clustering Algorithm in mining data in terms of reducing time, seed predications and avoiding Empty Clusters.


INTRODUCTION
While data collection methodologies have become increasingly sophisticated in recent years, the problem of inaccurate data continues to be a challenge for many data mining problems. This is because data collection methodologies are often inaccurate and are based on incomplete or inaccurate information. For example, the information collected from surveys is highly incomplete and either needs to be imputed or ignored altogether. In other cases, the base data for the data mining process may itself be only estimation from other underlying phenomena. In many cases, a quantitative estimation of the noise in different fields is available. An example is illustrated in [8], in which error driven methods are used to improve the quality of retail sales merchandising. Many scientific methods for data collection are known to have error-estimation methodologies built into the data collection and feature extraction process.
Exploratory data analysis processes often make use of clustering techniques. This can be used to look for groups of similar objects according to some metrics. Properties can be considered as well. Many methods can provide relevant partitions on one dimension (say objects or properties) but they suffer from the lack of explicit cluster characterization, i.e., what are the properties that are shared by the objects of a same cluster.
Clustering is one of the most important research areas in the field of data mining. Clustering means creating groups of objects based on their features in such a way that the objects belonging to the same groups are similar and those belonging to different groups are dissimilar. Clustering is an unsupervised learning technique. The main advantage of clustering is that interesting patterns and structures can be found directly from very large data sets with little or none of the background knowledge. Clustering algorithms can be applied in many domains.
K-means clustering is a method that partitions n data points within a vector space into k distinct clusters. Points are allocated to the closest cluster and cluster locations arise naturally to fit the available data.
K-means minimizes intra-cluster variance, that is, clusters form that minimize the sum of the squared distances between data points and the center (centroid) of their containing cluster. However, k-means is not guaranteed to find a global minimum.
The k-means algorithm [1,2] is successful in producing clusters for many practical applications. But the computational complexity of the original k means algorithm is very high, especially for large data sets. Moreover, this algorithm results in different types of clusters depending on the random choice of initial centroids. [3]. Many difficulties in comparing quality of the clusters produced, for example for different initial partitions of values of k affect outcome, does not work well with non-globular clusters. Several attempts were made by researchers for improving the performance of the k-means clustering algorithm.
In this paper, we have presented an enhanced approach, which eliminates the unnecessary computations in making the partition of the data. Here the basic execution of the k-means algorithm is preserved along with all its necessary characteristics. With the proposed algorithm, the complexity of the mechanism was reduced by adopting the entropy of the seed of the cluster. The term entropy defines the number of same instances of the dataset.

Problem Statement
To compare the two algorithms using normal distribution data points. This investigate can be used two unsupervised clustering methods, namely K-Means, Entropy based k-means are examined to analyze based on the distance between the input data points. The clusters are formed according to the distance between data points and cluster centers are formed for each cluster. For implementation plan, we take the datasets from UCI Machine Learning Repository. The implementation work was used in advanced java, MS-Excel and MATLAB software. The execution time is calculated in milliseconds. This paper deals with a method for improving efficiency of the k-means algorithm and analyze the elapsed time is taken by entropy based k-means is less than k means algorithm.

FACTORS DRIVES TOWARDS, PROPOSED WORK.
This segment describes the original k-means clustering algorithm. The idea is to classify a given set of data into k number of transfer clusters, where the value of k is fixed in advance. The algorithm consists of two separate phases: the first stage is to define k centroid, one for each cluster [4,17]. The next stage is to take each point belonging to the given data set and associate it to the nearest centroid.
The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters observations into k groups, where k is provided as an input parameter. It then assigns each observation to clusters based upon the observation's proximity to the mean of the cluster. The cluster's mean is then recomputed and the process begins again. Here's how the algorithm works: 1. The algorithm arbitrarily selects k points as the initial cluster centers ("means"). 2. Each point in the dataset is assigned to the closed cluster, based upon the Euclidean distance between each point and each cluster center. 3. Each cluster center is recomputed as the average of the points in that cluster. 4. Steps 2 and 3 repeat until the clusters converge. Convergence may be defined differently depending upon the implementation, but it normally means that either no observations change clusters when steps 2 and 3 are repeated or that the changes do not make a material difference in the definition of the clusters.
In K-Mean Algorithm, a set D of N patterns {x 1 ,x 2 ,…..,x n } of dimensions d is partitioned into K clusters denoted by {C 1 ,C 2 ,C 3 ,……C n } with the objective function J is a choosen distance measure between a data point x i (j) and the cluster centre C j , is an indicator of the distance of the n data points from their respective cluster centres.
This K mean algorithm, is good and produces qualitative results, it is strugules with the more number of computations, not put good effects on the non-global clusters, and an other major problem is handling the null clusters.

THE ENTROPY BASED MEANS CLUSTERING ALGORITHM
The proposed Entropy Based Means Clustering algorithm, reduces the significant limitations observed in the basic K-mean clustering technique The Entropy based Mean algorithm is slightly modifies K Mean Clustering method. In this algorithm can be used more effective than normal k-means algorithm. The proposed algorithm works in the three phases. In the first phase it computes the min points of the each seed (element or item) in the data set and then arranges the seed elements in the order of their seed entropy ( For example(Seed-Entropy): 1-10,2-5,3-9,4-6,5-1, then it arranges the data as 1,3,4,2,5 .i.e. data arranged descending order of the entropy). In the second phase, it makes the candidate set, this candidate set is unique in nature, i.e it does not consisting of duplicated elements.
In the third phase the clustering was applied on the Euclidian distances, and remaining elements, which were not in candidate sets were placed in according to the native elements, were resided.

A. Arranging the data in the descending order of the Entropy
This Phase identifies the entropy of each seed in the data set D={x1, x2, x3,….., xn} and arranges them in the descending order of the seed entropy. Here entropy was calculated as the number of elements of the same kind. For example D={1,2,1,2,4,4,4,1,4,1,1,3} and the entropies of each were as follows

B. Identification of Candidate set C from D
This phase determines the number of candidate seeds in the dataset. Here in our sample we have 4 data seeds, instead of 12 data seeds. By this we can reduce the number of computations, for making a cluster.

C. Making the clusters by using defined K and C.
This phase, works in three steps as follows 3. Make the Candidate data set C such that no duplicates seed in C. and make one duplicate candidate set DC. 4. a) Set mean for each cluster CL k as 0 and call it as Cluster Centre CC. b) Assign a seed to every cluster CL k. from the candidate set C. 5. Recompute the mean of each CL. 6. For each seed-point Ci remain in C, find the closest centroid CC j and assign C i to cluster j . 7. a) Place the seed point C i to the cluster CC k such that the seed point distance is closer to the present nearest Distance. b) Detach C i from C c) Repeat the step 5. 8. Repeat Step 6 to 7, until Candidate Set C becomes empty and convergence was made. 9. For each element in {D-DC} do the following step a) Compare each CL K seeds with the data seeds in {D-DC}. b) place the seeds in {D-DC} into the corresponding CL k .
The above algorithm reveals that the new clustering scheme is exactly similar to the original k-means algorithm process, except some differences like making of Candidate set C preparation, which reduces the number of computations and rearrangement of data seeds.

Rate of Convergence of the Entropy based Clustering Algorithm.
In the proposed algorithm, an iteration starts with a set of ole center CC k (old) , the data elements were distributed among the clusters depending upon the minimum Euclidean distance, and then set of new clusters CC k (new) is generated by averaging the data elements.
This center updation can be mathematically described as follows: Where n k is the number of elements in cluster CL k . If new centers CC k (new) do not match exactly with the old center CC k (old) , the algorithm does the next iteration assuming CC k (new) as CC k (old) .
Let CL be the cluster consisting of the elements represented by CL= { x 1 , x 2 , x 3 , x 4 ,…… x n } and the corresponding cluster center The subsequent centers can be obtained as follows Similarly for the r th iteration, All the above equations are in G.P., so combine all the equations and we get the converged condition for entropy based mean clustering as follows And the converged center for all clusters, say CL 1 = { x 1 , x 2 , x 3 , x 4 ,…… x p } and CL 2 = { y 1 , y 2 , y 3 ,…… y q } the overall process can be defined as Satisfying the two conditions i. ii.

EXPERIMENTAL RESULTS
This section provides the performance comparison of the conventional k-mean and entropy based mean clustering in terms of handling null clusters, accuracy of the clusters.
To make this testing, we used the dataset extracted from UCI Machine Learning Dataset called "abalone" dataset with 50 instances and 9 attributes. To make this experiment, we use the attribute called "Rings". The following Table 1

A. Handling of Empty Clusters and Accuracy of Clusters
Here, we shall experimentally proved, how the Entropy based clustering algorithm overcomes the problem with K-mean clustering, in the view of avoiding empty clusters. The following Table 2 illustrates about No. of samples or elements allocated to each cluster.    Table 3, we notice, average size of the clusters in EBM Clustering is almost uniform, in the every case of cluster no: 2, 5, 10 and 15. But in the case of K-mean the cluster average cluster size is not uniform. This may effects the size and shape of the cluster.
The following Table 4 The following Figure 1, shows the comparative runtime differences between the algorithms From the Figure 1, we notice at runtimes for EBM clustering almost all same for all clusters, but in the case of K-mean the run time is increases as the number of clusters is increases. The following figures final clusters obtained from the EBM Clustering Technique.

CONCLUSION
This paper proposes Entropy Based Mean, consisting of different enhanced features of basic K-mean clustering. The time complexity can be calculated by CPU elapsed time for different two algorithms. As a rule the time complexity varies from one processor to another processor, which depends on the speed and the type of the system. The partitioning algorithms work well for decision spherical-shaped clusters in different type of data points. The advantage of the K-Means algorithm is its favorable execution time. Its drawback is that the user has to know in advance how many clusters are searched for. From the experimental results, it is practical that K-Means algorithm is efficient for smaller data sets and Entropy Based Mean Clustering is good for larger databases. For the both algorithms, size and shape of the cluster is depending upon the type of attribute, we selected or clustering and number of unique values for attribute consider for the clustering. EBM Clustering was more preferable because it eliminates the empty clusters during the generation itself and no additional phases are required.