A Micro Partitioning Technique in MapReduce for Massive Data Analysis
|Related article at Pubmed, Scholar Google|
Over the past years, large amounts of structured and unstructured data are being collected from various sources. These huge amounts of data are difficult to handle by a single machine which requires the work to be distributed across large number of computers. Hadoop is one such distributed framework which process data in distributed manner by using Mapreduce programming model. In order for Mapreduce to work, it has to divide the workload across the machines in the cluster. The performance of Mapreduce depends on how evenly it distributes the workload to the machines without skew and avoids executing job in a poorly running node called straggler. The workload distribution depends on the algorithm that partitions the data. To overcome the problem from skew, an efficient partitioning technique is proposed. The proposed algorithm improves load balancing as well as reduces the memory requirements. Slow running nodes degrade the performance of Mapreduce job. To overcome this problem, a technique called micro partitioning is used that divide the tasks into smaller tasks greater than the number of reducers and are assigned to reducers. Running many small tasks lessens the impact of stragglers, since work that would have been scheduled on slow nodes is only small which can be performed by other idle workers.