EXTRACTION OF STRUCTURED INFORMATION FROM UNSTRUCTURED OR SEMI- STRUCTURED MACHINE READABLE WEB PAGES
|Vinod Kumar Raavi*1 and Satya P Kumar Somayajula2
|Corresponding Author: Vinod Kumar Raavi, E-mail: [email protected]|
|Related article at Pubmed, Scholar Google|
In now a days the extraction of structured information from unstructured or semi- structured machine readable documents extemporaneously plays a vital role hence many of the websites using ordinary templates with contents which produce the information to accomplish a well publishing productivity, but the major resource for extracting the information is WWW.Recently template detection approach has attained a lot of consolidation of effort in order to reform in various conditions like clustering and classification of web documents, performance of search engine as templates decrease the performance and the efficiency of web application for machines as a result of irrelevant template terms. We want to present a novel algorithm in this paper for extracting templates from a excessive number of web documents that are achieved from heterogeneous templates. By understanding the similarities of the basic template structure in the document we group the web documents so that template for each group has been simultaneously extracted. Hence the algorithms proposed in this paper can be considered as the best among all of the template detection algorithms.