Special Issue Article
Entity Recognition in a Web Based Join Structure
Given a document, the task of Entity Recognition is to identify predefined entities such as person names, products, or locations in this document. With a potentially large dictionary, this entity recognition problem transforms into a Dictionary-based Membership Checking problem called Approximate Membership Extraction (AME) which aims at finding all possible substrings from a document that match any reference in the given dictionary. It generates many redundant matched substrings, thus rendering AME unsuitable for real-world tasks based on entity extraction. Approximate Membership Localization (AML) only aims at locating true mentions of clean references. An important observation is as follows: in real world situations, one word position within a document generally belongs to only one reference-matched substring, meaning that the true matched substrings should not overlap. Therefore, AML targets at locating non-overlapped substrings in a given document that can approximately match any clean reference. In the event where several substrings overlap, only the one with the highest similarity to a clean reference qualifies as a result. Web-based join Structure which is a search-based approach joining two tables using entity recognition from web documents and it is a typical real-world application greatly relying on membership checking. Membership checking is performed by using correlation, Inverse Document Frequency (IDF), Jaccard Similarity, P-Pruning Technique.