Received date: May 13, 2013; Accepted date: September 03, 2013; Published date: September 09, 2013
Citation: Renganathan V, Babu AN, Sarbadhikari SN (2013) A Tutorial on Information Filtering Concepts and Methods for Bio-medical Searching. J Health Med Informat 4:131. doi: 10.4172/2157-7420.1000131
Copyright: © 2013 Renganathan V, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Health & Medical Informatics
Information filtering system; Filtering agent; User profiles; Methods of gathering user information; Filtering models; Information filtering types
Vast amounts of information are now widely accessible on the web. Customarily, when a user wants to find interesting documents or date sources, the user has to actively search the World Wide Web. Searchers required effective means to efficiently find the information that they really need, and avoid the irrelevant information that does not match their interests. Information retrieval [1,2], and information filtering are two major information access techniques:
Information retrieval is concerned with retrieving the relevant documents from a large collection of material efficiently. It is concerned with the collection, representation, storage, organization, access, manipulation and display of information.
The immense volume of source information, however, often leads to query results which are too long and unwieldy for human users to manage effectively. The need, therefore, arises for more “intelligent” aids for information access tasks. Information filtering is an example of such an information access process.
Unlike information retrieval, information filtering generally focuses on users’ long-term information needs, often being stable preferences. It operates on dynamically changing information streams (e.g. email and news). Based on a user’s profile, which is initially derived from his or her interests, a filtering system processes a new item and takes appropriate actions that either ignore it or bring it to the user’s notice.
Information filtering and information retrieval both have similar aims, . Each wants to retrieve information according to the user request; they try to minimize as much as possible the amount of irrelevant information. But, there are key differences between information retrieval and information filtering, , as noted in Table 1.
|Information retrieval||Information filtering|
|User profile||Not necessary||Essential|
|Information seeking Behavior||Short term||Long term|
|User Query||Brief||Description or explanation of the information|
|User interaction with the system||Single information seeking episodes||Series of information seeking episodes|
Table 1: Information retrieval vs filtering system.
Need for information filtering techniques in the biomedical field
Information filtering is a critical resource for biomedical researchers as the Internet contains vast volumes of heterogeneous data and analytical tools.
• Information filtering tools can reduce the time spent by a biomedical researcher in getting relevant information, which can in turn, enhance the timeliness and quality of research.
• It will help the researcher in transforming the information to knowledge in a shorter time period.
Need for information filtering felt in different biomedical sub fields
• Public health
• Medical informatics
A well-known example is PubMed. There are different sources of information available for medical professionals, such as Pubmed, which necessitates the information filtering role in getting the relevant information which suites to the profile of interest of the medical professionals.
Information filtering system consists of a filtering agent and user profile (Figure 1).
Filtering agent [5,6] acts an interface between the user and the document system, and helps the user in finding the relevant topics of a given topic through the user agent. It reduces the user’ time and effort in locating the relevant document through the specialized domain knowledge it possesses.
The filtering agent filters out irrelevant incoming documents and presents to the user only those documents which match the user’s interest. During the course of time, the filtering system becomes more effective as it learns the user’s preferences and develops accuracy in performing the filtering tasks. Its roles include interfacing with the source document subsystem, managing the user-profile (which is discussed in the Section 7), calculating the relevance of a documentvector to the user-profiles and communicating with the user.
The process flow associated with the filtering agent:
1. The topic for the current document filtering session is obtained from the user
2. User profile vector is initialized using the current topic’s title and text
3. Document Vector d is obtained from source document subsystem
4. Current user-profile vector t is retrieved
5. The following probabilities are calculated
Probability of Relevance (PR)=Probability (Relevant/d; t)
Probability of Non-Relevance (PN)=Probability (Non-Relevant/d; t)
6. If PR>PN, then for the given relevant document d:
(a) Text of d is obtained from source document subsystem
(b) User is informed text of d
(c) Actual relevance judgment is obtained from the user for the document d
(d) User-profile vector t is updated with the actual relevance Judgment of d
7. Repeat from Step 3.
User profile [7-10] (Figure 2) represents users’ details (User interest, needs, goals, and behavior), and is constantly updated in response to user feedback. The quality of user profiles has a major impact on the performance of information retrieval and filtering systems.
In order to provide personalized information to a user, the system creates and maintains a description of the type of information that the user interested to access. Personalized content is retrieved based on information matching the user profile.
The user profile is divided into two categories static or dynamic . Static profiling is the process of acquiring a user’s characteristics, such as age, gender, profession, etc., through direct input from the user.
Dynamic profile is created based on the future or current actions of the user, such as current location, position and occupation and browser used, etc.
Example of user profile and topic of interest
User profile of a cardiologist contains the topics of his interest (for example myocardial infarction), and it will help the filtering system to deliver articles on myocardial infarction, whenever it is published through the user profile created in the system. Apart from the article, the user profile can be used to recommend a movie or music based on his hobbies mentioned in the user profile.
User profile of medical physician who wishes to use an information filtering system
Education: DM (sub-specialization in cardiology)
Field of work: Cardiology
Topic of interest: Myocardial infarction
Hobbies: Soccer, music
User profile for personalized cancer information system for patients
Types of interest: Stability of interests
Direct to short-term interests
Main interest groups Explanation of medical issues concerning the patient, especially medical problems and treatments
Privacy sacrifice anonymous use by the server
Task and goal characteristics
Available Time-short to medium
Frequency of use-irregular
Types of task to gain knowledge about the personal medical situation and treatments
Delivery patterns-almost synchronously
Current medical problems: Past medical problems
Personals details like age and sex
Initialization-implicit via medical dossier
Maintenance-implicit via medical dossier, which is updated by the medical staff
Methods for gathering user information
User profile helps the Filtering system in the personalization process and the user profile can be created using three methods, namely explicit, implicit and mixed method.
The explicit method requires the user to furnish the information, such as preferences explicitly in the form of questionnaire or template. This type of personalization is known as customization. The explicit methods are adapted for the acquisition of the characteristics of the users, such as name, age, preferences and priorities. Explicit method also involves relevance feedback [12-14] process, where the user evaluates (in the form of rating or ranking) the relevancy of the document the filtering system returned. Relevance feedback helps the filtering system to explicitly update the user profile.
The implicit method employs procedures, which records the information about the changing behavior or characteristics of the user. In this method, the user will not know the process of collecting information. Implicit methods are appropriate to record the actions of the user, such as time spent; items purchased, history of movement, key pressed, mouse movement. This method helps the filtering system to maintain the updated user profile and helps the system to detect change in the user’s interest .
The mixed method uses a combination of explicit and implicit methods. The system is constantly looking for patterns in the behavior of a user combined with explicit questions to the users to verify the assumptions about user characteristics, such as the rating of the users on what extent the output of the system matches his preferences. Also, a mixed system begins it task by requesting the users to provide information about the preferences which solves the cold start problem.
Information filtering models may be interpreted as decision functions whose domain is the set of all possible document features, and whose range is the set (relevant, non relevant). Even though it has its roots in the techniques of information retrieval, the filtering algorithm differs from personalized search . The following are the important models used in the information filtering process.
String matching model
In the string matching traditional model , the user specifies his/ her information needs by a string of words. A document would match the information need of a user if the user-specified string exists in the document. This method is one of the earliest and simplest approaches. The method is less able to match the documents that require contextual and experiential knowledge, and also it suffers from the following problems homonymy (words are spelled same, but have different meanings), synonymy (different words having the same meaning), polysemy (words with multiple meaning) and bad response time.
The Boolean model [18,19] is a modification of the above method where the user can combine words using primitive Boolean operators such as AND, OR and NOT. This method gives the user a tool to better express his/her information need, but at the same time, requires some skills on behalf of the user. It assigns equal weights to all terms in the query. This model also has the same problems of the string matching models.
Vector space model
A document K is represented as a vector of dimension n, where n is the total number of terms. Each term is given a weight, which represents its importance in the document and in the whole document collection. The vector K is represented as K=(w1....wn), where wi is the weight assigned to the i-th term. Similarly, the query (profile) vector is represented as U=(x1....xn).
Vector is created from the list of words the document contains and words with high frequency and low content discriminating powers (such as ‘and’, ‘the’) are excluded. The weight of the terms is calculated by multiplying the term frequency (tf) and inverse document frequency (idf). tf is the how frequently the term appears in the document and idf is related to how frequently it appears in the whole document collection.
The relevance of an unseen document to a given query (profile) is judged by calculating the distance between the query (profile) vector and the document vector, and comparing this distance to a threshold value P (vector distance is established using some pre specified distancemetric). Normally, the distance is calculated using the cosine similarity measure .
Latent semantic index model
Latent Semantic Index (LSI) model [22,23] works on the assumption that there is a latent structure in the pattern of words usage across the documents that can be exploited to overcome one drawback of vector space model, which assumes that the words are orthogonal or independent. It creates multi dimensional semantic structure of information using implicit higher order structure of association of terms with the documents. Latent Semantic index works on singular value decomposition technique (SVD) [22,23], which is also used to reduce the dimensionality of the large volume of data. LSI creates a semantic space for a set of documents which are previously judged by the user as relevant or not. New document is categorized as relevant if it is close to the interesting documents in the semantic space. On the other hand, if it is close to the non interesting documents in the semantic space, it is categorized as irrelevant.
Bayesian network model
Bayesian network principles can be used in the collaborative filtering system. The rating data can be used to learn a Bayesian network,  where each item is represented by a node, and directed arrows between items signify user’s interest on items influence interest of other items. The Bayesian network is used to create probabilistic decision trees for each item, where leaf nodes are likelihoods of the target user’s interest on the target item, and intermediary decisions are based on the target user’s view on the parent items of the target item from the network. Bayesian networks is useful when the user preferences changes slowly with respect to the time needed to build the model, but are not suitable for environments in which user preference models needs to be updated frequently.
Artificial neural network models
Artificial neural networks models in the information filtering systems [25,26], are used to automatically create the terms in user profiles by the training of neural network models through examples. The ability of the ANN to model non-linear relationships can be applied to the matching of the documents to the user profile. Neural networks can be used to represents the user’s preferences where words from documents are represented as nodes and strength of association between words in the same document. There are two types of learning algorithms are available in the neural networks: one is supervised learning and another unsupervised learning.
Markov models based information system works on the Markovian Decision Process [MDP] , which involves the stochastic model of sequential decisions. Given a number of observed events (i.e. past history of the users actions, such as last visited web page or characteristic of the document), the next event is predicted from the probability distribution of the events which have followed these observed events in the past . Mitios online book store  is an example of such systems which uses the MDP Model.
Content Based filtering system [29-31] recommends a document by matching the document profile with the user profile, using traditional information retrieval techniques such Term Frequency and Inverse Document frequency (TF-IDF). User characteristics are gathered over time and profiled automatically based upon a user’s prior feedback and choices. The system uses item to item correlation in recommending the document to the user. The system starts with the process of collecting the content details about the item, such as treatments, symptoms etc. for disease related item and author, publisher etc. for the book items. In the next step, the system asks the user to rate the items. Finally, system matches unrated item with the user profile item and assign score to the unrated item and user is presented with items ranked according to the scores assigned.
News Dude , is one of the examples of content based filtering system which uses short term TF-IDF technique and long term Bayesian classifier for learning on an initial set of documents provided by the user.
Content based information filtering systems are not affected by the cold start problem and new user problem, as the system focuses on the individual user needs Content based information filtering systems are not suitable for multimedia items, such as images, audio, video. Multimedia documents must be tagged with a semantic description of the resource which will be a time consuming process. Contentbased filtering methods cannot filter documents based on quality and relevance.
Collaborative filtering systems [33-36] filters information based on the interests of the user (past history), and the ratings of other users with similar interests. It is widely used in many filtering systems or recommender systems, especially in ecommerce applications. One of the examples of such system are Amazon.com and e-Bay, where a user’s past shopping history is used to make recommendations for new products.
Collaborative filtering system involves the computation of similarity between user interests. Similarity between the users interest are calculated using different methods such as Pearson correlation coefficient. The system collects the ratings of each item from different users explicitly or through their behavior, and then calculates the similarity between the ratings of the users. The ratings can be explicit on a numeric scale [18,37], or implicit such as purchases, clicks and mouse movement. Then, the users are grouped based on the calculated similarity measures and future items are recommended to the user based on the recommendation of other users in the group.
Consider a group of users U1, U2....Un and items I1, I2....Im. The Table 2 shows the rating given by the users on different items.
Table 2: Rating given by the users on different items.
For example: If similarity rating between the user U1 and U5 is high, then user u1 and u5 can be grouped and new items will be recommended to each user based on the other user’s interest. Here, item I3 will be recommended to the user U5, as a new item based on the high rating given by the other user in the group U1. Similarly, item Im will be recommended to User U4 based on the rating of other user U3.
There are two types of collaborative filtering systems are available: one is memory based and another model based. Memory based collaborative filtering systems works on the neighborhood principles to recommend items and model based collaborative systems uses models in recommending items by estimating models based on the ratings. The models are built using different machine learning algorithm, such as Bayesian network, Clustering and Markov models.
The collaborative systems can be used to filter all types of items, including the multimedia items. It suffers from the cold-start problem and early rater problem. It involves the issues of filtering a new item, if it is not rated by any one of users yet. This system also suffers when data are sparse, which makes the recommendation difficult, as there are few common items present in calculating the similarity measures.
Application of collaborative filtering in bio-Medical field
The following section provides examples of application of collaborative filtering in the Health Care field.
1. Collaborative filtering is used in the form of recommender systems in the Health Care field. In nursing clinical recommender system , the user (say nurse) selects all the required items for a particular treatment plan, and not selecting a particular item of interest based on his likes, as in the commercial recommender systems. The user behavior in the clinical recommender system is binary, either accept or reject the item, and the here, the requirement is objective not subjective.
2. Collaborative filtering is used to overcome the omission of medication , by detecting the omission from the Patient’s observed medication list and reconciling the medication lists.
3. Collaborative filtering is applied to make personalized medical predictions using the CARE system , a collaborative recommendation engine for prospective and proactive health care, using the assumption that patients with similar history will continue to develop similar conditions.
5. Diabetic recommender system [43,44], a collaborative filtering system which suggests a diabetic friendly food option when a person shops for groceries or eats at restaurant based on algorithm developed by Yehuda Koren from Yahoo Research.
6. Walk et al.  used collaborative filtering tool to perform high level mapping of recommender techniques to collaborative ontology engineering platforms in the biomedical domain, and as a proof-ofconcept in the form of implementations in the context of the ICD-11 project
7. Smart H tweet engine , provides users with personalized health related recommendations and alters based on stored clinician knowledge. It extracts user interest, health condition and emotions from social media to create rich user profile.
8. Elsten  built life style coach a recommender system, which facilitated the decision process of its users by filtering large quantities of data and recommended only a few personally relevant item using collaborative filtering techniques.
9. Kotevska  proposed a collaborative patient-centered health care system model which provides tool for personal health care by generating different recommendation, notification and suggestion to the users, using collaborative filtering technique.
10. Song et al.  proposed a Health Social Network Recommender System which works on the principles of Support Vector Machine (SVM), to provide a social networking frame work for patient care, in particular, parents of children with Autism Spectrum Disease (ASD).
Hybrid filtering systems
The hybrid filtering systems combines features of both the content and collaborative filtering systems. The hybrid system overcomes the problem of cold start and early rater problem by using the content based approach in the initial stage. In the subsequent stages, it uses collaborative filtering systems features, which helps the system to recommend all types of items, including multimedia items and overcomes the problem related to content based filtering techniques.
The most successful social filtering , system is Yahoo. Yahoo employs humans to evaluate documents and puts documents, which are interesting, into its structured information database. The simplest and most common filtering is by organizing discussions into groups (newsgroups, mailing lists, forums, etc.). Each group has a topic, and wants only contributions within that topic. Sometimes, the right to submit contributions is restricted.
Filtering against spamming
Filtering against spamming , involves both content based filtering techniques and collaborative filtering techniques . Content based spam filtering techniques which use the entire content of the emails to find out the words and phrases, which are classified as spasm by the users. The content based spam filtering uses the rule based techniques, nearest neighborhood method and Bayesian networks to filter the Spam. Spam Assassin , is one of the spams filtering, which uses Bayesian network principles to filter the emails.
In collaborative spam filtering, when the users classifies the email as spam, it will be added to the central database in the email servers through signatures added to the email. A signature is computed for every new email and is compared with the database, and if it matches, it will be classified as spam. It works on the principles of near duplicate similarity matching techniques . Yahoo! uses the collaborative spam filtering system to classify the email as spam.
Recall is the number of relevant document retrieved divided by the total number of relevant document.
Information filtering in different forms
Information filtering system is used to filtering the different forms of information items, such as
• Music and movies
The perceived benefits of information filtering have motivated several researchers to develop information filtering software for biomedical researcher, some of which are:
• PubCrawler , is a free altering service that keeps users informed of the current contents of Medline and Genbank.
• A personalized cancer information system , is available for cancer patients to receive information about treatment, disease measurement related to their medical condition.
• Oncosifter , is available for retrieving the latest news, diagnosis and treatment information related to cancer.
• MARVIN , filters relevant documents from a set of Web pages and follows links to retrieve new documents.
• PURE , a PubMed article recommender system, which automatically finds articles, which are relevant to user interest using content based filtering technique
Apart from these programs, there is other generic filtering software, which are also called recommender systems are available:
• SIFT , is an information filtering system, and helps the users to receive new documents based on the profiles submitted to the system through the profile index.
• Beehive , a distributed system for social sharing and filtering of information, provides a simple and intuitive interface for distributing relevant information.
• Tapestry system , support both content-based filtering and collaborative filtering. It involves people in collaborative filtering by recording their reactions to documents they read, and which helps to filter the documents effectively.
• InRoute , an Information filtering system works on the inference network model
• RAMA , helps to retrieve useful information from various Internet sources
including USENIX news and anonymous FTP servers.
• PI-Agent , implements an information filtering system using importance based classification realized by implicit pre-classification profiles.
• NewsSIEVE , is an information filtering agent which automatically generates simple user profiles to make the review process easier and improves the performance of the filtering system.
• SysKill and Webert , a software agent that learns to rate pages on the Worm Wide Web (WWW) capable recommending pages to the users which might of interest to the user.
• INFOS -System reduces the user’s search burden by automatically eliminating Usenet news articles predicted to be irrelevant.
• iAgent  is an Information filtering system to filter the newspapers.
• NewT , a system which helps the user filter Usenet Netnews.
• Ringo - Personalized music recommender system accessible to users via email.
• Explication of long-dated information needs in the shape of a user profile is a difficult job, as the user needs and interests depend on many parameters, including the personal, behavioral characteristics. Some of the parameters are not easily translated into system components (user’s mood, work load, financial position, etc.)
• User profile has to specify current and future documents correctly. The quality of user profiles is a key to making a filtering system work. From the user’s point of view, there are two potential problems. One is if a large proportion of the items that the system sends to a user are irrelevant, and then the system becomes more of an annoyance than help. Conversely, if the system fails to provide the user with enough relevant information, then the benefit of information filtering delivery is largely lost, because the user will still have to actively hunt for information.
• Total utility of an Information Filtering system is maximal, if effort in programming the user profile is low, and the quota of correct classified documents is high. If the use profile is of high volume, make filtering system over burden.
• Humans are often not able to explain in general how they evaluate the relevance (class affiliation) of a document, i.e. the user may not be able to explain the basis on which he has given the relevance feed back to filtering system.
• Synonymy, Homography, while in user profile may pose a problem, etc.
• Privacy is a important issue in maintaining the user profiles.
Efficient information retrieval requires information filtering and search adaptation to the user’s current needs, interests, knowledge level, etc. From early days of SDI (selective Dissemination of information) to current modern information retrieval, information filtering has undergone a tremendous change. In this paper I have tried to focus on methods of Information Filtering. I have not covered all aspect of the topic like the technical details behind the methods because it is behind the scope of this article. My main aim of this paper is to pinpoint the highlights of Information Filtering, and to draw a comprehensive picture of methodology behind Information Filtering.
Vinaitheerthan wrote the first draft. Dr. Ajit N Babu and Dr. SN Sarbadhikari have meticulously rewritten and enhanced the quality of the manuscript for making it more acceptable to biomedical researchers.