Visit for more related articles at International Journal of Advancements in Technology
The parts of speech disambiguation in corpora is most challenging area in Natural Language Processing. However, some works have been done in the past to overcome the problem of bilingual corpora disambiguation forHindi using Hidden Markov Model and Neural Network. In this paper, Quantum Neural Network (QNN) forHindi parts of speech tagger has been used.To analyze the effectiveness of the proposed approach, 2600 sentences of news items having 11500 words from various newspapers have been evaluated. During simulations and evaluation, the accuracy upto 99.13% is achieved, which is significantly better in comparison with other existing approaches for Hindi parts of speech tagging.
Parts of Speech tagging, Tokenizer, Tagset, Quantum Neural Networks, Pattern Recognition.
POS: Parts of Speech, QNN: Quantum Neural Network, HMM: Hidden Markov Model, CRF: Conditional Random Field.
Hindi is the National language of India, spoken by around 500 million Indians. It is the world’s fourth most commonly used language after Chinese, English and Spanish. Hindi is morphologically rich language and relatively free word- order language. Therefore, many permutations of the same sentence convey similar meaning. The Tagger grammatically tags up the words in the corpus corresponding to particular parts of speech, suitable to its context. Each of the word is having the relationship with adjacent and related words in a corpus. The POS tagging helps in the parsing of corpus, which is the important step in natural language processing.
The tagging is the process to identify the correct syntactic categories of words in corpus. The identification process is ambiguous during the mapping between words and its syntactic categories. The most important problem in POS tagging is to assign the most appropriate morpho-syntactic category to each word in a sentence from those listed in the lexicon, given the context. For the subsequent manipulations of the text, annotation of a text with POS tags is useful. The tagger processes all words and that belongs to a certain class providing a useful abstraction in some special way like getting all verbs from a text. The grammatical parts of speech are important because they allow meaning and structure to be derived from a sentence . To syntactically analyze a long sentence, the input sentences break up into multiple sentences of simple sentence by using conjunctions and prepositions .
One of the important functions of the tagger is to categorize words in a text properly into a finite set of syntactic categories. This process is indefinite as the mapping between words to the tag-space is often one-to-many. POS tagging is a difficult task with challenges like ambiguous parts of speech .Many POS taggers for English are available based on machine learning techniques like decision trees [4,5,6], transformation-based errordriven learning [7, 8, 9] maximum entropy methods , Markov model  etc. The stochastic and rule-based hybrid taggers are also available which are using both approaches, such as CLAWS . There is some amount of work done on morphology-based disambiguation in Hindi POS tagging. Bharati et al. (1995) in their work on computational Paninian Parser described a technique where POS tagging is implicit and is merged with the parsing phase. Ray et al. (2003) proposed an algorithm that identifies Hindi word groups on the basis of the lexical tags for individual words. Their partial POS tagger (as they call it) reduces the number of possible tags for a given sentence by imposing some constraints on the sequence of lexical categories that are possible in a Hindi sentence.
This paper shows a QNN based approach which learns the parameters of POS tagger from a representative training data set whose training time and performance is better than Neural based Tagger.
As discussed above, many researchers introduced their POS tagger but still there are possibilities to work on ambiguous parts of speech as there is a lack of accuracy in the existing POS Taggers. Many researchers proposed their Machine learning based POS tagger to do the POS tagging on real basis like humans interprets, but their accuracy performance is not so good. Hence there is a possibility to improve the accuracy in the performance of POS Taggers. POS tagger based on QNN for Hindi is a possible solution to this problem. It recognizes the pattern of POS tagging as it has the ability to learn from examples. A user without any expert technical knowledge can make any change without knowing how the computer stores and represents rules, if the QNN based POS tagger is not working correctly,.
Hindi, unlike English, belongs to the category of inflectionally rich languages which suffer from data sparseness problem. QNN is one of the most efficient approaches for learning from a sparse data. Hindi is relatively a free word- order language; hence it requires an approach which provides variable lengths of contexts. Most of the previous approaches used for POS tagging of Hindi were unable to give an approach to provide variable lengths of contexts but QNN is quite capable of handling these issues.
Various approaches are used for POS tagging systems such as rule-based model, statistical model, and neural networks. The major disadvantages of rule-based and stochastic approaches are their inherent inability to deal with unknown words, i.e. words that are not the parts of the training set.
Morphological rules based POS tagger
The Morphological rules based POS tagger is not designed for learning. Locally annotated modestly-sized corpora of 15,562 words used in this system. The high-coverage lexicon and a decision tree based algorithm were used for morphological analysis. The POS categories identified by Lexicon lookup in this system. The performance of the system was evaluated by a 4-fold cross validation over the corpora of 15,562 words and found 93.45% accuracy .
Maximum Entropy based POS tagger
The Maximum Entropy (ME) based POS tagger is based on approach requires the feature functions extracted from a training corpus. Normally feature function is a Boolean function which captures some aspect of the language which is relevant to the sequence labeling task. The average performance of the system is 88.4%.There is an increase in performance till it reaches 75% of the training corpus after which there is a reduction in accuracy due to over fitting of the trained model to training corpus. The least and best POS tagging accuracy of the system was found to be 87.04% and 89.34% and the average accuracy over 10 runs was 88.4%.
Conditional Random Fields based POS tagger
Agarwal et al. developed POS tagger based on conditional random fields. This system makes use of Hindi morph analyzer for training purpose and to get the root-word and possible POS tag for every word in the corpus. The training and testing is performed on the corpus size of 1, 50,000 words. The performance of the system was 82.67%.
Based on surveyed work it is noted that tagging is very ambiguous process, still the existing tagging system for Hindi are not working accurately with the ambiguous corpus. The work presented in this paper is similar to the neural approach to POS tagging[15, 16].
Similar to Human brain the QNN algorithm is able to work on the information having the nature of certainty as well as uncertainty. As human brain learns and predicts the pattern which are very complex and it is also efficient in unrealistic situation which are having multilevel discreet information. QNN reflects the properties which are similar to human brain, by using the approach of quantum superposition of states in Neural Network. It is not possible to address the unrealistic situation with the traditional Neural Network.On the other hand the QNN is a possible solution to address the unrealistic situation also.
Karayiannis et al [17, 18] introduced the novel approach of Neural Network model based on quanta states superposition, having multi-level transfer function. QNN has ability to classify uncertain data. QNN is similar to the ANN but the difference is that the traditional ANN is used for the ordinary sigmoid function. On the other hand in QNN a multilevel activation function is used and each multilevel activation function consists of the sum of sigmoid function superimposed by Quantum intervals.According to Daqi et al., the transfer function of the quantum neuron in hidden layer consists of superposition of several traditional transfer functions . Using QNN it is possible to define new understanding of mind and brain function as well as new unprecedented abilities in information processing .
As shown in Fig.1, a three layer Architecture of QNN consists of inputs, one layer of multilevel hidden units and output units. In QNN instead of the ordinary sigmoid functions, a multilevel activation functions is used. Each multilevel function consists of the sum of sigmoid functions shifted by the quantum intervals [21, 22, 23, 24].
The sigmoid function with various graded levels has been used as the activation function for each hidden neuron and is expressed as:
Here every Neural Network Node represents three substrates in itself with the difference of quantum interval θ r with quantum level r, where ns denote the number of grades in the Quantum Activation functions.
The proposed POS Tagging system is inspired with the human translator. Human what generally do for identifying the POS tagging is they first refer the Dictionary/ lexicon and then pick the parts of speech information directly from the Dictionary/ lexicon and then match with the sentence Pattern on the basis of grammar rules, if it suits the pattern then it is ok, else human correct their decision for parts of speech on the basis of sentence pattern. Similarly the proposed system uses the same method. In this system, the raw sentence first passes through the Tokenizer, the Tokenizer splits the sentence into words and indexes it as token and then the resulting words with token, pass through the Rule based POS Tagger.
The Rule based POS tagger tag the POS by simply using the Lexicon. The outcome of the Rule based POS Tagger is not perfect, for correction and accuracy it finally passes through the QNN based POS tagger, which makes it correct the identified rule based POS using the pattern recognition of corpus. Here the QNN is used for Pattern Recognition of corpus to identify and correct the POS tagging. For learning purpose, some manually tagged sentences are inputted in the QNN based POS tagger, on the bases of inputted tagged sentences the QNN based POS tagger learns all the patterns of POS tagging. The whole process is shown in Architecture Diagram of QNN based parts of speech tagger in Fig.2
Representation Of The Input And Output
There are 2600 Hindi sentences of news items from various newspapers which are used for training purpose. The corpus used for the training and testing purposes contains 11500 words. The training set is generated from a simple deterministic grammar by a program. The POS tag of words in a sentence must be represented in numeric form. This work uses binary representation for the POS tag. Table 1 shows the input POS tags which use 3 bits encoding scheme representation and their corresponding numeric code for the target word Parts of Speech tags.
Tagset with Its Coding Mechanism
Tagset is the set of parts of speech tags from which the tagger uses the parts of speech of a relevant word. The tagset generally contains N (Noun), V(Verb), ADJ(Adjective), ADV(Adverb), PREP(Preposition), CONJ(Conjunction) etc. which depends on the Morphological Structure of any Language. Here for proposed Hindi parts of speech tagger the Tagset is listed below with its coding mechanism in Table 1.
In the parts of speech tagset (as given in table 1) resulting codes are generated on the basis of their base class of Parts of Speech and the occurrence number. Here occurrence number starts with 0, means at very first time if noun occurs in sentence then the resulting code is .100 and if second time the noun occurs in sentence then the resulting code is .101 and so on. Numerically, the coding mechanism expressed as
Resulting code (POS id) = (POS base id + (Occurrence Number /1000))
Tokenizer split a sentence into meaningful elements, which are often referred as words. Literally a Tokenizer breaks up sentences into pieces called tokens. A token is an instance of a sequence of characters or numbers for a sentence to group collectively as a useful semantic unit for processing. Here in proposed model the Tokenizer splits the sentence into words and indexes it as token.
Rule based POS Tagger
Rule based POS tagger, labels most likely POS tag by using the Lexicon / dictionary, and well defined Rules.
As in dictionary every word has word meaning along with the Parts of Speech information, but it is possible that in dictionary a single word contains multiple Parts of Speech tagging information. The Parts of Speech of a word always depends on the relative sentence in which the word is used. That is why the Parts of Speech tagging is very ambiguous. Here the Rule based POS Tagger picks the appropriate Parts of Speech on the basis of welldefinedRules with the help of information of a word from the dictionary/ Lexicon.
Quantum neuro tagger algorithm.
Given a sentence, perform the following steps:
• Learning Phase
INPUT: Manually tagged training corpus
OUTPUT: The Patterns of POS Tagging rules learned.
• Tagging Phase
INPUT: Untagged Corpus
Step 1: Tokenizer splits the sentence into words and indexes it as token
Step 2: Label most likely tag (using Lexicon) by Rule based POS Tagger
Step 3: Passes to the QNN based Parts of Speech Tagger
OUTPUT: Most accurate POS Tagged Corpus
Implementation of Quantum Neural Based Pos Tagger
As described above in the section 5, this concept is purely inspired from the human interpreter. Thus the steps are similar with the steps used by human interpreter, to implement the POS tagging rules with QNN. Our system first picks the parts of speech of any word using the well defined rules and lexicon, the word have different Parts of Speech in different sentences. The part of speech of any word in respect of any sentence depends on how the word acts in sentence. To overcome this ambiguous situation in our system after picking up the rules based parts of speech from using the well defined rules and Dictionary/ lexicon, the set of parts of speech then passes through the QNN based POS tagging system which is here used as Pattern Recognizer, which learns and correct the Parts of Speech tag information on the basis of corpus/sentence patterns learned in past during training.
Fig. 3 shows the incorrect parts of speech which passes though the QNN - (.100) HV (.111) ADV (.112) V (.110) Pre (.140) A (.123) PostN (.105) and then the resulting numeric code we get as N (.100) HV (.111) NE (.180) V (.110) Pre (.140) A (.123) PostN (.105) with its accurate POStagging in context of which the sentence is used for. The network which implements Rule must recognize the pattern inherent in this reorganization. This is done by training the network on a sufficient number of coded input and output sentences chosen as the training set.
Unlike the example shown above, the outputs of the network are not perfectly integer. Thus the outputs must be round off to the nearest integer and some basic error correctionsare necessary to obtain the symbolic codes.
All words in each language are assigned with a unique Numeric code, because the total number of Parts of Speech in one language did not exceed by ten in the test. It is possible to use three numeric codes to encode all the words in one language. Fig 3 shows how this encoding scheme produced a total of seven numeric codes in the input layer and a total of seven numeric codes in the output layer of the QNN. All the errors of words in Hindi and Devanagari-Hindi, sentence and Parts of Speech are evaluated and recorded.
The POS distribution for Devanagari-Hindi Sentences according to their number and percentage is shown in Table 2. Experiments show memorization of the training data is occurring. The results observed as shown in the table 3. The results shown in the series of tables in this section are achieved after training with Lexicon POS of 2600 Hindisentences used for the training and testing purposes containing 11500 words of news items from various newspaperswith human based POS Tag.
500 tests are performed with the system for each value of Quantum Interval (θ) with random data sets for training, validation and Test from POS of 2600 Hindi sentence. The results shown in table 3 are the average of 500 times calculated result. In table 3, the best performance is shown for value of Quantum Interval θ equal to3.5 with respect of all the parameters i.e. Epoch or iterations needed to train the Network, the training performance, Validation performance and Test performance in respect of their Mean square Error(MSE). Table 3 clearly shows the comparison between the performances of QNN with ANN in respect of above said performance parameters and as a result we conclude that QNN is better than ANN for POS tagging.
During experiment all the words in a sentences are assigned with a unique numeric code for their Parts of Speech. As shown in Fig 3 shows how the encoding scheme produced a total of seven numeric codes in the input layer and a total of seven numeric codes in the output layer of the QNN. All the errors of Parts of Speechfor words in Hindi sentence are evaluated and recorded. On the basis of Input pair of POS set, the QNNmemorize the pattern of Parts of Speech.Here for training purpose the Lexical based POS of a Hindi sentence with POS tagged by Human are used for the same Hindi sentence. During the test it is identified that, with 3 and above number of Nodes, the rate of accuracy is constant.
Due to the structure of the grammar used, it is easiest to learn for the QNN, how to identify the Parts of Speech of preposition (there are only two prepositions used), whereas hardest to learn to tag the correct POS tagging between the adjective and the second noun,furthermore, it is also slightly harder to learn to tag the correct Parts of Speech of adverb because of the fact that in Hindigrammar the positions of the verb and adverb are randomly changed in the training and test sets.Fig 4 below clearly shows that the proposed POStagger correctly disambiguates and correctly identifies the parts of speech with higher accuracy.
The accuracy based on the categories of parts of speech is shown in the Fig 4. By looking at the categories having low accuracy, such Question Word, Negative Word, Verb, Adverb we find that all of them are highly ambiguous and almost invariably, very rare in the corpus. Also, most of them are hard to disambiguate without any semantic information.
Experiments show that during learning process with QNN Based POS tagger for Hindi, there is decrease in indeterminacy of pattern recognition and increase in authenticity of pattern recognition of Parts of Speech. Hence, by using POS tagger with QNN, the proposed system has achieved better POS tagging with higher accuracy in comparison to other existing approaches.
This paper proposes a new POS tagging method which combines the advantage of Quantum Neural Network. 2600 sentences contained 11500 words of news items from various Newspapers are used to analyze the effectiveness of the proposed POS Taggerand for training purpose, only 600 sentences of news items are used as input paired sentences. On the basis of the tests performed on dataset, the accuracy percentage of various parts of speech using ANN and QNN is calculated. As shown in Table 4, the overall accuracy QNN based POS Tagger is 99.13%.
Experiments confirm that the accuracy rate of Parts of Speech Tagger based on QNN is 99.13% for simple sentences, which is better than other POS tagging methods Morphological Rule Based Parts of Speech tagging , Hidden Markov Model Based POS tagging , Maximum Entropy based POS Tagger for Hindi, Conditional Random Fields based POS Tagger for Hindi [15, 16], Comparison of the Various Based POS tagging Systems is shown in Table 5.
In this work we have presented Quantum Neural Network approach for the problem of POS tagging for Hindi and achieved reasonable accuracy of 99.13 %. The accuracy of this system has been improved significantly by incorporating techniques for handling the unknown words using QNN. A close investigation to the evaluation results reveal the fact that most of the POS tagging errors are encountered with the unknown words. Along with the unknown word handling techniques, it uses effective encoding scheme in which corpus-based and Rule-based features are implicitly used for tagging. Its performance is also compared with other approaches such as Morphological Rule Based POS tagger, Hidden Markov Model Based POS tagger, and Maximum Entropy based POS Tagger etc. It was also shown that it requires less training time than the ANN based tagger.
 R.G. Raj and S. Abdul-Kareem, “A Pattern Based Approach for the Derivation of Base Forms of Verbs from Participles and Tenses for Flexible NLP”. Malaysian Journal of Computer Science, Vol. 24, 2011, pp.138-159.
 R.G. Raj and S. Abdul-Kareem, “Information Dissemination and Storage for Tele-Text Based Conversational Systems' Learning”. Malaysian Journal of Computer Science, 22, 2009, pp138-159.
 C. D. Manning and H. Schutze. Book: “Foundations of Statistical Natural Language Processing”, MIT Press, 2002.
 E. Black et al. “Decision tree models applied to the labeling of text with parts-of-speech”. In Darpa Workshop on Speech and Natural Language, 1992.
 B. Merialdo, “Tagging English text with a probabilistic model”, Computational Linguistics, 1994, Vol20,pp.155–171.
 Ekbal, S. Saha, “Simulated annealing based classifier ensemble techniques: Application to part of speech tagging” Information Fusion, 2013,Vol.14,pp288–300.
 E. Brill, “A simple rule-based Parts of Speech tagger”, Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing, Trento, IT, 1992pp152-155.
 E. Brill, “Some advances in transformation-based Parts of Speech tagging”. In AAAI '94: Proceedings of the twelfth national conference on Artificial Intelligence, American Association for Artificial Intelligence, Menlo Park, CA, USA,1994, Vol.1,pp.722-727.
 E. Brill, “Transformation-Based Error Driven Learning and Natural Language Processing: A Case Study in Parts of Speech Tagging”. Computational Linguistics, 1995, Vol21,pp.543-566.
 Ratnaparakhi, “A Maximum Entropy Part- Of-Speech Tagger”. EMNLP ,1996
 M. Shrivastava, P. Bhattacharyya, “Hindi POS Tagger Using Naive Stemming: Harnessing Morphological Information without Extensive Linguistic Knowledge”, 6th International Conference on Natural Language Processing ICON, 2008.
 R. Garside, N. Smith “A Hybrid Grammatical Tagger: CLAWS4”, in R. Garside, G. Leech, and A. McEnery (Eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora, London: Longman, 1997, pp.102-121.
 S. Singh, K. Gupta, M. Shrivastava, and P. Bhattacharyya. “Morphological richness offsets resource demand – experiences in constructing a pos tagger for hindi”. In Proceedings of the COLING/ACL, Main Conference Poster Sessions, Sydney, Australia, 2006,pp.779–786.
 Dalal, K. Nagaraj, U. Sawant and S. Shelke, “Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach”, In Proceeding of the NLPAI Machine Learning Competition, 2006.
 PVS Avinesh, G Karthik, ”Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning” in the proceedings of NLPAI Contest ,2006.
 Himashu, A. Anirudh,” Part of Speech Tagging and Chunking with Conditional Random Fields” in the proceedings of NLPAI Contest,2006.
 G. Purushothaman and N. B. Karayiannis, “Fuzzy pattern classification using feed forward neural networks with multilevel hidden neurons”, IEEE Int. Conf. on neural networks, Orlando, FL, USA, 1994, pp. 1577-1582.
 G. Purushothaman and N. B. Karayiannis, “Quantum Neural Networks (QNNs): Inherently fuzzy feed forward neural networks, IEEE Transactions on Neural Networks”, 1997, Vol.8, pp. 679-693.
 Z. Daqi and Wu Rushi, “A Multi-layer Quantum Neural Networks Recognition System for Handwritten Digital Recognition”, IEEE Third Int. Conf. on Natural Computation (ICNC), Haikou, Hainan, China,2007, pp. 267- 271
 L. Fei, S. Zhao and Z. Baoyu, “Quantum Neural Network in Speech Recognition”, IEEE, 6th International Conf. on Signal Processing, Beijing, China, 2002, pp. 357- 362.
 R.Narayan, S.Chakraverty and V.P.Singh, “Quantum Neural Network based Machine Translator for Hindi to English”, The Scientific World Journal, 2014, Vol.2014, Article ID 485737.
 S.Chakraverty, P.Gupta, S.Sharma, “Neural network-based simulation for response identification of two-storey shear building subject to earthquake motion”, Journal of Neural Computing and Applications., 2010, Vol.3, No.19, pp.367-375.
 R. Narayan, S. Chakraverty and V.P. Singh, “Machine Translation using Quantum Neural Network for Simple Sentences”, International Journal of Information and Computation Technology,2013, Vol.3,No.7, pp.683-690.
 R. Narayan, S. Chakraverty and V. P. Singh, “Neural Network based Parts of Speech Tagger for Hindi”, Third International conference, Advances and control and Optimisation of dynamical systems, IIT Kanpur, proceedings of IFAC- Elsevier, 2014, Vol. 3, No.1, pp 519- 524.